Dear all,

In English possession can be indicated by apostrophe s. Like: "this man's computer". In Dutch this is almost the same, only in most cases without the apostrophe. We only use an apostrophe when the word ends on an s or on a/o/e/i/u. So for example:

Jans hoed (hat)
Jos' tas (bag)
Monica's jas (coat)

The Stem::Lingua::Snowball module does not know this. The small script below this email demonstrates that.

The default is stemmed correctly Jans -> Jan. On the exceptions - Jos' and Minonica's - the stemmer leaves the apostrophe at the end. And the - in Dutch erroneous - spelling of Jans as Jan's is also stemmed wrongly.

In Lucy this leads to having Jos' and Monica' as words in the lexicon. Messages with "Monica's" will not be found when searching on "Monica". This is demonstrated with the word Halsema's in the second copy-paste script below.

Is this indeed a bug? Is there a way to work around this?

Kind regards,
Arjan Widlak

United Knowledge
http://www.unitedknowledge.nl

---Lingua::Stem::Snowball--------------------------------------------------------------------------------------
use strict;
use warnings;
use 5.010;

use Encode;
use Lingua::Stem::Snowball;

my @words = qw( Jans Jos' Monica's Jan's );

my $stemmer = Lingua::Stem::Snowball->new( lang => 'nl' );
$stemmer->stem_in_place( \@words );

foreach my $word ( @words ) {
    say encode( 'utf8', $word );
}
---Lingua::Stem::Snowball--------------------------------------------------------------------------------------

---Lucy---------------------------------------------------------------------------------------------------------------
use strict;
use warnings;
use 5.010;
use Encode;

use Lucy::Plan::Schema;
use Lucy::Index::Indexer;
use Lucy::Search::IndexSearcher;
use Lucy::Analysis::RegexTokenizer;
use Lucy::Analysis::PolyAnalyzer;
use Lucy::Analysis::CaseFolder;
use Lucy::Analysis::SnowballStemmer;
use Lucy::Index::IndexReader;
use Lucy::Index::LexiconReader;
use utf8; #data in script itself

# create an index
my $document1 = {
searchstring => qq|In een column schrijft hij een reactie op Femke Halsema's voorstel om te komen tot meer samenwerking op links.|,
};

my $message_storage = "/tmp";
my $schema          = Lucy::Plan::Schema->new;
my $case_folder     = Lucy::Analysis::CaseFolder->new;
my $tokenizer       = Lucy::Analysis::RegexTokenizer->new;

my $stemmer = Lucy::Analysis::SnowballStemmer->new(
    language    => 'nl',
);
my $polyanalyzer    = Lucy::Analysis::PolyAnalyzer->new(
    language    => 'nl',
    analyzers   => [ $case_folder, $tokenizer, $stemmer ],
);

# Field Types
my $type_text    = Lucy::Plan::FullTextType->new(
    analyzer        => $polyanalyzer,
    indexed         => 1,
    stored          => 1,
    sortable        => 0
);

$schema->spec_field( name => "searchstring", type => $type_text );
my $indexer = Lucy::Index::Indexer->new(
    schema      => $schema,
    index       => $message_storage,
    create      => 1,
    truncate    => 1,
);

$indexer->add_doc( $document1 );
$indexer->commit;

# See what we find
my $query_parser = Lucy::Search::QueryParser->new(
    schema  => $schema,
    fields  => [ 'searchstring' ],
);

my $query = $query_parser->parse( qw( Halsema ) );

my $searcher = Lucy::Search::IndexSearcher->new(
    index => $message_storage,
);

my $hits = $searcher->hits(
    query       => $query,
    offset      => 0,
    num_wanted  => 10000,
);

say encode( 'utf8', "\n\tHits from the index:");
while ( my $hit = $hits->next ) {
    say encode( 'utf8', "found hit on: " . $hit->{ searchstring } );
}

# See what's in the lexicon
my $polyreader = Lucy::Index::IndexReader->open(
        index => $message_storage,
    );
my $seg_readers = $polyreader->seg_readers;

say encode('utf8', "\n\tIndividual words in the lexicon:");
foreach my $seg_reader ( @$seg_readers ) {
    my $lex_reader = $seg_reader->obtain( "Lucy::Index::LexiconReader" );
    my $lexicon    = $lex_reader->lexicon( field => 'searchstring' );

    while ( $lexicon->next ) {
        say encode( 'utf8', $lexicon->get_term );
    }
}
---Lucy---------------------------------------------------------------------------------------------------------------

--
Recent: http://www.lomcongres.nl/
Congres- en nieuwsbriefportaal met relatiebeheer systeem voor het Landelijk 
Overleg Milieuhandhaving

Setting Standards, a a Delft University of Technology and United Knowledge 
simulation exercise on strategy and cooperation in standardization, 
http://www.setting-standards.com

United Knowledge, internet voor de publieke sector
Keizersgracht 74
1015 CT Amsterdam
T +31 (0)20 52 18 300
F +31 (0)20 52 18 301
[email protected]
http://www.unitedknowledge.nl

M +31 (0)6 2427 1444
E [email protected]

Bezoek onze site op:
http://www.unitedknowledge.nl

Of bekijk een van onze projecten:
http://www.handhavingsportaal.nl/
http://www.setting-standards.com/
http://www.lomcongres.nl/
http://www.clubvanmaarssen.org/



Reply via email to