Dear all,
In English possession can be indicated by apostrophe s. Like: "this
man's computer". In Dutch this is almost the same, only in most cases
without the apostrophe. We only use an apostrophe when the word ends on
an s or on a/o/e/i/u. So for example:
Jans hoed (hat)
Jos' tas (bag)
Monica's jas (coat)
The Stem::Lingua::Snowball module does not know this. The small script
below this email demonstrates that.
The default is stemmed correctly Jans -> Jan. On the exceptions - Jos'
and Minonica's - the stemmer leaves the apostrophe at the end. And the -
in Dutch erroneous - spelling of Jans as Jan's is also stemmed wrongly.
In Lucy this leads to having Jos' and Monica' as words in the lexicon.
Messages with "Monica's" will not be found when searching on "Monica".
This is demonstrated with the word Halsema's in the second copy-paste
script below.
Is this indeed a bug? Is there a way to work around this?
Kind regards,
Arjan Widlak
United Knowledge
http://www.unitedknowledge.nl
---Lingua::Stem::Snowball--------------------------------------------------------------------------------------
use strict;
use warnings;
use 5.010;
use Encode;
use Lingua::Stem::Snowball;
my @words = qw( Jans Jos' Monica's Jan's );
my $stemmer = Lingua::Stem::Snowball->new( lang => 'nl' );
$stemmer->stem_in_place( \@words );
foreach my $word ( @words ) {
say encode( 'utf8', $word );
}
---Lingua::Stem::Snowball--------------------------------------------------------------------------------------
---Lucy---------------------------------------------------------------------------------------------------------------
use strict;
use warnings;
use 5.010;
use Encode;
use Lucy::Plan::Schema;
use Lucy::Index::Indexer;
use Lucy::Search::IndexSearcher;
use Lucy::Analysis::RegexTokenizer;
use Lucy::Analysis::PolyAnalyzer;
use Lucy::Analysis::CaseFolder;
use Lucy::Analysis::SnowballStemmer;
use Lucy::Index::IndexReader;
use Lucy::Index::LexiconReader;
use utf8; #data in script itself
# create an index
my $document1 = {
searchstring => qq|In een column schrijft hij een reactie op
Femke Halsema's voorstel om te komen tot meer samenwerking op links.|,
};
my $message_storage = "/tmp";
my $schema = Lucy::Plan::Schema->new;
my $case_folder = Lucy::Analysis::CaseFolder->new;
my $tokenizer = Lucy::Analysis::RegexTokenizer->new;
my $stemmer = Lucy::Analysis::SnowballStemmer->new(
language => 'nl',
);
my $polyanalyzer = Lucy::Analysis::PolyAnalyzer->new(
language => 'nl',
analyzers => [ $case_folder, $tokenizer, $stemmer ],
);
# Field Types
my $type_text = Lucy::Plan::FullTextType->new(
analyzer => $polyanalyzer,
indexed => 1,
stored => 1,
sortable => 0
);
$schema->spec_field( name => "searchstring", type => $type_text );
my $indexer = Lucy::Index::Indexer->new(
schema => $schema,
index => $message_storage,
create => 1,
truncate => 1,
);
$indexer->add_doc( $document1 );
$indexer->commit;
# See what we find
my $query_parser = Lucy::Search::QueryParser->new(
schema => $schema,
fields => [ 'searchstring' ],
);
my $query = $query_parser->parse( qw( Halsema ) );
my $searcher = Lucy::Search::IndexSearcher->new(
index => $message_storage,
);
my $hits = $searcher->hits(
query => $query,
offset => 0,
num_wanted => 10000,
);
say encode( 'utf8', "\n\tHits from the index:");
while ( my $hit = $hits->next ) {
say encode( 'utf8', "found hit on: " . $hit->{ searchstring } );
}
# See what's in the lexicon
my $polyreader = Lucy::Index::IndexReader->open(
index => $message_storage,
);
my $seg_readers = $polyreader->seg_readers;
say encode('utf8', "\n\tIndividual words in the lexicon:");
foreach my $seg_reader ( @$seg_readers ) {
my $lex_reader = $seg_reader->obtain( "Lucy::Index::LexiconReader" );
my $lexicon = $lex_reader->lexicon( field => 'searchstring' );
while ( $lexicon->next ) {
say encode( 'utf8', $lexicon->get_term );
}
}
---Lucy---------------------------------------------------------------------------------------------------------------
--
Recent: http://www.lomcongres.nl/
Congres- en nieuwsbriefportaal met relatiebeheer systeem voor het Landelijk
Overleg Milieuhandhaving
Setting Standards, a a Delft University of Technology and United Knowledge
simulation exercise on strategy and cooperation in standardization,
http://www.setting-standards.com
United Knowledge, internet voor de publieke sector
Keizersgracht 74
1015 CT Amsterdam
T +31 (0)20 52 18 300
F +31 (0)20 52 18 301
[email protected]
http://www.unitedknowledge.nl
M +31 (0)6 2427 1444
E [email protected]
Bezoek onze site op:
http://www.unitedknowledge.nl
Of bekijk een van onze projecten:
http://www.handhavingsportaal.nl/
http://www.setting-standards.com/
http://www.lomcongres.nl/
http://www.clubvanmaarssen.org/