Very helpful and clear reply. Thanks a lot, Peter.
On Monday, March 3, 2014 2:22 PM, Peter Karman <[email protected]> wrote: On 3/2/14 4:18 PM, Anil Pachuri wrote: > Hi there, > > How should one handle synonym terms in Lucy? I wonder if expanding > the query (e.g. terms separated by 'OR') is the best way to do this. > Is there a built-in function/sample code available in Lucy that shows > how to handle synonym terms at the index level? Please advise. > As you allude, there are two ways to solve the problem: at index time, or at search time. There are trade-offs to both; I prefer to do as much at index time as possible, for a couple of reasons. One, stuffing the index with extra data at index time means the search-time code doesn't have to work harder (running a long OR'd string, e.g.). Two, it makes debugging easier IME, because standard searching code gets the same results as customized searching code. E.g., you can dump a lexicon to see exactly what is in the index, synonyms included. OTOH, see the caveats below. I don't know of any examples in the wild for doing this at index time, but I image something like this would work: my %doc = get_doc_to_index(); my @terms = get_terms_from_doc($doc); # should analyze like Lucy does my %synonyms; for my $term (@terms) { for my $syn (get_synonyms($term)) { $synonyms{$syn}++; # avoid duplicates } } # make sure your schema has a 'synonyms' field defined $doc{synonyms} = join ' ', keys %synonyms; add_to_indexer(\%doc); The caveats here (and anytime you do this at index-time) include: * snipping/highlighting will be strange, since a match in the 'synonyms' field will have zero context. * you're increasing the size of your index with content that doesn't actually exist in your document corpus. That can have unforeseen usability impact, depending on your application. * the 'synonyms' field is "virtual" or "private" so you'll have to decide whether you want to expose it as part of your public interface or not. Otherwise, if you do this at search-time with query expansion, I would expect a small (maybe not measurable) performance hit and more complicated search code. You could use the Search::Query term_expander feature[0]. my $parser = Search::Query->parser( dialect => 'Lucy', term_expander => sub { my ($term, $field) = @_; return ($term) if ref $term; # skip ranges return ( get_array_of_synonyms_for_term($term), $term ); }, ); my $query = $parser->parse($str); my $lucy_query = $query->as_lucy_query(); my $hits = $lucy_searcher->hits( query => $lucy_query ); A third way to approach the problem, though it doesn't directly answer the question you posed, is to treat the synonyms as 'suggestions' for further searches, rather than searching for them automatically. Something like LucyX::Suggester[1] could be extended to include synonyms in addition to spellings. [0] https://metacpan.org/pod/Search::Query::Parser#term_expander [1] https://metacpan.org/pod/LucyX::Suggester -- Peter Karman . http://peknet.com/ . [email protected]
