roman created LUCENE-4499:
-----------------------------

             Summary: Multi-word synonym filter (synonym expansion at indexing 
time).
                 Key: LUCENE-4499
                 URL: https://issues.apache.org/jira/browse/LUCENE-4499
             Project: Lucene - Core
          Issue Type: Improvement
          Components: core/other
    Affects Versions: 4.1, 5.0
            Reporter: roman
            Priority: Minor
             Fix For: 5.0


I apologize for bringing the multi-token synonym expansion up again. There is 
an old, unresolved issue at LUCENE-1622 [1]

While solving the problem for our needs [2], I discovered that the current 
SolrSynonym parser (and the wonderful FTS) have almost everything to 
satisfactorily handle both the query and index time synonym expansion. It seems 
that people often need to use the synonym filter *slightly* differently at 
indexing and query time.

In our case, we must do different things during indexing and querying.

Example sentence: Mirrors of the Hubble space telescope pointed at XA5

This is what we need (comma marks position bump):
 
  indexing: mirrors,hubble|hubble space 
telescope|hst,space,telescope,pointed,xa5|astroobject#5
  querying: +mirrors +(hubble space telescope | hst) +pointed 
+(xa5|astroboject#5)
  

This translated to following needs:
  indexing time: 
    single-token synonyms => return only synonyms
    multi-token synonyms => return original tokens AND the synonyms
 
We need the original tokens for the proximity queries, if we indexed 'hubble 
space telescope'
as one token, we cannot search for 'hubble NEAR telescope'

  query time:
    single-token: return only its synonyms (but preserve case)
    multi-token: return only synonyms



You may (not) be surprised, but Lucene already supports ALL these requirements. 
The patch is an attempt to state the problem differently. I am not sure if it 
is the best option, however it works perfectly for our needs and it seems it 
could work for general public too. Especially if the SynonymFilterFactory had a 
preconfigured sets of SynonymMapBuilders - and people could just choose what 
situation they use.


links:
[1] https://issues.apache.org/jira/browse/LUCENE-1622
[2] http://labs.adsabs.harvard.edu/trac/ads-invenio/ticket/158
[3] seems to have similar request: 
http://lucene.472066.n3.nabble.com/Proposal-Full-support-for-multi-word-synonyms-at-query-time-td4000522.html




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to