Re: Is there a way for me to handle a multiword synonym correctly?

Matthew Hall Fri, 07 Aug 2009 07:43:19 -0700

Create a field that is specifically for this type of matches.

What you could then do is at indexing time manipulate your data in sucha way that it can be matched in a punctuation irrelevant way.

So in this field you would convert all non letter characters intospaces, and reduce all white space instances to single ones (" "becomes " ") , you could also likely lowercase it at the same time.

Then at search time perform a special search against this field thatdoes the same thing to the query string. At this point plain old phrasequeries should work for you.


Our corpus contains remarkably obnoxious items in it like: Rara<^tm3.1Ipc>

So we need to be able to do very similar things as you are describing,the above mentioned technique worked like a charm.


Matt

Donna L Gresh wrote:

I saw some discussion on the board but I'm not sure I've got quite thesame problem. As an example, I have a query that might be a technicalskill:
SAP EM FIN AM
I would like that to match a document that has *either* SAP.EM.FIN.AM or"SAP EM FIN AM" (in that order and all together, not spread out throughthe document).
The approach I had tried was at index time if I saw SAP.EM.FIN.AM I wouldconsider "SAP EM FIN AM" a synonym for it, using the Lucene in Actionexample. Luke shows me that I have two terms in the index for thisdocument: SAP.EM.FIN.AM and "SAP EM FIN AM" (one term). Thus it appearsdifferently in the index than if it had been organically found as just thestring of tokens, in which case there would be separate terms for SAP, EM,and so on.At query time if I look for "SAP EM FIN AM" it is formed as a phrase querywith a slop of 0 which does *not* match the one term version "SAP EM FINAM". (For that matter a simple boolean query doesn't find it either) Lukeconfirms the fact that the phrase query does not find my synonym term. Thequery "SAP EM FIN AM" finds *only* documents that originally had thoseseparated tokens in them.
Is there a way to handle this situation such that at index time I can turnSAP.EM.FIN.AM into something that will be found with a query for "SAP EMFIN AM"?
Thanks for any pointers
Donna



--
Matthew Hall
Software Engineer
Mouse Genome Informatics
mh...@informatics.jax.org
(207) 288-6012



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Is there a way for me to handle a multiword synonym correctly?

Reply via email to