Thanks to all, I will take into account your suggestions. But I think that should have given the concrete use case. Therefore, taking into account my first example given, I have the email received by a user and that email I extract topics of interest to associate the terms of DBpedia (basically DBpedia documents). The problem here is, for example Apple, may be fruit or a company (Apple Computers). To accomplish this disambiguation, I wanted to use the abstract vs. text of the email to find out what the best term to choose.
Thanks. 2013/9/4 Allison, Timothy B. <talli...@mitre.org>: > I agree with Ivan and Koji. You also might want to look into MoreLikeThis, > which should take care of finding the highest tf*idf terms for you to use in > your query -- > http://lucene.apache.org/core/4_4_0/queries/org/apache/lucene/queries/mlt/MoreLikeThis.html > > Best, > > Tim > > ________________________________________ > From: Ivan Krišto [ivan.kri...@gmail.com] > Sent: Wednesday, September 04, 2013 3:17 AM > To: java-user@lucene.apache.org > Subject: Re: Lucene Text Similarity > > On 09/03/2013 07:33 PM, David Miranda wrote: > > Is there any way to check the similarity of texts with Lucene? I have the > DBpedia indexed and wanted to get the texts more similar between the > abstract and DBpedia another text. If I do a search in the abstract field, > with a particular text the result is not very satisfactory. Eg Abstract > DBpedia: "SoundCloud is an online audio distribution platform Which Allows > collaboration, promotion and distribution of audio recordings." My Text: > "Private Track From DJ Sneak. Download the track now in the SoundCloud > website." > > > You are attacking extremly hard problem here -- searching short documents > with a long query. This creates a lots of problems, as setting document > frequency of a term to the same magnitude of its own frequency which > instantly kills some similarity measures. > > All you can do is to experiment a lot with different similarity measures > and preprocessing steps. > > Sim measures are simple, just try them all for each preprocessing > combination. > > Suggestions of preprocessing steps: > - remove all stop words > - remove all functional words (you can find list of them at wikipedia) > - boost all uppercase words or words containing at least one uppercase > letter (add boost of 3 or 4; maybe skip first word of a sentence) > - break search text into sentences then search index for each sentence > (combine results using borda count or something similar) > - do what Koji suggested > > Regards, > Ivan Krišto > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > -- Cumprimentos, David Miranda --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org