On 09/03/2013 07:33 PM, David Miranda wrote: Is there any way to check the similarity of texts with Lucene? I have the DBpedia indexed and wanted to get the texts more similar between the abstract and DBpedia another text. If I do a search in the abstract field, with a particular text the result is not very satisfactory. Eg Abstract DBpedia: "SoundCloud is an online audio distribution platform Which Allows collaboration, promotion and distribution of audio recordings." My Text: "Private Track From DJ Sneak. Download the track now in the SoundCloud website."
You are attacking extremly hard problem here -- searching short documents with a long query. This creates a lots of problems, as setting document frequency of a term to the same magnitude of its own frequency which instantly kills some similarity measures. All you can do is to experiment a lot with different similarity measures and preprocessing steps. Sim measures are simple, just try them all for each preprocessing combination. Suggestions of preprocessing steps: - remove all stop words - remove all functional words (you can find list of them at wikipedia) - boost all uppercase words or words containing at least one uppercase letter (add boost of 3 or 4; maybe skip first word of a sentence) - break search text into sentences then search index for each sentence (combine results using borda count or something similar) - do what Koji suggested Regards, Ivan Krišto