First, start with Solr and use the edismax query parser with the default query operator as "OR" and set pf, pf2, and pf3, and then simply query by the raw text of the paragraph. This will order the results by how closely the indexed paragraphs match the query paragraph.

This is also a good technique for detecting plagiarism where a lot of the text is similar if not identical.

Once you get experience using this technique in Solr, then simply look at the parsed query that edismax generates and do the same in your Lucene Java code.

-- Jack Krupansky

-----Original Message----- From: Malgorzata Urbanska
Sent: Friday, June 14, 2013 12:23 PM
To: java-user@lucene.apache.org
Subject: compare paragraphs of text - which Query Class to use?

Hello,

I've just started using Lucene and I'm not sure which Query Classes I
should use in my project.

My goal is to compare paragraphs of text. Paragraph A is a query and
paragraph B is a document for which I would like to calculate similarity
score.

the paragraphs A and B can be in some situations exactly the same or not.
Generally I would like to check do they talk about the same topic.

In my project I have set of paragraphs A and set of paragraphs B, so I'm
looking for some universal solution which allow me to check similarity
score for each paragraph A all paragraphs B.

Do you have any suggestions? I really appreciate all of the ideas.

--
gosia

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to