Re: How to index word-pairs and phrases

markharw00d Tue, 19 Feb 2008 00:44:44 -0800

Further to Grant's useful background - there is an analyzer specificallyfor multi-word terms in "contrib".

See Lucene\contrib\analyzers\src\java\org\apache\lucene\analysis\shingle


Cheers
Mark

Hi Ghinwa,
A Term is simply a unit of tokenization that has been indexed for aField, produced by a TokenStream. In the demo, on the main site,this can be seen in the file called IndexFiles.java on line 56:IndexWriter writer = new IndexWriter(INDEX_DIR, newStandardAnalyzer(), true, IndexWriter.MaxFieldLength.LIMITED);
The key being that the StandardAnalyzer is used to get a TokenStream.The TokenStream.next() method returns a Token. All a Term is, is aToken that has been indexed for a given Field. In a nutshell, though,a Term is whatever you want it to be, based on your Analyzer. Itcould be a whole document as a single string or it could be a stringcontaining a single character. In reality, it usually corresponds toa single word.
Have a look at the wiki, also, as there are many talks and articlesexplaining all of this. Lucene In Action, while outdated, is also anexcellent book for explaining this stuff (with the exception that someof the code examples no longer work on the latest Lucene version)
Getting beyond that, you will need to look into adding spell checking(in Lucene's "contrib" area) and other features like stemming andsynonyms to handle the various issues you bring up.
Hope that helps,
Grant

On Feb 18, 2008, at 7:36 PM, Ghinwa Choueiter wrote:
Hi,
I am new to Lucene and have been reading the documentation. I wouldlike to use Lucene to query a song database by lyrics. The querycould potentially contain typos, or even wrong words, wordcontractions (can't versus cannot), etc..
I would like to create an inverted list by word pairs and possiblyphrases and not just by isolated words. For example:
<w1,w2>   < d1, d10, d27>
<w2,w3>   <d2, d13>
...

OR even
<phrase 1> <d1, d3,...>
<phrase 2> <...>
...
It seems to me that, by default, the index in Lucene storesstatistics for isolated words. The Lucene documentation refers to theword "Term" all the time and seems to imply that "Term" can be a wordor a phrase, but I can't see how IndexWriter can read a document andindex it by word pairs.
thank you in advance for the answers and my apologies if I did notget the terminology quite right.
-Ghinwa
--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com
http://www.lucenebootcamp.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: How to index word-pairs and phrases

Reply via email to