Mark, The ShingleFilter contrib has not been committed yet - it's still here:
https://issues.apache.org/jira/browse/LUCENE-400 Steve On 02/19/2008 at 2:33 AM, markharw00d wrote: > Further to Grant's useful background - there is an analyzer specifically > for multi-word terms in "contrib". See > Lucene\contrib\analyzers\src\java\org\apache\lucene\analysis\shingle > > Cheers > Mark > > Hi Ghinwa, > > > > A Term is simply a unit of tokenization that has been indexed for a > > Field, produced by a TokenStream. In the demo, on the main site, > > this can be seen in the file called IndexFiles.java on line 56: > > IndexWriter writer = new IndexWriter(INDEX_DIR, new > > StandardAnalyzer(), true, IndexWriter.MaxFieldLength.LIMITED); > > > > The key being that the StandardAnalyzer is used to get a TokenStream. > > The TokenStream.next() method returns a Token. All a Term is, is a > > Token that has been indexed for a given Field. In a nutshell, though, > > a Term is whatever you want it to be, based on your Analyzer. It could > > be a whole document as a single string or it could be a string > > containing a single character. In reality, it usually corresponds to a > > single word. > > > > Have a look at the wiki, also, as there are many talks and articles > > explaining all of this. Lucene In Action, while outdated, is also an > > excellent book for explaining this stuff (with the exception that some > > of the code examples no longer work on the latest Lucene version) > > > > Getting beyond that, you will need to look into adding spell checking > > (in Lucene's "contrib" area) and other features like stemming and > > synonyms to handle the various issues you bring up. > > > > Hope that helps, > > Grant > > > > On Feb 18, 2008, at 7:36 PM, Ghinwa Choueiter wrote: > > > > > Hi, > > > > > > I am new to Lucene and have been reading the documentation. I would > > > like to use Lucene to query a song database by lyrics. The query could > > > potentially contain typos, or even wrong words, word contractions > > > (can't versus cannot), etc.. > > > > > > I would like to create an inverted list by word pairs and possibly > > > phrases and not just by isolated words. For example: > > > <w1,w2> < d1, d10, d27> > > > <w2,w3> <d2, d13> > > > ... > > > > > > OR even > > > <phrase 1> <d1, d3,...> > > > <phrase 2> <...> > > > ... > > > > > > It seems to me that, by default, the index in Lucene stores statistics > > > for isolated words. The Lucene documentation refers to the word "Term" > > > all the time and seems to imply that "Term" can be a word or a phrase, > > > but I can't see how IndexWriter can read a document and index it by > > > word pairs. > > > > > > thank you in advance for the answers and my apologies if I did not > > > get the terminology quite right. > > > > > > -Ghinwa > > > > -------------------------- > > Grant Ingersoll > > http://lucene.grantingersoll.com > > http://www.lucenebootcamp.com > > > > Lucene Helpful Hints: > > http://wiki.apache.org/lucene-java/BasicsOfPerformance > > http://wiki.apache.org/lucene-java/LuceneFAQ > > > > > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] For > > additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > --------------------------------------------------------------------- To > unsubscribe, e-mail: [EMAIL PROTECTED] For > additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]