Hi Ghinwa, o.a.l.analysis.ngram.NgramTokenizer is a *character* level n-gram filter - from text "word1 word2 word3" you get tokens "wo", "or", "rd", "d1", etc.
The ShingleFilter gives you "word1 word2" and "word2 word3". Steve On 02/19/2008 at 11:28 AM, Ghinwa Choueiter wrote: > What about > \contrib\analyzers\src\java\org\apache\lucene\analysis\ngram ?? > > Does this tokenizer do what I need? > > thank you, > -Ghinwa > > On Tue, 19 Feb 2008, Steven A Rowe wrote: > > > Mark, > > > > The ShingleFilter contrib has not been committed yet - it's still here: > > > > https://issues.apache.org/jira/browse/LUCENE-400 > > > > Steve > > > > On 02/19/2008 at 2:33 AM, markharw00d wrote: > > > Further to Grant's useful background - there is an analyzer > > > specifically for multi-word terms in "contrib". See > > > > > > Lucene\contrib\analyzers\src\java\org\apache\lucene\analysis\shingle > > > > > > Cheers > > > Mark > > > > Hi Ghinwa, > > > > > > > > A Term is simply a unit of tokenization that has been indexed for a > > > > Field, produced by a TokenStream. In the demo, on the main site, > > > > this can be seen in the file called IndexFiles.java on line 56: > > > > IndexWriter writer = new IndexWriter(INDEX_DIR, new > > > > StandardAnalyzer(), true, IndexWriter.MaxFieldLength.LIMITED); > > > > > > > > The key being that the StandardAnalyzer is used to get a TokenStream. > > > > The TokenStream.next() method returns a Token. All a Term is, is a > > > > Token that has been indexed for a given Field. In a nutshell, > > > > though, a Term is whatever you want it to be, based on your Analyzer. > > > > It could be a whole document as a single string or it could be a > > > > string containing a single character. In reality, it usually > > > > corresponds to a single word. > > > > > > > > Have a look at the wiki, also, as there are many talks and articles > > > > explaining all of this. Lucene In Action, while outdated, is also an > > > > excellent book for explaining this stuff (with the exception that > > > > some of the code examples no longer work on the latest Lucene version) > > > > > > > > Getting beyond that, you will need to look into adding spell checking > > > > (in Lucene's "contrib" area) and other features like stemming and > > > > synonyms to handle the various issues you bring up. > > > > > > > > Hope that helps, > > > > Grant > > > > > > > > On Feb 18, 2008, at 7:36 PM, Ghinwa Choueiter wrote: > > > > > > > > > Hi, > > > > > > > > > > I am new to Lucene and have been reading the documentation. I would > > > > > like to use Lucene to query a song database by lyrics. The query > > > > > could potentially contain typos, or even wrong words, word > > > > > contractions (can't versus cannot), etc.. > > > > > > > > > > I would like to create an inverted list by word pairs and possibly > > > > > phrases and not just by isolated words. For example: <w1,w2> < d1, > > > > > d10, d27> <w2,w3> <d2, d13> ... > > > > > > > > > > OR even > > > > > <phrase 1> <d1, d3,...> > > > > > <phrase 2> <...> > > > > > ... > > > > > > > > > > It seems to me that, by default, the index in Lucene stores > > > > > statistics for isolated words. The Lucene documentation refers to > > > > > the word "Term" all the time and seems to imply that "Term" can be a > > > > > word or a phrase, but I can't see how IndexWriter can read a > > > > > document and index it by word pairs. > > > > > > > > > > thank you in advance for the answers and my apologies if I did not > > > > > get the terminology quite right. > > > > > > > > > > -Ghinwa > > > > > > > > -------------------------- > > > > Grant Ingersoll > > > > http://lucene.grantingersoll.com > > > > http://www.lucenebootcamp.com > > > > > > > > Lucene Helpful Hints: > > > > http://wiki.apache.org/lucene-java/BasicsOfPerformance > > > > http://wiki.apache.org/lucene-java/LuceneFAQ --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]