RE: How to index word-pairs and phrases

Steven A Rowe Tue, 19 Feb 2008 08:41:22 -0800

Hi Ghinwa,

o.a.l.analysis.ngram.NgramTokenizer is a *character* level n-gram filter - from 
text "word1 word2 word3" you get tokens "wo", "or", "rd", "d1", etc.


The ShingleFilter gives you "word1 word2" and "word2 word3".

Steve

On 02/19/2008 at 11:28 AM, Ghinwa Choueiter wrote:
> What about
> \contrib\analyzers\src\java\org\apache\lucene\analysis\ngram  ??
> 
> Does this tokenizer do what I need?
> 
> thank you,
> -Ghinwa
> 
> On Tue, 19 Feb 2008, Steven A Rowe wrote:
> 
> > Mark,
> > 
> > The ShingleFilter contrib has not been committed yet - it's still here:
> > 
> >   https://issues.apache.org/jira/browse/LUCENE-400
> > 
> > Steve
> > 
> > On 02/19/2008 at 2:33 AM, markharw00d wrote:
> > > Further to Grant's useful background - there is an analyzer
> > > specifically for multi-word terms in "contrib". See
> > > 
> > > Lucene\contrib\analyzers\src\java\org\apache\lucene\analysis\shingle
> > > 
> > > Cheers
> > > Mark
> > > > Hi Ghinwa,
> > > > 
> > > > A Term is simply a unit of tokenization that has been indexed for a
> > > > Field, produced by a TokenStream.   In the demo, on the main site,
> > > > this can be seen in the file called IndexFiles.java on line 56:
> > > > IndexWriter writer = new IndexWriter(INDEX_DIR, new
> > > > StandardAnalyzer(), true, IndexWriter.MaxFieldLength.LIMITED);
> > > > 
> > > > The key being that the StandardAnalyzer is used to get a TokenStream.
> > > > The TokenStream.next() method returns a Token.  All a Term is, is a
> > > > Token that has been indexed for a given Field.  In a nutshell,
> > > > though, a Term is whatever you want it to be, based on your Analyzer.
> > > >  It could be a whole document as a single string or it could be a
> > > > string containing a single character.  In reality, it usually
> > > > corresponds to a single word.
> > > > 
> > > > Have a look at the wiki, also, as there are many talks and articles
> > > > explaining all of this.  Lucene In Action, while outdated, is also an
> > > > excellent book for explaining this stuff (with the exception that
> > > > some of the code examples no longer work on the latest Lucene version)
> > > > 
> > > > Getting beyond that, you will need to look into adding spell checking
> > > > (in Lucene's "contrib" area) and other features like stemming and
> > > > synonyms to handle the various issues you bring up.
> > > > 
> > > > Hope that helps,
> > > > Grant
> > > > 
> > > > On Feb 18, 2008, at 7:36 PM, Ghinwa Choueiter wrote:
> > > > 
> > > > > Hi,
> > > > > 
> > > > > I am new to Lucene and have been reading the documentation. I would
> > > > > like to use Lucene to query a song database by lyrics. The query
> > > > > could potentially contain typos, or even wrong words, word
> > > > > contractions (can't versus cannot), etc..
> > > > > 
> > > > > I would like to create an inverted list by word pairs and possibly
> > > > > phrases and not just by isolated words. For example: <w1,w2>   < d1,
> > > > > d10, d27> <w2,w3>   <d2, d13> ...
> > > > > 
> > > > > OR even
> > > > > <phrase 1> <d1, d3,...>
> > > > > <phrase 2> <...>
> > > > > ...
> > > > > 
> > > > > It seems to me that, by default, the index in Lucene stores
> > > > > statistics for isolated words. The Lucene documentation refers to
> > > > > the word "Term" all the time and seems to imply that "Term" can be a
> > > > > word or a phrase, but I can't see how IndexWriter can read a
> > > > > document and index it by word pairs.
> > > > > 
> > > > > thank you in advance for the answers and my apologies if I did not
> > > > > get the terminology quite right.
> > > > > 
> > > > > -Ghinwa
> > > > 
> > > > --------------------------
> > > > Grant Ingersoll
> > > > http://lucene.grantingersoll.com
> > > > http://www.lucenebootcamp.com
> > > > 
> > > > Lucene Helpful Hints:
> > > > http://wiki.apache.org/lucene-java/BasicsOfPerformance
> > > > http://wiki.apache.org/lucene-java/LuceneFAQ


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: How to index word-pairs and phrases

Reply via email to