RE: How to index word-pairs and phrases

Steven A Rowe Tue, 19 Feb 2008 08:06:55 -0800

Mark,

The ShingleFilter contrib has not been committed yet - it's still here:


   https://issues.apache.org/jira/browse/LUCENE-400

Steve

On 02/19/2008 at 2:33 AM, markharw00d wrote:
> Further to Grant's useful background - there is an analyzer specifically
> for multi-word terms in "contrib". See
> Lucene\contrib\analyzers\src\java\org\apache\lucene\analysis\shingle
> 
> Cheers
> Mark
> > Hi Ghinwa,
> > 
> > A Term is simply a unit of tokenization that has been indexed for a
> > Field, produced by a TokenStream.   In the demo, on the main site,
> > this can be seen in the file called IndexFiles.java on line 56:
> > IndexWriter writer = new IndexWriter(INDEX_DIR, new
> > StandardAnalyzer(), true, IndexWriter.MaxFieldLength.LIMITED);
> > 
> > The key being that the StandardAnalyzer is used to get a TokenStream.
> > The TokenStream.next() method returns a Token.  All a Term is, is a
> > Token that has been indexed for a given Field.  In a nutshell, though,
> > a Term is whatever you want it to be, based on your Analyzer.  It could
> > be a whole document as a single string or it could be a string
> > containing a single character.  In reality, it usually corresponds to a
> > single word.
> > 
> > Have a look at the wiki, also, as there are many talks and articles
> > explaining all of this.  Lucene In Action, while outdated, is also an
> > excellent book for explaining this stuff (with the exception that some
> > of the code examples no longer work on the latest Lucene version)
> > 
> > Getting beyond that, you will need to look into adding spell checking
> > (in Lucene's "contrib" area) and other features like stemming and
> > synonyms to handle the various issues you bring up.
> > 
> > Hope that helps,
> > Grant
> > 
> > On Feb 18, 2008, at 7:36 PM, Ghinwa Choueiter wrote:
> > 
> > > Hi,
> > > 
> > > I am new to Lucene and have been reading the documentation. I would
> > > like to use Lucene to query a song database by lyrics. The query could
> > > potentially contain typos, or even wrong words, word contractions
> > > (can't versus cannot), etc..
> > > 
> > > I would like to create an inverted list by word pairs and possibly
> > > phrases and not just by isolated words. For example:
> > > <w1,w2>   < d1, d10, d27>
> > > <w2,w3>   <d2, d13>
> > > ...
> > > 
> > > OR even
> > > <phrase 1> <d1, d3,...>
> > > <phrase 2> <...>
> > > ...
> > > 
> > > It seems to me that, by default, the index in Lucene stores statistics
> > > for isolated words. The Lucene documentation refers to the word "Term"
> > > all the time and seems to imply that "Term" can be a word or a phrase,
> > > but I can't see how IndexWriter can read a document and index it by
> > > word pairs.
> > > 
> > > thank you in advance for the answers and my apologies if I did not
> > > get the terminology quite right.
> > > 
> > > -Ghinwa
> > 
> > --------------------------
> > Grant Ingersoll
> > http://lucene.grantingersoll.com
> > http://www.lucenebootcamp.com
> > 
> > Lucene Helpful Hints:
> > http://wiki.apache.org/lucene-java/BasicsOfPerformance
> > http://wiki.apache.org/lucene-java/LuceneFAQ
> > 
> > 
> > 
> > 
> > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED] For
> > additional commands, e-mail: [EMAIL PROTECTED]
> > 
> > 
> > 
> 
> 
> 
> --------------------------------------------------------------------- To
> unsubscribe, e-mail: [EMAIL PROTECTED] For
> additional commands, e-mail: [EMAIL PROTECTED]
> 
>

 


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: How to index word-pairs and phrases

Reply via email to