Also, is there a way to pass a null or no tokenizer when writing to the index 
the field "words" ?? I have no need for tokenizing the words and the exact 
query will always be known. 

To understand better the problem, when are performing words comparison in large 
number of text documents. Each word in each sentence is compared with the rest 
of the words in the other sentences. A similarity score is computed for each 
pair and stored in the index for fast retrieval in the future (computation of 
the score is resource intensive). What we used to do is construct a matrix and 
store the words in alphabetical order (for binary search) and then load the 
words when the program is launched. Due to the size of the files generated, the 
update was a real struggle.

Thus, we decided to use Lucene and store a score for each pair of words. 
Updates should be much easier and faster, however improving the search is 
something we're looking into. We are new to Lucene, and would appreciate any 
input in this regard. 

Knowing that the document would contain only two fields : score and words and 
that no tokenization is needed, what would be the most efficient way for 
implementing this index using Lucene ? 


--- On Sun, 11/2/08, semelak ss <[EMAIL PROTECTED]> wrote:

> From: semelak ss <[EMAIL PROTECTED]>
> Subject: Re: Exact Phrase Query
> To: java-user@lucene.apache.org
> Date: Sunday, November 2, 2008, 7:26 AM
> I was in a hurry when copying and pasting the code. What
> I've been using is only writer. RamWriter was never used
> as it never really worked (thanks to you, I now understand
> the reason).
> 
> The above is not really related to the problem I was
> facing. I modified my code so that an
> indexreader/indexwriter is opened right before the words
> comparison takes place and is closed right after. (currently
> not using RamDir due to the problems faced earlier)
> 
> Considering that the program is basically a loop that does
> thousands and thousands of comparison, this is definitely
> not the most efficient way of handling things.
> 
> I would appreciate any input in this regard on how to
> improve the efficiency. 
> 
> 
> 
> --- On Sat, 11/1/08, Erick Erickson
> <[EMAIL PROTECTED]> wrote:
> 
> > From: Erick Erickson <[EMAIL PROTECTED]>
> > Subject: Re: Exact Phrase Query
> > To: java-user@lucene.apache.org, [EMAIL PROTECTED]
> > Date: Saturday, November 1, 2008, 5:06 PM
> > ahhhh, finally. I'm almost completely sure you
> can't
> > *write* to a
> > RAMDirectory
> > and expect the underlying FSDir to be updated. The
> intent
> > of RAMDirectorys
> > is to *read* in an index from disk and keep it in
> memory.
> > Essentially I
> > believe
> > that your RAMDirecotry constructor is taking a
> snapshot of
> > the underlying
> > disk index, modifying that in-memory copy, and
> throwing it
> > away without
> > ever writing it to disk. I wouldn't expect opening
> the
> > FSDirectory after
> > writing
> > to the RAMDirectory to find anything. Ever.
> > 
> > If you really need the RAMDir, I suspect you'll
> have to
> > open an FS-based
> > writer as well as a RAM-based writer, and write to
> both
> > when necessary.
> > You'll probably also have to open/search your
> RAM-based
> > index as the
> > faster alternative to re-opening the FS-based index.
> Either
> > way, reopening
> > the index is probably expensive, are you sure you need
> to?
> > Is there a way
> > to keep your information in an internal data structure
> for
> > some period of
> > time?
> > 
> > Best
> > Erick
> > 
> > 
> > 
> > On Sat, Nov 1, 2008 at 6:31 PM, semelak ss
> > <[EMAIL PROTECTED]> wrote:
> > 
> > > I am not entirely sure if this can be the cause,
> but
> > here is something I
> > > thought might be related:
> > > The idea is have an index containing documents
> where
> > each document has a
> > > combination of two words : word1 and word2 and a
> score
> > for these two words.
> > > The index would be searched first if the two
> words
> > exist, and if not the
> > > score would be computed on the fly and then added
> to
> > the index. This process
> > > would be repeated thousands of times for
> thousands of
> > words.
> > >
> > > Hence, I have an indexwriter and a searcher
> > > --------------------
> > > RAMDirectory ramDir = new
> RAMDirectory(INDEX_DIR);
> > > IndexWriter  ramWriter = new IndexWriter(ramDir,
> new
> > WhitespaceAnalyzer(),
> > > true,IndexWriter.MaxFieldLength.UNLIMITED);
> > > writer = new IndexWriter(INDEX_DIR,new
> > WhitespaceAnalyzer(),true
> > > ,IndexWriter.MaxFieldLength.UNLIMITED);
> > >
> > > FSDirectory fsdir =
> > FSDirectory.getDirectory(INDEX_DIR);
> > > IndexReader ir = IndexReader.open(fsdir);
> > > _searcher = new IndexSearcher(ir);
> > > --------------------
> > >
> > > The indexWriter is closed near the end of the
> program
> > (it's open while
> > > searching for words combinations ).
> > >
> > > When using Luke,, I was able to search
> successfully
> > for exact phrases. My
> > > guess is that the problem I am facing has
> something to
> > do with the
> > > indexWriter, but I can not pinpoint the exact
> cause of
> > the problem.
> > >
> > >
> > > --- On Sat, 11/1/08, semelak ss
> > <[EMAIL PROTECTED]> wrote:
> > >
> > > > From: semelak ss
> <[EMAIL PROTECTED]>
> > > > Subject: Re: Exact Phrase Query
> > > > To: java-user@lucene.apache.org
> > > > Date: Saturday, November 1, 2008, 10:03 AM
> > > > When using Luke,, searching for the
> followings
> > gives me hits
> > > > now:
> > > > "insurer storm"
> > > > The synatx of the query as parsed by Luke is
> :
> > > > word:"insurer storm"
> > > >
> > > > The code I am using is as follows:
> > > > ----------------------
> > > > _searcher = new IndexSearcher(INDEX_DIR);
> > > > _parser = new QueryParser("word",
> new
> > > > WhitespaceAnalyzer());
> > > > Query q = _parser.parse(query);
> > > > System.out.println(q.toString()); // this
> outputs
> > ->
> > > > word:"insurer storm"
> > > > TopDocs vv= _searcher.search(q, 1);
> > > > Hits tmph = _searcher.search(q);
> > > > ---------------------------------
> > > >
> > > > both vv and tmph give no results (their size
> is
> > 0)
> > > >
> > > >
> > > >
> > > > --- On Fri, 10/31/08, semelak ss
> > > > <[EMAIL PROTECTED]> wrote:
> > > >
> > > > > From: semelak ss
> > <[EMAIL PROTECTED]>
> > > > > Subject: Re: Exact Phrase Query
> > > > > To: java-user@lucene.apache.org
> > > > > Date: Friday, October 31, 2008, 9:41 AM
> > > > > For indexing, I use the following:
> > > > > ===========
> > > > > writer = new IndexWriter(INDEX_DIR,new
> > > > > WhitespaceAnalyzer(),true
> > > > > ,IndexWriter.MaxFieldLength.UNLIMITED);
> > > > > Document doc = new Document();
> > > > > String tmpword =
> this.getProperForm(word1,
> > word2);
> > > > > doc.add(new Field("WORDS",
> > tmpword,
> > > > > Field.Store.YES,
> Field.Index.TOKENIZED));
> > > > > doc.add(new Field("score",
> > > > Double.toString(score)
> > > > > , Field.Store.YES, Field.Index.NO));
> > > > > writer.addDocument(adoc);
> > > > > ============
> > > > >
> > > > > For searching,, I use the following
> (query =
> > > > > "homeowner work" gives no
> hits ,,
> > > > > "homeowner" gives results):
> > > > > ============
> > > > > _searcher = new
> IndexSearcher(INDEX_DIR);
> > > > > _parser = new
> QueryParser("WORDS",
> > new
> > > > > WhitespaceAnalyzer());
> > > > > q = _parser.parse(query);
> > > > > Hits tmph = _searcher.search(q);
> > > > >
> > > > > ============
> > > > >
> > > > > A sample document (contained in the
> index)
> > is the
> > > > > following:
> > > > >
> > > > > filed: value
> > > > > -----: -----
> > > > > WORDS:"homeowners work"
> > > > > score: 0.1515417
> > > > >
> > > > >
> > > > > Also, please note that I tried using
> Luke to
> > browse
> > > > the
> > > > > index and the fields seem to be filled
> out
> > with words
> > > > just
> > > > > as expected. Searching, however, with
> exact
> > phrases
> > > > yield no
> > > > > answer. Searching with single words
> gives
> > hits.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > --- On Fri, 10/31/08, Erick Erickson
> > > > > <[EMAIL PROTECTED]> wrote:
> > > > >
> > > > > > From: Erick Erickson
> > > > <[EMAIL PROTECTED]>
> > > > > > Subject: Re: Exact Phrase Query
> > > > > > To: java-user@lucene.apache.org,
> > > > [EMAIL PROTECTED]
> > > > > > Date: Friday, October 31, 2008,
> 5:57 AM
> > > > > > You need to give us more
> information
> > for
> > > > meaningful
> > > > > replies,
> > > > > > like
> > > > > > the analyzers you use when
> indexing and
> > > > searching, the
> > > > > > exact
> > > > > > query you use, perhaps the
> snippets of
> > the code,
> > > > etc.
> > > > > >
> > > > > > That said, things to check:
> > > > > > Get a copy of Luke and examine
> your
> > index. You
> > > > can
> > > > > even
> > > > > > run queries through that tool and
> see
> > what gets
> > > > sent
> > > > > to the
> > > > > > database and what responses you
> get
> > with those
> > > > > analyzers.
> > > > > >
> > > > > > Make sure you're analyzers at
> query
> > and index
> > > > time
> > > > > are
> > > > > > doing
> > > > > > what you expect. Query.toString()
> is
> > your friend.
> > > > If
> > > > > you
> > > > > > don't
> > > > > > take the time to understand
> analyzers,
> > you'll
> > > > > spend
> > > > > > lots of time
> > > > > > spinning your wheels.
> > > > > >
> > > > > > And you really should wait more
> than 9
> > minutes
> > > > before
> > > > > > pinging
> > > > > > the list....
> > > > > >
> > > > > > Best
> > > > > > Erick
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Fri, Oct 31, 2008 at 8:44 AM,
> > semelak ss
> > > > > > <[EMAIL PROTECTED]>
> wrote:
> > > > > >
> > > > > > > I have documents containing
> > multiple words
> > > > in the
> > > > > the
> > > > > > field "word"
> > > > > > > for example, one of the
> documents
> > contain in
> > > > the
> > > > > field
> > > > > > "word" the
> > > > > > > following:
> > > > > > > homeowners work
> > > > > > >
> > > > > > > When searching for single
> words
> > (i.e.
> > > > homewoners
> > > > > ) I
> > > > > > get hits.
> > > > > > >
> > > > > > > However, searching for the
> exact
> > phrase
> > > > > > "homeowners work" gives
> me no
> > > > > > > hits!! I use the double
> quotes
> > when
> > > > searching for
> > > > > > exact phrases.
> > > > > > >
> > > > > > > Any idea why ??
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
> ---------------------------------------------------------------------
> > > > > > > To unsubscribe, e-mail:
> > > > > >
> [EMAIL PROTECTED]
> > > > > > > For additional commands,
> e-mail:
> > > > > > [EMAIL PROTECTED]
> > > > > > >
> > > > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> >
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail:
> > > > > [EMAIL PROTECTED]
> > > > > For additional commands, e-mail:
> > > > > [EMAIL PROTECTED]
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> >
> ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail:
> > > > [EMAIL PROTECTED]
> > > > For additional commands, e-mail:
> > > > [EMAIL PROTECTED]
> > >
> > >
> > >
> > >
> > >
> > >
> >
> ---------------------------------------------------------------------
> > > To unsubscribe, e-mail:
> > [EMAIL PROTECTED]
> > > For additional commands, e-mail:
> > [EMAIL PROTECTED]
> > >
> > >
> 
> 
>       
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]


      


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to