Hello Erick, If it weren't for your help and kind response, I would be struggling now with the initial problem I had. The solution to that problem turned out to be the one you mentioned in your response (indexwriters/indexreaders both being opened at the same time).
The problem I mentioned in my last response is different from the initial question I posted. It's really a request for thoughts and people inputs on how to improve searching given the structure of the data described in my last response. Again, I appreciate your help (and I am not saying this because I am looking forward to your response.) --- On Sun, 11/2/08, Erick Erickson <[EMAIL PROTECTED]> wrote: > From: Erick Erickson <[EMAIL PROTECTED]> > Subject: Re: Exact Phrase Query > To: java-user@lucene.apache.org, [EMAIL PROTECTED] > Date: Sunday, November 2, 2008, 12:11 PM > Sorry, but I've really run out of patience here. You > have consistently > stated only > part of the problem, never posting enough information to > allow me to answer > helpfully. You haven't even taken the time to proofread > your posts, which > has wasted my (limited, volunteer) time. > > In the future, please consider the fact that people trying > to help with your > > problem are volunteering their time and respect that fact > by making a > greater effort to make it easy and efficient for us to help > with what is, > after all, *your* problem. > > Best > Erick > > On Sun, Nov 2, 2008 at 11:03 AM, semelak ss > <[EMAIL PROTECTED]> wrote: > > > Also, is there a way to pass a null or no tokenizer > when writing to the > > index the field "words" ?? I have no need > for tokenizing the words and the > > exact query will always be known. > > > > To understand better the problem, when are performing > words comparison in > > large number of text documents. Each word in each > sentence is compared with > > the rest of the words in the other sentences. A > similarity score is computed > > for each pair and stored in the index for fast > retrieval in the future > > (computation of the score is resource intensive). What > we used to do is > > construct a matrix and store the words in alphabetical > order (for binary > > search) and then load the words when the program is > launched. Due to the > > size of the files generated, the update was a real > struggle. > > > > Thus, we decided to use Lucene and store a score for > each pair of words. > > Updates should be much easier and faster, however > improving the search is > > something we're looking into. We are new to > Lucene, and would appreciate any > > input in this regard. > > > > Knowing that the document would contain only two > fields : score and words > > and that no tokenization is needed, what would be the > most efficient way for > > implementing this index using Lucene ? > > > > > > --- On Sun, 11/2/08, semelak ss > <[EMAIL PROTECTED]> wrote: > > > > > From: semelak ss <[EMAIL PROTECTED]> > > > Subject: Re: Exact Phrase Query > > > To: java-user@lucene.apache.org > > > Date: Sunday, November 2, 2008, 7:26 AM > > > I was in a hurry when copying and pasting the > code. What > > > I've been using is only writer. RamWriter was > never used > > > as it never really worked (thanks to you, I now > understand > > > the reason). > > > > > > The above is not really related to the problem I > was > > > facing. I modified my code so that an > > > indexreader/indexwriter is opened right before > the words > > > comparison takes place and is closed right after. > (currently > > > not using RamDir due to the problems faced > earlier) > > > > > > Considering that the program is basically a loop > that does > > > thousands and thousands of comparison, this is > definitely > > > not the most efficient way of handling things. > > > > > > I would appreciate any input in this regard on > how to > > > improve the efficiency. > > > > > > > > > > > > --- On Sat, 11/1/08, Erick Erickson > > > <[EMAIL PROTECTED]> wrote: > > > > > > > From: Erick Erickson > <[EMAIL PROTECTED]> > > > > Subject: Re: Exact Phrase Query > > > > To: java-user@lucene.apache.org, > [EMAIL PROTECTED] > > > > Date: Saturday, November 1, 2008, 5:06 PM > > > > ahhhh, finally. I'm almost completely > sure you > > > can't > > > > *write* to a > > > > RAMDirectory > > > > and expect the underlying FSDir to be > updated. The > > > intent > > > > of RAMDirectorys > > > > is to *read* in an index from disk and keep > it in > > > memory. > > > > Essentially I > > > > believe > > > > that your RAMDirecotry constructor is taking > a > > > snapshot of > > > > the underlying > > > > disk index, modifying that in-memory copy, > and > > > throwing it > > > > away without > > > > ever writing it to disk. I wouldn't > expect opening > > > the > > > > FSDirectory after > > > > writing > > > > to the RAMDirectory to find anything. Ever. > > > > > > > > If you really need the RAMDir, I suspect > you'll > > > have to > > > > open an FS-based > > > > writer as well as a RAM-based writer, and > write to > > > both > > > > when necessary. > > > > You'll probably also have to open/search > your > > > RAM-based > > > > index as the > > > > faster alternative to re-opening the > FS-based index. > > > Either > > > > way, reopening > > > > the index is probably expensive, are you > sure you need > > > to? > > > > Is there a way > > > > to keep your information in an internal data > structure > > > for > > > > some period of > > > > time? > > > > > > > > Best > > > > Erick > > > > > > > > > > > > > > > > On Sat, Nov 1, 2008 at 6:31 PM, semelak ss > > > > <[EMAIL PROTECTED]> wrote: > > > > > > > > > I am not entirely sure if this can be > the cause, > > > but > > > > here is something I > > > > > thought might be related: > > > > > The idea is have an index containing > documents > > > where > > > > each document has a > > > > > combination of two words : word1 and > word2 and a > > > score > > > > for these two words. > > > > > The index would be searched first if > the two > > > words > > > > exist, and if not the > > > > > score would be computed on the fly and > then added > > > to > > > > the index. This process > > > > > would be repeated thousands of times > for > > > thousands of > > > > words. > > > > > > > > > > Hence, I have an indexwriter and a > searcher > > > > > -------------------- > > > > > RAMDirectory ramDir = new > > > RAMDirectory(INDEX_DIR); > > > > > IndexWriter ramWriter = new > IndexWriter(ramDir, > > > new > > > > WhitespaceAnalyzer(), > > > > > > true,IndexWriter.MaxFieldLength.UNLIMITED); > > > > > writer = new IndexWriter(INDEX_DIR,new > > > > WhitespaceAnalyzer(),true > > > > > ,IndexWriter.MaxFieldLength.UNLIMITED); > > > > > > > > > > FSDirectory fsdir = > > > > FSDirectory.getDirectory(INDEX_DIR); > > > > > IndexReader ir = > IndexReader.open(fsdir); > > > > > _searcher = new IndexSearcher(ir); > > > > > -------------------- > > > > > > > > > > The indexWriter is closed near the end > of the > > > program > > > > (it's open while > > > > > searching for words combinations ). > > > > > > > > > > When using Luke,, I was able to search > > > successfully > > > > for exact phrases. My > > > > > guess is that the problem I am facing > has > > > something to > > > > do with the > > > > > indexWriter, but I can not pinpoint the > exact > > > cause of > > > > the problem. > > > > > > > > > > > > > > > --- On Sat, 11/1/08, semelak ss > > > > <[EMAIL PROTECTED]> wrote: > > > > > > > > > > > From: semelak ss > > > <[EMAIL PROTECTED]> > > > > > > Subject: Re: Exact Phrase Query > > > > > > To: java-user@lucene.apache.org > > > > > > Date: Saturday, November 1, 2008, > 10:03 AM > > > > > > When using Luke,, searching for > the > > > followings > > > > gives me hits > > > > > > now: > > > > > > "insurer storm" > > > > > > The synatx of the query as parsed > by Luke is > > > : > > > > > > word:"insurer storm" > > > > > > > > > > > > The code I am using is as follows: > > > > > > ---------------------- > > > > > > _searcher = new > IndexSearcher(INDEX_DIR); > > > > > > _parser = new > QueryParser("word", > > > new > > > > > > WhitespaceAnalyzer()); > > > > > > Query q = _parser.parse(query); > > > > > > System.out.println(q.toString()); > // this > > > outputs > > > > -> > > > > > > word:"insurer storm" > > > > > > TopDocs vv= _searcher.search(q, > 1); > > > > > > Hits tmph = _searcher.search(q); > > > > > > --------------------------------- > > > > > > > > > > > > both vv and tmph give no results > (their size > > > is > > > > 0) > > > > > > > > > > > > > > > > > > > > > > > > --- On Fri, 10/31/08, semelak ss > > > > > > <[EMAIL PROTECTED]> > wrote: > > > > > > > > > > > > > From: semelak ss > > > > <[EMAIL PROTECTED]> > > > > > > > Subject: Re: Exact Phrase > Query > > > > > > > To: > java-user@lucene.apache.org > > > > > > > Date: Friday, October 31, > 2008, 9:41 AM > > > > > > > For indexing, I use the > following: > > > > > > > =========== > > > > > > > writer = new > IndexWriter(INDEX_DIR,new > > > > > > > WhitespaceAnalyzer(),true > > > > > > > > ,IndexWriter.MaxFieldLength.UNLIMITED); > > > > > > > Document doc = new > Document(); > > > > > > > String tmpword = > > > this.getProperForm(word1, > > > > word2); > > > > > > > doc.add(new > Field("WORDS", > > > > tmpword, > > > > > > > Field.Store.YES, > > > Field.Index.TOKENIZED)); > > > > > > > doc.add(new > Field("score", > > > > > > Double.toString(score) > > > > > > > , Field.Store.YES, > Field.Index.NO)); > > > > > > > writer.addDocument(adoc); > > > > > > > ============ > > > > > > > > > > > > > > For searching,, I use the > following > > > (query = > > > > > > > "homeowner work" > gives no > > > hits ,, > > > > > > > "homeowner" gives > results): > > > > > > > ============ > > > > > > > _searcher = new > > > IndexSearcher(INDEX_DIR); > > > > > > > _parser = new > > > QueryParser("WORDS", > > > > new > > > > > > > WhitespaceAnalyzer()); > > > > > > > q = _parser.parse(query); > > > > > > > Hits tmph = > _searcher.search(q); > > > > > > > > > > > > > > ============ > > > > > > > > > > > > > > A sample document (contained > in the > > > index) > > > > is the > > > > > > > following: > > > > > > > > > > > > > > filed: value > > > > > > > -----: ----- > > > > > > > WORDS:"homeowners > work" > > > > > > > score: 0.1515417 > > > > > > > > > > > > > > > > > > > > > Also, please note that I > tried using > > > Luke to > > > > browse > > > > > > the > > > > > > > index and the fields seem to > be filled > > > out > > > > with words > > > > > > just > > > > > > > as expected. Searching, > however, with > > > exact > > > > phrases > > > > > > yield no > > > > > > > answer. Searching with single > words > > > gives > > > > hits. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --- On Fri, 10/31/08, Erick > Erickson > > > > > > > > <[EMAIL PROTECTED]> wrote: > > > > > > > > > > > > > > > From: Erick Erickson > > > > > > <[EMAIL PROTECTED]> > > > > > > > > Subject: Re: Exact > Phrase Query > > > > > > > > To: > java-user@lucene.apache.org, > > > > > > [EMAIL PROTECTED] > > > > > > > > Date: Friday, October > 31, 2008, > > > 5:57 AM > > > > > > > > You need to give us more > > > information > > > > for > > > > > > meaningful > > > > > > > replies, > > > > > > > > like > > > > > > > > the analyzers you use > when > > > indexing and > > > > > > searching, the > > > > > > > > exact > > > > > > > > query you use, perhaps > the > > > snippets of > > > > the code, > > > > > > etc. > > > > > > > > > > > > > > > > That said, things to > check: > > > > > > > > Get a copy of Luke and > examine > > > your > > > > index. You > > > > > > can > > > > > > > even > > > > > > > > run queries through that > tool and > > > see > > > > what gets > > > > > > sent > > > > > > > to the > > > > > > > > database and what > responses you > > > get > > > > with those > > > > > > > analyzers. > > > > > > > > > > > > > > > > Make sure you're > analyzers at > > > query > > > > and index > > > > > > time > > > > > > > are > > > > > > > > doing > > > > > > > > what you expect. > Query.toString() > > > is > > > > your friend. > > > > > > If > > > > > > > you > > > > > > > > don't > > > > > > > > take the time to > understand > > > analyzers, > > > > you'll > > > > > > > spend > > > > > > > > lots of time > > > > > > > > spinning your wheels. > > > > > > > > > > > > > > > > And you really should > wait more > > > than 9 > > > > minutes > > > > > > before > > > > > > > > pinging > > > > > > > > the list.... > > > > > > > > > > > > > > > > Best > > > > > > > > Erick > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Oct 31, 2008 at > 8:44 AM, > > > > semelak ss > > > > > > > > > <[EMAIL PROTECTED]> > > > wrote: > > > > > > > > > > > > > > > > > I have documents > containing > > > > multiple words > > > > > > in the > > > > > > > the > > > > > > > > field "word" > > > > > > > > > for example, one of > the > > > documents > > > > contain in > > > > > > the > > > > > > > field > > > > > > > > "word" the > > > > > > > > > following: > > > > > > > > > homeowners work > > > > > > > > > > > > > > > > > > When searching for > single > > > words > > > > (i.e. > > > > > > homewoners > > > > > > > ) I > > > > > > > > get hits. > > > > > > > > > > > > > > > > > > However, searching > for the > > > exact > > > > phrase > > > > > > > > "homeowners > work" gives > > > me no > > > > > > > > > hits!! I use the > double > > > quotes > > > > when > > > > > > searching for > > > > > > > > exact phrases. > > > > > > > > > > > > > > > > > > Any idea why ?? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > > > > > > To unsubscribe, > e-mail: > > > > > > > > > > > [EMAIL PROTECTED] > > > > > > > > > For additional > commands, > > > e-mail: > > > > > > > > > [EMAIL PROTECTED] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > > > > To unsubscribe, e-mail: > > > > > > > > [EMAIL PROTECTED] > > > > > > > For additional commands, > e-mail: > > > > > > > > [EMAIL PROTECTED] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > > > To unsubscribe, e-mail: > > > > > > > [EMAIL PROTECTED] > > > > > > For additional commands, e-mail: > > > > > > [EMAIL PROTECTED] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > > To unsubscribe, e-mail: > > > > [EMAIL PROTECTED] > > > > > For additional commands, e-mail: > > > > [EMAIL PROTECTED] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: > > > [EMAIL PROTECTED] > > > For additional commands, e-mail: > > > [EMAIL PROTECTED] > > > > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: > [EMAIL PROTECTED] > > For additional commands, e-mail: > [EMAIL PROTECTED] > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]