Lokeya <[EMAIL PROTECTED]> wrote on 21/03/2007 22:09:06:
>
> Initially I was writing into the Index 7,00,000 times. I chaged the code
to
> now write only 70 times which means I am putting lot of data in an array
> list and add to doc and index at one shot. This is where the improvement
> came from. To be precise IndexWriter is now adding document 70 times Vs.
> 7,00,000 times.
I was just concluding from the provided code;
So now your index should have only 70 documents, right?
(Search "granularity" would be 1/70 of collection size, is that really
useful?)
If now you open the index writer once, add all those 70 documents, call
optimize once, close index writer once, that's the right way.
> If this is not clear please let me know. I have't pasted the latest code
> where I have fixed the lock issue as well. If required I can do that.
>
> Thanks everyone for quick turnaround and it really helped me a lot.
>
>
> Doron Cohen wrote:
> >
> > Lokeya <[EMAIL PROTECTED]> wrote on 18/03/2007 13:19:45:
> >
> >>
> >> Yep I did that, and now my code looks as follows.
> >> The time taken for indexing one file is now
> >> => Elapsed Time in Minutes :: 0.3531
> >> which is really great
> >
> > I am jumping in late so appologies if I am missing something.
> > However I don't understand where this improvement came
> > from. (more below)
> >
> >>
> >> //Create an Index under LUCENE
> >> IndexWriter writer = new IndexWriter("./LUCENE", new
> >> StopStemmingAnalyzer(),
> >> false);
> >> Document doc = new Document();
> >>
> >> //Get Array List Elements and add them as fileds to doc
> >> for(int k=0; k < alist_Title.size(); k++) {
> >> doc.add(new Field("Title",
> >> alist_Title.get(k).toString(),
> >> Field.Store.YES,
> >> Field.Index.UN_TOKENIZED));
> >> }
> >>
> >> for(int k=0; k < alist_Descr.size(); k++) {
> >> doc.add(new Field("Description",
> >> alist_Descr.get(k).toString(),
> >> Field.Store.YES,
> >> Field.Index.UN_TOKENIZED));
> >> }
> >>
> >> //Add the document created out of those fields to
> >> // the IndexWriter which will create and index
> >> writer.addDocument(doc);
> >> writer.optimize();
> >> writer.close();
> >>
> >> long elapsedTimeMillis = System.currentTimeMillis()-startTime;
> >> System.out.println("Elapsed Time for "+
> >> dirName +" :: " + elapsedTimeMillis);
> >> float elapsedTimeMin = elapsedTimeMillis/(60*1000F);
> >> System.out.println("Elapsed Time in Minutes :: "+
> >> elapsedTimeMin);
> >
> > the code listed above is, still, for each indexed document,
> > doing this:
> >
> > 1. open index writer
> > 2. create a document (and add fields to it)
> > 3. add the doc to the index
> > 4. optimize the index
> > 5. close the index writer
> >
> > As already mentioned here this is very non optimal.
> > So I don't understand where the improvement came from.
> >
> > It should be done like this:
> >
> > 1. open/create the index
> > 2. for each valid doc input {
> > a. create a document (add fields)
> > b. add the doc to the index
> > }
> > 3. optimize the index (optionally)
> > 4. close the index
> >
> > It would probably also help both coding and discussion to capture
> > some logic in methods, e.g.
> >
> > IndexWriter createIndex();
> > IndexWriter openIndex();
> > void closeIndex();
> > Document createDocument(String oai_citeseer_fileName)
> >
> >> but after processing 4 dumpfiles(which means
> >> 40,000 small xml's), I get :
> >>
> >> caught a class java.io.IOException
> >> 40114 with message: Lock obtain timed out:
> >> Lock@/tmp/lucene-e36d478e46e88f594d57a03c10ee0b3b-write.lock
> >>
> >>
> >> This is the new issue now.What could be the reason ?. I am surprised
> > because
> >> I am only writing to Index under ./LUCENE/ evrytime and not doing
> > anything
> >> with the index(ofcrs to avoid such synchronization issues !)
> >
> > If the code listed above is getting this exception, and this runs on
> > Windows, then perhaps this is related to issue 665 - intensive IO
> > activity, write lock acquired and released for each and every added
> > document. If so, this too would be avoided by opening the index just
once.
> >
> > Last comment - I did not understand if a single index is used for all
> > 70 metadata files, or each one has its own index, totalling 70 indexes.
> > I would think it should be only one index for all, and then you better
> > avoid calling optimize 70 times - that would waste time too. Better
> > organize the code like this:
> >
> > 1. create/open index
> > 2. for each meta data file {
> > a. for each "valid element" in meta data file {
> > a.1 create document
> > a.2 add doc to index
> > }
> > }
> > 3. optimize
> > 4. close index
> >
> > Regards,
> > Doron
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]