Re: Issue while parsing XML files due to control characters, help appreciated.

Doron Cohen Wed, 21 Mar 2007 22:29:47 -0800

Lokeya <[EMAIL PROTECTED]> wrote on 21/03/2007 22:09:06:

>
> Initially I was writing into the Index 7,00,000 times. I chaged the code
to
> now write only 70 times which means I am putting lot of data in an array
> list and add to doc and index at one shot. This is where the improvement
> came from. To be precise IndexWriter is now adding document 70 times Vs.
> 7,00,000 times.


I was just concluding from the provided code;

So now your index should have only 70 documents, right?

(Search "granularity" would be 1/70 of collection size, is that really
useful?)

If now you open the index writer once, add all those 70 documents, call
optimize once, close index writer once, that's the right way.

> If this is not clear please let me know. I have't pasted the latest code
> where I have fixed the lock issue as well. If required I can do that.
>
> Thanks everyone for quick turnaround and it really helped me a lot.
>
>
> Doron Cohen wrote:
> >
> > Lokeya <[EMAIL PROTECTED]> wrote on 18/03/2007 13:19:45:
> >
> >>
> >> Yep I did that, and now my code looks as follows.
> >> The time taken for indexing one file is now
> >> => Elapsed Time in Minutes :: 0.3531
> >> which is really great
> >
> > I am jumping in late so appologies if I am missing something.
> > However I don't understand where this improvement came
> > from. (more below)
> >
> >>
> >> //Create an Index under LUCENE
> >> IndexWriter writer = new IndexWriter("./LUCENE", new
> >>                                      StopStemmingAnalyzer(),
> >>                                      false);
> >> Document doc = new Document();
> >>
> >> //Get Array List Elements  and add them as fileds to doc
> >> for(int k=0; k < alist_Title.size(); k++)             {
> >>   doc.add(new Field("Title",
> >>                     alist_Title.get(k).toString(),
> >>                     Field.Store.YES,
> >>                     Field.Index.UN_TOKENIZED));
> >> }
> >>
> >> for(int k=0; k < alist_Descr.size(); k++)             {
> >>   doc.add(new Field("Description",
> >>                     alist_Descr.get(k).toString(),
> >>                     Field.Store.YES,
> >>                     Field.Index.UN_TOKENIZED));
> >> }
> >>
> >> //Add the document created out of those fields to
> >> // the IndexWriter which will create and index
> >> writer.addDocument(doc);
> >> writer.optimize();
> >> writer.close();
> >>
> >> long elapsedTimeMillis = System.currentTimeMillis()-startTime;
> >> System.out.println("Elapsed Time for  "+
> >>                    dirName +" :: " + elapsedTimeMillis);
> >> float elapsedTimeMin = elapsedTimeMillis/(60*1000F);
> >> System.out.println("Elapsed Time in Minutes ::  "+
> >>                    elapsedTimeMin);
> >
> > the code listed above is, still, for each indexed document,
> > doing this:
> >
> >   1. open index writer
> >   2. create a document (and add fields to it)
> >   3. add the doc to the index
> >   4. optimize the index
> >   5. close the index writer
> >
> > As already mentioned here this is very non optimal.
> > So I don't understand where the improvement came from.
> >
> > It should be done like this:
> >
> >   1. open/create the index
> >   2. for each valid doc input {
> >        a. create a document (add fields)
> >        b. add the doc to the index
> >      }
> >   3. optimize the index (optionally)
> >   4. close the index
> >
> > It would probably also help both coding and discussion to capture
> > some logic in methods, e.g.
> >
> > IndexWriter createIndex();
> > IndexWriter openIndex();
> > void closeIndex();
> > Document createDocument(String oai_citeseer_fileName)
> >
> >> but after processing 4 dumpfiles(which means
> >> 40,000 small xml's), I get :
> >>
> >> caught a class java.io.IOException
> >>  40114  with message: Lock obtain timed out:
> >> Lock@/tmp/lucene-e36d478e46e88f594d57a03c10ee0b3b-write.lock
> >>
> >>
> >> This is the new issue now.What could be the reason ?. I am surprised
> > because
> >> I am only writing to Index under ./LUCENE/ evrytime and not doing
> > anything
> >> with the index(ofcrs to avoid such synchronization issues !)
> >
> > If the code listed above is getting this exception, and this runs on
> > Windows, then perhaps this is related to issue 665 - intensive IO
> > activity, write lock acquired and released for each and every added
> > document. If so, this too would be avoided by opening the index just
once.
> >
> > Last comment - I did not understand if a single index is used for all
> > 70 metadata files, or each one has its own index, totalling 70 indexes.
> > I would think it should be only one index for all, and then you better
> > avoid calling optimize 70 times - that would waste time too. Better
> > organize the code like this:
> >
> > 1. create/open index
> > 2. for each meta data file {
> >    a. for each "valid element" in meta data file {
> >         a.1 create document
> >         a.2 add doc to index
> >       }
> >    }
> > 3. optimize
> > 4. close index
> >
> > Regards,
> > Doron


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Issue while parsing XML files due to control characters, help appreciated.

Reply via email to