Re: Indexing and Duplication

Kelvin Tan Tue, 19 Mar 2002 20:45:54 -0800

Ype,

----- Original Message -----
From: "Ype Kingma" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Tuesday, March 19, 2002 3:57 AM
Subject: Re: Indexing and Duplication



> Kelvin,
>
> >Ype,
> >
> >That would be a good solution to my problem if only I weren't performing
> >multi-threaded indexing. :(
> >The Reader obtained by any one thread may not be an accurate reflection
of
> >the actual state of the index, just what the state when the Reader was
> >instantiated.
>
> Why share index readers between threads?
> For searching this is fine, off course, but importing can be done
> differently.

Even if each thread has its own reader, after it has obtained the reader,
another thread may have written to the index...

>
> You might consider changing the functionality of your threads a bit:
> one or more threads for indexing and one or
> more other threads for extracting the lucene documents.
>
> You could eg. use a bounded queue of batches of lucene docs as input to
> the indexing threads. The extracting thread(s) can then put
> lucene docs in the next batch and put the batch on the queue.
>
> The only exclusive serial part would then be opening the index reader,
> deleting a batch of old docs, and closing the reader. Adding a batch of
> new docs can be done by eg. two threads while not using the reader.
>
> For incremental imports an index reader is also needed to check whether a
> document has been imported or not. Such checks might be done up front
> during a single run of the import program.
>
> In this way the index readers are used for rather short periods
> to do some batch of work, and there is no need to share them
> between threads.

hmmmm...interesting. It's a good suggestion, and I'll need to think abit
more about it.

>
> >My current solution is that I hold a collection of documents with the key
as
> >my object identifier and only write them to the writer after indexing is
>
> What's the difference between 'writing to the writer' and 'indexing'?
>

Sorry. I should've been more explicit. Indexing means the creation of
Document objects and adding fields to them, not adding them to the writer
yet.

> >done. I chose it because it saved me having to write, then delete a
> >document, etc. However, it's not so ideal because the memory consumed by
> >such an approach may be prohibitive.
>
> >What do you think?
>
> Memory usage can be limited by using a bounded queue. A single batch
> of docs on the queue can be limited by eg. the total size of the docs.
>
> I assumed you need to delete old docs while adding new ones. In case
> you don't need to delete old docs, you you might not need an
> index reader at all.

I know. My approach wasn't working with batches at all. Each indexing thread
was just adding documents to a hashtable. The main thread would then iterate
through the hashtable and add them to the writer.

This seems like a silly question, but will keeping hold of Document objects
cause me to run into "Too many files open" problems? If each document object
has a Field.Text which contains a Reader, and the Reader isn't closed till
the document is indexed, would this be an issue? Is the memory consumed by
Document objects directly proportional to the size of the object the Reader
reads?

Thanks.

Regards,
Kelvin

>
> Ype
>
>
> >Regards,
> >Kelvin
> >----- Original Message -----
> >From: "Ype Kingma" <[EMAIL PROTECTED]>
> >To: "Lucene Users List" <[EMAIL PROTECTED]>
> >Sent: Sunday, March 17, 2002 6:15 AM
> >Subject: Re: Indexing and Duplication
> >
> >
> > > Kelvin,
> > >
> > > >I've got a little problem with indexing that I'd like to throw to
> >everyone.
> > > >
> > > >My objects have a unique identifier. When indexing, before I create a
new
> > > >document, I'd like to check if a document has already been created
with
> >this
> > > >identifier. If so, I'd like to retrieve the document corresponding to
> >this
> > > >identifier, and add the fields I currently have to this document's
fields
> > > >and write it. If no such document exists, then I'd create a new
document,
> > > >add my fields and write it. What this really does, I guess, is ensure
> >that a
> > > >document object represents a body of information which really belongs
> > > >together, eliminating duplication.
> > > >
> > > >With the current API, writing and retrieving is performed by the
> >IndexWriter
> > > >and IndexReader respectively. This effectively means that in order to
do
> >the
> > > >above, I'd have to close the writer, create a new instance of the
index
> > > >reader after each document has been added in order for the reader to
have
> > > >the most updated version of the index (!).
> > > >
> >> >Does anyone have any suggestions how I might approach this?
> >>
> >> Avoid closing and opening too much by batching n docs at a time
> >> on the index reader and then to the things needed for the n docs on the
> >> index writer. You might have to delete docs on the reader, too.
> > >
> > > The reasons for using the reader for reading/searching/deleting
> > > and the using writer for adding have been discussed some time ago on
this
> > > list. I can't provide a pointer into the list archives as I don't
recall
> >> the original subject header, sorry.
> >>
> >> Regards,
> >> Ype
> >>
> >> --
> >>
> >> --
> >> To unsubscribe, e-mail:
> ><mailto:[EMAIL PROTECTED]>
> >> For additional commands, e-mail:
> ><mailto:[EMAIL PROTECTED]>
> >>
> >
> >
> >--
> >To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
> >For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>
>
>
> --
>
> --
> To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>
>


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Re: Indexing and Duplication

Reply via email to