Re: The best way to iterate over document

Erick Erickson Thu, 27 Mar 2008 12:23:44 -0700

See below...

On Wed, Mar 26, 2008 at 11:29 AM, Wojtek H <[EMAIL PROTECTED]> wrote:


> Thank you for reply. What I did not mention before was that for
> iteration we don't care about scoring, so that's not the issue at all.
> Creating Filter with BitSet seems much better idea than keeping
> HitIterator in memory. Am I right that in such a case with
> MatchAllDocsQuery memory usage would be around
> ( NUM_OF_DOCS_IN_INDEX / 8 )   bytes?


Yes, it's very close to that number as far as I can tell.


>
> We didn't check it yet, but do you think that time for accessing
> documents (reader.doc(i)) is big enough to make iteration in
> HitCollector (without accessing any objects) over documents
> already returned almost un-noticable?


I don't quite understand this. Accessing a single document
in HitCollector via reader.doc(i) is probably unnoticeable. Accessing
every documents in the HitCollector is bad, very bad. If you're doing
something like spinning through the HitCollector N times, then
returning the Nth document and breaking, I don't know. You'll just have
to experiment I'm afraid.



> And another question - if I don't care about scoring is there a way to
> make Lucene don't spend time on calculating score (I don't know if
> that time matters)? HitCollector receives doc and its score (as far as
> I remember the difference here is that it is not normalized to value
> between 0 and 1). Is there a way (and does it make sense) to make
> scoring faster in such a case?


I *think* ConstantScoreQuery is your friend here, although I haven't
used it personally.


>
> And to make things clear - am I right that if I operate on the same
> searcher over requests for docs chunks I don't see neither additions
> nor deletions which could happen meanwhile? So if I would like to
> iterate over point-in-time keeping the same searcher opened would do.
> Thanks and regards,
>

You must close and re-open a reader to see any changes since the last
time you opened that reader. So I think your assumption is OK.

Have fun,
Erick


> wojtek
>
>
> 2008/3/26, Erick Erickson <[EMAIL PROTECTED]>:
> > Why not keep a Filter in memory? It consists of a single bit per
> document
> >  and the ordinal position of that bit is the Lucene doc ID. You could
> create
> >  this reasonably quickly for the *first* query that came in via
> HitCollector.
> >
> >  Then each time you wanted another chunk, use the filter to know which
> >  docs to return. You could either, say, extend the Filter class and add
> >  some bookeeping or just zero out each bit that you returned to the
> user.
> >
> >  NOTE: you don't get relevance this way, but for the case of returning
> all
> >  docs do you really want it?
> >
> >  About updating the index. Remember that there is no "update in place".
> >  So you'll only have to check whether any document in the filter has
> been
> >  deleted when you are returning. Then you'd have to do something about
> >  looking for any new additions as you returned the last document in the
> >  set...
> >  But remember that until you close/reopen the searcher, you won't see
> changes
> >  anyway.....
> >
> >  But you may not need to do any of this. If, each time you return a
> chunk,
> >  you're using a Hits object, then this is the first thing I'd change. A
> Hits
> >  object re-executes the query every 100th element you look at. So,
> assume
> >  you have something like
> >
> >  (bad pseudo code here)
> >  for (int idx = 0; idx < firstdocinchunk && Hits.next(); ++idx)
> >  {
> >  }
> >
> >  for (idx = 0; idx < chunksize && Hits.next(); ++idx)
> >  {
> >     assemble doc for return
> >  }
> >  and the first doc you want to return is number 1,000, you'll actually
> >  be re-executing the query 10 times. Which probably accounts for your
> >  quadratic time.
> >
> >  So I'd try just using a new HitCollector each time and see if that
> solves
> >  your problems before getting fancy. There really shouldn't be any
> >  noticeable difference between the first and last request unless you're
> >  doing something like accessing the documents before you get to
> >  the first one you expect to return. And a TopDocs should even
> >  preserve scoring.......
> >
> >  Best
> >
> > Erick
> >
> >
> >
> >
> >  On Wed, Mar 26, 2008 at 5:48 AM, Wojtek H <[EMAIL PROTECTED]> wrote:
> >
> >  > Hi all,
> >  >
> >  > our problem is to choose the best (the fastest) way to iterate over
> huge
> >  > set
> >  > of documents (basic and most important case is to iterate over all
> >  > documents
> >  > in the index). Some slow process accesses documents and now it is
> done via
> >  > repeating query (for instance MatchAllDocsQuery). It processess first
> N
> >  > docs
> >  > then repeats query and processes next N docs and so on. Repeating
> query
> >  > means in fact quadratic time! So we think about changing the way docs
> are
> >  > accessed.
> >  > In case of generic query the only way to speed it up we see is to
> keep
> >  > HitCollector in memory between requests for docs. Isn't this approach
> too
> >  > memory consuming?
> >  > In case of iterating over all documents I was wondering if there is a
> way
> >  > to
> >  > determine set of index ids over which we could iterate (and of course
> >  > control index changes - if index is changed between requests we
> should
> >  > probably invalidate 'iterating session').
> >  > What is the best solution for this problem?
> >  > Thanks and regards,
> >  >
> >  > wojtek
> >  >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: The best way to iterate over document

Reply via email to