See below...
On Wed, Mar 26, 2008 at 11:29 AM, Wojtek H <[EMAIL PROTECTED]> wrote:
> Thank you for reply. What I did not mention before was that for
> iteration we don't care about scoring, so that's not the issue at all.
> Creating Filter with BitSet seems much better idea than keeping
> HitIterator in memory. Am I right that in such a case with
> MatchAllDocsQuery memory usage would be around
> ( NUM_OF_DOCS_IN_INDEX / 8 ) bytes?
Yes, it's very close to that number as far as I can tell.
>
> We didn't check it yet, but do you think that time for accessing
> documents (reader.doc(i)) is big enough to make iteration in
> HitCollector (without accessing any objects) over documents
> already returned almost un-noticable?
I don't quite understand this. Accessing a single document
in HitCollector via reader.doc(i) is probably unnoticeable. Accessing
every documents in the HitCollector is bad, very bad. If you're doing
something like spinning through the HitCollector N times, then
returning the Nth document and breaking, I don't know. You'll just have
to experiment I'm afraid.
> And another question - if I don't care about scoring is there a way to
> make Lucene don't spend time on calculating score (I don't know if
> that time matters)? HitCollector receives doc and its score (as far as
> I remember the difference here is that it is not normalized to value
> between 0 and 1). Is there a way (and does it make sense) to make
> scoring faster in such a case?
I *think* ConstantScoreQuery is your friend here, although I haven't
used it personally.
>
> And to make things clear - am I right that if I operate on the same
> searcher over requests for docs chunks I don't see neither additions
> nor deletions which could happen meanwhile? So if I would like to
> iterate over point-in-time keeping the same searcher opened would do.
> Thanks and regards,
>
You must close and re-open a reader to see any changes since the last
time you opened that reader. So I think your assumption is OK.
Have fun,
Erick
> wojtek
>
>
> 2008/3/26, Erick Erickson <[EMAIL PROTECTED]>:
> > Why not keep a Filter in memory? It consists of a single bit per
> document
> > and the ordinal position of that bit is the Lucene doc ID. You could
> create
> > this reasonably quickly for the *first* query that came in via
> HitCollector.
> >
> > Then each time you wanted another chunk, use the filter to know which
> > docs to return. You could either, say, extend the Filter class and add
> > some bookeeping or just zero out each bit that you returned to the
> user.
> >
> > NOTE: you don't get relevance this way, but for the case of returning
> all
> > docs do you really want it?
> >
> > About updating the index. Remember that there is no "update in place".
> > So you'll only have to check whether any document in the filter has
> been
> > deleted when you are returning. Then you'd have to do something about
> > looking for any new additions as you returned the last document in the
> > set...
> > But remember that until you close/reopen the searcher, you won't see
> changes
> > anyway.....
> >
> > But you may not need to do any of this. If, each time you return a
> chunk,
> > you're using a Hits object, then this is the first thing I'd change. A
> Hits
> > object re-executes the query every 100th element you look at. So,
> assume
> > you have something like
> >
> > (bad pseudo code here)
> > for (int idx = 0; idx < firstdocinchunk && Hits.next(); ++idx)
> > {
> > }
> >
> > for (idx = 0; idx < chunksize && Hits.next(); ++idx)
> > {
> > assemble doc for return
> > }
> > and the first doc you want to return is number 1,000, you'll actually
> > be re-executing the query 10 times. Which probably accounts for your
> > quadratic time.
> >
> > So I'd try just using a new HitCollector each time and see if that
> solves
> > your problems before getting fancy. There really shouldn't be any
> > noticeable difference between the first and last request unless you're
> > doing something like accessing the documents before you get to
> > the first one you expect to return. And a TopDocs should even
> > preserve scoring.......
> >
> > Best
> >
> > Erick
> >
> >
> >
> >
> > On Wed, Mar 26, 2008 at 5:48 AM, Wojtek H <[EMAIL PROTECTED]> wrote:
> >
> > > Hi all,
> > >
> > > our problem is to choose the best (the fastest) way to iterate over
> huge
> > > set
> > > of documents (basic and most important case is to iterate over all
> > > documents
> > > in the index). Some slow process accesses documents and now it is
> done via
> > > repeating query (for instance MatchAllDocsQuery). It processess first
> N
> > > docs
> > > then repeats query and processes next N docs and so on. Repeating
> query
> > > means in fact quadratic time! So we think about changing the way docs
> are
> > > accessed.
> > > In case of generic query the only way to speed it up we see is to
> keep
> > > HitCollector in memory between requests for docs. Isn't this approach
> too
> > > memory consuming?
> > > In case of iterating over all documents I was wondering if there is a
> way
> > > to
> > > determine set of index ids over which we could iterate (and of course
> > > control index changes - if index is changed between requests we
> should
> > > probably invalidate 'iterating session').
> > > What is the best solution for this problem?
> > > Thanks and regards,
> > >
> > > wojtek
> > >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>