Re: lucene index reader performance

Michael McCandless Mon, 04 Jul 2016 02:48:47 -0700

Why not ask Lucene to do the sort on your time field, instead of pulling
millions of docids to the client and having it sort.  You could even do
index-time sorting by time field if you want, which makes early termination
possible (faster sorted searches).


But if even on having Lucene do the sort you still need to load millions of
documents per search request, you are in trouble: you need to re-formulate
your use case somehow to take advantage of what Lucene is good for (getting
top results for a search).

Maybe you can use faceting to do whatever aggregation you are currently
doing after retrieving those millions of documents.

Maybe you could make a custom collector, and use doc values, to do your own
custom aggregation.

Mike McCandless

http://blog.mikemccandless.com

On Mon, Jul 4, 2016 at 1:39 AM, Tarun Kumar <ta...@sumologic.com> wrote:

> Thanks for reply Michael! In my application, i need to get millions of
> documents per search.
>
> Use case is following: return documents in increasing order of field time.
> Client (caller) can't hold more than a few thousand docs at a time so it
> gets all docIds and corresponding time field for each doc, sort them on
> time and get n docs at a time. To support this usecase, i am:
>
> - getting all docsIds first.
> - Sort docIds on time fields.
> - Query n docids at a time from client which make
> indexReader.document(docId) call for all n docs at server, combine the docs
> these docs and return.
>
> indexReader.document(docId) is creating bottlenecks. What alternatives do
> you suggest?
>
> On Wed, Jun 29, 2016 at 4:00 AM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> Are you maybe trying to load too many documents for each search request?
>>
>> The IR.document API is designed to be used to load just a few hits, like
>> a page worth or ~ 10 documents, per search.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Tue, Jun 28, 2016 at 7:05 AM, Tarun Kumar <ta...@sumologic.com> wrote:
>>
>>> I am running lucene 4.6.1. I am trying to get documents corresponding to
>>> docIds. All threads get stuck (don't get stuck exactly but spend a LOT of
>>> time in) at:
>>>
>>> java.lang.Thread.State: RUNNABLE
>>>         at sun.nio.ch.FileDispatcherImpl.pread0(Native Method)
>>>         at
>>> sun.nio.ch.FileDispatcherImpl.pread(FileDispatcherImpl.java:52)
>>>         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:220)
>>>         at sun.nio.ch.IOUtil.read(IOUtil.java:197)
>>>         at
>>> sun.nio.ch.FileChannelImpl.readInternal(FileChannelImpl.java:731)
>>>         at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:716)
>>>         at
>>>
>>> org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:169)
>>>         at
>>>
>>> org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:271)
>>>         at
>>>
>>> org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:51)
>>>         at org.apache.lucene.store.DataInput.readVInt(DataInput.java:108)
>>>         at
>>>
>>> org.apache.lucene.store.BufferedIndexInput.readVInt(BufferedIndexInput.java:218)
>>>         at
>>>
>>> org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.visitDocument(CompressingStoredFieldsReader.java:232)
>>>         at
>>> org.apache.lucene.index.SegmentReader.document(SegmentReader.java:277)
>>>         at
>>>
>>> org.apache.lucene.index.BaseCompositeReader.document(BaseCompositeReader.java:110)
>>>         at
>>> org.apache.lucene.index.IndexReader.document(IndexReader.java:440)
>>>
>>>
>>> There is no disk throttling. What can result into this?
>>>
>>> Thanks
>>> Tarun
>>>
>>
>>
>

Re: lucene index reader performance

Reply via email to