Re: lucene index reader performance

Michael McCandless Thu, 07 Jul 2016 05:49:27 -0700

Somehow you need to get the sorting server-side ... that's really the only
way to do your use case efficiently.


Why can't you sort each request to your N shards, and then do a merge sort
on the client side, to get the top hits?

Mike McCandless

http://blog.mikemccandless.com

On Thu, Jul 7, 2016 at 5:48 AM, Tarun Kumar <[email protected]> wrote:

> Any suggestions pls?
>
> On Mon, Jul 4, 2016 at 3:37 PM, Tarun Kumar <[email protected]> wrote:
>
>> Hey Michael,
>>
>> docIds from multiple indices (from multiple machines) need to be
>> aggregated, sorted and first few thousand new to be queried. These few
>> thousand docs can be distributed among multiple machines. Each machine will
>> search the docs which are there in their own indices. So, pulling sorting
>> on server side won't suffice the use-case. Is there a alternative to get
>> document for given docIds faster?
>>
>> Thanks
>> Tarun
>>
>> On Mon, Jul 4, 2016 at 3:17 PM, Michael McCandless <
>> [email protected]> wrote:
>>
>>> Why not ask Lucene to do the sort on your time field, instead of pulling
>>> millions of docids to the client and having it sort.  You could even do
>>> index-time sorting by time field if you want, which makes early termination
>>> possible (faster sorted searches).
>>>
>>> But if even on having Lucene do the sort you still need to load millions
>>> of documents per search request, you are in trouble: you need to
>>> re-formulate your use case somehow to take advantage of what Lucene is good
>>> for (getting top results for a search).
>>>
>>> Maybe you can use faceting to do whatever aggregation you are currently
>>> doing after retrieving those millions of documents.
>>>
>>> Maybe you could make a custom collector, and use doc values, to do your
>>> own custom aggregation.
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>> On Mon, Jul 4, 2016 at 1:39 AM, Tarun Kumar <[email protected]> wrote:
>>>
>>>> Thanks for reply Michael! In my application, i need to get millions of
>>>> documents per search.
>>>>
>>>> Use case is following: return documents in increasing order of field
>>>> time. Client (caller) can't hold more than a few thousand docs at a time so
>>>> it gets all docIds and corresponding time field for each doc, sort them on
>>>> time and get n docs at a time. To support this usecase, i am:
>>>>
>>>> - getting all docsIds first.
>>>> - Sort docIds on time fields.
>>>> - Query n docids at a time from client which make
>>>> indexReader.document(docId) call for all n docs at server, combine the docs
>>>> these docs and return.
>>>>
>>>> indexReader.document(docId) is creating bottlenecks. What alternatives
>>>> do you suggest?
>>>>
>>>> On Wed, Jun 29, 2016 at 4:00 AM, Michael McCandless <
>>>> [email protected]> wrote:
>>>>
>>>>> Are you maybe trying to load too many documents for each search
>>>>> request?
>>>>>
>>>>> The IR.document API is designed to be used to load just a few hits,
>>>>> like a page worth or ~ 10 documents, per search.
>>>>>
>>>>> Mike McCandless
>>>>>
>>>>> http://blog.mikemccandless.com
>>>>>
>>>>> On Tue, Jun 28, 2016 at 7:05 AM, Tarun Kumar <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> I am running lucene 4.6.1. I am trying to get documents corresponding
>>>>>> to
>>>>>> docIds. All threads get stuck (don't get stuck exactly but spend a
>>>>>> LOT of
>>>>>> time in) at:
>>>>>>
>>>>>> java.lang.Thread.State: RUNNABLE
>>>>>>         at sun.nio.ch.FileDispatcherImpl.pread0(Native Method)
>>>>>>         at
>>>>>> sun.nio.ch.FileDispatcherImpl.pread(FileDispatcherImpl.java:52)
>>>>>>         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:220)
>>>>>>         at sun.nio.ch.IOUtil.read(IOUtil.java:197)
>>>>>>         at
>>>>>> sun.nio.ch.FileChannelImpl.readInternal(FileChannelImpl.java:731)
>>>>>>         at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:716)
>>>>>>         at
>>>>>>
>>>>>> org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:169)
>>>>>>         at
>>>>>>
>>>>>> org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:271)
>>>>>>         at
>>>>>>
>>>>>> org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:51)
>>>>>>         at
>>>>>> org.apache.lucene.store.DataInput.readVInt(DataInput.java:108)
>>>>>>         at
>>>>>>
>>>>>> org.apache.lucene.store.BufferedIndexInput.readVInt(BufferedIndexInput.java:218)
>>>>>>         at
>>>>>>
>>>>>> org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.visitDocument(CompressingStoredFieldsReader.java:232)
>>>>>>         at
>>>>>> org.apache.lucene.index.SegmentReader.document(SegmentReader.java:277)
>>>>>>         at
>>>>>>
>>>>>> org.apache.lucene.index.BaseCompositeReader.document(BaseCompositeReader.java:110)
>>>>>>         at
>>>>>> org.apache.lucene.index.IndexReader.document(IndexReader.java:440)
>>>>>>
>>>>>>
>>>>>> There is no disk throttling. What can result into this?
>>>>>>
>>>>>> Thanks
>>>>>> Tarun
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: lucene index reader performance

Reply via email to