Any suggestions pls? On Mon, Jul 4, 2016 at 3:37 PM, Tarun Kumar <ta...@sumologic.com> wrote:
> Hey Michael, > > docIds from multiple indices (from multiple machines) need to be > aggregated, sorted and first few thousand new to be queried. These few > thousand docs can be distributed among multiple machines. Each machine will > search the docs which are there in their own indices. So, pulling sorting > on server side won't suffice the use-case. Is there a alternative to get > document for given docIds faster? > > Thanks > Tarun > > On Mon, Jul 4, 2016 at 3:17 PM, Michael McCandless < > luc...@mikemccandless.com> wrote: > >> Why not ask Lucene to do the sort on your time field, instead of pulling >> millions of docids to the client and having it sort. You could even do >> index-time sorting by time field if you want, which makes early termination >> possible (faster sorted searches). >> >> But if even on having Lucene do the sort you still need to load millions >> of documents per search request, you are in trouble: you need to >> re-formulate your use case somehow to take advantage of what Lucene is good >> for (getting top results for a search). >> >> Maybe you can use faceting to do whatever aggregation you are currently >> doing after retrieving those millions of documents. >> >> Maybe you could make a custom collector, and use doc values, to do your >> own custom aggregation. >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> On Mon, Jul 4, 2016 at 1:39 AM, Tarun Kumar <ta...@sumologic.com> wrote: >> >>> Thanks for reply Michael! In my application, i need to get millions of >>> documents per search. >>> >>> Use case is following: return documents in increasing order of field >>> time. Client (caller) can't hold more than a few thousand docs at a time so >>> it gets all docIds and corresponding time field for each doc, sort them on >>> time and get n docs at a time. To support this usecase, i am: >>> >>> - getting all docsIds first. >>> - Sort docIds on time fields. >>> - Query n docids at a time from client which make >>> indexReader.document(docId) call for all n docs at server, combine the docs >>> these docs and return. >>> >>> indexReader.document(docId) is creating bottlenecks. What alternatives >>> do you suggest? >>> >>> On Wed, Jun 29, 2016 at 4:00 AM, Michael McCandless < >>> luc...@mikemccandless.com> wrote: >>> >>>> Are you maybe trying to load too many documents for each search request? >>>> >>>> The IR.document API is designed to be used to load just a few hits, >>>> like a page worth or ~ 10 documents, per search. >>>> >>>> Mike McCandless >>>> >>>> http://blog.mikemccandless.com >>>> >>>> On Tue, Jun 28, 2016 at 7:05 AM, Tarun Kumar <ta...@sumologic.com> >>>> wrote: >>>> >>>>> I am running lucene 4.6.1. I am trying to get documents corresponding >>>>> to >>>>> docIds. All threads get stuck (don't get stuck exactly but spend a LOT >>>>> of >>>>> time in) at: >>>>> >>>>> java.lang.Thread.State: RUNNABLE >>>>> at sun.nio.ch.FileDispatcherImpl.pread0(Native Method) >>>>> at >>>>> sun.nio.ch.FileDispatcherImpl.pread(FileDispatcherImpl.java:52) >>>>> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:220) >>>>> at sun.nio.ch.IOUtil.read(IOUtil.java:197) >>>>> at >>>>> sun.nio.ch.FileChannelImpl.readInternal(FileChannelImpl.java:731) >>>>> at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:716) >>>>> at >>>>> >>>>> org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:169) >>>>> at >>>>> >>>>> org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:271) >>>>> at >>>>> >>>>> org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:51) >>>>> at >>>>> org.apache.lucene.store.DataInput.readVInt(DataInput.java:108) >>>>> at >>>>> >>>>> org.apache.lucene.store.BufferedIndexInput.readVInt(BufferedIndexInput.java:218) >>>>> at >>>>> >>>>> org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.visitDocument(CompressingStoredFieldsReader.java:232) >>>>> at >>>>> org.apache.lucene.index.SegmentReader.document(SegmentReader.java:277) >>>>> at >>>>> >>>>> org.apache.lucene.index.BaseCompositeReader.document(BaseCompositeReader.java:110) >>>>> at >>>>> org.apache.lucene.index.IndexReader.document(IndexReader.java:440) >>>>> >>>>> >>>>> There is no disk throttling. What can result into this? >>>>> >>>>> Thanks >>>>> Tarun >>>>> >>>> >>>> >>> >> >