Re: Order docIds to reduce disk seeks

Vijay B Fri, 21 Nov 2014 05:54:13 -0800

The source of data is Oracle DB and that is not an option for us due to the
volume of requests we are expect and the amount of text we are pulling.
IndexSearcher is managed via Searchmanger.


>>> If you are querying for most of the fields in the index the it might be
       more efficient to iterate through all of the stored fields. I'm not
sure
       how to do this with the API however.

Could you please eloborate more on this.


On Wed, Nov 19, 2014 at 5:59 AM, Barry Coughlan <[email protected]>
wrote:

> Hi Vijay,
>
> Could you just bypass Lucene altogether and send the documents to Carrot
> from the same place that Lucene got them?
>
> If for some reason you can not do that, here are some suggestions (note:
> I'm not a Lucene expert):
>
> 1. If you have other stored fields in your index, ensure you are only
> retrieving the text field: is.doc(scoreDoc.doc,
> Collections.singleton("doc_text")).
> 2. Re-use the IndexSearcher object instead of re-opening the index for
> different queries. I'm not sure from your code sample if you do this
> already.
> 3. Time your code to ensure that retrieving the stored field is the
> bottleneck in your case. If it turns out that searching is slow then you
> could store your UUID using DocValues and look up the document IDs in
> memory.
> 4. If you are querying for most of the fields in the index the it might be
> more efficient to iterate through all of the stored fields. I'm not sure
> how to do this with the API however.
>
> Barry
>
> On Tue, Nov 18, 2014 at 8:11 PM, Vijay B <[email protected]> wrote:
>
> > Hi Barry,
> >
> > here is our usecase. We fetch doc text from lucene and feed it to
> > http://carrotsearch.com/ libary for generating document clusters as a
> text
> > processing step.Carrotsearch API need to be fed with list of
> > org.carrot2.core.Document
> > <
> http://download.carrot2.org/stable/javadoc/org/carrot2/core/Document.html
> > >
> >  constructed out of document title and complete text.
> > <
> >
> http://download.carrot2.org/stable/javadoc/org/carrot2/core/Document.html#Document(java.lang.String
> > ,
> > java.lang.String, java.lang.String)>
> >
> >
> >
> >
> > On Tue, Nov 18, 2014 at 2:53 PM, Barry Coughlan <[email protected]>
> > wrote:
> >
> > > Hi Vijay,
> > >
> > > I'm guessing Michael means that perhaps your text processing step could
> > be
> > > better solved by using Lucene features. The use case of Lucene you
> > describe
> > > in your post is better suited to a key value store or a relational
> > > database.
> > >
> > > Can you give more details on what your text processing step does?
> > >
> > > Barry
> > >
> > > On Nov 18, 2014 7:41 PM, "Vijay B" <[email protected]> wrote:
> > > >
> > > > Hi Mike,  could you provide some pointers on using inverted index.
> Any
> > > > examples or what API classes to use to accomplish this.
> > > >
> > > > On Tue, Nov 18, 2014 at 12:40 PM, Michael McCandless <
> > > > [email protected]> wrote:
> > > >
> > > > > Even if you sort all hits by docID it's likely too slow to visit
> > every
> > > > > single one and load the stored document ...
> > > > >
> > > > > Try to find another way to solve your problem, making use of the
> > > inverted
> > > > > index?
> > > > >
> > > > > Mike McCandless
> > > > >
> > > > > http://blog.mikemccandless.com
> > > > >
> > > > >
> > > > > On Mon, Nov 17, 2014 at 6:05 PM, Rose, Stuart J <
> > [email protected]>
> > > > > wrote:
> > > > > > Hi Vijay,
> > > > > >
> > > > > > ...sorting the documents you need to retrieve by docID order
> > first...
> > > > > >
> > > > > > means sorting them by their 'document number' which is the value
> in
> > > the
> > > > > 'scoreDoc.doc' field and is the value that the reader takes to
> > > 'retrieve'
> > > > > the document from the index. If you write a comparator to sort the
> > > elements
> > > > > in the ScoreDoc[] by their doc field then that will put them in
> > 'docID
> > > > > order' and the reader will always be skipping forward to the next
> doc
> > > which
> > > > > will probably reduce its seek time.
> > > > > >
> > > > > > Regards,
> > > > > > Stuart
> > > > > >
> > > > > >
> > > > > >
> > > > > > -----Original Message-----
> > > > > > From: Vijay B [mailto:[email protected]]
> > > > > > Sent: Monday, November 17, 2014 9:16 AM
> > > > > > To: [email protected]
> > > > > > Subject: Order docIds to reduce disk seeks
> > > > > >
> > > > > > *Could someone point me how to order docIds as per **
> > > > > http://wiki.apache.org/lucene-java/ImproveSearchingSpeed
> > > > > > <http://wiki.apache.org/lucene-java/ImproveSearchingSpeed>*
> > > > > >
> > > > > > *"Limit usage of stored fields and term vectors. Retrieving these
> > > from
> > > > > the index is quite costly. Typically you should only retrieve these
> > for
> > > the
> > > > > current "page" the user will see, not for all documents in the full
> > > result
> > > > > set. For each document retrieved, Lucene must seek to a different
> > > location
> > > > > in various files. Try sorting the documents you need to retrieve by
> > > docID
> > > > > order first."*
> > > > > >
> > > > > > *To give some background:*
> > > > > >
> > > > > > *We are using plain vanilla LUCNE (version 4.2.1) for our **Our
> > > > > application.**We index our documents using stored fields. We add
> two
> > > fields
> > > > > related to our documents: UUID: 9 digit number represents internal
> id
> > > and
> > > > > > doc_text: document text( 7k to 20K in size approx). In our search
> > > code,
> > > > > **we use boolean Query to retrive by UUID  and fetch document text
> > use
> > > if
> > > > > for other processing. We are noticing slow response times with the
> > > > > searches. I understand that stored field retrieval are slower and
> > > should be
> > > > > limited but this is mandatory for our app.*
> > > > > >
> > > > > >
> > > > > > Current code:
> > > > > >
> > > > > > TopScoreDocCollector collector =
> > > > > > TopScoreDocCollector.create(BooleanQuery.getMaxClauseCount(),
> > true);
> > > > > >
> > > > > > dirReader = DirectoryReader.open(FSDirectory.open(......))
> > > > > > IndexSearcher indexSearcher = new IndexSearcher(dirReader);
> > > > > indexSearcher.search(query, collector); ScoreDoc[] scoreDocs =
> > > > > collector.topDocs().scoreDocs;
> > > > > >
> > > > > > for (ScoreDoc scoreDoc : scoreDocs) {
> > > > > > Document luceneDoc = indexSearcher.doc(scoreDoc.doc); String
> text =
> > > > > luceneDoc.get("doc_text"); //these calls take lot of time
> > > > > >
> > > > > > //process text
> > > > > > }
> > > > > >
> > > > > >
> > ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: [email protected]
> > > > > > For additional commands, e-mail:
> [email protected]
> > > > > >
> > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: [email protected]
> > > > > For additional commands, e-mail: [email protected]
> > > > >
> > > > >
> > >
> >
>

Re: Order docIds to reduce disk seeks

Reply via email to