Re: Order docIds to reduce disk seeks

Vijay B Tue, 18 Nov 2014 12:12:25 -0800

Hi Barry,

here is our usecase. We fetch doc text from lucene and feed it to
http://carrotsearch.com/ libary for generating document clusters as a text
processing step.Carrotsearch API need to be fed with list of
org.carrot2.core.Document
<http://download.carrot2.org/stable/javadoc/org/carrot2/core/Document.html>
 constructed out of document title and complete text.
<http://download.carrot2.org/stable/javadoc/org/carrot2/core/Document.html#Document(java.lang.String,
java.lang.String, java.lang.String)>





On Tue, Nov 18, 2014 at 2:53 PM, Barry Coughlan <b.coughl...@gmail.com>
wrote:

> Hi Vijay,
>
> I'm guessing Michael means that perhaps your text processing step could be
> better solved by using Lucene features. The use case of Lucene you describe
> in your post is better suited to a key value store or a relational
> database.
>
> Can you give more details on what your text processing step does?
>
> Barry
>
> On Nov 18, 2014 7:41 PM, "Vijay B" <vijay.nip...@gmail.com> wrote:
> >
> > Hi Mike,  could you provide some pointers on using inverted index. Any
> > examples or what API classes to use to accomplish this.
> >
> > On Tue, Nov 18, 2014 at 12:40 PM, Michael McCandless <
> > luc...@mikemccandless.com> wrote:
> >
> > > Even if you sort all hits by docID it's likely too slow to visit every
> > > single one and load the stored document ...
> > >
> > > Try to find another way to solve your problem, making use of the
> inverted
> > > index?
> > >
> > > Mike McCandless
> > >
> > > http://blog.mikemccandless.com
> > >
> > >
> > > On Mon, Nov 17, 2014 at 6:05 PM, Rose, Stuart J <stuart.r...@pnnl.gov>
> > > wrote:
> > > > Hi Vijay,
> > > >
> > > > ...sorting the documents you need to retrieve by docID order first...
> > > >
> > > > means sorting them by their 'document number' which is the value in
> the
> > > 'scoreDoc.doc' field and is the value that the reader takes to
> 'retrieve'
> > > the document from the index. If you write a comparator to sort the
> elements
> > > in the ScoreDoc[] by their doc field then that will put them in 'docID
> > > order' and the reader will always be skipping forward to the next doc
> which
> > > will probably reduce its seek time.
> > > >
> > > > Regards,
> > > > Stuart
> > > >
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: Vijay B [mailto:vijay.nip...@gmail.com]
> > > > Sent: Monday, November 17, 2014 9:16 AM
> > > > To: java-user@lucene.apache.org
> > > > Subject: Order docIds to reduce disk seeks
> > > >
> > > > *Could someone point me how to order docIds as per **
> > > http://wiki.apache.org/lucene-java/ImproveSearchingSpeed
> > > > <http://wiki.apache.org/lucene-java/ImproveSearchingSpeed>*
> > > >
> > > > *"Limit usage of stored fields and term vectors. Retrieving these
> from
> > > the index is quite costly. Typically you should only retrieve these for
> the
> > > current "page" the user will see, not for all documents in the full
> result
> > > set. For each document retrieved, Lucene must seek to a different
> location
> > > in various files. Try sorting the documents you need to retrieve by
> docID
> > > order first."*
> > > >
> > > > *To give some background:*
> > > >
> > > > *We are using plain vanilla LUCNE (version 4.2.1) for our **Our
> > > application.**We index our documents using stored fields. We add two
> fields
> > > related to our documents: UUID: 9 digit number represents internal id
> and
> > > > doc_text: document text( 7k to 20K in size approx). In our search
> code,
> > > **we use boolean Query to retrive by UUID  and fetch document text use
> if
> > > for other processing. We are noticing slow response times with the
> > > searches. I understand that stored field retrieval are slower and
> should be
> > > limited but this is mandatory for our app.*
> > > >
> > > >
> > > > Current code:
> > > >
> > > > TopScoreDocCollector collector =
> > > > TopScoreDocCollector.create(BooleanQuery.getMaxClauseCount(), true);
> > > >
> > > > dirReader = DirectoryReader.open(FSDirectory.open(......))
> > > > IndexSearcher indexSearcher = new IndexSearcher(dirReader);
> > > indexSearcher.search(query, collector); ScoreDoc[] scoreDocs =
> > > collector.topDocs().scoreDocs;
> > > >
> > > > for (ScoreDoc scoreDoc : scoreDocs) {
> > > > Document luceneDoc = indexSearcher.doc(scoreDoc.doc); String text =
> > > luceneDoc.get("doc_text"); //these calls take lot of time
> > > >
> > > > //process text
> > > > }
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >
> > >
>

Re: Order docIds to reduce disk seeks

Reply via email to