Re: Order docIds to reduce disk seeks

Barry Coughlan Tue, 18 Nov 2014 11:55:29 -0800

Hi Vijay,

I'm guessing Michael means that perhaps your text processing step could be
better solved by using Lucene features. The use case of Lucene you describe
in your post is better suited to a key value store or a relational database.


Can you give more details on what your text processing step does?

Barry

On Nov 18, 2014 7:41 PM, "Vijay B" <[email protected]> wrote:
>
> Hi Mike,  could you provide some pointers on using inverted index. Any
> examples or what API classes to use to accomplish this.
>
> On Tue, Nov 18, 2014 at 12:40 PM, Michael McCandless <
> [email protected]> wrote:
>
> > Even if you sort all hits by docID it's likely too slow to visit every
> > single one and load the stored document ...
> >
> > Try to find another way to solve your problem, making use of the
inverted
> > index?
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> >
> > On Mon, Nov 17, 2014 at 6:05 PM, Rose, Stuart J <[email protected]>
> > wrote:
> > > Hi Vijay,
> > >
> > > ...sorting the documents you need to retrieve by docID order first...
> > >
> > > means sorting them by their 'document number' which is the value in
the
> > 'scoreDoc.doc' field and is the value that the reader takes to
'retrieve'
> > the document from the index. If you write a comparator to sort the
elements
> > in the ScoreDoc[] by their doc field then that will put them in 'docID
> > order' and the reader will always be skipping forward to the next doc
which
> > will probably reduce its seek time.
> > >
> > > Regards,
> > > Stuart
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Vijay B [mailto:[email protected]]
> > > Sent: Monday, November 17, 2014 9:16 AM
> > > To: [email protected]
> > > Subject: Order docIds to reduce disk seeks
> > >
> > > *Could someone point me how to order docIds as per **
> > http://wiki.apache.org/lucene-java/ImproveSearchingSpeed
> > > <http://wiki.apache.org/lucene-java/ImproveSearchingSpeed>*
> > >
> > > *"Limit usage of stored fields and term vectors. Retrieving these from
> > the index is quite costly. Typically you should only retrieve these for
the
> > current "page" the user will see, not for all documents in the full
result
> > set. For each document retrieved, Lucene must seek to a different
location
> > in various files. Try sorting the documents you need to retrieve by
docID
> > order first."*
> > >
> > > *To give some background:*
> > >
> > > *We are using plain vanilla LUCNE (version 4.2.1) for our **Our
> > application.**We index our documents using stored fields. We add two
fields
> > related to our documents: UUID: 9 digit number represents internal id
and
> > > doc_text: document text( 7k to 20K in size approx). In our search
code,
> > **we use boolean Query to retrive by UUID  and fetch document text use
if
> > for other processing. We are noticing slow response times with the
> > searches. I understand that stored field retrieval are slower and
should be
> > limited but this is mandatory for our app.*
> > >
> > >
> > > Current code:
> > >
> > > TopScoreDocCollector collector =
> > > TopScoreDocCollector.create(BooleanQuery.getMaxClauseCount(), true);
> > >
> > > dirReader = DirectoryReader.open(FSDirectory.open(......))
> > > IndexSearcher indexSearcher = new IndexSearcher(dirReader);
> > indexSearcher.search(query, collector); ScoreDoc[] scoreDocs =
> > collector.topDocs().scoreDocs;
> > >
> > > for (ScoreDoc scoreDoc : scoreDocs) {
> > > Document luceneDoc = indexSearcher.doc(scoreDoc.doc); String text =
> > luceneDoc.get("doc_text"); //these calls take lot of time
> > >
> > > //process text
> > > }
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [email protected]
> > > For additional commands, e-mail: [email protected]
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >

Re: Order docIds to reduce disk seeks

Reply via email to