Hi Barry, here is our usecase. We fetch doc text from lucene and feed it to http://carrotsearch.com/ libary for generating document clusters as a text processing step.Carrotsearch API need to be fed with list of org.carrot2.core.Document <http://download.carrot2.org/stable/javadoc/org/carrot2/core/Document.html> constructed out of document title and complete text. <http://download.carrot2.org/stable/javadoc/org/carrot2/core/Document.html#Document(java.lang.String, java.lang.String, java.lang.String)>
On Tue, Nov 18, 2014 at 2:53 PM, Barry Coughlan <[email protected]> wrote: > Hi Vijay, > > I'm guessing Michael means that perhaps your text processing step could be > better solved by using Lucene features. The use case of Lucene you describe > in your post is better suited to a key value store or a relational > database. > > Can you give more details on what your text processing step does? > > Barry > > On Nov 18, 2014 7:41 PM, "Vijay B" <[email protected]> wrote: > > > > Hi Mike, could you provide some pointers on using inverted index. Any > > examples or what API classes to use to accomplish this. > > > > On Tue, Nov 18, 2014 at 12:40 PM, Michael McCandless < > > [email protected]> wrote: > > > > > Even if you sort all hits by docID it's likely too slow to visit every > > > single one and load the stored document ... > > > > > > Try to find another way to solve your problem, making use of the > inverted > > > index? > > > > > > Mike McCandless > > > > > > http://blog.mikemccandless.com > > > > > > > > > On Mon, Nov 17, 2014 at 6:05 PM, Rose, Stuart J <[email protected]> > > > wrote: > > > > Hi Vijay, > > > > > > > > ...sorting the documents you need to retrieve by docID order first... > > > > > > > > means sorting them by their 'document number' which is the value in > the > > > 'scoreDoc.doc' field and is the value that the reader takes to > 'retrieve' > > > the document from the index. If you write a comparator to sort the > elements > > > in the ScoreDoc[] by their doc field then that will put them in 'docID > > > order' and the reader will always be skipping forward to the next doc > which > > > will probably reduce its seek time. > > > > > > > > Regards, > > > > Stuart > > > > > > > > > > > > > > > > -----Original Message----- > > > > From: Vijay B [mailto:[email protected]] > > > > Sent: Monday, November 17, 2014 9:16 AM > > > > To: [email protected] > > > > Subject: Order docIds to reduce disk seeks > > > > > > > > *Could someone point me how to order docIds as per ** > > > http://wiki.apache.org/lucene-java/ImproveSearchingSpeed > > > > <http://wiki.apache.org/lucene-java/ImproveSearchingSpeed>* > > > > > > > > *"Limit usage of stored fields and term vectors. Retrieving these > from > > > the index is quite costly. Typically you should only retrieve these for > the > > > current "page" the user will see, not for all documents in the full > result > > > set. For each document retrieved, Lucene must seek to a different > location > > > in various files. Try sorting the documents you need to retrieve by > docID > > > order first."* > > > > > > > > *To give some background:* > > > > > > > > *We are using plain vanilla LUCNE (version 4.2.1) for our **Our > > > application.**We index our documents using stored fields. We add two > fields > > > related to our documents: UUID: 9 digit number represents internal id > and > > > > doc_text: document text( 7k to 20K in size approx). In our search > code, > > > **we use boolean Query to retrive by UUID and fetch document text use > if > > > for other processing. We are noticing slow response times with the > > > searches. I understand that stored field retrieval are slower and > should be > > > limited but this is mandatory for our app.* > > > > > > > > > > > > Current code: > > > > > > > > TopScoreDocCollector collector = > > > > TopScoreDocCollector.create(BooleanQuery.getMaxClauseCount(), true); > > > > > > > > dirReader = DirectoryReader.open(FSDirectory.open(......)) > > > > IndexSearcher indexSearcher = new IndexSearcher(dirReader); > > > indexSearcher.search(query, collector); ScoreDoc[] scoreDocs = > > > collector.topDocs().scoreDocs; > > > > > > > > for (ScoreDoc scoreDoc : scoreDocs) { > > > > Document luceneDoc = indexSearcher.doc(scoreDoc.doc); String text = > > > luceneDoc.get("doc_text"); //these calls take lot of time > > > > > > > > //process text > > > > } > > > > > > > > --------------------------------------------------------------------- > > > > To unsubscribe, e-mail: [email protected] > > > > For additional commands, e-mail: [email protected] > > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: [email protected] > > > For additional commands, e-mail: [email protected] > > > > > > >
