Re: How to index the parsed content effectively

Sergey Beryozkin Fri, 11 Jul 2014 10:39:07 -0700

Hi Tim, All.
On 02/07/14 14:32, Allison, Timothy B. wrote:

Hi Sergey,


   I'd take a look at what the DataImportHandler in Solr does.  If you want to 
store the field, you need to create the field with a String (as opposed to a 
Reader); which means you have to have the whole thing in memory.  Also, if 
you're proposing adding a field entry in a multivalued field for a given SAX 
event, I don't think that will help, because you still have to hold the entire 
document in memory before calling addDocument() if you are storing the field.  
If you aren't storing the field, then you could try a Reader.

I'd like to ask something about using Tika parser and a Reader (andLucene Store.NO)

Consider a case where we have a service which accepts a very large PDFfile. This file will be stored on the disk or may be in some DB. Andthis service will also use Tika to extract content and populate a LuceneDocument.Now, we already have the original PDF occupying some space, soduplicating it (its content) with a Document with Store.YES fields maynot be the best idea in some cases.

So I wonder, is it possible somehow for a given Tika Parser, lets say aPDF parser, report, via the Metadata, the start and end indexes of thecontent ? So the consumer will create say InputStreamReader for acontent region and will use Store.NO and this Reader ?

Does it really make sense at all ? I can create a minor enhancementrequest for parsers getting the access to a low level info like thestart/stop delimiters of the content to report it ?


Cheers, Sergey


   Some thoughts:

   At the least, you could create a separate Lucene document for each container 
document and each of its embedded documents.

   You could also break large documents into logical sections and index those 
as separate documents; but that gets very use-case dependent.

     In practice, for many, many use cases I've come across, you can index quite large documents 
with no problems, e.g. "Moby Dick" or "Dream of the Red Chamber."  There may be 
a hit at highlighting time for large docs depending on which highlighter you use.  In the old days, 
there used to be a 10k default limit on the number of tokens, but that is now long gone.

   For truly large docs (probably machine generated), yes, you could run into 
problems if you need to hold the whole thing in memory.

  Cheers,

               Tim
-----Original Message-----
From: Sergey Beryozkin [mailto:sberyoz...@gmail.com]
Sent: Wednesday, July 02, 2014 8:27 AM
To: user@tika.apache.org
Subject: How to index the parsed content effectively

Hi All,

We've been experimenting with indexing the parsed content in Lucene and
our initial attempt was to index the output from
ToTextContentHandler.toString() as a Lucene Text field.

This is unlikely to be effective for large files. So I wonder what
strategies exist for a more effective indexing/tokenization of the
possibly large content.

Perhaps a custom ContentHandler can index content fragments in a unique
Lucene field every time its characters(...) method is called, something
I've been planning to experiment with.

The feedback will be appreciated
Cheers, Sergey

Re: How to index the parsed content effectively

Reply via email to