Re: How to index the parsed content effectively

Sergey Beryozkin Wed, 02 Jul 2014 06:24:30 -0700

Hi,
On 02/07/14 13:54, Ken Krugler wrote:


On Jul 2, 2014, at 5:27am, Sergey Beryozkin <sberyoz...@gmail.com
<mailto:sberyoz...@gmail.com>> wrote:

Hi All,

We've been experimenting with indexing the parsed content in Lucene and
our initial attempt was to index the output from
ToTextContentHandler.toString() as a Lucene Text field.

This is unlikely to be effective for large files.


What are your concerns here?

We write a utility for (CXF JAX-RS) users to start experimenting withsearching with the help of Tika and Lucene. As such my concerns arerather vague for now. I suspect that parsing a large file into apossibly very large/massive String and indexing it in a single LuceneText field won't be memory and/or performance optimal.

And what's the max amount of text in one file you think you'll need to
index?

This is something I've no idea about. I'd like to make sure our utilitycan help other users to effectively index Tika output into Lucene ifthey will ever need it


Thanks, Sergey


-- Ken

So I wonder what
strategies exist for a more effective indexing/tokenization of the the
possibly large content.

Perhaps a custom ContentHandler can index content fragments in a unique
Lucene field every time its characters(...) method is called, something
I've been planning to experiment with.

The feedback will be appreciated
Cheers, Sergey


--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr



--
Sergey Beryozkin

Talend Community Coders
http://coders.talend.com/

Blog: http://sberyozkin.blogspot.com

Re: How to index the parsed content effectively

Reply via email to