Hi,
On 02/07/14 13:54, Ken Krugler wrote:

On Jul 2, 2014, at 5:27am, Sergey Beryozkin <sberyoz...@gmail.com
<mailto:sberyoz...@gmail.com>> wrote:

Hi All,

We've been experimenting with indexing the parsed content in Lucene and
our initial attempt was to index the output from
ToTextContentHandler.toString() as a Lucene Text field.

This is unlikely to be effective for large files.

What are your concerns here?

We write a utility for (CXF JAX-RS) users to start experimenting with searching with the help of Tika and Lucene. As such my concerns are rather vague for now. I suspect that parsing a large file into a possibly very large/massive String and indexing it in a single Lucene Text field won't be memory and/or performance optimal.

And what's the max amount of text in one file you think you'll need to
index?
This is something I've no idea about. I'd like to make sure our utility can help other users to effectively index Tika output into Lucene if they will ever need it

Thanks, Sergey


-- Ken

So I wonder what
strategies exist for a more effective indexing/tokenization of the the
possibly large content.

Perhaps a custom ContentHandler can index content fragments in a unique
Lucene field every time its characters(...) method is called, something
I've been planning to experiment with.

The feedback will be appreciated
Cheers, Sergey

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







--
Sergey Beryozkin

Talend Community Coders
http://coders.talend.com/

Blog: http://sberyozkin.blogspot.com

Reply via email to