Hi,
On 02/07/14 13:54, Ken Krugler wrote:
On Jul 2, 2014, at 5:27am, Sergey Beryozkin <sberyoz...@gmail.com
<mailto:sberyoz...@gmail.com>> wrote:
Hi All,
We've been experimenting with indexing the parsed content in Lucene and
our initial attempt was to index the output from
ToTextContentHandler.toString() as a Lucene Text field.
This is unlikely to be effective for large files.
What are your concerns here?
We write a utility for (CXF JAX-RS) users to start experimenting with
searching with the help of Tika and Lucene. As such my concerns are
rather vague for now. I suspect that parsing a large file into a
possibly very large/massive String and indexing it in a single Lucene
Text field won't be memory and/or performance optimal.
And what's the max amount of text in one file you think you'll need to
index?
This is something I've no idea about. I'd like to make sure our utility
can help other users to effectively index Tika output into Lucene if
they will ever need it
Thanks, Sergey
-- Ken
So I wonder what
strategies exist for a more effective indexing/tokenization of the the
possibly large content.
Perhaps a custom ContentHandler can index content fragments in a unique
Lucene field every time its characters(...) method is called, something
I've been planning to experiment with.
The feedback will be appreciated
Cheers, Sergey
--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr
--
Sergey Beryozkin
Talend Community Coders
http://coders.talend.com/
Blog: http://sberyozkin.blogspot.com