On Jul 2, 2014, at 5:27am, Sergey Beryozkin <sberyoz...@gmail.com> wrote:

> Hi All,
> 
> We've been experimenting with indexing the parsed content in Lucene and
> our initial attempt was to index the output from
> ToTextContentHandler.toString() as a Lucene Text field.
> 
> This is unlikely to be effective for large files.

What are your concerns here?

And what's the max amount of text in one file you think you'll need to index?

-- Ken

> So I wonder what
> strategies exist for a more effective indexing/tokenization of the the
> possibly large content.
> 
> Perhaps a custom ContentHandler can index content fragments in a unique
> Lucene field every time its characters(...) method is called, something
> I've been planning to experiment with.
> 
> The feedback will be appreciated
> Cheers, Sergey

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





Reply via email to