On Jul 2, 2014, at 5:27am, Sergey Beryozkin <sberyoz...@gmail.com> wrote:
> Hi All, > > We've been experimenting with indexing the parsed content in Lucene and > our initial attempt was to index the output from > ToTextContentHandler.toString() as a Lucene Text field. > > This is unlikely to be effective for large files. What are your concerns here? And what's the max amount of text in one file you think you'll need to index? -- Ken > So I wonder what > strategies exist for a more effective indexing/tokenization of the the > possibly large content. > > Perhaps a custom ContentHandler can index content fragments in a unique > Lucene field every time its characters(...) method is called, something > I've been planning to experiment with. > > The feedback will be appreciated > Cheers, Sergey -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr