-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 If you want to have a try, we created a crawling Tika parser, which gives recursive, incremental crawing capabilities to Tika. There we also implemented a handler as a decorator that writes into a Lucene index.
Checkout 'Create a Lucene index' here: https://github.com/leechcrawler/leech/blob/master/codeSnippets.md Maybe also as a starting point by looking into the code best Chris On 02.07.2014 14:27, Sergey Beryozkin wrote: > Hi All, > > We've been experimenting with indexing the parsed content in Lucene and our > initial attempt was > to index the output from ToTextContentHandler.toString() as a Lucene Text > field. > > This is unlikely to be effective for large files. So I wonder what strategies > exist for a more > effective indexing/tokenization of the the possibly large content. > > Perhaps a custom ContentHandler can index content fragments in a unique > Lucene field every time > its characters(...) method is called, something I've been planning to > experiment with. > > The feedback will be appreciated Cheers, Sergey - -- ______________________________________________________________________________ Christian Reuschling, Dipl.-Ing.(BA) Software Engineer Knowledge Management Department German Research Center for Artificial Intelligence DFKI GmbH Trippstadter Straße 122, D-67663 Kaiserslautern, Germany Phone: +49.631.20575-1250 mailto:reuschl...@dfki.de http://www.dfki.uni-kl.de/~reuschling/ - ------------Legal Company Information Required by German Law------------------ Geschäftsführung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender) Dr. Walter Olthoff Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes Amtsgericht Kaiserslautern, HRB 2313= ______________________________________________________________________________ -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlO0A5UACgkQ6EqMXq+WZg/oLgCgkdpH5uRoYncVhLadg7qxjXKD PZQAn1jxxRejVGchXXoYA08BIA3ldOKH =ulNT -----END PGP SIGNATURE-----