-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

If you want to have a try, we created a crawling Tika parser, which gives 
recursive, incremental
crawing capabilities to Tika. There we also implemented a handler as a 
decorator that writes into
a Lucene index.

Checkout 'Create a Lucene index' here:

https://github.com/leechcrawler/leech/blob/master/codeSnippets.md

Maybe also as a starting point by looking into the code

best

Chris

On 02.07.2014 14:27, Sergey Beryozkin wrote:
> Hi All,
> 
> We've been experimenting with indexing the parsed content in Lucene and our 
> initial attempt was
> to index the output from ToTextContentHandler.toString() as a Lucene Text 
> field.
> 
> This is unlikely to be effective for large files. So I wonder what strategies 
> exist for a more
> effective indexing/tokenization of the the possibly large content.
> 
> Perhaps a custom ContentHandler can index content fragments in a unique 
> Lucene field every time
> its characters(...) method is called, something I've been planning to 
> experiment with.
> 
> The feedback will be appreciated Cheers, Sergey

- -- 
______________________________________________________________________________
Christian Reuschling, Dipl.-Ing.(BA)
Software Engineer

Knowledge Management Department
German Research Center for Artificial Intelligence DFKI GmbH
Trippstadter Straße 122, D-67663 Kaiserslautern, Germany

Phone: +49.631.20575-1250
mailto:reuschl...@dfki.de  http://www.dfki.uni-kl.de/~reuschling/

- ------------Legal Company Information Required by German Law------------------
Geschäftsführung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
                  Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313=
______________________________________________________________________________
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlO0A5UACgkQ6EqMXq+WZg/oLgCgkdpH5uRoYncVhLadg7qxjXKD
PZQAn1jxxRejVGchXXoYA08BIA3ldOKH
=ulNT
-----END PGP SIGNATURE-----

Reply via email to