Bonjour Romain,

Im asking myself a few questions. Mainly about speed (indexation time) and
> document parsing (way to index most of commonly used office documents).  For
> document parsing, I'm planning to use different open sources library. The
> company Im doing this for will be indexing a few Gigabytes of data. Around
> 5Gb I think. Any advices about this project? Comments and suggestion are
> welcome.
>

For the parsing you should have a look at Apache Tika. It supports the most
common formats and exposes the OS libraries it uses for each format under a
very nice and simple API. That should spare you the trouble of interfacing
with each individual library.

Julien
-- 
DigitalPebble Ltd
http://www.digitalpebble.com

Reply via email to