On Thu, Feb 18, 2010 at 8:39 AM, Thomas Müller <[email protected]> wrote:
> > The fulltext index is (potentially) slow, specially fulltext > extraction. Therefore, fulltext index should be done asynchronously if would this be in line with the spec? > it takes too long. Also, in a clustered environment, at least text > extraction should only be done in one cluster node. I would still use > Apache Tika and Apache Lucene for this. Especially pdf extraction can kill the performance of an entire cluster. As pdfs can be part of a document at our structure, where it needs to be nodescope indexed every time the document is saved again, we use an approach to store as binary (to use the DataStore) version an extracted version of the pdf and index this extracted version: Only one node in the cluster will now do the extraction, only one user is blocked. The other nodes just index the extracted text version, which is quite fast. Not sure if we should have this kind of option part of JR regards Ard > > Regards, > Thomas >
