Re: [jr3] Search index in content

Ard Schrijvers Thu, 18 Feb 2010 01:24:23 -0800

On Thu, Feb 18, 2010 at 8:39 AM, Thomas Müller <[email protected]> wrote:


>
> The fulltext index is (potentially) slow, specially fulltext
> extraction. Therefore, fulltext index should be done asynchronously if

would this be in line with the spec?

> it takes too long. Also, in a clustered environment, at least text
> extraction should only be done in one cluster node. I would still use
> Apache Tika and Apache Lucene for this.

Especially pdf extraction can kill the performance of an entire
cluster. As pdfs can be part of a document at our structure, where it
needs to be nodescope indexed every time the document is saved again,
we use an approach to store as binary (to use the DataStore) version
an extracted version of the pdf and index this extracted version: Only
one node in the cluster will now do the extraction, only one user is
blocked. The other nodes just index the extracted text version, which
is quite fast. Not sure if we should have this kind of option part of
JR

regards Ard

>
> Regards,
> Thomas
>

Re: [jr3] Search index in content

Reply via email to