Can you try to cache the word segmentation results? This will be easier.
















At 2021-11-22 16:40:42, "Omri" <omri.sui...@clearmash.com> wrote:
>We are indexing a lot of similar texts using Lucene analyzers.
>From our performance tests we see that the analyzation (converting the text 
>the tokensteam object) is talking more time that we want.
>Before digging into the analyzation code, I was thinking about caching the 
>analyzation result since we have many repeated texts that we index in 
>different times.
>The basic idea is to serialize the tokenstream and store it in a DB. when we 
>encounter the same text, to load it and initialize an analyzer with the loaded 
>tokenstream.
>In this context:
>1 - is it "safe" to serialize the tokenstream?
>2 - there is an existing code that already serialize a tokenstream?
>3 - how to initialize an existing analyzer with a tokenstream?
>
>Thanks!
>
>Best,
>Omri
>The contents of this e-mail message and any attachments are confidential and 
>are intended solely for addressee. The information may also be legally 
>privileged. This transmission is sent in trust, for the sole purpose of 
>delivery to the intended recipient. If you have received this transmission in 
>error, any use, reproduction or dissemination of this transmission is strictly 
>prohibited. If you are not the intended recipient, please immediately notify 
>the sender by reply e-mail or phone and delete this message and its 
>attachments, if any.

Reply via email to