I tested this. I use a single static analyzer for all my documents,
and the caching analyzer was not working properly. I had to add a
method to clear the cache each time a new document was to be indexed,
and then it worked as expected. I have never looked into lucenes inner
working so I am not sure if what I did is correct.
I also had to comment some code cause I merged the memory stuff from
trunk with lucene 2.0.
Performance was certainly much better (4 times faster in my very gross
testing), but for my processing that operation is only a very small,
so I will keep the original way, without caching the tokens, just to
be able to use the unmodified lucene 2.0. I found a data problem in
my tests, but as I was not going to pursue that improvement for now I
did not look into it.
thanks,
javier
On 11/23/06, Wolfgang Hoschek <[EMAIL PROTECTED]> wrote:
Out of interest, I've checked an implementation of something like
this into AnalyzerUtil SVN trunk:
/**
* Returns an analyzer wrapper that caches all tokens generated by
the underlying child analyzer's
* token stream, and delivers those cached tokens on subsequent
calls to
* <code>tokenStream(String fieldName, Reader reader)</code>.
* <p>
* This can help improve performance in the presence of expensive
Analyzer / TokenFilter chains.
* <p>
* Caveats:
* 1) Caching only works if the methods equals() and hashCode()
methods are properly
* implemented on the Reader passed to <code>tokenStream(String
fieldName, Reader reader)</code>.
* 2) Caching the tokens of large Lucene documents can lead to out
of memory exceptions.
* 3) The Token instances delivered by the underlying child
analyzer must be immutable.
*
* @param child
* the underlying child analyzer
* @return a new analyzer
*/
public static Analyzer getTokenCachingAnalyzer(final Analyzer
child) { ... }
Check it out, and let me know if this is close to what you had in mind.
Wolfgang.
On Nov 22, 2006, at 9:19 AM, Wolfgang Hoschek wrote:
> I've never tried it, but I guess you could write an Analyzer and
> TokenFilter that no only feeds into IndexWriter on
> IndexWriter.addDocument(), but as a sneaky side effect also
> simultaneously saves its tokens into a list so that you could later
> turn that list into another TokenStream to be added to MemoryIndex.
> How much this might help depends on how expensive your analyzer
> chain is. For some examples on how to set up analyzers for chains
> of token streams, see MemoryIndex.keywordTokenStream and class
> AnalzyerUtil in the same package.
>
> Wolfgang.
>
> On Nov 22, 2006, at 4:15 AM, jm wrote:
>
>> checking one last thing, just in case...
>>
>> as I mentioned, I have previously indexed the same document in
>> another
>> index (for another purpose), as I am going to use the same analyzer,
>> would it be possible to avoid analyzing the doc again?
>>
>> I see IndexWriter.addDocument() returns void, so it does not seem to
>> be an easy way to do that no?
>>
>> thanks
>>
>> On 11/21/06, Wolfgang Hoschek <[EMAIL PROTECTED]> wrote:
>>>
>>> On Nov 21, 2006, at 12:38 PM, jm wrote:
>>>
>>> > Ok, thanks, I'll give MemoryIndex a go, and if that is not good
>>> enoguh
>>> > I will explore the other options then.
>>>
>>> To get started you can use something like this:
>>>
>>> for each document D:
>>> MemoryIndex index = createMemoryIndex(D, ...)
>>> for each query Q:
>>> float score = index.search(Q)
>>> if (score > 0.0) System.out.println("it's a match");
>>>
>>>
>>>
>>>
>>> private MemoryIndex createMemoryIndex(Document doc, Analyzer
>>> analyzer) {
>>> MemoryIndex index = new MemoryIndex();
>>> Enumeration iter = doc.fields();
>>> while (iter.hasMoreElements()) {
>>> Field field = (Field) iter.nextElement();
>>> index.addField(field.name(), field.stringValue(), analyzer);
>>> }
>>> return index;
>>> }
>>>
>>>
>>>
>>> >
>>> >
>>> > On 11/21/06, Wolfgang Hoschek <[EMAIL PROTECTED]> wrote:
>>> >> On Nov 21, 2006, at 7:43 AM, jm wrote:
>>> >>
>>> >> > Hi,
>>> >> >
>>> >> > I have to decide between using a RAMDirectory and
>>> MemoryIndex, but
>>> >> > not sure what approach will work better...
>>> >> >
>>> >> > I have to run many items (tens of thousands) against some
>>> >> queries (100
>>> >> > at most), but I have to do it one item at a time. And I already
>>> >> have
>>> >> > the lucene Document associated with each item, from a previous
>>> >> > operation I perform.
>>> >> >
>>> >> > From what I read MemoryIndex should be faster, but apparently I
>>> >> cannot
>>> >> > reuse the document I already have, and I have to create a new
>>> >> > MemoryIndex per item.
>>> >>
>>> >> A MemoryIndex object holds one document.
>>> >>
>>> >> > Using the RAMDirectory I can use only one of
>>> >> > them, also one IndexWriter, and create a IndexSearcher and
>>> >> IndexReader
>>> >> > per item, for searching and removing the item each time.
>>> >> >
>>> >> > Any thoughts?
>>> >>
>>> >> The MemoryIndex impl is optimized to work efficiently without
>>> reusing
>>> >> the MemoryIndex object for a subsequent document. See the source
>>> >> code. Reusing the object would not further improve performance.
>>> >>
>>> >> Wolfgang.
>>> >>
>>> >>
>>> --------------------------------------------------------------------
>>> -
>>> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>> >> For additional commands, e-mail: [EMAIL PROTECTED]
>>> >>
>>> >>
>>> >
>>> >
>>> --------------------------------------------------------------------
>>> -
>>> > To unsubscribe, e-mail: [EMAIL PROTECTED]
>>> > For additional commands, e-mail: [EMAIL PROTECTED]
>>> >
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]