On Nov 26, 2006, at 8:57 AM, jm wrote:

I tested this. I use a single static analyzer for all my documents,
and the caching analyzer was not working properly. I had to add a
method to clear the cache each time a new document was to be indexed,
and then it worked as expected. I have never looked into lucenes inner
working so I am not sure if what I did is correct.

Makes sense, I've now incorporated that as well by adding a clear() method and extracting the functionality into a public class AnalyzerUtil.TokenCachingAnalyzer.


I also had to comment some code cause I merged the memory stuff from
trunk with lucene 2.0.

Performance was certainly much better (4 times faster in my very gross
testing), but for my processing that operation is only a very small,
so I will keep the original way, without caching the tokens, just to
be able to use the unmodified lucene 2.0.  I found a data problem in
my tests, but as I was not going to pursue that improvement for now I
did not look into it.

Ok.
Wolfgang.


thanks,
javier

On 11/23/06, Wolfgang Hoschek <[EMAIL PROTECTED]> wrote:
Out of interest, I've checked an implementation of something like
this into AnalyzerUtil SVN trunk:

   /**
    * Returns an analyzer wrapper that caches all tokens generated by
the underlying child analyzer's
    * token stream, and delivers those cached tokens on subsequent
calls to
    * <code>tokenStream(String fieldName, Reader reader)</code>.
    * <p>
    * This can help improve performance in the presence of expensive
Analyzer / TokenFilter chains.
    * <p>
    * Caveats:
    * 1) Caching only works if the methods equals() and hashCode()
methods are properly
    * implemented on the Reader passed to <code>tokenStream(String
fieldName, Reader reader)</code>.
    * 2) Caching the tokens of large Lucene documents can lead to out
of memory exceptions.
    * 3) The Token instances delivered by the underlying child
analyzer must be immutable.
    *
    * @param child
    *            the underlying child analyzer
    * @return a new analyzer
    */
   public static Analyzer getTokenCachingAnalyzer(final Analyzer
child) { ... }


Check it out, and let me know if this is close to what you had in mind.

Wolfgang.

On Nov 22, 2006, at 9:19 AM, Wolfgang Hoschek wrote:

> I've never tried it, but I guess you could write an Analyzer and
> TokenFilter that no only feeds into IndexWriter on
> IndexWriter.addDocument(), but as a sneaky side effect also
> simultaneously saves its tokens into a list so that you could later
> turn that list into another TokenStream to be added to MemoryIndex.
> How much this might help depends on how expensive your analyzer
> chain is. For some examples on how to set up analyzers for chains
> of token streams, see MemoryIndex.keywordTokenStream and class
> AnalzyerUtil in the same package.
>
> Wolfgang.
>
> On Nov 22, 2006, at 4:15 AM, jm wrote:
>
>> checking one last thing, just in case...
>>
>> as I mentioned, I have previously indexed the same document in
>> another
>> index (for another purpose), as I am going to use the same analyzer,
>> would it be possible to avoid analyzing the doc again?
>>
>> I see IndexWriter.addDocument() returns void, so it does not seem to
>> be an easy way to do that no?
>>
>> thanks
>>
>> On 11/21/06, Wolfgang Hoschek <[EMAIL PROTECTED]> wrote:
>>>
>>> On Nov 21, 2006, at 12:38 PM, jm wrote:
>>>
>>> > Ok, thanks, I'll give MemoryIndex a go, and if that is not good
>>> enoguh
>>> > I will explore the other options then.
>>>
>>> To get started you can use something like this:
>>>
>>> for each document D:
>>>      MemoryIndex index = createMemoryIndex(D, ...)
>>>      for each query Q:
>>>          float score = index.search(Q)
>>>         if (score > 0.0) System.out.println("it's a match");
>>>
>>>
>>>
>>>
>>>    private MemoryIndex createMemoryIndex(Document doc, Analyzer
>>> analyzer) {
>>>      MemoryIndex index = new MemoryIndex();
>>>      Enumeration iter = doc.fields();
>>>      while (iter.hasMoreElements()) {
>>>        Field field = (Field) iter.nextElement();
>>> index.addField(field.name(), field.stringValue(), analyzer);
>>>      }
>>>      return index;
>>>    }
>>>
>>>
>>>
>>> >
>>> >
>>> > On 11/21/06, Wolfgang Hoschek <[EMAIL PROTECTED]> wrote:
>>> >> On Nov 21, 2006, at 7:43 AM, jm wrote:
>>> >>
>>> >> > Hi,
>>> >> >
>>> >> > I have to decide between  using a RAMDirectory and
>>> MemoryIndex, but
>>> >> > not sure what approach will work better...
>>> >> >
>>> >> > I have to run many items (tens of thousands) against some
>>> >> queries (100
>>> >> > at most), but I have to do it one item at a time. And I already
>>> >> have
>>> >> > the lucene Document associated with each item, from a previous
>>> >> > operation I perform.
>>> >> >
>>> >> > From what I read MemoryIndex should be faster, but apparently I
>>> >> cannot
>>> >> > reuse the document I already have, and I have to create a new
>>> >> > MemoryIndex per item.
>>> >>
>>> >> A MemoryIndex object holds one document.
>>> >>
>>> >> > Using the RAMDirectory I can use only one of
>>> >> > them, also one IndexWriter, and create a IndexSearcher and
>>> >> IndexReader
>>> >> > per item, for searching and removing the item each time.
>>> >> >
>>> >> > Any thoughts?
>>> >>
>>> >> The MemoryIndex impl is optimized to work efficiently without
>>> reusing
>>> >> the MemoryIndex object for a subsequent document. See the source >>> >> code. Reusing the object would not further improve performance.
>>> >>
>>> >> Wolfgang.
>>> >>
>>> >>
>>> --------------------------------------------------------------------
>>> -
>>> >> To unsubscribe, e-mail: java-user- [EMAIL PROTECTED] >>> >> For additional commands, e-mail: java-user- [EMAIL PROTECTED]
>>> >>
>>> >>
>>> >
>>> >
>>> --------------------------------------------------------------------
>>> -
>>> > To unsubscribe, e-mail: [EMAIL PROTECTED]
>>> > For additional commands, e-mail: java-user- [EMAIL PROTECTED]
>>> >
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to