Out of interest, I've checked an implementation of something like this into AnalyzerUtil SVN trunk:

  /**
* Returns an analyzer wrapper that caches all tokens generated by the underlying child analyzer's * token stream, and delivers those cached tokens on subsequent calls to
   * <code>tokenStream(String fieldName, Reader reader)</code>.
   * <p>
* This can help improve performance in the presence of expensive Analyzer / TokenFilter chains.
   * <p>
   * Caveats:
* 1) Caching only works if the methods equals() and hashCode() methods are properly * implemented on the Reader passed to <code>tokenStream(String fieldName, Reader reader)</code>. * 2) Caching the tokens of large Lucene documents can lead to out of memory exceptions. * 3) The Token instances delivered by the underlying child analyzer must be immutable.
   *
   * @param child
   *            the underlying child analyzer
   * @return a new analyzer
   */
public static Analyzer getTokenCachingAnalyzer(final Analyzer child) { ... }


Check it out, and let me know if this is close to what you had in mind.

Wolfgang.

On Nov 22, 2006, at 9:19 AM, Wolfgang Hoschek wrote:

I've never tried it, but I guess you could write an Analyzer and TokenFilter that no only feeds into IndexWriter on IndexWriter.addDocument(), but as a sneaky side effect also simultaneously saves its tokens into a list so that you could later turn that list into another TokenStream to be added to MemoryIndex. How much this might help depends on how expensive your analyzer chain is. For some examples on how to set up analyzers for chains of token streams, see MemoryIndex.keywordTokenStream and class AnalzyerUtil in the same package.

Wolfgang.

On Nov 22, 2006, at 4:15 AM, jm wrote:

checking one last thing, just in case...

as I mentioned, I have previously indexed the same document in another
index (for another purpose), as I am going to use the same analyzer,
would it be possible to avoid analyzing the doc again?

I see IndexWriter.addDocument() returns void, so it does not seem to
be an easy way to do that no?

thanks

On 11/21/06, Wolfgang Hoschek <[EMAIL PROTECTED]> wrote:

On Nov 21, 2006, at 12:38 PM, jm wrote:

> Ok, thanks, I'll give MemoryIndex a go, and if that is not good enoguh
> I will explore the other options then.

To get started you can use something like this:

for each document D:
     MemoryIndex index = createMemoryIndex(D, ...)
     for each query Q:
         float score = index.search(Q)
        if (score > 0.0) System.out.println("it's a match");




   private MemoryIndex createMemoryIndex(Document doc, Analyzer
analyzer) {
     MemoryIndex index = new MemoryIndex();
     Enumeration iter = doc.fields();
     while (iter.hasMoreElements()) {
       Field field = (Field) iter.nextElement();
       index.addField(field.name(), field.stringValue(), analyzer);
     }
     return index;
   }



>
>
> On 11/21/06, Wolfgang Hoschek <[EMAIL PROTECTED]> wrote:
>> On Nov 21, 2006, at 7:43 AM, jm wrote:
>>
>> > Hi,
>> >
>> > I have to decide between using a RAMDirectory and MemoryIndex, but
>> > not sure what approach will work better...
>> >
>> > I have to run many items (tens of thousands) against some
>> queries (100
>> > at most), but I have to do it one item at a time. And I already
>> have
>> > the lucene Document associated with each item, from a previous
>> > operation I perform.
>> >
>> > From what I read MemoryIndex should be faster, but apparently I
>> cannot
>> > reuse the document I already have, and I have to create a new
>> > MemoryIndex per item.
>>
>> A MemoryIndex object holds one document.
>>
>> > Using the RAMDirectory I can use only one of
>> > them, also one IndexWriter, and create a IndexSearcher and
>> IndexReader
>> > per item, for searching and removing the item each time.
>> >
>> > Any thoughts?
>>
>> The MemoryIndex impl is optimized to work efficiently without reusing
>> the MemoryIndex object for a subsequent document. See the source
>> code. Reusing the object would not further improve performance.
>>
>> Wolfgang.
>>
>> -------------------------------------------------------------------- -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>
> -------------------------------------------------------------------- -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to