Re: RAMDirectory vs MemoryIndex

Wolfgang Hoschek Wed, 22 Nov 2006 15:04:12 -0800

Out of interest, I've checked an implementation of something likethis into AnalyzerUtil SVN trunk:

/**

* Returns an analyzer wrapper that caches all tokens generated bythe underlying child analyzer's* token stream, and delivers those cached tokens on subsequentcalls to

   * <code>tokenStream(String fieldName, Reader reader)</code>.
   * <p>

* This can help improve performance in the presence of expensiveAnalyzer / TokenFilter chains.

   * <p>
   * Caveats:

* 1) Caching only works if the methods equals() and hashCode()methods are properly* implemented on the Reader passed to <code>tokenStream(StringfieldName, Reader reader)</code>.* 2) Caching the tokens of large Lucene documents can lead to outof memory exceptions.* 3) The Token instances delivered by the underlying childanalyzer must be immutable.

   *
   * @param child
   *            the underlying child analyzer
   * @return a new analyzer
   */

public static Analyzer getTokenCachingAnalyzer(final Analyzerchild) { ... }



Check it out, and let me know if this is close to what you had in mind.

Wolfgang.

On Nov 22, 2006, at 9:19 AM, Wolfgang Hoschek wrote:

I've never tried it, but I guess you could write an Analyzer andTokenFilter that no only feeds into IndexWriter onIndexWriter.addDocument(), but as a sneaky side effect alsosimultaneously saves its tokens into a list so that you could laterturn that list into another TokenStream to be added to MemoryIndex.How much this might help depends on how expensive your analyzerchain is. For some examples on how to set up analyzers for chainsof token streams, see MemoryIndex.keywordTokenStream and classAnalzyerUtil in the same package.


Wolfgang.

On Nov 22, 2006, at 4:15 AM, jm wrote:

checking one last thing, just in case...

as I mentioned, I have previously indexed the same document inanother

index (for another purpose), as I am going to use the same analyzer,
would it be possible to avoid analyzing the doc again?

I see IndexWriter.addDocument() returns void, so it does not seem to
be an easy way to do that no?

thanks

On 11/21/06, Wolfgang Hoschek <[EMAIL PROTECTED]> wrote:


On Nov 21, 2006, at 12:38 PM, jm wrote:

> Ok, thanks, I'll give MemoryIndex a go, and if that is not goodenoguh

> I will explore the other options then.

To get started you can use something like this:

for each document D:
     MemoryIndex index = createMemoryIndex(D, ...)
     for each query Q:
         float score = index.search(Q)
        if (score > 0.0) System.out.println("it's a match");




   private MemoryIndex createMemoryIndex(Document doc, Analyzer
analyzer) {
     MemoryIndex index = new MemoryIndex();
     Enumeration iter = doc.fields();
     while (iter.hasMoreElements()) {
       Field field = (Field) iter.nextElement();
       index.addField(field.name(), field.stringValue(), analyzer);
     }
     return index;
   }



>
>
> On 11/21/06, Wolfgang Hoschek <[EMAIL PROTECTED]> wrote:
>> On Nov 21, 2006, at 7:43 AM, jm wrote:
>>
>> > Hi,
>> >

>> > I have to decide between using a RAMDirectory andMemoryIndex, but

>> > not sure what approach will work better...
>> >
>> > I have to run many items (tens of thousands) against some
>> queries (100
>> > at most), but I have to do it one item at a time. And I already
>> have
>> > the lucene Document associated with each item, from a previous
>> > operation I perform.
>> >
>> > From what I read MemoryIndex should be faster, but apparently I
>> cannot
>> > reuse the document I already have, and I have to create a new
>> > MemoryIndex per item.
>>
>> A MemoryIndex object holds one document.
>>
>> > Using the RAMDirectory I can use only one of
>> > them, also one IndexWriter, and create a IndexSearcher and
>> IndexReader
>> > per item, for searching and removing the item each time.
>> >
>> > Any thoughts?
>>

>> The MemoryIndex impl is optimized to work efficiently withoutreusing

>> the MemoryIndex object for a subsequent document. See the source
>> code. Reusing the object would not further improve performance.
>>
>> Wolfgang.
>>

>>---------------------------------------------------------------------

>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>

>---------------------------------------------------------------------

> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: RAMDirectory vs MemoryIndex

Reply via email to