Out of interest, I've checked an implementation of something like
this into AnalyzerUtil SVN trunk:
/**
* Returns an analyzer wrapper that caches all tokens generated by
the underlying child analyzer's
* token stream, and delivers those cached tokens on subsequent
calls to
* <code>tokenStream(String fieldName, Reader reader)</code>.
* <p>
* This can help improve performance in the presence of expensive
Analyzer / TokenFilter chains.
* <p>
* Caveats:
* 1) Caching only works if the methods equals() and hashCode()
methods are properly
* implemented on the Reader passed to <code>tokenStream(String
fieldName, Reader reader)</code>.
* 2) Caching the tokens of large Lucene documents can lead to out
of memory exceptions.
* 3) The Token instances delivered by the underlying child
analyzer must be immutable.
*
* @param child
* the underlying child analyzer
* @return a new analyzer
*/
public static Analyzer getTokenCachingAnalyzer(final Analyzer
child) { ... }
Check it out, and let me know if this is close to what you had in mind.
Wolfgang.
On Nov 22, 2006, at 9:19 AM, Wolfgang Hoschek wrote:
I've never tried it, but I guess you could write an Analyzer and
TokenFilter that no only feeds into IndexWriter on
IndexWriter.addDocument(), but as a sneaky side effect also
simultaneously saves its tokens into a list so that you could later
turn that list into another TokenStream to be added to MemoryIndex.
How much this might help depends on how expensive your analyzer
chain is. For some examples on how to set up analyzers for chains
of token streams, see MemoryIndex.keywordTokenStream and class
AnalzyerUtil in the same package.
Wolfgang.
On Nov 22, 2006, at 4:15 AM, jm wrote:
checking one last thing, just in case...
as I mentioned, I have previously indexed the same document in
another
index (for another purpose), as I am going to use the same analyzer,
would it be possible to avoid analyzing the doc again?
I see IndexWriter.addDocument() returns void, so it does not seem to
be an easy way to do that no?
thanks
On 11/21/06, Wolfgang Hoschek <[EMAIL PROTECTED]> wrote:
On Nov 21, 2006, at 12:38 PM, jm wrote:
> Ok, thanks, I'll give MemoryIndex a go, and if that is not good
enoguh
> I will explore the other options then.
To get started you can use something like this:
for each document D:
MemoryIndex index = createMemoryIndex(D, ...)
for each query Q:
float score = index.search(Q)
if (score > 0.0) System.out.println("it's a match");
private MemoryIndex createMemoryIndex(Document doc, Analyzer
analyzer) {
MemoryIndex index = new MemoryIndex();
Enumeration iter = doc.fields();
while (iter.hasMoreElements()) {
Field field = (Field) iter.nextElement();
index.addField(field.name(), field.stringValue(), analyzer);
}
return index;
}
>
>
> On 11/21/06, Wolfgang Hoschek <[EMAIL PROTECTED]> wrote:
>> On Nov 21, 2006, at 7:43 AM, jm wrote:
>>
>> > Hi,
>> >
>> > I have to decide between using a RAMDirectory and
MemoryIndex, but
>> > not sure what approach will work better...
>> >
>> > I have to run many items (tens of thousands) against some
>> queries (100
>> > at most), but I have to do it one item at a time. And I already
>> have
>> > the lucene Document associated with each item, from a previous
>> > operation I perform.
>> >
>> > From what I read MemoryIndex should be faster, but apparently I
>> cannot
>> > reuse the document I already have, and I have to create a new
>> > MemoryIndex per item.
>>
>> A MemoryIndex object holds one document.
>>
>> > Using the RAMDirectory I can use only one of
>> > them, also one IndexWriter, and create a IndexSearcher and
>> IndexReader
>> > per item, for searching and removing the item each time.
>> >
>> > Any thoughts?
>>
>> The MemoryIndex impl is optimized to work efficiently without
reusing
>> the MemoryIndex object for a subsequent document. See the source
>> code. Reusing the object would not further improve performance.
>>
>> Wolfgang.
>>
>>
--------------------------------------------------------------------
-
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>
>
--------------------------------------------------------------------
-
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]