thanks. I'll try to get this working and see wether there is a perf difference during the weekend.
On 11/23/06, Wolfgang Hoschek <[EMAIL PROTECTED]> wrote:
Out of interest, I've checked an implementation of something like this into AnalyzerUtil SVN trunk: /** * Returns an analyzer wrapper that caches all tokens generated by the underlying child analyzer's * token stream, and delivers those cached tokens on subsequent calls to * <code>tokenStream(String fieldName, Reader reader)</code>. * <p> * This can help improve performance in the presence of expensive Analyzer / TokenFilter chains. * <p> * Caveats: * 1) Caching only works if the methods equals() and hashCode() methods are properly * implemented on the Reader passed to <code>tokenStream(String fieldName, Reader reader)</code>. * 2) Caching the tokens of large Lucene documents can lead to out of memory exceptions. * 3) The Token instances delivered by the underlying child analyzer must be immutable. * * @param child * the underlying child analyzer * @return a new analyzer */ public static Analyzer getTokenCachingAnalyzer(final Analyzer child) { ... } Check it out, and let me know if this is close to what you had in mind. Wolfgang. On Nov 22, 2006, at 9:19 AM, Wolfgang Hoschek wrote: > I've never tried it, but I guess you could write an Analyzer and > TokenFilter that no only feeds into IndexWriter on > IndexWriter.addDocument(), but as a sneaky side effect also > simultaneously saves its tokens into a list so that you could later > turn that list into another TokenStream to be added to MemoryIndex. > How much this might help depends on how expensive your analyzer > chain is. For some examples on how to set up analyzers for chains > of token streams, see MemoryIndex.keywordTokenStream and class > AnalzyerUtil in the same package. > > Wolfgang. > > On Nov 22, 2006, at 4:15 AM, jm wrote: > >> checking one last thing, just in case... >> >> as I mentioned, I have previously indexed the same document in >> another >> index (for another purpose), as I am going to use the same analyzer, >> would it be possible to avoid analyzing the doc again? >> >> I see IndexWriter.addDocument() returns void, so it does not seem to >> be an easy way to do that no? >> >> thanks >> >> On 11/21/06, Wolfgang Hoschek <[EMAIL PROTECTED]> wrote: >>> >>> On Nov 21, 2006, at 12:38 PM, jm wrote: >>> >>> > Ok, thanks, I'll give MemoryIndex a go, and if that is not good >>> enoguh >>> > I will explore the other options then. >>> >>> To get started you can use something like this: >>> >>> for each document D: >>> MemoryIndex index = createMemoryIndex(D, ...) >>> for each query Q: >>> float score = index.search(Q) >>> if (score > 0.0) System.out.println("it's a match"); >>> >>> >>> >>> >>> private MemoryIndex createMemoryIndex(Document doc, Analyzer >>> analyzer) { >>> MemoryIndex index = new MemoryIndex(); >>> Enumeration iter = doc.fields(); >>> while (iter.hasMoreElements()) { >>> Field field = (Field) iter.nextElement(); >>> index.addField(field.name(), field.stringValue(), analyzer); >>> } >>> return index; >>> } >>> >>> >>> >>> > >>> > >>> > On 11/21/06, Wolfgang Hoschek <[EMAIL PROTECTED]> wrote: >>> >> On Nov 21, 2006, at 7:43 AM, jm wrote: >>> >> >>> >> > Hi, >>> >> > >>> >> > I have to decide between using a RAMDirectory and >>> MemoryIndex, but >>> >> > not sure what approach will work better... >>> >> > >>> >> > I have to run many items (tens of thousands) against some >>> >> queries (100 >>> >> > at most), but I have to do it one item at a time. And I already >>> >> have >>> >> > the lucene Document associated with each item, from a previous >>> >> > operation I perform. >>> >> > >>> >> > From what I read MemoryIndex should be faster, but apparently I >>> >> cannot >>> >> > reuse the document I already have, and I have to create a new >>> >> > MemoryIndex per item. >>> >> >>> >> A MemoryIndex object holds one document. >>> >> >>> >> > Using the RAMDirectory I can use only one of >>> >> > them, also one IndexWriter, and create a IndexSearcher and >>> >> IndexReader >>> >> > per item, for searching and removing the item each time. >>> >> > >>> >> > Any thoughts? >>> >> >>> >> The MemoryIndex impl is optimized to work efficiently without >>> reusing >>> >> the MemoryIndex object for a subsequent document. See the source >>> >> code. Reusing the object would not further improve performance. >>> >> >>> >> Wolfgang. >>> >> >>> >> >>> -------------------------------------------------------------------- >>> - >>> >> To unsubscribe, e-mail: [EMAIL PROTECTED] >>> >> For additional commands, e-mail: [EMAIL PROTECTED] >>> >> >>> >> >>> > >>> > >>> -------------------------------------------------------------------- >>> - >>> > To unsubscribe, e-mail: [EMAIL PROTECTED] >>> > For additional commands, e-mail: [EMAIL PROTECTED] >>> > >>> >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]