RE: [Performance] Streaming main memory indexing of single strings

Robert Engels Thu, 14 Apr 2005 14:11:49 -0700

It is really not that involved. Just implement the abstract methods of
IndexReader. And many cane be no-op'd because they will never be called in a
"read only" situation.


Methods related to normalization and such can also be no-op'd because you
are only dealing with a single document.

I would think you will find this approach at least an order of magnitude
faster, if not 2.


-----Original Message-----
From: Wolfgang Hoschek [mailto:[EMAIL PROTECTED]
Sent: Thursday, April 14, 2005 4:04 PM
To: java-dev@lucene.apache.org
Subject: Re: [Performance] Streaming main memory indexing of single
strings


This seems to be a promising avenue worth exploring. My gutfeeling is
that this could easily be 10-100 times faster.

The drawback is that it requires a fair amount of understanding of
intricate Lucene internals, pulling those pieces together and adapting
them as required for the seemingly simple "float match(String text,
Query query)".

I might give it a shot but I'm not sure I'll be able to pull this off!
Is there any similar code I could look at as a starting point?

Wolfgang.

On Apr 14, 2005, at 1:13 PM, Robert Engels wrote:

> I think you are not approaching this the correct way.
>
> Pseudo code:
>
> Subclass IndexReader.
>
> Get tokens from String 'document' using Lucene analyzers.
>
> Build simple hash-map based data structures using tokens for terms,
> and term
> positions.
>
> reimplement termDocs() and termPositions() to use the structures from
> above.
>
> run searches.
>
> start again with next document.
>
>
>
> -----Original Message-----
> From: Wolfgang Hoschek [mailto:[EMAIL PROTECTED]
> Sent: Thursday, April 14, 2005 2:56 PM
> To: java-dev@lucene.apache.org
> Subject: Re: [Performance] Streaming main memory indexing of single
> strings
>
>
> Otis, this might be a misunderstanding.
>
> - I'm not calling optimize(). That piece is commented out you if look
> again at the code.
> - The *streaming* use case requires that for each query I add one (and
> only one) document (aka string) to an empty index:
>
> repeat N times (where N is millions or billions):
>       add a single string (aka document) to an empty index
>       query the index
>       drop index (or delete it's document)
>
> with the following API being called N times: float match(String text,
> Query query)
>
> So there's no possibility of adding many documents and thereafter
> running the query. This in turn seems to mean that the IndexWriter
> can't be kept open - unless I manually delete each document after each
> query to repeatedly reuse the RAMDirectory, which I've also tried
> before without any significant performance gain - deletion seems to
> have substantial overhead in itself. Perhaps it would be better if
> there were a Directory.deleteAllDocuments() or similar. Did you have
> some other approach in mind?
>
> As I said, Lucene's design doesn't seem to fit this streaming use case
> pattern well. In *this* scenario one could easily do without any
> locking, and without byte level organization in RAMDirectory and
> RAMFile, etc because a single small string isn't a large persistent
> multi-document index.
>
> For some background, here's a small example for the kind of XQuery
> functionality Nux/Lucene integration enables:
>
> (: An XQuery that finds all books authored by James that have something
> to do with "fish", sorted by relevance :)
> declare namespace lucene = "java:nux.xom.xquery.XQueryUtil";
> declare variable $query := "fish*~";
>
> for $book in /books/book[author="James" and lucene:match(string(.),
> $query) > 0.0]
> let $score := lucene:match(string($book), $query)
> order by $score descending
> return (<score>{$score}</score>, $book)
>
> More interestingly one can use this for classifying and routing XML
> messages based on rules (i.e. queries) inspecting their content...
>
> Any other clues about potential improvements would be greatly
> appreciated.
>
> Wolfgang.
>
> On Apr 13, 2005, at 10:09 PM, Otis Gospodnetic wrote:
>
>> It looks like you are calling that IndexWriter code in some loops,
>> opening it and closing it in every iteration of the loop and also
>> calling optimize.  All of those things could be improved.
>> Keep your IndexWriter open, don't close it, and optimize the index
>> only
>> once you are done adding documents to it.
>>
>> See the highlights and the snipets in the first hit:
>>   http://www.lucenebook.com/search?query=when+to+optimize
>>
>> Otis
>>
>>
>> --- Wolfgang Hoschek <[EMAIL PROTECTED]> wrote:
>>
>>> Hi,
>>>
>>> I'm wondering if anyone could let me know how to improve Lucene
>>> performance for "streaming main memory indexing of single strings".
>>> This would help to effectively integrate Lucene with the Nux XQuery
>>> engine.
>>>
>>> Below is a small microbenchmark simulating STREAMING XQuery fulltext
>>> search as typical for XML network routers, message queuing system,
>>> P2P
>>> networks, etc. In this on-the-fly main memory indexing scenario, each
>>>
>>> individual string is immediately matched as soon as it becomes
>>> available without any persistance involved. This usage scenario and
>>> corresponding performance profile is quite different in comparison to
>>>
>>> fulltext search over persistent (read-mostly) indexes.
>>>
>>> The benchmark runs at some 3000 lucene queries/sec (lucene-1.4.3)
>>> which
>>> is unfortunate news considering the XQuery engine can easily walk
>>> hundreds of thousands of XML nodes per second. Ideally I'd like to
>>> run
>>> at some 100000 queries/sec. Runnning this through the JDK 1.5
>>> profiler
>>> it seems that most time is spent in and below the following calls:
>>>
>>> writer = new IndexWriter(dir, analyzer, true);
>>> writer.addDocument(...);
>>> writer.close();
>>>
>>> I tried quite a few variants of the benchmark with various options,
>>> unfortunately with little or no effect.
>>> Lucene just does not seem to designed to do this sort of "transient
>>> single string index" thing. All code paths related to opening,
>>> closing,
>>> reading, writing, querying and object creation seem to be designed
>>> for
>>> large persistent indexes.
>>>
>>> Any advice on what I'm missing or what could be done about it would
>>> be
>>> greatly appreciated.
>>>
>>> Wolfgang.
>>>
>>> P.S. the benchmark code is attached as a file below:
>>>
>>>> package nux.xom.pool;
>>>
>>> import java.io.IOException;
>>> //import java.io.Reader;
>>>
>>> import org.apache.lucene.analysis.Analyzer;
>>> //import org.apache.lucene.analysis.LowerCaseTokenizer;
>>> //import org.apache.lucene.analysis.PorterStemFilter;
>>> //import org.apache.lucene.analysis.SimpleAnalyzer;
>>> //import org.apache.lucene.analysis.TokenStream;
>>> import org.apache.lucene.analysis.standard.StandardAnalyzer;
>>> import org.apache.lucene.document.Document;
>>> import org.apache.lucene.document.Field;
>>> //import org.apache.lucene.index.IndexReader;
>>> import org.apache.lucene.index.IndexWriter;
>>> import org.apache.lucene.queryParser.ParseException;
>>> import org.apache.lucene.queryParser.QueryParser;
>>> import org.apache.lucene.search.Hits;
>>> import org.apache.lucene.search.IndexSearcher;
>>> import org.apache.lucene.search.Query;
>>> import org.apache.lucene.search.Searcher;
>>> import org.apache.lucene.store.Directory;
>>> import org.apache.lucene.store.RAMDirectory;
>>>
>>> public final class LuceneMatcher { // TODO: make non-public
>>>
>>>     private final Analyzer analyzer;
>>> //  private final Directory dir = new RAMDirectory();
>>>
>>>     public LuceneMatcher() {
>>>             this(new StandardAnalyzer());
>>> //          this(new SimpleAnalyzer());
>>> //          this(new StopAnalyzer());
>>> //          this(new Analyzer() {
>>> //                  public final TokenStream tokenStream(String fieldName, 
>>> Reader
>>> reader) {
>>> //                          return new PorterStemFilter(new 
>>> LowerCaseTokenizer(reader));
>>> //                  }
>>> //          });
>>>     }
>>>
>>>     public LuceneMatcher(Analyzer analyzer) {
>>>             if (analyzer == null)
>>>                     throw new IllegalArgumentException("analyzer must not 
>>> be null");
>>>             this.analyzer = analyzer;
>>>     }
>>>
>>>     public Query parseQuery(String expression) throws ParseException {
>>>             QueryParser parser = new QueryParser("content", analyzer);
>>> //          parser.setPhraseSlop(0);
>>>             return parser.parse(expression);
>>>     }
>>>
>>>     /**
>>>      * Returns the relevance score by matching the given index against
>>> the given
>>>      * Lucene query expression. The index must not contain more than one
>>> Lucene
>>>      * "document" (aka string to be searched).
>>>      */
>>>     public float match(Directory index, Query query) {
>>>             Searcher searcher = null;
>>>             try {
>>>                     searcher = new IndexSearcher(index);
>>>                     Hits hits = searcher.search(query);
>>>                     float score = hits.length() > 0 ? hits.score(0) : 0.0f;
>>>                     return score;
>>>             } catch (IOException e) { // should never happen (RAMDirectory)
>>>                     throw new RuntimeException(e);
>>>             } finally {
>>>                     try {
>>>                             if (searcher != null) searcher.close();
>>>                     } catch (IOException e) { // should never happen 
>>> (RAMDirectory)
>>>                             throw new RuntimeException(e);
>>>                     }
>>>             }
>>>     }
>>>
>>> //  public float match(String text, Query query) {
>>> //          return match(createIndex(text), query);
>>> //  }
>>>
>>>     public Directory createIndex(String text) {
>>>             Directory dir = new RAMDirectory();
>>>             IndexWriter writer = null;
>>>             try {
>>>                     writer = new IndexWriter(dir, analyzer, true);
>>> //                  writer.setUseCompoundFile(false);
>>> //                  writer.mergeFactor = 2;
>>> //                  writer.minMergeDocs = 1;
>>> //                  writer.maxMergeDocs = 1;
>>>
>>>                     writer.addDocument(createDocument(text));
>>> //                  writer.optimize();
>>>                     return dir;
>>>             } catch (IOException e) { // should never happen (RAMDirectory)
>>>                     throw new RuntimeException(e);
>>>             } finally {
>>>                     try {
>>>                             if (writer != null) writer.close();
>>>                     } catch (IOException e) { // should never happen 
>>> (RAMDirectory)
>>>                             throw new RuntimeException(e);
>>>                     }
>>>             }
>>>     }
>>>
>>>     private Document createDocument(String content) {
>>>             Document doc = new Document();
>>>             doc.add(Field.UnStored("content", content));
>>> //          doc.add(Field.Text("x", content));
>>>             return doc;
>>>     }
>>>
>>>     /**
>>>      * Lucene microbenchmark simulating STREAMING XQuery fulltext search
>>> as
>>>      * typical for XML network routers, message queuing system, P2P
>>> networks,
>>>      * etc. In this on-the-fly main memory indexing scenario, each
>>> individual
>>>      * string is immediately matched as soon as it becomes available
>>> without any
>>>      * persistance involved. This usage scenario and corresponding
>>> performance
>>>      * profile is quite different in comparison to fulltext search over
>>>      * persistent (read-mostly) indexes.
>>>      *
>>>      * Example XPath: count(/table/row[lucene:match(string(./firstname),
>>>      * "James") > 0.0])
>>>      */
>>>     public static void main(String[] args) throws Exception {
>>>             int k = -1;
>>>             int runs = 5;
>>>             if (args.length > ++k) runs = Integer.parseInt(args[k]);
>>>
>>>             int nodes = 10000;
>>>             if (args.length > ++k) nodes = Integer.parseInt(args[k]);
>>>
>>>             String content = "James is out in the woods";
>>>             if (args.length > ++k) content = args[k];
>>>
>>>             String expression = "James";
>>>             if (args.length > ++k) expression = args[k];
>>>
>>>             LuceneMatcher matcher = new LuceneMatcher();
>>>             Query query = matcher.parseQuery(expression); // to be reused N
>>> times
>>>
>>>             for (int r = 0; r < runs; r++) {
>>>                     long start = System.currentTimeMillis();
>>>                     int matches = 0;
>>>
>>>                     for (int i = 0; i < nodes; i++) {
>>> //                          if (LuceneUtil.match(content + i, expression) > 
>>> 0.0f) {
>>>                             if (matcher.match(matcher.createIndex(content + 
>>> i), query) >
>>> 0.0f) {
>>>                                     matches++;
>>>                             }
>>>                     }
>>>
>>>                     long end = System.currentTimeMillis();
>>>                     System.out.println("matches=" + matches);
>>>                     System.out.println("secs=" + ((end-start) / 1000.0f));
>>>                     System.out.println("queries/sec=" + (nodes / 
>>> ((end-start) /
>>> 1000.0f)));
>>>                     System.out.println();
>>>             }
>>>     }
>>> }
>>>>
>>>
>>>>
>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: [Performance] Streaming main memory indexing of single strings

Reply via email to