Re: Using Lucene for searching tokens, not storing them.

karl wettin Sun, 16 Apr 2006 23:14:13 -0700


16 apr 2006 kl. 19.18 skrev karl wettin:

For any interested party, I do this because I have a fairly smallcorpus with very heavy load. I think there is a lot to win by notcreating new instances of what not, seeking in the file-centricDirectory, parsing pseudo-UTF8, et.c. at query time. I simply storeall instance of everything (the index in a bunch of Lists and Maps.Bits are cheaper than ticks.


I will most definitely follow this path.

My tests used the IMDB tv-series as corpus. It contains about 45 000documents and has plenty of unique terms.


On my G4 the 190 000 queries took:
193 476 milliseconds on a RAMDirectory
123 193 milliseconds with my code branch.

That is about 40% less time. The code contains lots of things thatcan be optimized for both memory and CPU. Pretty sure it can becranked down to use a fraction of the ticks spent by a RAMDirectory.I aim at 1/3.

The FSDirectory take 5MB with no fields stored. My implementationoccupies about 100MB RAM, but that includes me treating all fields asStore.YES so it is not comparable at this stage.

I did not time the indexing, but it felt as it was about three tofive times as fast.

Personally I'll be using Prevayler for persistence(java.io.Serializable with transactions).



Basically, this is what I did:

public final class Document implements java.io.Serializable {
    private static final long serialVersionUID = 1l;

    private Integer documentNumber;
    private Map<Term, int[]> termsPositions;
    private Map<String, TermFreqVector> termFrequecyVectorsByField;
    private TermFreqVector[] termFrequencyVectors;


public final class Term implements Comparable, java.io.Serializable {
    private static final long serialVersionUID = 1l;

    private int orderIndex;
    private ArrayList<Document> documents;


public class MemImplManager implements Serializable {
    private static final long serialVersionUID = 1l;

    private transient Map<String, byte[]> normsByFieldCache;
    private Map<String, ArrayList<Byte>> normsByField;

    private ArrayList<Term> orderedTerms;
    private ArrayList<Document> documents;

    private Map<String, Map<String, Term>> termsByFieldAndName;

    private class MemImplReader extends IndexReader {
        ...

So far everything is not fully implemented yet, hence my test onlycontains SpanQueries.


        for (int i = 0; i < 10000; i++) {
            placeQuery(new String[]{"csi", "ny"});
            placeQuery(new String[]{"csi", "new", "york"});
            placeQuery(new String[]{"star", "trek", "enterprise"});
            placeQuery(new String[]{"star", "trek", "deep", "space"});
            placeQuery(new String[]{"lust", "in", "space"});
            placeQuery(new String[]{"lost", "in", "space"});
            placeQuery(new String[]{"lost"});
            placeQuery(new String[]{"that", "70", "show"});
            placeQuery(new String[]{"the", "y-files"});
            placeQuery(new String[]{"csi", "las", "vegas"});
            placeQuery(new String[]{"stargate", "sg-1"});
            placeQuery(new String[]{"stargate", "atlantis"});
            placeQuery(new String[]{"miami", "vice"});
            placeQuery(new String[]{"miami", "voice"});
            placeQuery(new String[]{"big", "brother"});
            placeQuery(new String[]{"my", "name", "is", "earl"});
            placeQuery(new String[]{"falcon", "crest"});
            placeQuery(new String[]{"dallas"});
            placeQuery(new String[]{"v"});
        }


protected Query buildQuery(String[] nameTokens) {

        BooleanQuery q = new BooleanQuery();
        BooleanQuery bqStrategies = new BooleanQuery();

        /**name ^10 */
        {
            SpanQuery[] spanQueries = new SpanQuery[nameTokens.length];
            for (int i = 0; i < spanQueries.length; i++) {

spanQueries[i] = new SpanTermQuery(new Term("name",nameTokens[i]));

SpanQuery nameQuery = new SpanNearQuery(spanQueries, 0,true);

            nameQuery.setBoost(10);

bqStrategies.add(new BooleanClause(nameQuery,BooleanClause.Occur.SHOULD));

        }

        /** aka name in order ^1 */
        {
            SpanQuery[] spanQueries = new SpanQuery[nameTokens.length];
            for (int i = 0; i < spanQueries.length; i++) {

spanQueries[i] = new SpanTermQuery(new Term("akaName", nameTokens[i]));

SpanQuery nameQuery = new SpanNearQuery(spanQueries, 0,true);

            nameQuery.setBoost(1);

bqStrategies.add(new BooleanClause(nameQuery,BooleanClause.Occur.SHOULD));

q.add(new BooleanClause(bqStrategies,BooleanClause.Occur.MUST));


        return q;
    }




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Using Lucene for searching tokens, not storing them.

Reply via email to