For those who don't recall, TeeTokenFilter was added on https://issues.apache.org/jira/browse/LUCENE-1058 to handle what I would consider a somewhat common case whereby two or more fields share a fair number of common analysis steps. For instance, if one wanted a field that contained the proper nouns found in the body of a text or just the dates or Tokens that matched a certain type, then the TeeTokenFilter can be setup such that those special tokens are "siphoned off" from the main analysis path and saved for the other field, thus eliminating the need to go back through the analysis steps.

So, I have done some preliminary performance testing of the TeeTokenFilter using the patch inlined below (unfortunately, some of the changes are from formatting). See the testPerformance() method at the bottom of this msg. I simulated some of the above scenarios by siphoning off every X number of tokens in the sink test, while doing the analysis twice in the non-sink test (but skipping all tokens in between (exclusive) X and 2X). I then looped over various values of X ranging from 1 to 500 and analyzed a String containing Y tokens where Y ranges from 100 to 10000.

For the smaller token counts, any performance difference is negligible. However, even at 500 tokens, one starts to see a difference. The first thing to note is that TeeTokenFilter (TTF) is much _slower_ in the case that all tokens are siphoned off (X = 1). I believe the reason is the cost of Token.clone(), but I have not validated this yet by profiling. However, once X > 2, one starts to see a reduction in the time spent producing analysis tokens to somewhere in the 50-65% range of the total time of the two field approach.

My code is below, just add it in to the TeeSinkTokenTest in test o.a.l.analysis.

Is my test/thinking valid on this idea? Intuitively it makes sense to me, but I am also tired and did most of the write up of this on the plane the other day, so mistakes happen. In fact, I was kind of surprised by the poor showing in the X = 1 case as I would have thought it would be in the ballpark of the two field case, but will need to investigate more.

Thanks,
Grant

-----
/**
* Not an explicit test, just useful to print out some info on performance
   *
   * @throws Exception
   */
  public void testPerformance() throws Exception {
    int[] tokCount = {100, 500, 1000, 2000, 5000, 10000, 50000};
    for (int k = 0; k < tokCount.length; k++) {
      StringBuffer buffer = new StringBuffer();
      System.out.println("-----Tokens: " + tokCount[k] + "-----");
      for (int i = 0; i < tokCount[k]; i++) {
buffer.append(English.intToEnglish(i).toUpperCase()).append(' ');
      }
      //make sure we produce the same tokens
      ModuloSinkTokenizer sink = new ModuloSinkTokenizer(100);
      Token next = new Token();
TokenStream result = new TeeTokenFilter(new StandardFilter(new StandardTokenizer(new StringReader(buffer.toString()))), sink);
      while ((next = result.next(next)) != null) {
      }
result = new ModuloTokenFilter(new StandardFilter(new StandardTokenizer(new StringReader(buffer.toString()))), 100);
      next = new Token();
      List tmp = new ArrayList();
      while ((next = result.next(next)) != null) {
        tmp.add(next.clone());
      }
      List sinkList = sink.getTokens();
assertTrue("tmp Size: " + tmp.size() + " is not: " + sinkList.size(), tmp.size() == sinkList.size());
      for (int i = 0; i < tmp.size(); i++) {
        Token tfTok = (Token) tmp.get(i);
        Token sinkTok = (Token) sinkList.get(i);
assertTrue(tfTok.termText() + " is not equal to " + sinkTok.termText() + " at token: " + i, tfTok.termText().equals(sinkTok.termText()) == true);
      }
      //simulate two fields, each being analyzed once
      int[] modCounts = {1, 2, 5, 10, 20, 50, 100, 200, 500};
      for (int j = 0; j < modCounts.length; j++) {
        int tfPos = 0;
        long start = System.currentTimeMillis();
        for (int i = 0; i < 20; i++) {
          next = new Token();
result = new StandardFilter(new StandardTokenizer(new StringReader(buffer.toString())));
          while ((next = result.next(next)) != null) {
            tfPos += next.getPositionIncrement();
          }
          next = new Token();
result = new ModuloTokenFilter(new StandardFilter(new StandardTokenizer(new StringReader(buffer.toString()))), modCounts[j]);
          while ((next = result.next(next)) != null) {
            tfPos += next.getPositionIncrement();
          }
        }
        long finish = System.currentTimeMillis();
System.out.println("ModCount: " + modCounts[j] + " Two fields took " + (finish - start) + " ms");
        int sinkPos = 0;
        start = System.currentTimeMillis();
        for (int i = 0; i < 20; i++) {
          sink = new ModuloSinkTokenizer(modCounts[j]);
          next = new Token();
result = new TeeTokenFilter(new StandardFilter(new StandardTokenizer(new StringReader(buffer.toString()))), sink);
          while ((next = result.next(next)) != null) {
            sinkPos += next.getPositionIncrement();
          }
          //System.out.println("Modulo--------");
          result = sink;
          while ((next = result.next(next)) != null) {
            sinkPos += next.getPositionIncrement();
          }
        }
        finish = System.currentTimeMillis();
System.out.println("ModCount: " + modCounts[j] + " Tee fields took " + (finish - start) + " ms"); assertTrue(sinkPos + " does not equal: " + tfPos, sinkPos == tfPos);

      }
      System.out.println("- End Tokens: " + tokCount[k] + "-----");
    }

  }


  class ModuloTokenFilter extends TokenFilter {

    int modCount;

    ModuloTokenFilter(TokenStream input, int mc) {
      super(input);
      modCount = mc;
    }

    int count = 0;

    //return every 100 tokens
    public Token next(Token result) throws IOException {

while ((result = input.next(result)) != null && count % modCount != 0) {
        count++;
      }
      count++;
      return result;
    }
  }

  class ModuloSinkTokenizer extends SinkTokenizer {
    int count = 0;
    int modCount;


    ModuloSinkTokenizer(int mc) {
      modCount = mc;
      lst = new ArrayList(modCount + 1);
    }

    public void add(Token t) {
      if (t != null && count % modCount == 0) {
        lst.add(t.clone());
      }
      count++;
    }
  }


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to