TeeTokenFilter performance testing

Grant Ingersoll Sun, 16 Dec 2007 20:41:23 -0800

For those who don't recall, TeeTokenFilter was added on https://issues.apache.org/jira/browse/LUCENE-1058to handle what I would consider a somewhat common case whereby twoor more fields share a fair number of common analysis steps. Forinstance, if one wanted a field that contained the proper nouns foundin the body of a text or just the dates or Tokens that matched acertain type, then the TeeTokenFilter can be setup such that thosespecial tokens are "siphoned off" from the main analysis path andsaved for the other field, thus eliminating the need to go backthrough the analysis steps.

So, I have done some preliminary performance testing of theTeeTokenFilter using the patch inlined below (unfortunately, some ofthe changes are from formatting). See the testPerformance() method atthe bottom of this msg. I simulated some of the above scenarios bysiphoning off every X number of tokens in the sink test, while doingthe analysis twice in the non-sink test (but skipping all tokens inbetween (exclusive) X and 2X). I then looped over various values of Xranging from 1 to 500 and analyzed a String containing Y tokens whereY ranges from 100 to 10000.

For the smaller token counts, any performance difference isnegligible. However, even at 500 tokens, one starts to see adifference. The first thing to note is that TeeTokenFilter (TTF) ismuch _slower_ in the case that all tokens are siphoned off (X = 1). Ibelieve the reason is the cost of Token.clone(), but I have notvalidated this yet by profiling. However, once X > 2, one starts tosee a reduction in the time spent producing analysis tokens tosomewhere in the 50-65% range of the total time of the two fieldapproach.

My code is below, just add it in to the TeeSinkTokenTest in testo.a.l.analysis.

Is my test/thinking valid on this idea? Intuitively it makes sense tome, but I am also tired and did most of the write up of this on theplane the other day, so mistakes happen. In fact, I was kind ofsurprised by the poor showing in the X = 1 case as I would havethought it would be in the ballpark of the two field case, but willneed to investigate more.


Thanks,
Grant

-----
/**

* Not an explicit test, just useful to print out some info onperformance

   *
   * @throws Exception
   */
  public void testPerformance() throws Exception {
    int[] tokCount = {100, 500, 1000, 2000, 5000, 10000, 50000};
    for (int k = 0; k < tokCount.length; k++) {
      StringBuffer buffer = new StringBuffer();
      System.out.println("-----Tokens: " + tokCount[k] + "-----");
      for (int i = 0; i < tokCount[k]; i++) {

buffer.append(English.intToEnglish(i).toUpperCase()).append('');

      }
      //make sure we produce the same tokens
      ModuloSinkTokenizer sink = new ModuloSinkTokenizer(100);
      Token next = new Token();

TokenStream result = new TeeTokenFilter(new StandardFilter(newStandardTokenizer(new StringReader(buffer.toString()))), sink);

      while ((next = result.next(next)) != null) {
      }

result = new ModuloTokenFilter(new StandardFilter(newStandardTokenizer(new StringReader(buffer.toString()))), 100);

      next = new Token();
      List tmp = new ArrayList();
      while ((next = result.next(next)) != null) {
        tmp.add(next.clone());
      }
      List sinkList = sink.getTokens();

assertTrue("tmp Size: " + tmp.size() + " is not: " +sinkList.size(), tmp.size() == sinkList.size());

      for (int i = 0; i < tmp.size(); i++) {
        Token tfTok = (Token) tmp.get(i);
        Token sinkTok = (Token) sinkList.get(i);

assertTrue(tfTok.termText() + " is not equal to " +sinkTok.termText() + " at token: " + i,tfTok.termText().equals(sinkTok.termText()) == true);

      }
      //simulate two fields, each being analyzed once
      int[] modCounts = {1, 2, 5, 10, 20, 50, 100, 200, 500};
      for (int j = 0; j < modCounts.length; j++) {
        int tfPos = 0;
        long start = System.currentTimeMillis();
        for (int i = 0; i < 20; i++) {
          next = new Token();

result = new StandardFilter(new StandardTokenizer(newStringReader(buffer.toString())));

          while ((next = result.next(next)) != null) {
            tfPos += next.getPositionIncrement();
          }
          next = new Token();

result = new ModuloTokenFilter(new StandardFilter(newStandardTokenizer(new StringReader(buffer.toString()))), modCounts[j]);

          while ((next = result.next(next)) != null) {
            tfPos += next.getPositionIncrement();
          }
        }
        long finish = System.currentTimeMillis();

System.out.println("ModCount: " + modCounts[j] + " Two fieldstook " + (finish - start) + " ms");

        int sinkPos = 0;
        start = System.currentTimeMillis();
        for (int i = 0; i < 20; i++) {
          sink = new ModuloSinkTokenizer(modCounts[j]);
          next = new Token();

result = new TeeTokenFilter(new StandardFilter(newStandardTokenizer(new StringReader(buffer.toString()))), sink);

          while ((next = result.next(next)) != null) {
            sinkPos += next.getPositionIncrement();
          }
          //System.out.println("Modulo--------");
          result = sink;
          while ((next = result.next(next)) != null) {
            sinkPos += next.getPositionIncrement();
          }
        }
        finish = System.currentTimeMillis();

System.out.println("ModCount: " + modCounts[j] + " Tee fieldstook " + (finish - start) + " ms");assertTrue(sinkPos + " does not equal: " + tfPos, sinkPos ==tfPos);


      }
      System.out.println("- End Tokens: " + tokCount[k] + "-----");
    }

  }


  class ModuloTokenFilter extends TokenFilter {

    int modCount;

    ModuloTokenFilter(TokenStream input, int mc) {
      super(input);
      modCount = mc;
    }

    int count = 0;

    //return every 100 tokens
    public Token next(Token result) throws IOException {

while ((result = input.next(result)) != null && count %modCount != 0) {

        count++;
      }
      count++;
      return result;
    }
  }

  class ModuloSinkTokenizer extends SinkTokenizer {
    int count = 0;
    int modCount;


    ModuloSinkTokenizer(int mc) {
      modCount = mc;
      lst = new ArrayList(modCount + 1);
    }

    public void add(Token t) {
      if (t != null && count % modCount == 0) {
        lst.add(t.clone());
      }
      count++;
    }
  }


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

TeeTokenFilter performance testing

Reply via email to