For those who don't recall, TeeTokenFilter was added on https://issues.apache.org/jira/browse/LUCENE-1058
to handle what I would consider a somewhat common case whereby two
or more fields share a fair number of common analysis steps. For
instance, if one wanted a field that contained the proper nouns found
in the body of a text or just the dates or Tokens that matched a
certain type, then the TeeTokenFilter can be setup such that those
special tokens are "siphoned off" from the main analysis path and
saved for the other field, thus eliminating the need to go back
through the analysis steps.
So, I have done some preliminary performance testing of the
TeeTokenFilter using the patch inlined below (unfortunately, some of
the changes are from formatting). See the testPerformance() method at
the bottom of this msg. I simulated some of the above scenarios by
siphoning off every X number of tokens in the sink test, while doing
the analysis twice in the non-sink test (but skipping all tokens in
between (exclusive) X and 2X). I then looped over various values of X
ranging from 1 to 500 and analyzed a String containing Y tokens where
Y ranges from 100 to 10000.
For the smaller token counts, any performance difference is
negligible. However, even at 500 tokens, one starts to see a
difference. The first thing to note is that TeeTokenFilter (TTF) is
much _slower_ in the case that all tokens are siphoned off (X = 1). I
believe the reason is the cost of Token.clone(), but I have not
validated this yet by profiling. However, once X > 2, one starts to
see a reduction in the time spent producing analysis tokens to
somewhere in the 50-65% range of the total time of the two field
approach.
My code is below, just add it in to the TeeSinkTokenTest in test
o.a.l.analysis.
Is my test/thinking valid on this idea? Intuitively it makes sense to
me, but I am also tired and did most of the write up of this on the
plane the other day, so mistakes happen. In fact, I was kind of
surprised by the poor showing in the X = 1 case as I would have
thought it would be in the ballpark of the two field case, but will
need to investigate more.
Thanks,
Grant
-----
/**
* Not an explicit test, just useful to print out some info on
performance
*
* @throws Exception
*/
public void testPerformance() throws Exception {
int[] tokCount = {100, 500, 1000, 2000, 5000, 10000, 50000};
for (int k = 0; k < tokCount.length; k++) {
StringBuffer buffer = new StringBuffer();
System.out.println("-----Tokens: " + tokCount[k] + "-----");
for (int i = 0; i < tokCount[k]; i++) {
buffer.append(English.intToEnglish(i).toUpperCase()).append('
');
}
//make sure we produce the same tokens
ModuloSinkTokenizer sink = new ModuloSinkTokenizer(100);
Token next = new Token();
TokenStream result = new TeeTokenFilter(new StandardFilter(new
StandardTokenizer(new StringReader(buffer.toString()))), sink);
while ((next = result.next(next)) != null) {
}
result = new ModuloTokenFilter(new StandardFilter(new
StandardTokenizer(new StringReader(buffer.toString()))), 100);
next = new Token();
List tmp = new ArrayList();
while ((next = result.next(next)) != null) {
tmp.add(next.clone());
}
List sinkList = sink.getTokens();
assertTrue("tmp Size: " + tmp.size() + " is not: " +
sinkList.size(), tmp.size() == sinkList.size());
for (int i = 0; i < tmp.size(); i++) {
Token tfTok = (Token) tmp.get(i);
Token sinkTok = (Token) sinkList.get(i);
assertTrue(tfTok.termText() + " is not equal to " +
sinkTok.termText() + " at token: " + i,
tfTok.termText().equals(sinkTok.termText()) == true);
}
//simulate two fields, each being analyzed once
int[] modCounts = {1, 2, 5, 10, 20, 50, 100, 200, 500};
for (int j = 0; j < modCounts.length; j++) {
int tfPos = 0;
long start = System.currentTimeMillis();
for (int i = 0; i < 20; i++) {
next = new Token();
result = new StandardFilter(new StandardTokenizer(new
StringReader(buffer.toString())));
while ((next = result.next(next)) != null) {
tfPos += next.getPositionIncrement();
}
next = new Token();
result = new ModuloTokenFilter(new StandardFilter(new
StandardTokenizer(new StringReader(buffer.toString()))), modCounts[j]);
while ((next = result.next(next)) != null) {
tfPos += next.getPositionIncrement();
}
}
long finish = System.currentTimeMillis();
System.out.println("ModCount: " + modCounts[j] + " Two fields
took " + (finish - start) + " ms");
int sinkPos = 0;
start = System.currentTimeMillis();
for (int i = 0; i < 20; i++) {
sink = new ModuloSinkTokenizer(modCounts[j]);
next = new Token();
result = new TeeTokenFilter(new StandardFilter(new
StandardTokenizer(new StringReader(buffer.toString()))), sink);
while ((next = result.next(next)) != null) {
sinkPos += next.getPositionIncrement();
}
//System.out.println("Modulo--------");
result = sink;
while ((next = result.next(next)) != null) {
sinkPos += next.getPositionIncrement();
}
}
finish = System.currentTimeMillis();
System.out.println("ModCount: " + modCounts[j] + " Tee fields
took " + (finish - start) + " ms");
assertTrue(sinkPos + " does not equal: " + tfPos, sinkPos ==
tfPos);
}
System.out.println("- End Tokens: " + tokCount[k] + "-----");
}
}
class ModuloTokenFilter extends TokenFilter {
int modCount;
ModuloTokenFilter(TokenStream input, int mc) {
super(input);
modCount = mc;
}
int count = 0;
//return every 100 tokens
public Token next(Token result) throws IOException {
while ((result = input.next(result)) != null && count %
modCount != 0) {
count++;
}
count++;
return result;
}
}
class ModuloSinkTokenizer extends SinkTokenizer {
int count = 0;
int modCount;
ModuloSinkTokenizer(int mc) {
modCount = mc;
lst = new ArrayList(modCount + 1);
}
public void add(Token t) {
if (t != null && count % modCount == 0) {
lst.add(t.clone());
}
count++;
}
}
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]