FWIW, seems like these issues should be brought up on java-dev. Even if the changes in Lucene are back compatible, that's not much help if the large majority of users are going to take a similar hit to what Solr is taking.

On Aug 9, 2009, at 11:47 PM, Mark Miller wrote:

isMethodOverriden is just nasty - copying Methods, security checks, walking the type hierarchy, this, that, some more. I bet cglib has a really fast version - too bad there is no built in equivalent.

Its not nearly as clean, but what if a new TokenStream simply identified itself as supporting increment, and the default impl returns false? The developer knows at compile time right? Almost no reason to keep asking the code over and over again, especially since its so expensive. Then reusable doubles the cost.

Mark Miller wrote:
Michael Busch wrote:
Are you sure that the initialization costs of the TokenStream/ AttributeSource cause the slowdown? With the bw-comp. code now every call of a Token method goes through a delegation layer. I'm afraid that might cause a slowdown?
Its isMethodOverriden and TokenStream<init>(AttributeSource).

The code that figures out what Attributes to put into the map uses reflection, but only if the impl wasn't seen before; otherwise the attributes are looked up in a cache.

The culprit could also be the reflection code that checks which TokenStream methods are implemented.

I can't look at the code right now (writing on my cell).
Even if this is "fixable", I don't really like the fact that users who upgrade to 2.9 will potentially see such a performance hit unless they implement incrementToken() and reusableTokenStream.
Looks like you take a good hit, but keep in mind that test is almost worst case scenario as well - the Document text is extremely short.

Michael

On Aug 9, 2009, at 11:13 AM, Yonik Seeley <yo...@lucidimagination.com > wrote:

FYI
https://issues.apache.org/jira/browse/SOLR-1353

On Sun, Aug 9, 2009 at 2:02 PM, Yonik Seeley<yo...@lucidimagination.com > wrote:
It looks like implementing the new attribute stuff will not be enough - the token architecture has changed enough that it looks like we must
cache tokenstreams to get back to good performance.

-Yonik
http://www.lucidimagination.com


On Sun, Aug 9, 2009 at 12:57 PM, Yonik Seeley<yo...@lucidimagination.com > wrote:
OK, I've isolated (magnified) the effect with a test I just checked in. Indexing documents directly at the UpdateHandler was 85% faster before
the latest lucene update.

Run the test like this:

ant test -Dtestcase=TestIndexingPerformance -Dargs="-server
-Diter=100000"; grep throughput
build/test-results/*TestIndexingPerformance.xml

To run on an older trunk version, just copy over
src/test/org/apache/solr/update/TestIndexingPerformance.java
src/test/test-files/solr/conf/solrconfig_perf.xml

I had a throughput of 10946 docs/sec before the lucene update, and 5849 after.

-Yonik
http://www.lucidimagination.com


On Sun, Aug 9, 2009 at 12:10 PM, Yonik Seeley<yo...@lucidimagination.com > wrote:
On Sun, Aug 9, 2009 at 12:01 PM, Grant Ingersoll<gsing...@apache.org > wrote:
Or bite the bullet and upgrade to the incrementToken() method.

Right - I'm not sure if that would fix it or not - I haven't been
involved in the new Token attribute stuff...
I'm currently writing a basic indexing unit test that we can use to
measure this (the standard solrconfig does stuff that slows down
indexing a lot, but helps in catching bugs on edge cases by creating
many segments).

-Yonik
http://www.lucidimagination.com







--
- Mark

http://www.lucidimagination.com




--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene:
http://www.lucidimagination.com/search

Reply via email to