FWIW, seems like these issues should be brought up on java-dev. Even
if the changes in Lucene are back compatible, that's not much help if
the large majority of users are going to take a similar hit to what
Solr is taking.
On Aug 9, 2009, at 11:47 PM, Mark Miller wrote:
isMethodOverriden is just nasty - copying Methods, security checks,
walking the type hierarchy, this, that, some more. I bet cglib has a
really fast version - too bad there is no built in equivalent.
Its not nearly as clean, but what if a new TokenStream simply
identified itself as supporting increment, and the default impl
returns false? The developer knows at compile time right? Almost no
reason to keep asking the code over and over again, especially since
its so expensive. Then reusable doubles the cost.
Mark Miller wrote:
Michael Busch wrote:
Are you sure that the initialization costs of the TokenStream/
AttributeSource cause the slowdown? With the bw-comp. code now
every call of a Token method goes through a delegation layer. I'm
afraid that might cause a slowdown?
Its isMethodOverriden and TokenStream<init>(AttributeSource).
The code that figures out what Attributes to put into the map uses
reflection, but only if the impl wasn't seen before; otherwise the
attributes are looked up in a cache.
The culprit could also be the reflection code that checks which
TokenStream methods are implemented.
I can't look at the code right now (writing on my cell).
Even if this is "fixable", I don't really like the fact that users
who upgrade to 2.9 will potentially see such a performance hit
unless they implement incrementToken() and reusableTokenStream.
Looks like you take a good hit, but keep in mind that test is
almost worst case scenario as well - the Document text is extremely
short.
Michael
On Aug 9, 2009, at 11:13 AM, Yonik Seeley <yo...@lucidimagination.com
> wrote:
FYI
https://issues.apache.org/jira/browse/SOLR-1353
On Sun, Aug 9, 2009 at 2:02 PM, Yonik Seeley<yo...@lucidimagination.com
> wrote:
It looks like implementing the new attribute stuff will not be
enough
- the token architecture has changed enough that it looks like
we must
cache tokenstreams to get back to good performance.
-Yonik
http://www.lucidimagination.com
On Sun, Aug 9, 2009 at 12:57 PM, Yonik Seeley<yo...@lucidimagination.com
> wrote:
OK, I've isolated (magnified) the effect with a test I just
checked in.
Indexing documents directly at the UpdateHandler was 85% faster
before
the latest lucene update.
Run the test like this:
ant test -Dtestcase=TestIndexingPerformance -Dargs="-server
-Diter=100000"; grep throughput
build/test-results/*TestIndexingPerformance.xml
To run on an older trunk version, just copy over
src/test/org/apache/solr/update/TestIndexingPerformance.java
src/test/test-files/solr/conf/solrconfig_perf.xml
I had a throughput of 10946 docs/sec before the lucene update,
and 5849 after.
-Yonik
http://www.lucidimagination.com
On Sun, Aug 9, 2009 at 12:10 PM, Yonik Seeley<yo...@lucidimagination.com
> wrote:
On Sun, Aug 9, 2009 at 12:01 PM, Grant Ingersoll<gsing...@apache.org
> wrote:
Or bite the bullet and upgrade to the incrementToken() method.
Right - I'm not sure if that would fix it or not - I haven't
been
involved in the new Token attribute stuff...
I'm currently writing a basic indexing unit test that we can
use to
measure this (the standard solrconfig does stuff that slows down
indexing a lot, but helps in catching bugs on edge cases by
creating
many segments).
-Yonik
http://www.lucidimagination.com
--
- Mark
http://www.lucidimagination.com
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search