Re: add CJKTokenizer to solr

Daniel Alheiros Fri, 22 Jun 2007 03:43:14 -0700

Hi Hoss.

I've done a few tests using reflection to instantiate a simple object and
the results will vary a lot depending on the JVM. As the JVM optimizes code
as it is executed it will vary depending on the usage, but I think we have
something to consider:

If done 1,000 samples (5 clean X loop of 200) and each sample is creating
100,000 objects and the results were:

With reflection:
    - Average                      : 0.0005418
    - Worst (first clean execution): 0.0007760

Without reflection:
    - Average                      : 0.0000469
    - Worst (first clean execution): 0.0002140

So comparing these numbers, I can see that using reflection on the average
case will cost 10 times more than creating the object without reflection.

But my question is: Do we need to create factories so frequently or the are
just create once and re-used (are they thread safe)? The term Factory made
me think of a class that is responsible for building others instance, so
usually they can be singletons... If they don't need to be created all the
time it will not impact really and will give extra flexibility in terms of
incorporating new Tokenizers (it would make easier to make Solr/Lucene
versions less coupled).

Environment:
java version "1.5.0_07"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_07-164)
Java HotSpot(TM) Client VM (build 1.5.0_07-87, mixed mode, sharing)
Heap size: 256M
Running on a PowerPC - Mac OS/X 10.4.9 with 1.5Gb RAM

Regards,
Daniel

On 21/6/07 20:39, "Chris Hostetter" <[EMAIL PROTECTED]> wrote:

> 
> : Why instead of that we don't create an UbberFactory that takes the Tokenizer
> : class as a parameter and instantiates the proper Tokenizer?
> 
> The idea has come up before ... and there's really no reason why it
> wouldn't be okay to include a reflection based facotry like this in Solr
> -- it just hasn't been done yet.
> 
> One of the reasons is that there are some performance costs associated
> with the reflection, so we wouldn't want to competley replace the existing
> "configuration via factory name" model with a "configure via class name
> and an uber factory does the reflection quetly in the background" model
> because it's the kind of appraoch that would really only make sense for
> simple prototypes -- in any system where you are really concerned about
> performacne, reflection on every analyzer call would probably be pretty
> expensive.  (allthough i'd love to see benchmarks prove me wrong)
> 
> Another question in my mind is "why doesn't solr provide an optional jar
> with factories for every tokenizer/tokenfilter in the lucene contribs?"
> ... the only answer to that is that no one has bothered to crank out a
> patch that does it.
> 
> http://www.nabble.com/Re%3A-making-schema.xml-nicer-to-read-use-p5939980.html
> http://www.nabble.com/foo-tf1737025.html#a4720545
> 
> 
> -Hoss
> 

http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal 
views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on 
it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.

Re: add CJKTokenizer to solr

Reply via email to