All true and good points. Lucene held up quite nicely in the search
aspect (at least perf. wise) and I generally don't think making these
kinds of comparisons are all that useful (we call it apple and oranges
in English :-) ).
What I am trying to get at is if this paper was just about Lucene and
never mentioned a single other system, what, if anything, can we take
from it that can help us make Lucene better. I know, for instance,
from my own personal experience, that 2.3 is somewhere in the range of
3-5+ times faster than 2.2 (which I know is faster than 1.9). That
being said, the paper clearly states that Lucene was not capable of
doing the WT10g docs because performance degraded too much. Now, I
know Lucene is pretty darn capable of a lot of things and people are
using it to do web search, etc. at very large scales (I have
personally talked w/ people doing it). So, what I worry about is that
either we are:
a) missing something in our defaults setup
b) missing something in our docs and our education efforts, or
c) we are missing some capability in our indexing such that it is
crashing
Now, what is to be done? It may well be nothing, but I just want to
make sure we are comfortable with that decision or whether it is worth
asking for a volunteer who has access to the WT10g docs to go have a
look at it and see what happens. I personally don't have access to
these docs, otherwise I would try it out. What we don't want to
happen is for potential supporters/contributors to read that paper and
say "Lucene isn't for me because of this."
Sometimes, when something like this comes up, it gives you the
opportunity to take a step back and ask what are the things we really
want Lucene to be going forward (the New Year is good for this kind of
assessment as well) What are it's strengths and weaknesses? What can
we improve in the short term and what needs to improve in the longer
term? Maybe it's just that time of year to send out your Lucene Wish
List... :-)
Cheers,
Grant
PS: Samir, any chance of contributing back your ranking
algorithms? :-)
On Dec 7, 2007, at 5:41 PM, Samir Abdou wrote:
There is an expression in French that says "comparer des pommes et des
poires" which literally means "to compare apples and pears". That's
what
this paper is about. For my point of view, such a comparison would be
interesting only if a cross analysis of different criterions (for
example,
retrieval effectiveness (aka search quality), search time, indexing
time,
index size, query language, index structure, and so on...) is done.
Comparing different systems based only on one criterion is not
well-grounded. There is always a kind of trade-off: for example,
beside
other parameters (ranking algorithm, frequencies statistics, document
structure, etc.), indexing with zettair is much faster than indexing
with
lucene but if we consider searching time lucene is better than
zettair. Why?
Because of many reasons but probably zettair hasn't the complex
document
structure of lucene besides the ranking algorithm (Okapi BM25 vs. tf-
idf).
Some systems computes and stores the scores at indexing time which
make them
faster at searching time but less flexible if you want to change/
implement a
new ranking algorithm.
Still, when a well-respected researcher in the field says Lucene
didn't do
so hot in certain areas,
If we consider the search quality, that's simply not true if we know
how to
implement in Lucene popular ranking algorithm such OkapiBM25 (at
least).
I've been working with Lucene for four years now, all experiments of
my
thesis have been done using Lucene (with many adaptations to
implement the
most recent ranking algorithm including different language model,
divergence
from randomness, etc.). I also participated to major IR campaigns
(NTCIR,
CLEF and TREC) and the results are not bad at all (see
http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings5/data/CLIR/NTCIR5
-OV-CLIR-KishidaK.pdf for NTCIR-5 or
http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings6/NTCIR/NTCIR6-OVE
RVIEW.pdf for NTCIR-6, for CLEF have a look at
http://www.clef-campaign.org/2006/working_notes/workingnotes2006/dinunzioOCL
EF2006.pdf, ...) for other information search the web ;-)
Samir
-----Message d'origine-----
De : Mark Miller [mailto:[EMAIL PROTECTED]
Envoyé : vendredi 7 décembre 2007 21:01
À : java-dev@lucene.apache.org
Objet : Re: O/S Search Comparisons
Yes, and even if they did not use the stock defaults, I would bet
there
would be complaints about what was done wrong at every turn. This
seems
like a very difficult thing to do. How long does it take to fully
learn
how to correctly utilize each search engine for the task at hand? I
am
sure longer than these busy men could possibly take. It seems that
such
a comparison could only be done legitimately if experts for each
search
engine set up the indexing/searching processes. Even then the results
seem like they could be difficult to measure...eg was each search
engine
configured so that they would only break on spaces for indexing and
do
nothing else special at all? So many small settings and knowledge
need
to ensure each engine is on level ground...
I doubt it will ever happen, but some sort of open source search off
would be pretty cool <g>. Then each camp could properly configure
their
search engine for each task.
- Mark
Mike Klaas wrote:
There is a good chance that they were using stock indexing defaults,
based on:
Lucene:
" In the present work, the simple applications
bundled with the library were used to index the collection. "
On 7-Dec-07, at 10:27 AM, Grant Ingersoll wrote:
Yeah, I wasn't too excited over it and I certainly didn't lose any
sleep over it, but there are some interesting things of note in
there
concerning Lucene, including the claim that it fell over on
indexing
WT10g docs (page 40) and I am always looking for ways to improve
things. Overall, I think Lucene held up pretty well in the
evaluation, and I know how suspect _any_ evaluation is given the
myriad ways of doing search. Still, when a well-respected
researcher
in the field says Lucene didn't do so hot in certain areas, I don't
think we can dismiss them out of hand. So regardless of the tests
being right or wrong, they are worth either addressing the failures
in Lucene or the failures in the test such that we make sure we are
properly educating our users on how best to use Lucene.
I emailed the authors asking for information on how the test was
run
etc., so we'll see if anything comes of it.
On Dec 7, 2007, at 12:04 PM, robert engels wrote:
I wouldn't get too excited over this. Once again, it does not seem
the evaluator understands the nature of GC based systems, and the
memory statistics are quite out of whack. But it is hard to tell
because there is no data on how memory consumption was actually
measured.
A far better way of measuring memory consumption is to cap the
process at different levels (max ram sizes), and compare the
performance at each level.
There is also fact that a process takes memory from disk cache,
and
visa versa, that heavily affects search performance, etc.
Since there is no detailed data (that I could find) about system
configuration, etc. the results are highly suspect.
There is also no mention of performance on multi-processor
systems.
Some systems (like Lucene) pay a penalty to support multi-
processing
(both in Java and Lucene), and only realize this benefit when
operating in a multi-processor environment.
Based on the shear speed of XMLSearch and Zettair those seem
likely
candidates to inspect their design.
On Dec 7, 2007, at 7:03 AM, Grant Ingersoll wrote:
Was wondering if people have seen
http://wrg.upf.edu/WRG/dctos/Middleton-Baeza.pdf
Has some interesting comparisons. Obviously, the comparison of
Lucene indexing is done w/ 1.9 so it probably needs to be done
again. Just wondering if people see any opportunities to improve
Lucene from it. I am going to try and contact the authors to
see
if I can get what there setup values were (mergeFactor, Analyzer,
etc.) as I think it would be interesting to run the tests again
on
2.3.
-Grant
------------------------------------------------------------------
---
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-------------------------------------------------------------------
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]