RE: O/S Search Comparisons

2007-12-07 Thread Samir Abdou
There is an expression in French that says comparer des pommes et des
poires which literally means to compare apples and pears.  That's what
this paper is about. For my point of view, such a comparison would be
interesting only if a cross analysis of different criterions (for example,
retrieval effectiveness (aka search quality), search time, indexing time,
index size, query language, index structure, and so on...) is done.
Comparing different systems based only on one criterion is not
well-grounded.  There is always a kind of trade-off: for example, beside
other parameters (ranking algorithm, frequencies statistics, document
structure, etc.), indexing with zettair is much faster than indexing with
lucene but if we consider searching time lucene is better than zettair. Why?
Because of many reasons but probably zettair hasn't the complex document
structure of lucene besides the ranking algorithm (Okapi BM25 vs. tf-idf).
Some systems computes and stores the scores at indexing time which make them
faster at searching time but less flexible if you want to change/implement a
new ranking algorithm. 

Still, when a well-respected researcher in the field says Lucene didn't do
so hot in certain areas,

If we consider the search quality, that's simply not true if we know how to
implement in Lucene popular ranking algorithm such OkapiBM25 (at least).
I've been working with Lucene for four years now, all experiments of my
thesis have been done using Lucene (with many adaptations to implement the
most recent ranking algorithm including different language model, divergence
from randomness, etc.).  I also participated to major IR campaigns (NTCIR,
CLEF and TREC) and the results are not bad at all (see
http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings5/data/CLIR/NTCIR5
-OV-CLIR-KishidaK.pdf for NTCIR-5 or
http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings6/NTCIR/NTCIR6-OVE
RVIEW.pdf for NTCIR-6, for CLEF have a look at
http://www.clef-campaign.org/2006/working_notes/workingnotes2006/dinunzioOCL
EF2006.pdf, ...)   for other information search the web ;-)

Samir 


 -Message d'origine-
 De : Mark Miller [mailto:[EMAIL PROTECTED]
 Envoyé : vendredi 7 décembre 2007 21:01
 À : java-dev@lucene.apache.org
 Objet : Re: O/S Search Comparisons
 
 Yes, and even if they did not use the stock defaults, I would bet there
 would be complaints about what was done wrong at every turn. This seems
 like a very difficult thing to do. How long does it take to fully learn
 how to correctly utilize each search engine for the task at hand? I am
 sure longer than these busy men could possibly take. It seems that such
 a comparison could only be done legitimately if experts for each search
 engine set up the indexing/searching processes. Even then the results
 seem like they could be difficult to measure...eg was each search
 engine
 configured so that they would only break on spaces for indexing and do
 nothing else special at all? So many small settings and knowledge need
 to ensure each engine is on level ground...
 
 I doubt it will ever happen, but some sort of open source search off
 would be pretty cool g. Then each camp could properly configure their
 search engine for each task.
 
 - Mark
 
 Mike Klaas wrote:
  There is a good chance that they were using stock indexing defaults,
  based on:
 
  Lucene:
   In the present work, the simple applications
  bundled with the library were used to index the collection. 
 
  On 7-Dec-07, at 10:27 AM, Grant Ingersoll wrote:
 
  Yeah, I wasn't too excited over it and I certainly didn't lose any
  sleep over it, but there are some interesting things of note in
 there
  concerning Lucene, including the claim that it fell over on indexing
  WT10g docs (page 40) and I am always looking for ways to improve
  things.  Overall, I think Lucene held up pretty well in the
  evaluation, and I know how suspect _any_ evaluation is given the
  myriad ways of doing search.  Still, when a well-respected
 researcher
  in the field says Lucene didn't do so hot in certain areas, I don't
  think we can dismiss them out of hand.   So regardless of the tests
  being right or wrong, they are worth either addressing the failures
  in Lucene or the failures in the test such that we make sure we are
  properly educating our users on how best to use Lucene.
 
  I emailed the authors asking for information on how the test was run
  etc., so we'll see if anything comes of it.
 
  On Dec 7, 2007, at 12:04 PM, robert engels wrote:
 
  I wouldn't get too excited over this. Once again, it does not seem
  the evaluator understands the nature of GC based systems, and the
  memory statistics are quite out of whack. But it is hard to tell
  because there is no data on how memory consumption was actually
  measured.
 
  A far better way of measuring memory consumption is to cap the
  process at different levels (max ram sizes), and compare the
  performance at each level.
 
  There is also fact that a process 

RE: file format incosisentcy (any answer ?) IMPORTANT

2006-11-16 Thread Samir Abdou

When the field stores offsets and positions of its terms within term vectors
(in the .tvf file), these are not specified in the file format
documentation.

But looking to the TermVectorsWriter within the writeField() method, you'll
see that if offsets and positions are required, then these are written to
(.tvf file)

Hope this w'll help you,

Samir
 

-Message d'origine-
De : Chris Hostetter [mailto:[EMAIL PROTECTED] 
Envoyé : mercredi, 15. novembre 2006 19:36
À : java-dev@lucene.apache.org; [EMAIL PROTECTED]
Objet : Re: file format incosisentcy 

: There is an inconsistency between the files format page (from Lucene
: website) and the source code. It concerns the positions and offsets of
term
: vectors. It seems that documentation (website) is not up to date.
According
: to the file format page, offsets and positions are not stored! Is that
: correct?

can you cite exactly what about the fileformats doc leads you to believe
this? ... a quick search for offsets and positions finds these lines
for me...

 If the third lowest-order bit is set (0x04), term positions are stored with
the term vectors.
 If the fourth lowest-order bit is set (0x08), term offsets are stored with
the term vectors.

...and that's just to start with.

-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



file format incosisentcy

2006-11-15 Thread Samir Abdou
Hi, 

There is an inconsistency between the files format page (from Lucene
website) and the source code. It concerns the positions and offsets of term
vectors. It seems that documentation (website) is not up to date. According
to the file format page, offsets and positions are not stored! Is that
correct?

Many thanks,

Samir


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]