[jira] Commented: (LUCENE-1083) JDiff report of changes between different versions of Lucene

2007-12-07 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12549429
 ] 

Grant Ingersoll commented on LUCENE-1083:
-

Thanks, Matt.  I assume the antjdiff.jar needs to be included somewhere?  In 
order for this to work, you probably need to make it so it checks to see if 
that library exists  before running it (check out the way the Clover test 
coverage works) and we can't include the actual JDiff libraries in Lucene b/c 
of licensing issues (I'm pretty sure, anyway), so ideally, your task would also 
download the library when executed and install it properly for users.

 JDiff report of changes between different versions of Lucene
 

 Key: LUCENE-1083
 URL: https://issues.apache.org/jira/browse/LUCENE-1083
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Javadocs
Affects Versions: 2.2
Reporter: Matt Doar
 Attachments: jdiff_lucene_191_220.zip, jdiff_lucene_210_220.zip


 I think that a helpful addition to the release process for Lucene would be 
 [JDiff|http://www.jdiff.org] reports of the API changes between different 
 versions. I am attaching reports of the differences between 1.9.1 and 2.2.0 
 and also between 2.1.0 and 2.2.0. The reports could be changed to only show 
 the public methods. The start page is changes.html.
 This is the Ant target I added to the top-level build.xml file in the JDiff 
 directory to produce a report:
 {noformat}
 target name=lucene depends=dist
   taskdef name=jdiff 
classname=jdiff.JDiffAntTask 
classpath=${dist.dir}/antjdiff.jar /
   jdiff destdir=${reports.dir}/lucene 
  verbose=on
  stats=on
  docchanges=on
 old name=1.9.1
   dirset dir=${examples.dir}/lucene-1.9.1/src/java includes=org/** /
 /old
 new name=2.2.0
   dirset dir=${examples.dir}/lucene-2.2.0/src/java includes=org/** /
 /new
   /jdiff
 /target
 {noformat}
 Disclaimer: I'm the author of JDiff

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Resolved: (LUCENE-1082) IndexReader.lastModified - throws NPE

2007-12-07 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1082.


   Resolution: Fixed
Fix Version/s: 2.3

I just committed this.  Thanks Alan!

 IndexReader.lastModified - throws NPE
 -

 Key: LUCENE-1082
 URL: https://issues.apache.org/jira/browse/LUCENE-1082
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.3
 Environment: Windows variants.
Reporter: Alan Boo
Assignee: Michael McCandless
 Fix For: 2.3

 Attachments: LUCENE-1082.patch


 IndexReader.lastModified(String dir) or its variants always return NPE on 
 2.3, perhaps something to do with SegmentInfo.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: O/S Search Comparisons

2007-12-07 Thread Grant Ingersoll
Yeah, I wasn't too excited over it and I certainly didn't lose any  
sleep over it, but there are some interesting things of note in there  
concerning Lucene, including the claim that it fell over on indexing  
WT10g docs (page 40) and I am always looking for ways to improve  
things.  Overall, I think Lucene held up pretty well in the  
evaluation, and I know how suspect _any_ evaluation is given the  
myriad ways of doing search.  Still, when a well-respected researcher  
in the field says Lucene didn't do so hot in certain areas, I don't  
think we can dismiss them out of hand.   So regardless of the tests  
being right or wrong, they are worth either addressing the failures in  
Lucene or the failures in the test such that we make sure we are  
properly educating our users on how best to use Lucene.


I emailed the authors asking for information on how the test was run  
etc., so we'll see if anything comes of it.


On Dec 7, 2007, at 12:04 PM, robert engels wrote:

I wouldn't get too excited over this. Once again, it does not seem  
the evaluator understands the nature of GC based systems, and the  
memory statistics are quite out of whack. But it is hard to tell  
because there is no data on how memory consumption was actually  
measured.


A far better way of measuring memory consumption is to cap the  
process at different levels (max ram sizes), and compare the  
performance at each level.


There is also fact that a process takes memory from disk cache, and  
visa versa, that heavily affects search performance, etc.


Since there is no detailed data (that I could find) about system  
configuration, etc. the results are highly suspect.


There is also no mention of performance on multi-processor systems.  
Some systems (like Lucene) pay a penalty to support multi-processing  
(both in Java and Lucene), and only realize this benefit when  
operating in a multi-processor environment.


Based on the shear speed of XMLSearch and Zettair those seem likely  
candidates to inspect their design.


On Dec 7, 2007, at 7:03 AM, Grant Ingersoll wrote:


Was wondering if people have seen 
http://wrg.upf.edu/WRG/dctos/Middleton-Baeza.pdf

Has some interesting comparisons.  Obviously, the comparison of  
Lucene indexing is done w/ 1.9 so it probably needs to be done  
again.  Just wondering if people see any opportunities to improve  
Lucene from it.I am going to try and contact the authors to see  
if I can get what there setup values were (mergeFactor, Analyzer,  
etc.) as I think it would be interesting to run the tests again on  
2.3.


-Grant



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1083) JDiff report of changes between different versions of Lucene

2007-12-07 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12549500
 ] 

Doug Cutting commented on LUCENE-1083:
--

The prior release is a new concept that needs to be added to the build to 
support this.  Perhaps a property in common-build.xml that names the subversion 
tags of the prior major and minor releases, and the jdiff target could use 
these.  The jdiff target should do nothing if jdiff.home is not defined.  Some 
links to the jdiff output should be added somewhere in the docs.  Folks who 
build releases would then be required to install jdiff and to define jdiff.home 
when they make releases, or else the releases will contain broken links.  This 
should be documented on the wiki's HowToRelease page.

 JDiff report of changes between different versions of Lucene
 

 Key: LUCENE-1083
 URL: https://issues.apache.org/jira/browse/LUCENE-1083
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Javadocs
Affects Versions: 2.2
Reporter: Matt Doar
 Attachments: jdiff_lucene_191_220.zip, jdiff_lucene_210_220.zip


 I think that a helpful addition to the release process for Lucene would be 
 [JDiff|http://www.jdiff.org] reports of the API changes between different 
 versions. I am attaching reports of the differences between 1.9.1 and 2.2.0 
 and also between 2.1.0 and 2.2.0. The reports could be changed to only show 
 the public methods. The start page is changes.html.
 This is the Ant target I added to the top-level build.xml file in the JDiff 
 directory to produce a report:
 {noformat}
 target name=lucene depends=dist
   taskdef name=jdiff 
classname=jdiff.JDiffAntTask 
classpath=${dist.dir}/antjdiff.jar /
   jdiff destdir=${reports.dir}/lucene 
  verbose=on
  stats=on
  docchanges=on
 old name=1.9.1
   dirset dir=${examples.dir}/lucene-1.9.1/src/java includes=org/** /
 /old
 new name=2.2.0
   dirset dir=${examples.dir}/lucene-2.2.0/src/java includes=org/** /
 /new
   /jdiff
 /target
 {noformat}
 Disclaimer: I'm the author of JDiff

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Comparable ScoreDoc

2007-12-07 Thread Chris Hostetter

: In general I would agree that people may want different implementations for
: compare(), but I hardly see that's the case for ScoreDoc. After all, you can
: either compare it by score or by doc (at least now). I believe that since
: most people use the TopDocsHitCollector, they prefer the compare-by-score
: approach ...

sure, but that's not all your suggested compareTo does ... it first 
compares by score and then does a secondary comparison by docId.  
some people might want docs added more recently to sort first instead, 
some might want docid left out of hte comparison all together.

This is where Comparators are more useful then compareTo methods.  We can 
add *lots* of different static inner Comparator classes to ScoreDoc, but 
if we add any compareTo method it could wind up burning someone down the 
road.


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Comparable ScoreDoc

2007-12-07 Thread Shai Erera
In general I would agree that people may want different implementations for
compare(), but I hardly see that's the case for ScoreDoc. After all, you can
either compare it by score or by doc (at least now). I believe that since
most people use the TopDocsHitCollector, they prefer the compare-by-score
approach ...

What about access to inner fields? Like I wrote, Comparable are
self-contained in the sense that they know how to compare themselves to the
same instances. However Comparators can only compare public variables.

On Dec 6, 2007 7:20 PM, Michael Busch [EMAIL PROTECTED] wrote:

 Shai Erera wrote:

 
  Comparators however have an advantage - in that specific case I could
 create
  two Comparators: (1) compares by the score and then by doc (2) compares
 by

 That's why I hesitate to add the Comparable interface to ScoreDoc:
 Different people might want different implementations of compare(), and
 that makes it questionable which default compare() implementation we
 should commit. I think Comparator solves this problem quite nicely.

 -Michael

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-- 
Regards,

Shai Erera


Re: O/S Search Comparisons

2007-12-07 Thread Mark Miller
Did it crash on the 10 GB? I thought it said that it just took way to 
long (7 times the best or something). Frankly, either case is suspect. 
Last summer I indexed about 5 million docs with a total size at the 
*very* least of 10 GB on my 3 year old desktop. It didn't take much more 
than 8 hours to index and searches where still lightning fast. Maybe 
they forgot to give the JVM more than the default amount of RAM g


- Mark

Grant Ingersoll wrote:
All true and good points.  Lucene held up quite nicely in the search 
aspect (at least perf. wise) and I generally don't think making these 
kinds of comparisons are all that useful (we call it apple and oranges 
in English :-)  ).


What I am trying to get at is if this paper was just about Lucene and 
never mentioned a single other system, what, if anything, can we take 
from it that can help us make Lucene better.   I know, for instance, 
from my own personal experience, that 2.3 is somewhere in the range of 
3-5+ times faster than 2.2 (which I know is faster than 1.9).  That 
being said, the paper clearly states that Lucene was not capable of 
doing the WT10g docs because performance degraded too much.  Now, I 
know Lucene is pretty darn capable of a lot of things and people are 
using it to do web search, etc. at very large scales (I have 
personally talked w/ people doing it).  So, what I worry about is that 
either we are:

a) missing something in our defaults setup
b) missing something in our docs and our education efforts, or
c) we are missing some capability in our indexing such that it is 
crashing


Now, what is to be done?  It may well be nothing, but I just want to 
make sure we are comfortable with that decision or whether it is worth 
asking for a volunteer who has access to the WT10g docs to go have a 
look at it and see what happens.  I personally don't have access to 
these docs, otherwise I would try it out.  What we don't want to 
happen is for potential supporters/contributors to read that paper and 
say Lucene isn't for me because of this.


Sometimes, when something like this comes up, it gives you the 
opportunity to take a step back and ask what are the things we really 
want Lucene to be going forward (the New Year is good for this kind of 
assessment as well)  What are it's strengths and weaknesses?  What can 
we improve in the short term and what needs to improve in the longer 
term?  Maybe it's just that time of year to send out your Lucene Wish 
List... :-)


Cheers,
Grant

PS:  Samir, any chance of contributing back your ranking algorithms?  :-)


On Dec 7, 2007, at 5:41 PM, Samir Abdou wrote:


There is an expression in French that says comparer des pommes et des
poires which literally means to compare apples and pears.  That's 
what

this paper is about. For my point of view, such a comparison would be
interesting only if a cross analysis of different criterions (for 
example,
retrieval effectiveness (aka search quality), search time, indexing 
time,

index size, query language, index structure, and so on...) is done.
Comparing different systems based only on one criterion is not
well-grounded.  There is always a kind of trade-off: for example, beside
other parameters (ranking algorithm, frequencies statistics, document
structure, etc.), indexing with zettair is much faster than indexing 
with
lucene but if we consider searching time lucene is better than 
zettair. Why?

Because of many reasons but probably zettair hasn't the complex document
structure of lucene besides the ranking algorithm (Okapi BM25 vs. 
tf-idf).
Some systems computes and stores the scores at indexing time which 
make them
faster at searching time but less flexible if you want to 
change/implement a

new ranking algorithm.

Still, when a well-respected researcher in the field says Lucene 
didn't do

so hot in certain areas,

If we consider the search quality, that's simply not true if we know 
how to

implement in Lucene popular ranking algorithm such OkapiBM25 (at least).
I've been working with Lucene for four years now, all experiments of my
thesis have been done using Lucene (with many adaptations to 
implement the
most recent ranking algorithm including different language model, 
divergence
from randomness, etc.).  I also participated to major IR campaigns 
(NTCIR,

CLEF and TREC) and the results are not bad at all (see
http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings5/data/CLIR/NTCIR5 


-OV-CLIR-KishidaK.pdf for NTCIR-5 or
http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings6/NTCIR/NTCIR6-OVE 


RVIEW.pdf for NTCIR-6, for CLEF have a look at
http://www.clef-campaign.org/2006/working_notes/workingnotes2006/dinunzioOCL 


EF2006.pdf, ...)   for other information search the web ;-)

Samir



-Message d'origine-
De : Mark Miller [mailto:[EMAIL PROTECTED]
Envoyé : vendredi 7 décembre 2007 21:01
À : java-dev@lucene.apache.org
Objet : Re: O/S Search Comparisons

Yes, and even if they did not use the stock defaults, I would bet there

Re: O/S Search Comparisons

2007-12-07 Thread Grant Ingersoll
All true and good points.  Lucene held up quite nicely in the search  
aspect (at least perf. wise) and I generally don't think making these  
kinds of comparisons are all that useful (we call it apple and oranges  
in English :-)  ).


What I am trying to get at is if this paper was just about Lucene and  
never mentioned a single other system, what, if anything, can we take  
from it that can help us make Lucene better.   I know, for instance,  
from my own personal experience, that 2.3 is somewhere in the range of  
3-5+ times faster than 2.2 (which I know is faster than 1.9).  That  
being said, the paper clearly states that Lucene was not capable of  
doing the WT10g docs because performance degraded too much.  Now, I  
know Lucene is pretty darn capable of a lot of things and people are  
using it to do web search, etc. at very large scales (I have  
personally talked w/ people doing it).  So, what I worry about is that  
either we are:

a) missing something in our defaults setup
b) missing something in our docs and our education efforts, or
c) we are missing some capability in our indexing such that it is  
crashing


Now, what is to be done?  It may well be nothing, but I just want to  
make sure we are comfortable with that decision or whether it is worth  
asking for a volunteer who has access to the WT10g docs to go have a  
look at it and see what happens.  I personally don't have access to  
these docs, otherwise I would try it out.  What we don't want to  
happen is for potential supporters/contributors to read that paper and  
say Lucene isn't for me because of this.


Sometimes, when something like this comes up, it gives you the  
opportunity to take a step back and ask what are the things we really  
want Lucene to be going forward (the New Year is good for this kind of  
assessment as well)  What are it's strengths and weaknesses?  What can  
we improve in the short term and what needs to improve in the longer  
term?  Maybe it's just that time of year to send out your Lucene Wish  
List... :-)


Cheers,
Grant

PS:  Samir, any chance of contributing back your ranking  
algorithms?  :-)



On Dec 7, 2007, at 5:41 PM, Samir Abdou wrote:


There is an expression in French that says comparer des pommes et des
poires which literally means to compare apples and pears.  That's  
what

this paper is about. For my point of view, such a comparison would be
interesting only if a cross analysis of different criterions (for  
example,
retrieval effectiveness (aka search quality), search time, indexing  
time,

index size, query language, index structure, and so on...) is done.
Comparing different systems based only on one criterion is not
well-grounded.  There is always a kind of trade-off: for example,  
beside

other parameters (ranking algorithm, frequencies statistics, document
structure, etc.), indexing with zettair is much faster than indexing  
with
lucene but if we consider searching time lucene is better than  
zettair. Why?
Because of many reasons but probably zettair hasn't the complex  
document
structure of lucene besides the ranking algorithm (Okapi BM25 vs. tf- 
idf).
Some systems computes and stores the scores at indexing time which  
make them
faster at searching time but less flexible if you want to change/ 
implement a

new ranking algorithm.

Still, when a well-respected researcher in the field says Lucene  
didn't do

so hot in certain areas,

If we consider the search quality, that's simply not true if we know  
how to
implement in Lucene popular ranking algorithm such OkapiBM25 (at  
least).
I've been working with Lucene for four years now, all experiments of  
my
thesis have been done using Lucene (with many adaptations to  
implement the
most recent ranking algorithm including different language model,  
divergence
from randomness, etc.).  I also participated to major IR campaigns  
(NTCIR,

CLEF and TREC) and the results are not bad at all (see
http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings5/data/CLIR/NTCIR5
-OV-CLIR-KishidaK.pdf for NTCIR-5 or
http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings6/NTCIR/NTCIR6-OVE
RVIEW.pdf for NTCIR-6, for CLEF have a look at
http://www.clef-campaign.org/2006/working_notes/workingnotes2006/dinunzioOCL
EF2006.pdf, ...)   for other information search the web ;-)

Samir



-Message d'origine-
De : Mark Miller [mailto:[EMAIL PROTECTED]
Envoyé : vendredi 7 décembre 2007 21:01
À : java-dev@lucene.apache.org
Objet : Re: O/S Search Comparisons

Yes, and even if they did not use the stock defaults, I would bet  
there
would be complaints about what was done wrong at every turn. This  
seems
like a very difficult thing to do. How long does it take to fully  
learn
how to correctly utilize each search engine for the task at hand? I  
am
sure longer than these busy men could possibly take. It seems that  
such
a comparison could only be done legitimately if experts for each  
search

engine set up the indexing/searching 

RE: O/S Search Comparisons

2007-12-07 Thread Samir Abdou
There is an expression in French that says comparer des pommes et des
poires which literally means to compare apples and pears.  That's what
this paper is about. For my point of view, such a comparison would be
interesting only if a cross analysis of different criterions (for example,
retrieval effectiveness (aka search quality), search time, indexing time,
index size, query language, index structure, and so on...) is done.
Comparing different systems based only on one criterion is not
well-grounded.  There is always a kind of trade-off: for example, beside
other parameters (ranking algorithm, frequencies statistics, document
structure, etc.), indexing with zettair is much faster than indexing with
lucene but if we consider searching time lucene is better than zettair. Why?
Because of many reasons but probably zettair hasn't the complex document
structure of lucene besides the ranking algorithm (Okapi BM25 vs. tf-idf).
Some systems computes and stores the scores at indexing time which make them
faster at searching time but less flexible if you want to change/implement a
new ranking algorithm. 

Still, when a well-respected researcher in the field says Lucene didn't do
so hot in certain areas,

If we consider the search quality, that's simply not true if we know how to
implement in Lucene popular ranking algorithm such OkapiBM25 (at least).
I've been working with Lucene for four years now, all experiments of my
thesis have been done using Lucene (with many adaptations to implement the
most recent ranking algorithm including different language model, divergence
from randomness, etc.).  I also participated to major IR campaigns (NTCIR,
CLEF and TREC) and the results are not bad at all (see
http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings5/data/CLIR/NTCIR5
-OV-CLIR-KishidaK.pdf for NTCIR-5 or
http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings6/NTCIR/NTCIR6-OVE
RVIEW.pdf for NTCIR-6, for CLEF have a look at
http://www.clef-campaign.org/2006/working_notes/workingnotes2006/dinunzioOCL
EF2006.pdf, ...)   for other information search the web ;-)

Samir 


 -Message d'origine-
 De : Mark Miller [mailto:[EMAIL PROTECTED]
 Envoyé : vendredi 7 décembre 2007 21:01
 À : java-dev@lucene.apache.org
 Objet : Re: O/S Search Comparisons
 
 Yes, and even if they did not use the stock defaults, I would bet there
 would be complaints about what was done wrong at every turn. This seems
 like a very difficult thing to do. How long does it take to fully learn
 how to correctly utilize each search engine for the task at hand? I am
 sure longer than these busy men could possibly take. It seems that such
 a comparison could only be done legitimately if experts for each search
 engine set up the indexing/searching processes. Even then the results
 seem like they could be difficult to measure...eg was each search
 engine
 configured so that they would only break on spaces for indexing and do
 nothing else special at all? So many small settings and knowledge need
 to ensure each engine is on level ground...
 
 I doubt it will ever happen, but some sort of open source search off
 would be pretty cool g. Then each camp could properly configure their
 search engine for each task.
 
 - Mark
 
 Mike Klaas wrote:
  There is a good chance that they were using stock indexing defaults,
  based on:
 
  Lucene:
   In the present work, the simple applications
  bundled with the library were used to index the collection. 
 
  On 7-Dec-07, at 10:27 AM, Grant Ingersoll wrote:
 
  Yeah, I wasn't too excited over it and I certainly didn't lose any
  sleep over it, but there are some interesting things of note in
 there
  concerning Lucene, including the claim that it fell over on indexing
  WT10g docs (page 40) and I am always looking for ways to improve
  things.  Overall, I think Lucene held up pretty well in the
  evaluation, and I know how suspect _any_ evaluation is given the
  myriad ways of doing search.  Still, when a well-respected
 researcher
  in the field says Lucene didn't do so hot in certain areas, I don't
  think we can dismiss them out of hand.   So regardless of the tests
  being right or wrong, they are worth either addressing the failures
  in Lucene or the failures in the test such that we make sure we are
  properly educating our users on how best to use Lucene.
 
  I emailed the authors asking for information on how the test was run
  etc., so we'll see if anything comes of it.
 
  On Dec 7, 2007, at 12:04 PM, robert engels wrote:
 
  I wouldn't get too excited over this. Once again, it does not seem
  the evaluator understands the nature of GC based systems, and the
  memory statistics are quite out of whack. But it is hard to tell
  because there is no data on how memory consumption was actually
  measured.
 
  A far better way of measuring memory consumption is to cap the
  process at different levels (max ram sizes), and compare the
  performance at each level.
 
  There is also fact that a process 

[jira] Created: (LUCENE-1084) increase default maxFieldLength?

2007-12-07 Thread Daniel Naber (JIRA)
increase default maxFieldLength?


 Key: LUCENE-1084
 URL: https://issues.apache.org/jira/browse/LUCENE-1084
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.2
Reporter: Daniel Naber


To my understanding, Lucene 2.3 will easily index large documents. So shouldn't 
we get rid of the 10,000 default limit for the field length? 10,000 isn't that 
much and as Lucene doesn't have any error logging by default, this is a common 
problem for users that is difficult to debug if you don't know where to look.

A better new default might be Integer.MAX_VALUE.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: O/S Search Comparisons

2007-12-07 Thread Mark Miller
Yes, and even if they did not use the stock defaults, I would bet there 
would be complaints about what was done wrong at every turn. This seems 
like a very difficult thing to do. How long does it take to fully learn 
how to correctly utilize each search engine for the task at hand? I am 
sure longer than these busy men could possibly take. It seems that such 
a comparison could only be done legitimately if experts for each search 
engine set up the indexing/searching processes. Even then the results 
seem like they could be difficult to measure...eg was each search engine 
configured so that they would only break on spaces for indexing and do 
nothing else special at all? So many small settings and knowledge need 
to ensure each engine is on level ground...


I doubt it will ever happen, but some sort of open source search off 
would be pretty cool g. Then each camp could properly configure their 
search engine for each task.


- Mark

Mike Klaas wrote:
There is a good chance that they were using stock indexing defaults, 
based on:


Lucene:
 In the present work, the simple applications
bundled with the library were used to index the collection. 

On 7-Dec-07, at 10:27 AM, Grant Ingersoll wrote:

Yeah, I wasn't too excited over it and I certainly didn't lose any 
sleep over it, but there are some interesting things of note in there 
concerning Lucene, including the claim that it fell over on indexing 
WT10g docs (page 40) and I am always looking for ways to improve 
things.  Overall, I think Lucene held up pretty well in the 
evaluation, and I know how suspect _any_ evaluation is given the 
myriad ways of doing search.  Still, when a well-respected researcher 
in the field says Lucene didn't do so hot in certain areas, I don't 
think we can dismiss them out of hand.   So regardless of the tests 
being right or wrong, they are worth either addressing the failures 
in Lucene or the failures in the test such that we make sure we are 
properly educating our users on how best to use Lucene.


I emailed the authors asking for information on how the test was run 
etc., so we'll see if anything comes of it.


On Dec 7, 2007, at 12:04 PM, robert engels wrote:

I wouldn't get too excited over this. Once again, it does not seem 
the evaluator understands the nature of GC based systems, and the 
memory statistics are quite out of whack. But it is hard to tell 
because there is no data on how memory consumption was actually 
measured.


A far better way of measuring memory consumption is to cap the 
process at different levels (max ram sizes), and compare the 
performance at each level.


There is also fact that a process takes memory from disk cache, and 
visa versa, that heavily affects search performance, etc.


Since there is no detailed data (that I could find) about system 
configuration, etc. the results are highly suspect.


There is also no mention of performance on multi-processor systems. 
Some systems (like Lucene) pay a penalty to support multi-processing 
(both in Java and Lucene), and only realize this benefit when 
operating in a multi-processor environment.


Based on the shear speed of XMLSearch and Zettair those seem likely 
candidates to inspect their design.


On Dec 7, 2007, at 7:03 AM, Grant Ingersoll wrote:

Was wondering if people have seen 
http://wrg.upf.edu/WRG/dctos/Middleton-Baeza.pdf


Has some interesting comparisons.  Obviously, the comparison of 
Lucene indexing is done w/ 1.9 so it probably needs to be done 
again.  Just wondering if people see any opportunities to improve 
Lucene from it.I am going to try and contact the authors to see 
if I can get what there setup values were (mergeFactor, Analyzer, 
etc.) as I think it would be interesting to run the tests again on 
2.3.


-Grant



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: O/S Search Comparisons

2007-12-07 Thread Mike Klaas
There is a good chance that they were using stock indexing defaults,  
based on:


Lucene:
 In the present work, the simple applications
bundled with the library were used to index the collection. 

On 7-Dec-07, at 10:27 AM, Grant Ingersoll wrote:

Yeah, I wasn't too excited over it and I certainly didn't lose any  
sleep over it, but there are some interesting things of note in  
there concerning Lucene, including the claim that it fell over on  
indexing WT10g docs (page 40) and I am always looking for ways to  
improve things.  Overall, I think Lucene held up pretty well in the  
evaluation, and I know how suspect _any_ evaluation is given the  
myriad ways of doing search.  Still, when a well-respected  
researcher in the field says Lucene didn't do so hot in certain  
areas, I don't think we can dismiss them out of hand.   So  
regardless of the tests being right or wrong, they are worth either  
addressing the failures in Lucene or the failures in the test such  
that we make sure we are properly educating our users on how best  
to use Lucene.


I emailed the authors asking for information on how the test was  
run etc., so we'll see if anything comes of it.


On Dec 7, 2007, at 12:04 PM, robert engels wrote:

I wouldn't get too excited over this. Once again, it does not seem  
the evaluator understands the nature of GC based systems, and the  
memory statistics are quite out of whack. But it is hard to tell  
because there is no data on how memory consumption was actually  
measured.


A far better way of measuring memory consumption is to cap the  
process at different levels (max ram sizes), and compare the  
performance at each level.


There is also fact that a process takes memory from disk cache,  
and visa versa, that heavily affects search performance, etc.


Since there is no detailed data (that I could find) about system  
configuration, etc. the results are highly suspect.


There is also no mention of performance on multi-processor  
systems. Some systems (like Lucene) pay a penalty to support multi- 
processing (both in Java and Lucene), and only realize this  
benefit when operating in a multi-processor environment.


Based on the shear speed of XMLSearch and Zettair those seem  
likely candidates to inspect their design.


On Dec 7, 2007, at 7:03 AM, Grant Ingersoll wrote:

Was wondering if people have seen http://wrg.upf.edu/WRG/dctos/ 
Middleton-Baeza.pdf


Has some interesting comparisons.  Obviously, the comparison of  
Lucene indexing is done w/ 1.9 so it probably needs to be done  
again.  Just wondering if people see any opportunities to improve  
Lucene from it.I am going to try and contact the authors to  
see if I can get what there setup values were (mergeFactor,  
Analyzer, etc.) as I think it would be interesting to run the  
tests again on 2.3.


-Grant



 
-

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1083) JDiff report of changes between different versions of Lucene

2007-12-07 Thread Matt Doar (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12549526
 ] 

Matt Doar commented on LUCENE-1083:
---

As an aside, Maven repositories in general could usefully be enhanced to
record this kind of information for a project, so that you could query them
for current release, prior patch release, prior minor release, prior
major release. For Lucene at 2.2.0, that would give 2.2.0,
2.1.x(non-existent, so detaults to
2.1.0?), 2.1.0 and 1.9.1. And then allow users to override what each value
is for a specific release of a project.




 JDiff report of changes between different versions of Lucene
 

 Key: LUCENE-1083
 URL: https://issues.apache.org/jira/browse/LUCENE-1083
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Javadocs
Affects Versions: 2.2
Reporter: Matt Doar
 Attachments: jdiff_lucene_191_220.zip, jdiff_lucene_210_220.zip


 I think that a helpful addition to the release process for Lucene would be 
 [JDiff|http://www.jdiff.org] reports of the API changes between different 
 versions. I am attaching reports of the differences between 1.9.1 and 2.2.0 
 and also between 2.1.0 and 2.2.0. The reports could be changed to only show 
 the public methods. The start page is changes.html.
 This is the Ant target I added to the top-level build.xml file in the JDiff 
 directory to produce a report:
 {noformat}
 target name=lucene depends=dist
   taskdef name=jdiff 
classname=jdiff.JDiffAntTask 
classpath=${dist.dir}/antjdiff.jar /
   jdiff destdir=${reports.dir}/lucene 
  verbose=on
  stats=on
  docchanges=on
 old name=1.9.1
   dirset dir=${examples.dir}/lucene-1.9.1/src/java includes=org/** /
 /old
 new name=2.2.0
   dirset dir=${examples.dir}/lucene-2.2.0/src/java includes=org/** /
 /new
   /jdiff
 /target
 {noformat}
 Disclaimer: I'm the author of JDiff

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1083) JDiff report of changes between different versions of Lucene

2007-12-07 Thread Matt Doar (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12549475
 ] 

Matt Doar commented on LUCENE-1083:
---

Grant,

I was imagining more that the release process for Lucene could be changed so 
that whoever is creating a release also runs JDiff to produce the HTML reports, 
which they then post on the website with the usual Javadocs, and possibly 
include the report in the released package, say in a directory named 
docs/changes next to where docs/api currently is.

The Maven javadoc task looks like it should be able to create the reports. I 
guess I should see if I can come up with a working example of that.



 JDiff report of changes between different versions of Lucene
 

 Key: LUCENE-1083
 URL: https://issues.apache.org/jira/browse/LUCENE-1083
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Javadocs
Affects Versions: 2.2
Reporter: Matt Doar
 Attachments: jdiff_lucene_191_220.zip, jdiff_lucene_210_220.zip


 I think that a helpful addition to the release process for Lucene would be 
 [JDiff|http://www.jdiff.org] reports of the API changes between different 
 versions. I am attaching reports of the differences between 1.9.1 and 2.2.0 
 and also between 2.1.0 and 2.2.0. The reports could be changed to only show 
 the public methods. The start page is changes.html.
 This is the Ant target I added to the top-level build.xml file in the JDiff 
 directory to produce a report:
 {noformat}
 target name=lucene depends=dist
   taskdef name=jdiff 
classname=jdiff.JDiffAntTask 
classpath=${dist.dir}/antjdiff.jar /
   jdiff destdir=${reports.dir}/lucene 
  verbose=on
  stats=on
  docchanges=on
 old name=1.9.1
   dirset dir=${examples.dir}/lucene-1.9.1/src/java includes=org/** /
 /old
 new name=2.2.0
   dirset dir=${examples.dir}/lucene-2.2.0/src/java includes=org/** /
 /new
   /jdiff
 /target
 {noformat}
 Disclaimer: I'm the author of JDiff

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]