search quality - assessment & improvements

2007-06-25 Thread Doron Cohen
hi, this could probably split into two threads but for context let's start it in a single discussion; Recently I was looking at the search quality of Lucene - Recall and Precision, focused at [EMAIL PROTECTED],5,10,20 and, mainly, MAP. -- Part 1 -- I found out that quality can be enhanced by mo

Re: search quality - assessment & improvements

2007-06-25 Thread Grant Ingersoll
Just to throw in a few things: First off, this is great! As I am sure you are aware: https://issues.apache.org/jira/browse/ LUCENE-836 On Jun 25, 2007, at 3:15 AM, Doron Cohen wrote: hi, this could probably split into two threads but for context let's start it in a single discussion; R

[jira] Created: (LUCENE-942) TopDocCollector.topDocs throws ArrayIndexOutOfBoundsException when called twice

2007-06-25 Thread Aaron Isotton (JIRA)
TopDocCollector.topDocs throws ArrayIndexOutOfBoundsException when called twice --- Key: LUCENE-942 URL: https://issues.apache.org/jira/browse/LUCENE-942 Project: Lucene - Jav

[EMAIL PROTECTED]: Project lucene-java (in module lucene-java) failed

2007-06-25 Thread Jason van Zyl
To whom it may engage... This is an automated request, but not an unsolicited one. For more information please visit http://gump.apache.org/nagged.html, and/or contact the folk at [EMAIL PROTECTED] Project lucene-java has an issue affecting its community integration. This issue affects

[EMAIL PROTECTED]: Project lucene-java (in module lucene-java) failed

2007-06-25 Thread Jason van Zyl
To whom it may engage... This is an automated request, but not an unsolicited one. For more information please visit http://gump.apache.org/nagged.html, and/or contact the folk at [EMAIL PROTECTED] Project lucene-java has an issue affecting its community integration. This issue affects

Re: search quality - assessment & improvements

2007-06-25 Thread Doug Cutting
Doron Cohen wrote: It is very important that we would be able to assess the search quality in a repeatable manner - so that anyone can repeat the quality tests, and maybe find ways to improve them. (This would also allow to verify the "improvements claims" above...). This capability seems like a

Re: search quality - assessment & improvements

2007-06-25 Thread Doron Cohen
Hey Grant, thanks for your comments! Grant Ingersoll wrote: > As I am sure you are aware: https://issues.apache.org/jira/browse/ > LUCENE-836 I remembered you mentioning setting our own doc/query judgment system but forgot it was in LUCENE-836, thanks for the reminder. > On Jun 25, 2007, at 3:1

Re: search quality - assessment & improvements

2007-06-25 Thread Grant Ingersoll
On Jun 25, 2007, at 2:19 PM, Doron Cohen wrote: IANAL and I didn't read the link, but I think people publish their MAP scores, etc. all the time on TREC data. I think it implies that you obtained the data through legal means. So you're saying that if person "X" got the TREC data legally, we

Re: search quality - assessment & improvements

2007-06-25 Thread Grant Ingersoll
On Jun 25, 2007, at 2:04 PM, Doug Cutting wrote: Doron Cohen wrote: It is very important that we would be able to assess the search quality in a repeatable manner - so that anyone can repeat the quality tests, and maybe find ways to improve them. (This would also allow to verify the "impro

Re: search quality - assessment & improvements

2007-06-25 Thread Marvin Humphrey
On Jun 25, 2007, at 11:56 AM, Grant Ingersoll wrote: To do this, we could use Reuters or Wikipedia. The hard part is generating the queries and having people make relevance judgments for a sufficient sample size. Wikipedia is a moving target. I think the collection would have to be sta

Re: search quality - assessment & improvements

2007-06-25 Thread Grant Ingersoll
Yes you are correct, we could use the specific version that we use for benchmarking. I was assuming that one, just didn't say it! :-) -Grant On Jun 25, 2007, at 3:00 PM, Marvin Humphrey wrote: On Jun 25, 2007, at 11:56 AM, Grant Ingersoll wrote: To do this, we could use Reuters or Wikipe

Re: search quality - assessment & improvements

2007-06-25 Thread Doug Cutting
Marvin Humphrey wrote: Wikipedia is a moving target. I think the collection would have to be static. In theory, one can evaluate against other search engines results for Wikipedia. However this may violate their EULAs... Doug ---

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-25 Thread Doron Cohen
Michael McCandless wrote: > OK, when you say "fair" I think you mean because you already had a > previous run that used compound file, you had to use compound file in > the run with the LUCENE-843 patch (etc)? Yes, that's true. > The recommendations above should speed up Lucene with or without m

Re: search quality - assessment & improvements

2007-06-25 Thread Chris Hostetter
: For the first change, logic is that Lucene's default length normalization : punishes long documents too much. I found contrib's sweet-spot-similarity : helpful here, but not enough. I found that a better doc-length : normalization method is one that considers collection statistics - e.g. : avera

[jira] Commented: (LUCENE-942) TopDocCollector.topDocs throws ArrayIndexOutOfBoundsException when called twice

2007-06-25 Thread Hoss Man (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508018 ] Hoss Man commented on LUCENE-942: - this seems like both a documentation issue, and a bad state checking issue. the j

[jira] Resolved: (LUCENE-933) QueryParser can produce empty sub BooleanQueries when Analyzer proudces no tokens for input

2007-06-25 Thread Doron Cohen (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen resolved LUCENE-933. Resolution: Fixed Lucene Fields: [Patch Available] (was: [New]) committed the bakwards-comp

[jira] Commented: (LUCENE-942) TopDocCollector.topDocs throws ArrayIndexOutOfBoundsException when called twice

2007-06-25 Thread Doron Cohen (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508025 ] Doron Cohen commented on LUCENE-942: Perhaps simpler to make the scoreDocs[] array a private data member, which n

[jira] Commented: (LUCENE-942) TopDocCollector.topDocs throws ArrayIndexOutOfBoundsException when called twice

2007-06-25 Thread Hoss Man (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508039 ] Hoss Man commented on LUCENE-942: - that makes sense ... but there is still a state issue of "don't call topDocs() un

[jira] Updated: (LUCENE-940) SimpleDateFormat used in a non thread safe manner

2007-06-25 Thread Doron Cohen (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-940: --- Attachment: lucene-940.patch Attached patch fixing DateFormat for parallel "doc making". Also fixing

Re: updating the forrest docs?

2007-06-25 Thread Chris Hostetter
: I think it makes sense to move to 0.8. okay ... i removed the sitemap file, regened everything, and then read through the diff to ensure there was nothing broken/missing -- the diff seemed to be entirely related to teaks to the skinning between 0.7 and 0.8. (but i'm still curious how michael h

[jira] Resolved: (LUCENE-936) Typo on query parser syntax web page.

2007-06-25 Thread Hoss Man (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man resolved LUCENE-936. - Resolution: Fixed Assignee: Hoss Man thanks for spotting this... Committed revision 550680. > Ty

[jira] Commented: (LUCENE-933) QueryParser can produce empty sub BooleanQueries when Analyzer proudces no tokens for input

2007-06-25 Thread Hoss Man (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508054 ] Hoss Man commented on LUCENE-933: - woops ... sorry doron, i actually reviewed these patches the other day, but apare

[jira] Commented: (LUCENE-933) QueryParser can produce empty sub BooleanQueries when Analyzer proudces no tokens for input

2007-06-25 Thread Doron Cohen (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508058 ] Doron Cohen commented on LUCENE-933: great, thanks Hoss! > QueryParser can produce empty sub BooleanQueries when