Hudson build is back to normal : Solr-3.x #33

2010-06-08 Thread Apache Hudson Server
See http://hudson.zones.apache.org/hudson/job/Solr-3.x/33/



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2492) Make PulsingCodec (wrapping StandardCodec) the default codec

2010-06-08 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12876577#action_12876577
 ] 

Andrzej Bialecki  commented on LUCENE-2492:
---

How about adding some metadata to SegmentInfos ... if we figure out how to 
proceed with LUCENE-2491 then SegmentInfos could keep the list of codecs per 
file plus their init args.

 Make PulsingCodec (wrapping StandardCodec) the default codec
 

 Key: LUCENE-2492
 URL: https://issues.apache.org/jira/browse/LUCENE-2492
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 4.0
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 4.0


 PulsingCodec can provides good gains, by inlining the postings into the terms 
 dict for rare terms.  This is especially helpful for primary key like fields, 
 since every term is rare and batch lookups are common (see 
 http://chbits.blogspot.com/2010/06/lucenes-pulsingcodec-on-primary-key.html 
 for a simple perf test), but it should also be a gain for ordinary fields, 
 thanks to Zipf's law.
 I think we should make it the default

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [BUG/ISSUE] Distributed Search doesn't response the result set when use existing lucene index

2010-06-08 Thread Koji Sekiguchi
Are you sure you have uniqueKey field in your lucene index?  
Distributed search needs it.


Koji Sekiguchi from mobile


On 2010/06/08, at 15:52, Scott Zhang macromars...@gmail.com wrote:


Hi. All.
I am coming from solr user mailing list. I got a problem with  
distributed search. Looks it is BUG/ISSUE in solr itself.


   I am trying to use solr to search over 2 lucene indexes.  I am  
following the solr tutorial and test the distributed search example.  
It works.
   Then I am using my own lucene indexes. Search in each solr  
instance works and return the expected result. But when I do  
distributed search using shards. It only return the numFound=14.  
But the result contain nothing.


The doc in my existing lucene indexes, when search with  
distributed search, none of them are returned. But the docs inserted  
from solr post.jar are returned successfully.
Don't know why. looks the lucene docs has some difference from  
solr's lucene.
And my situation is, I already have 72 indexes folders which  
occupy lots of disk and repost them to solr will take very long  
time, so I have to stick with my existing index. Is there a solution  
for this?



Thanks.
Regards.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [BUG/ISSUE] Distributed Search doesn't response the result set when use existing lucene index

2010-06-08 Thread Scott Zhang
Hi. Koji.

Not sure how to set uniqueKey field in my lucene index. I am creating it
by using lucene.net

Document doc = new Document();
doc.Add(new Field(id, product_obj.product_id.ToString(),
Field.Store.YES, Field.Index.UN_TOKENIZED));
doc.Add(new Field(type, product, Field.Store.YES,
Field.Index.UN_TOKENIZED));
Field bField = new Field(keyword_level1, product_obj.title,
Field.Store.NO, Field.Index.ANALYZED);
bField.SetBoost(10.0F);
doc.Add(bField);
//keyword_level1
doc.Add(new Field(keyword_level1, product_obj.sku, Field.Store.NO,
Field.Index.NOT_ANALYZED));
if (product_obj.is_zuup)
{
doc.Add(new Field(keyword_level1, zuup, Field.Store.NO,
Field.Index.NOT_ANALYZED));
}


Regards.
Scott


On Tue, Jun 8, 2010 at 3:25 PM, Koji Sekiguchi k...@r.email.ne.jp wrote:

 Are you sure you have uniqueKey field in your lucene index? Distributed
 search needs it.

 Koji Sekiguchi from mobile



 On 2010/06/08, at 15:52, Scott Zhang macromars...@gmail.com wrote:

  Hi. All.
I am coming from solr user mailing list. I got a problem with
 distributed search. Looks it is BUG/ISSUE in solr itself.

   I am trying to use solr to search over 2 lucene indexes.  I am following
 the solr tutorial and test the distributed search example. It works.
   Then I am using my own lucene indexes. Search in each solr instance
 works and return the expected result. But when I do distributed search using
 shards. It only return the numFound=14. But the result contain nothing.

The doc in my existing lucene indexes, when search with distributed
 search, none of them are returned. But the docs inserted from solr post.jar
 are returned successfully.
Don't know why. looks the lucene docs has some difference from solr's
 lucene.
And my situation is, I already have 72 indexes folders which occupy
 lots of disk and repost them to solr will take very long time, so I have to
 stick with my existing index. Is there a solution for this?


 Thanks.
 Regards.


 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




Re: [BUG/ISSUE] Distributed Search doesn't response the result set when use existing lucene index

2010-06-08 Thread Scott Zhang
I checked my existing lucene indexes. All the ID field are stored but
not indexed. I don't want to rebuild these indexes as it will take days.
Can solr be changed a little let ID be not indexed?


Thanks.

On Tue, Jun 8, 2010 at 3:30 PM, Scott Zhang macromars...@gmail.com wrote:

 Hi. Koji.

 Not sure how to set uniqueKey field in my lucene index. I am creating it
 by using lucene.net

 Document doc = new Document();
 doc.Add(new Field(id, product_obj.product_id.ToString(),
 Field.Store.YES, Field.Index.UN_TOKENIZED));
 doc.Add(new Field(type, product, Field.Store.YES,
 Field.Index.UN_TOKENIZED));
 Field bField = new Field(keyword_level1, product_obj.title,
 Field.Store.NO, Field.Index.ANALYZED);
 bField.SetBoost(10.0F);
 doc.Add(bField);
 //keyword_level1
 doc.Add(new Field(keyword_level1, product_obj.sku,
 Field.Store.NO, Field.Index.NOT_ANALYZED));
 if (product_obj.is_zuup)
 {
 doc.Add(new Field(keyword_level1, zuup, Field.Store.NO,
 Field.Index.NOT_ANALYZED));
 }


 Regards.
 Scott



 On Tue, Jun 8, 2010 at 3:25 PM, Koji Sekiguchi k...@r.email.ne.jp wrote:

 Are you sure you have uniqueKey field in your lucene index? Distributed
 search needs it.

 Koji Sekiguchi from mobile



 On 2010/06/08, at 15:52, Scott Zhang macromars...@gmail.com wrote:

  Hi. All.
I am coming from solr user mailing list. I got a problem with
 distributed search. Looks it is BUG/ISSUE in solr itself.

   I am trying to use solr to search over 2 lucene indexes.  I am
 following the solr tutorial and test the distributed search example. It
 works.
   Then I am using my own lucene indexes. Search in each solr instance
 works and return the expected result. But when I do distributed search using
 shards. It only return the numFound=14. But the result contain nothing.

The doc in my existing lucene indexes, when search with distributed
 search, none of them are returned. But the docs inserted from solr post.jar
 are returned successfully.
Don't know why. looks the lucene docs has some difference from solr's
 lucene.
And my situation is, I already have 72 indexes folders which occupy
 lots of disk and repost them to solr will take very long time, so I have to
 stick with my existing index. Is there a solution for this?


 Thanks.
 Regards.


 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org





[jira] Updated: (SOLR-1943) Disable clustering contrib in Solr trunk

2010-06-08 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated SOLR-1943:


Attachment: SOLR-1943.patch

This patch effectively adds a readme file and renames build.xml. Will commit 
soon, to be able to go forward with LUCENE-2484.

 Disable clustering contrib in Solr trunk
 

 Key: SOLR-1943
 URL: https://issues.apache.org/jira/browse/SOLR-1943
 Project: Solr
  Issue Type: Bug
  Components: contrib - Clustering
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Attachments: SOLR-1943.patch


 With LUCENE-2484, Lucene's trunk API changed incompatible. As the clustering 
 contrib depends on a older carror2 jar file compoiled against an older 
 version of Lucene (3.0), the tests failed to run (TermAttribute class 
 removed).
 As we should be able to change the APIs in trunk without forcing external 
 projects like carrot2 to update its internal stuff to work with Lucene trunk.
 The attached patch will simply rename build.xml to build.xml.disabled, so 
 the module is simply no loger built. After we create a release branch out of 
 trunk, wen can simply enable it again after upgrading the carror2.jar files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Resolved: (SOLR-1943) Disable clustering contrib in Solr trunk

2010-06-08 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler resolved SOLR-1943.
-

Resolution: Fixed

Committed revision: 952613

 Disable clustering contrib in Solr trunk
 

 Key: SOLR-1943
 URL: https://issues.apache.org/jira/browse/SOLR-1943
 Project: Solr
  Issue Type: Bug
  Components: contrib - Clustering
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Attachments: SOLR-1943.patch


 With LUCENE-2484, Lucene's trunk API changed incompatible. As the clustering 
 contrib depends on a older carror2 jar file compoiled against an older 
 version of Lucene (3.0), the tests failed to run (TermAttribute class 
 removed).
 As we should be able to change the APIs in trunk without forcing external 
 projects like carrot2 to update its internal stuff to work with Lucene trunk.
 The attached patch will simply rename build.xml to build.xml.disabled, so 
 the module is simply no loger built. After we create a release branch out of 
 trunk, wen can simply enable it again after upgrading the carror2.jar files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2495) Add In/Out/putStream wrapper around Lucene IndexIn/Out/put

2010-06-08 Thread John Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Wang updated LUCENE-2495:
--

Attachment: lucene-iostream.tar

Added classes:
LuceneDirectoryInputStream and
LuceneDirectoryOutputStream

that decorate IndexInput and IndexOuput classes.

 Add In/Out/putStream wrapper around Lucene IndexIn/Out/put
 --

 Key: LUCENE-2495
 URL: https://issues.apache.org/jira/browse/LUCENE-2495
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Store
Reporter: John Wang
 Attachments: lucene-iostream.tar


 Lucene Directory is an abstraction that builds IndexInput and IndexOutput 
 instances. Sometimes it is useful to add in custom files in the index 
 directory for custom searching.
 It is often useful in that case to have some sort of bridge between this and 
 code that understand the regular java In/Out/putStream class.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2495) Add In/Out/putStream wrapper around Lucene IndexIn/Out/put

2010-06-08 Thread John Wang (JIRA)
Add In/Out/putStream wrapper around Lucene IndexIn/Out/put
--

 Key: LUCENE-2495
 URL: https://issues.apache.org/jira/browse/LUCENE-2495
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Store
Reporter: John Wang
 Attachments: lucene-iostream.tar

Lucene Directory is an abstraction that builds IndexInput and IndexOutput 
instances. Sometimes it is useful to add in custom files in the index directory 
for custom searching.

It is often useful in that case to have some sort of bridge between this and 
code that understand the regular java In/Out/putStream class.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Proposal: Scorer api change

2010-06-08 Thread John Wang
Hi Shai:

I am not sure I understand how changing Similarity would solve this
problem, wouldn't you need the reader?

As for PayloadTermQuery, payload is not always the most efficient way of
storing such data, especially when number of terms  numdocs. (I am not
sure accessing the payload when you iterate is a good idea, but that is
another discussion)

Yes, what I described is exactly a simple CustomScoreQuery for a special
use-case. The problem is also in CustomScoreQuery, where nextDoc and advance
are calling the sub-scorers as a wrapper. This can be avoided if the Scorer
returns an iterator instead.

Separating scoring and doc iteration is a good idea anyway. I don't know
the reason to combine them originally.

Thanks

-John

On Tue, Jun 8, 2010 at 8:47 AM, Shai Erera ser...@gmail.com wrote:

 So wouldn't it make sense to add some method to Similarity? Which receives
 the doc Id in question maybe ... just thinking here.

 Factoring Scorer like you propose would create 3 objects for
 scoring/iterating: Scorer (which really becomes an iterator), Similarity and
 CustomScoreFunction ...

 Maybe you can use CustomScoreQuery? or PayloadTermQuery? depends how you
 compute your age decay function (where you pull the data about the age of
 the document).

 Shai


 On Tue, Jun 8, 2010 at 6:41 PM, John Wang john.w...@gmail.com wrote:

 Hi Shai:

 Similarity in many cases is not sufficient for scoring. For example,
 to implement age decaying of a document (very useful for corpuses like news
 or tweets), you want to project the raw tfidf score onto a time curve, say
 f(x), to do this, you'd have a custom scorer that decorates the underlying
 scorer from your say, boolean query:

 public float score(){
 return myFunc(innerScorer.score());
 }

 This is fine, but then you would have to do this as well:

 public int nextDoc(){
return innerScorer.nextDoc();
 }

 and also:

 public int advance(int target){
return innerScorer.advance();
 }

  The difference here is that nextDoc and advance are called far more
 times as score. And you are introducing an extra method call for them, which
 is not insignificant for queries result in large recall sets.

 Hope this makes sense.

 Thanks

 -John

 On Tue, Jun 8, 2010 at 5:02 AM, Shai Erera ser...@gmail.com wrote:

 I'm not sure I understand what you mean - Scorer is a DISI itself, and
 the scoring formula is mostly controlled by Similarity.

 What will be the benefits of the proposed change?

 Shai


 On Tue, Jun 8, 2010 at 8:25 AM, John Wang john.w...@gmail.com wrote:

 Hi guys:

 I'd like to make a proposal to change the Scorer class/api to the
 following:


 public abstract class Scorer{
DocIdSetIterator getDocIDSetIterator();
float score(int docid);
 }

 Reasons:

 1) To build a Scorer from an existing Scorer (e.g. that produces raw
 scores from tfidf), one would decorate it, and it would introduce overhead
 (in function calls) around nextDoc and advance, even if you just want to
 augment the score method which is called much fewer times.

 2) The current contract forces scoring on the currentDoc in the
 underlying iterator. So once you pass current, you can no longer score. 
 In
 one of our use-cases, it is very inconvenient.

 What do you think? I can go ahead and open an issue and work on a patch
 if I get some agreement.

 Thanks

 -John







Re: any lucene 2.9.3 RC already available?

2010-06-08 Thread jm
thanks! found it already

On Tue, Jun 8, 2010 at 6:33 PM, Michael McCandless
luc...@mikemccandless.com wrote:
 It's gene...@lucene.apache.org -- that list has been around forever.
 We are generally (heh) supposed to do the release vote on general,
 but in the past we have also done it on dev.

 If you search on Lucid, you'll find this thread and the VOTE thread on 
 general:

    http://www.lucidimagination.com/search/?q=vote+release+2.9.3

 Mike

 On Tue, Jun 8, 2010 at 12:22 PM, jm jmugur...@gmail.com wrote:
 thanks Mike. which list is that? In the past I have seen these sort of
 threads in this list. I guess it changed with the merge and now I am
 missing some other list?

 On Tue, Jun 8, 2010 at 6:19 PM, Michael McCandless
 luc...@mikemccandless.com wrote:
 Yes, see the thread [VOTE] Apache Lucene Java 2.9.3 and 3.0.2
 artifacts to be released on the general list; it has links to the RC
  changes.

 Mike

 On Tue, Jun 8, 2010 at 12:11 PM, jm jmugur...@gmail.com wrote:
 Hi,

 I think I read 2.9.3 was about to be released soon. I am chasing some
 memory issue in our process and it looks like
 https://issues.apache.org/jira/browse/LUCENE-2467 could be a culprit,
 so is there already any RC I could try? It does not matter if it's not
 official yet.

 thanks
 javi

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Proposal: Scorer api change

2010-06-08 Thread John Wang
re: But Scorer is itself an iterator, so what prevents you from calling
nextDoc and advance on it without score()

Nothing. It is just inefficient to pay the method call overhead just to
overload score.

re: If I were in your shoes, I'd simply provider a Query wrapper. If CSQ
is not good enough I'd just develop my own.

That is what I am doing. I am just proposing the change (see my first email)
as an improvement.

re: Scorer is itself an iterator

yes, that is the current definition. The point of the proposal is to make
this change.

-John

On Tue, Jun 8, 2010 at 9:45 AM, Shai Erera ser...@gmail.com wrote:

 Well … I don't know the reason as well and always thought Scorer and
 Similarity are confusing.

 But Scorer is itself an iterator, so what prevents you from calling
 nextDoc and advance on it without score(). And what would the returned
 DISI do when nextDoc is called, if not delegate to its subs?

 If I were in your shoes, I'd simply provider a Query wrapper. If CSQ
 is not good enough I'd just develop my own.

 But perhaps others think differently?

 Shai

 On Tuesday, June 8, 2010, John Wang john.w...@gmail.com wrote:
  Hi Shai:
  I am not sure I understand how changing Similarity would solve this
 problem, wouldn't you need the reader?
  As for PayloadTermQuery, payload is not always the most efficient way
 of storing such data, especially when number of terms  numdocs. (I am not
 sure accessing the payload when you iterate is a good idea, but that is
 another discussion)
 
  Yes, what I described is exactly a simple CustomScoreQuery for a
 special use-case. The problem is also in CustomScoreQuery, where nextDoc and
 advance are calling the sub-scorers as a wrapper. This can be avoided if the
 Scorer returns an iterator instead.
 
  Separating scoring and doc iteration is a good idea anyway. I don't
 know the reason to combine them originally.
  Thanks
  -John
 
 
  On Tue, Jun 8, 2010 at 8:47 AM, Shai Erera ser...@gmail.com wrote:
 
  So wouldn't it make sense to add some method to Similarity? Which
 receives the doc Id in question maybe ... just thinking here.
 
  Factoring Scorer like you propose would create 3 objects for
 scoring/iterating: Scorer (which really becomes an iterator), Similarity and
 CustomScoreFunction ...
 
  Maybe you can use CustomScoreQuery? or PayloadTermQuery? depends how you
 compute your age decay function (where you pull the data about the age of
 the document).
 
  Shai
 
 
  On Tue, Jun 8, 2010 at 6:41 PM, John Wang john.w...@gmail.com wrote:
  Hi Shai:
  Similarity in many cases is not sufficient for scoring. For example,
 to implement age decaying of a document (very useful for corpuses like news
 or tweets), you want to project the raw tfidf score onto a time curve, say
 f(x), to do this, you'd have a custom scorer that decorates the underlying
 scorer from your say, boolean query:
 
 
 
  public float score(){return myFunc(innerScorer.score());}
  This is fine, but then you would have to do this as well:
  public int nextDoc(){
 
 
 return innerScorer.nextDoc();}
  and also:
  public int advance(int target){   return innerScorer.advance();}  The
 difference here is that nextDoc and advance are called far more times as
 score. And you are introducing an extra method call for them, which is not
 insignificant for queries result in large recall sets.
 
 
 
  Hope this makes sense.
  Thanks
  -John
  On Tue, Jun 8, 2010 at 5:02 AM, Shai Erera ser...@gmail.com wrote:
  I'm not sure I understand what you mean - Scorer is a DISI itself, and
 the scoring formula is mostly controlled by Similarity.
 
  What will be the benefits of the proposed change?
 
  Shai
 
  On Tue, Jun 8, 2010 at 8:25 AM, John Wang john.w...@gmail.com wrote:
 
 
 
 
  Hi guys:
 
  I'd like to make a proposal to change the Scorer class/api to the
 following:
 
 
  public abstract class Scorer{
 DocIdSetIterator getDocIDSetIterator();
 float score(int docid);
  }
 
  Reasons:
 
  1) To build a Scorer from an existing Scorer (e.g. that produces raw
 scores from tfidf), one would decorate it, and it would introduce overhead
 (in function calls) around nextDoc and advance, even if you just want to
 augment the score method which is called much fewer times.
 
  2) The current contract forces scoring on the currentDoc in the
 underlying iterator. So once you pass current, you can no longer score. In
 one of our use-cases, it is very inconvenient.
 
  What do you think? I can go ahead and open an issue and work on a patch
 if I get some agreement.
 
  Thanks
 
  -John
 
 
 
 
 
 
 
 
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




Re: Proposal: Scorer api change

2010-06-08 Thread Shai Erera
I guess I must be missing something fundamental here :).

If Scorer is defined as you propose, and I create my Scorer which impls
getDISI() as return this - what do I lose? What's wrong w/ Scorer already
being a DISI?

You mention it is just inefficient to pay the method call overhead ... -
what overhead? Are you talking about the decorator delegating the call to
the wrapped scorer? I really think the compiler can handle that, no?
Especially if you make your nextDoc/advance final (which probably you
should) ...

That doesn't seem to justify an API change, break bw completely (even if we
do it in 4.0 only) and change all the current Scorers ...

Shai

On Tue, Jun 8, 2010 at 8:01 PM, John Wang john.w...@gmail.com wrote:

 re: But Scorer is itself an iterator, so what prevents you from calling

 nextDoc and advance on it without score()

 Nothing. It is just inefficient to pay the method call overhead just to
 overload score.

 re: If I were in your shoes, I'd simply provider a Query wrapper. If CSQ

 is not good enough I'd just develop my own.

 That is what I am doing. I am just proposing the change (see my first
 email) as an improvement.

 re: Scorer is itself an iterator

 yes, that is the current definition. The point of the proposal is to make
 this change.

 -John

 On Tue, Jun 8, 2010 at 9:45 AM, Shai Erera ser...@gmail.com wrote:

 Well … I don't know the reason as well and always thought Scorer and
 Similarity are confusing.

 But Scorer is itself an iterator, so what prevents you from calling
 nextDoc and advance on it without score(). And what would the returned
 DISI do when nextDoc is called, if not delegate to its subs?

 If I were in your shoes, I'd simply provider a Query wrapper. If CSQ
 is not good enough I'd just develop my own.

 But perhaps others think differently?

 Shai

 On Tuesday, June 8, 2010, John Wang john.w...@gmail.com wrote:
  Hi Shai:
  I am not sure I understand how changing Similarity would solve this
 problem, wouldn't you need the reader?
  As for PayloadTermQuery, payload is not always the most efficient
 way of storing such data, especially when number of terms  numdocs. (I am
 not sure accessing the payload when you iterate is a good idea, but that is
 another discussion)
 
  Yes, what I described is exactly a simple CustomScoreQuery for a
 special use-case. The problem is also in CustomScoreQuery, where nextDoc and
 advance are calling the sub-scorers as a wrapper. This can be avoided if the
 Scorer returns an iterator instead.
 
  Separating scoring and doc iteration is a good idea anyway. I don't
 know the reason to combine them originally.
  Thanks
  -John
 
 
  On Tue, Jun 8, 2010 at 8:47 AM, Shai Erera ser...@gmail.com wrote:
 
  So wouldn't it make sense to add some method to Similarity? Which
 receives the doc Id in question maybe ... just thinking here.
 
  Factoring Scorer like you propose would create 3 objects for
 scoring/iterating: Scorer (which really becomes an iterator), Similarity and
 CustomScoreFunction ...
 
  Maybe you can use CustomScoreQuery? or PayloadTermQuery? depends how you
 compute your age decay function (where you pull the data about the age of
 the document).
 
  Shai
 
 
  On Tue, Jun 8, 2010 at 6:41 PM, John Wang john.w...@gmail.com wrote:
  Hi Shai:
  Similarity in many cases is not sufficient for scoring. For example,
 to implement age decaying of a document (very useful for corpuses like news
 or tweets), you want to project the raw tfidf score onto a time curve, say
 f(x), to do this, you'd have a custom scorer that decorates the underlying
 scorer from your say, boolean query:
 
 
 
  public float score(){return myFunc(innerScorer.score());}
  This is fine, but then you would have to do this as well:
  public int nextDoc(){
 
 
 return innerScorer.nextDoc();}
  and also:
  public int advance(int target){   return innerScorer.advance();}
 The difference here is that nextDoc and advance are called far more times as
 score. And you are introducing an extra method call for them, which is not
 insignificant for queries result in large recall sets.
 
 
 
  Hope this makes sense.
  Thanks
  -John
  On Tue, Jun 8, 2010 at 5:02 AM, Shai Erera ser...@gmail.com wrote:
  I'm not sure I understand what you mean - Scorer is a DISI itself, and
 the scoring formula is mostly controlled by Similarity.
 
  What will be the benefits of the proposed change?
 
  Shai
 
  On Tue, Jun 8, 2010 at 8:25 AM, John Wang john.w...@gmail.com wrote:
 
 
 
 
  Hi guys:
 
  I'd like to make a proposal to change the Scorer class/api to the
 following:
 
 
  public abstract class Scorer{
 DocIdSetIterator getDocIDSetIterator();
 float score(int docid);
  }
 
  Reasons:
 
  1) To build a Scorer from an existing Scorer (e.g. that produces raw
 scores from tfidf), one would decorate it, and it would introduce overhead
 (in function calls) around nextDoc and advance, even if you just want to
 augment the score method which 

Re: Proposal: Scorer api change

2010-06-08 Thread Earwin Burrfoot
The problem with your proposal is that, currently, Lucene uses current
iteration state to compute score.
I.e. it already knows which of SHOULD BQ clauses matched for current
doc, so it's easier to calculate the score.
If you change API to allow scoring arbitrary documents (even those
that didn't match the query at all), you're opening a can of worms :)

As an alternative, you can try looking at MG4J sources. As far as I
understand, their scoring is decoupled from matching, just like you
(and I bet many more people) want. The matcher is separate, and the
scoring entity accepts current matcher state instead of document id,
so you get the best of both worlds.

On Tue, Jun 8, 2010 at 21:01, John Wang john.w...@gmail.com wrote:
 re: But Scorer is itself an iterator, so what prevents you from calling
 nextDoc and advance on it without score()

 Nothing. It is just inefficient to pay the method call overhead just to
 overload score.

 re: If I were in your shoes, I'd simply provider a Query wrapper. If CSQ
 is not good enough I'd just develop my own.

 That is what I am doing. I am just proposing the change (see my first email)
 as an improvement.

 re: Scorer is itself an iterator

 yes, that is the current definition. The point of the proposal is to make
 this change.

 -John

 On Tue, Jun 8, 2010 at 9:45 AM, Shai Erera ser...@gmail.com wrote:

 Well … I don't know the reason as well and always thought Scorer and
 Similarity are confusing.

 But Scorer is itself an iterator, so what prevents you from calling
 nextDoc and advance on it without score(). And what would the returned
 DISI do when nextDoc is called, if not delegate to its subs?

 If I were in your shoes, I'd simply provider a Query wrapper. If CSQ
 is not good enough I'd just develop my own.

 But perhaps others think differently?

 Shai

 On Tuesday, June 8, 2010, John Wang john.w...@gmail.com wrote:
  Hi Shai:
      I am not sure I understand how changing Similarity would solve this
  problem, wouldn't you need the reader?
      As for PayloadTermQuery, payload is not always the most efficient
  way of storing such data, especially when number of terms  numdocs. (I am
  not sure accessing the payload when you iterate is a good idea, but that is
  another discussion)
 
      Yes, what I described is exactly a simple CustomScoreQuery for a
  special use-case. The problem is also in CustomScoreQuery, where nextDoc 
  and
  advance are calling the sub-scorers as a wrapper. This can be avoided if 
  the
  Scorer returns an iterator instead.
 
      Separating scoring and doc iteration is a good idea anyway. I don't
  know the reason to combine them originally.
  Thanks
  -John
 
 
  On Tue, Jun 8, 2010 at 8:47 AM, Shai Erera ser...@gmail.com wrote:
 
  So wouldn't it make sense to add some method to Similarity? Which
  receives the doc Id in question maybe ... just thinking here.
 
  Factoring Scorer like you propose would create 3 objects for
  scoring/iterating: Scorer (which really becomes an iterator), Similarity 
  and
  CustomScoreFunction ...
 
  Maybe you can use CustomScoreQuery? or PayloadTermQuery? depends how you
  compute your age decay function (where you pull the data about the age of
  the document).
 
  Shai
 
 
  On Tue, Jun 8, 2010 at 6:41 PM, John Wang john.w...@gmail.com wrote:
  Hi Shai:
      Similarity in many cases is not sufficient for scoring. For example,
  to implement age decaying of a document (very useful for corpuses like news
  or tweets), you want to project the raw tfidf score onto a time curve, say
  f(x), to do this, you'd have a custom scorer that decorates the underlying
  scorer from your say, boolean query:
 
 
 
  public float score(){    return myFunc(innerScorer.score());}
      This is fine, but then you would have to do this as well:
  public int nextDoc(){
 
 
     return innerScorer.nextDoc();}
  and also:
  public int advance(int target){   return innerScorer.advance();}
  The difference here is that nextDoc and advance are called far more times 
  as
  score. And you are introducing an extra method call for them, which is not
  insignificant for queries result in large recall sets.
 
 
 
  Hope this makes sense.
  Thanks
  -John
  On Tue, Jun 8, 2010 at 5:02 AM, Shai Erera ser...@gmail.com wrote:
  I'm not sure I understand what you mean - Scorer is a DISI itself, and
  the scoring formula is mostly controlled by Similarity.
 
  What will be the benefits of the proposed change?
 
  Shai
 
  On Tue, Jun 8, 2010 at 8:25 AM, John Wang john.w...@gmail.com wrote:
 
 
 
 
  Hi guys:
 
      I'd like to make a proposal to change the Scorer class/api to the
  following:
 
 
  public abstract class Scorer{
     DocIdSetIterator getDocIDSetIterator();
     float score(int docid);
  }
 
  Reasons:
 
  1) To build a Scorer from an existing Scorer (e.g. that produces raw
  scores from tfidf), one would decorate it, and it would introduce overhead
  (in function calls) around nextDoc and advance, even if you just 

Re: Proposal: Scorer api change

2010-06-08 Thread Earwin Burrfoot
Shai, his wrapper Scorer will just look like:
DISI getDISI() {
  return delegate.getDISI();
}

float score(int doc) {
  return calcMyAwesomeScore(doc);
}

this saves delegate.nextDoc(), delegate.advance() indirection calls.
But I already offered a better alternative :)

On Tue, Jun 8, 2010 at 21:09, Shai Erera ser...@gmail.com wrote:
 I guess I must be missing something fundamental here :).

 If Scorer is defined as you propose, and I create my Scorer which impls
 getDISI() as return this - what do I lose? What's wrong w/ Scorer already
 being a DISI?

 You mention it is just inefficient to pay the method call overhead ... -
 what overhead? Are you talking about the decorator delegating the call to
 the wrapped scorer? I really think the compiler can handle that, no?
 Especially if you make your nextDoc/advance final (which probably you
 should) ...

 That doesn't seem to justify an API change, break bw completely (even if we
 do it in 4.0 only) and change all the current Scorers ...

 Shai

 On Tue, Jun 8, 2010 at 8:01 PM, John Wang john.w...@gmail.com wrote:

 re: But Scorer is itself an iterator, so what prevents you from calling
 nextDoc and advance on it without score()

 Nothing. It is just inefficient to pay the method call overhead just to
 overload score.

 re: If I were in your shoes, I'd simply provider a Query wrapper. If CSQ
 is not good enough I'd just develop my own.

 That is what I am doing. I am just proposing the change (see my first
 email) as an improvement.

 re: Scorer is itself an iterator

 yes, that is the current definition. The point of the proposal is to make
 this change.

 -John

 On Tue, Jun 8, 2010 at 9:45 AM, Shai Erera ser...@gmail.com wrote:

 Well … I don't know the reason as well and always thought Scorer and
 Similarity are confusing.

 But Scorer is itself an iterator, so what prevents you from calling
 nextDoc and advance on it without score(). And what would the returned
 DISI do when nextDoc is called, if not delegate to its subs?

 If I were in your shoes, I'd simply provider a Query wrapper. If CSQ
 is not good enough I'd just develop my own.

 But perhaps others think differently?

 Shai

 On Tuesday, June 8, 2010, John Wang john.w...@gmail.com wrote:
  Hi Shai:
      I am not sure I understand how changing Similarity would solve this
  problem, wouldn't you need the reader?
      As for PayloadTermQuery, payload is not always the most efficient
  way of storing such data, especially when number of terms  numdocs. (I 
  am
  not sure accessing the payload when you iterate is a good idea, but that 
  is
  another discussion)
 
      Yes, what I described is exactly a simple CustomScoreQuery for a
  special use-case. The problem is also in CustomScoreQuery, where nextDoc 
  and
  advance are calling the sub-scorers as a wrapper. This can be avoided if 
  the
  Scorer returns an iterator instead.
 
      Separating scoring and doc iteration is a good idea anyway. I don't
  know the reason to combine them originally.
  Thanks
  -John
 
 
  On Tue, Jun 8, 2010 at 8:47 AM, Shai Erera ser...@gmail.com wrote:
 
  So wouldn't it make sense to add some method to Similarity? Which
  receives the doc Id in question maybe ... just thinking here.
 
  Factoring Scorer like you propose would create 3 objects for
  scoring/iterating: Scorer (which really becomes an iterator), Similarity 
  and
  CustomScoreFunction ...
 
  Maybe you can use CustomScoreQuery? or PayloadTermQuery? depends how
  you compute your age decay function (where you pull the data about the age
  of the document).
 
  Shai
 
 
  On Tue, Jun 8, 2010 at 6:41 PM, John Wang john.w...@gmail.com wrote:
  Hi Shai:
      Similarity in many cases is not sufficient for scoring. For
  example, to implement age decaying of a document (very useful for corpuses
  like news or tweets), you want to project the raw tfidf score onto a time
  curve, say f(x), to do this, you'd have a custom scorer that decorates the
  underlying scorer from your say, boolean query:
 
 
 
  public float score(){    return myFunc(innerScorer.score());}
      This is fine, but then you would have to do this as well:
  public int nextDoc(){
 
 
     return innerScorer.nextDoc();}
  and also:
  public int advance(int target){   return innerScorer.advance();}
  The difference here is that nextDoc and advance are called far more times 
  as
  score. And you are introducing an extra method call for them, which is not
  insignificant for queries result in large recall sets.
 
 
 
  Hope this makes sense.
  Thanks
  -John
  On Tue, Jun 8, 2010 at 5:02 AM, Shai Erera ser...@gmail.com wrote:
  I'm not sure I understand what you mean - Scorer is a DISI itself, and
  the scoring formula is mostly controlled by Similarity.
 
  What will be the benefits of the proposed change?
 
  Shai
 
  On Tue, Jun 8, 2010 at 8:25 AM, John Wang john.w...@gmail.com wrote:
 
 
 
 
  Hi guys:
 
      I'd like to make a proposal to change the Scorer class/api to the
  

Re: Proposal: Scorer api change

2010-06-08 Thread John Wang
Shai:

method call overhead in this case is not insignificant because it is in
a very tight loop, and no, compiler cannot optimize it for you, we are not
inline-ing cuz we are in a java world.

 You are right, this breaks backward compatibility. But from 2.4 - 2.9,
we have done MUCH worse. :)

-John

On Tue, Jun 8, 2010 at 10:09 AM, Shai Erera ser...@gmail.com wrote:

 I guess I must be missing something fundamental here :).

 If Scorer is defined as you propose, and I create my Scorer which impls
 getDISI() as return this - what do I lose? What's wrong w/ Scorer already
 being a DISI?

 You mention it is just inefficient to pay the method call overhead ... -
 what overhead? Are you talking about the decorator delegating the call to
 the wrapped scorer? I really think the compiler can handle that, no?
 Especially if you make your nextDoc/advance final (which probably you
 should) ...

 That doesn't seem to justify an API change, break bw completely (even if we
 do it in 4.0 only) and change all the current Scorers ...

 Shai


 On Tue, Jun 8, 2010 at 8:01 PM, John Wang john.w...@gmail.com wrote:

 re: But Scorer is itself an iterator, so what prevents you from calling

 nextDoc and advance on it without score()

 Nothing. It is just inefficient to pay the method call overhead just to
 overload score.

 re: If I were in your shoes, I'd simply provider a Query wrapper. If CSQ

 is not good enough I'd just develop my own.

 That is what I am doing. I am just proposing the change (see my first
 email) as an improvement.

 re: Scorer is itself an iterator

 yes, that is the current definition. The point of the proposal is to make
 this change.

 -John

 On Tue, Jun 8, 2010 at 9:45 AM, Shai Erera ser...@gmail.com wrote:

 Well … I don't know the reason as well and always thought Scorer and
 Similarity are confusing.

 But Scorer is itself an iterator, so what prevents you from calling
 nextDoc and advance on it without score(). And what would the returned
 DISI do when nextDoc is called, if not delegate to its subs?

 If I were in your shoes, I'd simply provider a Query wrapper. If CSQ
 is not good enough I'd just develop my own.

 But perhaps others think differently?

 Shai

 On Tuesday, June 8, 2010, John Wang john.w...@gmail.com wrote:
  Hi Shai:
  I am not sure I understand how changing Similarity would solve this
 problem, wouldn't you need the reader?
  As for PayloadTermQuery, payload is not always the most efficient
 way of storing such data, especially when number of terms  numdocs. (I am
 not sure accessing the payload when you iterate is a good idea, but that is
 another discussion)
 
  Yes, what I described is exactly a simple CustomScoreQuery for a
 special use-case. The problem is also in CustomScoreQuery, where nextDoc and
 advance are calling the sub-scorers as a wrapper. This can be avoided if the
 Scorer returns an iterator instead.
 
  Separating scoring and doc iteration is a good idea anyway. I don't
 know the reason to combine them originally.
  Thanks
  -John
 
 
  On Tue, Jun 8, 2010 at 8:47 AM, Shai Erera ser...@gmail.com wrote:
 
  So wouldn't it make sense to add some method to Similarity? Which
 receives the doc Id in question maybe ... just thinking here.
 
  Factoring Scorer like you propose would create 3 objects for
 scoring/iterating: Scorer (which really becomes an iterator), Similarity and
 CustomScoreFunction ...
 
  Maybe you can use CustomScoreQuery? or PayloadTermQuery? depends how
 you compute your age decay function (where you pull the data about the age
 of the document).
 
  Shai
 
 
  On Tue, Jun 8, 2010 at 6:41 PM, John Wang john.w...@gmail.com wrote:
  Hi Shai:
  Similarity in many cases is not sufficient for scoring. For
 example, to implement age decaying of a document (very useful for corpuses
 like news or tweets), you want to project the raw tfidf score onto a time
 curve, say f(x), to do this, you'd have a custom scorer that decorates the
 underlying scorer from your say, boolean query:
 
 
 
  public float score(){return myFunc(innerScorer.score());}
  This is fine, but then you would have to do this as well:
  public int nextDoc(){
 
 
 return innerScorer.nextDoc();}
  and also:
  public int advance(int target){   return innerScorer.advance();}
 The difference here is that nextDoc and advance are called far more times as
 score. And you are introducing an extra method call for them, which is not
 insignificant for queries result in large recall sets.
 
 
 
  Hope this makes sense.
  Thanks
  -John
  On Tue, Jun 8, 2010 at 5:02 AM, Shai Erera ser...@gmail.com wrote:
  I'm not sure I understand what you mean - Scorer is a DISI itself, and
 the scoring formula is mostly controlled by Similarity.
 
  What will be the benefits of the proposed change?
 
  Shai
 
  On Tue, Jun 8, 2010 at 8:25 AM, John Wang john.w...@gmail.com wrote:
 
 
 
 
  Hi guys:
 
  I'd like to make a proposal to change the Scorer class/api to the
 

Re: Proposal: Scorer api change

2010-06-08 Thread John Wang
Shai:

   Java cannot inline in this case.

   Actually there is an urban legend around using final to hint to
underlying compiler to inline :) (turns out to be false, one reason being
dynamic classloading)

   write a simple pgm and try and see for yourself (remember to turn on
-server on VM options)

-John

On Tue, Jun 8, 2010 at 10:28 AM, Shai Erera ser...@gmail.com wrote:

 What do you mean we are not inlining? The compiler inlines methods .. at
 least it tries.

 Shai


 On Tue, Jun 8, 2010 at 8:21 PM, John Wang john.w...@gmail.com wrote:

 Shai:

 method call overhead in this case is not insignificant because it is
 in a very tight loop, and no, compiler cannot optimize it for you, we are
 not inline-ing cuz we are in a java world.

  You are right, this breaks backward compatibility. But from 2.4 -
 2.9, we have done MUCH worse. :)

 -John


 On Tue, Jun 8, 2010 at 10:09 AM, Shai Erera ser...@gmail.com wrote:

 I guess I must be missing something fundamental here :).

 If Scorer is defined as you propose, and I create my Scorer which impls
 getDISI() as return this - what do I lose? What's wrong w/ Scorer already
 being a DISI?

 You mention it is just inefficient to pay the method call overhead ...
 - what overhead? Are you talking about the decorator delegating the call to
 the wrapped scorer? I really think the compiler can handle that, no?
 Especially if you make your nextDoc/advance final (which probably you
 should) ...

 That doesn't seem to justify an API change, break bw completely (even if
 we do it in 4.0 only) and change all the current Scorers ...

 Shai


 On Tue, Jun 8, 2010 at 8:01 PM, John Wang john.w...@gmail.com wrote:

 re: But Scorer is itself an iterator, so what prevents you from calling

 nextDoc and advance on it without score()

 Nothing. It is just inefficient to pay the method call overhead just to
 overload score.

 re: If I were in your shoes, I'd simply provider a Query wrapper. If CSQ

 is not good enough I'd just develop my own.

 That is what I am doing. I am just proposing the change (see my first
 email) as an improvement.

 re: Scorer is itself an iterator

 yes, that is the current definition. The point of the proposal is to
 make this change.

 -John

 On Tue, Jun 8, 2010 at 9:45 AM, Shai Erera ser...@gmail.com wrote:

 Well … I don't know the reason as well and always thought Scorer and
 Similarity are confusing.

 But Scorer is itself an iterator, so what prevents you from calling
 nextDoc and advance on it without score(). And what would the returned
 DISI do when nextDoc is called, if not delegate to its subs?

 If I were in your shoes, I'd simply provider a Query wrapper. If CSQ
 is not good enough I'd just develop my own.

 But perhaps others think differently?

 Shai

 On Tuesday, June 8, 2010, John Wang john.w...@gmail.com wrote:
  Hi Shai:
  I am not sure I understand how changing Similarity would solve
 this problem, wouldn't you need the reader?
  As for PayloadTermQuery, payload is not always the most efficient
 way of storing such data, especially when number of terms  numdocs. (I 
 am
 not sure accessing the payload when you iterate is a good idea, but that 
 is
 another discussion)
 
  Yes, what I described is exactly a simple CustomScoreQuery for a
 special use-case. The problem is also in CustomScoreQuery, where nextDoc 
 and
 advance are calling the sub-scorers as a wrapper. This can be avoided if 
 the
 Scorer returns an iterator instead.
 
  Separating scoring and doc iteration is a good idea anyway. I
 don't know the reason to combine them originally.
  Thanks
  -John
 
 
  On Tue, Jun 8, 2010 at 8:47 AM, Shai Erera ser...@gmail.com wrote:
 
  So wouldn't it make sense to add some method to Similarity? Which
 receives the doc Id in question maybe ... just thinking here.
 
  Factoring Scorer like you propose would create 3 objects for
 scoring/iterating: Scorer (which really becomes an iterator), Similarity 
 and
 CustomScoreFunction ...
 
  Maybe you can use CustomScoreQuery? or PayloadTermQuery? depends how
 you compute your age decay function (where you pull the data about the age
 of the document).
 
  Shai
 
 
  On Tue, Jun 8, 2010 at 6:41 PM, John Wang john.w...@gmail.com
 wrote:
  Hi Shai:
  Similarity in many cases is not sufficient for scoring. For
 example, to implement age decaying of a document (very useful for corpuses
 like news or tweets), you want to project the raw tfidf score onto a time
 curve, say f(x), to do this, you'd have a custom scorer that decorates the
 underlying scorer from your say, boolean query:
 
 
 
  public float score(){return myFunc(innerScorer.score());}
  This is fine, but then you would have to do this as well:
  public int nextDoc(){
 
 
 return innerScorer.nextDoc();}
  and also:
  public int advance(int target){   return innerScorer.advance();}
 The difference here is that nextDoc and advance are called far more times 
 as
 score. And you are introducing an 

Re: Proposal: Scorer api change

2010-06-08 Thread John Wang
Wouldn't you get it as well with proposed api?
You would still be able to iterate the doc and at that point call score with
the docid. If you call score() along with iteration, you would still get the
information no?
Making scorer take a docid allows you score any docid in the reader if the
query wants it to. Wouldn't it make it more flexible?

-John

On Tue, Jun 8, 2010 at 10:54 AM, Earwin Burrfoot ear...@gmail.com wrote:

 To compute a score you have to see which of your subqueries did not
 match, which did, and what are the docfreqs/positions for them.
 When iterating, and calling score() only for current doc - parts of
 this data (maybe even all of it, not sure) is already gathered for
 you. If you allow calling score(int doc) - for arbitrary docId, you'll
 have to redo this work.

 2010/6/8 John Wang john.w...@gmail.com:
  Hi Earwin:
 
   I am not sure I understand here, e.g. what si the difference
 between:
 
   float myscorinCode(){
   computeMyScore(scorer.score());
   }
 
   and
 
float myscorinCode(){
 
 computeMyScore(scorer.score(scorer.getDocIdSetIterator().docID());
}
 
In the case of BQ, when you get a hit, would you still be able to
 call
  subscorer.score(hit)? Why is the point of iteration important for BQ?
 
please elaborate.
 
  Thanks
 
  -John
 
  On Tue, Jun 8, 2010 at 10:10 AM, Earwin Burrfoot ear...@gmail.com
 wrote:
 
  The problem with your proposal is that, currently, Lucene uses current
  iteration state to compute score.
  I.e. it already knows which of SHOULD BQ clauses matched for current
  doc, so it's easier to calculate the score.
  If you change API to allow scoring arbitrary documents (even those
  that didn't match the query at all), you're opening a can of worms :)
 
  As an alternative, you can try looking at MG4J sources. As far as I
  understand, their scoring is decoupled from matching, just like you
  (and I bet many more people) want. The matcher is separate, and the
  scoring entity accepts current matcher state instead of document id,
  so you get the best of both worlds.
 
  On Tue, Jun 8, 2010 at 21:01, John Wang john.w...@gmail.com wrote:
   re: But Scorer is itself an iterator, so what prevents you from
 calling
   nextDoc and advance on it without score()
  
   Nothing. It is just inefficient to pay the method call overhead just
 to
   overload score.
  
   re: If I were in your shoes, I'd simply provider a Query wrapper. If
 CSQ
   is not good enough I'd just develop my own.
  
   That is what I am doing. I am just proposing the change (see my first
   email)
   as an improvement.
  
   re: Scorer is itself an iterator
  
   yes, that is the current definition. The point of the proposal is to
   make
   this change.
  
   -John
  
   On Tue, Jun 8, 2010 at 9:45 AM, Shai Erera ser...@gmail.com wrote:
  
   Well ... I don't know the reason as well and always thought Scorer and
   Similarity are confusing.
  
   But Scorer is itself an iterator, so what prevents you from calling
   nextDoc and advance on it without score(). And what would the
 returned
   DISI do when nextDoc is called, if not delegate to its subs?
  
   If I were in your shoes, I'd simply provider a Query wrapper. If CSQ
   is not good enough I'd just develop my own.
  
   But perhaps others think differently?
  
   Shai
  
   On Tuesday, June 8, 2010, John Wang john.w...@gmail.com wrote:
Hi Shai:
I am not sure I understand how changing Similarity would solve
this
problem, wouldn't you need the reader?
As for PayloadTermQuery, payload is not always the most
 efficient
way of storing such data, especially when number of terms 
 numdocs.
(I am
not sure accessing the payload when you iterate is a good idea, but
that is
another discussion)
   
Yes, what I described is exactly a simple CustomScoreQuery for
 a
special use-case. The problem is also in CustomScoreQuery, where
nextDoc and
advance are calling the sub-scorers as a wrapper. This can be
 avoided
if the
Scorer returns an iterator instead.
   
Separating scoring and doc iteration is a good idea anyway. I
don't
know the reason to combine them originally.
Thanks
-John
   
   
On Tue, Jun 8, 2010 at 8:47 AM, Shai Erera ser...@gmail.com
 wrote:
   
So wouldn't it make sense to add some method to Similarity? Which
receives the doc Id in question maybe ... just thinking here.
   
Factoring Scorer like you propose would create 3 objects for
scoring/iterating: Scorer (which really becomes an iterator),
Similarity and
CustomScoreFunction ...
   
Maybe you can use CustomScoreQuery? or PayloadTermQuery? depends
 how
you
compute your age decay function (where you pull the data about the
age of
the document).
   
Shai
   
   
On Tue, Jun 8, 2010 at 6:41 PM, John Wang john.w...@gmail.com
wrote:
Hi Shai:
Similarity in many 

Re: Proposal: Scorer api change

2010-06-08 Thread Earwin Burrfoot
Some people don't do IO while searching at all. When you're over
certain qps/index size threshold, you need less nodes to keep all your
index (or its hot parts) in memory, than to keep combined IO subsystem
throughput high enough to satisfy disc-based search demands.

2010/6/9 Doron Cohen cdor...@gmail.com:
 I too tend to ignore the overhead of delegated calls, especially comparing
 to all other IO ops and computations done by the stack of scorers, but
 accepting that you cannot ignore it, could you achieve the same goal by
 sub-classing the top query where you subclass its weight to return a
 sub-class of its scorer which would only override score() but not the other
 methods, and in score would apply that eg decay logic? This way no
 delegation is required for the other methods. A disadvantage of this is that
 you would need subclass like this any kind of top level query that might
 come up in your app - so not sure if this is really acceptable in your case.
 Another disadvantage is that this is a much more complicated code to write.

 Doron

 2010/6/8 John Wang john.w...@gmail.com

 Wouldn't you get it as well with proposed api?
 You would still be able to iterate the doc and at that point call score
 with the docid. If you call score() along with iteration, you would still
 get the information no?
 Making scorer take a docid allows you score any docid in the reader if the
 query wants it to. Wouldn't it make it more flexible?
 -John

 On Tue, Jun 8, 2010 at 10:54 AM, Earwin Burrfoot ear...@gmail.com wrote:

 To compute a score you have to see which of your subqueries did not
 match, which did, and what are the docfreqs/positions for them.
 When iterating, and calling score() only for current doc - parts of
 this data (maybe even all of it, not sure) is already gathered for
 you. If you allow calling score(int doc) - for arbitrary docId, you'll
 have to redo this work.

 2010/6/8 John Wang john.w...@gmail.com:
  Hi Earwin:
 
   I am not sure I understand here, e.g. what si the difference
  between:
 
   float myscorinCode(){
   computeMyScore(scorer.score());
   }
 
   and
 
    float myscorinCode(){
 
  computeMyScore(scorer.score(scorer.getDocIdSetIterator().docID());
    }
 
    In the case of BQ, when you get a hit, would you still be able to
  call
  subscorer.score(hit)? Why is the point of iteration important for BQ?
 
    please elaborate.
 
  Thanks
 
  -John
 
  On Tue, Jun 8, 2010 at 10:10 AM, Earwin Burrfoot ear...@gmail.com
  wrote:
 
  The problem with your proposal is that, currently, Lucene uses current
  iteration state to compute score.
  I.e. it already knows which of SHOULD BQ clauses matched for current
  doc, so it's easier to calculate the score.
  If you change API to allow scoring arbitrary documents (even those
  that didn't match the query at all), you're opening a can of worms :)
 
  As an alternative, you can try looking at MG4J sources. As far as I
  understand, their scoring is decoupled from matching, just like you
  (and I bet many more people) want. The matcher is separate, and the
  scoring entity accepts current matcher state instead of document id,
  so you get the best of both worlds.
 
  On Tue, Jun 8, 2010 at 21:01, John Wang john.w...@gmail.com wrote:
   re: But Scorer is itself an iterator, so what prevents you from
   calling
   nextDoc and advance on it without score()
  
   Nothing. It is just inefficient to pay the method call overhead just
   to
   overload score.
  
   re: If I were in your shoes, I'd simply provider a Query wrapper. If
   CSQ
   is not good enough I'd just develop my own.
  
   That is what I am doing. I am just proposing the change (see my
   first
   email)
   as an improvement.
  
   re: Scorer is itself an iterator
  
   yes, that is the current definition. The point of the proposal is to
   make
   this change.
  
   -John
  
   On Tue, Jun 8, 2010 at 9:45 AM, Shai Erera ser...@gmail.com wrote:
  
   Well … I don't know the reason as well and always thought Scorer
   and
   Similarity are confusing.
  
   But Scorer is itself an iterator, so what prevents you from calling
   nextDoc and advance on it without score(). And what would the
   returned
   DISI do when nextDoc is called, if not delegate to its subs?
  
   If I were in your shoes, I'd simply provider a Query wrapper. If
   CSQ
   is not good enough I'd just develop my own.
  
   But perhaps others think differently?
  
   Shai
  
   On Tuesday, June 8, 2010, John Wang john.w...@gmail.com wrote:
Hi Shai:
    I am not sure I understand how changing Similarity would
solve
this
problem, wouldn't you need the reader?
    As for PayloadTermQuery, payload is not always the most
efficient
way of storing such data, especially when number of terms 
numdocs.
(I am
not sure accessing the payload when you iterate is a good idea,
but
that is
another discussion)
   
    Yes, what I 

Hudson build is back to normal : Lucene-3.x #36

2010-06-08 Thread Apache Hudson Server
See http://hudson.zones.apache.org/hudson/job/Lucene-3.x/36/



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org