date:20110806

[jira] [Issue Comment Edited] (LUCENE-1768) NumericRange support for new query parser

2011-08-06 Thread Uwe Schindler (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13080360#comment-13080360
]

Uwe Schindler edited comment on LUCENE-1768 at 8/6/11 5:57 AM:
---

bq. I see some classes in Lucene use Version, but I don't know exactly how that
works and why standard query parser do not use it. Should it?

Version is inteneded to be used for behavioural changes to keep index
compatibility, so people can use new Lucene versions without reindexing. It
does not help for API changes (it could sometimes, but only for those cases
where the API changes are something like: If versionA call method a else method
b, if method a or b trigger different APIs).

Typical examples for Version are changes in tokenization (so most analyzers use
it): When a bugfix in the analyzer produces different tokens than before the
version flag is used to be able to enable the buggy behaviour, so querying
your index with the wrong tokens still works. The core queryparser also uses
it to change the behaviour of creating phrase queries (the flexible query
parser is, as far as I know, still missing this).

I am away this weekend, I will come back to you on Monday for the other
questions.

was (Author: thetaphi):
bq. I see some classes in Lucene use Version, but I don't know exactly how
that works and why standard query parser do not use it. Should it?

I am away this weekend, I will come back to you on Monday for the other
questions.

NumericRange support for new query parser
-

Key: LUCENE-1768
URL: https://issues.apache.org/jira/browse/LUCENE-1768
Project: Lucene - Java
Issue Type: New Feature
Components: core/queryparser
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
Labels: contrib, gsoc, gsoc2011, lucene-gsoc-11, mentor
Fix For: 4.0

Attachments: TestNumericQueryParser-fix.patch,
TestNumericQueryParser-fix.patch, TestNumericQueryParser-fix.patch,
TestNumericQueryParser-fix.patch, week-7.patch, week-8.patch, week1.patch,
week2.patch, week3.patch, week4.patch, week5-6.patch

It would be good to specify some type of schema for the query parser in
future, to automatically create NumericRangeQuery for different numeric
types? It would then be possible to index a numeric value
(double,float,long,int) using NumericField and then the query parser knows,
which type of field this is and so it correctly creates a NumericRangeQuery
for strings like [1.567..*] or (1.787..19.5].
There is currently no way to extract if a field is numeric from the index, so
the user will have to configure the FieldConfig objects in the ConfigHandler.
But if this is done, it will not be that difficult to implement the rest.
The only difference between the current handling of RangeQuery is then the
instantiation of the correct Query type and conversion of the entered numeric
values (simple Number.valueOf(...) cast of the user entered numbers).
Evenerything else is identical, NumericRangeQuery also supports the MTQ
rewrite modes (as it is a MTQ).
Another thing is a change in Date semantics. There are some strange flags in
the current parser that tells it how to handle dates.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[JENKINS] Lucene-Solr-tests-only-trunk - Build # 10015 - Failure

2011-08-06 Thread Apache Jenkins Server

Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/10015/

1 tests failed.
REGRESSION:  org.apache.solr.core.TestJmxIntegration.testJmxOnCoreReload

Error Message:
Number of registered MBeans is not the same as info registry size expected:51 
but was:45

Stack Trace:
junit.framework.AssertionFailedError: Number of registered MBeans is not the 
same as info registry size expected:51 but was:45
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1522)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1427)
at 
org.apache.solr.core.TestJmxIntegration.testJmxOnCoreReload(TestJmxIntegration.java:134)




Build Log (for compile errors):
[...truncated 7951 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3220) Implement various ranking models as Similarities

2011-08-06 Thread David Mark Nemeskey (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mark Nemeskey updated LUCENE-3220:


Attachment: LUCENE-3220.patch

Done. Actually, I wanted to implement the norm table in the way you said, but 
somehow forgot about it.

Two questions remain on my side:
 * the one about discountOverlaps (see above)
 * what kind of index-time boosts do people usually use? Too big a boost might 
cause problems if we just divide the length with it. Maybe we should take the 
logarithm or sth like that?

 Implement various ranking models as Similarities
 

 Key: LUCENE-3220
 URL: https://issues.apache.org/jira/browse/LUCENE-3220
 Project: Lucene - Java
  Issue Type: Sub-task
  Components: core/query/scoring, core/search
Affects Versions: flexscoring branch
Reporter: David Mark Nemeskey
Assignee: David Mark Nemeskey
  Labels: gsoc, gsoc2011
 Attachments: LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, 
 LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, 
 LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, 
 LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, 
 LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, 
 LUCENE-3220.patch, LUCENE-3220.patch

   Original Estimate: 336h
  Remaining Estimate: 336h

 With [LUCENE-3174|https://issues.apache.org/jira/browse/LUCENE-3174] done, we 
 can finally work on implementing the standard ranking models. Currently DFR, 
 BM25 and LM are on the menu.
 Done:
  * {{EasyStats}}: contains all statistics that might be relevant for a 
 ranking algorithm
  * {{EasySimilarity}}: the ancestor of all the other similarities. Hides the 
 DocScorers and as much implementation detail as possible
  * _BM25_: the current mock implementation might be OK
  * _LM_
  * _DFR_
  * The so-called _Information-Based Models_

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3220) Implement various ranking models as Similarities

2011-08-06 Thread David Mark Nemeskey (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mark Nemeskey updated LUCENE-3220:


Attachment: LUCENE-3220.patch

Added a short explanation on the parameter for the Jelinek-Mercer method.

 Implement various ranking models as Similarities
 

 Key: LUCENE-3220
 URL: https://issues.apache.org/jira/browse/LUCENE-3220
 Project: Lucene - Java
  Issue Type: Sub-task
  Components: core/query/scoring, core/search
Affects Versions: flexscoring branch
Reporter: David Mark Nemeskey
Assignee: David Mark Nemeskey
  Labels: gsoc, gsoc2011
 Attachments: LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, 
 LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, 
 LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, 
 LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, 
 LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, 
 LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch

   Original Estimate: 336h
  Remaining Estimate: 336h

 With [LUCENE-3174|https://issues.apache.org/jira/browse/LUCENE-3174] done, we 
 can finally work on implementing the standard ranking models. Currently DFR, 
 BM25 and LM are on the menu.
 Done:
  * {{EasyStats}}: contains all statistics that might be relevant for a 
 ranking algorithm
  * {{EasySimilarity}}: the ancestor of all the other similarities. Hides the 
 DocScorers and as much implementation detail as possible
  * _BM25_: the current mock implementation might be OK
  * _LM_
  * _DFR_
  * The so-called _Information-Based Models_

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Question about LUCENE-3097 - Post Group Faceting

2011-08-06 Thread Martijn v Groningen

The facet result for field productType will show the following count:
BOOK: 1
DVD: 0

So yes, because of post group faceting you'll miss the second facet.
This is basically the same example I described in LUCENE-3097.

I've also described three ways of calculating facet counts in combination
grouping.
The third way which I've named matrix counts (field value group value
combination) would give the result that you expect.
However this isn't implemented yet. In Solr this would require changes in
the FacetComponent.
I hope this explains it a bit!

Martijn

On 5 August 2011 16:28, Joshua Harness jkharnes...@gmail.com wrote:

Martin -

Thanks for the reply. I understand your answer about the segments.
However, I'm still cloudy about faceting with respect to the group head.
Perhaps an example will clarify my confusion. Suppose I have 3 order
documents with the following data:

*orderNumber: 1
customerNumber: 1
totalInCents: 1500
productType: 'BOOK'

orderNumber: 2
customerNumber: 1
totalInCents: 500
productType: 'BOOK'

orderNumber: 3
customerNumber: 1
totalInCents: 1000
productType: 'DVD'

* *Imagine I perform a search for items greater than or equal to 1000
cents grouped by customer number. I would expect to get order numbers 1 and
3 back grouped underneath customer id. Lets assume that order number 1 is
considered the most relevant document (in your scenario). Will the post
group faceting miss that I actually have two facet values for productType:
BOOK and DVD?

Thanks!

Josh

On Fri, Aug 5, 2011 at 4:22 AM, Martijn v Groningen
martijn.is.h...@gmail.com wrote:

Hi Josh,

For post grouping the documents don't need to reside in the same segment.
Lucene's grouping module has a collector (TermAllGroupHeadsCollector) that
can
collect the most relevant document for each group (GroupHead). This
collector can produce a int[] or a FixedBitSet that can be used during
faceting to produce
post group facets (patch in SOLR-2665 uses this). During faceting only the
the groupheads are known, because of this field values that are different in
documents
less relevant than the most relevant document of a group aren't taken into
account. This is the same as in example described in the description of
LUCENE-3097.
Hope this helps!

Martijn

On 4 August 2011 22:59, Joshua Harness jkharnes...@gmail.com wrote:

Hello -

Please let me know if this question is more appropriate of the user
list. I had assumed the developer list was more appropriate since the ticket
is still open. I was analyzing the comments on
LUCENE-3097https://issues.apache.org/jira/browse/LUCENE-3097and had a
couple of questions.

A
commenthttps://issues.apache.org/jira/browse/LUCENE-3097?focusedCommentId=13033953page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13033953started
a small thread that mentioned that all documents in a given group
would need to be contiguous and in the same segment. Also - a statement was
made that ' The app would have to ensure this'. I was unclear the result of
this conversation. It sounded like maybe this could have turned out to not
be the case. What is the status of this? Does my application have to ensure
all the documents in the group are in the same segment? How would one
accomplish this?

Another
commenthttps://issues.apache.org/jira/browse/LUCENE-3097?focusedCommentId=13038297page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13038297mentioned
that 'we pick only the head doc...as long as the head doc is
guaranteed to have the same value for field X, it safe to use that doc to
represent the entire group for facet counting'. Does this mean that there
is a restriction placed on me that the head document must have field values
that match the rest of the documents in the same group? Or is this simply an
implementation detail that uses the head document when this condition is the
case or chooses another strategy when this is not the case?

I am very interested in adopting this patch. However - I am
attempting to understand any limitations/conditions so that I may use it
correctly. Any advice would be greatly appreciated.

Thanks!

Josh Harness

--
Met vriendelijke groet,

Martijn van Groningen

--
Met vriendelijke groet,

Martijn van Groningen

[jira] [Updated] (LUCENE-2308) Separately specify a field's type

2011-08-06 Thread Nikola Tankovic (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikola Tankovic updated LUCENE-2308:


Attachment: LUCENE-2308-21.patch

21th patch :) Fixed Javadocs errors.

 Separately specify a field's type
 -

 Key: LUCENE-2308
 URL: https://issues.apache.org/jira/browse/LUCENE-2308
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Reporter: Michael McCandless
Assignee: Michael McCandless
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: 4.0

 Attachments: LUCENE-2308-10.patch, LUCENE-2308-11.patch, 
 LUCENE-2308-12.patch, LUCENE-2308-13.patch, LUCENE-2308-14.patch, 
 LUCENE-2308-15.patch, LUCENE-2308-16.patch, LUCENE-2308-17.patch, 
 LUCENE-2308-18.patch, LUCENE-2308-19.patch, LUCENE-2308-2.patch, 
 LUCENE-2308-20.patch, LUCENE-2308-21.patch, LUCENE-2308-3.patch, 
 LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, 
 LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, 
 LUCENE-2308-ltc.patch, LUCENE-2308.patch, LUCENE-2308.patch


 This came up from dicussions on IRC.  I'm summarizing here...
 Today when you make a Field to add to a document you can set things
 index or not, stored or not, analyzed or not, details like omitTfAP,
 omitNorms, index term vectors (separately controlling
 offsets/positions), etc.
 I think we should factor these out into a new class (FieldType?).
 Then you could re-use this FieldType instance across multiple fields.
 The Field instance would still hold the actual value.
 We could then do per-field analyzers by adding a setAnalyzer on the
 FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise
 for per-field codecs (with flex), where we now have
 PerFieldCodecWrapper).
 This would NOT be a schema!  It's just refactoring what we already
 specify today.  EG it's not serialized into the index.
 This has been discussed before, and I know Michael Busch opened a more
 ambitious (I think?) issue.  I think this is a good first baby step.  We could
 consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold
 off on that for starters...

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3357) Unit and integration test cases for the new Similarities

2011-08-06 Thread David Mark Nemeskey (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mark Nemeskey updated LUCENE-3357:


Attachment: LUCENE-3357.patch

Integration tests added. There are two of them; however, ant test only runs one?

 Unit and integration test cases for the new Similarities
 

 Key: LUCENE-3357
 URL: https://issues.apache.org/jira/browse/LUCENE-3357
 Project: Lucene - Java
  Issue Type: Sub-task
  Components: core/query/scoring
Affects Versions: flexscoring branch
Reporter: David Mark Nemeskey
Assignee: David Mark Nemeskey
Priority: Minor
  Labels: gsoc, gsoc2011, test
 Fix For: flexscoring branch

 Attachments: LUCENE-3357.patch


 Write test cases to test the new Similarities added in 
 [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of 
 test cases will be created:
  * unit tests, in which mock statistics are provided to the Similarities and 
 the score is validated against hand calculations;
  * integration tests, in which a small collection is indexed and then 
 searched using the Similarities.
 Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Issue Comment Edited] (LUCENE-3357) Unit and integration test cases for the new Similarities

2011-08-06 Thread David Mark Nemeskey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13080416#comment-13080416
 ] 

David Mark Nemeskey edited comment on LUCENE-3357 at 8/6/11 3:52 PM:
-

Integration tests added. There are two of them; however, ant test runs only one?

  was (Author: david_nemeskey):
Integration tests added. There are two of them; however, ant test only runs 
one?
  
 Unit and integration test cases for the new Similarities
 

 Key: LUCENE-3357
 URL: https://issues.apache.org/jira/browse/LUCENE-3357
 Project: Lucene - Java
  Issue Type: Sub-task
  Components: core/query/scoring
Affects Versions: flexscoring branch
Reporter: David Mark Nemeskey
Assignee: David Mark Nemeskey
Priority: Minor
  Labels: gsoc, gsoc2011, test
 Fix For: flexscoring branch

 Attachments: LUCENE-3357.patch


 Write test cases to test the new Similarities added in 
 [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of 
 test cases will be created:
  * unit tests, in which mock statistics are provided to the Similarities and 
 the score is validated against hand calculations;
  * integration tests, in which a small collection is indexed and then 
 searched using the Similarities.
 Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-2700) transaction logging

2011-08-06 Thread Yonik Seeley (JIRA)

transaction logging
---

 Key: SOLR-2700
 URL: https://issues.apache.org/jira/browse/SOLR-2700
 Project: Solr
  Issue Type: New Feature
Reporter: Yonik Seeley


A transaction log is needed for durability of updates, for a more performant 
realtime-get, and for replaying updates to recovering peers.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-2700) transaction logging

2011-08-06 Thread Yonik Seeley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley updated SOLR-2700:
---

Attachment: SOLR-2700.patch

Here's a draft patch.
There is a tlog.number file created for each commit.  The javabin format is 
used to serialize SolrInputDocuments.
An in-memory map of pointers into the log is kept for documents not yet 
soft-committed, and the realtime-get component checks that first before using 
SolrCore.getNewestSearcher().

Seems to work for getting documents not in the newest searcher so far.

Tons of stuff left to do
- the tlog files are currently in the CWD
- need to handle deletes
- need to handle flushes in a performant way
- need to implement optional fsync for durability on power-failure
- would be nice to make some of this multi-threaded for better performance
- need to implement durability (apply updates from logs on startup)
- need to implement some form of cleanup for transaction logs


 transaction logging
 ---

 Key: SOLR-2700
 URL: https://issues.apache.org/jira/browse/SOLR-2700
 Project: Solr
  Issue Type: New Feature
Reporter: Yonik Seeley
 Attachments: SOLR-2700.patch


 A transaction log is needed for durability of updates, for a more performant 
 realtime-get, and for replaying updates to recovering peers.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[JENKINS] Lucene-Solr-tests-only-3.x - Build # 10036 - Failure

2011-08-06 Thread Apache Jenkins Server

Build: https://builds.apache.org/job/Lucene-Solr-tests-only-3.x/10036/

2 tests failed.
REGRESSION:  
org.apache.solr.client.solrj.embedded.MergeIndexesEmbeddedTest.testMergeIndexesByDirName

Error Message:
No such core: core1

Stack Trace:
org.apache.solr.common.SolrException: No such core: core1
at 
org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:104)
at 
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
at 
org.apache.solr.client.solrj.MergeIndexesExampleTestBase.setupCores(MergeIndexesExampleTestBase.java:90)
at 
org.apache.solr.client.solrj.MergeIndexesExampleTestBase.testMergeIndexesByDirName(MergeIndexesExampleTestBase.java:129)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1335)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1240)


REGRESSION:  
org.apache.solr.client.solrj.embedded.MergeIndexesEmbeddedTest.testMergeIndexesByCoreName

Error Message:
No such core: core1

Stack Trace:
org.apache.solr.common.SolrException: No such core: core1
at 
org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:104)
at 
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
at 
org.apache.solr.client.solrj.MergeIndexesExampleTestBase.setupCores(MergeIndexesExampleTestBase.java:90)
at 
org.apache.solr.client.solrj.MergeIndexesExampleTestBase.testMergeIndexesByCoreName(MergeIndexesExampleTestBase.java:145)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1335)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1240)




Build Log (for compile errors):
[...truncated 14180 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3364) Add score threshold into Scorer.score()

2011-08-06 Thread John Wang (JIRA)

Add score threshold into Scorer.score()
---

 Key: LUCENE-3364
 URL: https://issues.apache.org/jira/browse/LUCENE-3364
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/query/scoring
Reporter: John Wang


This is an optimization for scoring.
Given a Scorer.score() implementation, where features are gathered to calculate 
a score.

Proposal, add a parameter to score, e.g. score(float threshold)

This threshold is the minimum score to beat to make it to the current 
PriorityQueue. This could potential save a great deal of wasted calculation in 
the cases where recall is large.

In our case specifically, some of the features needed to do calculation can be 
expensive to obtain, it would be nice to have a place to decide whether or not 
even fetching these features are necessary.

Also, if we know the score would be low, simply threshold can be returned.

Let me know if this makes sense and I can work on a patch.


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3364) Add score threshold into Scorer.score()

2011-08-06 Thread Yonik Seeley (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13080437#comment-13080437
]

Yonik Seeley commented on LUCENE-3364:
--

Perhaps it would be easiest to just create a Collector that cut off based on
score?

Add score threshold into Scorer.score()
---

Key: LUCENE-3364
URL: https://issues.apache.org/jira/browse/LUCENE-3364
Project: Lucene - Java
Issue Type: Improvement
Components: core/query/scoring
Reporter: John Wang

This is an optimization for scoring.
Given a Scorer.score() implementation, where features are gathered to
calculate a score.
Proposal, add a parameter to score, e.g. score(float threshold)
This threshold is the minimum score to beat to make it to the current
PriorityQueue. This could potential save a great deal of wasted calculation
in the cases where recall is large.
In our case specifically, some of the features needed to do calculation can
be expensive to obtain, it would be nice to have a place to decide whether or
not even fetching these features are necessary.
Also, if we know the score would be low, simply threshold can be returned.
Let me know if this makes sense and I can work on a patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3364) Add score threshold into Scorer.score()

2011-08-06 Thread John Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13080439#comment-13080439
 ] 

John Wang commented on LUCENE-3364:
---

Hi Yonik:

   In Collector, the decision time to cut of a score would be too late, e.g.

   float score = Scorer.score();  --- this is where the cost would occur.
   boolean cutOff = decide(score)

   In my example, my score impl:

   float s1 = cheapCalc(docid);
   float s2 = expensiveCalc(docid);

   return s1+s2;

   So now if I know expensiveCalc is bounded by N, and cheapCalc returns a very 
small number, I can simply skip s2 calculation because this doc would be thrown 
out anyway.

   Hope I am making sense :)

-John

 Add score threshold into Scorer.score()
 ---

 Key: LUCENE-3364
 URL: https://issues.apache.org/jira/browse/LUCENE-3364
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/query/scoring
Reporter: John Wang

 This is an optimization for scoring.
 Given a Scorer.score() implementation, where features are gathered to 
 calculate a score.
 Proposal, add a parameter to score, e.g. score(float threshold)
 This threshold is the minimum score to beat to make it to the current 
 PriorityQueue. This could potential save a great deal of wasted calculation 
 in the cases where recall is large.
 In our case specifically, some of the features needed to do calculation can 
 be expensive to obtain, it would be nice to have a place to decide whether or 
 not even fetching these features are necessary.
 Also, if we know the score would be low, simply threshold can be returned.
 Let me know if this makes sense and I can work on a patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3364) Add score threshold into Scorer.score()

2011-08-06 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13080442#comment-13080442
 ] 

Yonik Seeley commented on LUCENE-3364:
--

Ah, gotcha - I see what you're saying now.

 Add score threshold into Scorer.score()
 ---

 Key: LUCENE-3364
 URL: https://issues.apache.org/jira/browse/LUCENE-3364
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/query/scoring
Reporter: John Wang

 This is an optimization for scoring.
 Given a Scorer.score() implementation, where features are gathered to 
 calculate a score.
 Proposal, add a parameter to score, e.g. score(float threshold)
 This threshold is the minimum score to beat to make it to the current 
 PriorityQueue. This could potential save a great deal of wasted calculation 
 in the cases where recall is large.
 In our case specifically, some of the features needed to do calculation can 
 be expensive to obtain, it would be nice to have a place to decide whether or 
 not even fetching these features are necessary.
 Also, if we know the score would be low, simply threshold can be returned.
 Let me know if this makes sense and I can work on a patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

IndexReader.maxDoc() and other

2011-08-06 Thread eks dev

Assuming there are no deletes,  would the following work as a way to load *last 
added document*, surviving optimize as well? 
Order of documentId-s in Lucene survives optimize as far as I remember? 

IndexReader ir...
int maxDoc = ir.maxDoc() - 1;
if(maxDoc0) //? What is the return value on empty index, 0 or 1?
Document d = ir.getDocument(maxDoc);

Would this correspond to the last committed document (at commit point where 
index reader was opened)

Or last added document, including pending/uncommitted (I am not getting 
IndexReader from the IndexWriter, no nrt yet...)


The problem I am trying to solve are incremental updates (there are no 
deletions). Having unique, numerical uid stored in index that is increasing 
with 
every add, I just need a way to find max(uid) on the last commit to get my 
delta 
from the database.

Above solution was one of the options. 
2.The second would be to iterate TermsEnum for uid field until I hit an end, 
but 
this sounds slow (even if I start skipping around like a monkey)? 

3.Third option would be to index reverse uid  (HUGE_CONSTANT - uid), so it gets 
on top in terms dictionary?  

4. And finally, the last option I am thinking of would be to track max(UID) and 
write it as a user Parameter with  IndexWriter.commit(Map...), so I could read 
it easily (piggy-back on lucene commit is as safe as it gets, better then 
persisting own files...) 

I like the last option, but have no idea how to create beforeCommitListener in 
solr?   


The most robust is 2/3, but maybe slow-ish (there are 100-200Mio documents/UIDs)

Any better ideas? (and no, DIH wall clock timestamp is not good enough)

I am talking about solr/lucene 4 trunk, we decided to take a risk :) 
 
Thanks, 
eks

Re: IndexReader.maxDoc() and other

2011-08-06 Thread Yonik Seeley

On Sat, Aug 6, 2011 at 2:47 PM, eks dev eks...@yahoo.co.uk wrote:
 Assuming there are no deletes,  would the following work as a way to load
 *last added document*, surviving optimize as well?
 Order of documentId-s in Lucene survives optimize as far as I remember?

No longer... the default merge policy can now merge non-contiguous segments.
You can of course still select a Log* merge policy, which never
reorders ids with respect to each other.

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: IndexReader.maxDoc() and other

2011-08-06 Thread eks dev

Thanks Yonik, 

assuming I am not going to index ID , than only an option 4. remains so far. I 
have no other ideas, and Log* merge policy would mean all 4 Indexing magic went 
to nothing :)

Colud then the following do the job? 
clone DefaultIndexWriterProvider into my codebase (ugly, keep in sync , but 
doable)
make it provide 
EnhancedSolrIndexWriter extends SolrIndexWriter

@Override
commit(...){
   super.commit(MapString, String Core.getUserMap());
} 

the same with close(...)  


If yes, Is this feature something solr could use? MapString, String 
userParams 
 somewhere in Core that gets committed with whatever it has at commit time. I 
could wrap up a patch by modifying SolrIndexWriter directly then?

Nice thing about it, one could have possibility to keep small map of key value 
pairs in sync with commit points with all goods of TwoPhaseCommit... for no 
way 
for this to get out of sync things, like my use case below... I imagine DIH 
could use it as well



-

No longer... the default merge policy can now merge non-contiguous segments.
You can of course still select a Log* merge policy, which never
reorders ids with respect to each other.

-Yonik
http://www.lucidimagination.com




From: eks dev eks...@yahoo.co.uk
To: dev@lucene.apache.org
Sent: Sat, 6 August, 2011 20:47:09
Subject: IndexReader.maxDoc()  and other


Assuming there are no deletes,  would the following work as a way to load *last 
added document*, surviving optimize as well? 
Order of documentId-s in Lucene survives optimize as far as I remember? 

IndexReader ir...
int maxDoc = ir.maxDoc() -  1;
if(maxDoc0) //? What is the return value on empty index, 0 or 1?
Document d = ir.getDocument(maxDoc);

Would this correspond to the last committed document (at commit point where 
index reader was opened)

Or last added document, including pending/uncommitted (I am not getting 
IndexReader from the IndexWriter, no nrt yet...)


The problem I am trying to solve are incremental updates (there are no 
deletions). Having unique, numerical uid stored in index that is increasing 
with 
every add, I just need a way to find max(uid) on the last commit to get my 
delta 
from the database.

Above solution was one of the options. 
2.The second would be to iterate TermsEnum for uid field until I hit an end, 
but 
this sounds slow (even if I start skipping around like a monkey)? 

3.Third option would be to index reverse uid  (HUGE_CONSTANT - uid), so it gets 
on top in terms dictionary?  

4. And finally, the last option I am thinking of would be to track max(UID) and 
write it as a user Parameter with  IndexWriter.commit(Map...), so I could read 
it easily (piggy-back on lucene commit is as safe as it gets, better then 
persisting own files...) 

I like the last option, but have no idea how to create beforeCommitListener in 
solr?


The most robust is 2/3, but maybe slow-ish (there are 100-200Mio documents/UIDs)

Any better ideas? (and no, DIH wall clock timestamp is not good enough)

I am talking about solr/lucene 4 trunk, we decided to take a risk :) 
 
Thanks, 
eks

[jira] [Commented] (LUCENE-2748) Convert all Lucene web properties to use the ASF CMS

2011-08-06 Thread Grant Ingersoll (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13080463#comment-13080463
 ] 

Grant Ingersoll commented on LUCENE-2748:
-

Making some progress on this.  Here's my intent:  We start clean.  One website 
directory for all of our projects (Lucene, Solr, PyLucene and ORP).  I'm more 
or less copying the layout of Mahout, which copied Open For Biz.  It's a lot 
cleaner and a lot nicer to look at.  I intend to move the old sites to an 
archive area and just link to them.

We'll still need to figure out per release docs, but I suspect it won't be that 
hard to convert that stuff to Markdown going forward and having our 
deploy/release mechanism just publish it to the CMS.

 Convert all Lucene web properties to use the ASF CMS
 

 Key: LUCENE-2748
 URL: https://issues.apache.org/jira/browse/LUCENE-2748
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll

 The new CMS has a lot of nice features (and some kinks to still work out) and 
 Forrest just doesn't cut it anymore, so we should move to the ASF CMS: 
 http://apache.org/dev/cms.html

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2748) Convert all Lucene web properties to use the ASF CMS

2011-08-06 Thread Grant Ingersoll (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13080471#comment-13080471
 ] 

Grant Ingersoll commented on LUCENE-2748:
-

If you wish to build/test locally, do the setup at: 
http://www.apache.org/dev/cmsref.html#local-build

Then run:
{quote}
path/build_site.pl --source-base [Path to Lucene CMS SVN checkout top dir] 
--target-base [OUTPUT]
{quote}

 Convert all Lucene web properties to use the ASF CMS
 

 Key: LUCENE-2748
 URL: https://issues.apache.org/jira/browse/LUCENE-2748
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll

 The new CMS has a lot of nice features (and some kinks to still work out) and 
 Forrest just doesn't cut it anymore, so we should move to the ASF CMS: 
 http://apache.org/dev/cms.html

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2701) Expose IndexWriter.commit(MapString,String commitUserData) to solr

2011-08-06 Thread Eks Dev (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13080474#comment-13080474
]

Eks Dev commented on SOLR-2701:
---

one hook for users to update content of this map would be to add beforeCommit
callbacks. This looks simple enough in UpdateHandler2.commit() call, but there
is a catch:

We need to invoke listeners before we close() for implicit commits... having
decref-ed IndexWriter, the question is if we want to run beforeCommit listeners
even if IW does not really get closed (user updates map more often than
needed).

IMO, this should not be a problem, invoking callbacks a little bit more often
than needed.

Another place where we have implicit commit is newIndexWriter() /
here we need only to add IndexWriterProvider.isIndexWriterNull() to check if we
need callbacks

A solution for close() would be also simple by adding
IndexWriterProvider.isIndexGoingToCloseOnNextDecref() before invoking decref()
to condition callbacks

Any better solution? Are the callbacks good approach to provide user hooks for
this?

---
Another approach is to get beforeCommitCallbacks at lucene level and piggy-back
there for solr callbacks?
We would only need to change IndexWriter.commit(Map..) and close() but commit
is final...

Notice: I am very rusty considering solr/lucene codebase = any help would be
appreciated. Last patch I made here is ages ago :)

Expose IndexWriter.commit(MapString,String commitUserData) to solr
-

Key: SOLR-2701
URL: https://issues.apache.org/jira/browse/SOLR-2701
Project: Solr
Issue Type: New Feature
Components: update
Affects Versions: 4.0
Reporter: Eks Dev
Priority: Minor
Labels: commit, update
Original Estimate: 8h
Remaining Estimate: 8h

At the moment, there is no feature that enables associating user information
to the commit point.

Lucene supports this possibility and it should be exposed to solr as well,
probably via beforeCommit Listener (analogous to prepareCommit in Lucene).
Most likely home for this Map to live is UpdateHandler.
Example use case would be an atomic tracking of sequence numbers or
timestamps for incremental updates.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2748) Convert all Lucene web properties to use the ASF CMS

2011-08-06 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13080480#comment-13080480
 ] 

Yonik Seeley commented on LUCENE-2748:
--

bq. I'm more or less copying the layout of Mahout, which copied Open For Biz.

+1, I like the looks of the Mahout site.

 Convert all Lucene web properties to use the ASF CMS
 

 Key: LUCENE-2748
 URL: https://issues.apache.org/jira/browse/LUCENE-2748
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll

 The new CMS has a lot of nice features (and some kinks to still work out) and 
 Forrest just doesn't cut it anymore, so we should move to the ASF CMS: 
 http://apache.org/dev/cms.html

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2654) lockType/ not used consistently in all places Directory objects are instantiated

2011-08-06 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13080525#comment-13080525
 ] 

Mark Miller commented on SOLR-2654:
---

Hmmm...the problem has something to do with this new index stuff that 
replication uses - this thing always gets in my way :)

 lockType/ not used consistently in all places Directory objects are 
 instantiated
 --

 Key: SOLR-2654
 URL: https://issues.apache.org/jira/browse/SOLR-2654
 Project: Solr
  Issue Type: Bug
Reporter: Hoss Man
Assignee: Mark Miller
Priority: Critical
 Fix For: 3.4

 Attachments: SOLR-2654.patch


 nipunb noted on the mailing list then when configuring solr to use an 
 alternate lockType/ (ie: simple) the stats for the SolrIndexSearcher list 
 NativeFSLockFactory being used by the Directory.
 The problem seems to be that SolrIndexConfig is not consulted when 
 constructing Directory objects used for IndexReader (it's only used by 
 SolrIndexWriter)
 I don't _think_ this is a problem in most cases since the IndexReaders should 
 all be readOnly in the core solr code) but plugins could attempt to use them 
 in other ways.  In general it seems like a really bad bug waiting to happen.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2654) lockType/ not used consistently in all places Directory objects are instantiated

2011-08-06 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13080532#comment-13080532
 ] 

Mark Miller commented on SOLR-2654:
---

My initial suspicion is actually that these where bugs in trunk that where 
being hidden by the old behavior.

 lockType/ not used consistently in all places Directory objects are 
 instantiated
 --

 Key: SOLR-2654
 URL: https://issues.apache.org/jira/browse/SOLR-2654
 Project: Solr
  Issue Type: Bug
Reporter: Hoss Man
Assignee: Mark Miller
Priority: Critical
 Fix For: 3.4

 Attachments: SOLR-2654.patch, SOLR-2654.patch


 nipunb noted on the mailing list then when configuring solr to use an 
 alternate lockType/ (ie: simple) the stats for the SolrIndexSearcher list 
 NativeFSLockFactory being used by the Directory.
 The problem seems to be that SolrIndexConfig is not consulted when 
 constructing Directory objects used for IndexReader (it's only used by 
 SolrIndexWriter)
 I don't _think_ this is a problem in most cases since the IndexReaders should 
 all be readOnly in the core solr code) but plugins could attempt to use them 
 in other ways.  In general it seems like a really bad bug waiting to happen.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-2654) lockType/ not used consistently in all places Directory objects are instantiated

2011-08-06 Thread Mark Miller (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated SOLR-2654:
--

Fix Version/s: 4.0

 lockType/ not used consistently in all places Directory objects are 
 instantiated
 --

 Key: SOLR-2654
 URL: https://issues.apache.org/jira/browse/SOLR-2654
 Project: Solr
  Issue Type: Bug
Reporter: Hoss Man
Assignee: Mark Miller
Priority: Critical
 Fix For: 3.4, 4.0

 Attachments: SOLR-2654.patch, SOLR-2654.patch


 nipunb noted on the mailing list then when configuring solr to use an 
 alternate lockType/ (ie: simple) the stats for the SolrIndexSearcher list 
 NativeFSLockFactory being used by the Directory.
 The problem seems to be that SolrIndexConfig is not consulted when 
 constructing Directory objects used for IndexReader (it's only used by 
 SolrIndexWriter)
 I don't _think_ this is a problem in most cases since the IndexReaders should 
 all be readOnly in the core solr code) but plugins could attempt to use them 
 in other ways.  In general it seems like a really bad bug waiting to happen.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Issue Comment Edited] (LUCENE-1768) NumericRange support for new query parser

[JENKINS] Lucene-Solr-tests-only-trunk - Build # 10015 - Failure

[jira] [Updated] (LUCENE-3220) Implement various ranking models as Similarities

[jira] [Updated] (LUCENE-3220) Implement various ranking models as Similarities

Re: Question about LUCENE-3097 - Post Group Faceting

[jira] [Updated] (LUCENE-2308) Separately specify a field's type

[jira] [Updated] (LUCENE-3357) Unit and integration test cases for the new Similarities

[jira] [Issue Comment Edited] (LUCENE-3357) Unit and integration test cases for the new Similarities

[jira] [Created] (SOLR-2700) transaction logging

[jira] [Updated] (SOLR-2700) transaction logging

[JENKINS] Lucene-Solr-tests-only-3.x - Build # 10036 - Failure

[jira] [Created] (LUCENE-3364) Add score threshold into Scorer.score()

[jira] [Commented] (LUCENE-3364) Add score threshold into Scorer.score()

[jira] [Commented] (LUCENE-3364) Add score threshold into Scorer.score()

[jira] [Commented] (LUCENE-3364) Add score threshold into Scorer.score()

IndexReader.maxDoc() and other

Re: IndexReader.maxDoc() and other

Re: IndexReader.maxDoc() and other

[jira] [Commented] (LUCENE-2748) Convert all Lucene web properties to use the ASF CMS

[jira] [Commented] (LUCENE-2748) Convert all Lucene web properties to use the ASF CMS

[jira] [Commented] (SOLR-2701) Expose IndexWriter.commit(MapString,String commitUserData) to solr

[jira] [Commented] (LUCENE-2748) Convert all Lucene web properties to use the ASF CMS

[jira] [Commented] (SOLR-2654) lockType/ not used consistently in all places Directory objects are instantiated

[jira] [Commented] (SOLR-2654) lockType/ not used consistently in all places Directory objects are instantiated

[jira] [Updated] (SOLR-2654) lockType/ not used consistently in all places Directory objects are instantiated

25 matches

Site Navigation

Mail list logo

Footer information