[jira] [Issue Comment Edited] (LUCENE-1768) NumericRange support for new query parser
[ https://issues.apache.org/jira/browse/LUCENE-1768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13080360#comment-13080360 ] Uwe Schindler edited comment on LUCENE-1768 at 8/6/11 5:57 AM: --- bq. I see some classes in Lucene use Version, but I don't know exactly how that works and why standard query parser do not use it. Should it? Version is inteneded to be used for behavioural changes to keep index compatibility, so people can use new Lucene versions without reindexing. It does not help for API changes (it could sometimes, but only for those cases where the API changes are something like: If versionA call method a else method b, if method a or b trigger different APIs). Typical examples for Version are changes in tokenization (so most analyzers use it): When a bugfix in the analyzer produces different tokens than before the version flag is used to be able to enable the buggy behaviour, so querying your index with the wrong tokens still works. The core queryparser also uses it to change the behaviour of creating phrase queries (the flexible query parser is, as far as I know, still missing this). I am away this weekend, I will come back to you on Monday for the other questions. was (Author: thetaphi): bq. I see some classes in Lucene use Version, but I don't know exactly how that works and why standard query parser do not use it. Should it? Version is inteneded to be used for behavioural changes to keep index compatibility, so people can use new Lucene versions without reindexing. It does not help for API changes (it could sometimes, but only for those cases where the API changes are something like: If versionA call method a else method b) if method a or b trigger different APIs. Typical examples for Version are changes in tokenization (so most analyzers use it): When a bugfix in the analyzer produces different tokens than before the version flag is used to be able to enable the buggy behaviour, so querying your index with the wrong tokens still works. The core queryparser also uses it to change the behaviour of creating phrase queries (the flexible query parser is, as far as I know, still missing this). I am away this weekend, I will come back to you on Monday for the other questions. NumericRange support for new query parser - Key: LUCENE-1768 URL: https://issues.apache.org/jira/browse/LUCENE-1768 Project: Lucene - Java Issue Type: New Feature Components: core/queryparser Affects Versions: 2.9 Reporter: Uwe Schindler Assignee: Uwe Schindler Labels: contrib, gsoc, gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: TestNumericQueryParser-fix.patch, TestNumericQueryParser-fix.patch, TestNumericQueryParser-fix.patch, TestNumericQueryParser-fix.patch, week-7.patch, week-8.patch, week1.patch, week2.patch, week3.patch, week4.patch, week5-6.patch It would be good to specify some type of schema for the query parser in future, to automatically create NumericRangeQuery for different numeric types? It would then be possible to index a numeric value (double,float,long,int) using NumericField and then the query parser knows, which type of field this is and so it correctly creates a NumericRangeQuery for strings like [1.567..*] or (1.787..19.5]. There is currently no way to extract if a field is numeric from the index, so the user will have to configure the FieldConfig objects in the ConfigHandler. But if this is done, it will not be that difficult to implement the rest. The only difference between the current handling of RangeQuery is then the instantiation of the correct Query type and conversion of the entered numeric values (simple Number.valueOf(...) cast of the user entered numbers). Evenerything else is identical, NumericRangeQuery also supports the MTQ rewrite modes (as it is a MTQ). Another thing is a change in Date semantics. There are some strange flags in the current parser that tells it how to handle dates. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[JENKINS] Lucene-Solr-tests-only-trunk - Build # 10015 - Failure
Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/10015/ 1 tests failed. REGRESSION: org.apache.solr.core.TestJmxIntegration.testJmxOnCoreReload Error Message: Number of registered MBeans is not the same as info registry size expected:51 but was:45 Stack Trace: junit.framework.AssertionFailedError: Number of registered MBeans is not the same as info registry size expected:51 but was:45 at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1522) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1427) at org.apache.solr.core.TestJmxIntegration.testJmxOnCoreReload(TestJmxIntegration.java:134) Build Log (for compile errors): [...truncated 7951 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3220) Implement various ranking models as Similarities
[ https://issues.apache.org/jira/browse/LUCENE-3220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Mark Nemeskey updated LUCENE-3220: Attachment: LUCENE-3220.patch Done. Actually, I wanted to implement the norm table in the way you said, but somehow forgot about it. Two questions remain on my side: * the one about discountOverlaps (see above) * what kind of index-time boosts do people usually use? Too big a boost might cause problems if we just divide the length with it. Maybe we should take the logarithm or sth like that? Implement various ranking models as Similarities Key: LUCENE-3220 URL: https://issues.apache.org/jira/browse/LUCENE-3220 Project: Lucene - Java Issue Type: Sub-task Components: core/query/scoring, core/search Affects Versions: flexscoring branch Reporter: David Mark Nemeskey Assignee: David Mark Nemeskey Labels: gsoc, gsoc2011 Attachments: LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch Original Estimate: 336h Remaining Estimate: 336h With [LUCENE-3174|https://issues.apache.org/jira/browse/LUCENE-3174] done, we can finally work on implementing the standard ranking models. Currently DFR, BM25 and LM are on the menu. Done: * {{EasyStats}}: contains all statistics that might be relevant for a ranking algorithm * {{EasySimilarity}}: the ancestor of all the other similarities. Hides the DocScorers and as much implementation detail as possible * _BM25_: the current mock implementation might be OK * _LM_ * _DFR_ * The so-called _Information-Based Models_ -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3220) Implement various ranking models as Similarities
[ https://issues.apache.org/jira/browse/LUCENE-3220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Mark Nemeskey updated LUCENE-3220: Attachment: LUCENE-3220.patch Added a short explanation on the parameter for the Jelinek-Mercer method. Implement various ranking models as Similarities Key: LUCENE-3220 URL: https://issues.apache.org/jira/browse/LUCENE-3220 Project: Lucene - Java Issue Type: Sub-task Components: core/query/scoring, core/search Affects Versions: flexscoring branch Reporter: David Mark Nemeskey Assignee: David Mark Nemeskey Labels: gsoc, gsoc2011 Attachments: LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch Original Estimate: 336h Remaining Estimate: 336h With [LUCENE-3174|https://issues.apache.org/jira/browse/LUCENE-3174] done, we can finally work on implementing the standard ranking models. Currently DFR, BM25 and LM are on the menu. Done: * {{EasyStats}}: contains all statistics that might be relevant for a ranking algorithm * {{EasySimilarity}}: the ancestor of all the other similarities. Hides the DocScorers and as much implementation detail as possible * _BM25_: the current mock implementation might be OK * _LM_ * _DFR_ * The so-called _Information-Based Models_ -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Question about LUCENE-3097 - Post Group Faceting
The facet result for field productType will show the following count: BOOK: 1 DVD: 0 So yes, because of post group faceting you'll miss the second facet. This is basically the same example I described in LUCENE-3097. I've also described three ways of calculating facet counts in combination grouping. The third way which I've named matrix counts (field value group value combination) would give the result that you expect. However this isn't implemented yet. In Solr this would require changes in the FacetComponent. I hope this explains it a bit! Martijn On 5 August 2011 16:28, Joshua Harness jkharnes...@gmail.com wrote: Martin - Thanks for the reply. I understand your answer about the segments. However, I'm still cloudy about faceting with respect to the group head. Perhaps an example will clarify my confusion. Suppose I have 3 order documents with the following data: *orderNumber: 1 customerNumber: 1 totalInCents: 1500 productType: 'BOOK' orderNumber: 2 customerNumber: 1 totalInCents: 500 productType: 'BOOK' orderNumber: 3 customerNumber: 1 totalInCents: 1000 productType: 'DVD' * * *Imagine I perform a search for items greater than or equal to 1000 cents grouped by customer number. I would expect to get order numbers 1 and 3 back grouped underneath customer id. Lets assume that order number 1 is considered the most relevant document (in your scenario). Will the post group faceting miss that I actually have two facet values for productType: BOOK and DVD? Thanks! Josh On Fri, Aug 5, 2011 at 4:22 AM, Martijn v Groningen martijn.is.h...@gmail.com wrote: Hi Josh, For post grouping the documents don't need to reside in the same segment. Lucene's grouping module has a collector (TermAllGroupHeadsCollector) that can collect the most relevant document for each group (GroupHead). This collector can produce a int[] or a FixedBitSet that can be used during faceting to produce post group facets (patch in SOLR-2665 uses this). During faceting only the the groupheads are known, because of this field values that are different in documents less relevant than the most relevant document of a group aren't taken into account. This is the same as in example described in the description of LUCENE-3097. Hope this helps! Martijn On 4 August 2011 22:59, Joshua Harness jkharnes...@gmail.com wrote: Hello - Please let me know if this question is more appropriate of the user list. I had assumed the developer list was more appropriate since the ticket is still open. I was analyzing the comments on LUCENE-3097https://issues.apache.org/jira/browse/LUCENE-3097and had a couple of questions. A commenthttps://issues.apache.org/jira/browse/LUCENE-3097?focusedCommentId=13033953page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13033953started a small thread that mentioned that all documents in a given group would need to be contiguous and in the same segment. Also - a statement was made that ' The app would have to ensure this'. I was unclear the result of this conversation. It sounded like maybe this could have turned out to not be the case. What is the status of this? Does my application have to ensure all the documents in the group are in the same segment? How would one accomplish this? Another commenthttps://issues.apache.org/jira/browse/LUCENE-3097?focusedCommentId=13038297page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13038297mentioned that 'we pick only the head doc...as long as the head doc is guaranteed to have the same value for field X, it safe to use that doc to represent the entire group for facet counting'. Does this mean that there is a restriction placed on me that the head document must have field values that match the rest of the documents in the same group? Or is this simply an implementation detail that uses the head document when this condition is the case or chooses another strategy when this is not the case? I am very interested in adopting this patch. However - I am attempting to understand any limitations/conditions so that I may use it correctly. Any advice would be greatly appreciated. Thanks! Josh Harness -- Met vriendelijke groet, Martijn van Groningen -- Met vriendelijke groet, Martijn van Groningen
[jira] [Updated] (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nikola Tankovic updated LUCENE-2308: Attachment: LUCENE-2308-21.patch 21th patch :) Fixed Javadocs errors. Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2308-10.patch, LUCENE-2308-11.patch, LUCENE-2308-12.patch, LUCENE-2308-13.patch, LUCENE-2308-14.patch, LUCENE-2308-15.patch, LUCENE-2308-16.patch, LUCENE-2308-17.patch, LUCENE-2308-18.patch, LUCENE-2308-19.patch, LUCENE-2308-2.patch, LUCENE-2308-20.patch, LUCENE-2308-21.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-ltc.patch, LUCENE-2308.patch, LUCENE-2308.patch This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3357) Unit and integration test cases for the new Similarities
[ https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Mark Nemeskey updated LUCENE-3357: Attachment: LUCENE-3357.patch Integration tests added. There are two of them; however, ant test only runs one? Unit and integration test cases for the new Similarities Key: LUCENE-3357 URL: https://issues.apache.org/jira/browse/LUCENE-3357 Project: Lucene - Java Issue Type: Sub-task Components: core/query/scoring Affects Versions: flexscoring branch Reporter: David Mark Nemeskey Assignee: David Mark Nemeskey Priority: Minor Labels: gsoc, gsoc2011, test Fix For: flexscoring branch Attachments: LUCENE-3357.patch Write test cases to test the new Similarities added in [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of test cases will be created: * unit tests, in which mock statistics are provided to the Similarities and the score is validated against hand calculations; * integration tests, in which a small collection is indexed and then searched using the Similarities. Performance tests will be performed in a separate issue. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (LUCENE-3357) Unit and integration test cases for the new Similarities
[ https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13080416#comment-13080416 ] David Mark Nemeskey edited comment on LUCENE-3357 at 8/6/11 3:52 PM: - Integration tests added. There are two of them; however, ant test runs only one? was (Author: david_nemeskey): Integration tests added. There are two of them; however, ant test only runs one? Unit and integration test cases for the new Similarities Key: LUCENE-3357 URL: https://issues.apache.org/jira/browse/LUCENE-3357 Project: Lucene - Java Issue Type: Sub-task Components: core/query/scoring Affects Versions: flexscoring branch Reporter: David Mark Nemeskey Assignee: David Mark Nemeskey Priority: Minor Labels: gsoc, gsoc2011, test Fix For: flexscoring branch Attachments: LUCENE-3357.patch Write test cases to test the new Similarities added in [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of test cases will be created: * unit tests, in which mock statistics are provided to the Similarities and the score is validated against hand calculations; * integration tests, in which a small collection is indexed and then searched using the Similarities. Performance tests will be performed in a separate issue. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-2700) transaction logging
transaction logging --- Key: SOLR-2700 URL: https://issues.apache.org/jira/browse/SOLR-2700 Project: Solr Issue Type: New Feature Reporter: Yonik Seeley A transaction log is needed for durability of updates, for a more performant realtime-get, and for replaying updates to recovering peers. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2700) transaction logging
[ https://issues.apache.org/jira/browse/SOLR-2700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yonik Seeley updated SOLR-2700: --- Attachment: SOLR-2700.patch Here's a draft patch. There is a tlog.number file created for each commit. The javabin format is used to serialize SolrInputDocuments. An in-memory map of pointers into the log is kept for documents not yet soft-committed, and the realtime-get component checks that first before using SolrCore.getNewestSearcher(). Seems to work for getting documents not in the newest searcher so far. Tons of stuff left to do - the tlog files are currently in the CWD - need to handle deletes - need to handle flushes in a performant way - need to implement optional fsync for durability on power-failure - would be nice to make some of this multi-threaded for better performance - need to implement durability (apply updates from logs on startup) - need to implement some form of cleanup for transaction logs transaction logging --- Key: SOLR-2700 URL: https://issues.apache.org/jira/browse/SOLR-2700 Project: Solr Issue Type: New Feature Reporter: Yonik Seeley Attachments: SOLR-2700.patch A transaction log is needed for durability of updates, for a more performant realtime-get, and for replaying updates to recovering peers. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[JENKINS] Lucene-Solr-tests-only-3.x - Build # 10036 - Failure
Build: https://builds.apache.org/job/Lucene-Solr-tests-only-3.x/10036/ 2 tests failed. REGRESSION: org.apache.solr.client.solrj.embedded.MergeIndexesEmbeddedTest.testMergeIndexesByDirName Error Message: No such core: core1 Stack Trace: org.apache.solr.common.SolrException: No such core: core1 at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:104) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105) at org.apache.solr.client.solrj.MergeIndexesExampleTestBase.setupCores(MergeIndexesExampleTestBase.java:90) at org.apache.solr.client.solrj.MergeIndexesExampleTestBase.testMergeIndexesByDirName(MergeIndexesExampleTestBase.java:129) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1335) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1240) REGRESSION: org.apache.solr.client.solrj.embedded.MergeIndexesEmbeddedTest.testMergeIndexesByCoreName Error Message: No such core: core1 Stack Trace: org.apache.solr.common.SolrException: No such core: core1 at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:104) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105) at org.apache.solr.client.solrj.MergeIndexesExampleTestBase.setupCores(MergeIndexesExampleTestBase.java:90) at org.apache.solr.client.solrj.MergeIndexesExampleTestBase.testMergeIndexesByCoreName(MergeIndexesExampleTestBase.java:145) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1335) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1240) Build Log (for compile errors): [...truncated 14180 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3364) Add score threshold into Scorer.score()
Add score threshold into Scorer.score() --- Key: LUCENE-3364 URL: https://issues.apache.org/jira/browse/LUCENE-3364 Project: Lucene - Java Issue Type: Improvement Components: core/query/scoring Reporter: John Wang This is an optimization for scoring. Given a Scorer.score() implementation, where features are gathered to calculate a score. Proposal, add a parameter to score, e.g. score(float threshold) This threshold is the minimum score to beat to make it to the current PriorityQueue. This could potential save a great deal of wasted calculation in the cases where recall is large. In our case specifically, some of the features needed to do calculation can be expensive to obtain, it would be nice to have a place to decide whether or not even fetching these features are necessary. Also, if we know the score would be low, simply threshold can be returned. Let me know if this makes sense and I can work on a patch. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3364) Add score threshold into Scorer.score()
[ https://issues.apache.org/jira/browse/LUCENE-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13080437#comment-13080437 ] Yonik Seeley commented on LUCENE-3364: -- Perhaps it would be easiest to just create a Collector that cut off based on score? Add score threshold into Scorer.score() --- Key: LUCENE-3364 URL: https://issues.apache.org/jira/browse/LUCENE-3364 Project: Lucene - Java Issue Type: Improvement Components: core/query/scoring Reporter: John Wang This is an optimization for scoring. Given a Scorer.score() implementation, where features are gathered to calculate a score. Proposal, add a parameter to score, e.g. score(float threshold) This threshold is the minimum score to beat to make it to the current PriorityQueue. This could potential save a great deal of wasted calculation in the cases where recall is large. In our case specifically, some of the features needed to do calculation can be expensive to obtain, it would be nice to have a place to decide whether or not even fetching these features are necessary. Also, if we know the score would be low, simply threshold can be returned. Let me know if this makes sense and I can work on a patch. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3364) Add score threshold into Scorer.score()
[ https://issues.apache.org/jira/browse/LUCENE-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13080439#comment-13080439 ] John Wang commented on LUCENE-3364: --- Hi Yonik: In Collector, the decision time to cut of a score would be too late, e.g. float score = Scorer.score(); --- this is where the cost would occur. boolean cutOff = decide(score) In my example, my score impl: float s1 = cheapCalc(docid); float s2 = expensiveCalc(docid); return s1+s2; So now if I know expensiveCalc is bounded by N, and cheapCalc returns a very small number, I can simply skip s2 calculation because this doc would be thrown out anyway. Hope I am making sense :) -John Add score threshold into Scorer.score() --- Key: LUCENE-3364 URL: https://issues.apache.org/jira/browse/LUCENE-3364 Project: Lucene - Java Issue Type: Improvement Components: core/query/scoring Reporter: John Wang This is an optimization for scoring. Given a Scorer.score() implementation, where features are gathered to calculate a score. Proposal, add a parameter to score, e.g. score(float threshold) This threshold is the minimum score to beat to make it to the current PriorityQueue. This could potential save a great deal of wasted calculation in the cases where recall is large. In our case specifically, some of the features needed to do calculation can be expensive to obtain, it would be nice to have a place to decide whether or not even fetching these features are necessary. Also, if we know the score would be low, simply threshold can be returned. Let me know if this makes sense and I can work on a patch. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3364) Add score threshold into Scorer.score()
[ https://issues.apache.org/jira/browse/LUCENE-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13080442#comment-13080442 ] Yonik Seeley commented on LUCENE-3364: -- Ah, gotcha - I see what you're saying now. Add score threshold into Scorer.score() --- Key: LUCENE-3364 URL: https://issues.apache.org/jira/browse/LUCENE-3364 Project: Lucene - Java Issue Type: Improvement Components: core/query/scoring Reporter: John Wang This is an optimization for scoring. Given a Scorer.score() implementation, where features are gathered to calculate a score. Proposal, add a parameter to score, e.g. score(float threshold) This threshold is the minimum score to beat to make it to the current PriorityQueue. This could potential save a great deal of wasted calculation in the cases where recall is large. In our case specifically, some of the features needed to do calculation can be expensive to obtain, it would be nice to have a place to decide whether or not even fetching these features are necessary. Also, if we know the score would be low, simply threshold can be returned. Let me know if this makes sense and I can work on a patch. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
IndexReader.maxDoc() and other
Assuming there are no deletes, would the following work as a way to load *last added document*, surviving optimize as well? Order of documentId-s in Lucene survives optimize as far as I remember? IndexReader ir... int maxDoc = ir.maxDoc() - 1; if(maxDoc0) //? What is the return value on empty index, 0 or 1? Document d = ir.getDocument(maxDoc); Would this correspond to the last committed document (at commit point where index reader was opened) Or last added document, including pending/uncommitted (I am not getting IndexReader from the IndexWriter, no nrt yet...) The problem I am trying to solve are incremental updates (there are no deletions). Having unique, numerical uid stored in index that is increasing with every add, I just need a way to find max(uid) on the last commit to get my delta from the database. Above solution was one of the options. 2.The second would be to iterate TermsEnum for uid field until I hit an end, but this sounds slow (even if I start skipping around like a monkey)? 3.Third option would be to index reverse uid (HUGE_CONSTANT - uid), so it gets on top in terms dictionary? 4. And finally, the last option I am thinking of would be to track max(UID) and write it as a user Parameter with IndexWriter.commit(Map...), so I could read it easily (piggy-back on lucene commit is as safe as it gets, better then persisting own files...) I like the last option, but have no idea how to create beforeCommitListener in solr? The most robust is 2/3, but maybe slow-ish (there are 100-200Mio documents/UIDs) Any better ideas? (and no, DIH wall clock timestamp is not good enough) I am talking about solr/lucene 4 trunk, we decided to take a risk :) Thanks, eks
Re: IndexReader.maxDoc() and other
On Sat, Aug 6, 2011 at 2:47 PM, eks dev eks...@yahoo.co.uk wrote: Assuming there are no deletes, would the following work as a way to load *last added document*, surviving optimize as well? Order of documentId-s in Lucene survives optimize as far as I remember? No longer... the default merge policy can now merge non-contiguous segments. You can of course still select a Log* merge policy, which never reorders ids with respect to each other. -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: IndexReader.maxDoc() and other
Thanks Yonik, assuming I am not going to index ID , than only an option 4. remains so far. I have no other ideas, and Log* merge policy would mean all 4 Indexing magic went to nothing :) Colud then the following do the job? clone DefaultIndexWriterProvider into my codebase (ugly, keep in sync , but doable) make it provide EnhancedSolrIndexWriter extends SolrIndexWriter @Override commit(...){ super.commit(MapString, String Core.getUserMap()); } the same with close(...) If yes, Is this feature something solr could use? MapString, String userParams somewhere in Core that gets committed with whatever it has at commit time. I could wrap up a patch by modifying SolrIndexWriter directly then? Nice thing about it, one could have possibility to keep small map of key value pairs in sync with commit points with all goods of TwoPhaseCommit... for no way for this to get out of sync things, like my use case below... I imagine DIH could use it as well - No longer... the default merge policy can now merge non-contiguous segments. You can of course still select a Log* merge policy, which never reorders ids with respect to each other. -Yonik http://www.lucidimagination.com From: eks dev eks...@yahoo.co.uk To: dev@lucene.apache.org Sent: Sat, 6 August, 2011 20:47:09 Subject: IndexReader.maxDoc() and other Assuming there are no deletes, would the following work as a way to load *last added document*, surviving optimize as well? Order of documentId-s in Lucene survives optimize as far as I remember? IndexReader ir... int maxDoc = ir.maxDoc() - 1; if(maxDoc0) //? What is the return value on empty index, 0 or 1? Document d = ir.getDocument(maxDoc); Would this correspond to the last committed document (at commit point where index reader was opened) Or last added document, including pending/uncommitted (I am not getting IndexReader from the IndexWriter, no nrt yet...) The problem I am trying to solve are incremental updates (there are no deletions). Having unique, numerical uid stored in index that is increasing with every add, I just need a way to find max(uid) on the last commit to get my delta from the database. Above solution was one of the options. 2.The second would be to iterate TermsEnum for uid field until I hit an end, but this sounds slow (even if I start skipping around like a monkey)? 3.Third option would be to index reverse uid (HUGE_CONSTANT - uid), so it gets on top in terms dictionary? 4. And finally, the last option I am thinking of would be to track max(UID) and write it as a user Parameter with IndexWriter.commit(Map...), so I could read it easily (piggy-back on lucene commit is as safe as it gets, better then persisting own files...) I like the last option, but have no idea how to create beforeCommitListener in solr? The most robust is 2/3, but maybe slow-ish (there are 100-200Mio documents/UIDs) Any better ideas? (and no, DIH wall clock timestamp is not good enough) I am talking about solr/lucene 4 trunk, we decided to take a risk :) Thanks, eks
[jira] [Commented] (LUCENE-2748) Convert all Lucene web properties to use the ASF CMS
[ https://issues.apache.org/jira/browse/LUCENE-2748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13080463#comment-13080463 ] Grant Ingersoll commented on LUCENE-2748: - Making some progress on this. Here's my intent: We start clean. One website directory for all of our projects (Lucene, Solr, PyLucene and ORP). I'm more or less copying the layout of Mahout, which copied Open For Biz. It's a lot cleaner and a lot nicer to look at. I intend to move the old sites to an archive area and just link to them. We'll still need to figure out per release docs, but I suspect it won't be that hard to convert that stuff to Markdown going forward and having our deploy/release mechanism just publish it to the CMS. Convert all Lucene web properties to use the ASF CMS Key: LUCENE-2748 URL: https://issues.apache.org/jira/browse/LUCENE-2748 Project: Lucene - Java Issue Type: Bug Reporter: Grant Ingersoll Assignee: Grant Ingersoll The new CMS has a lot of nice features (and some kinks to still work out) and Forrest just doesn't cut it anymore, so we should move to the ASF CMS: http://apache.org/dev/cms.html -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2748) Convert all Lucene web properties to use the ASF CMS
[ https://issues.apache.org/jira/browse/LUCENE-2748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13080471#comment-13080471 ] Grant Ingersoll commented on LUCENE-2748: - If you wish to build/test locally, do the setup at: http://www.apache.org/dev/cmsref.html#local-build Then run: {quote} path/build_site.pl --source-base [Path to Lucene CMS SVN checkout top dir] --target-base [OUTPUT] {quote} Convert all Lucene web properties to use the ASF CMS Key: LUCENE-2748 URL: https://issues.apache.org/jira/browse/LUCENE-2748 Project: Lucene - Java Issue Type: Bug Reporter: Grant Ingersoll Assignee: Grant Ingersoll The new CMS has a lot of nice features (and some kinks to still work out) and Forrest just doesn't cut it anymore, so we should move to the ASF CMS: http://apache.org/dev/cms.html -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2701) Expose IndexWriter.commit(MapString,String commitUserData) to solr
[ https://issues.apache.org/jira/browse/SOLR-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13080474#comment-13080474 ] Eks Dev commented on SOLR-2701: --- one hook for users to update content of this map would be to add beforeCommit callbacks. This looks simple enough in UpdateHandler2.commit() call, but there is a catch: We need to invoke listeners before we close() for implicit commits... having decref-ed IndexWriter, the question is if we want to run beforeCommit listeners even if IW does not really get closed (user updates map more often than needed). IMO, this should not be a problem, invoking callbacks a little bit more often than needed. Another place where we have implicit commit is newIndexWriter() / here we need only to add IndexWriterProvider.isIndexWriterNull() to check if we need callbacks A solution for close() would be also simple by adding IndexWriterProvider.isIndexGoingToCloseOnNextDecref() before invoking decref() to condition callbacks Any better solution? Are the callbacks good approach to provide user hooks for this? --- Another approach is to get beforeCommitCallbacks at lucene level and piggy-back there for solr callbacks? We would only need to change IndexWriter.commit(Map..) and close() but commit is final... Notice: I am very rusty considering solr/lucene codebase = any help would be appreciated. Last patch I made here is ages ago :) Expose IndexWriter.commit(MapString,String commitUserData) to solr - Key: SOLR-2701 URL: https://issues.apache.org/jira/browse/SOLR-2701 Project: Solr Issue Type: New Feature Components: update Affects Versions: 4.0 Reporter: Eks Dev Priority: Minor Labels: commit, update Original Estimate: 8h Remaining Estimate: 8h At the moment, there is no feature that enables associating user information to the commit point. Lucene supports this possibility and it should be exposed to solr as well, probably via beforeCommit Listener (analogous to prepareCommit in Lucene). Most likely home for this Map to live is UpdateHandler. Example use case would be an atomic tracking of sequence numbers or timestamps for incremental updates. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2748) Convert all Lucene web properties to use the ASF CMS
[ https://issues.apache.org/jira/browse/LUCENE-2748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13080480#comment-13080480 ] Yonik Seeley commented on LUCENE-2748: -- bq. I'm more or less copying the layout of Mahout, which copied Open For Biz. +1, I like the looks of the Mahout site. Convert all Lucene web properties to use the ASF CMS Key: LUCENE-2748 URL: https://issues.apache.org/jira/browse/LUCENE-2748 Project: Lucene - Java Issue Type: Bug Reporter: Grant Ingersoll Assignee: Grant Ingersoll The new CMS has a lot of nice features (and some kinks to still work out) and Forrest just doesn't cut it anymore, so we should move to the ASF CMS: http://apache.org/dev/cms.html -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2654) lockType/ not used consistently in all places Directory objects are instantiated
[ https://issues.apache.org/jira/browse/SOLR-2654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13080525#comment-13080525 ] Mark Miller commented on SOLR-2654: --- Hmmm...the problem has something to do with this new index stuff that replication uses - this thing always gets in my way :) lockType/ not used consistently in all places Directory objects are instantiated -- Key: SOLR-2654 URL: https://issues.apache.org/jira/browse/SOLR-2654 Project: Solr Issue Type: Bug Reporter: Hoss Man Assignee: Mark Miller Priority: Critical Fix For: 3.4 Attachments: SOLR-2654.patch nipunb noted on the mailing list then when configuring solr to use an alternate lockType/ (ie: simple) the stats for the SolrIndexSearcher list NativeFSLockFactory being used by the Directory. The problem seems to be that SolrIndexConfig is not consulted when constructing Directory objects used for IndexReader (it's only used by SolrIndexWriter) I don't _think_ this is a problem in most cases since the IndexReaders should all be readOnly in the core solr code) but plugins could attempt to use them in other ways. In general it seems like a really bad bug waiting to happen. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2654) lockType/ not used consistently in all places Directory objects are instantiated
[ https://issues.apache.org/jira/browse/SOLR-2654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13080532#comment-13080532 ] Mark Miller commented on SOLR-2654: --- My initial suspicion is actually that these where bugs in trunk that where being hidden by the old behavior. lockType/ not used consistently in all places Directory objects are instantiated -- Key: SOLR-2654 URL: https://issues.apache.org/jira/browse/SOLR-2654 Project: Solr Issue Type: Bug Reporter: Hoss Man Assignee: Mark Miller Priority: Critical Fix For: 3.4 Attachments: SOLR-2654.patch, SOLR-2654.patch nipunb noted on the mailing list then when configuring solr to use an alternate lockType/ (ie: simple) the stats for the SolrIndexSearcher list NativeFSLockFactory being used by the Directory. The problem seems to be that SolrIndexConfig is not consulted when constructing Directory objects used for IndexReader (it's only used by SolrIndexWriter) I don't _think_ this is a problem in most cases since the IndexReaders should all be readOnly in the core solr code) but plugins could attempt to use them in other ways. In general it seems like a really bad bug waiting to happen. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2654) lockType/ not used consistently in all places Directory objects are instantiated
[ https://issues.apache.org/jira/browse/SOLR-2654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated SOLR-2654: -- Fix Version/s: 4.0 lockType/ not used consistently in all places Directory objects are instantiated -- Key: SOLR-2654 URL: https://issues.apache.org/jira/browse/SOLR-2654 Project: Solr Issue Type: Bug Reporter: Hoss Man Assignee: Mark Miller Priority: Critical Fix For: 3.4, 4.0 Attachments: SOLR-2654.patch, SOLR-2654.patch nipunb noted on the mailing list then when configuring solr to use an alternate lockType/ (ie: simple) the stats for the SolrIndexSearcher list NativeFSLockFactory being used by the Directory. The problem seems to be that SolrIndexConfig is not consulted when constructing Directory objects used for IndexReader (it's only used by SolrIndexWriter) I don't _think_ this is a problem in most cases since the IndexReaders should all be readOnly in the core solr code) but plugins could attempt to use them in other ways. In general it seems like a really bad bug waiting to happen. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org