[JENKINS] Lucene-Solr-tests-only-trunk-java7 - Build # 118 - Failure
Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk-java7/118/ 1 tests failed. REGRESSION: org.apache.solr.client.solrj.embedded.MergeIndexesEmbeddedTest.testMergeIndexesByCoreName Error Message: No such core: core1 Stack Trace: org.apache.solr.common.SolrException: No such core: core1 at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:118) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:104) at org.apache.solr.client.solrj.MergeIndexesExampleTestBase.setupCores(MergeIndexesExampleTestBase.java:90) at org.apache.solr.client.solrj.MergeIndexesExampleTestBase.testMergeIndexesByCoreName(MergeIndexesExampleTestBase.java:145) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1522) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1427) Build Log (for compile errors): [...truncated 12215 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3305) Kuromoji code donation - a new Japanese morphological analyzer
[ https://issues.apache.org/jira/browse/LUCENE-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13076069#comment-13076069 ] Christian Moen commented on LUCENE-3305: Hello again, Simon. I've filed the paperwork and copied you on email. Hope you're enjoying your vacation! Kuromoji code donation - a new Japanese morphological analyzer -- Key: LUCENE-3305 URL: https://issues.apache.org/jira/browse/LUCENE-3305 Project: Lucene - Java Issue Type: New Feature Components: modules/analysis Reporter: Christian Moen Assignee: Simon Willnauer Attachments: Kuromoji short overview .pdf, ip-clearance-Kuromoji.xml, kuromoji-0.7.6-asf.tar.gz, kuromoji-0.7.6.tar.gz, kuromoji-solr-0.5.3-asf.tar.gz, kuromoji-solr-0.5.3.tar.gz Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese morphological analyzer to the Apache Software Foundation in the hope that it will be useful to Lucene and Solr users in Japan and elsewhere. The project was started in 2010 since we couldn't find any high-quality, actively maintained and easy-to-use Java-based Japanese morphological analyzers, and these become many of our design goals for Kuromoji. Kuromoji also has a segmentation mode that is particularly useful for search, which we hope will interest Lucene and Solr users. Compound-nouns, such as 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are segmented as one token with most analyzers. As a result, a search for 空港 (airport) or 新聞 (newspaper) will not give you a for in these words. Kuromoji can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what you would want for search and you'll get a hit. We also wanted to make sure the technology has a license that makes it compatible with other Apache Software Foundation software to maximize its usefulness. Kuromoji has an Apache License 2.0 and all code is currently owned by Atilika Inc. The software has been developed by my good friend and ex-colleague Masaru Hasegawa and myself. Kuromoji uses the so-called IPADIC for its dictionary/statistical model and its license terms are described in NOTICE.txt. I'll upload code distributions and their corresponding hashes and I'd very much like to start the code grant process. I'm also happy to provide patches to integrate Kuromoji into the codebase, if you prefer that. Please advise on how you'd like me to proceed with this. Thank you. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3343) Comparison operators ,=,,= and = support as RangeQuery syntax in QueryParser
[ https://issues.apache.org/jira/browse/LUCENE-3343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13076091#comment-13076091 ] Olivier Favre commented on LUCENE-3343: --- Great, thanks! No blockers for 3x? Comparison operators ,=,,= and = support as RangeQuery syntax in QueryParser Key: LUCENE-3343 URL: https://issues.apache.org/jira/browse/LUCENE-3343 Project: Lucene - Java Issue Type: New Feature Components: modules/queryparser Reporter: Olivier Favre Assignee: Adriano Crestani Priority: Minor Labels: parser, query Fix For: 3.4, 4.0 Attachments: NumCompQueryParser-3x.patch, NumCompQueryParser.patch Original Estimate: 96h Remaining Estimate: 96h To offer better interoperability with other search engines and to provide an easier and more straight forward syntax, the operators , =, , = and = should be available to express an open range query. They should at least work for numeric queries. '=' can be made a synonym for ':'. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3220) Implement various ranking models as Similarities
[ https://issues.apache.org/jira/browse/LUCENE-3220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Mark Nemeskey updated LUCENE-3220: Attachment: LUCENE-3220.patch Added norm decoding table to EasySimilarity, and removed sumTotalFreq. Sorry I could only upload this patch now but I didn't have time to work on Lucene the last week. As I see, all the problems you mentioned have been corrected, so maybe we can go on with the review? Implement various ranking models as Similarities Key: LUCENE-3220 URL: https://issues.apache.org/jira/browse/LUCENE-3220 Project: Lucene - Java Issue Type: Sub-task Components: core/search Affects Versions: flexscoring branch Reporter: David Mark Nemeskey Assignee: David Mark Nemeskey Labels: gsoc Attachments: LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch Original Estimate: 336h Remaining Estimate: 336h With [LUCENE-3174|https://issues.apache.org/jira/browse/LUCENE-3174] done, we can finally work on implementing the standard ranking models. Currently DFR, BM25 and LM are on the menu. Done: * {{EasyStats}}: contains all statistics that might be relevant for a ranking algorithm * {{EasySimilarity}}: the ancestor of all the other similarities. Hides the DocScorers and as much implementation detail as possible * _BM25_: the current mock implementation might be OK * _LM_ * _DFR_ * The so-called _Information-Based Models_ -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3357) Unit and integration test cases for the new Similarities
Unit and integration test cases for the new Similarities Key: LUCENE-3357 URL: https://issues.apache.org/jira/browse/LUCENE-3357 Project: Lucene - Java Issue Type: Sub-task Components: core/query/scoring Affects Versions: flexscoring branch Reporter: David Mark Nemeskey Assignee: David Mark Nemeskey Priority: Minor Fix For: flexscoring branch Write test cases to test the new Similarities added in [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of test cases will be created: * unit tests, in which mock statistics are provided to the Similarities and the score is validated against hand calculations; * integration tests, in which a small collection is indexed and then searched using the Similarities. Performance tests will be performed in a separate issue. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3220) Implement various ranking models as Similarities
[ https://issues.apache.org/jira/browse/LUCENE-3220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Mark Nemeskey updated LUCENE-3220: Component/s: core/query/scoring Labels: gsoc gsoc2011 (was: gsoc) Implement various ranking models as Similarities Key: LUCENE-3220 URL: https://issues.apache.org/jira/browse/LUCENE-3220 Project: Lucene - Java Issue Type: Sub-task Components: core/query/scoring, core/search Affects Versions: flexscoring branch Reporter: David Mark Nemeskey Assignee: David Mark Nemeskey Labels: gsoc, gsoc2011 Attachments: LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch Original Estimate: 336h Remaining Estimate: 336h With [LUCENE-3174|https://issues.apache.org/jira/browse/LUCENE-3174] done, we can finally work on implementing the standard ranking models. Currently DFR, BM25 and LM are on the menu. Done: * {{EasyStats}}: contains all statistics that might be relevant for a ranking algorithm * {{EasySimilarity}}: the ancestor of all the other similarities. Hides the DocScorers and as much implementation detail as possible * _BM25_: the current mock implementation might be OK * _LM_ * _DFR_ * The so-called _Information-Based Models_ -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3357) Unit and integration test cases for the new Similarities
[ https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Mark Nemeskey updated LUCENE-3357: Labels: gsoc gsoc2011 test (was: gsoc gsoc2011) Unit and integration test cases for the new Similarities Key: LUCENE-3357 URL: https://issues.apache.org/jira/browse/LUCENE-3357 Project: Lucene - Java Issue Type: Sub-task Components: core/query/scoring Affects Versions: flexscoring branch Reporter: David Mark Nemeskey Assignee: David Mark Nemeskey Priority: Minor Labels: gsoc, gsoc2011, test Fix For: flexscoring branch Write test cases to test the new Similarities added in [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of test cases will be created: * unit tests, in which mock statistics are provided to the Similarities and the score is validated against hand calculations; * integration tests, in which a small collection is indexed and then searched using the Similarities. Performance tests will be performed in a separate issue. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3357) Unit and integration test cases for the new Similarities
[ https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Mark Nemeskey updated LUCENE-3357: Labels: gsoc gsoc2011 (was: ) Unit and integration test cases for the new Similarities Key: LUCENE-3357 URL: https://issues.apache.org/jira/browse/LUCENE-3357 Project: Lucene - Java Issue Type: Sub-task Components: core/query/scoring Affects Versions: flexscoring branch Reporter: David Mark Nemeskey Assignee: David Mark Nemeskey Priority: Minor Labels: gsoc, gsoc2011, test Fix For: flexscoring branch Write test cases to test the new Similarities added in [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of test cases will be created: * unit tests, in which mock statistics are provided to the Similarities and the score is validated against hand calculations; * integration tests, in which a small collection is indexed and then searched using the Similarities. Performance tests will be performed in a separate issue. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3335) jrebug causes porter stemmer to sigsegv
[ https://issues.apache.org/jira/browse/LUCENE-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13076132#comment-13076132 ] Uwe Schindler commented on LUCENE-3335: --- @Shay: Sorry I did not want to be too italian :-) I just wanted to ensure that such configurations, leading to bugs in JVMs, would be reported to us. It would help us to also respond quicker on such bug reports, like the one we already got 2 months ago (which nobody was able to reproduce, as we did not know that the user used aggressive opts). jrebug causes porter stemmer to sigsegv --- Key: LUCENE-3335 URL: https://issues.apache.org/jira/browse/LUCENE-3335 Project: Lucene - Java Issue Type: Bug Affects Versions: 1.9, 1.9.1, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4, 2.4.1, 2.9, 2.9.1, 2.9.2, 2.9.3, 2.9.4, 3.0, 3.0.1, 3.0.2, 3.0.3, 3.1, 3.2, 3.3, 3.4, 4.0 Environment: - JDK 7 Preview Release, GA (may also affect update _1, targeted fix is JDK 1.7.0_2) - JDK 1.6.0_20+ with -XX:+OptimizeStringConcat or -XX:+AggressiveOpts Reporter: Robert Muir Assignee: Robert Muir Labels: Java7 Attachments: LUCENE-3335.patch, LUCENE-3335_slow.patch, patch-0uwe.patch happens easily on java7: ant test -Dtestcase=TestPorterStemFilter -Dtests.iter=100 might happen on 1.6.0_u26 too, a user reported something that looks like the same bug already: http://www.lucidimagination.com/search/document/3beaa082c4d2fdd4/porterstemfilter_kills_jvm -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3335) jrebug causes porter stemmer to sigsegv
[ https://issues.apache.org/jira/browse/LUCENE-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13076156#comment-13076156 ] Robert Muir commented on LUCENE-3335: - I don't think there is any sense in this, who cares? We reported this crash to Oracle in plenty of time, and the *worse* wrong-results bug has been open since May 13: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7044738, but Oracle decided not to fix that, too. jrebug causes porter stemmer to sigsegv --- Key: LUCENE-3335 URL: https://issues.apache.org/jira/browse/LUCENE-3335 Project: Lucene - Java Issue Type: Bug Affects Versions: 1.9, 1.9.1, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4, 2.4.1, 2.9, 2.9.1, 2.9.2, 2.9.3, 2.9.4, 3.0, 3.0.1, 3.0.2, 3.0.3, 3.1, 3.2, 3.3, 3.4, 4.0 Environment: - JDK 7 Preview Release, GA (may also affect update _1, targeted fix is JDK 1.7.0_2) - JDK 1.6.0_20+ with -XX:+OptimizeStringConcat or -XX:+AggressiveOpts Reporter: Robert Muir Assignee: Robert Muir Labels: Java7 Attachments: LUCENE-3335.patch, LUCENE-3335_slow.patch, patch-0uwe.patch happens easily on java7: ant test -Dtestcase=TestPorterStemFilter -Dtests.iter=100 might happen on 1.6.0_u26 too, a user reported something that looks like the same bug already: http://www.lucidimagination.com/search/document/3beaa082c4d2fdd4/porterstemfilter_kills_jvm -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3220) Implement various ranking models as Similarities
[ https://issues.apache.org/jira/browse/LUCENE-3220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13076171#comment-13076171 ] Robert Muir commented on LUCENE-3220: - Hi David, i was thinking for the norm, we could store it like DefaultSimilarity. this would make it especially convenient, as you could easily use these similarities with the same exact index as one using Lucene's default scoring. Also I think (not sure!) by using 1/sqrt we will get better quantization from smallfloat? {noformat} public byte computeNorm(FieldInvertState state) { final int numTerms; if (discountOverlaps) numTerms = state.getLength() - state.getNumOverlap(); else numTerms = state.getLength(); return encodeNormValue(state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms; } {noformat} for computations, you have to 'undo' the sqrt() to get the quantized length, but thats ok since its only done up-front a single time and tableized, so it won't slow anything down. Implement various ranking models as Similarities Key: LUCENE-3220 URL: https://issues.apache.org/jira/browse/LUCENE-3220 Project: Lucene - Java Issue Type: Sub-task Components: core/query/scoring, core/search Affects Versions: flexscoring branch Reporter: David Mark Nemeskey Assignee: David Mark Nemeskey Labels: gsoc, gsoc2011 Attachments: LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch Original Estimate: 336h Remaining Estimate: 336h With [LUCENE-3174|https://issues.apache.org/jira/browse/LUCENE-3174] done, we can finally work on implementing the standard ranking models. Currently DFR, BM25 and LM are on the menu. Done: * {{EasyStats}}: contains all statistics that might be relevant for a ranking algorithm * {{EasySimilarity}}: the ancestor of all the other similarities. Hides the DocScorers and as much implementation detail as possible * _BM25_: the current mock implementation might be OK * _LM_ * _DFR_ * The so-called _Information-Based Models_ -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3030) Block tree terms dict index
[ https://issues.apache.org/jira/browse/LUCENE-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13076177#comment-13076177 ] Robert Muir commented on LUCENE-3030: - This is awesome, i really like adding the intersect() hook! Thanks for making a branch, I will check it out and try to dive in to help with some of this :) One trivial thing we might want to do is to add the logic currently in AQ's ctor to CA, so that you ask CA for its termsenum. this way, if it can be accomplished with a simpler enum like just terms.iterator() or prefixtermsenum etc etc we get that optimization always. Block tree terms dict index - Key: LUCENE-3030 URL: https://issues.apache.org/jira/browse/LUCENE-3030 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.0 Attachments: LUCENE-3030.patch, LUCENE-3030.patch, LUCENE-3030.patch, LUCENE-3030.patch Our default terms index today breaks terms into blocks of fixed size (ie, every 32 terms is a new block), and then we build an index on top of that (holding the start term for each block). But, it should be better to instead break terms according to how they share prefixes. This results in variable sized blocks, but means within each block we maximize the shared prefix and minimize the resulting terms index. It should also be a speedup for terms dict intensive queries because the terms index becomes a true prefix trie, and can be used to fast-fail on term lookup (ie returning NOT_FOUND without having to seek/scan a terms block). Having a true prefix trie should also enable much faster intersection with automaton (but this will be a new issue). I've made an initial impl for this (called BlockTreeTermsWriter/Reader). It's still a work in progress... lots of nocommits, and hairy code, but tests pass (at least once!). I made two new codecs, temporarily called StandardTree, PulsingTree, that are just like their counterparts but use this new terms dict. I added a new exactOnly boolean to TermsEnum.seek. If that's true and the term is NOT_FOUND, we will (quickly) return NOT_FOUND and the enum is unpositioned (ie you should not call next(), docs(), etc.). In this approach the index and dict are tightly connected, so it does not support a pluggable index impl like BlockTermsWriter/Reader. Blocks are stored on certain nodes of the prefix trie, and can contain both terms and pointers to sub-blocks (ie, if the block is not a leaf block). So there are two trees, tied to one another -- the index trie, and the blocks. Only certain nodes in the trie map to a block in the block tree. I think this algorithm is similar to burst tries (http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3499), except it allows terms to be stored on inner blocks (not just leaf blocks). This is important for Lucene because an [accidental] adversary could produce a terms dict with way too many blocks (way too much RAM used by the terms index). Still, with my current patch, an adversary can produce too-big blocks... which we may need to fix, by letting the terms index not be a true prefix trie on it's leaf edges. Exactly how the blocks are picked can be factored out as its own policy (but I haven't done that yet). Then, burst trie is one policy, my current approach is another, etc. The policy can be tuned to the terms' expected distribution, eg if it's a primary key field and you only use base 10 for each character then you want block sizes of size 10. This can make a sizable difference on lookup cost. I modified the FST Builder to allow for a plugin that freezes the tail (changed suffix) of each added term, because I use this to find the blocks. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3030) Block tree terms dict index
[ https://issues.apache.org/jira/browse/LUCENE-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13076178#comment-13076178 ] Robert Muir commented on LUCENE-3030: - Also, we should measure if a prefix automaton with intersect() is faster than PrefixTermsEnum (I suspect it could be!) If this is true, we might want to not rewrite to prefixtermsenum anymore, instead changing PrefixQuery to extend AutomatonQuery too. Block tree terms dict index - Key: LUCENE-3030 URL: https://issues.apache.org/jira/browse/LUCENE-3030 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.0 Attachments: LUCENE-3030.patch, LUCENE-3030.patch, LUCENE-3030.patch, LUCENE-3030.patch Our default terms index today breaks terms into blocks of fixed size (ie, every 32 terms is a new block), and then we build an index on top of that (holding the start term for each block). But, it should be better to instead break terms according to how they share prefixes. This results in variable sized blocks, but means within each block we maximize the shared prefix and minimize the resulting terms index. It should also be a speedup for terms dict intensive queries because the terms index becomes a true prefix trie, and can be used to fast-fail on term lookup (ie returning NOT_FOUND without having to seek/scan a terms block). Having a true prefix trie should also enable much faster intersection with automaton (but this will be a new issue). I've made an initial impl for this (called BlockTreeTermsWriter/Reader). It's still a work in progress... lots of nocommits, and hairy code, but tests pass (at least once!). I made two new codecs, temporarily called StandardTree, PulsingTree, that are just like their counterparts but use this new terms dict. I added a new exactOnly boolean to TermsEnum.seek. If that's true and the term is NOT_FOUND, we will (quickly) return NOT_FOUND and the enum is unpositioned (ie you should not call next(), docs(), etc.). In this approach the index and dict are tightly connected, so it does not support a pluggable index impl like BlockTermsWriter/Reader. Blocks are stored on certain nodes of the prefix trie, and can contain both terms and pointers to sub-blocks (ie, if the block is not a leaf block). So there are two trees, tied to one another -- the index trie, and the blocks. Only certain nodes in the trie map to a block in the block tree. I think this algorithm is similar to burst tries (http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3499), except it allows terms to be stored on inner blocks (not just leaf blocks). This is important for Lucene because an [accidental] adversary could produce a terms dict with way too many blocks (way too much RAM used by the terms index). Still, with my current patch, an adversary can produce too-big blocks... which we may need to fix, by letting the terms index not be a true prefix trie on it's leaf edges. Exactly how the blocks are picked can be factored out as its own policy (but I haven't done that yet). Then, burst trie is one policy, my current approach is another, etc. The policy can be tuned to the terms' expected distribution, eg if it's a primary key field and you only use base 10 for each character then you want block sizes of size 10. This can make a sizable difference on lookup cost. I modified the FST Builder to allow for a plugin that freezes the tail (changed suffix) of each added term, because I use this to find the blocks. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-2689) !frange with query($qq) sets score=1.0f for all returned documents
!frange with query($qq) sets score=1.0f for all returned documents -- Key: SOLR-2689 URL: https://issues.apache.org/jira/browse/SOLR-2689 Project: Solr Issue Type: Bug Components: search Affects Versions: 3.4 Reporter: Markus Jelsma Fix For: 3.4, 4.0 Consider the following queries, both query the default field for 'test' and return the document digest and score (i don't seem to be able get only score, fl=score returns all fields): This is a normal query and yields normal results with proper scores: {code} q=testfl=digest,score {code} {code} result name=response numFound=227763 start=0 maxScore=4.952673 − doc float name=score4.952673/float str name=digestc48e784f06a051d89f20b72194b0dcf0/str /doc − doc float name=score4.952673/float str name=digest7f78a504b8cbd86c6cdbf2aa2c4ae5e3/str /doc − doc float name=score4.952673/float str name=digest0f7fefa6586ceda42fc1f095d460aa17/str /doc {code} This query uses frange with query() to limit the number of returned documents. When using multiple search terms i can indeed cut-off the result set but in the end all returned documents have score=1.0f. The final result set cannot be sorted by score anymore. The result set seems to be returned in the order of Lucene docId's. {code} q={!frange l=1.23}query($qq)qq=testfl=digest,score {code} {code} result name=response numFound=227763 start=0 maxScore=1.0 − doc float name=score1.0/float str name=digestc48e784f06a051d89f20b72194b0dcf0/str /doc − doc float name=score1.0/float str name=digest7f78a504b8cbd86c6cdbf2aa2c4ae5e3/str /doc − doc float name=score1.0/float str name=digest0f7fefa6586ceda42fc1f095d460aa17/str /doc {code} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (SOLR-1032) CSV loader to support literal field values
[ https://issues.apache.org/jira/browse/SOLR-1032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik Hatcher resolved SOLR-1032. Resolution: Fixed Fix Version/s: 4.0 Assignee: Erik Hatcher CSV loader to support literal field values -- Key: SOLR-1032 URL: https://issues.apache.org/jira/browse/SOLR-1032 Project: Solr Issue Type: Improvement Components: update Affects Versions: 1.3 Reporter: Erik Hatcher Assignee: Erik Hatcher Priority: Minor Fix For: 4.0 Attachments: SOLR-1032.patch, SOLR-1032.patch It would be very handy if the CSV loader could handle a literal field mapping, like the extracting request handler does. For example, in a scenario where you have multiple datasources (some data from a DB, some from file crawls, and some from CSV) it is nice to add a field to every document that specifies the data source. This is easily done with DIH with a template transformer, and Solr Cell with ext.literal.datasource=, but impossible currently with CSV. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2525) Date Faceting or Range Faceting with offset doesn't convert timezone
[ https://issues.apache.org/jira/browse/SOLR-2525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13076294#comment-13076294 ] David commented on SOLR-2525: - Timezone needs to be taken into account when doing date math. Currently it isn't. DateMathParser instances created are told to use UTC. This is a huge issue when it comes to faceting. Depending on your timezone day-light-savings changes the length of a month. A facet gap of +1MONTH is different depending on the timezone and the time of the year. I believe the issue is very simple to fix. There are three places in the code DateMathParser created. All three are configured with the timezone being UTC. If a user could specify the TimeZone to pass into DateMathParser this faceting issue would be resolved. Date Faceting or Range Faceting with offset doesn't convert timezone Key: SOLR-2525 URL: https://issues.apache.org/jira/browse/SOLR-2525 Project: Solr Issue Type: Bug Components: Schema and Analysis, search Affects Versions: 3.1 Environment: Solr 3.1 Windows 2008 RC2 Server Java 6 Running on Jetty Reporter: Rohit Gupta Labels: date, facet I am trying to facet based on date field and apply user timezone offset so that the faceted results are in user timezone. My faceted result is given below, ?xml version=1.0 encoding=UTF-8? response lst name=responseHeader int name=status0/int int name=QTime6/int lst name=params str name=facettrue/str str name=qicici/str str name=facet.range.start2011-05-02T00:00:00Z+330MINUTES/str str name=facet.rangecreatedOnGMTDate/str str name=facet.range.end2011-05-18T00:00:00Z/str str name=facet.range.gap+1DAY/str /lst /lst lst name=facet_counts lst name=facet_ranges lst name=createdOnGMTDate lst name=counts int name=2011-05-02T05:30:00Z4/int int name=2011-05-03T05:30:00Z63/int int name=2011-05-04T05:30:00Z0/int int name=2011-05-05T05:30:00Z0/int .. /lst str name=gap+1DAY/str date name=start2011-05-02T05:30:00Z/date date name=end2011-05-18T05:30:00Z/date /lst /lst /lst /response Now if you notice that the response show 4 records for the 2th of May 2011 which will fall in the IST timezone (+330MINUTES), but when I try to get the results I see that there is only 1 result for the 2nd why is this happening. ?xml version=1.0 encoding=UTF-8? response lst name=responseHeader int name=status0/int int name=QTime5/int lst name=params str name=sortcreatedOnGMTDate asc/str str name=flcreatedOnGMT,createdOnGMTDate,twtText/str str name=fqcreatedOnGMTDate:[2011-05-01T00:00:00Z+330MINUTES TO *] /str str name=qicici/str /lst /lst result name=response numFound=67 start=0 doc str name=createdOnGMTMon, 02 May 2011 16:27:05+/str date name=createdOnGMTDate2011-05-02T16:27:05Z/date str name=twtText#TechStrat615. Infosys (business soln amp; IT outsourcer) manages damages with new chairman K.Kamath (ex ICICI Bank chairman) to begin Aug 21./str /doc doc str name=createdOnGMTMon, 02 May 2011 19:00:44+/str date name=createdOnGMTDate2011-05-02T19:00:44Z/date str name=twtTexthow to get icici mobile banking/str /doc doc str name=createdOnGMTTue, 03 May 2011 01:53:05+/str date name=createdOnGMTDate2011-05-03T01:53:05Z/date str name=twtTextICICI BANK LTD, L. M. MIRAJ branch in SANGLI, MAHARASHTRA. IFSC Code: ICIC0006537, MICR Code: ... http://bit.ly/fJCuWl #ifsc #micr #bank/str /doc doc str name=createdOnGMTTue, 03 May 2011 01:53:05+/str date name=createdOnGMTDate2011-05-03T01:53:05Z/date str name=twtTextICICI BANK LTD, L. M. MIRAJ branch in SANGLI, MAHARASHTRA. IFSC Code: ICIC0006537, MICR Code: ...
[jira] [Commented] (SOLR-1972) Need additional query stats in admin interface - median, 95th and 99th percentile
[ https://issues.apache.org/jira/browse/SOLR-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13076301#comment-13076301 ] Shawn Heisey commented on SOLR-1972: Hoss, the patch isn't my work, I just modified it to support a 100th percentile and reattached it. I am only just now beginning to learn Java. Although I have some clue what you're saying with static methods, actually doing it properly within a larger work like Solr is something I won't be able to do yet. Need additional query stats in admin interface - median, 95th and 99th percentile - Key: SOLR-1972 URL: https://issues.apache.org/jira/browse/SOLR-1972 Project: Solr Issue Type: Improvement Affects Versions: 1.4 Reporter: Shawn Heisey Priority: Minor Attachments: SOLR-1972.patch, SOLR-1972.patch, SOLR-1972.patch, SOLR-1972.patch, elyograg-1972-3.2.patch, elyograg-1972-3.2.patch, elyograg-1972-trunk.patch, elyograg-1972-trunk.patch I would like to see more detailed query statistics from the admin GUI. This is what you can get now: requests : 809 errors : 0 timeouts : 0 totalTime : 70053 avgTimePerRequest : 86.59209 avgRequestsPerSecond : 0.8148785 I'd like to see more data on the time per request - median, 95th percentile, 99th percentile, and any other statistical function that makes sense to include. In my environment, the first bunch of queries after startup tend to take several seconds each. I find that the average value tends to be useless until it has several thousand queries under its belt and the caches are thoroughly warmed. The statistical functions I have mentioned would quickly eliminate the influence of those initial slow queries. The system will have to store individual data about each query. I don't know if this is something Solr does already. It would be nice to have a configurable count of how many of the most recent data points are kept, to control the amount of memory the feature uses. The default value could be something like 1024 or 4096. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-2690) Date Faceting or Range Faceting doesn't take timezone into account.
Date Faceting or Range Faceting doesn't take timezone into account. --- Key: SOLR-2690 URL: https://issues.apache.org/jira/browse/SOLR-2690 Project: Solr Issue Type: Bug Affects Versions: 3.3 Reporter: David Timezone needs to be taken into account when doing date math. Currently it isn't. DateMathParser instances created are always being constructed with UTC. This is a huge issue when it comes to faceting. Depending on your timezone day-light-savings changes the length of a month. A facet gap of +1MONTH is different depending on the timezone and the time of the year. I believe the issue is very simple to fix. There are three places in the code DateMathParser is created. All three are configured with the timezone being UTC. If a user could specify the TimeZone to pass into DateMathParser this faceting issue would be resolved. Though it would be nice if we could always specify the timezone DateMathParser uses (since date math DOES depend on timezone) its really only essential that we can affect DateMathParser the SimpleFacets uses when dealing with the gap of the date facets. Another solution is to expand the syntax of the expressions DateMathParser understands. For example we could allow (?timeZone=VALUE) to be added anywhere within an expression. VALUE would be the id of the timezone. When DateMathParser reads this in sets the timezone on the Calendar it is using. Two examples: - (?timeZone=America/Chicago)NOW/YEAR - (?timeZone=America/Chicago)+1MONTH I would be more then happy to modify DateMathParser and provide a patch. I just need a committer to agree this needs to be resolved and a decision needs to be made on the syntax used Thanks! David -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2690) Date Faceting or Range Faceting doesn't take timezone into account.
[ https://issues.apache.org/jira/browse/SOLR-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13076307#comment-13076307 ] Yonik Seeley commented on SOLR-2690: Although this probably isn't a bug, I agree that handling timezones somehow would be nice. We just need to think very carefully about the API so we can support it long term. One immediate thought I had was that it would be a pain to specify the timezone everywhere. Even a simple range query would need to specify it twice: my_date:[(?timeZone=America/Chicago)NOW/YEAR TO (?timeZone=America/Chicago)+1MONTH] So one possible alternative that needs more thought is a TZ request parameter that would apply by default to things that are date related. Date Faceting or Range Faceting doesn't take timezone into account. --- Key: SOLR-2690 URL: https://issues.apache.org/jira/browse/SOLR-2690 Project: Solr Issue Type: Bug Affects Versions: 3.3 Reporter: David Original Estimate: 3h Remaining Estimate: 3h Timezone needs to be taken into account when doing date math. Currently it isn't. DateMathParser instances created are always being constructed with UTC. This is a huge issue when it comes to faceting. Depending on your timezone day-light-savings changes the length of a month. A facet gap of +1MONTH is different depending on the timezone and the time of the year. I believe the issue is very simple to fix. There are three places in the code DateMathParser is created. All three are configured with the timezone being UTC. If a user could specify the TimeZone to pass into DateMathParser this faceting issue would be resolved. Though it would be nice if we could always specify the timezone DateMathParser uses (since date math DOES depend on timezone) its really only essential that we can affect DateMathParser the SimpleFacets uses when dealing with the gap of the date facets. Another solution is to expand the syntax of the expressions DateMathParser understands. For example we could allow (?timeZone=VALUE) to be added anywhere within an expression. VALUE would be the id of the timezone. When DateMathParser reads this in sets the timezone on the Calendar it is using. Two examples: - (?timeZone=America/Chicago)NOW/YEAR - (?timeZone=America/Chicago)+1MONTH I would be more then happy to modify DateMathParser and provide a patch. I just need a committer to agree this needs to be resolved and a decision needs to be made on the syntax used Thanks! David -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2690) Date Faceting or Range Faceting doesn't take timezone into account.
[ https://issues.apache.org/jira/browse/SOLR-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13078292#comment-13078292 ] David commented on SOLR-2690: - Good point. Also, this isn't a bug but if we want a complete solution, we really need a way to specify times in other timezones. If I want midnight in Central time zone I shouldn't have to write: 2011-01-01T06:00:00Z (Note I wrote 6:00 not 0:00) I believe only DateField would have to be modified to make it possible to specify timezone. For a complete example if I wanted to facet blog posts by the date posted and the month: facet.date=blogPostDate facet.date.start=2011-01-01T00:00:00 facet.date.end=2012-01-01T00:00:00 facet.date.gap=+1MONTH timezone=America/Chicago Currently you would need to do the following. (Which actually gives close to correct results but not exact. Again, problem is the gap of +1MONTH doesn't take daylight savings into account so blog posts on the edge of ranges are counted in the wrong range. facet.date=blogPostDate facet.date.start=2011-01-01T00:06:00Z facet.date.end=2012-01-01T00:06:00Z facet.date.gap=+1MONTH Date Faceting or Range Faceting doesn't take timezone into account. --- Key: SOLR-2690 URL: https://issues.apache.org/jira/browse/SOLR-2690 Project: Solr Issue Type: Bug Affects Versions: 3.3 Reporter: David Original Estimate: 3h Remaining Estimate: 3h Timezone needs to be taken into account when doing date math. Currently it isn't. DateMathParser instances created are always being constructed with UTC. This is a huge issue when it comes to faceting. Depending on your timezone day-light-savings changes the length of a month. A facet gap of +1MONTH is different depending on the timezone and the time of the year. I believe the issue is very simple to fix. There are three places in the code DateMathParser is created. All three are configured with the timezone being UTC. If a user could specify the TimeZone to pass into DateMathParser this faceting issue would be resolved. Though it would be nice if we could always specify the timezone DateMathParser uses (since date math DOES depend on timezone) its really only essential that we can affect DateMathParser the SimpleFacets uses when dealing with the gap of the date facets. Another solution is to expand the syntax of the expressions DateMathParser understands. For example we could allow (?timeZone=VALUE) to be added anywhere within an expression. VALUE would be the id of the timezone. When DateMathParser reads this in sets the timezone on the Calendar it is using. Two examples: - (?timeZone=America/Chicago)NOW/YEAR - (?timeZone=America/Chicago)+1MONTH I would be more then happy to modify DateMathParser and provide a patch. I just need a committer to agree this needs to be resolved and a decision needs to be made on the syntax used Thanks! David -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[JENKINS] Lucene-Solr-tests-only-trunk-java7 - Build # 124 - Failure
Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk-java7/124/ 1 tests failed. REGRESSION: org.apache.solr.TestDistributedSearch.testDistribSearch Error Message: Some threads threw uncaught exceptions! Stack Trace: junit.framework.AssertionFailedError: Some threads threw uncaught exceptions! at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1522) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1427) at org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:639) at org.apache.solr.SolrTestCaseJ4.tearDown(SolrTestCaseJ4.java:99) at org.apache.solr.BaseDistributedSearchTestCase.tearDown(BaseDistributedSearchTestCase.java:174) Build Log (for compile errors): [...truncated 11177 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3030) Block tree terms dict index
[ https://issues.apache.org/jira/browse/LUCENE-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-3030: --- Attachment: BlockTree.png The block tree terms dict seems to be working... all tests pass w/ StandardTree codec. There's still more to do (many nocommits), but, I think the perf results should be close to what I finally commit: ||Task||QPS base||StdDev base||QPS blocktree||StdDev blocktree||Pct diff |IntNRQ|11.58|1.37|10.11|1.77|{color:red}35%{color}-{color:green}16%{color}| |Term|106.65|3.24|98.84|4.97|{color:red}14%{color}-{color:green}0%{color}| |Prefix3|30.83|1.36|28.64|2.42|{color:red}18%{color}-{color:green}5%{color}| |OrHighHigh|5.85|0.15|5.44|0.28|{color:red}14%{color}-{color:green}0%{color}| |OrHighMed|19.25|0.62|17.91|0.86|{color:red}14%{color}-{color:green}0%{color}| |Phrase|9.37|0.42|8.87|0.10|{color:red}10%{color}-{color:green}0%{color}| |TermBGroup1M|44.02|0.90|42.76|1.08|{color:red}7%{color}-{color:green}1%{color}| |TermGroup1M|37.68|0.65|36.95|0.74|{color:red}5%{color}-{color:green}1%{color}| |TermBGroup1M1P|47.16|2.94|46.36|0.16|{color:red}7%{color}-{color:green}5%{color}| |SpanNear|5.60|0.35|5.55|0.29|{color:red}11%{color}-{color:green}11%{color}| |SloppyPhrase|3.36|0.16|3.34|0.04|{color:red}6%{color}-{color:green}5%{color}| |Wildcard|35.15|1.30|35.05|2.42|{color:red}10%{color}-{color:green}10%{color}| |AndHighHigh|10.71|0.22|10.99|0.22|{color:red}1%{color}-{color:green}6%{color}| |AndHighMed|51.15|1.44|54.31|1.84|{color:green}0%{color}-{color:green}12%{color}| |Fuzzy1|31.63|0.55|66.15|1.35|{color:green}101%{color}-{color:green}117%{color}| |PKLookup|40.00|0.75|84.93|5.49|{color:green}94%{color}-{color:green}130%{color}| |Fuzzy2|33.78|0.82|89.59|2.46|{color:green}151%{color}-{color:green}179%{color}| |Respell|23.56|1.15|70.89|1.77|{color:green}179%{color}-{color:green}224%{color}| This is for a multi-segment index, 10 M wikipedia docs, using luceneutil. These are huge speedups for the terms-dict intensive queries! The two FuzzyQuerys and Respell get the speedup from the directly implemented intersect method, and the PKLookup gets gains because it can often avoid seeking since block tree's terms index can sometimes rule out terms by their prefix (though, this relies on the PK terms being predictable -- I use %09d w/ a counter, now; if you instead used something more random looking (GUIDs )I don't think we'd see gains). Block tree terms dict index - Key: LUCENE-3030 URL: https://issues.apache.org/jira/browse/LUCENE-3030 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.0 Attachments: BlockTree.png, LUCENE-3030.patch, LUCENE-3030.patch, LUCENE-3030.patch, LUCENE-3030.patch Our default terms index today breaks terms into blocks of fixed size (ie, every 32 terms is a new block), and then we build an index on top of that (holding the start term for each block). But, it should be better to instead break terms according to how they share prefixes. This results in variable sized blocks, but means within each block we maximize the shared prefix and minimize the resulting terms index. It should also be a speedup for terms dict intensive queries because the terms index becomes a true prefix trie, and can be used to fast-fail on term lookup (ie returning NOT_FOUND without having to seek/scan a terms block). Having a true prefix trie should also enable much faster intersection with automaton (but this will be a new issue). I've made an initial impl for this (called BlockTreeTermsWriter/Reader). It's still a work in progress... lots of nocommits, and hairy code, but tests pass (at least once!). I made two new codecs, temporarily called StandardTree, PulsingTree, that are just like their counterparts but use this new terms dict. I added a new exactOnly boolean to TermsEnum.seek. If that's true and the term is NOT_FOUND, we will (quickly) return NOT_FOUND and the enum is unpositioned (ie you should not call next(), docs(), etc.). In this approach the index and dict are tightly connected, so it does not support a pluggable index impl like BlockTermsWriter/Reader. Blocks are stored on certain nodes of the prefix trie, and can contain both terms and pointers to sub-blocks (ie, if the block is not a leaf block). So there are two trees, tied to one another -- the index trie, and the blocks. Only certain nodes in the trie map to a block in the block tree. I think this algorithm is similar to burst tries (http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3499), except it allows terms to be stored on inner blocks (not just leaf blocks). This is important for Lucene because an [accidental] adversary could produce a terms dict with way too many
[jira] [Commented] (LUCENE-3030) Block tree terms dict index
[ https://issues.apache.org/jira/browse/LUCENE-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13078337#comment-13078337 ] Michael McCandless commented on LUCENE-3030: Here's the graph of the results: !BlockTree.png! Block tree terms dict index - Key: LUCENE-3030 URL: https://issues.apache.org/jira/browse/LUCENE-3030 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.0 Attachments: BlockTree.png, LUCENE-3030.patch, LUCENE-3030.patch, LUCENE-3030.patch, LUCENE-3030.patch Our default terms index today breaks terms into blocks of fixed size (ie, every 32 terms is a new block), and then we build an index on top of that (holding the start term for each block). But, it should be better to instead break terms according to how they share prefixes. This results in variable sized blocks, but means within each block we maximize the shared prefix and minimize the resulting terms index. It should also be a speedup for terms dict intensive queries because the terms index becomes a true prefix trie, and can be used to fast-fail on term lookup (ie returning NOT_FOUND without having to seek/scan a terms block). Having a true prefix trie should also enable much faster intersection with automaton (but this will be a new issue). I've made an initial impl for this (called BlockTreeTermsWriter/Reader). It's still a work in progress... lots of nocommits, and hairy code, but tests pass (at least once!). I made two new codecs, temporarily called StandardTree, PulsingTree, that are just like their counterparts but use this new terms dict. I added a new exactOnly boolean to TermsEnum.seek. If that's true and the term is NOT_FOUND, we will (quickly) return NOT_FOUND and the enum is unpositioned (ie you should not call next(), docs(), etc.). In this approach the index and dict are tightly connected, so it does not support a pluggable index impl like BlockTermsWriter/Reader. Blocks are stored on certain nodes of the prefix trie, and can contain both terms and pointers to sub-blocks (ie, if the block is not a leaf block). So there are two trees, tied to one another -- the index trie, and the blocks. Only certain nodes in the trie map to a block in the block tree. I think this algorithm is similar to burst tries (http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3499), except it allows terms to be stored on inner blocks (not just leaf blocks). This is important for Lucene because an [accidental] adversary could produce a terms dict with way too many blocks (way too much RAM used by the terms index). Still, with my current patch, an adversary can produce too-big blocks... which we may need to fix, by letting the terms index not be a true prefix trie on it's leaf edges. Exactly how the blocks are picked can be factored out as its own policy (but I haven't done that yet). Then, burst trie is one policy, my current approach is another, etc. The policy can be tuned to the terms' expected distribution, eg if it's a primary key field and you only use base 10 for each character then you want block sizes of size 10. This can make a sizable difference on lookup cost. I modified the FST Builder to allow for a plugin that freezes the tail (changed suffix) of each added term, because I use this to find the blocks. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2690) Date Faceting or Range Faceting doesn't take timezone into account.
[ https://issues.apache.org/jira/browse/SOLR-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13078352#comment-13078352 ] David Schlotfeldt commented on SOLR-2690: - By extending FacetComponent (and having to resort to reflection) I added: facet.date.gap.tz The new parameter only affects the gap. The math done with processing the gap is the largest issue when it comes it date faceting in my mind. I would be more then happy to provide a patch to add this feature. No this doesn't address all timezone issues but at least it would address the main issue that makes date faceting, in my eyes, completely useless. I bet there are 100s of people out there using date faceting that don't realize it does NOT give correct results :) Date Faceting or Range Faceting doesn't take timezone into account. --- Key: SOLR-2690 URL: https://issues.apache.org/jira/browse/SOLR-2690 Project: Solr Issue Type: Bug Affects Versions: 3.3 Reporter: David Schlotfeldt Original Estimate: 3h Remaining Estimate: 3h Timezone needs to be taken into account when doing date math. Currently it isn't. DateMathParser instances created are always being constructed with UTC. This is a huge issue when it comes to faceting. Depending on your timezone day-light-savings changes the length of a month. A facet gap of +1MONTH is different depending on the timezone and the time of the year. I believe the issue is very simple to fix. There are three places in the code DateMathParser is created. All three are configured with the timezone being UTC. If a user could specify the TimeZone to pass into DateMathParser this faceting issue would be resolved. Though it would be nice if we could always specify the timezone DateMathParser uses (since date math DOES depend on timezone) its really only essential that we can affect DateMathParser the SimpleFacets uses when dealing with the gap of the date facets. Another solution is to expand the syntax of the expressions DateMathParser understands. For example we could allow (?timeZone=VALUE) to be added anywhere within an expression. VALUE would be the id of the timezone. When DateMathParser reads this in sets the timezone on the Calendar it is using. Two examples: - (?timeZone=America/Chicago)NOW/YEAR - (?timeZone=America/Chicago)+1MONTH I would be more then happy to modify DateMathParser and provide a patch. I just need a committer to agree this needs to be resolved and a decision needs to be made on the syntax used Thanks! David -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2688) switch solr 4.0 example to DirectSpellChecker
[ https://issues.apache.org/jira/browse/SOLR-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13078353#comment-13078353 ] Robert Muir commented on SOLR-2688: --- I'll work up a patch, might tweak the example a bit for the time being, I'd like to err on the side of performance. Note: with LUCENE-3030, Mike has really sped this guy up again. switch solr 4.0 example to DirectSpellChecker - Key: SOLR-2688 URL: https://issues.apache.org/jira/browse/SOLR-2688 Project: Solr Issue Type: Improvement Components: spellchecker Affects Versions: 4.0 Reporter: Robert Muir For discussion: we might want to switch the Solr 4.0 example to use DirectSpellChecker, which doesn't need an extra index or build/rebuild'ing. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3343) Comparison operators ,=,,= and = support as RangeQuery syntax in QueryParser
[ https://issues.apache.org/jira/browse/LUCENE-3343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13078372#comment-13078372 ] Adriano Crestani commented on LUCENE-3343: -- Hi Oliver, I was only able to make your patch work when I merged with LUCENE-3338, however LUCENE-3338 is only available for trunk, not 3x. I will need to figure out some other way to make it work on 3x. I plan to work on this soon, probably next weekend. Comparison operators ,=,,= and = support as RangeQuery syntax in QueryParser Key: LUCENE-3343 URL: https://issues.apache.org/jira/browse/LUCENE-3343 Project: Lucene - Java Issue Type: New Feature Components: modules/queryparser Reporter: Olivier Favre Assignee: Adriano Crestani Priority: Minor Labels: parser, query Fix For: 3.4, 4.0 Attachments: NumCompQueryParser-3x.patch, NumCompQueryParser.patch Original Estimate: 96h Remaining Estimate: 96h To offer better interoperability with other search engines and to provide an easier and more straight forward syntax, the operators , =, , = and = should be available to express an open range query. They should at least work for numeric queries. '=' can be made a synonym for ':'. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[JENKINS] Lucene-Solr-tests-only-trunk-java7 - Build # 125 - Still Failing
Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk-java7/125/ 1 tests failed. REGRESSION: org.apache.solr.cloud.CloudStateUpdateTest.testCoreRegistration Error Message: expected:2 but was:3 Stack Trace: junit.framework.AssertionFailedError: expected:2 but was:3 at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1522) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1427) at org.apache.solr.cloud.CloudStateUpdateTest.testCoreRegistration(CloudStateUpdateTest.java:198) Build Log (for compile errors): [...truncated 11154 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2382) DIH Cache Improvements
[ https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13078384#comment-13078384 ] James Dyer commented on SOLR-2382: -- Lance, I do not have any scientific benchmarks, but I can tell you how we use BerkleyBackedCache and how it performs for us. In our main app, we fully re-index all our data every night (13+ million records). Its basically a 2-step process. First we run ~50 DIH handlers, each of which builds a cache from databases, flat files, etc. The caches partition the data 8-ways. Then a master DIH script does all the joining, runs transformers on the data, etc. We have all 8 invocations of this same master DIH config running simultaneously indexing to the same Solr core, so each DIH invocation is processing 1.6 million records directly out of caches, doing all the 1-many joins, running transformer code, indexing, etc. This takes 1-1/2 hours, so maybe 250-300 solr records get added per second. We're using fast local disks configured with raid-0 on an 8-core 64gb server. This app is running solr 1.4, using the original version of this patch, prior to my front-porting it to trunk. No doubt some of the time is spent contending for the Lucene index as all 8 DIH invocations are indexing at the same time . We also have another app that uses Solr4.0 with the patch I originally posted back in February, sharing hardware with the main app. This one has about 10 entities and uses a simple 1-dih-handler configuration. The parent entity drives directly off the database while all the child entities use SqlEntityProcessor with BerkleyBackedCache. There are only 25,000 fairly narrow records and we can re-index everything in about 10 minutes. This includes database time, indexing, running transformers, etc in addition to the cache overhead. The inspiration for this was that we were converting off of Endeca and we were relying on Endeca's Forge program to join denormalize all of the data. Forge has a very fast disk-backed caching mechanism and I needed to match that performance with DIH. I'm pretty sure what we have here surpasses Forge. And we also get a big bonus in that it lets you persist caches and use them as a subsequent input. With Forge, we had to output the data into huge delimited text files and then use that as input for the next step... Hope this information gives you some idea if this will work for your use case. DIH Cache Improvements -- Key: SOLR-2382 URL: https://issues.apache.org/jira/browse/SOLR-2382 Project: Solr Issue Type: New Feature Components: contrib - DataImportHandler Reporter: James Dyer Priority: Minor Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-properties.patch, SOLR-2382-properties.patch, SOLR-2382-solrwriter-verbose-fix.patch, SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch Functionality: 1. Provide a pluggable caching framework for DIH so that users can choose a cache implementation that best suits their data and application. 2. Provide a means to temporarily cache a child Entity's data without needing to create a special cached implementation of the Entity Processor (such as CachedSqlEntityProcessor). 3. Provide a means to write the final (root entity) DIH output to a cache rather than to Solr. Then provide a way for a subsequent DIH call to use the cache as an Entity input. Also provide the ability to do delta updates on such persistent caches. 4. Provide the ability to partition data across multiple caches that can then be fed back into DIH and indexed either to varying Solr Shards, or to the same Core in parallel. Use Cases: 1. We needed a flexible scalable way to temporarily cache child-entity data prior to joining to parent entities. - Using SqlEntityProcessor with Child Entities can cause an n+1 select problem. - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching mechanism and does not scale. - There is no way to cache non-SQL inputs (ex: flat files, xml, etc). 2. We needed the ability to gather data from long-running entities by a process that runs separate from our main indexing process. 3. We wanted the ability to do a delta import of only the entities that changed. - Lucene/Solr requires entire documents to be re-indexed, even if only a few fields changed. - Our data comes from 50+ complex sql queries and/or flat files. - We do not want to incur overhead re-gathering all of this data if only 1 entity's data
[jira] [Commented] (LUCENE-3030) Block tree terms dict index
[ https://issues.apache.org/jira/browse/LUCENE-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13078400#comment-13078400 ] Simon Willnauer commented on LUCENE-3030: - bq. These are huge speedups for the terms-dict intensive queries! oh boy! This is awesome! Block tree terms dict index - Key: LUCENE-3030 URL: https://issues.apache.org/jira/browse/LUCENE-3030 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.0 Attachments: BlockTree.png, LUCENE-3030.patch, LUCENE-3030.patch, LUCENE-3030.patch, LUCENE-3030.patch Our default terms index today breaks terms into blocks of fixed size (ie, every 32 terms is a new block), and then we build an index on top of that (holding the start term for each block). But, it should be better to instead break terms according to how they share prefixes. This results in variable sized blocks, but means within each block we maximize the shared prefix and minimize the resulting terms index. It should also be a speedup for terms dict intensive queries because the terms index becomes a true prefix trie, and can be used to fast-fail on term lookup (ie returning NOT_FOUND without having to seek/scan a terms block). Having a true prefix trie should also enable much faster intersection with automaton (but this will be a new issue). I've made an initial impl for this (called BlockTreeTermsWriter/Reader). It's still a work in progress... lots of nocommits, and hairy code, but tests pass (at least once!). I made two new codecs, temporarily called StandardTree, PulsingTree, that are just like their counterparts but use this new terms dict. I added a new exactOnly boolean to TermsEnum.seek. If that's true and the term is NOT_FOUND, we will (quickly) return NOT_FOUND and the enum is unpositioned (ie you should not call next(), docs(), etc.). In this approach the index and dict are tightly connected, so it does not support a pluggable index impl like BlockTermsWriter/Reader. Blocks are stored on certain nodes of the prefix trie, and can contain both terms and pointers to sub-blocks (ie, if the block is not a leaf block). So there are two trees, tied to one another -- the index trie, and the blocks. Only certain nodes in the trie map to a block in the block tree. I think this algorithm is similar to burst tries (http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3499), except it allows terms to be stored on inner blocks (not just leaf blocks). This is important for Lucene because an [accidental] adversary could produce a terms dict with way too many blocks (way too much RAM used by the terms index). Still, with my current patch, an adversary can produce too-big blocks... which we may need to fix, by letting the terms index not be a true prefix trie on it's leaf edges. Exactly how the blocks are picked can be factored out as its own policy (but I haven't done that yet). Then, burst trie is one policy, my current approach is another, etc. The policy can be tuned to the terms' expected distribution, eg if it's a primary key field and you only use base 10 for each character then you want block sizes of size 10. This can make a sizable difference on lookup cost. I modified the FST Builder to allow for a plugin that freezes the tail (changed suffix) of each added term, because I use this to find the blocks. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3220) Implement various ranking models as Similarities
[ https://issues.apache.org/jira/browse/LUCENE-3220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Mark Nemeskey updated LUCENE-3220: Attachment: LUCENE-3220.patch EasySimilarity now computes norms in the same way as DefaultSimilarity. Actually not exactly the same way, as I have not yet added the discountOverlaps property. I think it would be a good idea for EasySimilarity as well (it is for phrases, right), what do you reckon? I also wrote a quick test to see which norm (length directly or 1/sqrt) is closer to the original value and it seems that the direct one is usually much closer (RMSE is 0.09689688608375747 vs 0.23787634482532286). Of course, I know it is much more important that the new Similarities can use existing indices. Implement various ranking models as Similarities Key: LUCENE-3220 URL: https://issues.apache.org/jira/browse/LUCENE-3220 Project: Lucene - Java Issue Type: Sub-task Components: core/query/scoring, core/search Affects Versions: flexscoring branch Reporter: David Mark Nemeskey Assignee: David Mark Nemeskey Labels: gsoc, gsoc2011 Attachments: LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch Original Estimate: 336h Remaining Estimate: 336h With [LUCENE-3174|https://issues.apache.org/jira/browse/LUCENE-3174] done, we can finally work on implementing the standard ranking models. Currently DFR, BM25 and LM are on the menu. Done: * {{EasyStats}}: contains all statistics that might be relevant for a ranking algorithm * {{EasySimilarity}}: the ancestor of all the other similarities. Hides the DocScorers and as much implementation detail as possible * _BM25_: the current mock implementation might be OK * _LM_ * _DFR_ * The so-called _Information-Based Models_ -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[Lucene.Net] [jira] [Resolved] (LUCENENET-404) Improve brand logo design
[ https://issues.apache.org/jira/browse/LUCENENET-404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Troy Howard resolved LUCENENET-404. --- Resolution: Fixed Uploaded the artifacts in r1153264 Improve brand logo design - Key: LUCENENET-404 URL: https://issues.apache.org/jira/browse/LUCENENET-404 Project: Lucene.Net Issue Type: Sub-task Components: Project Infrastructure Reporter: Troy Howard Assignee: Troy Howard Priority: Minor Labels: branding, logo Attachments: lucene-alternates.jpg, lucene-medium.png, lucene-net-logo-display.jpg The existing Lucene.Net logo leaves a lot to be desired. We'd like a new logo that is modern and well designed. To implement this, Troy is coordinating with StackOverflow/StackExchange to manage a logo design contest, the results of which will be our new logo design. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (SOLR-2690) Date Faceting or Range Faceting doesn't take timezone into account.
[ https://issues.apache.org/jira/browse/SOLR-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13078419#comment-13078419 ] David Schlotfeldt commented on SOLR-2690: - Okay I've modified my code to now take facet.date.tz instead. The time zone now affects the facet's start, end and gap values. Date Faceting or Range Faceting doesn't take timezone into account. --- Key: SOLR-2690 URL: https://issues.apache.org/jira/browse/SOLR-2690 Project: Solr Issue Type: Bug Affects Versions: 3.3 Reporter: David Schlotfeldt Original Estimate: 3h Remaining Estimate: 3h Timezone needs to be taken into account when doing date math. Currently it isn't. DateMathParser instances created are always being constructed with UTC. This is a huge issue when it comes to faceting. Depending on your timezone day-light-savings changes the length of a month. A facet gap of +1MONTH is different depending on the timezone and the time of the year. I believe the issue is very simple to fix. There are three places in the code DateMathParser is created. All three are configured with the timezone being UTC. If a user could specify the TimeZone to pass into DateMathParser this faceting issue would be resolved. Though it would be nice if we could always specify the timezone DateMathParser uses (since date math DOES depend on timezone) its really only essential that we can affect DateMathParser the SimpleFacets uses when dealing with the gap of the date facets. Another solution is to expand the syntax of the expressions DateMathParser understands. For example we could allow (?timeZone=VALUE) to be added anywhere within an expression. VALUE would be the id of the timezone. When DateMathParser reads this in sets the timezone on the Calendar it is using. Two examples: - (?timeZone=America/Chicago)NOW/YEAR - (?timeZone=America/Chicago)+1MONTH I would be more then happy to modify DateMathParser and provide a patch. I just need a committer to agree this needs to be resolved and a decision needs to be made on the syntax used Thanks! David -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (SOLR-2143) Add OpenSearch resources to the bundled example
[ https://issues.apache.org/jira/browse/SOLR-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll reassigned SOLR-2143: - Assignee: (was: Grant Ingersoll) Add OpenSearch resources to the bundled example Key: SOLR-2143 URL: https://issues.apache.org/jira/browse/SOLR-2143 Project: Solr Issue Type: Wish Components: documentation Affects Versions: 4.0 Environment: N/A Reporter: Rich Cariens Fix For: 4.0 Attachments: SOLR-2143.patch Original Estimate: 2h Remaining Estimate: 2h Guidance samples on how to make a Solr instance OpenSearch-compliant feels like it would add value to the user community. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2748) Convert all Lucene web properties to use the ASF CMS
[ https://issues.apache.org/jira/browse/LUCENE-2748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13078434#comment-13078434 ] Grant Ingersoll commented on LUCENE-2748: - I wonder if the best thing to do here is to simply start fresh and clean and simply leave all existing content up as is and link to it as the old content. Convert all Lucene web properties to use the ASF CMS Key: LUCENE-2748 URL: https://issues.apache.org/jira/browse/LUCENE-2748 Project: Lucene - Java Issue Type: Bug Reporter: Grant Ingersoll Assignee: Grant Ingersoll The new CMS has a lot of nice features (and some kinks to still work out) and Forrest just doesn't cut it anymore, so we should move to the ASF CMS: http://apache.org/dev/cms.html -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2689) !frange with query($qq) sets score=1.0f for all returned documents
[ https://issues.apache.org/jira/browse/SOLR-2689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13078451#comment-13078451 ] Otis Gospodnetic commented on SOLR-2689: Markus - I can't even tell this frange call cuts-off any of the hits - you have numFound=227763 in both examples above. Am I missing something? :) !frange with query($qq) sets score=1.0f for all returned documents -- Key: SOLR-2689 URL: https://issues.apache.org/jira/browse/SOLR-2689 Project: Solr Issue Type: Bug Components: search Affects Versions: 3.4 Reporter: Markus Jelsma Fix For: 3.4, 4.0 Consider the following queries, both query the default field for 'test' and return the document digest and score (i don't seem to be able get only score, fl=score returns all fields): This is a normal query and yields normal results with proper scores: {code} q=testfl=digest,score {code} {code} result name=response numFound=227763 start=0 maxScore=4.952673 − doc float name=score4.952673/float str name=digestc48e784f06a051d89f20b72194b0dcf0/str /doc − doc float name=score4.952673/float str name=digest7f78a504b8cbd86c6cdbf2aa2c4ae5e3/str /doc − doc float name=score4.952673/float str name=digest0f7fefa6586ceda42fc1f095d460aa17/str /doc {code} This query uses frange with query() to limit the number of returned documents. When using multiple search terms i can indeed cut-off the result set but in the end all returned documents have score=1.0f. The final result set cannot be sorted by score anymore. The result set seems to be returned in the order of Lucene docId's. {code} q={!frange l=1.23}query($qq)qq=testfl=digest,score {code} {code} result name=response numFound=227763 start=0 maxScore=1.0 − doc float name=score1.0/float str name=digestc48e784f06a051d89f20b72194b0dcf0/str /doc − doc float name=score1.0/float str name=digest7f78a504b8cbd86c6cdbf2aa2c4ae5e3/str /doc − doc float name=score1.0/float str name=digest0f7fefa6586ceda42fc1f095d460aa17/str /doc {code} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Did solr.xml persistence brake?
Hi, With the trunk build, running SolrCloud, if I issue a PERSIST CoreAdmin command, the solr.xml gets overwritten with only the last core, repeated as many times as there are cores. It used to work fine with a trunk build from a couple of months ago, so it looks like something broke solr.xml persistence. Can it be related to SOLR-2331? I'm running SolrCloud, using: -Dbootstrap_confdir=/opt/solr/solr/conf -Dcollection.configName=hcpconf -DzkRun I'm starting Solr with four cores listed in solr.xml: solr persistent=true cores adminPath=/admin/cores defaultCoreName=master1 core name=master1 instanceDir=master1 shard=shard1 collection=hcpconf / core name=master2 instanceDir=master2 shard=shard2 collection=hcpconf / core name=slave1 instanceDir=slave1 shard=shard1 collection=hcpconf / core name=slave2 instanceDir=slave2 shard=shard2 collection=hcpconf / /cores /solr I then issue a PERSIST request: http://localhost:8983/solr/admin/cores?action=PERSIST And the solr.xml turns into: solr persistent=true cores defaultCoreName=master1 adminPath=/admin/cores zkClientTimeout=1 hostPort=8983 hostContext=solr core shard=shard2 instanceDir=slave2/ name=slave2 collection=hcpconf/ core shard=shard2 instanceDir=slave2/ name=slave2 collection=hcpconf/ core shard=shard2 instanceDir=slave2/ name=slave2 collection=hcpconf/ core shard=shard2 instanceDir=slave2/ name=slave2 collection=hcpconf/ /cores /solr - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (SOLR-1692) CarrotClusteringEngine produce summary does nothing
[ https://issues.apache.org/jira/browse/SOLR-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll reassigned SOLR-1692: - Assignee: (was: Grant Ingersoll) CarrotClusteringEngine produce summary does nothing --- Key: SOLR-1692 URL: https://issues.apache.org/jira/browse/SOLR-1692 Project: Solr Issue Type: Bug Components: contrib - Clustering Reporter: Grant Ingersoll Fix For: 3.4, 4.0 Attachments: SOLR-1692.patch In the CarrotClusteringEngine, the produceSummary option does nothing, as the results of doing the highlighting are just ignored. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2689) !frange with query($qq) sets score=1.0f for all returned documents
[ https://issues.apache.org/jira/browse/SOLR-2689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13078467#comment-13078467 ] Markus Jelsma commented on SOLR-2689: - You are right, it's because both examples use one search term and thus all have the same score. It shows when not all scores are identical when you use multiple terms. I'll provide a better description and example next week when i'll get back. !frange with query($qq) sets score=1.0f for all returned documents -- Key: SOLR-2689 URL: https://issues.apache.org/jira/browse/SOLR-2689 Project: Solr Issue Type: Bug Components: search Affects Versions: 3.4 Reporter: Markus Jelsma Fix For: 3.4, 4.0 Consider the following queries, both query the default field for 'test' and return the document digest and score (i don't seem to be able get only score, fl=score returns all fields): This is a normal query and yields normal results with proper scores: {code} q=testfl=digest,score {code} {code} result name=response numFound=227763 start=0 maxScore=4.952673 − doc float name=score4.952673/float str name=digestc48e784f06a051d89f20b72194b0dcf0/str /doc − doc float name=score4.952673/float str name=digest7f78a504b8cbd86c6cdbf2aa2c4ae5e3/str /doc − doc float name=score4.952673/float str name=digest0f7fefa6586ceda42fc1f095d460aa17/str /doc {code} This query uses frange with query() to limit the number of returned documents. When using multiple search terms i can indeed cut-off the result set but in the end all returned documents have score=1.0f. The final result set cannot be sorted by score anymore. The result set seems to be returned in the order of Lucene docId's. {code} q={!frange l=1.23}query($qq)qq=testfl=digest,score {code} {code} result name=response numFound=227763 start=0 maxScore=1.0 − doc float name=score1.0/float str name=digestc48e784f06a051d89f20b72194b0dcf0/str /doc − doc float name=score1.0/float str name=digest7f78a504b8cbd86c6cdbf2aa2c4ae5e3/str /doc − doc float name=score1.0/float str name=digest0f7fefa6586ceda42fc1f095d460aa17/str /doc {code} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2689) !frange with query($qq) sets score=1.0f for all returned documents
[ https://issues.apache.org/jira/browse/SOLR-2689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13078522#comment-13078522 ] Hoss Man commented on SOLR-2689: I don't really understand why this is a bug? frange is the FunctionRangeQParserPlugin which produces ConstantScoreRangeQueries -- it doesn't matter when/how/why it's used (or that the function it's wrapping comes from an arbitrary query), it always produces range queries that generate constant scores. !frange with query($qq) sets score=1.0f for all returned documents -- Key: SOLR-2689 URL: https://issues.apache.org/jira/browse/SOLR-2689 Project: Solr Issue Type: Bug Components: search Affects Versions: 3.4 Reporter: Markus Jelsma Fix For: 3.4, 4.0 Consider the following queries, both query the default field for 'test' and return the document digest and score (i don't seem to be able get only score, fl=score returns all fields): This is a normal query and yields normal results with proper scores: {code} q=testfl=digest,score {code} {code} result name=response numFound=227763 start=0 maxScore=4.952673 − doc float name=score4.952673/float str name=digestc48e784f06a051d89f20b72194b0dcf0/str /doc − doc float name=score4.952673/float str name=digest7f78a504b8cbd86c6cdbf2aa2c4ae5e3/str /doc − doc float name=score4.952673/float str name=digest0f7fefa6586ceda42fc1f095d460aa17/str /doc {code} This query uses frange with query() to limit the number of returned documents. When using multiple search terms i can indeed cut-off the result set but in the end all returned documents have score=1.0f. The final result set cannot be sorted by score anymore. The result set seems to be returned in the order of Lucene docId's. {code} q={!frange l=1.23}query($qq)qq=testfl=digest,score {code} {code} result name=response numFound=227763 start=0 maxScore=1.0 − doc float name=score1.0/float str name=digestc48e784f06a051d89f20b72194b0dcf0/str /doc − doc float name=score1.0/float str name=digest7f78a504b8cbd86c6cdbf2aa2c4ae5e3/str /doc − doc float name=score1.0/float str name=digest0f7fefa6586ceda42fc1f095d460aa17/str /doc {code} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-2691) solr.xml persistence is broken for multicore (by SOLR-2331)
solr.xml persistence is broken for multicore (by SOLR-2331) --- Key: SOLR-2691 URL: https://issues.apache.org/jira/browse/SOLR-2691 Project: Solr Issue Type: Bug Components: multicore Affects Versions: 4.0 Reporter: Yury Kats Priority: Critical With the trunk build, running SolrCloud, if I issue a PERSIST CoreAdmin command, the solr.xml gets overwritten with only the last core, repeated as many times as there are cores. It used to work fine with a trunk build from a couple of months ago, so it looks like something broke solr.xml persistence. It appears to have been introduced by SOLR-2331: CoreContainer#persistFile creates the map for core attributes (coreAttribs) outside of the loop that iterates over cores. Therefore, all cores reuse the same map of attributes and hence only the values from the last core are preserved and used for all cores in the list. I'm running SolrCloud, using: -Dbootstrap_confdir=/opt/solr/solr/conf -Dcollection.configName=hcpconf -DzkRun I'm starting Solr with four cores listed in solr.xml: solr persistent=true cores adminPath=/admin/cores defaultCoreName=master1 core name=master1 instanceDir=master1 shard=shard1 collection=hcpconf / core name=master2 instanceDir=master2 shard=shard2 collection=hcpconf / core name=slave1 instanceDir=slave1 shard=shard1 collection=hcpconf / core name=slave2 instanceDir=slave2 shard=shard2 collection=hcpconf / /cores /solr I then issue a PERSIST request: http://localhost:8983/solr/admin/cores?action=PERSIST And the solr.xml turns into: solr persistent=true cores defaultCoreName=master1 adminPath=/admin/cores zkClientTimeout=1 hostPort=8983 hostContext=solr core shard=shard2 instanceDir=slave2/ name=slave2 collection=hcpconf/ core shard=shard2 instanceDir=slave2/ name=slave2 collection=hcpconf/ core shard=shard2 instanceDir=slave2/ name=slave2 collection=hcpconf/ core shard=shard2 instanceDir=slave2/ name=slave2 collection=hcpconf/ /cores /solr -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2331) Refactor CoreContainer's SolrXML serialization code and improve testing
[ https://issues.apache.org/jira/browse/SOLR-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13078526#comment-13078526 ] Yury Kats commented on SOLR-2331: - Looks like this introduced a regression in solr.xml persistence. See SOLR-2691. Refactor CoreContainer's SolrXML serialization code and improve testing --- Key: SOLR-2331 URL: https://issues.apache.org/jira/browse/SOLR-2331 Project: Solr Issue Type: Improvement Components: multicore Reporter: Mark Miller Assignee: Mark Miller Priority: Minor Fix For: 4.0 Attachments: SOLR-2331-fix-windows-file-deletion-failure.patch, SOLR-2331-fix-windows-file-deletion-failure.patch, SOLR-2331.patch CoreContainer has enough code in it - I'd like to factor out the solr.xml serialization code into SolrXMLSerializer or something - which should make testing it much easier and lightweight. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2691) solr.xml persistence is broken for multicore (by SOLR-2331)
[ https://issues.apache.org/jira/browse/SOLR-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yury Kats updated SOLR-2691: Attachment: jira2691.patch Patch. Create map of attributes inside the loop. solr.xml persistence is broken for multicore (by SOLR-2331) --- Key: SOLR-2691 URL: https://issues.apache.org/jira/browse/SOLR-2691 Project: Solr Issue Type: Bug Components: multicore Affects Versions: 4.0 Reporter: Yury Kats Priority: Critical Attachments: jira2691.patch With the trunk build, running SolrCloud, if I issue a PERSIST CoreAdmin command, the solr.xml gets overwritten with only the last core, repeated as many times as there are cores. It used to work fine with a trunk build from a couple of months ago, so it looks like something broke solr.xml persistence. It appears to have been introduced by SOLR-2331: CoreContainer#persistFile creates the map for core attributes (coreAttribs) outside of the loop that iterates over cores. Therefore, all cores reuse the same map of attributes and hence only the values from the last core are preserved and used for all cores in the list. I'm running SolrCloud, using: -Dbootstrap_confdir=/opt/solr/solr/conf -Dcollection.configName=hcpconf -DzkRun I'm starting Solr with four cores listed in solr.xml: solr persistent=true cores adminPath=/admin/cores defaultCoreName=master1 core name=master1 instanceDir=master1 shard=shard1 collection=hcpconf / core name=master2 instanceDir=master2 shard=shard2 collection=hcpconf / core name=slave1 instanceDir=slave1 shard=shard1 collection=hcpconf / core name=slave2 instanceDir=slave2 shard=shard2 collection=hcpconf / /cores /solr I then issue a PERSIST request: http://localhost:8983/solr/admin/cores?action=PERSIST And the solr.xml turns into: solr persistent=true cores defaultCoreName=master1 adminPath=/admin/cores zkClientTimeout=1 hostPort=8983 hostContext=solr core shard=shard2 instanceDir=slave2/ name=slave2 collection=hcpconf/ core shard=shard2 instanceDir=slave2/ name=slave2 collection=hcpconf/ core shard=shard2 instanceDir=slave2/ name=slave2 collection=hcpconf/ core shard=shard2 instanceDir=slave2/ name=slave2 collection=hcpconf/ /cores /solr -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Did solr.xml persistence brake?
On 8/2/2011 5:42 PM, Yury Kats wrote: It used to work fine with a trunk build from a couple of months ago, so it looks like something broke solr.xml persistence. Can it be related to SOLR-2331? Looking at the code, it does seem like a regression from SOLR-2331. CoreContainer#persistFile creates the map for core attributes (coreAttribs) outside of the loop that iterates over cores. Therefore, all cores reuse the same map of attributes and hence only the values from the last core are preserved and used for all cores in the list. I opened SOLR-2691 to track and attached a patch. Would appreciate a quick look from a committer. Thanks! - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Did solr.xml persistence brake?
: I opened SOLR-2691 to track and attached a patch. : : Would appreciate a quick look from a committer. Thanks! I'm not too familiar with that code, but i can definitely reproduce the bug ... i'll take a look at the existing tests and see if i can help out with your patch. -Hoss - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2690) Date Faceting or Range Faceting doesn't take timezone into account.
[ https://issues.apache.org/jira/browse/SOLR-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13078538#comment-13078538 ] David Schlotfeldt commented on SOLR-2690: - Being able to specify dates in timezones other then GMT+0 isn't a problem. It would just be nice but we can gnore that. The time zone the DateMathParser is configured with is the issue (which it sounds like you understand.) My solution that changes the timezone DateMathParser is constructed with in SimpleFacet to parse start, end and gap isn't ideal. I went this route because I don't want to run a custom built Solr -- my solution allowed me to fix the bug by simply replacing the facet SearchComponent. Affecting all DateMathParsrs created for length of the request is what is really needed (which is what you said). I like your approach. It sounds like we are on the same page. So, can we get this added? :) Without time zone affecting DateMathParser the date faceting is useless (at least for 100% the situations I would use it for) By the way, I'm gald to see how many responses there have been. I'm happy to see how active this project is. :) Date Faceting or Range Faceting doesn't take timezone into account. --- Key: SOLR-2690 URL: https://issues.apache.org/jira/browse/SOLR-2690 Project: Solr Issue Type: Improvement Affects Versions: 3.3 Reporter: David Schlotfeldt Original Estimate: 3h Remaining Estimate: 3h Timezone needs to be taken into account when doing date math. Currently it isn't. DateMathParser instances created are always being constructed with UTC. This is a huge issue when it comes to faceting. Depending on your timezone day-light-savings changes the length of a month. A facet gap of +1MONTH is different depending on the timezone and the time of the year. I believe the issue is very simple to fix. There are three places in the code DateMathParser is created. All three are configured with the timezone being UTC. If a user could specify the TimeZone to pass into DateMathParser this faceting issue would be resolved. Though it would be nice if we could always specify the timezone DateMathParser uses (since date math DOES depend on timezone) its really only essential that we can affect DateMathParser the SimpleFacets uses when dealing with the gap of the date facets. Another solution is to expand the syntax of the expressions DateMathParser understands. For example we could allow (?timeZone=VALUE) to be added anywhere within an expression. VALUE would be the id of the timezone. When DateMathParser reads this in sets the timezone on the Calendar it is using. Two examples: - (?timeZone=America/Chicago)NOW/YEAR - (?timeZone=America/Chicago)+1MONTH I would be more then happy to modify DateMathParser and provide a patch. I just need a committer to agree this needs to be resolved and a decision needs to be made on the syntax used Thanks! David -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
CHANGES.txt for modules
I can see the description of changes made to the modules are still in contrib/CHANGES.txt. Are they going to be moved in future to a modules/CHANGES.txt?
[jira] [Commented] (LUCENE-2979) Simplify configuration API of contrib Query Parser
[ https://issues.apache.org/jira/browse/LUCENE-2979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13078549#comment-13078549 ] Adriano Crestani commented on LUCENE-2979: -- Hi Phillipe, Thanks for the patch. I just applied your patch for 3x. It looks good. As you removed TestAttributes, can you create another junit to test whether configuration is updated when an attribute (like CharTermAttribute) is updated, which is basically the new functionality of the newly deprecated query parser attributes. Simplify configuration API of contrib Query Parser -- Key: LUCENE-2979 URL: https://issues.apache.org/jira/browse/LUCENE-2979 Project: Lucene - Java Issue Type: Improvement Components: modules/other Affects Versions: 2.9, 3.0 Reporter: Adriano Crestani Assignee: Adriano Crestani Labels: api-change, gsoc, gsoc2011, lucene-gsoc-11, mentor Fix For: 3.4, 4.0 Attachments: LUCENE-2979_phillipe_ramalho_2.patch, LUCENE-2979_phillipe_ramalho_3.patch, LUCENE-2979_phillipe_ramalho_3.patch, LUCENE-2979_phillipe_ramalho_4_3x.patch, LUCENE-2979_phillipe_ramalho_4_trunk.patch, LUCENE-2979_phillipe_reamalho.patch The current configuration API is very complicated and inherit the concept used by Attribute API to store token information in token streams. However, the requirements for both (QP config and token stream) are not the same, so they shouldn't be using the same thing. I propose to simplify QP config and make it less scary for people intending to use contrib QP. The task is not difficult, it will just require a lot of code change and figure out the best way to do it. That's why it's a good candidate for a GSoC project. I would like to hear good proposals about how to make the API more friendly and less scaring :) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2691) solr.xml persistence is broken for multicore (by SOLR-2331)
[ https://issues.apache.org/jira/browse/SOLR-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated SOLR-2691: -- Fix Version/s: 4.0 Assignee: Mark Miller solr.xml persistence is broken for multicore (by SOLR-2331) --- Key: SOLR-2691 URL: https://issues.apache.org/jira/browse/SOLR-2691 Project: Solr Issue Type: Bug Components: multicore Affects Versions: 4.0 Reporter: Yury Kats Assignee: Mark Miller Priority: Critical Fix For: 4.0 Attachments: jira2691.patch With the trunk build, running SolrCloud, if I issue a PERSIST CoreAdmin command, the solr.xml gets overwritten with only the last core, repeated as many times as there are cores. It used to work fine with a trunk build from a couple of months ago, so it looks like something broke solr.xml persistence. It appears to have been introduced by SOLR-2331: CoreContainer#persistFile creates the map for core attributes (coreAttribs) outside of the loop that iterates over cores. Therefore, all cores reuse the same map of attributes and hence only the values from the last core are preserved and used for all cores in the list. I'm running SolrCloud, using: -Dbootstrap_confdir=/opt/solr/solr/conf -Dcollection.configName=hcpconf -DzkRun I'm starting Solr with four cores listed in solr.xml: solr persistent=true cores adminPath=/admin/cores defaultCoreName=master1 core name=master1 instanceDir=master1 shard=shard1 collection=hcpconf / core name=master2 instanceDir=master2 shard=shard2 collection=hcpconf / core name=slave1 instanceDir=slave1 shard=shard1 collection=hcpconf / core name=slave2 instanceDir=slave2 shard=shard2 collection=hcpconf / /cores /solr I then issue a PERSIST request: http://localhost:8983/solr/admin/cores?action=PERSIST And the solr.xml turns into: solr persistent=true cores defaultCoreName=master1 adminPath=/admin/cores zkClientTimeout=1 hostPort=8983 hostContext=solr core shard=shard2 instanceDir=slave2/ name=slave2 collection=hcpconf/ core shard=shard2 instanceDir=slave2/ name=slave2 collection=hcpconf/ core shard=shard2 instanceDir=slave2/ name=slave2 collection=hcpconf/ core shard=shard2 instanceDir=slave2/ name=slave2 collection=hcpconf/ /cores /solr -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2691) solr.xml persistence is broken for multicore (by SOLR-2331)
[ https://issues.apache.org/jira/browse/SOLR-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man updated SOLR-2691: --- Attachment: SOLR-2691.patch patch of persistence tests at the CoreContainer level (since that's where the bug was) that incorporates Yury's fix. the assertions could definitely be beefed up to sanity check more aspects of the serialization, and we should really also be testing that load works and parses all of these things back in in the expected way as well, but it's a start. The thing that's currently hanging me up is that somehow the test is leaking a SolrIndexSearcher reference. I thought maybe it was because of the SolrCores i was creating+registering and then ignoring, but if i try to close them i get an error about too many decrefs instead. I'll let miller figure it out solr.xml persistence is broken for multicore (by SOLR-2331) --- Key: SOLR-2691 URL: https://issues.apache.org/jira/browse/SOLR-2691 Project: Solr Issue Type: Bug Components: multicore Affects Versions: 4.0 Reporter: Yury Kats Assignee: Mark Miller Priority: Critical Fix For: 4.0 Attachments: SOLR-2691.patch, jira2691.patch With the trunk build, running SolrCloud, if I issue a PERSIST CoreAdmin command, the solr.xml gets overwritten with only the last core, repeated as many times as there are cores. It used to work fine with a trunk build from a couple of months ago, so it looks like something broke solr.xml persistence. It appears to have been introduced by SOLR-2331: CoreContainer#persistFile creates the map for core attributes (coreAttribs) outside of the loop that iterates over cores. Therefore, all cores reuse the same map of attributes and hence only the values from the last core are preserved and used for all cores in the list. I'm running SolrCloud, using: -Dbootstrap_confdir=/opt/solr/solr/conf -Dcollection.configName=hcpconf -DzkRun I'm starting Solr with four cores listed in solr.xml: solr persistent=true cores adminPath=/admin/cores defaultCoreName=master1 core name=master1 instanceDir=master1 shard=shard1 collection=hcpconf / core name=master2 instanceDir=master2 shard=shard2 collection=hcpconf / core name=slave1 instanceDir=slave1 shard=shard1 collection=hcpconf / core name=slave2 instanceDir=slave2 shard=shard2 collection=hcpconf / /cores /solr I then issue a PERSIST request: http://localhost:8983/solr/admin/cores?action=PERSIST And the solr.xml turns into: solr persistent=true cores defaultCoreName=master1 adminPath=/admin/cores zkClientTimeout=1 hostPort=8983 hostContext=solr core shard=shard2 instanceDir=slave2/ name=slave2 collection=hcpconf/ core shard=shard2 instanceDir=slave2/ name=slave2 collection=hcpconf/ core shard=shard2 instanceDir=slave2/ name=slave2 collection=hcpconf/ core shard=shard2 instanceDir=slave2/ name=slave2 collection=hcpconf/ /cores /solr -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2690) Date Faceting or Range Faceting doesn't take timezone into account.
[ https://issues.apache.org/jira/browse/SOLR-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13078578#comment-13078578 ] David Smiley commented on SOLR-2690: Hoss, thanks for elaborating on the distinction between the date literal and the DateMath timezone. I was conflating these issues in my mind -- silly me. Date Faceting or Range Faceting doesn't take timezone into account. --- Key: SOLR-2690 URL: https://issues.apache.org/jira/browse/SOLR-2690 Project: Solr Issue Type: Improvement Affects Versions: 3.3 Reporter: David Schlotfeldt Original Estimate: 3h Remaining Estimate: 3h Timezone needs to be taken into account when doing date math. Currently it isn't. DateMathParser instances created are always being constructed with UTC. This is a huge issue when it comes to faceting. Depending on your timezone day-light-savings changes the length of a month. A facet gap of +1MONTH is different depending on the timezone and the time of the year. I believe the issue is very simple to fix. There are three places in the code DateMathParser is created. All three are configured with the timezone being UTC. If a user could specify the TimeZone to pass into DateMathParser this faceting issue would be resolved. Though it would be nice if we could always specify the timezone DateMathParser uses (since date math DOES depend on timezone) its really only essential that we can affect DateMathParser the SimpleFacets uses when dealing with the gap of the date facets. Another solution is to expand the syntax of the expressions DateMathParser understands. For example we could allow (?timeZone=VALUE) to be added anywhere within an expression. VALUE would be the id of the timezone. When DateMathParser reads this in sets the timezone on the Calendar it is using. Two examples: - (?timeZone=America/Chicago)NOW/YEAR - (?timeZone=America/Chicago)+1MONTH I would be more then happy to modify DateMathParser and provide a patch. I just need a committer to agree this needs to be resolved and a decision needs to be made on the syntax used Thanks! David -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[JENKINS] Lucene-Solr-tests-only-trunk-java7 - Build # 129 - Failure
Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk-java7/129/ 2 tests failed. REGRESSION: org.apache.solr.client.solrj.embedded.MultiCoreExampleJettyTest.testMultiCore Error Message: Severe errors in solr configuration. Check your log files for more detailed information on what may be wrong. - java.lang.RuntimeException: java.io.FileNotFoundException: /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk-java7/checkout/solr/example/multicore/core0/data/index/org.apache.solr.core.RefCntRamDirectory@7e96f890 lockFactory=org.apache.lucene.store.simplefslockfact...@4b905345-write.lock (No such file or directory) at org.apache.solr.core.SolrCore.initIndex(SolrCore.java:392) at org.apache.solr.core.SolrCore.init(SolrCore.java:562) at org.apache.solr.core.SolrCore.init(SolrCore.java:509) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:653) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:407) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:292) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:241) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:93) at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:713) at org.mortbay.jetty.servlet.ServletHandler.updateMappings(ServletHandler.java:1104) at org.mortbay.jetty.servlet.ServletHandler.setFilterMappings(ServletHandler.java:1140) at org.mortbay.jetty.servlet.ServletHandler.addFilterWithMapping(ServletHandler.java:940) at org.mortbay.jetty.servlet.ServletHandler.addFilterWithMapping(ServletHandler.java:895) at org.mortbay.jetty.servlet.Context.addFilter(Context.java:207) at org.apache.solr.client.solrj.embedded.JettySolrRunner$1.lifeCycleStarted(JettySolrRunner.java:98) at org.mortbay.component.AbstractLifeCycle.setStarted(AbstractLifeCycle.java:140) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:52) at org.apache.solr.client.solrj.embedded.JettySolrRunner.start(Jet Severe errors in solr configuration. Check your log files for more detailed information on what may be wrong. - java.lang.RuntimeException: java.io.FileNotFoundException: /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk-java7/checkout/solr/example/multicore/core0/data/index/org.apache.solr.core.RefCntRamDirectory@7e96f890 lockFactory=org.apache.lucene.store.simplefslockfact...@4b905345-write.lock (No such file or directory) at org.apache.solr.core.SolrCore.initIndex(SolrCore.java:392) at org.apache.solr.core.SolrCore.init(SolrCore.java:562) at org.apache.solr.core.SolrCore.init(SolrCore.java:509) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:653) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:407) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:292) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:241) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:93) at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:713) at org.mortbay.jetty.servlet.ServletHandler.updateMappings(ServletHandler.java:1104) at org.mortbay.jetty.servlet.ServletHandler.setFilterMappings(ServletHandler.java:1140) at org.mortbay.jetty.servlet.ServletHandler.addFilterWithMapping(ServletHandler.java:940) at org.mortbay.jetty.servlet.ServletHandler.addFilterWithMapping(ServletHandler.java:895) at org.mortbay.jetty.servlet.Context.addFilter(Context.java:207) at org.apache.solr.client.solrj.embedded.JettySolrRunner$1.lifeCycleStarted(JettySolrRunner.java:98) at org.mortbay.component.AbstractLifeCycle.setStarted(AbstractLifeCycle.java:140) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:52) at org.apache.solr.client.solrj.embedded.JettySolrRunner.start(Jet request: http://localhost:27238/example/core0/update?commit=truewaitSearcher=truewt=javabinversion=2 Stack Trace: request: http://localhost:27238/example/core0/update?commit=truewaitSearcher=truewt=javabinversion=2 at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:434) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:104) at
[jira] [Created] (LUCENE-3358) StandardTokenizer disposes of Hiragana combining mark dakuten instead of attaching it to the character it belongs to
StandardTokenizer disposes of Hiragana combining mark dakuten instead of attaching it to the character it belongs to Key: LUCENE-3358 URL: https://issues.apache.org/jira/browse/LUCENE-3358 Project: Lucene - Java Issue Type: Bug Affects Versions: 3.3 Reporter: Trejkaz Lucene 3.3 (possibly 3.1 onwards) exhibits less than great behaviour for tokenising hiragana, if combining marks are in use. Here's a unit test: {code} @Test public void testHiraganaWithCombiningMarkDakuten() throws Exception { // Hiragana 'S' following by the combining mark dakuten TokenStream stream = new StandardTokenizer(Version.LUCENE_33, new StringReader(\u3055\u3099)); // Should be kept together. ListString expectedTokens = Arrays.asList(\u3055\u3099); ListString actualTokens = new LinkedListString(); CharTermAttribute term = stream.addAttribute(CharTermAttribute.class); while (stream.incrementToken()) { actualTokens.add(term.toString()); } assertEquals(Wrong tokens, expectedTokens, actualTokens); } {code} This code fails with: {noformat} java.lang.AssertionError: Wrong tokens expected:[ざ] but was:[さ] {noformat} It seems as if the tokeniser is throwing away the combining mark entirely. 3.0's behaviour was also undesirable: {noformat} java.lang.AssertionError: Wrong tokens expected:[ざ] but was:[さ, ゙] {noformat} But at least the token was there, so it was possible to write a filter to work around the issue. Katakana seems to be avoiding this particular problem, because all katakana and combining marks found in a single run seem to be lumped into a single token (this is a problem in its own right, but I'm not sure if it's really a bug.) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org