[jira] [Created] (LUCENE-3968) Factor MockGraphTokenFilter into LookaheadTokenFilter + random tokens
Factor MockGraphTokenFilter into LookaheadTokenFilter + random tokens - Key: LUCENE-3968 URL: https://issues.apache.org/jira/browse/LUCENE-3968 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.0 MockGraphTokenFilter is rather hairy... I've managed to simplify it (I think!) by breaking apart its two functions... I think LookaheadTokenFilter can be used in the future for other graph aware filters. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3966) smokeTestRelease should accept a local (file://) staging URL
smokeTestRelease should accept a local (file://) staging URL Key: LUCENE-3966 URL: https://issues.apache.org/jira/browse/LUCENE-3966 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless I'll also fix buildAndPushRelease so it can push to a local URL; this way at any time we can build, push to local staging, and run smoke tester on it, and hopefully nothing fails... But really any tests in smoke tester should ideally be pushed back earlier in our dev process (into jenkins, into ant test). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3942) SynonymFilter should set pos length att
SynonymFilter should set pos length att --- Key: LUCENE-3942 URL: https://issues.apache.org/jira/browse/LUCENE-3942 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.0 Tokenizers/Filters can now produce graphs instead of a single linear chain of tokens, by setting the PositionLengthAttribute, expressing where (how many positions ahead) this token ends. The default is 1, meaning it ends at the next position, to be backwards compatible. SynonymFilter produces graph output tokens, as long as the output is a single token, but currently never sets the pos length to express this. EG for the rule wifi network - hotspot, the hotspot token should have pos length = 2. With LUCENE-3940 this will allow us to verify that the offsets for such tokens are correct... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3940) When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole
When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole - Key: LUCENE-3940 URL: https://issues.apache.org/jira/browse/LUCENE-3940 Project: Lucene - Java Issue Type: Bug Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 4.0 I modified BaseTokenStreamTestCase to assert that the start/end offsets match for graph (posLen 1) tokens, and this caught a bug in Kuromoji when the decompounding of a compound token has a punctuation token that's dropped. In this case we should leave hole(s) so that the graph is intact, ie, the graph should look the same as if the punctuation tokens were not initially removed, but then a StopFilter had removed them. This also affects tokens that have no compound over them, ie we fail to leave a hole today when we remove the punctuation tokens. I'm not sure this is serious enough to warrant fixing in 3.6 at the last minute... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3912) Improved the checked-in tiny line file docs
Improved the checked-in tiny line file docs --- Key: LUCENE-3912 URL: https://issues.apache.org/jira/browse/LUCENE-3912 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Fix For: 4.0 I think it may not have any surrogate pairs (it was derived from Europarl). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3913) HTMLStripCharFilter produces invalid final offset
HTMLStripCharFilter produces invalid final offset - Key: LUCENE-3913 URL: https://issues.apache.org/jira/browse/LUCENE-3913 Project: Lucene - Java Issue Type: Bug Reporter: Michael McCandless Fix For: 3.6, 4.0 Nightly build found this... I boiled it down to a small test case that doesn't require the big line file docs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3905) BaseTokenStreamTestCase should test analyzers on real-ish content
BaseTokenStreamTestCase should test analyzers on real-ish content - Key: LUCENE-3905 URL: https://issues.apache.org/jira/browse/LUCENE-3905 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless We already have LineFileDocs, that pulls content generated from europarl or wikipedia... I think sometimes BTSTC should test the analyzers on that as well. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3907) Improve the Edge/NGramTokenizer/Filters
Improve the Edge/NGramTokenizer/Filters --- Key: LUCENE-3907 URL: https://issues.apache.org/jira/browse/LUCENE-3907 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Fix For: 4.0 Our ngram tokenizers/filters could use some love. EG, they output ngrams in multiple passes, instead of stacked, which messes up offsets/positions and requires too much buffering (can hit OOME for long tokens). They clip at 1024 chars (tokenizers) but don't (token filters). The split up surrogate pairs incorrectly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3890) GroupFacetCollectorTest nightly build failure
GroupFacetCollectorTest nightly build failure - Key: LUCENE-3890 URL: https://issues.apache.org/jira/browse/LUCENE-3890 Project: Lucene - Java Issue Type: Bug Reporter: Michael McCandless Fix For: 4.0 Failure from nightly build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk-java7/2022/testReport/junit/org.apache.lucene.search.grouping/GroupFacetCollectorTest/testRandom/ It reproduces for me with: {noformat} ant test -Dtestcase=GroupFacetCollectorTest -Dtestmethod=testRandom -Dtests.seed=7d227aa075b7bfb8:550d2a0828ce2537:-3553c99f6a4d293e -Dtests.multiplier=3 -Dargs=-Dfile.encoding=US-ASCII {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3891) Documents loaded at search time (IndexReader.document) should be a different class from the index-time Document
Documents loaded at search time (IndexReader.document) should be a different class from the index-time Document --- Key: LUCENE-3891 URL: https://issues.apache.org/jira/browse/LUCENE-3891 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless The fact that the Document you can load at search time is the same Document class you had indexed is horribly trappy in Lucene, because, the loaded document necessarily loses information like field boost, whether a field was tokenized, etc. (See LUCENE-3854 for a recent example). We should fix this, statically, so that it's an entirely different class at search time vs index time. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Fix For: 4.0 On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3893) TermsFilter should use AutomatonQuery
TermsFilter should use AutomatonQuery - Key: LUCENE-3893 URL: https://issues.apache.org/jira/browse/LUCENE-3893 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless I think we could see perf gains if TermsFilter sorted the terms, built a minimal automaton, and used TermsEnum.intersect to visit the terms... This idea came up on the dev list recently. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3894) Make BaseTokenStreamTestCase a bit more evil
Make BaseTokenStreamTestCase a bit more evil Key: LUCENE-3894 URL: https://issues.apache.org/jira/browse/LUCENE-3894 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.6, 4.0 Throw an exception from the Reader while tokenizing, stop after not consuming all tokens, sometimes spoon-feed chars from the reader... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3877) Lucene should not call System.out.println
Lucene should not call System.out.println - Key: LUCENE-3877 URL: https://issues.apache.org/jira/browse/LUCENE-3877 Project: Lucene - Java Issue Type: Bug Reporter: Michael McCandless Fix For: 3.6, 4.0 We seem to have accumulated a few random sops... Eg, PairOutputs.java (oal.util.fst) and MultiDocValues.java, at least. Can we somehow detect (eg, have a test failure) if we accidentally leave errant System.out.println's (leftover from debugging)...? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3872) Index changes are lost if you call prepareCommit() then close()
Index changes are lost if you call prepareCommit() then close() --- Key: LUCENE-3872 URL: https://issues.apache.org/jira/browse/LUCENE-3872 Project: Lucene - Java Issue Type: Bug Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.6, 4.0 You are supposed to call commit() after calling prepareCommit(), but... if you forget, and call close() after prepareCommit() without calling commit(), then any changes done after the prepareCommit() are silently lost (including adding/deleting docs, but also any completed merges). Spinoff from java-user thread lots of .cfs (compound files) in the index directory from Tim Bogaert. I think to fix this, IW.close should throw an IllegalStateException if prepareCommit() was called with no matching call to commit(). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3870) VarDerefBytesImpl doc values prefix length may fall across two pages
VarDerefBytesImpl doc values prefix length may fall across two pages Key: LUCENE-3870 URL: https://issues.apache.org/jira/browse/LUCENE-3870 Project: Lucene - Java Issue Type: Bug Affects Versions: 4.0 Reporter: Michael McCandless Fix For: 4.0 The VarDerefBytesImpl doc values encodes the unique byte[] with prefix (1 or 2 bytes) first, followed by bytes, so that it can use PagedBytes.fillSliceWithPrefix. It does this itself rather than using PagedBytes.copyUsingLengthPrefix... The problem is, it can write an invalid 2 byte prefix spanning two blocks (ie, last byte of block N and first byte of block N+1), which fillSliceWithPrefix won't decode correctly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3846) Fuzzy suggester
Fuzzy suggester --- Key: LUCENE-3846 URL: https://issues.apache.org/jira/browse/LUCENE-3846 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.6, 4.0 Would be nice to have a suggester that can handle some fuzziness (like spell correction) so that it's able to suggest completions that are near what you typed. As a first go at this, I implemented 1T (ie up to 1 edit, including a transposition), except the first letter must be correct. But there is a penalty, ie, the corrected suggestion needs to have a much higher freq than the exact match suggestion before it can compete. Still tons of nocommits, and somehow we should merge this / make it work with analyzing suggester too (LUCENE-3842). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3829) Lucene40 codec's DocValues DirectSource impls aren't thread-safe
Lucene40 codec's DocValues DirectSource impls aren't thread-safe Key: LUCENE-3829 URL: https://issues.apache.org/jira/browse/LUCENE-3829 Project: Lucene - Java Issue Type: Bug Reporter: Michael McCandless Fix For: 4.0 Our DirectSource impls hold IndexInput(s) open against the dat/idx files, which we then seek + read when loading a specific document's value. But this is in no way protected against multiple threads I think...? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3824) TermOrdVal/DocValuesComparator does too much work in compareBottom
TermOrdVal/DocValuesComparator does too much work in compareBottom -- Key: LUCENE-3824 URL: https://issues.apache.org/jira/browse/LUCENE-3824 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 3.6, 4.0 We now have logic to fall back to by-value comparison, when the bottom slot is not from the current reader. But this is silly, because if the bottom slot is from a different reader, it means the tie-break case is not possible (since the current reader didn't have the bottom value), so when the incoming ord equals the bottom ord we should always return x 0. I added a new random string sort test case to TestSort... I also renamed DocValues.SortedSource.getByValue - getOrdByValue and cleaned up some whitespace. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3769) Simplify NRTManager
Simplify NRTManager --- Key: LUCENE-3769 URL: https://issues.apache.org/jira/browse/LUCENE-3769 Project: Lucene - Java Issue Type: Improvement Components: core/search Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.6, 4.0 NRTManager is hairy now, because the applyDeletes is separately passed to ctor, passed to maybeReopen, passed to getSearcherManager, etc. I think, instead, you should pass it only to the ctor, and if you have some cases needing deletes and others not then you can make two NRTManagers. This should be no less efficient than we have today, just simpler. I think it will also enable NRTManager to subclass ThingyManager (LUCENE-3761). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3766) Remove/deprecate Tokenizer's default ctor
Remove/deprecate Tokenizer's default ctor - Key: LUCENE-3766 URL: https://issues.apache.org/jira/browse/LUCENE-3766 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Fix For: 3.6, 4.0 I was working on a new Tokenizer... and I accidentally forgot to call super(input) (and super.reset(input) from my reset method)... which then meant my correctOffset() calls were silently a no-op; this is very trappy. Fortunately the awesome BaseTokenStreamTestCase caught this (I hit failures because the offsets were not in fact being corrected). One minimal thing we can do (but it sounds like from Robert there may be reasons why we can't) is add {{assert input != null}} in Tokenizer.correctOffset: {noformat} Index: lucene/core/src/java/org/apache/lucene/analysis/Tokenizer.java === --- lucene/core/src/java/org/apache/lucene/analysis/Tokenizer.java (revision 1242316) +++ lucene/core/src/java/org/apache/lucene/analysis/Tokenizer.java (working copy) @@ -82,6 +82,7 @@ * @see CharStream#correctOffset */ protected final int correctOffset(int currentOff) { +assert input != null: subclass failed to call super(Reader) or super.reset(Reader); return (input instanceof CharStream) ? ((CharStream) input).correctOffset(currentOff) : currentOff; } {noformat} But best would be to remove the default ctor that leaves input null... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3767) Explore streaming Viterbi search in Kuromoji
Explore streaming Viterbi search in Kuromoji Key: LUCENE-3767 URL: https://issues.apache.org/jira/browse/LUCENE-3767 Project: Lucene - Java Issue Type: Improvement Components: modules/analysis Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.6, 4.0 I've been playing with the idea of changing the Kuromoji viterbi search to be 2 passes (intersect, backtrace) instead of 4 passes (break into sentences, intersect, score, backtrace)... this is very much a work in progress, so I'm just getting my current state up. It's got tons of nocommits, doesn't properly handle the user dict nor extended modes yet, etc. One thing I'm playing with is to add a double backtrace for the long compound tokens, ie, instead of penalizing these tokens so that shorter tokens are picked, leave the scores unchanged but on backtrace take that penalty and use it as a threshold for a 2nd best segmentation... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3760) Cleanup DR.getCurrentVersion/DR.getUserData/DR.getIndexCommit().getUserData()
Cleanup DR.getCurrentVersion/DR.getUserData/DR.getIndexCommit().getUserData() - Key: LUCENE-3760 URL: https://issues.apache.org/jira/browse/LUCENE-3760 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.6, 4.0 Spinoff from Ryan's dev thread DR.getCommitUserData() vs DR.getIndexCommit().getUserData()... these methods are confusing/dups right now. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3756) Don't allow IndexWriterConfig setters to chain
Don't allow IndexWriterConfig setters to chain -- Key: LUCENE-3756 URL: https://issues.apache.org/jira/browse/LUCENE-3756 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Spinoff from LUCENE-3736. I don't like that IndexWriterConfig's setters are chainable; it results in code in our tests like this: {noformat} IndexWriter writer = new IndexWriter(dir, newIndexWriterConfig( TEST_VERSION_CURRENT, new MockAnalyzer(random)).setMaxBufferedDocs(2).setMergePolicy(newLogMergePolicy())); {noformat} I think in general we should avoid chaining since it encourages hard to read code (code is already hard enough to read!). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3738) Be consistent about negative vInt/vLong
Be consistent about negative vInt/vLong --- Key: LUCENE-3738 URL: https://issues.apache.org/jira/browse/LUCENE-3738 Project: Lucene - Java Issue Type: Bug Reporter: Michael McCandless Fix For: 3.6, 4.0 Today, write/readVInt allows a negative int, in that it will encode and decode correctly, just horribly inefficiently (5 bytes). However, read/writeVLong fails (trips an assert). I'd prefer that both vInt/vLong trip an assert if you ever try to write a negative number... it's badly trappy today. But, unfortunately, we sometimes rely on this... had we had this assert in 'since the beginning' we could have avoided that. So, if we can't add that assert in today, I think we should at least fix readVLong to handle negative longs... but then you quietly spend 9 bytes (even more trappy!). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3742) SynFilter doesn't set offsets for outputs that hang off the end of the input tokens
SynFilter doesn't set offsets for outputs that hang off the end of the input tokens --- Key: LUCENE-3742 URL: https://issues.apache.org/jira/browse/LUCENE-3742 Project: Lucene - Java Issue Type: Bug Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.6, 4.0 Attachments: LUCENE-3742.patch If you have syn rule a - x y and input a then output is a/x y but... what should y's offsets be? Right now we set to 0/0. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3729) Allow using FST to hold terms data in DocValues.BYTES_*_SORTED
Allow using FST to hold terms data in DocValues.BYTES_*_SORTED -- Key: LUCENE-3729 URL: https://issues.apache.org/jira/browse/LUCENE-3729 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3725) Add optional packing to FST building
Add optional packing to FST building Key: LUCENE-3725 URL: https://issues.apache.org/jira/browse/LUCENE-3725 Project: Lucene - Java Issue Type: Improvement Components: core/FSTs Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.6, 4.0 The FSTs produced by Builder can be further shrunk if you are willing to spend highish transient RAM to do so... our Builder today tries hard not to use much RAM (and has options to tweak down the RAM usage, in exchange for somewhat lager FST), even when building immense FSTs. But for apps that can afford highish transient RAM to get a smaller net FST, I think we should offer packing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3685) Add top-down version of BlockJoinQuery
Add top-down version of BlockJoinQuery -- Key: LUCENE-3685 URL: https://issues.apache.org/jira/browse/LUCENE-3685 Project: Lucene - Java Issue Type: Improvement Components: modules/join Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.6, 4.0 Today, BlockJoinQuery can join from child docIDs up to parent docIDs. EG this works well for product (parent) + many SKUs (child) search. But the reverse, which BJQ cannot do, is also useful in some cases. EG say you index songs (child) within albums (parent), but you want to search and present by song not album while involving some fields from the album in the query. In this case you want to wrap a parent query (against album), joining down to the child document space. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3684) Add offsets to postings (DPEnum)
Add offsets to postings (DPEnum) - Key: LUCENE-3684 URL: https://issues.apache.org/jira/browse/LUCENE-3684 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.0 I think should explore making start/end offsets a first-class attr in the postings APIs, and fixing the indexer to index them into postings. This will make term vector access cleaner (we now have to jump through hoops w/ non-first-class offset attr). It can also enable efficient highlighting without term vectors / reanalyzing, if the app indexes offsets into the postings. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3681) FST.BYTE2 should save as fixed 2 byte not as vInt
FST.BYTE2 should save as fixed 2 byte not as vInt - Key: LUCENE-3681 URL: https://issues.apache.org/jira/browse/LUCENE-3681 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.6, 4.0 We currently write BYTE1 as a single byte, but BYTE2/4 as vInt, but I think that's confusing. Also, for the FST for the new Kuromoji analyzer (LUCENE-3305), writing as 2 bytes instead shrank the FST and ran faster, presumably because more values were = 16384 than were 128. Separately the whole INPUT_TYPE is very confusing... really all it's doing is declaring the allowed range of the characters of the input alphabet, and then the only thing that uses that is the write/readLabel methods (well and some confusing sugar methods in Builder!). Not sure how to fix that yet... It's a simple change but it changes the FST binary format so any users w/ FSTs out there will have to rebuild (FST is marked experimental...). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3679) Replace IndexReader.getFieldNames with IndexReader.getFieldInfos
Replace IndexReader.getFieldNames with IndexReader.getFieldInfos Key: LUCENE-3679 URL: https://issues.apache.org/jira/browse/LUCENE-3679 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.6, 4.0 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3658) NRTCachingDir has invalid asserts (if same file name is written twice)
NRTCachingDir has invalid asserts (if same file name is written twice) -- Key: LUCENE-3658 URL: https://issues.apache.org/jira/browse/LUCENE-3658 Project: Lucene - Java Issue Type: Bug Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.6, 4.0 Attachments: LUCENE-3658.patch Normally Lucene is write-once (except for segments.gen file, which NRTCachingDir never caches), but in some tests (TestDoc, TestCrash) we can write the same file more than once. I don't think NRTCachingDir should have these asserts, and I think on createOutput it should remove any old file if present. I also found fixed a possible concurrency issue (if more than one thread syncs at the same time; IndexWriter doesn't ever do this today but it has in the past). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3639) Add test case support for shard searching
Add test case support for shard searching - Key: LUCENE-3639 URL: https://issues.apache.org/jira/browse/LUCENE-3639 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.0, 3.5 New test case that helps stress test the APIs to support sharding -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3640) remove IndexSearcher.close
remove IndexSearcher.close -- Key: LUCENE-3640 URL: https://issues.apache.org/jira/browse/LUCENE-3640 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.6, 4.0 Now that IS is never heavy (since you have to pass in your own IR), IS.close is truly a no-op... I think we should remove it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3634) remove old static main methods in core
remove old static main methods in core -- Key: LUCENE-3634 URL: https://issues.apache.org/jira/browse/LUCENE-3634 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Fix For: 3.6, 4.0 We have a few random static main methods that I think are very rarely used... we should remove them (IndexReader, UTF32ToUTF8, English). The IndexReader main lets you list / extract the sub-files from a CFS... I think we should move this to a new tool in contrib/misc. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3618) FST suggester should allow saving to Directory (not just File)
FST suggester should allow saving to Directory (not just File) -- Key: LUCENE-3618 URL: https://issues.apache.org/jira/browse/LUCENE-3618 Project: Lucene - Java Issue Type: Improvement Components: modules/spellchecker Reporter: Michael McCandless Currently FSTCompletionLookup has a store method, taking File storeDir, which it treats as a directory and then saves the FST to file fst.bin inside there. I think we should also add a store method taking a Lucene Directory? Eg then I can store my suggest FST in a RAMDir. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (PYLUCENE-15) Add spellchecker JAR
Add spellchecker JAR Key: PYLUCENE-15 URL: https://issues.apache.org/jira/browse/PYLUCENE-15 Project: PyLucene Issue Type: Improvement Reporter: Michael McCandless 3.x's lucene/contrib/spellchecker has the spellchecker and suggest packages... would be nice to have PyLucene wrap these by default. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PYLUCENE-14) Add PythonIndexDeletionPolicy so we can implement IndexDeletionPolicy in Python
Add PythonIndexDeletionPolicy so we can implement IndexDeletionPolicy in Python Key: PYLUCENE-14 URL: https://issues.apache.org/jira/browse/PYLUCENE-14 Project: PyLucene Issue Type: Improvement Reporter: Michael McCandless -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PYLUCENE-12) Add PythonReusableAnalyzerBase, so we can create analyzers in Python
Add PythonReusableAnalyzerBase, so we can create analyzers in Python Key: PYLUCENE-12 URL: https://issues.apache.org/jira/browse/PYLUCENE-12 Project: PyLucene Issue Type: Improvement Reporter: Michael McCandless Lucene now has a useful helper class, ReusableAnalyzerBase; you subclass it and override one method, to create an analyzer that provides reusableTokenStream impl. I think we should expose it in Python... patch is simple. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (LUCENE-3572) MultiIndexDocValues pretends it can merge sorted sources
MultiIndexDocValues pretends it can merge sorted sources Key: LUCENE-3572 URL: https://issues.apache.org/jira/browse/LUCENE-3572 Project: Lucene - Java Issue Type: Bug Reporter: Michael McCandless Fix For: 4.0 Nightly build hit this failure: {noformat} ant test-core -Dtestcase=TestSort -Dtestmethod=testReverseSort -Dtests.seed=791b126576b0cfab:-48895c7243ecc5d0:743c683d1c9f7768 -Dtests.multiplier=3 -Dargs=-Dfile.encoding=ISO8859-1 [junit] Testcase: testReverseSort(org.apache.lucene.search.TestSort): Caused an ERROR [junit] expected:[CEGIA] but was:[ACEGI] [junit] at org.apache.lucene.search.TestSort.assertMatches(TestSort.java:1248) [junit] at org.apache.lucene.search.TestSort.assertMatches(TestSort.java:1216) [junit] at org.apache.lucene.search.TestSort.testReverseSort(TestSort.java:759) [junit] at org.apache.lucene.util.LuceneTestCase$3$1.evaluate(LuceneTestCase.java:523) [junit] at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:149) [junit] at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:51) {noformat} It's happening in the test for reverse-sort of a string field with DocValues, when the test had gotten SlowMultiReaderWrapper. I committed a fix to the test to avoid testing this case, but we need a better fix to the underlying bug. MultiIndexDocValues cannot merge sorted sources (I think?), yet somehow it's pretending it can (in the above test, the three subs had BYTES_FIXED_SORTED type, and the TypePromoter happily claims to merge these to BYTES_FIXED_SORTED; I think MultiIndexDocValues should return null for the sorted source in this case? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3575) Field names can be wrong for stored fields / term vectors after merging
Field names can be wrong for stored fields / term vectors after merging --- Key: LUCENE-3575 URL: https://issues.apache.org/jira/browse/LUCENE-3575 Project: Lucene - Java Issue Type: Bug Affects Versions: 4.0 Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.0 The good news is this bug only exists in trunk... the bad news is it's been here for some time (created by accident in LUCENE-2881). But the good news is it should strike fairly rarely. SegmentMerger sometimes incorrectly thinks it can bulk-copy TVs/stored fields when it cannot (because field numbers don't map to the same names across segments). I think it happens only with addIndexes, or indexes that have pre-trunk segments, and then SM falsely thinks it can bulk-merge only when the last field number has the same field name across segments. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3564) rename IndexWriter.rollback to .rollbackAndClose
rename IndexWriter.rollback to .rollbackAndClose Key: LUCENE-3564 URL: https://issues.apache.org/jira/browse/LUCENE-3564 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.5, 4.0 Spinoff from LUCENE-3454, where Shai noticed that rollback is trappy since it [unexpected] closes the IW. I think we should rename it to rollbackAndClose. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3562) Stop storing TermsEnum in CloseableThreadLocal inside Terms instance
Stop storing TermsEnum in CloseableThreadLocal inside Terms instance Key: LUCENE-3562 URL: https://issues.apache.org/jira/browse/LUCENE-3562 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.0 We have sugar methods in Terms.java (docFreq, totalTermFreq, docs, docsAndPositions) that use a saved thread-private TermsEnum to do the lookups. But on apps that send many threads through Lucene, and/or have many segments, this can add up to a lot of RAM, especially if the codecs impl holds onto stuff. Also, Terms has a close method (closes the CloseableThreadLocal) which must be called, but we fail to do so in some places. These saved enums are the cause of the recent OOME in TestNRTManager (TestNRTManager.testNRTManager -seed 2aa27e1aec20c4a2:-4a5a5ecf46837d0e:-7c4f651f1f0b75d7 -mult 3 -nightly). Really sharing these enums is a holdover from before Lucene queries would share state (ie, save the TermState from the first pass, and use it later to pull enums, get docFreq, etc.). It's not helpful anymore, and it can use gobbs of RAM, so I'd like to remove it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3539) IndexFormatTooOld/NewExc should try to include fileName + directory when possible
IndexFormatTooOld/NewExc should try to include fileName + directory when possible - Key: LUCENE-3539 URL: https://issues.apache.org/jira/browse/LUCENE-3539 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.5, 4.0 (Spinoff from http://markmail.org/thread/t6s7nn3ve765nojc ) When we throw a too old/new exc we should try to include the full path to the offending file, if possible. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3524) Add direct PackedInts.Reader impl, that reads directly from disk on each get
Add direct PackedInts.Reader impl, that reads directly from disk on each get -- Key: LUCENE-3524 URL: https://issues.apache.org/jira/browse/LUCENE-3524 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Spinoff from LUCENE-3518. If we had a direct PackedInts.Reader impl we could use that instead of the RandomAccessReaderIterator. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3520) If the NRT reader hasn't changed then IndexReader.openIfChanged should return null
If the NRT reader hasn't changed then IndexReader.openIfChanged should return null -- Key: LUCENE-3520 URL: https://issues.apache.org/jira/browse/LUCENE-3520 Project: Lucene - Java Issue Type: Bug Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.5, 4.0 I hit a failure in TestSearcherManager (NOTE: doesn't always fail): {noformat} ant test -Dtestcase=TestSearcherManager -Dtestmethod=testSearcherManager -Dtests.seed=459ac99a4256789c:-29b8a7f52497c3b4:145ae632ae9e1ecf {noformat} It was tripping the assert inside SearcherLifetimeManager.record, because two different IndexSearcher instances had different IR instances sharing the same version. This was happening because IW.getReader always returns a new reader even when there are no changes. I think we should fix that... Separately I found a deadlock in TestSearcherManager.testIntermediateClose, if the test gets SerialMergeScheduler and needs to merge during the second commit. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3518) Add sort-by-term with DocValues
Add sort-by-term with DocValues --- Key: LUCENE-3518 URL: https://issues.apache.org/jira/browse/LUCENE-3518 Project: Lucene - Java Issue Type: New Feature Components: core/search Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.0 There are two sorted byte[] types with DocValues (BYTES_VAR_SORTED, BYTES_FIXED_SORTED), so you can index this type, but you can't yet sort by it. So I added a FieldComparator just like TermOrdValComparator, except it pulls from the doc values instead. There are some small diffs, eg with doc values there are never null values (see LUCENE-3504). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3519) BlockJoinCollector only allows retrieving groups for only one BlockJoinQuery
BlockJoinCollector only allows retrieving groups for only one BlockJoinQuery Key: LUCENE-3519 URL: https://issues.apache.org/jira/browse/LUCENE-3519 Project: Lucene - Java Issue Type: Bug Components: modules/join Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.5, 4.0 Spinoff from Mark Harwood's email (subject BlockJoin concerns) to dev list. It's fine to use multiple nested joins in a single query, and BlockJoinCollector should let you retrieve the top groups for all of them. But currently it always returns null after the first query's groups have been retrieved, because of a silly bug. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3515) Possible slowdown of indexing/merging on 3.x vs trunk
Possible slowdown of indexing/merging on 3.x vs trunk - Key: LUCENE-3515 URL: https://issues.apache.org/jira/browse/LUCENE-3515 Project: Lucene - Java Issue Type: Bug Components: core/index Reporter: Michael McCandless Fix For: 3.5, 4.0 Opening an issue to pursue the possible slowdown Marc Sturlese uncovered. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3510) BooleanScorer should not limit number of prohibited clauses
BooleanScorer should not limit number of prohibited clauses --- Key: LUCENE-3510 URL: https://issues.apache.org/jira/browse/LUCENE-3510 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.5, 4.0 Today it's limited to 32, because it uses a separate bit in the mask for each clause. But I don't understand why it does this; I think all prohibited clauses can share a single boolean/bit? Any match on a prohibited clause sets this bit and the doc is not collected; we don't need each prohibited clause to have a dedicated bit? We also use the mask for required clauses, but this code is now commented out (we always use BS2 if there are any required clauses); if we re-enable this code (and I think we should, at least in certain cases: I suspect it'd be faster than BS2 in many cases), I think we can cutover to an int count instead of bit masks, and then have no limit on the required clauses sent to BooleanScorer also. Separately I cleaned a few things up about BooleanScorer: all of the embedded scorer methods (nextDoc, docID, advance, score) now throw UOE; pre-allocate the buckets instead of doing it lazily per-sub-collect. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3502) Packed ints: move .getArray into Reader API
Packed ints: move .getArray into Reader API --- Key: LUCENE-3502 URL: https://issues.apache.org/jira/browse/LUCENE-3502 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.5, 4.0 This is a simple code cleanup... it's messy that a consumer of PackedInts.Reader must check whether the impl is Direct8/16/32/64 in order to get an array; it's better to move up the .getArray into the Reader interface and then make the DirectN impls package private. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3503) DisjunctionSumScorer gives slightly (float iotas) different scores when you .nextDoc vs .advance
DisjunctionSumScorer gives slightly (float iotas) different scores when you .nextDoc vs .advance Key: LUCENE-3503 URL: https://issues.apache.org/jira/browse/LUCENE-3503 Project: Lucene - Java Issue Type: Bug Reporter: Michael McCandless Attachments: LUCENE-3503.patch Spinoff from LUCENE-1536. I dug into why we hit a score diff when using luceneutil to benchmark the patch. At first I thought it was BS1/BS2 difference, but because of a bug in the patch it was still using BS2 (but should be BS1) -- Robert's last patch fixes that. But it's actually a diff in BS2 itself, whether you next or advance through the docs. It's because DisjunctionSumScorer, when summing the float scores for a given doc that matches multiple sub-scorers, might sum in a different order, when you had .nextDoc'd to that doc than when you had .advance'd to it. This in turn is because the PQ used by that scorer (ScorerDocQueue) makes no effort to break ties. So, when the top N scorers are on the same doc, the PQ doesn't care what order they are in. Fixing ScorerDocQueue to break ties will likely be a non-trivial perf hit, though, so I'm not sure whether we should do anything here... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3504) DocValues: deref/sorted bytes types shouldn't return empty byte[] when doc didn't have a value
DocValues: deref/sorted bytes types shouldn't return empty byte[] when doc didn't have a value -- Key: LUCENE-3504 URL: https://issues.apache.org/jira/browse/LUCENE-3504 Project: Lucene - Java Issue Type: Bug Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.0 I'm looking at making a FieldComparator that uses DV's SortedSource to sort by string field (ie just like TermOrdValComparator, except using DV instead of FieldCache). We already have comparators for DV int and float DV fields. But one thing I noticed is we can't detect documents that didn't have any value indexed vs documents that had empty byte[] indexed. This is easy to fix (and we used to do this), because these types are deref'd (ie, each doc stores an address, and then separately looks up the byte[] at that address), we can reserve ord/address 0 to mean doc didn't have the field. Then we should return null when you retrieve the BytesRef value for that field. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3486) Add SearcherLifetimeManager, so you can retrieve the same searcher you previously used
Add SearcherLifetimeManager, so you can retrieve the same searcher you previously used -- Key: LUCENE-3486 URL: https://issues.apache.org/jira/browse/LUCENE-3486 Project: Lucene - Java Issue Type: New Feature Components: core/search Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.5, 4.0 Attachments: LUCENE-3486.patch The idea is similar to SOLR-2809 (adding searcher leases to Solr). This utility class sits above whatever your source is for the current searcher (eg NRTManager, SearcherManager, etc.), and records (holds a reference to) each searcher in recent history. The idea is to ensure that when a user does a follow-on action (clicks next page, drills down/up), or when two or more searcher invocations within a single user search need to happen against the same searcher (eg in distributed search), you can retrieve the same searcher you used last time. I think with the new searchAfter API (LUCENE-2215), doing follow-on searches on the same searcher is more important, since the bottom (score/docID) held for that API can easily shift when a new searcher is opened. When you do a new search, you record the searcher you used with the manager, and it returns to you a long token (currently just the IR.getVersion()), which you can later use to retrieve the same searcher. Separately you must periodically call prune(), to prune the old searchers, ideally from the same thread / at the same time that you open a new searcher. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-2807) Upgrade to Tika 0.10
Upgrade to Tika 0.10 Key: SOLR-2807 URL: https://issues.apache.org/jira/browse/SOLR-2807 Project: Solr Issue Type: Improvement Reporter: Michael McCandless Tika 0.10 was recently released... seems like we should upgrade? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3477) Fix JFlex tokenizer compiler warnings
Fix JFlex tokenizer compiler warnings - Key: LUCENE-3477 URL: https://issues.apache.org/jira/browse/LUCENE-3477 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Attachments: LUCENE-3477.patch We get lots of distracting fallthrough warnings running ant compile in modules/analysis, from the tokenizers generated from JFlex. Digging a bit, they actually do look spooky. So I managed to edit the JFlex inputs to insert a bunch of break statements in our rules, but I have no idea if this is right/dangerous, and it seems a bit weird having to do such insertions of naked breaks. But, this does fix all the warnings, and all tests pass... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3478) TestSimpleExplanations failure
TestSimpleExplanations failure -- Key: LUCENE-3478 URL: https://issues.apache.org/jira/browse/LUCENE-3478 Project: Lucene - Java Issue Type: Bug Components: core/search Reporter: Michael McCandless Fix For: 4.0 {noformat} ant test -Dtestcase=TestSimpleExplanations -Dtestmethod=testDMQ8 -Dtests.seed=144152895b276837:eb7ba4953db943f:33373b79a971db02 {noformat} fails w/ this on current trunk... looks like silly floating point precision issue: {noformat} [junit] Testsuite: org.apache.lucene.search.TestSimpleExplanations [junit] 1.4508595 = (MATCH) sum of: [junit] 1.4508595 = (MATCH) weight(field:yy in 2) [DefaultSimilarity], result of: [junit] 1.4508595 = score(doc=2,freq=1.0 = termFreq=1 [junit] ), product of: [junit] 1.287682 = queryWeight, product of: [junit] 1.287682 = idf(docFreq=2, maxDocs=4) [junit] 1.0 = queryNorm [junit] 1.1267219 = fieldWeight in 2, product of: [junit] 1.0 = tf(freq=1.0), with freq of: [junit] 1.0 = termFreq=1 [junit] 1.287682 = idf(docFreq=2, maxDocs=4) [junit] 0.875 = fieldNorm(doc=2) [junit] 145085.95 = (MATCH) weight(field:xx^10.0 in 2) [DefaultSimilarity], result of: [junit] 145085.95 = score(doc=2,freq=1.0 = termFreq=1 [junit] ), product of: [junit] 128768.2 = queryWeight, product of: [junit] 10.0 = boost [junit] 1.287682 = idf(docFreq=2, maxDocs=4) [junit] 1.0 = queryNorm [junit] 1.1267219 = fieldWeight in 2, product of: [junit] 1.0 = tf(freq=1.0), with freq of: [junit] 1.0 = termFreq=1 [junit] 1.287682 = idf(docFreq=2, maxDocs=4) [junit] 0.875 = fieldNorm(doc=2) [junit] expected:145086.66 but was:145086.69) [junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 0.544 sec [junit] [junit] - Standard Error - [junit] NOTE: reproduce with: ant test -Dtestcase=TestSimpleExplanations -Dtestmethod=testDMQ8 -Dtests.seed=144152895b276837:eb7ba4953db943f:33373b79a971db02 [junit] NOTE: test params are: codec=PreFlex, sim=RandomSimilarityProvider(queryNorm=false,coord=false): {field=DefaultSimilarity, alt=DFR I(ne)LZ(0.3), KEY=IB LL-D2}, locale=en_IN, timezone=Pacific/Samoa [junit] NOTE: all tests run in this JVM: [junit] [TestSimpleExplanations] [junit] NOTE: Linux 2.6.33.6-147.fc13.x86_64 amd64/Sun Microsystems Inc. 1.6.0_21 (64-bit)/cpus=24,threads=1,free=130426744,total=189988864 [junit] - --- [junit] Testcase: testDMQ8(org.apache.lucene.search.TestSimpleExplanations):FAILED [junit] ((field:yy field:w5^100.0) | field:xx^10.0)~0.5: score(doc=2)=145086.66 != explanationScore=145086.69 Explanation: 145086.69 = (MATCH) max plus 0.5 times others of: [junit] 1.4508595 = (MATCH) sum of: [junit] 1.4508595 = (MATCH) weight(field:yy in 2) [DefaultSimilarity], result of: [junit] 1.4508595 = score(doc=2,freq=1.0 = termFreq=1 [junit] ), product of: [junit] 1.287682 = queryWeight, product of: [junit] 1.287682 = idf(docFreq=2, maxDocs=4) [junit] 1.0 = queryNorm [junit] 1.1267219 = fieldWeight in 2, product of: [junit] 1.0 = tf(freq=1.0), with freq of: [junit] 1.0 = termFreq=1 [junit] 1.287682 = idf(docFreq=2, maxDocs=4) [junit] 0.875 = fieldNorm(doc=2) [junit] 145085.95 = (MATCH) weight(field:xx^10.0 in 2) [DefaultSimilarity], result of: [junit] 145085.95 = score(doc=2,freq=1.0 = termFreq=1 [junit] ), product of: [junit] 128768.2 = queryWeight, product of: [junit] 10.0 = boost [junit] 1.287682 = idf(docFreq=2, maxDocs=4) [junit] 1.0 = queryNorm [junit] 1.1267219 = fieldWeight in 2, product of: [junit] 1.0 = tf(freq=1.0), with freq of: [junit] 1.0 = termFreq=1 [junit] 1.287682 = idf(docFreq=2, maxDocs=4) [junit] 0.875 = fieldNorm(doc=2) [junit] expected:145086.66 but was:145086.69 [junit] junit.framework.AssertionFailedError: ((field:yy field:w5^100.0) | field:xx^10.0)~0.5: score(doc=2)=145086.66 != explanationScore=145086.69 Explanation: 145086.69 = (MATCH) max plus 0.5 times others of: [junit] 1.4508595 = (MATCH) sum of: [junit] 1.4508595 = (MATCH) weight(field:yy in 2) [DefaultSimilarity], result of: [junit] 1.4508595 = score(doc=2,freq=1.0 = termFreq=1 [junit] ), product of: [junit] 1.287682 = queryWeight, product of: [junit] 1.287682 = idf(docFreq=2, maxDocs=4) [junit] 1.0 = queryNorm [junit]
[jira] [Created] (LUCENE-3479) TestGrouping failure
TestGrouping failure Key: LUCENE-3479 URL: https://issues.apache.org/jira/browse/LUCENE-3479 Project: Lucene - Java Issue Type: Bug Components: modules/grouping Reporter: Michael McCandless Assignee: Michael McCandless {noformat} ant test -Dtestcase=TestGrouping -Dtestmethod=testRandom -Dtests.seed=295cdb78b4a442d4:-4c5d64ef4d698c27:-425d4c1eb87211ba {noformat} fails with this on current trunk: {noformat} [junit] - Standard Error - [junit] NOTE: reproduce with: ant test -Dtestcase=TestGrouping -Dtestmethod=testRandom -Dtests.seed=295cdb78b4a442d4:-4c5d64ef4d698c27:-425d4c1eb87211ba [junit] NOTE: test params are: codec=RandomCodecProvider: {id=MockRandom, content=MockSep, sort2=SimpleText, groupend=Pulsing(freqCutoff=3 minBlockSize=65 maxBlockSize=132), sort1=Memory, group=Memory}, sim=RandomSimilarityProvider(queryNorm=true,coord=false): {id=DFR I(F)L2, content=DFR BeL3(800.0), sort2=DFR GL3(800.0), groupend=DFR G2, sort1=DFR GB3(800.0), group=LM Jelinek-Mercer(0.70)}, locale=zh_TW, timezone=America/Indiana/Indianapolis [junit] NOTE: all tests run in this JVM: [junit] [TestGrouping] [junit] NOTE: Linux 2.6.33.6-147.fc13.x86_64 amd64/Sun Microsystems Inc. 1.6.0_21 (64-bit)/cpus=24,threads=1,free=143246344,total=281804800 [junit] - --- [junit] Testcase: testRandom(org.apache.lucene.search.grouping.TestGrouping): FAILED [junit] expected:11 but was:7 [junit] junit.framework.AssertionFailedError: expected:11 but was:7 [junit] at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:148) [junit] at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:50) [junit] at org.apache.lucene.search.grouping.TestGrouping.assertEquals(TestGrouping.java:980) [junit] at org.apache.lucene.search.grouping.TestGrouping.testRandom(TestGrouping.java:865) [junit] at org.apache.lucene.util.LuceneTestCase$2$1.evaluate(LuceneTestCase.java:611) [junit] [junit] {noformat} I dug for a while... the test is a bit sneaky because it compares sorted docs (by score) across 2 indexes. Index #1 has no deletions; Index #2 has same docs, but organized into doc blocks by group, and has some deletions. In theory (I think) even though the deletions will cause scores to differ across the two indices, it should not alter the sort order of the docs. Here is the explain output of the docs that sorted differently: {noformat} #1: top hit in the has deletes doc-block index (id=239): explain: 2.394486 = (MATCH) weight(content:real1 in 292) [DFRSimilarity], result of: 2.394486 = score(DFRSimilarity, doc=292, freq=1.0), computed from: 1.0 = termFreq=1 41.944084 = NormalizationH3, computed from: 1.0 = tf 5.3102274 = avgFieldLength 2.56 = len 102.829 = BasicModelBE, computed from: 41.944084 = tfn 880.0 = numberOfDocuments 239.0 = totalTermFreq 0.023286095 = AfterEffectL, computed from: 41.944084 = tfn #2: hit in the no deletes normal index (id=229) ID=229 explain=2.382285 = (MATCH) weight(content:real1 in 225) [DFRSimilarity], result of: 2.382285 = score(DFRSimilarity, doc=225, freq=1.0), computed from: 1.0 = termFreq=1 41.765594 = NormalizationH3, computed from: 1.0 = tf 5.3218827 = avgFieldLength 10.24 = len 101.879845 = BasicModelBE, computed from: 41.765594 = tfn 786.0 = numberOfDocuments 215.0 = totalTermFreq 0.023383282 = AfterEffectL, computed from: 41.765594 = tfn Then I went and called explain on the no deletes normal index for the top doc (id=239): explain: 2.3822558 = (MATCH) weight(content:real1 in 17) [DFRSimilarity], result of: 2.3822558 = score(DFRSimilarity, doc=17, freq=1.0), computed from: 1.0 = termFreq=1 42.165264 = NormalizationH3, computed from: 1.0 = tf 5.3218827 = avgFieldLength 2.56 = len 102.8307 = BasicModelBE, computed from: 42.165264 = tfn 786.0 = numberOfDocuments 215.0 = totalTermFreq 0.023166776 = AfterEffectL, computed from: 42.165264 = tfn {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3465) IndexSearcher fails to pass docBase to Collector when using ExecutorService
IndexSearcher fails to pass docBase to Collector when using ExecutorService --- Key: LUCENE-3465 URL: https://issues.apache.org/jira/browse/LUCENE-3465 Project: Lucene - Java Issue Type: Bug Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.5 This bug is causing the failure in TestSearchAfter. We are now always passing docBase 0 to Collector when you use ExecutorService with IndexSearcher. This doesn't affect trunk (AtomicReaderContext carries the right docBase); only 3.x. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org