[jira] [Commented] (LUCENE-3997) join module should not depend on grouping module
[ https://issues.apache.org/jira/browse/LUCENE-3997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13257191#comment-13257191 ] Michael McCandless commented on LUCENE-3997: +1 join module should not depend on grouping module Key: LUCENE-3997 URL: https://issues.apache.org/jira/browse/LUCENE-3997 Project: Lucene - Java Issue Type: Task Affects Versions: 4.0 Reporter: Robert Muir Fix For: 4.0 Attachments: LUCENE-3997.patch, LUCENE-3997.patch I think TopGroups/GroupDocs should simply be in core? Both grouping and join modules use these trivial classes, but join depends on grouping just for them. I think its better that we try to minimize these inter-module dependencies. Of course, another option is to combine grouping and join into one module, but last time i brought that up nobody could agree on a name. Anyway I think the change is pretty clean: its similar to having basic stuff like Analyzer.java in core, so other things can work with Analyzer without depending on any specific implementing modules. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3972) Improve AllGroupsCollector implementations
[ https://issues.apache.org/jira/browse/LUCENE-3972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13252315#comment-13252315 ] Michael McCandless commented on LUCENE-3972: Curious that it's so much faster ... BytesRefHash operates on the byte[] term while the current approach operates on int ord. How large was the index? If it was smallish, maybe the time was dominated by re-ord'ing after each reader...? Improve AllGroupsCollector implementations -- Key: LUCENE-3972 URL: https://issues.apache.org/jira/browse/LUCENE-3972 Project: Lucene - Java Issue Type: Improvement Components: modules/grouping Reporter: Martijn van Groningen Attachments: LUCENE-3972.patch I think that the performance of TermAllGroupsCollectorm, DVAllGroupsCollector.BR and DVAllGroupsCollector.SortedBR can be improved by using BytesRefHash to store the groups instead of an ArrayList. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3972) Improve AllGroupsCollector implementations
[ https://issues.apache.org/jira/browse/LUCENE-3972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13252518#comment-13252518 ] Michael McCandless commented on LUCENE-3972: Actually, we are storing term ords here, not docIDs. I think the high number of unique groups explains why the new patch is faster: the time is likely dominated by re-ord'ing for each segment? If you have fewer unique groups (and as the number of docs collected goes up), I think the current impl should be faster...? Improve AllGroupsCollector implementations -- Key: LUCENE-3972 URL: https://issues.apache.org/jira/browse/LUCENE-3972 Project: Lucene - Java Issue Type: Improvement Components: modules/grouping Reporter: Martijn van Groningen Attachments: LUCENE-3972.patch, LUCENE-3972.patch I think that the performance of TermAllGroupsCollectorm, DVAllGroupsCollector.BR and DVAllGroupsCollector.SortedBR can be improved by using BytesRefHash to store the groups instead of an ArrayList. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3970) Rename getUnique[Field/Terms]Count() into size()
[ https://issues.apache.org/jira/browse/LUCENE-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13251004#comment-13251004 ] Michael McCandless commented on LUCENE-3970: Thanks Iulius, this looks good ... I'll commit shortly. Rename getUnique[Field/Terms]Count() into size() Key: LUCENE-3970 URL: https://issues.apache.org/jira/browse/LUCENE-3970 Project: Lucene - Java Issue Type: Task Components: core/index Reporter: Iulius Curt Assignee: Michael McCandless Priority: Minor Fix For: 4.0 Attachments: LUCENE-3970.patch Like Robert Muir said in LUCENE-3109: {quote}Also I think there are other improvements we can do here that would be more natural: Fields.getUniqueFieldCount() - Fields.size() Terms.getUniqueTermCount() - Terms.size(){quote} I believe this dramatically improves understandability (way less 'scary', actually beautiful). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3109) Rename FieldsConsumer to InvertedFieldsConsumer
[ https://issues.apache.org/jira/browse/LUCENE-3109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13249512#comment-13249512 ] Michael McCandless commented on LUCENE-3109: Thanks Iulius, looks great! I'll commit... Rename FieldsConsumer to InvertedFieldsConsumer --- Key: LUCENE-3109 URL: https://issues.apache.org/jira/browse/LUCENE-3109 Project: Lucene - Java Issue Type: Task Components: core/codecs Affects Versions: 4.0 Reporter: Simon Willnauer Priority: Minor Fix For: 4.0 Attachments: LUCENE-3109.patch, LUCENE-3109.patch, LUCENE-3109.patch, LUCENE-3109.patch, LUCENE-3109.patch The name FieldsConsumer is missleading here it really is an InvertedFieldsConsumer and since we are extending codecs to consume non-inverted Fields we should be clear here. Same applies to Fields.java as well as FieldsProducer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3963) improve smoketester to work on windows
[ https://issues.apache.org/jira/browse/LUCENE-3963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13249516#comment-13249516 ] Michael McCandless commented on LUCENE-3963: +1 improve smoketester to work on windows -- Key: LUCENE-3963 URL: https://issues.apache.org/jira/browse/LUCENE-3963 Project: Lucene - Java Issue Type: Task Reporter: Robert Muir Attachments: LUCENE-3963.patch After the changes in SOLR-3331, the smoketester won't work on windows (things like path separators of : or ;). Not really critical, people will just have to smoketest on unix-like machines. But it would be more convenient for testers on windows machines if it worked there too. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3109) Rename FieldsConsumer to InvertedFieldsConsumer
[ https://issues.apache.org/jira/browse/LUCENE-3109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13249520#comment-13249520 ] Michael McCandless commented on LUCENE-3109: bq. We need to change CHANGES.txt and MIGRATE.txt to the new API, it's now heavily outdated. Thanks Uwe, you're right, my bad. bq. Should we change AtomicReader to have invertedField() instead fields()? +1 bq. Also the name FieldsEnum is now inconsistent. I think it should be InvertedFieldsEnum? Iulius do you want to make these changes? Or I can... let me know. Rename FieldsConsumer to InvertedFieldsConsumer --- Key: LUCENE-3109 URL: https://issues.apache.org/jira/browse/LUCENE-3109 Project: Lucene - Java Issue Type: Task Components: core/codecs Affects Versions: 4.0 Reporter: Simon Willnauer Assignee: Michael McCandless Priority: Minor Fix For: 4.0 Attachments: LUCENE-3109.patch, LUCENE-3109.patch, LUCENE-3109.patch, LUCENE-3109.patch, LUCENE-3109.patch The name FieldsConsumer is missleading here it really is an InvertedFieldsConsumer and since we are extending codecs to consume non-inverted Fields we should be clear here. Same applies to Fields.java as well as FieldsProducer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3109) Rename FieldsConsumer to InvertedFieldsConsumer
[ https://issues.apache.org/jira/browse/LUCENE-3109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13249546#comment-13249546 ] Michael McCandless commented on LUCENE-3109: OK I'll revert so we can discuss more... Rename FieldsConsumer to InvertedFieldsConsumer --- Key: LUCENE-3109 URL: https://issues.apache.org/jira/browse/LUCENE-3109 Project: Lucene - Java Issue Type: Task Components: core/codecs Affects Versions: 4.0 Reporter: Simon Willnauer Assignee: Michael McCandless Priority: Minor Fix For: 4.0 Attachments: LUCENE-3109.patch, LUCENE-3109.patch, LUCENE-3109.patch, LUCENE-3109.patch, LUCENE-3109.patch, LUCENE-3109.patch The name FieldsConsumer is missleading here it really is an InvertedFieldsConsumer and since we are extending codecs to consume non-inverted Fields we should be clear here. Same applies to Fields.java as well as FieldsProducer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3967) nuke AtomicReader.termDocsEnum(termState) and termPositionsEnum(termState)
[ https://issues.apache.org/jira/browse/LUCENE-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13249570#comment-13249570 ] Michael McCandless commented on LUCENE-3967: +1 nuke AtomicReader.termDocsEnum(termState) and termPositionsEnum(termState) -- Key: LUCENE-3967 URL: https://issues.apache.org/jira/browse/LUCENE-3967 Project: Lucene - Java Issue Type: Task Reporter: Robert Muir Attachments: LUCENE-3967.patch These are simply sugar methods anyway, and so expert that I don't think we need sugar here at all. If someone wants to get DocsEnum via a saved TermState they can just use TermsEnum! But having these public in AtomicReader i think is pretty confusing and overwhelming. In fact, nothing in Lucene even uses these methods, except a sole assert statement in PhraseQuery, which I think can be written more clearly anyway: {noformat} // PhraseQuery on a field that did not index // positions. if (postingsEnum == null) { - assert reader.termDocsEnum(liveDocs, t.field(), t.bytes(), state, false) != null: termstate found but no term exists in reader; + assert te.seekExact(t.bytes(), false) : termstate found but no term exists in reader; {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3965) consolidate all api modules in one place and un!@$# packaging for 4.0
[ https://issues.apache.org/jira/browse/LUCENE-3965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13249284#comment-13249284 ] Michael McCandless commented on LUCENE-3965: +1, to moving/merging modules/* and lucene/contrib/* under lucene. This is much cleaner. consolidate all api modules in one place and un!@$# packaging for 4.0 - Key: LUCENE-3965 URL: https://issues.apache.org/jira/browse/LUCENE-3965 Project: Lucene - Java Issue Type: Task Components: general/build Affects Versions: 4.0 Reporter: Robert Muir I think users get confused about how svn/source is structured, when in fact we are just producing a modular build. I think it would be more clear if the lucene stuff was underneath modules/, thats where our modular API is. we could still package this up as lucene.tar.gz if we want, and even name modules/core lucene-core.jar, but i think this would be a lot better organized than the current: * lucene * lucene/contrib * modules confusion. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3109) Rename FieldsConsumer to InvertedFieldsConsumer
[ https://issues.apache.org/jira/browse/LUCENE-3109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13249390#comment-13249390 ] Michael McCandless commented on LUCENE-3109: Thanks for the fast turnaround Iulius! Did you use svn mv to rename the sources? (I'm guessing not -- I don't see the removed original sources). But it's fine: I got this to apply quite easily. Thanks! I'll commit shortly... Rename FieldsConsumer to InvertedFieldsConsumer --- Key: LUCENE-3109 URL: https://issues.apache.org/jira/browse/LUCENE-3109 Project: Lucene - Java Issue Type: Task Components: core/codecs Affects Versions: 4.0 Reporter: Simon Willnauer Priority: Minor Fix For: 4.0 Attachments: LUCENE-3109.patch, LUCENE-3109.patch, LUCENE-3109.patch, LUCENE-3109.patch The name FieldsConsumer is missleading here it really is an InvertedFieldsConsumer and since we are extending codecs to consume non-inverted Fields we should be clear here. Same applies to Fields.java as well as FieldsProducer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3109) Rename FieldsConsumer to InvertedFieldsConsumer
[ https://issues.apache.org/jira/browse/LUCENE-3109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13249398#comment-13249398 ] Michael McCandless commented on LUCENE-3109: Hmm, one thing: I noticed the imports got changed into wildcards, eg: {noformat} +import org.apache.lucene.index.*; import org.apache.lucene.util.LuceneTestCase; import org.apache.lucene.document.Document; import org.apache.lucene.document.TextField; -import org.apache.lucene.index.RandomIndexWriter; -import org.apache.lucene.index.TermsEnum; -import org.apache.lucene.index.IndexReader; -import org.apache.lucene.index.Term; -import org.apache.lucene.index.MultiFields; +import org.apache.lucene.index.MultiInvertedFields; {noformat} In general I prefer seeing each import (not the wildcard)... can you redo patch putting them back? Thanks! (I'm assuming/hoping this is a simple setting in your IDE?). Rename FieldsConsumer to InvertedFieldsConsumer --- Key: LUCENE-3109 URL: https://issues.apache.org/jira/browse/LUCENE-3109 Project: Lucene - Java Issue Type: Task Components: core/codecs Affects Versions: 4.0 Reporter: Simon Willnauer Priority: Minor Fix For: 4.0 Attachments: LUCENE-3109.patch, LUCENE-3109.patch, LUCENE-3109.patch, LUCENE-3109.patch The name FieldsConsumer is missleading here it really is an InvertedFieldsConsumer and since we are extending codecs to consume non-inverted Fields we should be clear here. Same applies to Fields.java as well as FieldsProducer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3331) solr NOTICE.txt is missing information
[ https://issues.apache.org/jira/browse/SOLR-3331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13248263#comment-13248263 ] Michael McCandless commented on SOLR-3331: -- I'll fix smoke tester... I already have a bunch of mods to add other checks to it... solr NOTICE.txt is missing information -- Key: SOLR-3331 URL: https://issues.apache.org/jira/browse/SOLR-3331 Project: Solr Issue Type: Bug Reporter: Robert Muir Assignee: Michael McCandless Priority: Blocker Fix For: 3.6 Solr depends on some modules from lucene, and is released separately (as a source release including lucene), thus its NOTICE.txt has a lucene section which includes notices from lucene: {noformat} = == Apache Lucene Notice == = {noformat} however, its missing the IPADIC (which is required to be there). Furthermore, there is no way to check this, except via manual inspection. This gets complicated in 4.0 because of modularization, but we need to fix the 3.6 situation in order to release (hence, this issue is set to 3.6 only and we can open a separate issue for 4.0 and discuss things like modules there, its irrelevant here). My proposal for *3.6* is: 1. add the IPADIC notice 2. have smoketester.py look for this specific block of text indicating the notices from lucene, and cross check them to ensure everything is consistent. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3316) Distributed Grouping fails in some scenarios.
[ https://issues.apache.org/jira/browse/SOLR-3316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13248343#comment-13248343 ] Michael McCandless commented on SOLR-3316: -- Patch looks good! I guess it's OK to make the hard change to the EndResultTransformer interface... (it's marked @experimental). Distributed Grouping fails in some scenarios. - Key: SOLR-3316 URL: https://issues.apache.org/jira/browse/SOLR-3316 Project: Solr Issue Type: Bug Components: SearchComponents - other Affects Versions: 3.4, 3.5 Environment: Windows 7, JDK 6u26 Reporter: Cody Young Assignee: Martijn van Groningen Priority: Blocker Labels: distributed, grouping Fix For: 4.0 Attachments: SOLR-3316-3x.patch, SOLR-3316-3x.patch, SOLR-3316.patch, TestDistributedGrouping.java.patch During a distributed grouping request, if rows is set to 0 a 500 error is returned. If groups are unique to a shard and the row count is set to 1, then the matches number is only the matches from one shard. I've put together a failing test. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3109) Rename FieldsConsumer to InvertedFieldsConsumer
[ https://issues.apache.org/jira/browse/LUCENE-3109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13248604#comment-13248604 ] Michael McCandless commented on LUCENE-3109: Hi Iulius, this patch is great: this rename is badly needed... I was able to apply the patch (resolving a few conflicts since the code has shifted since it was created), but... some things seem to be missing (eg InvertedFieldsProducer rename). How did you generate the patch? Rename FieldsConsumer to InvertedFieldsConsumer --- Key: LUCENE-3109 URL: https://issues.apache.org/jira/browse/LUCENE-3109 Project: Lucene - Java Issue Type: Task Components: core/codecs Affects Versions: 4.0 Reporter: Simon Willnauer Priority: Minor Fix For: 4.0 Attachments: LUCENE-3109.patch, LUCENE-3109.patch The name FieldsConsumer is missleading here it really is an InvertedFieldsConsumer and since we are extending codecs to consume non-inverted Fields we should be clear here. Same applies to Fields.java as well as FieldsProducer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3932) Improve load time of .tii files
[ https://issues.apache.org/jira/browse/LUCENE-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13247239#comment-13247239 ] Michael McCandless commented on LUCENE-3932: OK I committed this to trunk (thanks Sean!). {quote} dataPagedBytes.getPointer() == 124973970 On disk the .tii file is 69508193 bytes {quote} OK, ~80% bigger... but in the overall index it's minor increase (~0.1%). But I think we should hold off on any more 3.x work until/unless we decide to do another release off of it Improve load time of .tii files --- Key: LUCENE-3932 URL: https://issues.apache.org/jira/browse/LUCENE-3932 Project: Lucene - Java Issue Type: Improvement Affects Versions: 3.5 Environment: Linux Reporter: Sean Bridges Attachments: LUCENE-3932.trunk.patch, perf.csv We have a large 50 gig index which is optimized as one segment, with a 66 MEG .tii file. This index has no norms, and no field cache. It takes about 5 seconds to load this index, profiling reveals that 60% of the time is spent in GrowableWriter.set(index, value), and most of time in set(...) is spent resizing PackedInts.Mutatable current. In the constructor for TermInfosReaderIndex, you initialize the writer with the line, {quote}GrowableWriter indexToTerms = new GrowableWriter(4, indexSize, false);{quote} For our index using four as the bit estimate results in 27 resizes. The last value in indexToTerms is going to be ~ tiiFileLength, and if instead you use, {quote}int bitEstimate = (int) Math.ceil(Math.log10(tiiFileLength) / Math.log10(2)); GrowableWriter indexToTerms = new GrowableWriter(bitEstimate, indexSize, false);{quote} Load time improves to ~ 2 seconds. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3946) improve docs ivy verification output to explain classpath problems and mention --noconfig
[ https://issues.apache.org/jira/browse/LUCENE-3946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13246465#comment-13246465 ] Michael McCandless commented on LUCENE-3946: Shawn, I'm not certain this is the same issue (it talks about an extra trailing / on ANT_HOME, but that didn't help me...), but it seems related: https://bugzilla.redhat.com/show_bug.cgi?id=490542 improve docs ivy verification output to explain classpath problems and mention --noconfig - Key: LUCENE-3946 URL: https://issues.apache.org/jira/browse/LUCENE-3946 Project: Lucene - Java Issue Type: Task Affects Versions: 3.6 Reporter: Hoss Man Assignee: Hoss Man Fix For: 4.0 Attachments: LUCENE-3946.patch offshoot of LUCENE-3930, where shawn reported... {quote} I can't get either branch_3x or trunk to build now, on a system that used to build branch_3x without complaint. It says that ivy is not available, even after doing ant ivy-bootstrap to download ivy into the home directory. Specifically I am trying to build solrj from trunk, but I can't even get ant in the root directory of the checkout to work. I'm on CentOS 6 with oracle jdk7 built using the city-fan.org SRPMs. Ant (1.7.1) and junit are installed from package repositories. Building a checkout of lucene_solr_3_5 on the same machine works fine. {quote} The root cause is that ant's global configs can be setup to ignore the users personal lib dir. suggested work arround is to run ant --noconfig but we should also try to give the user feedback in our failure about exactly what classpath ant is currently using (because apparently ${java.class.path} is not actually it) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3943) Use ivy cachepath and cachefileset instead of ivy retrieve
[ https://issues.apache.org/jira/browse/LUCENE-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13245123#comment-13245123 ] Michael McCandless commented on LUCENE-3943: Would this also mean ivy doesn't have to copy the JARs from its cache, into each checkout? So this will save disk space for devs w/ multiple checkouts... Use ivy cachepath and cachefileset instead of ivy retrieve -- Key: LUCENE-3943 URL: https://issues.apache.org/jira/browse/LUCENE-3943 Project: Lucene - Java Issue Type: Improvement Components: general/build Reporter: Chris Male In LUCENE-3930 we moved to resolving all external dependencies using ivy:retrieve. This process places the dependencies into the lib/ folder of the respective modules which was ideal since it replicated the existing build process and limited the number of changes to be made to the build. However it can lead to multiple jars for the same dependency in the lib folder when the dependency is upgraded, and just isn't the most efficient way to use Ivy. Uwe pointed out that we can remove the ivy:retrieve calls and make use of ivy:cachepath and ivy:cachefileset to build our classpaths and packages respectively, which will go some way to addressing these limitations -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2026) Refactoring of IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13245125#comment-13245125 ] Michael McCandless commented on LUCENE-2026: Is there anyone who can volunteer to be a mentor for this issue...? Refactoring of IndexWriter -- Key: LUCENE-2026 URL: https://issues.apache.org/jira/browse/LUCENE-2026 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Labels: gsoc2011, gsoc2012, lucene-gsoc-11, lucene-gsoc-12, mentor Fix For: 4.0 I've been thinking for a while about refactoring the IndexWriter into two main components. One could be called a SegmentWriter and as the name says its job would be to write one particular index segment. The default one just as today will provide methods to add documents and flushes when its buffer is full. Other SegmentWriter implementations would do things like e.g. appending or copying external segments [what addIndexes*() currently does]. The second component's job would it be to manage writing the segments file and merging/deleting segments. It would know about DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would provide hooks that allow users to manage external data structures and keep them in sync with Lucene's data during segment merges. API wise there are things we have to figure out, such as where the updateDocument() method would fit in, because its deletion part affects all segments, whereas the new document is only being added to the new segment. Of course these should be lower level APIs for things like parallel indexing and related use cases. That's why we should still provide easy to use APIs like today for people who don't need to care about per-segment ops during indexing. So the current IndexWriter could probably keeps most of its APIs and delegate to the new classes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2357) Reduce transient RAM usage while merging by using packed ints array for docID re-mapping
[ https://issues.apache.org/jira/browse/LUCENE-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13245343#comment-13245343 ] Michael McCandless commented on LUCENE-2357: Hi Iulius, The basic idea is to replace the fixed int[] that we now have (in oal.index.MergeState's docMaps array) with a PackedInts store (see oal.util.packed.PackedInts.getMutable). This should be fairly simple, since a PackedInts store is concetually just like an int[]. I think that (a rote swap) would be phase one. After that, we can save more RAM by storing either the new docID (what we do today), or, inverting that and storing instead the number of del docs seen so far, depending on which requires fewer bits. EG if we are merging 1M docs but only 100K are deleted it's cheaper to store the number of deletes... Reduce transient RAM usage while merging by using packed ints array for docID re-mapping Key: LUCENE-2357 URL: https://issues.apache.org/jira/browse/LUCENE-2357 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Priority: Minor Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.0 We allocate this int[] to remap docIDs due to compaction of deleted ones. This uses alot of RAM for large segment merges, and can fail to allocate due to fragmentation on 32 bit JREs. Now that we have packed ints, a simple fix would be to use a packed int array... and maybe instead of storing abs docID in the mapping, we could store the number of del docs seen so far (so the remap would do a lookup then a subtract). This may add some CPU cost to merging but should bring down transient RAM usage quite a bit. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13245374#comment-13245374 ] Michael McCandless commented on LUCENE-3892: The proposal at http://www.google-melange.com/gsoc/proposal/review/google/gsoc2012/billybob/1 looks great! Some initial feedback: * There are actually more than 2 codecs (eg we also have Lucene3x, SimpleText, sep/intblock (abstract), random codecs/postings formats for testing...), but our default codec now is Lucene40. * I think you can use the existing abstract sep/intblock classes (ie, they implement layers like FieldsProducer/Consumer...), and then you can just implement the required methods (eg to encode/decode one int[] block). * We may need to tune the skipper settings, based on profiling results from skip-intensive (Phrase, And) queries... since it's currently geared towards single-doc-at-once encoding. I don't think we should try to make a new skipper impl here... (there is a separate issue for that). * Maybe explore the combination of pulsing and PForDelta codecs; seems like the combination of those two could be important, since for low docFreq terms, retrieving the docs is now more expensive... Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.0 On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3930) nuke jars from source tree and use ivy
[ https://issues.apache.org/jira/browse/LUCENE-3930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13245799#comment-13245799 ] Michael McCandless commented on LUCENE-3930: Shawn, I have similar problems with the builtin ant on Fedora 13, and when I add that same echo line, I can see that the ~/.ant/lib/ivy-2.2.0.jar is on the CLASSPATH... yet it fails the ivy-availability-check. I never got to the bottom of it ... but installing ant myself (1.8.2) and using that version instead worked around it... nuke jars from source tree and use ivy -- Key: LUCENE-3930 URL: https://issues.apache.org/jira/browse/LUCENE-3930 Project: Lucene - Java Issue Type: Task Components: general/build Reporter: Robert Muir Assignee: Robert Muir Priority: Blocker Fix For: 3.6, 4.0 Attachments: LUCENE-3930-skip-sources-javadoc.patch, LUCENE-3930-solr-example.patch, LUCENE-3930-solr-example.patch, LUCENE-3930.patch, LUCENE-3930.patch, LUCENE-3930.patch, LUCENE-3930__ivy_bootstrap_target.patch, LUCENE-3930_includetestlibs_excludeexamplexml.patch, ant_-verbose_clean_test.out.txt, langdetect-1.1.jar, noggit-commons-csv.patch, patch-jetty-build.patch, pom.xml As mentioned on the ML thread: switch jars to ivy mechanism?. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3946) improve docs ivy verification output to explain classpath problems and mention --noconfig
[ https://issues.apache.org/jira/browse/LUCENE-3946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13245900#comment-13245900 ] Michael McCandless commented on LUCENE-3946: Patch works -- I see lots of JARs on the classpath: {noformat} [echo] Current Classpath: [echo]/usr/share/java/ant.jar [echo]/usr/share/java/ant-launcher.jar [echo]/usr/share/java/jaxp_parser_impl.jar [echo]/usr/share/java/xml-commons-apis.jar [echo]/usr/share/java/antlr.jar [echo]/usr/share/java/ant/ant-antlr.jar [echo]/usr/share/java/bcel.jar [echo]/usr/share/java/ant/ant-apache-bcel.jar [echo]/usr/share/java/oro.jar [echo]/usr/share/java/ant/ant-apache-oro.jar [echo]/usr/share/java/regexp.jar [echo]/usr/share/java/ant/ant-apache-regexp.jar [echo]/usr/share/java/xml-commons-resolver.jar [echo]/usr/share/java/ant/ant-apache-resolver.jar [echo]/usr/share/java/jakarta-commons-logging.jar [echo]/usr/share/java/ant/ant-commons-logging.jar [echo]/usr/share/java/javamail.jar [echo]/usr/share/java/jaf.jar [echo]/usr/share/java/ant/ant-javamail.jar [echo]/usr/share/java/jdepend.jar [echo]/usr/share/java/ant/ant-jdepend.jar [echo]/usr/share/java/junit.jar [echo]/usr/share/java/ant/ant-junit.jar [echo]/usr/share/java/ant/ant-nodeps.jar [echo]/usr/share/java/ant/ant-swing.jar [echo]/usr/share/java/jaxp_transform_impl.jar [echo]/usr/share/java/ant/ant-trax.jar [echo]/usr/share/java/xalan-j2-serializer.jar [echo]/usr/local/src/jdk1.6.0_21/lib/tools.jar [echo]/home/mike/.ant/lib/maven-ant-tasks-2.1.3.jar [echo]/home/mike/.ant/lib/ivy-2.2.0.jar [echo]/usr/share/ant/lib/ant-swing.jar [echo]/usr/share/ant/lib/ant-launcher.jar [echo]/usr/share/ant/lib/ant-junit.jar [echo]/usr/share/ant/lib/ant-bootstrap.jar [echo]/usr/share/ant/lib/ant-apache-bcel.jar [echo]/usr/share/ant/lib/ant-apache-oro.jar [echo]/usr/share/ant/lib/ant-nodeps.jar [echo]/usr/share/ant/lib/ant-apache-resolver.jar [echo]/usr/share/ant/lib/ant-trax.jar [echo]/usr/share/ant/lib/ant-apache-log4j.jar [echo]/usr/share/ant/lib/ant-antlr.jar [echo]/usr/share/ant/lib/ant-javamail.jar [echo]/usr/share/ant/lib/ant-jdepend.jar [echo]/usr/share/ant/lib/ant-apache-regexp.jar [echo]/usr/share/ant/lib/ant-commons-logging.jar {noformat} That's just running ant, and it fails... ant --noconfig works (fortunately I don't have/need ~/.antrc). Here's my /etc/ant.conf: {noformat} # ant.conf (Ant 1.7.x) # JPackage Project http://www.jpackage.org/ # Validate --noconfig setting in case being invoked # from pre Ant 1.6.x environment if [ -z $no_config ] ; then no_config=true fi # Setup ant configuration if $no_config ; then # Disable RPM layout rpm_mode=false else # Use RPM layout rpm_mode=true # ANT_HOME for rpm layout ANT_HOME=/usr/share/ant fi {noformat} improve docs ivy verification output to explain classpath problems and mention --noconfig - Key: LUCENE-3946 URL: https://issues.apache.org/jira/browse/LUCENE-3946 Project: Lucene - Java Issue Type: Task Affects Versions: 3.6 Reporter: Hoss Man Assignee: Hoss Man Fix For: 4.0 Attachments: LUCENE-3946.patch offshoot of LUCENE-3930, where shawn reported... {quote} I can't get either branch_3x or trunk to build now, on a system that used to build branch_3x without complaint. It says that ivy is not available, even after doing ant ivy-bootstrap to download ivy into the home directory. Specifically I am trying to build solrj from trunk, but I can't even get ant in the root directory of the checkout to work. I'm on CentOS 6 with oracle jdk7 built using the city-fan.org SRPMs. Ant (1.7.1) and junit are installed from package repositories. Building a checkout of lucene_solr_3_5 on the same machine works fine. {quote} The root cause is that ant's global configs can be setup to ignore the users personal lib dir. suggested work arround is to run ant --noconfig but we should also try to give the user feedback in our failure about exactly what classpath ant is currently using (because apparently ${java.class.path} is not actually it) -- This message is automatically generated by JIRA. If you think it was
[jira] [Commented] (LUCENE-3946) improve docs ivy verification output to explain classpath problems and mention --noconfig
[ https://issues.apache.org/jira/browse/LUCENE-3946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13245902#comment-13245902 ] Michael McCandless commented on LUCENE-3946: I passed --execdebug to ant, and when it fails (w/ the builtin Fedora ant) I get this: {noformat} exec /usr/local/src/jdk1.6.0_21/bin/java -classpath /usr/share/java/ant.jar:/usr/share/java/ant-launcher.jar:/usr/share/java/jaxp_parser_impl.jar:/usr/share/java/xml-commons-apis.jar:/usr/share/java/antlr.jar:/usr/share/java/ant/ant-antlr.jar:/usr/share/java/bcel.jar:/usr/share/java/ant/ant-apache-bcel.jar:/usr/share/java/ant.jar:/usr/share/java/oro.jar:/usr/share/java/ant/ant-apache-oro.jar:/usr/share/java/regexp.jar:/usr/share/java/ant/ant-apache-regexp.jar:/usr/share/java/xml-commons-resolver.jar:/usr/share/java/ant/ant-apache-resolver.jar:/usr/share/java/jakarta-commons-logging.jar:/usr/share/java/ant/ant-commons-logging.jar:/usr/share/java/javamail.jar:/usr/share/java/jaf.jar:/usr/share/java/ant/ant-javamail.jar:/usr/share/java/jdepend.jar:/usr/share/java/ant/ant-jdepend.jar:/usr/share/java/junit.jar:/usr/share/java/ant/ant-junit.jar:/usr/share/java/ant/ant-nodeps.jar:/usr/share/java/ant/ant-swing.jar:/usr/share/java/jaxp_transform_impl.jar:/usr/share/java/ant/ant-trax.jar:/usr/share/java/xalan-j2-serializer.jar:/usr/local/src/jdk1.6.0_21/lib/tools.jar -Dant.home=/usr/share/ant -Dant.library.dir=/usr/share/ant/lib org.apache.tools.ant.launch.Launcher -cp {noformat} and then when I switch to the working ant: {noformat} exec /usr/local/src/jdk1.6.0_21/jre/bin/java -classpath /usr/local/src/apache-ant-1.8.2//lib/ant-launcher.jar -Dant.home=/usr/local/src/apache-ant-1.8.2/ -Dant.library.dir=/usr/local/src/apache-ant-1.8.2//lib org.apache.tools.ant.launch.Launcher -cp {noformat} improve docs ivy verification output to explain classpath problems and mention --noconfig - Key: LUCENE-3946 URL: https://issues.apache.org/jira/browse/LUCENE-3946 Project: Lucene - Java Issue Type: Task Affects Versions: 3.6 Reporter: Hoss Man Assignee: Hoss Man Fix For: 4.0 Attachments: LUCENE-3946.patch offshoot of LUCENE-3930, where shawn reported... {quote} I can't get either branch_3x or trunk to build now, on a system that used to build branch_3x without complaint. It says that ivy is not available, even after doing ant ivy-bootstrap to download ivy into the home directory. Specifically I am trying to build solrj from trunk, but I can't even get ant in the root directory of the checkout to work. I'm on CentOS 6 with oracle jdk7 built using the city-fan.org SRPMs. Ant (1.7.1) and junit are installed from package repositories. Building a checkout of lucene_solr_3_5 on the same machine works fine. {quote} The root cause is that ant's global configs can be setup to ignore the users personal lib dir. suggested work arround is to run ant --noconfig but we should also try to give the user feedback in our failure about exactly what classpath ant is currently using (because apparently ${java.class.path} is not actually it) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3296) Explore alternatives to Commons CSV
[ https://issues.apache.org/jira/browse/SOLR-3296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13244502#comment-13244502 ] Michael McCandless commented on SOLR-3296: -- bq. wrt commons-csv alternatives, it's too risky for little/no gain. This confuses me: commons-csv is unreleased, while there are other license-friendly packages (eg opencsv) that have been released for some time (multiple releases), been tested in the field, had bugs found fixed, etc. Why use an unreleased package when released alternatives are available? bq. I put a lot of effort into getting commons-csv up to snuff, Wait: a lot of effort doing what? Did you have to modify commons-csv sources? Or do you mean open issues w/ the commons devs to fix things/add test cases to commons-csv sources (great!)...? bq. Switching implementations would most likely result in a lot of regressions that we don't have tests for. I'd expect the reverse, ie, it's more likely there are bugs in commons-csv (it's not released and thus not heavily tested) than eg in opencsv. And if somehow that's really the case (eg we have particular/unusual CSV parsing requirements), we should have our own tests asserting so? Explore alternatives to Commons CSV --- Key: SOLR-3296 URL: https://issues.apache.org/jira/browse/SOLR-3296 Project: Solr Issue Type: Improvement Components: Build Reporter: Chris Male In LUCENE-3930 we're implementing some less than ideal solutions to make available the unreleased version of commons-csv. We could remove these solutions if we didn't rely on this lib. So I think we should explore alternatives. I think [opencsv|http://opencsv.sourceforge.net/] is an alternative to consider, I've used it in many commercial projects. Bizarrely Commons-CSV's website says that Opencsv uses a BSD license, but this isn't the case, OpenCSV uses ASL2. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3939) ClassCastException thrown in the map(String,int,TermVectorOffsetInfo[],int[]) method in org.apache.lucene.index.SortedTermVectorMapper
[ https://issues.apache.org/jira/browse/LUCENE-3939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13243727#comment-13243727 ] Michael McCandless commented on LUCENE-3939: bq. For example, if the first invocation of the method map is commented out(as below), then there is no exception thrown. In this case, the Comparator is still null. This is because of sneakiness/trapiness in TreeSet (and maybe Java's type erasure for generics), I think. Ie, on inserting only one object into it, it does not need to cast that object to Comparator (there's nothing to compare to). But on adding a 2nd object, it will try to cast. ClassCastException thrown in the map(String,int,TermVectorOffsetInfo[],int[]) method in org.apache.lucene.index.SortedTermVectorMapper -- Key: LUCENE-3939 URL: https://issues.apache.org/jira/browse/LUCENE-3939 Project: Lucene - Java Issue Type: Bug Components: core/index Affects Versions: 3.0.2, 3.1, 3.4, 3.5 Reporter: SHIN HWEI TAN Original Estimate: 0.05h Remaining Estimate: 0.05h The method map in the SortedTermVectorMapper class does not check the parameter term for the valid values. It throws ClassCastException when called with a invalid string for the parameter term (i.e., var3.map(*, (-1), null, null)). The exception thrown is due to an explict cast(i.e., casting the return value of termToTVE.get(term) to type TermVectorEntry). Suggested Fixes: Replace the beginning of the method body for the class SortedTermVectorMapper by changing it like this: public void map(String term, int frequency, TermVectorOffsetInfo[] offsets, int[] positions) { if(termToTVE.get(term) instanceof TermVectorEntry){ TermVectorEntry entry = (TermVectorEntry) termToTVE.get(term); ... } } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3938) Add query time parent child search
[ https://issues.apache.org/jira/browse/LUCENE-3938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13243763#comment-13243763 ] Michael McCandless commented on LUCENE-3938: I don't fully grok this yet :) ... but some initial questions: I'm confused: when you say parent child document, what does that mean...? I thought there are parent documents and child documents, in the context of a given join? Or do you mean parent or child document...? Ie, it looks like your Query is free to match both parent and child documents...? (Unlike index-time joins). But then you also have a childrenQuery, which is only allowed to match docs in the child space...? Minor: there's an @author tag in ParentChildCommand Minor: maybe break out ParentChildHit into its own source file...? Add query time parent child search -- Key: LUCENE-3938 URL: https://issues.apache.org/jira/browse/LUCENE-3938 Project: Lucene - Java Issue Type: New Feature Components: modules/join Reporter: Martijn van Groningen Attachments: LUCENE-3938.patch At the moment there is support for index time parent child search with two queries implementations and a collector. The index time parent child search requires that documents are indexed in a block, this isn't ideal for updatability. For example in the case of tv content and subtitles (both being separate documents). Updating already indexed tv content with subtitles would then require to also re-index the subtitles. This issue focuses on the collector part for query time parent child search. I started a while back with implementing this. Basically a two pass search performs a parent child search. In the first pass the top N parent child documents are resolved. In the second pass the parent or top N children are resolved (depending if the hit is a parent or child) and are associated with the top N parent child relation documents. Patch will follow soon. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3940) When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole
[ https://issues.apache.org/jira/browse/LUCENE-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13243782#comment-13243782 ] Michael McCandless commented on LUCENE-3940: Here's an example where we create a compound token with punctuation. I got this from the Japanese Wikipedia export, with our MockCharFilter sometimes doubling characters: we are at a position that the characters 〇〇'''、''' after it... that 〇 is this Unicode character http://www.fileformat.info/info/unicode/char/3007/index.htm When Kuromoji extends from this position, both 〇 and 〇〇 are KNOWN, but then we also extend by unknown 〇〇'''、''' (ie, 〇〇 plus only punctuation): Note that 〇 is not considered punctuation by Kuromoji's isPunctuation method... {noformat} + UNKNOWN word 〇〇'''、''' toPos=41 cost=21223 penalty=3400 toPos.idx=0 + KNOWN word 〇〇 toPos=34 cost=9895 penalty=0 toPos.idx=0 + KNOWN word 〇 toPos=33 cost=2766 penalty=0 toPos.idx=0 + KNOWN word 〇 toPos=33 cost=5256 penalty=0 toPos.idx=1 {noformat} And then on backtrace we make a compound token (UNKNOWN) for all of 〇〇'''、''', while the decompounded path keeps two separate 〇 tokens but drops the '''、''' since it's all punctuation, thus creating inconsistent offsets. When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole - Key: LUCENE-3940 URL: https://issues.apache.org/jira/browse/LUCENE-3940 Project: Lucene - Java Issue Type: Bug Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 4.0 Attachments: LUCENE-3940.patch, LUCENE-3940.patch, LUCENE-3940.patch I modified BaseTokenStreamTestCase to assert that the start/end offsets match for graph (posLen 1) tokens, and this caught a bug in Kuromoji when the decompounding of a compound token has a punctuation token that's dropped. In this case we should leave hole(s) so that the graph is intact, ie, the graph should look the same as if the punctuation tokens were not initially removed, but then a StopFilter had removed them. This also affects tokens that have no compound over them, ie we fail to leave a hole today when we remove the punctuation tokens. I'm not sure this is serious enough to warrant fixing in 3.6 at the last minute... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3940) When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole
[ https://issues.apache.org/jira/browse/LUCENE-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13243785#comment-13243785 ] Michael McCandless commented on LUCENE-3940: OK here's one possible fix... Right now, when we are glomming up an UNKNOWN token, we glom only as long as the character class of each character is the same as the first character. What if we also require that isPunct-ness is the same? That way we would never create an UNKNOWN token mixing punct and non-punct... I implemented that and the tests seem to pass w/ offset checking fully turned on again... When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole - Key: LUCENE-3940 URL: https://issues.apache.org/jira/browse/LUCENE-3940 Project: Lucene - Java Issue Type: Bug Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 4.0 Attachments: LUCENE-3940.patch, LUCENE-3940.patch, LUCENE-3940.patch I modified BaseTokenStreamTestCase to assert that the start/end offsets match for graph (posLen 1) tokens, and this caught a bug in Kuromoji when the decompounding of a compound token has a punctuation token that's dropped. In this case we should leave hole(s) so that the graph is intact, ie, the graph should look the same as if the punctuation tokens were not initially removed, but then a StopFilter had removed them. This also affects tokens that have no compound over them, ie we fail to leave a hole today when we remove the punctuation tokens. I'm not sure this is serious enough to warrant fixing in 3.6 at the last minute... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3932) Improve load time of .tii files
[ https://issues.apache.org/jira/browse/LUCENE-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13243102#comment-13243102 ] Michael McCandless commented on LUCENE-3932: bq. Is the space savings of delta encoding worth the processing time? You could write the .tii file to disk such that on open you could read it straight into a byte[]. This is actually what we do in 4.0's default codec (the index is an FST). It is tempting to do that in 3.x (if we were to do another 3.x release after 3.6) ... we'd need to alter other things as well, eg the term bytes are also delta-coded in the file but not in RAM. I'm curious how much larger it'd be if we stopped delta coding... for your case, how large is the byte[] in RAM (just call dataPagedBytes.getPointer(), just before we freeze it, and print that result) vs the tii on disk...? Improve load time of .tii files --- Key: LUCENE-3932 URL: https://issues.apache.org/jira/browse/LUCENE-3932 Project: Lucene - Java Issue Type: Improvement Affects Versions: 3.5 Environment: Linux Reporter: Sean Bridges Attachments: LUCENE-3932.trunk.patch, perf.csv We have a large 50 gig index which is optimized as one segment, with a 66 MEG .tii file. This index has no norms, and no field cache. It takes about 5 seconds to load this index, profiling reveals that 60% of the time is spent in GrowableWriter.set(index, value), and most of time in set(...) is spent resizing PackedInts.Mutatable current. In the constructor for TermInfosReaderIndex, you initialize the writer with the line, {quote}GrowableWriter indexToTerms = new GrowableWriter(4, indexSize, false);{quote} For our index using four as the bit estimate results in 27 resizes. The last value in indexToTerms is going to be ~ tiiFileLength, and if instead you use, {quote}int bitEstimate = (int) Math.ceil(Math.log10(tiiFileLength) / Math.log10(2)); GrowableWriter indexToTerms = new GrowableWriter(bitEstimate, indexSize, false);{quote} Load time improves to ~ 2 seconds. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3939) ClassCastException thrown in the map(String,int,TermVectorOffsetInfo[],int[]) method in org.apache.lucene.index.SortedTermVectorMapper
[ https://issues.apache.org/jira/browse/LUCENE-3939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13243108#comment-13243108 ] Michael McCandless commented on LUCENE-3939: I'm confused on how something's that not a TermVectorEntry can get into the termToTVE map... can you post a small test case showing this problem? ClassCastException thrown in the map(String,int,TermVectorOffsetInfo[],int[]) method in org.apache.lucene.index.SortedTermVectorMapper -- Key: LUCENE-3939 URL: https://issues.apache.org/jira/browse/LUCENE-3939 Project: Lucene - Java Issue Type: Bug Components: core/index Affects Versions: 3.0.2, 3.1, 3.4, 3.5 Reporter: SHIN HWEI TAN Original Estimate: 0.05h Remaining Estimate: 0.05h The method map in the SortedTermVectorMapper class does not check the parameter term for the valid values. It throws ClassCastException when called with a invalid string for the parameter term (i.e., var3.map(*, (-1), null, null)). The exception thrown is due to an explict cast(i.e., casting the return value of termToTVE.get(term) to type TermVectorEntry). Suggested Fixes: Replace the beginning of the method body for the class SortedTermVectorMapper by changing it like this: public void map(String term, int frequency, TermVectorOffsetInfo[] offsets, int[] positions) { if(termToTVE.get(term) instanceof TermVectorEntry){ TermVectorEntry entry = (TermVectorEntry) termToTVE.get(term); ... } } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3738) Be consistent about negative vInt/vLong
[ https://issues.apache.org/jira/browse/LUCENE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13243125#comment-13243125 ] Michael McCandless commented on LUCENE-3738: Thanks Uwe, I'll test! Be consistent about negative vInt/vLong --- Key: LUCENE-3738 URL: https://issues.apache.org/jira/browse/LUCENE-3738 Project: Lucene - Java Issue Type: Bug Reporter: Michael McCandless Assignee: Uwe Schindler Priority: Blocker Fix For: 3.6, 4.0 Attachments: ByteArrayDataInput.java.patch, LUCENE-3738-improvement.patch, LUCENE-3738.patch, LUCENE-3738.patch, LUCENE-3738.patch, LUCENE-3738.patch, LUCENE-3738.patch Today, write/readVInt allows a negative int, in that it will encode and decode correctly, just horribly inefficiently (5 bytes). However, read/writeVLong fails (trips an assert). I'd prefer that both vInt/vLong trip an assert if you ever try to write a negative number... it's badly trappy today. But, unfortunately, we sometimes rely on this... had we had this assert in 'since the beginning' we could have avoided that. So, if we can't add that assert in today, I think we should at least fix readVLong to handle negative longs... but then you quietly spend 9 bytes (even more trappy!). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3738) Be consistent about negative vInt/vLong
[ https://issues.apache.org/jira/browse/LUCENE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13243138#comment-13243138 ] Michael McCandless commented on LUCENE-3738: Alas, the results are now all over the place! And I went back to the prior patch and tried to reproduce the above results... and the results are still all over the place. I think we are chasing Java ghosts at this point... Be consistent about negative vInt/vLong --- Key: LUCENE-3738 URL: https://issues.apache.org/jira/browse/LUCENE-3738 Project: Lucene - Java Issue Type: Bug Reporter: Michael McCandless Assignee: Uwe Schindler Priority: Blocker Fix For: 3.6, 4.0 Attachments: ByteArrayDataInput.java.patch, LUCENE-3738-improvement.patch, LUCENE-3738.patch, LUCENE-3738.patch, LUCENE-3738.patch, LUCENE-3738.patch, LUCENE-3738.patch Today, write/readVInt allows a negative int, in that it will encode and decode correctly, just horribly inefficiently (5 bytes). However, read/writeVLong fails (trips an assert). I'd prefer that both vInt/vLong trip an assert if you ever try to write a negative number... it's badly trappy today. But, unfortunately, we sometimes rely on this... had we had this assert in 'since the beginning' we could have avoided that. So, if we can't add that assert in today, I think we should at least fix readVLong to handle negative longs... but then you quietly spend 9 bytes (even more trappy!). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3738) Be consistent about negative vInt/vLong
[ https://issues.apache.org/jira/browse/LUCENE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13243293#comment-13243293 ] Michael McCandless commented on LUCENE-3738: Sorry Uwe, that was exactly it: I don't know what to conclude from the perf runs anymore. But +1 for your new patch: it ought to be better since the code is simpler. Be consistent about negative vInt/vLong --- Key: LUCENE-3738 URL: https://issues.apache.org/jira/browse/LUCENE-3738 Project: Lucene - Java Issue Type: Bug Reporter: Michael McCandless Assignee: Uwe Schindler Priority: Blocker Fix For: 3.6, 4.0 Attachments: ByteArrayDataInput.java.patch, LUCENE-3738-improvement.patch, LUCENE-3738.patch, LUCENE-3738.patch, LUCENE-3738.patch, LUCENE-3738.patch, LUCENE-3738.patch Today, write/readVInt allows a negative int, in that it will encode and decode correctly, just horribly inefficiently (5 bytes). However, read/writeVLong fails (trips an assert). I'd prefer that both vInt/vLong trip an assert if you ever try to write a negative number... it's badly trappy today. But, unfortunately, we sometimes rely on this... had we had this assert in 'since the beginning' we could have avoided that. So, if we can't add that assert in today, I think we should at least fix readVLong to handle negative longs... but then you quietly spend 9 bytes (even more trappy!). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3940) When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole
[ https://issues.apache.org/jira/browse/LUCENE-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13243299#comment-13243299 ] Michael McCandless commented on LUCENE-3940: bq. StandardTokenizer doesnt leave holes when it drops punctuation, But is that really good? This means a PhraseQuery will match across end-of-sentence (.), semicolon, colon, comma, etc. (English examples..). I think tokenizers should throw away as little information as possible... we can always filter out such tokens in a later stage? For example, if a tokenizer created punct tokens (instead of silently discarding them), other token filters could make use of them in the mean time, eg a synonym rule for u.s.a. - usa or maybe a dedicated English acronyms filter. We could then later filter them out, even not leaving holes, and have the same behavior that we have now? Are there non-English examples where you would want the PhraseQuery to match over punctuation...? EG, for Japanese, I assume we don't want PhraseQuery applying across periods/commas, like it will now? (Not sure about middle dot...? Others...?). When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole - Key: LUCENE-3940 URL: https://issues.apache.org/jira/browse/LUCENE-3940 Project: Lucene - Java Issue Type: Bug Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 4.0 Attachments: LUCENE-3940.patch, LUCENE-3940.patch I modified BaseTokenStreamTestCase to assert that the start/end offsets match for graph (posLen 1) tokens, and this caught a bug in Kuromoji when the decompounding of a compound token has a punctuation token that's dropped. In this case we should leave hole(s) so that the graph is intact, ie, the graph should look the same as if the punctuation tokens were not initially removed, but then a StopFilter had removed them. This also affects tokens that have no compound over them, ie we fail to leave a hole today when we remove the punctuation tokens. I'm not sure this is serious enough to warrant fixing in 3.6 at the last minute... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3930) nuke jars from source tree and use ivy
[ https://issues.apache.org/jira/browse/LUCENE-3930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13242520#comment-13242520 ] Michael McCandless commented on LUCENE-3930: bq. In my opinion this is ready to go into trunk. I'll wait a bit for any feedback though. +1 ant test passes, after the one-time ant ivy-bootstrap. Thanks everyone! nuke jars from source tree and use ivy -- Key: LUCENE-3930 URL: https://issues.apache.org/jira/browse/LUCENE-3930 Project: Lucene - Java Issue Type: Task Components: general/build Reporter: Robert Muir Assignee: Robert Muir Priority: Blocker Fix For: 3.6 Attachments: LUCENE-3930-skip-sources-javadoc.patch, LUCENE-3930-solr-example.patch, LUCENE-3930-solr-example.patch, LUCENE-3930.patch, LUCENE-3930.patch, LUCENE-3930.patch, LUCENE-3930__ivy_bootstrap_target.patch, ant_-verbose_clean_test.out.txt, noggit-commons-csv.patch, patch-jetty-build.patch As mentioned on the ML thread: switch jars to ivy mechanism?. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3932) Improve load time of .tii files
[ https://issues.apache.org/jira/browse/LUCENE-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13242541#comment-13242541 ] Michael McCandless commented on LUCENE-3932: {quote} utf8 - utf 16 is 7% of the time utf 16 - utf8 is 16% of the time writing vlong's is also 16% of the time, TermBufer.read() is 17% of the time (24% if you include the call to utf8ToUtf16) {quote} Seems like if we made a direct decode tii file and write in-memory format (instead of going through SegmentTermEnum), we could get some of this back. The vLongs unfortunately need to be decoded/re-encoded because they are deltas in the file but absolutes in memory. But, eg the vInt docFreq could be a copyVInt method instead of readVInt then writeVInt, which should save a bit. bq. Trying with 3.4 gives a 4 second load time, most of the time spent in SegmentTermEnum.next(). OK, a bit faster than 3.5. But presumably 3.4 uses much more RAM after startup...? bq. Using the patch on trunk, load time goes from ~5 to ~2 seconds. Awesome, thanks for testing! Improve load time of .tii files --- Key: LUCENE-3932 URL: https://issues.apache.org/jira/browse/LUCENE-3932 Project: Lucene - Java Issue Type: Improvement Affects Versions: 3.5 Environment: Linux Reporter: Sean Bridges Attachments: LUCENE-3932.trunk.patch, perf.csv We have a large 50 gig index which is optimized as one segment, with a 66 MEG .tii file. This index has no norms, and no field cache. It takes about 5 seconds to load this index, profiling reveals that 60% of the time is spent in GrowableWriter.set(index, value), and most of time in set(...) is spent resizing PackedInts.Mutatable current. In the constructor for TermInfosReaderIndex, you initialize the writer with the line, {quote}GrowableWriter indexToTerms = new GrowableWriter(4, indexSize, false);{quote} For our index using four as the bit estimate results in 27 resizes. The last value in indexToTerms is going to be ~ tiiFileLength, and if instead you use, {quote}int bitEstimate = (int) Math.ceil(Math.log10(tiiFileLength) / Math.log10(2)); GrowableWriter indexToTerms = new GrowableWriter(bitEstimate, indexSize, false);{quote} Load time improves to ~ 2 seconds. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3738) Be consistent about negative vInt/vLong
[ https://issues.apache.org/jira/browse/LUCENE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13242550#comment-13242550 ] Michael McCandless commented on LUCENE-3738: Removing the asserts apparently didn't change the perf... I can reproduce the slowdown in a separate test (before/after this commit): {noformat} TaskQPS base StdDev baseQPS vInt StdDev vInt Pct diff IntNRQ7.110.896.730.58 -23% - 17% Prefix3 16.070.96 15.650.72 -12% - 8% Wildcard 20.140.91 19.670.77 -10% - 6% PKLookup 154.625.08 151.112.82 -7% - 2% Fuzzy1 85.241.53 83.871.18 -4% - 1% Fuzzy2 44.111.03 43.960.44 -3% - 3% SpanNear3.230.113.220.07 -5% - 5% TermBGroup1M1P 42.350.49 42.431.43 -4% - 4% Respell 65.111.91 65.271.27 -4% - 5% AndHighMed 54.184.04 54.502.27 -10% - 13% TermGroup1M 31.270.35 31.460.63 -2% - 3% TermBGroup1M 45.010.33 45.371.42 -3% - 4% AndHighHigh 13.350.71 13.460.50 -7% - 10% Term 82.713.12 83.562.33 -5% - 7% OrHighMed 10.660.67 10.780.44 -8% - 12% OrHighHigh7.080.427.190.26 -7% - 11% SloppyPhrase5.110.245.200.31 -8% - 13% Phrase 11.140.75 11.400.50 -8% - 14% {noformat} But then Uwe made a patch (I'll attach) reducing the byte code for the unrolled methods: {noformat} TaskQPS base StdDev baseQPS vInt StdDev vInt Pct diff SpanNear3.240.133.180.07 -7% - 4% Phrase 11.340.68 11.130.38 -10% - 7% SloppyPhrase5.170.235.080.18 -9% - 6% TermBGroup1M1P 41.920.80 41.570.94 -4% - 3% TermGroup1M 30.740.68 30.810.96 -5% - 5% Term 80.873.52 81.292.05 -6% - 7% TermBGroup1M 43.940.93 44.171.32 -4% - 5% AndHighMed 53.712.62 54.211.97 -7% - 9% AndHighHigh 13.200.42 13.410.41 -4% - 8% Respell 65.372.70 66.533.29 -7% - 11% Fuzzy1 84.292.11 86.443.36 -3% - 9% PKLookup 149.814.20 153.879.46 -6% - 12% OrHighHigh7.190.287.400.48 -7% - 13% OrHighMed 10.820.43 11.160.73 -7% - 14% Fuzzy2 43.720.96 45.242.03 -3% - 10% Wildcard 18.961.00 20.050.39 -1% - 13% Prefix3 14.960.83 15.890.27 -1% - 14% IntNRQ5.890.586.950.174% - 34% {noformat} So... I think we should commit it! Be consistent about negative vInt/vLong --- Key: LUCENE-3738 URL: https://issues.apache.org/jira/browse/LUCENE-3738 Project: Lucene - Java Issue Type: Bug Reporter: Michael McCandless Assignee: Uwe Schindler Fix For: 3.6, 4.0 Attachments: ByteArrayDataInput.java.patch, LUCENE-3738.patch, LUCENE-3738.patch, LUCENE-3738.patch, LUCENE-3738.patch, LUCENE-3738.patch Today, write/readVInt allows a negative int, in that it will encode and decode correctly, just horribly inefficiently (5 bytes). However, read/writeVLong fails (trips an assert). I'd prefer that both vInt/vLong trip an assert if you ever try to write a negative number... it's badly trappy today. But, unfortunately, we sometimes rely on this... had we had this assert in 'since the beginning' we could have avoided that. So, if we can't add that assert in today, I think we should at least fix readVLong to handle negative longs... but then you quietly spend 9 bytes (even more trappy!). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please
[jira] [Commented] (LUCENE-3935) Optimize Kuromoji inner loop - rewrite ConnectionCosts.get() method
[ https://issues.apache.org/jira/browse/LUCENE-3935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241328#comment-13241328 ] Michael McCandless commented on LUCENE-3935: +1 Optimize Kuromoji inner loop - rewrite ConnectionCosts.get() method --- Key: LUCENE-3935 URL: https://issues.apache.org/jira/browse/LUCENE-3935 Project: Lucene - Java Issue Type: Improvement Components: modules/analysis Affects Versions: 3.6, 4.0 Reporter: Christian Moen Attachments: LUCENE-3935.patch I've been profiling Kuromoji, and not very surprisingly, method {{ConnectionCosts.get(int forwardId, int backwardId)}} that looks up costs in the Viterbi is called many many times and contributes to more processing time than I had expected. This method is currently backed by a {{short[][]}}. This data stored here structure is a two dimensional array with both dimensions being fixed with 1316 elements in each dimension. (The data is {{matrix.def}} in MeCab-IPADIC.) We can rewrite this to use a single one-dimensional array instead, and we will at least save one bounds check, a pointer reference, and we should also get much better cache utilization since this structure is likely to be in very local CPU cache. I think this will be a nice optimization. Working on it... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3312) Break out StorableField from IndexableField
[ https://issues.apache.org/jira/browse/LUCENE-3312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241354#comment-13241354 ] Michael McCandless commented on LUCENE-3312: Hi Nikola, I think this plus LUCENE-3891 sounds great! The challenge is... we need a mentor for this project... volunteers? Break out StorableField from IndexableField --- Key: LUCENE-3312 URL: https://issues.apache.org/jira/browse/LUCENE-3312 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: Field Type branch In the field type branch we have strongly decoupled Document/Field/FieldType impl from the indexer, by having only a narrow API (IndexableField) passed to IndexWriter. This frees apps up use their own documents instead of the user-space impls we provide in oal.document. Similarly, with LUCENE-3309, we've done the same thing on the doc/field retrieval side (from IndexReader), with the StoredFieldsVisitor. But, maybe we should break out StorableField from IndexableField, such that when you index a doc you provide two Iterables -- one for the IndexableFields and one for the StorableFields. Either can be null. One downside is possible perf hit for fields that are both indexed stored (ie, we visit them twice, lookup their name in a hash twice, etc.). But the upside is a cleaner separation of concerns in API -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3907) Improve the Edge/NGramTokenizer/Filters
[ https://issues.apache.org/jira/browse/LUCENE-3907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241357#comment-13241357 ] Michael McCandless commented on LUCENE-3907: Awesome! We just need a possible mentor here... volunteers...? Improve the Edge/NGramTokenizer/Filters --- Key: LUCENE-3907 URL: https://issues.apache.org/jira/browse/LUCENE-3907 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.0 Our ngram tokenizers/filters could use some love. EG, they output ngrams in multiple passes, instead of stacked, which messes up offsets/positions and requires too much buffering (can hit OOME for long tokens). They clip at 1024 chars (tokenizers) but don't (token filters). The split up surrogate pairs incorrectly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3936) Rename StringIndexDocValues to DocTermsIndexDocValues
[ https://issues.apache.org/jira/browse/LUCENE-3936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241359#comment-13241359 ] Michael McCandless commented on LUCENE-3936: +1 Rename StringIndexDocValues to DocTermsIndexDocValues - Key: LUCENE-3936 URL: https://issues.apache.org/jira/browse/LUCENE-3936 Project: Lucene - Java Issue Type: Improvement Components: modules/other Reporter: Martijn van Groningen Fix For: 4.0 Attachments: LUCENE-3936.patch StringIndex doesn't exists any more in the trunk, so the name DocTermsIndex should be used and this is also what it is using. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2000) Use covariant clone() return types
[ https://issues.apache.org/jira/browse/LUCENE-2000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241364#comment-13241364 ] Michael McCandless commented on LUCENE-2000: We now get a bunch of redundant cast warnings from this ... are there plans to fix that...? Use covariant clone() return types -- Key: LUCENE-2000 URL: https://issues.apache.org/jira/browse/LUCENE-2000 Project: Lucene - Java Issue Type: Task Components: core/other Affects Versions: 3.0 Reporter: Uwe Schindler Assignee: Ryan McKinley Priority: Minor Fix For: 4.0 Attachments: LUCENE-2000-clone_covariance.patch, LUCENE-2000-clone_covariance.patch *Paul Cowan wrote in LUCENE-1257:* OK, thought I'd jump in and help out here with one of my Java 5 favourites. Haven't seen anyone discuss this, and don't believe any of the patches address this, so thought I'd throw a patch out there (against SVN HEAD @ revision 827821) which uses Java 5 covariant return types for (almost) all of the Object#clone() implementations in core. i.e. this: public Object clone() { changes to: public SpanNotQuery clone() { which lets us get rid of a whole bunch of now-unnecessary casts, so e.g. if (clone == null) clone = (SpanNotQuery) this.clone(); becomes if (clone == null) clone = this.clone(); Almost everything has been done and all downcasts removed, in core, with the exception of Some SpanQuery stuff, where it's assumed that it's safe to cast the clone() of a SpanQuery to a SpanQuery - this can't be made covariant without declaring abstract SpanQuery clone() in SpanQuery itself, which breaks those SpanQuerys that don't declare their own clone() Some IndexReaders, e.g. DirectoryReader - we can't be more specific than changing .clone() to return IndexReader, because it returns the result of IndexReader.clone(boolean). We could use covariant types for THAT, which would work fine, but that didn't follow the pattern of the others so that could be a later commit. Two changes were also made in contrib/, where not making the changes would have broken code by trying to widen IndexInput#clone() back out to returning Object, which is not permitted. contrib/ was otherwise left untouched. Let me know what you think, or if you have any other questions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3932) Improve load time of .tii files
[ https://issues.apache.org/jira/browse/LUCENE-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241386#comment-13241386 ] Michael McCandless commented on LUCENE-3932: I agree net/net that change is good; we know the in-RAM image will be at least as large as the tii file so we should make a better guess up front. 3.x is currently in code freeze (for the 3.6.0 release), but I'll commit to trunk's preflex codec. Can you describe more about your index...? If your tii fils is 66 MB, how many terms do you have...? 5 seconds is also a long startup time... what's the IO system like? Improve load time of .tii files --- Key: LUCENE-3932 URL: https://issues.apache.org/jira/browse/LUCENE-3932 Project: Lucene - Java Issue Type: Improvement Affects Versions: 3.5 Environment: Linux Reporter: Sean Bridges We have a large 50 gig index which is optimized as one segment, with a 66 MEG .tii file. This index has no norms, and no field cache. It takes about 5 seconds to load this index, profiling reveals that 60% of the time is spent in GrowableWriter.set(index, value), and most of time in set(...) is spent resizing PackedInts.Mutatable current. In the constructor for TermInfosReaderIndex, you initialize the writer with the line, {quote}GrowableWriter indexToTerms = new GrowableWriter(4, indexSize, false);{quote} For our index using four as the bit estimate results in 27 resizes. The last value in indexToTerms is going to be ~ tiiFileLength, and if instead you use, {quote}int bitEstimate = (int) Math.ceil(Math.log10(tiiFileLength) / Math.log10(2)); GrowableWriter indexToTerms = new GrowableWriter(bitEstimate, indexSize, false);{quote} Load time improves to ~ 2 seconds. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2000) Use covariant clone() return types
[ https://issues.apache.org/jira/browse/LUCENE-2000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241417#comment-13241417 ] Michael McCandless commented on LUCENE-2000: Thanks Ryan! Use covariant clone() return types -- Key: LUCENE-2000 URL: https://issues.apache.org/jira/browse/LUCENE-2000 Project: Lucene - Java Issue Type: Task Components: core/other Affects Versions: 3.0 Reporter: Uwe Schindler Assignee: Ryan McKinley Priority: Minor Fix For: 4.0 Attachments: LUCENE-2000-clone_covariance.patch, LUCENE-2000-clone_covariance.patch *Paul Cowan wrote in LUCENE-1257:* OK, thought I'd jump in and help out here with one of my Java 5 favourites. Haven't seen anyone discuss this, and don't believe any of the patches address this, so thought I'd throw a patch out there (against SVN HEAD @ revision 827821) which uses Java 5 covariant return types for (almost) all of the Object#clone() implementations in core. i.e. this: public Object clone() { changes to: public SpanNotQuery clone() { which lets us get rid of a whole bunch of now-unnecessary casts, so e.g. if (clone == null) clone = (SpanNotQuery) this.clone(); becomes if (clone == null) clone = this.clone(); Almost everything has been done and all downcasts removed, in core, with the exception of Some SpanQuery stuff, where it's assumed that it's safe to cast the clone() of a SpanQuery to a SpanQuery - this can't be made covariant without declaring abstract SpanQuery clone() in SpanQuery itself, which breaks those SpanQuerys that don't declare their own clone() Some IndexReaders, e.g. DirectoryReader - we can't be more specific than changing .clone() to return IndexReader, because it returns the result of IndexReader.clone(boolean). We could use covariant types for THAT, which would work fine, but that didn't follow the pattern of the others so that could be a later commit. Two changes were also made in contrib/, where not making the changes would have broken code by trying to widen IndexInput#clone() back out to returning Object, which is not permitted. contrib/ was otherwise left untouched. Let me know what you think, or if you have any other questions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-1591) Enable bzip compression in benchmark
[ https://issues.apache.org/jira/browse/LUCENE-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241428#comment-13241428 ] Michael McCandless commented on LUCENE-1591: Note that enwiki-20110115-pages-articles.xml.bz2 also hits XERCESJ-1257 ... Enable bzip compression in benchmark Key: LUCENE-1591 URL: https://issues.apache.org/jira/browse/LUCENE-1591 Project: Lucene - Java Issue Type: Improvement Components: modules/benchmark Reporter: Shai Erera Assignee: Mark Miller Fix For: 2.9, 3.1, 4.0 Attachments: LUCENE-1591.patch, LUCENE-1591.patch, LUCENE-1591.patch, LUCENE-1591.patch, LUCENE-1591.patch, LUCENE-1591.patch, LUCENE-1591.patch, LUCENE-1591.patch, commons-compress-dev20090413.jar, commons-compress-dev20090413.jar bzip compression can aid the benchmark package by not requiring extracting bzip files (such as enwiki) in order to index them. The plan is to add a config parameter bzip.compression=true/false and in the relevant tasks either decompress the input file or compress the output file using the bzip streams. It will add a dependency on ant.jar which contains two classes similar to GZIPOutputStream and GZIPInputStream which compress/decompress files using the bzip algorithm. bzip is known to be superior in its compression performance to the gzip algorithm (~20% better compression), although it does the compression/decompression a bit slower. I wil post a patch which adds this parameter and implement it in LineDocMaker, EnwikiDocMaker and WriteLineDoc task. Maybe even add the capability to DocMaker or some of the super classes, so it can be inherited by all sub-classes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3937) Workaround the XERCES-J bug in Benchmark
[ https://issues.apache.org/jira/browse/LUCENE-3937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241438#comment-13241438 ] Michael McCandless commented on LUCENE-3937: LUCENE-1591 is when we first tripped on the XERCESJ-1257 bug... and the bug also happens on enwiki-20110115-pages-articles.xml.bz2 export. Great idea to workaround Xercesj's bug by using the JVM to decode UTF8, instead of Xercesj... I'll test this patch now! Workaround the XERCES-J bug in Benchmark Key: LUCENE-3937 URL: https://issues.apache.org/jira/browse/LUCENE-3937 Project: Lucene - Java Issue Type: Bug Reporter: Uwe Schindler Attachments: LUCENE-3937.patch In becnhmark we have a patched version of XERCES which is hard to compile from source. When looking at the code part patched and the source of EnwikiContentSource, to simply provide the XML parser a Reader instead of InputStream, so the broken code is not triggered. This assumes, that the XML-file is always UTF-8 If not it will no longer work (because the XML parser cannot switch encoding, if it only has a Reader). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3937) Workaround the XERCES-J bug in Benchmark
[ https://issues.apache.org/jira/browse/LUCENE-3937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241445#comment-13241445 ] Michael McCandless commented on LUCENE-3937: Note: I just run benchmark's conf/extractWikipedia.alg task on the XML export... when XERCESJ-1257 strikes you get this: {noformat} ... [java] 936.83 sec -- main Wrote 2801000 line docs [java] 937.04 sec -- main Wrote 2802000 line docs [java] 937.27 sec -- main Wrote 2803000 line docs [java] 937.53 sec -- main Wrote 2804000 line docs [java] 937.79 sec -- main Wrote 2805000 line docs [java] 938.04 sec -- main Wrote 2806000 line docs [java] 938.35 sec -- main Wrote 2807000 line docs [java] 938.65 sec -- main Wrote 2808000 line docs [java] 938.88 sec -- main Wrote 2809000 line docs [java] 939.09 sec -- main Wrote 281 line docs [java] 939.09 sec -- main Wrote 281 line docs [java] Exception in thread Thread-0 java.lang.RuntimeException: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence. [java] at org.apache.lucene.benchmark.byTask.feeds.EnwikiContentSource$Parser.run(EnwikiContentSource.java:198) [java] at java.lang.Thread.run(Thread.java:619) [java] [java] ### D O N E !!! ### [java] Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence. [java] [java] at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source) [java] at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source) [java] at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source) [java] at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source) [java] at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source) [java] at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) [java] at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) [java] at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) [java] at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) [java] at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) [java] at org.apache.lucene.benchmark.byTask.feeds.EnwikiContentSource$Parser.run(EnwikiContentSource.java:175) [java] ... 1 more [java] Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence. [java] at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source) [java] at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source) [java] at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source) [java] at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source) [java] at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source) [java] ... 8 more {noformat} Workaround the XERCES-J bug in Benchmark Key: LUCENE-3937 URL: https://issues.apache.org/jira/browse/LUCENE-3937 Project: Lucene - Java Issue Type: Bug Reporter: Uwe Schindler Attachments: LUCENE-3937.patch In becnhmark we have a patched version of XERCES which is hard to compile from source. When looking at the code part patched and the source of EnwikiContentSource, to simply provide the XML parser a Reader instead of InputStream, so the broken code is not triggered. This assumes, that the XML-file is always UTF-8 If not it will no longer work (because the XML parser cannot switch encoding, if it only has a Reader). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3937) Workaround the XERCES-J bug in Benchmark
[ https://issues.apache.org/jira/browse/LUCENE-3937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241486#comment-13241486 ] Michael McCandless commented on LUCENE-3937: OK with this patch the decode of enwiki-20110115 finished! I agree we should tell the decoder to throw exception on any problems... Workaround the XERCES-J bug in Benchmark Key: LUCENE-3937 URL: https://issues.apache.org/jira/browse/LUCENE-3937 Project: Lucene - Java Issue Type: Bug Reporter: Uwe Schindler Attachments: LUCENE-3937.patch In becnhmark we have a patched version of XERCES which is hard to compile from source. When looking at the code part patched and the source of EnwikiContentSource, to simply provide the XML parser a Reader instead of InputStream, so the broken code is not triggered. This assumes, that the XML-file is always UTF-8 If not it will no longer work (because the XML parser cannot switch encoding, if it only has a Reader). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3932) Improve load time of .tii files
[ https://issues.apache.org/jira/browse/LUCENE-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241497#comment-13241497 ] Michael McCandless commented on LUCENE-3932: Nice. I'd love to know how trunk handles all these terms (we have a more memory efficient terms dict/index in 4.0). bq. After the change the big time waste is converting the terms from utf8 to utf16 when reading from the .tii file, and then back to utf8 when writing to the in memory store. What %tg of the time is spent on the decode/encode (after fixing the initial bitEstimate)? That is very silly... fixing that is a somewhat deeper change though. I guess we'd need to read the .tii file directly (not use SegmentTermEnum), and then copy the UTF8 bytes straight without going through UTF16... Do you have comparisons with pre-3.5 (before we cutover to this more RAM-efficient (but CPU heavy on load) terms index)? Probably that less CPU on init, but more RAM held for the lifetime of the reader...? Improve load time of .tii files --- Key: LUCENE-3932 URL: https://issues.apache.org/jira/browse/LUCENE-3932 Project: Lucene - Java Issue Type: Improvement Affects Versions: 3.5 Environment: Linux Reporter: Sean Bridges We have a large 50 gig index which is optimized as one segment, with a 66 MEG .tii file. This index has no norms, and no field cache. It takes about 5 seconds to load this index, profiling reveals that 60% of the time is spent in GrowableWriter.set(index, value), and most of time in set(...) is spent resizing PackedInts.Mutatable current. In the constructor for TermInfosReaderIndex, you initialize the writer with the line, {quote}GrowableWriter indexToTerms = new GrowableWriter(4, indexSize, false);{quote} For our index using four as the bit estimate results in 27 resizes. The last value in indexToTerms is going to be ~ tiiFileLength, and if instead you use, {quote}int bitEstimate = (int) Math.ceil(Math.log10(tiiFileLength) / Math.log10(2)); GrowableWriter indexToTerms = new GrowableWriter(bitEstimate, indexSize, false);{quote} Load time improves to ~ 2 seconds. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13240527#comment-13240527 ] Michael McCandless commented on LUCENE-3892: That's great Han, I'll have a look. I can be a mentor for this... Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.0 On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze
[ https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239447#comment-13239447 ] Michael McCandless commented on SOLR-3282: -- This sounds like a fabulous test! I wonder if we can somehow make this easily runnable on demand (eg, like Test2BTerms), assuming you have the prereqs installed locally (eg Japanese Wikipedia export). Perform Kuromoji/Japanese stability test before 3.6 freeze -- Key: SOLR-3282 URL: https://issues.apache.org/jira/browse/SOLR-3282 Project: Solr Issue Type: Task Components: Schema and Analysis Affects Versions: 3.6, 4.0 Reporter: Christian Moen Assignee: Christian Moen Kuromoji might be used by many and also in mission critical systems. I'd like to run a stability test before we freeze 3.6. My thinking is to test the out-of-the-box configuration using fieldtype {{text_ja}} as follows: # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a never ending loop # Simultaneously run many tens of thousands typical Japanese queries against the index at 3-5 queries per second with highlighting turned on While Solr is indexing and searching, I'd like to verify that: * Indexing and queries are working as expected * Memory and heap usage looks stable over time * Garbage collection is overall low over time -- no Full-GC issues I'll post findings and results to this JIRA. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3738) Be consistent about negative vInt/vLong
[ https://issues.apache.org/jira/browse/LUCENE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239667#comment-13239667 ] Michael McCandless commented on LUCENE-3738: +1 to remove those asserts... let's see if this fixes the slowdown the nightly builds hit on 3/18: http://people.apache.org/~mikemccand/lucenebench/IntNRQ.html Be consistent about negative vInt/vLong --- Key: LUCENE-3738 URL: https://issues.apache.org/jira/browse/LUCENE-3738 Project: Lucene - Java Issue Type: Bug Reporter: Michael McCandless Assignee: Uwe Schindler Fix For: 3.6, 4.0 Attachments: ByteArrayDataInput.java.patch, LUCENE-3738.patch, LUCENE-3738.patch, LUCENE-3738.patch, LUCENE-3738.patch, LUCENE-3738.patch Today, write/readVInt allows a negative int, in that it will encode and decode correctly, just horribly inefficiently (5 bytes). However, read/writeVLong fails (trips an assert). I'd prefer that both vInt/vLong trip an assert if you ever try to write a negative number... it's badly trappy today. But, unfortunately, we sometimes rely on this... had we had this assert in 'since the beginning' we could have avoided that. So, if we can't add that assert in today, I think we should at least fix readVLong to handle negative longs... but then you quietly spend 9 bytes (even more trappy!). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3076) Solr should support block joins
[ https://issues.apache.org/jira/browse/SOLR-3076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13238239#comment-13238239 ] Michael McCandless commented on SOLR-3076: -- {quote} 2. Do you agree with overall approach to deliver straightforward QP with explicit joining syntax? Or you object and insist on entity-relationship-schema approach? 3. What's is the level of uncertainty you have about the current QP syntax? What's your main concern and what's the way to improve it? {quote} Well, stepping back, my concern is still that I don't think there should be any QP syntax to express block joins. These are joins determined at indexing time, and compiled into the index, and so the only remaining query-time freedom is which fields you want to search against (something QP can already understand, ie field:text syntax). From that fields list the required joins are implied. I can't imagine users learning/typing the sort of syntax we are discussing here. It's true there are exceptional cases (Hoss's size field that's on both parent and child docs), but, that's the exception not the rule; I don't think we should design things (APIs, QP syntax) around exceptional cases. And, I think such an exception should be handled by some sort of field aliasing (book_page_count vs chapter_page_count). For query-time join, which is fully flexible, I agree the QP must (and already does) include join syntax, ie be more like SQL, where you can express arbitrary on-the-fly joins. But, at the same time, the 'users' of Solr's QP syntax may not be the end user, ie, the app's front end may very well construct these complex join expressions and so it's really the developers of that search app writing these join queries. So perhaps it's fine to add crazy-expert syntax that end users would rarely use but search app developers might...? All this being said, I defer to Hoss (and other committers more experienced w/ Solr QP issues) here... if they all feel this added QP syntax makes sense then let's do it! Solr should support block joins --- Key: SOLR-3076 URL: https://issues.apache.org/jira/browse/SOLR-3076 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Attachments: SOLR-3076.patch, SOLR-3076.patch, SOLR-3076.patch, SOLR-3076.patch, SOLR-3076.patch, bjq-vs-filters-backward-disi.patch, bjq-vs-filters-illegal-state.patch, child-bjqparser.patch, parent-bjq-qparser.patch, parent-bjq-qparser.patch, solrconf-bjq-erschema-snippet.xml, tochild-bjq-filtered-search-fix.patch Lucene has the ability to do block joins, we should add it to Solr. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3923) fail the build on wrong svn:eol-style
[ https://issues.apache.org/jira/browse/LUCENE-3923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13238355#comment-13238355 ] Michael McCandless commented on LUCENE-3923: +1 And, ideally, ant test as well... fail the build on wrong svn:eol-style - Key: LUCENE-3923 URL: https://issues.apache.org/jira/browse/LUCENE-3923 Project: Lucene - Java Issue Type: Task Components: general/build Reporter: Robert Muir I'm tired of fixing this before releases. Jenkins should detect and fail on this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3873) tie MockGraphTokenFilter into all analyzers tests
[ https://issues.apache.org/jira/browse/LUCENE-3873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13238398#comment-13238398 ] Michael McCandless commented on LUCENE-3873: LUCENE-3848 has the MockGraphTokenFilter patch... tie MockGraphTokenFilter into all analyzers tests - Key: LUCENE-3873 URL: https://issues.apache.org/jira/browse/LUCENE-3873 Project: Lucene - Java Issue Type: Task Components: modules/analysis Reporter: Robert Muir Mike made a MockGraphTokenFilter on LUCENE-3848. Many filters currently arent tested with anything but a simple tokenstream. we should test them with this, too, it might find bugs (zero-length terms, stacked terms/synonyms, etc) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3659) Improve Javadocs of RAMDirectory to document its limitations and add improvements to make it more GC friendly on large indexes
[ https://issues.apache.org/jira/browse/LUCENE-3659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13238402#comment-13238402 ] Michael McCandless commented on LUCENE-3659: This looks great Uwe! I'm a little worried about the tiny file case; you're checking for SEGMENTS_* now, but many other files can be much smaller than 1/64th of the estimated segment size. I wonder if we should improve IOContext to hold the [rough] estimated file size (not just overall segment size)... the thing is that's sort of a hassle on codec impls. Or: maybe, on closing the ROS/RAMFile, we can downsize the final buffer (yes, this means copying the bytes, but that cost is vanishingly small as the RAMDir grows). Then tiny files stay tiny, though they are still [relatively] costly to create... I don't this RAMDir.createOutput should publish the RAMFile until the ROS is closed? Ie, you are not allowed to openInput on something still opened with createOutput in any Lucene Dir impl..? This would allow us to make RAMFile frozen (eg if ROS holds its own buffers and then creates RAMFile on close), that requires no sync when reading? I also don't think RAMFile should be public, ie, the only way to make changes to a file stored in a RAMDir is via RAMOutputStream. We can do this separately... Maybe we should pursue a growing buffer size...? Ie, where each newly added buffer is bigger than the one before (like ArrayUtil.oversize's growth function)... I realize that adds complexity (RAMInputStream.seek is more fun), but this would let tiny files use tiny RAM and huge files use few buffers. Ie, RAMDir would scale up and scale down well. Separately: I noticed we still have IndexOutput.setLength, but, nobody calls it anymore I think? (In 3.x we call this when creating a CFS). Maybe we should remove it... Improve Javadocs of RAMDirectory to document its limitations and add improvements to make it more GC friendly on large indexes -- Key: LUCENE-3659 URL: https://issues.apache.org/jira/browse/LUCENE-3659 Project: Lucene - Java Issue Type: Task Affects Versions: 3.5, 4.0 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 3.6, 4.0 Attachments: LUCENE-3659.patch, LUCENE-3659.patch, LUCENE-3659.patch Spinoff from several dev@lao issues: - [http://mail-archives.apache.org/mod_mbox/lucene-dev/201112.mbox/%3C001001ccbf1c%2471845830%24548d0890%24%40thetaphi.de%3E] - issue LUCENE-3653 The use cases for RAMDirectory are very limited and to prevent users from using it for e.g. loading a 50 Gigabyte index from a file on disk, we should improve the javadocs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3873) tie MockGraphTokenFilter into all analyzers tests
[ https://issues.apache.org/jira/browse/LUCENE-3873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13238422#comment-13238422 ] Michael McCandless commented on LUCENE-3873: I agree we can use it in specific places for starters... The patch on LUCENE-3848 mixes in TokenStream to Automaton and MockGraphTokenFilter; I'll split that apart and only commit MockGraphTokenFilter here. One problem is... MockGraphTokenFilter isn't setting offsets currently I think to do this correctly it needs to buffer up pending input tokens, until it's reached the posLength it wants to output for a random token, and then set the offset accordingly. tie MockGraphTokenFilter into all analyzers tests - Key: LUCENE-3873 URL: https://issues.apache.org/jira/browse/LUCENE-3873 Project: Lucene - Java Issue Type: Task Components: modules/analysis Reporter: Robert Muir Assignee: Michael McCandless Mike made a MockGraphTokenFilter on LUCENE-3848. Many filters currently arent tested with anything but a simple tokenstream. we should test them with this, too, it might find bugs (zero-length terms, stacked terms/synonyms, etc) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-1410) PFOR implementation
[ https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13237893#comment-13237893 ] Michael McCandless commented on LUCENE-1410: bq. Out of curiousity, is the PFOR effort dead? Nothing in open source is ever dead! (Well, rarely...). It's just that nobody has picked this up again and pushed it to a committable state. I think now that we have no more bulk API in trunk, it may not be that much work to finish... though there could easily be surprises. I opened LUCENE-3892 to do exactly this, as a Google Summer of Code project. PFOR implementation --- Key: LUCENE-1410 URL: https://issues.apache.org/jira/browse/LUCENE-1410 Project: Lucene - Java Issue Type: New Feature Components: core/index Reporter: Paul Elschot Priority: Minor Fix For: Bulk Postings branch Attachments: LUCENE-1410-codecs.tar.bz2, LUCENE-1410.patch, LUCENE-1410.patch, LUCENE-1410.patch, LUCENE-1410.patch, LUCENE-1410b.patch, LUCENE-1410c.patch, LUCENE-1410d.patch, LUCENE-1410e.patch, TermQueryTests.tgz, TestPFor2.java, TestPFor2.java, TestPFor2.java, autogen.tgz, for-summary.txt Original Estimate: 21,840h Remaining Estimate: 21,840h Implementation of Patched Frame of Reference. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3581) IndexReader#isCurrent() should return true on a NRT reader if no deletes are applied and only deletes are present in IW
[ https://issues.apache.org/jira/browse/LUCENE-3581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13237926#comment-13237926 ] Michael McCandless commented on LUCENE-3581: This need not block 3.6.0 right? We are returning false when we could return true from isCurrent, but this just means the app will go through the reopen when it didn't have to...? Ie relatively minor? IndexReader#isCurrent() should return true on a NRT reader if no deletes are applied and only deletes are present in IW --- Key: LUCENE-3581 URL: https://issues.apache.org/jira/browse/LUCENE-3581 Project: Lucene - Java Issue Type: Bug Affects Versions: 3.5, 4.0 Reporter: Simon Willnauer Assignee: Simon Willnauer Fix For: 3.6, 4.0 I keep forgetting about this, I better open an issue. If you have a NRT reader without deletes applied it should infact return true on IR#isCurrent() if the IW only has deletes in its buffer ie. no documents where updated / added since the NRT reader was opened. Currently if there is a delete coming in we force a reopen which does nothing since deletes are not applied anyway. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3919) more thorough testing of analysis chains
[ https://issues.apache.org/jira/browse/LUCENE-3919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13237994#comment-13237994 ] Michael McCandless commented on LUCENE-3919: Awesome! more thorough testing of analysis chains Key: LUCENE-3919 URL: https://issues.apache.org/jira/browse/LUCENE-3919 Project: Lucene - Java Issue Type: Task Components: modules/analysis Affects Versions: 3.6, 4.0 Reporter: Robert Muir Attachments: LUCENE-3919.patch In lucene we essentially test each analysis component separately. we also give some good testing to the example Analyzers we provide that combine them. But we don't test various combinations that are possible: which is bad because it doesnt test possibilities for custom analyzers (especially since lots of solr users etc define their own). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3909) Move Kuromoji to analysis.ja and introduce Japanese* naming
[ https://issues.apache.org/jira/browse/LUCENE-3909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13237494#comment-13237494 ] Michael McCandless commented on LUCENE-3909: +1 Move Kuromoji to analysis.ja and introduce Japanese* naming --- Key: LUCENE-3909 URL: https://issues.apache.org/jira/browse/LUCENE-3909 Project: Lucene - Java Issue Type: Improvement Components: modules/analysis Affects Versions: 3.6, 4.0 Reporter: Christian Moen Lucene/Solr 3.6 and 4.0 will get out-of-the-box Japanese language support through {{KuromojiAnalyzer}}, {{KuromojiTokenizer}} and various other filters. These filters currently live in {{org.apache.lucene.analysis.kuromoji}}. I'm proposing that we move Kuromoji to a new Japanese package {{org.apache.lucene.analysis.ja}} in line with how other languages are organized. As part of this, I also think we should rename {{KuromojiAnalyzer}} to {{JapaneseAnalyzer}}, etc. to further align naming to our conventions by making it very clear that these analyzers are for Japanese. (As much as I like the name Kuromoji, I think Japanese is more fitting.) A potential issue I see with this that I'd like to raise and get feedback on, is that end-users in Japan and elsewhere who use lucene-gosen could have issues after an upgrade since lucene-gosen is in fact releasing its analyzers under the {{org.apache.lucene.analysis.ja}} namespace (and we'd have a name clash). I believe users should have the freedom to choose whichever Japanese analyzer, filter, etc. they'd like to use, and I don't want to propose a name change that just creates unnecessary problems for users, but I think the naming proposed above is most fitting for a Lucene/Solr release. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3913) HTMLStripCharFilter produces invalid final offset
[ https://issues.apache.org/jira/browse/LUCENE-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13237545#comment-13237545 ] Michael McCandless commented on LUCENE-3913: I forgot to say: patch is against 3.x. HTMLStripCharFilter produces invalid final offset - Key: LUCENE-3913 URL: https://issues.apache.org/jira/browse/LUCENE-3913 Project: Lucene - Java Issue Type: Bug Reporter: Michael McCandless Fix For: 3.6, 4.0 Attachments: LUCENE-3913.patch Nightly build found this... I boiled it down to a small test case that doesn't require the big line file docs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3911) improve BaseTokenStreamTestCase random string generation
[ https://issues.apache.org/jira/browse/LUCENE-3911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13237547#comment-13237547 ] Michael McCandless commented on LUCENE-3911: Looks great! improve BaseTokenStreamTestCase random string generation Key: LUCENE-3911 URL: https://issues.apache.org/jira/browse/LUCENE-3911 Project: Lucene - Java Issue Type: Task Components: general/test Affects Versions: 3.6, 4.0 Reporter: Robert Muir Attachments: LUCENE-3911.patch, LUCENE-3911.patch Most analysis tests use mocktokenizer (which splits on whitespace), but its rare that we generate a string with 'many tokens'. So I think we should try to generate more realistic test strings. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3913) HTMLStripCharFilter produces invalid final offset
[ https://issues.apache.org/jira/browse/LUCENE-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13237548#comment-13237548 ] Michael McCandless commented on LUCENE-3913: Good idea! I'll fix that test case. Here's the failure output: {noformat} [junit] - Standard Error - [junit] NOTE: reproduce with: ant test -Dtestcase=HTMLStripCharFilterTest -Dtestmethod=testOddHTMLString -Dtests.seed=-fe5cdb1aeca4e37:583f6a844412e138:70dc861e8567bea3 -Dargs=-Dfile.encoding=UTF-8 [junit] NOTE: reproduce with: ant test -Dtestcase=HTMLStripCharFilterTest -Dtestmethod=null -Dtests.seed=-fe5cdb1aeca4e37:583f6a844412e138:70dc861e8567bea3 -Dargs=-Dfile.encoding=UTF-8 [junit] NOTE: test params are: locale=zh_SG, timezone=Europe/Minsk [junit] NOTE: all tests run in this JVM: [junit] [HTMLStripCharFilterTest] [junit] NOTE: Linux 2.6.33.6-147.fc13.x86_64 amd64/Sun Microsystems Inc. 1.6.0_21 (64-bit)/cpus=24,threads=1,free=163214064,total=189988864 [junit] - --- [junit] Testcase: testOddHTMLString(org.apache.lucene.analysis.charfilter.HTMLStripCharFilterTest): FAILED [junit] finalOffset expected:20 but was:19 [junit] junit.framework.AssertionFailedError: finalOffset expected:20 but was:19 [junit] at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner$3.addError(JUnitTestRunner.java:975) [junit] at junit.framework.TestResult.addError(TestResult.java:38) [junit] at junit.framework.JUnit4TestAdapterCache$1.testFailure(JUnit4TestAdapterCache.java:51) [junit] at org.junit.runner.notification.RunNotifier$4.notifyListener(RunNotifier.java:100) [junit] at org.junit.runner.notification.RunNotifier$SafeNotifier.run(RunNotifier.java:41) [junit] at org.junit.runner.notification.RunNotifier.fireTestFailure(RunNotifier.java:97) [junit] at org.junit.internal.runners.model.EachTestNotifier.addFailure(EachTestNotifier.java:26) [junit] at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:267) [junit] at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68) [junit] at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:146) [junit] at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:50) [junit] at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231) [junit] at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60) [junit] at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229) [junit] at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50) [junit] at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222) [junit] at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28) [junit] at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:30) [junit] at org.apache.lucene.util.UncaughtExceptionsRule$1.evaluate(UncaughtExceptionsRule.java:74) [junit] at org.apache.lucene.util.StoreClassNameRule$1.evaluate(StoreClassNameRule.java:36) [junit] at org.apache.lucene.util.SystemPropertiesInvariantRule$1.evaluate(SystemPropertiesInvariantRule.java:67) [junit] at org.junit.rules.RunRules.evaluate(RunRules.java:18) [junit] at org.junit.runners.ParentRunner.run(ParentRunner.java:300) [junit] at junit.framework.JUnit4TestAdapter.run(JUnit4TestAdapter.java:39) [junit] at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:420) [junit] at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:911) [junit] at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:768) [junit] Caused by: java.lang.AssertionError: finalOffset expected:20 but was:19 [junit] at org.junit.Assert.fail(Assert.java:93) [junit] at org.junit.Assert.failNotEquals(Assert.java:647) [junit] at org.junit.Assert.assertEquals(Assert.java:128) [junit] at org.junit.Assert.assertEquals(Assert.java:472) [junit] at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:182) [junit] at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:574) [junit] at org.apache.lucene.analysis.charfilter.HTMLStripCharFilterTest.testOddHTMLString(HTMLStripCharFilterTest.java:550) [junit] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [junit] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) [junit] at
[jira] [Commented] (LUCENE-3913) HTMLStripCharFilter produces invalid final offset
[ https://issues.apache.org/jira/browse/LUCENE-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13237652#comment-13237652 ] Michael McCandless commented on LUCENE-3913: Awesome, thanks Steve! HTMLStripCharFilter produces invalid final offset - Key: LUCENE-3913 URL: https://issues.apache.org/jira/browse/LUCENE-3913 Project: Lucene - Java Issue Type: Bug Reporter: Michael McCandless Assignee: Steven Rowe Fix For: 3.6, 4.0 Attachments: LUCENE-3913.patch, LUCENE-3913.patch Nightly build found this... I boiled it down to a small test case that doesn't require the big line file docs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3268) remove write acess to source tree (chmod 555) when running tests in jenkins
[ https://issues.apache.org/jira/browse/SOLR-3268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13236500#comment-13236500 ] Michael McCandless commented on SOLR-3268: -- +1 remove write acess to source tree (chmod 555) when running tests in jenkins --- Key: SOLR-3268 URL: https://issues.apache.org/jira/browse/SOLR-3268 Project: Solr Issue Type: Bug Reporter: Robert Muir Fix For: 3.6, 4.0 Some tests are currently creating files under the source tree. This causes a lot of problems, it makes my checkout look dirty after running 'ant test' and i have to cleanup. I opened and issue for this a month in a half for solrj/src/test-files/solrj/solr/shared/test-solr.xml (SOLR-3112), but now we have a second file (core/src/test-files/solr/conf/elevate-data-distrib.xml). So I think hudson needs to chmod these src directories to 555, so that solr tests that do this will fail. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3905) BaseTokenStreamTestCase should test analyzers on real-ish content
[ https://issues.apache.org/jira/browse/LUCENE-3905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13236806#comment-13236806 ] Michael McCandless commented on LUCENE-3905: The ngram filters are unfortunately not OK: they use up tons of RAM when you send random/big tokens through them, because they don't have the same 1024 character limit... I think we should open a new issue for them... in fact I think repairing them could make a good GSoC! BaseTokenStreamTestCase should test analyzers on real-ish content - Key: LUCENE-3905 URL: https://issues.apache.org/jira/browse/LUCENE-3905 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Attachments: LUCENE-3905.patch We already have LineFileDocs, that pulls content generated from europarl or wikipedia... I think sometimes BTSTC should test the analyzers on that as well. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3905) BaseTokenStreamTestCase should test analyzers on real-ish content
[ https://issues.apache.org/jira/browse/LUCENE-3905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13236820#comment-13236820 ] Michael McCandless commented on LUCENE-3905: OK I opened LUCENE-3907 for ngram love... BaseTokenStreamTestCase should test analyzers on real-ish content - Key: LUCENE-3905 URL: https://issues.apache.org/jira/browse/LUCENE-3905 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Attachments: LUCENE-3905.patch We already have LineFileDocs, that pulls content generated from europarl or wikipedia... I think sometimes BTSTC should test the analyzers on that as well. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3897) KuromojiTokenizer fails with large docs
[ https://issues.apache.org/jira/browse/LUCENE-3897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13235493#comment-13235493 ] Michael McCandless commented on LUCENE-3897: Thanks Christian! KuromojiTokenizer fails with large docs --- Key: LUCENE-3897 URL: https://issues.apache.org/jira/browse/LUCENE-3897 Project: Lucene - Java Issue Type: Bug Components: modules/analysis Reporter: Robert Muir Assignee: Christian Moen Fix For: 3.6, 4.0 Attachments: LUCENE-3897.patch just shoving largeish random docs triggers asserts like: {noformat} [junit] Caused by: java.lang.AssertionError: backPos=4100 vs lastBackTracePos=5120 [junit] at org.apache.lucene.analysis.kuromoji.KuromojiTokenizer.backtrace(KuromojiTokenizer.java:907) [junit] at org.apache.lucene.analysis.kuromoji.KuromojiTokenizer.parse(KuromojiTokenizer.java:756) [junit] at org.apache.lucene.analysis.kuromoji.KuromojiTokenizer.incrementToken(KuromojiTokenizer.java:403) [junit] at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:404) {noformat} But, you get no seed... I'll commit the test case and @Ignore it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3887) 'ant javadocs' should fail if a package is missing a package.html
[ https://issues.apache.org/jira/browse/LUCENE-3887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13235574#comment-13235574 ] Michael McCandless commented on LUCENE-3887: You can also just run the javadoc checker directly in a source checkout, like this: {noformat} python -u dev-tools/scripts/checkJavaDocs.py /lucene/3x/lucene/build {noformat} You have to ant javadocs first yourself. Right now it only checks for missing sentences in the package-summary.html... I'll see if I can fix it to also detect missing package.html's... Here's what it reports on 3.x right now: {noformat} /lucene/3x/lucene/build/docs/api/contrib-highlighter/org/apache/lucene/search/highlight/package-summary.html missing: TokenStreamFromTermPositionVector /lucene/3x/lucene/build/docs/api/contrib-highlighter/org/apache/lucene/search/vectorhighlight/package-summary.html missing: BoundaryScanner missing: BaseFragmentsBuilder missing: FieldFragList.WeightedFragInfo missing: FieldFragList.WeightedFragInfo.SubInfo missing: FieldPhraseList.WeightedPhraseInfo missing: FieldPhraseList.WeightedPhraseInfo.Toffs missing: FieldQuery.QueryPhraseMap missing: FieldTermStack.TermInfo missing: ScoreOrderFragmentsBuilder.ScoreComparator missing: SimpleBoundaryScanner /lucene/3x/lucene/build/docs/api/contrib-spatial/org/apache/lucene/spatial/tier/package-summary.html missing: DistanceHandler.Precision /lucene/3x/lucene/build/docs/api/contrib-spellchecker/org/apache/lucene/search/suggest/package-summary.html missing: Lookup.LookupPriorityQueue /lucene/3x/lucene/build/docs/api/contrib-spellchecker/org/apache/lucene/search/suggest/jaspell/package-summary.html missing: JaspellLookup /lucene/3x/lucene/build/docs/api/contrib-spellchecker/org/apache/lucene/search/suggest/tst/package-summary.html missing: TSTAutocomplete missing: TSTLookup /lucene/3x/lucene/build/docs/api/contrib-pruning/org/apache/lucene/index/pruning/package-summary.html missing: CarmelTopKTermPruningPolicy.ByDocComparator missing: CarmelUniformTermPruningPolicy.ByDocComparator /lucene/3x/lucene/build/docs/api/contrib-facet/org/apache/lucene/facet/taxonomy/writercache/lru/package-summary.html missing: LruTaxonomyWriterCache.LRUType /lucene/3x/lucene/build/docs/api/contrib-facet/org/apache/lucene/facet/index/package-summary.html missing: FacetsPayloadProcessorProvider.FacetsDirPayloadProcessor /lucene/3x/lucene/build/docs/api/core/org/apache/lucene/store/package-summary.html missing: FSDirectory.FSIndexOutput missing: NIOFSDirectory.NIOFSIndexInput missing: RAMFile missing: SimpleFSDirectory.SimpleFSIndexInput missing: SimpleFSDirectory.SimpleFSIndexInput.Descriptor /lucene/3x/lucene/build/docs/api/core/org/apache/lucene/index/package-summary.html missing: MergePolicy.MergeAbortedException /lucene/3x/lucene/build/docs/api/core/org/apache/lucene/search/package-summary.html missing: FieldCache.CreationPlaceholder missing: FieldComparator.NumericComparatorlt;T extends Numbergt; missing: FieldValueHitQueue.Entry missing: QueryTermVector missing: ScoringRewritelt;Q extends Querygt; missing: SpanFilterResult.PositionInfo missing: SpanFilterResult.StartEnd missing: TimeLimitingCollector.TimerThread /lucene/3x/lucene/build/docs/api/core/org/apache/lucene/util/package-summary.html missing: ByteBlockPool.Allocator missing: ByteBlockPool.DirectAllocator missing: ByteBlockPool.DirectTrackingAllocator missing: BytesRefHash.BytesStartArray missing: BytesRefHash.DirectBytesStartArray missing: BytesRefIterator.EmptyBytesRefIterator missing: DoubleBarrelLRUCache.CloneableKey missing: OpenBitSetDISI missing: PagedBytes.Reader missing: UnicodeUtil.UTF16Result missing: UnicodeUtil.UTF8Result /lucene/3x/lucene/build/docs/api/contrib-analyzers/org/tartarus/snowball/package-summary.html missing: Among missing: TestApp /lucene/3x/lucene/build/docs/api/contrib-xml-query-parser/org/apache/lucene/xmlparser/package-summary.html missing: FilterBuilder missing: CorePlusExtensionsParser missing: DOMUtils missing: FilterBuilderFactory missing: QueryBuilderFactory missing: ParserException /lucene/3x/lucene/build/docs/api/contrib-xml-query-parser/org/apache/lucene/xmlparser/builders/package-summary.html missing: SpanQueryBuilder missing: BooleanFilterBuilder missing: BooleanQueryBuilder missing: BoostingQueryBuilder missing: BoostingTermBuilder missing: ConstantScoreQueryBuilder missing: DuplicateFilterBuilder missing: FilteredQueryBuilder missing: FuzzyLikeThisQueryBuilder missing: LikeThisQueryBuilder missing: MatchAllDocsQueryBuilder missing: RangeFilterBuilder missing: SpanBuilderBase missing: SpanFirstBuilder missing: SpanNearBuilder missing: SpanNotBuilder missing: SpanOrBuilder missing: SpanOrTermsBuilder missing: SpanQueryBuilderFactory missing: SpanTermBuilder
[jira] [Commented] (LUCENE-3887) 'ant javadocs' should fail if a package is missing a package.html
[ https://issues.apache.org/jira/browse/LUCENE-3887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13235635#comment-13235635 ] Michael McCandless commented on LUCENE-3887: OK I committed the basic checking for smoke tester... I'll leave this open for having ant javadocs fail when things are missing... 'ant javadocs' should fail if a package is missing a package.html - Key: LUCENE-3887 URL: https://issues.apache.org/jira/browse/LUCENE-3887 Project: Lucene - Java Issue Type: Task Components: general/build Reporter: Robert Muir Attachments: LUCENE-3887.patch, LUCENE-3887.patch While reviewing the javadocs I noticed many packages are missing a basic package.html. For 3.x I committed some package.html files where they were missing (I will port forward to trunk). I think all packages should have this... really all public/protected classes/methods/constants, but this would be a good step. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3898) possible SynonymFilter bug: hudson fail
[ https://issues.apache.org/jira/browse/LUCENE-3898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13234263#comment-13234263 ] Michael McCandless commented on LUCENE-3898: I can't provoke this failure yet... (just beasting the test). possible SynonymFilter bug: hudson fail --- Key: LUCENE-3898 URL: https://issues.apache.org/jira/browse/LUCENE-3898 Project: Lucene - Java Issue Type: Bug Components: modules/analysis Reporter: Robert Muir Assignee: Michael McCandless See https://builds.apache.org/job/Lucene-trunk/1867/consoleText (no seed) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3896) CharTokenizer has bugs for large documents.
[ https://issues.apache.org/jira/browse/LUCENE-3896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13234397#comment-13234397 ] Michael McCandless commented on LUCENE-3896: Thanks Rob! CharTokenizer has bugs for large documents. --- Key: LUCENE-3896 URL: https://issues.apache.org/jira/browse/LUCENE-3896 Project: Lucene - Java Issue Type: Bug Components: modules/analysis Reporter: Robert Muir Priority: Blocker Fix For: 3.6, 4.0 Attachments: LUCENE-3896.patch, LUCENE-3896.patch, LUCENE-3896.patch Initially found by hudson from additional testing added in LUCENE-3894, but currently not reproducable (see LUCENE-3895). But its easy to reproduce for a simple single-threaded case in TestDuelingAnalyzers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3899) Evil up MockDirectoryWrapper.checkIndexOnClose
[ https://issues.apache.org/jira/browse/LUCENE-3899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13234400#comment-13234400 ] Michael McCandless commented on LUCENE-3899: +1 More evilness! Evil up MockDirectoryWrapper.checkIndexOnClose -- Key: LUCENE-3899 URL: https://issues.apache.org/jira/browse/LUCENE-3899 Project: Lucene - Java Issue Type: Test Reporter: Robert Muir Fix For: 3.6, 4.0 Attachments: LUCENE-3899.patch MockDirectoryWrapper checks any indexes tests create on close(), if they exist. The problem is the logic it uses to determine if an index exists could mask real bugs (e.g. segments file corrumption): {code} if (DirectoryReader.indexExists(this) { ... // evil stuff like crash() ... _TestUtil.checkIndex(this) } {code} and for reference DirectoryReader.indexExists is: {code} try { new SegmentInfos().read(directory); return true; } catch (IOException ioe) { return false; } {code} So if there are segments file problems, we just silently do no checkIndex. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3778) Create a grouping convenience class
[ https://issues.apache.org/jira/browse/LUCENE-3778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13234445#comment-13234445 ] Michael McCandless commented on LUCENE-3778: +1 Create a grouping convenience class --- Key: LUCENE-3778 URL: https://issues.apache.org/jira/browse/LUCENE-3778 Project: Lucene - Java Issue Type: Improvement Components: modules/grouping Reporter: Martijn van Groningen Fix For: 4.0 Attachments: LUCENE-3778.patch, LUCENE-3778.patch, LUCENE-3778.patch, LUCENE-3778.patch Currently the grouping module has many collector classes with a lot of different options per class. I think it would be a good idea to have a GroupUtil (Or another name?) convenience class. I think this could be a builder, because of the many options (sort,sortWithinGroup,groupOffset,groupCount and more) and implementations (term/dv/function) grouping has. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3900) Make BaseTokenStreamTestCase.checkRandomData more debuggable
[ https://issues.apache.org/jira/browse/LUCENE-3900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13234446#comment-13234446 ] Michael McCandless commented on LUCENE-3900: +1! Make BaseTokenStreamTestCase.checkRandomData more debuggable Key: LUCENE-3900 URL: https://issues.apache.org/jira/browse/LUCENE-3900 Project: Lucene - Java Issue Type: Task Reporter: Robert Muir This thing has gotten meaner recently, but if it fails, it can be tough to debug. I feel like usually we just look at whatever analyzer failed, and completely review the code and look for any smells until it passes :) So I think instead we can possibly make this easier if this does something like: {code} try { ...checks... } catch (Throwable t) { BaseTokenException e = new BaseTokenException(randomInputUsed, randomParamter1, randomParameter2); e.setInitCause(t); throw e; } {code} Then you could have a useful exception with the input string that caused the fail, information about whether or not charfilter/mockreaderwrapper/whatever were used, etc, as well as the initial problem as root cause. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3897) KuromojiTokenizer fails with large docs
[ https://issues.apache.org/jira/browse/LUCENE-3897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13234455#comment-13234455 ] Michael McCandless commented on LUCENE-3897: I think the problem is when we force a backtrace (if it's = 1024 chars since the last backtrace)... I think we are not correctly pruning all paths in this case. Unlike the natural backtrace, which happens whenever there is only 1 path (ie the parsing is unambiguous from that point backwards), the forced backtrace may have more than one live path. Have to mull how to fix... KuromojiTokenizer fails with large docs --- Key: LUCENE-3897 URL: https://issues.apache.org/jira/browse/LUCENE-3897 Project: Lucene - Java Issue Type: Bug Components: modules/analysis Reporter: Robert Muir Fix For: 3.6, 4.0 just shoving largeish random docs triggers asserts like: {noformat} [junit] Caused by: java.lang.AssertionError: backPos=4100 vs lastBackTracePos=5120 [junit] at org.apache.lucene.analysis.kuromoji.KuromojiTokenizer.backtrace(KuromojiTokenizer.java:907) [junit] at org.apache.lucene.analysis.kuromoji.KuromojiTokenizer.parse(KuromojiTokenizer.java:756) [junit] at org.apache.lucene.analysis.kuromoji.KuromojiTokenizer.incrementToken(KuromojiTokenizer.java:403) [junit] at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:404) {noformat} But, you get no seed... I'll commit the test case and @Ignore it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2788) Make CharFilter reusable
[ https://issues.apache.org/jira/browse/LUCENE-2788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13234556#comment-13234556 ] Michael McCandless commented on LUCENE-2788: +1 I really like the approach here (just using FilterReader instead of our own new class). Since the back-compat is going be tricky... maybe we should first commit this patch to trunk? Make CharFilter reusable Key: LUCENE-2788 URL: https://issues.apache.org/jira/browse/LUCENE-2788 Project: Lucene - Java Issue Type: Improvement Components: modules/analysis Reporter: Robert Muir Priority: Minor Attachments: LUCENE-2788.patch The CharFilter API lets you wrap a Reader, altering the contents before the Tokenizer sees them. It also allows you to correct the offsets so this is transparent to highlighting. One problem is that the API isn't reusable, if you have a lot of short documents its going to be efficient. Additionally there is some unnecessary wrapping in Tokenizer (see the CharReader.get in the ctor, but *not* in reset(Reader)!!!) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3893) TermsFilter should use AutomatonQuery
[ https://issues.apache.org/jira/browse/LUCENE-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13233522#comment-13233522 ] Michael McCandless commented on LUCENE-3893: LUCENE-3832 should also be done for this... TermsFilter should use AutomatonQuery - Key: LUCENE-3893 URL: https://issues.apache.org/jira/browse/LUCENE-3893 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 I think we could see perf gains if TermsFilter sorted the terms, built a minimal automaton, and used TermsEnum.intersect to visit the terms... This idea came up on the dev list recently. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3887) 'ant javadocs' should fail if a package is missing a package.html
[ https://issues.apache.org/jira/browse/LUCENE-3887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13233544#comment-13233544 ] Michael McCandless commented on LUCENE-3887: +1 It shouldn't be the RM who must do this on release... 'ant javadocs' should fail if a package is missing a package.html - Key: LUCENE-3887 URL: https://issues.apache.org/jira/browse/LUCENE-3887 Project: Lucene - Java Issue Type: Task Components: general/build Reporter: Robert Muir While reviewing the javadocs I noticed many packages are missing a basic package.html. For 3.x I committed some package.html files where they were missing (I will port forward to trunk). I think all packages should have this... really all public/protected classes/methods/constants, but this would be a good step. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3889) Remove/Uncommit SegmentingTokenizerBase
[ https://issues.apache.org/jira/browse/LUCENE-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13233545#comment-13233545 ] Michael McCandless commented on LUCENE-3889: +1 Remove/Uncommit SegmentingTokenizerBase --- Key: LUCENE-3889 URL: https://issues.apache.org/jira/browse/LUCENE-3889 Project: Lucene - Java Issue Type: Task Affects Versions: 3.6, 4.0 Reporter: Robert Muir Attachments: LUCENE-3889.patch I added this class in LUCENE-3305 to support analyzers like Kuromoji, but Kuromoji no longer needs it as of LUCENE-3767. So now nothing uses it. I think we should uncommit before releasing, svn doesn't forget so we can add this back if we want to refactor something like Thai or Smartcn to use it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3894) Make BaseTokenStreamTestCase a bit more evil
[ https://issues.apache.org/jira/browse/LUCENE-3894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13233886#comment-13233886 ] Michael McCandless commented on LUCENE-3894: I think that new read method needs to use the incoming offset (ie, pass location + offset, not location, as 2nd arg to input.read)? Does testHugeDoc then pass? Make BaseTokenStreamTestCase a bit more evil Key: LUCENE-3894 URL: https://issues.apache.org/jira/browse/LUCENE-3894 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.6, 4.0 Attachments: LUCENE-3894.patch, LUCENE-3894.patch, LUCENE-3894.patch Throw an exception from the Reader while tokenizing, stop after not consuming all tokens, sometimes spoon-feed chars from the reader... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3894) Make BaseTokenStreamTestCase a bit more evil
[ https://issues.apache.org/jira/browse/LUCENE-3894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13233945#comment-13233945 ] Michael McCandless commented on LUCENE-3894: Thanks Rob! Make BaseTokenStreamTestCase a bit more evil Key: LUCENE-3894 URL: https://issues.apache.org/jira/browse/LUCENE-3894 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.6, 4.0 Attachments: LUCENE-3894.patch, LUCENE-3894.patch, LUCENE-3894.patch Throw an exception from the Reader while tokenizing, stop after not consuming all tokens, sometimes spoon-feed chars from the reader... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3076) Solr should support block joins
[ https://issues.apache.org/jira/browse/SOLR-3076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13232938#comment-13232938 ] Michael McCandless commented on SOLR-3076: -- Hi Mikhail, I've committed fixes for the filtering issues you found in ToChildBJQ, I think...? Are you still seeing issues? I'm unsure of the QP syntax for BJQ... Solr should support block joins --- Key: SOLR-3076 URL: https://issues.apache.org/jira/browse/SOLR-3076 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Attachments: SOLR-3076.patch, SOLR-3076.patch, SOLR-3076.patch, SOLR-3076.patch, SOLR-3076.patch, bjq-vs-filters-backward-disi.patch, bjq-vs-filters-illegal-state.patch, child-bjqparser.patch, parent-bjq-qparser.patch, parent-bjq-qparser.patch, solrconf-bjq-erschema-snippet.xml, tochild-bjq-filtered-search-fix.patch Lucene has the ability to do block joins, we should add it to Solr. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3738) Be consistent about negative vInt/vLong
[ https://issues.apache.org/jira/browse/LUCENE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13232272#comment-13232272 ] Michael McCandless commented on LUCENE-3738: bq. In my opinion, we should unroll all readVInt/readVLong loops so all behave 100% identical! +1 Be consistent about negative vInt/vLong --- Key: LUCENE-3738 URL: https://issues.apache.org/jira/browse/LUCENE-3738 Project: Lucene - Java Issue Type: Bug Reporter: Michael McCandless Assignee: Uwe Schindler Fix For: 3.6, 4.0 Attachments: LUCENE-3738.patch, LUCENE-3738.patch, LUCENE-3738.patch Today, write/readVInt allows a negative int, in that it will encode and decode correctly, just horribly inefficiently (5 bytes). However, read/writeVLong fails (trips an assert). I'd prefer that both vInt/vLong trip an assert if you ever try to write a negative number... it's badly trappy today. But, unfortunately, we sometimes rely on this... had we had this assert in 'since the beginning' we could have avoided that. So, if we can't add that assert in today, I think we should at least fix readVLong to handle negative longs... but then you quietly spend 9 bytes (even more trappy!). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3738) Be consistent about negative vInt/vLong
[ https://issues.apache.org/jira/browse/LUCENE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13232336#comment-13232336 ] Michael McCandless commented on LUCENE-3738: +1 Looks awesome Uwe! Be consistent about negative vInt/vLong --- Key: LUCENE-3738 URL: https://issues.apache.org/jira/browse/LUCENE-3738 Project: Lucene - Java Issue Type: Bug Reporter: Michael McCandless Assignee: Uwe Schindler Fix For: 3.6, 4.0 Attachments: LUCENE-3738.patch, LUCENE-3738.patch, LUCENE-3738.patch, LUCENE-3738.patch Today, write/readVInt allows a negative int, in that it will encode and decode correctly, just horribly inefficiently (5 bytes). However, read/writeVLong fails (trips an assert). I'd prefer that both vInt/vLong trip an assert if you ever try to write a negative number... it's badly trappy today. But, unfortunately, we sometimes rely on this... had we had this assert in 'since the beginning' we could have avoided that. So, if we can't add that assert in today, I think we should at least fix readVLong to handle negative longs... but then you quietly spend 9 bytes (even more trappy!). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3738) Be consistent about negative vInt/vLong
[ https://issues.apache.org/jira/browse/LUCENE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13231962#comment-13231962 ] Michael McCandless commented on LUCENE-3738: bq. The check is only ommitted in the unrolled loop, the for-loop still contains the check. I'm confused... I don't see how/where BufferedIndexInput.readVLong is checking for negative result now...? Are you proposing adding an if into that method? That's what I don't want to do... eg, readVLong is called 3 times per term we decode (Lucene40 codec); it's a very low level API... other codecs may very well call it more often. I don't think we should add an if inside BII.readVLong. Or maybe you are saying you just want the unrolled code to handle the negative vLong case (ie, unroll the currently missing 10th cycle), and not add an if to BufferedIndexInput.readVLong? And then for free we can add a real if (not assert) if that 10th cycle is hit? (ie, if we get to that 10th byte, throw an exception). I think that makes sense! bq. there are other asserts in the index readiung code at places completely outside any loops, executed only once when index is opened. +1 to make those real checks, as long as the cost is vanishingly small. bq. which is also a security issue when you e.g. download indexes through network connections and a man in the middle modifies the stream. I don't think it's our job to protect against / detect that. bq. Disk IO can produce wrong data. True, but all bets are off if that happens: you're gonna get all sorts of crazy exceptions out of Lucene. We are not a filesystem. Be consistent about negative vInt/vLong --- Key: LUCENE-3738 URL: https://issues.apache.org/jira/browse/LUCENE-3738 Project: Lucene - Java Issue Type: Bug Reporter: Michael McCandless Assignee: Uwe Schindler Fix For: 3.6, 4.0 Attachments: LUCENE-3738.patch, LUCENE-3738.patch, LUCENE-3738.patch Today, write/readVInt allows a negative int, in that it will encode and decode correctly, just horribly inefficiently (5 bytes). However, read/writeVLong fails (trips an assert). I'd prefer that both vInt/vLong trip an assert if you ever try to write a negative number... it's badly trappy today. But, unfortunately, we sometimes rely on this... had we had this assert in 'since the beginning' we could have avoided that. So, if we can't add that assert in today, I think we should at least fix readVLong to handle negative longs... but then you quietly spend 9 bytes (even more trappy!). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3870) VarDerefBytesImpl doc values prefix length may fall across two pages
[ https://issues.apache.org/jira/browse/LUCENE-3870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13231033#comment-13231033 ] Michael McCandless commented on LUCENE-3870: +1, looks good Simon! Just remember to remove that sop... VarDerefBytesImpl doc values prefix length may fall across two pages Key: LUCENE-3870 URL: https://issues.apache.org/jira/browse/LUCENE-3870 Project: Lucene - Java Issue Type: Bug Affects Versions: 4.0 Reporter: Michael McCandless Assignee: Simon Willnauer Fix For: 4.0 Attachments: LUCENE-3870.patch, LUCENE-3870.patch The VarDerefBytesImpl doc values encodes the unique byte[] with prefix (1 or 2 bytes) first, followed by bytes, so that it can use PagedBytes.fillSliceWithPrefix. It does this itself rather than using PagedBytes.copyUsingLengthPrefix... The problem is, it can write an invalid 2 byte prefix spanning two blocks (ie, last byte of block N and first byte of block N+1), which fillSliceWithPrefix won't decode correctly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3876) TestIndexWriterExceptions fails (reproducible)
[ https://issues.apache.org/jira/browse/LUCENE-3876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13231035#comment-13231035 ] Michael McCandless commented on LUCENE-3876: Hmm I think we need a separate check in FreqProxTermsWriterPerField? Ie, that class is private to the indexing chain; it's a like a codec, that's used to buffer postings in RAM until we write them to the real codec, and in theory an app could swap in a different indexing chain that didn't steal a bit from the posDelta... TestIndexWriterExceptions fails (reproducible) -- Key: LUCENE-3876 URL: https://issues.apache.org/jira/browse/LUCENE-3876 Project: Lucene - Java Issue Type: Bug Reporter: Dawid Weiss Priority: Minor Fix For: 4.0 {noformat} ant test -Dtestcase=TestIndexWriterExceptions -Dtestmethod=testIllegalPositions -Dtests.seed=-228094d3d2f35cf2:-496e33eec9bbd57c:36a1c54f4e1bb32 -Dargs=-Dfile.encoding=UTF-8 [junit] junit.framework.AssertionFailedError: position=-2 lastPosition=0 [junit] at org.apache.lucene.codecs.lucene40.Lucene40PostingsWriter.addPosition(Lucene40PostingsWriter.java:215) [junit] at org.apache.lucene.index.FreqProxTermsWriterPerField.flush(FreqProxTermsWriterPerField.java:519) [junit] at org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:92) [junit] at org.apache.lucene.index.TermsHash.flush(TermsHash.java:117) [junit] at org.apache.lucene.index.DocInverter.flush(DocInverter.java:53) [junit] at org.apache.lucene.index.DocFieldProcessor.flush(DocFieldProcessor.java:81) [junit] at org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:475) [junit] at org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:422) [junit] at org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:553) [junit] at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:2640) [junit] at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:2616) [junit] at org.apache.lucene.index.IndexWriter.closeInternal(IndexWriter.java:851) [junit] at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:810) [junit] at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:774) [junit] at org.apache.lucene.index.TestIndexWriterExceptions.testIllegalPositions(TestIndexWriterExceptions.java:1517) [junit] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [junit] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) [junit] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) [junit] at java.lang.reflect.Method.invoke(Method.java:597) [junit] at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45) [junit] at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) [junit] at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42) [junit] at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) [junit] at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28) [junit] at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:30) [junit] at org.apache.lucene.util.LuceneTestCase$SubclassSetupTeardownRule$1.evaluate(LuceneTestCase.java:729) [junit] at org.apache.lucene.util.LuceneTestCase$InternalSetupTeardownRule$1.evaluate(LuceneTestCase.java:645) [junit] at org.apache.lucene.util.SystemPropertiesInvariantRule$1.evaluate(SystemPropertiesInvariantRule.java:22) [junit] at org.apache.lucene.util.LuceneTestCase$TestResultInterceptorRule$1.evaluate(LuceneTestCase.java:556) [junit] at org.apache.lucene.util.UncaughtExceptionsRule$1.evaluate(UncaughtExceptionsRule.java:51) [junit] at org.apache.lucene.util.LuceneTestCase$RememberThreadRule$1.evaluate(LuceneTestCase.java:618) [junit] at org.junit.rules.RunRules.evaluate(RunRules.java:18) [junit] at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263) [junit] at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68) [junit] at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:164) [junit] at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:57) [junit] at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231) [junit] at
[jira] [Commented] (LUCENE-3877) Lucene should not call System.out.println
[ https://issues.apache.org/jira/browse/LUCENE-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13231090#comment-13231090 ] Michael McCandless commented on LUCENE-3877: I think it's fine if tests write to the std streams, but not core Lucene code (lucene/core/src/java/*)? Lucene should not call System.out.println - Key: LUCENE-3877 URL: https://issues.apache.org/jira/browse/LUCENE-3877 Project: Lucene - Java Issue Type: Bug Reporter: Michael McCandless Fix For: 3.6, 4.0 We seem to have accumulated a few random sops... Eg, PairOutputs.java (oal.util.fst) and MultiDocValues.java, at least. Can we somehow detect (eg, have a test failure) if we accidentally leave errant System.out.println's (leftover from debugging)...? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3876) TestIndexWriterExceptions fails (reproducible)
[ https://issues.apache.org/jira/browse/LUCENE-3876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13231107#comment-13231107 ] Michael McCandless commented on LUCENE-3876: +1 TestIndexWriterExceptions fails (reproducible) -- Key: LUCENE-3876 URL: https://issues.apache.org/jira/browse/LUCENE-3876 Project: Lucene - Java Issue Type: Bug Reporter: Dawid Weiss Priority: Minor Fix For: 3.6, 4.0 Attachments: LUCENE-3876.patch, LUCENE-3876_test.patch {noformat} ant test -Dtestcase=TestIndexWriterExceptions -Dtestmethod=testIllegalPositions -Dtests.seed=-228094d3d2f35cf2:-496e33eec9bbd57c:36a1c54f4e1bb32 -Dargs=-Dfile.encoding=UTF-8 [junit] junit.framework.AssertionFailedError: position=-2 lastPosition=0 [junit] at org.apache.lucene.codecs.lucene40.Lucene40PostingsWriter.addPosition(Lucene40PostingsWriter.java:215) [junit] at org.apache.lucene.index.FreqProxTermsWriterPerField.flush(FreqProxTermsWriterPerField.java:519) [junit] at org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:92) [junit] at org.apache.lucene.index.TermsHash.flush(TermsHash.java:117) [junit] at org.apache.lucene.index.DocInverter.flush(DocInverter.java:53) [junit] at org.apache.lucene.index.DocFieldProcessor.flush(DocFieldProcessor.java:81) [junit] at org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:475) [junit] at org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:422) [junit] at org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:553) [junit] at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:2640) [junit] at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:2616) [junit] at org.apache.lucene.index.IndexWriter.closeInternal(IndexWriter.java:851) [junit] at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:810) [junit] at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:774) [junit] at org.apache.lucene.index.TestIndexWriterExceptions.testIllegalPositions(TestIndexWriterExceptions.java:1517) [junit] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [junit] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) [junit] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) [junit] at java.lang.reflect.Method.invoke(Method.java:597) [junit] at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45) [junit] at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) [junit] at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42) [junit] at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) [junit] at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28) [junit] at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:30) [junit] at org.apache.lucene.util.LuceneTestCase$SubclassSetupTeardownRule$1.evaluate(LuceneTestCase.java:729) [junit] at org.apache.lucene.util.LuceneTestCase$InternalSetupTeardownRule$1.evaluate(LuceneTestCase.java:645) [junit] at org.apache.lucene.util.SystemPropertiesInvariantRule$1.evaluate(SystemPropertiesInvariantRule.java:22) [junit] at org.apache.lucene.util.LuceneTestCase$TestResultInterceptorRule$1.evaluate(LuceneTestCase.java:556) [junit] at org.apache.lucene.util.UncaughtExceptionsRule$1.evaluate(UncaughtExceptionsRule.java:51) [junit] at org.apache.lucene.util.LuceneTestCase$RememberThreadRule$1.evaluate(LuceneTestCase.java:618) [junit] at org.junit.rules.RunRules.evaluate(RunRules.java:18) [junit] at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263) [junit] at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68) [junit] at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:164) [junit] at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:57) [junit] at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231) [junit] at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60) [junit] at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229) [junit] at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50) [junit] at
[jira] [Commented] (LUCENE-3738) Be consistent about negative vInt/vLong
[ https://issues.apache.org/jira/browse/LUCENE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13231295#comment-13231295 ] Michael McCandless commented on LUCENE-3738: Hmm... I think we should think about it more. Ie, we apparently never write a negative vLong today... and I'm not sure we should start allowing it...? Be consistent about negative vInt/vLong --- Key: LUCENE-3738 URL: https://issues.apache.org/jira/browse/LUCENE-3738 Project: Lucene - Java Issue Type: Bug Reporter: Michael McCandless Fix For: 3.6, 4.0 Attachments: LUCENE-3738.patch, LUCENE-3738.patch Today, write/readVInt allows a negative int, in that it will encode and decode correctly, just horribly inefficiently (5 bytes). However, read/writeVLong fails (trips an assert). I'd prefer that both vInt/vLong trip an assert if you ever try to write a negative number... it's badly trappy today. But, unfortunately, we sometimes rely on this... had we had this assert in 'since the beginning' we could have avoided that. So, if we can't add that assert in today, I think we should at least fix readVLong to handle negative longs... but then you quietly spend 9 bytes (even more trappy!). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3878) CheckIndex should check deleted documents too
[ https://issues.apache.org/jira/browse/LUCENE-3878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13231296#comment-13231296 ] Michael McCandless commented on LUCENE-3878: +1 CheckIndex should check deleted documents too - Key: LUCENE-3878 URL: https://issues.apache.org/jira/browse/LUCENE-3878 Project: Lucene - Java Issue Type: Task Affects Versions: 4.0 Reporter: Robert Muir Fix For: 4.0 In 4.0 livedocs are passed down to the enums, thus deleted docs are not so special. So I think checkindex should not pass the livedocs down to the enums when checking, it should pass livedocs=null and check all the postings. It already does this separately to collect stats i think to compare against the term/collection statistics? But we should just clean this up and only use one enum. For example LUCENE-3876 is a case where we were actually making a corrumpt index, (a position was negative) but because the document in question was deleted, CheckIndex didn't detect this. This could have caused problems if someone just passed null for livedocs (maybe they are doing something where its not so important to take deletions into account) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3877) Lucene should not call System.out.println
[ https://issues.apache.org/jira/browse/LUCENE-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13231307#comment-13231307 ] Michael McCandless commented on LUCENE-3877: I removed the std prints in lucene/core/src/java that I could find on quick grepping. I'll leave this open so we can somehow automatically catch this... Lucene should not call System.out.println - Key: LUCENE-3877 URL: https://issues.apache.org/jira/browse/LUCENE-3877 Project: Lucene - Java Issue Type: Bug Reporter: Michael McCandless Fix For: 3.6, 4.0 We seem to have accumulated a few random sops... Eg, PairOutputs.java (oal.util.fst) and MultiDocValues.java, at least. Can we somehow detect (eg, have a test failure) if we accidentally leave errant System.out.println's (leftover from debugging)...? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3738) Be consistent about negative vInt/vLong
[ https://issues.apache.org/jira/browse/LUCENE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13231309#comment-13231309 ] Michael McCandless commented on LUCENE-3738: {quote} don't see how we can avoid negative vints. I think its ok to be inconsistent with vLong, but it should not be something we assert only at read-time. It should be asserted on write so that problems are found immediately. {quote} +1 I think we are stuck with negative vInts, as trappy as they are (5 bytes!!). Let's not make it worse by allowing negative vLongs. But let's assert that at write time (and read time)... I think inconsistency here is the lesser evil. Be consistent about negative vInt/vLong --- Key: LUCENE-3738 URL: https://issues.apache.org/jira/browse/LUCENE-3738 Project: Lucene - Java Issue Type: Bug Reporter: Michael McCandless Fix For: 3.6, 4.0 Attachments: LUCENE-3738.patch, LUCENE-3738.patch Today, write/readVInt allows a negative int, in that it will encode and decode correctly, just horribly inefficiently (5 bytes). However, read/writeVLong fails (trips an assert). I'd prefer that both vInt/vLong trip an assert if you ever try to write a negative number... it's badly trappy today. But, unfortunately, we sometimes rely on this... had we had this assert in 'since the beginning' we could have avoided that. So, if we can't add that assert in today, I think we should at least fix readVLong to handle negative longs... but then you quietly spend 9 bytes (even more trappy!). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3738) Be consistent about negative vInt/vLong
[ https://issues.apache.org/jira/browse/LUCENE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13231470#comment-13231470 ] Michael McCandless commented on LUCENE-3738: bq. If we disallow, it should be a hard check (no assert), as the data is coming from a file (and somebody could used a hex editor). The reader will crash later... Hmm, I don't think we should do that. If you go and edit your index with a hex editor... there are no guarantees on what may ensue! bq. Mike: If you fix the unrolled loops, please also add the checks to the other implementations in Buffered* and so on. I don't think the unrolled loops or other impls of write/readVLong are wrong? The javadocs state clearly that negatives are not supported. All we're doing here is added an assert to backup that javadoc statement. Be consistent about negative vInt/vLong --- Key: LUCENE-3738 URL: https://issues.apache.org/jira/browse/LUCENE-3738 Project: Lucene - Java Issue Type: Bug Reporter: Michael McCandless Fix For: 3.6, 4.0 Attachments: LUCENE-3738.patch, LUCENE-3738.patch, LUCENE-3738.patch Today, write/readVInt allows a negative int, in that it will encode and decode correctly, just horribly inefficiently (5 bytes). However, read/writeVLong fails (trips an assert). I'd prefer that both vInt/vLong trip an assert if you ever try to write a negative number... it's badly trappy today. But, unfortunately, we sometimes rely on this... had we had this assert in 'since the beginning' we could have avoided that. So, if we can't add that assert in today, I think we should at least fix readVLong to handle negative longs... but then you quietly spend 9 bytes (even more trappy!). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3872) Index changes are lost if you call prepareCommit() then close()
[ https://issues.apache.org/jira/browse/LUCENE-3872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13230307#comment-13230307 ] Michael McCandless commented on LUCENE-3872: Well, we could also easily allow skipping the call to commit... in this case IW.close would detect the missing call to commit, call commit, and call commit again to save any changes done after the prepareCommit and before close. Index changes are lost if you call prepareCommit() then close() --- Key: LUCENE-3872 URL: https://issues.apache.org/jira/browse/LUCENE-3872 Project: Lucene - Java Issue Type: Bug Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.6, 4.0 Attachments: LUCENE-3872.patch, LUCENE-3872.patch You are supposed to call commit() after calling prepareCommit(), but... if you forget, and call close() after prepareCommit() without calling commit(), then any changes done after the prepareCommit() are silently lost (including adding/deleting docs, but also any completed merges). Spinoff from java-user thread lots of .cfs (compound files) in the index directory from Tim Bogaert. I think to fix this, IW.close should throw an IllegalStateException if prepareCommit() was called with no matching call to commit(). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3848) basetokenstreamtestcase should fail if tokenstream starts with posinc=0
[ https://issues.apache.org/jira/browse/LUCENE-3848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13230341#comment-13230341 ] Michael McCandless commented on LUCENE-3848: +1 basetokenstreamtestcase should fail if tokenstream starts with posinc=0 --- Key: LUCENE-3848 URL: https://issues.apache.org/jira/browse/LUCENE-3848 Project: Lucene - Java Issue Type: Bug Reporter: Robert Muir Fix For: 4.0 Attachments: LUCENE-3848-MockGraphTokenFilter.patch, LUCENE-3848.patch, LUCENE-3848.patch This is meaningless for a tokenstream to start with posinc=0, Its also caused problems and hairiness in the indexer (LUCENE-1255, LUCENE-1542), and it makes senseless tokenstreams. We should add a check and fix any that do this. Furthermore the same bug can exist in removing-filters if they have enablePositionIncrements=false. I think this option is useful: but it shouldnt mean 'allow broken tokenstream', it just means we don't add gaps. If you remove tokens with enablePositionIncrements=false it should not cause the TS to start with positionincrement=0, and it shouldnt 'restructure' the tokenstream (e.g. moving synonyms on top of a different word). It should just not add any 'holes'. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3874) bogus positions create a corrumpt index
[ https://issues.apache.org/jira/browse/LUCENE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13230402#comment-13230402 ] Michael McCandless commented on LUCENE-3874: +1 Crazy we don't catch this already... bogus positions create a corrumpt index --- Key: LUCENE-3874 URL: https://issues.apache.org/jira/browse/LUCENE-3874 Project: Lucene - Java Issue Type: Bug Affects Versions: 3.6, 4.0 Reporter: Robert Muir Attachments: LUCENE-3874.patch, LUCENE-3874_test.patch Its pretty common that positionIncrement can overflow, this happens really easily if people write analyzers that don't clearAttributes(). It used to be the case that if this happened (and perhaps still is in 3.x, i didnt check), that IW would throw an exception. But i couldnt find the code checking this, I wrote a test and it makes a corrumpt index... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org