[jira] [Commented] (LUCENE-4599) Compressed term vectors
[ https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526535#comment-13526535 ] Robert Muir commented on LUCENE-4599: - {quote} If you have ideas to efficiently compress term vectors, you're welcome! {quote} I think we waste space with the terms, especially prefix/suffix lengths (even so much so, the prefix encoding probably hurts in general for many people). these should likely be bulk-compressed. as you already noticed in the patch, frequencies are a waste too. flags are wasteful and stupid, but it seems like you already tried to address that to some extent. if we compress chunks of docs we should optimize the case where flags are the same. Its crazy that someone would have just positions for "body field" of document 2, but positions and offsets for "body field" of document 3. > Compressed term vectors > --- > > Key: LUCENE-4599 > URL: https://issues.apache.org/jira/browse/LUCENE-4599 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs, core/termvectors >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: 4.1 > > Attachments: LUCENE-4599.patch > > > We should have codec-compressed term vectors similarly to what we have with > stored fields. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4599) Compressed term vectors
[ https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526557#comment-13526557 ] Adrien Grand commented on LUCENE-4599: -- bq. I think we waste space with the terms, especially prefix/suffix lengths [..] these should likely be bulk-compressed Good point. bq. flags are wasteful and stupid, but it seems like you already tried to address that to some extent I'm storing them in a packed ints array where each entry is 3 bits per value. I'll try to optimize when a field always has the same flags. > Compressed term vectors > --- > > Key: LUCENE-4599 > URL: https://issues.apache.org/jira/browse/LUCENE-4599 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs, core/termvectors >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: 4.1 > > Attachments: LUCENE-4599.patch > > > We should have codec-compressed term vectors similarly to what we have with > stored fields. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4599) Compressed term vectors
[ https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13527075#comment-13527075 ] David Smiley commented on LUCENE-4599: -- Does it make sense to put this in an FST where the key is the term bytes and the value is what you're doing now for the positions, offsets, and payloads in a byte array? The point to this is that a term dictionary is going to use much less space with sharing of prefixes and suffixes of words. Or... can we simply reference the terms by ord (an int) instead of writing each term bytes? > Compressed term vectors > --- > > Key: LUCENE-4599 > URL: https://issues.apache.org/jira/browse/LUCENE-4599 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs, core/termvectors >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: 4.1 > > Attachments: LUCENE-4599.patch > > > We should have codec-compressed term vectors similarly to what we have with > stored fields. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4599) Compressed term vectors
[ https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13527167#comment-13527167 ] Adrien Grand commented on LUCENE-4599: -- I think a FST would not compress as much as what LZ4 or Deflate can do? But maybe it could speed up TermsEnum.seekCeil on large documents so it might be an interesting idea regarding random access speed? bq. can we simply reference the terms by ord (an int) instead of writing each term bytes? Do you mean their ords in the terms dictionary? Is that information available somewhere when writing/merging term vectors? > Compressed term vectors > --- > > Key: LUCENE-4599 > URL: https://issues.apache.org/jira/browse/LUCENE-4599 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs, core/termvectors >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: 4.1 > > Attachments: LUCENE-4599.patch > > > We should have codec-compressed term vectors similarly to what we have with > stored fields. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4599) Compressed term vectors
[ https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13527180#comment-13527180 ] Michael McCandless commented on LUCENE-4599: bq. Does it make sense to put this in an FST where the key is the term bytes and the value is what you're doing now for the positions, offsets, and payloads in a byte array? That's a neat idea :) We should [almost] just be able to use MemoryPostingsFormat, since it already stores all postings in an FST. bq. I think a FST would not compress as much as what LZ4 or Deflate can do? But maybe it could speed up TermsEnum.seekCeil on large documents so it might be an interesting idea regarding random access speed? Likely it would not compress as well, since LZ4/Deflate are able to share common infix fragments too, but FST only shares prefix/suffix. It'd be interesting to test ... but we should explore this (FST-backed TermVectorsFormat) in a new issue I think ... this issue seems awesome enough already :) bq. Or... can we simply reference the terms by ord (an int) instead of writing each term bytes? Using ords matching the main terms dict is a neat idea too! It would be much more compact ... but, when reading the term vectors we'd need to resolve-by-ord against the main terms dictionary (not all postings formats support that: it's optional, and eg our default PF doesn't), which would likely be slower than today. bq. Is that information available somewhere when writing/merging term vectors? Unfortunately, no. We only assign ords when it's time to flush the segment ... but we write term vectors "live" as we index each document. If we changed that, eg buffered up term vectors, then we could get the ords when we wrote them. > Compressed term vectors > --- > > Key: LUCENE-4599 > URL: https://issues.apache.org/jira/browse/LUCENE-4599 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs, core/termvectors >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: 4.1 > > Attachments: LUCENE-4599.patch > > > We should have codec-compressed term vectors similarly to what we have with > stored fields. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4599) Compressed term vectors
[ https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13527203#comment-13527203 ] David Smiley commented on LUCENE-4599: -- The ord reference approach seems most interesting to me, even if it's not workable at the moment (based on Mike's comment). If things were changed to make ord's possible then there wouldn't even need to be any term information in term-vectors whatsoever; right? Not even the ord (integer) itself because the array of each term vector is intrinsically in ord-order and aligned exactly to each ord; right? Does anyone know roughly what % of term-vector storage is currently for the term? > Compressed term vectors > --- > > Key: LUCENE-4599 > URL: https://issues.apache.org/jira/browse/LUCENE-4599 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs, core/termvectors >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: 4.1 > > Attachments: LUCENE-4599.patch > > > We should have codec-compressed term vectors similarly to what we have with > stored fields. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4599) Compressed term vectors
[ https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13527215#comment-13527215 ] Robert Muir commented on LUCENE-4599: - I'm not sure how much more compact ords would really be? start thinking about average word length, shared prefixes and so on, and long references (even though they could be delta-encoded since they are in order, i still imagine 3 or 4 bytes on average if you assume a large terms dict) don't seem to save a lot. I think its way more important to bulk encode the prefix/suffix lengths. > Compressed term vectors > --- > > Key: LUCENE-4599 > URL: https://issues.apache.org/jira/browse/LUCENE-4599 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs, core/termvectors >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: 4.1 > > Attachments: LUCENE-4599.patch > > > We should have codec-compressed term vectors similarly to what we have with > stored fields. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4599) Compressed term vectors
[ https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13535030#comment-13535030 ] Shawn Heisey commented on LUCENE-4599: -- With the 4.1 release triage likely coming soon, I am wondering if this is ready to make the cut or if it needs more work. > Compressed term vectors > --- > > Key: LUCENE-4599 > URL: https://issues.apache.org/jira/browse/LUCENE-4599 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs, core/termvectors >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: 4.1 > > Attachments: LUCENE-4599.patch > > > We should have codec-compressed term vectors similarly to what we have with > stored fields. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4599) Compressed term vectors
[ https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13535079#comment-13535079 ] Adrien Grand commented on LUCENE-4599: -- Hey Shawn, I'm still working actively on this issue. I made good progress regarding compression ratio but term vectors are more complicated than stored fields (with lots of corner cases like negative start offsets, negative lengths, fields that don't always have the same options, etc.) so I will need time and lots of Jenkins builds to feel comfortable making it the default term vectors impl. It will depend on the 4.1 release schedule but given that it's likely to comme rather soon and that I will have very little time to work on this issue until next month it will probably only make it to 4.2. > Compressed term vectors > --- > > Key: LUCENE-4599 > URL: https://issues.apache.org/jira/browse/LUCENE-4599 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs, core/termvectors >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: 4.1 > > Attachments: LUCENE-4599.patch > > > We should have codec-compressed term vectors similarly to what we have with > stored fields. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4599) Compressed term vectors
[ https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13535126#comment-13535126 ] Shawn Heisey commented on LUCENE-4599: -- bq. it will probably only make it to 4.2. I'm not surprised. I had hoped it would make it, but there will be enough to do for release without working on half-baked features. I might need to continue to use Solr from branch_4x even after 4.1 gets released. Thank you for everything you've done for me personally and the entire project. > Compressed term vectors > --- > > Key: LUCENE-4599 > URL: https://issues.apache.org/jira/browse/LUCENE-4599 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs, core/termvectors >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: 4.1 > > Attachments: LUCENE-4599.patch > > > We should have codec-compressed term vectors similarly to what we have with > stored fields. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4599) Compressed term vectors
[ https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13535295#comment-13535295 ] Robert Muir commented on LUCENE-4599: - {quote} but term vectors are more complicated than stored fields (with lots of corner cases like negative start offsets, negative lengths, fields that don't always have the same options, etc.) {quote} And all of these corner cases are completely bogus with no real use cases. We definitely need to make the long-term investment to fix this. Its so sad this kinda nonsense bullshit is slowing down Adrien here. Its hard to fix... I know ive wasted a lot of brain cycles on trying to come up with perfect solutions. But we have to make some progress somehow. > Compressed term vectors > --- > > Key: LUCENE-4599 > URL: https://issues.apache.org/jira/browse/LUCENE-4599 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs, core/termvectors >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: 4.1 > > Attachments: LUCENE-4599.patch > > > We should have codec-compressed term vectors similarly to what we have with > stored fields. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4599) Compressed term vectors
[ https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13557849#comment-13557849 ] Shawn Heisey commented on LUCENE-4599: -- bq. If someone with very large term vector files wanted to test this new format, this would be great! I'll try on my side to perform more indexing/highlighting benchmarks.. My indexes are pretty big, with termvectors taking up a lot of that. The 3.5.0 version of each of my shards is about 21GB. The same index in 4.1 with compressed stored fields is a little lres than 17 GB. I will give this patch a try on branch_4x. The full import will take 7-8 hours. > Compressed term vectors > --- > > Key: LUCENE-4599 > URL: https://issues.apache.org/jira/browse/LUCENE-4599 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs, core/termvectors >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: 4.2 > > Attachments: LUCENE-4599.patch, LUCENE-4599.patch, LUCENE-4599.patch > > > We should have codec-compressed term vectors similarly to what we have with > stored fields. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4599) Compressed term vectors
[ https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13557889#comment-13557889 ] Shawn Heisey commented on LUCENE-4599: -- I should ask - will this be on by default in Solr with the patch? I just got the patch applied to 4.1 because I already had it, decided to try it before branch_4x. It has occurred to me that as a LUCENE issue, it might not be turned on for Solr. > Compressed term vectors > --- > > Key: LUCENE-4599 > URL: https://issues.apache.org/jira/browse/LUCENE-4599 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs, core/termvectors >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: 4.2 > > Attachments: LUCENE-4599.patch, LUCENE-4599.patch, LUCENE-4599.patch > > > We should have codec-compressed term vectors similarly to what we have with > stored fields. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4599) Compressed term vectors
[ https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13558057#comment-13558057 ] Shawn Heisey commented on LUCENE-4599: -- in the Compressing41Codec class from the solr patch, there is a protected member. I had to make that public or hundreds of tests were failing. I don't know if that was the right thing to do, but it allowed the tests to proceed without immediate failures. If they all pass after that change, I'll do another full-import. {noformat} [junit4:junit4]> Throwable #1: java.util.ServiceConfigurationError: Cannot instantiate SPI class: org.apache.solr.core.Compressing41Codec [junit4:junit4]> Caused by: java.lang.IllegalAccessException: Class org.apache.lucene.util.NamedSPILoader can not access a member of class org.apache.solr.core.Compressing41Codec with modifiers "protected" [junit4:junit4]> Throwable #1: java.lang.NoClassDefFoundError: Could not initialize class org.apache.lucene.codecs.Codec [junit4:junit4]>at org.apache.lucene.util.TestRuleSetupAndRestoreClassEnv.before(TestRuleSetupAndRestoreClassEnv.java:137) {noformat} > Compressed term vectors > --- > > Key: LUCENE-4599 > URL: https://issues.apache.org/jira/browse/LUCENE-4599 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs, core/termvectors >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: 4.2 > > Attachments: LUCENE-4599.patch, LUCENE-4599.patch, LUCENE-4599.patch, > solr.patch > > > We should have codec-compressed term vectors similarly to what we have with > stored fields. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4599) Compressed term vectors
[ https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13558063#comment-13558063 ] Uwe Schindler commented on LUCENE-4599: --- Yes, the constructor must be public :-) > Compressed term vectors > --- > > Key: LUCENE-4599 > URL: https://issues.apache.org/jira/browse/LUCENE-4599 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs, core/termvectors >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: 4.2 > > Attachments: LUCENE-4599.patch, LUCENE-4599.patch, LUCENE-4599.patch, > solr.patch > > > We should have codec-compressed term vectors similarly to what we have with > stored fields. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4599) Compressed term vectors
[ https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13558079#comment-13558079 ] Shawn Heisey commented on LUCENE-4599: -- Another test run produced another dataimport failure. I went ahead and put the new version into place and updated my solrconfig.xml in the same way that the patch updated the example, now I have begun a full-import. I'm not sure the new format took - Solr didn't complain about the format of my existing indexes like I expected. > Compressed term vectors > --- > > Key: LUCENE-4599 > URL: https://issues.apache.org/jira/browse/LUCENE-4599 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs, core/termvectors >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: 4.2 > > Attachments: 4599-dataimport-fail.log, 4599-zookeer-fail.log, > LUCENE-4599.patch, LUCENE-4599.patch, LUCENE-4599.patch, solr.patch > > > We should have codec-compressed term vectors similarly to what we have with > stored fields. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4599) Compressed term vectors
[ https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13558089#comment-13558089 ] Uwe Schindler commented on LUCENE-4599: --- bq. Another test run produced another dataimport failure. Could be the DST change in Fidji today, see mailing list. I disabled the test. bq. I'm not sure the new format took - Solr didn't complain about the format of my existing indexes like I expected. The CodecFactory in the patch produces a new index format, which is marked by another header ("Compressing41"). An existing index has the codec id "Lucene41" and is still readable (Lucene will use the default codec to read it). > Compressed term vectors > --- > > Key: LUCENE-4599 > URL: https://issues.apache.org/jira/browse/LUCENE-4599 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs, core/termvectors >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: 4.2 > > Attachments: 4599-dataimport-fail.log, 4599-zookeer-fail.log, > LUCENE-4599.patch, LUCENE-4599.patch, LUCENE-4599.patch, solr.patch > > > We should have codec-compressed term vectors similarly to what we have with > stored fields. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4599) Compressed term vectors
[ https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13558140#comment-13558140 ] Adrien Grand commented on LUCENE-4599: -- I tried to reproduce the SOlr cloud error (ant test -Dtestcase=ClusterStateUpdateTest -Dtests.method=testCoreRegistration -Dtests.seed=F72E8E946F6EBEAF -Dtests.nightly=true -Dtests.weekly=true -Dtests.slow=true -Dtests.locale=sr_RS -Dtests.timezone=Africa/Bamako -Dtests.file.encoding=UTF-8) but it succceeded, so I assume it's a random Solr cloud failure due to the fact that your machine was very busy? > Compressed term vectors > --- > > Key: LUCENE-4599 > URL: https://issues.apache.org/jira/browse/LUCENE-4599 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs, core/termvectors >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: 4.2 > > Attachments: 4599-dataimport-fail.log, 4599-zookeer-fail.log, > LUCENE-4599.patch, LUCENE-4599.patch, LUCENE-4599.patch, solr.patch > > > We should have codec-compressed term vectors similarly to what we have with > stored fields. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4599) Compressed term vectors
[ https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13558148#comment-13558148 ] Shawn Heisey commented on LUCENE-4599: -- bq. so I assume it's a random Solr cloud failure due to the fact that your machine was very busy? That's my assumption too. My worry is that a real SolrCloud might fail in a similar way if the machines involved become really busy. If the test is designed to have much tighter constraints than a typical production cloud, then it might not be something I have to worry about. > Compressed term vectors > --- > > Key: LUCENE-4599 > URL: https://issues.apache.org/jira/browse/LUCENE-4599 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs, core/termvectors >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: 4.2 > > Attachments: 4599-dataimport-fail.log, 4599-zookeer-fail.log, > LUCENE-4599.patch, LUCENE-4599.patch, LUCENE-4599.patch, solr.patch > > > We should have codec-compressed term vectors similarly to what we have with > stored fields. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4599) Compressed term vectors
[ https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13558157#comment-13558157 ] Shawn Heisey commented on LUCENE-4599: -- My full import just finished. The indexes built with a patched Solr from lucene_solr_4_1 are pretty much the same size as they were before.The indexes below that end in 0 are the recently built cores that are now live. The ones that end in 1 are the ones built with an unmodified 4.1. {noformat} ncindex@bigindy5 /index/solr4/data $ du -sc * 762820 inc_0 742984 inc_1 24 ncmain 17210772s0_0 17214504s0_1 17211784s1_0 17143900s1_1 17191632s2_0 17190108s2_1 17192292s3_0 17188164s3_1 17198920s4_0 1728s4_1 17205092s5_0 17205800s5_1 207858804 total {noformat} > Compressed term vectors > --- > > Key: LUCENE-4599 > URL: https://issues.apache.org/jira/browse/LUCENE-4599 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs, core/termvectors >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: 4.2 > > Attachments: 4599-dataimport-fail.log, 4599-zookeer-fail.log, > LUCENE-4599.patch, LUCENE-4599.patch, LUCENE-4599.patch, solr.patch > > > We should have codec-compressed term vectors similarly to what we have with > stored fields. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4599) Compressed term vectors
[ https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13558164#comment-13558164 ] Shawn Heisey commented on LUCENE-4599: -- You can ignore the previous comment. I thought I had added the codec line to solrconfig.xml -- turns out that I edited the solrconfig from the 3.5.0 directory, not the 4.1 directory. On my dev server, the 3.5 solr isn't even running. Now that I've got the right config changed, a new full-import looks like it's probably working - there are now two termvector files per segment instead of three. > Compressed term vectors > --- > > Key: LUCENE-4599 > URL: https://issues.apache.org/jira/browse/LUCENE-4599 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs, core/termvectors >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: 4.2 > > Attachments: 4599-dataimport-fail.log, 4599-zookeer-fail.log, > LUCENE-4599.patch, LUCENE-4599.patch, LUCENE-4599.patch, solr.patch > > > We should have codec-compressed term vectors similarly to what we have with > stored fields. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4599) Compressed term vectors
[ https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13558245#comment-13558245 ] Shawn Heisey commented on LUCENE-4599: -- New files are 53.2% of the old size. New TV files total 3890809739. Old TV files total 7311612548. {noformat} Unmodified Solr 4.1: total 17154140 drwxr-xr-x 2 ncindex ncindex 45056 Jan 19 21:35 ./ drwxr-xr-x 4 ncindex ncindex 4096 Jan 18 20:15 ../ -rw-r--r-- 1 ncindex ncindex 99 Jan 19 21:28 segments_dt -rw-r--r-- 1 ncindex ncindex 20 Jan 19 21:28 segments.gen -rw-r--r-- 1 ncindex ncindex 3220362314 Jan 19 21:11 _uk.fdt -rw-r--r-- 1 ncindex ncindex1796091 Jan 19 21:11 _uk.fdx -rw-r--r-- 1 ncindex ncindex 3291 Jan 19 21:28 _uk.fnm -rw-r--r-- 1 ncindex ncindex 2712855241 Jan 19 21:23 _uk_Lucene41_0.doc -rw-r--r-- 1 ncindex ncindex 2641242950 Jan 19 21:23 _uk_Lucene41_0.pos -rw-r--r-- 1 ncindex ncindex 1605874308 Jan 19 21:23 _uk_Lucene41_0.tim -rw-r--r-- 1 ncindex ncindex 35091811 Jan 19 21:23 _uk_Lucene41_0.tip -rw-r--r-- 1 ncindex ncindex115 Jan 19 21:28 _uk_nrm.cfe -rw-r--r-- 1 ncindex ncindex 36874222 Jan 19 21:28 _uk_nrm.cfs -rw-r--r-- 1 ncindex ncindex473 Jan 19 21:28 _uk.si -rw-r--r-- 1 ncindex ncindex 24581897 Jan 19 21:28 _uk.tvd -rw-r--r-- 1 ncindex ncindex 7090368538 Jan 19 21:28 _uk.tvf -rw-r--r-- 1 ncindex ncindex 196662113 Jan 19 21:28 _uk.tvx Solr 4.1 with patch: total 13812100 drwxr-xr-x 2 ncindex ncindex 53248 Jan 20 06:10 ./ drwxr-xr-x 4 ncindex ncindex 4096 Jan 18 20:15 ../ -rw-r--r-- 1 ncindex ncindex 3220492130 Jan 20 05:54 _1oy.fdt -rw-r--r-- 1 ncindex ncindex1790533 Jan 20 05:54 _1oy.fdx -rw-r--r-- 1 ncindex ncindex 3291 Jan 20 06:10 _1oy.fnm -rw-r--r-- 1 ncindex ncindex 2713448546 Jan 20 06:08 _1oy_Lucene41_0.doc -rw-r--r-- 1 ncindex ncindex 2640844965 Jan 20 06:08 _1oy_Lucene41_0.pos -rw-r--r-- 1 ncindex ncindex 1604289094 Jan 20 06:08 _1oy_Lucene41_0.tim -rw-r--r-- 1 ncindex ncindex 34910618 Jan 20 06:08 _1oy_Lucene41_0.tip -rw-r--r-- 1 ncindex ncindex115 Jan 20 06:10 _1oy_nrm.cfe -rw-r--r-- 1 ncindex ncindex 36874183 Jan 20 06:10 _1oy_nrm.cfs -rw-r--r-- 1 ncindex ncindex477 Jan 20 06:10 _1oy.si -rw-r--r-- 1 ncindex ncindex 3889805695 Jan 20 06:10 _1oy.tvd -rw-r--r-- 1 ncindex ncindex1004044 Jan 20 06:10 _1oy.tvx -rw-r--r-- 1 ncindex ncindex 20 Jan 20 06:10 segments.gen -rw-r--r-- 1 ncindex ncindex105 Jan 20 06:10 segments_ul -rw-r--r-- 1 ncindex ncindex 0 Jan 19 21:39 write.lock {noformat} For this listing, the _0 and _1 indexes have been swapped - now the _1 indexes are live. {noformat} ncindex@bigindy5 /index/solr4/data $ du -sc * 492 inc_0 609212 inc_1 24 ncmain 17154980s0_0 13840212s0_1 17211000s1_0 13913260s1_1 17191660s2_0 13895536s2_1 17192320s3_0 13889920s3_1 17198940s4_0 13897380s4_1 17205112s5_0 13918936s5_1 187118984 total {noformat} > Compressed term vectors > --- > > Key: LUCENE-4599 > URL: https://issues.apache.org/jira/browse/LUCENE-4599 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs, core/termvectors >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: 4.2 > > Attachments: 4599-dataimport-fail.log, 4599-zookeer-fail.log, > LUCENE-4599.patch, LUCENE-4599.patch, LUCENE-4599.patch, solr.patch > > > We should have codec-compressed term vectors similarly to what we have with > stored fields. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4599) Compressed term vectors
[ https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13558401#comment-13558401 ] Adrien Grand commented on LUCENE-4599: -- OK, I think I understood: I had forgotten to turn debug off, and although documents in this collection are rather big, queries tend to favor small docs, whose chunks contain more documents (up to 30). I ran the benchmark again with a very small chunk size (128) so that chunks would likely contain a single doc and results got better : {noformat} Fuzzy2 94.39 (7.8%) 88.33 (7.5%) -6.4% ( -20% -9%) MedTerm 292.09 (2.7%) 279.01 (2.6%) -4.5% ( -9% -0%) OrHighHigh 76.84 (7.4%) 73.58 (5.8%) -4.2% ( -16% -9%) Fuzzy1 93.07 (4.8%) 89.59 (4.4%) -3.7% ( -12% -5%) OrHighMed 69.23 (6.4%) 67.17 (4.9%) -3.0% ( -13% -8%) HighPhrase8.54 (9.4%)8.36 (11.6%) -2.1% ( -21% - 20%) LowPhrase 125.02 (2.5%) 122.91 (3.4%) -1.7% ( -7% -4%) MedPhrase 39.97 (5.3%) 39.58 (7.6%) -1.0% ( -13% - 12%) HighTerm 177.70 (2.4%) 176.21 (2.2%) -0.8% ( -5% -3%) LowTerm 370.26 (3.7%) 367.36 (2.8%) -0.8% ( -7% -5%) OrHighLow 106.08 (5.2%) 105.41 (4.7%) -0.6% ( -10% -9%) LowSloppyPhrase 71.29 (5.2%) 70.95 (5.3%) -0.5% ( -10% - 10%) HighSloppyPhrase 30.52 (5.6%) 30.39 (5.2%) -0.4% ( -10% - 10%) PKLookup 339.12 (3.0%) 338.09 (3.1%) -0.3% ( -6% -5%) MedSloppyPhrase 71.13 (4.2%) 70.95 (4.4%) -0.3% ( -8% -8%) AndHighLow 259.19 (3.8%) 258.54 (5.1%) -0.2% ( -8% -8%) Respell 69.04 (3.7%) 68.92 (3.2%) -0.2% ( -6% -6%) AndHighHigh 74.49 (1.5%) 74.47 (1.8%) -0.0% ( -3% -3%) Wildcard 157.16 (2.0%) 157.21 (1.9%) 0.0% ( -3% -3%) AndHighMed 79.81 (2.1%) 80.16 (1.6%) 0.4% ( -3% -4%) MedSpanNear 14.09 (3.6%) 14.16 (4.4%) 0.5% ( -7% -8%) Prefix3 281.17 (2.7%) 282.85 (2.5%) 0.6% ( -4% -5%) HighSpanNear7.73 (3.9%)7.79 (2.8%) 0.8% ( -5% -7%) IntNRQ 143.14 (3.0%) 144.45 (3.2%) 0.9% ( -5% -7%) LowSpanNear 23.85 (6.6%) 24.36 (6.0%) 2.2% ( -9% - 15%) {noformat} (Decreasing the chunk size from 16KB to 128 made the compression ratio increase from 66% to 68%.) > Compressed term vectors > --- > > Key: LUCENE-4599 > URL: https://issues.apache.org/jira/browse/LUCENE-4599 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs, core/termvectors >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: 4.2 > > Attachments: 4599-dataimport-fail.log, 4599-zookeer-fail.log, > CompressingTVF_ingest_rate.png, highlightNoStop.tasks, > Lucene40TVF_ingest_rate.png, LUCENE-4599.patch, LUCENE-4599.patch, > LUCENE-4599.patch, solr.patch > > > We should have codec-compressed term vectors similarly to what we have with > stored fields. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4599) Compressed term vectors
[ https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13558418#comment-13558418 ] Adrien Grand commented on LUCENE-4599: -- I tried to compute the compression ratio of the term vector files compared to Lucene40TVF for small docs (the wikipedia 1K docs) based on the chunk size (the patch has 2^14 as a default chunk size): || Chunk size || no options || positions + offsets || | 2^7 | 0.79 | 0.68 | | 2^8 | 0.79 | 0.68 | | 2^9 | 0.75 | 0.66 | | 2^10| 0.73 | 0.65 | | 2^11| 0.70 | 0.63 | | 2^12| 0.68 | 0.62 | | 2^13| 0.65 | 0.60 | | 2^14| 0.63 | 0.59 | | 2^15| 0.62 | 0.58 | | 2^16| 0.62 | 0.59 | | 2^17| 0.62 | 0.58 | Interestingly, raising the chunk size above 2^14 doesn't bring much. 2^11 or 2^12 look like good candidates for the default size if we were to make this TVF the default one (making big documents likely to be alone in their chunks and preventing small docs from raising the compression ratio). > Compressed term vectors > --- > > Key: LUCENE-4599 > URL: https://issues.apache.org/jira/browse/LUCENE-4599 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs, core/termvectors >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: 4.2 > > Attachments: 4599-dataimport-fail.log, 4599-zookeer-fail.log, > CompressingTVF_ingest_rate.png, highlightNoStop.tasks, > Lucene40TVF_ingest_rate.png, LUCENE-4599.patch, LUCENE-4599.patch, > LUCENE-4599.patch, solr.patch > > > We should have codec-compressed term vectors similarly to what we have with > stored fields. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4599) Compressed term vectors
[ https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13558435#comment-13558435 ] Adrien Grand commented on LUCENE-4599: -- If there's no objection, I plan to commit soon. > Compressed term vectors > --- > > Key: LUCENE-4599 > URL: https://issues.apache.org/jira/browse/LUCENE-4599 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs, core/termvectors >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: 4.2 > > Attachments: 4599-dataimport-fail.log, 4599-zookeer-fail.log, > CompressingTVF_ingest_rate.png, highlightNoStop.tasks, > Lucene40TVF_ingest_rate.png, LUCENE-4599.patch, LUCENE-4599.patch, > LUCENE-4599.patch, solr.patch > > > We should have codec-compressed term vectors similarly to what we have with > stored fields. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4599) Compressed term vectors
[ https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13559058#comment-13559058 ] Commit Tag Bot commented on LUCENE-4599: [branch_4x commit] Adrien Grand http://svn.apache.org/viewvc?view=revision&revision=1436584 LUCENE-4599: New compressed TVF impl: CompressingTermVectorsFormat (merged from r1436556). > Compressed term vectors > --- > > Key: LUCENE-4599 > URL: https://issues.apache.org/jira/browse/LUCENE-4599 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs, core/termvectors >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: 4.2 > > Attachments: 4599-dataimport-fail.log, 4599-zookeer-fail.log, > CompressingTVF_ingest_rate.png, highlightNoStop.tasks, > Lucene40TVF_ingest_rate.png, LUCENE-4599.patch, LUCENE-4599.patch, > LUCENE-4599.patch, solr.patch > > > We should have codec-compressed term vectors similarly to what we have with > stored fields. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4599) Compressed term vectors
[ https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13559059#comment-13559059 ] Commit Tag Bot commented on LUCENE-4599: [trunk commit] Adrien Grand http://svn.apache.org/viewvc?view=revision&revision=1436556 LUCENE-4599: New compressed TVF impl: CompressingTermVectorsFormat. > Compressed term vectors > --- > > Key: LUCENE-4599 > URL: https://issues.apache.org/jira/browse/LUCENE-4599 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs, core/termvectors >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: 4.2 > > Attachments: 4599-dataimport-fail.log, 4599-zookeer-fail.log, > CompressingTVF_ingest_rate.png, highlightNoStop.tasks, > Lucene40TVF_ingest_rate.png, LUCENE-4599.patch, LUCENE-4599.patch, > LUCENE-4599.patch, solr.patch > > > We should have codec-compressed term vectors similarly to what we have with > stored fields. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4599) Compressed term vectors
[ https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13559432#comment-13559432 ] Commit Tag Bot commented on LUCENE-4599: [branch_4x commit] Robert Muir http://svn.apache.org/viewvc?view=revision&revision=1436764 LUCENE-4599: fix Compressing vectors to not return a docsAndPositions when it has no prox > Compressed term vectors > --- > > Key: LUCENE-4599 > URL: https://issues.apache.org/jira/browse/LUCENE-4599 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs, core/termvectors >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: 4.2 > > Attachments: 4599-dataimport-fail.log, 4599-zookeer-fail.log, > CompressingTVF_ingest_rate.png, highlightNoStop.tasks, > Lucene40TVF_ingest_rate.png, LUCENE-4599.patch, LUCENE-4599.patch, > LUCENE-4599.patch, solr.patch > > > We should have codec-compressed term vectors similarly to what we have with > stored fields. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4599) Compressed term vectors
[ https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13559435#comment-13559435 ] Commit Tag Bot commented on LUCENE-4599: [trunk commit] Robert Muir http://svn.apache.org/viewvc?view=revision&revision=1436765 LUCENE-4599: fix Compressing vectors to not return a docsAndPositions when it has no prox > Compressed term vectors > --- > > Key: LUCENE-4599 > URL: https://issues.apache.org/jira/browse/LUCENE-4599 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs, core/termvectors >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: 4.2 > > Attachments: 4599-dataimport-fail.log, 4599-zookeer-fail.log, > CompressingTVF_ingest_rate.png, highlightNoStop.tasks, > Lucene40TVF_ingest_rate.png, LUCENE-4599.patch, LUCENE-4599.patch, > LUCENE-4599.patch, solr.patch > > > We should have codec-compressed term vectors similarly to what we have with > stored fields. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4599) Compressed term vectors
[ https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13559524#comment-13559524 ] Markus Jelsma commented on LUCENE-4599: --- Great reduction! Is this going to be enabled in the default Lucene 41 codec that already compresses stored fields? > Compressed term vectors > --- > > Key: LUCENE-4599 > URL: https://issues.apache.org/jira/browse/LUCENE-4599 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs, core/termvectors >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: 4.2 > > Attachments: 4599-dataimport-fail.log, 4599-zookeer-fail.log, > CompressingTVF_ingest_rate.png, highlightNoStop.tasks, > Lucene40TVF_ingest_rate.png, LUCENE-4599.patch, LUCENE-4599.patch, > LUCENE-4599.patch, solr.patch > > > We should have codec-compressed term vectors similarly to what we have with > stored fields. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4599) Compressed term vectors
[ https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13559527#comment-13559527 ] Adrien Grand commented on LUCENE-4599: -- Unfortunately it is too late for Lucene 4.1 and anyway this new format still requires a lot of testing, but I plan to propose to make it the default term vectors format for Lucene 4.2, so yes Lucene 4.2 might compress both stored fields and term vectors. > Compressed term vectors > --- > > Key: LUCENE-4599 > URL: https://issues.apache.org/jira/browse/LUCENE-4599 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs, core/termvectors >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: 4.2 > > Attachments: 4599-dataimport-fail.log, 4599-zookeer-fail.log, > CompressingTVF_ingest_rate.png, highlightNoStop.tasks, > Lucene40TVF_ingest_rate.png, LUCENE-4599.patch, LUCENE-4599.patch, > LUCENE-4599.patch, solr.patch > > > We should have codec-compressed term vectors similarly to what we have with > stored fields. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4599) Compressed term vectors
[ https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13559528#comment-13559528 ] Markus Jelsma commented on LUCENE-4599: --- Alright. Do you already have an issue filed for making it default in trunk so i can watch? We use recent Solr/Lucene trunk check outs and are interested in how this affects stuff. > Compressed term vectors > --- > > Key: LUCENE-4599 > URL: https://issues.apache.org/jira/browse/LUCENE-4599 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs, core/termvectors >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: 4.2 > > Attachments: 4599-dataimport-fail.log, 4599-zookeer-fail.log, > CompressingTVF_ingest_rate.png, highlightNoStop.tasks, > Lucene40TVF_ingest_rate.png, LUCENE-4599.patch, LUCENE-4599.patch, > LUCENE-4599.patch, solr.patch > > > We should have codec-compressed term vectors similarly to what we have with > stored fields. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4599) Compressed term vectors
[ https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13559530#comment-13559530 ] Adrien Grand commented on LUCENE-4599: -- Not yet. I'm leaving some time for the Jenkins instances to find bugs (for example, one of them found a little bug last night that Robert had to fix) and for people to criticize/fix/improve the format. > Compressed term vectors > --- > > Key: LUCENE-4599 > URL: https://issues.apache.org/jira/browse/LUCENE-4599 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs, core/termvectors >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: 4.2 > > Attachments: 4599-dataimport-fail.log, 4599-zookeer-fail.log, > CompressingTVF_ingest_rate.png, highlightNoStop.tasks, > Lucene40TVF_ingest_rate.png, LUCENE-4599.patch, LUCENE-4599.patch, > LUCENE-4599.patch, solr.patch > > > We should have codec-compressed term vectors similarly to what we have with > stored fields. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4599) Compressed term vectors
[ https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13559540#comment-13559540 ] Markus Jelsma commented on LUCENE-4599: --- Alright. Looking forward to that. Thanks Adrien! > Compressed term vectors > --- > > Key: LUCENE-4599 > URL: https://issues.apache.org/jira/browse/LUCENE-4599 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs, core/termvectors >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: 4.2 > > Attachments: 4599-dataimport-fail.log, 4599-zookeer-fail.log, > CompressingTVF_ingest_rate.png, highlightNoStop.tasks, > Lucene40TVF_ingest_rate.png, LUCENE-4599.patch, LUCENE-4599.patch, > LUCENE-4599.patch, solr.patch > > > We should have codec-compressed term vectors similarly to what we have with > stored fields. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org