[jira] [Commented] (LUCENE-3003) Move UnInvertedField into Lucene core
[ https://issues.apache.org/jira/browse/LUCENE-3003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13222914#comment-13222914 ] Robert Muir commented on LUCENE-3003: - 3.6 pruning: can we push this out to 4.0 (mark resolved?) Move UnInvertedField into Lucene core - Key: LUCENE-3003 URL: https://issues.apache.org/jira/browse/LUCENE-3003 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 3.6, 4.0 Attachments: LUCENE-3003.patch, LUCENE-3003.patch, byte_size_32-bit-openjdk6.txt Solr's UnInvertedField lets you quickly lookup all terms ords for a given doc/field. Like, FieldCache, it inverts the index to produce this, and creates a RAM-resident data structure holding the bits; but, unlike FieldCache, it can handle multiple values per doc, and, it does not hold the term bytes in RAM. Rather, it holds only term ords, and then uses TermsEnum to resolve ord - term. This is great eg for faceting, where you want to use int ords for all of your counting, and then only at the end you need to resolve the top N ords to their text. I think this is a useful core functionality, and we should move most of it into Lucene's core. It's a good complement to FieldCache. For this first baby step, I just move it into core and refactor Solr's usage of it. After this, as separate issues, I think there are some things we could explore/improve: * The first-pass that allocates lots of tiny byte[] looks like it could be inefficient. Maybe we could use the byte slices from the indexer for this... * We can improve the RAM efficiency of the TermIndex: if the codec supports ords, and we are operating on one segment, we should just use it. If not, we can use a more RAM-efficient data structure, eg an FST mapping to the ord. * We may be able to improve on the main byte[] representation by using packed ints instead of delta-vInt? * Eventually we should fold this ability into docvalues, ie we'd write the byte[] image at indexing time, and then loading would be fast, instead of uninverting -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3003) Move UnInvertedField into Lucene core
[ https://issues.apache.org/jira/browse/LUCENE-3003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13014703#comment-13014703 ] Yonik Seeley commented on LUCENE-3003: -- bq. Attached: 32-bit results Ah, bummer. It's every 8 bytes, but with a 4 byte offset! I guess we could make it based on if we detect 32 vs 64 bit jvm... but maybe first see if anyone has any ideas about how to use something like pagedbytes instead. Move UnInvertedField into Lucene core - Key: LUCENE-3003 URL: https://issues.apache.org/jira/browse/LUCENE-3003 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3003.patch, LUCENE-3003.patch, byte_size_32-bit-openjdk6.txt Solr's UnInvertedField lets you quickly lookup all terms ords for a given doc/field. Like, FieldCache, it inverts the index to produce this, and creates a RAM-resident data structure holding the bits; but, unlike FieldCache, it can handle multiple values per doc, and, it does not hold the term bytes in RAM. Rather, it holds only term ords, and then uses TermsEnum to resolve ord - term. This is great eg for faceting, where you want to use int ords for all of your counting, and then only at the end you need to resolve the top N ords to their text. I think this is a useful core functionality, and we should move most of it into Lucene's core. It's a good complement to FieldCache. For this first baby step, I just move it into core and refactor Solr's usage of it. After this, as separate issues, I think there are some things we could explore/improve: * The first-pass that allocates lots of tiny byte[] looks like it could be inefficient. Maybe we could use the byte slices from the indexer for this... * We can improve the RAM efficiency of the TermIndex: if the codec supports ords, and we are operating on one segment, we should just use it. If not, we can use a more RAM-efficient data structure, eg an FST mapping to the ord. * We may be able to improve on the main byte[] representation by using packed ints instead of delta-vInt? * Eventually we should fold this ability into docvalues, ie we'd write the byte[] image at indexing time, and then loading would be fast, instead of uninverting -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3003) Move UnInvertedField into Lucene core
[ https://issues.apache.org/jira/browse/LUCENE-3003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13014723#comment-13014723 ] Michael McCandless commented on LUCENE-3003: bq. It is inefficient - but I never saw a way around it since the lists are all being built in parallel (due to the fact that we are uninverting). Lucene's indexer (TermsHashPerField) has precisely this same problem -- every unique term must point to two (well, one if omitTFAP) growable byte arrays. We use slices into a single big (paged) byte[], where first slice is tiny and can only hold like 5 bytes, but then points to the next slice which is a bit bigger, etc. We could look @ refactoring that for this use too... Though this is just the one-time startup cost. bq. Another small easy optimization I hadn't gotten around to yet was to lower the indexIntervalBits and make it configurable. I did make it configurable to the Lucene class (you can pass it in to ctor), but for Solr I left it using every 128th term. {quote} Another small optimization would be to store an array of offsets to length-prefixed byte arrays, rather than a BytesRef[]. At least the values are already in packed byte arrays via PagedBytes. {quote} Both FieldCache and docvalues (branch) store an array-of-terms like this (the array of offsets is packed ints). We should also look at using an FST, which'd be the most compact but the ord - term lookup cost goes up. Anyway I think we can pursue these cool ideas on new [future] issues... Move UnInvertedField into Lucene core - Key: LUCENE-3003 URL: https://issues.apache.org/jira/browse/LUCENE-3003 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3003.patch, LUCENE-3003.patch, byte_size_32-bit-openjdk6.txt Solr's UnInvertedField lets you quickly lookup all terms ords for a given doc/field. Like, FieldCache, it inverts the index to produce this, and creates a RAM-resident data structure holding the bits; but, unlike FieldCache, it can handle multiple values per doc, and, it does not hold the term bytes in RAM. Rather, it holds only term ords, and then uses TermsEnum to resolve ord - term. This is great eg for faceting, where you want to use int ords for all of your counting, and then only at the end you need to resolve the top N ords to their text. I think this is a useful core functionality, and we should move most of it into Lucene's core. It's a good complement to FieldCache. For this first baby step, I just move it into core and refactor Solr's usage of it. After this, as separate issues, I think there are some things we could explore/improve: * The first-pass that allocates lots of tiny byte[] looks like it could be inefficient. Maybe we could use the byte slices from the indexer for this... * We can improve the RAM efficiency of the TermIndex: if the codec supports ords, and we are operating on one segment, we should just use it. If not, we can use a more RAM-efficient data structure, eg an FST mapping to the ord. * We may be able to improve on the main byte[] representation by using packed ints instead of delta-vInt? * Eventually we should fold this ability into docvalues, ie we'd write the byte[] image at indexing time, and then loading would be fast, instead of uninverting -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3003) Move UnInvertedField into Lucene core
[ https://issues.apache.org/jira/browse/LUCENE-3003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13013869#comment-13013869 ] Dawid Weiss commented on LUCENE-3003: - For what it's worth, the instrumentation interface allows one to get exact allocation sizes of objects. I put together a small spike at https://github.com/dweiss/poligon/tree/master/instrumenter that measures the actual allocation size of byte[]. On my hotspot, 64-bit, this yields: {noformat} byte[0] takes 24 bytes. byte[1] takes 32 bytes. byte[2] takes 32 bytes. byte[3] takes 32 bytes. byte[4] takes 32 bytes. byte[5] takes 32 bytes. byte[6] takes 32 bytes. byte[7] takes 32 bytes. byte[8] takes 32 bytes. byte[9] takes 40 bytes. byte[10] takes 40 bytes. byte[11] takes 40 bytes. ... {noformat} IBM's VM yields the same (64-bit), but the version of jrockit that I have (which may be an old one, but is 64-bit!) yields: {noformat} byte[0] takes 16 bytes. byte[1] takes 24 bytes. byte[2] takes 24 bytes. byte[3] takes 24 bytes. byte[4] takes 24 bytes. byte[5] takes 24 bytes. byte[6] takes 24 bytes. byte[7] takes 24 bytes. byte[8] takes 24 bytes. byte[9] takes 32 bytes. byte[10] takes 32 bytes. byte[11] takes 32 bytes. byte[12] takes 32 bytes. byte[13] takes 32 bytes. byte[14] takes 32 bytes. byte[15] takes 32 bytes. byte[16] takes 32 bytes. byte[17] takes 40 bytes. {noformat} Don't have access to a 32-bit system right now, but if you're keen on checking, checkout that github repo and run: {noformat} cd instrumenter mvn package java -javaagent:target/instrumenter-0.1.0-SNAPSHOT.jar -version {noformat} Move UnInvertedField into Lucene core - Key: LUCENE-3003 URL: https://issues.apache.org/jira/browse/LUCENE-3003 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3003.patch, LUCENE-3003.patch Solr's UnInvertedField lets you quickly lookup all terms ords for a given doc/field. Like, FieldCache, it inverts the index to produce this, and creates a RAM-resident data structure holding the bits; but, unlike FieldCache, it can handle multiple values per doc, and, it does not hold the term bytes in RAM. Rather, it holds only term ords, and then uses TermsEnum to resolve ord - term. This is great eg for faceting, where you want to use int ords for all of your counting, and then only at the end you need to resolve the top N ords to their text. I think this is a useful core functionality, and we should move most of it into Lucene's core. It's a good complement to FieldCache. For this first baby step, I just move it into core and refactor Solr's usage of it. After this, as separate issues, I think there are some things we could explore/improve: * The first-pass that allocates lots of tiny byte[] looks like it could be inefficient. Maybe we could use the byte slices from the indexer for this... * We can improve the RAM efficiency of the TermIndex: if the codec supports ords, and we are operating on one segment, we should just use it. If not, we can use a more RAM-efficient data structure, eg an FST mapping to the ord. * We may be able to improve on the main byte[] representation by using packed ints instead of delta-vInt? * Eventually we should fold this ability into docvalues, ie we'd write the byte[] image at indexing time, and then loading would be fast, instead of uninverting -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3003) Move UnInvertedField into Lucene core
[ https://issues.apache.org/jira/browse/LUCENE-3003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13013986#comment-13013986 ] Yonik Seeley commented on LUCENE-3003: -- Thanks Dawid, this suggests that we could round up to the 8 byte boundary for free. Move UnInvertedField into Lucene core - Key: LUCENE-3003 URL: https://issues.apache.org/jira/browse/LUCENE-3003 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3003.patch, LUCENE-3003.patch Solr's UnInvertedField lets you quickly lookup all terms ords for a given doc/field. Like, FieldCache, it inverts the index to produce this, and creates a RAM-resident data structure holding the bits; but, unlike FieldCache, it can handle multiple values per doc, and, it does not hold the term bytes in RAM. Rather, it holds only term ords, and then uses TermsEnum to resolve ord - term. This is great eg for faceting, where you want to use int ords for all of your counting, and then only at the end you need to resolve the top N ords to their text. I think this is a useful core functionality, and we should move most of it into Lucene's core. It's a good complement to FieldCache. For this first baby step, I just move it into core and refactor Solr's usage of it. After this, as separate issues, I think there are some things we could explore/improve: * The first-pass that allocates lots of tiny byte[] looks like it could be inefficient. Maybe we could use the byte slices from the indexer for this... * We can improve the RAM efficiency of the TermIndex: if the codec supports ords, and we are operating on one segment, we should just use it. If not, we can use a more RAM-efficient data structure, eg an FST mapping to the ord. * We may be able to improve on the main byte[] representation by using packed ints instead of delta-vInt? * Eventually we should fold this ability into docvalues, ie we'd write the byte[] image at indexing time, and then loading would be fast, instead of uninverting -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3003) Move UnInvertedField into Lucene core
[ https://issues.apache.org/jira/browse/LUCENE-3003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13013731#comment-13013731 ] Yonik Seeley commented on LUCENE-3003: -- bq. I'd guess that pulsing should be 'good enough' most of the time? This already really pulses I think? If the bytes can fit in an int, they are inlined right in the pointer that would normally point out to the byte array. bq. But for this case (faceting), usually, you have relatively few terms and many docs per term? We see everything. But this structure was more optimized for a high number of unique terms, but relatively few per document. This will perform well on a multi-valued author field, but relatively poorly on a large full-text field. Move UnInvertedField into Lucene core - Key: LUCENE-3003 URL: https://issues.apache.org/jira/browse/LUCENE-3003 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3003.patch, LUCENE-3003.patch Solr's UnInvertedField lets you quickly lookup all terms ords for a given doc/field. Like, FieldCache, it inverts the index to produce this, and creates a RAM-resident data structure holding the bits; but, unlike FieldCache, it can handle multiple values per doc, and, it does not hold the term bytes in RAM. Rather, it holds only term ords, and then uses TermsEnum to resolve ord - term. This is great eg for faceting, where you want to use int ords for all of your counting, and then only at the end you need to resolve the top N ords to their text. I think this is a useful core functionality, and we should move most of it into Lucene's core. It's a good complement to FieldCache. For this first baby step, I just move it into core and refactor Solr's usage of it. After this, as separate issues, I think there are some things we could explore/improve: * The first-pass that allocates lots of tiny byte[] looks like it could be inefficient. Maybe we could use the byte slices from the indexer for this... * We can improve the RAM efficiency of the TermIndex: if the codec supports ords, and we are operating on one segment, we should just use it. If not, we can use a more RAM-efficient data structure, eg an FST mapping to the ord. * We may be able to improve on the main byte[] representation by using packed ints instead of delta-vInt? * Eventually we should fold this ability into docvalues, ie we'd write the byte[] image at indexing time, and then loading would be fast, instead of uninverting -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3003) Move UnInvertedField into Lucene core
[ https://issues.apache.org/jira/browse/LUCENE-3003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13013747#comment-13013747 ] Yonik Seeley commented on LUCENE-3003: -- bq. The first-pass that allocates lots of tiny byte[] looks like it could be inefficient. Maybe we could use the byte slices from the indexer for this... It is inefficient - but I never saw a way around it since the lists are all being built in parallel (due to the fact that we are uninverting). Another small easy optimization I hadn't gotten around to yet was to lower the indexIntervalBits and make it configurable. Another small optimization would be to store an array of offsets to length-prefixed byte arrays, rather than a BytesRef[]. At least the values are already in packed byte arrays via PagedBytes. I'd also love to hear others thoughts on this memory optimization for many small byte arrays: {code} // We avoid a doubling strategy to lower memory usage. // this faceting method isn't for docs with many terms. // In hotspot, objects have 2 words of overhead, then fields, rounded up to a 64-bit boundary. // TODO: figure out what array lengths we can round up to w/o actually using more memory // (how much space does a byte[] take up? Is data preceded by a 32 bit length only? // It should be safe to round up to the nearest 32 bits in any case. int newLen = (newend + 3) 0xfffc; // 4 byte alignment {code} Move UnInvertedField into Lucene core - Key: LUCENE-3003 URL: https://issues.apache.org/jira/browse/LUCENE-3003 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3003.patch, LUCENE-3003.patch Solr's UnInvertedField lets you quickly lookup all terms ords for a given doc/field. Like, FieldCache, it inverts the index to produce this, and creates a RAM-resident data structure holding the bits; but, unlike FieldCache, it can handle multiple values per doc, and, it does not hold the term bytes in RAM. Rather, it holds only term ords, and then uses TermsEnum to resolve ord - term. This is great eg for faceting, where you want to use int ords for all of your counting, and then only at the end you need to resolve the top N ords to their text. I think this is a useful core functionality, and we should move most of it into Lucene's core. It's a good complement to FieldCache. For this first baby step, I just move it into core and refactor Solr's usage of it. After this, as separate issues, I think there are some things we could explore/improve: * The first-pass that allocates lots of tiny byte[] looks like it could be inefficient. Maybe we could use the byte slices from the indexer for this... * We can improve the RAM efficiency of the TermIndex: if the codec supports ords, and we are operating on one segment, we should just use it. If not, we can use a more RAM-efficient data structure, eg an FST mapping to the ord. * We may be able to improve on the main byte[] representation by using packed ints instead of delta-vInt? * Eventually we should fold this ability into docvalues, ie we'd write the byte[] image at indexing time, and then loading would be fast, instead of uninverting -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3003) Move UnInvertedField into Lucene core
[ https://issues.apache.org/jira/browse/LUCENE-3003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13012659#comment-13012659 ] Jason Rutherglen commented on LUCENE-3003: -- {quote}Eventually we should fold this ability into docvalues, ie we'd write the byte[] image at indexing time, and then loading would be fast, instead of uninverting{quote} I'd guess that pulsing should be 'good enough' most of the time? It seems like there'll be some overlap in terms of the gains from pulsing vis-à-vis DocValues? Move UnInvertedField into Lucene core - Key: LUCENE-3003 URL: https://issues.apache.org/jira/browse/LUCENE-3003 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3003.patch Solr's UnInvertedField lets you quickly lookup all terms ords for a given doc/field. Like, FieldCache, it inverts the index to produce this, and creates a RAM-resident data structure holding the bits; but, unlike FieldCache, it can handle multiple values per doc, and, it does not hold the term bytes in RAM. Rather, it holds only term ords, and then uses TermsEnum to resolve ord - term. This is great eg for faceting, where you want to use int ords for all of your counting, and then only at the end you need to resolve the top N ords to their text. I think this is a useful core functionality, and we should move most of it into Lucene's core. It's a good complement to FieldCache. For this first baby step, I just move it into core and refactor Solr's usage of it. After this, as separate issues, I think there are some things we could explore/improve: * The first-pass that allocates lots of tiny byte[] looks like it could be inefficient. Maybe we could use the byte slices from the indexer for this... * We can improve the RAM efficiency of the TermIndex: if the codec supports ords, and we are operating on one segment, we should just use it. If not, we can use a more RAM-efficient data structure, eg an FST mapping to the ord. * We may be able to improve on the main byte[] representation by using packed ints instead of delta-vInt? * Eventually we should fold this ability into docvalues, ie we'd write the byte[] image at indexing time, and then loading would be fast, instead of uninverting -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3003) Move UnInvertedField into Lucene core
[ https://issues.apache.org/jira/browse/LUCENE-3003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13012683#comment-13012683 ] Michael McCandless commented on LUCENE-3003: bq. I'd guess that pulsing should be 'good enough' most of the time? It seems like there'll be some overlap in terms of the gains from pulsing vis-à-vis DocValues? I think Pulsing codec probably doesn't help much here? Ie Pulsing is good for terms that have only 1 or 2 docs. But for this case (faceting), usually, you have relatively few terms and many docs per term? Move UnInvertedField into Lucene core - Key: LUCENE-3003 URL: https://issues.apache.org/jira/browse/LUCENE-3003 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3003.patch Solr's UnInvertedField lets you quickly lookup all terms ords for a given doc/field. Like, FieldCache, it inverts the index to produce this, and creates a RAM-resident data structure holding the bits; but, unlike FieldCache, it can handle multiple values per doc, and, it does not hold the term bytes in RAM. Rather, it holds only term ords, and then uses TermsEnum to resolve ord - term. This is great eg for faceting, where you want to use int ords for all of your counting, and then only at the end you need to resolve the top N ords to their text. I think this is a useful core functionality, and we should move most of it into Lucene's core. It's a good complement to FieldCache. For this first baby step, I just move it into core and refactor Solr's usage of it. After this, as separate issues, I think there are some things we could explore/improve: * The first-pass that allocates lots of tiny byte[] looks like it could be inefficient. Maybe we could use the byte slices from the indexer for this... * We can improve the RAM efficiency of the TermIndex: if the codec supports ords, and we are operating on one segment, we should just use it. If not, we can use a more RAM-efficient data structure, eg an FST mapping to the ord. * We may be able to improve on the main byte[] representation by using packed ints instead of delta-vInt? * Eventually we should fold this ability into docvalues, ie we'd write the byte[] image at indexing time, and then loading would be fast, instead of uninverting -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3003) Move UnInvertedField into Lucene core
[ https://issues.apache.org/jira/browse/LUCENE-3003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13012687#comment-13012687 ] Jason Rutherglen commented on LUCENE-3003: -- bq. Ie Pulsing is good for terms that have only 1 or 2 docs I thought the default is 16 docs? If there are more then seek'ing to the postings should be negligible (in comparison to a larger aggregate index size when using CSF/DocValues, which'll consume more of the system IO cache)? Move UnInvertedField into Lucene core - Key: LUCENE-3003 URL: https://issues.apache.org/jira/browse/LUCENE-3003 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3003.patch Solr's UnInvertedField lets you quickly lookup all terms ords for a given doc/field. Like, FieldCache, it inverts the index to produce this, and creates a RAM-resident data structure holding the bits; but, unlike FieldCache, it can handle multiple values per doc, and, it does not hold the term bytes in RAM. Rather, it holds only term ords, and then uses TermsEnum to resolve ord - term. This is great eg for faceting, where you want to use int ords for all of your counting, and then only at the end you need to resolve the top N ords to their text. I think this is a useful core functionality, and we should move most of it into Lucene's core. It's a good complement to FieldCache. For this first baby step, I just move it into core and refactor Solr's usage of it. After this, as separate issues, I think there are some things we could explore/improve: * The first-pass that allocates lots of tiny byte[] looks like it could be inefficient. Maybe we could use the byte slices from the indexer for this... * We can improve the RAM efficiency of the TermIndex: if the codec supports ords, and we are operating on one segment, we should just use it. If not, we can use a more RAM-efficient data structure, eg an FST mapping to the ord. * We may be able to improve on the main byte[] representation by using packed ints instead of delta-vInt? * Eventually we should fold this ability into docvalues, ie we'd write the byte[] image at indexing time, and then loading would be fast, instead of uninverting -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3003) Move UnInvertedField into Lucene core
[ https://issues.apache.org/jira/browse/LUCENE-3003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13012810#comment-13012810 ] Chris Male commented on LUCENE-3003: +1 to committing this change then tackling the improvements separately. Move UnInvertedField into Lucene core - Key: LUCENE-3003 URL: https://issues.apache.org/jira/browse/LUCENE-3003 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3003.patch Solr's UnInvertedField lets you quickly lookup all terms ords for a given doc/field. Like, FieldCache, it inverts the index to produce this, and creates a RAM-resident data structure holding the bits; but, unlike FieldCache, it can handle multiple values per doc, and, it does not hold the term bytes in RAM. Rather, it holds only term ords, and then uses TermsEnum to resolve ord - term. This is great eg for faceting, where you want to use int ords for all of your counting, and then only at the end you need to resolve the top N ords to their text. I think this is a useful core functionality, and we should move most of it into Lucene's core. It's a good complement to FieldCache. For this first baby step, I just move it into core and refactor Solr's usage of it. After this, as separate issues, I think there are some things we could explore/improve: * The first-pass that allocates lots of tiny byte[] looks like it could be inefficient. Maybe we could use the byte slices from the indexer for this... * We can improve the RAM efficiency of the TermIndex: if the codec supports ords, and we are operating on one segment, we should just use it. If not, we can use a more RAM-efficient data structure, eg an FST mapping to the ord. * We may be able to improve on the main byte[] representation by using packed ints instead of delta-vInt? * Eventually we should fold this ability into docvalues, ie we'd write the byte[] image at indexing time, and then loading would be fast, instead of uninverting -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org