[jira] Commented: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]
[ https://issues.apache.org/jira/browse/LUCENE-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12880483#action_12880483 ] Yonik Seeley commented on LUCENE-2380: -- It was really tricky performance testing this. If I started solr and tested one type of faceting exclusively, the performance impact of going through the new FieldCache interfaces (PackedInts for ord lookup) was relatively minimal. However, I had a simple script that tested the different variants (the 4 in the table above)... and using that resulted in the bigger slowdowns. The script would do the following: {code} 1) test 100 iterations of facet.method=fc on the 100,000 term field 2) test 10 iterations of facet.method=fcs on the 100,000 term field 3) test 100 iterations of facet.method=fc on the 100 term field 4) test 10 iterations of facet.method=fcs on the 100 term field {code} I would run the script a few times, making sure the numbers stabilized and were repeatable. Testing #1 alone resulted in trunk slowing down ~ 4% Testing #1 along with any single other test: same small slowdown of ~4% Running the complete script: slowdown of 33-38% for #1 (as well as others) When running the complete script, the first run of Test #1 was always the best... as if the JVM correctly specialized it, but then discarded it later, never to return. So: you can't always depend on the JVM being able to inline stuff for you, and it seems very hard to determine when it can. This obviously has implications for the lucene benchmarker too. > Add FieldCache.getTermBytes, to load term data as byte[] > > > Key: LUCENE-2380 > URL: https://issues.apache.org/jira/browse/LUCENE-2380 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 4.0 > > Attachments: LUCENE-2380.patch, LUCENE-2380.patch, LUCENE-2380.patch, > LUCENE-2380.patch, LUCENE-2380_direct_arr_access.patch, > LUCENE-2380_enum.patch, LUCENE-2380_enum.patch > > > With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode > string, but not necessarily), so we need to push this up the search stack. > FieldCache now has getStrings and getStringIndex; we need corresponding > methods to load terms as native byte[], since in general they may not be > representable as String. This should be quite a bit more RAM efficient too, > for US ascii content since each character would then use 1 byte not 2. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]
[ https://issues.apache.org/jira/browse/LUCENE-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879712#action_12879712 ] Michael McCandless commented on LUCENE-2380: The above commit was actually for LUCENE-2378. > Add FieldCache.getTermBytes, to load term data as byte[] > > > Key: LUCENE-2380 > URL: https://issues.apache.org/jira/browse/LUCENE-2380 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 4.0 > > Attachments: LUCENE-2380.patch, LUCENE-2380.patch, LUCENE-2380.patch, > LUCENE-2380.patch, LUCENE-2380_enum.patch, LUCENE-2380_enum.patch > > > With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode > string, but not necessarily), so we need to push this up the search stack. > FieldCache now has getStrings and getStringIndex; we need corresponding > methods to load terms as native byte[], since in general they may not be > representable as String. This should be quite a bit more RAM efficient too, > for US ascii content since each character would then use 1 byte not 2. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]
[ https://issues.apache.org/jira/browse/LUCENE-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12878974#action_12878974 ] Yonik Seeley commented on LUCENE-2380: -- |terms in field|facet method|pre-bytes ms|trunk+patch ms|new/old |10|fc|27|36|1.33 |10|fcs|333|325|0.98 |100|fc|20|22|1.10 |100|fcs|24|25|1.04 OK - so the biggest problem area initially (bottlenecked by field cache merging) that was 55% slower is now 2% faster. > Add FieldCache.getTermBytes, to load term data as byte[] > > > Key: LUCENE-2380 > URL: https://issues.apache.org/jira/browse/LUCENE-2380 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 4.0 > > Attachments: LUCENE-2380.patch, LUCENE-2380.patch, LUCENE-2380.patch, > LUCENE-2380.patch, LUCENE-2380_enum.patch, LUCENE-2380_enum.patch > > > With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode > string, but not necessarily), so we need to push this up the search stack. > FieldCache now has getStrings and getStringIndex; we need corresponding > methods to load terms as native byte[], since in general they may not be > representable as String. This should be quite a bit more RAM efficient too, > for US ascii content since each character would then use 1 byte not 2. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]
[ https://issues.apache.org/jira/browse/LUCENE-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12878907#action_12878907 ] Michael McCandless commented on LUCENE-2380: Patch looks good Yonik! > Add FieldCache.getTermBytes, to load term data as byte[] > > > Key: LUCENE-2380 > URL: https://issues.apache.org/jira/browse/LUCENE-2380 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 4.0 > > Attachments: LUCENE-2380.patch, LUCENE-2380.patch, LUCENE-2380.patch, > LUCENE-2380.patch, LUCENE-2380_enum.patch, LUCENE-2380_enum.patch > > > With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode > string, but not necessarily), so we need to push this up the search stack. > FieldCache now has getStrings and getStringIndex; we need corresponding > methods to load terms as native byte[], since in general they may not be > representable as String. This should be quite a bit more RAM efficient too, > for US ascii content since each character would then use 1 byte not 2. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]
[ https://issues.apache.org/jira/browse/LUCENE-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12875913#action_12875913 ] Yonik Seeley commented on LUCENE-2380: -- FYI, while trying to implement an iterator over the fieldcache terms, I ran into a bug where each term is written twice. This causes double the memory usage for the bytes (but no functionality bugs). I'll fix shortly, and anyone who has done performance tests might want to redo them again (cache effects, GC differences, and bigger entry build times). > Add FieldCache.getTermBytes, to load term data as byte[] > > > Key: LUCENE-2380 > URL: https://issues.apache.org/jira/browse/LUCENE-2380 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 4.0 > > Attachments: LUCENE-2380.patch, LUCENE-2380.patch, LUCENE-2380.patch, > LUCENE-2380.patch > > > With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode > string, but not necessarily), so we need to push this up the search stack. > FieldCache now has getStrings and getStringIndex; we need corresponding > methods to load terms as native byte[], since in general they may not be > representable as String. This should be quite a bit more RAM efficient too, > for US ascii content since each character would then use 1 byte not 2. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]
[ https://issues.apache.org/jira/browse/LUCENE-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12875368#action_12875368 ] Yonik Seeley commented on LUCENE-2380: -- I just committed a patch that helps... when merging the fieldcaches, instead of looking up the term for each comparison, it's now stored in the segment data structure. Per-segment faceting is now 26% slower for the 100,000 term field, and 17% slower for the 100 term field. One way to regain more performance is to implement some kind of stateful iterator over the values in the field cache entry instead of looking up by ord each time. > Add FieldCache.getTermBytes, to load term data as byte[] > > > Key: LUCENE-2380 > URL: https://issues.apache.org/jira/browse/LUCENE-2380 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 4.0 > > Attachments: LUCENE-2380.patch, LUCENE-2380.patch, LUCENE-2380.patch, > LUCENE-2380.patch > > > With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode > string, but not necessarily), so we need to push this up the search stack. > FieldCache now has getStrings and getStringIndex; we need corresponding > methods to load terms as native byte[], since in general they may not be > representable as String. This should be quite a bit more RAM efficient too, > for US ascii content since each character would then use 1 byte not 2. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]
[ https://issues.apache.org/jira/browse/LUCENE-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12875360#action_12875360 ] Yonik Seeley commented on LUCENE-2380: -- bq. What do the numbers mean? Time to do the faceting (roughly). FieldCache build time is not included. Given that the degradation is much worse for a higher number of unique values, this points to the increased cost of going from ord->value. > Add FieldCache.getTermBytes, to load term data as byte[] > > > Key: LUCENE-2380 > URL: https://issues.apache.org/jira/browse/LUCENE-2380 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 4.0 > > Attachments: LUCENE-2380.patch, LUCENE-2380.patch, LUCENE-2380.patch, > LUCENE-2380.patch > > > With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode > string, but not necessarily), so we need to push this up the search stack. > FieldCache now has getStrings and getStringIndex; we need corresponding > methods to load terms as native byte[], since in general they may not be > representable as String. This should be quite a bit more RAM efficient too, > for US ascii content since each character would then use 1 byte not 2. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]
[ https://issues.apache.org/jira/browse/LUCENE-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12875302#action_12875302 ] Michael McCandless commented on LUCENE-2380: H. Can you try adding ", true" to FieldCache.DEFAULT.getTermsIndex? That'll use more RAM but should be faster. Also, could the fix for executor have changed the performance? > Add FieldCache.getTermBytes, to load term data as byte[] > > > Key: LUCENE-2380 > URL: https://issues.apache.org/jira/browse/LUCENE-2380 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 4.0 > > Attachments: LUCENE-2380.patch, LUCENE-2380.patch, LUCENE-2380.patch, > LUCENE-2380.patch > > > With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode > string, but not necessarily), so we need to push this up the search stack. > FieldCache now has getStrings and getStringIndex; we need corresponding > methods to load terms as native byte[], since in general they may not be > representable as String. This should be quite a bit more RAM efficient too, > for US ascii content since each character would then use 1 byte not 2. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]
[ https://issues.apache.org/jira/browse/LUCENE-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12875291#action_12875291 ] Uwe Schindler commented on LUCENE-2380: --- What do the numbers mean? Time to build cache or time for sorting something? Thats unclear to me. > Add FieldCache.getTermBytes, to load term data as byte[] > > > Key: LUCENE-2380 > URL: https://issues.apache.org/jira/browse/LUCENE-2380 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 4.0 > > Attachments: LUCENE-2380.patch, LUCENE-2380.patch, LUCENE-2380.patch, > LUCENE-2380.patch > > > With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode > string, but not necessarily), so we need to push this up the search stack. > FieldCache now has getStrings and getStringIndex; we need corresponding > methods to load terms as native byte[], since in general they may not be > representable as String. This should be quite a bit more RAM efficient too, > for US ascii content since each character would then use 1 byte not 2. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]
[ https://issues.apache.org/jira/browse/LUCENE-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12875287#action_12875287 ] Yonik Seeley commented on LUCENE-2380: -- I did some performance testing on faceting using the field cache (single valued field with facet.method fc and fcs). field=10 unique values fc: 5% slower fcs: 55% slower field=100 unique values fc: 2.5% slower fcs: 26% slower I'll look into it to see how we can regain some of that lost performance. > Add FieldCache.getTermBytes, to load term data as byte[] > > > Key: LUCENE-2380 > URL: https://issues.apache.org/jira/browse/LUCENE-2380 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 4.0 > > Attachments: LUCENE-2380.patch, LUCENE-2380.patch, LUCENE-2380.patch, > LUCENE-2380.patch > > > With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode > string, but not necessarily), so we need to push this up the search stack. > FieldCache now has getStrings and getStringIndex; we need corresponding > methods to load terms as native byte[], since in general they may not be > representable as String. This should be quite a bit more RAM efficient too, > for US ascii content since each character would then use 1 byte not 2. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]
[ https://issues.apache.org/jira/browse/LUCENE-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12875256#action_12875256 ] Yonik Seeley commented on LUCENE-2380: -- Whew, that's one involved patch! I didn't get to it before, but I'll start looking over the Solr changes now. > Add FieldCache.getTermBytes, to load term data as byte[] > > > Key: LUCENE-2380 > URL: https://issues.apache.org/jira/browse/LUCENE-2380 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 4.0 > > Attachments: LUCENE-2380.patch, LUCENE-2380.patch, LUCENE-2380.patch, > LUCENE-2380.patch > > > With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode > string, but not necessarily), so we need to push this up the search stack. > FieldCache now has getStrings and getStringIndex; we need corresponding > methods to load terms as native byte[], since in general they may not be > representable as String. This should be quite a bit more RAM efficient too, > for US ascii content since each character would then use 1 byte not 2. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]
[ https://issues.apache.org/jira/browse/LUCENE-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12875233#action_12875233 ] Michael McCandless commented on LUCENE-2380: I opened LUCENE-2483 for the future improvements. > Add FieldCache.getTermBytes, to load term data as byte[] > > > Key: LUCENE-2380 > URL: https://issues.apache.org/jira/browse/LUCENE-2380 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 4.0 > > Attachments: LUCENE-2380.patch, LUCENE-2380.patch, LUCENE-2380.patch, > LUCENE-2380.patch > > > With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode > string, but not necessarily), so we need to push this up the search stack. > FieldCache now has getStrings and getStringIndex; we need corresponding > methods to load terms as native byte[], since in general they may not be > representable as String. This should be quite a bit more RAM efficient too, > for US ascii content since each character would then use 1 byte not 2. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]
[ https://issues.apache.org/jira/browse/LUCENE-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871342#action_12871342 ] Michael McCandless commented on LUCENE-2380: I did some rough estimates of RAM usage for StringIndex (trunk) vs TermIndex (patch). Java String is an object, so estimate 8 byte object header in the JRE. It seems to have 3 int fields (offset, count, hashCode), from OpenJDK's sources, plus ref to char[]. The char[] has 8 byte object header, int length, and actual array data. So in trunk's StringIndex: per-unique-term: 40 bytes (48 on 64bit jre) + 2*length-of-string-in-UTF16 per-doc: 4 bytes (8 bytes on 64 bit) In the patch: per-unique-term: ceil(log2(totalUTF8BytesTermData)) + utf8 bytes + 1 or 2 bytes (vInt, for term length) per-doc: ceil(log2(numUniqueTerm)) bits So eg say you have an English title field, avg length 40 chars, and assume always unique. On a 5M doc index, trunk would take ~591MB and patch would take ~226 MB (32bit JRE) = 62% less. But if you have a CJK title field, avg 10 chars (may be highish), it's less savings because UTF8 takes 50% more RAM than UTF16 does for CJK (and others). Trunk would take ~305MB and patch ~178MB (32bit JRE) = 42% less. Also don't forget the GC load of having 5M String & char[] objects... > Add FieldCache.getTermBytes, to load term data as byte[] > > > Key: LUCENE-2380 > URL: https://issues.apache.org/jira/browse/LUCENE-2380 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 4.0 > > Attachments: LUCENE-2380.patch, LUCENE-2380.patch, LUCENE-2380.patch > > > With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode > string, but not necessarily), so we need to push this up the search stack. > FieldCache now has getStrings and getStringIndex; we need corresponding > methods to load terms as native byte[], since in general they may not be > representable as String. This should be quite a bit more RAM efficient too, > for US ascii content since each character would then use 1 byte not 2. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]
[ https://issues.apache.org/jira/browse/LUCENE-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871308#action_12871308 ] Michael McCandless commented on LUCENE-2380: OK I ran some sort perf tests. I picked the worst case -- trivial query (TermQuery) matching all docs, sorting by either a highly unique string field (random string) or enumerated field (country ~ a couple hundred values), from benchmark's SortableSingleDocSource. Index has 5M docs. Each run is best of 3. Results: ||Sort||Trunk QPS||Patch QPS||Change %|| |random|7.75|5.64|{color:red}-27.2%{color} |country|8.05|7.62|{color:red}-5.3%{color} So the packed ints lookups are more costly than trunk today (but, at a large reduction in RAM used). Then I tried another test, asking packed ints to upgrade to an array of the nearest native type (ie byte[], short[], int[], long[]) for the doc -> ord map. This is faster since lookups don't require shift/mask, but, wastes some space since you have unused bits: ||Sort||Trunk QPS||Patch QPS||Change %|| |random|7.75|7.89|{color:green}1.8%{color} |country|8.05|7.64|{color:red}-5.1%{color} The country case didn't get any better (noise) because it happened to already be using 8 bits (byte[]) for doc->ord map. Remember this is a worst case test -- if you query matches fewer results than your entire index, or your query is more costly to evaluate than the simple single TermQuery, this FieldCache lookup cost will be relatively smaller. So... I think we should expose in the new FieldCache methods an optional param to control time/space tradeoff; I'll add this, defaulting to upgrading to nearest native type. I think the 5.3% slowdown on the country field is acceptable given the large reduction in RAM used... > Add FieldCache.getTermBytes, to load term data as byte[] > > > Key: LUCENE-2380 > URL: https://issues.apache.org/jira/browse/LUCENE-2380 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 4.0 > > Attachments: LUCENE-2380.patch, LUCENE-2380.patch, LUCENE-2380.patch > > > With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode > string, but not necessarily), so we need to push this up the search stack. > FieldCache now has getStrings and getStringIndex; we need corresponding > methods to load terms as native byte[], since in general they may not be > representable as String. This should be quite a bit more RAM efficient too, > for US ascii content since each character would then use 1 byte not 2. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]
[ https://issues.apache.org/jira/browse/LUCENE-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868776#action_12868776 ] Yonik Seeley commented on LUCENE-2380: -- bq. would love to return empty string (not null) if ord 0 comes in, and require caller to specifically handle ord 0 if they need to differentiate... I had started down that path but got spooked by it Yeah... I guess I could see how it could cause a loss of info if you go though a few layers and you only have a BytesRef w/o an ord to tell you the value was missing. > Add FieldCache.getTermBytes, to load term data as byte[] > > > Key: LUCENE-2380 > URL: https://issues.apache.org/jira/browse/LUCENE-2380 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 4.0 > > Attachments: LUCENE-2380.patch > > > With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode > string, but not necessarily), so we need to push this up the search stack. > FieldCache now has getStrings and getStringIndex; we need corresponding > methods to load terms as native byte[], since in general they may not be > representable as String. This should be quite a bit more RAM efficient too, > for US ascii content since each character would then use 1 byte not 2. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]
[ https://issues.apache.org/jira/browse/LUCENE-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868767#action_12868767 ] Michael McCandless commented on LUCENE-2380: bq. Ahhh, fun stuff! I'm packing for Prague though - prob won't be able to look at this for a week. OK no prob... have fun! bq. 1 or 2? a max len of 2**15? (I know... a term bigger than 32K would be horrible, but so are limits that aren't necessary) Indexer already has this limit, during indexing (these large terms are skipped). bq. re: returning null if an ord of 0 is passed to get(int ord, BytesRef ret): do we need to do this? We could record 0 as zero length in the FieldCache and hence avoid the special-case code. We could require the user to check for 0 if they care to know the difference between zero length and missing. I would love to return empty string (not null) if ord 0 comes in, and require caller to specifically handle ord 0 if they need to differentiate... I had started down that path but got spooked by it :) I think we can revisit it, but maybe separately. > Add FieldCache.getTermBytes, to load term data as byte[] > > > Key: LUCENE-2380 > URL: https://issues.apache.org/jira/browse/LUCENE-2380 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 4.0 > > Attachments: LUCENE-2380.patch > > > With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode > string, but not necessarily), so we need to push this up the search stack. > FieldCache now has getStrings and getStringIndex; we need corresponding > methods to load terms as native byte[], since in general they may not be > representable as String. This should be quite a bit more RAM efficient too, > for US ascii content since each character would then use 1 byte not 2. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]
[ https://issues.apache.org/jira/browse/LUCENE-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868762#action_12868762 ] Yonik Seeley commented on LUCENE-2380: -- Ahhh, fun stuff! I'm packing for Prague though - prob won't be able to look at this for a week. bq. 1 or 2 byte vInt prefix 1 or 2? a max len of 2**15? (I know... a term bigger than 32K would be horrible, but so are limits that aren't necessary). We could also do 1 or 4 (or 1 or 5), but as long as we make sure the single-byte case is optimized, it shouldn't matter. re: returning null if an ord of 0 is passed to get(int ord, BytesRef ret): do we need to do this? We could record 0 as zero length in the FieldCache and hence avoid the special-case code. We could require the user to check for 0 if they care to know the difference between zero length and missing. > Add FieldCache.getTermBytes, to load term data as byte[] > > > Key: LUCENE-2380 > URL: https://issues.apache.org/jira/browse/LUCENE-2380 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 4.0 > > Attachments: LUCENE-2380.patch > > > With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode > string, but not necessarily), so we need to push this up the search stack. > FieldCache now has getStrings and getStringIndex; we need corresponding > methods to load terms as native byte[], since in general they may not be > representable as String. This should be quite a bit more RAM efficient too, > for US ascii content since each character would then use 1 byte not 2. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]
[ https://issues.apache.org/jira/browse/LUCENE-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867535#action_12867535 ] Michael McCandless commented on LUCENE-2380: I agree, let's pass the reused BytesRef in. > Add FieldCache.getTermBytes, to load term data as byte[] > > > Key: LUCENE-2380 > URL: https://issues.apache.org/jira/browse/LUCENE-2380 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 4.0 > > > With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode > string, but not necessarily), so we need to push this up the search stack. > FieldCache now has getStrings and getStringIndex; we need corresponding > methods to load terms as native byte[], since in general they may not be > representable as String. This should be quite a bit more RAM efficient too, > for US ascii content since each character would then use 1 byte not 2. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]
[ https://issues.apache.org/jira/browse/LUCENE-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867534#action_12867534 ] Yonik Seeley commented on LUCENE-2380: -- One thing to keep in mind is that the current way of returning shared BytesRef objects often forces one to make a copy. We should perhaps consider allowing a BytesRef to be passed in. {code} // returning shared BytesRef forces a copy for(;;) { BytesRef val1 = new BytesRef(getValue(doc1)) // make a copy BytesRef val2 = getValue(doc2) int cmp = val1.compareTo(val2) // allowing BytesRef to be passed in means no copy BytesRef val1 = new BytesRef(); BytesRef val2 = new BytesRef(); for(;;) { getValue(doc1, val1) getValue(doc2, val2) int cmp = val1.compareTo(val2) } {code} > Add FieldCache.getTermBytes, to load term data as byte[] > > > Key: LUCENE-2380 > URL: https://issues.apache.org/jira/browse/LUCENE-2380 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 4.0 > > > With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode > string, but not necessarily), so we need to push this up the search stack. > FieldCache now has getStrings and getStringIndex; we need corresponding > methods to load terms as native byte[], since in general they may not be > representable as String. This should be quite a bit more RAM efficient too, > for US ascii content since each character would then use 1 byte not 2. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org