[jira] Commented: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]

2010-06-19 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12880483#action_12880483
 ] 

Yonik Seeley commented on LUCENE-2380:
--

It was really tricky performance testing this.

If I started solr and tested one type of faceting exclusively, the performance 
impact of going through the new FieldCache interfaces (PackedInts for ord 
lookup) was relatively minimal.

However, I had a simple script that tested the different variants (the 4 in the 
table above)... and using that resulted in the bigger slowdowns.

The script would do the following:
{code}
1) test 100 iterations of facet.method=fc on the 100,000 term field
2) test 10 iterations of facet.method=fcs on the 100,000 term field
3) test 100 iterations of facet.method=fc on the 100 term field
4) test 10 iterations of facet.method=fcs on the 100 term field
{code}

I would run the script a few times, making sure the numbers stabilized and were 
repeatable.

Testing #1 alone resulted in trunk slowing down ~ 4%
Testing #1 along with any single other test: same small slowdown of ~4%
Running the complete script: slowdown of 33-38% for #1 (as well as others)
When running the complete script, the first run of Test #1 was always the 
best... as if the JVM correctly specialized it, but then discarded it later, 
never to return.

So: you can't always depend on the JVM being able to inline stuff for you, and 
it seems very hard to determine when it can.
This obviously has implications for the lucene benchmarker too.


> Add FieldCache.getTermBytes, to load term data as byte[]
> 
>
> Key: LUCENE-2380
> URL: https://issues.apache.org/jira/browse/LUCENE-2380
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2380.patch, LUCENE-2380.patch, LUCENE-2380.patch, 
> LUCENE-2380.patch, LUCENE-2380_direct_arr_access.patch, 
> LUCENE-2380_enum.patch, LUCENE-2380_enum.patch
>
>
> With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode 
> string, but not necessarily), so we need to push this up the search stack.
> FieldCache now has getStrings and getStringIndex; we need corresponding 
> methods to load terms as native byte[], since in general they may not be 
> representable as String.  This should be quite a bit more RAM efficient too, 
> for US ascii content since each character would then use 1 byte not 2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]

2010-06-17 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879712#action_12879712
 ] 

Michael McCandless commented on LUCENE-2380:


The above commit was actually for LUCENE-2378.

> Add FieldCache.getTermBytes, to load term data as byte[]
> 
>
> Key: LUCENE-2380
> URL: https://issues.apache.org/jira/browse/LUCENE-2380
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2380.patch, LUCENE-2380.patch, LUCENE-2380.patch, 
> LUCENE-2380.patch, LUCENE-2380_enum.patch, LUCENE-2380_enum.patch
>
>
> With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode 
> string, but not necessarily), so we need to push this up the search stack.
> FieldCache now has getStrings and getStringIndex; we need corresponding 
> methods to load terms as native byte[], since in general they may not be 
> representable as String.  This should be quite a bit more RAM efficient too, 
> for US ascii content since each character would then use 1 byte not 2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]

2010-06-15 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12878974#action_12878974
 ] 

Yonik Seeley commented on LUCENE-2380:
--


|terms in field|facet method|pre-bytes ms|trunk+patch ms|new/old
|10|fc|27|36|1.33
|10|fcs|333|325|0.98
|100|fc|20|22|1.10
|100|fcs|24|25|1.04

OK - so the biggest problem area initially (bottlenecked by field cache 
merging) that was 55% slower is now 2% faster.

> Add FieldCache.getTermBytes, to load term data as byte[]
> 
>
> Key: LUCENE-2380
> URL: https://issues.apache.org/jira/browse/LUCENE-2380
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2380.patch, LUCENE-2380.patch, LUCENE-2380.patch, 
> LUCENE-2380.patch, LUCENE-2380_enum.patch, LUCENE-2380_enum.patch
>
>
> With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode 
> string, but not necessarily), so we need to push this up the search stack.
> FieldCache now has getStrings and getStringIndex; we need corresponding 
> methods to load terms as native byte[], since in general they may not be 
> representable as String.  This should be quite a bit more RAM efficient too, 
> for US ascii content since each character would then use 1 byte not 2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]

2010-06-15 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12878907#action_12878907
 ] 

Michael McCandless commented on LUCENE-2380:


Patch looks good Yonik!

> Add FieldCache.getTermBytes, to load term data as byte[]
> 
>
> Key: LUCENE-2380
> URL: https://issues.apache.org/jira/browse/LUCENE-2380
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2380.patch, LUCENE-2380.patch, LUCENE-2380.patch, 
> LUCENE-2380.patch, LUCENE-2380_enum.patch, LUCENE-2380_enum.patch
>
>
> With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode 
> string, but not necessarily), so we need to push this up the search stack.
> FieldCache now has getStrings and getStringIndex; we need corresponding 
> methods to load terms as native byte[], since in general they may not be 
> representable as String.  This should be quite a bit more RAM efficient too, 
> for US ascii content since each character would then use 1 byte not 2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]

2010-06-05 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12875913#action_12875913
 ] 

Yonik Seeley commented on LUCENE-2380:
--

FYI, while trying to implement an iterator over the fieldcache terms, I ran 
into a bug where each term is written twice. This causes double the memory 
usage for the bytes (but no functionality bugs). I'll fix shortly, and anyone 
who has done performance tests might want to redo them again (cache effects, GC 
differences, and bigger entry build times). 

> Add FieldCache.getTermBytes, to load term data as byte[]
> 
>
> Key: LUCENE-2380
> URL: https://issues.apache.org/jira/browse/LUCENE-2380
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2380.patch, LUCENE-2380.patch, LUCENE-2380.patch, 
> LUCENE-2380.patch
>
>
> With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode 
> string, but not necessarily), so we need to push this up the search stack.
> FieldCache now has getStrings and getStringIndex; we need corresponding 
> methods to load terms as native byte[], since in general they may not be 
> representable as String.  This should be quite a bit more RAM efficient too, 
> for US ascii content since each character would then use 1 byte not 2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]

2010-06-03 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12875368#action_12875368
 ] 

Yonik Seeley commented on LUCENE-2380:
--

I just committed a patch that helps... when merging the fieldcaches,  instead 
of looking up the term for each comparison, it's now stored in the segment data 
structure.

Per-segment faceting is now 26% slower for the 100,000 term field, and 17% 
slower for the 100 term field.

One way to regain more performance is to implement some kind of stateful 
iterator over the values in the field cache entry instead of looking up by ord 
each time.

> Add FieldCache.getTermBytes, to load term data as byte[]
> 
>
> Key: LUCENE-2380
> URL: https://issues.apache.org/jira/browse/LUCENE-2380
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2380.patch, LUCENE-2380.patch, LUCENE-2380.patch, 
> LUCENE-2380.patch
>
>
> With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode 
> string, but not necessarily), so we need to push this up the search stack.
> FieldCache now has getStrings and getStringIndex; we need corresponding 
> methods to load terms as native byte[], since in general they may not be 
> representable as String.  This should be quite a bit more RAM efficient too, 
> for US ascii content since each character would then use 1 byte not 2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]

2010-06-03 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12875360#action_12875360
 ] 

Yonik Seeley commented on LUCENE-2380:
--

bq. What do the numbers mean?

Time to do the faceting (roughly).  FieldCache build time is not included.  
Given that the degradation is much worse for a higher number of unique values, 
this points to the increased cost of going from ord->value.

> Add FieldCache.getTermBytes, to load term data as byte[]
> 
>
> Key: LUCENE-2380
> URL: https://issues.apache.org/jira/browse/LUCENE-2380
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2380.patch, LUCENE-2380.patch, LUCENE-2380.patch, 
> LUCENE-2380.patch
>
>
> With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode 
> string, but not necessarily), so we need to push this up the search stack.
> FieldCache now has getStrings and getStringIndex; we need corresponding 
> methods to load terms as native byte[], since in general they may not be 
> representable as String.  This should be quite a bit more RAM efficient too, 
> for US ascii content since each character would then use 1 byte not 2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]

2010-06-03 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12875302#action_12875302
 ] 

Michael McCandless commented on LUCENE-2380:


H.

Can you try adding ", true" to FieldCache.DEFAULT.getTermsIndex?  That'll use 
more RAM but should be faster.

Also, could the fix for executor have changed the performance?

> Add FieldCache.getTermBytes, to load term data as byte[]
> 
>
> Key: LUCENE-2380
> URL: https://issues.apache.org/jira/browse/LUCENE-2380
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2380.patch, LUCENE-2380.patch, LUCENE-2380.patch, 
> LUCENE-2380.patch
>
>
> With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode 
> string, but not necessarily), so we need to push this up the search stack.
> FieldCache now has getStrings and getStringIndex; we need corresponding 
> methods to load terms as native byte[], since in general they may not be 
> representable as String.  This should be quite a bit more RAM efficient too, 
> for US ascii content since each character would then use 1 byte not 2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]

2010-06-03 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12875291#action_12875291
 ] 

Uwe Schindler commented on LUCENE-2380:
---

What do the numbers mean? Time to build cache or time for sorting something? 
Thats unclear to me.

> Add FieldCache.getTermBytes, to load term data as byte[]
> 
>
> Key: LUCENE-2380
> URL: https://issues.apache.org/jira/browse/LUCENE-2380
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2380.patch, LUCENE-2380.patch, LUCENE-2380.patch, 
> LUCENE-2380.patch
>
>
> With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode 
> string, but not necessarily), so we need to push this up the search stack.
> FieldCache now has getStrings and getStringIndex; we need corresponding 
> methods to load terms as native byte[], since in general they may not be 
> representable as String.  This should be quite a bit more RAM efficient too, 
> for US ascii content since each character would then use 1 byte not 2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]

2010-06-03 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12875287#action_12875287
 ] 

Yonik Seeley commented on LUCENE-2380:
--

I did some performance testing on faceting using the field cache (single valued 
field with facet.method fc and fcs).

field=10 unique values
fc: 5% slower
fcs: 55% slower

field=100 unique values
fc: 2.5% slower
fcs: 26% slower

I'll look into it to see how we can regain some of that lost performance.

> Add FieldCache.getTermBytes, to load term data as byte[]
> 
>
> Key: LUCENE-2380
> URL: https://issues.apache.org/jira/browse/LUCENE-2380
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2380.patch, LUCENE-2380.patch, LUCENE-2380.patch, 
> LUCENE-2380.patch
>
>
> With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode 
> string, but not necessarily), so we need to push this up the search stack.
> FieldCache now has getStrings and getStringIndex; we need corresponding 
> methods to load terms as native byte[], since in general they may not be 
> representable as String.  This should be quite a bit more RAM efficient too, 
> for US ascii content since each character would then use 1 byte not 2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]

2010-06-03 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12875256#action_12875256
 ] 

Yonik Seeley commented on LUCENE-2380:
--

Whew, that's one involved patch!
I didn't get to it before, but I'll start looking over the Solr changes now.

> Add FieldCache.getTermBytes, to load term data as byte[]
> 
>
> Key: LUCENE-2380
> URL: https://issues.apache.org/jira/browse/LUCENE-2380
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2380.patch, LUCENE-2380.patch, LUCENE-2380.patch, 
> LUCENE-2380.patch
>
>
> With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode 
> string, but not necessarily), so we need to push this up the search stack.
> FieldCache now has getStrings and getStringIndex; we need corresponding 
> methods to load terms as native byte[], since in general they may not be 
> representable as String.  This should be quite a bit more RAM efficient too, 
> for US ascii content since each character would then use 1 byte not 2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]

2010-06-03 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12875233#action_12875233
 ] 

Michael McCandless commented on LUCENE-2380:


I opened LUCENE-2483 for the future improvements.

> Add FieldCache.getTermBytes, to load term data as byte[]
> 
>
> Key: LUCENE-2380
> URL: https://issues.apache.org/jira/browse/LUCENE-2380
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2380.patch, LUCENE-2380.patch, LUCENE-2380.patch, 
> LUCENE-2380.patch
>
>
> With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode 
> string, but not necessarily), so we need to push this up the search stack.
> FieldCache now has getStrings and getStringIndex; we need corresponding 
> methods to load terms as native byte[], since in general they may not be 
> representable as String.  This should be quite a bit more RAM efficient too, 
> for US ascii content since each character would then use 1 byte not 2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]

2010-05-25 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871342#action_12871342
 ] 

Michael McCandless commented on LUCENE-2380:


I did some rough estimates of RAM usage for StringIndex (trunk) vs
TermIndex (patch).

Java String is an object, so estimate 8 byte object header in the JRE.
It seems to have 3 int fields (offset, count, hashCode), from
OpenJDK's sources, plus ref to char[].

The char[] has 8 byte object header, int length, and actual array
data.

So in trunk's StringIndex:

  per-unique-term: 40 bytes (48 on 64bit jre) + 2*length-of-string-in-UTF16
  per-doc: 4 bytes (8 bytes on 64 bit)

In the patch:

  per-unique-term: ceil(log2(totalUTF8BytesTermData)) + utf8 bytes + 1 or 2 
bytes (vInt, for term length)
  per-doc: ceil(log2(numUniqueTerm)) bits

So eg say you have an English title field, avg length 40 chars, and
assume always unique.  On a 5M doc index, trunk would take ~591MB and
patch would take ~226 MB (32bit JRE) = 62% less.

But if you have a CJK title field, avg 10 chars (may be highish), it's
less savings because UTF8 takes 50% more RAM than UTF16 does for CJK
(and others).  Trunk would take ~305MB and patch ~178MB (32bit JRE) =
42% less.

Also don't forget the GC load of having 5M String & char[] objects...


> Add FieldCache.getTermBytes, to load term data as byte[]
> 
>
> Key: LUCENE-2380
> URL: https://issues.apache.org/jira/browse/LUCENE-2380
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2380.patch, LUCENE-2380.patch, LUCENE-2380.patch
>
>
> With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode 
> string, but not necessarily), so we need to push this up the search stack.
> FieldCache now has getStrings and getStringIndex; we need corresponding 
> methods to load terms as native byte[], since in general they may not be 
> representable as String.  This should be quite a bit more RAM efficient too, 
> for US ascii content since each character would then use 1 byte not 2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]

2010-05-25 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871308#action_12871308
 ] 

Michael McCandless commented on LUCENE-2380:


OK I ran some sort perf tests.  I picked the worst case -- trivial
query (TermQuery) matching all docs, sorting by either a highly unique
string field (random string) or enumerated field (country ~ a couple
hundred values), from benchmark's SortableSingleDocSource.

Index has 5M docs.  Each run is best of 3.

Results:

||Sort||Trunk QPS||Patch QPS||Change %||
|random|7.75|5.64|{color:red}-27.2%{color}
|country|8.05|7.62|{color:red}-5.3%{color}

So the packed ints lookups are more costly than trunk today (but,
at a large reduction in RAM used).

Then I tried another test, asking packed ints to upgrade to an array
of the nearest native type (ie byte[], short[], int[], long[]) for the
doc -> ord map.  This is faster since lookups don't require
shift/mask, but, wastes some space since you have unused bits:

||Sort||Trunk QPS||Patch QPS||Change %||
|random|7.75|7.89|{color:green}1.8%{color}
|country|8.05|7.64|{color:red}-5.1%{color}

The country case didn't get any better (noise) because it happened to
already be using 8 bits (byte[]) for doc->ord map.

Remember this is a worst case test -- if you query matches fewer
results than your entire index, or your query is more costly to
evaluate than the simple single TermQuery, this FieldCache lookup cost
will be relatively smaller.

So... I think we should expose in the new FieldCache methods an
optional param to control time/space tradeoff; I'll add this,
defaulting to upgrading to nearest native type.  I think the 5.3%
slowdown on the country field is acceptable given the large reduction
in RAM used...


> Add FieldCache.getTermBytes, to load term data as byte[]
> 
>
> Key: LUCENE-2380
> URL: https://issues.apache.org/jira/browse/LUCENE-2380
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2380.patch, LUCENE-2380.patch, LUCENE-2380.patch
>
>
> With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode 
> string, but not necessarily), so we need to push this up the search stack.
> FieldCache now has getStrings and getStringIndex; we need corresponding 
> methods to load terms as native byte[], since in general they may not be 
> representable as String.  This should be quite a bit more RAM efficient too, 
> for US ascii content since each character would then use 1 byte not 2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]

2010-05-18 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868776#action_12868776
 ] 

Yonik Seeley commented on LUCENE-2380:
--

bq. would love to return empty string (not null) if ord 0 comes in, and require 
caller to specifically handle ord 0 if they need to differentiate... I had 
started down that path but got spooked by it

Yeah... I guess I could see how it could cause a loss of info if you go though 
a few layers and you only have a BytesRef w/o an ord to tell you the value was 
missing.

> Add FieldCache.getTermBytes, to load term data as byte[]
> 
>
> Key: LUCENE-2380
> URL: https://issues.apache.org/jira/browse/LUCENE-2380
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2380.patch
>
>
> With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode 
> string, but not necessarily), so we need to push this up the search stack.
> FieldCache now has getStrings and getStringIndex; we need corresponding 
> methods to load terms as native byte[], since in general they may not be 
> representable as String.  This should be quite a bit more RAM efficient too, 
> for US ascii content since each character would then use 1 byte not 2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]

2010-05-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868767#action_12868767
 ] 

Michael McCandless commented on LUCENE-2380:


bq. Ahhh, fun stuff! I'm packing for Prague though - prob won't be able to look 
at this for a week.

OK no prob... have fun!

bq. 1 or 2? a max len of 2**15? (I know... a term bigger than 32K would be 
horrible, but so are limits that aren't necessary)

Indexer already has this limit, during indexing (these large terms are skipped).

bq. re: returning null if an ord of 0 is passed to get(int ord, BytesRef ret): 
do we need to do this? We could record 0 as zero length in the FieldCache and 
hence avoid the special-case code. We could require the user to check for 0 if 
they care to know the difference between zero length and missing.

I would love to return empty string (not null) if ord 0 comes in, and require 
caller to specifically handle ord 0 if they need to differentiate... I had 
started down that path but got spooked by it :)  I think we can revisit it, but 
maybe separately.

> Add FieldCache.getTermBytes, to load term data as byte[]
> 
>
> Key: LUCENE-2380
> URL: https://issues.apache.org/jira/browse/LUCENE-2380
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2380.patch
>
>
> With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode 
> string, but not necessarily), so we need to push this up the search stack.
> FieldCache now has getStrings and getStringIndex; we need corresponding 
> methods to load terms as native byte[], since in general they may not be 
> representable as String.  This should be quite a bit more RAM efficient too, 
> for US ascii content since each character would then use 1 byte not 2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]

2010-05-18 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868762#action_12868762
 ] 

Yonik Seeley commented on LUCENE-2380:
--

Ahhh, fun stuff!  I'm packing for Prague though - prob won't be able to look at 
this for a week.

bq. 1 or 2 byte vInt prefix

1 or 2? a max len of 2**15?  (I know... a term bigger than 32K would be 
horrible, but so are limits that aren't necessary).  We could also do 1 or 4 
(or 1 or 5), but as long as we make sure the single-byte case is optimized, it 
shouldn't matter.

re: returning null if an ord of 0 is passed to get(int ord, BytesRef ret): do 
we need to do this?  We could record 0 as zero length in the FieldCache and 
hence avoid the special-case code.  We could require the user to check for 0 if 
they care to know the difference between zero length and missing.

> Add FieldCache.getTermBytes, to load term data as byte[]
> 
>
> Key: LUCENE-2380
> URL: https://issues.apache.org/jira/browse/LUCENE-2380
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2380.patch
>
>
> With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode 
> string, but not necessarily), so we need to push this up the search stack.
> FieldCache now has getStrings and getStringIndex; we need corresponding 
> methods to load terms as native byte[], since in general they may not be 
> representable as String.  This should be quite a bit more RAM efficient too, 
> for US ascii content since each character would then use 1 byte not 2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]

2010-05-14 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867535#action_12867535
 ] 

Michael McCandless commented on LUCENE-2380:


I agree, let's pass the reused BytesRef in.

> Add FieldCache.getTermBytes, to load term data as byte[]
> 
>
> Key: LUCENE-2380
> URL: https://issues.apache.org/jira/browse/LUCENE-2380
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 4.0
>
>
> With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode 
> string, but not necessarily), so we need to push this up the search stack.
> FieldCache now has getStrings and getStringIndex; we need corresponding 
> methods to load terms as native byte[], since in general they may not be 
> representable as String.  This should be quite a bit more RAM efficient too, 
> for US ascii content since each character would then use 1 byte not 2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]

2010-05-14 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867534#action_12867534
 ] 

Yonik Seeley commented on LUCENE-2380:
--

One thing to keep in mind is that the current way of returning shared BytesRef 
objects often forces one to make a copy.  We should perhaps consider allowing a 
BytesRef to be passed in.

{code}
// returning shared BytesRef forces a copy
for(;;) {
  BytesRef val1 = new BytesRef(getValue(doc1))  // make a copy
  BytesRef val2 = getValue(doc2)
  int cmp = val1.compareTo(val2)

// allowing BytesRef to be passed in means no copy
BytesRef val1 = new BytesRef();
BytesRef val2 = new BytesRef();
for(;;) {
  getValue(doc1, val1)
  getValue(doc2, val2)
  int cmp = val1.compareTo(val2)
}
{code}

> Add FieldCache.getTermBytes, to load term data as byte[]
> 
>
> Key: LUCENE-2380
> URL: https://issues.apache.org/jira/browse/LUCENE-2380
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 4.0
>
>
> With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode 
> string, but not necessarily), so we need to push this up the search stack.
> FieldCache now has getStrings and getStringIndex; we need corresponding 
> methods to load terms as native byte[], since in general they may not be 
> representable as String.  This should be quite a bit more RAM efficient too, 
> for US ascii content since each character would then use 1 byte not 2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org