[jira] Commented: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]

2010-04-08 Thread Toke Eskildsen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854853#action_12854853
 ] 

Toke Eskildsen commented on LUCENE-2380:


Working on LUCENE-2369 I essentially had to re-implement the FieldCache because 
of the hardwiring of arrays. Switching to accessor methods seems like the right 
direction to go.

> Add FieldCache.getTermBytes, to load term data as byte[]
> 
>
> Key: LUCENE-2380
> URL: https://issues.apache.org/jira/browse/LUCENE-2380
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 3.1
>
>
> With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode 
> string, but not necessarily), so we need to push this up the search stack.
> FieldCache now has getStrings and getStringIndex; we need corresponding 
> methods to load terms as native byte[], since in general they may not be 
> representable as String.  This should be quite a bit more RAM efficient too, 
> for US ascii content since each character would then use 1 byte not 2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]

2010-04-07 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854639#action_12854639
 ] 

Uwe Schindler commented on LUCENE-2380:
---

This goes again in the direction of not having arrays in FieldCache anymore, 
but instead have accessor methods taking a docid and giving back the data 
(possibly as a reference). So getBytes(docid) returns a reused BytesRef with 
offset and length of the requested term. For native types we should also go 
away from arrays and only provide accessor methods. Java is so fast and possiby 
inlines the method call. So for native types we could also use a FloatBuffer or 
ByteBuffer or whatever from java.nio.

> Add FieldCache.getTermBytes, to load term data as byte[]
> 
>
> Key: LUCENE-2380
> URL: https://issues.apache.org/jira/browse/LUCENE-2380
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 3.1
>
>
> With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode 
> string, but not necessarily), so we need to push this up the search stack.
> FieldCache now has getStrings and getStringIndex; we need corresponding 
> methods to load terms as native byte[], since in general they may not be 
> representable as String.  This should be quite a bit more RAM efficient too, 
> for US ascii content since each character would then use 1 byte not 2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]

2010-04-07 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854621#action_12854621
 ] 

Yonik Seeley commented on LUCENE-2380:
--

bq. We could also do shared byte[] blocks (private), with a public method to 
retrieve the BytesRef for a given doc?

Absolutely!  Now that we are in control, it would be a crime not not share the 
byte[]
Seems like one should pass in a BytesRef to be filled in... that would be most 
efficient for people doing simple stuff like compare docid1 to docid2.  
Returning a reused BytesRef could also work (as TermsEnum does) but it's less 
efficient for anything needing a state of more than 1 BytesRef since it then 
requires copying.

We can further save space by putting the length as a vInt in the byte[] - most 
would be a single byte.
Then we just need an int[] as an index into the byte[]... or potentially packed 
ints.

We'll also need an implementation that can span multiple byte[]s for larger 
than 2GB support.  The correct byte[] to look into is then simply a function of 
the docid (as is done in Solr's UnInvertedField).

We could possibly play games with the offsets into the byte[] too - encode as a 
delta against the average instead of an absolute offset.  So offset = 
average_size * ord + get_delta(ord)

> Add FieldCache.getTermBytes, to load term data as byte[]
> 
>
> Key: LUCENE-2380
> URL: https://issues.apache.org/jira/browse/LUCENE-2380
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 3.1
>
>
> With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode 
> string, but not necessarily), so we need to push this up the search stack.
> FieldCache now has getStrings and getStringIndex; we need corresponding 
> methods to load terms as native byte[], since in general they may not be 
> representable as String.  This should be quite a bit more RAM efficient too, 
> for US ascii content since each character would then use 1 byte not 2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]

2010-04-07 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854615#action_12854615
 ] 

Michael McCandless commented on LUCENE-2380:


We could also do shared byte[] blocks (private), with a public method to 
retrieve the BytesRef for a given doc?  Standard codec's terms index does this 
-- we could share it I think.

A new byte[] per doc adds alot of RAM overhead and GC load.  (Of course, so 
does the String solution we use today, so it'd at least be no worse...).

> Add FieldCache.getTermBytes, to load term data as byte[]
> 
>
> Key: LUCENE-2380
> URL: https://issues.apache.org/jira/browse/LUCENE-2380
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 3.1
>
>
> With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode 
> string, but not necessarily), so we need to push this up the search stack.
> FieldCache now has getStrings and getStringIndex; we need corresponding 
> methods to load terms as native byte[], since in general they may not be 
> representable as String.  This should be quite a bit more RAM efficient too, 
> for US ascii content since each character would then use 1 byte not 2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]

2010-04-07 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854594#action_12854594
 ] 

Uwe Schindler commented on LUCENE-2380:
---

The structure should look like String and StringIndex, but I am not sure, if we 
need real BytesRefs. In my opinion, it should be an array of byte[], where each 
byte[] is allocated with the termsize from the enums BytesRef and copied over - 
this is. This is no problem, as the terms need to be replicated either way, as 
the BytesRef from the enum is reused. The only problem is that byte[] is mising 
the cool bytesref methods like utf8ToString() that may be needed by consumers.

getStrings and getStringIndex should be deprecated. We cannot emulate them 
using BytesRef.utf8ToString, as the String[] arrays are raw and allow no 
wrapping. If FieldCache would use accessor methods and not raw arrays, we would 
not have that problem...

> Add FieldCache.getTermBytes, to load term data as byte[]
> 
>
> Key: LUCENE-2380
> URL: https://issues.apache.org/jira/browse/LUCENE-2380
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 3.1
>
>
> With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode 
> string, but not necessarily), so we need to push this up the search stack.
> FieldCache now has getStrings and getStringIndex; we need corresponding 
> methods to load terms as native byte[], since in general they may not be 
> representable as String.  This should be quite a bit more RAM efficient too, 
> for US ascii content since each character would then use 1 byte not 2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org