Re: BytesRef violates the principle of least astonishment

András Péteri Wed, 20 May 2015 00:19:28 -0700

As Olivier wrote, multiple BytesRef instances can share the underlying byte
array when representing slices of existing data, for performance reasons.


BytesRef#clone()'s javadoc comment says that the result will be a shallow
clone, sharing the backing array with the original instance, and points
to another utility method for deep cloning: BytesRef#deepCopyOf(BytesRef).

On Wed, May 20, 2015 at 7:21 AM, Olivier Binda <olivier.bi...@wanadoo.fr>
wrote:

> My take :
> Indeed BytesRef is mutable
> This happens for performance reasons, to avoid unnecessary object
> creations and unecessary copying and Also to workaround
> the java "issue" that most of the time  you need to pass an array with an
> offset and length in methods for performance but you don't want to create
> an array every time you have to do that
>
>
> In your case, you are supposed to copy your bytes because, indeed, the
> bytesRef will change everytime you call a lucene method on it
> (it is mutable) and the array it points to will change too because these
> might be internal arrays of readers/buffers/codecs
> (and you don't know the internal working of those)...
>
>
> Also, in my opinion,
> Lucene rocks
>
>
>
> On 05/20/2015 06:19 AM, Trejkaz wrote:
>
>> Hi all.
>>
>> The Lucene 4 migration guide "helpfully" suggests to work with
>> BytesRef directly rather than converting to string, but I disagree.
>> Take the following example of building up a List<Term> by iterating a
>> TermsEnum. I think it is written in a fairly straight-forward fashion.
>> I added some println which aren't really there, to illustrate the
>> place I have my breakpoints.
>>
>>      protected List<Term> toList(String field, TermsEnum termsEnum)
>> throws IOException {
>>          List<Term> terms = new LinkedList<>();
>>          BytesRef text;
>>          //noinspection NestedAssignment
>>          while((text = termsEnum.next()) != null) {
>>              Term term = new Term(field, text);
>>              System.out.println("in loop: " + term);
>>              terms.add(term);
>>          }
>>          System.out.println("at end: " + terms);
>>          return terms;
>>      }
>>
>> When you actually try to call this, weird shit happens.
>>
>>      in loop: content:term
>>      at end: [content:testing]
>>      in loop: content:extractor
>>      at end: [content:for]
>>
>> Basically, by the time you exit the while loop, the BytesRef you put
>> into the Term has changed to point to the next term in the index. So
>> okay, so BytesRef is mutable. I hate mutable stuff, but luckily we
>> have clone() on this class, so I'll just clone it when creating the
>> term:
>>
>>              Term term = new Term(field, text.clone());
>>
>> Now the output is:
>>
>>      in loop: content:term
>>      at end: [content:test]
>>      in loop: content:extractor
>>      at end: [content:forractor]
>>
>> WTF?
>>
>> Now it seems like it clones the length of the slice but not the actual
>> data, and the actual data has still changed underneath it. Great. So
>> basically, the only safe way to use BytesRef is to treat it like a hot
>> potato and immediately call utf8ToString() to get hold of an object
>> you can trust.
>>
>>              Term term = new Term(field, text.utf8ToString());
>>
>> And then finally you get:
>>
>>      in loop: content:term
>>      at end: [content:term]
>>      in loop: content:extractor
>>      at end: [content:extractor]
>>
>> I will probably eventually formalise this in our code and making
>> utility wrappers which don't expose BytesRef to the caller, since it's
>> so easy to do the wrong thing with it.
>>
>> They say a good measure of the quality of a library is the number of
>> times you say "WTF" while trying to figure out how to use it. I have
>> already lost count.
>>
>> TX
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


-- 
Péteri András

Re: BytesRef violates the principle of least astonishment

Reply via email to