As Olivier wrote, multiple BytesRef instances can share the underlying byte array when representing slices of existing data, for performance reasons.
BytesRef#clone()'s javadoc comment says that the result will be a shallow clone, sharing the backing array with the original instance, and points to another utility method for deep cloning: BytesRef#deepCopyOf(BytesRef). On Wed, May 20, 2015 at 7:21 AM, Olivier Binda <olivier.bi...@wanadoo.fr> wrote: > My take : > Indeed BytesRef is mutable > This happens for performance reasons, to avoid unnecessary object > creations and unecessary copying and Also to workaround > the java "issue" that most of the time you need to pass an array with an > offset and length in methods for performance but you don't want to create > an array every time you have to do that > > > In your case, you are supposed to copy your bytes because, indeed, the > bytesRef will change everytime you call a lucene method on it > (it is mutable) and the array it points to will change too because these > might be internal arrays of readers/buffers/codecs > (and you don't know the internal working of those)... > > > Also, in my opinion, > Lucene rocks > > > > On 05/20/2015 06:19 AM, Trejkaz wrote: > >> Hi all. >> >> The Lucene 4 migration guide "helpfully" suggests to work with >> BytesRef directly rather than converting to string, but I disagree. >> Take the following example of building up a List<Term> by iterating a >> TermsEnum. I think it is written in a fairly straight-forward fashion. >> I added some println which aren't really there, to illustrate the >> place I have my breakpoints. >> >> protected List<Term> toList(String field, TermsEnum termsEnum) >> throws IOException { >> List<Term> terms = new LinkedList<>(); >> BytesRef text; >> //noinspection NestedAssignment >> while((text = termsEnum.next()) != null) { >> Term term = new Term(field, text); >> System.out.println("in loop: " + term); >> terms.add(term); >> } >> System.out.println("at end: " + terms); >> return terms; >> } >> >> When you actually try to call this, weird shit happens. >> >> in loop: content:term >> at end: [content:testing] >> in loop: content:extractor >> at end: [content:for] >> >> Basically, by the time you exit the while loop, the BytesRef you put >> into the Term has changed to point to the next term in the index. So >> okay, so BytesRef is mutable. I hate mutable stuff, but luckily we >> have clone() on this class, so I'll just clone it when creating the >> term: >> >> Term term = new Term(field, text.clone()); >> >> Now the output is: >> >> in loop: content:term >> at end: [content:test] >> in loop: content:extractor >> at end: [content:forractor] >> >> WTF? >> >> Now it seems like it clones the length of the slice but not the actual >> data, and the actual data has still changed underneath it. Great. So >> basically, the only safe way to use BytesRef is to treat it like a hot >> potato and immediately call utf8ToString() to get hold of an object >> you can trust. >> >> Term term = new Term(field, text.utf8ToString()); >> >> And then finally you get: >> >> in loop: content:term >> at end: [content:term] >> in loop: content:extractor >> at end: [content:extractor] >> >> I will probably eventually formalise this in our code and making >> utility wrappers which don't expose BytesRef to the caller, since it's >> so easy to do the wrong thing with it. >> >> They say a good measure of the quality of a library is the number of >> times you say "WTF" while trying to figure out how to use it. I have >> already lost count. >> >> TX >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Péteri András