Hi Amey,
You created a byte array from the original string (which may contain surrogate
chars). But then you created a copy string with `final String copy = new
String(bytes, charset);`. There will be encoding to UTF-8, which may fail to
encode some values, leading to the error you reported I suspect.
If you try `final String copy = new String(bytes);` there will be still
encoding to the default system charset as well.
So I think the safest is to compare codepoints. Perhaps with something like
this:
@Test
public void testSubStringWithSurrogatePair() {
for (int j = 0; j < 10; j++) {
final int size = 5000; RandomStringGenerator generator =
new RandomStringGenerator.Builder().build(); String orig =
generator.generate(size).substring(0, 2500);
final String copy = new String(orig);
for (int i = 0; i < orig.length() && i < copy.length(); i++) {
final int o = orig.codePointAt(i); final int c =
copy.codePointAt(i); assertEquals(String.format("Differs where j
= %d, i = %d, o = %d, and c = %d", j, i, o, c), o, c); } }
}
Running it 10 times, I was able to consistently reproduce the initial issue. It
would always fail, about 4 out of 10. I think [rng] or somewhere in another
commons component I kind of remember seeing unit tests for random generated
values using loops? But may be mistaken (I don't trust my own memory). So feel
free to leave that part out if you prefer. I tried the code above with j going
up to 1000. After a few seconds, the test passed too.
Doing `final String copy = new String(orig);` the value of the original string
is completely copied onto the new string. So comparing the codepoints should do
the trick. We may even want to add another assert statement before the for loop
to confirm both strings have the same length?
Hope that helps,Bruno
________________________________
From: Amey Jadiye <[email protected]>
To: Commons Developers List <[email protected]>
Sent: Monday, 11 September 2017 12:15 AM
Subject: [text] Invalid unicode sequences on .substring of RandomStringGenerator
Hi Folks,
While working on RandomStringGenerator I found when I'm doing .substring on
generated random string its failing intermittently with sequence of
surrogate pair.
same bug was raised in commons-lang
https://issues.apache.org/jira/browse/LANG-100
Is this possible bug with RandomStringGenerator ? or is this expected ?
@Test
public void testSubStringWithSurrogatePair() {
final int size = 5000;
final Charset charset = Charset.forName("UTF-8");
RandomStringGenerator generator = new
RandomStringGenerator.Builder().build();
String orig = generator.generate(size).substring(0,2500);
final byte[] bytes = orig.getBytes(charset);
final String copy = new String(bytes, charset);
for (int i=0; i < orig.length() && i < copy.length(); i++) {
final char o = orig.charAt(i);
final char c = copy.charAt(i);
assertEquals("differs at " + i + "(" + Integer.toHexString(new
Character(o).hashCode()) + "," +
Integer.toHexString(new Character(c).hashCode()) + ")", o,
c);
}
}
Regards,
Amey
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]