Thanks much for checking this Bruno. On Mon, Sep 11, 2017 at 3:05 PM, Bruno P. Kinoshita < brunodepau...@yahoo.com.br.invalid> wrote:
> Hi Amey, > > You created a byte array from the original string (which may contain > surrogate chars). But then you created a copy string with `final String > copy = new String(bytes, charset);`. There will be encoding to UTF-8, which > may fail to encode some values, leading to the error you reported I suspect. > > I did this purposefully for checking LANG-100 issue, issue was about the string conversion from UTF-16(default) to UTF-8 back and forth. My expectation was this test should pass clean. > If you try `final String copy = new String(bytes);` there will be still > encoding to the default system charset as well. > > So I think the safest is to compare codepoints. Perhaps with something > like this: > > @Test > public void testSubStringWithSurrogatePair() { > for (int j = 0; j < 10; j++) { > final int size = 5000; RandomStringGenerator > generator = new RandomStringGenerator.Builder().build(); > String orig = generator.generate(size).substring(0, 2500); > final String copy = new String(orig); > for (int i = 0; i < orig.length() && i < copy.length(); i++) > { final int o = orig.codePointAt(i); final > int c = copy.codePointAt(i); > assertEquals(String.format("Differs > where j = %d, i = %d, o = %d, and c = %d", j, i, o, c), o, c); > } } > } > > Running it 10 times, I was able to consistently reproduce the initial > issue. It would always fail, about 4 out of 10. I think [rng] or somewhere > in another commons component I kind of remember seeing unit tests for > random generated values using loops? But may be mistaken (I don't trust my > own memory). So feel free to leave that part out if you prefer. I tried the > code above with j going up to 1000. After a few seconds, the test passed > too. > yeah I did same kind of testing and found it happens intermittently, whenever surrogate pair comes on last position where I cut string, i.e. 0 to 2500, so if there is pair on 2500 it will be cut in half and issue comes which i found obvious now. And Yes I cant keep it as is for the sake of not introducing LANG-100 again in commons-text. > Doing `final String copy = new String(orig);` the value of the original > string is completely copied onto the new string. So comparing the > codepoints should do the trick. We may even want to add another assert > statement before the for loop to confirm both strings have the same length? > Whatever you suggested here work fine with no issues, even length is same, goal of doing this was to return exact length of string which asked by used. > Hope that helps,Bruno > > Now I want community advice that why the RandomStringGenerator's .generate(int count) method designed in such way that it will return given number of codepoints and not the actual length of String ? I'm ok with this approach as well but can we have one more .generate which can return the actual String of given length ? I found when I pass .50 it returns me ~70 length of string, as commons-dev its good but as application-dev its weird. Regards, Amey > ________________________________ > > > From: Amey Jadiye <ameyjad...@gmail.com> > To: Commons Developers List <dev@commons.apache.org> > Sent: Monday, 11 September 2017 12:15 AM > Subject: [text] Invalid unicode sequences on .substring of > RandomStringGenerator > > > > Hi Folks, > > > While working on RandomStringGenerator I found when I'm doing .substring on > > generated random string its failing intermittently with sequence of > > surrogate pair. > > same bug was raised in commons-lang > > https://issues.apache.org/jira/browse/LANG-100 > > > Is this possible bug with RandomStringGenerator ? or is this expected ? > > > @Test > > public void testSubStringWithSurrogatePair() { > > final int size = 5000; > > final Charset charset = Charset.forName("UTF-8"); > > RandomStringGenerator generator = new > > RandomStringGenerator.Builder().build(); > > String orig = generator.generate(size).substring(0,2500); > > > final byte[] bytes = orig.getBytes(charset); > > final String copy = new String(bytes, charset); > > > for (int i=0; i < orig.length() && i < copy.length(); i++) { > > final char o = orig.charAt(i); > > final char c = copy.charAt(i); > > assertEquals("differs at " + i + "(" + Integer.toHexString(new > > Character(o).hashCode()) + "," + > > Integer.toHexString(new Character(c).hashCode()) + ")", o, > > c); > > } > > > } > > > Regards, > > Amey > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > > For additional commands, e-mail: dev-h...@commons.apache.org > -- --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org