Thanks much for checking this Bruno.

On Mon, Sep 11, 2017 at 3:05 PM, Bruno P. Kinoshita <
brunodepau...@yahoo.com.br.invalid> wrote:

> Hi Amey,
>
> You created a byte array from the original string (which may contain
> surrogate chars). But then you created a copy string with `final String
> copy = new String(bytes, charset);`. There will be encoding to UTF-8, which
> may fail to encode some values, leading to the error you reported I suspect.
>
> I did this purposefully for checking LANG-100 issue, issue was about the
string conversion from UTF-16(default) to UTF-8  back and forth.
My expectation was this test should pass clean.


> If you try `final String copy = new String(bytes);` there will be still
> encoding to the default system charset as well.
>
> So I think the safest is to compare codepoints. Perhaps with something
> like this:
>
>     @Test
>     public void testSubStringWithSurrogatePair() {
>         for (int j = 0; j < 10; j++) {
>             final int size = 5000;            RandomStringGenerator
> generator = new RandomStringGenerator.Builder().build();
> String orig = generator.generate(size).substring(0, 2500);
>             final String copy = new String(orig);
>             for (int i = 0; i < orig.length() && i < copy.length(); i++)
> {                final int o = orig.codePointAt(i);                final
> int c = copy.codePointAt(i);                
> assertEquals(String.format("Differs
> where j = %d, i = %d, o = %d, and c = %d", j, i, o, c), o, c);
> }        }
>     }
>
> Running it 10 times, I was able to consistently reproduce the initial
> issue. It would always fail, about 4 out of 10. I think [rng] or somewhere
> in another commons component I kind of remember seeing unit tests for
> random generated values using loops? But may be mistaken (I don't trust my
> own memory). So feel free to leave that part out if you prefer. I tried the
> code above with j going up to 1000. After a few seconds, the test passed
> too.
>

yeah I did same kind of testing and found it happens intermittently,
whenever surrogate pair comes on last position where I cut string, i.e.
0 to 2500, so if there is pair on 2500 it will be cut in half and issue
comes which i found obvious now. And Yes I cant keep it as is for the sake
of not introducing LANG-100 again in commons-text.


> Doing `final String copy = new String(orig);` the value of the original
> string is completely copied onto the new string. So comparing the
> codepoints should do the trick. We may even want to add another assert
> statement before the for loop to confirm both strings have the same length?
>
Whatever you suggested here work fine with no issues, even length is same,
goal of doing this was to return exact length of string which asked by used.


> Hope that helps,Bruno
>
>
Now I want community advice that why the RandomStringGenerator's
.generate(int count) method designed in such way that it will return given
number of codepoints and not the actual length of String ? I'm ok with this
approach as well but can we have one more .generate which can return the
actual String of given length ? I found when I pass .50 it returns me ~70
length of string, as commons-dev its good but as application-dev its weird.

Regards,
Amey




> ________________________________
>
>
> From: Amey Jadiye <ameyjad...@gmail.com>
> To: Commons Developers List <dev@commons.apache.org>
> Sent: Monday, 11 September 2017 12:15 AM
> Subject: [text] Invalid unicode sequences on .substring of
> RandomStringGenerator
>
>
>
> Hi Folks,
>
>
> While working on RandomStringGenerator I found when I'm doing .substring on
>
> generated random string its failing intermittently with sequence of
>
> surrogate pair.
>
> same bug was raised in commons-lang
>
> https://issues.apache.org/jira/browse/LANG-100
>
>
> Is this possible bug with RandomStringGenerator ? or is this expected ?
>
>
> @Test
>
> public void testSubStringWithSurrogatePair() {
>
>     final int size = 5000;
>
>     final Charset charset = Charset.forName("UTF-8");
>
>     RandomStringGenerator generator = new
>
> RandomStringGenerator.Builder().build();
>
>     String orig = generator.generate(size).substring(0,2500);
>
>
>     final byte[] bytes = orig.getBytes(charset);
>
>     final String copy = new String(bytes, charset);
>
>
>     for (int i=0; i < orig.length() && i < copy.length(); i++) {
>
>         final char o = orig.charAt(i);
>
>         final char c = copy.charAt(i);
>
>         assertEquals("differs at " + i + "(" + Integer.toHexString(new
>
> Character(o).hashCode()) + "," +
>
>                 Integer.toHexString(new Character(c).hashCode()) + ")", o,
>
> c);
>
>     }
>
>
> }
>
>
> Regards,
>
> Amey
>
>
>
> ---------------------------------------------------------------------
>
> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>
> For additional commands, e-mail: dev-h...@commons.apache.org
>



-- 

---------------------------------------------------------------------

To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org

For additional commands, e-mail: dev-h...@commons.apache.org

Reply via email to