Re: Inadvertent copies of test data in L2/17-197 ?

2017-08-07 Thread Henri Sivonen via Unicode
On Mon, Aug 7, 2017 at 9:53 AM, Martin J. Dürst  wrote:
> I just had a look at http://www.unicode.org/L2/L2017/17197-utf8-retract.pdf
> to use the test data in there for Ruby.
> I was under the impression from previous looks at it that it contained a lot
> of test data.

It contains the test outputs with identical results (output exhibiting
the spec-following behavior and output exhibiting the one REPLACEMENT
CHARACTER per bogus byte behavior) shown only once. Since the input
doesn't make sense as a PDF, it only mentions where to find the input
(https://hsivonen.fi/broken-utf-8/test.html).

> However, when I looked at the test data more carefully (I had
> read the text before the test data carefully at least two times before, but
> not looked at the test data in that much detail), I discovered that there
> might be up to 7 copies of the same data. The first one starts on page 9,
> and then there's a new one about every 4 or 5 pages.
>
> Can you check/confirm? Any idea what might have caused this?

The test outputs are not identical. They should be the content of the
following files with a bit of introductory text before each:
https://hsivonen.fi/broken-utf-8/spec.html
https://hsivonen.fi/broken-utf-8/one-per-byte.html
https://hsivonen.fi/broken-utf-8/win32.html
https://hsivonen.fi/broken-utf-8/java.html
https://hsivonen.fi/broken-utf-8/python2.html with non-conforming
output replaced with italic text saying what the bytes were
https://hsivonen.fi/broken-utf-8/perl5.html
https://hsivonen.fi/broken-utf-8/icu.html

I inspected the PDF multiple times just now, and, as far as I can
tell, the content indeed matches what I described above (no
duplicates).

For reference, I tested the Ruby standard library with the following program:

data = IO.read("test.html", encoding: "UTF-8")
encoded = data.encode("UTF-16LE", :invalid=>:replace).encode("UTF-8")
IO.write("ruby.html", encoded)

...where test.html was the file available at
https://hsivonen.fi/broken-utf-8/test.html

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/



Inadvertent copies of test data in L2/17-197 ?

2017-08-07 Thread Martin J. Dürst via Unicode

Hello Henry,

I just had a look at 
http://www.unicode.org/L2/L2017/17197-utf8-retract.pdf to use the test 
data in there for Ruby.


I was under the impression from previous looks at it that it contained a 
lot of test data. However, when I looked at the test data more carefully 
(I had read the text before the test data carefully at least two times 
before, but not looked at the test data in that much detail), I 
discovered that there might be up to 7 copies of the same data. The 
first one starts on page 9, and then there's a new one about every 4 or 
5 pages.


Can you check/confirm? Any idea what might have caused this?

Regards,   Martin.