Re: UTF16 BOM in new PrintWriter() vs withPrintWriter()

Keegan Witt Tue, 09 Jun 2015 04:50:57 -0700

No, it's a Groovy bug.

private static void writeUTF16BomIfRequired(final String charset,
final OutputStream stream) throws IOException {
    if ("UTF-16BE".equals(charset)) {
        writeUtf16Bom(stream, true);
    } else if ("UTF-16LE".equals(charset)) {
        writeUtf16Bom(stream, false);
    }
}


should be

private static void writeUTF16BomIfRequired(final String charset,
final OutputStream stream) throws IOException {
    if ("UTF-16BE".equals(Charset.forName(charset).name())) {
        writeUtf16Bom(stream, true);
    } else if ("UTF-16LE".equals(Charset.forName(charset).name())) {
        writeUtf16Bom(stream, false);
    }
}

in org.codehaus.groovy.runtime.ResourceGroovyMethods.  We'll probably want
to fix that regardless of what we decide on the *withPrintWriter*
question.  I'll open a Jira and a PR.

-Keegan


On Tue, Jun 9, 2015 at 3:21 AM, Guillaume Laforge <[email protected]>
wrote:

> From Groovy's point of view (ie. when you're coding in Groovy), the BOM is
> automatically discarded when you use one of our reader methods (withReader,
> etc), so it's transparent whether the BOM is here or not.
>
> I tend to think that having the BOM always is a good thing (I even thought
> that was mandatory), but Groovy should guess the endianness regardless
> anyway.
>
> Happy to hear what others think too about all this though.
>
> Guillaume
>
>
> 2015-06-08 23:20 GMT+02:00 Keegan Witt <[email protected]>:
>
>> The code as-is today writes the BOM regardless of platform.  I just
>> tested in Linux with the same results.  I think there are 2 parts to the
>> question of "what's the correct behavior?"
>>
>> 1.  Should the BOM be written at all, particularly when the platform is
>> Windows?
>> 2.  Should the behavior of *withPrintWriter* differ (even if the
>> difference is to be smarter) from the behavior of *new PrintWriter*?
>>
>> *Discussion*
>> 1.  Strictly speaking, yes.  Because RFC 2781
>> <http://tools.ietf.org/html/rfc2781> states in section 4.3 to assume big
>> endian if there is no BOM.  However, in practice, many applications
>> disregard the RFC and assume little-endian because that's what Windows
>> does
>> <https://msdn.microsoft.com/en-us/library/windows/desktop/dd374101%28v=vs.85%29.aspx>.
>> Because of this, the behavior could be changed so that when writing
>> UTF-16LE on Windows, it doesn't write the BOM.  But in my opinion, it's
>> best practice to always write a BOM when working with UTF-16, and Java
>> should have done this in their implementation of their PrintWriter.
>>
>> 2.  This is a tough one.  Arguably, *withPrintWriter* is doing the
>> smarter, more correct behavior, but the typical user would assume this is
>> just a shorthand convenience for newing up a PrintWriter (I certainly
>> did).  So the question is, is it better to just document this difference in
>> the GroovyDoc?  Or to change the behavior to be closer to Java?  And if the
>> latter, what breakages would that cause within Groovy itself?  Making that
>> change could break folks in production, because they could rely on that BOM
>> being there, in cases for example where the file is created on Windows, but
>> then processed on Linux or when working with a third party library that is
>> more picky about the presence of a BOM.
>>
>> -Keegan
>>
>> On Mon, Jun 8, 2015 at 4:32 PM, Guillaume Laforge <[email protected]>
>> wrote:
>>
>>> Now... is it what should be done or not is the good question to ask :-)
>>> Does Windows manages to open UTF-16 files without BOMs?
>>>
>>> 2015-06-08 22:17 GMT+02:00 Keegan Witt <[email protected]>:
>>>
>>>> I forgot to mention that.  Yes, I ran the test mentioned in Windows.
>>>>
>>>> On Mon, Jun 8, 2015 at 3:54 PM, Guillaume Laforge <[email protected]>
>>>> wrote:
>>>>
>>>>> That's a good question.
>>>>> I guess this is happening on Windows? (I haven't tried here, since I'm
>>>>> on OS X)
>>>>> I think BOMs were mandatory in text files on Windows.
>>>>>
>>>>> 2015-06-08 17:53 GMT+02:00 Keegan Witt <[email protected]>:
>>>>>
>>>>>> I've always taken a perverse pleasure in character encoding
>>>>>> problems.  I was intrigued by this SO question
>>>>>> <http://stackoverflow.com/questions/30538461/why-groovy-file-write-with-utf-16le-produce-bom-char>
>>>>>>  on
>>>>>> UTF 16 BOMs in Java vs Groovy.
>>>>>>
>>>>>> It appears using withPrintWriter(charset) produces a BOM whereas new
>>>>>> PrintWriter(file, charset) does not.  As demonstrated here:
>>>>>>
>>>>>> File file = new File("tmp.txt")try {
>>>>>>     String text = " "
>>>>>>     String charset = "UTF-16LE"
>>>>>>
>>>>>>     file.withPrintWriter(charset) { it << text }
>>>>>>     println "withPrintWriter"
>>>>>>     file.getBytes().each { System.out.format("%02x ", it) }
>>>>>>
>>>>>>     PrintWriter w = new PrintWriter(file, charset)
>>>>>>     w.print(text)
>>>>>>     w.close()
>>>>>>     println "\n\nnew PrintWriter"
>>>>>>     file.getBytes().each { System.out.format("%02x ", it) }} finally {
>>>>>>     file.delete()}
>>>>>>
>>>>>> Outputs
>>>>>>
>>>>>> withPrintWriter
>>>>>> ff fe 20 00
>>>>>>
>>>>>> new PrintWriter
>>>>>> 20 00
>>>>>>
>>>>>>
>>>>>> Is this difference in behavior intentional?  It seems kinda odd to me.
>>>>>>
>>>>>> -Keegan
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Guillaume Laforge
>>>>> Groovy Project Manager
>>>>> Product Ninja & Advocate at Restlet <http://restlet.com>
>>>>>
>>>>> Blog: http://glaforge.appspot.com/
>>>>> Social: @glaforge <http://twitter.com/glaforge> / Google+
>>>>> <https://plus.google.com/u/0/114130972232398734985/posts>
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Guillaume Laforge
>>> Groovy Project Manager
>>> Product Ninja & Advocate at Restlet <http://restlet.com>
>>>
>>> Blog: http://glaforge.appspot.com/
>>> Social: @glaforge <http://twitter.com/glaforge> / Google+
>>> <https://plus.google.com/u/0/114130972232398734985/posts>
>>>
>>
>>
>
>
> --
> Guillaume Laforge
> Groovy Project Manager
> Product Ninja & Advocate at Restlet <http://restlet.com>
>
> Blog: http://glaforge.appspot.com/
> Social: @glaforge <http://twitter.com/glaforge> / Google+
> <https://plus.google.com/u/0/114130972232398734985/posts>
>

Re: UTF16 BOM in new PrintWriter() vs withPrintWriter()

Reply via email to