No, it's a Groovy bug.
private static void writeUTF16BomIfRequired(final String charset,
final OutputStream stream) throws IOException {
if ("UTF-16BE".equals(charset)) {
writeUtf16Bom(stream, true);
} else if ("UTF-16LE".equals(charset)) {
writeUtf16Bom(stream, false);
}
}
should be
private static void writeUTF16BomIfRequired(final String charset,
final OutputStream stream) throws IOException {
if ("UTF-16BE".equals(Charset.forName(charset).name())) {
writeUtf16Bom(stream, true);
} else if ("UTF-16LE".equals(Charset.forName(charset).name())) {
writeUtf16Bom(stream, false);
}
}
in org.codehaus.groovy.runtime.ResourceGroovyMethods. We'll probably want
to fix that regardless of what we decide on the *withPrintWriter*
question. I'll open a Jira and a PR.
-Keegan
On Tue, Jun 9, 2015 at 3:21 AM, Guillaume Laforge <[email protected]>
wrote:
> From Groovy's point of view (ie. when you're coding in Groovy), the BOM is
> automatically discarded when you use one of our reader methods (withReader,
> etc), so it's transparent whether the BOM is here or not.
>
> I tend to think that having the BOM always is a good thing (I even thought
> that was mandatory), but Groovy should guess the endianness regardless
> anyway.
>
> Happy to hear what others think too about all this though.
>
> Guillaume
>
>
> 2015-06-08 23:20 GMT+02:00 Keegan Witt <[email protected]>:
>
>> The code as-is today writes the BOM regardless of platform. I just
>> tested in Linux with the same results. I think there are 2 parts to the
>> question of "what's the correct behavior?"
>>
>> 1. Should the BOM be written at all, particularly when the platform is
>> Windows?
>> 2. Should the behavior of *withPrintWriter* differ (even if the
>> difference is to be smarter) from the behavior of *new PrintWriter*?
>>
>> *Discussion*
>> 1. Strictly speaking, yes. Because RFC 2781
>> <http://tools.ietf.org/html/rfc2781> states in section 4.3 to assume big
>> endian if there is no BOM. However, in practice, many applications
>> disregard the RFC and assume little-endian because that's what Windows
>> does
>> <https://msdn.microsoft.com/en-us/library/windows/desktop/dd374101%28v=vs.85%29.aspx>.
>> Because of this, the behavior could be changed so that when writing
>> UTF-16LE on Windows, it doesn't write the BOM. But in my opinion, it's
>> best practice to always write a BOM when working with UTF-16, and Java
>> should have done this in their implementation of their PrintWriter.
>>
>> 2. This is a tough one. Arguably, *withPrintWriter* is doing the
>> smarter, more correct behavior, but the typical user would assume this is
>> just a shorthand convenience for newing up a PrintWriter (I certainly
>> did). So the question is, is it better to just document this difference in
>> the GroovyDoc? Or to change the behavior to be closer to Java? And if the
>> latter, what breakages would that cause within Groovy itself? Making that
>> change could break folks in production, because they could rely on that BOM
>> being there, in cases for example where the file is created on Windows, but
>> then processed on Linux or when working with a third party library that is
>> more picky about the presence of a BOM.
>>
>> -Keegan
>>
>> On Mon, Jun 8, 2015 at 4:32 PM, Guillaume Laforge <[email protected]>
>> wrote:
>>
>>> Now... is it what should be done or not is the good question to ask :-)
>>> Does Windows manages to open UTF-16 files without BOMs?
>>>
>>> 2015-06-08 22:17 GMT+02:00 Keegan Witt <[email protected]>:
>>>
>>>> I forgot to mention that. Yes, I ran the test mentioned in Windows.
>>>>
>>>> On Mon, Jun 8, 2015 at 3:54 PM, Guillaume Laforge <[email protected]>
>>>> wrote:
>>>>
>>>>> That's a good question.
>>>>> I guess this is happening on Windows? (I haven't tried here, since I'm
>>>>> on OS X)
>>>>> I think BOMs were mandatory in text files on Windows.
>>>>>
>>>>> 2015-06-08 17:53 GMT+02:00 Keegan Witt <[email protected]>:
>>>>>
>>>>>> I've always taken a perverse pleasure in character encoding
>>>>>> problems. I was intrigued by this SO question
>>>>>> <http://stackoverflow.com/questions/30538461/why-groovy-file-write-with-utf-16le-produce-bom-char>
>>>>>> on
>>>>>> UTF 16 BOMs in Java vs Groovy.
>>>>>>
>>>>>> It appears using withPrintWriter(charset) produces a BOM whereas new
>>>>>> PrintWriter(file, charset) does not. As demonstrated here:
>>>>>>
>>>>>> File file = new File("tmp.txt")try {
>>>>>> String text = " "
>>>>>> String charset = "UTF-16LE"
>>>>>>
>>>>>> file.withPrintWriter(charset) { it << text }
>>>>>> println "withPrintWriter"
>>>>>> file.getBytes().each { System.out.format("%02x ", it) }
>>>>>>
>>>>>> PrintWriter w = new PrintWriter(file, charset)
>>>>>> w.print(text)
>>>>>> w.close()
>>>>>> println "\n\nnew PrintWriter"
>>>>>> file.getBytes().each { System.out.format("%02x ", it) }} finally {
>>>>>> file.delete()}
>>>>>>
>>>>>> Outputs
>>>>>>
>>>>>> withPrintWriter
>>>>>> ff fe 20 00
>>>>>>
>>>>>> new PrintWriter
>>>>>> 20 00
>>>>>>
>>>>>>
>>>>>> Is this difference in behavior intentional? It seems kinda odd to me.
>>>>>>
>>>>>> -Keegan
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Guillaume Laforge
>>>>> Groovy Project Manager
>>>>> Product Ninja & Advocate at Restlet <http://restlet.com>
>>>>>
>>>>> Blog: http://glaforge.appspot.com/
>>>>> Social: @glaforge <http://twitter.com/glaforge> / Google+
>>>>> <https://plus.google.com/u/0/114130972232398734985/posts>
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Guillaume Laforge
>>> Groovy Project Manager
>>> Product Ninja & Advocate at Restlet <http://restlet.com>
>>>
>>> Blog: http://glaforge.appspot.com/
>>> Social: @glaforge <http://twitter.com/glaforge> / Google+
>>> <https://plus.google.com/u/0/114130972232398734985/posts>
>>>
>>
>>
>
>
> --
> Guillaume Laforge
> Groovy Project Manager
> Product Ninja & Advocate at Restlet <http://restlet.com>
>
> Blog: http://glaforge.appspot.com/
> Social: @glaforge <http://twitter.com/glaforge> / Google+
> <https://plus.google.com/u/0/114130972232398734985/posts>
>