2015-06-09 18:57 GMT+02:00 Keegan Witt <keeganw...@gmail.com>: > I created PR 37 <https://github.com/apache/incubator-groovy/pull/37> to > correct the JavaDoc I mentioned (as well as to document the existing > behavior for the non-NIO methods). > > Java doesn't eat the BOM, but this is a problem Java folks are used to > dealing with, and why things like Apache Common-IO's BOMInputStream > <https://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/input/BOMInputStream.html> > exist. >
That's also why I made Groovy eat the BOM too, so that it's transparent to our users :-) But that was a long time ago since I worked on those parts of the codebase, and it's been refactored quite a bit (by Jim for example). > > -Keegan > > On Tue, Jun 9, 2015 at 11:33 AM, Guillaume Laforge <glafo...@gmail.com> > wrote: > >> So now, how to decide what's best? :-) >> >> Is a Java reader happy with the BOM? and eats it transparently? (I think >> in the past that wasn't the case but I may be wrong) >> >> 2015-06-09 17:21 GMT+02:00 Keegan Witt <keeganw...@gmail.com>: >> >>> That's an excellent point, Paolo. NioGroovyMethods.newWriter claims >>> (in the JavaDoc) it will write the BOM if needed, but it doesn't because it >>> uses Java's implementation rather than with Groovy's >>> writeUTF16BomIfRequired. None of the methods in NioGroovyMethods use >>> writeUTF16BomIfRequired. >>> >>> Whichever we decide, we should be consistent. >>> >>> -Keegan >>> >>> On Tue, Jun 9, 2015 at 11:08 AM, Paolo Di Tommaso < >>> paolo.ditomm...@gmail.com> wrote: >>> >>>> I'm wondering if NioGroovyMethods that implement the write methods for >>>> Path should do the same. >>>> >>>> >>>> Cheers, >>>> Paolo >>>> >>>> >>>> On Tue, Jun 9, 2015 at 4:02 PM, Keegan Witt <keeganw...@gmail.com> >>>> wrote: >>>> >>>>> Cool. I'll wait for PR 36 to be merged first, because I also was >>>>> thinking the Javadoc would be changed from >>>>> is "UTF-16BE" or "UTF-16LE" >>>>> to >>>>> is "UTF-16BE" or "UTF-16LE" (or an equivalent alias) >>>>> >>>>> -Keegan >>>>> >>>>> >>>>> On Tue, Jun 9, 2015 at 9:08 AM, Guillaume Laforge <glafo...@gmail.com> >>>>> wrote: >>>>> >>>>>> >>>>>> 2015-06-09 15:04 GMT+02:00 Keegan Witt <keeganw...@gmail.com>: >>>>>> >>>>>>> Created GROOVY-7461 >>>>>>> <https://issues.apache.org/jira/browse/GROOVY-7461> and PR 36 >>>>>>> <https://github.com/apache/incubator-groovy/pull/36>. >>>>>>> >>>>>> >>>>>> Cool! >>>>>> >>>>>> >>>>>>> How would you feel about a PR to copy the Javadoc comment mentioning >>>>>>> the UTF-16 BOM on File.newWriter to all the other methods that use >>>>>>> writeUTF16BomIfRequired (at least until we decide we're going to >>>>>>> change the current behavior)? >>>>>>> >>>>>> >>>>>> Right, worth it! >>>>>> >>>>>> >>>>>>> >>>>>>> -Keegan >>>>>>> >>>>>>> On Tue, Jun 9, 2015 at 8:17 AM, Guillaume Laforge < >>>>>>> glafo...@gmail.com> wrote: >>>>>>> >>>>>>>> Good point! >>>>>>>> >>>>>>>> 2015-06-09 14:11 GMT+02:00 Keegan Witt <keeganw...@gmail.com>: >>>>>>>> >>>>>>>>> That's only available in Java 7. Isn't Groovy still targeting 1.6 >>>>>>>>> for the non-indy version? >>>>>>>>> >>>>>>>>> -Keegan >>>>>>>>> On Jun 9, 2015 7:56 AM, "Guillaume Laforge" <glafo...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Well spotted! >>>>>>>>>> >>>>>>>>>> You could also compare with the StandardCharset, instead of going >>>>>>>>>> through the name comparison: >>>>>>>>>> >>>>>>>>>> http://docs.oracle.com/javase/7/docs/api/java/nio/charset/StandardCharsets.html >>>>>>>>>> >>>>>>>>>> 2015-06-09 13:49 GMT+02:00 Keegan Witt <keeganw...@gmail.com>: >>>>>>>>>> >>>>>>>>>>> No, it's a Groovy bug. >>>>>>>>>>> >>>>>>>>>>> private static void writeUTF16BomIfRequired(final String charset, >>>>>>>>>>> final OutputStream stream) throws IOException { >>>>>>>>>>> if ("UTF-16BE".equals(charset)) { >>>>>>>>>>> writeUtf16Bom(stream, true); >>>>>>>>>>> } else if ("UTF-16LE".equals(charset)) { >>>>>>>>>>> writeUtf16Bom(stream, false); >>>>>>>>>>> } >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> should be >>>>>>>>>>> >>>>>>>>>>> private static void writeUTF16BomIfRequired(final String charset, >>>>>>>>>>> final OutputStream stream) throws IOException { >>>>>>>>>>> if ("UTF-16BE".equals(Charset.forName(charset).name())) { >>>>>>>>>>> writeUtf16Bom(stream, true); >>>>>>>>>>> } else if ("UTF-16LE".equals(Charset.forName(charset).name())) { >>>>>>>>>>> writeUtf16Bom(stream, false); >>>>>>>>>>> } >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> in org.codehaus.groovy.runtime.ResourceGroovyMethods. We'll >>>>>>>>>>> probably want to fix that regardless of what we decide on the >>>>>>>>>>> *withPrintWriter* question. I'll open a Jira and a PR. >>>>>>>>>>> >>>>>>>>>>> -Keegan >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Tue, Jun 9, 2015 at 3:21 AM, Guillaume Laforge < >>>>>>>>>>> glafo...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> From Groovy's point of view (ie. when you're coding in Groovy), >>>>>>>>>>>> the BOM is automatically discarded when you use one of our reader >>>>>>>>>>>> methods >>>>>>>>>>>> (withReader, etc), so it's transparent whether the BOM is here or >>>>>>>>>>>> not. >>>>>>>>>>>> >>>>>>>>>>>> I tend to think that having the BOM always is a good thing (I >>>>>>>>>>>> even thought that was mandatory), but Groovy should guess the >>>>>>>>>>>> endianness >>>>>>>>>>>> regardless anyway. >>>>>>>>>>>> >>>>>>>>>>>> Happy to hear what others think too about all this though. >>>>>>>>>>>> >>>>>>>>>>>> Guillaume >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> 2015-06-08 23:20 GMT+02:00 Keegan Witt <keeganw...@gmail.com>: >>>>>>>>>>>> >>>>>>>>>>>>> The code as-is today writes the BOM regardless of platform. I >>>>>>>>>>>>> just tested in Linux with the same results. I think there are 2 >>>>>>>>>>>>> parts to >>>>>>>>>>>>> the question of "what's the correct behavior?" >>>>>>>>>>>>> >>>>>>>>>>>>> 1. Should the BOM be written at all, particularly when the >>>>>>>>>>>>> platform is Windows? >>>>>>>>>>>>> 2. Should the behavior of *withPrintWriter* differ (even if >>>>>>>>>>>>> the difference is to be smarter) from the behavior of *new >>>>>>>>>>>>> PrintWriter*? >>>>>>>>>>>>> >>>>>>>>>>>>> *Discussion* >>>>>>>>>>>>> 1. Strictly speaking, yes. Because RFC 2781 >>>>>>>>>>>>> <http://tools.ietf.org/html/rfc2781> states in section 4.3 to >>>>>>>>>>>>> assume big endian if there is no BOM. However, in practice, many >>>>>>>>>>>>> applications disregard the RFC and assume little-endian because >>>>>>>>>>>>> that's what Windows >>>>>>>>>>>>> does >>>>>>>>>>>>> <https://msdn.microsoft.com/en-us/library/windows/desktop/dd374101%28v=vs.85%29.aspx>. >>>>>>>>>>>>> Because of this, the behavior could be changed so that when >>>>>>>>>>>>> writing >>>>>>>>>>>>> UTF-16LE on Windows, it doesn't write the BOM. But in my >>>>>>>>>>>>> opinion, it's >>>>>>>>>>>>> best practice to always write a BOM when working with UTF-16, and >>>>>>>>>>>>> Java >>>>>>>>>>>>> should have done this in their implementation of their >>>>>>>>>>>>> PrintWriter. >>>>>>>>>>>>> >>>>>>>>>>>>> 2. This is a tough one. Arguably, *withPrintWriter* is >>>>>>>>>>>>> doing the smarter, more correct behavior, but the typical user >>>>>>>>>>>>> would assume >>>>>>>>>>>>> this is just a shorthand convenience for newing up a PrintWriter >>>>>>>>>>>>> (I >>>>>>>>>>>>> certainly did). So the question is, is it better to just >>>>>>>>>>>>> document this >>>>>>>>>>>>> difference in the GroovyDoc? Or to change the behavior to be >>>>>>>>>>>>> closer to >>>>>>>>>>>>> Java? And if the latter, what breakages would that cause within >>>>>>>>>>>>> Groovy >>>>>>>>>>>>> itself? Making that change could break folks in production, >>>>>>>>>>>>> because they >>>>>>>>>>>>> could rely on that BOM being there, in cases for example where >>>>>>>>>>>>> the file is >>>>>>>>>>>>> created on Windows, but then processed on Linux or when working >>>>>>>>>>>>> with a >>>>>>>>>>>>> third party library that is more picky about the presence of a >>>>>>>>>>>>> BOM. >>>>>>>>>>>>> >>>>>>>>>>>>> -Keegan >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon, Jun 8, 2015 at 4:32 PM, Guillaume Laforge < >>>>>>>>>>>>> glafo...@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Now... is it what should be done or not is the good question >>>>>>>>>>>>>> to ask :-) >>>>>>>>>>>>>> Does Windows manages to open UTF-16 files without BOMs? >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2015-06-08 22:17 GMT+02:00 Keegan Witt <keeganw...@gmail.com> >>>>>>>>>>>>>> : >>>>>>>>>>>>>> >>>>>>>>>>>>>>> I forgot to mention that. Yes, I ran the test mentioned in >>>>>>>>>>>>>>> Windows. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Mon, Jun 8, 2015 at 3:54 PM, Guillaume Laforge < >>>>>>>>>>>>>>> glafo...@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> That's a good question. >>>>>>>>>>>>>>>> I guess this is happening on Windows? (I haven't tried >>>>>>>>>>>>>>>> here, since I'm on OS X) >>>>>>>>>>>>>>>> I think BOMs were mandatory in text files on Windows. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 2015-06-08 17:53 GMT+02:00 Keegan Witt < >>>>>>>>>>>>>>>> keeganw...@gmail.com>: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I've always taken a perverse pleasure in character >>>>>>>>>>>>>>>>> encoding problems. I was intrigued by this SO question >>>>>>>>>>>>>>>>> <http://stackoverflow.com/questions/30538461/why-groovy-file-write-with-utf-16le-produce-bom-char> >>>>>>>>>>>>>>>>> on >>>>>>>>>>>>>>>>> UTF 16 BOMs in Java vs Groovy. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> It appears using withPrintWriter(charset) produces a BOM >>>>>>>>>>>>>>>>> whereas new PrintWriter(file, charset) does not. As >>>>>>>>>>>>>>>>> demonstrated here: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> File file = new File("tmp.txt")try { >>>>>>>>>>>>>>>>> String text = " " >>>>>>>>>>>>>>>>> String charset = "UTF-16LE" >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> file.withPrintWriter(charset) { it << text } >>>>>>>>>>>>>>>>> println "withPrintWriter" >>>>>>>>>>>>>>>>> file.getBytes().each { System.out.format("%02x ", it) } >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> PrintWriter w = new PrintWriter(file, charset) >>>>>>>>>>>>>>>>> w.print(text) >>>>>>>>>>>>>>>>> w.close() >>>>>>>>>>>>>>>>> println "\n\nnew PrintWriter" >>>>>>>>>>>>>>>>> file.getBytes().each { System.out.format("%02x ", it) }} >>>>>>>>>>>>>>>>> finally { >>>>>>>>>>>>>>>>> file.delete()} >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Outputs >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> withPrintWriter >>>>>>>>>>>>>>>>> ff fe 20 00 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> new PrintWriter >>>>>>>>>>>>>>>>> 20 00 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Is this difference in behavior intentional? It seems >>>>>>>>>>>>>>>>> kinda odd to me. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> -Keegan >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> Guillaume Laforge >>>>>>>>>>>>>>>> Groovy Project Manager >>>>>>>>>>>>>>>> Product Ninja & Advocate at Restlet <http://restlet.com> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Blog: http://glaforge.appspot.com/ >>>>>>>>>>>>>>>> Social: @glaforge <http://twitter.com/glaforge> / Google+ >>>>>>>>>>>>>>>> <https://plus.google.com/u/0/114130972232398734985/posts> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Guillaume Laforge >>>>>>>>>>>>>> Groovy Project Manager >>>>>>>>>>>>>> Product Ninja & Advocate at Restlet <http://restlet.com> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Blog: http://glaforge.appspot.com/ >>>>>>>>>>>>>> Social: @glaforge <http://twitter.com/glaforge> / Google+ >>>>>>>>>>>>>> <https://plus.google.com/u/0/114130972232398734985/posts> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Guillaume Laforge >>>>>>>>>>>> Groovy Project Manager >>>>>>>>>>>> Product Ninja & Advocate at Restlet <http://restlet.com> >>>>>>>>>>>> >>>>>>>>>>>> Blog: http://glaforge.appspot.com/ >>>>>>>>>>>> Social: @glaforge <http://twitter.com/glaforge> / Google+ >>>>>>>>>>>> <https://plus.google.com/u/0/114130972232398734985/posts> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Guillaume Laforge >>>>>>>>>> Groovy Project Manager >>>>>>>>>> Product Ninja & Advocate at Restlet <http://restlet.com> >>>>>>>>>> >>>>>>>>>> Blog: http://glaforge.appspot.com/ >>>>>>>>>> Social: @glaforge <http://twitter.com/glaforge> / Google+ >>>>>>>>>> <https://plus.google.com/u/0/114130972232398734985/posts> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Guillaume Laforge >>>>>>>> Groovy Project Manager >>>>>>>> Product Ninja & Advocate at Restlet <http://restlet.com> >>>>>>>> >>>>>>>> Blog: http://glaforge.appspot.com/ >>>>>>>> Social: @glaforge <http://twitter.com/glaforge> / Google+ >>>>>>>> <https://plus.google.com/u/0/114130972232398734985/posts> >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Guillaume Laforge >>>>>> Groovy Project Manager >>>>>> Product Ninja & Advocate at Restlet <http://restlet.com> >>>>>> >>>>>> Blog: http://glaforge.appspot.com/ >>>>>> Social: @glaforge <http://twitter.com/glaforge> / Google+ >>>>>> <https://plus.google.com/u/0/114130972232398734985/posts> >>>>>> >>>>> >>>>> >>>> >>> >> >> >> -- >> Guillaume Laforge >> Groovy Project Manager >> Product Ninja & Advocate at Restlet <http://restlet.com> >> >> Blog: http://glaforge.appspot.com/ >> Social: @glaforge <http://twitter.com/glaforge> / Google+ >> <https://plus.google.com/u/0/114130972232398734985/posts> >> > > -- Guillaume Laforge Groovy Project Manager Product Ninja & Advocate at Restlet <http://restlet.com> Blog: http://glaforge.appspot.com/ Social: @glaforge <http://twitter.com/glaforge> / Google+ <https://plus.google.com/u/0/114130972232398734985/posts>