byte-order marks

2013-01-28 Thread Andy Wingo
What do people think about this attached patch? Andy >From 831c3418941f2d643f91e3076ef9458f700a2c59 Mon Sep 17 00:00:00 2001 From: Andy Wingo Date: Mon, 28 Jan 2013 22:41:34 +0100 Subject: [PATCH] detect and consume byte-order marks for textual ports * libguile/read.c (scm_i_scan_for_encod

Re: byte-order marks

2013-01-28 Thread Mike Gran
> What do people think about this attached patch? > > Andy If you find the word "coding" by scanning 8-bit char by 8-bit char, it can't be UTF-16, since that would be more like "c o d i n g :" with nulls interspersed.   While rather unlikely, it is a theoretical possibility that a doc in encoding

Re: byte-order marks

2013-01-29 Thread Mark H Weaver
built using Guile, and on that basis would advocate removing the existing cleverness from 'open-input-file' in stable-2.0. At the very least it should be removed from master. Regarding byte-order marks, my preference is that users should explictly consume BOMs if that's w

Re: byte-order marks

2013-01-29 Thread Andy Wingo
-file' in stable-2.0. At the > very least it should be removed from master. I agree as well. Want to make a patch? > Regarding byte-order marks, my preference is that users should explictly > consume BOMs if that's what they want (ideally using some convenience > procedure

Re: byte-order marks

2013-01-29 Thread Andy Wingo
On Mon 28 Jan 2013 23:20, Mike Gran writes: > So if there is a "coding:" line in the doc, I think it > should nullify giving precedence to a UTF-16 BOM. OK. Cheers, Andy -- http://wingolog.org/

Re: byte-order marks

2013-01-29 Thread Ludovic Courtès
Andy Wingo skribis: [...] >> Regarding byte-order marks, my preference is that users should explictly >> consume BOMs if that's what they want (ideally using some convenience >> procedure provided by Guile). Sometimes consuming the BOM is the wrong >> thing.

Re: byte-order marks

2013-01-29 Thread Andy Wingo
Hi, [Ludo and Mark and I scribas]: >>> * 'open-input-file' could perhaps auto-consume a BOM at the beginning of >>> the stream, but *only* if the BOM is already in the encoding specified >>> by the user (possibly via an explicit call to 'file-encoding'). >> >> The problem is that we have no wa

Re: byte-order marks

2013-01-29 Thread Mark H Weaver
Hi, l...@gnu.org (Ludovic Courtès) writes: >> For textual files, it doesn’t seem unreasonable for ‘open-input-file’ to >> consume the BOM, IMO. It’s not much different from the ‘eol-style’ >> transcoders. Andy Wingo writes: > I could go either way. I would prefer for open-input-file to consume

Re: byte-order marks

2013-01-29 Thread Mark H Weaver
I wrote: > Having slept on this, I think I agree that 'open-input-file' should > auto-consume BOMs. On the other hand, there's a nasty complication. Of course (open-input-file FILENAME) is just (open-file FILENAME "r"), so the auto-consuming logic should be in 'open-file'. So what should (open-f

Re: byte-order marks

2013-01-29 Thread Neil Jerram
Andy Wingo writes: > What do people think about this attached patch? > > Andy > > >>From 831c3418941f2d643f91e3076ef9458f700a2c59 Mon Sep 17 00:00:00 2001 > From: Andy Wingo > Date: Mon, 28 Jan 2013 22:41:34 +0100 > Subject: [PATCH] detect and consume byte-order ma

Re: byte-order marks

2013-01-29 Thread Ludovic Courtès
Mark H Weaver skribis: >>> However, there’s no way to open a file in binary mode when using >>> ‘open-input-file’, ‘call-with-input-file’, etc. >> >> We can add keyword or optional arguments of course. (Not suggesting >> that we do so at this time though.) > > This has been on my TODO list for a

Re: byte-order marks

2013-01-29 Thread Andy Wingo
On Tue 29 Jan 2013 20:22, Neil Jerram writes: > (define (read-csv file-name) > (let ((s (utf16->string (get-bytevector-all (open-input-file file-name)) > 'little))) > > ;; Discard possible byte order mark. > (if (and (>= (string-length s) 1) >(char=?

Re: byte-order marks

2013-01-29 Thread Neil Jerram
Andy Wingo writes: > On Tue 29 Jan 2013 20:22, Neil Jerram writes: > >> (define (read-csv file-name) >> (let ((s (utf16->string (get-bytevector-all (open-input-file file-name)) >>'little))) >> >> ;; Discard possible byte order mark. >> (if (and (>= (string-lengt

Re: byte-order marks

2013-01-29 Thread Ludovic Courtès
Mark H Weaver skribis: > I wrote: >> Having slept on this, I think I agree that 'open-input-file' should >> auto-consume BOMs. Good. > So what should (open-file FILENAME "r+") do? What about doing the same as for just “r”? I can’t think of any reasonable scenario where this could be a problem

Re: byte-order marks

2013-01-30 Thread Andy Wingo
BOM is already in the previously specified encoding. I will punt on this one. >From 5512fe4f93e4e583ab538ae02dd98e5825252dc9 Mon Sep 17 00:00:00 2001 From: Andy Wingo Date: Wed, 30 Jan 2013 10:17:25 +0100 Subject: [PATCH] detect and consume byte-order marks for textual ports * libguile/ports.h: * libguile/ports.

Re: byte-order marks

2013-01-30 Thread Ludovic Courtès
tle brain and mailbox don’t get confused? :-) > From 5512fe4f93e4e583ab538ae02dd98e5825252dc9 Mon Sep 17 00:00:00 2001 > From: Andy Wingo > Date: Wed, 30 Jan 2013 10:17:25 +0100 > Subject: [PATCH] detect and consume byte-order marks for textual ports > > * libguile/ports.h: > *

Re: byte-order marks

2013-01-31 Thread Andy Wingo
; Date: Wed, 30 Jan 2013 10:17:25 +0100 >> Subject: [PATCH] detect and consume byte-order marks for textual ports >> >> * libguile/ports.h: >> * libguile/ports.c (scm_consume_byte_order_mark): New procedure. >> >> * libguile/fports.c (scm_open_file): Call co

[PATCH] Improve handling of Unicode byte-order marks (BOMs)

2013-04-03 Thread Mark H Weaver
re's the patch. Comments and suggestions solicited. Mark >From 008b89c7ba4637e2d6323f02b6b8b6284a533857 Mon Sep 17 00:00:00 2001 From: Mark H Weaver Date: Wed, 3 Apr 2013 04:22:04 -0400 Subject: [PATCH] Improve handling of Unicode byte-order marks (BOMs). * libguile/ports-internal.h

Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs)

2013-04-03 Thread Mark H Weaver
. Mark >From d8d37d5519ca61961b70cb3051ccca2be7d4affa Mon Sep 17 00:00:00 2001 From: Mark H Weaver Date: Wed, 3 Apr 2013 04:22:04 -0400 Subject: [PATCH] Improve handling of Unicode byte-order marks (BOMs). * libguile/ports-internal.h (struct scm_port_internal): Add new members 'at_st

Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs)

2013-04-03 Thread Ludovic Courtès
Hello, Mark! Mark H Weaver skribis: > * All kinds of streams are supported in a uniform way: files, pipes, > sockets, terminals, etc. > > * As specified in Unicode 6.2, BOMs are only handled specially at the > start of a stream, and only if the encoding is set to "UTF-16" or > "UTF-32". B

Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs)

2013-04-03 Thread Mark H Weaver
precise_encoding = decide_utf32_encoding (port, mode); > > Shouldn’t it be strcasecmp? (Actually there are other uses of strcmp > already, but I think it’s a mistake.) Ouch, good catch! Indeed, we already had some bugs because of this. I pushed a fix for the existing bugs to stable

Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs)

2013-04-03 Thread Ludovic Courtès
Mark H Weaver skribis: > l...@gnu.org (Ludovic Courtès) writes: >> Woow, well thought out. The semantics seem good. (It’s interesting to >> see how BOMs complicate things, but that’s life, I guess.) >> >> The patch looks good to me. The test suite is nice. It doesn’t seem to >> cover all the

Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs)

2013-04-03 Thread Mark H Weaver
e tweaks. Thanks, Mark >From f849f9a3f6babd87088d39369442a7f429762cec Mon Sep 17 00:00:00 2001 From: Mark H Weaver Date: Wed, 3 Apr 2013 04:22:04 -0400 Subject: [PATCH] Improve handling of Unicode byte-order marks (BOMs). * libguile/ports-internal.h (struct

Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs)

2013-04-03 Thread Mike Gran
Hi Mark >>> Here's the new patch.  Any more suggestions? There are a couple of lines in your doc patch that aren't quite right. "@code{UTF-16BE}, @code{UTF-16LE}, @code{UTF-16BE}, or @code{UTF-16LE}" I assume that two of these should be UTF-32. Also "This is intended to multiple logical te

Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs)

2013-04-03 Thread Mark H Weaver
tream start. Write a BOM if appropriate. * doc/ref/api-io.texi (BOM Handling): New node. * test-suite/tests/ports.test ("set-port-encoding!, wrong encoding"): Adapt test to cope with the fact that 'set-port-encoding!' does not immediately open the iconv descript

Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs)

2013-04-03 Thread Mark H Weaver
start. Write a BOM if appropriate. * doc/ref/api-io.texi (BOM Handling): New node. * test-suite/tests/ports.test ("set-port-encoding!, wrong encoding"): Adapt test to cope with the fact that 'set-port-encoding!' does not immediately open the iconv descriptors. (bv-read

Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs)

2013-04-04 Thread Andy Wingo
Hi. The following review applies to the wrong version of this patch. I'll go ahead and post it anyway. On Wed 03 Apr 2013 22:33, Mark H Weaver writes: > + /* If we just read a BOM in an encoding that recognizes them, > + then silently consume it and read another code point.

Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs)

2013-04-05 Thread Mark H Weaver
Hi Andy, Andy Wingo writes: > On Wed 03 Apr 2013 22:33, Mark H Weaver writes: > >> + /* If we just read a BOM in an encoding that recognizes them, >> + then silently consume it and read another code point. */ >> + if (SCM_UNLIKELY (*codepoint == SCM_UNICODE_BOM >>

Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs)

2013-04-05 Thread Mike Gran
>>> +      /* If the specified encoding is UTF-16 or UTF-32, then make >>> +        that more precise by deciding what endianness to use.  */ >>> +      if (strcasecmp (pt->encoding, "UTF-16") == 0) >>> +        precise_encoding = decide_utf16_encoding (port, mode); >>> +      else if (strcas

Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs)

2013-04-05 Thread Ludovic Courtès
Mike Gran skribis: +      /* If the specified encoding is UTF-16 or UTF-32, then make +        that more precise by deciding what endianness to use.  */ +      if (strcasecmp (pt->encoding, "UTF-16") == 0) +        precise_encoding = decide_utf16_encoding (port, mode); >>

Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs)

2013-04-05 Thread Mark H Weaver
l...@gnu.org (Ludovic Courtès) writes: > Mike Gran skribis: > >> It would be a trivial function to write, of course, but there is a >> c-strcasecmp func in gnulib. > > Yes, better use that one. > > (Just add ‘c-strcase’ in m4/gnulib-cache.m4, run ‘gnulib-tool --update’ > with Gnulib v0.0-7865-ga8