Re: std.stream, BOM, and deprecation

2012-10-16 Thread Charles Hixson

On 10/15/2012 10:29 AM, Steven Schveighoffer wrote:

On Sat, 13 Oct 2012 21:53:48 -0400, Charles Hixson
charleshi...@earthlink.net wrote:


If std.stream is being deprecated, what is the correct way to deal
with file BOMs. This is particularly concerning utf8 files, which I
understand to be a bit problematic, as there isn't, actually, a utf8
BOM, merely a convention which isn't a part of a standard. But the
std.stdio documentation doesn't so much as mention byte order marks
(BOMs).

If this should wait until std.io is released, then I could use
std.stream until them, but the documentation is already warning to
avoid using it.


When std.io is released, it will be fully BOM-aware by default (as long
as you use the purely D versions). The plan from my point of view is for
std.io be be a replacement backend for std.stdio, with the C version
being the default (as it must be for compatibility purposes).

-Steve
That sounds good.  All of the files I'm interested should have been 
converted to utf8 (if they weren't already), but many of them have the 
utf8 BOM (so they won't be confused with other non-unicode files).  It 
sounds like std.io will handle this in a transparent fashion.


Re: std.stream, BOM, and deprecation

2012-10-16 Thread Charles Hixson

On 10/14/2012 10:28 PM, Nick Sabalausky wrote:

On Sat, 13 Oct 2012 18:53:48 -0700
Charles Hixsoncharleshi...@earthlink.net  wrote:


If std.stream is being deprecated, what is the correct way to deal
with file BOMs.  This is particularly concerning utf8 files, which I
understand to be a bit problematic, as there isn't, actually, a utf8
BOM, merely a convention which isn't a part of a standard.  But the
std.stdio documentation doesn't so much as mention byte order marks
(BOMs).

If this should wait until std.io is released, then I could use
std.stream until them, but the documentation is already warning to
avoid using it.


Personally, I think it's kind of cumbersome to deal with in Phobos, so
I wrote this wrapper that I use instead, which handles everything:

https://bitbucket.org/Abscissa/semitwistdtools/src/977820d5dcb0/src/semitwist/util/io.d?at=master#cl-24

And then there's the utfConvert below it if you already have the data
in memory instead of on disk.

(Maybe I should add some range capability and make a Phobos pull
request. I don't know if it'd fly though. It uses a lot of custom
endian- and bom-related code since I found the existing endian/bom
stuff in phobos inadequate. So that stuff would have to be accepted,
and then this too, and it's usually a bit of a pain to get things
approved.)

That wrapper looks very nice, but it's a lot more than what I need.  I 
want to deal only with utf8 files, many of which have BOMs.  I *can* 
handle that by detecting the BOM and dropping it.  I don't need anything 
else.  I was merely wondering what the appropriate way to approach this 
was now that std.stream is being documented as deprecated, but no 
replacement specified.  It sounds like the appropriate response is to 
use std.stdio, and handle the BOM myself.


Re: std.stream, BOM, and deprecation

2012-10-15 Thread Steven Schveighoffer
On Sat, 13 Oct 2012 21:53:48 -0400, Charles Hixson  
charleshi...@earthlink.net wrote:


If std.stream is being deprecated, what is the correct way to deal with  
file BOMs.  This is particularly concerning utf8 files, which I  
understand to be a bit problematic, as there isn't, actually, a utf8  
BOM, merely a convention which isn't a part of a standard.  But the  
std.stdio documentation doesn't so much as mention byte order marks  
(BOMs).


If this should wait until std.io is released, then I could use  
std.stream until them, but the documentation is already warning to avoid  
using it.


When std.io is released, it will be fully BOM-aware by default (as long as  
you use the purely D versions).  The plan from my point of view is for  
std.io be be a replacement backend for std.stdio, with the C version being  
the default (as it must be for compatibility purposes).


-Steve


Re: std.stream, BOM, and deprecation

2012-10-14 Thread Nick Sabalausky
On Sat, 13 Oct 2012 18:53:48 -0700
Charles Hixson charleshi...@earthlink.net wrote:

 If std.stream is being deprecated, what is the correct way to deal
 with file BOMs.  This is particularly concerning utf8 files, which I 
 understand to be a bit problematic, as there isn't, actually, a utf8 
 BOM, merely a convention which isn't a part of a standard.  But the 
 std.stdio documentation doesn't so much as mention byte order marks
 (BOMs).
 
 If this should wait until std.io is released, then I could use 
 std.stream until them, but the documentation is already warning to
 avoid using it.

Personally, I think it's kind of cumbersome to deal with in Phobos, so
I wrote this wrapper that I use instead, which handles everything:

https://bitbucket.org/Abscissa/semitwistdtools/src/977820d5dcb0/src/semitwist/util/io.d?at=master#cl-24

And then there's the utfConvert below it if you already have the data
in memory instead of on disk.

(Maybe I should add some range capability and make a Phobos pull
request. I don't know if it'd fly though. It uses a lot of custom
endian- and bom-related code since I found the existing endian/bom
stuff in phobos inadequate. So that stuff would have to be accepted,
and then this too, and it's usually a bit of a pain to get things
approved.)



Re: std.stream, BOM, and deprecation

2012-10-13 Thread Jonathan M Davis
On Saturday, October 13, 2012 18:53:48 Charles Hixson wrote:
 If std.stream is being deprecated, what is the correct way to deal with
 file BOMs.  This is particularly concerning utf8 files, which I
 understand to be a bit problematic, as there isn't, actually, a utf8
 BOM, merely a convention which isn't a part of a standard.  But the
 std.stdio documentation doesn't so much as mention byte order marks (BOMs).
 
 If this should wait until std.io is released, then I could use
 std.stream until them, but the documentation is already warning to avoid
 using it.

std.stream will be around until after std.io has been introduced, because 
std.io will be its replacement. As for dealing with BOMs, I don't really know 
anything about that, so I don't really have any suggestions. I know that it's 
come up before, and you can probably find some discussion on it in the 
archives, but for the most part, Phobos' I/O assumes UTF-8 or compatible, and 
if you want something else, you have to deal with it yourself. It's an area 
where Phobos needs improvement.

You can use std.stream, but just be aware that in the long term, you'll either 
have to refactor your code so that it uses another solution (presumably 
std.io) or copy std.stream to your own stuff, because it's going to be removed 
from Phobos eventually.

- Jonathan M Davis


Re: std.stream, BOM, and deprecation

2012-10-13 Thread Ali Çehreli

On 10/13/2012 06:53 PM, Charles Hixson wrote:
 If std.stream is being deprecated, what is the correct way to deal with
 file BOMs. This is particularly concerning utf8 files, which I
 understand to be a bit problematic, as there isn't, actually, a utf8
 BOM,

That's correct. There is just one byte order for UTF-8.

 merely a convention which isn't a part of a standard.

I am not sure about that. The Unicode standard describes UTF-8 as code 
units following each other in the file. There can't be any confusion 
about their order. According to Wikipedia, the only use of BOM for UTF-8 
is to identify the file as having been encoded in UTF-8:


  http://en.wikipedia.org/wiki/Byte_order_mark#UTF-8

But that can't have any meaning. The file could have been encoded in any 
one of the multitude of code pages as well. Treating the first three 
bytes as BOM would be taking a chance in that case and dropping those 
three characters.


 But the
 std.stdio documentation doesn't so much as mention byte order marks 
(BOMs).


 If this should wait until std.io is released, then I could use
 std.stream until them, but the documentation is already warning to avoid
 using it.

As I understand it, it is all down to convention any way. What is the 
meaning of the non-ASCII code 166? Only the generator of the file knows. :/


Ali