Re: Reading and writing Unicode files

jicman Tue, 03 Mar 2009 08:45:10 -0800

Daniel Keep Wrote:

> 
> 
> jicman wrote:
> > Ok, the only reason that I say Unicode is that when I open the file in 
> > Notepad and I do a SaveAs, the Encoding says Unicode.  So, when i read this 
> > file and I write it back to the another file, the Encoding turns to UTF8.  
> > I want to keep it as Unicode.
> 
> There is no such thing as a Unicode file format.  There just isn't.  I
> know the option you speak of, and I have no idea what it's supposed to
> be; probably UCS-2 or UTF-16.
> 
> > I will give the suggestion a try.  I did not try it yet.  Maybe Phobos 
> > should think about taking care of the BOM byte and provide support for 
> > these encodings.  I am a big fan of Phobos. :-)  I have not tried Tango 
> > yet, because I would have to uninstall Phobos and I have just spend two 
> > years using Phobos and we already have an application based in Phobos and 
> > changing back to Tango will slow us down and put us back.  Maybe version 
> > 2.0.
> 
> There's std.stream.EndianStream, which looks like it can read and write
> BOMs.  As for converting between UTF encodings, std.utf.


Thanks, Daniel...

I could not the above to work.  Maybe for lack of understanding and examples, 
but the code below is working.  For now, the 1000+ XML files I have are all the 
same BOM (UTF16_le), so it will work at least for now.  However, I need to fill 
in the rest later.  I have a question on this code below which is working for 
UTF16_le:

import std.stdio;
import std.file;
import std.utf;

char[] getBOM(ubyte[] t)
{
  ubyte[] UTF32_be = [0x00,0x00,0xfe,0xff];
  ubyte[] UTF32_le = [0xff,0xfe,0x00,0x00];
  ubyte[] UTF8 = [0xef,0xbb,0xbf];
  ubyte[] UTF16_be = [0xfe,0xff];
  ubyte[] UTF16_le = [0xff,0xfe];
  if(t == UTF32_be)
    return "UTF32_be";
  if(t == UTF32_le)
    return "UTF32_le";
  if(t[0 .. 3] == UTF8)
    return "UTF8";
  if(t[0 .. 2] == UTF16_be)
    return "UTF16_be";
  if(t[0 .. 2] == UTF16_le)
    return "UTF16_le";
  return "NO_BOM";
}

void main()
{
  char[] f0 = "Unicode.ttx.xml";
  char[] f1 = "UnicodeNew.ttx.xml";
  auto text = cast(string) f0.read();
  ubyte[4] b = cast(ubyte[]) text[0 .. 4];
  char[] bom = getBOM(b);
  char[] wText;
  writefln(bom);
  if (bom[0 .. 5] == "UTF16")
  {
    wchar[] temp = cast(wchar[]) text; //text[2 .. $];
    wText = std.utf.toUTF8(temp);
  }
  else if (bom[0 .. 5] == "UTF32")
  {
  }
  else if (bom == "UTF8")
  {
  }
  
  if (std.string.find(wText,"DisplayText=\"TrixieTag\">") > 0)
  {
    writefln("Found Trixie Tags in " ~ f0);
    char[][] nt = std.string.split(wText,`<ut Style="external" 
DisplayText="TrixieTag">`);
    wText = nt[0];
    nt = std.string.split(nt[1],"</ut>");
    wText ~= nt[1];
    wchar[] eText = std.utf.toUTF16(wText);
    f1.write(cast(void[]) eText);
  }
}


The question: what happens when I get an UTF16_be (big endian)?  Will the call 
to std.utf.toUTF16(wText) take care of the BOM?

thanks so much for all the support.

josé

Re: Reading and writing Unicode files

Reply via email to