Daniel Keep Wrote:
>
>
> jicman wrote:
> > Ok, the only reason that I say Unicode is that when I open the file in
> > Notepad and I do a SaveAs, the Encoding says Unicode. So, when i read this
> > file and I write it back to the another file, the Encoding turns to UTF8.
> > I want to keep it as Unicode.
>
> There is no such thing as a Unicode file format. There just isn't. I
> know the option you speak of, and I have no idea what it's supposed to
> be; probably UCS-2 or UTF-16.
>
> > I will give the suggestion a try. I did not try it yet. Maybe Phobos
> > should think about taking care of the BOM byte and provide support for
> > these encodings. I am a big fan of Phobos. :-) I have not tried Tango
> > yet, because I would have to uninstall Phobos and I have just spend two
> > years using Phobos and we already have an application based in Phobos and
> > changing back to Tango will slow us down and put us back. Maybe version
> > 2.0.
>
> There's std.stream.EndianStream, which looks like it can read and write
> BOMs. As for converting between UTF encodings, std.utf.
Thanks, Daniel...
I could not the above to work. Maybe for lack of understanding and examples,
but the code below is working. For now, the 1000+ XML files I have are all the
same BOM (UTF16_le), so it will work at least for now. However, I need to fill
in the rest later. I have a question on this code below which is working for
UTF16_le:
import std.stdio;
import std.file;
import std.utf;
char[] getBOM(ubyte[] t)
{
ubyte[] UTF32_be = [0x00,0x00,0xfe,0xff];
ubyte[] UTF32_le = [0xff,0xfe,0x00,0x00];
ubyte[] UTF8 = [0xef,0xbb,0xbf];
ubyte[] UTF16_be = [0xfe,0xff];
ubyte[] UTF16_le = [0xff,0xfe];
if(t == UTF32_be)
return "UTF32_be";
if(t == UTF32_le)
return "UTF32_le";
if(t[0 .. 3] == UTF8)
return "UTF8";
if(t[0 .. 2] == UTF16_be)
return "UTF16_be";
if(t[0 .. 2] == UTF16_le)
return "UTF16_le";
return "NO_BOM";
}
void main()
{
char[] f0 = "Unicode.ttx.xml";
char[] f1 = "UnicodeNew.ttx.xml";
auto text = cast(string) f0.read();
ubyte[4] b = cast(ubyte[]) text[0 .. 4];
char[] bom = getBOM(b);
char[] wText;
writefln(bom);
if (bom[0 .. 5] == "UTF16")
{
wchar[] temp = cast(wchar[]) text; //text[2 .. $];
wText = std.utf.toUTF8(temp);
}
else if (bom[0 .. 5] == "UTF32")
{
}
else if (bom == "UTF8")
{
}
if (std.string.find(wText,"DisplayText=\"TrixieTag\">") > 0)
{
writefln("Found Trixie Tags in " ~ f0);
char[][] nt = std.string.split(wText,`<ut Style="external"
DisplayText="TrixieTag">`);
wText = nt[0];
nt = std.string.split(nt[1],"</ut>");
wText ~= nt[1];
wchar[] eText = std.utf.toUTF16(wText);
f1.write(cast(void[]) eText);
}
}
The question: what happens when I get an UTF16_be (big endian)? Will the call
to std.utf.toUTF16(wText) take care of the BOM?
thanks so much for all the support.
josé