Re: UTF-8 strings and endianness

2012-10-30 Thread Jesse Phillips
On Tuesday, 30 October 2012 at 17:17:36 UTC, Tobias Pankrath 
wrote:
On Tuesday, 30 October 2012 at 17:12:41 UTC, Jesse Phillips 
wrote:
On Monday, 29 October 2012 at 15:22:39 UTC, Adam D. Ruppe 
wrote:

UTF-8 isn't affected by endianness.


If this is true why does the BOM have marks for big and little 
endian?


http://en.wikipedia.org/wiki/Byte_order_mark#Representations_of_byte_order_marks_by_encoding


UTF8 has only one?


oops, mixed up and thought he just said "UTF isn't ..."


Re: UTF-8 strings and endianness

2012-10-30 Thread Dmitry Olshansky

10/30/2012 5:17 PM, Tobias Pankrath пишет:

On Tuesday, 30 October 2012 at 17:12:41 UTC, Jesse Phillips wrote:

On Monday, 29 October 2012 at 15:22:39 UTC, Adam D. Ruppe wrote:

UTF-8 isn't affected by endianness.


If this is true why does the BOM have marks for big and little endian?

http://en.wikipedia.org/wiki/Byte_order_mark#Representations_of_byte_order_marks_by_encoding



UTF8 has only one?


Even Wiki knows the simple truth:
> Byte order has no meaning in UTF-8, [5] so its only use in UTF-8 is 
to  signal at the start that the text stream is encoded in UTF-8


--
Dmitry Olshansky


Re: UTF-8 strings and endianness

2012-10-30 Thread Tobias Pankrath

On Tuesday, 30 October 2012 at 17:12:41 UTC, Jesse Phillips wrote:

On Monday, 29 October 2012 at 15:22:39 UTC, Adam D. Ruppe wrote:

UTF-8 isn't affected by endianness.


If this is true why does the BOM have marks for big and little 
endian?


http://en.wikipedia.org/wiki/Byte_order_mark#Representations_of_byte_order_marks_by_encoding


UTF8 has only one?


Re: UTF-8 strings and endianness

2012-10-30 Thread Jesse Phillips

On Monday, 29 October 2012 at 15:22:39 UTC, Adam D. Ruppe wrote:

UTF-8 isn't affected by endianness.


If this is true why does the BOM have marks for big and little 
endian?


http://en.wikipedia.org/wiki/Byte_order_mark#Representations_of_byte_order_marks_by_encoding


Re: UTF-8 strings and endianness

2012-10-29 Thread denizzzka

On Monday, 29 October 2012 at 15:46:43 UTC, Jordi Sayol wrote:

Al 29/10/12 16:17, En/na denizzzka ha escrit:

Hi!

How to convert D's string to big endian?
How to convert to D's string from big endian?




UTF-8 is always big emdian.


Yes.

(I thought that the problem in this place but the problem was 
different.)


Re: UTF-8 strings and endianness

2012-10-29 Thread denizzzka

On Monday, 29 October 2012 at 15:46:43 UTC, Jordi Sayol wrote:

Al 29/10/12 16:17, En/na denizzzka ha escrit:

Hi!

How to convert D's string to big endian?
How to convert to D's string from big endian?




UTF-8 is always big emdian.


oops, what?

Q: Is the UTF-8 encoding scheme the same irrespective of whether 
the underlying processor is little endian or big endian?


A: Yes. Since UTF-8 is interpreted as a sequence of bytes, there 
is no endian problem as there is for encoding forms that use 
16-bit or 32-bit code units. Where a BOM is used with UTF-8, it 
is only used as an ecoding signature to distinguish UTF-8 from 
other encodings — it has nothing to do with byte order.


Re: UTF-8 strings and endianness

2012-10-29 Thread denizzzka

On Monday, 29 October 2012 at 15:22:39 UTC, Adam D. Ruppe wrote:

UTF-8 isn't affected by endianness.


Ok, thanks!


Re: UTF-8 strings and endianness

2012-10-29 Thread Jordi Sayol
Al 29/10/12 16:17, En/na denizzzka ha escrit:
> Hi!
> 
> How to convert D's string to big endian?
> How to convert to D's string from big endian?
> 
> 

UTF-8 is always big emdian.
-- 
Jordi Sayol


Re: UTF-8 strings and endianness

2012-10-29 Thread Adam D. Ruppe

UTF-8 isn't affected by endianness.


UTF-8 strings and endianness

2012-10-29 Thread denizzzka

Hi!

How to convert D's string to big endian?
How to convert to D's string from big endian?



Re: strings and endianness

2012-01-18 Thread Jonathan M Davis
On Wednesday, January 18, 2012 21:42:51 Johannes Pfau wrote:
> Jonathan M Davis wrote:
> Section 4.1.2. indeed says that it uses big endian. However, I should still
> be able to use a ubyte[16] representation and just make sure that those
> bytes are equal to the big endian representation. Thinking about this: If I
> construct a ubyte[16] from a uuid string byte by byte, the resulting
> ubyte[16] should already be the big-endian representation?

Yes.

> > How that conversion is done though, depends on what each of the
> > values represent. If they're 4 uints, then you'd need to sway each set
> > of
> > 4 bytes. If they're 8 ushorts, then you need to swap each set of 2
> > bytes.
> > 
> > However, I belive that RFC 4122 is laid out like this
> > 
> > uint
> > ushort
> > ushort
> > ubyte
> > ubyte
> > ubyte
> > ubyte
> > ubyte
> > ubyte
> > ubyte
> > ubyte
> 
> Right, I totally forgot that, as boost just treats an UUID as a ubyte[16].
> But as long as I keep the data as ubyte[16] equal to the above layout in big
> endian, that should work as well.

Yes. I believe that the implementation (in C++) that we use where I work has a 
union between the various layouts. ubyte[16] should just be a mapping of the 
bytes such that you could cast each piece to the appropriate type and have it 
work (once you've dealt with endianness).

> If you want to comment on the code, it's here:
> https://github.com/jpf91/phobos/blob/std.uuid/std/uuid.d

I'll try and take a look at it at some point soon. Worst case, I can look at 
it when you try and get it into Phobos, which I assume that you're trying to 
do.

- Jonathan M Davis


Re: strings and endianness

2012-01-18 Thread Johannes Pfau
Jonathan M Davis wrote:

> On Wednesday, January 18, 2012 20:40:33 Johannes Pfau wrote:
>> I'm currently finishing std.uuid (see
>> http://prowiki.org/wiki4d/wiki.cgi?ReviewQueue ). For name based hashes,
>> a name string is passed to a hash function and I need to make sure that
>> the resulting hash is the same on both little endian and big endian
>> systems. So what's needed to convert a string to e.g little endian?
>> 
>> string --> as string is basically a byte array, is byte swapping even
>> necessary?
>> wstring --> read as shorts and swap using nativeToLittleEndian()?
>> dstring --> read as ints and swap using nativeToLittleEndian()?
>> 
>> Also remotely related questions: AFAIK
>> http://www.ietf.org/rfc/rfc4122.txt doesn't exactly specify what
>> encoding/byte order should be used for the UUID names? Does this mean
>> different implementations are allowed to generate different UUIDs for the
>> same input? (See chapter 'Algorithm for Creating a Name-Based UUID')
>> 
>> RFC4122 also says "put the name space ID in network byte order.", but the
>> namespace is a ubyte[16], so how should this work?
>> 
>> Should name based UUIDs be different if they were created with the same
>> name, but using different encodings(string vs wstring vs dstring)? That's
>> the way boost.uuid implements it.
> 
> If RFC 4122 says that it's using big endian (and I'd be shocked if
> anything like that used little endian), then you need to convert to big
> endian.

Section 4.1.2. indeed says that it uses big endian. However, I should still 
be able to use a ubyte[16] representation and just make sure that those 
bytes are equal to the big endian representation. Thinking about this: If I 
construct a ubyte[16] from a uuid string byte by byte, the resulting 
ubyte[16] should already be the big-endian representation?

> How that conversion is done though, depends on what each of the
> values represent. If they're 4 uints, then you'd need to sway each set of
> 4 bytes. If they're 8 ushorts, then you need to swap each set of 2 bytes.
> 
> However, I belive that RFC 4122 is laid out like this
> 
> uint
> ushort
> ushort
> ubyte
> ubyte
> ubyte
> ubyte
> ubyte
> ubyte
> ubyte
> ubyte

Right, I totally forgot that, as boost just treats an UUID as a ubyte[16]. 
But as long as I keep the data as ubyte[16] equal to the above layout in big 
endian, that should work as well.

> So, you'd need to have the first 4 bytes in big endian as a uint, and the
> next 2 set of 2 bytes in big endian as ushorts, leaving the rest alone.
> 
> As for strings. Remember that they're representing the data in the bytes,
> so I don't believe that it makes sense to try and convert wstrings or
> dstrings to a uuid directly. IIRC, the string must be 32 characters long
> (excepting the dashes) and that each of those characters represents the
> hex for a nibble in the UUID. So, if you have
> 
> 58DF357E-8918-408D-8ABB-AFB70864ED9F
> 
> 5 is the hex value for the first 4 bits in str[0], 8 is the hex value for
> the second 4 bits in str[0], D is the hex value for the first 4 bits in
> str[1], etc. So, there's no endian conversion going on at all. You just
> take the characters (regardless of the type of string) and convert each
> hex character to its corresponding integral value ('5' -> 5, '8' -> 8, 'D'
> -> 13, etc.) and set the corresponding nibble in the ubyte[16] for each.

Sure, that's the string representation of an UUID and that's easy to get 
right. But you can also generate uuids from names (see section 4.3, 
UUID("dlang.org", dnsNamespace)). In that case the name is passed to a SHA1 
or MD5 hash function, but it doesn't state which encoding or endianess is 
used for the name.

> 
> You're going to have to study RFC 4122 though, and make sure that you
> understand it properly. I'm going primarily off of how I've seen UUID's
> implemented before. All of this should be in the RFC.

Don't worry, I already read RFC4122 completely and my implementation is 
basically a port from boost, so the code is likely to be not that bad.

If you want to comment on the code, it's here:
https://github.com/jpf91/phobos/blob/std.uuid/std/uuid.d

Most things should be finished, except that I still have to fix the 
endianness stuff.



Re: strings and endianness

2012-01-18 Thread Jonathan M Davis
On Wednesday, January 18, 2012 20:40:33 Johannes Pfau wrote:
> I'm currently finishing std.uuid (see
> http://prowiki.org/wiki4d/wiki.cgi?ReviewQueue ). For name based hashes, a
> name string is passed to a hash function and I need to make sure that the
> resulting hash is the same on both little endian and big endian systems. So
> what's needed to convert a string to e.g little endian?
> 
> string --> as string is basically a byte array, is byte swapping even
> necessary?
> wstring --> read as shorts and swap using nativeToLittleEndian()?
> dstring --> read as ints and swap using nativeToLittleEndian()?
> 
> Also remotely related questions: AFAIK http://www.ietf.org/rfc/rfc4122.txt
> doesn't exactly specify what encoding/byte order should be used for the UUID
> names? Does this mean different implementations are allowed to generate
> different UUIDs for the same input? (See chapter 'Algorithm for Creating a
> Name-Based UUID')
> 
> RFC4122 also says "put the name space ID in network byte order.", but the
> namespace is a ubyte[16], so how should this work?
> 
> Should name based UUIDs be different if they were created with the same
> name, but using different encodings(string vs wstring vs dstring)? That's
> the way boost.uuid implements it.

If RFC 4122 says that it's using big endian (and I'd be shocked if anything 
like that used little endian), then you need to convert to big endian. How 
that conversion is done though, depends on what each of the values represent. 
If they're 4 uints, then you'd need to sway each set of 4 bytes. If they're 8 
ushorts, then you need to swap each set of 2 bytes.

However, I belive that RFC 4122 is laid out like this

uint
ushort
ushort
ubyte
ubyte
ubyte
ubyte
ubyte
ubyte
ubyte
ubyte

So, you'd need to have the first 4 bytes in big endian as a uint, and the next 
2 set of 2 bytes in big endian as ushorts, leaving the rest alone.

As for strings. Remember that they're representing the data in the bytes, so I 
don't believe that it makes sense to try and convert wstrings or dstrings to a 
uuid directly. IIRC, the string must be 32 characters long (excepting the 
dashes) and that each of those characters represents the hex for a nibble in 
the UUID. So, if you have

58DF357E-8918-408D-8ABB-AFB70864ED9F

5 is the hex value for the first 4 bits in str[0], 8 is the hex value for the 
second 4 bits in str[0], D is the hex value for the first 4 bits in str[1], 
etc. So, there's no endian conversion going on at all. You just take the 
characters (regardless of the type of string) and convert each hex character 
to its corresponding integral value ('5' -> 5, '8' -> 8, 'D' -> 13, etc.) and 
set the corresponding nibble in the ubyte[16] for each.

You're going to have to study RFC 4122 though, and make sure that you 
understand it properly. I'm going primarily off of how I've seen UUID's 
implemented before. All of this should be in the RFC.

- Jonathan M Davis


strings and endianness

2012-01-18 Thread Johannes Pfau
I'm currently finishing std.uuid (see 
http://prowiki.org/wiki4d/wiki.cgi?ReviewQueue ). For name based hashes, a 
name string is passed to a hash function and I need to make sure that the 
resulting hash is the same on both little endian and big endian systems. So 
what's needed to convert a string to e.g little endian?

string --> as string is basically a byte array, is byte swapping even 
necessary?
wstring --> read as shorts and swap using nativeToLittleEndian()?
dstring --> read as ints and swap using nativeToLittleEndian()?

Also remotely related questions: AFAIK http://www.ietf.org/rfc/rfc4122.txt 
doesn't exactly specify what encoding/byte order should be used for the UUID 
names? Does this mean different implementations are allowed to generate 
different UUIDs for the same input? (See chapter 'Algorithm for Creating a 
Name-Based UUID')

RFC4122 also says "put the name space ID in network byte order.", but the 
namespace is a ubyte[16], so how should this work?

Should name based UUIDs be different if they were created with the same 
name, but using different encodings(string vs wstring vs dstring)? That's 
the way boost.uuid implements it.