Re: New ABI NSConstantString

2018-04-05 Thread David Chisnall
On 6 Apr 2018, at 00:25, Stefan Bidigaray  wrote:
> 
> I use the gmail web interface, which is not great. I'll just comment without 
> quoting.
> 
> The thing I'm trying to address is the fact that all CF objects must start 
> with:
> struct {
> void *isa;
> uint32_t info;
> };
> That 32-bit info value includes the CFTypeID (a 16-bit value) and 16-bit for 
> general/restricted use.

Which 16 bits are the CFTypeID and which are spare?  Apple (from their open 
source release) appears to use a 12-bit TypeID (which indexes into a 10-bit 
table, so leaves two bits spare) and uses the rest for the ref count.

> If that 32-bit (or it could be 64-bit) field could be the same for constant 
> strings, it would allow CFString functions to work directly with ObjC 
> constant strings, instead of having to call the toll-free bridging mechanism. 
> That would be much more efficient for container objects in corebase.
> 
> Just to be clear, the CFString structure is currently:
> struct {
> void *isa;
> uint32_t info;
> char *data;
> long count;
> long hash;
> void *allocator;
> };
> 
> If the ObjC constant string structure and the CFString structure were 
> similar, they could be used interchangeably in corebase and base.
> 
> So my proposal was to arrange the first top-most portion of the new constant 
> string structure as:
> sturct {
> void *isa;
> uint64_t info; /* includes both info and hash */
> char *data;
> long count;
> };
> 
> If I modified the corebase version to match, these structure, with a little 
> help from libobjc, could be exactly the same.

I’d prefer not to pack too many unrelated things into a uint64_t (particularly 
because that will break things on big-endian platforms), so how about:

struct
{
Class isa;
uint32_t flags;
uint32_t count;
uint32_t length;
uint32_t hash;
const char *data;
};

That gives us 24 bytes on 32-bit, 32 bytes on 64-bit, and 40 bytes on 128-bit, 
with no padding on any architecture.

Does CoreBase have any issues using GSTinyStrings?  Presumably it has to put up 
with the fact that they might be generated at run time and handle them already?

David


___
Gnustep-dev mailing list
Gnustep-dev@gnu.org
https://lists.gnu.org/mailman/listinfo/gnustep-dev


Re: New ABI NSConstantString

2018-04-05 Thread Stefan Bidigaray
I use the gmail web interface, which is not great. I'll just comment
without quoting.

The thing I'm trying to address is the fact that all CF objects must start
with:
struct {
void *isa;
uint32_t info;
};
That 32-bit info value includes the CFTypeID (a 16-bit value) and 16-bit
for general/restricted use.

If that 32-bit (or it could be 64-bit) field could be the same for constant
strings, it would allow CFString functions to work directly with ObjC
constant strings, instead of having to call the toll-free bridging
mechanism. That would be much more efficient for container objects in
corebase.

Just to be clear, the CFString structure is currently:
struct {
void *isa;
uint32_t info;
char *data;
long count;
long hash;
void *allocator;
};

If the ObjC constant string structure and the CFString structure were
similar, they could be used interchangeably in corebase and base.

So my proposal was to arrange the first top-most portion of the new
constant string structure as:
sturct {
void *isa;
uint64_t info; /* includes both info and hash */
char *data;
long count;
};

If I modified the corebase version to match, these structure, with a little
help from libobjc, could be exactly the same.

On Thu, Apr 5, 2018 at 3:33 PM, David Chisnall 
wrote:

> This might be slightly confusing, because your mail client doesn’t seem to
> do anything sane for quoting:
>
> On 5 Apr 2018, at 20:09, Stefan Bidigaray  wrote:
> >
> > On Thu, Apr 5, 2018 at 1:41 PM, David Chisnall <
> gnus...@theravensnest.org> wrote:
> > On 5 Apr 2018, at 17:27, Stefan Bidigaray  wrote:
> > >
> > > Hi David,
> > > I forgot to make a comment when you originally posted the idea, and I
> think this would be a great time to add my 2 cents.
> > >
> > > Regarding the structure:
> > > * Would it not be better to add the flags bit field immediately after
> the isa pointer? My thought here is that it can be checked for if different
> versions of the structure exist. This is important for CoreBase since it
> does not have the luxury of real classes.
> >
> > I’m concerned with structure padding here.  Even on a 64-bit platform,
> we either need an 8-byte flags field (which is wasteful) or end up with 4
> bytes of padding.  With 128-bit pointers (which are probably coming sooner
> than you expect) we will end up with 12 bytes of padding if we have a
> 32-bit flags field followed by a pointer.
> >
> > Well, I was hoping there is a way we can define this structure so that
> it can be used directly in CoreBase, without having to call the toll-free
> bridging mechanism. If a 32-bit hash is used, could it be combined with the
> "flags" variable (see the structure I included at the end of this email)?
> I'm hoping to be able to have use the same constant strings without having
> to call the bridging mechanism. It's pretty slow and cumbersome.
>
> Can you explain why CoreBase needs to store the hash as anything other
> than a 32-bit value that it can zero extend when returning a 64-bit value?
> It the CoreFoundation and Foundation implementations of hash are
> compatible, then it will currently be returning a 28-bit value in a 64-bit
> register, so I don’t understand the issue here.
>
> >
> > By the way, I noticed there was not uint32_t flags in your original
> structure, making it 24 bytes in 32-bit CPUs.
> >
> > > * Would it be possible to make the hash variable a NSUInterger? The
> output of -hash is an NSUInterger, and that would allow the value to be
> expanded in the future.
> >
> > We can, though that would again increase the size quite noticeably.  I
> think I’m happy with a 32-bit hash, because as rfm points out with a decent
> hash algorithm that basically gives us unique hashes.
> >
> > Sounds reasonable.
> >
> > > * Why have both count and length? Would it not make more sense to keep
> a single variable here called count and define it as, "The count/number of
> code units"? For ASCII and UTF-8 this would be # of bytes, and for UTF-16
> it would be the # of 16-bit codes. The Apple documentation states "The
> number of UTF-16 code units in the receiver", making at least the ASCII and
> UTF-16 numbers correct. The way I understand the current implementation,
> the value for length would return the UTF-32 # of characters, which is
> inconsistent with the docs.
> >
> > If a UTF-8 string contains multi-byte sequences, then the length of the
> buffer and the number if UTF-16 code units will be different.  If we know
> the number of bytes, then we can use more efficient C standard library
> functions for things like comparisons, though that may not be important.
> >
> > I guess I'm still a bit confused about the meaning and/or different of
> the variables count and length.
>
> One tells you the logical number of characters, the other the length of
> the buffer in bytes.  A lot of bytes-scanning functions are far more
> efficient if they know the length up front, because 

Re: New ABI NSConstantString

2018-04-05 Thread David Chisnall
This might be slightly confusing, because your mail client doesn’t seem to do 
anything sane for quoting:

On 5 Apr 2018, at 20:09, Stefan Bidigaray  wrote:
> 
> On Thu, Apr 5, 2018 at 1:41 PM, David Chisnall  
> wrote:
> On 5 Apr 2018, at 17:27, Stefan Bidigaray  wrote:
> >
> > Hi David,
> > I forgot to make a comment when you originally posted the idea, and I think 
> > this would be a great time to add my 2 cents.
> >
> > Regarding the structure:
> > * Would it not be better to add the flags bit field immediately after the 
> > isa pointer? My thought here is that it can be checked for if different 
> > versions of the structure exist. This is important for CoreBase since it 
> > does not have the luxury of real classes.
> 
> I’m concerned with structure padding here.  Even on a 64-bit platform, we 
> either need an 8-byte flags field (which is wasteful) or end up with 4 bytes 
> of padding.  With 128-bit pointers (which are probably coming sooner than you 
> expect) we will end up with 12 bytes of padding if we have a 32-bit flags 
> field followed by a pointer.
> 
> Well, I was hoping there is a way we can define this structure so that it can 
> be used directly in CoreBase, without having to call the toll-free bridging 
> mechanism. If a 32-bit hash is used, could it be combined with the "flags" 
> variable (see the structure I included at the end of this email)? I'm hoping 
> to be able to have use the same constant strings without having to call the 
> bridging mechanism. It's pretty slow and cumbersome.

Can you explain why CoreBase needs to store the hash as anything other than a 
32-bit value that it can zero extend when returning a 64-bit value?  It the 
CoreFoundation and Foundation implementations of hash are compatible, then it 
will currently be returning a 28-bit value in a 64-bit register, so I don’t 
understand the issue here.

> 
> By the way, I noticed there was not uint32_t flags in your original 
> structure, making it 24 bytes in 32-bit CPUs.
> 
> > * Would it be possible to make the hash variable a NSUInterger? The output 
> > of -hash is an NSUInterger, and that would allow the value to be expanded 
> > in the future.
> 
> We can, though that would again increase the size quite noticeably.  I think 
> I’m happy with a 32-bit hash, because as rfm points out with a decent hash 
> algorithm that basically gives us unique hashes.
> 
> Sounds reasonable.
>  
> > * Why have both count and length? Would it not make more sense to keep a 
> > single variable here called count and define it as, "The count/number of 
> > code units"? For ASCII and UTF-8 this would be # of bytes, and for UTF-16 
> > it would be the # of 16-bit codes. The Apple documentation states "The 
> > number of UTF-16 code units in the receiver", making at least the ASCII and 
> > UTF-16 numbers correct. The way I understand the current implementation, 
> > the value for length would return the UTF-32 # of characters, which is 
> > inconsistent with the docs.
> 
> If a UTF-8 string contains multi-byte sequences, then the length of the 
> buffer and the number if UTF-16 code units will be different.  If we know the 
> number of bytes, then we can use more efficient C standard library functions 
> for things like comparisons, though that may not be important.
> 
> I guess I'm still a bit confused about the meaning and/or different of the 
> variables count and length.

One tells you the logical number of characters, the other the length of the 
buffer in bytes.  A lot of bytes-scanning functions are far more efficient if 
they know the length up front, because they can then process one word at a time 
until the last word.

> I know this is probably going to be rejected, but how about making constant 
> string either ASCII or UTF-16 only? Scratching UTF-8 altogether? I know this 
> would increase the byte count for most European languages using Latin 
> characters, but I don't see the point of maintaining both UTF-8 and UTF-16 
> encoding. Everything that can be done with UTF-16 can be encoded in UTF-8 
> (and vise-versa), so how would the compiler pick between the two? 
> Additionally, wouldn't sticking to just 1 of the 2 encoding simplify the code 
> significantly?

There’s also the issue that -UTF8String is one of the most commonly used 
methods on NSString, so if we represent something as UTF-16 internally then it 
needs converting and returning in an autoreleased buffer, whereas with a UTF-8 
string it can just return the pointer.  On non-Windows platforms, -UTF8String 
is the way of getting a string that you pass to pretty much any OS function.

> 
> > * I would also think that it makes more sense to have the length/count 
> > variable before the data pointer. I don't have a strong opinion about this 
> > one, but it just makes more sense in my head.
> 
> Again, this gives us more padding in the structure.
> 
> Would it? Isn't sizeof (long) == sizeof (void *) in all 32 and 64-bit 
> architectures (except WIN64)? I th

Re: New ABI NSConstantString

2018-04-05 Thread Stefan Bidigaray
On Thu, Apr 5, 2018 at 1:41 PM, David Chisnall 
wrote:

> On 5 Apr 2018, at 17:27, Stefan Bidigaray  wrote:
> >
> > Hi David,
> > I forgot to make a comment when you originally posted the idea, and I
> think this would be a great time to add my 2 cents.
> >
> > Regarding the structure:
> > * Would it not be better to add the flags bit field immediately after
> the isa pointer? My thought here is that it can be checked for if different
> versions of the structure exist. This is important for CoreBase since it
> does not have the luxury of real classes.
>
> I’m concerned with structure padding here.  Even on a 64-bit platform, we
> either need an 8-byte flags field (which is wasteful) or end up with 4
> bytes of padding.  With 128-bit pointers (which are probably coming sooner
> than you expect) we will end up with 12 bytes of padding if we have a
> 32-bit flags field followed by a pointer.
>

Well, I was hoping there is a way we can define this structure so that it
can be used directly in CoreBase, without having to call the toll-free
bridging mechanism. If a 32-bit hash is used, could it be combined with the
"flags" variable (see the structure I included at the end of this email)?
I'm hoping to be able to have use the same constant strings without having
to call the bridging mechanism. It's pretty slow and cumbersome.

By the way, I noticed there was not uint32_t flags in your original
structure, making it 24 bytes in 32-bit CPUs.

> * Would it be possible to make the hash variable a NSUInterger? The
> output of -hash is an NSUInterger, and that would allow the value to be
> expanded in the future.
>
> We can, though that would again increase the size quite noticeably.  I
> think I’m happy with a 32-bit hash, because as rfm points out with a decent
> hash algorithm that basically gives us unique hashes.
>

Sounds reasonable.


> > * Why have both count and length? Would it not make more sense to keep a
> single variable here called count and define it as, "The count/number of
> code units"? For ASCII and UTF-8 this would be # of bytes, and for UTF-16
> it would be the # of 16-bit codes. The Apple documentation states "The
> number of UTF-16 code units in the receiver", making at least the ASCII and
> UTF-16 numbers correct. The way I understand the current implementation,
> the value for length would return the UTF-32 # of characters, which is
> inconsistent with the docs.
>
> If a UTF-8 string contains multi-byte sequences, then the length of the
> buffer and the number if UTF-16 code units will be different.  If we know
> the number of bytes, then we can use more efficient C standard library
> functions for things like comparisons, though that may not be important.
>

I guess I'm still a bit confused about the meaning and/or different of the
variables count and length.

I know this is probably going to be rejected, but how about making constant
string either ASCII or UTF-16 only? Scratching UTF-8 altogether? I know
this would increase the byte count for most European languages using Latin
characters, but I don't see the point of maintaining both UTF-8 and UTF-16
encoding. Everything that can be done with UTF-16 can be encoded in UTF-8
(and vise-versa), so how would the compiler pick between the two?
Additionally, wouldn't sticking to just 1 of the 2 encoding simplify the
code significantly?

> * I would also think that it makes more sense to have the length/count
> variable before the data pointer. I don't have a strong opinion about this
> one, but it just makes more sense in my head.
>
> Again, this gives us more padding in the structure.
>

Would it? Isn't sizeof (long) == sizeof (void *) in all 32 and 64-bit
architectures (except WIN64)? I thought a long would not be padded any more
than a pointer for most applications.

>
> > Regarding the hash function:
> > Why are we using Murmur3 hash? I know it is significantly more efficient
> than our current one-at-a-time approach, but how much better is it to
> competing hash functions? Is there a bench mark out there comparing some of
> the major ones? For example, how does it compare with lookup3 or
> SpookyHash. If we are storing the hash in the string structure, the speed
> of calculating the hash is not as important as the spread. Additionally,
> Murmur3 seems ill suited if NSUInteger is used to store the hash value
> since, as far as I could tell, it only outputs 32-bit and 128-bit hashes.
> Lookup3 and SpookyHash, for example, output 64-bit values (2 32-bit words
> in the case of lookup3), as well.
>
> The size of the type doesn’t necessarily give us the range.  We are
> completely free to give only a 32-bit or even 28-bit range within an
> NSUInteger (which is what we do now) and if we have good coverage.  A good
> hash function has even distribution of entropy across all bits, so taking a
> 32-bit or 128-bit hash and truncating it is fine.  That said, I’m happy to
> make the hash value 8 bytes on 64-bit platforms if this seems like a good
> use of b

Re: New ABI NSConstantString

2018-04-05 Thread Ivan Vučica
Thank you, this was very informative!

On Thu, Apr 5, 2018 at 6:41 PM, David Chisnall
 wrote:
> On 5 Apr 2018, at 17:01, Ivan Vučica  wrote:
>>
>> Layman question: does it make sense to optimize for space, too, and have a 
>> smaller structure for tiny constant strings?
>
> With the new ABI, we get much better deduplication across compilation units 
> for selectors and protocols, which should extend to constant strings.
>
> At run time, on 64-bit platforms, we generate GSTinyString instances, which 
> are 64 bits and are hidden inside a pointer.  I’m tempted to make the 
> compiler generate those directly.
>
>> For 32bit ptrs and longs, this would be 20 bytes without the string itself. 
>> I don't think that's a lot, but I thought I'd ask.
>
> 20 bytes isn’t too bad, 36 (for 64-bit platforms) is a bit more.  On a 
> CHERI-like platform, it grows to 52 bytes, which starts to feel a bit 
> excessive.
>
> The absolute minimum structure is an isa pointer immediately followed by the 
> character data, with a null terminator.  That’s not a great idea, because the 
> isa pointer needs to be mutable, which would make the constant string also 
> accidentally mutable.
>
> The next smallest would be an isa pointer and a null-terminated string 
> pointer, so 8 / 16 / 32 bytes on the respective architectures.
>
> The cost of recomputing the hash is sufficiently expensive that it’s probably 
> worth using at least the 28 bits that we provide already for string hashes.
>
> I’ve done some measurements in -base.  In the compiled binary, we have a 
> total of 84976 bytes of strings, in 3307 strings, so an average of just under 
> 26 bytes per string, so 36 bytes of overhead seems quite a lot, and even 20 
> is quite noticeable.  If we exclude strings of 8 or fewer characters, this 
> gives us 81637 bytes in 2586 strings, so an average length of just under 32 
> bytes, so 36 bytes is still more than 100% overhead and adds up to about 90KB 
> in the final binary.
>
> With the current encoding, each constant string is 24 bytes, so that adds up 
> to about 60KB (excluding the string data itself) on 64-bit platforms.  That’s 
> about 0.5% of the total binary size, so I’m not too worried about making it 
> bigger.  Even making it 80KB is a lot of overhead per string (roughly 100%), 
> but isn’t that much of the total binary size.
>
>
> David
>

___
Gnustep-dev mailing list
Gnustep-dev@gnu.org
https://lists.gnu.org/mailman/listinfo/gnustep-dev


Re: New ABI NSConstantString

2018-04-05 Thread David Chisnall
On 5 Apr 2018, at 17:01, Ivan Vučica  wrote:
> 
> Layman question: does it make sense to optimize for space, too, and have a 
> smaller structure for tiny constant strings?

With the new ABI, we get much better deduplication across compilation units for 
selectors and protocols, which should extend to constant strings.

At run time, on 64-bit platforms, we generate GSTinyString instances, which are 
64 bits and are hidden inside a pointer.  I’m tempted to make the compiler 
generate those directly.

> For 32bit ptrs and longs, this would be 20 bytes without the string itself. I 
> don't think that's a lot, but I thought I'd ask.

20 bytes isn’t too bad, 36 (for 64-bit platforms) is a bit more.  On a 
CHERI-like platform, it grows to 52 bytes, which starts to feel a bit excessive.

The absolute minimum structure is an isa pointer immediately followed by the 
character data, with a null terminator.  That’s not a great idea, because the 
isa pointer needs to be mutable, which would make the constant string also 
accidentally mutable.

The next smallest would be an isa pointer and a null-terminated string pointer, 
so 8 / 16 / 32 bytes on the respective architectures.

The cost of recomputing the hash is sufficiently expensive that it’s probably 
worth using at least the 28 bits that we provide already for string hashes.  

I’ve done some measurements in -base.  In the compiled binary, we have a total 
of 84976 bytes of strings, in 3307 strings, so an average of just under 26 
bytes per string, so 36 bytes of overhead seems quite a lot, and even 20 is 
quite noticeable.  If we exclude strings of 8 or fewer characters, this gives 
us 81637 bytes in 2586 strings, so an average length of just under 32 bytes, so 
36 bytes is still more than 100% overhead and adds up to about 90KB in the 
final binary.  

With the current encoding, each constant string is 24 bytes, so that adds up to 
about 60KB (excluding the string data itself) on 64-bit platforms.  That’s 
about 0.5% of the total binary size, so I’m not too worried about making it 
bigger.  Even making it 80KB is a lot of overhead per string (roughly 100%), 
but isn’t that much of the total binary size.


David


___
Gnustep-dev mailing list
Gnustep-dev@gnu.org
https://lists.gnu.org/mailman/listinfo/gnustep-dev


Re: New ABI NSConstantString

2018-04-05 Thread David Chisnall
On 5 Apr 2018, at 17:27, Stefan Bidigaray  wrote:
> 
> Hi David,
> I forgot to make a comment when you originally posted the idea, and I think 
> this would be a great time to add my 2 cents.
> 
> Regarding the structure:
> * Would it not be better to add the flags bit field immediately after the isa 
> pointer? My thought here is that it can be checked for if different versions 
> of the structure exist. This is important for CoreBase since it does not have 
> the luxury of real classes.

I’m concerned with structure padding here.  Even on a 64-bit platform, we 
either need an 8-byte flags field (which is wasteful) or end up with 4 bytes of 
padding.  With 128-bit pointers (which are probably coming sooner than you 
expect) we will end up with 12 bytes of padding if we have a 32-bit flags field 
followed by a pointer.

> * Would it be possible to make the hash variable a NSUInterger? The output of 
> -hash is an NSUInterger, and that would allow the value to be expanded in the 
> future.

We can, though that would again increase the size quite noticeably.  I think 
I’m happy with a 32-bit hash, because as rfm points out with a decent hash 
algorithm that basically gives us unique hashes.

> * Why have both count and length? Would it not make more sense to keep a 
> single variable here called count and define it as, "The count/number of code 
> units"? For ASCII and UTF-8 this would be # of bytes, and for UTF-16 it would 
> be the # of 16-bit codes. The Apple documentation states "The number of 
> UTF-16 code units in the receiver", making at least the ASCII and UTF-16 
> numbers correct. The way I understand the current implementation, the value 
> for length would return the UTF-32 # of characters, which is inconsistent 
> with the docs.

If a UTF-8 string contains multi-byte sequences, then the length of the buffer 
and the number if UTF-16 code units will be different.  If we know the number 
of bytes, then we can use more efficient C standard library functions for 
things like comparisons, though that may not be important.

> * I would also think that it makes more sense to have the length/count 
> variable before the data pointer. I don't have a strong opinion about this 
> one, but it just makes more sense in my head.

Again, this gives us more padding in the structure.

> 
> Regarding the hash function:
> Why are we using Murmur3 hash? I know it is significantly more efficient than 
> our current one-at-a-time approach, but how much better is it to competing 
> hash functions? Is there a bench mark out there comparing some of the major 
> ones? For example, how does it compare with lookup3 or SpookyHash. If we are 
> storing the hash in the string structure, the speed of calculating the hash 
> is not as important as the spread. Additionally, Murmur3 seems ill suited if 
> NSUInteger is used to store the hash value since, as far as I could tell, it 
> only outputs 32-bit and 128-bit hashes. Lookup3 and SpookyHash, for example, 
> output 64-bit values (2 32-bit words in the case of lookup3), as well.

The size of the type doesn’t necessarily give us the range.  We are completely 
free to give only a 32-bit or even 28-bit range within an NSUInteger (which is 
what we do now) and if we have good coverage.  A good hash function has even 
distribution of entropy across all bits, so taking a 32-bit or 128-bit hash and 
truncating it is fine.  That said, I’m happy to make the hash value 8 bytes on 
64-bit platforms if this seems like a good use of bits.

I’m not wedded to the idea of Murmur3.  We do need to use the same hash for 
constant and non-constant strings, so execution speed is important.  I’m 
somewhat tempted to suggest SHA256, because it’s fairly easy to accelerate with 
SSE and newer CPUs have full hardware offload for it.  That said, the goal is 
not to mandate the use of the compiler-generated hash for constant strings, 
it’s to provide a space to store one that the compiler initialises to something 
sensible.

Given the analysis I’ve done in the reply to Ivan, I think it’s worth consuming 
space to improve performance.

David
___
Gnustep-dev mailing list
Gnustep-dev@gnu.org
https://lists.gnu.org/mailman/listinfo/gnustep-dev


Re: New ABI NSConstantString

2018-04-05 Thread Stefan Bidigaray
Hi David,
I forgot to make a comment when you originally posted the idea, and I think
this would be a great time to add my 2 cents.

Regarding the structure:
* Would it not be better to add the flags bit field immediately after the
isa pointer? My thought here is that it can be checked for if different
versions of the structure exist. This is important for CoreBase since it
does not have the luxury of real classes.
* Would it be possible to make the hash variable a NSUInterger? The output
of -hash is an NSUInterger, and that would allow the value to be expanded
in the future.
* Why have both count and length? Would it not make more sense to keep a
single variable here called count and define it as, "The count/number of
code units"? For ASCII and UTF-8 this would be # of bytes, and for UTF-16
it would be the # of 16-bit codes. The Apple documentation states "The
number of UTF-16 code units in the receiver", making at least the ASCII and
UTF-16 numbers correct. The way I understand the current implementation,
the value for length would return the UTF-32 # of characters, which is
inconsistent with the docs.
* I would also think that it makes more sense to have the length/count
variable before the data pointer. I don't have a strong opinion about this
one, but it just makes more sense in my head.

Regarding the hash function:
Why are we using Murmur3 hash? I know it is significantly more efficient
than our current one-at-a-time approach, but how much better is it to
competing hash functions? Is there a bench mark out there comparing some of
the major ones? For example, how does it compare with lookup3 or
SpookyHash. If we are storing the hash in the string structure, the speed
of calculating the hash is not as important as the spread. Additionally,
Murmur3 seems ill suited if NSUInteger is used to store the hash value
since, as far as I could tell, it only outputs 32-bit and 128-bit hashes.
Lookup3 and SpookyHash, for example, output 64-bit values (2 32-bit words
in the case of lookup3), as well.

I'm late for work, so I have to wrap up.

Stefan

On Thu, Apr 5, 2018 at 11:24 AM, David Chisnall 
wrote:

> On 1 Apr 2018, at 14:06, Richard Frith-Macdonald  theengagehub.com> wrote:
> >
> >
> > I wasn't aware of that ... it would make sense for your new ABI, when
> individual bits, to have them specified as particular bits rather than as a
> bitfield, avoiding the possibility of problems with different compilers.
> >
> > I don't think you should feel constrained to follow the current layout
> ... IMO the current one is good for years yet but probably not for decades.
> > However, I do think that it's more sensible to have pointer, count,
> hash, and flags similar to the current GNUstep layout than to follow Apple
> (and to bear in mind that its convenient for mutable strings to share a
> layout with constant ones).
>
> How about this:
>
> struct {
> // Class pointer
> id isa;
> // Pointer to the buffer.  ro_data section, so immutable.
> NULL-terminated
> const char *data;
> // Number of characters, not including the null terminator
> long count;
> // Number of bytes in the encoding, not including the null
> terminator.
> long length;
> // Murmur 3 hash
> uint32_t hash
> // Flags bitfield:
> // Low 2 bits, enum with values:
> //   0: ASCII string
> //   1: UTF-8 but not ASCII string
> //   2: UTF-16 string
> //   3: Reserved for future encodings
> // (1<<2): has mumur3 hash
> // (1<<3) to (1<<15): Reserved for future compiler-defined flags
> // (1<<16) to (1<<31): Reserved for use by the constant string
> class
> }
>
> I think that this should give everything that we need, plus room for easy
> future expansion.
>
> David
>
>
> ___
> Gnustep-dev mailing list
> Gnustep-dev@gnu.org
> https://lists.gnu.org/mailman/listinfo/gnustep-dev
>
___
Gnustep-dev mailing list
Gnustep-dev@gnu.org
https://lists.gnu.org/mailman/listinfo/gnustep-dev


Re: New ABI NSConstantString

2018-04-05 Thread Ivan Vučica
Layman question: does it make sense to optimize for space, too, and have a
smaller structure for tiny constant strings?

For 32bit ptrs and longs, this would be 20 bytes without the string itself.
I don't think that's a lot, but I thought I'd ask.

On Thu, Apr 5, 2018, 16:25 David Chisnall  wrote:

> On 1 Apr 2018, at 14:06, Richard Frith-Macdonald <
> richard.frith-macdon...@theengagehub.com> wrote:
> >
> >
> > I wasn't aware of that ... it would make sense for your new ABI, when
> individual bits, to have them specified as particular bits rather than as a
> bitfield, avoiding the possibility of problems with different compilers.
> >
> > I don't think you should feel constrained to follow the current layout
> ... IMO the current one is good for years yet but probably not for decades.
> > However, I do think that it's more sensible to have pointer, count,
> hash, and flags similar to the current GNUstep layout than to follow Apple
> (and to bear in mind that its convenient for mutable strings to share a
> layout with constant ones).
>
> How about this:
>
> struct {
> // Class pointer
> id isa;
> // Pointer to the buffer.  ro_data section, so immutable.
> NULL-terminated
> const char *data;
> // Number of characters, not including the null terminator
> long count;
> // Number of bytes in the encoding, not including the null
> terminator.
> long length;
> // Murmur 3 hash
> uint32_t hash
> // Flags bitfield:
> // Low 2 bits, enum with values:
> //   0: ASCII string
> //   1: UTF-8 but not ASCII string
> //   2: UTF-16 string
> //   3: Reserved for future encodings
> // (1<<2): has mumur3 hash
> // (1<<3) to (1<<15): Reserved for future compiler-defined flags
> // (1<<16) to (1<<31): Reserved for use by the constant string
> class
> }
>
> I think that this should give everything that we need, plus room for easy
> future expansion.
>
> David
>
>
> ___
> Gnustep-dev mailing list
> Gnustep-dev@gnu.org
> https://lists.gnu.org/mailman/listinfo/gnustep-dev
>
___
Gnustep-dev mailing list
Gnustep-dev@gnu.org
https://lists.gnu.org/mailman/listinfo/gnustep-dev


Re: New ABI NSConstantString

2018-04-05 Thread David Chisnall
On 1 Apr 2018, at 14:06, Richard Frith-Macdonald 
 wrote:
> 
> 
> I wasn't aware of that ... it would make sense for your new ABI, when 
> individual bits, to have them specified as particular bits rather than as a 
> bitfield, avoiding the possibility of problems with different compilers.
> 
> I don't think you should feel constrained to follow the current layout ... 
> IMO the current one is good for years yet but probably not for decades.
> However, I do think that it's more sensible to have pointer, count, hash, and 
> flags similar to the current GNUstep layout than to follow Apple (and to bear 
> in mind that its convenient for mutable strings to share a layout with 
> constant ones).

How about this:

struct {
// Class pointer
id isa;
// Pointer to the buffer.  ro_data section, so immutable.  
NULL-terminated
const char *data;
// Number of characters, not including the null terminator
long count;
// Number of bytes in the encoding, not including the null terminator.
long length;
// Murmur 3 hash
uint32_t hash
// Flags bitfield:
// Low 2 bits, enum with values:
//   0: ASCII string
//   1: UTF-8 but not ASCII string
//   2: UTF-16 string
//   3: Reserved for future encodings
// (1<<2): has mumur3 hash
// (1<<3) to (1<<15): Reserved for future compiler-defined flags
// (1<<16) to (1<<31): Reserved for use by the constant string class
}

I think that this should give everything that we need, plus room for easy 
future expansion.

David


___
Gnustep-dev mailing list
Gnustep-dev@gnu.org
https://lists.gnu.org/mailman/listinfo/gnustep-dev