This might be slightly confusing, because your mail client doesn’t seem to do anything sane for quoting:
On 5 Apr 2018, at 20:09, Stefan Bidigaray <stefanb...@gmail.com> wrote: > > On Thu, Apr 5, 2018 at 1:41 PM, David Chisnall <gnus...@theravensnest.org> > wrote: > On 5 Apr 2018, at 17:27, Stefan Bidigaray <stefanb...@gmail.com> wrote: > > > > Hi David, > > I forgot to make a comment when you originally posted the idea, and I think > > this would be a great time to add my 2 cents. > > > > Regarding the structure: > > * Would it not be better to add the flags bit field immediately after the > > isa pointer? My thought here is that it can be checked for if different > > versions of the structure exist. This is important for CoreBase since it > > does not have the luxury of real classes. > > I’m concerned with structure padding here. Even on a 64-bit platform, we > either need an 8-byte flags field (which is wasteful) or end up with 4 bytes > of padding. With 128-bit pointers (which are probably coming sooner than you > expect) we will end up with 12 bytes of padding if we have a 32-bit flags > field followed by a pointer. > > Well, I was hoping there is a way we can define this structure so that it can > be used directly in CoreBase, without having to call the toll-free bridging > mechanism. If a 32-bit hash is used, could it be combined with the "flags" > variable (see the structure I included at the end of this email)? I'm hoping > to be able to have use the same constant strings without having to call the > bridging mechanism. It's pretty slow and cumbersome. Can you explain why CoreBase needs to store the hash as anything other than a 32-bit value that it can zero extend when returning a 64-bit value? It the CoreFoundation and Foundation implementations of hash are compatible, then it will currently be returning a 28-bit value in a 64-bit register, so I don’t understand the issue here. > > By the way, I noticed there was not uint32_t flags in your original > structure, making it 24 bytes in 32-bit CPUs. > > > * Would it be possible to make the hash variable a NSUInterger? The output > > of -hash is an NSUInterger, and that would allow the value to be expanded > > in the future. > > We can, though that would again increase the size quite noticeably. I think > I’m happy with a 32-bit hash, because as rfm points out with a decent hash > algorithm that basically gives us unique hashes. > > Sounds reasonable. > > > * Why have both count and length? Would it not make more sense to keep a > > single variable here called count and define it as, "The count/number of > > code units"? For ASCII and UTF-8 this would be # of bytes, and for UTF-16 > > it would be the # of 16-bit codes. The Apple documentation states "The > > number of UTF-16 code units in the receiver", making at least the ASCII and > > UTF-16 numbers correct. The way I understand the current implementation, > > the value for length would return the UTF-32 # of characters, which is > > inconsistent with the docs. > > If a UTF-8 string contains multi-byte sequences, then the length of the > buffer and the number if UTF-16 code units will be different. If we know the > number of bytes, then we can use more efficient C standard library functions > for things like comparisons, though that may not be important. > > I guess I'm still a bit confused about the meaning and/or different of the > variables count and length. One tells you the logical number of characters, the other the length of the buffer in bytes. A lot of bytes-scanning functions are far more efficient if they know the length up front, because they can then process one word at a time until the last word. > I know this is probably going to be rejected, but how about making constant > string either ASCII or UTF-16 only? Scratching UTF-8 altogether? I know this > would increase the byte count for most European languages using Latin > characters, but I don't see the point of maintaining both UTF-8 and UTF-16 > encoding. Everything that can be done with UTF-16 can be encoded in UTF-8 > (and vise-versa), so how would the compiler pick between the two? > Additionally, wouldn't sticking to just 1 of the 2 encoding simplify the code > significantly? There’s also the issue that -UTF8String is one of the most commonly used methods on NSString, so if we represent something as UTF-16 internally then it needs converting and returning in an autoreleased buffer, whereas with a UTF-8 string it can just return the pointer. On non-Windows platforms, -UTF8String is the way of getting a string that you pass to pretty much any OS function. > > > * I would also think that it makes more sense to have the length/count > > variable before the data pointer. I don't have a strong opinion about this > > one, but it just makes more sense in my head. > > Again, this gives us more padding in the structure. > > Would it? Isn't sizeof (long) == sizeof (void *) in all 32 and 64-bit > architectures (except WIN64)? I thought a long would not be padded any more > than a pointer for most applications. Not Win64, not on anything with larger than 64-bit pointers. > > > > Regarding the hash function: > > Why are we using Murmur3 hash? I know it is significantly more efficient > > than our current one-at-a-time approach, but how much better is it to > > competing hash functions? Is there a bench mark out there comparing some of > > the major ones? For example, how does it compare with lookup3 or > > SpookyHash. If we are storing the hash in the string structure, the speed > > of calculating the hash is not as important as the spread. Additionally, > > Murmur3 seems ill suited if NSUInteger is used to store the hash value > > since, as far as I could tell, it only outputs 32-bit and 128-bit hashes. > > Lookup3 and SpookyHash, for example, output 64-bit values (2 32-bit words > > in the case of lookup3), as well. > > The size of the type doesn’t necessarily give us the range. We are > completely free to give only a 32-bit or even 28-bit range within an > NSUInteger (which is what we do now) and if we have good coverage. A good > hash function has even distribution of entropy across all bits, so taking a > 32-bit or 128-bit hash and truncating it is fine. That said, I’m happy to > make the hash value 8 bytes on 64-bit platforms if this seems like a good use > of bits. > > I’m not wedded to the idea of Murmur3. We do need to use the same hash for > constant and non-constant strings, so execution speed is important. I’m > somewhat tempted to suggest SHA256, because it’s fairly easy to accelerate > with SSE and newer CPUs have full hardware offload for it. That said, the > goal is not to mandate the use of the compiler-generated hash for constant > strings, it’s to provide a space to store one that the compiler initialises > to something sensible. > > Given the analysis I’ve done in the reply to Ivan, I think it’s worth > consuming space to improve performance. > > I agree. > > So how about a structure like: > > struct { > id isa; /* Class pointer */ > uint64_t flags; > /* Flags bitfield: > Low 2 bits, enum with values: > 0: ASCII string > 1: UTF-16 string > 2 and 3: Reserved for future encodings > (1<<2) to (1<<3): 0 for one-at-a-time; 1 for murmur hash; 2 and 3 > reserved for future hashes > (1<<4) to (1<<15): Reserved for future compiler-defined flags > (1<<16) to (1<<31): Reserved for use by the constant string class > (I'm hoping this could hold the CFTypeID of a constant string so it can be > identified by corebase) > (1<<32) to (1<<63): hash > */ > const char *data; /* Pointer to the buffer. ro_data section, so > immutable. NULL-terminated */ > long count; /* Number of UTF-16 code units, not including the null > terminator */ > } I don’t see why we’d use a single uint64_t rather than a pair of uint32_ts and I don’t like the ordering (it will be annoying to have to order the fields differently on 128-bit pointer platforms). I’m not convinced that it’s worth omitting the length to save 8 bytes per string. It’s probably also not actually worth using longs for the length on 64-bit platforms, so both of these should probably be 32 bits. 4GB of string literal seems a bit excessive (for one thing, I doubt the compiler will be entirely happy with it, and I don’t know happy linkers are with 4GB symbols…). David _______________________________________________ Gnustep-dev mailing list Gnustep-dev@gnu.org https://lists.gnu.org/mailman/listinfo/gnustep-dev