I use the gmail web interface, which is not great. I'll just comment without quoting.
The thing I'm trying to address is the fact that all CF objects must start with: struct { void *isa; uint32_t info; }; That 32-bit info value includes the CFTypeID (a 16-bit value) and 16-bit for general/restricted use. If that 32-bit (or it could be 64-bit) field could be the same for constant strings, it would allow CFString functions to work directly with ObjC constant strings, instead of having to call the toll-free bridging mechanism. That would be much more efficient for container objects in corebase. Just to be clear, the CFString structure is currently: struct { void *isa; uint32_t info; char *data; long count; long hash; void *allocator; }; If the ObjC constant string structure and the CFString structure were similar, they could be used interchangeably in corebase and base. So my proposal was to arrange the first top-most portion of the new constant string structure as: sturct { void *isa; uint64_t info; /* includes both info and hash */ char *data; long count; }; If I modified the corebase version to match, these structure, with a little help from libobjc, could be exactly the same. On Thu, Apr 5, 2018 at 3:33 PM, David Chisnall <gnus...@theravensnest.org> wrote: > This might be slightly confusing, because your mail client doesn’t seem to > do anything sane for quoting: > > On 5 Apr 2018, at 20:09, Stefan Bidigaray <stefanb...@gmail.com> wrote: > > > > On Thu, Apr 5, 2018 at 1:41 PM, David Chisnall < > gnus...@theravensnest.org> wrote: > > On 5 Apr 2018, at 17:27, Stefan Bidigaray <stefanb...@gmail.com> wrote: > > > > > > Hi David, > > > I forgot to make a comment when you originally posted the idea, and I > think this would be a great time to add my 2 cents. > > > > > > Regarding the structure: > > > * Would it not be better to add the flags bit field immediately after > the isa pointer? My thought here is that it can be checked for if different > versions of the structure exist. This is important for CoreBase since it > does not have the luxury of real classes. > > > > I’m concerned with structure padding here. Even on a 64-bit platform, > we either need an 8-byte flags field (which is wasteful) or end up with 4 > bytes of padding. With 128-bit pointers (which are probably coming sooner > than you expect) we will end up with 12 bytes of padding if we have a > 32-bit flags field followed by a pointer. > > > > Well, I was hoping there is a way we can define this structure so that > it can be used directly in CoreBase, without having to call the toll-free > bridging mechanism. If a 32-bit hash is used, could it be combined with the > "flags" variable (see the structure I included at the end of this email)? > I'm hoping to be able to have use the same constant strings without having > to call the bridging mechanism. It's pretty slow and cumbersome. > > Can you explain why CoreBase needs to store the hash as anything other > than a 32-bit value that it can zero extend when returning a 64-bit value? > It the CoreFoundation and Foundation implementations of hash are > compatible, then it will currently be returning a 28-bit value in a 64-bit > register, so I don’t understand the issue here. > > > > > By the way, I noticed there was not uint32_t flags in your original > structure, making it 24 bytes in 32-bit CPUs. > > > > > * Would it be possible to make the hash variable a NSUInterger? The > output of -hash is an NSUInterger, and that would allow the value to be > expanded in the future. > > > > We can, though that would again increase the size quite noticeably. I > think I’m happy with a 32-bit hash, because as rfm points out with a decent > hash algorithm that basically gives us unique hashes. > > > > Sounds reasonable. > > > > > * Why have both count and length? Would it not make more sense to keep > a single variable here called count and define it as, "The count/number of > code units"? For ASCII and UTF-8 this would be # of bytes, and for UTF-16 > it would be the # of 16-bit codes. The Apple documentation states "The > number of UTF-16 code units in the receiver", making at least the ASCII and > UTF-16 numbers correct. The way I understand the current implementation, > the value for length would return the UTF-32 # of characters, which is > inconsistent with the docs. > > > > If a UTF-8 string contains multi-byte sequences, then the length of the > buffer and the number if UTF-16 code units will be different. If we know > the number of bytes, then we can use more efficient C standard library > functions for things like comparisons, though that may not be important. > > > > I guess I'm still a bit confused about the meaning and/or different of > the variables count and length. > > One tells you the logical number of characters, the other the length of > the buffer in bytes. A lot of bytes-scanning functions are far more > efficient if they know the length up front, because they can then process > one word at a time until the last word. > > > I know this is probably going to be rejected, but how about making > constant string either ASCII or UTF-16 only? Scratching UTF-8 altogether? I > know this would increase the byte count for most European languages using > Latin characters, but I don't see the point of maintaining both UTF-8 and > UTF-16 encoding. Everything that can be done with UTF-16 can be encoded in > UTF-8 (and vise-versa), so how would the compiler pick between the two? > Additionally, wouldn't sticking to just 1 of the 2 encoding simplify the > code significantly? > > There’s also the issue that -UTF8String is one of the most commonly used > methods on NSString, so if we represent something as UTF-16 internally then > it needs converting and returning in an autoreleased buffer, whereas with a > UTF-8 string it can just return the pointer. On non-Windows platforms, > -UTF8String is the way of getting a string that you pass to pretty much any > OS function. > > > > > > * I would also think that it makes more sense to have the length/count > variable before the data pointer. I don't have a strong opinion about this > one, but it just makes more sense in my head. > > > > Again, this gives us more padding in the structure. > > > > Would it? Isn't sizeof (long) == sizeof (void *) in all 32 and 64-bit > architectures (except WIN64)? I thought a long would not be padded any more > than a pointer for most applications. > > Not Win64, not on anything with larger than 64-bit pointers. > > > > > > > Regarding the hash function: > > > Why are we using Murmur3 hash? I know it is significantly more > efficient than our current one-at-a-time approach, but how much better is > it to competing hash functions? Is there a bench mark out there comparing > some of the major ones? For example, how does it compare with lookup3 or > SpookyHash. If we are storing the hash in the string structure, the speed > of calculating the hash is not as important as the spread. Additionally, > Murmur3 seems ill suited if NSUInteger is used to store the hash value > since, as far as I could tell, it only outputs 32-bit and 128-bit hashes. > Lookup3 and SpookyHash, for example, output 64-bit values (2 32-bit words > in the case of lookup3), as well. > > > > The size of the type doesn’t necessarily give us the range. We are > completely free to give only a 32-bit or even 28-bit range within an > NSUInteger (which is what we do now) and if we have good coverage. A good > hash function has even distribution of entropy across all bits, so taking a > 32-bit or 128-bit hash and truncating it is fine. That said, I’m happy to > make the hash value 8 bytes on 64-bit platforms if this seems like a good > use of bits. > > > > I’m not wedded to the idea of Murmur3. We do need to use the same hash > for constant and non-constant strings, so execution speed is important. > I’m somewhat tempted to suggest SHA256, because it’s fairly easy to > accelerate with SSE and newer CPUs have full hardware offload for it. That > said, the goal is not to mandate the use of the compiler-generated hash for > constant strings, it’s to provide a space to store one that the compiler > initialises to something sensible. > > > > Given the analysis I’ve done in the reply to Ivan, I think it’s worth > consuming space to improve performance. > > > > I agree. > > > > So how about a structure like: > > > > struct { > > id isa; /* Class pointer */ > > uint64_t flags; > > /* Flags bitfield: > > Low 2 bits, enum with values: > > 0: ASCII string > > 1: UTF-16 string > > 2 and 3: Reserved for future encodings > > (1<<2) to (1<<3): 0 for one-at-a-time; 1 for murmur hash; 2 > and 3 reserved for future hashes > > (1<<4) to (1<<15): Reserved for future compiler-defined flags > > (1<<16) to (1<<31): Reserved for use by the constant string > class (I'm hoping this could hold the CFTypeID of a constant string so it > can be identified by corebase) > > (1<<32) to (1<<63): hash > > */ > > const char *data; /* Pointer to the buffer. ro_data section, so > immutable. NULL-terminated */ > > long count; /* Number of UTF-16 code units, not including the > null terminator */ > > } > > I don’t see why we’d use a single uint64_t rather than a pair of uint32_ts > and I don’t like the ordering (it will be annoying to have to order the > fields differently on 128-bit pointer platforms). I’m not convinced that > it’s worth omitting the length to save 8 bytes per string. It’s probably > also not actually worth using longs for the length on 64-bit platforms, so > both of these should probably be 32 bits. 4GB of string literal seems a > bit excessive (for one thing, I doubt the compiler will be entirely happy > with it, and I don’t know happy linkers are with 4GB symbols…). > > David > >
_______________________________________________ Gnustep-dev mailing list Gnustep-dev@gnu.org https://lists.gnu.org/mailman/listinfo/gnustep-dev