Re: [PATCH v2 01/10] Define a structure for object IDs.
Duy Nguyen pclo...@gmail.com writes: Notice that the first time pack-obj[] is filled using lookup_object(). So yes, the hash table has all the pointers that pack-obj[] has. Are we talking about the same thing? By the hash table, I mean **obj_hash that is a hashtable that uses sha-1 form of identifier as the key. So the hash table has all the pointers sounds like all the objects you instantiate from a v4 packfile needs their object name known in the sha-1 form anyway when it gets instantiated (or more importantly, when you realize there is another copy of it already in the **obj_hash hashtable, you may have to free or refrain from creating it in the first place). As long as we use **obj_hash hashtable as all the objects this process cares about in the world and struct object_id.hash as the key into it, I do not think pack,nth representation belongs to that layer. I do think pack,nth representation has its use as a short-cut mechanism to reach an indirectly referenced object directly without instantiating intermediate objects when dealing with an extended SHA-1 expression (cf. v1.0^0:t example in a few messages upthread). I just think it does not belong to struct object_id that are used to refer to in-core objects. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 01/10] Define a structure for object IDs.
Duy Nguyen pclo...@gmail.com writes: Anyway, wouldn't this be all academic? I do not see how you would keep the object name in the pack, nth format in-core, as the obj_hash[] is a hashtable keyed by sha-1, and even when we switch to a different hash, I cannot see how such a table to ensure the singleton-ness of in-core objects can be keyed sometimes by hash and by pack, nth in some other time. I'm implementing something to see how much we gain by avoiding object lookup. The current approach is having struct object ** obj in struct packed_git, indexed by nth. So when you have pack, nth and pack-obj[nth] is valid, you'll get to struct object * without hashing. But do you realize that the hashtable serves two purposes? Grab the object from its name is one thing, and the other one I am not seeing how you will make it work with sometimes sha-1 sometimes pack,nth is to ensure that we will have only one in-core copy for the same object. We even walk the hashtable when we want to drop the flag bits from all in-core objects, so even if you instanciated an in-core object without going through the object name layer, the hashtable needs to have a pointer to such a pointer, no? -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 01/10] Define a structure for object IDs.
On Sun, Mar 15, 2015 at 5:19 AM, Junio C Hamano gits...@pobox.com wrote: Duy Nguyen pclo...@gmail.com writes: Anyway, wouldn't this be all academic? I do not see how you would keep the object name in the pack, nth format in-core, as the obj_hash[] is a hashtable keyed by sha-1, and even when we switch to a different hash, I cannot see how such a table to ensure the singleton-ness of in-core objects can be keyed sometimes by hash and by pack, nth in some other time. I'm implementing something to see how much we gain by avoiding object lookup. The current approach is having struct object ** obj in struct packed_git, indexed by nth. So when you have pack, nth and pack-obj[nth] is valid, you'll get to struct object * without hashing. But do you realize that the hashtable serves two purposes? Grab the object from its name is one thing, and the other one I am not seeing how you will make it work with sometimes sha-1 sometimes pack,nth is to ensure that we will have only one in-core copy for the same object. We even walk the hashtable when we want to drop the flag bits from all in-core objects, so even if you instanciated an in-core object without going through the object name layer, the hashtable needs to have a pointer to such a pointer, no? Notice that the first time pack-obj[] is filled using lookup_object(). So yes, the hash table has all the pointers that pack-obj[] has. If somebody wants to remove an object out of hash table, then all pack-obj[] are invalidated, but I don't think we ever delete objects from hash table. -- Duy -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 01/10] Define a structure for object IDs.
On Fri, Mar 13, 2015 at 1:03 PM, Junio C Hamano gits...@pobox.com wrote: To me, pack, offset information smells to belong more to a struct object (or its subclass) as an optional annotation---when a caller is asked to parse_object(), you would bypass the sha1_read_file() that goes and looks the object name up from the list of pack .idx and instead go there straight using that annotation. For pack v4, commits and trees can be encoded this way. Even if your in-pack representation of a commit object allowed to store the tree pointer in pack, nth format, its object name must be computed as if you have the commit object in the traditional format and computed the hash of that (together with the standard type size\0 header), and at that point, you need the contained object's name in sha-1 format (imagine how you would implement the commit-tree command). True. But commit-tree is not often executed as, say, rev-list, where the hash is used as 20-byte key to some pieces somewhere (pack data, struct object *). If we keep pack, nth in object_id (the union approach), v4-aware code has the chance to go on a fast path. Hence, I do not think the v4 encoding changes the discussion very much. I see the primary value of v4 encoding is to shorten the length of various fields take on-disk and in-pack. If it were pack, offset, it could be argued that it would also be faster to get to the packed data in the packfile, and going from pack, nth to the .idx file and then going to the location in the data in the packfile would be faster than going from sha-1 to a particular pack and its in-pack offset, with the difference of cost around log(n)*m where n is the number of objects in a pack and m is the total number of packs in the repository. It is true that nth format (force that the referred-to object lives in the same pack as the referrer) can help speed up interpretation of extended SHA-1 expression, e.g. v1.0^0:t, which can read v1.0 tag in v4 format, find the nth info for the commit pointed by the tag and get to that data in the pack, find the nth info for the top-tree recorded in that commit and directly get to the data of that tree, and then find the entry for t, which will give the object name for that subtree again in nth format, and at that point you can find the sha-1 of that final object, without having to know any object names of the intermediate objects (i.e. you must start from sha-1 of the tag you obtain from the refs API, but you didn't use the object name of the commit and its top-level tree). So for such a codepath, I would say it would be sufficient to use a union in struct people have been envisioning and convert pack, nth to sha-1 when the latter form becomes necessary for the first time for the object. It's the point: faster access (to either pack data, or struct object *). Using a separate union from struct object_id (iow, object_id only contains SHA-1) works. But imagine to independent code islands can perform this type of fast access. In order to pass this union from one island to another, we need to update pretty much every function call between them. The change to using object_id has a similar impact. If we go with union in object_id, we may be able to avoid the second mass change to make use of pack, nth. Anyway, wouldn't this be all academic? I do not see how you would keep the object name in the pack, nth format in-core, as the obj_hash[] is a hashtable keyed by sha-1, and even when we switch to a different hash, I cannot see how such a table to ensure the singleton-ness of in-core objects can be keyed sometimes by hash and by pack, nth in some other time. I'm implementing something to see how much we gain by avoiding object lookup. The current approach is having struct object ** obj in struct packed_git, indexed by nth. So when you have pack, nth and pack-obj[nth] is valid, you'll get to struct object * without hashing. If pack-obj[nth] is NULL, hashing and looking up are required the first time using SHA-1, then the result is cached in pack-obj[]. So no, it's not academic :) -- Duy -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 01/10] Define a structure for object IDs.
Many places throughout the code use unsigned char [20] to store object IDs (SHA-1 values). This leads to lots of hardcoded numbers throughout the codebase. It also leads to confusion about the purposes of a buffer. Introduce a structure for object IDs. This allows us to obtain the benefits of compile-time checking for misuse. The structure is expected to remain the same size and have the same alignment requirements on all known platforms, compared to the array of unsigned char, although this is not required for correctness. Signed-off-by: brian m. carlson sand...@crustytoothpaste.net --- cache.h | 8 1 file changed, 8 insertions(+) diff --git a/cache.h b/cache.h index 761c570..6582c35 100644 --- a/cache.h +++ b/cache.h @@ -43,6 +43,14 @@ int git_deflate_end_gently(git_zstream *); int git_deflate(git_zstream *, int flush); unsigned long git_deflate_bound(git_zstream *, unsigned long); +/* The length in bytes and in hex digits of an object name (SHA-1 value). */ +#define GIT_SHA1_RAWSZ 20 +#define GIT_SHA1_HEXSZ (2 * GIT_SHA1_RAWSZ) + +struct object_id { + unsigned char hash[GIT_SHA1_RAWSZ]; +}; + #if defined(DT_UNKNOWN) !defined(NO_D_TYPE_IN_DIRENT) #define DTYPE(de) ((de)-d_type) #else -- 2.2.1.209.g41e5f3a -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 01/10] Define a structure for object IDs.
Duy Nguyen pclo...@gmail.com writes: You mean if it came in pack, offset format, convert it down to sha1 until the last second that it is needed (e.g. need to put that in a tree object in order to compute the object name of the containing tree object)? I picked my words poorly. It should be pack, the index in pack instead of the _byte_ offset. Thanks for a clarification, but I do not think it affects the main point of the discussion very much. If we use union in struct, where we can store either an SHA-1 hash or some other identifying information for the object, but not both, then at some point you would need to convert pack, nth to sha-1 in a codepath that needs the sha-1 hash value (e.g. if the object A, that is known to you as pack, nth, is placed in a tree object B, you need the object name of A in sha-1 representation in order to compute the object name of the tree object B. You can choose to keep it in pack, nth form in struct object_id { union } and look up the sha-1 from the pack index every time, or you can choose to give up the pack, nth form and upgrade the struct object_id to store sha-1 at that point. If you keep both pack, nth *and* sha-1 in struct object_id at the same time, you can choose whichever is convenient, but it would bloat everything in core. Not just it bloats struct object and its subclasses, the in-core index entries, which is what I meant by ... Unless you fix that union in struct assumption, that is. ... this. To me, pack, offset information smells to belong more to a struct object (or its subclass) as an optional annotation---when a caller is asked to parse_object(), you would bypass the sha1_read_file() that goes and looks the object name up from the list of pack .idx and instead go there straight using that annotation. For pack v4, commits and trees can be encoded this way. Even if your in-pack representation of a commit object allowed to store the tree pointer in pack, nth format, its object name must be computed as if you have the commit object in the traditional format and computed the hash of that (together with the standard type size\0 header), and at that point, you need the contained object's name in sha-1 format (imagine how you would implement the commit-tree command). Hence, I do not think the v4 encoding changes the discussion very much. I see the primary value of v4 encoding is to shorten the length of various fields take on-disk and in-pack. If it were pack, offset, it could be argued that it would also be faster to get to the packed data in the packfile, and going from pack, nth to the .idx file and then going to the location in the data in the packfile would be faster than going from sha-1 to a particular pack and its in-pack offset, with the difference of cost around log(n)*m where n is the number of objects in a pack and m is the total number of packs in the repository. It is true that nth format (force that the referred-to object lives in the same pack as the referrer) can help speed up interpretation of extended SHA-1 expression, e.g. v1.0^0:t, which can read v1.0 tag in v4 format, find the nth info for the commit pointed by the tag and get to that data in the pack, find the nth info for the top-tree recorded in that commit and directly get to the data of that tree, and then find the entry for t, which will give the object name for that subtree again in nth format, and at that point you can find the sha-1 of that final object, without having to know any object names of the intermediate objects (i.e. you must start from sha-1 of the tag you obtain from the refs API, but you didn't use the object name of the commit and its top-level tree). So for such a codepath, I would say it would be sufficient to use a union in struct people have been envisioning and convert pack, nth to sha-1 when the latter form becomes necessary for the first time for the object. Anyway, wouldn't this be all academic? I do not see how you would keep the object name in the pack, nth format in-core, as the obj_hash[] is a hashtable keyed by sha-1, and even when we switch to a different hash, I cannot see how such a table to ensure the singleton-ness of in-core objects can be keyed sometimes by hash and by pack, nth in some other time. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 01/10] Define a structure for object IDs.
On Wed, Mar 11, 2015 at 05:26:56PM -0700, Junio C Hamano wrote: brian m. carlson sand...@crustytoothpaste.net writes: Michael Haggerty recommended that I call the structure element sha1 instead of oid in case we want to turn this into a union if we decide to go the additional hash route. I'd advise against it. Yeah, after re-reading this, I think hash is the best choice for the underlying element name. That way, it's different from oid, which would be the instances of the struct itself, and it's sufficiently different from unconverted places in the code, which don't typically use that term. -- brian m. carlson / brian with sandals: Houston, Texas, US +1 832 623 2791 | http://www.crustytoothpaste.net/~bmc | My opinion only OpenPGP: RSA v4 4096b: 88AC E9B2 9196 305B A994 7552 F1BA 225C 0223 B187 signature.asc Description: Digital signature
Re: [PATCH v2 01/10] Define a structure for object IDs.
On Thu, Mar 12, 2015 at 11:28:10AM +0100, Michael Haggerty wrote: On 03/12/2015 01:26 AM, Junio C Hamano wrote: And that would break the abstraction effort if you start calling the field with a name that is specific to the underlying hash function. The caller has to change o-sha1 to o-sha256 instead of keeping that as o-oid and letting the callee handle the implementation details when calling if (!hashcmp(o1-oid, o2-oid)) ; /* they are the same */ else ; /* they are different */ [...] Hmm, I guess you imagine that we might sometimes pack SHA-1s, sometimes SHA-256s (or whatever) in the oid field, which would be dimensioned large enough for either one (with, say, SHA-1s padded with zeros). I was imagining that this would evolve into a union (or maybe struct) of different hash types, like struct object_id { unsigned char hash_type; union { unsigned char sha1[GIT_SHA1_RAWSZ]; unsigned char sha256[GIT_SHA256_RAWSZ]; } hash; }; BTW in either case, any hopes of mapping object_id objects directly on top of buffer memory would disappear. What I think might be more beneficial is to make the hash function specific to a particular repository, and therefore you could maintain something like this: struct object_id { unsigned char hash[GIT_MAX_RAWSZ]; }; and make hash_type (or hash_length) a global[0]. I don't think it's very worthwhile to try to mix two different hash functions in the same repository, so we could still map directly onto buffer memory if we decide that's portable enough. I expect the cases where we need to do that will be relatively limited. Regardless, it seems that this solution has the most support (including Junio's) and it's more self-documenting than my current set of patches, so I'm going to go with it for now. It should be easy to change if the consensus goes back the other way. [0] I personally think globals are a bit gross, but they don't seem to have the problems that they would if git were a shared library. -- brian m. carlson / brian with sandals: Houston, Texas, US +1 832 623 2791 | http://www.crustytoothpaste.net/~bmc | My opinion only OpenPGP: RSA v4 4096b: 88AC E9B2 9196 305B A994 7552 F1BA 225C 0223 B187 signature.asc Description: Digital signature
Re: [PATCH v2 01/10] Define a structure for object IDs.
On 03/12/2015 01:26 AM, Junio C Hamano wrote: brian m. carlson sand...@crustytoothpaste.net writes: Michael Haggerty recommended that I call the structure element sha1 instead of oid in case we want to turn this into a union if we decide to go the additional hash route. I'd advise against it. As I wrote in $gmane/265337 in response to Michael: 4. We continue to support working with SHA-1s declared to be (unsigned char *) in some performance-critical code, even as we migrate most other code to using SHA-1s embedded within a (struct object_id). This will cost some duplication of code. To accept this approach, we would need an idea of *how much* code duplication would be needed. E.g., how many functions will need both (unsigned char *) versions and (struct object_id *) versions? ... I do not know what kind of code duplication you are worried about, though. If a callee needs unsigned char *, the caller that has a struct object_id *o should pass o-hash to the callee. And that would break the abstraction effort if you start calling the field with a name that is specific to the underlying hash function. The caller has to change o-sha1 to o-sha256 instead of keeping that as o-oid and letting the callee handle the implementation details when calling if (!hashcmp(o1-oid, o2-oid)) ; /* they are the same */ else ; /* they are different */ [...] Hmm, I guess you imagine that we might sometimes pack SHA-1s, sometimes SHA-256s (or whatever) in the oid field, which would be dimensioned large enough for either one (with, say, SHA-1s padded with zeros). I was imagining that this would evolve into a union (or maybe struct) of different hash types, like struct object_id { unsigned char hash_type; union { unsigned char sha1[GIT_SHA1_RAWSZ]; unsigned char sha256[GIT_SHA256_RAWSZ]; } hash; }; BTW in either case, any hopes of mapping object_id objects directly on top of buffer memory would disappear. Michael -- Michael Haggerty mhag...@alum.mit.edu -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 01/10] Define a structure for object IDs.
On Thu, Mar 12, 2015 at 5:46 PM, brian m. carlson sand...@crustytoothpaste.net wrote: On Thu, Mar 12, 2015 at 11:28:10AM +0100, Michael Haggerty wrote: On 03/12/2015 01:26 AM, Junio C Hamano wrote: And that would break the abstraction effort if you start calling the field with a name that is specific to the underlying hash function. The caller has to change o-sha1 to o-sha256 instead of keeping that as o-oid and letting the callee handle the implementation details when calling if (!hashcmp(o1-oid, o2-oid)) ; /* they are the same */ else ; /* they are different */ [...] Hmm, I guess you imagine that we might sometimes pack SHA-1s, sometimes SHA-256s (or whatever) in the oid field, which would be dimensioned large enough for either one (with, say, SHA-1s padded with zeros). I was imagining that this would evolve into a union (or maybe struct) of different hash types, like struct object_id { unsigned char hash_type; union { unsigned char sha1[GIT_SHA1_RAWSZ]; unsigned char sha256[GIT_SHA256_RAWSZ]; } hash; }; BTW in either case, any hopes of mapping object_id objects directly on top of buffer memory would disappear. What I think might be more beneficial is to make the hash function specific to a particular repository, and therefore you could maintain something like this: struct object_id { unsigned char hash[GIT_MAX_RAWSZ]; }; and make hash_type (or hash_length) a global[0]. I don't think it's very worthwhile to try to mix two different hash functions in the same repository, This may or may not fall into the mix different hash functions category. In pack files version 4, trees are encoded to point to other trees or blobs by a (pack, offset) tuple. It would be great if the new object_id could support carrying this kind of object id around because it could help reduce object lookup cost a lot. (pack, offset) can be converted back to SHA-1 so no info is lost and hashcmp() can compare (pack, tuple) against an SHA-1 just fine. so we could still map directly onto buffer memory if we decide that's portable enough. I expect the cases where we need to do that will be relatively limited. Regardless, it seems that this solution has the most support (including Junio's) and it's more self-documenting than my current set of patches, so I'm going to go with it for now. It should be easy to change if the consensus goes back the other way. [0] I personally think globals are a bit gross, but they don't seem to have the problems that they would if git were a shared library. -- brian m. carlson / brian with sandals: Houston, Texas, US +1 832 623 2791 | http://www.crustytoothpaste.net/~bmc | My opinion only OpenPGP: RSA v4 4096b: 88AC E9B2 9196 305B A994 7552 F1BA 225C 0223 B187 -- Duy -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 01/10] Define a structure for object IDs.
Junio C Hamano gits...@pobox.com writes: Duy Nguyen pclo...@gmail.com writes: This may or may not fall into the mix different hash functions category. In pack files version 4, trees are encoded to point to other trees or blobs by a (pack, offset) tuple. It would be great if the new object_id could support carrying this kind of object id around because it could help reduce object lookup cost a lot. (pack, offset) can be converted back to SHA-1 so no info is lost and hashcmp() can compare (pack, tuple) against an SHA-1 just fine. You mean if it came in pack, offset format, convert it down to sha1 until the last second that it is needed (e.g. need to put that in a tree object in order to compute the object name of the containing tree object)? Sorry, obviously I meant do not convert until the last second. To me, pack, offset information smells to belong more to a struct object (or its subclass) as an optional annotation---when a caller is asked to parse_object(), you would bypass the sha1_read_file() that goes and looks the object name up from the list of pack .idx and instead go there straight using that annotation. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 01/10] Define a structure for object IDs.
Duy Nguyen pclo...@gmail.com writes: This may or may not fall into the mix different hash functions category. In pack files version 4, trees are encoded to point to other trees or blobs by a (pack, offset) tuple. It would be great if the new object_id could support carrying this kind of object id around because it could help reduce object lookup cost a lot. (pack, offset) can be converted back to SHA-1 so no info is lost and hashcmp() can compare (pack, tuple) against an SHA-1 just fine. You mean if it came in pack, offset format, convert it down to sha1 until the last second that it is needed (e.g. need to put that in a tree object in order to compute the object name of the containing tree object)? After converting an object name originally represented as pack, offset, if we are doing the union in struct thing, to sha1 representation, you would have to look it up from .idx in order to read the contents the usual way. If that happens often enough, then it may not be worth adding complexity to the code to carry the pack, offset pair around. Unless you fix that union in struct assumption, that is. To me, pack, offset information smells to belong more to a struct object (or its subclass) as an optional annotation---when a caller is asked to parse_object(), you would bypass the sha1_read_file() that goes and looks the object name up from the list of pack .idx and instead go there straight using that annotation. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 01/10] Define a structure for object IDs.
On Fri, Mar 13, 2015 at 1:24 AM, Junio C Hamano gits...@pobox.com wrote: Duy Nguyen pclo...@gmail.com writes: This may or may not fall into the mix different hash functions category. In pack files version 4, trees are encoded to point to other trees or blobs by a (pack, offset) tuple. It would be great if the new object_id could support carrying this kind of object id around because it could help reduce object lookup cost a lot. (pack, offset) can be converted back to SHA-1 so no info is lost and hashcmp() can compare (pack, tuple) against an SHA-1 just fine. You mean if it came in pack, offset format, convert it down to sha1 until the last second that it is needed (e.g. need to put that in a tree object in order to compute the object name of the containing tree object)? I picked my words poorly. It should be pack, the index in pack instead of the _byte_ offset. This index ranges from 0 to the number of objects minus one. This is essentially what pack v4 stores in places where pack v2 would store SHA-1, which makes me make the connection: both are different ways of identifying an object. After converting an object name originally represented as pack, offset, if we are doing the union in struct thing, to sha1 representation, you would have to look it up from .idx in order to read the contents the usual way. If that happens often enough, then it may not be worth adding complexity to the code to carry the pack, offset pair around. I'd keep it in pack, index for as long as possible (or until higher layer decides that it's not worth keeping in this representation anymore). I can only see two use cases where actual SHA-1 is involved: sorting by SHA-1 (rarely done, probably only in index-pack and pack-objects) and output to the screen (or producing v2 packs). Unless you fix that union in struct assumption, that is. To me, pack, offset information smells to belong more to a struct object (or its subclass) as an optional annotation---when a caller is asked to parse_object(), you would bypass the sha1_read_file() that goes and looks the object name up from the list of pack .idx and instead go there straight using that annotation. For pack v4, commits and trees can be encoded this way. parse_object() could help the commit case (maybe, maybe not, I haven't looked at commit walkers), not recursive tree walking where we now pass SHA-1 around to user callback and all. Carrying optional annotation around would impact the code in many places. Also, storing this info in struct object seems conflict with v4 goal of reducing struct object lookup cost. Maybe I'm missing something here. Then again, we don't know if pack v4 will get merged in the end (I do hope it will). And we have an option of making specialized commit/tree walkers that are aware of pack v4 and only use them in hot places to reduce impact to the rest of the code base. If hash[GIT_MAX_RAWSZ] looks like a good enough solution, we can go with that and worry about pack v4 later when/if it comes. -- Duy -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 01/10] Define a structure for object IDs.
On Tue, Mar 10, 2015 at 07:38:42PM -0700, Kyle J. McKay wrote: GIT_SHA1_HEXSZ will always be exactly 2 * GIT_SHA1_RAWSZ, right? In fact, if it's not things will almost certainly break, yes? Does it make more sense then to reflect this requirement by using: #define GIT_SHA1_HEXSZ (2 * GIT_SHA1_RAWSZ) instead? Yes. I'll make that change in the next version. I don't see anything wrong with this. However, in part 02/10 the utility functions all use oid in their names, so I'm thinking that it may make more sense to just go with: struct object_id { unsigned char oid[GIT_SHA1_RAWSZ]; }; to match? Michael Haggerty recommended that I call the structure element sha1 instead of oid in case we want to turn this into a union if we decide to go the additional hash route. I think it can also improve readability if we use oid only for the instances of the struct itself, especially since it makes it more obvious what code has been converted already. -- brian m. carlson / brian with sandals: Houston, Texas, US +1 832 623 2791 | http://www.crustytoothpaste.net/~bmc | My opinion only OpenPGP: RSA v4 4096b: 88AC E9B2 9196 305B A994 7552 F1BA 225C 0223 B187 signature.asc Description: Digital signature
Re: [PATCH v2 01/10] Define a structure for object IDs.
brian m. carlson sand...@crustytoothpaste.net writes: Michael Haggerty recommended that I call the structure element sha1 instead of oid in case we want to turn this into a union if we decide to go the additional hash route. I'd advise against it. As I wrote in $gmane/265337 in response to Michael: 4. We continue to support working with SHA-1s declared to be (unsigned char *) in some performance-critical code, even as we migrate most other code to using SHA-1s embedded within a (struct object_id). This will cost some duplication of code. To accept this approach, we would need an idea of *how much* code duplication would be needed. E.g., how many functions will need both (unsigned char *) versions and (struct object_id *) versions? ... I do not know what kind of code duplication you are worried about, though. If a callee needs unsigned char *, the caller that has a struct object_id *o should pass o-hash to the callee. And that would break the abstraction effort if you start calling the field with a name that is specific to the underlying hash function. The caller has to change o-sha1 to o-sha256 instead of keeping that as o-oid and letting the callee handle the implementation details when calling if (!hashcmp(o1-oid, o2-oid)) ; /* they are the same */ else ; /* they are different */ The only folks that need to _know_ what hash function is used or how long the field is are the ones that have raw bytes of the hash obtained from files (e.g. from the index) and they would do something like this to implement a function that checks the file records an object name that is expected by the caller: void check_oid(int fd, struct object_id *expected) { unsigned char object_name[GIT_HASH_RAWSZ]; ... read(fd, object_name, GIT_HASH_RAWSZ); if (hashcmp(object_name, expected-oid)) die(object name mismatch???); } or when they do know they are using a specific function, do: void compute_object_name(struct object_id *id, unsignd char*data, size_t len) { git_SHA_CTX c; unsigned char *sha1 = (id-oid); /* if we are paranoid... */ assert(sizeof(id-oid) = 20); ... git_SHA1_Init(c); git_SHA1_Update(c, data, len); ... git_SHA1_Final(sha1, c); } Even the latter would not be helped if the field to store the hash were named id-sha1 very much, I would think. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 01/10] Define a structure for object IDs.
Many places throughout the code use unsigned char [20] to store object IDs (SHA-1 values). This leads to lots of hardcoded numbers throughout the codebase. It also leads to confusion about the purposes of a buffer. Introduce a structure for object IDs. This allows us to obtain the benefits of compile-time checking for misuse. The structure is expected to remain the same size and have the same alignment requirements on all known platforms, compared to the array of unsigned char, although this is not required for correctness. Signed-off-by: brian m. carlson sand...@crustytoothpaste.net --- object.h | 8 1 file changed, 8 insertions(+) diff --git a/object.h b/object.h index 6416247..f99b502 100644 --- a/object.h +++ b/object.h @@ -1,6 +1,14 @@ #ifndef OBJECT_H #define OBJECT_H +/* The length in bytes and in hex digits of an object name (SHA-1 value). */ +#define GIT_SHA1_RAWSZ 20 +#define GIT_SHA1_HEXSZ 40 + +struct object_id { + unsigned char sha1[GIT_SHA1_RAWSZ]; +}; + struct object_list { struct object *item; struct object_list *next; -- 2.2.1.209.g41e5f3a -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html