Re: [PATCH v2 01/10] Define a structure for object IDs.

2015-03-14 Thread Junio C Hamano
Duy Nguyen pclo...@gmail.com writes:

 Notice that the first time pack-obj[] is filled using
 lookup_object(). So yes, the hash table has all the pointers that
 pack-obj[] has.

Are we talking about the same thing?

By the hash table, I mean **obj_hash that is a hashtable that uses
sha-1 form of identifier as the key.  So the hash table has all
the pointers sounds like all the objects you instantiate from a v4
packfile needs their object name known in the sha-1 form anyway
when it gets instantiated (or more importantly, when you realize
there is another copy of it already in the **obj_hash hashtable, you
may have to free or refrain from creating it in the first place).

As long as we use **obj_hash hashtable as all the objects this
process cares about in the world and struct object_id.hash as the
key into it, I do not think pack,nth representation belongs to
that layer.

I do think pack,nth representation has its use as a short-cut
mechanism to reach an indirectly referenced object directly without
instantiating intermediate objects when dealing with an extended
SHA-1 expression (cf. v1.0^0:t example in a few messages
upthread).  I just think it does not belong to struct object_id
that are used to refer to in-core objects.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 01/10] Define a structure for object IDs.

2015-03-14 Thread Junio C Hamano
Duy Nguyen pclo...@gmail.com writes:

 Anyway, wouldn't this be all academic?  I do not see how you would
 keep the object name in the pack, nth format in-core, as the
 obj_hash[] is a hashtable keyed by sha-1, and even when we switch
 to a different hash, I cannot see how such a table to ensure the
 singleton-ness of in-core objects can be keyed sometimes by hash
 and by pack, nth in some other time.

 I'm implementing something to see how much we gain by avoiding object
 lookup. The current approach is having struct object ** obj in
 struct packed_git, indexed by nth. So when you have pack, nth
 and pack-obj[nth] is valid, you'll get to struct object * without
 hashing.

But do you realize that the hashtable serves two purposes?  Grab the
object from its name is one thing, and the other one I am not seeing
how you will make it work with sometimes sha-1 sometimes pack,nth
is to ensure that we will have only one in-core copy for the same object.
We even walk the hashtable when we want to drop the flag bits from
all in-core objects, so even if you instanciated an in-core object
without going through the object name layer, the hashtable needs to
have a pointer to such a pointer, no?
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 01/10] Define a structure for object IDs.

2015-03-14 Thread Duy Nguyen
On Sun, Mar 15, 2015 at 5:19 AM, Junio C Hamano gits...@pobox.com wrote:
 Duy Nguyen pclo...@gmail.com writes:

 Anyway, wouldn't this be all academic?  I do not see how you would
 keep the object name in the pack, nth format in-core, as the
 obj_hash[] is a hashtable keyed by sha-1, and even when we switch
 to a different hash, I cannot see how such a table to ensure the
 singleton-ness of in-core objects can be keyed sometimes by hash
 and by pack, nth in some other time.

 I'm implementing something to see how much we gain by avoiding object
 lookup. The current approach is having struct object ** obj in
 struct packed_git, indexed by nth. So when you have pack, nth
 and pack-obj[nth] is valid, you'll get to struct object * without
 hashing.

 But do you realize that the hashtable serves two purposes?  Grab the
 object from its name is one thing, and the other one I am not seeing
 how you will make it work with sometimes sha-1 sometimes pack,nth
 is to ensure that we will have only one in-core copy for the same object.
 We even walk the hashtable when we want to drop the flag bits from
 all in-core objects, so even if you instanciated an in-core object
 without going through the object name layer, the hashtable needs to
 have a pointer to such a pointer, no?

Notice that the first time pack-obj[] is filled using
lookup_object(). So yes, the hash table has all the pointers that
pack-obj[] has. If somebody wants to remove an object out of hash
table, then all pack-obj[] are invalidated, but I don't think we ever
delete objects from hash table.
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 01/10] Define a structure for object IDs.

2015-03-14 Thread Duy Nguyen
On Fri, Mar 13, 2015 at 1:03 PM, Junio C Hamano gits...@pobox.com wrote:
 To me, pack, offset information smells to belong more to a struct
 object (or its subclass) as an optional annotation---when a caller
 is asked to parse_object(), you would bypass the sha1_read_file()
 that goes and looks the object name up from the list of pack .idx
 and instead go there straight using that annotation.

 For pack v4, commits and trees can be encoded this way.

 Even if your in-pack representation of a commit object allowed to
 store the tree pointer in pack, nth format, its object name must
 be computed as if you have the commit object in the traditional
 format and computed the hash of that (together with the standard
 type size\0 header), and at that point, you need the contained
 object's name in sha-1 format (imagine how you would implement the
 commit-tree command).

True. But commit-tree is not often executed as, say, rev-list, where
the hash is used as 20-byte key to some pieces somewhere (pack data,
struct object *). If we keep pack, nth in object_id (the union
approach), v4-aware code has the chance to go on a fast path.

 Hence, I do not think the v4 encoding changes
 the discussion very much.  I see the primary value of v4 encoding is
 to shorten the length of various fields take on-disk and in-pack.
 If it were pack, offset, it could be argued that it would also be
 faster to get to the packed data in the packfile, and going from
 pack, nth to the .idx file and then going to the location in the
 data in the packfile would be faster than going from sha-1 to a
 particular pack and its in-pack offset, with the difference of cost
 around log(n)*m where n is the number of objects in a pack and m is
 the total number of packs in the repository.

 It is true that nth format (force that the referred-to object
 lives in the same pack as the referrer) can help speed up
 interpretation of extended SHA-1 expression, e.g. v1.0^0:t, which
 can read v1.0 tag in v4 format, find the nth info for the commit
 pointed by the tag and get to that data in the pack, find the nth
 info for the top-tree recorded in that commit and directly get to
 the data of that tree, and then find the entry for t, which will
 give the object name for that subtree again in nth format, and at
 that point you can find the sha-1 of that final object, without
 having to know any object names of the intermediate objects
 (i.e. you must start from sha-1 of the tag you obtain from the
 refs API, but you didn't use the object name of the commit and its
 top-level tree).  So for such a codepath, I would say it would be
 sufficient to use a union in struct people have been envisioning
 and convert pack, nth to sha-1 when the latter form becomes
 necessary for the first time for the object.

It's the point: faster access (to either pack data, or struct object
*). Using a separate union from  struct object_id (iow, object_id
only contains SHA-1) works. But imagine to independent code islands
can perform this type of fast access. In order to pass this union from
one island to another, we need to update pretty much every function
call between them. The change to using object_id has a similar impact.
If we go with union in object_id, we may be able to avoid the second
mass change to make use of pack, nth.

 Anyway, wouldn't this be all academic?  I do not see how you would
 keep the object name in the pack, nth format in-core, as the
 obj_hash[] is a hashtable keyed by sha-1, and even when we switch
 to a different hash, I cannot see how such a table to ensure the
 singleton-ness of in-core objects can be keyed sometimes by hash
 and by pack, nth in some other time.

I'm implementing something to see how much we gain by avoiding object
lookup. The current approach is having struct object ** obj in
struct packed_git, indexed by nth. So when you have pack, nth
and pack-obj[nth] is valid, you'll get to struct object * without
hashing. If pack-obj[nth] is NULL, hashing and looking up are
required the first time using SHA-1, then the result is cached in
pack-obj[]. So no, it's not academic :)
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 01/10] Define a structure for object IDs.

2015-03-13 Thread brian m. carlson
Many places throughout the code use unsigned char [20] to store object IDs
(SHA-1 values).  This leads to lots of hardcoded numbers throughout the
codebase.  It also leads to confusion about the purposes of a buffer.

Introduce a structure for object IDs.  This allows us to obtain the benefits
of compile-time checking for misuse.  The structure is expected to remain
the same size and have the same alignment requirements on all known
platforms, compared to the array of unsigned char, although this is not
required for correctness.

Signed-off-by: brian m. carlson sand...@crustytoothpaste.net
---
 cache.h | 8 
 1 file changed, 8 insertions(+)

diff --git a/cache.h b/cache.h
index 761c570..6582c35 100644
--- a/cache.h
+++ b/cache.h
@@ -43,6 +43,14 @@ int git_deflate_end_gently(git_zstream *);
 int git_deflate(git_zstream *, int flush);
 unsigned long git_deflate_bound(git_zstream *, unsigned long);
 
+/* The length in bytes and in hex digits of an object name (SHA-1 value). */
+#define GIT_SHA1_RAWSZ 20
+#define GIT_SHA1_HEXSZ (2 * GIT_SHA1_RAWSZ)
+
+struct object_id {
+   unsigned char hash[GIT_SHA1_RAWSZ];
+};
+
 #if defined(DT_UNKNOWN)  !defined(NO_D_TYPE_IN_DIRENT)
 #define DTYPE(de)  ((de)-d_type)
 #else
-- 
2.2.1.209.g41e5f3a

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 01/10] Define a structure for object IDs.

2015-03-13 Thread Junio C Hamano
Duy Nguyen pclo...@gmail.com writes:

 You mean if it came in pack, offset format, convert it down to
 sha1 until the last second that it is needed (e.g. need to put
 that in a tree object in order to compute the object name of the
 containing tree object)?

 I picked my words poorly. It should be pack, the index in pack
 instead of the _byte_ offset.

Thanks for a clarification, but I do not think it affects the main
point of the discussion very much.  If we use union in struct,
where we can store either an SHA-1 hash or some other identifying
information for the object, but not both, then at some point you
would need to convert pack, nth to sha-1 in a codepath that
needs the sha-1 hash value (e.g. if the object A, that is known to
you as pack, nth, is placed in a tree object B, you need the
object name of A in sha-1 representation in order to compute the
object name of the tree object B.  You can choose to keep it in
pack, nth form in struct object_id { union } and look up the
sha-1 from the pack index every time, or you can choose to give
up the pack, nth form and upgrade the struct object_id to store
sha-1 at that point.

If you keep both pack, nth *and* sha-1 in struct object_id at
the same time, you can choose whichever is convenient, but it would
bloat everything in core.  Not just it bloats struct object and
its subclasses, the in-core index entries, which is what I meant
by ...

 Unless you fix that union in struct assumption, that is.

... this.

 To me, pack, offset information smells to belong more to a struct
 object (or its subclass) as an optional annotation---when a caller
 is asked to parse_object(), you would bypass the sha1_read_file()
 that goes and looks the object name up from the list of pack .idx
 and instead go there straight using that annotation.

 For pack v4, commits and trees can be encoded this way.

Even if your in-pack representation of a commit object allowed to
store the tree pointer in pack, nth format, its object name must
be computed as if you have the commit object in the traditional
format and computed the hash of that (together with the standard
type size\0 header), and at that point, you need the contained
object's name in sha-1 format (imagine how you would implement the
commit-tree command).  Hence, I do not think the v4 encoding changes
the discussion very much.  I see the primary value of v4 encoding is
to shorten the length of various fields take on-disk and in-pack.
If it were pack, offset, it could be argued that it would also be
faster to get to the packed data in the packfile, and going from
pack, nth to the .idx file and then going to the location in the
data in the packfile would be faster than going from sha-1 to a
particular pack and its in-pack offset, with the difference of cost
around log(n)*m where n is the number of objects in a pack and m is
the total number of packs in the repository.

It is true that nth format (force that the referred-to object
lives in the same pack as the referrer) can help speed up
interpretation of extended SHA-1 expression, e.g. v1.0^0:t, which
can read v1.0 tag in v4 format, find the nth info for the commit
pointed by the tag and get to that data in the pack, find the nth
info for the top-tree recorded in that commit and directly get to
the data of that tree, and then find the entry for t, which will
give the object name for that subtree again in nth format, and at
that point you can find the sha-1 of that final object, without
having to know any object names of the intermediate objects
(i.e. you must start from sha-1 of the tag you obtain from the
refs API, but you didn't use the object name of the commit and its
top-level tree).  So for such a codepath, I would say it would be
sufficient to use a union in struct people have been envisioning
and convert pack, nth to sha-1 when the latter form becomes
necessary for the first time for the object.

Anyway, wouldn't this be all academic?  I do not see how you would
keep the object name in the pack, nth format in-core, as the
obj_hash[] is a hashtable keyed by sha-1, and even when we switch
to a different hash, I cannot see how such a table to ensure the
singleton-ness of in-core objects can be keyed sometimes by hash
and by pack, nth in some other time.




--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 01/10] Define a structure for object IDs.

2015-03-12 Thread brian m. carlson

On Wed, Mar 11, 2015 at 05:26:56PM -0700, Junio C Hamano wrote:

brian m. carlson sand...@crustytoothpaste.net writes:


Michael Haggerty recommended that I call the structure element sha1
instead of oid in case we want to turn this into a union if we decide to
go the additional hash route.


I'd advise against it.


Yeah, after re-reading this, I think hash is the best choice for the
underlying element name.  That way, it's different from oid, which
would be the instances of the struct itself, and it's sufficiently
different from unconverted places in the code, which don't typically use
that term.
--
brian m. carlson / brian with sandals: Houston, Texas, US
+1 832 623 2791 | http://www.crustytoothpaste.net/~bmc | My opinion only
OpenPGP: RSA v4 4096b: 88AC E9B2 9196 305B A994 7552 F1BA 225C 0223 B187


signature.asc
Description: Digital signature


Re: [PATCH v2 01/10] Define a structure for object IDs.

2015-03-12 Thread brian m. carlson

On Thu, Mar 12, 2015 at 11:28:10AM +0100, Michael Haggerty wrote:

On 03/12/2015 01:26 AM, Junio C Hamano wrote:

And that would break the abstraction effort if you start calling the
field with a name that is specific to the underlying hash function.
The caller has to change o-sha1 to o-sha256 instead of keeping
that as o-oid and letting the callee handle the implementation
details when calling

if (!hashcmp(o1-oid, o2-oid))
; /* they are the same */
else
; /* they are different */
[...]


Hmm, I guess you imagine that we might sometimes pack SHA-1s, sometimes
SHA-256s (or whatever) in the oid field, which would be dimensioned
large enough for either one (with, say, SHA-1s padded with zeros).

I was imagining that this would evolve into a union (or maybe struct) of
different hash types, like

   struct object_id {
   unsigned char hash_type;
   union {
   unsigned char sha1[GIT_SHA1_RAWSZ];
   unsigned char sha256[GIT_SHA256_RAWSZ];
   } hash;
   };

BTW in either case, any hopes of mapping object_id objects directly on
top of buffer memory would disappear.


What I think might be more beneficial is to make the hash function
specific to a particular repository, and therefore you could maintain
something like this:

 struct object_id {
 unsigned char hash[GIT_MAX_RAWSZ];
 };

and make hash_type (or hash_length) a global[0].  I don't think it's
very worthwhile to try to mix two different hash functions in the same
repository, so we could still map directly onto buffer memory if we
decide that's portable enough.  I expect the cases where we need to do
that will be relatively limited.

Regardless, it seems that this solution has the most support (including
Junio's) and it's more self-documenting than my current set of patches,
so I'm going to go with it for now.  It should be easy to change if the
consensus goes back the other way.

[0] I personally think globals are a bit gross, but they don't seem to
have the problems that they would if git were a shared library.
--
brian m. carlson / brian with sandals: Houston, Texas, US
+1 832 623 2791 | http://www.crustytoothpaste.net/~bmc | My opinion only
OpenPGP: RSA v4 4096b: 88AC E9B2 9196 305B A994 7552 F1BA 225C 0223 B187


signature.asc
Description: Digital signature


Re: [PATCH v2 01/10] Define a structure for object IDs.

2015-03-12 Thread Michael Haggerty
On 03/12/2015 01:26 AM, Junio C Hamano wrote:
 brian m. carlson sand...@crustytoothpaste.net writes:
 
 Michael Haggerty recommended that I call the structure element sha1
 instead of oid in case we want to turn this into a union if we decide to
 go the additional hash route.
 
 I'd advise against it.
 
 As I wrote in $gmane/265337 in response to Michael:
 
  4. We continue to support working with SHA-1s declared to be (unsigned
  char *) in some performance-critical code, even as we migrate most other
  code to using SHA-1s embedded within a (struct object_id). This will
  cost some duplication of code. To accept this approach, we would need an
  idea of *how much* code duplication would be needed. E.g., how many
  functions will need both (unsigned char *) versions and (struct
  object_id *) versions?
 
 ...
 
 I do not know what kind of code duplication you are worried about,
 though.  If a callee needs unsigned char *, the caller that has a
 struct object_id *o should pass o-hash to the callee.
 
 And that would break the abstraction effort if you start calling the
 field with a name that is specific to the underlying hash function.
 The caller has to change o-sha1 to o-sha256 instead of keeping
 that as o-oid and letting the callee handle the implementation
 details when calling
 
 if (!hashcmp(o1-oid, o2-oid))
 ; /* they are the same */
 else
 ; /* they are different */
 [...]

Hmm, I guess you imagine that we might sometimes pack SHA-1s, sometimes
SHA-256s (or whatever) in the oid field, which would be dimensioned
large enough for either one (with, say, SHA-1s padded with zeros).

I was imagining that this would evolve into a union (or maybe struct) of
different hash types, like

struct object_id {
unsigned char hash_type;
union {
unsigned char sha1[GIT_SHA1_RAWSZ];
unsigned char sha256[GIT_SHA256_RAWSZ];
} hash;
};

BTW in either case, any hopes of mapping object_id objects directly on
top of buffer memory would disappear.

Michael

-- 
Michael Haggerty
mhag...@alum.mit.edu

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 01/10] Define a structure for object IDs.

2015-03-12 Thread Duy Nguyen
On Thu, Mar 12, 2015 at 5:46 PM, brian m. carlson
sand...@crustytoothpaste.net wrote:
 On Thu, Mar 12, 2015 at 11:28:10AM +0100, Michael Haggerty wrote:

 On 03/12/2015 01:26 AM, Junio C Hamano wrote:

 And that would break the abstraction effort if you start calling the
 field with a name that is specific to the underlying hash function.
 The caller has to change o-sha1 to o-sha256 instead of keeping
 that as o-oid and letting the callee handle the implementation
 details when calling

 if (!hashcmp(o1-oid, o2-oid))
 ; /* they are the same */
 else
 ; /* they are different */
 [...]


 Hmm, I guess you imagine that we might sometimes pack SHA-1s, sometimes
 SHA-256s (or whatever) in the oid field, which would be dimensioned
 large enough for either one (with, say, SHA-1s padded with zeros).

 I was imagining that this would evolve into a union (or maybe struct) of
 different hash types, like

struct object_id {
unsigned char hash_type;
union {
unsigned char sha1[GIT_SHA1_RAWSZ];
unsigned char sha256[GIT_SHA256_RAWSZ];
} hash;
};

 BTW in either case, any hopes of mapping object_id objects directly on
 top of buffer memory would disappear.


 What I think might be more beneficial is to make the hash function
 specific to a particular repository, and therefore you could maintain
 something like this:

  struct object_id {
  unsigned char hash[GIT_MAX_RAWSZ];
  };

 and make hash_type (or hash_length) a global[0].  I don't think it's
 very worthwhile to try to mix two different hash functions in the same
 repository,

This may or may not fall into the mix different hash functions
category. In pack files version 4, trees are encoded to point to other
trees or blobs by a (pack, offset) tuple. It would be great if the new
object_id could support carrying this kind of object id around because
it could help reduce object lookup cost a lot. (pack, offset) can be
converted back to SHA-1 so no info is lost and hashcmp() can compare
(pack, tuple) against an SHA-1 just fine.

 so we could still map directly onto buffer memory if we
 decide that's portable enough.  I expect the cases where we need to do
 that will be relatively limited.

 Regardless, it seems that this solution has the most support (including
 Junio's) and it's more self-documenting than my current set of patches,
 so I'm going to go with it for now.  It should be easy to change if the
 consensus goes back the other way.

 [0] I personally think globals are a bit gross, but they don't seem to
 have the problems that they would if git were a shared library.

 --
 brian m. carlson / brian with sandals: Houston, Texas, US
 +1 832 623 2791 | http://www.crustytoothpaste.net/~bmc | My opinion only
 OpenPGP: RSA v4 4096b: 88AC E9B2 9196 305B A994 7552 F1BA 225C 0223 B187



-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 01/10] Define a structure for object IDs.

2015-03-12 Thread Junio C Hamano
Junio C Hamano gits...@pobox.com writes:

 Duy Nguyen pclo...@gmail.com writes:

 This may or may not fall into the mix different hash functions
 category. In pack files version 4, trees are encoded to point to other
 trees or blobs by a (pack, offset) tuple. It would be great if the new
 object_id could support carrying this kind of object id around because
 it could help reduce object lookup cost a lot. (pack, offset) can be
 converted back to SHA-1 so no info is lost and hashcmp() can compare
 (pack, tuple) against an SHA-1 just fine.

 You mean if it came in pack, offset format, convert it down to
 sha1 until the last second that it is needed (e.g. need to put
 that in a tree object in order to compute the object name of the
 containing tree object)?

Sorry, obviously I meant do not convert until the last second.

 To me, pack, offset information smells to belong more to a struct
 object (or its subclass) as an optional annotation---when a caller
 is asked to parse_object(), you would bypass the sha1_read_file()
 that goes and looks the object name up from the list of pack .idx
 and instead go there straight using that annotation.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 01/10] Define a structure for object IDs.

2015-03-12 Thread Junio C Hamano
Duy Nguyen pclo...@gmail.com writes:

 This may or may not fall into the mix different hash functions
 category. In pack files version 4, trees are encoded to point to other
 trees or blobs by a (pack, offset) tuple. It would be great if the new
 object_id could support carrying this kind of object id around because
 it could help reduce object lookup cost a lot. (pack, offset) can be
 converted back to SHA-1 so no info is lost and hashcmp() can compare
 (pack, tuple) against an SHA-1 just fine.

You mean if it came in pack, offset format, convert it down to
sha1 until the last second that it is needed (e.g. need to put
that in a tree object in order to compute the object name of the
containing tree object)?

After converting an object name originally represented as pack,
offset, if we are doing the union in struct thing, to sha1
representation, you would have to look it up from .idx in order to
read the contents the usual way.  If that happens often enough, then
it may not be worth adding complexity to the code to carry the
pack, offset pair around.

Unless you fix that union in struct assumption, that is.

To me, pack, offset information smells to belong more to a struct
object (or its subclass) as an optional annotation---when a caller
is asked to parse_object(), you would bypass the sha1_read_file()
that goes and looks the object name up from the list of pack .idx
and instead go there straight using that annotation.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 01/10] Define a structure for object IDs.

2015-03-12 Thread Duy Nguyen
On Fri, Mar 13, 2015 at 1:24 AM, Junio C Hamano gits...@pobox.com wrote:
 Duy Nguyen pclo...@gmail.com writes:

 This may or may not fall into the mix different hash functions
 category. In pack files version 4, trees are encoded to point to other
 trees or blobs by a (pack, offset) tuple. It would be great if the new
 object_id could support carrying this kind of object id around because
 it could help reduce object lookup cost a lot. (pack, offset) can be
 converted back to SHA-1 so no info is lost and hashcmp() can compare
 (pack, tuple) against an SHA-1 just fine.

 You mean if it came in pack, offset format, convert it down to
 sha1 until the last second that it is needed (e.g. need to put
 that in a tree object in order to compute the object name of the
 containing tree object)?

I picked my words poorly. It should be pack, the index in pack
instead of the _byte_ offset. This index ranges from 0 to the number
of objects minus one. This is essentially what pack v4 stores in
places where pack v2 would store SHA-1, which makes me make the
connection: both are different ways of identifying an object.

 After converting an object name originally represented as pack,
 offset, if we are doing the union in struct thing, to sha1
 representation, you would have to look it up from .idx in order to
 read the contents the usual way.  If that happens often enough, then
 it may not be worth adding complexity to the code to carry the
 pack, offset pair around.

I'd keep it in pack, index for as long as possible (or until higher
layer decides that it's not worth keeping in this representation
anymore). I can only see two use cases where actual SHA-1 is involved:
sorting by SHA-1 (rarely done, probably only in index-pack and
pack-objects) and output to the screen (or producing v2 packs).

 Unless you fix that union in struct assumption, that is.

 To me, pack, offset information smells to belong more to a struct
 object (or its subclass) as an optional annotation---when a caller
 is asked to parse_object(), you would bypass the sha1_read_file()
 that goes and looks the object name up from the list of pack .idx
 and instead go there straight using that annotation.

For pack v4, commits and trees can be encoded this way. parse_object()
could help the commit case (maybe, maybe not, I haven't looked at
commit walkers), not recursive tree walking where we now pass SHA-1
around to user callback and all. Carrying optional annotation around
would impact the code in many places.  Also, storing this info in
struct object seems conflict with v4 goal of reducing struct object
lookup cost.  Maybe I'm missing something here.

Then again, we don't know if pack v4 will get merged in the end (I do
hope it will). And we have an option of making specialized commit/tree
walkers that are aware of pack v4 and only use them in hot places to
reduce impact to the rest of the code base. If hash[GIT_MAX_RAWSZ]
looks like a good enough solution, we can go with that and worry about
pack v4 later when/if it comes.
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 01/10] Define a structure for object IDs.

2015-03-11 Thread brian m. carlson

On Tue, Mar 10, 2015 at 07:38:42PM -0700, Kyle J. McKay wrote:

GIT_SHA1_HEXSZ will always be exactly 2 * GIT_SHA1_RAWSZ, right?  In
fact, if it's not things will almost certainly break, yes?

Does it make more sense then to reflect this requirement by using:

 #define GIT_SHA1_HEXSZ (2 * GIT_SHA1_RAWSZ)

instead?


Yes.  I'll make that change in the next version.


I don't see anything wrong with this.  However, in part 02/10 the
utility functions all use oid in their names, so I'm thinking that
it may make more sense to just go with:

struct object_id {
unsigned char oid[GIT_SHA1_RAWSZ];
};

to match?


Michael Haggerty recommended that I call the structure element sha1
instead of oid in case we want to turn this into a union if we decide to
go the additional hash route.

I think it can also improve readability if we use oid only for the
instances of the struct itself, especially since it makes it more
obvious what code has been converted already.
--
brian m. carlson / brian with sandals: Houston, Texas, US
+1 832 623 2791 | http://www.crustytoothpaste.net/~bmc | My opinion only
OpenPGP: RSA v4 4096b: 88AC E9B2 9196 305B A994 7552 F1BA 225C 0223 B187


signature.asc
Description: Digital signature


Re: [PATCH v2 01/10] Define a structure for object IDs.

2015-03-11 Thread Junio C Hamano
brian m. carlson sand...@crustytoothpaste.net writes:

 Michael Haggerty recommended that I call the structure element sha1
 instead of oid in case we want to turn this into a union if we decide to
 go the additional hash route.

I'd advise against it.

As I wrote in $gmane/265337 in response to Michael:

 4. We continue to support working with SHA-1s declared to be (unsigned
 char *) in some performance-critical code, even as we migrate most other
 code to using SHA-1s embedded within a (struct object_id). This will
 cost some duplication of code. To accept this approach, we would need an
 idea of *how much* code duplication would be needed. E.g., how many
 functions will need both (unsigned char *) versions and (struct
 object_id *) versions?

...

I do not know what kind of code duplication you are worried about,
though.  If a callee needs unsigned char *, the caller that has a
struct object_id *o should pass o-hash to the callee.

And that would break the abstraction effort if you start calling the
field with a name that is specific to the underlying hash function.
The caller has to change o-sha1 to o-sha256 instead of keeping
that as o-oid and letting the callee handle the implementation
details when calling

if (!hashcmp(o1-oid, o2-oid))
; /* they are the same */
else
; /* they are different */

The only folks that need to _know_ what hash function is used or
how long the field is are the ones that have raw bytes of the hash
obtained from files (e.g. from the index) and they would do
something like this to implement a function that checks the file
records an object name that is expected by the caller:

void check_oid(int fd, struct object_id *expected)
{
unsigned char object_name[GIT_HASH_RAWSZ];

...
read(fd, object_name, GIT_HASH_RAWSZ);
if (hashcmp(object_name, expected-oid))
die(object name mismatch???);
}

or when they do know they are using a specific function, do:

void compute_object_name(struct object_id *id,
unsignd char*data,
size_t len)
{
git_SHA_CTX c;
unsigned char *sha1 = (id-oid);

/* if we are paranoid... */
assert(sizeof(id-oid) = 20);

...
git_SHA1_Init(c);
git_SHA1_Update(c, data, len);
...
git_SHA1_Final(sha1, c);
}

Even the latter would not be helped if the field to store the hash
were named id-sha1 very much, I would think.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 01/10] Define a structure for object IDs.

2015-03-07 Thread brian m. carlson
Many places throughout the code use unsigned char [20] to store object IDs
(SHA-1 values).  This leads to lots of hardcoded numbers throughout the
codebase.  It also leads to confusion about the purposes of a buffer.

Introduce a structure for object IDs.  This allows us to obtain the benefits
of compile-time checking for misuse.  The structure is expected to remain
the same size and have the same alignment requirements on all known
platforms, compared to the array of unsigned char, although this is not
required for correctness.

Signed-off-by: brian m. carlson sand...@crustytoothpaste.net
---
 object.h | 8 
 1 file changed, 8 insertions(+)

diff --git a/object.h b/object.h
index 6416247..f99b502 100644
--- a/object.h
+++ b/object.h
@@ -1,6 +1,14 @@
 #ifndef OBJECT_H
 #define OBJECT_H
 
+/* The length in bytes and in hex digits of an object name (SHA-1 value). */
+#define GIT_SHA1_RAWSZ 20
+#define GIT_SHA1_HEXSZ 40
+
+struct object_id {
+   unsigned char sha1[GIT_SHA1_RAWSZ];
+};
+
 struct object_list {
struct object *item;
struct object_list *next;
-- 
2.2.1.209.g41e5f3a

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html