Re: Re: [PATCH] implement submodule config cache for lookup of submodule names

2014-03-12 Thread Heiko Voigt
On Tue, Mar 11, 2014 at 05:58:08PM -0400, Jeff King wrote:
 On Mon, Mar 10, 2014 at 10:24:12PM +0100, Heiko Voigt wrote:
 
  I have also moved all functions into the new submodule-config-cache
  module. I am not completely satisfied with the naming since it is quite
  long. If someone comes up with some different naming I am open for it.
  Maybe simply submodule-config (submodule_config prefix for functions)?
 
 Since the cache is totally internal to the submodule-config code, I do
 not know that you even need to have a nice submodule-config-cache API.
 Can it just be a singleton?
 
 That is bad design in a sense (it becomes harder later if you ever want
 to pull submodule config from two sources simultaneously), but it
 matches many other subsystems in git which cache behind the scenes.
 
 I also suspect you could call submodule_config simply submodule, and
 let it be a stand-in for the submodule (for now, only data from the
 config, but potentially you could keep other data on it, too).
 
 So with all that, the entry point into your code is just:
 
   const struct submodule *submodule_from_path(const char *path);
 
 and the caching magically happens behind-the-scenes.

Actually since we also need to define the revision from which we request
these submodule values that would become:

   const struct submodule *submodule_from_path(const unsigned char *commit_sha1,
const char *path);

Since the configuration for submodules for a submodule are identified by
a unique commit sha1 and its unique path (or unique name) I think it is
feasible to make it a singleton. That would also simplify using it from
the existing config parsing code.

To be future proof I can internally keep the object oriented approach
always passing on the submodule_config_cache pointer. That way it would
be easy to expose to the outside in case we later need multiple caches.

So I currently I do not see any downside of making it a singleton to the
outside and would go with that.

  +/* one submodule_config_cache entry */
  +struct submodule_config {
  +   struct strbuf path;
  +   struct strbuf name;
  +   unsigned char gitmodule_sha1[20];
  +};
 
 Do these strings need changed after they are written once? If not, you
 may want to just make these bare pointers (you can still use strbufs to
 create them, and then strbuf_detach at the end). That may just be a matter of
 taste, though.

No they do not need to be changed after parsing, since every path,
name mapping should be unique in one .gitmodule file. And I think it
actually would make the code more clear in one instance where I directly
set the buf pointer which Jonathan mentioned. There it is needed only
for the hashmap lookup.

Cheers Heiko
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Re: [PATCH] implement submodule config cache for lookup of submodule names

2014-03-12 Thread Heiko Voigt
Hi,

On Tue, Mar 11, 2014 at 07:28:52PM -0700, Jonathan Nieder wrote:
 Heiko Voigt wrote:
 
  This submodule configuration cache allows us to lazily read .gitmodules
  configurations by commit into a runtime cache which can then be used to
  easily lookup values from it. Currently only the values for path or name
  are stored but it can be extended for any value needed.
 
 Makes sense.
 
 [...]
  Next I am planning to extend this so configuration cache so it will
  return the local configuration (with the usual .gitmodules,
  /etc/gitconfig, .git/config, ... overlay of variables) when you pass it
  null_sha1 for gitmodule or commit sha1. That way we can give this
  configuration cache some use and implement its usage for the current
  configuration variables in submodule.c.
 
 Yay!
 
 I wonder whether local configuration should also override information
 from commits at times.

Yeah but for that I plan a different code path that solves conflicts
and/or extend missing values. I think its best to keep the submodule
configurations by commit as simple as possible. But we will see once I
get to the point where we need this.

 [...]
  --- /dev/null
  +++ b/Documentation/technical/api-submodule-config-cache.txt
  @@ -0,0 +1,39 @@
  +submodule config cache API
  +==
 
 Most API documents in Documentation/technical have a section explaining
 general usage --- e.g. (from api-revision-walking),
 
   Calling sequence
   
 
   The walking API has a given calling sequence: first you need to
   initialize a rev_info structure, then add revisions to control what kind
   of revision list do you want to get, finally you can iterate over the
   revision list.
 
 Or (from api-string-list):
 
   The caller:
 
   . Allocates and clears a `struct string_list` variable.
 
   . Initializes the members. You might want to set the flag 
 `strdup_strings`
 if the strings should be strdup()ed. For example, this is necessary
 [...]
   . Can remove items not matching a criterion from a sorted or unsorted
 list using `filter_string_list`, or remove empty strings using
 `string_list_remove_empty_items`.
 
   . Finally it should free the list using `string_list_clear`.
 
 This makes it easier to get started by looking at the documentation alone
 without having to mimic an example.

True, will extend that in the next iteration.

 Another thought: it's tempting to call this the 'submodule config API' and
 treat the (optional?) memoization as an implementation detail instead of
 reminding the caller of it too much.  But I haven't thought that through
 completely.

Thats the same point Jeff mentioned and I think that will simplify many
things. As I mentioned in the answer to Jeff I will keep passing around
the cache pointer internally but expose only functions without it to the
outside to be future proof but also easy to use for the current use
case.

 [...]
  +`struct submodule_config *get_submodule_config_from_path(
  +   struct submodule_config_cache *cache,
  +   const unsigned char *commit_sha1,
  +   const char *path)::
  +
  +   Lookup a submodules configuration by its commit_sha1 and path.
 
 The cache just always grows until it's freed, right?  Would it make
 sense to allow cached configurations to be explicitly evicted when the
 caller is done with them some day?  (Just curious, not a complaint.
 Might be worth mentioning in the overview how the cache is expected to
 be used, though.)

Yes it only grows at the moment. Not sure whether we need to remove
configurations. Currently the only use case that comes to my mind would
be if it grows to big to be kept in memory, but at the moment that seems
far away for me, so I would leave that for a future extension. It will
be hard to know whether someone is done with a certain .gitmodule file
since we cache it by its sha1 expecting it to be the same over many
revisions.

 [...]
  --- /dev/null
  +++ b/submodule-config-cache.h
  @@ -0,0 +1,26 @@
  +#ifndef SUBMODULE_CONFIG_CACHE_H
  +#define SUBMODULE_CONFIG_CACHE_H
  +
  +#include hashmap.h
  +#include strbuf.h
  +
  +struct submodule_config_cache {
  +   struct hashmap for_path;
  +   struct hashmap for_name;
  +};
 
 Could use comments (e.g., saying what keys each maps to what values).
 Would callers ever look at the hashmaps directly or are they only
 supposed to be accessed using helper functions that take a whole
 struct?

Users should only use the helper functions, but I will hide this in the
submodule-config module. Will add some comment as well.

 [...]
  +
  +/* one submodule_config_cache entry */
  +struct submodule_config {
  +   struct strbuf path;
  +   struct strbuf name;
  +   unsigned char gitmodule_sha1[20];
 
 Could also use comments.  What does the gitmodule_sha1 field represent?

Thats the sha1 of the parsed .gitmodule file blob.

 [...]
  --- /dev/null
  +++ 

Re: [PATCH] implement submodule config cache for lookup of submodule names

2014-03-11 Thread Jeff King
On Mon, Mar 10, 2014 at 10:24:12PM +0100, Heiko Voigt wrote:

 I have also moved all functions into the new submodule-config-cache
 module. I am not completely satisfied with the naming since it is quite
 long. If someone comes up with some different naming I am open for it.
 Maybe simply submodule-config (submodule_config prefix for functions)?

Since the cache is totally internal to the submodule-config code, I do
not know that you even need to have a nice submodule-config-cache API.
Can it just be a singleton?

That is bad design in a sense (it becomes harder later if you ever want
to pull submodule config from two sources simultaneously), but it
matches many other subsystems in git which cache behind the scenes.

I also suspect you could call submodule_config simply submodule, and
let it be a stand-in for the submodule (for now, only data from the
config, but potentially you could keep other data on it, too).

So with all that, the entry point into your code is just:

  const struct submodule *submodule_from_path(const char *path);

and the caching magically happens behind-the-scenes.

But take all of this with a giant grain of salt, as I am not too
familiar with the needs of the callers.

 +/* one submodule_config_cache entry */
 +struct submodule_config {
 + struct strbuf path;
 + struct strbuf name;
 + unsigned char gitmodule_sha1[20];
 +};

Do these strings need changed after they are written once? If not, you
may want to just make these bare pointers (you can still use strbufs to
create them, and then strbuf_detach at the end). That may just be a matter of
taste, though.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] implement submodule config cache for lookup of submodule names

2014-03-11 Thread Jonathan Nieder
Hi,

Some quick thoughts.

Heiko Voigt wrote:

 This submodule configuration cache allows us to lazily read .gitmodules
 configurations by commit into a runtime cache which can then be used to
 easily lookup values from it. Currently only the values for path or name
 are stored but it can be extended for any value needed.

Makes sense.

[...]
 Next I am planning to extend this so configuration cache so it will
 return the local configuration (with the usual .gitmodules,
 /etc/gitconfig, .git/config, ... overlay of variables) when you pass it
 null_sha1 for gitmodule or commit sha1. That way we can give this
 configuration cache some use and implement its usage for the current
 configuration variables in submodule.c.

Yay!

I wonder whether local configuration should also override information
from commits at times.

[...]
 --- /dev/null
 +++ b/Documentation/technical/api-submodule-config-cache.txt
 @@ -0,0 +1,39 @@
 +submodule config cache API
 +==

Most API documents in Documentation/technical have a section explaining
general usage --- e.g. (from api-revision-walking),

Calling sequence


The walking API has a given calling sequence: first you need to
initialize a rev_info structure, then add revisions to control what kind
of revision list do you want to get, finally you can iterate over the
revision list.

Or (from api-string-list):

The caller:

. Allocates and clears a `struct string_list` variable.

. Initializes the members. You might want to set the flag 
`strdup_strings`
  if the strings should be strdup()ed. For example, this is necessary
[...]
. Can remove items not matching a criterion from a sorted or unsorted
  list using `filter_string_list`, or remove empty strings using
  `string_list_remove_empty_items`.

. Finally it should free the list using `string_list_clear`.

This makes it easier to get started by looking at the documentation alone
without having to mimic an example.

Another thought: it's tempting to call this the 'submodule config API' and
treat the (optional?) memoization as an implementation detail instead of
reminding the caller of it too much.  But I haven't thought that through
completely.

[...]
 +`struct submodule_config *get_submodule_config_from_path(
 + struct submodule_config_cache *cache,
 + const unsigned char *commit_sha1,
 + const char *path)::
 +
 + Lookup a submodules configuration by its commit_sha1 and path.

The cache just always grows until it's freed, right?  Would it make
sense to allow cached configurations to be explicitly evicted when the
caller is done with them some day?  (Just curious, not a complaint.
Might be worth mentioning in the overview how the cache is expected to
be used, though.)

[...]
 --- /dev/null
 +++ b/submodule-config-cache.h
 @@ -0,0 +1,26 @@
 +#ifndef SUBMODULE_CONFIG_CACHE_H
 +#define SUBMODULE_CONFIG_CACHE_H
 +
 +#include hashmap.h
 +#include strbuf.h
 +
 +struct submodule_config_cache {
 + struct hashmap for_path;
 + struct hashmap for_name;
 +};

Could use comments (e.g., saying what keys each maps to what values).
Would callers ever look at the hashmaps directly or are they only
supposed to be accessed using helper functions that take a whole
struct?

[...]
 +
 +/* one submodule_config_cache entry */
 +struct submodule_config {
 + struct strbuf path;
 + struct strbuf name;
 + unsigned char gitmodule_sha1[20];

Could also use comments.  What does the gitmodule_sha1 field represent?

[...]
 --- /dev/null
 +++ b/submodule-config-cache.c
 @@ -0,0 +1,256 @@
 +#include cache.h
 +#include submodule-config-cache.h
 +#include strbuf.h
 +
 +struct submodule_config_entry {
 + struct hashmap_entry ent;
 + struct submodule_config *config;
 +};
 +
 +static int submodule_config_path_cmp(const struct submodule_config_entry *a,
 +  const struct submodule_config_entry *b,
 +  const void *unused)
 +{
 + int ret;
 + if ((ret = strcmp(a-config-path.buf, b-config-path.buf)))
 + return ret;

Style: avoid assignments in conditionals:

ret = strcmp(...);
if (ret)
return ret;

return hashcmp(...);

The actual return value from strcmp isn't important since
hashmap_cmp_fn is only used to check for equality.  Perhaps as a
separate change it would make sense to make it a hashmap_equality_fn
some day:

return !strcmp(...)  !hashcmp(...);

This checks both the path and gitmodule_sha1, so I guess it's for
looking up submodule names.

[...]
 +static int submodule_config_name_cmp(const struct submodule_config_entry *a,
 +  const struct submodule_config_entry *b,
 +  const void *unused)

Likewise.

[...]
 +void