Re: [RFC][GSoC] Calling for comments regarding rough draft of proposal

2014-03-19 Thread Junio C Hamano
tanay abhra  writes:

> 2.Other things I should add to the proposal that I have left off?I am
> getting confused what extra details I should add to the proposal. I
> will add
> the informal parts(my background, schedule for summer etc) of the
> proposal later.

I would not label the schedule and success criteria "informal";
without them how would one judge if the proposal has merits?

Other things like your background and previous achievements would
become relevant, after it is decided that the proposed project has
merits, to see if you are a good fit to work on that project, so I
agree with your message that it is sensible to defer them before the
other parts of the proposal is ironed out.

> #Proposed Improvements
>
> * Fix "git config --unset" to clean up detritus from sections that are
> left empty.
>
> * Read the configuration from files once and cache the results in an
> appropriate data structure in memory.
>
> * Change `git_config()` to iterate through the pre-read values in
> memory rather than re-reading the configuration
>   files.
>
> * Add new API calls that allow the cache to be inquired easily and
> efficiently.  Rewrite other functions like
>  `git_config_int()` to be cache-aware.

I think we already had a discussion to point out git_config_int() is
not a good example for this bullet point (check the list archive).
The approach seciton seems to use a more sensible example (point 2).

> * Rewrite callers to use the new API wherever possible.
>
> * How to invalidate the cache correctly in the case that the
> configuration is changed while `git` is executing.

I wouldn't list this as an item of "list of improvements".

It is merely a point you have to be careful about because you are
doing other "improvements" based on "read all into memory first and
do not re-read files" approach, no?  In the current code, when
somebody does git_config_set() and then later uses git_config() to
grab the value of the variable set with the first call, we will read
the value written to the file with the first call.  With the
proposed change, if you parse from the file upfront, callers to
git_config_set() will need to somehow invalidate that stale copy in
memory, either updating only the changed part (harder) or just
discarding the cache (easy).

> ##Changing the git_config api to retrieve values from memory
>
> Approach:-
>
> We parse the config file once, storing the raw values to records in
> memory. After the whole config has been read, iterate through the records,
> feeding the surviving values into the callback in the order they were
> originally read
> (minus deletions).
>
> Path to follow for the api conversion,
>
> 1. Convert the parser to read into an in-memory representation, but
>leave git_config() as a wrapper which iterates over it.
>
> 2. Add query functions like config_string_get() which will inquire
> cache for values efficiently.
>
> 3. Convert callbacks to query functions one by one.
>
> I propose two approaches for the format of the internal cache,
>
> 1.Using a hashmap to map keys to their values.This would bring as an
>  advantage, constant time lookups for the values.The implementation
>  will be similar to "dict" data structure in python,
>
>  for example, section.subsection --mapped-to--> multi_value_string

I have no idea what you wanted to illustrate with that "example" at
all.

>  This approach loses the relative order of different config keys.

As long as it keeps the order of multi-value elements, it should
not be a problem.

> 2.Another approach would be to actually represent the syntax tree of the
>   config file in memory. That would make lookups of individual keys more
>   expensive, but would enable other manipulation. E.g., if the syntax
>   tree included nodes for comments and other non-semantic constructs, then
>   we can use it for a complete rewrite.

"for a complete rewrite" of what?

>  And "git config" becomes:
>
>   1. Read the tree.
>
>   2. Perform operations on the tree (add nodes, delete nodes, etc).
>
>   3. Write out the tree.
>
> and things like "remove the section header when the last item in the
> section is removed" become trivial during step 2.

Are you saying you will try both approaches during the summer?

You should be able to look-up quickly *and* to preserve order at the
same time within one approach, by either annotating the tree with a
hash, or the other way around to annotate the hash with each node
remembering where in the original file it came from (which you will
need to keep in order to report errors anyway).

> --
> ##Tidy configuration files
>
> When a configuration file is repeatedly modified, often garbage is
> left behind.  For example, after
>
> git config pull.rebase true
> git config --unset pull.rebase
> git config pull.rebase true
> git config --unset pull.rebase
>
> the bottom of the configuration file is left with the useless lines
>
> [pull]
>  

[RFC][GSoC] Calling for comments regarding rough draft of proposal

2014-03-19 Thread tanay abhra
Hi,
  I have already done the microproject, which has been merged into
main last week. I have prepared a rough draft of my proposal for
review, read all the previous mailing list threads about it.I am
reading the codebase little by little.

Please suggest improvements on the following topics,

1.I have read one-third of config.c and will complete reading it by
tomorrow.Is there any other  piece of code relevant to this proposal?

2.Other things I should add to the proposal that I have left off?I am
getting confused what extra details I should add to the proposal. I
will add
the informal parts(my background, schedule for summer etc) of the
proposal later.

3.Did I understand anything wrong or if my approach to solving
problems is incorrect,if yes, I will redraft my proposal according to
your suggestions.

--
#GSoC Proposal : Git configuration API improvements
---

#Proposed Improvements

* Fix "git config --unset" to clean up detritus from sections that are
left empty.

* Read the configuration from files once and cache the results in an
appropriate data structure in memory.

* Change `git_config()` to iterate through the pre-read values in
memory rather than re-reading the configuration
  files.

* Add new API calls that allow the cache to be inquired easily and
efficiently.  Rewrite other functions like
 `git_config_int()` to be cache-aware.

* Rewrite callers to use the new API wherever possible.

* How to invalidate the cache correctly in the case that the
configuration is changed while `git` is executing.

#Future Improvements

*Allow configuration values to be unset via a config file

--
##Changing the git_config api to retrieve values from memory

Approach:-

We parse the config file once, storing the raw values to records in
memory. After the whole config has been read, iterate through the records,
feeding the surviving values into the callback in the order they were
originally read
(minus deletions).

Path to follow for the api conversion,

1. Convert the parser to read into an in-memory representation, but
   leave git_config() as a wrapper which iterates over it.

2. Add query functions like config_string_get() which will inquire
cache for values efficiently.

3. Convert callbacks to query functions one by one.

I propose two approaches for the format of the internal cache,

1.Using a hashmap to map keys to their values.This would bring as an
 advantage, constant time lookups for the values.The implementation
 will be similar to "dict" data structure in python,

 for example, section.subsection --mapped-to--> multi_value_string

 This approach loses the relative order of different config keys.

2.Another approach would be to actually represent the syntax tree of the
  config file in memory. That would make lookups of individual keys more
  expensive, but would enable other manipulation. E.g., if the syntax
  tree included nodes for comments and other non-semantic constructs, then
  we can use it for a complete rewrite.

 And "git config" becomes:

  1. Read the tree.

  2. Perform operations on the tree (add nodes, delete nodes, etc).

  3. Write out the tree.

and things like "remove the section header when the last item in the
section is removed" become trivial during step 2.


I still prefer the hashmap way of implementing the cache,as empty
section headers  are not so problematic(no processing pitfalls) and
are sometimes annotated with comments  which become redundant and
confusing if the section header is removed.As for the aesthetic
problem
I propose a different solution for it below.

--
##Tidy configuration files

When a configuration file is repeatedly modified, often garbage is
left behind.  For example, after

git config pull.rebase true
git config --unset pull.rebase
git config pull.rebase true
git config --unset pull.rebase

the bottom of the configuration file is left with the useless lines

[pull]
[pull]

Also,setting a config value, appends the key-value pair at the end of
file without checking for empty main keys
even if the main key(like [my]) is already present and empty.It works
fine if the main key with an already present
sub-key.

for example:-
git config pull.rebase true
git config --unset pull.rebase
git config pull.rebase true
git config pull.option true
gives
[pull]
[pull]
rebase = true
option = true

Also, a possible detriment is presence of comments,
For Example:-
[my]
# This section is for my own private settings

Expected output:

  1. When we delete the last key in a section, we should be
 able to delete the section header.

  2. When we add a key into a section, we should be able to
 reus