Re: [RFC] [GSoC] Project proposal: convert scripts to builtins

2018-03-24 Thread Christian Couder
Hi,

On Wed, Mar 21, 2018 at 7:16 AM, Pratik Karki  wrote:
>
> Thanks for the feedback. Thanks to you, I realized my proposal was
> a bit ambitious. Both git-stash and git-rebase are big
> commitment. After much analyzing, I found out I cannot complete
> both in the given time frame. So, I decided to stick to one and
> complete it.

Great.

[...]

> There has been some development in `git-stash` as seen on
> []
> (https://public-inbox.org/git/20171110231314.30711-1-j...@teichroeb.net/).
> To maximize the productivity, the findings from the patch submitted can
> be used. Since, there are already much discussions regarding the
> rewrite.

In general it would be nice if you summarized what has already been
done, how you can reuse it and what is needed to complete it.

I see that you talk about some of that below, but a more general
overview might be nice too.

It could be interesting also to put the author(s) of the work that you
will reuse in Cc.

[...]

> Timeline and Development Cycle
> --
>
> -   Apr 23: Accepted student proposals announced.
>
> -   Apr 23 onwards: Researching of all the test suites. Discussion of
> possible test improvements in for `git-stash`.
>
> Firstly, the test suite coverage of every command will be reviewed
> using gcov and kcov.

I don't think it is necessary to spend a lot of time on the test suite coverage.

> The test suite might not be perfect or
> comprehensive but must cover all the major code paths and
> command-line switches of the script. For the tests which seem
> inadequate, minimum required tests are written and developed
> incrementally. The minimum tests must provide safety net for
> migration of scripts to built-ins. The tests would be sent as a
> separate patch for parallel development and review process so that
> development of built-ins can happen at the same time productively.

Nice.

> The tests will be written for every code changes and will be worked
> throughout the summer.
>
> -   May 1: Rewriting skeleton begins.
>
> The shell scripts are translated on a line-by-line basis into C
> code. The C code will be written in a way to maximize the use of git
> internal API. In git-stash `parse-options` API can be used for
> implementing parsing argument of command-line. This would be way
> better than parsing via the scripts. Firstly, I will start
> implementing `stash --helper`from respective scripts to C code. Then
> increment it further more. Then I'll start converting git-stash.sh
> on a line-by-line basis.

Not sure what you mean by line-by-line basis.

>  Again for git-stash some work seem to be done
> 
> []
> (https://public-inbox.org/git/20171110231314.30711-1-j...@teichroeb.net/).
> Now, to maximize the output I'll be taking findings from the
> previous patch and use it for my patch. As seen from the comments in
> the patch some tests for checking branch when `git stash branch`
> fails needs to be written.

Nice. Maybe writing those tests can come earlier in you schedule.

> New tests will be written and code
> coverage tools will be used for the written code.

Not sure that code coverage tools need to be used.

> -   May 13: Making minimal `builtin/stash.c` with `stash--helper` ready
> for review process. (This goes on for some time.)
>
> The initial review of minimal builtin would be ready for git-stash.
> The result C code at this stage may not be necessarily be efficient
> but would be free from obvious bugs and can serve as a baseline for
> the final patch. This is sent for review process which can take some
> time. The code will ofcourse be tested using the test suite with
> some additional tests.

How does that relates with the existing work? Will this be one or
several patch series? What will each patch do?

[...]

> -   June 10 - Jul 20: Start optimizing `builtin/stash.c`. Benchmarking
> and profiling is done. They are exclusively compared to their
> original shell scripts, to check whether they are more performant or
> not and the results are published in the mailing list for further
> discussion.

Will the performance tests be added to the t/perf tests?

> The C code will be optimized for speed and efficiency in this stage. The
> built-ins will now be profiled using the new efficient test suites to
> find hot spots. Bench-marking is also done in comparison to original
> scripts.The performance for stash can be measured by making it stash
> large number of changes in another working directory and measuring the
> time for completion of the task. After finding out, a graphical
> representation of performance findings will be published to git mailing
> list and discussions will commence on more 

Re: [RFC] [GSoC] Project proposal: convert scripts to builtins

2018-03-21 Thread Pratik Karki
Hi Johannes,

Thanks for the feedback. Thanks to you, I realized my proposal was
a bit ambitious. Both git-stash and git-rebase are big
commitment. After much analyzing, I found out I cannot complete
both in the given time frame. So, I decided to stick to one and
complete it. I decided to stick with git-stash. Thank you for directing
me to the un-merged matches. Now, I can find the points where the
patch couldn't be effective and work towards completing those
effective things.

Please provide feedback for this updated proposal.

Cheers,
Pratik Karki


Convert Scripts to builtins
===

Abstract


Many components of Git are still in the form of shell and Perl scripts.
This has certain advantages of being extensible but causes problems in
production code on multiple platforms like Windows.\
I propose to rewrite a couple of shell and perl scripts into portable
and performant C code, making them built-ins. The major advantage of
doing this is improvement in efficiency and performance.

Much more scripts like `git-am` , `git-pull`, `git-branch` have already
been rewritten in C. Much more scripts like `git-rebase`, `git-stash`,
`git-add --interactive` are still present in shell and perl scripts. I
propose to work in `git-stash`.

### Shell Scripts:

Although shell scripts are more faster can be extensible in
functionality and can be more easier to write, they introduce certain
disadvantages.

1.  Dependencies:\
 The scripting languages and shell scripts make more productive code
but there is an overhead of dependencies. The shell scripts are
lighter and simpler and call other executables to perform
non-trivial tasks. Taking `git-stash` shell script for example.
`sed`, `rm`, `echo`, `test` are constantly present in `git-stash`.
These look common to POSIX platforms but for non-POSIX platforms
there needs some extra work for porting these commands. For example,
in Git for Windows, the workaround for these commands in non-POSIX
platform adds some extra utilities and adds MSYS2 shell commands and
needs to pack language runtime for Perl. This increases the
installation size and requires more disk space. Again, adding more
batteries again needs implementation in all of the dependency
libraries and executables.

2.  Inefficiency:\
 Git has internal caches for configuration values, the repository
index and repository objects. The porcelain commands do not have
access to git's internal API and so they spawn git processes to
perform git operations. For every git invocation, git would re-read
the user's configuration files, repository index, repopulate the
filesystem cache, etc. This leads to overhead and unnecessary I/O.
Windows is known to have worse I/O performance compared to Linux.
There is also slower I/O performance of HDD compared to SSD. This
unnecessary I/O operations causes runtime overhead and becomes
slower in poor I/O performance setups. Now, writing the porcelain
into C built-ins leverages the git API and there is no need of
spawning separate git processes, caching can be used to reduce
unnecessary I/O processes.

3.  Spawing processes is less performant:\
 Shell scripts usually spawn a lot of processes. Shell scripts are
very lighter and hence have limited functionalites. For
`git-stash.sh` to work it needs to perform lots of git operations
like `git rev-parse` `git config` and thus spawns git executable
processes for performing these operations. Again for invoking
`git config` and providing configuration values, it spawn new
processes to handle that. Spawning is implemented by `fork()` and
`exec()` by shells. Now, on systems that do not support
copy-on-write semantics for `fork()`, there is duplication of the
memory of the parent process for every `fork()` call which turns out
to be an expensive process. Now, in Windows, Git uses MSYS2
exclusively to emulate `fork()` but since, Windows doesnot support
forking semantics natively, the workaround provided by MSYS2
emulates `fork()` without [copy-on-write
semantics](https://www.cygwin.com/faq.html#faq.api.fork). Doing this
creates another layer over Windows processes and thus slows git.

Rewriting C built-ins
-

These above mentioned problems need to be fixed. The only fix for these
problems would be to write built-ins in C for all these shell scripts
leveraging the git API. Writing in built-in reduces the dependency
required by shell scripts. Since, Git is native executable in Windows,
doing this can make MSYS2 POSIX emulation obsolete. Then, using git's
internal API and C data types, built-in `git_config_get_value()` can be
used to get configuration value rather than spawning another git-config
process. This removes the necessary to re-read git configuration cache
everytime and reduces I/O. Furthermore, git-stash will be more faster
and show 

Re: [RFC] [GSoC] Project proposal: convert scripts to builtins

2018-03-20 Thread Johannes Schindelin
Hi Pratik,

thank you so much for posting this inline, to make it easier to review.

I will quote only on specific parts below; Please just assume that I like
the other parts and have nothing to add.

On Tue, 20 Mar 2018, Pratik Karki wrote:

> 
> Timeline and Development Cycle
> --
> 
> -   Apr 23: Accepted student proposals announced.
> -   Apr 23 onwards: Researching of all the test suites. Discussion of
> possible test improvements in for `git-stash` and `git-rebase`.
> -   May 1: Rewriting skeleton begins.

I would have liked more detail here. Like, maybe even a rudimentary
initial version identifying, say, a part of `git stash` and/or `git
rebase` that could be put into a builtin (stash--helper and
rebase--helper, respectively).

It is my experience from several GSoCs working on this huge overarching
project to convert the scripts (which are good prototypes, but lack in
stringency in addition to performance) to C that even the individual
scripts are too much to stem for a single GSoC.

> -   May 13: Making `builtin/stash.c` ready for review process. (This
> goes on for some time.)

There have been two past efforts to turn stash into a builtin:

https://github.com/git-for-windows/git/pull/508

and

https://public-inbox.org/git/20171110231314.30711-1-j...@teichroeb.net/

It would be good to read up on those and incorporate the learnings into
the proposal.

> -   May 26: Making `builtin/rebase.c` ready for review process. (This
> goes on for some time.)

The `git-rebase.sh` script is itself not terribly interesting, as it hands
off to `git-rebase--am.sh`, `git-rebase--interactive.sh` and
`git-rebase--merge.sh`, respectively.

Converting `git-rebase` into a builtin without first converting all of
those scripts would make little sense.

It would probably be better to choose one of those latter scripts and move
their functionality into a builtin, in an incremental fashion.

By doing it incrementally, you can also avoid...

> -   June 10: Make second versions with more improvements and more
> batteries ready for next review cycle.

... leaving two weeks between checkpoints. Also, doing it incrementally
lets you avoid sitting on your hands while waiting for the first patches
to be reviewed.

> -   June 20: Writing new tests and using more code-coverage tools to
> squash bugs present.

Typically it helps a lot to have those tests *during* the conversion.
That's how I found most of the bugs when converting difftool, for example.

> -   June 25 - Jul 20: Start optimizing `builtin/stash.c` and
> `builtin/rebase.c`. Benchmarking and profiling is done. They are
> exclusively compared to their original shell scripts, to check
> whether they are more performant or not and the results are
> published in the mailing list for further discussion.

Could you add details how you would perform benchmarking and profiling?

> -   Jul 20 - Aug 5: More optimizing and polishing of `builtin/stash.c`
> and `builtin/rebase.c` and running of new tests series written and
> send them for code review.
> -   Aug 14: Submit final patches.
> -   Aug 22: Results announced.
> -   Apr 24 - Aug 14: Documentation is written. "What I'm working on" is
> written and posted in my blog regarding GSoC with Git.

The timeline is a bit ambitious. I would like to caution you that these
are all big tasks, and maybe you want to cut down on the deliverables, and
add more detail what exactly you want to deliver (such as: what part of
stash/rebase do you find under-tested in our test suite and would
therefore want to augment, what parts of stash/rebase do you think you
would handle first, and how?).

Ciao,
Johannes


Re: [RFC] [GSoC] Project proposal: convert scripts to builtins

2018-03-20 Thread Pratik Karki
Hi,
This is my draft for my proposal on "Convert Scripts to builtins" for GSoC.
Please review and provide feedback.


Cheers,
Pratik Karki


Convert Scripts to builtins
===

Abstract


Many components of Git are still in the form of shell and Perl scripts.
This has certain advantages of being extensible but causes problems in
production code on multiple platforms like Windows.
I propose to rewrite a couple of shell and perl scripts into portable
and performant C code, making them built-ins. The major advantage of
doing this is improvement in efficiency and performance.

Much more scripts like `git-am` , `git-pull`, `git-branch` have already
been rewritten in C. Much more scripts like `git-rebase`, `git-stash`,
`git-add --interactive` are still present in shell and perl scripts.
I propose to work in `git-rebase` and `git-stash`.

Shell Scripts:
--

Although shell scripts are more faster can be extensible in
functionality and can be more easier to write, they introduce certain
disadvantages.

1.  Dependencies:
 The scripting languages and shell scripts make more productive code
but there is an overhead of dependencies. The shell scripts are
lighter and simpler and call other executables to perform
non-trivial tasks. Taking `git-stash` shell script for example.
`sed`, `rm`, `echo`, `test` are constantly present in `git-stash`.
These look common to POSIX platforms but for non-POSIX platforms
there needs some extra work for porting these commands. For example,
in Git for Windows, the workaround for these commands in non-POSIX
platform adds some extra utilities and adds MSYS2 shell commands and
needs to pack language runtime for Perl. This increases the
installation size and requires more disk space. Again, adding more
batteries again needs implementation in all of the dependency
libraries and executables.

2.  Inefficiency:
 Git has internal caches for configuration values, the repository
index and repository objects. The porcelain commands do not have
access to git's internal API and so they spawn git processes to
perform git operations. For every git invocation, git would re-read
the user's configuration files, repository index, repopulate the
filesystem cache, etc. This leads to overhead and unnecessary I/O.
Windows is known to have worse I/O performance compared to Linux.
There is also slower I/O performance of HDD compared to SSD. This
unnecessary I/O operations causes runtime overhead and becomes
slower in poor I/O performance setups. Now, writing the porcelain
into C built-ins leverages the git API and there is no need of
spawning separate git processes, caching can be used to reduce
unnecessary I/O processes.

3.  Spawing processes is less performant:
 Shell scripts usually spawn a lot of processes. Shell scripts are
very lighter and hence have limited functionalites. For
`git-stash.sh` to work it needs to perform lots of git operations
like `git rev-parse` `git config` and thus spawns git executable
processes for performing these operations. Again for invoking
`git config` and providing configuration values, it spawn new
processes to handle that. Spawning is implemented by `fork()` and
`exec()` by shells. Now, on systems that do not support
copy-on-write semantics for `fork()`, there is duplication of the
memory of the parent process for every `fork()` call which turns out
to be an expensive process. Now, in Windows, Git uses MSYS2
exclusively to emulate `fork()` but since, Windows doesnot support
forking semantics natively, the workaround provided by MSYS2
emulates `fork()` without [copy-on-write
semantics](https://www.cygwin.com/faq.html#faq.api.fork). Doing this
creates another layer over Windows processes and thus slows git.

Rewriting C built-ins
-

These above mentioned problems need to be fixed. The only fix for these
problems would be to write built-ins in C for all these shell scripts
leveraging the git API. Writing in built-in reduces the dependency
required by shell scripts. Since, Git is native executable in Windows,
doing this can make MSYS2 POSIX emulation obsolete. Then, using git's
internal API and C data types, built-in `git_config_get_value()` can be
used to get configuration value rather than spawning another git-config
process. This removes the necessary to re-read git configuration cache
everytime and reduces I/O. Furthermore, git-stash and git-rebase will be
more faster and show consistent behaviour as instead of spawing another
process and parsing command-line arguments manually, they can be
hardcoded to be built-in and leverage all the required git's internal
API's like `parse-options`.

To implement git-stash and git-rebase in C, I propose to avoid spawning
lots of external git processes and reduce redundant I/O by taking
advantage of the internal 

Re: [RFC] [GSoC] Project proposal: convert scripts to builtins

2018-03-20 Thread Christian Couder
Hi,

On Tue, Mar 20, 2018 at 10:00 AM, Pratik Karki  wrote:
> Hi,
> This is my draft for my proposal on "Convert Scripts to builtin" for GSoC.
> Please review and provide feedbacks.
>
> https://gist.github.com/prertik/daaa73a39d3ce30811d9a208043dc235

It would be easier for us to comment if the markdown was sent inline.

Thanks,
Christian.