Here is my "inefficient" attempt on exposing a Git protocol over a
Fossil repository. It is written in Ruby.
http://fossil.webstream.io/fossil_ruby/artifact/e10666fa65ebce2e

How it works:

* it uses The Dumb Protocol
(https://git-scm.com/book/en/v2/Git-Internals-Transfer-Protocols#The-Dumb-Protocol)
which is basically exposing .git directory over HTTP protocol.
* In The Dumb Protocol, client starts with requesting the "info/refs"
file which calls the "fetch_info_refs" method
* the "fetch_info_refs" method then iterates over all open Fossil
branches, and generates a response which contains the associated Git
SHA1 hash of these Fossil commits.

How the Git SHA1 hashes are calculated:

* Git has a very simple database layout, and this script takes
advantage of that.
* To generate the Git SHA1 hash for a Fossil manifest (which happens
in the get_git_uuid_for_commit(fossil_manifest) method), this script
generates a Git commit object
(https://git-scm.com/book/en/v2/Git-Internals-Git-Objects#Commit-Objects)
and then calculates its SHA1 hash.
* The Git commit object is created in the method
fetch_git_commit(fossil_manifest), which in turn uses the Git SHA1
hash of manifest's parents and tree. You can see that for fetching the
Git SHA1 of its parents, it just gets recursive.
* Creating Git tree object is rather easy (it just uses the
fetch_git_tree(manifest, path) recursively, where for the first call,
"path" is "/"). You can find the "files_level_hash_for" method here -
http://fossil.webstream.io/fossil_ruby/artifact/03189394837f1db6 and
more information on Git tree objects here -
https://git-scm.com/book/en/v2/Git-Internals-Git-Objects#Tree-Objects

Since calculating Git SHA1 hash for a Fossil manifest requires the
script to already know the Git SHA1 hash for its parents, and
recursively calculating parents' IDs will be very expensive. So, I use
a table "git_objects" to store the Git SHA1 IDs of objects, once it is
calculated.
So, when the path "info/refs" is fetched for the first time, it
generates Git SHA1 ID for all manifests using recursion and any
subsequent requests then directly read from the cache ("git_objects"
table).

However, as noted earlier, that the code is inefficient. Since it uses
recursion, it fails with "stack level too deep" for repositories with
too many commits. Also the "git_objects" table gets too large
containing hundreds of thousands of rows if not millions (specially
those Git tree objects, as you need to create a row for every parent
directory for a changed file in a commit).

So, I started to redesign it.

* To avoid the issue of system-stack-error, I used topological sorting
and processed Fossil manifests in an iteration rather than recursion,
and instead of building the git objects on demand (when the request
comes in), build it in advance (e.g., when fossil writes a new
manifest).
* Write to file-system instead of a table. The script initialises a
"git init --bare" repository and writes Git objects to the
<git>/objects directory directly. An sql table is still needed though
to find the associated Git commit object ID for a Fossil manifest, as
well as finding Git blob object IDs for files. But you get to avoid
the Git tree objects, so that saves tons of rows.
* Periodically run "git fsck" to pack the objects to a Packfile
(https://git-scm.com/book/en/v2/Git-Internals-Packfiles), so that the
repository size doesn't increase like crazy. (The script writes full
content of the file to the disk, and then leaves it to git-fsck to
generate and store the diff)
* I also added some concurrency using threads and mutex, so that
multiple threads write to file-system while the main thread creates
Git objects. (Yes, I'm aware that threads are evil, but I was just
using this opportunity to teach myself more about Ruby mutex & fibers
- https://www.sqlite.org/faq.html#q6) :-)

I have not yet committed the code for above redesign. But if you are
interested, I can commit in next 1-2 days.

A better plan would be to understand the Git Packfile format and write
directly to it.
https://www.kernel.org/pub/software/scm/git/docs/v1.4.3/technical/pack-format.txt

__

Vikrant Chaudhary
http://webstream.io




On 19 December 2015 at 20:37, Richard Hipp <[email protected]> wrote:
> Would it be good to support the "git:" URL scheme for
> clone/push/pull/sync?  In other words, teach Fossil to understand the
> GIT wire protocol, translating content to and from the GIT format as
> it crosses the wire?
>
> This would allow you to "clone" repos off of GitHub.  Or to
> automatically sync your Fossil repositories on GitHub.
>
> I'd be willing to work on this as my Christmas project (assuming
> nothing more pressing comes up over the Holiday).  You can help by
> looking up documentation on the Git wire protocol for
> clone/push/pull/sync and sending me links.
>
> --
> D. Richard Hipp
> [email protected]
> _______________________________________________
> fossil-dev mailing list
> [email protected]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/fossil-dev
_______________________________________________
fossil-dev mailing list
[email protected]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/fossil-dev

Reply via email to