from:"Martin Fick"

Re: GDPR compliance best practices?

2018-06-12 Thread Martin Fick

On Tuesday, June 12, 2018 09:12:19 PM Peter Backes wrote:
> So? If a thousand lawyers claim 1+1=3, it becomes a
> mathematical truth?

No, but probably a legal "truth". :)

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation

Re: worktrees vs. alternates

2018-05-16 Thread Martin Fick

On Wednesday, May 16, 2018 01:06:59 PM Jeff King wrote:
> On Wed, May 16, 2018 at 01:40:56PM -0600, Martin Fick 
wrote:
> > > In theory the fetch means that it's safe to actually
> > > prune in the mother repo, but in practice there are
> > > still races. They don't come up often, but if you
> > > have enough repositories, they do eventually. :)
> > 
> > Peff,
> > 
> > I would be very curious to hear what you think of this
> > approach to mitigating the effect of those races?
> > 
> > https://git.eclipse.org/r/c/122288/2
> 
> The crux of the problem is that we have no way to
> atomically mark an object as "I am using this -- do not
> delete" with respect to the actual deletion.
> 
> So if I'm reading your approach correctly, you put objects
> into a purgatory rather than delete them, and let some
> operations rescue them from purgatory if we had a race. 

Yes.  This has the cost of extra disk space for a while, but 
once I realized that we are incurring that cost already 
because for our repos, we already put things into purgatory 
to avoid getting stale NFS File handle errors during 
unrecoverable paths (while streaming an object).  So 
effectively this has no extra space cost then what is needed 
to run safely on NFS.

>   1. When do you rescue from purgatory? Any time the
> object is referenced? Do you then pull in all of its
> reachable objects too?

For my approach, I decided a) Yes b) No

Because:

a) Rescue on reference is cheap and allows any other policy 
to be built upon it, just ensure that policy references it 
at some point before it is prune from the purgatory.

b)  The other referenced objects will likely get pulled in 
on reference anyway or by virtue of being in the same pack.

>   2. How do you decide when to drop an object from
> purgatory? And specifically, how do you avoid racing with
> somebody using the object as you're pruning purgatory?

If you clean the purgatory during repacking after creating 
all the new packs and before deleting the old ones, you will 
have a significant grace window to handle most longer running 
operations.  In this way, repacking will have re-referenced 
any missing objects from the purgatory before it gets pruned 
causing them to be recovered if necessary.  Those missing 
objects, believed to be in the exact packs in the purgatory 
at that time, should only ever have been referenced by write 
operations that started before those packs were moved to the 
purgatory, which was before the previous repacking round 
ended.  This leaves write operations a full repacking cycle 
to complete in to avoid loosing objects.

>   3. How do you know that an operation has been run that
> will actually rescue the object, as opposed to silently
> having a corrupted state on disk?
> 
>  E.g., imagine this sequence:
> 
>a. git-prune computes reachability and finds that
> commit X is ready to be pruned
> 
>b. another process sees that commit X exists and
> builds a commit that references it as a parent
> 
>c. git-prune drops the object into purgatory
> 
>  Now we have a corrupt state created by the process in
> (b), since we have a reachable object in purgatory. But
> what if nobody goes back and tries to read those commits
> in the meantime?

See answer to #2, repacking itself should rescue any objects 
that need to be rescued before pruning the purgatory.

> I think this might be solvable by using the purgatory as a
> kind of "lock", where prune does something like:
> 
>   1. compute reachability
> 
>   2. move candidate objects into purgatory; nobody can
> look into purgatory except us

I don't think this is needed.

It should be OK to let others see the objects in the 
purgatory after 1 and before 3 as long as "seeing" them, 
causes them to be recovered!

>   3. compute reachability _again_, making sure that no
> purgatory objects are used (if so, rollback the deletion
> and try again)

Yes, you laid out the formula, but nothing says this 
recompute can't wait until the next repack (again see my 
answer to #2)!  i.e. there is no rush to cause a recovery as 
long as it gets recovered before it gets pruned from the 
purgatory.

> But even that's not quite there, because you need to have
> some consistent atomic view of what's "used". Just
> checking refs isn't enough, because some other process
> may be planning to reference a purgatory object but not
> yet have updated the ref. So you need some atomic way of
> saying "I am interested in using this object".

As long as all write paths also read the object first (I 
assume they do, or we would be in big trouble already), then 
this should not be an issue.  The idea is to force all reads 
(and thus all writes also) to recover the object,

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation

Re: worktrees vs. alternates

2018-05-16 Thread Martin Fick

On Wednesday, May 16, 2018 12:37:45 PM Jeff King wrote:
> On Wed, May 16, 2018 at 03:29:42PM -0400, Konstantin 
Ryabitsev wrote:
> Yes, that's pretty close to what we do at GitHub. Before
> doing any repacking in the mother repo, we actually do
> the equivalent of:
> 
>   git fetch --prune ../$id.git +refs/*:refs/remotes/$id/*
>   git repack -Adl
> 
> from each child to pick up any new objects to de-duplicate
> (our "mother" repos are not real repos at all, but just
> big shared-object stores).
... 
> In theory the fetch means that it's safe to actually prune
> in the mother repo, but in practice there are still
> races. They don't come up often, but if you have enough
> repositories, they do eventually. :)

Peff,

I would be very curious to hear what you think of this 
approach to mitigating the effect of those races?

https://git.eclipse.org/r/c/122288/2

-Martin
-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation

Re: worktrees vs. alternates

2018-05-16 Thread Martin Fick

On Wednesday, May 16, 2018 03:11:47 PM Konstantin Ryabitsev 
wrote:
> On 05/16/18 15:03, Martin Fick wrote:
> >> I'm undecided about that. On the one hand this does
> >> create lots of small files and inevitably causes
> >> (some) performance degradation. On the other hand, I
> >> don't want to keep useless objects in the pack,
> >> because that would also cause performance degradation
> >> for people cloning the "mother repo." If my
> >> assumptions on any of that are incorrect, I'm happy to
> >> learn more.
> > 
> > My suggestion is to use science, not logic or hearsay.
> > :)
> > i.e. test it!
> 
> I think the answer will be "it depends." In many of our
> cases the repos that need those loose objects are rarely
> accessed -- usually because they are forks with older
> data (hence why they need objects that are no longer used
> by the mother repo). Therefore, performance impacts of
> occasionally touching a handful of loose objects will be
> fairly negligible. This is especially true on
> non-spinning media where seek times are low anyway.
> Having slimmer packs for the mother repo would be more
> beneficial in this case.
> 
> On the other hand, if the "child repo" is frequently used,
> then the impact of needing a bunch of loose objects would
> be greater. For the sake of simplicity, I think I'll
> leave things as they are -- it's cheaper to fix this via
> reducing seek times than by applying complicated logic
> trying to optimize on a per-repo basis.

I think a major performance issue with loose objects is not 
just the seek time, but also the fact that they are not 
delta compressed.  This means that sending them over the 
wire will likely have a significant cost before sending it. 
Unlike the seek time, this cost is not mitigated across 
concurrent fetches by the FS (or jgit if you were to use it) 
caching,

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation

Re: worktrees vs. alternates

2018-05-16 Thread Martin Fick

On Wednesday, May 16, 2018 03:01:13 PM Konstantin Ryabitsev 
wrote:
> On 05/16/18 14:26, Martin Fick wrote:
> > If you are going to keep the unreferenced objects around
> > forever, it might be better to keep them around in
> > packed
> > form?
> 
> I'm undecided about that. On the one hand this does create
> lots of small files and inevitably causes (some)
> performance degradation. On the other hand, I don't want
> to keep useless objects in the pack, because that would
> also cause performance degradation for people cloning the
> "mother repo." If my assumptions on any of that are
> incorrect, I'm happy to learn more.

My suggestion is to use science, not logic or hearsay. :) 
i.e. test it!

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation

Re: worktrees vs. alternates

2018-05-16 Thread Martin Fick

On Wednesday, May 16, 2018 02:12:24 PM Konstantin Ryabitsev 
wrote:
> The loose objects I'm thinking of are those that are
> generated when we do "git repack -Ad" -- this takes all
> unreachable objects and loosens them (see man git-repack
> for more info). Normally, these would be pruned after a
> certain period, but we're deliberately keeping them
> around forever just in case another repo relies on them
> via alternates. I want those repos to "claim" these loose
> objects via hardlinks, such that we can run git-prune on
> the mother repo instead of dragging all the unreachable
> objects on forever just in case.

If you are going to keep the unreferenced objects around 
forever, it might be better to keep them around in packed 
form?  We currently do that because we don't think there is 
a safe way to prune objects yet on a running server (which 
is why I am teaching jgit to be able to recover from a racy 
pruning error),

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation

Re: worktrees vs. alternates

2018-05-16 Thread Martin Fick

On Wednesday, May 16, 2018 10:58:19 AM Konstantin Ryabitsev 
wrote:
> 
> 1. Find every repo mentioning the parent repository in
> their alternates 2. Repack them without the -l switch
> (which copies all the borrowed objects into those repos)
> 3. Once all child repos have been repacked this way, prune
> the parent repo (it's safe now)

This is probably only true if the repos are in read-only 
mode?  I suspect this is still racy on a busy server with no 
downtime.

> 4. Repack child repos again, this time with the -l flag,
> to get your savings back.

> I would heartily love a way to teach git-repack to
> recognize when an object it's borrowing from the parent
> repo is in danger of being pruned. The cheapest way of
> doing this would probably be to hardlink loose objects
> into its own objects directory and only consider "safe"
> objects those that are part of the parent repository's
> pack. This should make alternates a lot safer, just in
> case git-prune happens to run by accident.

I think that hard linking is generally a good approach to 
solving many of the "pruning" races left in git.

I have uploaded a "hard linking" proposal to jgit that could 
potentially solve a similar situation that is not alternate 
specific, and only for packfiles, with the intent of 
eventually also doing something similar for loose 
objects.  You can see this here: 

https://git.eclipse.org/r/c/122288/2

I think it would be good to fill in more of these pruning 
gaps!

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation

Re: Git push error due to hooks error

2018-05-14 Thread Martin Fick

On Monday, May 14, 2018 05:32:35 PM Barodia, Anjali wrote:
> I was trying to push local git to another git on gerrit,
> but stuck with an hook error. This is a very large repo
> and while running the command "git push origin --all" I
> am getting this errors:
> 
> remote: (W) 92e19d4: too many commit message lines longer
> than 70 characters; manually wrap lines remote: (W)
> de2245b: too many commit message lines longer than 70
> characters; manually wrap lines remote: (W) dc6e982: too
> many commit message lines longer than 70 characters;
> manually wrap lines remote: (W) d2e2efd: too many commit
> message lines longer than 70 characters; manually wrap
> lines remote: error: internal error while processing
> changes To ssh_url_path:8282/SI_VF.git
>  ! [remote rejected]   master -> master (Error running
> hook /opt/gerrit/hooks/ref-update) error: failed to
> push some refs to 'ssh_user@url_path:8282/SI_VF.git'


This is standard Gerrit behavior.  For Gerrit questions, 
please post question to:

Repo and Gerrit Discussion 

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation

Re: [RFC PATCH 00/18] Multi-pack index (MIDX)

2018-01-10 Thread Martin Fick

On Wednesday, January 10, 2018 02:39:13 PM Derrick Stolee 
wrote:
> On 1/10/2018 1:25 PM, Martin Fick wrote:
> > On Sunday, January 07, 2018 01:14:41 PM Derrick Stolee
> > 
> > wrote:
> >> This RFC includes a new way to index the objects in
> >> multiple packs using one file, called the multi-pack
> >> index (MIDX).
> > 
> > ...
> > 
> >> The main goals of this RFC are:
> >> 
> >> * Determine interest in this feature.
> >> 
> >> * Find other use cases for the MIDX feature.
> > 
> > My interest in this feature would be to speed up fetches
> > when there is more than one large pack-file with many of
> > the same objects that are in other pack-files.   What
> > does your MIDX design do when it encounters multiple
> > copies of the same object in different pack files? 
> > Does it index them all, or does it keep a single copy?
> 
> The MIDX currently keeps only one reference to each
> object. Duplicates are dropped during writing. (See the
> care taken in commit 04/18 to avoid duplicates.) Since
> midx_sha1_compare() does not use anything other than the
> OID to order the objects, there is no decision being made
> about which pack is "better". The MIDX writes the first
> copy it finds and discards the others.

This would likely speed things up then, even if the chosen 
objects are suboptimal.

> It would not be difficult to include a check in
> midx_sha1_compare() to favor one packfile over another
> based on some measurement (size? mtime?). Since this
> would be a heuristic at best, I left it out of the
> current patch.

Yeah, I didn't know what heuristic to use either, I tended 
to think that the bigger pack-file would be valuable because 
it is more likely to share deltas with other objects in that 
pack, so more easy to send them.  However, that is likely 
only true during clones or other large fetches when we want 
most objects.  During small "update" fetches, the newer 
packs might be better?

I also thought that objects in alternates should be 
considered less valuable for my use case, however in the 
github fork use case, the alternates might be more valuable?

So yes heuristics, and I don't know what is best.  Perhaps 
some config options could be used to set heuristics like 
this.  Whatever the heuristics are, since they would be a 
part of the MIDX packing process it would be easy to change.  
This assumes that keeping only one copy in the index is the 
right thing.  The question would be, what if we need 
different heuristics for different operations?  Would it make 
sense to have multiple MIDX files covering the same packs 
then, one for fetch, one for merge...?

> > In our Gerrit instance (Gerrit uses jgit), we have
> > multiple copies of the linux kernel repos linked
> > together via the alternatives file mechanism.
> 
> GVFS also uses alternates for sharing packfiles across
> multiple copies of the repo. The MIDX is designed to
> cover all packfiles in the same directory, but is not
> designed to cover packfiles in multiple alternates;
> currently, each alternate would need its own MIDX file.
> Does that cause issues with your setup?

No, since the other large packfiles are all in other repos 
(alternates).  Is there a reason the MIDX would not want to 
cover the alternates?  If you don't then you would seemingly 
loose any benefits of the MIDX when you have alternates in 
use.

...
> > It would be nice if this use case could be improved with
> > MIDX.  To do so, it seems that it would either require
> > that MIDX either only put "the best" version of an
> > object (i.e. pre-select which one to use), or include
> > the extra information to help make the selection
> > process of which copy to use (perhaps based on the
> > operation being performed) fast.
> 
> I'm not sure if there is sufficient value in storing
> multiple references to the same object stored in multiple
> packfiles. There could be value in carefully deciding
> which copy is "best" during the MIDX write, but during
> read is not a good time to make such a decision. It also
> increases the size of the file to store multiple copies.

Yes, I am not sure either, it would be good to have input 
from experts here.

> > This also leads me to ask, what other additional
> > information (bitmaps?) for other operations, besides
> > object location, might suddenly be valuable in an index
> > that potentially points to multiple copies of objects? 
> > Would such information be appropriate in MIDX, or would
> > it be better in another index?
> 
> For applications to bitmaps, it is probably best that we
> only include one copy of each object. Otherwise, we need
>

Re: [RFC PATCH 00/18] Multi-pack index (MIDX)

2018-01-10 Thread Martin Fick

On Sunday, January 07, 2018 01:14:41 PM Derrick Stolee 
wrote:
> This RFC includes a new way to index the objects in
> multiple packs using one file, called the multi-pack
> index (MIDX).
...
> The main goals of this RFC are:
> 
> * Determine interest in this feature.
> 
> * Find other use cases for the MIDX feature.

My interest in this feature would be to speed up fetches 
when there is more than one large pack-file with many of the 
same objects that are in other pack-files.   What does your 
MIDX design do when it encounters multiple copies of the 
same object in different pack files?  Does it index them all, 
or does it keep a single copy?

In our Gerrit instance (Gerrit uses jgit), we have multiple 
copies of the linux kernel repos linked together via the 
alternatives file mechanism.  These repos have many different 
references (mostly Gerrit change references), but they share 
most of the common objects from the mainline.  I have found 
that during a large fetch such as a clone, jgit spends a 
significant amount of extra time by having the extra large 
pack-files from the other repos visible to it, usually around 
an extra minute per instance of these (without them, the 
clone takes around 7mins).  This adds up easily with a few 
repos extra repos, it can almost double the time.

My investigations have shown that this is due to jgit 
searching each of these pack files to decide which version of 
each object to send.  I don't fully understand its selection 
criteria, however if I shortcut it to just pick the first 
copy of an object that it finds, I regain my lost time.  I 
don't know if git suffers from a similar problem?  If git 
doesn't suffer from this then it likely just uses the first 
copy of an object it finds (which may not be the best object 
to send?)

It would be nice if this use case could be improved with 
MIDX.  To do so, it seems that it would either require that 
MIDX either only put "the best" version of an object (i.e. 
pre-select which one to use), or include the extra 
information to help make the selection process of which copy 
to use (perhaps based on the operation being performed) 
fast.

This also leads me to ask, what other additional information 
(bitmaps?) for other operations, besides object location, 
might suddenly be valuable in an index that potentially 
points to multiple copies of objects?  Would such 
information be appropriate in MIDX, or would it be better in 
another index?

Thanks,

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation

Re: Bring together merge and rebase

2018-01-04 Thread Martin Fick

> On Jan 4, 2018 11:19 AM, "Martin Fick" 
<mf...@codeaurora.org> wrote:
> > On Tuesday, December 26, 2017 12:40:26 AM Jacob Keller
> > 
> > wrote:
> > > On Mon, Dec 25, 2017 at 10:02 PM, Carl Baldwin
> > 
> > <c...@ecbaldwin.net> wrote:
> > > >> On Mon, Dec 25, 2017 at 5:16 PM, Carl Baldwin
> > 
> > <c...@ecbaldwin.net> wrote:
> > > >> A bit of a tangent here, but a thought I didn't
> > > >> wanna
> > > >> lose: In the general case where a patch was rebased
> > > >> and the original parent pointer was changed, it is
> > > >> actually quite hard to show a diff of what changed
> > > >> between versions.
> > > 
> > > My biggest gripes are that the gerrit web interface
> > > doesn't itself do something like this (and jgit does
> > > not
> > > appear to be able to generate combined diffs at all!)
> > 
> > I believe it now does, a presentation was given at the
> > Gerrit User summit in London describing this work.  It
> > would indeed be great if git could do this also!


On Thursday, January 04, 2018 04:02:40 PM Jacob Keller 
wrote:
> Any chance slides or a recording was posted anywhere? I'm
> quite interested in this topic.

Slides and video + transcript here:

https://gerrit.googlesource.com/summit/2017/+/master/sessions/new-in-2.15.md

Watch the part after the backend improvements,

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation

Re: Bring together merge and rebase

2018-01-04 Thread Martin Fick

On Tuesday, December 26, 2017 01:31:55 PM Carl Baldwin 
wrote:
...
> What I propose is that gerrit and github could end up more
> robust, featureful, and interoperable if they had this
> feature to build from.

I agree (assuming we come up with a well defined feature)

> With gerrit specifically, adopting this feature would make
> the "change" concept richer than it is now because it
> could supersede the change-id in the commit message and
> allow a change to evolve in a distributed non-linear way
> with protection against clobbering work.

We (the Gerrit maintainers) would like changes to be able to 
evolve non-linearly so that we can eventually support 
distributed Gerrit reviews, and the amended-commit pointer 
is one way I have thought to resolve this.

> I have no intention to disparage either tool. I love them
> both. They've both made my career better in different
> ways. I know there is no guarantee that github, gerrit,
> or any other tool will do anything to adopt this. But,
> I'm hoping they are reading this thread and that they
> recognize how this feature can make them a little bit
> better and jump in and help. I know it is a lot to hope
> for but I think it could be great if it happened.

We (the Gerrit maintainers) do recognize it, and I am glad 
that someone is pushing for solutions in this space.  I am 
not sure what the right solution is, and how to modify 
workflows to deal better with this.  I do think that starting 
by making your local repo track pointers to amended-commits, 
likely with various git hooks and notes (as also proposed by 
Johannes Schindelin), would be a good start.   With that in 
place, then you can attack various specific workflows.

If you want to then attack the Gerrit workflow, it would be 
good if you could prevent pushing new patchests that are 
amended versions of patchsets that are out of date.  While 
it would be great if Gerrit could reject such pushes, I 
wonder if to start, git could detect and it prevent the push 
in this situation?  Could a git push hook analyze the ref 
advertisements and figure this out (all the patchsets are in 
the advertisement)?  Can a git hook look at the ref 
advertisement?

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation

Re: Bring together merge and rebase

2018-01-04 Thread Martin Fick

On Monday, December 25, 2017 06:16:40 PM Carl Baldwin wrote:
> On Sun, Dec 24, 2017 at 10:52:15PM -0500, Theodore Ts'o 
wrote:
> Look at what happens in a rebase type workflow in any of
> the following scenarios. All of these came up regularly
> in my time with Gerrit.
> 
> 1. Make a quick edit through the web UI then later
> work on the change again in your local clone. It is easy
> to forget to pull down the change made through the UI
> before starting to work on it again. If that happens, the
> change made through the UI will almost certainly be
> clobbered.
> 
> 2. You or someone else creates a second change that is
> dependent on yours and works on it while yours is still
> evolving. If the second change gets rebased with an older
> copy of the base change and then posted back up for
> review, newer work in the base change has just been
> clobbered.
> 
> 3. As a reviewer, you decide the best way to explain
> how you'd like to see something done differently is to
> make the quick change yourself and push it up. If the
> author fails to fetch what you pushed before continuing
> onto something else, it gets clobbered.
> 
> 4. You want to collaborate on a single change with
> someone else in any way and for whatever reason. As soon
> as that change starts hitting multiple work spaces, there
> are synchronization issues that currently take careful
> manual intervention.

These scenarios seem to come up most for me at Gerrit hack-
a-thons where we collaborate a lot in short time spans on 
changes.  We (the Gerrit maintainers) too have wanted and 
sometimes discussed ways to track the relation of "amended" 
commits (which is generally what Gerrit patchsets are).  We 
also concluded that some sort of parent commit pointer was 
needed, although parent is somewhat the wrong term since 
that already means something in git.  Rather, maybe some 
"predecessor" type of term would be better, maybe 
"antecedent", but "amended-commit" pointer might be best?

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation

Re: Bring together merge and rebase

2018-01-04 Thread Martin Fick

On Sunday, December 24, 2017 12:01:38 AM Johannes Schindelin 
wrote:
> Hi Carl,
> 
> On Sat, 23 Dec 2017, Carl Baldwin wrote:
> > I imagine that a "git commit --amend" would also insert
> > a "replaces" reference to the original commit but I
> > failed to mention that in my original post.
> 
> And cherry-pick, too, of course.
> 
> Both of these examples hint at a rather huge urge of some
> users to turn this feature off because the referenced
> commits may very well be throw-away commits in their
> case, making the newly-recorded information completely
> undesired.
> 
> Example: I am working on a topic branch. In the middle, I
> see a typo. I commit a fix, continue to work on the topic
> branch. Later, I cherry-pick that commit to a separate
> topic branch because I really don't think that those two
> topics are related. Now I definitely do not want a
> reference of the cherry-picked commit to the original
> one: the latter will never be pushed to a public
> repository, and gc'ed in a few weeks.
> 
> Of course, that is only my wish, other users in similar
> situations may want that information. Demonstrating that
> you would be better served with an opt-in feature that
> uses notes rather than a baked-in commit header.

I think what you are highlighting is not when to track this, 
but rather when to share this tracking.  In my local repo, I 
would definitely want to know that I cherry-picked this from 
elsewhere, it helps me understand what I have done later 
when I look back at old commits and branches that need to 
potentially be thrown away.  But I agree you may not want to 
share these publicly.

I am not sure what the right formula is, for when to share 
these pointers publicly, but it seems like it might be that 
whenever you push something, it should push along any 
references to amended commits that were publicly available 
already.  I am not sure how to track that, but I suspect it 
is a subset of the union of commits you have fetched, and 
commits you have pushed (i.e. you got it from elsewhere, or 
you created it and already shared it with others)?  Maybe it 
should also include any commits reachable by advertisements 
to places you are pushing to (in case it got shared some 
other way)?

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation

Re: Bring together merge and rebase

2018-01-04 Thread Martin Fick

On Tuesday, December 26, 2017 12:40:26 AM Jacob Keller 
wrote:
> On Mon, Dec 25, 2017 at 10:02 PM, Carl Baldwin 
 wrote:
> >> On Mon, Dec 25, 2017 at 5:16 PM, Carl Baldwin 
 wrote:
> >> A bit of a tangent here, but a thought I didn't wanna
> >> lose: In the general case where a patch was rebased
> >> and the original parent pointer was changed, it is
> >> actually quite hard to show a diff of what changed
> >> between versions.
> 
> My biggest gripes are that the gerrit web interface
> doesn't itself do something like this (and jgit does not
> appear to be able to generate combined diffs at all!)

I believe it now does, a presentation was given at the 
Gerrit User summit in London describing this work.  It would 
indeed be great if git could do this also!

-Martin 



-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation

Re: [PATCH] fetch-pack: always allow fetching of literal SHA1s

2017-05-10 Thread Martin Fick

On Wednesday, May 10, 2017 11:20:49 AM Jonathan Nieder 
wrote:
> Hi,
> 
> Ævar Arnfjörð Bjarmason wrote:
> > Just a side question, what are the people who use this
> > feature using it for? The only thing I can think of
> > myself is some out of band ref advertisement because
> > you've got squillions of refs as a hack around git's
> > limitations in that area.
> 
> That's one use case.
> 
> Another is when you really care about the exact sha1 (for
> example because you are an automated build system and
> this is the specific sha1 you have already decided you
> want to build).
> > Are there other use-cases for this? All the commits[1]
> > that touched this feature just explain what, not why.
> 
> Similar to the build system case I described above is when
> a human has a sha1 (from a mailing list, or source
> browser, or whatever) and wants to fetch just that
> revision, with --depth=1.  You could use "git archive
> --remote", but (1) github doesn't support that and (2)
> that doesn't give you all the usual git-ish goodness.


Perhaps another use case is submodules and repo(android 
tool) subprojects since they can be "pinned" to sha1s,

-Martin
-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation

Re: Simultaneous gc and repack

2017-04-13 Thread Martin Fick

On Thursday, April 13, 2017 02:28:07 PM David Turner wrote:
> On Thu, 2017-04-13 at 12:08 -0600, Martin Fick wrote:
> > On Thursday, April 13, 2017 11:03:14 AM Jacob Keller 
wrote:
> > > On Thu, Apr 13, 2017 at 10:31 AM, David Turner 
> > 
> > <nova...@novalis.org> wrote:
> > > > Git gc locks the repository (using a gc.pid file) so
> > > > that other gcs don't run concurrently. But git
> > > > repack
> > > > doesn't respect this lock, so it's possible to have
> > > > a
> > > > repack running at the same time as a gc.  This makes
> > > > the gc sad when its packs are deleted out from under
> > > > it
> > > > with: "fatal: ./objects/pack/pack-$sha.pack cannot
> > > > be
> > > > accessed".  Then it dies, leaving a large temp file
> > > > hanging around.
> > > > 
> > > > Does the following seem reasonable?
> > > > 
> > > > 1. Make git repack, by default, check for a gc.pid
> > > > file
> > > > (using the same logic as git gc itself does).
> > > > 2. Provide a --force option to git repack to ignore
> > > > said
> > > > check. 3. Make git gc provide that --force option
> > > > when
> > > > it calls repack under its own lock.
> > > 
> > > What about just making the code that calls repack
> > > today
> > > just call gc instead? I guess it's more work if you
> > > don't
> > > strictly need it but..?
> > 
> > There are many scanerios where this does not achieve
> > the 
> > same thing.  On the obvious side, gc does more than 
> > repacking, but on the other side, repacking has many 
> > switches that are not available via gc.
> > 
> > Would it make more sense to move the lock to repack
> > instead  of to gc?
> 
> Other gc operations might step on each other too (e.g.
> packing refs). That would be less bad (and less common),
> but it still seems worth avoiding.

Yes, but all of thsoe operations need to be self protected 
already, or they risk the same issue.

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation

Re: Simultaneous gc and repack

2017-04-13 Thread Martin Fick

On Thursday, April 13, 2017 11:03:14 AM Jacob Keller wrote:
> On Thu, Apr 13, 2017 at 10:31 AM, David Turner 
 wrote:
> > Git gc locks the repository (using a gc.pid file) so
> > that other gcs don't run concurrently. But git repack
> > doesn't respect this lock, so it's possible to have a
> > repack running at the same time as a gc.  This makes
> > the gc sad when its packs are deleted out from under it
> > with: "fatal: ./objects/pack/pack-$sha.pack cannot be
> > accessed".  Then it dies, leaving a large temp file
> > hanging around.
> > 
> > Does the following seem reasonable?
> > 
> > 1. Make git repack, by default, check for a gc.pid file
> > (using the same logic as git gc itself does).
> > 2. Provide a --force option to git repack to ignore said
> > check. 3. Make git gc provide that --force option when
> > it calls repack under its own lock.
> 
> What about just making the code that calls repack today
> just call gc instead? I guess it's more work if you don't
> strictly need it but..?

There are many scanerios where this does not achieve the 
same thing.  On the obvious side, gc does more than 
repacking, but on the other side, repacking has many 
switches that are not available via gc.

Would it make more sense to move the lock to repack instead 
of to gc?

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation

Re: [PATCH v2] repack: Add option to preserve and prune old pack files

2017-03-13 Thread Martin Fick

On Sunday, March 12, 2017 11:03:44 AM Junio C Hamano wrote:
> Jeff King  writes:
> > I can think of one downside of a time-based solution,
> > though: if you run multiple gc's during the time
> > period, you may end up using a lot of disk space (one
> > repo's worth per gc). But that's a fundamental tension
> > in the problem space; the whole point is to waste disk
> > to keep helping old processes.
> 
> Yes.  If you want to help a process that mmap's a packfile
> and wants to keep using it for N seconds, no matter how
> many times somebody else ran "git repack" while you are
> doing your work within that timeframe, you somehow need
> to make sure the NFS server knows the file is still in
> use for that N seconds.
> 
> > But you may want a knob that lets you slide between
> > those two things. For instance, if you kept a sliding
> > window of N sets of preserved packs, and ejected from
> > one end of the window (regardless of time), while
> > inserting into the other end. James' existing patch is
> > that same strategy with a hardcoded window of "1".
> 
> Again, yes.  But then the user does not get any guarantee
> of how long-living a process the user can have without
> getting broken by the NFS server losing track of a
> packfile that is still in use.  My suggestion for the
> "expiry" based approach is essentially that I do not see
> a useful tradeoff afforded by having such a knob.
> > The other variable you can manipulate is whether to gc
> > in the first place. E.g., don't gc if there are N
> > preserved sets (or sets consuming more than N bytes, or
> > whatever). You could do that check outside of git
> > entirely (or in an auto-gc hook, if you're using it).
> Yes, "don't gc/repack more than once within N seconds" may
> also be an alternative and may generally be more useful
> by covering general source of wastage coming from doing
> gc too frequently, not necessarily limited to preserved
> pack accumulation.

As someone who helps manage a Gerrit server for several 
thousand repos, all on the same NFS disks, a time based 
expiry seems unpractical, and not something that I am very 
interested in having.  I favor the simpler (single for now) 
repacking cycle approach, and it is what we have been using 
for almost 6 months now successfully, without suffering any 
more stale file handle exceptions.

While time is indeed the factor that is going to determine 
whether someone is going to see the stale file handles or 
not, on a server (which is what this feature is aimed at), 
this is secondary in my mind to predictability about space 
utilization.  I have no specific minimum time that I can 
reason about, i.e. I cannot reasonably say "I want all 
operations that last less than 1 hour, 1 min, or 1 second... 
to succeed".  I don't really want ANY failures, and I am 
willing to sacrifice some disk space to prevent as many as 
possible.  So the question to me is "How much disk space am 
I willing to sacrifice?", not "How long do I want operations 
to be able to last?".  The only way that time enters my 
equation is to compare it to how long repacking takes, i.e. 
I want the preserved files cleaned up on the next repack.   
So effectively I am choosing a repacking cycle based 
approach, so that I can reasonably predict the extra disk 
space that I need to reserve for my collection of repos.  
With a single cycle, I am effectively doubling the "static" 
size of repos.  

To achieve this predictability with a time based approach 
requires coordination between the expiry setting and the 
repacking time cycle.  This coordination is extra effort for 
me, with no apparent gain.  It is also an additional risk 
that I don't want to have.  If I decide to bump up how often 
I run repacking, and I forget to reduce the expiry time, my 
disk utilization will grow and potentially cause serious 
issues for all my repositories (since they share the same 
volume).  This problem is even more difficult if I decide to 
use a usage (instead of time) based algorithm to determine 
when I repack.

Admittedly, a repacking cycle based approach happens to be 
very easy and practical when it is a "single" cycle.  If I 
determine eventually empirically that a single cycle is not 
long enough for my server, I don't know what I will do?  
Perhaps I would then want a switch that preserves the repos 
for another cycle?  Maybe it could work the way that log 
rotation works, add a number to the end of each file name for 
each preserved cycle?  This option seems preferable to me 
than a time based approach because it makes it more obvious 
what the impact on disk utilization will be.  However, so 
far in practice, this does not seem necessary.

I don't really see a good use case for a time based expiry 
(other than "this is how it was done for other things in 
git").  Of course, that doesn't mean such a use case doesn't 
exist, but I don't support adding a feature unless I really 
understand why and how someone would want to use

Re: [PATCH] repack: Add options to preserve and prune old pack files

2017-03-09 Thread Martin Fick

On Thursday, March 09, 2017 10:50:21 AM 
jmel...@codeaurora.org wrote:
> On 2017-03-07 13:33, Junio C Hamano wrote:
> > James Melvin  writes:
> >> These options are designed to prevent stale file handle
> >> exceptions during git operations which can happen on
> >> users of NFS repos when repacking is done on them. The
> >> strategy is to preserve old pack files around until
> >> the next repack with the hopes that they will become
> >> unreferenced by then and not cause any exceptions to
> >> running processes when they are finally deleted
> >> (pruned).
> > 
> > I find it a very sensible strategy to work around NFS,
> > but it does not explain why the directory the old ones
> > are moved to need to be configurable.  It feels to me
> > that a boolean that causes the old ones renamed
> > s/^pack-/^old-&/ in the same directory (instead of
> > pruning them right away) would risk less chances of
> > mistakes (e.g. making "preserved" subdirectory on a
> > separate device mounted there in a hope to reduce disk
> > usage of the primary repository, which may defeat the
> > whole point of moving the still-active file around
> > instead of removing them).
> 
> Moving the preserved pack files to a separate directory
> only helped make the pack directory cleaner, but I agree
> that having the old* pack files in the same directory is
> a better approach as it would ensure that it's still on
> the same mounted device. I'll update the logic to reflect
> that.
> 
> As for the naming convention of the preserved pack files,
> there is already some logic to remove "old-" files in
> repack. Currently this is the naming convention I have
> for them:
> 
> pack-.old-
> pack-7412ee739b8a20941aa1c2fd03abcc7336b330ba.old-pack
> 
> One advantage of that is the extension is no longer an
> expected one, differentiating it from current pack files.
> 
> That said, if that is not a concern, I could prefix them
> with "preserved" instead of "old" to differentiate them
> from the other logic that cleans up "old-*". What are
> your thoughts on that?
> 
> preserved-.
> preserved-7412ee739b8a20941aa1c2fd03abcc7336b330ba.pack

Some other proposals so that the preserved files do not get 
returned by naive finds based on their extensions,

 preserved-.-preserved
 preserved-7412ee739b8a20941aa1c2fd03abcc7336b330ba.pack-
preserved

or:

 preserved-.preserved-
 preserved-7412ee739b8a20941aa1c2fd03abcc7336b330ba.preserved-
pack

or maybe even just:

 preserved--
 preserved-pack-7412ee739b8a20941aa1c2fd03abcc7336b330ba


-Martin
-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation

Re: [RFC] Add support for downloading blobs on demand

2017-01-17 Thread Martin Fick

On Tuesday, January 17, 2017 04:50:13 PM Ben Peart wrote:
> While large files can be a real problem, our biggest issue
> today is having a lot (millions!) of source files when
> any individual developer only needs a small percentage of
> them.  Git with 3+ million local files just doesn't
> perform well.

Honestly, this sounds like a problem better dealt with by 
using git subtree or git submodules, have you considered 
that?

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation

Re: Preserve/Prune Old Pack Files

2017-01-09 Thread Martin Fick

On Monday, January 09, 2017 01:21:37 AM Jeff King wrote:
> On Wed, Jan 04, 2017 at 09:11:55AM -0700, Martin Fick 
wrote:
> > I am replying to this email across lists because I
> > wanted to highlight to the git community this jgit
> > change to repacking that we have up for review
> > 
> >  https://git.eclipse.org/r/#/c/87969/
> > 
> > This change introduces a new convention for how to
> > preserve old pack files in a staging area
> > (.git/objects/packs/preserved) before deleting them.  I
> > wanted to ensure that the new proposed convention would
> > be done in a way that would be satisfactory to the git
> > community as a whole so that it would be more easy to
> > provide the same behavior in git eventually.  The
> > preserved pack files (and accompanying index and bitmap
> > files), are not only moved, but they are also renamed
> > so that they no longer will match recursive finds
> > looking for pack files.
> It looks like objects/pack/pack-123.pack becomes
> objects/pack/preserved/pack-123.old-pack,

Yes, that's the idea.

> and so forth. Which seems reasonable, and I'm happy that:
> 
>   find objects/pack -name '*.pack'
> 
> would not find it. :)

Cool.

> I suspect the name-change will break a few tools that you
> might want to use to look at a preserved pack (like
> verify-pack).  I know that's not your primary use case,
> but it seems plausible that somebody may one day want to
> use a preserved pack to try to recover from corruption. I
> think "git index-pack --stdin
>  be a last-resort for re-admitting the objects to the
> repository.

or even a simple manual rename/move back to its orginal 
place?

> I notice this doesn't do anything for loose objects. I
> think they technically suffer the same issue, though the
> race window is much shorter (we mmap them and zlib
> inflate immediately, whereas packfiles may stay mapped
> across many object requests).

Hmm, yeah that's the next change, didn't you see it? :)  No, 
actually I forgot about those.  Our server tends to not have 
too many of those (loose objects), and I don't think we have 
seen any exceptions yet for them.  But, of course, you are 
right, they should get fixed too.  I will work on a followup 
change to do that.

Where would you suggest we store those?  Maybe under 
".git/objects/preserved//"?  Do they need to be 
renamed also somehow to avoid a find?

...
> I've wondered if we could make object pruning more atomic
> by speculatively moving items to be deleted into some
> kind of "outgoing" object area.
...
> I don't have a solution here.  I don't think we want to
> solve it by locking the repository for updates during a
> repack. I have a vague sense that a solution could be
> crafted around moving the old pack into a holding area
> instead of deleting (during which time nobody else would
> see the objects, and thus not reference them), while the
> repacking process checks to see if the actual deletion
> would break any references (and rolls back the deletion
> if it would).
> 
> That's _way_ more complicated than your problem, and as I
> said, I do not have a finished solution. But it seems
> like they touch on a similar concept (a post-delete
> holding area for objects). So I thought I'd mention it in
> case if spurs any brilliance.

I agree, this is a problem I have wanted to solve also.  I 
think having a "preserved" directory does open the door to 
such "recovery" solutions, although I think you would 
actually want to modify the many read code paths to fall 
back to looking at the preserved area and performing 
immediate "recovery" of the pack file if it ends up being 
needed.  That's a lot of work, but having the packs (and 
eventually the loose objects) preserved into a location 
where no new references will be built to depend on them is 
likely the first step.  Does the name "preserved" do well for 
that use case also, or would there be some better name, what 
would a transactional system call them?

Thanks for the review Peff!

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation

Re: Preserve/Prune Old Pack Files

2017-01-09 Thread Martin Fick

On Monday, January 09, 2017 05:55:45 AM Jeff King wrote:
> On Mon, Jan 09, 2017 at 04:01:19PM +0900, Mike Hommey 
wrote:
> > > That's _way_ more complicated than your problem, and
> > > as I said, I do not have a finished solution. But it
> > > seems like they touch on a similar concept (a
> > > post-delete holding area for objects). So I thought
> > > I'd mention it in case if spurs any brilliance.
> > 
> > Something that is kind-of in the same family of problems
> > is the "loosening" or objects on repacks, before they
> > can be pruned.
...
> Yes, this can be a problem. The repack is smart enough not
> to write out objects which would just get pruned
> immediately, but since the grace period is 2 weeks, that
> can include a lot of objects (especially with history
> rewriting as you note). It would be possible to write
> those loose objects to a "cruft" pack, but there are some
> management issues around the cruft pack. You do not want
> to keep repacking them into a new cruft pack at each
> repack, since then they would never expire. So you need
> some way of marking the pack as cruft, letting it age
> out, and then deleting it after the grace period expires.
> 
> I don't think it would be _that_ hard, but AFAIK nobody
> has ever made patches.

FYI, jgit does this,

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation

Re: Preserve/Prune Old Pack Files

2017-01-04 Thread Martin Fick

I am replying to this email across lists because I wanted to 
highlight to the git community this jgit change to repacking 
that we have up for review

 https://git.eclipse.org/r/#/c/87969/

This change introduces a new convention for how to preserve 
old pack files in a staging area 
(.git/objects/packs/preserved) before deleting them.  I 
wanted to ensure that the new proposed convention would be 
done in a way that would be satisfactory to the git 
community as a whole so that it would be more easy to 
provide the same behavior in git eventually.  The preserved 
pack files (and accompanying index and bitmap files), are not 
only moved, but they are also renamed so that they no longer 
will match recursive finds looking for pack files.

I look forward to any review (it need not happen on the 
change, replies to this email would be fine also), in 
particular with respect to the approach and naming 
conventions.

Thanks,

-Martin

On Tuesday, January 03, 2017 02:46:12 PM 
jmel...@codeaurora.org wrote:
> We’ve noticed cases where Stale File Handle Exceptions
> occur during git operations, which can happen on users of
> NFS repos when repacking is done on them.
> 
> To address this issue, we’ve added two new options to the
> JGit GC command:
> 
> --preserve-oldpacks: moves old pack files into the
> preserved subdirectory instead of deleting them after
> repacking
> 
> --prune-preserved: prunes old pack files from the
> preserved subdirectory after repacking, but before
> potentially moving the latest old pack files to this
> subdirectory
> 
> The strategy is to preserve old pack files around until
> the next repack with the hopes that they will become
> unreferenced by then and not cause any exceptions to
> running processes when they are finally deleted (pruned).
> 
> Change is uploaded for review here:
> https://git.eclipse.org/r/#/c/87969/
> 
> Thanks,
> James

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation

Re: storing cover letter of a patch series?

2016-08-05 Thread Martin Fick

On Friday, August 05, 2016 08:39:58 AM you wrote:
>  * A new topic, when you merge it to the "lit" branch, you
> describe the cover as the merge commit message.
> 
>  * When you updated an existing topic, you tell a tool
> like "rebase -i -p" to recreate "lit" branch on top of
> the mainline.  This would give you an opportunity to
> update the cover.

This is a neat idea.  How would this work if there is no 
merge commit (mainline hasn't moved)?

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: GIT admin access

2016-06-23 Thread Martin Fick

Brigning this back on list so that someone else can help...

On Thursday, June 23, 2016 05:01:18 PM John Ajah wrote:
> I'm on a private git, installed on a work server. Now the
> guy who set it up is not available and I want to give
> access to someone working for me, but I don't know how to
> do that.

I don't know what type of setup a "private git" means?  Is 
this a machine with ssh access, is it git-daemon, git-hub, 
git-olite, gerrit, ...?


> This is the error the developer got when he tried cloning:
> 
> FATAL ERROR: Network error: Connection timed out
> fatal: Could not read from remote repository.
> 
> Please make sure you have the correct access rights
> and the repository exists.
> 
> My partner wants to set up another Git server and transfer
> our content to the new server from the one we're
> currently using. I think this is very risky and I also
> think there has to be a way to provide access without
> doing this.

We need to know what product you are running to help.  

What risks are you concerned with setting up another server?  
And what kind of server would you be setting up?

-Martin


-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RefTree: Alternate ref backend

2015-12-22 Thread Martin Fick

On Tuesday, December 22, 2015 06:17:28 PM you wrote:
> On Tue, Dec 22, 2015 at 7:41 AM, Michael Haggerty
 wrote:
>
> At a deeper level, the "refs/" part of reference names is
> actually pretty useless in general. I suppose it
> originated in the practice of storing loose references
> under "refs/" to keep them separate from other metadata
> in $GIT_DIR. But really, aside from slightly helping
> disambiguate references from paths in the command line,
> what is it good for? Would we really be worse off if
> references' full names were
>
> HEAD
> heads/master
> tags/v1.0.0
> remotes/origin/master (or remotes/origin/heads/master)

I think this is a bit off, because

  HEAD != refs/HEAD

so not quite useless.

But, I agree that the whole refs notation has always bugged
me, it is quirky.  It makes it hard to disambiguate when
something is meant to be absolute or not.  What if we added
a leading slash for absolute references? Then I could do
something like:

/HEAD
/refs/heads/master
/refs/tags/v1.0.0
/refs/remotes/origin/master

I don't like that plumbing has to do a dance to guess at
expansions, how many tools get it wrong (do it in different
orders, miss some expansions...).  With an absolute
notation, plumbing could be built to require absolute
notations, giving more predictable interpretations when
called from tools.

This is a long term idea, but it might make sense to
consider it now just for the sake of storing refs, it would
eliminate the need for the ".." notation for "refs/..HEAD".

Now if we could only figure out a way to tell plumbing that
something is a SHA, not a ref? :)

-Martin

--
The Qualcomm Innovation Center, Inc. is a member of Code
Aurora Forum, hosted by The Linux Foundation

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: storing cover letter of a patch series?

2015-09-10 Thread Martin Fick

+repo-disc...@googlegroups.com (to hit Gerrit developers 
also)

On Thursday, September 10, 2015 09:28:52 AM Jacob Keller 
 wrote:
> does anyone know of any tricks for storing a cover letter
> for a patch series inside of git somehow? I'd guess the
> only obvious way currently is to store it at the top of
> the series as an empty commit.. but this doesn't get
> emailed as the cover letter...
...
> I really think it should be possible to store something
> somehow as a blob that could be looked up later.


On Thursday, September 10, 2015 10:41:54 AM Junio C Hamano 
wrote:
> 
> I think "should" is too strong here.  Yes, you could
> implement that way.  It is debatable if it is better, or
> a flat file kept in a directory (my-topic/ in the example
> above) across rerolls is more flexible, lightweight and
> with less mental burden to the users. --

As a Gerrit developer and user, I would like a way to 
see/review cover letters in Gerrit.  We have had many 
internal proposals, most based on git notes, but we have 
also used the empty commit trick.  It would be nice if there 
were some standard git way to do this so that Gerrit and 
other tools could benefit from this standard.  I am not 
suggesting that git need to be modified to do this, but 
rather that at least some convention be established.

-Martin


-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] protocol upload-pack-v2

2015-04-02 Thread Martin Fick

 The current protocol has the following problems that limit
 us:
 
  - It is not easy to make it resumable, because we
 recompute every time.  This is especially problematic for
 the initial fetch aka clone as we will be talking about
 a large transfer. Redirection to a bundle hosted on CDN
 might be something we could do transparently.
 
  - The protocol extension has a fairly low length limit.
 
  - Because the protocol exchange starts by the server side
 advertising all its refs, even when the fetcher is
 interested in a single ref, the initial overhead is
 nontrivial, especially when you are doing a small
 incremental update.  The worst case is an auto-builder
 that polls every five minutes, even when there is no new
 commits to be fetched.

A lot of focus about the problems with ref advertisement is 
about the obvious problem mentioned above (a bad problem 
indeed).  I would like to add that there is another related 
problem that all potential solutions to the above problem do 
not neccessarily improve.   When polling regularly there is 
also no current efficient way to check on the current state of 
all refs.  It would be nice to also be able to get an 
incremental update on large refs spaces.

Thanks,

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Git Scaling: What factors most affect Git performance for a large repo?

2015-02-20 Thread Martin Fick

On Friday, February 20, 2015 01:29:12 PM David Turner wrote:
...
 For a more general solution, perhaps a log of ref updates
 could be used. Every time a ref is updated on the server,
 that ref would be written into an append-only log.  Every
 time a client pulls, their pull data includes an index
 into that log.  Then on push, the client could say, I
 have refs as-of $index, and the server could read the
 log (or do something more-optimized) and send only refs
 updated since that index.

Interesting idea, I like it.

How would you make this reliable?  It relies on updates 
being reliably recorded which would mean that you would have 
to ensure that any tool which touches the repo follows this 
convention.  That is unfortunately a tough thing to enforce 
for most people.

But perhaps, instead of logging updates, the server could 
log snapshots of all refs using an atomically increasing 
sequence number.  Then missed updates do not matter, a 
sequence number is simplly an opaque handle to some full ref 
state that can be diffed against.  The snapshots need not 
even be taken inline with the client connection, or with 
every update for this to work.  It might mean that some 
extra updates are sent when they don't need to be, but at 
least they will be accurate.

I know in the past similar ideas have been passed around, 
but they typically relied on the server keeping track of the 
state of each client.  Instead, here we are talking about 
clients keeping track of state for a particular server.  
Clients already store info about remotes.

A very neat idea indeed, thanks!

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Git Scaling: What factors most affect Git performance for a large repo?

2015-02-19 Thread Martin Fick

On Feb 19, 2015 5:42 PM, David Turner dtur...@twopensource.com wrote:

On Fri, 2015-02-20 at 06:38 +0700, Duy Nguyen wrote:
* 'git push'?

This one is not affected by how deep your repo's history is, or how
wide your tree is, so should be quick..

Ah the number of refs may affect both git-push and git-pull. I think
Stefan knows better than I in this area.

I can tell you that this is a bit of a problem for us at Twitter. We
have over 100k refs, which adds ~20MiB of downstream traffic to every
push.

I added a hack to improve this locally inside Twitter: The client sends
a bloom filter of shas that it believes that the server knows about; the
server sends only the sha of master and any refs that are not in the
bloom filter. The client uses its local version of the servers' refs
as if they had just been sent. This means that some packs will be
suboptimal, due to false positives in the bloom filter leading some new
refs to not be sent. Also, if there were a repack between the pull and
the push, some refs might have been deleted on the server; we repack
rarely enough and pull frequently enough that this is hopefully not an
issue.

We're still testing to see if this works. But due to the number of
assumptions it makes, it's probably not that great an idea for general
use.

Good to hear that others are starting to experiment with solutions to this
problem! I hope to hear more updates on this.

I have a prototype of a simpler, and
I believe more robust solution, but aimed at a smaller use case I think. On
connecting, the client sends a sha of all its refs/shas as defined by a
refspec, which it also sends to the server, which it believes the server might
have the same refs/shas values for. The server can then calculate the value of
its refs/shas which meet the same refspec, and then omit sending those refs if
the verification sha matches, and instead send only a confirmation that they
matched (along with any refs outside of the refspec). On a match, the client
can inject the local values of the refs which met the refspec and be guaranteed
that they match the server's values.

This optimization is aimed at the worst case scenario (and is thus the
potentially best case compression), when the client and server match for all
refs (a refs/* refspec) This is something that happens often on Gerrit server
startup, when it verifies that its mirrors are up-to-date. One reason I chose
this as a starting optimization, is because I think it is one use case which
will actually not benefit from fixing the git protocol to only send relevant
refs since all the refs are in fact relevant here! So something like this will
likely be needed in any future git protocol in order for it to be efficient for
this use case. And I believe this use case is likely to stick around.

With a minor tweak, this optimization should work when replicating actual
expected updates also by excluding the expected updating refs from the
verification so that the server always sends their values since they will
likely not match and would wreck the optimization. However, for this use case
it is not clear whether it is actually even worth caring about the non updating
refs? In theory the knowledge of the non updating refs can potentially reduce
the amount of data transmitted, but I suspect that as the ref count increases,
this has diminishing returns and mostly ends up chewing up CPU and memory in a
vain attempt to reduce network traffic.

Please do keep us up-to-date of your results,

-Martin

Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a
Linux Foundation Collaborative
ProjectN�r��yb�X��ǧv�^�)޺{.n�+ا���ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf

Re: Multi-threaded 'git clone'

2015-02-16 Thread Martin Fick

There currently is a thread on the Gerrit list about how much faster cloning 
can be when using Gerrit/jgit GCed packs with bitmaps versus C git GCed packs 
with bitmaps.

Some differences outlined are that jgit seems to have more bitmaps, it creates 
one for every refs/heads, is C git doing that?  Another difference seems to be 
that jgit creates two packs, splitting stuff not reachable from refs/heads into 
its own pack.  This makes a clone have zero CPU server side in the pristine 
case.  In the Gerrit use case, this second unreachable packfile can be 
sizeable, I wonder if there are other use cases where this might also be the 
case (and this slowing down clones for C git GCed repos)?

If there is not a lot of parallelism left to squeak out, perhaps a focus with 
better returns is trying to do whatever is possible to make all clones (and 
potentially any fetch use case deemed important on a particular server) have 
zero CPU?  Depending on what a server's primary mission is, I could envision 
certain admins willing to sacrifice significant amounts of disk space to speed 
up their fetches.  Perhaps some more extreme thinking (such as what must have 
led to bitmaps) is worth brainstorming about to improve server use cases?

What if an admin were willing to sacrifice a packfile for every use case he 
deemed important, could git be made to support that easily?  For example, maybe 
the admin considers a clone or a fetch from master to be important, could zero 
percent CPU be achieved regularly for those two use cases?  Cloning is possible 
if the repository were repacked in the jgit style after any push to a head.  Is 
it worth exploring ways of making GC efficient enough to make this feasible?  
Can bitmaps be leveraged to make repacking faster?  I believe that at least 
reachability checking could potentially be improved with bitmaps? Are there 
potentially any ways to make better deltification reuse during repacking (not 
bitmap related), by somehow reversing or translating deltas to new objects that 
were just received, without actually recalculating them, but yet still getting 
most objects deltified against the newest objects (achieving the same packs as 
git GC would achieve today, but faster)? What other pieces need to be improved 
to make repacking faster?

As for the single branch fetch case, could this somehow be improved by 
allocating one or more packfiles to this use case?  The simplest single branch 
fetch use case is likely someone doing a git init followed by a single branch 
fetch.  I think the android repo tool can be used in this way, so this may 
actually be a common use case?  With a packfile dedicated to this branch, git 
should be able to just stream it out without any CPU.  But I think git would 
need to know this packfile exists to be able to use it.  It would be nice if 
bitmaps could help here, but I believe bitmaps can so far only be used for one 
packfile.  I understand that making bitmaps span multiple packfiles would be 
very complicated, but maybe it would not be so hard to support bitmaps on 
multiple packfiles if each of these were self contained?  By self contained I 
mean that all objects referenced by objects in the packfile were contained in 
that packfile.

What other still unimplemented caching techniques could be used to improve 
clone/fetch use cases? 

- Shallow clones (dedicate a special packfile to this, what about another 
bitmap format that only maps objects in a single tree to help this)?

- Small fetches (simple branch FF updates), I suspect these are fast enough, 
but if not, maybe caching some thin packs (that could result in zero CPU 
requests for many clients) would be useful?  Maybe spread these out 
exponentially over time so that many will be available for recent updates and 
fewer for older updates?  I know git normally throws away thin packs after 
receiving them and resolving them, but if it kept them around (maybe in a 
special directory), it seems that they could be useful for updating other 
clients with zero CPU?  A thin pack cache might be something really easy to 
manage based on file timestamps, an admin may simply need to set a max cache 
size.  But how can git know what thin packs it has, and what they would be 
useful for, name them with their start and ending shas?

Sorry for the long winded rant. I suspect that some variation of all my 
suggestions have already been suggested, but maybe they will rekindle some 
older, now useful thoughts, or inspire some new ones.  And maybe some of these 
are better to pursue then more parallelism?

-Martin

Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a 
Linux Foundation Collaborative ProjectOn Feb 16, 2015 8:47 AM, Jeff King 
p...@peff.net wrote:

 On Mon, Feb 16, 2015 at 07:31:33AM -0800, David Lang wrote: 

  Then the server streams the data to the client. It might do some light 
  work transforming the data as it comes off the disk,

Re: Diagnosing stray/stale .keep files -- explore what is in a pack?

2014-01-14 Thread Martin Fick

Perhaps the receiving process is dying hard and leaving 
stuff behind?  Out-of-memory, out of disk space?

-Martin

On Tuesday, January 14, 2014 10:10:31 am Martin Langhoff 
wrote:
 On Tue, Jan 14, 2014 at 9:54 AM, Martin Langhoff
 
 martin.langh...@gmail.com wrote:
  Is there a handy way to list the blobs in a pack, so I
  can feed them to git-cat-file and see what's in there?
  I'm sure that'll help me narrow down on the issue.
 
 git show-index  
 /var/lib/ppg/reports.git/objects/pack/pack-22748bcca7f50a
 3a49aa4aed61444bf9c4ced685.idx
 
 cut -d\  -f2 | xargs -iHASH git --git-dir 
 /var/lib/ppg/reports.git/ unpack-file HASH
 
 After a bit of looking at the output, clearly I have two
 clients, out of the many that connect here, that have
 the problem. I will be looking into those clients to see
 what's the problem.
 
 In my use case, clients push to their own head. Looking
 at refs/heads shows that there are stale .lock files
 there. Hmmm.
 
 This is on git 1.7.1 (RHEL and CentOS clients).
 
 cheers,
 
 
 m

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation
 
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Ideas to speed up repacking

2013-12-03 Thread Martin Fick

 Martin Fick mf...@codeaurora.org writes:
  * Setup 1:
Do a full repack.  All loose and packed objects are
added
...
  * Scenario 1:
Start with Setup 1.  Nothing has changed on the repo
  contents (no new object/packs, refs all the same), but
  repacking config options have changed (for example
  compression level has changed).


On Tuesday, December 03, 2013 10:50:07 am Junio C Hamano 
wrote:
 Duy Nguyen pclo...@gmail.com writes:
  Reading Martin's mail again I wonder how we just
  grab all objects and skip history traversal. Who will
  decide object order in the new pack if we don't
  traverse history and collect path information.
 
 I vaguely recall raising a related topic for quick
 repack, assuming everything in existing packfiles are
 reachable, that only removes loose cruft several weeks
 ago.  Once you decide that your quick repack do not care
 about ejecting objects from existing packs, like how I
 suspect Martin's outline will lead us to, we can repack
 the reachable loose ones on the recent surface of the
 history and then concatenate the contents of existing
 packs, excluding duplicates and possibly adjusting the
 delta base offsets for some entries, without traversing
 the bulk of the history.

From this, it sounds like scenario 1 (a single pack being 
repacked) might then be doable (just trying to establish a 
really simple baseline)?  Except that it would potentially 
not result in the same ordering without traversing history?  
Or, would the current pack ordering be preserved and thus be 
correct?

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation
 
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ideas to speed up repacking

2013-12-02 Thread Martin Fick

I wanted to explore the idea of exploiting knowledge about 
previous repacks to help speed up future repacks.  

I had various ideas that seemed like they might be good 
places to start, but things quickly got away from me.  
Mainly I wanted to focus on reducing and even sometimes 
eliminating reachability calculations since that seems to be 
be the one major unsolved slow piece during repacking.

My first line of thinking goes like this:  After a full 
repack, reachability of the current refs is known.  Exploit 
that knowledge for future repacks.  There are some very 
simple scenarios where if we could figure out how to 
identify them reliably, I think we could simply avoid 
reachability calculations entirely, and yet end up with the 
same repacked files as if we had done the reachability 
calculations.  Let me outline some to see if they make sense 
as starting place for further discussion.

-

* Setup 1:  

  Do a full repack.  All loose and packed objects are added 
to a single pack file (assumes git config repack options do 
not create multiple packs).

* Scenario 1:

  Start with Setup 1.  Nothing has changed on the repo 
contents (no new object/packs, refs all the same), but 
repacking config options have changed (for example 
compression level has changed).

* Scenario 2:

   Starts with Setup 1.  Add one new pack file that was 
pushed to the repo by adding a new ref to the repo (existing 
refs did not change).

* Scenario 3: 

   Starts with Setup 1.  Add one new pack file that was 
pushed to the repo by updating an existing ref with a fast 
forward.

* Scenario 4:

   Starts with Setup 1.  Add some loose objects to the repo 
via a local fast forward ref update (I am assuming this is 
possible without adding any new unreferenced objects?)


In all 4 scenarios, I believe we should be able to skip 
history traversal and simply grab all objects and repack 
them into a new file?

-

Of the 4 scenarios above, it seems like #3 and #4 are very 
common operations (#2 is perhaps even more common for 
Gerrit)?  If these scenarios can be reliably identified 
somehow, then perhaps they could be used to reduce repacking 
time for these scenarios, and later used as building blocks 
to reduce repacking time for other related but slightly more 
complicated scenarios (with reduced history walking instead 
of none)?

For example to identify scenario 1, what if we kept a copy 
of all refs and their shas used during a full repack along 
with the newly repacked file?  A simplistic approach would 
store them in the same format as the packed-refs file as 
pack-sha.refs.  During repacking, if none of the refs have 
changed and there are no new objects...  

Then, if none of the refs have changed and there are new 
objects, we can just throw the new objects away?

...

I am going to stop here because this email is long enough 
and I wanted to get some feedback on the ideas first before 
offering more solutions.

Thanks,

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation
 
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RFE: support change-id generation natively

2013-10-21 Thread Martin Fick

On Monday, October 21, 2013 12:40:58 pm 
james.mo...@gitblit.com wrote:
 On Mon, Oct 21, 2013, at 02:29 PM, Thomas Koch wrote:
  As I understand, a UUID could also be used for the same
  purbose as the change-
  id. How is the change-id generated by the way? Would it
  be a good english name
  to call it enduring commit identifier?
 
 Here is the algorithm:
 https://git.eclipse.org/c/jgit/jgit.git/tree/org.eclipse.
 jgit/src/org/eclipse/jgit/util/ChangeIdUtil.java#n78
 
 I think enduring commit id is a fair interpretation of
 it's purpose. I don't speak for the Gerrit developers so
 I can not say if they are interested in alternative id
 generation.  I come to the list as a change-id
 user/consumer.

As a Gerrit maintainer, I would suspect that we would 
welcome a way to track changes natively in git.  Despite 
any compatibility issues with the current Gerrit 
implementation, I suspect we would be open to new forms if 
the git community has a better proposal than the current 
Change-Id.  Especially if it does reduce the significant 
user pain point of installing a hook!


-Martin


-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation
 
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: pack corruption post-mortem

2013-10-16 Thread Martin Fick

On Wednesday, October 16, 2013 02:34:01 am Jeff King wrote:
 I was recently presented with a repository with a
 corrupted packfile, and was asked if the data was
 recoverable. This post-mortem describes the steps I took
 to investigate and fix the problem. I thought others
 might find the process interesting, and it might help
 somebody in the same situation.

This is awesome Peff, thanks for the great writeup!

I have nightmares about this sort of thing every now and 
then, and we even experience some corruption here and there 
that needs to be fixed (mainly missing objects when we toy 
with different git repack arguments).  I cannot help but 
wonder, how we can improve git further to either help 
diagnose or even fix some of these problems?  More inline 
below...


 The first thing I did was pull the broken data out of the
 packfile. I needed to know how big the object was, which
 I found out with:
 
   $ git show-index $idx | cut -d' ' -f1 | sort -n | grep
 -A1 51653873 51653873
   51664736
 
 Show-index gives us the list of objects and their
 offsets. We throw away everything but the offsets, and
 then sort them so that our interesting offset (which we
 got from the fsck output above) is followed immediately
 by the offset of the next object. Now we know that the
 object data is 10863 bytes long, and we can grab it
 with:
 
   dd if=$pack of=object bs=1 skip=51653873 count=10863

Is there a current plumbing command that should be enhanced 
to be able to do the 2 steps above directly for people 
debugging (maybe with some new switch)?  If not, should we 
create one, git show --zlib, or git cat-file --zlib?


 Note that the object file isn't fit for feeding
 straight to zlib; it has the git packed object header,
 which is variable-length. We want to strip that off so
 we can start playing with the zlib data directly. You
 can either work your way through it manually (the format
 is described in
 Documentation/technical/pack-format.txt), or you can
 walk through it in a debugger. I did the latter,
 creating a valid pack like:
 
   # pack magic and version
   printf 'PACK\0\0\0\2' tmp.pack
   # pack has one object
   printf '\0\0\0\1' tmp.pack
   # now add our object data
   cat object tmp.pack
   # and then append the pack trailer
   /path/to/git.git/test-sha1 -b tmp.pack trailer
   cat trailer tmp.pack
 
 and then running git index-pack tmp.pack in the
 debugger (stop at unpack_raw_entry). Doing this, I found
 that there were 3 bytes of header (and the header itself
 had a sane type and size). So I stripped those off with:
 
   dd if=object of=zlib bs=1 skip=3

This too feels like something we should be able to do with a 
plumbing command eventually?

git zlib-extract

 So I took a different approach. Working under the guess
 that the corruption was limited to a single byte, I
 wrote a program to munge each byte individually, and try
 inflating the result. Since the object was only 10K
 compressed, that worked out to about 2.5M attempts,
 which took a few minutes.

Awesome!  Would this make a good new plumbing command, git 
zlib-fix?


 I fixed the packfile itself with:
 
   chmod +w $pack
   printf '\xc7' | dd of=$pack bs=1 seek=51659518
 conv=notrunc chmod -w $pack
 
 The '\xc7' comes from the replacement byte our munge
 program found. The offset 51659518 is derived by taking
 the original object offset (51653873), adding the
 replacement offset found by munge (5642), and then
 adding back in the 3 bytes of git header we stripped.

Another plumbing command needed?  git pack-put --zlib?

I am not saying my command suggestions are good, but maybe 
they will inspire the right answer?

-Martin
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: A naive proposal for preventing loose object explosions

2013-09-06 Thread Martin Fick

On Friday, September 06, 2013 11:19:02 am Junio C Hamano 
wrote:
 mf...@codeaurora.org writes:
  Object lookups should likely not get any slower than if
  repack were not run, and the extra new pack might
  actually help find some objects quicker.
 
 In general, having an extra pack, only to keep objects
 that you know are available in other packs, will make
 _all_ object accesses, not just the ones that are
 contained in that extra pack, slower.

My assumption was that if the new pack, with all the 
consolidated reachable objects in it, happens to be searched 
first, it would actually speed things up.  And if it is 
searched last, then the objects weren't in the other packs 
so how could it have made it slower?  It seems this would 
only slow down the missing object path?

But it sounds like all the index files are mmaped up front?  
Then yes, I can see how it would slow things down.  However, 
it is one only extra (hopefully now well optimized) pack.  
My base assumption was that even if it does slow things 
down, it would likely be unmeasurable and a price worth 
paying to avoid an extreme penalty.


 Instead of mmapping all the .idx files for all the
 available packfiles, we could build a table that
 records, for each packed object, from which packfile at
 what offset the data is available to optimize the
 access, but obviously building that in-core table will
 take time, so it may not be a good trade-off to do so at
 runtime (a precomputed super-.idx that we can mmap at
 runtime might be a good way forward if that turns out to
 be the case).
 
  Does this sound like it would work?
 
 Sorry, but it is unclear what problem you are trying to
 solve.

I think you guessed it below, I am trying to prevent loose 
object explosions by keeping unreachable objects around in 
packs (instead of loose) until expiry.  With the current way 
that pack-objects works, this is the best I could come up 
with (I said naive). :(

Today the git-repack calls git pack-objects like this:

git pack-objects --keep-true-parents --honor-pack-keep --
non-empty --all --reflog $args /dev/null $PACKTMP

This has no mechanism to place unreachable objects in a 
pack.  If git pack-objects supported an option which 
streamed them to a separate file (as you suggest below), 
that would likely be the main piece needed to avoid the 
heavy-handed approach I was suggesting.  

The problem is how to define the interface for this?  How do 
we get the filename of the new unreachable packfile?  Today 
the name of the new packfile is sent to stdout, would we 
just tack on another name?  That seems like it would break 
some assumptions?  Maybe it would be OK if it only did that 
when an --unreachable flag was added?  Then git-repack could 
be enhanced to understand that flag and the extra filenames 
it outputs?


 Is it that you do not like that repack -A ejects
 unreferenced objects and makes it loose, which you may
 have many?

Yes, several times a week we have people pushing the kernel 
to wrong projects, this leads to 4M loose objects. :(  
Without a solution for this regular problem, we are very 
scared to move our repos off of SSDs.  This leads to hour 
plus long fetches.


 The loosen_unused_packed_objects() function used by
 repack -A calls the force_object_loose() function
 (actually, it is the sole caller of the function).  If
 you tweak the latter to stream to a single new
 graveyard packfile and mark it as kept until expiry,
 would it solve the issue the same way but with much
 smaller impact?

Yes.
 
 There already is an infrastructure available to open a
 single output packfile and send multiple objects to it
 in bulk-checkin.c, and I am wondering if you can take
 advantage of the framework.  The existing interface to
 it assumes that the object data is coming from a file
 descriptor (the interface was built to support
 bulk-checkin of many objects in an empty repository),
 and it needs refactoring to allow stream_to_pack() to
 take different kind of data sources in the form of
 stateful callback function, though.

That feels beyond what I could currently dedicate the time 
to do.  Like I said, my solution is heavy handed but it felt 
simple enough for me to try.  I can spare the extra disk 
space and I am not convinced the performance hit would be 
bad.  I would, of course, be delighted if someone else were 
to do what you suggest, but I get that it's my itch...

-Martin


-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation
 
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH/RFC 0/7] Multiple simultaneously locked ref updates

2013-08-29 Thread Martin Fick

On Thursday, August 29, 2013 08:11:48 am Brad King wrote:
 
fatal: Unable to create 'lock': File exists.
If no other git process is currently running, this
 probably means a git process crashed in this repository
 earlier. Make sure no other git process is running and
 remove the file manually to continue.

I don't believe git currently tries to do any form of stale 
lock recovery since it is racy and unreliable (both single 
server or on a multi-server shared repo),


-Martin
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] repack: rewrite the shell script in C.

2013-08-15 Thread Martin Fick

On Thursday, August 15, 2013 01:46:02 am Stefan Beller 
wrote:
 On 08/15/2013 01:25 AM, Martin Fick wrote:
  On Wednesday, August 14, 2013 04:51:14 pm Matthieu Moy
  
  wrote:
  Antoine Pelisse apeli...@gmail.com writes:
  On Wed, Aug 14, 2013 at 6:27 PM, Stefan Beller
  
  stefanbel...@googlemail.com wrote:
   builtin/repack.c   | 410
   +
   contrib/examples/git-repack.sh | 194
   +++ git-repack.sh 
   | 194 ---
  
  I'm still not sure I understand the trade-off here.
  
  Most of what git-repack does is compute some file
  paths, (re)move those files and call
  git-pack-objects, and potentially git-prune-packed
  and
  git-update-server-info.
  Maybe I'm wrong, but I have the feeling that the
  correct tool for that is Shell, rather than C (and I
  think the code looks less intuitive in C for that
  matter).
  
  There's a real problem with git-repack being shell (I
  already mentionned it in the previous thread about the
  rewrite): it creates dependencies on a few external
  binaries, and a restricted server may not have them. I
  have this issue on a fusionforge server where Git
  repos are accessed in a chroot with very few commands
  available: everything went OK until the first project
  grew enough to require a git gc --auto, and then it
  stopped accepting pushes for that project.
  
  I tracked down the origin of the problem and the
  sysadmins disabled auto-gc, but that's not a very
  satisfactory solution.
  
  C is rather painfull to write, but as a sysadmin, drop
  the binary on your server and it just works. That's
  really important. AFAIK, git-repack is the only
  remaining shell part on the server, and it's rather
  small. I'd really love to see it disapear.
  
  I didn't review the proposed C version, but how was it
  planning on removing the dependencies on these
  binaries? Was it planning to reimplement mv, cp, find?
 
 These small programms (at least mv and cp) are just
 convenient interfaces for system calls from within the
 shell. You can use these system calls to achieve a
 similar results compared to the commandline option.
 http://linux.die.net/man/2/rename
 http://linux.die.net/man/2/unlink

Sure, but have you ever looked at the code to mv?  It isn't 
pretty. ;(  But in all that ugliness is decades worth of 
portability and corner cases.  Also, mv is smart enough to 
copy when rename doesn't work (on some systems it doesn't).  
So C may sound more portable, but I am not sure it actually 
is.  Now hopefully you won't need all of that, but I think 
that some of the design decision that went into git-repack 
did consider some of the more eccentric filesystems out 
there,

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation
 
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] repack: rewrite the shell script in C.

2013-08-14 Thread Martin Fick

On Wednesday, August 14, 2013 10:49:58 am Antoine Pelisse 
wrote:
 On Wed, Aug 14, 2013 at 6:27 PM, Stefan Beller
 
 stefanbel...@googlemail.com wrote:
   builtin/repack.c   | 410
   +
   contrib/examples/git-repack.sh | 194
   +++ git-repack.sh  |
   194 ---
 
 I'm still not sure I understand the trade-off here.
 
 Most of what git-repack does is compute some file paths,
 (re)move those files and call git-pack-objects, and
 potentially git-prune-packed and git-update-server-info.
 Maybe I'm wrong, but I have the feeling that the correct
 tool for that is Shell, rather than C (and I think the
 code looks less intuitive in C for that matter).
 I'm not sure anyone would run that command a thousand
 times a second, so I'm not sure it would make a
 real-life performance difference.

I have been holding off a bit on expressing this opinion too 
because I don't want to squash someone's energy to improve 
things, and because I am not yet a git dev, but since it was 
brought up anyway...
 
I can say that as a user, having git-repack as a shell 
script has been very valuable.  For example:  we have 
modified it for our internal use to no longer ever overwrite
new packfiles with the same name as the current packfile.  
This modification was relatively easy to do and see in shell 
script.  If this were C code I can't imagine having 
personally: 1) identified that there was an issue with the 
original git-repack (it temporarily makes objects 
unavailable) 2) made a simple custom fix to that policy.

The script really is mostly a policy script, and with the 
discussions happening in other threads about how to improve 
git gc, I think it is helpful to potentially be able to 
quickly modify the policies in this script, it makes it 
easier to prototype things.  Shell portability issues aside, 
this script is not a low level type of tool that I feel will 
benefit from being in C, I feel it will in fact be worse off 
in C,

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation
 
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] repack: rewrite the shell script in C.

2013-08-14 Thread Martin Fick

On Wednesday, August 14, 2013 04:16:35 pm Stefan Beller 
wrote:
 On 08/14/2013 07:25 PM, Martin Fick wrote:
  I have been holding off a bit on expressing this
  opinion too because I don't want to squash someone's
  energy to improve things, and because I am not yet a
  git dev, but since it was brought up anyway...
 
 It's ok, if you knew a better topic to work on, I'd
 gladly switch over. (Given it would be a good beginners
 topic.)

See below...

  I can say that as a user, having git-repack as a shell
  script has been very valuable.  For example:  we have
  modified it for our internal use to no longer ever
  overwrite new packfiles with the same name as the
  current packfile. This modification was relatively
  easy to do and see in shell script.  If this were C
  code I can't imagine having personally: 1) identified
  that there was an issue with the original git-repack
  (it temporarily makes objects unavailable) 2) made a
  simple custom fix to that policy.
 
 Looking at the `git log -- git-repack.sh` the last commit
 is from April 2012 and the commit before is 2011, so I
 assumed it stable enough for porting over to C, as there
 is not much modification going on. I'd be glad to
 include these changes you're using into the rewrite or
 the shell script as of now.

One suggestion would be to change the repack code to create 
pack filenames based on the sha1 of the contents of the pack 
file instead of on the sha1 of the objects in the packfile.  

Since the same objects can be stored in a packfile in many 
ways (different deltification/compression options), it is 
currently possible to have 2 different pack files with the 
same names.  The contents are different, but the contained 
objects are the same.  This causes the object availability 
bug that I describe above in git repack when a new packfile 
is generated with the same name as a current one.

I am not 100% sure if the change in naming convention I 
propose wouldn't cause any problems?  But if others agree it 
is a good idea, perhaps it is something a beginner could do?

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation
 
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] repack: rewrite the shell script in C.

2013-08-14 Thread Martin Fick

On Wednesday, August 14, 2013 04:51:14 pm Matthieu Moy 
wrote:
 Antoine Pelisse apeli...@gmail.com writes:
  On Wed, Aug 14, 2013 at 6:27 PM, Stefan Beller
  
  stefanbel...@googlemail.com wrote:
   builtin/repack.c   | 410
   +
   contrib/examples/git-repack.sh | 194
   +++ git-repack.sh  |
   194 ---
  
  I'm still not sure I understand the trade-off here.
  
  Most of what git-repack does is compute some file
  paths, (re)move those files and call git-pack-objects,
  and potentially git-prune-packed and
  git-update-server-info.
  Maybe I'm wrong, but I have the feeling that the
  correct tool for that is Shell, rather than C (and I
  think the code looks less intuitive in C for that
  matter).
 
 There's a real problem with git-repack being shell (I
 already mentionned it in the previous thread about the
 rewrite): it creates dependencies on a few external
 binaries, and a restricted server may not have them. I
 have this issue on a fusionforge server where Git repos
 are accessed in a chroot with very few commands
 available: everything went OK until the first project
 grew enough to require a git gc --auto, and then it
 stopped accepting pushes for that project.
 
 I tracked down the origin of the problem and the
 sysadmins disabled auto-gc, but that's not a very
 satisfactory solution.
 
 C is rather painfull to write, but as a sysadmin, drop
 the binary on your server and it just works. That's
 really important. AFAIK, git-repack is the only
 remaining shell part on the server, and it's rather
 small. I'd really love to see it disapear.

I didn't review the proposed C version, but how was it 
planning on removing the dependencies on these binaries?  
Was it planning to reimplement mv, cp, find?  Were there 
other binaries that were problematic that you were thinking 
of?  From what I can tell it also uses test, mkdir, sed, 
chmod and naturally sh, that is 8 dependencies.  If those 
can't be depended upon for existing, perhaps git should just 
consider bundling busy-box or some other limited shell 
utils, or yikes!, even its own reimplementation of these 
instead of implementing these independently inside other git 
programs?

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation
 
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] repack: rewrite the shell script in C.

2013-08-14 Thread Martin Fick

On Wednesday, August 14, 2013 04:53:36 pm Junio C Hamano 
wrote:
 Martin Fick mf...@codeaurora.org writes:
  One suggestion would be to change the repack code to
  create pack filenames based on the sha1 of the
  contents of the pack file instead of on the sha1 of
  the objects in the packfile. ...
  I am not 100% sure if the change in naming convention I
  propose wouldn't cause any problems?  But if others
  agree it is a good idea, perhaps it is something a
  beginner could do?
 
 I would not be surprised if that change breaks some other
 people's reimplementation.  I know we do not validate
 the pack name with the hash of the contents in the
 current code, but at the same time I do remember that
 was one of the planned things to be done while I and
 Linus were working on the original pack design, which
 was the last task we did together before he retired from
 the maintainership of this project.

Perhaps a config option?  One that becomes standard for git 
2.0?

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation
 
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] repack: rewrite the shell script in C.

2013-08-14 Thread Martin Fick

On Wednesday, August 14, 2013 05:25:42 pm Martin Fick wrote:
 On Wednesday, August 14, 2013 04:51:14 pm Matthieu Moy
 
 wrote:
  Antoine Pelisse apeli...@gmail.com writes:
   On Wed, Aug 14, 2013 at 6:27 PM, Stefan Beller
   
   stefanbel...@googlemail.com wrote:
builtin/repack.c   | 410
+
contrib/examples/git-repack.sh | 194
+++ git-repack.sh 
| 194 ---
   
   I'm still not sure I understand the trade-off here.
   
   Most of what git-repack does is compute some file
   paths, (re)move those files and call
   git-pack-objects, and potentially git-prune-packed
   and
   git-update-server-info.
   Maybe I'm wrong, but I have the feeling that the
   correct tool for that is Shell, rather than C (and I
   think the code looks less intuitive in C for that
   matter).
  
  There's a real problem with git-repack being shell (I
  already mentionned it in the previous thread about the
  rewrite): it creates dependencies on a few external
  binaries, and a restricted server may not have them. I
  have this issue on a fusionforge server where Git repos
  are accessed in a chroot with very few commands
  available: everything went OK until the first project
  grew enough to require a git gc --auto, and then it
  stopped accepting pushes for that project.
  
  I tracked down the origin of the problem and the
  sysadmins disabled auto-gc, but that's not a very
  satisfactory solution.
  
  C is rather painfull to write, but as a sysadmin, drop
  the binary on your server and it just works. That's
  really important. AFAIK, git-repack is the only
  remaining shell part on the server, and it's rather
  small. I'd really love to see it disapear.
 
 I didn't review the proposed C version, but how was it
 planning on removing the dependencies on these binaries?
 Was it planning to reimplement mv, cp, find?  Were there
 other binaries that were problematic that you were
 thinking of?  From what I can tell it also uses test,
 mkdir, sed, chmod and naturally sh, that is 8
 dependencies.  If those can't be depended upon for
 existing, perhaps git should just consider bundling
 busy-box or some other limited shell utils, or yikes!,
 even its own reimplementation of these instead of
 implementing these independently inside other git
 programs?

Sorry I didn't comprehend your email fully when I first read 
it.  I guess that wouldn't really solve your problem unless 
someone had a way of bundling an sh program and whatever it 
calls inside a single executable? :(

I can see why you would want what you want,

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation
 
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] git exproll: steps to tackle gc aggression

2013-08-08 Thread Martin Fick

On Thursday, August 08, 2013 10:56:38 am Junio C Hamano 
wrote:
 I thought the discussion was about making the local gc
 cheaper, and the Imagine we have a cheap way was to
 address it by assuming that the daily pack young
 objects into a single pack can be sped up if we did not
 have to traverse history.  More permanent packs (the
 older ones in set of packs staggered by age Martin
 proposes) in the repository should go through the normal
 history traversal route.

Assuming I understand what you are suggesting, would these 
young object likely still get deduped in an efficient 
way without doing history traversal (it sounds like they 
would)?  In other words, if I understand correctly, it would 
save time by not pruning unreferenced objects, but it would 
still be deduping things and delta compressing also, so you 
would still likely get a great benefit from creating these 
young object packs?  In other words, is there still a good 
chance that my 317 new pack files which included a 33M pack 
file will still get consolidated down to something near 8M?  

If so, then yeah this might be nice, especially if the 
history traversal is what would speed this up.  Because 
today, my solution mostly saves IO and not time.  I think it 
still saves time, I believe I have seen up to a 50% savings, 
but that is nothing compared to massive, several orders of 
magnitude IO savings.  But if what you suggest could also 
give massive time (orders of magnitude) savings along with 
the IO improvements I am seeing, then suddenly repacking 
regularly would become very cheap even on large repos.  

The only time consuming piece would be pruning then?  Could 
bitmaps eventually help out there?

-Martin


-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation
 
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] git exproll: steps to tackle gc aggression

2013-08-06 Thread Martin Fick

On Tuesday, August 06, 2013 06:24:50 am Duy Nguyen wrote:
 On Tue, Aug 6, 2013 at 9:38 AM, Ramkumar Ramachandra 
artag...@gmail.com wrote:
  +   Garbage collect using a pseudo
  logarithmic packfile maintenance +  
  approach.  This approach attempts to minimize packfile
  churn +   by keeping several generations
  of varying sized packfiles around +   and
  only consolidating packfiles (or loose objects) which
  are +   either new packfiles, or packfiles
  close to the same size as +   another
  packfile.
 
 I wonder if a simpler approach may be nearly efficient as
 this one: keep the largest pack out, repack the rest at
 fetch/push time so there are at most 2 packs at a time.
 Or we we could do the repack at 'gc --auto' time, but
 with lower pack threshold (about 10 or so). When the
 second pack is as big as, say half the size of the
 first, merge them into one at gc --auto time. This can
 be easily implemented in git-repack.sh.

It would definitely be better than the current gc approach.  

However, I suspect it is still at least one to two orders of 
magnitude off from where it should be.  To give you a real 
world example, on our server today when gitexproll ran on 
our kernel/msm repo, it consolidated 317 pack files into one 
almost 8M packfile (it compresses/dedupes shockingly well, 
one of those new packs was 33M).  Our largest packfile in 
that repo is 1.5G!  

So let's now imagine that the second closest packfile is 
only 100M, it would keep getting consolidated with 8M worth 
of data every day (assuming the same conditions and no extra 
compression).  That would take (750M-100M)/8M ~ 81 days to 
finally build up large enough to no longer consolidate the 
new packs with the second largest pack file daily.  During 
those 80+ days, it will be on average writing 325M too much 
per day (when it should be writing just 8M).

So I can see the appeal of a simple solution, unfortunately 
I think one layer would still suck though.  And if you are 
going to add even just one extra layer, I suspect that you 
might as well go the full distance since you probably 
already need to implement the logic to do so?

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation
 
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] git exproll: steps to tackle gc aggression

2013-08-06 Thread Martin Fick

On Monday, August 05, 2013 08:38:47 pm Ramkumar Ramachandra 
wrote:
 This is the rough explanation I wrote down after reading
 it:
 
 So, the problem is that my .git/objects/pack is polluted
 with little packs everytime I fetch (or push, if you're
 the server), and this is problematic from the
 perspective of a overtly (naively) aggressive gc that
 hammers out all fragmentation.  So, on the first run,
 the little packfiles I have are all consolidated into
 big packfiles; you also write .keep files to say that
 don't gc these big packs we just generated.  In
 subsequent runs, the little packfiles from the fetch are
 absorbed into a pack that is immune to gc.  You're also
 using a size heuristic, to consolidate similarly sized
 packfiles.  You also have a --ratio to tweak the ratio
 of sizes.
 
 From: Martin Fickmf...@codeaurora.org
 See: https://gerrit-review.googlesource.com/#/c/35215/
 Thread:
 http://thread.gmane.org/gmane.comp.version-control.git/2
 31555 (Martin's emails are missing from the archive)
 ---

After analyzing today's data, I recognize that in some 
circumstances the size estimation after consolidation can be 
off by huge amounts.  The script naively just adds the 
current sizes together.  This gives a very rough estimate, 
of the new packfile size, but sometimes it can be off by 
over 2 orders of magnitude. :(  While many new packfiles are 
tiny (several K only), it seems like the larger new 
packfiles have a terrible tendency to throw the estimate way 
off (I suspect they simply have many duplicate objects).  
But despite this poor estimate, the script still offers 
drastic improvements over plain git gc.

So, it has me wondering if there isn't a more accurate way 
to estimate the new packfile without wasting a ton of time?

If not, one approach which might be worth experimenting with 
is to just assume that new packfiles have size 0!  Then just 
consolidate them with any other packfile which is ready for 
consolidation, or if none are ready, with the smallest 
packfile.  I would not be surprised to see this work on 
average better than the current summation,

-Martin


-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation
 
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [BUG?] gc and impatience

2013-08-05 Thread Martin Fick

On Monday, August 05, 2013 11:34:24 am Ramkumar Ramachandra 
wrote:
 Martin Fick wrote:
  https://gerrit-review.googlesource.com/#/c/35215/
 
 Very cool. Of what I understood:
 
 So, the problem is that my .git/objects/pack is polluted
 with little packs everytime I fetch (or push, if you're
 the server), and this is problematic from the
 perspective of a overtly (naively) aggressive gc that
 hammers out all fragmentation.  So, on the first run,
 the little packfiles I have are all consolidated into
 big packfiles; you also write .keep files to say that
 don't gc these big packs we just generated.  In
 subsequent runs, the little packfiles from the fetch are
 absorbed into a pack that is immune to gc.  You're also
 using a size heuristic, to consolidate similarly sized
 packfiles.  You also have a --ratio to tweak the ratio
 of sizes.

Yes, pretty much.  

I suspect that a smarter implementation would do a less 
good job of packing to save time also.  I think this can be 
done by further limiting much of the lookups to the packs 
being packed (or some limited set of the greater packfiles).  
I admit I don't really understand how much the packing does 
today, but I believe it still looks at the larger packs with 
keeps to potentially deltafy against them, or to determine 
which objects are duplicated and thus should not be put into 
the new smaller packfiles?  I say this because the time 
savings of this script is not as significant as I would have 
expected it to be (but the IO is).  I think that it is 
possible to design a git gc using this rolling approach that 
would actually greatly reduce the time spent packing also.  
However, I don't think that can easily be done in a script 
like mine which just wraps itself around git gc.  I hope 
that someone more familiar with git gc than me might take 
this on some day. :)


 I've checked it in and started using it; so yeah: I'll
 chew on it for a few weeks.

The script also does some nasty timestamp manipulations that 
I am not proud of.  They had significant time impacts for 
us, and likely could have been achieved some other way.  
They shouldn't be relevant to the packing algo though.  I 
hope it doesn't interfere with the evaluation of the 
approach.

Thanks for taking an interest in it,

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation
 
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: How to still kill git fetch with too many refs

2013-07-02 Thread Martin Fick

On Tuesday, July 02, 2013 03:24:14 am Michael Haggerty 
wrote:
  git rev-list HEAD | for nn in $(seq 0 100) ; do for c
  in $(seq 0 1) ; do  read sha ; echo $sha
  refs/c/$nn/$c$nn ; done ; done  .git/packed-refs
 
 I believe this generates a packed-refs file that is not
 sorted lexicographically by refname, whereas all
 Git-generated packed-refs files are sorted.  


Yes, you are indeed correct.  I was attempting to be too 
clever with my sharding I guess.  Thanks.

 There are
 some optimizations in refs.c for adding references in
 order that might therefore be circumvented by your
 unsorted file.  Please try sorting the file by refname
 and see if that helps.  (You can do so by deleting one
 of the packed references; then git will sort the
 remainder while rewriting the file.)

A simple git pack-refs seems to clean it up.

The original test did complete in ~77mins last night.  A 
rerun with a sorted file takes ~61mins,

-Martin


PS: This test was performed with git version 1.8.2.1 on 
linux 2.6.32-37-generic #81-Ubuntu SMP 

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation
 
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/3] avoid quadratic behavior in fetch-pack

2013-07-02 Thread Martin Fick

On Tuesday, July 02, 2013 12:11:49 am Jeff King wrote:
 Here are my patches to deal with Martin's pathological
 case, split out for easy reading. I took a few timings
 to show that the results of the 3rd patch are noticeable
 even with 50,000 unique refs (which is still a lot, but
 something that I could conceive of a busy repo
 accumulating over time).
 
   [1/3]: fetch-pack: avoid quadratic list insertion in
 mark_complete [2/3]: commit.c: make
 compare_commits_by_commit_date global [3/3]: fetch-pack:
 avoid quadratic behavior in rev_list_push
 
 And here's the diffstat to prove it is really not scary.
 :)
 
  commit.c |  2 +-
  commit.h |  2 ++
  fetch-pack.c | 16 
  3 files changed, 11 insertions(+), 9 deletions(-)
 
 -Peff

I applied these 3 patches and it indeed improves things 
dramatically.  Thanks Peff, you are awesome!!!


The synthetic test case (but sorted), now comes in at around 
15s.  The more important real world case (for us), fetching 
from my production server, which took around 12mins 
previously, now takes around 30s (I think the extra time is 
now spent on the Gerrit server, but I will investigate that 
a bit more)!  That is very significant and should make many 
workflows much more efficient.  +1 for merging this. :)

Again, thanks,

-Martin


Note, I tested git-next 1.8.3.2.883.g27cfd27 to be sure that 
it is still problematic without this patch, it is (running 
for 10mins now without completing).


-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation
 
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

How to still kill git fetch with too many refs

2013-07-01 Thread Martin Fick

I have often reported problems with git fetch when there are 
many refs in a repo, and I have been pleasantly surprised 
how many problems I reported were so quickly fixed. :) With 
time, others have created various synthetic test cases to 
ensure that git can handle many many refs.  A simple 
synthetic test case with 1M refs all pointing to the same 
sha1 seems to be easily handled by git these days.  However, 
in our experience with our internal git repo, we still have 
performance issues related to having too many refs, in our 
kernel/msm instance we have around 400K.

When I tried the simple synthetic test case and could not 
reproduce bad results, so I tried something just a little 
more complex and was able to get atrocious results!!! 
Basically, I generate a packed-refs files with many refs 
which each point to a different sha1.  To get a list of 
valid but unique sha1s for the repo, I simply used rev-list.  
The result, a copy of linus' repo with a million unique 
valid refs and a git fetch of a single updated ref taking a 
very long time (55mins and it did not complete yet).  Note, 
with 100K refs it completes in about 2m40s.  It is likely 
not linear since 2m40s * 10 would be ~26m (but the 
difference could also just be how the data in the sha1s are 
ordered).


Here is my small reproducible test case for this issue:

git clone 
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
cp -rp linux linux.1Mrefs-revlist

cd linux
echo Hello  hello ; git add hello ; git ci -a -m 'hello'
cd ..

cd linux.1Mrefs-revlist
git rev-list HEAD | for nn in $(seq 0 100) ; do for c in 
$(seq 0 1) ; do  read sha ; echo $sha refs/c/$nn/$c$nn ; 
done ; done  .git/packed-refs

time git fetch file:///$(dirname $PWD)/linux 
refs/heads/master

Any insights as to why it is so slow, and how we could 
possibly speed it up?

Thanks,

-Martin

PS: My tests were performed with git version 1.8.2.1 on 
linux 2.6.32-37-generic #81-Ubuntu SMP 


-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation
 
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Fixing the git-repack replacement gap?

2013-06-18 Thread Martin Fick

I have been trying to think of ways to fix git-repack so 
that it no longer momentarily makes the objects in a repo 
inaccessible to all processes when it replaces packfiles 
with the same objects in them as an already existing pack 
file.  To be more explicit, I am talking about the way it 
moves the existing pack file (and index) to old-sha1.pack 
before moving the new packfile in place.  During this moment 
in time the objects in that packfile are simply not 
available to anyone using the repo.  This can be 
particularly problematic for busy servers.

There likely are at lest 2 ways that the fundamental design 
of packfiles, their indexes, and their names have led to 
this issue.  If the packfile and index were stored in a 
single file, they could have been replaced atomically and 
thus it would potentially avoid the issue of them being 
temporarily inaccessible (although admittedly that might not 
work anyway on some filesystems).  Alternatively, if the 
pack file were named after the sha1 of the packed contents 
of the file instead of the sha1 of the objects in the sha1, 
then the replacement would never need to happen since it 
makes no sense to replace a file with another file with the 
exact same contents (unless, of course the first one is 
corrupt, but then you aren't likely making the repo 
temporarily worse, you are fixing a broken repo).

I suspect these 2 ideas have been discussed before, but 
since they are fundamental changes to the way pack files 
work (and thus would not be backwards compatible), they are 
not likely to get implemented soon.  This got me wondering 
if there wasn't an easier backwards compatible solution to 
avoid making the objects inaccessible?

It seems like the problem could be avoided if we could 
simply change the name of the pack file when a replacement 
would be needed?  Of course, if we just changed the name, 
then the name would not match the sha1 of the contained 
objects and would likely be considered bad by git?  So, what 
if we could simply add a dummy object to the file to cause 
it to deserve a name change?

So the idea would be, have git-repack detect the conflict in 
filenames and have it repack the new file with an additional 
dummy (unused) object in it, and then deliver the new file 
which no longer conflicts.  Would this be possible?  If so, 
what sort of other problems would this cause?  It would 
likely cause an unreferenced object and likely cause it to 
want to get pruned by the next git-repack?  Is that OK, 
maybe you want it to get pruned because then the pack file 
will get repacked once again without the dummy object later 
and avoid the temporarily inaccessible period for objects in 
the file?  

Hmm, but then maybe that could even be done in a single git-
repack run (at the expense of extra disk space)?  

1) Detect the conflict, 
2) Save the replacement file 
3) Create a new packfile with the dummy object
4) Put the new file with the dummy object into service
5) Remove the old conflicting file (no gap)
6) Place the new conflicting file in service (no dummy)
7) Remove the new file with dummy object (no gap again)

done?  Would it work?

If so, is there an easy way to create the dummy file?  Can 
any object simply be added at the end of a pack file after 
the fact (and then added to the index too)?  Also, what 
should the dummy object be?  Is there some sort of null 
object that would be tiny and that would never already be in 
the pack?

Thanks for any thoughts,

-Martin
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: git hangs on pthread_join

2013-05-23 Thread Martin Fick

On Thursday, May 23, 2013 07:01:43 am you wrote:
 
 I'm running a rather special configuration, basically i
 have a gerrit server pushing
... 
 I have found git receive-packs that has been running
 for days/weeks without terminating
 
... 
 Anyone that has any clues about what could be going
 wrong? --


Have you narrowed down whether this is a git client problem, 
or a server problem (gerrit in your case).  Is this a 
repeatable issue.  Try the same operation against a clone of 
the repo using just git.  Check on the server side for .noz 
files in you repo (a jgit thing),

-Martin
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: inotify to minimize stat() calls

2013-02-10 Thread Martin Fick

On Sunday, February 10, 2013 12:03:00 pm Robert Zeh wrote:
 On Sat, Feb 9, 2013 at 1:35 PM, Junio C Hamano 
gits...@pobox.com wrote:
  Ramkumar Ramachandra artag...@gmail.com writes:
  This is much better than Junio's suggestion to study
  possible implementations on all platforms and
  designing a generic daemon/ communication channel. 
  That's no weekend project.
  
  It appears that you misunderstood what I wrote.  That
  was not here is a design; I want it in my system.  Go
  implemment it.
  
  It was If somebody wants to discuss it but does not
  know where to begin, doing a small experiment like
  this and reporting how well it worked here may be one
  way to do so., nothing more.
 
 What if instead of communicating over a socket, the
 daemon dumped a file containing all of the lstat
 information after git wrote a file? By definition the
 daemon should know about file writes.

But git doesn't, how will it know when the file is written?
Will it use inotify, or poll (kind of defeats the point)?

-Martin
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/2] optimizing pack access on read only fetch repos

2013-01-29 Thread Martin Fick



Jeff King p...@peff.net wrote:

On Sat, Jan 26, 2013 at 10:32:42PM -0800, Junio C Hamano wrote:

 Both makes sense to me.
 
 I also wonder if we would be helped by another repack mode that
 coalesces small packs into a single one with minimum overhead, and
 run that often from gc --auto, so that we do not end up having to
 have 50 packfiles.
 
 When we have 2 or more small and young packs, we could:
 
  - iterate over idx files for these packs to enumerate the objects
to be packed, replacing read_object_list_from_stdin() step;
 
  - always choose to copy the data we have in these existing packs,
instead of doing a full prepare_pack(); and
 
  - use the order the objects appear in the original packs, bypassing
compute_write_order().

I'm not sure. If I understand you correctly, it would basically just be
concatenating packs without trying to do delta compression between the
objects which are ending up in the same pack. So it would save us from
having to do (up to) 50 binary searches to find an object in a pack,
but
would not actually save us much space.

I would be interested to see the timing on how quick it is compared to
a
real repack, as the I/O that happens during a repack is non-trivial
(although if you are leaving aside the big main pack, then it is
probably not bad).

But how do these somewhat mediocre concatenated packs get turned into
real packs? Pack-objects does not consider deltas between objects in
the
same pack. And when would you decide to make a real pack? How do you
know you have 50 young and small packs, and not 50 mediocre coalesced
packs?


If we are reconsidering repacking strategies, I would like to propose an 
approach that might be a more general improvement to repacking which would help 
in more situations. 

You could roll together any packs which are close in size, say within 50% of 
each other.  With this strategy you will end up with files which are spread out 
by size exponentially.  I implementated this strategy on top of the current gc 
script using keep files, it works fairly well:

https://gerrit-review.googlesource.com/#/c/35215/3/contrib/git-exproll.sh

This saves some time, but mostly it saves I/O when repacking regularly.  I 
suspect that if this strategy were used in core git that further optimizations 
could be made to also reduce the repack time, but I don't know enough about 
repacking to know?  We run it nightly on our servers, both write and read only 
mirrors. We us are a ratio of 5 currently to drastically reduce large repack 
file rollovers,

-Martin

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] refs: do not use cached refs in repack_without_ref

2013-01-07 Thread Martin Fick

...[Sorry about the previous HTML reposts]

Jeff King p...@peff.net wrote:
On Mon, Dec 31, 2012 at 03:30:53AM -0700, Martin Fick 
wrote:

 The general approach is to setup a transaction and
 either commit or abort it. A transaction can be setup
 by renaming an appropriately setup directory to the
 ref.lock name. If the rename succeeds, the
 transaction is begun. Any actor can abort the
 transaction (up until it is committed) by simply
 deleting the ref.lock directory, so it is not at
 risk of going stale.

 Deleting a directory is not atomic, as you first have
 to remove the contents, putting it into a potentially
 inconsistent state. I'll assume you deal with that
 later...

Right, these simple single file transactions have at 
best 1 important file/directory in them, once deleted 
the transaction is aborted (can no longer complete).  
However to support multi file transactions, a better 
approach is likely to rename the uuid directory to have 
a .delete extension before deleting stuff in it.


  One important piece of the transaction is the use
  of uuids. The uuids provide a mechanism to tie the
  atomic commit pieces to the transactions and thus to
  prevent long sleeping process from inadvertently
  performing actions which could be out of date when
  they wake finally up.
 
Has this been a problem for you in practice?

No, but as you say, we don't currently hold locks for 
very long. I anticipate it being a problem in a 
clustered environment when transactions start spanning 
repos from java processes, with insane amounts of RAM, 
which can sometimes have unpredictable indeterminately 
long java GC cycles at inopportune times.. It would seem 
short sighted if Gerrit at least did not assume this 
will be a problem.

But, deletes today in git are not so short and Michael's 
fixes may make things worse? But, as you point out, that 
should perhaps be solved a different way.


 Avoiding this is one of the reasons that git does not
 take out long locks; instead, it takes the lock only
 at the moment it is ready to write, and aborts if it
 has been updated since the longer-term operation
 began. This has its own problems (you might do a lot
 of work only to have your operation aborted), but I
 am not sure that your proposal improves on that.

It does not, it might increase this.


 Git typically holds ref locks for a few syscalls. If
 you are conservative about leaving potentially stale
 locks in place (e.g., give them a few minutes to
 complete before assuming they are now bogus), you will
 not run into that problem.

In a distributed environment even a few minutes might 
not be enough, processes could be on a remote server 
with a temporarily split network, that could cause 
delays longer than your typical local expectations.

But there is also the other piece of this problem, how 
do you detect stale locks? How long will it be stale 
until a user figures it out and reports it? How many 
other users will simply have failed pushes and wonder 
why without reporting them?


  In each case, the atomic commit piece is the
  renaming of a file. For the create and update
  pieces, a file is renamed from the ref.lock 
  dir to the ref file resulting in an update to 
  the sha for the ref.

 I think we've had problems with cross-directory
 renames on some filesystems, but I don't recall the
 details. I know that Coda does not like 
 cross-directory links, but cross-directory renames 
 are OK (and in fact we fall back to the latter when
 the former does not work).

 Ah, here we go: 5723fe7 (Avoid cross-directory renames
 and linking on object creation, 2008-06-14). Looks
 like NFS is the culprit.

If the renames fail we can fall back to regular file 
locking, the hard part to detect and deal with would be 
if the renames don't fail but become copies/mkdirs.


 In the case of a delete, the actor may verify that
 ref currently contains the sha to prune if it
 needs to, and then renames the ref file to
 ref.lock/uuid/delete. On success, the ref was
 deleted.

 Whether successful or not, the actor may now simply
 delete the ref.lock directory, clearing the way for
 a new transaction. Any other actor may delete this
 directory at any time also, likely either on conflict
 (if they are attempting to initiate a transaction),
 or after a grace period just to cleanup the FS. Any
 actor may also safely cleanup the tmp directories,
 preferably also after a grace period.

 Hmm. So what happens to the delete file when the
 ref.lock directory is being deleted? Presumably
 deleting the ref.lock directory means doing it
 recursively (which is non-atomic). But then why are 
 we keeping the delete file at all, if we're just about
 to remove it?

We are not trying to keep it, but we need to ensure that 
our transaction has not yet been aborted: the rename 
does this.  If we just deleted the file, we may sleep 
and another transaction may abort our transaction and 
complete before we wake up and actually delete the file. 
But by using

Re: Lockless Refs? (Was [PATCH] refs: do not use cached refs in repack_without_ref)

2013-01-04 Thread Martin Fick

On Friday, January 04, 2013 10:52:43 am Pyeron, Jason J 
CTR (US) wrote:
  From: Martin Fick
  Sent: Thursday, January 03, 2013 6:53 PM
  
  Any thoughts on this idea?  Is it flawed?  I am 
trying
  to write it up in a more formal generalized manner 
and
  was hoping to get at least one it seems sane 
before
  I do.
 
 If you are assuming that atomic renames, etc. are
 available, then you should identify a test case and a
 degrade operation path when it is not available.

Thanks, sound reasonable.  Where you thinking a runtime 
test case that would be run before every transaction?  I 
was anticipating a per repo config option called 
something like core.locks = recoverable that would be 
needed to turn them on?  I was thinking that this was 
something that server sites could test in advance on 
their repos and then enable it for them.  Maybe a git-
lock tool with a --test-recoverable option?

-Martin


  
  On Monday, December 31, 2012 03:30:53 am Martin Fick 
wrote:
   On Thursday, December 27, 2012 04:11:51 pm Martin
   Fick
  
  wrote:
It concerns me that git uses any locking at all,
even for refs since it has the potential to 
leave
around stale locks.
...
[a previous not so great attempt to fix this]
...
   
   I may have finally figured out a working loose ref
   update mechanism which I think can avoid stale
   locks. Unfortunately it requires atomic directory
   renames and universally unique identifiers 
(uuids). 
   These may be no-go criteria?  But I figure it is
   worth at least exploring this idea because of the
   potential benefits?
   
   The general approach is to setup a transaction and
   either commit or abort it.  A transaction can be
   setup by renaming an appropriately setup directory
   to the ref.lock name.  If the rename succeeds, 
the
   transaction is begun.  Any actor can abort the
   transaction (up until it is committed) by simply
   deleting the ref.lock directory, so it is not at
   risk of going stale.  However, once the actor who
   sets up the transaction commits it, deleting the
   ref.lock directory simply aids in cleaning it up
   for the next transaction (instead of aborting it).
   
   One important piece of the transaction is the use 
of
   uuids. The uuids provide a mechanism to tie the
   atomic commit pieces to the transactions and thus 
to
   prevent long sleeping process from inadvertently
   performing actions which could be out of date when
   they wake finally up.  In each case, the atomic
   commit piece is the renaming of a file.   For the
   create and update pieces, a file is renamed from 
the
   ref.lock dir to the ref file resulting in an
   update to the sha for the ref. However, in the
   delete case, the ref file is instead renamed to
   end up in the ref.lock directory resulting in a
   delete of the ref.  This scheme does not affect 
the
   way refs are read today,
   
   To prepare for a transaction, an actor first
   generates a uuid (an exercise I will delay for 
now).
Next, a tmp directory named after the uuid is
   generated in the parent directory for the ref to 
be
   updated, perhaps something like:  .lock_uuid. In
   this directory is places either a file or a
   directory named after the uuid, something like:
   .lock_uuid/,uuid.  In the case of a create or an
   update, the new sha is written to this file.  In 
the
   case of a delete, it is a directory.
   
   Once the tmp directory is setup, the initiating 
actor
   attempts to start the transaction by renaming the 
tmp
   directory to ref.lock.  If the rename fails, the
   update fails. If the rename succeeds, the actor 
can
   then attempt to commit the transaction (before
   another actor aborts it).
   
   In the case of a create, the actor verifies that
   ref does not currently exist, and then renames 
the
   now named ref.lock/uuid file to ref. On 
success,
   the ref was created.
   
   In the case of an update, the actor verifies that
   ref currently contains the old sha, and then 
also
   renames the now named ref.lock/uuid file to 
ref.
   On success, the ref was updated.
   
   In the case of a delete, the actor may verify that
   ref currently contains the sha to prune if it
   needs to, and then renames the ref file to
   ref.lock/uuid/delete. On success, the ref was
   deleted.
   
   Whether successful or not, the actor may now 
simply
   delete the ref.lock directory, clearing the way
   for a new transaction.  Any other actor may delete
   this directory at any time also, likely either on
   conflict (if they are attempting to initiate a
   transaction), or after a grace period just to
   cleanup the FS.  Any actor may also safely cleanup
   the tmp directories, preferably also after a grace
   period.
   
   One neat part about this scheme is that I believe 
it
   would be backwards compatible with the current
   locking mechanism since the transaction directory
   will simply appear to be a lock to older clients. 
   And the old

Re: Lockless Refs? (Was [PATCH] refs: do not use cached refs in repack_without_ref)

2013-01-03 Thread Martin Fick

Any thoughts on this idea?  Is it flawed?  I am trying to 
write it up in a more formal generalized manner and was 
hoping to get at least one it seems sane before I do.

Thanks,

-Martin

On Monday, December 31, 2012 03:30:53 am Martin Fick wrote:
 On Thursday, December 27, 2012 04:11:51 pm Martin Fick 
wrote:
  It concerns me that git uses any locking at all, even
  for refs since it has the potential to leave around
  stale locks.
  ...
  [a previous not so great attempt to fix this]
  ...
 
 I may have finally figured out a working loose ref update
 mechanism which I think can avoid stale locks. 
 Unfortunately it requires atomic directory renames and
 universally unique identifiers (uuids).  These may be
 no-go criteria?  But I figure it is worth at least
 exploring this idea because of the potential benefits?
 
 The general approach is to setup a transaction and either
 commit or abort it.  A transaction can be setup by
 renaming an appropriately setup directory to the
 ref.lock name.  If the rename succeeds, the transaction
 is begun.  Any actor can abort the transaction (up until
 it is committed) by simply deleting the ref.lock
 directory, so it is not at risk of going stale.  However,
 once the actor who sets up the transaction commits it,
 deleting the ref.lock directory simply aids in cleaning
 it up for the next transaction (instead of aborting it).
 
 One important piece of the transaction is the use of
 uuids. The uuids provide a mechanism to tie the atomic
 commit pieces to the transactions and thus to prevent
 long sleeping process from inadvertently performing
 actions which could be out of date when they wake finally
 up.  In each case, the atomic commit piece is the
 renaming of a file.   For the create and update pieces, a
 file is renamed from the ref.lock dir to the ref file
 resulting in an update to the sha for the ref. However,
 in the delete case, the ref file is instead renamed to
 end up in the ref.lock directory resulting in a delete
 of the ref.  This scheme does not affect the way refs are
 read today,
 
 To prepare for a transaction, an actor first generates a
 uuid (an exercise I will delay for now).  Next, a tmp
 directory named after the uuid is generated in the parent
 directory for the ref to be updated, perhaps something
 like:  .lock_uuid. In this directory is places either a
 file or a directory named after the uuid, something like:
 .lock_uuid/,uuid.  In the case of a create or an
 update, the new sha is written to this file.  In the case
 of a delete, it is a directory.
 
 Once the tmp directory is setup, the initiating actor
 attempts to start the transaction by renaming the tmp
 directory to ref.lock.  If the rename fails, the update
 fails. If the rename succeeds, the actor can then attempt
 to commit the transaction (before another actor aborts
 it).
 
 In the case of a create, the actor verifies that ref
 does not currently exist, and then renames the now named
 ref.lock/uuid file to ref. On success, the ref was
 created.
 
 In the case of an update, the actor verifies that ref
 currently contains the old sha, and then also renames the
 now named ref.lock/uuid file to ref. On success, the
 ref was updated.
 
 In the case of a delete, the actor may verify that ref
 currently contains the sha to prune if it needs to, and
 then renames the ref file to ref.lock/uuid/delete. On
 success, the ref was deleted.
 
 Whether successful or not, the actor may now simply delete
 the ref.lock directory, clearing the way for a new
 transaction.  Any other actor may delete this directory at
 any time also, likely either on conflict (if they are
 attempting to initiate a transaction), or after a grace
 period just to cleanup the FS.  Any actor may also safely
 cleanup the tmp directories, preferably also after a grace
 period.
 
 One neat part about this scheme is that I believe it would
 be backwards compatible with the current locking
 mechanism since the transaction directory will simply
 appear to be a lock to older clients.  And the old lock
 file should continue to lock out these newer
 transactions.
 
 Due to this backwards compatibility, I believe that this
 could be incrementally employed today without affecting
 very much.  It could be deployed in place of any updates
 which only hold ref.locks to update the loose ref.  So
 for example I think it could replace step 4a below from
 Michael Haggerty's description of today's loose ref
 pruning during
 
 ref packing:
  * Pack references:
 ...
 
  4. prune_refs(): for each ref in the ref_to_prune list,
  
  call  prune_ref():
  a. Lock the reference using lock_ref_sha1(),
  verifying that the recorded SHA1 is still valid.  If
  it is, unlink the loose reference file then free
  the lock; otherwise leave the loose reference file
  untouched.
 
 I think it would also therefore be able to replace the
 loose ref locking in Michael's new ref-packing scheme as
 well as the locking in Michael's new ref

Re: Lockless Refs? (Was [PATCH] refs: do not use cached refs in repack_without_ref)

2012-12-31 Thread Martin Fick

On Thursday, December 27, 2012 04:11:51 pm Martin Fick wrote:
 It concerns me that git uses any locking at all, even for
 refs since it has the potential to leave around stale
 locks.
 ...
 [a previous not so great attempt to fix this]
 ...

I may have finally figured out a working loose ref update 
mechanism which I think can avoid stale locks.  Unfortunately 
it requires atomic directory renames and universally unique 
identifiers (uuids).  These may be no-go criteria?  But I 
figure it is worth at least exploring this idea because of the 
potential benefits?

The general approach is to setup a transaction and either 
commit or abort it.  A transaction can be setup by renaming 
an appropriately setup directory to the ref.lock name.  If 
the rename succeeds, the transaction is begun.  Any actor can 
abort the transaction (up until it is committed) by simply 
deleting the ref.lock directory, so it is not at risk of 
going stale.  However, once the actor who sets up the 
transaction commits it, deleting the ref.lock directory 
simply aids in cleaning it up for the next transaction 
(instead of aborting it).

One important piece of the transaction is the use of uuids.  
The uuids provide a mechanism to tie the atomic commit pieces 
to the transactions and thus to prevent long sleeping process 
from inadvertently performing actions which could be out of 
date when they wake finally up.  In each case, the atomic 
commit piece is the renaming of a file.   For the create and 
update pieces, a file is renamed from the ref.lock dir to 
the ref file resulting in an update to the sha for the ref.  
However, in the delete case, the ref file is instead renamed 
to end up in the ref.lock directory resulting in a delete 
of the ref.  This scheme does not affect the way refs are read 
today,

To prepare for a transaction, an actor first generates a uuid 
(an exercise I will delay for now).  Next, a tmp directory 
named after the uuid is generated in the parent directory for 
the ref to be updated, perhaps something like:  .lock_uuid.  
In this directory is places either a file or a directory named 
after the uuid, something like: .lock_uuid/,uuid.  In the 
case of a create or an update, the new sha is written to this 
file.  In the case of a delete, it is a directory.  

Once the tmp directory is setup, the initiating actor 
attempts to start the transaction by renaming the tmp 
directory to ref.lock.  If the rename fails, the update 
fails. If the rename succeeds, the actor can then attempt to 
commit the transaction (before another actor aborts it). 

In the case of a create, the actor verifies that ref does 
not currently exist, and then renames the now named 
ref.lock/uuid file to ref. On success, the ref was 
created.

In the case of an update, the actor verifies that ref 
currently contains the old sha, and then also renames the now 
named ref.lock/uuid file to ref. On success, the ref was 
updated.

In the case of a delete, the actor may verify that ref 
currently contains the sha to prune if it needs to, and 
then renames the ref file to ref.lock/uuid/delete. On 
success, the ref was deleted.

Whether successful or not, the actor may now simply delete 
the ref.lock directory, clearing the way for a new 
transaction.  Any other actor may delete this directory at 
any time also, likely either on conflict (if they are 
attempting to initiate a transaction), or after a grace 
period just to cleanup the FS.  Any actor may also safely 
cleanup the tmp directories, preferably also after a grace 
period.

One neat part about this scheme is that I believe it would be 
backwards compatible with the current locking mechanism since 
the transaction directory will simply appear to be a lock to 
older clients.  And the old lock file should continue to lock 
out these newer transactions.

Due to this backwards compatibility, I believe that this 
could be incrementally employed today without affecting very 
much.  It could be deployed in place of any updates which 
only hold ref.locks to update the loose ref.  So for example 
I think it could replace step 4a below from Michael 
Haggerty's description of today's loose ref pruning during 
ref packing:

 * Pack references:
...
 4. prune_refs(): for each ref in the ref_to_prune list,
 call  prune_ref():

 a. Lock the reference using lock_ref_sha1(), 
 verifying that the recorded SHA1 is still valid.  If it
 is, unlink the loose reference file then free the lock;
 otherwise leave the loose reference file untouched.

I think it would also therefore be able to replace the loose 
ref locking in Michael's new ref-packing scheme as well as 
the locking in Michael's new ref deletion scheme (again steps 
4):

 * Delete reference foo:
...
   4. Delete loose ref for foo:
 
  a. Acquire the lock $GIT_DIR/refs/heads/foo.lock
 
  b. Unlink $GIT_DIR/refs/heads/foo if it is unchanged.
  If it is changed, leave it untouched.  If it is deleted,
 that is OK too.
 
  c

Re: Lockless Refs? (Was [PATCH] refs: do not use cached refs in repack_without_ref)

2012-12-30 Thread Martin Fick

On Saturday, December 29, 2012 03:18:49 pm Martin Fick wrote:
 Jeff King p...@peff.net wrote:
 On Thu, Dec 27, 2012 at 04:11:51PM -0700, Martin Fick 
wrote:
  My idea is based on using filenames to store sha1s
  instead of file contents.  To do this, the sha1 one of
  a ref would be stored in a file in a directory named
  after the loose ref.  I believe this would then make
  it possible to have lockless atomic ref updates by
  renaming the file.
  
  To more fully illustrate the idea, imagine that any
  file (except for the null file) in the directory will
  represent the value of the ref with its name, then the
  following transitions can represent atomic state
  changes to a refs
 
  value and existence:
 Hmm. So basically you are relying on atomic rename() to
 move the value around within a directory, rather than
 using write to move it around within a file. Atomic
 rename is usually something we have on local filesystems
 (and I think we rely on it elsewhere). Though I would
 not be
 surprised if it is not atomic on all networked
 filesystems (though it is
 on NFS, at least).
 
 Yes.  I assume this is OK because doesn't git already rely
 on atomic renames?  For example to rename the new
 packed-refs file to unlock it?
 
 ...
 
  3) To create a ref, it must be renamed from the null
  file (sha ...) to the new value just as if it were
  being updated from any other value, but there is one
  extra condition: before renaming the null file, a full
  directory scan must be done to ensure that the null
  file is the only file in the directory (this condition
  exists because creating the directory and null file
  cannot be atomic unless the filesystem supports atomic
  directory renames, an expectation git does not
  currently make).  I am not sure how this compares to
  today's approach, but including the setup costs
  (described below), I suspect it is slower.
 
 Hmm. mkdir is atomic. So wouldn't it be sufficient to
 just mkdir and create the correct sha1 file?
 
 But then a process could mkdir and die leaving a stale
 empty dir with no reliable recovery mechanism.
 
 
 Unfortunately, I think I see another flaw though! :( I
 should have known that I cannot separate an important
 check from its state transitioning action.  The following
 could happen:
 
  A does mkdir
  A creates null file
  A checks dir - no other files
  B checks dir - no other files
  A renames null file to abcd
  C creates second null file
  B renames second null file to defg
 
 One way to fix this is to rely on directory renames, but I
 believe this is something git does not want to require of
 every FS? If we did, we could Change #3 to be:
 
 3) To create a ref, it must be renamed from the null file
 (sha ...) to the new value just as if it were being
 updated from any other value. (No more scan)
 
 Then, with reliable directory renames, a process could do
 what you suggested to a temporary directory, mkdir +
 create null file, then rename the temporary dir to
 refname.  This would prevent duplicate null files.  With
 a grace period, the temporary dirs could be cleaned up in
 case a process dies before the rename.  This is your
 approach with reliable recovery.

The whole null file can go away if we use directory renames.  
Make #3:

3) To create a ref, create a temporary directory containing a 
file named after the sha1 of the ref to be created and rename 
the directory to the name of the ref to create.  If the 
rename fails, the create fails.  If the rename succeeds, the 
create succeeds.

With a grace period, the temporary dirs could be cleaned up 
in case a process dies before the rename,

-Martin
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Lockless Refs?

2012-12-29 Thread Martin Fick



Jeff King p...@peff.net wrote:

On Fri, Dec 28, 2012 at 09:15:52AM -0800, Junio C Hamano wrote:

 Martin Fick mf...@codeaurora.org writes:
 
  Hmm, actually I believe that with a small modification to the 
  semantics described here it would be possible to make multi 
  repo/branch commits work
 
  Shawn talked about adding multi repo/branch transaction 
  semantics to jgit, this might be something that git wants to 
  support also at some point?
 
 Shawn may have talked about it and you may have listened to it, but
 others wouldn't have any idea what kind of multi repo/branch
 transaction you are talking about.  Is it about I want to push
 this ref to that repo and push this other ref to that other repo,
 in what situation will it be used/useful, what are the failure
 modes, what are failure tolerances by the expected use cases, ...?
 
 Care to explain?

I cannot speak for Martin, but I am assuming the point is to atomically
update 2 (or more) refs on the same repo. That is, if I have a branch
refs/heads/foo and a ref pointing to meta-information (say, notes
about commits in foo, in refs/notes/meta/foo), I would want to git
push them, and only update them if _both_ will succeed, and otherwise
fail and update nothing.

My use case was cross repo/branch dependencies in Gerrit (which do not yet 
exist). Users want to be able to define several changes (destined for different 
project/branches) which can only be merged together.  If one change cannot be 
merged, the others should fail too.  The solutions we can think of generally 
need to hold ref locks while acquiring more ref locks, this drastically 
increases the opportunities for stale locks over the simple lock, check, 
update, unlock mode which git locks are currently used for.

I was perhaps making too big of a leap to assume that there would be other non 
Gerrit uses cases for this?  I assumed that other git projects which are spread 
across several git repos would need this? But maybe this simply wouldn't be 
practical with other git server solutions?

-Martin

Employee of Qualcomm Innovation Center,Inc. which is a member of Code Aurora 
Forum
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Lockless Refs? (Was [PATCH] refs: do not use cached refs in repack_without_ref)

2012-12-29 Thread Martin Fick

Jeff King p...@peff.net wrote:

On Fri, Dec 28, 2012 at 07:50:14AM -0700, Martin Fick wrote:

 Hmm, actually I believe that with a small modification to the 
 semantics described here it would be possible to make multi 
 repo/branch commits work.   Simply allow the ref filename to 
 be locked by a transaction by appending the transaction ID to 
 the filename.  So if transaction 123 wants to lock master 
 which points currently to abcde, then it will move 
 master/abcde to master/abcde_123.  If transaction 123 is 
 designed so that any process can commit/complete/abort it 
 without requiring any locks which can go stale, then this ref 
 lock will never go stale either (easy as long as it writes 
 all its proposed updates somewhere upfront and has atomic 
 semantics for starting, committing and aborting).  On commit, 
 the ref lock gets updated to its new value: master/newsha and 
 on abort it gets unlocked: master/abcde.

Hmm. I thought our goal was to avoid locks? Isn't this just locking by
another name?

It is a lock, but it is a lock with an owner: the transaction.  If the 
transaction has reliable recovery semantics, then the lock will be recoverable 
also.  This is possible if we have lock ownership (the transaction) which does 
not exist today for the ref locks.  With good lock ownership we gain the 
ability to reliably delete locks for a specific owner without the risk of 
deleting the lock when held by another owner (putting the owner in the filename 
is good, while putting the owner in the filecontents is not).   Lastly, for 
reliable recovery of stale locks we need the ability to determine when an owner 
has abandoned a lock.  I believe that the transaction semantics laid out below 
give this.


I guess your point is to have no locks in the normal case, and have
locked transactions as an optional add-on?

Basically.  If we design the transaction into the git semantics we could ensure 
that it is recoverable and we should not need to expose these reflocks outside 
of the transaction APIs.

To illustrate a simple transaction approach (borrowing some of Shawn's ideas), 
we could designate a directory to hold transaction files *1.  To prepare a 
transaction: write a list of repo:ref:oldvalue:newvalue to a file named id.new 
(in a stable sorted order based on repo:ref to prevent deadlocks).  This is not 
a state change and thus this file could be deleted by any process at anytime 
(preferably after a long grace period).

If file renames are atomic on the filesystem holding the transaction files then 
1, 2, 3 below will be atomic state changes.  It does not matter who performs 
state transitions 2 or 3.  It does not matter who implements the work following 
any of the 3 transitions, many processes could attempt the work in parallel (so 
could a human).
 
1) To start the transaction, rename the id.new file to id.  If the rename 
fails, start over if desired/still possible.  On success, ref locks for each 
entry should be acquired in listed order (to prevent deadlocks), using 
transaction id and oldvalue.  It is never legal to unlock a ref in this state 
(because a block could cause the unlock to be delayed until the commit phase).  
However, it is legal for any process to transition to abort at any time from 
this state, perhaps because of a failure to acquire a lock (held by another 
transaction), and definitely if a ref has changed (is no longer oldvalue).

2) To abort the transaction, rename the id file to id.abort.  This should only 
ever fail if commit was achieved first.  Once in this state, any process 
may/should unlock any ref locks belonging to this transaction id.  Once all 
refs are unlocked, id.abort may be deleted (it could be deleted earlier, but 
then cleanup will take longer).

3) To commit the transaction, rename the file to id.commit.  This should only 
ever fail if abort was achieved first. This transition should never be done 
until every listed ref is locked by the current transaction id.  Once in this 
phase, all refs may/should be moved to their new values and unlocked by any 
process. Once all refs are unlocked, id.commit may be deleted. 

Since any process attempting any of the work in these transactions could block 
at any time for an indefinite amount of time, these processes may wake after 
the transaction is aborted or comitted and the transaction files are cleaned 
up.  I believe that in these cases the only actions which could succeed by 
these waking processes is the ref locking action.  All such abandoned ref locks 
may/should be unlocked by any process.  This last rule means that no 
transaction ids should ever be reused,

-Martin


*1 We may want to adapt the simple model illustrated above to use git 
mechanisms such as refs to hold transaction info instead of files in a 
directory, and git submodule files to hold the list of refs to update.  

Employee of Qualcomm Innovation Center,Inc. which is a member of Code Aurora 
Forum
--
To unsubscribe from this list: send

Re: Lockless Refs? (Was [PATCH] refs: do not use cached refs in repack_without_ref)

2012-12-29 Thread Martin Fick

Jeff King p...@peff.net wrote:

On Thu, Dec 27, 2012 at 04:11:51PM -0700, Martin Fick wrote:
 My idea is based on using filenames to store sha1s instead of 
 file contents.  To do this, the sha1 one of a ref would be 
 stored in a file in a directory named after the loose ref.  I 
 believe this would then make it possible to have lockless 
 atomic ref updates by renaming the file.
 
 To more fully illustrate the idea, imagine that any file 
 (except for the null file) in the directory will represent the 
 value of the ref with its name, then the following 
 transitions can represent atomic state changes to a refs 
 value and existence:

Hmm. So basically you are relying on atomic rename() to move the value
around within a directory, rather than using write to move it around
within a file. Atomic rename is usually something we have on local
filesystems (and I think we rely on it elsewhere). Though I would not
be
surprised if it is not atomic on all networked filesystems (though it
is
on NFS, at least).

Yes.  I assume this is OK because doesn't git already rely on atomic renames?  
For example to rename the new packed-refs file to unlock it?

...

 3) To create a ref, it must be renamed from the null file (sha 
 ...) to the new value just as if it were being updated 
 from any other value, but there is one extra condition: 
 before renaming the null file, a full directory scan must be 
 done to ensure that the null file is the only file in the 
 directory (this condition exists because creating the 
 directory and null file cannot be atomic unless the filesystem 
 supports atomic directory renames, an expectation git does 
 not currently make).  I am not sure how this compares to 
 today's approach, but including the setup costs (described 
 below), I suspect it is slower.

Hmm. mkdir is atomic. So wouldn't it be sufficient to just mkdir and
create the correct sha1 file?

But then a process could mkdir and die leaving a stale empty dir with no 
reliable recovery mechanism.


Unfortunately, I think I see another flaw though! :( I should have known that I 
cannot separate an important check from its state transitioning action.  The 
following could happen:

 A does mkdir
 A creates null file
 A checks dir - no other files 
 B checks dir - no other files
 A renames null file to abcd
 C creates second null file 
 B renames second null file to defg

One way to fix this is to rely on directory renames, but I believe this is 
something git does not want to require of every FS? If we did, we could Change 
#3 to be:

3) To create a ref, it must be renamed from the null file (sha ...) to the 
new value just as if it were being updated from any other value. (No more scan)

Then, with reliable directory renames, a process could do what you suggested to 
a temporary directory, mkdir + create null file, then rename the temporary dir 
to refname.  This would prevent duplicate null files.  With a grace period, the 
temporary dirs could be cleaned up in case a process dies before the rename.  
This is your approach with reliable recovery.


 I don't know how this new scheme could be made to work with 
 the current scheme, it seems like perhaps new git releases 
 could be made to understand both the old and the new, and a 
 config option could be used to tell it which method to write 
 new refs with.  Since in this new scheme ref directory names 
 would conflict with old ref filenames, this would likely 
 prevent both schemes from erroneously being used 
 simultaneously (so they shouldn't corrupt each other), except 
 for the fact that refs can be nested in directories which 
 confuses things a bit.  I am not sure what a good solution to 
 this is?

I think you would need to bump core.repositoryformatversion, and just
never let old versions of git access the repository directly. Not the
end of the world, but it certainly increases deployment effort. If we
were going to do that, it would probably make sense to think about
solving the D/F conflict issues at the same time (i.e., start calling
refs/heads/foo in the filesystem refs.d/heads.d/foo.ref so that it
cannot conflict with refs.d/heads.d/foo.d/bar.ref).

Wouldn't you want to use a non legal ref character instead of dot? And without 
locks, we free up more of the ref namespace too I think? (Refs could end in 
.lock)

-Martin

Employee of Qualcomm Innovation Center,Inc. which is a member of Code Aurora 
Forum
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Lockless Refs?

2012-12-28 Thread Martin Fick

On Friday, December 28, 2012 09:58:36 am Junio C Hamano 
wrote:
 Martin Fick mf...@codeaurora.org writes:
  3) To create a ref, it must be renamed from the null
  file (sha ...) to the new value just as if it were
  being updated from any other value, but there is one
  extra condition: before renaming the null file, a full
  directory scan must be done to ensure that the null
  file is the only file in the directory...
 
 While you are scanning this directory to make sure it is
 empty, 

The objective is not to scan for an empty dir, but to scan 
for the existence of only the null file.

 I am contemplating to create the same ref with a
 different value.  You finished checking but haven't
 created the null.

The scan needs to happen after creating the null, not before, 
so I don't believe the rest of the scenario below is possible 
then?

 I have also scanned, created the null
 and renamed it to my value.  Now you try to create the
 null, succeed, and then rename.  We won't know which of
 the two non-null values are valid, but worse yet, I think
 one of them should have failed in the first place.



 Sounds like we would need some form of locking around
 here.  Is your goal no locks, or less locks?
(answered below)

  I don't know how this new scheme could be made to work
  with the current scheme,...
 
 It is much more important to know if/why yours is better
 than the current scheme in the first place.  

The goal is: no locks which do not have a clearly defined 
reliable recovery procedure.

Stale locks without a reliable recovery procedure will lead 
to stolen locks.  At this point it is only a matter of luck 
whether this leads to data loss or not.  So I do hope to 
convince people first that the current scheme is bad, not that 
my scheme is better!  My scheme was proposed to get people 
thinking that we may not have to use locks to get reliable 
updates.


 Without an
 analysis on how the new scheme interacts with the packed
 refs and gives better behaviour, that is kinda difficult.

Fair enough. I will attempt this if the basic idea seems at 
least sane?  I do hope that eventually the packed-refs piece 
and its locking will be reconsidered also; as Michael pointed 
out it has issues already.  So, I am hoping to get people 
thinking more about lockless approaches to all the pieces. I 
think I have some solutions to some of the other pieces also, 
but I don't want to overwhelm the discussion all at once 
(especially if my first piece is shown to be flawed, or if no 
one has any interest in eliminating the current locks?)

 
 I think transition plans can wait until that is done.  If
 it is not even marginally better, we do not have to worry
 about transitioning at all.  If it is only marginally
 better, the transition has to be designed to be no impact
 to the existing repositories.  If it is vastly better, we
 might be able to afford a flag day.

OK, makes sense, I jumped the gun a bit,

-Martin
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Lockless Refs? (Was [PATCH] refs: do not use cached refs in repack_without_ref)

2012-12-27 Thread Martin Fick

On Wednesday, December 26, 2012 01:24:39 am Michael Haggerty 
wrote:
 ... lots of discussion about ref locking...

It concerns me that git uses any locking at all, even for 
refs since it has the potential to leave around stale locks. 

For a single user repo this is not a big deal, the lock can 
always be cleaned up manually (and it is a rare occurrence).  
However, in a multi user server environment, possibly even 
from multiple hosts over a shared filesystem such as NFS, 
stale locks could lead to serious downtime and risky recovery 
(since it is currently hard to figure out if a lock really is 
stale).  Even though stale locks are probably rare even today 
in the larger shared repo case, as git scales to even larger 
shared repositories, this will eventually become more of a 
problem *1.  Naturally, this has me thinking that git should 
possibly consider moving towards a lockless design for refs 
in the long term.

I realize this is hard and that git needs to support many 
different filesystems with different semantics.  I had an idea I 
think may be close to a functional lockless design for loose 
refs (one piece at a time) that I thought I should propose, 
just to get the ball rolling, even if it is just going to be 
found to be flawed (I realize that history suggests that such 
schemes usually are).  I hope that it does not make use of 
any semantics which are not currently expected from git of 
filesystems.  I think it relies only on the ability to rename 
a file atomically, and the ability to scan the contents of a 
directory reliably to detect the ordered existence of files.

My idea is based on using filenames to store sha1s instead of 
file contents.  To do this, the sha1 one of a ref would be 
stored in a file in a directory named after the loose ref.  I 
believe this would then make it possible to have lockless 
atomic ref updates by renaming the file.

To more fully illustrate the idea, imagine that any file 
(except for the null file) in the directory will represent the 
value of the ref with its name, then the following 
transitions can represent atomic state changes to a refs 
value and existence:

1) To update the value from a known value to a new value 
atomically, simply rename the file to the new value.  This 
operation should only succeed if the file exists and is still 
named old value before the rename.  This should even be 
faster than today's approach, especially on remote filesystems 
since it would require only 1 round trip in the success case 
instead of 3!

2) To delete the ref, simply delete the filename representing 
the current value of the ref.  This ensures that you are 
deleting the ref from a specific value.  I am not sure if git 
needs to be able to delete refs without knowing their values?  
If so, this would require reading the value and looping until 
the delete succeeds, this may be a bit slow for a constantly 
updated ref, but likely a rare situation (and not likely 
worse than trying to acquire the ref-lock today).  Overall, 
this again would likely be faster than today's approach.

3) To create a ref, it must be renamed from the null file (sha 
...) to the new value just as if it were being updated 
from any other value, but there is one extra condition: 
before renaming the null file, a full directory scan must be 
done to ensure that the null file is the only file in the 
directory (this condition exists because creating the 
directory and null file cannot be atomic unless the filesystem 
supports atomic directory renames, an expectation git does 
not currently make).  I am not sure how this compares to 
today's approach, but including the setup costs (described 
below), I suspect it is slower.

While this outlines the state changes, some additional 
operations may be needed to setup some starting conditions 
and to clean things up. But these operations could/should be 
performed by any process/thread and would not cause any state 
changes to the ref existence or value.  For example, when 
creating a ref, the ref directory would need to be created 
and the null file needs to be created.  Whenever a null file is 
detected in the directory at the same time as another file, it 
should be deleted.   Whenever the directory is empty, it may 
be deleted (perhaps after a grace period to reduce retries 
during ref creation unless the process just deleted the ref).

I don't know how this new scheme could be made to work with 
the current scheme, it seems like perhaps new git releases 
could be made to understand both the old and the new, and a 
config option could be used to tell it which method to write 
new refs with.  Since in this new scheme ref directory names 
would conflict with old ref filenames, this would likely 
prevent both schemes from erroneously being used 
simultaneously (so they shouldn't corrupt each other), except 
for the fact that refs can be nested in directories which 
confuses things a bit.  I am not sure what a good solution to 
this is?

git-repack.sh not server/multiuse safe?

2012-09-05 Thread Martin Fick

I have been reading the git-repack.sh script and I have 
found a piece that I am concerned with.  It looks like after 
repacking there is a place when packfiles could be 
temporarily unaccessible making the objects within 
temporarily unaccessible.  If my evaluation is true, it 
would seem like git repacking is not server safe?

In particular, I am talking about this loop:

 # Ok we have prepared all new packfiles.

 # First see if there are packs of the same name and if so
 # if we can move them out of the way (this can happen if we
 # repacked immediately after packing fully.
 rollback=
 failed=
 for name in $names
 do
for sfx in pack idx
do
file=pack-$name.$sfx
test -f $PACKDIR/$file || continue
rm -f $PACKDIR/old-$file 
mv $PACKDIR/$file $PACKDIR/old-$file ||  
{
failed=t
break
}
rollback=$rollback $file
done
test -z $failed || break
 done



It would seem that one way to avoid this (at least on 
systems supporting hardlinks), would be to instead link the 
original packfile to old-file first, then move the new 
packfile in place without ever deleting the original one 
(from its original name), only delete the old-file link.  
Does that make sense at all?

Thanks,

-Martin


-- 
Employee of Qualcomm Innovation Center, Inc. which is a 
member of Code Aurora Forum
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

67 matches

Mail list logo