[Monotone-devel] A couple questions...

2010-06-03 Thread J Decker
Why does monotone read all revisions and certs in the database on a sync?

At least 90% of the time, a comparison between the heads of branches
to sync(A) vs the heads of branches to sync against(B) will result in
either A is before B or B is before A and occasionally neither A or B
is known to the respective opposite side, but then perhaps a revision
C could be stored per branch in the repository that A is in that is
the last point(s) [it may be multiple heads] that was known to sync...
then C will be before B in this third case.

probably less than 1% of synchronizations are new pulls, which would
require reading all revisions in the branches of interest.  (in which
case revision C will be blank or null).

I would think it would be a SIGNIFICANT reduction in sync time (which
isn't really that long for 30,000 revisions in assorted branches which
takes about 10 seconds to read on a netbook) to do a quick check and
just pull from the last known revision to the tip on both sides to
sync, instead of reading the entire revision structure.

Jim

___
Monotone-devel mailing list
Monotone-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/monotone-devel


Re: [Monotone-devel] A couple questions...

2010-06-03 Thread Timothy Brownawell

On 06/03/2010 04:21 PM, J Decker wrote:

Why does monotone read all revisions and certs in the database on a sync?


One fairly basic feature monotone has is that certs can be added to 
arbitrarily old revisions. What this means for netsync is that just 
synchronizing the revision graph isn't enough, you also need to check 
information from every node.



At least 90% of the time, a comparison between the heads of branches
to sync(A) vs the heads of branches to sync against(B) will result in
either A is before B or B is before A and occasionally neither A or B
is known to the respective opposite side, but then perhaps a revision
C could be stored per branch in the repository that A is in that is
the last point(s) [it may be multiple heads] that was known to sync...
then C will be before B in this third case.


I think we don't want to store that kind of per-peer information just on 
general principle.


What would probably work is to store a skip-list over the revision 
graph, and synchronize on first the leaves (well, probably near-leaves, 
say anything not covered by a higher skip-node) at each level and then 
either prune/iterate or just synchronize on all unknown-state revisions. 
The skip nodes could have a hash over all certs that they skip over.



probably less than 1% of synchronizations are new pulls, which would
require reading all revisions in the branches of interest.  (in which
case revision C will be blank or null).


Further, most pulls are going to have a few (or possibly no) new 
revisions on one side and no new revisions on the other. While it's 
fairly common that two people are commiting on the same branch at the 
same time, I think this is merely a significant minority of commits (and 
syncs).



I would think it would be a SIGNIFICANT reduction in sync time (which
isn't really that long for 30,000 revisions in assorted branches which
takes about 10 seconds to read on a netbook) to do a quick check and
just pull from the last known revision to the tip on both sides to
sync, instead of reading the entire revision structure.


It's not quite that simple, but a graph-aware synchronization (with 
appropriate cert summarization) would definitely speed up figuring out 
what to send. It's on my to-do list somewhere, I promise. :)


--
Timothy

Free public monotone hosting: http://mtn-host.prjek.net

___
Monotone-devel mailing list
Monotone-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/monotone-devel


Re: [Monotone-devel] A couple questions...

2010-06-03 Thread J Decker
On Thu, Jun 3, 2010 at 2:56 PM, Timothy Brownawell tbrow...@prjek.net wrote:
 On 06/03/2010 04:21 PM, J Decker wrote:

 Why does monotone read all revisions and certs in the database on a sync?

 One fairly basic feature monotone has is that certs can be added to
 arbitrarily old revisions. What this means for netsync is that just
 synchronizing the revision graph isn't enough, you also need to check
 information from every node.


Okay well that does explain that :)


 I think we don't want to store that kind of per-peer information just on
 general principle.

Yes, I can see that becoming complex, which is why I mentioned saving
it only on the client sync side, which is likely to have fewer remotes
than the central server side (which could have many identifiers for
the same client if the client is behind many firewalls and travels a
lot to various hotels, etc)... but on the client side probably only a
few servers are generally used to sync... a central or 3, and maybe a
couple peers...

even though it is a peer-to-peer operation, there is a side that plays
like a client, and a side that hosts service.


 What would probably work is to store a skip-list over the revision graph,
 and synchronize on first the leaves (well, probably near-leaves, say
 anything not covered by a higher skip-node) at each level and then either
 prune/iterate or just synchronize on all unknown-state revisions. The skip
 nodes could have a hash over all certs that they skip over.


But then you'd need to mark 'I definatly changed this revision way
back in history' ... but then when is that status cleared, since I may
have to push that to 3 servers

 probably less than 1% of synchronizations are new pulls, which would
 require reading all revisions in the branches of interest.  (in which
 case revision C will be blank or null).

 Further, most pulls are going to have a few (or possibly no) new revisions
 on one side and no new revisions on the other. While it's fairly common that
 two people are commiting on the same branch at the same time, I think this
 is merely a significant minority of commits (and syncs).

true, since many branches in the overall sync are probably untouched.


 I would think it would be a SIGNIFICANT reduction in sync time (which
 isn't really that long for 30,000 revisions in assorted branches which
 takes about 10 seconds to read on a netbook) to do a quick check and
 just pull from the last known revision to the tip on both sides to
 sync, instead of reading the entire revision structure.

 It's not quite that simple, but a graph-aware synchronization (with
 appropriate cert summarization) would definitely speed up figuring out what
 to send. It's on my to-do list somewhere, I promise. :)


Glad to hear that, sorry to hear that there are features that prevent
quick optimizations.

 --
 Timothy

 Free public monotone hosting: http://mtn-host.prjek.net


___
Monotone-devel mailing list
Monotone-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/monotone-devel


Re: [Monotone-devel] A couple questions...

2010-06-03 Thread Timothy Brownawell

On 06/03/2010 06:09 PM, J Decker wrote:


What would probably work is to store a skip-list over the revision graph,
and synchronize on first the leaves (well, probably near-leaves, say
anything not covered by a higher skip-node) at each level and then either
prune/iterate or just synchronize on all unknown-state revisions. The skip
nodes could have a hash over all certs that they skip over.



But then you'd need to mark 'I definatly changed this revision way
back in history' ... but then when is that status cleared, since I may
have to push that to 3 servers


I'm thinking something like this (but probably with a factor more like 
16 or 64 instead of 2)

A2--B2  - skip nodes
|   |
z1  A1--B1  C1--D1  - skip nodes
|   |   |   |   |
  z-y-x-A-B-C-D-E-F-G-H - revisions
where each of the skip nodes has two hashes, one for the lower nodes it 
covers and one for both hashes of each of its parents. So:

 * D1 has a hash over the certs on G and H, and an empty hash.
 * C1 has a hash over the certs on E and F, and one over the two
   hashes on D1
 * B1 has a hash over the certs on C and D, and an empty hash. (It can
   still note C1 as an ancestor, but just doesn't hash over it.) The
   empty hash is to limit the number of skip-node hashes that cover the
   certs on any given revision.
 ...
 * B2 has a hash over the hashes on C1 and D1, and an empty hash
 * A2 has a hash over the hashes on A1 and B1, and one over the
   hashes on B2.
 ...
So you take the certs on z, and the pairs of hashes on z1 and A2, and 
this represents the list of all certs on all revisions. (z covers 
itself, z1 covers y and x, and A2 covers A-H). Say there's a new cert on 
G: this would show up as a difference in A2's ancestor hash so you check 
it's ancestors (in this linear case just B2), then as a difference in 
B2's covered-nodes hash so you check C1 and D1, and it's a difference in 
D1's covered-nodes hash so you check G and H and see that G is missing a 
cert.


In order to keep this consistent, it would require updating appropriate 
skip nodes when you add a cert on an older revision, for example adding 
a new cert on H would require updating hashes on D1, C1, B2, and A2. So 
really it's trading O(log(graph-depth)) on getting certs on old 
revisions which is rare, to get rid of O(graph-size) on sync refinement 
which is common.


Everything's still nice and stateless, just some older data is cached in 
summary form.


--
Timothy

Free public monotone hosting: http://mtn-host.prjek.net

___
Monotone-devel mailing list
Monotone-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/monotone-devel