[Monotone-devel] A couple questions...
Why does monotone read all revisions and certs in the database on a sync? At least 90% of the time, a comparison between the heads of branches to sync(A) vs the heads of branches to sync against(B) will result in either A is before B or B is before A and occasionally neither A or B is known to the respective opposite side, but then perhaps a revision C could be stored per branch in the repository that A is in that is the last point(s) [it may be multiple heads] that was known to sync... then C will be before B in this third case. probably less than 1% of synchronizations are new pulls, which would require reading all revisions in the branches of interest. (in which case revision C will be blank or null). I would think it would be a SIGNIFICANT reduction in sync time (which isn't really that long for 30,000 revisions in assorted branches which takes about 10 seconds to read on a netbook) to do a quick check and just pull from the last known revision to the tip on both sides to sync, instead of reading the entire revision structure. Jim ___ Monotone-devel mailing list Monotone-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/monotone-devel
Re: [Monotone-devel] A couple questions...
On 06/03/2010 04:21 PM, J Decker wrote: Why does monotone read all revisions and certs in the database on a sync? One fairly basic feature monotone has is that certs can be added to arbitrarily old revisions. What this means for netsync is that just synchronizing the revision graph isn't enough, you also need to check information from every node. At least 90% of the time, a comparison between the heads of branches to sync(A) vs the heads of branches to sync against(B) will result in either A is before B or B is before A and occasionally neither A or B is known to the respective opposite side, but then perhaps a revision C could be stored per branch in the repository that A is in that is the last point(s) [it may be multiple heads] that was known to sync... then C will be before B in this third case. I think we don't want to store that kind of per-peer information just on general principle. What would probably work is to store a skip-list over the revision graph, and synchronize on first the leaves (well, probably near-leaves, say anything not covered by a higher skip-node) at each level and then either prune/iterate or just synchronize on all unknown-state revisions. The skip nodes could have a hash over all certs that they skip over. probably less than 1% of synchronizations are new pulls, which would require reading all revisions in the branches of interest. (in which case revision C will be blank or null). Further, most pulls are going to have a few (or possibly no) new revisions on one side and no new revisions on the other. While it's fairly common that two people are commiting on the same branch at the same time, I think this is merely a significant minority of commits (and syncs). I would think it would be a SIGNIFICANT reduction in sync time (which isn't really that long for 30,000 revisions in assorted branches which takes about 10 seconds to read on a netbook) to do a quick check and just pull from the last known revision to the tip on both sides to sync, instead of reading the entire revision structure. It's not quite that simple, but a graph-aware synchronization (with appropriate cert summarization) would definitely speed up figuring out what to send. It's on my to-do list somewhere, I promise. :) -- Timothy Free public monotone hosting: http://mtn-host.prjek.net ___ Monotone-devel mailing list Monotone-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/monotone-devel
Re: [Monotone-devel] A couple questions...
On Thu, Jun 3, 2010 at 2:56 PM, Timothy Brownawell tbrow...@prjek.net wrote: On 06/03/2010 04:21 PM, J Decker wrote: Why does monotone read all revisions and certs in the database on a sync? One fairly basic feature monotone has is that certs can be added to arbitrarily old revisions. What this means for netsync is that just synchronizing the revision graph isn't enough, you also need to check information from every node. Okay well that does explain that :) I think we don't want to store that kind of per-peer information just on general principle. Yes, I can see that becoming complex, which is why I mentioned saving it only on the client sync side, which is likely to have fewer remotes than the central server side (which could have many identifiers for the same client if the client is behind many firewalls and travels a lot to various hotels, etc)... but on the client side probably only a few servers are generally used to sync... a central or 3, and maybe a couple peers... even though it is a peer-to-peer operation, there is a side that plays like a client, and a side that hosts service. What would probably work is to store a skip-list over the revision graph, and synchronize on first the leaves (well, probably near-leaves, say anything not covered by a higher skip-node) at each level and then either prune/iterate or just synchronize on all unknown-state revisions. The skip nodes could have a hash over all certs that they skip over. But then you'd need to mark 'I definatly changed this revision way back in history' ... but then when is that status cleared, since I may have to push that to 3 servers probably less than 1% of synchronizations are new pulls, which would require reading all revisions in the branches of interest. (in which case revision C will be blank or null). Further, most pulls are going to have a few (or possibly no) new revisions on one side and no new revisions on the other. While it's fairly common that two people are commiting on the same branch at the same time, I think this is merely a significant minority of commits (and syncs). true, since many branches in the overall sync are probably untouched. I would think it would be a SIGNIFICANT reduction in sync time (which isn't really that long for 30,000 revisions in assorted branches which takes about 10 seconds to read on a netbook) to do a quick check and just pull from the last known revision to the tip on both sides to sync, instead of reading the entire revision structure. It's not quite that simple, but a graph-aware synchronization (with appropriate cert summarization) would definitely speed up figuring out what to send. It's on my to-do list somewhere, I promise. :) Glad to hear that, sorry to hear that there are features that prevent quick optimizations. -- Timothy Free public monotone hosting: http://mtn-host.prjek.net ___ Monotone-devel mailing list Monotone-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/monotone-devel
Re: [Monotone-devel] A couple questions...
On 06/03/2010 06:09 PM, J Decker wrote: What would probably work is to store a skip-list over the revision graph, and synchronize on first the leaves (well, probably near-leaves, say anything not covered by a higher skip-node) at each level and then either prune/iterate or just synchronize on all unknown-state revisions. The skip nodes could have a hash over all certs that they skip over. But then you'd need to mark 'I definatly changed this revision way back in history' ... but then when is that status cleared, since I may have to push that to 3 servers I'm thinking something like this (but probably with a factor more like 16 or 64 instead of 2) A2--B2 - skip nodes | | z1 A1--B1 C1--D1 - skip nodes | | | | | z-y-x-A-B-C-D-E-F-G-H - revisions where each of the skip nodes has two hashes, one for the lower nodes it covers and one for both hashes of each of its parents. So: * D1 has a hash over the certs on G and H, and an empty hash. * C1 has a hash over the certs on E and F, and one over the two hashes on D1 * B1 has a hash over the certs on C and D, and an empty hash. (It can still note C1 as an ancestor, but just doesn't hash over it.) The empty hash is to limit the number of skip-node hashes that cover the certs on any given revision. ... * B2 has a hash over the hashes on C1 and D1, and an empty hash * A2 has a hash over the hashes on A1 and B1, and one over the hashes on B2. ... So you take the certs on z, and the pairs of hashes on z1 and A2, and this represents the list of all certs on all revisions. (z covers itself, z1 covers y and x, and A2 covers A-H). Say there's a new cert on G: this would show up as a difference in A2's ancestor hash so you check it's ancestors (in this linear case just B2), then as a difference in B2's covered-nodes hash so you check C1 and D1, and it's a difference in D1's covered-nodes hash so you check G and H and see that G is missing a cert. In order to keep this consistent, it would require updating appropriate skip nodes when you add a cert on an older revision, for example adding a new cert on H would require updating hashes on D1, C1, B2, and A2. So really it's trading O(log(graph-depth)) on getting certs on old revisions which is rare, to get rid of O(graph-size) on sync refinement which is common. Everything's still nice and stateless, just some older data is cached in summary form. -- Timothy Free public monotone hosting: http://mtn-host.prjek.net ___ Monotone-devel mailing list Monotone-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/monotone-devel