Re: [Monotone-devel] Re: [patch] roster-deltas (40% pull speedup)
Is there a description somewhere just what rosters are? The last word I remember on the subject many months ago was that they were an internal data structure that was hard to explain. Has there been any movement on the explanation front since then? -- hendrik On Wed, Sep 06, 2006 at 02:46:30PM -0700, Nathaniel Smith wrote: -- If you can explain how you do something, then you're very very bad at it. -- John Hopfield Or is it just that you're very good at it? :-) ___ Monotone-devel mailing list Monotone-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/monotone-devel
Re: [Monotone-devel] Re: [patch] roster-deltas (40% pull speedup)
[EMAIL PROTECTED] wrote: Is there a description somewhere just what rosters are? The last word I remember on the subject many months ago was that they were an internal data structure that was hard to explain. Has there been any movement on the explanation front since then? Yeah, I'd like to get a better understanding, too. The wiki is severely out of date... Regards Markus ___ Monotone-devel mailing list Monotone-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/monotone-devel
Re: [Monotone-devel] Re: [patch] roster-deltas (40% pull speedup)
On Fri, 2006-09-08 at 16:18 +0200, Markus Schiltknecht wrote: [EMAIL PROTECTED] wrote: Is there a description somewhere just what rosters are? The last word I remember on the subject many months ago was that they were an internal data structure that was hard to explain. Has there been any movement on the explanation front since then? Yeah, I'd like to get a better understanding, too. The wiki is severely out of date... It's like a manifest, except it also has various pieces of internal metadata used to make merging faster. In our data structures, this is split into roster_t, which maps nodes (files or dirs) to an id and holds the attributes (content hash (for files), attrs, (parent id, name) pair) for the nodes, and a marking_map which holds multi-*-merge metadata. When serialized, these two structures get mixed to make the text version of a roster. Tim -- Free (experimental) public monotone hosting: http://mtn-host.prjek.net ___ Monotone-devel mailing list Monotone-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/monotone-devel
Re: [Monotone-devel] Re: [patch] roster-deltas (40% pull speedup)
I went ahaed an added your answer to the Glossary on the wiki, Tim. This question had been nagging at me for some time as well. ;-) RS On 9/8/06, Timothy Brownawell [EMAIL PROTECTED] wrote: On Fri, 2006-09-08 at 16:18 +0200, Markus Schiltknecht wrote: [EMAIL PROTECTED] wrote: Is there a description somewhere just what rosters are?The last word I remember on the subject many months ago was that they were an internal data structure that was hard to explain.Has there been any movement on the explanation front since then? Yeah, I'd like to get a better understanding, too. The wiki is severely out of date...It's like a manifest, except it also has various pieces of internal metadata used to make merging faster.In our data structures, this is split into roster_t, which maps nodes(files or dirs) to an id and holds the attributes (content hash (forfiles), attrs, (parent id, name) pair) for the nodes, and a marking_map which holds multi-*-merge metadata.When serialized, these two structures get mixed to make the text versionof a roster.Tim--Free (experimental) public monotone hosting: http://mtn-host.prjek.net___Monotone-devel mailing listMonotone-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/monotone-devel ___ Monotone-devel mailing list Monotone-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/monotone-devel
Re: [Monotone-devel] Re: [patch] roster-deltas (40% pull speedup)
On Wed, Sep 06, 2006 at 01:47:58AM -0700, Graydon Hoare wrote: Nathaniel Smith wrote: Thanks for looking it over -- I know you don't have much time these days :-). s/have/make/. It'd be dishonest to claim innocence; I'm just not allocating as much time for this stuff anymore. A philosophical difference, perhaps... if you don't want to do it, you shouldn't, and people can hardly be blamed for what they find they prefer to spend time on :-). Well, it's a tricky bit of state tracking. For instance, at one point, I had converted things so that rosters were indexed by roster_id, but hadn't actually changed their storage format or anything yet. I was still caching them in the vcache. I only realized much later that this meant there was a major bug -- because rollback() does not flush the vcache, I could have left stale rosters that never existed in there... this doesn't apply to files in the vcache, because they're indexed by content, but it did to rosters once I switched them over to being indexed by...index. Well, I don't immediately see any places you're not handling. It's late and there's an increasing amount of state in this class, so I'm not very confident in that assessment, but I don't know what else to do. I read through the methods and their sub-methods using etags. It seems coherent. I mean, you put them in the write-through cache during put_roster, you flush them to disk on commit, you cancel them all on rollback, you consult the cache on get_roster_version and populate it on cache-misses. Sounds sufficient to me. 'kay; sounds sufficient to me too, but it's always good to get a sanity check. Three points stand out: 1. I'm nervous about handing out shared pointers to cache entries. I see they're const and all, so realistically I can't imagine anything that'd go wrong.. it's not like we had anything in the old code to prevent fetching soon-to-be-cancelled writes either, it just seems a bit ominous. Is it possible that the db invalidates a cache entry and some caller will keep the cache entry alive, thinking it's a real roster? If it does, what are the possible consequences? This is a general problem with transactional stores, right, you might do BEGIN/write/read/ROLLBACK and the read becomes invalid? I don't have anything super useful to say about the problem in this particular case -- I just tried to make the shared_ptr stuff have exactly the semantics as one would get with a copying approach like we use elsewhere (e.g., for files, for revisions proper, etc.). 2. The manifest-reconstruction stuff is still present. Soon this should all be deleted. Maybe now? I don't know. I'm surprised you kept it alive. If you're migrating the schema, why not just toss the manifest tables entirely? They're empty, no? I got about half way through tearing it out as being completely ancient, but then I realized that no, 'rosterify' actually needs the manifest reconstruction stuff too. 0.30 seems a little early to break migration from pre-0.26. 3. Why is the delayed_files map separate from the vcache? Shouldn't the two be unified (and dedicated to files alone, ditching manifests) now that you have a proper LRUWritebackCache? Assuming we do want to keep manifests around for now, using a writeback cache for files is a little bit tricky, because we'd have to separate out the file and manifest logic somehow, or something... Anyway, I call defer! on this, the current file/manifest logic is just like it is on mainline, and if this branch creates new opportunities, we don't have to take all of them before it lands. That's the kind of bug I'm worried about -- somehow dropping a step in the dance between roster_cache insertion, marking things clean/dirty, and letting them eventually get written or not -- depending on how the transaction resolves, and whether a delta gets written straight to the db first. I _think_ I got all this stuff right, but I would feel better if someone independently convinced themself of the same thing. Again, I think this is even less likely if you make the file cache use LRUWritebackCache too: identical code paths and relationship to transactions as roster cache, less opportunity for logic mismatch. The file logic is already known good, so I think this is okay for now. If the delta gets written straight to the db it's still inside a transaction; it'll still get backed out if the transaction rolls back. The key is to ensure that everything in memory that might relate to the current transaction is written out before issuing COMMIT, or discarded from cache before issuing ROLLBACK. The fewer paths we have that cause write-out or cache-clearing, the easier this is to ensure. The other place I'm worried for similar reasons is in the roster delta code -- this is much simpler, but again, I'd appreciate it if someone else independently read
Re: [Monotone-devel] Re: [patch] roster-deltas (40% pull speedup)
On Tue, Aug 22, 2006 at 02:24:00PM -0700, Graydon Hoare wrote: First off: fantastic work! I only have minor nit-picky comments at this point; as far as the plan we've been discussing, this looks basically perfect, I'm surprised you got it done so fast. Thanks for looking it over -- I know you don't have much time these days :-). I haven't carefully gone over the transaction commit / writeback / flush ordering in database.cc, which I think you probably want. I'll try taking a whole-file read soon, but I'd be very surprised if any bugs in there would avoid tripping a fatal assertion due to a missing db row. Well, it's a tricky bit of state tracking. For instance, at one point, I had converted things so that rosters were indexed by roster_id, but hadn't actually changed their storage format or anything yet. I was still caching them in the vcache. I only realized much later that this meant there was a major bug -- because rollback() does not flush the vcache, I could have left stale rosters that never existed in there... this doesn't apply to files in the vcache, because they're indexed by content, but it did to rosters once I switched them over to being indexed by...index. That's the kind of bug I'm worried about -- somehow dropping a step in the dance between roster_cache insertion, marking things clean/dirty, and letting them eventually get written or not -- depending on how the transaction resolves, and whether a delta gets written straight to the db first. I _think_ I got all this stuff right, but I would feel better if someone independently convinced themself of the same thing. The other place I'm worried for similar reasons is in the roster delta code -- this is much simpler, but again, I'd appreciate it if someone else independently read through and convinced themselves that we aren't going to accidentally drop some field or whatever. + * This template creats a simple collection of key-value pairs that grows + * until the size specified at construction is reached and then begins + * discard the Least Recently Used element on each insertion. Modify this comment to describe the dirty set. Good point. Done. + struct roster_delta_t + { +typedef std::setnode_id nodes_deleted_t; +nodes_deleted_t nodes_deleted; +typedef std::mapstd::pairnode_id, path_component, node_id dirs_added_t; +dirs_added_t dirs_added; +typedef std::mapstd::pairnode_id, path_component, std::pairnode_id, file_id files_added_t; +files_added_t files_added; +typedef std::mapnode_id, std::pairnode_id, path_component nodes_renamed_t; +nodes_renamed_t nodes_renamed; +typedef std::mapnode_id, file_id deltas_applied_t; +deltas_applied_t deltas_applied; +typedef std::setstd::pairnode_id, attr_key attrs_cleared_t; +attrs_cleared_t attrs_cleared; +typedef std::setstd::pairnode_id, std::pairattr_key, std::pairbool, attr_valueattrs_changed_t; +attrs_changed_t attrs_changed; Stylistic: I think I'd prefer all the typedefs first, then the slot defs. Also wrap at 80 cols please. Okay, done. Semantic: I'm slightly concerned that this is the non-invertable representation. The version I was working on (which, granted, is horribly incomplete) was invertable. You can only reconstruct backwards. Does this concern you at all? The only inversion cases I could imagine were just that: imaginary. I guess it's private so we can always revisit. The only one I can think of is restricted log with the --next switch --- and even that one might go away, if we do further optimization for log and friends. So yeah, I'd prefer to defer this part for now. (It should be really easy to add later on top of the machinery this patch adds.) +// attach everything +for (dirs_added_t::const_iterator i = dirs_added.begin(); i != dirs_added.end(); ++i) + roster.attach_node(i-second, i-first.first, i-first.second); +for (files_added_t::const_iterator i = files_added.begin(); i != files_added.end(); ++i) + roster.attach_node(i-second.first, i-first.first, i-first.second); +for (nodes_renamed_t::const_iterator i = nodes_renamed.begin(); i != nodes_renamed.end(); ++i) + roster.attach_node(i-first, i-second.first, i-second.second); Stylistic: a bit more whitespace and wrapping, please. Okay, done. Also I'm shifting commenting style towards full sentences starting with a capital letter and ending with a period. I know, ghastly. What's happening?! Who are you, and what you have you done with graydon?!? (Done.) + +// roight, all that tricky tree-rearranging done, just have to do some +// individual node edits now roight? Err... roight. Done. + void + delta_only_in_to(node_t new_n, roster_delta_t d) + delta_in_both(node_t old_n, node_t new_n, roster_delta_t d) Smells like duplicated code from roster.cc. Is it refactorable? Probably because it was copy-and-pasted :-).
[Monotone-devel] Re: [patch] roster-deltas (40% pull speedup)
Nathaniel Smith wrote: The change is very large -- 32 files changed, 2387 insertions(+), 1073 deletions(-) -- and messes about with some very important pieces. In particular, if there are bugs in roster reconstruction, that could jeopardize our sanity checking generally. So, I'd really appreciate review of this code before it lands. First off: fantastic work! I only have minor nit-picky comments at this point; as far as the plan we've been discussing, this looks basically perfect, I'm surprised you got it done so fast. I haven't carefully gone over the transaction commit / writeback / flush ordering in database.cc, which I think you probably want. I'll try taking a whole-file read soon, but I'd be very surprised if any bugs in there would avoid tripping a fatal assertion due to a missing db row. Good testcase: download a cvs archive from sourceforge, import it, and sync it back and forth a few times. Nit picking follows: --- lru_writeback_cache.hh 7e0017442df3ec9e2a386acb3a424d60a6a7 +++ lru_writeback_cache.hh 7e0017442df3ec9e2a386acb3a424d60a6a7 +/** + * @brief Template cache with an LRU removal policy. + * @class LRUCache + * + * @par + * This template creats a simple collection of key-value pairs that grows + * until the size specified at construction is reached and then begins + * discard the Least Recently Used element on each insertion. Modify this comment to describe the dirty set. --- roster_delta.cc 1d44ccf2b5bf2911ce0fe1264bed6a33c285173f +++ roster_delta.cc 1d44ccf2b5bf2911ce0fe1264bed6a33c285173f @@ -0,0 +1,483 @@ +#include set +#include map + +#include boost/lexical_cast.hpp + +#include safe_map.hh +#include parallel_iter.hh +#include roster_delta.hh +#include basic_io.hh +#include paths.hh + +using boost::lexical_cast; + +namespace +{ + + struct roster_delta_t + { +typedef std::setnode_id nodes_deleted_t; +nodes_deleted_t nodes_deleted; +typedef std::mapstd::pairnode_id, path_component, node_id dirs_added_t; +dirs_added_t dirs_added; +typedef std::mapstd::pairnode_id, path_component, std::pairnode_id, file_id files_added_t; +files_added_t files_added; +typedef std::mapnode_id, std::pairnode_id, path_component nodes_renamed_t; +nodes_renamed_t nodes_renamed; +typedef std::mapnode_id, file_id deltas_applied_t; +deltas_applied_t deltas_applied; +typedef std::setstd::pairnode_id, attr_key attrs_cleared_t; +attrs_cleared_t attrs_cleared; +typedef std::setstd::pairnode_id, std::pairattr_key, std::pairbool, attr_value attrs_changed_t; +attrs_changed_t attrs_changed; Stylistic: I think I'd prefer all the typedefs first, then the slot defs. Also wrap at 80 cols please. Semantic: I'm slightly concerned that this is the non-invertable representation. The version I was working on (which, granted, is horribly incomplete) was invertable. You can only reconstruct backwards. Does this concern you at all? The only inversion cases I could imagine were just that: imaginary. I guess it's private so we can always revisit. + +// nodes_deleted are automatically removed from the marking_map; these are +// all markings that are new or changed +typedef std::mapnode_id, marking_t markings_changed_t; +markings_changed_t markings_changed; + +void +apply(roster_t roster, marking_map markings) const; + }; + + void + roster_delta_t::apply(roster_t roster, marking_map markings) const + { +// detach everything that should be detached +for (nodes_deleted_t::const_iterator i = nodes_deleted.begin(); i != nodes_deleted.end(); ++i) + roster.detach_node(*i); +for (nodes_renamed_t::const_iterator i = nodes_renamed.begin(); i != nodes_renamed.end(); ++i) + roster.detach_node(i-first); +// delete the delete-able things +for (nodes_deleted_t::const_iterator i = nodes_deleted.begin(); i != nodes_deleted.end(); ++i) + roster.drop_detached_node(*i); +// add the new things +for (dirs_added_t::const_iterator i = dirs_added.begin(); i != dirs_added.end(); ++i) + roster.create_dir_node(i-second); +for (files_added_t::const_iterator i = files_added.begin(); i != files_added.end(); ++i) + roster.create_file_node(i-second.second, i-second.first); +// attach everything +for (dirs_added_t::const_iterator i = dirs_added.begin(); i != dirs_added.end(); ++i) + roster.attach_node(i-second, i-first.first, i-first.second); +for (files_added_t::const_iterator i = files_added.begin(); i != files_added.end(); ++i) + roster.attach_node(i-second.first, i-first.first, i-first.second); +for (nodes_renamed_t::const_iterator i = nodes_renamed.begin(); i != nodes_renamed.end(); ++i) + roster.attach_node(i-first, i-second.first, i-second.second); Stylistic: a bit more whitespace and wrapping, please. Also I'm shifting commenting style towards full sentences starting with a capital letter and ending