Re: [Monotone-devel] Re: [patch] roster-deltas (40% pull speedup)

2006-09-08 Thread hendrik
Is there a description somewhere just what rosters are?  The last word I
remember on the subject many months ago was that they were an internal
data structure that was hard to explain.  Has there been any movement
on the explanation front since then?

-- hendrik


On Wed, Sep 06, 2006 at 02:46:30PM -0700, Nathaniel Smith wrote:
 
 -- 
 If you can explain how you do something, then you're very very bad at it.
   -- John Hopfield

Or is it just that you're very good at it?   :-)



___
Monotone-devel mailing list
Monotone-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/monotone-devel


Re: [Monotone-devel] Re: [patch] roster-deltas (40% pull speedup)

2006-09-08 Thread Markus Schiltknecht

[EMAIL PROTECTED] wrote:

Is there a description somewhere just what rosters are?  The last word I
remember on the subject many months ago was that they were an internal
data structure that was hard to explain.  Has there been any movement
on the explanation front since then?


Yeah, I'd like to get a better understanding, too. The wiki is severely 
out of date...


Regards

Markus


___
Monotone-devel mailing list
Monotone-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/monotone-devel


Re: [Monotone-devel] Re: [patch] roster-deltas (40% pull speedup)

2006-09-08 Thread Timothy Brownawell
On Fri, 2006-09-08 at 16:18 +0200, Markus Schiltknecht wrote:
 [EMAIL PROTECTED] wrote:
  Is there a description somewhere just what rosters are?  The last word I
  remember on the subject many months ago was that they were an internal
  data structure that was hard to explain.  Has there been any movement
  on the explanation front since then?
 
 Yeah, I'd like to get a better understanding, too. The wiki is severely 
 out of date...

It's like a manifest, except it also has various pieces of internal
metadata used to make merging faster.

In our data structures, this is split into roster_t, which maps nodes
(files or dirs) to an id and holds the attributes (content hash (for
files), attrs, (parent id, name) pair) for the nodes, and a marking_map
which holds multi-*-merge metadata.

When serialized, these two structures get mixed to make the text version
of a roster.

Tim
-- 
Free (experimental) public monotone hosting: http://mtn-host.prjek.net



___
Monotone-devel mailing list
Monotone-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/monotone-devel


Re: [Monotone-devel] Re: [patch] roster-deltas (40% pull speedup)

2006-09-08 Thread Rob Schoening
I went ahaed an added your answer to the Glossary on the wiki, Tim.

This question had been nagging at me for some time as well. ;-)

RS
On 9/8/06, Timothy Brownawell [EMAIL PROTECTED] wrote:
On Fri, 2006-09-08 at 16:18 +0200, Markus Schiltknecht wrote: 
[EMAIL PROTECTED] wrote:  Is there a description somewhere just what rosters are?The last word I  remember on the subject many months ago was that they were an internal  data structure that was hard to explain.Has there been any movement
  on the explanation front since then? Yeah, I'd like to get a better understanding, too. The wiki is severely out of date...It's like a manifest, except it also has various pieces of internal
metadata used to make merging faster.In our data structures, this is split into roster_t, which maps nodes(files or dirs) to an id and holds the attributes (content hash (forfiles), attrs, (parent id, name) pair) for the nodes, and a marking_map
which holds multi-*-merge metadata.When serialized, these two structures get mixed to make the text versionof a roster.Tim--Free (experimental) public monotone hosting: 
http://mtn-host.prjek.net___Monotone-devel mailing listMonotone-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/monotone-devel
___
Monotone-devel mailing list
Monotone-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/monotone-devel


Re: [Monotone-devel] Re: [patch] roster-deltas (40% pull speedup)

2006-09-06 Thread Nathaniel Smith
On Wed, Sep 06, 2006 at 01:47:58AM -0700, Graydon Hoare wrote:
 Nathaniel Smith wrote:
 
 Thanks for looking it over -- I know you don't have much time these
 days :-).
 
 s/have/make/. It'd be dishonest to claim innocence; I'm just not 
 allocating as much time for this stuff anymore.

A philosophical difference, perhaps... if you don't want to do it, you
shouldn't, and people can hardly be blamed for what they find they
prefer to spend time on :-).

 Well, it's a tricky bit of state tracking.  For instance, at one
 point, I had converted things so that rosters were indexed by
 roster_id, but hadn't actually changed their storage format or
 anything yet.  I was still caching them in the vcache.  I only
 realized much later that this meant there was a major bug -- because
 rollback() does not flush the vcache, I could have left stale rosters
 that never existed in there... this doesn't apply to files in the
 vcache, because they're indexed by content, but it did to rosters once
 I switched them over to being indexed by...index.
 
 Well, I don't immediately see any places you're not handling. It's late 
 and there's an increasing amount of state in this class, so I'm not very 
 confident in that assessment, but I don't know what else to do. I read 
 through the methods and their sub-methods using etags. It seems coherent.
 
 I mean, you put them in the write-through cache during put_roster, you 
 flush them to disk on commit, you cancel them all on rollback, you 
 consult the cache on get_roster_version and populate it on cache-misses. 
  Sounds sufficient to me.

'kay; sounds sufficient to me too, but it's always good to get a
sanity check.

 Three points stand out:
 
   1. I'm nervous about handing out shared pointers to cache entries.
  I see they're const and all, so realistically I can't imagine
  anything that'd go wrong.. it's not like we had anything in
  the old code to prevent fetching soon-to-be-cancelled writes
  either, it just seems a bit ominous.
 
  Is it possible that the db invalidates a cache entry and some
  caller will keep the cache entry alive, thinking it's a real
  roster? If it does, what are the possible consequences?

This is a general problem with transactional stores, right, you might
do BEGIN/write/read/ROLLBACK and the read becomes invalid?  I don't
have anything super useful to say about the problem in this particular
case -- I just tried to make the shared_ptr stuff have exactly the
semantics as one would get with a copying approach like we use
elsewhere (e.g., for files, for revisions proper, etc.).

   2. The manifest-reconstruction stuff is still present. Soon this
  should all be deleted. Maybe now? I don't know. I'm surprised
  you kept it alive. If you're migrating the schema, why not just
  toss the manifest tables entirely? They're empty, no?

I got about half way through tearing it out as being completely
ancient, but then I realized that no, 'rosterify' actually needs the
manifest reconstruction stuff too.  0.30 seems a little early to break
migration from pre-0.26.

   3. Why is the delayed_files map separate from the vcache? Shouldn't
  the two be unified (and dedicated to files alone, ditching
  manifests) now that you have a proper LRUWritebackCache?

Assuming we do want to keep manifests around for now, using a
writeback cache for files is a little bit tricky, because we'd have to
separate out the file and manifest logic somehow, or something...

Anyway, I call defer! on this, the current file/manifest logic is
just like it is on mainline, and if this branch creates new
opportunities, we don't have to take all of them before it lands.

 That's the kind of bug I'm worried about -- somehow dropping a step in
 the dance between roster_cache insertion, marking things clean/dirty,
 and letting them eventually get written or not -- depending on how the
 transaction resolves, and whether a delta gets written straight to the
 db first.  I _think_ I got all this stuff right, but I would feel
 better if someone independently convinced themself of the same thing.
 
 Again, I think this is even less likely if you make the file cache use 
 LRUWritebackCache too: identical code paths and relationship to 
 transactions as roster cache, less opportunity for logic mismatch.

The file logic is already known good, so I think this is okay for
now.

 If the delta gets written straight to the db it's still inside a 
 transaction; it'll still get backed out if the transaction rolls back. 
 The key is to ensure that everything in memory that might relate to the 
 current transaction is written out before issuing COMMIT, or discarded 
 from cache before issuing ROLLBACK. The fewer paths we have that cause 
 write-out or cache-clearing, the easier this is to ensure.
 
 The other place I'm worried for similar reasons is in the roster delta
 code -- this is much simpler, but again, I'd appreciate it if someone
 else independently read 

Re: [Monotone-devel] Re: [patch] roster-deltas (40% pull speedup)

2006-08-23 Thread Nathaniel Smith
On Tue, Aug 22, 2006 at 02:24:00PM -0700, Graydon Hoare wrote:
 First off: fantastic work!
 
 I only have minor nit-picky comments at this point; as far as the plan 
 we've been discussing, this looks basically perfect, I'm surprised you 
 got it done so fast.

Thanks for looking it over -- I know you don't have much time these
days :-).

 I haven't carefully gone over the transaction commit / writeback / flush 
 ordering in database.cc, which I think you probably want. I'll try 
 taking a whole-file read soon, but I'd be very surprised if any bugs in 
 there would avoid tripping a fatal assertion due to a missing db row.

Well, it's a tricky bit of state tracking.  For instance, at one
point, I had converted things so that rosters were indexed by
roster_id, but hadn't actually changed their storage format or
anything yet.  I was still caching them in the vcache.  I only
realized much later that this meant there was a major bug -- because
rollback() does not flush the vcache, I could have left stale rosters
that never existed in there... this doesn't apply to files in the
vcache, because they're indexed by content, but it did to rosters once
I switched them over to being indexed by...index.

That's the kind of bug I'm worried about -- somehow dropping a step in
the dance between roster_cache insertion, marking things clean/dirty,
and letting them eventually get written or not -- depending on how the
transaction resolves, and whether a delta gets written straight to the
db first.  I _think_ I got all this stuff right, but I would feel
better if someone independently convinced themself of the same thing.

The other place I'm worried for similar reasons is in the roster delta
code -- this is much simpler, but again, I'd appreciate it if someone
else independently read through and convinced themselves that we
aren't going to accidentally drop some field or whatever.

 + * This template creats a simple collection of key-value pairs that grows
 + * until the size specified at construction is reached and then begins
 + * discard the Least Recently Used element on each insertion.
 
 Modify this comment to describe the dirty set.

Good point.  Done.

 +  struct roster_delta_t
 +  {
 +typedef std::setnode_id nodes_deleted_t;
 +nodes_deleted_t nodes_deleted;
 +typedef std::mapstd::pairnode_id, path_component, node_id 
 dirs_added_t;
 +dirs_added_t dirs_added;
 +typedef std::mapstd::pairnode_id, path_component, 
 std::pairnode_id, file_id  files_added_t;
 +files_added_t files_added;
 +typedef std::mapnode_id, std::pairnode_id, path_component  
 nodes_renamed_t;
 +nodes_renamed_t nodes_renamed;
 +typedef std::mapnode_id, file_id deltas_applied_t;
 +deltas_applied_t deltas_applied;
 +typedef std::setstd::pairnode_id, attr_key  attrs_cleared_t;
 +attrs_cleared_t attrs_cleared;
 +typedef std::setstd::pairnode_id, std::pairattr_key, 
 std::pairbool, attr_valueattrs_changed_t;
 +attrs_changed_t attrs_changed;
 
 Stylistic: I think I'd prefer all the typedefs first, then the slot 
 defs. Also wrap at 80 cols please.

Okay, done.

 Semantic: I'm slightly concerned that this is the non-invertable 
 representation. The version I was working on (which, granted, is 
 horribly incomplete) was invertable. You can only reconstruct backwards. 
 Does this concern you at all? The only inversion cases I could imagine 
 were just that: imaginary. I guess it's private so we can always revisit.

The only one I can think of is restricted log with the --next switch
--- and even that one might go away, if we do further optimization for
log and friends.  So yeah, I'd prefer to defer this part for now.  (It
should be really easy to add later on top of the machinery this patch
adds.)

 +// attach everything
 +for (dirs_added_t::const_iterator i = dirs_added.begin(); i != 
 dirs_added.end(); ++i)
 +  roster.attach_node(i-second, i-first.first, i-first.second);
 +for (files_added_t::const_iterator i = files_added.begin(); i != 
 files_added.end(); ++i)
 +  roster.attach_node(i-second.first, i-first.first, 
 i-first.second);
 +for (nodes_renamed_t::const_iterator i = nodes_renamed.begin(); i != 
 nodes_renamed.end(); ++i)
 +  roster.attach_node(i-first, i-second.first, i-second.second);
 
 Stylistic: a bit more whitespace and wrapping, please.

Okay, done.

 Also I'm shifting 
 commenting style towards full sentences starting with a capital letter 
 and ending with a period. I know, ghastly. What's happening?!

Who are you, and what you have you done with graydon?!?

(Done.)

 +
 +// roight, all that tricky tree-rearranging done, just have to do some
 +// individual node edits now
 
 roight?

Err... roight.  Done.

 +  void
 +  delta_only_in_to(node_t new_n, roster_delta_t  d)
 +  delta_in_both(node_t old_n, node_t new_n, roster_delta_t  d)
 
 Smells like duplicated code from roster.cc. Is it refactorable?

Probably because it was copy-and-pasted :-).  

[Monotone-devel] Re: [patch] roster-deltas (40% pull speedup)

2006-08-22 Thread Graydon Hoare

Nathaniel Smith wrote:


The change is very large -- 32 files changed, 2387 insertions(+), 1073
deletions(-) -- and messes about with some very important pieces.  In
particular, if there are bugs in roster reconstruction, that could
jeopardize our sanity checking generally.  So, I'd really appreciate
review of this code before it lands.


First off: fantastic work!

I only have minor nit-picky comments at this point; as far as the plan 
we've been discussing, this looks basically perfect, I'm surprised you 
got it done so fast.


I haven't carefully gone over the transaction commit / writeback / flush 
ordering in database.cc, which I think you probably want. I'll try 
taking a whole-file read soon, but I'd be very surprised if any bugs in 
there would avoid tripping a fatal assertion due to a missing db row.


Good testcase: download a cvs archive from sourceforge, import it, and 
sync it back and forth a few times.


Nit picking follows:


--- lru_writeback_cache.hh  7e0017442df3ec9e2a386acb3a424d60a6a7
+++ lru_writeback_cache.hh  7e0017442df3ec9e2a386acb3a424d60a6a7
+/**
+ * @brief Template cache with an LRU removal policy.
+ * @class LRUCache
+ *
+ * @par
+ * This template creats a simple collection of key-value pairs that grows
+ * until the size specified at construction is reached and then begins
+ * discard the Least Recently Used element on each insertion.


Modify this comment to describe the dirty set.


--- roster_delta.cc 1d44ccf2b5bf2911ce0fe1264bed6a33c285173f
+++ roster_delta.cc 1d44ccf2b5bf2911ce0fe1264bed6a33c285173f
@@ -0,0 +1,483 @@
+#include set
+#include map
+
+#include boost/lexical_cast.hpp
+
+#include safe_map.hh
+#include parallel_iter.hh
+#include roster_delta.hh
+#include basic_io.hh
+#include paths.hh
+
+using boost::lexical_cast;
+
+namespace 
+{

+
+  struct roster_delta_t
+  {
+typedef std::setnode_id nodes_deleted_t;
+nodes_deleted_t nodes_deleted;
+typedef std::mapstd::pairnode_id, path_component, node_id dirs_added_t;
+dirs_added_t dirs_added;
+typedef std::mapstd::pairnode_id, path_component, std::pairnode_id, 
file_id  files_added_t;
+files_added_t files_added;
+typedef std::mapnode_id, std::pairnode_id, path_component  
nodes_renamed_t;
+nodes_renamed_t nodes_renamed;
+typedef std::mapnode_id, file_id deltas_applied_t;
+deltas_applied_t deltas_applied;
+typedef std::setstd::pairnode_id, attr_key  attrs_cleared_t;
+attrs_cleared_t attrs_cleared;
+typedef std::setstd::pairnode_id, std::pairattr_key, std::pairbool, attr_value 
   attrs_changed_t;
+attrs_changed_t attrs_changed;


Stylistic: I think I'd prefer all the typedefs first, then the slot 
defs. Also wrap at 80 cols please.


Semantic: I'm slightly concerned that this is the non-invertable 
representation. The version I was working on (which, granted, is 
horribly incomplete) was invertable. You can only reconstruct backwards. 
Does this concern you at all? The only inversion cases I could imagine 
were just that: imaginary. I guess it's private so we can always revisit.



+
+// nodes_deleted are automatically removed from the marking_map; these are
+// all markings that are new or changed
+typedef std::mapnode_id, marking_t markings_changed_t;
+markings_changed_t markings_changed;
+
+void
+apply(roster_t  roster, marking_map  markings) const;
+  };
+
+  void
+  roster_delta_t::apply(roster_t  roster, marking_map  markings) const
+  {
+// detach everything that should be detached
+for (nodes_deleted_t::const_iterator i = nodes_deleted.begin(); i != 
nodes_deleted.end(); ++i)
+  roster.detach_node(*i);
+for (nodes_renamed_t::const_iterator i = nodes_renamed.begin(); i != 
nodes_renamed.end(); ++i)
+  roster.detach_node(i-first);
+// delete the delete-able things
+for (nodes_deleted_t::const_iterator i = nodes_deleted.begin(); i != 
nodes_deleted.end(); ++i)
+  roster.drop_detached_node(*i);
+// add the new things
+for (dirs_added_t::const_iterator i = dirs_added.begin(); i != 
dirs_added.end(); ++i)
+  roster.create_dir_node(i-second);
+for (files_added_t::const_iterator i = files_added.begin(); i != 
files_added.end(); ++i)
+  roster.create_file_node(i-second.second, i-second.first);
+// attach everything
+for (dirs_added_t::const_iterator i = dirs_added.begin(); i != 
dirs_added.end(); ++i)
+  roster.attach_node(i-second, i-first.first, i-first.second);
+for (files_added_t::const_iterator i = files_added.begin(); i != 
files_added.end(); ++i)
+  roster.attach_node(i-second.first, i-first.first, i-first.second);
+for (nodes_renamed_t::const_iterator i = nodes_renamed.begin(); i != 
nodes_renamed.end(); ++i)
+  roster.attach_node(i-first, i-second.first, i-second.second);


Stylistic: a bit more whitespace and wrapping, please. Also I'm shifting 
commenting style towards full sentences starting with a capital letter 
and ending