Re: svnrdump: The BIG update

2010-08-19 Thread 'Daniel Shahaf'
(sorry for the delay; didn't want to reply while sleepy)

Bert Huijben wrote on Tue, Aug 17, 2010 at 09:30:08 -0700:
 
 
  -Original Message-
  From: Ramkumar Ramachandra [mailto:artag...@gmail.com]
  Sent: dinsdag 17 augustus 2010 9:09
  To: Daniel Shahaf
  Cc: Subversion-dev Mailing List
  Subject: Re: svnrdump: The BIG update
  
  Hi Daniel,
  
  Daniel Shahaf writes:
   Ramkumar Ramachandra wrote on Thu, Aug 12, 2010 at 12:17:34 +0530:
  The dump functionality is also complete- thanks to Stefan's review
  and
  MANY others for cleaning it up. It's however hit a brick wall now
  because of missing headers in the RA layer. Until I (or someone
 else)
  figures out how to fix the RA layer, we can't do better than the
 XFail
  copy-and-modify test I've committed.

 Part of the diff there is lack of SHA-1 headers --- which is
 unavoidable
 until editor is revved --- but part of it is a missing
 Text-copy-source-
  md5.
 Why don't you output that information --- doesn't the editor give it
 to
  you?
   
Afaik, no. I don't see Text-copy-source-* anywhere in the RA
layer. Maybe I'm not looking hard enough?
   
  
   Hmm.  It seems you're right.  So you might have to use two RA session in
   parallel...
  
   (and then, you might have to have the user authenticate twice?)
  
  Hm, I also have to find out if it's allowed. The commit_editor doesn't
  allow it for instance. Besides, it's a very inelegant solution- I'd
  rather fix the RA layer than do this.
 
 @Daniel, what would adding these adders add?
 
 The extra headers are for making it easier to detect corruptions by checking
 them along the transfer. 
 
 If we are just doing additional work to add headers via a different process
 it slows the dumping down more than a bit and it doesn't make the dump file
 any safer because it uses a different processes to obtain the header. 
 I think you would have to obtain the source of the copyfrom and get some
 checksum from that; maybe you can do that without transferring the file
 again, but I'm not sure about that.
 

I'm a bit surprised, but indeed I don't see a way to obtain the checksum
via svn_ra.h.  (The word 'checksum' doesn't appear there, and it isn't
included in svn_dirent_t either.)  I wonder how we got away without
having it...

 (And without the added headers the process is already as safe as svnsync.).
 
 Yes, we can add more and more processing to also get those new Sha1 headers
 by recalculating them while dumping, but the idea for svnrdump was to create
 a fast and secure way to dump and load repositories... not an incredible
 slow one that has to transfer files multiple times just to make all the
 optional headers match the output of svnadmin.
 
 Those headers were made optional for a reason: you don't always have them. 
 And different conversion processes have different headers available.
 Svnadmin looks at the FS layer for dumping, so it sees different things than
 an RA layer api. E.g. the dump in svnadmin has to create diffs from
 fulltexts itself, while svnrdump has diffs and must apply these itself to
 get full texts. The checksums have a similar mangling. The FS has access to
 some of the checksums and recalculates others for you. (See the performance
 drop in 1.6 of svnadmin dump)
 

Okay, agreed.  I assumed the editor would provide the copyfrom's
checksum for free (or, at least, that svn_ra_stat() would provide it),
but of course I won't suggest to add those copyfrom-checksum headers if
calculating them is as expensive as it now appears to be.

 There is a similar case at the import side. Applying commits can't check all
 the checksums, but the really important ones are already handled. Svnrdump
 dump and svnrdump load are a nice match.
 
   Bert
 

Thanks for doubting,

Daniel



Re: svnrdump: The BIG update

2010-08-17 Thread Ramkumar Ramachandra
Hi Daniel,

Daniel Shahaf writes:
 Ramkumar Ramachandra wrote on Thu, Aug 12, 2010 at 12:17:34 +0530:
The dump functionality is also complete- thanks to Stefan's review and
MANY others for cleaning it up. It's however hit a brick wall now
because of missing headers in the RA layer. Until I (or someone else)
figures out how to fix the RA layer, we can't do better than the XFail
copy-and-modify test I've committed.
   
   Part of the diff there is lack of SHA-1 headers --- which is unavoidable
   until editor is revved --- but part of it is a missing 
   Text-copy-source-md5.
   Why don't you output that information --- doesn't the editor give it to 
   you?
  
  Afaik, no. I don't see Text-copy-source-* anywhere in the RA
  layer. Maybe I'm not looking hard enough?
  
 
 Hmm.  It seems you're right.  So you might have to use two RA session in
 parallel...
 
 (and then, you might have to have the user authenticate twice?)

Hm, I also have to find out if it's allowed. The commit_editor doesn't
allow it for instance. Besides, it's a very inelegant solution- I'd
rather fix the RA layer than do this.

- Make dumpfile v3 the de-facto standard and improve it for optimized
  loading/ generation. The former part was suggested by Stefan.
- Integrate it into svnadmin etc as appropriate. I think there's
  enough work here for a mini-GSoC project?
   
   How would it be integrated into svnadmin?  Do you want to push the logic
   into the standard 'svnadmin dump' command?
  
  This is something I haven't given thought either. I brought it up
  because of an earlier discussion in which everyone seemed to be in
  favor of NOT having a new command. It feels like we're stuffing a lot
  of functionality into one tool though.
  
 
 Personally I also like having svnadmin operates only locally (so it doesn't
 even link against libsvn_ra), but that was hashed out already on that
 moderately-long thread a few weeks ago.

Yeah. It looks like I'll have to ressurect this thread soon and reach
a concrete conclusion.

-- Ram


RE: svnrdump: The BIG update

2010-08-17 Thread Bert Huijben


 -Original Message-
 From: Ramkumar Ramachandra [mailto:artag...@gmail.com]
 Sent: dinsdag 17 augustus 2010 9:09
 To: Daniel Shahaf
 Cc: Subversion-dev Mailing List
 Subject: Re: svnrdump: The BIG update
 
 Hi Daniel,
 
 Daniel Shahaf writes:
  Ramkumar Ramachandra wrote on Thu, Aug 12, 2010 at 12:17:34 +0530:
 The dump functionality is also complete- thanks to Stefan's review
 and
 MANY others for cleaning it up. It's however hit a brick wall now
 because of missing headers in the RA layer. Until I (or someone
else)
 figures out how to fix the RA layer, we can't do better than the
XFail
 copy-and-modify test I've committed.
   
Part of the diff there is lack of SHA-1 headers --- which is
unavoidable
until editor is revved --- but part of it is a missing
Text-copy-source-
 md5.
Why don't you output that information --- doesn't the editor give it
to
 you?
  
   Afaik, no. I don't see Text-copy-source-* anywhere in the RA
   layer. Maybe I'm not looking hard enough?
  
 
  Hmm.  It seems you're right.  So you might have to use two RA session in
  parallel...
 
  (and then, you might have to have the user authenticate twice?)
 
 Hm, I also have to find out if it's allowed. The commit_editor doesn't
 allow it for instance. Besides, it's a very inelegant solution- I'd
 rather fix the RA layer than do this.

@Daniel, what would adding these adders add?

The extra headers are for making it easier to detect corruptions by checking
them along the transfer. 

If we are just doing additional work to add headers via a different process
it slows the dumping down more than a bit and it doesn't make the dump file
any safer because it uses a different processes to obtain the header. 
I think you would have to obtain the source of the copyfrom and get some
checksum from that; maybe you can do that without transferring the file
again, but I'm not sure about that.

(And without the added headers the process is already as safe as svnsync.).

Yes, we can add more and more processing to also get those new Sha1 headers
by recalculating them while dumping, but the idea for svnrdump was to create
a fast and secure way to dump and load repositories... not an incredible
slow one that has to transfer files multiple times just to make all the
optional headers match the output of svnadmin.

Those headers were made optional for a reason: you don't always have them. 
And different conversion processes have different headers available.
Svnadmin looks at the FS layer for dumping, so it sees different things than
an RA layer api. E.g. the dump in svnadmin has to create diffs from
fulltexts itself, while svnrdump has diffs and must apply these itself to
get full texts. The checksums have a similar mangling. The FS has access to
some of the checksums and recalculates others for you. (See the performance
drop in 1.6 of svnadmin dump)

There is a similar case at the import side. Applying commits can't check all
the checksums, but the really important ones are already handled. Svnrdump
dump and svnrdump load are a nice match.

Bert



Re: svnrdump: The BIG update

2010-08-12 Thread Ramkumar Ramachandra
Hi Daniel,

Daniel Shahaf writes:
  It's been a few weeks since I got partial committer access, and ~80
  commits later, this is what we have:
  
  Firstly, thanks to Daniel for motivating me and driving me to submit
  the series to the list, and guiding me through everything. Without
  him, I'd probably not have finished svnrdump to begin with.
  
  The command line interface and argument parsing library is ready-
  thanks to Bert and lots of others for getting me started with
  this. The interface is solid and looks like the one used in the other
  SVN tools.
  
  The dump functionality is also complete- thanks to Stefan's review and
  MANY others for cleaning it up. It's however hit a brick wall now
  because of missing headers in the RA layer. Until I (or someone else)
  figures out how to fix the RA layer, we can't do better than the XFail
  copy-and-modify test I've committed.
 
 Part of the diff there is lack of SHA-1 headers --- which is unavoidable
 until editor is revved --- but part of it is a missing Text-copy-source-md5.
 Why don't you output that information --- doesn't the editor give it to you?

Afaik, no. I don't see Text-copy-source-* anywhere in the RA
layer. Maybe I'm not looking hard enough?

 Nitpick: svnrdump_tests 5 6 have the same textual description / docstring as
 each other, could you please change that?  See other test files (e.g.,
 ./commit_tests.py --list) for plenty of examples.

Fixed. Thanks for noticing this.

  It's quite mature and dumps
  surprisingly fast though. I'm tempted to run benchmarks, but I haven't
  done it yet because I fear I might be biased towards the tool :p
  
 
 Just write all the benchmarks before running them?

Hehe, yeah. Will do- I just have to make sure that no external factors
affect the tests (example: variations of network speed, disk speed,
cache with time).

  The load functionality is also quite complete, thanks to Bert et al
  for helping me debug all the cryptic errors. The code is mostly
  unreviewed though- there might be plenty of bugs and code cleanup
  opportunities. Not to say that I've stopped working on it- just that
  the work has become less challenging, now that all the tests pass :)
  
 
 Okay, good.  Some field testing probably needed here?

Yeah, lots. I've tested against 1000 revisions of the ASF
successfully, but I'll need more time and patience to run more tests.

  TODO:
  - Write more tests and start using svnrdump for real! Advertise it,
especially to developers of other versioning systems looking to
communicate with SVN. Remember how this project started out?
 
 Don't forget to inform us...@subversion.apache.org :-)

Oh, okay. I'll write another email for them.

  - More optimizations. Since svnrdump is already so fast compared to
the other tools, I think we can squeeze some more speed out of it.
  - Huge documentation effort. svnrdump is a hack- I just did what I
felt like and got it to work somehow. It's very unlike svnmucc,
which does things by the book.
  - Build more infrastructure around svnrdump- I've mostly used existing
SVN API. Although a lot of new functions were suggested, I never
really got down to writing them.
 
 Yep.  There was also talk of moving some of the logic into the libraries ---
 where does that stand?

Yeah, I haven't started working on this yet. I'll need some guidance
for this- I have to sketch out a roadmap and ask for access to the
specified regions or branch; planning is something I'm not used to at
all :p

  - Make dumpfile v3 the de-facto standard and improve it for optimized
loading/ generation. The former part was suggested by Stefan.
  - Integrate it into svnadmin etc as appropriate. I think there's
enough work here for a mini-GSoC project?
 
 How would it be integrated into svnadmin?  Do you want to push the logic
 into the standard 'svnadmin dump' command?

This is something I haven't given thought either. I brought it up
because of an earlier discussion in which everyone seemed to be in
favor of NOT having a new command. It feels like we're stuffing a lot
of functionality into one tool though.

  - GitHub support (?) -- I saw this discussed on IRC somewhere, but I
didn't understand this myself. Can someone clarify?
  
 
 Joke.  GitHub implemented a mod_dav_svn interface to their repositories [1],
 so it's now possible (if their implementation is sound) to generate an svn
 dump of a GitHub git repository.

Ah, yes. I'm aware. With the infrastructure I've written on the Git
end (incomplete), the SVN - Git bidirectional bridge should be
seamless and awesome :)

Note: I'll be visiting home this weekend (that means: mostly
travelling). I'll be back to hack next week.

-- Ram


Re: svnrdump: The BIG update

2010-08-12 Thread Daniel Shahaf
Ramkumar Ramachandra wrote on Thu, Aug 12, 2010 at 12:17:34 +0530:
   The dump functionality is also complete- thanks to Stefan's review and
   MANY others for cleaning it up. It's however hit a brick wall now
   because of missing headers in the RA layer. Until I (or someone else)
   figures out how to fix the RA layer, we can't do better than the XFail
   copy-and-modify test I've committed.
  
  Part of the diff there is lack of SHA-1 headers --- which is unavoidable
  until editor is revved --- but part of it is a missing Text-copy-source-md5.
  Why don't you output that information --- doesn't the editor give it to you?
 
 Afaik, no. I don't see Text-copy-source-* anywhere in the RA
 layer. Maybe I'm not looking hard enough?
 

Hmm.  It seems you're right.  So you might have to use two RA session in
parallel...

(and then, you might have to have the user authenticate twice?)

   - More optimizations. Since svnrdump is already so fast compared to
 the other tools, I think we can squeeze some more speed out of it.
   - Huge documentation effort. svnrdump is a hack- I just did what I
 felt like and got it to work somehow. It's very unlike svnmucc,
 which does things by the book.
   - Build more infrastructure around svnrdump- I've mostly used existing
 SVN API. Although a lot of new functions were suggested, I never
 really got down to writing them.
  
  Yep.  There was also talk of moving some of the logic into the libraries ---
  where does that stand?
 
 Yeah, I haven't started working on this yet. I'll need some guidance
 for this- I have to sketch out a roadmap and ask for access to the
 specified regions or branch; planning is something I'm not used to at
 all :p
 

:-)

   - Make dumpfile v3 the de-facto standard and improve it for optimized
 loading/ generation. The former part was suggested by Stefan.
   - Integrate it into svnadmin etc as appropriate. I think there's
 enough work here for a mini-GSoC project?
  
  How would it be integrated into svnadmin?  Do you want to push the logic
  into the standard 'svnadmin dump' command?
 
 This is something I haven't given thought either. I brought it up
 because of an earlier discussion in which everyone seemed to be in
 favor of NOT having a new command. It feels like we're stuffing a lot
 of functionality into one tool though.
 

Personally I also like having svnadmin operates only locally (so it doesn't
even link against libsvn_ra), but that was hashed out already on that
moderately-long thread a few weeks ago.

   - GitHub support (?) -- I saw this discussed on IRC somewhere, but I
 didn't understand this myself. Can someone clarify?
   
  
  Joke.  GitHub implemented a mod_dav_svn interface to their repositories [1],
  so it's now possible (if their implementation is sound) to generate an svn
  dump of a GitHub git repository.
 
 Ah, yes. I'm aware. With the infrastructure I've written on the Git
 end (incomplete), the SVN - Git bidirectional bridge should be
 seamless and awesome :)
 
 Note: I'll be visiting home this weekend (that means: mostly
 travelling). I'll be back to hack next week.
 

 -- Ram


Re: svnrdump: The BIG update

2010-08-10 Thread Daniel Shahaf
Ramkumar Ramachandra wrote on Tue, Aug 10, 2010 at 19:32:34 +0530:
 Hi,
 
 I've been putting this off for some time now- it's so much easier to
 write code than to write English :p Anyway, here it is- a massive
 status update.
 

Thanks for the update.

 It's been a few weeks since I got partial committer access, and ~80
 commits later, this is what we have:
 
 Firstly, thanks to Daniel for motivating me and driving me to submit
 the series to the list, and guiding me through everything. Without
 him, I'd probably not have finished svnrdump to begin with.
 
 The command line interface and argument parsing library is ready-
 thanks to Bert and lots of others for getting me started with
 this. The interface is solid and looks like the one used in the other
 SVN tools.
 
 The dump functionality is also complete- thanks to Stefan's review and
 MANY others for cleaning it up. It's however hit a brick wall now
 because of missing headers in the RA layer. Until I (or someone else)
 figures out how to fix the RA layer, we can't do better than the XFail
 copy-and-modify test I've committed.

Part of the diff there is lack of SHA-1 headers --- which is unavoidable
until editor is revved --- but part of it is a missing Text-copy-source-md5.
Why don't you output that information --- doesn't the editor give it to you?

Nitpick: svnrdump_tests 5 6 have the same textual description / docstring as
each other, could you please change that?  See other test files (e.g.,
./commit_tests.py --list) for plenty of examples.

 It's quite mature and dumps
 surprisingly fast though. I'm tempted to run benchmarks, but I haven't
 done it yet because I fear I might be biased towards the tool :p
 

Just write all the benchmarks before running them?

 The load functionality is also quite complete, thanks to Bert et al
 for helping me debug all the cryptic errors. The code is mostly
 unreviewed though- there might be plenty of bugs and code cleanup
 opportunities. Not to say that I've stopped working on it- just that
 the work has become less challenging, now that all the tests pass :)
 

Okay, good.  Some field testing probably needed here?

 TODO:
 - Write more tests and start using svnrdump for real! Advertise it,
   especially to developers of other versioning systems looking to
   communicate with SVN. Remember how this project started out?

Don't forget to inform us...@subversion.apache.org :-)

 - More optimizations. Since svnrdump is already so fast compared to
   the other tools, I think we can squeeze some more speed out of it.
 - Huge documentation effort. svnrdump is a hack- I just did what I
   felt like and got it to work somehow. It's very unlike svnmucc,
   which does things by the book.
 - Build more infrastructure around svnrdump- I've mostly used existing
   SVN API. Although a lot of new functions were suggested, I never
   really got down to writing them.

Yep.  There was also talk of moving some of the logic into the libraries ---
where does that stand?

 - Make dumpfile v3 the de-facto standard and improve it for optimized
   loading/ generation. The former part was suggested by Stefan.
 - Integrate it into svnadmin etc as appropriate. I think there's
   enough work here for a mini-GSoC project?

How would it be integrated into svnadmin?  Do you want to push the logic
into the standard 'svnadmin dump' command?

 - GitHub support (?) -- I saw this discussed on IRC somewhere, but I
   didn't understand this myself. Can someone clarify?
 

Joke.  GitHub implemented a mod_dav_svn interface to their repositories [1],
so it's now possible (if their implementation is sound) to generate an svn
dump of a GitHub git repository.


[1] http://github.com/blog/626-announcing-svn-support
[1] `svn info http://svn.github.com/artagnon/svnrdump.git`

 -- Ram