Splitting op_depth into base_op_depth and work_op_depth
phi...@apache.org writes: Author: philip Date: Fri Oct 8 09:53:19 2010 New Revision: 1005751 + PM: Yes, we have overwrite sematics. The FS layer on the server has + magic that converts the copy of the r12 descendant into a replace if + the descendant exists in r10. The client does not send a delete. + + This magic applies to copies, not deletes, so there is a problem + when the descendant is deleted in the mixed-revision copy in the + working copy. When faced with a copy of the subtree at r10 and a + delete of a descendant at r12 the commit doesn't work at present. + Deleting the descendant is wrong if it does not exist in r10, but + not deleting it is wrong if it does exist. I suppose the client + could ask the server, or perhaps use multiple layers of BASE to + track mixed-revisions (argh!). Suppose we were to split NODES.op_depth into base_op_depth and work_op_depth, one of which is always set and one of which is always NULL. Then we could represent a mixed revision working copy as a layering of base_op_depth. (work_op_depth=0 might be allowed but otherwise work_op_depth would be much like op_depth0.) Having layers of base_op_depth would allow us to represent a mixed-rev copy as layers of work_op_depth and so should solve the delete problem above. We could also used layers of base_op_depth to represent switched subtrees, and that would probably allow us to handle deletes of the root of the switch (as a tree conflict perhaps?). Layers of base_op_depth would also allow us to represent externals as single working copy: a base_op_depth=1 without base_op_depth=0 would be an external. This is probably not something for 1.7, I'd really like to get 1.7 released rather than spend forever redesigning it, but perhaps this is something for 1.8? -- Philip
Re: [PATCH] Use neon's system proxy detection if not explicitly specified
Ping. This patch has received no further comments. Gavin Beau Baumanis On 29/09/2010, at 8:48 PM, Dominique Leuenberger wrote: On Wed, 2010-09-29 at 12:42 +0200, Daniel Shahaf wrote: For me either way is fine: I can update the patch to also detect newer versions as suggest by you. Which in turn will still break all the other detections of SVN_NEON_0_28 and older. Or we keep them 'in sync' together and fix them all together at a later stage. The latter please; when Neon 0.39 comes around we'll fix all checks at the same time. In this case I would consider my patch complete, as this is already what has been submitted. Thanks for your confirmation; Is there any further action to be taken by myself for this to happen? Dominique
Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster
On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad julian.f...@wandisco.com wrote: On Sat, 2010-10-09, Johan Corveleyn wrote: Ok, third iteration of the patch in attachment. It passes make check. As discussed in [1], this version keeps 50 lines of the identical suffix around, to give the algorithm a good chance to generate a diff output of good quality (in all but the most extreme cases, this will be the same as with the original svn_diff algorithm). That's about the only difference with the previous iteration. So for now, I'm submitting this for review. Any feedback is very welcome :-). Hi Johan. Hi Julian, Thanks for taking a look. I haven't reviewed it, but after seeing today's discussion I had just scrolled quickly through the previous version of this patch. I noticed that the two main functions - find_identical_suffix and find_identical_suffix - are both quite similar (but not quite similar enough to make them easily share implementation) and both quite long, and I noticed you wrote in an earlier email that you were finding it hard to make the code readable. I have a suggestion that may help. I think the existing structure of the svn_diff__file_baton_t is unhelpful: { const svn_diff_file_options_t *options; const char *path[4]; apr_file_t *file[4]; apr_off_t size[4]; int chunk[4]; char *buffer[4]; char *curp[4]; char *endp[4]; /* List of free tokens that may be reused. */ svn_diff__file_token_t *tokens; svn_diff__normalize_state_t normalize_state[4]; apr_pool_t *pool; } svn_diff__file_baton_t; All those array[4] fields are logically related, but this layout forces the programmer to address them individually. So I wrote a patch - attached - that refactors this into an array of 4 sub-structures, and simplifies all the code that uses them. I think this will help you to get better code clarity because then your increment_pointer_or_chunk() for example will be able to take a single pointer to a file_info structure instead of a lot of pointers to individual members of the same. Would you take a look and let me know if you agree. If so, I can commit the refactoring straight away. Yes, great idea! That would indeed vastly simplify a lot of the code. So please go ahead and commit the refactoring. Also, maybe the last_chunk number could be included in the file_info struct? Now it's calculated in several places: last_chunk0 = offset_to_chunk(file_baton-size[idx0]), or I have to pass it on every time as an extra argument. Seems like the sort of info that could be part of file_info. One more thing: you might have noticed that for find_identical_suffix I use other buffers, chunks, curp's, endp's, ... than for the prefix. For prefix scanning I can just use the stuff from the diff_baton, because after prefix scanning has finished, everything is buffered and pointing correctly for the normal algorithm to continue (i.e. everything points at the first byte of the first non-identical line). For suffix scanning I need to use other structures (newly alloc'd buffer etc), so as to not mess with those pointers/buffers from the diff_baton. So: I think I'll need the file_info struct to be available out of the diff_baton_t struct as well, so I can use this in suffix scanning also. (side-note: I considered first doing suffix scanning, then prefix scanning, so I could reuse the buffers/pointers from diff_baton all the time, and still have everything pointing correctly after eliminating prefix/suffix. But that could give vastly different results in some cases, for instance when original file is entirely identical to both the prefix and the suffix of the modified file. So I decided it's best to stick with first prefix, then suffix). Responding to some of the other points you mentioned in a much earlier mail: 3) It's only implemented for 2 files. I'd like to generalize this for an array of datasources, so it can also be used for diff3 and diff4. 4) As a small hack, I had to add a flag datasource_opened to token.c#svn_diff__get_tokens, so I could have different behavior for regular diff vs. diff3 and diff4. If 3) gets implemented, this hack is no longer needed. Yes, I'd like to see 3), and so hack 4) will go away. I'm wondering though how I should represent the datasources to pass into datasources_open. An array combined with a length parameter? Something like: static svn_error_t * datasources_open(void *baton, apr_off_t *prefix_lines, svn_diff_datasource_e[] datasources, int datasources_len) ? And then use for loops everywhere I now do things twice for the two datasources? 5) I've struggled with making the code readable/understandable. It's likely that there is still a lot of room for improvement. I also probably need to document it some more. You need to write a full doc string for datasources_open(), at least. It needs especially to say how it relates to datasource_open() - why should the caller call this
Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster
Johan Corveleyn wrote on Sat, Oct 09, 2010 at 14:21:09 +0200: (side-note: I considered first doing suffix scanning, then prefix scanning, so I could reuse the buffers/pointers from diff_baton all the time, and still have everything pointing correctly after eliminating prefix/suffix. But that could give vastly different results in some cases, for instance when original file is entirely identical to both the prefix and the suffix of the modified file. So I decided it's best to stick with first prefix, then suffix). What Hyrum said. How common /is/ this case? And, anyway, in that case both everything was appended and everything was prepended are equally legitimate diffs.
BitTorrent RA layer
Hi all, I've recently been contemplating implementing an RA layer using the bittorrent protocol in order to speed up large repository checkouts. The primary impetus for this feature is to get a large development group, geographically colocated up and running quickly. The code base is large (~1gb), and initial checkouts are a major pain. If we could harness peer to peer downloads, then most of this pain goes away. Has anyone thought about this before? How difficult would it be? Is anyone perhaps interested in coordinating an effort to do this?
Re: BitTorrent RA layer
[Ozzie Chan] I've recently been contemplating implementing an RA layer using the bittorrent protocol in order to speed up large repository checkouts. I don't think it would fit the RA layer very well, honestly. I think what you'd do instead is seed a torrent of a full checkout, or perhaps of a svn dump file. Or you could come up with a protocol that is somewhat, but not entirely, like bittorrent: each client seeks dumpfiles of all the revisions in the repository. They exchange these much like normal bittorrent payloads, except that there's probably no way to come up with the checksums in advance, so the clients would all have to trust each other. The repository would serve as the initial seed, and each client would use 'svnrdump' (a tool to generate a dumpfile over the RA layer) to retrieve new revisions from the repository that are not already in the BT network. I note that this gives you a copy of the repository, which is a superset of a checkout and may be many times larger. Can be useful, too, to set up a local write-through proxy via mod_dav_svn and keep it up to date with svnsync. Refer also to Luke Leighton's recent git-BT gateway proof of concept: http://lists.debian.org/debian-devel/2010/09/msg9.html and following http://gitorious.org/python-libbittorrent/pybtlib I note that git is probably better suited to the bittorrent gateway concept than svn is, since it is changeset-oriented, and each changeset contains and is uniquely identified by a cryptographic hash. -- Peter Samuelson | org-tld!p12n!peter | http://p12n.org/
Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster
On Sat, Oct 9, 2010 at 5:19 PM, Daniel Shahaf d...@daniel.shahaf.name wrote: Johan Corveleyn wrote on Sat, Oct 09, 2010 at 14:21:09 +0200: (side-note: I considered first doing suffix scanning, then prefix scanning, so I could reuse the buffers/pointers from diff_baton all the time, and still have everything pointing correctly after eliminating prefix/suffix. But that could give vastly different results in some cases, for instance when original file is entirely identical to both the prefix and the suffix of the modified file. So I decided it's best to stick with first prefix, then suffix). What Hyrum said. How common /is/ this case? And, anyway, in that case both everything was appended and everything was prepended are equally legitimate diffs. Hm, I'm not sure about this one. I just wanted to try the maximum reasonably possible to keep the results identical to what they were. Using another buffer for suffix scanning didn't seem that big of a deal (only slight increase in memory use (2 chunks of 128K in current implementation)). I made that decision pretty early, before I knew of the other problem of suffix scanning, and the keep-50-suffix-lines compromise we decided upon. There may be more subtle cases than the one I described, I don't know. OTOH, now that we have the keep-50-suffix-lines, that may help also in this case. I'll have to think about that. Maybe I can give it a go, first suffix then prefix, and see if I can find real-life problems ... -- Johan