Re: Status of the branch diff-optimizations-bytes branch
On Fri, Jan 28, 2011 at 5:51 AM, Hyrum K Wright hy...@hyrumwright.org wrote: On Tue, Jan 25, 2011 at 8:31 PM, Johan Corveleyn jcor...@gmail.com wrote: On Tue, Jan 25, 2011 at 4:58 PM, Hyrum K Wright hy...@hyrumwright.org wrote: Johan (and other interested parties), I've been following some of the commits to the diff-optimizations-branch with interest. While I've not reviewed them for technical merit, it appears that others have, and that there is good work going on in the branch. Thanks for your interest. It might be good to get a little bit more review on the whole, I think. A lot of people have read parts of the code, but if I remember correctly most (if not all) of them have said things like I haven't reviewed it in detail, but here are some suggestions, feedback, I'm wondering what the potential for merging back to trunk is. This is the TODO section from the BRANCH-README: TODO: * eliminate identical prefix [DONE] * eliminate identical suffix [DONE] * make diff3 and diff4 use datasources_open [DONE] (this may eliminate the need for datasource_open, and the flag datasource_opened in token.c#svn_diff__get_tokens) * implement (part of) increment_pointers and decrement_pointers with a macro [DONE] (offers another 10% speedup) * integrate (some of the) low-level optimizations for prefix/suffix scanning, proposed by stefan2 [2] [] * revv svn_diff.h#svn_diff_fns_t [] It looks like, for the most part, any destabilizing functionality is completed, and what remains are simply optimizations. This optimization work can probably be performed on trunk, and if so, we should merge the branch back and do the cleanup work there. The only bit on the current list of stuff is revving svn_diff_fns_t. Can that work be parallelized? Yes, you are correct. Most of the essential work is done. Further optimizations can just as well be done on trunk. Revving svn_diff_fns_t: what do you mean with parallelizing it? I must admit that I don't really know (yet) how to go about that. Very early during the branch work, danielsh pointed out that I modified this public struct (vtable for reading data from datasources), so it should be revved. Is it listed/documented somewhere what should be done for that (community guide perhaps)? (one slightly related note to this: I introduced the function datasources_open, to open the multiple datasources at once (as opposed to the original datasource_open, which only opens one datasource). But maybe the new function should be renamed to datasource_open_all or datasources_open_all or something, to make it stand out a little bit more). I'm not trying to rush the work, nor do I think it is essential for 1.7, but it feels like a pretty good performance increase, and one that we shouldn't have any problem shipping. (If there are currently API uncertainties, than it would be good to wait until 1.7.x branches, though.) Yes, I think it's quite ready to be merged, and could provide a very significant performance increase (for diff, merge and blame). The API is stable now, I think, except maybe for the name of the datasources_open function (see above). If we decide to go (for optimizations reasons) for specialized prefix/suffix scanning functions for 2, 3 or 4 datasources, I think it's best to keep the generic datasources_open API (with an array of datasources), and only split up between specialized versions in the implementation. The API is now rev'd, and I caught the branch up with trunk in r1064459. So it looks like we're ready to merge! Johan, would you like to do the honors? Thanks. I'm not sure: shouldn't we wait for a little bit more review, or can that happen after integration on trunk (or reviewing the reintegration commit itself or something)? E.g. stefan2 said he was going to take a look during his travel time next week. And one more thing came to mind with regards to the new API (datasources_open function): currently it only supports up to 4 datasources, so not an arbitrary number of datasources (the implementation in diff_file.c assumes that anyway, because it has to use some local array variables, so needs a fixed array length). I guess that's ok, because existing code in diff_file.c also already assumes that (e.g. the svn_diff__file_baton_t struct, containing an array of 4 file_info structs). And of course, there are currently no know usages of that API with more than 4 datasources (diff, diff3 and diff4). Should this restriction be documented in the docstring of datasources_open withing svn_diff.h#svn_diff_fns2_t or something? Cheers, -- Johan
Re: Status of the branch diff-optimizations-bytes branch
On Fri, Jan 28, 2011 at 6:49 AM, Johan Corveleyn jcor...@gmail.com wrote: On Fri, Jan 28, 2011 at 5:51 AM, Hyrum K Wright hy...@hyrumwright.org wrote: On Tue, Jan 25, 2011 at 8:31 PM, Johan Corveleyn jcor...@gmail.com wrote: On Tue, Jan 25, 2011 at 4:58 PM, Hyrum K Wright hy...@hyrumwright.org wrote: Johan (and other interested parties), I've been following some of the commits to the diff-optimizations-branch with interest. While I've not reviewed them for technical merit, it appears that others have, and that there is good work going on in the branch. Thanks for your interest. It might be good to get a little bit more review on the whole, I think. A lot of people have read parts of the code, but if I remember correctly most (if not all) of them have said things like I haven't reviewed it in detail, but here are some suggestions, feedback, I'm wondering what the potential for merging back to trunk is. This is the TODO section from the BRANCH-README: TODO: * eliminate identical prefix [DONE] * eliminate identical suffix [DONE] * make diff3 and diff4 use datasources_open [DONE] (this may eliminate the need for datasource_open, and the flag datasource_opened in token.c#svn_diff__get_tokens) * implement (part of) increment_pointers and decrement_pointers with a macro [DONE] (offers another 10% speedup) * integrate (some of the) low-level optimizations for prefix/suffix scanning, proposed by stefan2 [2] [] * revv svn_diff.h#svn_diff_fns_t [] It looks like, for the most part, any destabilizing functionality is completed, and what remains are simply optimizations. This optimization work can probably be performed on trunk, and if so, we should merge the branch back and do the cleanup work there. The only bit on the current list of stuff is revving svn_diff_fns_t. Can that work be parallelized? Yes, you are correct. Most of the essential work is done. Further optimizations can just as well be done on trunk. Revving svn_diff_fns_t: what do you mean with parallelizing it? I must admit that I don't really know (yet) how to go about that. Very early during the branch work, danielsh pointed out that I modified this public struct (vtable for reading data from datasources), so it should be revved. Is it listed/documented somewhere what should be done for that (community guide perhaps)? (one slightly related note to this: I introduced the function datasources_open, to open the multiple datasources at once (as opposed to the original datasource_open, which only opens one datasource). But maybe the new function should be renamed to datasource_open_all or datasources_open_all or something, to make it stand out a little bit more). I'm not trying to rush the work, nor do I think it is essential for 1.7, but it feels like a pretty good performance increase, and one that we shouldn't have any problem shipping. (If there are currently API uncertainties, than it would be good to wait until 1.7.x branches, though.) Yes, I think it's quite ready to be merged, and could provide a very significant performance increase (for diff, merge and blame). The API is stable now, I think, except maybe for the name of the datasources_open function (see above). If we decide to go (for optimizations reasons) for specialized prefix/suffix scanning functions for 2, 3 or 4 datasources, I think it's best to keep the generic datasources_open API (with an array of datasources), and only split up between specialized versions in the implementation. The API is now rev'd, and I caught the branch up with trunk in r1064459. So it looks like we're ready to merge! Johan, would you like to do the honors? Thanks. I'm not sure: shouldn't we wait for a little bit more review, or can that happen after integration on trunk (or reviewing the reintegration commit itself or something)? E.g. stefan2 said he was going to take a look during his travel time next week. Sounds good. And one more thing came to mind with regards to the new API (datasources_open function): currently it only supports up to 4 datasources, so not an arbitrary number of datasources (the implementation in diff_file.c assumes that anyway, because it has to use some local array variables, so needs a fixed array length). I noticed these variables. It seems that we should get rid of the magic number ('4'), and replace it either with a sizeof() or a macro from somewhere. Not sure what the best approach is, but arbitrary numbers tickle my spidey sense. I guess that's ok, because existing code in diff_file.c also already assumes that (e.g. the svn_diff__file_baton_t struct, containing an array of 4 file_info structs). And of course, there are currently no know usages of that API with more than 4 datasources (diff, diff3 and diff4). Should this restriction be documented in the docstring of
Re: Status of the branch diff-optimizations-bytes branch
Hyrum K Wright wrote: On Fri, Jan 28, 2011 at 6:49 AM, Johan Corveleyn jcor...@gmail.com wrote: And one more thing came to mind with regards to the new API (datasources_open function): currently it only supports up to 4 datasources, so not an arbitrary number of datasources (the implementation in diff_file.c assumes that anyway, because it has to use some local array variables, so needs a fixed array length). I noticed these variables. It seems that we should get rid of the magic number ('4'), and replace it either with a sizeof() or a macro from somewhere. Not sure what the best approach is, but arbitrary numbers tickle my spidey sense. I'm not looking at the code right now, but if it has multiple variables each declared as an array[4], then a better way is usually to put all the vars in a struct and have a single array of struct[4]. That's if they make sense together and are all [4] for the same reason, of course. - Julian
Re: Status of the branch diff-optimizations-bytes branch
On Wed, Jan 26, 2011 at 7:15 PM, Johan Corveleyn jcor...@gmail.com wrote: On Wed, Jan 26, 2011 at 10:50 PM, Hyrum K Wright hy...@hyrumwright.org wrote: On Tue, Jan 25, 2011 at 8:31 PM, Johan Corveleyn jcor...@gmail.com wrote: ... * revv svn_diff.h#svn_diff_fns_t [] It looks like, for the most part, any destabilizing functionality is completed, and what remains are simply optimizations. This optimization work can probably be performed on trunk, and if so, we should merge the branch back and do the cleanup work there. The only bit on the current list of stuff is revving svn_diff_fns_t. Can that work be parallelized? Yes, you are correct. Most of the essential work is done. Further optimizations can just as well be done on trunk. Revving svn_diff_fns_t: what do you mean with parallelizing it? I must admit that I don't really know (yet) how to go about that. Very early during the branch work, danielsh pointed out that I modified this public struct (vtable for reading data from datasources), so it should be revved. Is it listed/documented somewhere what should be done for that (community guide perhaps)? I was just wondering about how much could be done by other folks while you were working on the primary goal of the branch. (While any full committer can commit to the branch, it's generally good form not to step on people's toes.) Oh yes, no problem. Any help with revv'ing that struct (or with other parts of the code) is very much appreciated. Feel free to make any changes necessary on the branch. Johan, I'd appreciate review on the attached patch. It is an attempt to rev the svn_diff_fns_t struct and related functions. You'll notice that I commented out the use of datasources_open in both diff_file.c and diff_memory.c, in an attempt to exercise the deprecated function wrappers. It currently segfaults horribly, but I'm not completely sure why. I suspect something is amiss in the implementation of the compat wrapper for datasources_open in deprecated.c. If you'd review this patch, I'd be much obliged. Thanks, -Hyrum Index: subversion/libsvn_diff/deprecated.c === --- subversion/libsvn_diff/deprecated.c (revision 1064135) +++ subversion/libsvn_diff/deprecated.c (working copy) @@ -41,7 +41,61 @@ /*** Code. ***/ +struct fns_wrapper_baton +{ + /* We put the old baton in front of this one, so that we can still use + this baton in place of the old. This prevents us from having to + implement simple wrappers around each member of diff_fns_t. */ + void *old_baton; + const svn_diff_fns_t *vtable; +}; +static svn_error_t * +datasources_open(void *baton, apr_off_t *prefix_lines, + svn_diff_datasource_e datasources[], + apr_size_t datasource_len) +{ + struct fns_wrapper_baton *fwb = baton; + int i; + + /* Just iterate over the datasources, using the old singular version. */ + for (i = 0; i datasource_len; i++) +{ + SVN_ERR(fwb-vtable-datasource_open(baton, datasources[i])); +} + + /* Don't claim any prefix matches. */ + *prefix_lines = 0; + + return SVN_NO_ERROR; +} + +static void +wrap_diff_fns(svn_diff_fns2_t **diff_fns2, + struct fns_wrapper_baton **baton2, + const svn_diff_fns_t *diff_fns, + void *baton, + apr_pool_t *result_pool) +{ + /* Initialize the return vtable. */ + *diff_fns2 = apr_palloc(result_pool, sizeof(**diff_fns2)); + + (*diff_fns2)-datasource_open = diff_fns-datasource_open; + (*diff_fns2)-datasource_close = diff_fns-datasource_close; + (*diff_fns2)-datasource_get_next_token = diff_fns-datasource_get_next_token; + (*diff_fns2)-token_compare = diff_fns-token_compare; + (*diff_fns2)-token_discard = diff_fns-token_discard; + (*diff_fns2)-token_discard_all = diff_fns-token_discard_all; + + (*diff_fns2)-datasources_open = datasources_open; + + /* Initialize the wrapper baton. */ + *baton2 = apr_palloc(result_pool, sizeof (**baton2)); + (*baton2)-old_baton = baton; + (*baton2)-vtable = diff_fns; +} + + /*** From diff_file.c ***/ svn_error_t * svn_diff_file_output_unified2(svn_stream_t *output_stream, @@ -142,3 +196,48 @@ svn_diff_file_output_merge(svn_stream_t *output_st style, pool); } + + +/*** From diff.c ***/ +svn_error_t * +svn_diff_diff(svn_diff_t **diff, + void *diff_baton, + const svn_diff_fns_t *vtable, + apr_pool_t *pool) +{ + svn_diff_fns2_t *diff_fns2; + struct fns_wrapper_baton *fwb; + + wrap_diff_fns(diff_fns2, fwb, vtable, diff_baton, pool); + return svn_diff_diff_2(diff, fwb, diff_fns2, pool); +} + + +/*** From diff3.c ***/ +svn_error_t * +svn_diff_diff3(svn_diff_t **diff, + void *diff_baton, + const svn_diff_fns_t *vtable, + apr_pool_t *pool) +{ + svn_diff_fns2_t *diff_fns2; +
Re: Status of the branch diff-optimizations-bytes branch
On Thu, Jan 27, 2011 at 4:52 PM, Hyrum K Wright hy...@hyrumwright.org wrote: Johan, I'd appreciate review on the attached patch. It is an attempt to rev the svn_diff_fns_t struct and related functions. You'll notice that I commented out the use of datasources_open in both diff_file.c and diff_memory.c, in an attempt to exercise the deprecated function wrappers. It currently segfaults horribly, but I'm not completely sure why. I suspect something is amiss in the implementation of the compat wrapper for datasources_open in deprecated.c. If you'd review this patch, I'd be much obliged. I don't completely grok the rev'ing and wrapping stuff just yet, but giving it a try ... One thing looked suspicious to me: Index: subversion/libsvn_diff/deprecated.c === --- subversion/libsvn_diff/deprecated.c (revision 1064135) +++ subversion/libsvn_diff/deprecated.c (working copy) @@ -41,7 +41,61 @@ /*** Code. ***/ +struct fns_wrapper_baton +{ + /* We put the old baton in front of this one, so that we can still use + this baton in place of the old. This prevents us from having to + implement simple wrappers around each member of diff_fns_t. */ + void *old_baton; + const svn_diff_fns_t *vtable; +}; +static svn_error_t * +datasources_open(void *baton, apr_off_t *prefix_lines, + svn_diff_datasource_e datasources[], + apr_size_t datasource_len) +{ + struct fns_wrapper_baton *fwb = baton; + int i; + + /* Just iterate over the datasources, using the old singular version. */ + for (i = 0; i datasource_len; i++) +{ + SVN_ERR(fwb-vtable-datasource_open(baton, datasources[i])); Not sure, but shouldn't that be 'baton-old_baton' instead of 'baton'? +} + + /* Don't claim any prefix matches. */ + *prefix_lines = 0; + + return SVN_NO_ERROR; +} The rest looks ok, at least as far as I understand the rev'ing stuff ... Cheers, -- Johan
Re: Status of the branch diff-optimizations-bytes branch
On Thu, Jan 27, 2011 at 3:08 PM, Johan Corveleyn jcor...@gmail.com wrote: On Thu, Jan 27, 2011 at 4:52 PM, Hyrum K Wright hy...@hyrumwright.org wrote: Johan, I'd appreciate review on the attached patch. It is an attempt to rev the svn_diff_fns_t struct and related functions. You'll notice that I commented out the use of datasources_open in both diff_file.c and diff_memory.c, in an attempt to exercise the deprecated function wrappers. It currently segfaults horribly, but I'm not completely sure why. I suspect something is amiss in the implementation of the compat wrapper for datasources_open in deprecated.c. If you'd review this patch, I'd be much obliged. I don't completely grok the rev'ing and wrapping stuff just yet, but giving it a try ... One thing looked suspicious to me: Index: subversion/libsvn_diff/deprecated.c === --- subversion/libsvn_diff/deprecated.c (revision 1064135) +++ subversion/libsvn_diff/deprecated.c (working copy) @@ -41,7 +41,61 @@ /*** Code. ***/ +struct fns_wrapper_baton +{ + /* We put the old baton in front of this one, so that we can still use + this baton in place of the old. This prevents us from having to + implement simple wrappers around each member of diff_fns_t. */ + void *old_baton; + const svn_diff_fns_t *vtable; +}; +static svn_error_t * +datasources_open(void *baton, apr_off_t *prefix_lines, + svn_diff_datasource_e datasources[], + apr_size_t datasource_len) +{ + struct fns_wrapper_baton *fwb = baton; + int i; + + /* Just iterate over the datasources, using the old singular version. */ + for (i = 0; i datasource_len; i++) + { + SVN_ERR(fwb-vtable-datasource_open(baton, datasources[i])); Not sure, but shouldn't that be 'baton-old_baton' instead of 'baton'? Technically, it shouldn't matter, since *baton and *old_baton are the same. In fact, I'm exploiting that fact to avoid having to wrap *all* the callbacks. (But I could be wrong here, and Interesting Things could be happening.) + } + + /* Don't claim any prefix matches. */ + *prefix_lines = 0; + + return SVN_NO_ERROR; +} The rest looks ok, at least as far as I understand the rev'ing stuff ... The bit I'm shaky about is how to emulate the behavior of datasources_open in terms of data_source open. *Something* is not quite right since it continues to segfault. -Hyrum
Re: Status of the branch diff-optimizations-bytes branch
On Thu, Jan 27, 2011 at 3:13 PM, Hyrum K Wright hy...@hyrumwright.org wrote: On Thu, Jan 27, 2011 at 3:08 PM, Johan Corveleyn jcor...@gmail.com wrote: On Thu, Jan 27, 2011 at 4:52 PM, Hyrum K Wright hy...@hyrumwright.org wrote: Johan, I'd appreciate review on the attached patch. It is an attempt to rev the svn_diff_fns_t struct and related functions. You'll notice that I commented out the use of datasources_open in both diff_file.c and diff_memory.c, in an attempt to exercise the deprecated function wrappers. It currently segfaults horribly, but I'm not completely sure why. I suspect something is amiss in the implementation of the compat wrapper for datasources_open in deprecated.c. If you'd review this patch, I'd be much obliged. I don't completely grok the rev'ing and wrapping stuff just yet, but giving it a try ... One thing looked suspicious to me: Index: subversion/libsvn_diff/deprecated.c === --- subversion/libsvn_diff/deprecated.c (revision 1064135) +++ subversion/libsvn_diff/deprecated.c (working copy) @@ -41,7 +41,61 @@ /*** Code. ***/ +struct fns_wrapper_baton +{ + /* We put the old baton in front of this one, so that we can still use + this baton in place of the old. This prevents us from having to + implement simple wrappers around each member of diff_fns_t. */ + void *old_baton; + const svn_diff_fns_t *vtable; +}; +static svn_error_t * +datasources_open(void *baton, apr_off_t *prefix_lines, + svn_diff_datasource_e datasources[], + apr_size_t datasource_len) +{ + struct fns_wrapper_baton *fwb = baton; + int i; + + /* Just iterate over the datasources, using the old singular version. */ + for (i = 0; i datasource_len; i++) + { + SVN_ERR(fwb-vtable-datasource_open(baton, datasources[i])); Not sure, but shouldn't that be 'baton-old_baton' instead of 'baton'? Technically, it shouldn't matter, since *baton and *old_baton are the same. In fact, I'm exploiting that fact to avoid having to wrap *all* the callbacks. (But I could be wrong here, and Interesting Things could be happening.) + } + + /* Don't claim any prefix matches. */ + *prefix_lines = 0; + + return SVN_NO_ERROR; +} The rest looks ok, at least as far as I understand the rev'ing stuff ... The bit I'm shaky about is how to emulate the behavior of datasources_open in terms of data_source open. *Something* is not quite right since it continues to segfault. For completeness: I committed an improved version of the patch in r1064347. -Hyrum
Re: Status of the branch diff-optimizations-bytes branch
On Tue, Jan 25, 2011 at 8:31 PM, Johan Corveleyn jcor...@gmail.com wrote: On Tue, Jan 25, 2011 at 4:58 PM, Hyrum K Wright hy...@hyrumwright.org wrote: Johan (and other interested parties), I've been following some of the commits to the diff-optimizations-branch with interest. While I've not reviewed them for technical merit, it appears that others have, and that there is good work going on in the branch. Thanks for your interest. It might be good to get a little bit more review on the whole, I think. A lot of people have read parts of the code, but if I remember correctly most (if not all) of them have said things like I haven't reviewed it in detail, but here are some suggestions, feedback, I'm wondering what the potential for merging back to trunk is. This is the TODO section from the BRANCH-README: TODO: * eliminate identical prefix [DONE] * eliminate identical suffix [DONE] * make diff3 and diff4 use datasources_open [DONE] (this may eliminate the need for datasource_open, and the flag datasource_opened in token.c#svn_diff__get_tokens) * implement (part of) increment_pointers and decrement_pointers with a macro [DONE] (offers another 10% speedup) * integrate (some of the) low-level optimizations for prefix/suffix scanning, proposed by stefan2 [2] [] * revv svn_diff.h#svn_diff_fns_t [] It looks like, for the most part, any destabilizing functionality is completed, and what remains are simply optimizations. This optimization work can probably be performed on trunk, and if so, we should merge the branch back and do the cleanup work there. The only bit on the current list of stuff is revving svn_diff_fns_t. Can that work be parallelized? Yes, you are correct. Most of the essential work is done. Further optimizations can just as well be done on trunk. Revving svn_diff_fns_t: what do you mean with parallelizing it? I must admit that I don't really know (yet) how to go about that. Very early during the branch work, danielsh pointed out that I modified this public struct (vtable for reading data from datasources), so it should be revved. Is it listed/documented somewhere what should be done for that (community guide perhaps)? (one slightly related note to this: I introduced the function datasources_open, to open the multiple datasources at once (as opposed to the original datasource_open, which only opens one datasource). But maybe the new function should be renamed to datasource_open_all or datasources_open_all or something, to make it stand out a little bit more). I'm not trying to rush the work, nor do I think it is essential for 1.7, but it feels like a pretty good performance increase, and one that we shouldn't have any problem shipping. (If there are currently API uncertainties, than it would be good to wait until 1.7.x branches, though.) Yes, I think it's quite ready to be merged, and could provide a very significant performance increase (for diff, merge and blame). The API is stable now, I think, except maybe for the name of the datasources_open function (see above). If we decide to go (for optimizations reasons) for specialized prefix/suffix scanning functions for 2, 3 or 4 datasources, I think it's best to keep the generic datasources_open API (with an array of datasources), and only split up between specialized versions in the implementation. The API is now rev'd, and I caught the branch up with trunk in r1064459. So it looks like we're ready to merge! Johan, would you like to do the honors? -Hyrum
Re: Status of the branch diff-optimizations-bytes branch
On Wed, Jan 26, 2011 at 2:46 PM, Hyrum K Wright hy...@hyrumwright.org wrote: On Tue, Jan 25, 2011 at 11:41 PM, Daniel Shahaf d...@daniel.shahaf.name wrote: Johan Corveleyn wrote on Wed, Jan 26, 2011 at 03:31:11 +0100: Revving svn_diff_fns_t: what do you mean with parallelizing it? I must admit that I don't really know (yet) how to go about that. Very early during the branch work, danielsh pointed out that I modified this public struct (vtable for reading data from datasources), so it should be revved. Is it listed/documented somewhere what should be done for that (community guide perhaps)? Briefly, revving a function includes re-declaring it, updating the old declaration to be a diff against the new one, marking it as deprecated (using doxygen and C preprocessor designators), and re-implementing the old function as a deprecated.c wrapper around the new one. For a struct, you need to re-declare the struct and revv functions that use it. Also, don't forget to add a constructor function (svn_foo_t_create(), but s/_t_/_/) (and possibly a duplicator function), and forbid people from defining variables of the struct type directly. I'm not sure HACKING contains this. On the other hand, the public header files contain plenty of examples of everything I just said... For testing purposes, I usually find it easiest to work from the bottom up. In other words, rev the lowest API in the call stack, write the appropriate wrapper for the existing API, and test and commit. This ensures that the existing API (now implemented as a wrapper) still behaves the same way, but it doesn't yet exercise whatever the new functionality is. I then just work my way up the stack for each related function. Structs are a little more involved, but can use similar principles. Johan, if you don't mind, I can put some of my (admittedly limited) tuits into looking at what needs to be done to rev the above struct. I would definitely, absolutely not mind at all :-). Quite the contrary. As a summary of what I did with svn_diff_fns_t (all in the context of subversion/libsvn_diff): - Added new function 'datasources_open' to svn_diff_fns_t (maybe better to be named 'datasources_open_all' or something to have more than 1 character difference with the other existing function 'datasource_open' :-), but I said that already). - This new function is implemented in diff_file.c (the 'real' work of prefix/suffix scanning) and in diff_memory.c (dummy implementation - doesn't do anything for prefix/suffix scanning). Those were the only internal implementors/providers of that vtable. - I sent a mail once to the dev-list, asking if there were other known implementors [1], but got no response. Of course, that's not a definitive survey :-). - The new vtable-datasources_open function is called from diff.c, diff3.c and diff4.c. - The only internal caller of the old function 'datasource_open' (for a single datasource) doesn't call it anymore (token.c#svn_diff__get_tokens) (there is no need anymore, since the callers in diff.c, diff3.c and diff4.c already open the datasources with 'datasources_open' before calling svn_diff__get_tokens). Cheers, -- Johan [1] http://svn.haxx.se/dev/archive-2010-12/0083.shtml
Re: Status of the branch diff-optimizations-bytes branch
Johan Corveleyn wrote on Wed, Jan 26, 2011 at 15:18:24 +0100: - The only internal caller of the old function 'datasource_open' (for a single datasource) doesn't call it anymore (token.c#svn_diff__get_tokens) (there is no need anymore, since the callers in diff.c, diff3.c and diff4.c already open the datasources with 'datasources_open' before calling svn_diff__get_tokens). Does datasource_open() need to stay in the revved struct?
Re: Status of the branch diff-optimizations-bytes branch
On 25.01.2011 16:58, Hyrum K Wright wrote: Johan (and other interested parties), I've been following some of the commits to the diff-optimizations-branch with interest. While I've not reviewed them for technical merit, it appears that others have, and that there is good work going on in the branch. I'm wondering what the potential for merging back to trunk is. Since I'm traveling for most of next week, I should have enough in-flight time for an in-depth review. Slow diffing has always bugged me as a user ;) This is the TODO section from the BRANCH-README: TODO: * eliminate identical prefix [DONE] * eliminate identical suffix [DONE] * make diff3 and diff4 use datasources_open [DONE] (this may eliminate the need for datasource_open, and the flag datasource_opened in token.c#svn_diff__get_tokens) * implement (part of) increment_pointers and decrement_pointers with a macro[DONE] (offers another 10% speedup) Even if some of the changes should no stand up to scrutiny as is, most of them should be independent from each other. Thus, pick and choose will certainly be viable. * integrate (some of the) low-level optimizations for prefix/suffix scanning, proposed by stefan2 [2] [] That is not an essential part and can certainly be done directly on /trunk. * revv svn_diff.h#svn_diff_fns_t [] It looks like, for the most part, any destabilizing functionality is completed, and what remains are simply optimizations. This optimization work can probably be performed on trunk, and if so, we should merge the branch back and do the cleanup work there. The only bit on the current list of stuff is revving svn_diff_fns_t. Can that work be parallelized? I'm not trying to rush the work, nor do I think it is essential for 1.7, but it feels like a pretty good performance increase, and one that we shouldn't have any problem shipping. (If there are currently API uncertainties, than it would be good to wait until 1.7.x branches, though.) Anyway, I just really like the concept / implementation of the branch, and I'm excited to get it into the hands of users. Thoughts? I'll review the code and if Johan is still reasonably sure that the API is final, I'm all for merging it. -- Stefan^2.
Re: Status of the branch diff-optimizations-bytes branch
On Tue, Jan 25, 2011 at 8:31 PM, Johan Corveleyn jcor...@gmail.com wrote: ... * revv svn_diff.h#svn_diff_fns_t [] It looks like, for the most part, any destabilizing functionality is completed, and what remains are simply optimizations. This optimization work can probably be performed on trunk, and if so, we should merge the branch back and do the cleanup work there. The only bit on the current list of stuff is revving svn_diff_fns_t. Can that work be parallelized? Yes, you are correct. Most of the essential work is done. Further optimizations can just as well be done on trunk. Revving svn_diff_fns_t: what do you mean with parallelizing it? I must admit that I don't really know (yet) how to go about that. Very early during the branch work, danielsh pointed out that I modified this public struct (vtable for reading data from datasources), so it should be revved. Is it listed/documented somewhere what should be done for that (community guide perhaps)? I was just wondering about how much could be done by other folks while you were working on the primary goal of the branch. (While any full committer can commit to the branch, it's generally good form not to step on people's toes.) I've updated the branch to trunk, and will start looking at rev'ing the struct and APIs. (one slightly related note to this: I introduced the function datasources_open, to open the multiple datasources at once (as opposed to the original datasource_open, which only opens one datasource). But maybe the new function should be renamed to datasource_open_all or datasources_open_all or something, to make it stand out a little bit more). Whatever you feel is best; I'm somewhat ambivalent about the name. The docstring does seem to be a bit lacking, however. -Hyrum
Re: Status of the branch diff-optimizations-bytes branch
On Wed, Jan 26, 2011 at 7:18 PM, Daniel Shahaf d...@daniel.shahaf.name wrote: Johan Corveleyn wrote on Wed, Jan 26, 2011 at 15:18:24 +0100: - The only internal caller of the old function 'datasource_open' (for a single datasource) doesn't call it anymore (token.c#svn_diff__get_tokens) (there is no need anymore, since the callers in diff.c, diff3.c and diff4.c already open the datasources with 'datasources_open' before calling svn_diff__get_tokens). Does datasource_open() need to stay in the revved struct? I don't think so. There are no callers left in svn's own code, so I guess it can be deprecated. Cheers, -- Johan
Re: Status of the branch diff-optimizations-bytes branch
On Wed, Jan 26, 2011 at 10:50 PM, Hyrum K Wright hy...@hyrumwright.org wrote: On Tue, Jan 25, 2011 at 8:31 PM, Johan Corveleyn jcor...@gmail.com wrote: ... * revv svn_diff.h#svn_diff_fns_t [] It looks like, for the most part, any destabilizing functionality is completed, and what remains are simply optimizations. This optimization work can probably be performed on trunk, and if so, we should merge the branch back and do the cleanup work there. The only bit on the current list of stuff is revving svn_diff_fns_t. Can that work be parallelized? Yes, you are correct. Most of the essential work is done. Further optimizations can just as well be done on trunk. Revving svn_diff_fns_t: what do you mean with parallelizing it? I must admit that I don't really know (yet) how to go about that. Very early during the branch work, danielsh pointed out that I modified this public struct (vtable for reading data from datasources), so it should be revved. Is it listed/documented somewhere what should be done for that (community guide perhaps)? I was just wondering about how much could be done by other folks while you were working on the primary goal of the branch. (While any full committer can commit to the branch, it's generally good form not to step on people's toes.) Oh yes, no problem. Any help with revv'ing that struct (or with other parts of the code) is very much appreciated. Feel free to make any changes necessary on the branch. I've updated the branch to trunk, and will start looking at rev'ing the struct and APIs. (one slightly related note to this: I introduced the function datasources_open, to open the multiple datasources at once (as opposed to the original datasource_open, which only opens one datasource). But maybe the new function should be renamed to datasource_open_all or datasources_open_all or something, to make it stand out a little bit more). Whatever you feel is best; I'm somewhat ambivalent about the name. The docstring does seem to be a bit lacking, however. Yes. I guess I (or whoever takes a look at it) should go through the docstrings (and inline comments) to see if they still describe things accurately. Most of those date from the initial implementation on the branch, so may be slightly out-of-date. If I have some time, I'll try to take a look at it. Cheers, -- Johan
Status of the branch diff-optimizations-bytes branch
Johan (and other interested parties), I've been following some of the commits to the diff-optimizations-branch with interest. While I've not reviewed them for technical merit, it appears that others have, and that there is good work going on in the branch. I'm wondering what the potential for merging back to trunk is. This is the TODO section from the BRANCH-README: TODO: * eliminate identical prefix [DONE] * eliminate identical suffix [DONE] * make diff3 and diff4 use datasources_open [DONE] (this may eliminate the need for datasource_open, and the flag datasource_opened in token.c#svn_diff__get_tokens) * implement (part of) increment_pointers and decrement_pointers with a macro[DONE] (offers another 10% speedup) * integrate (some of the) low-level optimizations for prefix/suffix scanning, proposed by stefan2 [2] [] * revv svn_diff.h#svn_diff_fns_t [] It looks like, for the most part, any destabilizing functionality is completed, and what remains are simply optimizations. This optimization work can probably be performed on trunk, and if so, we should merge the branch back and do the cleanup work there. The only bit on the current list of stuff is revving svn_diff_fns_t. Can that work be parallelized? I'm not trying to rush the work, nor do I think it is essential for 1.7, but it feels like a pretty good performance increase, and one that we shouldn't have any problem shipping. (If there are currently API uncertainties, than it would be good to wait until 1.7.x branches, though.) Anyway, I just really like the concept / implementation of the branch, and I'm excited to get it into the hands of users. Thoughts? -Hyrum
Re: Status of the branch diff-optimizations-bytes branch
On Tue, Jan 25, 2011 at 4:58 PM, Hyrum K Wright hy...@hyrumwright.org wrote: Johan (and other interested parties), I've been following some of the commits to the diff-optimizations-branch with interest. While I've not reviewed them for technical merit, it appears that others have, and that there is good work going on in the branch. Thanks for your interest. It might be good to get a little bit more review on the whole, I think. A lot of people have read parts of the code, but if I remember correctly most (if not all) of them have said things like I haven't reviewed it in detail, but here are some suggestions, feedback, I'm wondering what the potential for merging back to trunk is. This is the TODO section from the BRANCH-README: TODO: * eliminate identical prefix [DONE] * eliminate identical suffix [DONE] * make diff3 and diff4 use datasources_open [DONE] (this may eliminate the need for datasource_open, and the flag datasource_opened in token.c#svn_diff__get_tokens) * implement (part of) increment_pointers and decrement_pointers with a macro [DONE] (offers another 10% speedup) * integrate (some of the) low-level optimizations for prefix/suffix scanning, proposed by stefan2 [2] [] * revv svn_diff.h#svn_diff_fns_t [] It looks like, for the most part, any destabilizing functionality is completed, and what remains are simply optimizations. This optimization work can probably be performed on trunk, and if so, we should merge the branch back and do the cleanup work there. The only bit on the current list of stuff is revving svn_diff_fns_t. Can that work be parallelized? Yes, you are correct. Most of the essential work is done. Further optimizations can just as well be done on trunk. Revving svn_diff_fns_t: what do you mean with parallelizing it? I must admit that I don't really know (yet) how to go about that. Very early during the branch work, danielsh pointed out that I modified this public struct (vtable for reading data from datasources), so it should be revved. Is it listed/documented somewhere what should be done for that (community guide perhaps)? (one slightly related note to this: I introduced the function datasources_open, to open the multiple datasources at once (as opposed to the original datasource_open, which only opens one datasource). But maybe the new function should be renamed to datasource_open_all or datasources_open_all or something, to make it stand out a little bit more). I'm not trying to rush the work, nor do I think it is essential for 1.7, but it feels like a pretty good performance increase, and one that we shouldn't have any problem shipping. (If there are currently API uncertainties, than it would be good to wait until 1.7.x branches, though.) Yes, I think it's quite ready to be merged, and could provide a very significant performance increase (for diff, merge and blame). The API is stable now, I think, except maybe for the name of the datasources_open function (see above). If we decide to go (for optimizations reasons) for specialized prefix/suffix scanning functions for 2, 3 or 4 datasources, I think it's best to keep the generic datasources_open API (with an array of datasources), and only split up between specialized versions in the implementation. Anyway, I just really like the concept / implementation of the branch, and I'm excited to get it into the hands of users. Thoughts? -Hyrum Cheers, -- Johan
Re: Status of the branch diff-optimizations-bytes branch
Johan Corveleyn wrote on Wed, Jan 26, 2011 at 03:31:11 +0100: On Tue, Jan 25, 2011 at 4:58 PM, Hyrum K Wright hy...@hyrumwright.org wrote: Johan (and other interested parties), I've been following some of the commits to the diff-optimizations-branch with interest. While I've not reviewed them for technical merit, it appears that others have, and that there is good work going on in the branch. Thanks for your interest. It might be good to get a little bit more review on the whole, I think. A lot of people have read parts of the code, but if I remember correctly most (if not all) of them have said things like I haven't reviewed it in detail, but here are some suggestions, feedback, Indeed. I, for example, reviewed parts of the code; but for other parts I've simply let Johan tell me what he had done, and haven't reviewed them myself.
Re: Status of the branch diff-optimizations-bytes branch
Johan Corveleyn wrote on Wed, Jan 26, 2011 at 03:31:11 +0100: Revving svn_diff_fns_t: what do you mean with parallelizing it? I must admit that I don't really know (yet) how to go about that. Very early during the branch work, danielsh pointed out that I modified this public struct (vtable for reading data from datasources), so it should be revved. Is it listed/documented somewhere what should be done for that (community guide perhaps)? Briefly, revving a function includes re-declaring it, updating the old declaration to be a diff against the new one, marking it as deprecated (using doxygen and C preprocessor designators), and re-implementing the old function as a deprecated.c wrapper around the new one. For a struct, you need to re-declare the struct and revv functions that use it. Also, don't forget to add a constructor function (svn_foo_t_create(), but s/_t_/_/) (and possibly a duplicator function), and forbid people from defining variables of the struct type directly. I'm not sure HACKING contains this. On the other hand, the public header files contain plenty of examples of everything I just said...