Re: Status of the branch diff-optimizations-bytes branch

2011-01-28 Thread Johan Corveleyn
On Fri, Jan 28, 2011 at 5:51 AM, Hyrum K Wright hy...@hyrumwright.org wrote:
 On Tue, Jan 25, 2011 at 8:31 PM, Johan Corveleyn jcor...@gmail.com wrote:
 On Tue, Jan 25, 2011 at 4:58 PM, Hyrum K Wright hy...@hyrumwright.org 
 wrote:
 Johan (and other interested parties),
 I've been following some of the commits to the
 diff-optimizations-branch with interest.  While I've not reviewed them
 for technical merit, it appears that others have, and that there is
 good work going on in the branch.

 Thanks for your interest.

 It might be good to get a little bit more review on the whole, I
 think. A lot of people have read parts of the code, but if I remember
 correctly most (if not all) of them have said things like I haven't
 reviewed it in detail, but here are some suggestions, feedback, 

 I'm wondering what the potential
 for merging back to trunk is.  This is the TODO section from the
 BRANCH-README:

 TODO:

  * eliminate identical prefix                 [DONE]
  * eliminate identical suffix                 [DONE]
  * make diff3 and diff4 use datasources_open  [DONE]
     (this may eliminate the need for datasource_open, and the flag
      datasource_opened in token.c#svn_diff__get_tokens)
  * implement (part of) increment_pointers and
    decrement_pointers with a macro            [DONE]
     (offers another 10% speedup)
  * integrate (some of the) low-level optimizations for prefix/suffix
    scanning, proposed by stefan2 [2]          []
  * revv svn_diff.h#svn_diff_fns_t             []

 It looks like, for the most part, any destabilizing functionality is
 completed, and what remains are simply optimizations.  This
 optimization work can probably be performed on trunk, and if so, we
 should merge the branch back and do the cleanup work there.  The only
 bit on the current list of stuff is revving svn_diff_fns_t.  Can that
 work be parallelized?

 Yes, you are correct. Most of the essential work is done. Further
 optimizations can just as well be done on trunk.

 Revving svn_diff_fns_t: what do you mean with parallelizing it? I must
 admit that I don't really know (yet) how to go about that. Very early
 during the branch work, danielsh pointed out that I modified this
 public struct (vtable for reading data from datasources), so it should
 be revved. Is it listed/documented somewhere what should be done for
 that (community guide perhaps)?

 (one slightly related note to this: I introduced the function
 datasources_open, to open the multiple datasources at once (as opposed
 to the original datasource_open, which only opens one datasource). But
 maybe the new function should be renamed to datasource_open_all or
 datasources_open_all or something, to make it stand out a little bit
 more).

 I'm not trying to rush the work, nor do I think it is essential for
 1.7, but it feels like a pretty good performance increase, and one
 that we shouldn't have any problem shipping.  (If there are currently
 API uncertainties, than it would be good to wait until 1.7.x branches,
 though.)

 Yes, I think it's quite ready to be merged, and could provide a very
 significant performance increase (for diff, merge and blame).

 The API is stable now, I think, except maybe for the name of the
 datasources_open function (see above). If we decide to go (for
 optimizations reasons) for specialized prefix/suffix scanning
 functions for 2, 3 or 4 datasources, I think it's best to keep the
 generic datasources_open API (with an array of datasources), and only
 split up between specialized versions in the implementation.

 The API is now rev'd, and I caught the branch up with trunk in
 r1064459.  So it looks like we're ready to merge!

 Johan, would you like to do the honors?

Thanks.

I'm not sure: shouldn't we wait for a little bit more review, or can
that happen after integration on trunk (or reviewing the reintegration
commit itself or something)? E.g. stefan2 said he was going to take a
look during his travel time next week.

And one more thing came to mind with regards to the new API
(datasources_open function): currently it only supports up to 4
datasources, so not an arbitrary number of datasources (the
implementation in diff_file.c assumes that anyway, because it has to
use some local array variables, so needs a fixed array length).

I guess that's ok, because existing code in diff_file.c also already
assumes that (e.g. the svn_diff__file_baton_t struct, containing an
array of 4 file_info structs). And of course, there are currently no
know usages of that API with more than 4 datasources (diff, diff3 and
diff4).

Should this restriction be documented in the docstring of
datasources_open withing svn_diff.h#svn_diff_fns2_t or something?

Cheers,
-- 
Johan


Re: Status of the branch diff-optimizations-bytes branch

2011-01-28 Thread Hyrum K Wright
On Fri, Jan 28, 2011 at 6:49 AM, Johan Corveleyn jcor...@gmail.com wrote:
 On Fri, Jan 28, 2011 at 5:51 AM, Hyrum K Wright hy...@hyrumwright.org wrote:
 On Tue, Jan 25, 2011 at 8:31 PM, Johan Corveleyn jcor...@gmail.com wrote:
 On Tue, Jan 25, 2011 at 4:58 PM, Hyrum K Wright hy...@hyrumwright.org 
 wrote:
 Johan (and other interested parties),
 I've been following some of the commits to the
 diff-optimizations-branch with interest.  While I've not reviewed them
 for technical merit, it appears that others have, and that there is
 good work going on in the branch.

 Thanks for your interest.

 It might be good to get a little bit more review on the whole, I
 think. A lot of people have read parts of the code, but if I remember
 correctly most (if not all) of them have said things like I haven't
 reviewed it in detail, but here are some suggestions, feedback, 

 I'm wondering what the potential
 for merging back to trunk is.  This is the TODO section from the
 BRANCH-README:

 TODO:

  * eliminate identical prefix                 [DONE]
  * eliminate identical suffix                 [DONE]
  * make diff3 and diff4 use datasources_open  [DONE]
     (this may eliminate the need for datasource_open, and the flag
      datasource_opened in token.c#svn_diff__get_tokens)
  * implement (part of) increment_pointers and
    decrement_pointers with a macro            [DONE]
     (offers another 10% speedup)
  * integrate (some of the) low-level optimizations for prefix/suffix
    scanning, proposed by stefan2 [2]          []
  * revv svn_diff.h#svn_diff_fns_t             []

 It looks like, for the most part, any destabilizing functionality is
 completed, and what remains are simply optimizations.  This
 optimization work can probably be performed on trunk, and if so, we
 should merge the branch back and do the cleanup work there.  The only
 bit on the current list of stuff is revving svn_diff_fns_t.  Can that
 work be parallelized?

 Yes, you are correct. Most of the essential work is done. Further
 optimizations can just as well be done on trunk.

 Revving svn_diff_fns_t: what do you mean with parallelizing it? I must
 admit that I don't really know (yet) how to go about that. Very early
 during the branch work, danielsh pointed out that I modified this
 public struct (vtable for reading data from datasources), so it should
 be revved. Is it listed/documented somewhere what should be done for
 that (community guide perhaps)?

 (one slightly related note to this: I introduced the function
 datasources_open, to open the multiple datasources at once (as opposed
 to the original datasource_open, which only opens one datasource). But
 maybe the new function should be renamed to datasource_open_all or
 datasources_open_all or something, to make it stand out a little bit
 more).

 I'm not trying to rush the work, nor do I think it is essential for
 1.7, but it feels like a pretty good performance increase, and one
 that we shouldn't have any problem shipping.  (If there are currently
 API uncertainties, than it would be good to wait until 1.7.x branches,
 though.)

 Yes, I think it's quite ready to be merged, and could provide a very
 significant performance increase (for diff, merge and blame).

 The API is stable now, I think, except maybe for the name of the
 datasources_open function (see above). If we decide to go (for
 optimizations reasons) for specialized prefix/suffix scanning
 functions for 2, 3 or 4 datasources, I think it's best to keep the
 generic datasources_open API (with an array of datasources), and only
 split up between specialized versions in the implementation.

 The API is now rev'd, and I caught the branch up with trunk in
 r1064459.  So it looks like we're ready to merge!

 Johan, would you like to do the honors?

 Thanks.

 I'm not sure: shouldn't we wait for a little bit more review, or can
 that happen after integration on trunk (or reviewing the reintegration
 commit itself or something)? E.g. stefan2 said he was going to take a
 look during his travel time next week.

Sounds good.

 And one more thing came to mind with regards to the new API
 (datasources_open function): currently it only supports up to 4
 datasources, so not an arbitrary number of datasources (the
 implementation in diff_file.c assumes that anyway, because it has to
 use some local array variables, so needs a fixed array length).

I noticed these variables.  It seems that we should get rid of the
magic number ('4'), and replace it either with a sizeof() or a macro
from somewhere.  Not sure what the best approach is, but arbitrary
numbers tickle my spidey sense.

 I guess that's ok, because existing code in diff_file.c also already
 assumes that (e.g. the svn_diff__file_baton_t struct, containing an
 array of 4 file_info structs). And of course, there are currently no
 know usages of that API with more than 4 datasources (diff, diff3 and
 diff4).

 Should this restriction be documented in the docstring of
 

Re: Status of the branch diff-optimizations-bytes branch

2011-01-28 Thread Julian Foad
Hyrum K Wright wrote:
 On Fri, Jan 28, 2011 at 6:49 AM, Johan Corveleyn jcor...@gmail.com wrote:
  And one more thing came to mind with regards to the new API
  (datasources_open function): currently it only supports up to 4
  datasources, so not an arbitrary number of datasources (the
  implementation in diff_file.c assumes that anyway, because it has to
  use some local array variables, so needs a fixed array length).
 
 I noticed these variables.  It seems that we should get rid of the
 magic number ('4'), and replace it either with a sizeof() or a macro
 from somewhere.  Not sure what the best approach is, but arbitrary
 numbers tickle my spidey sense.

I'm not looking at the code right now, but if it has multiple variables
each declared as an array[4], then a better way is usually to put all
the vars in a struct and have a single array of struct[4].  That's if
they make sense together and are all [4] for the same reason, of course.

- Julian




Re: Status of the branch diff-optimizations-bytes branch

2011-01-27 Thread Hyrum K Wright
On Wed, Jan 26, 2011 at 7:15 PM, Johan Corveleyn jcor...@gmail.com wrote:
 On Wed, Jan 26, 2011 at 10:50 PM, Hyrum K Wright hy...@hyrumwright.org 
 wrote:
 On Tue, Jan 25, 2011 at 8:31 PM, Johan Corveleyn jcor...@gmail.com wrote:
 ...
  * revv svn_diff.h#svn_diff_fns_t             []

 It looks like, for the most part, any destabilizing functionality is
 completed, and what remains are simply optimizations.  This
 optimization work can probably be performed on trunk, and if so, we
 should merge the branch back and do the cleanup work there.  The only
 bit on the current list of stuff is revving svn_diff_fns_t.  Can that
 work be parallelized?

 Yes, you are correct. Most of the essential work is done. Further
 optimizations can just as well be done on trunk.

 Revving svn_diff_fns_t: what do you mean with parallelizing it? I must
 admit that I don't really know (yet) how to go about that. Very early
 during the branch work, danielsh pointed out that I modified this
 public struct (vtable for reading data from datasources), so it should
 be revved. Is it listed/documented somewhere what should be done for
 that (community guide perhaps)?

 I was just wondering about how much could be done by other folks while
 you were working on the primary goal of the branch.  (While any full
 committer can commit to the branch, it's generally good form not to
 step on people's toes.)

 Oh yes, no problem. Any help with revv'ing that struct (or with other
 parts of the code) is very much appreciated. Feel free to make any
 changes necessary on the branch.

Johan,
I'd appreciate review on the attached patch.  It is an attempt to rev
the svn_diff_fns_t struct and related functions.  You'll notice that I
commented out the use of datasources_open in both diff_file.c and
diff_memory.c, in an attempt to exercise the deprecated function
wrappers.

It currently segfaults horribly, but I'm not completely sure why.  I
suspect something is amiss in the implementation of the compat wrapper
for datasources_open in deprecated.c.  If you'd review this patch, I'd
be much obliged.

Thanks,
-Hyrum
Index: subversion/libsvn_diff/deprecated.c
===
--- subversion/libsvn_diff/deprecated.c (revision 1064135)
+++ subversion/libsvn_diff/deprecated.c (working copy)
@@ -41,7 +41,61 @@
 
 
 /*** Code. ***/
+struct fns_wrapper_baton
+{
+  /* We put the old baton in front of this one, so that we can still use
+ this baton in place of the old.  This prevents us from having to 
+ implement simple wrappers around each member of diff_fns_t. */
+  void *old_baton;
+  const svn_diff_fns_t *vtable;
+};
 
+static svn_error_t *
+datasources_open(void *baton, apr_off_t *prefix_lines,
+ svn_diff_datasource_e datasources[],
+ apr_size_t datasource_len)
+{
+  struct fns_wrapper_baton *fwb = baton;
+  int i;
+
+  /* Just iterate over the datasources, using the old singular version. */
+  for (i = 0; i  datasource_len; i++)
+{
+  SVN_ERR(fwb-vtable-datasource_open(baton, datasources[i]));
+}
+
+  /* Don't claim any prefix matches. */
+  *prefix_lines = 0;
+
+  return SVN_NO_ERROR;
+}
+
+static void
+wrap_diff_fns(svn_diff_fns2_t **diff_fns2,
+  struct fns_wrapper_baton **baton2,
+  const svn_diff_fns_t *diff_fns,
+  void *baton,
+  apr_pool_t *result_pool)
+{
+  /* Initialize the return vtable. */
+  *diff_fns2 = apr_palloc(result_pool, sizeof(**diff_fns2));
+
+  (*diff_fns2)-datasource_open = diff_fns-datasource_open;
+  (*diff_fns2)-datasource_close = diff_fns-datasource_close;
+  (*diff_fns2)-datasource_get_next_token = 
diff_fns-datasource_get_next_token;
+  (*diff_fns2)-token_compare = diff_fns-token_compare;
+  (*diff_fns2)-token_discard = diff_fns-token_discard;
+  (*diff_fns2)-token_discard_all = diff_fns-token_discard_all;
+
+  (*diff_fns2)-datasources_open = datasources_open;
+
+  /* Initialize the wrapper baton. */
+  *baton2 = apr_palloc(result_pool, sizeof (**baton2));
+  (*baton2)-old_baton = baton;
+  (*baton2)-vtable = diff_fns;
+}
+
+
 /*** From diff_file.c ***/
 svn_error_t *
 svn_diff_file_output_unified2(svn_stream_t *output_stream,
@@ -142,3 +196,48 @@ svn_diff_file_output_merge(svn_stream_t *output_st
  style,
  pool);
 }
+
+
+/*** From diff.c ***/
+svn_error_t *
+svn_diff_diff(svn_diff_t **diff,
+  void *diff_baton,
+  const svn_diff_fns_t *vtable,
+  apr_pool_t *pool)
+{
+  svn_diff_fns2_t *diff_fns2;
+  struct fns_wrapper_baton *fwb;
+
+  wrap_diff_fns(diff_fns2, fwb, vtable, diff_baton, pool);
+  return svn_diff_diff_2(diff, fwb, diff_fns2, pool);
+}
+
+
+/*** From diff3.c ***/
+svn_error_t *
+svn_diff_diff3(svn_diff_t **diff,
+   void *diff_baton,
+   const svn_diff_fns_t *vtable,
+   apr_pool_t *pool)
+{
+  svn_diff_fns2_t *diff_fns2;
+ 

Re: Status of the branch diff-optimizations-bytes branch

2011-01-27 Thread Johan Corveleyn
On Thu, Jan 27, 2011 at 4:52 PM, Hyrum K Wright hy...@hyrumwright.org wrote:
 Johan,
 I'd appreciate review on the attached patch.  It is an attempt to rev
 the svn_diff_fns_t struct and related functions.  You'll notice that I
 commented out the use of datasources_open in both diff_file.c and
 diff_memory.c, in an attempt to exercise the deprecated function
 wrappers.

 It currently segfaults horribly, but I'm not completely sure why.  I
 suspect something is amiss in the implementation of the compat wrapper
 for datasources_open in deprecated.c.  If you'd review this patch, I'd
 be much obliged.

I don't completely grok the rev'ing and wrapping stuff just yet, but
giving it a try ...

One thing looked suspicious to me:

 Index: subversion/libsvn_diff/deprecated.c
 ===
 --- subversion/libsvn_diff/deprecated.c   (revision 1064135)
 +++ subversion/libsvn_diff/deprecated.c   (working copy)
 @@ -41,7 +41,61 @@

  
  /*** Code. ***/
 +struct fns_wrapper_baton
 +{
 +  /* We put the old baton in front of this one, so that we can still use
 + this baton in place of the old.  This prevents us from having to
 + implement simple wrappers around each member of diff_fns_t. */
 +  void *old_baton;
 +  const svn_diff_fns_t *vtable;
 +};

 +static svn_error_t *
 +datasources_open(void *baton, apr_off_t *prefix_lines,
 + svn_diff_datasource_e datasources[],
 + apr_size_t datasource_len)
 +{
 +  struct fns_wrapper_baton *fwb = baton;
 +  int i;
 +
 +  /* Just iterate over the datasources, using the old singular version. */
 +  for (i = 0; i  datasource_len; i++)
 +{
 +  SVN_ERR(fwb-vtable-datasource_open(baton, datasources[i]));

Not sure, but shouldn't that be 'baton-old_baton' instead of 'baton'?

 +}
 +
 +  /* Don't claim any prefix matches. */
 +  *prefix_lines = 0;
 +
 +  return SVN_NO_ERROR;
 +}

The rest looks ok, at least as far as I understand the rev'ing stuff ...

Cheers,
-- 
Johan


Re: Status of the branch diff-optimizations-bytes branch

2011-01-27 Thread Hyrum K Wright
On Thu, Jan 27, 2011 at 3:08 PM, Johan Corveleyn jcor...@gmail.com wrote:
 On Thu, Jan 27, 2011 at 4:52 PM, Hyrum K Wright hy...@hyrumwright.org wrote:
 Johan,
 I'd appreciate review on the attached patch.  It is an attempt to rev
 the svn_diff_fns_t struct and related functions.  You'll notice that I
 commented out the use of datasources_open in both diff_file.c and
 diff_memory.c, in an attempt to exercise the deprecated function
 wrappers.

 It currently segfaults horribly, but I'm not completely sure why.  I
 suspect something is amiss in the implementation of the compat wrapper
 for datasources_open in deprecated.c.  If you'd review this patch, I'd
 be much obliged.

 I don't completely grok the rev'ing and wrapping stuff just yet, but
 giving it a try ...

 One thing looked suspicious to me:

 Index: subversion/libsvn_diff/deprecated.c
 ===
 --- subversion/libsvn_diff/deprecated.c       (revision 1064135)
 +++ subversion/libsvn_diff/deprecated.c       (working copy)
 @@ -41,7 +41,61 @@


  /*** Code. ***/
 +struct fns_wrapper_baton
 +{
 +  /* We put the old baton in front of this one, so that we can still use
 +     this baton in place of the old.  This prevents us from having to
 +     implement simple wrappers around each member of diff_fns_t. */
 +  void *old_baton;
 +  const svn_diff_fns_t *vtable;
 +};

 +static svn_error_t *
 +datasources_open(void *baton, apr_off_t *prefix_lines,
 +                 svn_diff_datasource_e datasources[],
 +                 apr_size_t datasource_len)
 +{
 +  struct fns_wrapper_baton *fwb = baton;
 +  int i;
 +
 +  /* Just iterate over the datasources, using the old singular version. */
 +  for (i = 0; i  datasource_len; i++)
 +    {
 +      SVN_ERR(fwb-vtable-datasource_open(baton, datasources[i]));

 Not sure, but shouldn't that be 'baton-old_baton' instead of 'baton'?

Technically, it shouldn't matter, since *baton and *old_baton are the
same.  In fact, I'm exploiting that fact to avoid having to wrap *all*
the callbacks.

(But I could be wrong here, and Interesting Things could be happening.)

 +    }
 +
 +  /* Don't claim any prefix matches. */
 +  *prefix_lines = 0;
 +
 +  return SVN_NO_ERROR;
 +}

 The rest looks ok, at least as far as I understand the rev'ing stuff ...

The bit I'm shaky about is how to emulate the behavior of
datasources_open in terms of data_source open.  *Something* is not
quite right since it continues to segfault.

-Hyrum


Re: Status of the branch diff-optimizations-bytes branch

2011-01-27 Thread Hyrum K Wright
On Thu, Jan 27, 2011 at 3:13 PM, Hyrum K Wright hy...@hyrumwright.org wrote:
 On Thu, Jan 27, 2011 at 3:08 PM, Johan Corveleyn jcor...@gmail.com wrote:
 On Thu, Jan 27, 2011 at 4:52 PM, Hyrum K Wright hy...@hyrumwright.org 
 wrote:
 Johan,
 I'd appreciate review on the attached patch.  It is an attempt to rev
 the svn_diff_fns_t struct and related functions.  You'll notice that I
 commented out the use of datasources_open in both diff_file.c and
 diff_memory.c, in an attempt to exercise the deprecated function
 wrappers.

 It currently segfaults horribly, but I'm not completely sure why.  I
 suspect something is amiss in the implementation of the compat wrapper
 for datasources_open in deprecated.c.  If you'd review this patch, I'd
 be much obliged.

 I don't completely grok the rev'ing and wrapping stuff just yet, but
 giving it a try ...

 One thing looked suspicious to me:

 Index: subversion/libsvn_diff/deprecated.c
 ===
 --- subversion/libsvn_diff/deprecated.c       (revision 1064135)
 +++ subversion/libsvn_diff/deprecated.c       (working copy)
 @@ -41,7 +41,61 @@


  /*** Code. ***/
 +struct fns_wrapper_baton
 +{
 +  /* We put the old baton in front of this one, so that we can still use
 +     this baton in place of the old.  This prevents us from having to
 +     implement simple wrappers around each member of diff_fns_t. */
 +  void *old_baton;
 +  const svn_diff_fns_t *vtable;
 +};

 +static svn_error_t *
 +datasources_open(void *baton, apr_off_t *prefix_lines,
 +                 svn_diff_datasource_e datasources[],
 +                 apr_size_t datasource_len)
 +{
 +  struct fns_wrapper_baton *fwb = baton;
 +  int i;
 +
 +  /* Just iterate over the datasources, using the old singular version. */
 +  for (i = 0; i  datasource_len; i++)
 +    {
 +      SVN_ERR(fwb-vtable-datasource_open(baton, datasources[i]));

 Not sure, but shouldn't that be 'baton-old_baton' instead of 'baton'?

 Technically, it shouldn't matter, since *baton and *old_baton are the
 same.  In fact, I'm exploiting that fact to avoid having to wrap *all*
 the callbacks.

 (But I could be wrong here, and Interesting Things could be happening.)

 +    }
 +
 +  /* Don't claim any prefix matches. */
 +  *prefix_lines = 0;
 +
 +  return SVN_NO_ERROR;
 +}

 The rest looks ok, at least as far as I understand the rev'ing stuff ...

 The bit I'm shaky about is how to emulate the behavior of
 datasources_open in terms of data_source open.  *Something* is not
 quite right since it continues to segfault.

For completeness: I committed an improved version of the patch in r1064347.

-Hyrum


Re: Status of the branch diff-optimizations-bytes branch

2011-01-27 Thread Hyrum K Wright
On Tue, Jan 25, 2011 at 8:31 PM, Johan Corveleyn jcor...@gmail.com wrote:
 On Tue, Jan 25, 2011 at 4:58 PM, Hyrum K Wright hy...@hyrumwright.org wrote:
 Johan (and other interested parties),
 I've been following some of the commits to the
 diff-optimizations-branch with interest.  While I've not reviewed them
 for technical merit, it appears that others have, and that there is
 good work going on in the branch.

 Thanks for your interest.

 It might be good to get a little bit more review on the whole, I
 think. A lot of people have read parts of the code, but if I remember
 correctly most (if not all) of them have said things like I haven't
 reviewed it in detail, but here are some suggestions, feedback, 

 I'm wondering what the potential
 for merging back to trunk is.  This is the TODO section from the
 BRANCH-README:

 TODO:

  * eliminate identical prefix                 [DONE]
  * eliminate identical suffix                 [DONE]
  * make diff3 and diff4 use datasources_open  [DONE]
     (this may eliminate the need for datasource_open, and the flag
      datasource_opened in token.c#svn_diff__get_tokens)
  * implement (part of) increment_pointers and
    decrement_pointers with a macro            [DONE]
     (offers another 10% speedup)
  * integrate (some of the) low-level optimizations for prefix/suffix
    scanning, proposed by stefan2 [2]          []
  * revv svn_diff.h#svn_diff_fns_t             []

 It looks like, for the most part, any destabilizing functionality is
 completed, and what remains are simply optimizations.  This
 optimization work can probably be performed on trunk, and if so, we
 should merge the branch back and do the cleanup work there.  The only
 bit on the current list of stuff is revving svn_diff_fns_t.  Can that
 work be parallelized?

 Yes, you are correct. Most of the essential work is done. Further
 optimizations can just as well be done on trunk.

 Revving svn_diff_fns_t: what do you mean with parallelizing it? I must
 admit that I don't really know (yet) how to go about that. Very early
 during the branch work, danielsh pointed out that I modified this
 public struct (vtable for reading data from datasources), so it should
 be revved. Is it listed/documented somewhere what should be done for
 that (community guide perhaps)?

 (one slightly related note to this: I introduced the function
 datasources_open, to open the multiple datasources at once (as opposed
 to the original datasource_open, which only opens one datasource). But
 maybe the new function should be renamed to datasource_open_all or
 datasources_open_all or something, to make it stand out a little bit
 more).

 I'm not trying to rush the work, nor do I think it is essential for
 1.7, but it feels like a pretty good performance increase, and one
 that we shouldn't have any problem shipping.  (If there are currently
 API uncertainties, than it would be good to wait until 1.7.x branches,
 though.)

 Yes, I think it's quite ready to be merged, and could provide a very
 significant performance increase (for diff, merge and blame).

 The API is stable now, I think, except maybe for the name of the
 datasources_open function (see above). If we decide to go (for
 optimizations reasons) for specialized prefix/suffix scanning
 functions for 2, 3 or 4 datasources, I think it's best to keep the
 generic datasources_open API (with an array of datasources), and only
 split up between specialized versions in the implementation.

The API is now rev'd, and I caught the branch up with trunk in
r1064459.  So it looks like we're ready to merge!

Johan, would you like to do the honors?

-Hyrum


Re: Status of the branch diff-optimizations-bytes branch

2011-01-26 Thread Johan Corveleyn
On Wed, Jan 26, 2011 at 2:46 PM, Hyrum K Wright hy...@hyrumwright.org wrote:
 On Tue, Jan 25, 2011 at 11:41 PM, Daniel Shahaf d...@daniel.shahaf.name 
 wrote:
 Johan Corveleyn wrote on Wed, Jan 26, 2011 at 03:31:11 +0100:
 Revving svn_diff_fns_t: what do you mean with parallelizing it? I must
 admit that I don't really know (yet) how to go about that. Very early
 during the branch work, danielsh pointed out that I modified this
 public struct (vtable for reading data from datasources), so it should
 be revved. Is it listed/documented somewhere what should be done for
 that (community guide perhaps)?

 Briefly, revving a function includes re-declaring it, updating the old
 declaration to be a diff against the new one, marking it as deprecated
 (using doxygen and C preprocessor designators), and re-implementing the
 old function as a deprecated.c wrapper around the new one.

 For a struct, you need to re-declare the struct and revv functions that
 use it.  Also, don't forget to add a constructor function
 (svn_foo_t_create(), but s/_t_/_/) (and possibly a duplicator function),
 and forbid people from defining variables of the struct type directly.

 I'm not sure HACKING contains this.  On the other hand, the public
 header files contain plenty of examples of everything I just said...

 For testing purposes, I usually find it easiest to work from the
 bottom up.  In other words, rev the lowest API in the call stack,
 write the appropriate wrapper for the existing API, and test and
 commit.  This ensures that the existing API (now implemented as a
 wrapper) still behaves the same way, but it doesn't yet exercise
 whatever the new functionality is.  I then just work my way up the
 stack for each related function.  Structs are a little more involved,
 but can use similar principles.

 Johan, if you don't mind, I can put some of my (admittedly limited)
 tuits into looking at what needs to be done to rev the above struct.

I would definitely, absolutely not mind at all :-). Quite the contrary.

As a summary of what I did with svn_diff_fns_t (all in the context of
subversion/libsvn_diff):

- Added new function 'datasources_open' to svn_diff_fns_t (maybe
better to be named 'datasources_open_all' or something to have more
than 1 character difference with the other existing function
'datasource_open' :-), but I said that already).

- This new function is implemented in diff_file.c (the 'real' work of
prefix/suffix scanning) and in diff_memory.c (dummy implementation -
doesn't do anything for prefix/suffix scanning). Those were the only
internal implementors/providers of that vtable.

- I sent a mail once to the dev-list, asking if there were other known
implementors [1], but got no response. Of course, that's not a
definitive survey :-).

- The new vtable-datasources_open function is called from diff.c,
diff3.c and diff4.c.

- The only internal caller of the old function 'datasource_open'
(for a single datasource) doesn't call it anymore
(token.c#svn_diff__get_tokens) (there is no need anymore, since the
callers in diff.c, diff3.c and diff4.c already open the datasources
with 'datasources_open' before calling svn_diff__get_tokens).


Cheers,
-- 
Johan

[1] http://svn.haxx.se/dev/archive-2010-12/0083.shtml


Re: Status of the branch diff-optimizations-bytes branch

2011-01-26 Thread Daniel Shahaf
Johan Corveleyn wrote on Wed, Jan 26, 2011 at 15:18:24 +0100:
 - The only internal caller of the old function 'datasource_open'
 (for a single datasource) doesn't call it anymore
 (token.c#svn_diff__get_tokens) (there is no need anymore, since the
 callers in diff.c, diff3.c and diff4.c already open the datasources
 with 'datasources_open' before calling svn_diff__get_tokens).

Does datasource_open() need to stay in the revved struct?


Re: Status of the branch diff-optimizations-bytes branch

2011-01-26 Thread Stefan Fuhrmann

On 25.01.2011 16:58, Hyrum K Wright wrote:

Johan (and other interested parties),
I've been following some of the commits to the
diff-optimizations-branch with interest.  While I've not reviewed them
for technical merit, it appears that others have, and that there is
good work going on in the branch.  I'm wondering what the potential
for merging back to trunk is.

Since I'm traveling for most of next week, I should
have enough in-flight time for an in-depth review.
Slow diffing has always bugged me as a user ;)

This is the TODO section from the BRANCH-README:

TODO:

   * eliminate identical prefix [DONE]
   * eliminate identical suffix [DONE]
   * make diff3 and diff4 use datasources_open  [DONE]
  (this may eliminate the need for datasource_open, and the flag
   datasource_opened in token.c#svn_diff__get_tokens)
   * implement (part of) increment_pointers and
 decrement_pointers with a macro[DONE]
  (offers another 10% speedup)

Even if some of the changes should no stand up to
scrutiny as is, most of them should be independent
from each other. Thus, pick and choose will certainly
be viable.

   * integrate (some of the) low-level optimizations for prefix/suffix
 scanning, proposed by stefan2 [2]  []

That is not an essential part and can certainly be
done directly on /trunk.

   * revv svn_diff.h#svn_diff_fns_t []

It looks like, for the most part, any destabilizing functionality is
completed, and what remains are simply optimizations.  This
optimization work can probably be performed on trunk, and if so, we
should merge the branch back and do the cleanup work there.  The only
bit on the current list of stuff is revving svn_diff_fns_t.  Can that
work be parallelized?

I'm not trying to rush the work, nor do I think it is essential for
1.7, but it feels like a pretty good performance increase, and one
that we shouldn't have any problem shipping.  (If there are currently
API uncertainties, than it would be good to wait until 1.7.x branches,
though.)

Anyway, I just really like the concept / implementation of the branch,
and I'm excited to get it into the hands of users.  Thoughts?

I'll review the code and if Johan is still reasonably
sure that the API is final, I'm all for merging it.

-- Stefan^2.



Re: Status of the branch diff-optimizations-bytes branch

2011-01-26 Thread Hyrum K Wright
On Tue, Jan 25, 2011 at 8:31 PM, Johan Corveleyn jcor...@gmail.com wrote:
...
  * revv svn_diff.h#svn_diff_fns_t             []

 It looks like, for the most part, any destabilizing functionality is
 completed, and what remains are simply optimizations.  This
 optimization work can probably be performed on trunk, and if so, we
 should merge the branch back and do the cleanup work there.  The only
 bit on the current list of stuff is revving svn_diff_fns_t.  Can that
 work be parallelized?

 Yes, you are correct. Most of the essential work is done. Further
 optimizations can just as well be done on trunk.

 Revving svn_diff_fns_t: what do you mean with parallelizing it? I must
 admit that I don't really know (yet) how to go about that. Very early
 during the branch work, danielsh pointed out that I modified this
 public struct (vtable for reading data from datasources), so it should
 be revved. Is it listed/documented somewhere what should be done for
 that (community guide perhaps)?

I was just wondering about how much could be done by other folks while
you were working on the primary goal of the branch.  (While any full
committer can commit to the branch, it's generally good form not to
step on people's toes.)

I've updated the branch to trunk, and will start looking at rev'ing
the struct and APIs.

 (one slightly related note to this: I introduced the function
 datasources_open, to open the multiple datasources at once (as opposed
 to the original datasource_open, which only opens one datasource). But
 maybe the new function should be renamed to datasource_open_all or
 datasources_open_all or something, to make it stand out a little bit
 more).

Whatever you feel is best; I'm somewhat ambivalent about the name.
The docstring does seem to be a bit lacking, however.

-Hyrum


Re: Status of the branch diff-optimizations-bytes branch

2011-01-26 Thread Johan Corveleyn
On Wed, Jan 26, 2011 at 7:18 PM, Daniel Shahaf d...@daniel.shahaf.name wrote:
 Johan Corveleyn wrote on Wed, Jan 26, 2011 at 15:18:24 +0100:
 - The only internal caller of the old function 'datasource_open'
 (for a single datasource) doesn't call it anymore
 (token.c#svn_diff__get_tokens) (there is no need anymore, since the
 callers in diff.c, diff3.c and diff4.c already open the datasources
 with 'datasources_open' before calling svn_diff__get_tokens).

 Does datasource_open() need to stay in the revved struct?

I don't think so. There are no callers left in svn's own code, so I
guess it can be deprecated.

Cheers,
-- 
Johan


Re: Status of the branch diff-optimizations-bytes branch

2011-01-26 Thread Johan Corveleyn
On Wed, Jan 26, 2011 at 10:50 PM, Hyrum K Wright hy...@hyrumwright.org wrote:
 On Tue, Jan 25, 2011 at 8:31 PM, Johan Corveleyn jcor...@gmail.com wrote:
 ...
  * revv svn_diff.h#svn_diff_fns_t             []

 It looks like, for the most part, any destabilizing functionality is
 completed, and what remains are simply optimizations.  This
 optimization work can probably be performed on trunk, and if so, we
 should merge the branch back and do the cleanup work there.  The only
 bit on the current list of stuff is revving svn_diff_fns_t.  Can that
 work be parallelized?

 Yes, you are correct. Most of the essential work is done. Further
 optimizations can just as well be done on trunk.

 Revving svn_diff_fns_t: what do you mean with parallelizing it? I must
 admit that I don't really know (yet) how to go about that. Very early
 during the branch work, danielsh pointed out that I modified this
 public struct (vtable for reading data from datasources), so it should
 be revved. Is it listed/documented somewhere what should be done for
 that (community guide perhaps)?

 I was just wondering about how much could be done by other folks while
 you were working on the primary goal of the branch.  (While any full
 committer can commit to the branch, it's generally good form not to
 step on people's toes.)

Oh yes, no problem. Any help with revv'ing that struct (or with other
parts of the code) is very much appreciated. Feel free to make any
changes necessary on the branch.

 I've updated the branch to trunk, and will start looking at rev'ing
 the struct and APIs.

 (one slightly related note to this: I introduced the function
 datasources_open, to open the multiple datasources at once (as opposed
 to the original datasource_open, which only opens one datasource). But
 maybe the new function should be renamed to datasource_open_all or
 datasources_open_all or something, to make it stand out a little bit
 more).

 Whatever you feel is best; I'm somewhat ambivalent about the name.
 The docstring does seem to be a bit lacking, however.

Yes. I guess I (or whoever takes a look at it) should go through the
docstrings (and inline comments) to see if they still describe things
accurately. Most of those date from the initial implementation on the
branch, so may be slightly out-of-date.

If I have some time, I'll try to take a look at it.

Cheers,
-- 
Johan


Status of the branch diff-optimizations-bytes branch

2011-01-25 Thread Hyrum K Wright
Johan (and other interested parties),
I've been following some of the commits to the
diff-optimizations-branch with interest.  While I've not reviewed them
for technical merit, it appears that others have, and that there is
good work going on in the branch.  I'm wondering what the potential
for merging back to trunk is.  This is the TODO section from the
BRANCH-README:

TODO:

  * eliminate identical prefix [DONE]
  * eliminate identical suffix [DONE]
  * make diff3 and diff4 use datasources_open  [DONE]
 (this may eliminate the need for datasource_open, and the flag
  datasource_opened in token.c#svn_diff__get_tokens)
  * implement (part of) increment_pointers and
decrement_pointers with a macro[DONE]
 (offers another 10% speedup)
  * integrate (some of the) low-level optimizations for prefix/suffix
scanning, proposed by stefan2 [2]  []
  * revv svn_diff.h#svn_diff_fns_t []

It looks like, for the most part, any destabilizing functionality is
completed, and what remains are simply optimizations.  This
optimization work can probably be performed on trunk, and if so, we
should merge the branch back and do the cleanup work there.  The only
bit on the current list of stuff is revving svn_diff_fns_t.  Can that
work be parallelized?

I'm not trying to rush the work, nor do I think it is essential for
1.7, but it feels like a pretty good performance increase, and one
that we shouldn't have any problem shipping.  (If there are currently
API uncertainties, than it would be good to wait until 1.7.x branches,
though.)

Anyway, I just really like the concept / implementation of the branch,
and I'm excited to get it into the hands of users.  Thoughts?

-Hyrum


Re: Status of the branch diff-optimizations-bytes branch

2011-01-25 Thread Johan Corveleyn
On Tue, Jan 25, 2011 at 4:58 PM, Hyrum K Wright hy...@hyrumwright.org wrote:
 Johan (and other interested parties),
 I've been following some of the commits to the
 diff-optimizations-branch with interest.  While I've not reviewed them
 for technical merit, it appears that others have, and that there is
 good work going on in the branch.

Thanks for your interest.

It might be good to get a little bit more review on the whole, I
think. A lot of people have read parts of the code, but if I remember
correctly most (if not all) of them have said things like I haven't
reviewed it in detail, but here are some suggestions, feedback, 

 I'm wondering what the potential
 for merging back to trunk is.  This is the TODO section from the
 BRANCH-README:

 TODO:

  * eliminate identical prefix                 [DONE]
  * eliminate identical suffix                 [DONE]
  * make diff3 and diff4 use datasources_open  [DONE]
     (this may eliminate the need for datasource_open, and the flag
      datasource_opened in token.c#svn_diff__get_tokens)
  * implement (part of) increment_pointers and
    decrement_pointers with a macro            [DONE]
     (offers another 10% speedup)
  * integrate (some of the) low-level optimizations for prefix/suffix
    scanning, proposed by stefan2 [2]          []
  * revv svn_diff.h#svn_diff_fns_t             []

 It looks like, for the most part, any destabilizing functionality is
 completed, and what remains are simply optimizations.  This
 optimization work can probably be performed on trunk, and if so, we
 should merge the branch back and do the cleanup work there.  The only
 bit on the current list of stuff is revving svn_diff_fns_t.  Can that
 work be parallelized?

Yes, you are correct. Most of the essential work is done. Further
optimizations can just as well be done on trunk.

Revving svn_diff_fns_t: what do you mean with parallelizing it? I must
admit that I don't really know (yet) how to go about that. Very early
during the branch work, danielsh pointed out that I modified this
public struct (vtable for reading data from datasources), so it should
be revved. Is it listed/documented somewhere what should be done for
that (community guide perhaps)?

(one slightly related note to this: I introduced the function
datasources_open, to open the multiple datasources at once (as opposed
to the original datasource_open, which only opens one datasource). But
maybe the new function should be renamed to datasource_open_all or
datasources_open_all or something, to make it stand out a little bit
more).

 I'm not trying to rush the work, nor do I think it is essential for
 1.7, but it feels like a pretty good performance increase, and one
 that we shouldn't have any problem shipping.  (If there are currently
 API uncertainties, than it would be good to wait until 1.7.x branches,
 though.)

Yes, I think it's quite ready to be merged, and could provide a very
significant performance increase (for diff, merge and blame).

The API is stable now, I think, except maybe for the name of the
datasources_open function (see above). If we decide to go (for
optimizations reasons) for specialized prefix/suffix scanning
functions for 2, 3 or 4 datasources, I think it's best to keep the
generic datasources_open API (with an array of datasources), and only
split up between specialized versions in the implementation.

 Anyway, I just really like the concept / implementation of the branch,
 and I'm excited to get it into the hands of users.  Thoughts?

 -Hyrum


Cheers,
-- 
Johan


Re: Status of the branch diff-optimizations-bytes branch

2011-01-25 Thread Daniel Shahaf
Johan Corveleyn wrote on Wed, Jan 26, 2011 at 03:31:11 +0100:
 On Tue, Jan 25, 2011 at 4:58 PM, Hyrum K Wright hy...@hyrumwright.org wrote:
  Johan (and other interested parties),
  I've been following some of the commits to the
  diff-optimizations-branch with interest.  While I've not reviewed them
  for technical merit, it appears that others have, and that there is
  good work going on in the branch.
 
 Thanks for your interest.
 
 It might be good to get a little bit more review on the whole, I
 think. A lot of people have read parts of the code, but if I remember
 correctly most (if not all) of them have said things like I haven't
 reviewed it in detail, but here are some suggestions, feedback, 
 

Indeed.  I, for example, reviewed parts of the code; but for other parts
I've simply let Johan tell me what he had done, and haven't reviewed
them myself.


Re: Status of the branch diff-optimizations-bytes branch

2011-01-25 Thread Daniel Shahaf
Johan Corveleyn wrote on Wed, Jan 26, 2011 at 03:31:11 +0100:
 Revving svn_diff_fns_t: what do you mean with parallelizing it? I must
 admit that I don't really know (yet) how to go about that. Very early
 during the branch work, danielsh pointed out that I modified this
 public struct (vtable for reading data from datasources), so it should
 be revved. Is it listed/documented somewhere what should be done for
 that (community guide perhaps)?

Briefly, revving a function includes re-declaring it, updating the old
declaration to be a diff against the new one, marking it as deprecated
(using doxygen and C preprocessor designators), and re-implementing the
old function as a deprecated.c wrapper around the new one.

For a struct, you need to re-declare the struct and revv functions that
use it.  Also, don't forget to add a constructor function
(svn_foo_t_create(), but s/_t_/_/) (and possibly a duplicator function),
and forbid people from defining variables of the struct type directly.

I'm not sure HACKING contains this.  On the other hand, the public
header files contain plenty of examples of everything I just said...