On Sat, 2010-10-09, Johan Corveleyn wrote: > Ok, third iteration of the patch in attachment. It passes make check. > > As discussed in [1], this version keeps 50 lines of the identical > suffix around, to give the algorithm a good chance to generate a diff > output of good quality (in all but the most extreme cases, this will > be the same as with the original svn_diff algorithm). > > That's about the only difference with the previous iteration. So for > now, I'm submitting this for review. Any feedback is very welcome :-).
Hi Johan. I haven't reviewed it, but after seeing today's discussion I had just scrolled quickly through the previous version of this patch. I noticed that the two main functions - find_identical_suffix and find_identical_suffix - are both quite similar (but not quite similar enough to make them easily share implementation) and both quite long, and I noticed you wrote in an earlier email that you were finding it hard to make the code readable. I have a suggestion that may help. I think the existing structure of the svn_diff__file_baton_t is unhelpful: { const svn_diff_file_options_t *options; const char *path[4]; apr_file_t *file[4]; apr_off_t size[4]; int chunk[4]; char *buffer[4]; char *curp[4]; char *endp[4]; /* List of free tokens that may be reused. */ svn_diff__file_token_t *tokens; svn_diff__normalize_state_t normalize_state[4]; apr_pool_t *pool; } svn_diff__file_baton_t; All those array[4] fields are logically related, but this layout forces the programmer to address them individually. So I wrote a patch - attached - that refactors this into an array of 4 sub-structures, and simplifies all the code that uses them. I think this will help you to get better code clarity because then your increment_pointer_or_chunk() for example will be able to take a single pointer to a file_info structure instead of a lot of pointers to individual members of the same. Would you take a look and let me know if you agree. If so, I can commit the refactoring straight away. Responding to some of the other points you mentioned in a much earlier mail: > 3) It's only implemented for 2 files. I'd like to generalize this for > an array of datasources, so it can also be used for diff3 and diff4. > > 4) As a small hack, I had to add a flag "datasource_opened" to > token.c#svn_diff__get_tokens, so I could have different behavior for > regular diff vs. diff3 and diff4. If 3) gets implemented, this hack is > no longer needed. Yes, I'd like to see 3), and so hack 4) will go away. > 5) I've struggled with making the code readable/understandable. It's > likely that there is still a lot of room for improvement. I also > probably need to document it some more. You need to write a full doc string for datasources_open(), at least. It needs especially to say how it relates to datasource_open() - why should the caller call this function rather than that function, and are they both necessary - and how the caller is supposed to use the 'prefix_lines' parameter. And doc strings for increment/decrement_...(). But this makes me think, it looks to me like this whole prefix-suffix-skipping functionality would fit better inside the lower-level diff algorithm rather than inside the "datasources_open" function. Do you agree? It works as it is, but you were talking about wanting it to obey the standard token-parsing rules such as "ignore white space", and so on. It seems that things like this would be much better one level down. > 6) I've not paid too much attention to low level optimizations, so > here also there's probably room for improvement, which may be > significant for the critical loops. Maybe. Not important at this stage. > 7) There are two warnings left "conversion from 'apr_off_t' to 'int'", > in diff_file.c#find_identical_suffix. To completely eliminate this, I > should probably make all "chunks" of type apr_off_t instead of int > (but I have not done that yet, because the original implementation > also used int for the chunk in the svn_diff__file_baton_t struct). > Should I do this (also in the baton struct)? Or should I use an > explicit cast somewhere? I recommend using an explicit cast where needed, in this patch. Changing the 'chunk' data type everywhere could be done in a separate patch, either before or after this patch. - Julian > I still consider this a WIP, because of the following remaining todo's > (which may have a lot of impact on the current implementation): > - Generalize for more than 2 datasources (for diff3 and diff4). > - revv svn_diff_fns_t and maybe other stuff I've changed in public API. > - Add support for -x-b, -x-w, and -x--ignore-eol-style options. Maybe > switch the implementation to read out entire lines before comparing > (like datasources_get_next_token does). > > Log message: > [[[ > Make svn_diff_diff skip identical prefix and suffix to make diff and blame > faster. > > * subversion/include/svn_diff.h > (svn_diff_fns_t): Added new function types datasources_open and > get_prefix_lines to the vtable. > > * subversion/libsvn_diff/diff_memory.c > (datasources_open): New function (does nothing). > (get_prefix_lines): New function (does nothing). > (svn_diff__mem_vtable): Added new functions datasources_open and > get_prefix_lines. > > * subversion/libsvn_diff/diff_file.c > (svn_diff__file_baton_t): Added members prefix_lines, suffix_start_chunk[4] > and suffix_offset_in_chunk. > (increment_pointer_or_chunk, decrement_pointer_or_chunk): New functions. > (find_identical_prefix, find_identical_suffix): New functions. > (datasources_open): New function, to open both datasources and find their > identical prefix and suffix. From the identical suffix, 50 lines are kept > to > help the diff algorithm find the nicest possible diff representation > in case of ambiguity. > (get_prefix_lines): New function. > (datasource_get_next_token): Stop at start of identical suffix. > (svn_diff__file_vtable): Added new functions datasources_open and > get_prefix_lines. > > * subversion/libsvn_diff/diff.h > (svn_diff__get_tokens): Added argument "datasource_opened", to indicate that > the datasource was already opened. > > * subversion/libsvn_diff/token.c > (svn_diff__get_tokens): Added argument "datasource_opened". Only open the > datasource if datasource_opened is FALSE. Set the starting offset of the > position list to the number of prefix lines. > > * subversion/libsvn_diff/lcs.c > (svn_diff__lcs): Added argument "prefix_lines". Use this to correctly set > the offset of the sentinel position for EOF, even if one of the files > became empty after eliminating the identical prefix. > > * subversion/libsvn_diff/diff.c > (svn_diff__diff): Add a chunk of "common" diff for identical prefix. > (svn_diff_diff): Use new function datasources_open, to open original and > modified at once, and find their identical prefix and suffix. Pass > prefix_lines to svn_diff__lcs and to svn_diff__diff. > > * subversion/libsvn_diff/diff3.c > (svn_diff_diff3): Pass datasource_opened = FALSE to svn_diff__get_tokens. > Pass prefix_lines = 0 to svn_diff__lcs. > > * subversion/libsvn_diff/diff4.c > (svn_diff_diff4): Pass datasource_opened = FALSE to svn_diff__get_tokens. > Pass prefix_lines = 0 to svn_diff__lcs. > ]]] > > > Cheers,
Re-factor a group of array variables into an array of structures, and thereby simplify their usage. * subversion/libsvn_diff/diff_file.c (svn_diff__file_baton_t): Move a set of related array variables into a new sub-structure array. (datasource_open, datasource_get_next_token, token_compare): By taking a pointer to the appropriate element of the sub-structure, simplify all references to these variables. (svn_diff_file_diff_2, svn_diff_file_diff3_2, svn_diff_file_diff4_2): Adjust references. --This line, and those below, will be ignored-- Index: subversion/libsvn_diff/diff_file.c =================================================================== --- subversion/libsvn_diff/diff_file.c (revision 1006000) +++ subversion/libsvn_diff/diff_file.c (working copy) @@ -67,21 +67,24 @@ typedef struct svn_diff__file_token_t typedef struct svn_diff__file_baton_t { const svn_diff_file_options_t *options; - const char *path[4]; - apr_file_t *file[4]; - apr_off_t size[4]; + struct file_info { + const char *path; - int chunk[4]; - char *buffer[4]; - char *curp[4]; - char *endp[4]; + apr_file_t *file; + apr_off_t size; + + int chunk; + char *buffer; + char *curp; + char *endp; + + svn_diff__normalize_state_t normalize_state; + } files[4]; /* List of free tokens that may be reused. */ svn_diff__file_token_t *tokens; - svn_diff__normalize_state_t normalize_state[4]; - apr_pool_t *pool; } svn_diff__file_baton_t; @@ -203,21 +206,19 @@ static svn_error_t * datasource_open(void *baton, svn_diff_datasource_e datasource) { svn_diff__file_baton_t *file_baton = baton; - int idx; + struct file_info *file = &file_baton->files[datasource_to_index(datasource)]; apr_finfo_t finfo; apr_off_t length; char *curp; char *endp; - idx = datasource_to_index(datasource); - - SVN_ERR(svn_io_file_open(&file_baton->file[idx], file_baton->path[idx], + SVN_ERR(svn_io_file_open(&file->file, file->path, APR_READ, APR_OS_DEFAULT, file_baton->pool)); SVN_ERR(svn_io_file_info_get(&finfo, APR_FINFO_SIZE, - file_baton->file[idx], file_baton->pool)); + file->file, file_baton->pool)); - file_baton->size[idx] = finfo.size; + file->size = finfo.size; length = finfo.size > CHUNK_SIZE ? CHUNK_SIZE : finfo.size; if (length == 0) @@ -226,10 +227,10 @@ datasource_open(void *baton, svn_diff_da endp = curp = apr_palloc(file_baton->pool, (apr_size_t) length); endp += length; - file_baton->buffer[idx] = file_baton->curp[idx] = curp; - file_baton->endp[idx] = endp; + file->buffer = file->curp = curp; + file->endp = endp; - return read_chunk(file_baton->file[idx], file_baton->path[idx], + return read_chunk(file->file, file->path, curp, length, 0, file_baton->pool); } @@ -252,7 +253,7 @@ datasource_get_next_token(apr_uint32_t * { svn_diff__file_baton_t *file_baton = baton; svn_diff__file_token_t *file_token; - int idx; + struct file_info *file = &file_baton->files[datasource_to_index(datasource)]; char *endp; char *curp; char *eol; @@ -264,15 +265,13 @@ datasource_get_next_token(apr_uint32_t * *token = NULL; - idx = datasource_to_index(datasource); + curp = file->curp; + endp = file->endp; - curp = file_baton->curp[idx]; - endp = file_baton->endp[idx]; - - last_chunk = offset_to_chunk(file_baton->size[idx]); + last_chunk = offset_to_chunk(file->size); if (curp == endp - && last_chunk == file_baton->chunk[idx]) + && last_chunk == file->chunk) { return SVN_NO_ERROR; } @@ -289,8 +288,8 @@ datasource_get_next_token(apr_uint32_t * } file_token->datasource = datasource; - file_token->offset = chunk_to_offset(file_baton->chunk[idx]) - + (curp - file_baton->buffer[idx]); + file_token->offset = chunk_to_offset(file->chunk) + + (curp - file->buffer); file_token->raw_length = 0; file_token->length = 0; @@ -311,7 +310,7 @@ datasource_get_next_token(apr_uint32_t * } } - if (file_baton->chunk[idx] == last_chunk) + if (file->chunk == last_chunk) { eol = endp; break; @@ -320,21 +319,21 @@ datasource_get_next_token(apr_uint32_t * length = endp - curp; file_token->raw_length += length; svn_diff__normalize_buffer(&curp, &length, - &file_baton->normalize_state[idx], + &file->normalize_state, curp, file_baton->options); file_token->length += length; h = svn_diff__adler32(h, curp, length); - curp = endp = file_baton->buffer[idx]; - file_baton->chunk[idx]++; - length = file_baton->chunk[idx] == last_chunk ? - offset_in_chunk(file_baton->size[idx]) : CHUNK_SIZE; + curp = endp = file->buffer; + file->chunk++; + length = file->chunk == last_chunk ? + offset_in_chunk(file->size) : CHUNK_SIZE; endp += length; - file_baton->endp[idx] = endp; + file->endp = endp; - SVN_ERR(read_chunk(file_baton->file[idx], file_baton->path[idx], + SVN_ERR(read_chunk(file->file, file->path, curp, length, - chunk_to_offset(file_baton->chunk[idx]), + chunk_to_offset(file->chunk), file_baton->pool)); /* If the last chunk ended in a CR, we're done. */ @@ -349,7 +348,7 @@ datasource_get_next_token(apr_uint32_t * length = eol - curp; file_token->raw_length += length; - file_baton->curp[idx] = eol; + file->curp = eol; /* If the file length is exactly a multiple of CHUNK_SIZE, we will end up * with a spurious empty token. Avoid returning it. @@ -360,7 +359,7 @@ datasource_get_next_token(apr_uint32_t * { char *c = curp; svn_diff__normalize_buffer(&c, &length, - &file_baton->normalize_state[idx], + &file->normalize_state, curp, file_baton->options); file_token->norm_offset = file_token->offset; @@ -388,13 +387,12 @@ token_compare(void *baton, void *token1, char buffer[2][COMPARE_CHUNK_SIZE]; char *bufp[2]; apr_off_t offset[2]; - int idx[2]; + struct file_info *file[2]; apr_off_t length[2]; apr_off_t total_length; /* How much is left to read of each token from the file. */ apr_off_t raw_length[2]; int i; - int chunk[2]; svn_diff__normalize_state_t state[2]; file_token[0] = token1; @@ -420,17 +418,18 @@ token_compare(void *baton, void *token1, for (i = 0; i < 2; ++i) { - idx[i] = datasource_to_index(file_token[i]->datasource); + int idx = datasource_to_index(file_token[i]->datasource); + + file[i] = &file_baton->files[idx]; offset[i] = file_token[i]->norm_offset; - chunk[i] = file_baton->chunk[idx[i]]; state[i] = svn_diff__normalize_state_normal; - if (offset_to_chunk(offset[i]) == chunk[i]) + if (offset_to_chunk(offset[i]) == file[i]->chunk) { /* If the start of the token is in memory, the entire token is * in memory. */ - bufp[i] = file_baton->buffer[idx[i]]; + bufp[i] = file[i]->buffer; bufp[i] += offset_in_chunk(offset[i]); length[i] = total_length; @@ -458,15 +457,15 @@ token_compare(void *baton, void *token1, NULL, _("The file '%s' changed unexpectedly" " during diff"), - file_baton->path[idx[i]]); + file[i]->path); /* Read a chunk from disk into a buffer */ bufp[i] = buffer[i]; length[i] = raw_length[i] > COMPARE_CHUNK_SIZE ? COMPARE_CHUNK_SIZE : raw_length[i]; - SVN_ERR(read_chunk(file_baton->file[idx[i]], - file_baton->path[idx[i]], + SVN_ERR(read_chunk(file[i]->file, + file[i]->path, bufp[i], length[i], offset[i], file_baton->pool)); offset[i] += length[i]; @@ -623,8 +622,8 @@ svn_diff_file_diff_2(svn_diff_t **diff, memset(&baton, 0, sizeof(baton)); baton.options = options; - baton.path[0] = original; - baton.path[1] = modified; + baton.files[0].path = original; + baton.files[1].path = modified; baton.pool = svn_pool_create(pool); SVN_ERR(svn_diff_diff(diff, &baton, &svn_diff__file_vtable, pool)); @@ -645,9 +644,9 @@ svn_diff_file_diff3_2(svn_diff_t **diff, memset(&baton, 0, sizeof(baton)); baton.options = options; - baton.path[0] = original; - baton.path[1] = modified; - baton.path[2] = latest; + baton.files[0].path = original; + baton.files[1].path = modified; + baton.files[2].path = latest; baton.pool = svn_pool_create(pool); SVN_ERR(svn_diff_diff3(diff, &baton, &svn_diff__file_vtable, pool)); @@ -669,10 +668,10 @@ svn_diff_file_diff4_2(svn_diff_t **diff, memset(&baton, 0, sizeof(baton)); baton.options = options; - baton.path[0] = original; - baton.path[1] = modified; - baton.path[2] = latest; - baton.path[3] = ancestor; + baton.files[0].path = original; + baton.files[1].path = modified; + baton.files[2].path = latest; + baton.files[3].path = ancestor; baton.pool = svn_pool_create(pool); SVN_ERR(svn_diff_diff4(diff, &baton, &svn_diff__file_vtable, pool));