- **status**: review --> in-progress
- **Comment**:

The results here are great.  Including the repo refresh backend logic.  But it 
is several changes and some quite big changes, and so naturally there's a good 
handful of tweaks needed to polish it up: 

#### general
* Now the commit view doesn't show binary diffs, good.  But the table listing 
all the files has binary files linked up still and the links don't go anywhere.
* Can you add a test for the `has_html_view` method's new functionality for 
fast binary detection?
* "refresh" logic is fast now too, yay!
* I guess this should be a separate ticket, but it'd be nice to sort by 
filename across all change types, instead of showing adds, then removes, etc.  
Maybe same ticket as displaying copies vs renames better.
* Down in the diff list, it says "File was copied or renamed."  We should be 
able to say exactly which now.
* A rename shows up as `{'new': u'README.txt', 'old': u'README', 'diff': '', 
'ratio': 1}` in the diff section and also says `Can't load diff`
    * Is it ok that we set diff to `''` in many places?

#### hg & svn
* The `[:]` slice would be better on the `for` loop than the `if` line right?

#### hg
* cleanup: move imports to top of file

#### git
* Testing with walrustech repo, in the 2nd commit, only the `Flan` dir shows up 
as having changes.  Nothing shown for `options.txt` or `bin/` or `mods/` but 
they did have changes.  You can see this with ?limit=1000.  And if you use the 
default limit, the pages at the end are all blank.
* I think we don't want to use `--find-copies-harder`
    * Performance wise on a big repo my timing measurement is 0m0.035s without 
it and 0m0.135s with it.  Noticable but not huge
    * A bigger impact is the semantics of it.  It can make an incorrect 
association of files being "copied" if the contents are common contents.  A 
very good example of common contents is no content, an empty file.  I've found 
a diff that says one `__init__.py` file was copied to another, but really it's 
just a new file.  And another file that is new but has a lot of test 
boilerplate so git thinks its a 56% similar copy.  Thus I think we should drop 
`--find-copies-harder`
* After doing a straight copy or rename in git and committing it, I get:

~~~~
File 
'/home/dbrondsema/dbrondsema-1019/forge/ForgeGit/forgegit/model/git_repo.py', 
line 682 in paged_diffs
  for i in xrange(0, result['total'] + 1, 2)]
IndexError: list index out of range
~~~~





---

** [tickets:#7925]  Speed up diff processing with binary files**

**Status:** in-progress
**Milestone:** unreleased
**Labels:** sf-2 sf-current performance 
**Created:** Mon Jul 13, 2015 03:04 PM UTC by Heith Seewald
**Last Updated:** Mon Jul 27, 2015 08:28 PM UTC
**Owner:** Heith Seewald


In a git repo with a large amount of binary files, our diff processing can be 
very inefficient. We should test if a file is binary and exclude it from the 
diff processing section.


---

Sent from forge-allura.apache.org because [email protected] is subscribed 
to https://forge-allura.apache.org/p/allura/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://forge-allura.apache.org/p/allura/admin/tickets/options.  Or, if this is 
a mailing list, you can unsubscribe from the mailing list.

Reply via email to