Re: A case where diff.colorMoved=plain is more sensible than diff.colorMoved=zebra & others
On Thu, Dec 6, 2018 at 6:58 AM Phillip Wood wrote: > > So is there some "must be at least two consecutive lines" condition for > > not-plain, or is something else going on here? > > To be considered a block has to have 20 alphanumeric characters - see > commit f0b8fb6e59 ("diff: define block by number of alphanumeric chars", > 2017-08-15). This stops things like random '}' lines being marked as > moved on their own. This is spot on. All but the "plain" mode use the concept of "blocks" of code (there is even one mode called "blocks", which adds to the confusion). > It might be better to use some kind of frequency > information (a bit like python's difflib junk parameter) instead so that > (fairly) unique short lines also get marked properly. Yes that is what I was initially thinking about. However to have good information, you'd need to index a whole lot (the whole repository, i.e. all text blobs in existence?) to get an accurate picture of frequency information, which I'd prefer to call entropy as I come from a background familiar with https://en.wikipedia.org/wiki/Information_theory, I am not sure where 'frequency information' comes from -- it sounds like the same concept. Of course it is too expensive to run an operation O(repository size) just for this diff, so maybe we could get away with some smaller corpus to build up this information on what is sufficient for coloring. When only looking at the given diff, I would imagine that each line would not carry a whole lot of information as its characters occur rather frequently compared to the rest of the diff. Best, Stefan
Re: A case where diff.colorMoved=plain is more sensible than diff.colorMoved=zebra & others
Hi Ævar On 06/12/2018 13:54, Ævar Arnfjörð Bjarmason wrote: Let's ignore how bad this patch is for git.git, and just focus on how diff.colorMoved treats it: diff --git a/builtin/add.c b/builtin/add.c index f65c172299..d1155322ef 100644 --- a/builtin/add.c +++ b/builtin/add.c @@ -6,5 +6,3 @@ #include "cache.h" -#include "config.h" #include "builtin.h" -#include "lockfile.h" #include "dir.h" diff --git a/builtin/am.c b/builtin/am.c index 8f27f3375b..eded15aa8a 100644 --- a/builtin/am.c +++ b/builtin/am.c @@ -6,3 +6,2 @@ #include "cache.h" -#include "config.h" #include "builtin.h" diff --git a/builtin/blame.c b/builtin/blame.c index 06a7163ffe..44a754f190 100644 --- a/builtin/blame.c +++ b/builtin/blame.c @@ -8,3 +8,2 @@ #include "cache.h" -#include "config.h" #include "color.h" diff --git a/cache.h b/cache.h index ca36b44ee0..ea8d60b94a 100644 --- a/cache.h +++ b/cache.h @@ -4,2 +4,4 @@ #include "git-compat-util.h" +#include "config.h" +#include "new.h" #include "strbuf.h" This is a common thing that's useful to have highlighted, e.g. we move includes of config.h to some common file, so I want to se all the deleted config.h lines as moved into the cache.h line, and then the "lockfile.h" I removed while I was at it plain remove, and the new "new.h" plain added. Exactly that is what you get with diff.colorMoved=plain, but the default of diff.colorMoved=zebra gets confused by this and highlights no moves at all, same or "blocks" and "dimmed-zebra". So at first I thought this had something to do with the many->one detection, but it seems to be simpler, we just don't detect a move of 1-line with anything but plain, e.g. this works as expected in all modes and detects the many->one: diff --git a/builtin/add.c b/builtin/add.c index f65c172299..f4fda75890 100644 --- a/builtin/add.c +++ b/builtin/add.c @@ -5,4 +5,2 @@ */ -#include "cache.h" -#include "config.h" #include "builtin.h" diff --git a/builtin/branch.c b/builtin/branch.c index 0c55f7f065..52e39924d3 100644 --- a/builtin/branch.c +++ b/builtin/branch.c @@ -7,4 +7,2 @@ -#include "cache.h" -#include "config.h" #include "color.h" diff --git a/cache.h b/cache.h index ca36b44ee0..d4146dbf8a 100644 --- a/cache.h +++ b/cache.h @@ -3,2 +3,4 @@ +#include "cache.h" +#include "config.h" #include "git-compat-util.h" So is there some "must be at least two consecutive lines" condition for not-plain, or is something else going on here? To be considered a block has to have 20 alphanumeric characters - see commit f0b8fb6e59 ("diff: define block by number of alphanumeric chars", 2017-08-15). This stops things like random '}' lines being marked as moved on their own. It might be better to use some kind of frequency information (a bit like python's difflib junk parameter) instead so that (fairly) unique short lines also get marked properly. Best Wishes Phillip
A case where diff.colorMoved=plain is more sensible than diff.colorMoved=zebra & others
Let's ignore how bad this patch is for git.git, and just focus on how diff.colorMoved treats it: diff --git a/builtin/add.c b/builtin/add.c index f65c172299..d1155322ef 100644 --- a/builtin/add.c +++ b/builtin/add.c @@ -6,5 +6,3 @@ #include "cache.h" -#include "config.h" #include "builtin.h" -#include "lockfile.h" #include "dir.h" diff --git a/builtin/am.c b/builtin/am.c index 8f27f3375b..eded15aa8a 100644 --- a/builtin/am.c +++ b/builtin/am.c @@ -6,3 +6,2 @@ #include "cache.h" -#include "config.h" #include "builtin.h" diff --git a/builtin/blame.c b/builtin/blame.c index 06a7163ffe..44a754f190 100644 --- a/builtin/blame.c +++ b/builtin/blame.c @@ -8,3 +8,2 @@ #include "cache.h" -#include "config.h" #include "color.h" diff --git a/cache.h b/cache.h index ca36b44ee0..ea8d60b94a 100644 --- a/cache.h +++ b/cache.h @@ -4,2 +4,4 @@ #include "git-compat-util.h" +#include "config.h" +#include "new.h" #include "strbuf.h" This is a common thing that's useful to have highlighted, e.g. we move includes of config.h to some common file, so I want to se all the deleted config.h lines as moved into the cache.h line, and then the "lockfile.h" I removed while I was at it plain remove, and the new "new.h" plain added. Exactly that is what you get with diff.colorMoved=plain, but the default of diff.colorMoved=zebra gets confused by this and highlights no moves at all, same or "blocks" and "dimmed-zebra". So at first I thought this had something to do with the many->one detection, but it seems to be simpler, we just don't detect a move of 1-line with anything but plain, e.g. this works as expected in all modes and detects the many->one: diff --git a/builtin/add.c b/builtin/add.c index f65c172299..f4fda75890 100644 --- a/builtin/add.c +++ b/builtin/add.c @@ -5,4 +5,2 @@ */ -#include "cache.h" -#include "config.h" #include "builtin.h" diff --git a/builtin/branch.c b/builtin/branch.c index 0c55f7f065..52e39924d3 100644 --- a/builtin/branch.c +++ b/builtin/branch.c @@ -7,4 +7,2 @@ -#include "cache.h" -#include "config.h" #include "color.h" diff --git a/cache.h b/cache.h index ca36b44ee0..d4146dbf8a 100644 --- a/cache.h +++ b/cache.h @@ -3,2 +3,4 @@ +#include "cache.h" +#include "config.h" #include "git-compat-util.h" So is there some "must be at least two consecutive lines" condition for not-plain, or is something else going on here?