Re: [PATCH v8 5/7] convert: add 'working-tree-encoding' attribute

2018-02-24 Thread Eric Sunshine
On Sat, Feb 24, 2018 at 11:27 AM,   wrote:
> Git recognizes files encoded with ASCII or one of its supersets (e.g.
> UTF-8 or ISO-8859-1) as text files. All other encodings are usually
> interpreted as binary and consequently built-in Git text processing
> tools (e.g. 'git diff') as well as most Git web front ends do not
> visualize the content.
>
> Add an attribute to tell Git what encoding the user has defined for a
> given file. If the content is added to the index, then Git converts the
> content to a canonical UTF-8 representation. On checkout Git will
> reverse the conversion.
>
> Signed-off-by: Lars Schneider 
> ---
> diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
> @@ -272,6 +272,84 @@ few exceptions.  Even though...
> +Please note that using the `working-tree-encoding` attribute may have a
> +number of pitfalls:
> +
> +- Alternative Git implementations (e.g. JGit or libgit2) and older Git
> +  versions (as of March 2018) do not support the `working-tree-encoding`
> +  attribute. If you decide to use the `working-tree-encoding` attribute
> +  in your repository, then it is strongly recommended to ensure that all
> +  clients working with the repository support it.
> +
> +  If you declare `*.proj` files as UTF-16 and you add `foo.proj` with an
> +  `working-tree-encoding` enabled Git client, then `foo.proj` will be
> +  stored as UTF-8 internally. A client without `working-tree-encoding`
> +  support will checkout `foo.proj` as UTF-8 encoded file. This will
> +  typically cause trouble for the users of this file.

The above paragraph is giving an example of the scenario described in
the paragraph above it. It might, therefore, be clearer to start this
paragraph with "For example,...". Also, as an aid to non-Windows
developers, it might be helpful to introduce ".proj" files as
"Microsoft Visual Studio `*.proj` files...".

> diff --git a/convert.c b/convert.c
> @@ -265,6 +266,110 @@ static int will_convert_lf_to_crlf(size_t len, struct 
> text_stat *stats,
> +static struct encoding {
> +   const char *name;
> +   struct encoding *next;
> +} *encoding, **encoding_tail;

Why does this struct need to be a linked-list since it contains only a
name? In fact, why do the struct and linked list exist at all? Back in
v1 when the struct held more members, it made sense to place them in a
collection so they could be re-used, but today it just seems an
unnecessary artifact which buys you nothing.

Is the plan that some future patch series will add members to the
struct? If not, then it seems that it should be retired altogether.

> +static const char *default_encoding = "UTF-8";
> +
> +static int encode_to_git(const char *path, const char *src, size_t src_len,
> +struct strbuf *buf, struct encoding *enc, int 
> conv_flags)
> +{
> +   [...]
> +   if (has_prohibited_utf_bom(enc->name, src, src_len)) {
> +   const char *error_msg = _(
> +   "BOM is prohibited in '%s' if encoded as %s");
> +   /*
> +* This advise is shown for UTF-??BE and UTF-??LE encodings.

s/advise/advice/

> +* We truncate the encoding name to 6 chars with %.6s to cut
> +* off the last two "byte order" characters.
> +*/
> +   const char *advise_msg = _(
> +   "The file '%s' contains a byte order mark (BOM). "
> +   "Please use %.6s as working-tree-encoding.");
> +   advise(advise_msg, path, enc->name);
> +   if (conv_flags & CONV_WRITE_OBJECT)
> +   die(error_msg, path, enc->name);
> +   else
> +   error(error_msg, path, enc->name);

What is the intended behavior in the non-die() case? As implemented,
this is still going to attempt the conversion even though it threw an
error. Should it be returning 0 here instead? Same question for the
couple additional cases below.

> +
> +   } else if (is_missing_required_utf_bom(enc->name, src, src_len)) {
> +   const char *error_msg = _(
> +   "BOM is required in '%s' if encoded as %s");
> +   const char *advise_msg = _(
> +   "The file '%s' is missing a byte order mark (BOM). "
> +   "Please use %sBE or %sLE (depending on the byte 
> order) "
> +   "as working-tree-encoding.");
> +   advise(advise_msg, path, enc->name, enc->name);
> +   if (conv_flags & CONV_WRITE_OBJECT)
> +   die(error_msg, path, enc->name);
> +   else
> +   error(error_msg, path, enc->name);
> +   }

For a function which presumably is agnostic about the working-tree
encoding (supporting whatever iconv does), it nevertheless has
unusually intimate knowledge about UTF BOMs, in general, and the

Re: [PATCH v8 4/7] utf8: add function to detect a missing UTF-16/32 BOM

2018-02-24 Thread Eric Sunshine
On Sat, Feb 24, 2018 at 11:27 AM,   wrote:
> If the endianness is not defined in the encoding name, then let's
> be strict and require a BOM to avoid any encoding confusion. The
> is_missing_required_utf_bom() function returns true if a required BOM
> is missing.
>
> The Unicode standard instructs to assume big-endian if there in no BOM
> for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard used
> in HTML5 recommends to assume little-endian to "deal with deployed
> content" [3]. Strictly requiring a BOM seems to be the safest option
> for content in Git.
>
> Signed-off-by: Lars Schneider 
> ---
> diff --git a/utf8.h b/utf8.h
> @@ -79,4 +79,20 @@ void strbuf_utf8_align(struct strbuf *buf, align_type 
> position, unsigned int wid
> +/*
> + * If the endianness is not defined in the encoding name, then we
> + * require a BOM. The function returns true if a required BOM is missing.
> + *
> + * The Unicode standard instructs to assume big-endian if there
> + * in no BOM for UTF-16/32 [1][2]. However, the W3C/WHATWG
> + * encoding standard used in HTML5 recommends to assume
> + * little-endian to "deal with deployed content" [3].

Perhaps you could tack on to the comment here the final bit of
explanation from the commit message which ties these conflicting
recommendations together. In particular:

Therefore, strictly requiring a BOM seems to be the
safest option for content in Git.

> + */
> +int is_missing_required_utf_bom(const char *enc, const char *data, size_t 
> len);


Re: [PATCH] t/known-leaky: add list of known-leaky test scripts

2018-02-24 Thread Kaartic Sivaraam
On Wednesday 21 February 2018 02:14 AM, Martin Ågren wrote:
> To sum up: I probably won't be looking into Travis-ing such a blacklist
> in the near future.
> 

Just thinking out loud, how about having a white-list instead of a
black-list and using it to run only those tests in the white list.
Something like,

t/white_list

t
t0001

To run
--

cd t/
for test in $(cat white_list)
do
white_list_tests="$white_list_tests $test*"
done
make SANITIZE=leak $white_list_tests



-- 
Kaartic

QUOTE
---

“The most valuable person on any team is the person who makes everyone
else on the team more valuable, not the person who knows the most.”

  - Joel Spolsky



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v8 3/7] utf8: add function to detect prohibited UTF-16/32 BOM

2018-02-24 Thread Eric Sunshine
On Sat, Feb 24, 2018 at 11:27 AM,   wrote:
> Whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE
> or UTF-32LE a BOM must not be used [1]. The function returns true if
> this is the case.
>
> [1] http://unicode.org/faq/utf_bom.html#bom10
>
> Signed-off-by: Lars Schneider 
> ---
> diff --git a/utf8.c b/utf8.c
> @@ -538,6 +538,30 @@ char *reencode_string_len(const char *in, int insz,
> +int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
> +{
> +   return (
> + (!strcmp(enc, "UTF-16BE") || !strcmp(enc, "UTF-16LE")) &&
> + (has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
> +  has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
> +   ) || (
> + (!strcmp(enc, "UTF-32BE") || !strcmp(enc, "UTF-32LE")) &&
> + (has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
> +  has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
> +   );
> +}

Is this interpretation correct? When I read [1], I interpret it as
saying that no BOM _of any sort_ should be present when the encoding
is declared as one of UTF-16BE, UTF-16LE, UTF-32BE, or UTF-32LE. This
code, on the other hand, only checks for BOMs corresponding to the
declared size (16 or 32 bits).

I suppose the intention of [1] is to detect a mismatch between the
declared encoding and how the stream is actually encoded. The check
implemented here will fail to detect a mismatch between, say, declared
encoding UTF-16BE and actual encoding UTF-32BE.


Re: [PATCH v2 5/5] Revert "repository: pre-initialize hash algo pointer"

2018-02-24 Thread Duy Nguyen
On Sun, Feb 25, 2018 at 5:58 AM, brian m. carlson
 wrote:
>> diff --git a/repository.c b/repository.c
>> index 4ffbe9bc94..0d715f4fdb 100644
>> --- a/repository.c
>> +++ b/repository.c
>> @@ -5,7 +5,7 @@
>>
>>  /* The main repository */
>>  static struct repository the_repo = {
>> - NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, _index, 
>> _algos[GIT_HASH_SHA1], 0, 0
>> + NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, _index, 
>> NULL, 0, 0
>
> I'm wondering, now that you have the name field for the unknown value,
> if that might be a better choice here than NULL.  I don't have a strong
> preference either way, so whatever you decide here is fine.

I did try that first, but for the purpose of catching uninitialized
algo use, NULL is better.

When I set unknown hash algo here, I think some test failed
mysteriously because it used rawsz field (which has value zero), it
didn't match some expectation, the code went to the error handling
path, which eventually failed with some error message, but it's not
obvious that the problem was rawsz being zero and back tracking that
took me some time.

With NULL hash_algo, any dereferencing fails immediately with a nice
stack trace. Another reason to push me towards NULL hash algo is, even
if we prefer nice messages over segmentation faults, we can't avoid it
completely anyway (empty_tree and empty_blob are still NULL in unknown
hash algo and will cause segfaults). Might as well make things
consistent and always segfault.
-- 
Duy


Re: [PATCH] revision.c: reduce object database queries

2018-02-24 Thread Derrick Stolee

On 2/24/2018 8:34 PM, Derrick Stolee wrote:

In mark_parents_uninteresting(), we check for the existence of an
object file to see if we should treat a commit as parsed. The result
is to set the "parsed" bit on the commit.

Modify the condition to only check has_object_file() if the result
would change the parsed bit.

When a local branch is different from its upstream ref, "git status"
will compute ahead/behind counts. This uses paint_down_to_common()
and hits mark_parents_uninteresting(). On a copy of the Linux repo
with a local instance of "master" behind the remote branch
"origin/master" by ~60,000 commits, we find the performance of
"git status" went from 1.42 seconds to 1.32 seconds, for a relative
difference of -7.0%.


I do want to add that the exact same "git status" performance test with 
the commit-graph feature goes from 0.29 seconds to 0.21 seconds for a 
relative change of -27.5%.


Thanks,

-Stolee




[PATCH] revision.c: reduce object database queries

2018-02-24 Thread Derrick Stolee
In mark_parents_uninteresting(), we check for the existence of an
object file to see if we should treat a commit as parsed. The result
is to set the "parsed" bit on the commit.

Modify the condition to only check has_object_file() if the result
would change the parsed bit.

When a local branch is different from its upstream ref, "git status"
will compute ahead/behind counts. This uses paint_down_to_common()
and hits mark_parents_uninteresting(). On a copy of the Linux repo
with a local instance of "master" behind the remote branch
"origin/master" by ~60,000 commits, we find the performance of
"git status" went from 1.42 seconds to 1.32 seconds, for a relative
difference of -7.0%.

Signed-off-by: Derrick Stolee 
---
 revision.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/revision.c b/revision.c
index 5ce9b93..bc7def5 100644
--- a/revision.c
+++ b/revision.c
@@ -113,7 +113,8 @@ void mark_parents_uninteresting(struct commit *commit)
 * it is popped next time around, we won't be trying
 * to parse it and get an error.
 */
-   if (!has_object_file(>object.oid))
+   if (!commit->object.parsed &&
+   !has_object_file(>object.oid))
commit->object.parsed = 1;
 
if (commit->object.flags & UNINTERESTING)
-- 
2.7.4



Re: [PATCH v2 5/5] Revert "repository: pre-initialize hash algo pointer"

2018-02-24 Thread brian m. carlson
On Sat, Feb 24, 2018 at 10:34:29AM +0700, Nguyễn Thái Ngọc Duy wrote:
> This reverts commit e26f7f19b6c7485f04234946a59ab8f4fd21d6d1. The root
> problem, git clone not setting up the_hash_algo, has been fixed in the
> previous patch.
> 
> Since this is a dangerous move and could potentially break stuff after
> release (and leads to workaround like the reverted commit), the
> workaround technically remains, but is hidden behind a new environment
> variable GIT_HASH_FIXUP. This should let the users continue to use git
> while we fix the problem. This variable can be deleted after one or two
> releases.
> 
> Signed-off-by: Nguyễn Thái Ngọc Duy 
> ---
>  common-main.c| 10 ++
>  repository.c |  2 +-
>  t/helper/test-dump-split-index.c |  2 ++
>  3 files changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/common-main.c b/common-main.c
> index 6a689007e7..fbfa98c3b8 100644
> --- a/common-main.c
> +++ b/common-main.c
> @@ -1,6 +1,7 @@
>  #include "cache.h"
>  #include "exec_cmd.h"
>  #include "attr.h"
> +#include "repository.h"
>  
>  /*
>   * Many parts of Git have subprograms communicate via pipe, expect the
> @@ -40,5 +41,14 @@ int main(int argc, const char **argv)
>  
>   restore_sigpipe_to_default();
>  
> + /*
> +  * Temporary recovery measure while hashing code is being
> +  * refactored. This variable should be gone after everybody
> +  * has used the_hash_algo in one or two releases and nobody
> +  * complains anything.
> +  */
> + if (getenv("GIT_HASH_FIXUP"))
> + repo_set_hash_algo(the_repository, GIT_HASH_SHA1);
> +
>   return cmd_main(argc, argv);
>  }
> diff --git a/repository.c b/repository.c
> index 4ffbe9bc94..0d715f4fdb 100644
> --- a/repository.c
> +++ b/repository.c
> @@ -5,7 +5,7 @@
>  
>  /* The main repository */
>  static struct repository the_repo = {
> - NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, _index, 
> _algos[GIT_HASH_SHA1], 0, 0
> + NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, _index, NULL, 
> 0, 0

I'm wondering, now that you have the name field for the unknown value,
if that might be a better choice here than NULL.  I don't have a strong
preference either way, so whatever you decide here is fine.
-- 
brian m. carlson / brian with sandals: Houston, Texas, US
https://www.crustytoothpaste.net/~bmc | My opinion only
OpenPGP: https://keybase.io/bk2204


signature.asc
Description: PGP signature


Re: [PATCH v2 3/5] index-pack: check (and optionally set) hash algo based on input file

2018-02-24 Thread brian m. carlson
On Sat, Feb 24, 2018 at 10:34:27AM +0700, Nguyễn Thái Ngọc Duy wrote:
> After 454253f059 (builtin/index-pack: improve hash function abstraction
> - 2018-02-01), index-pack uses the_hash_algo for hashing. If "git
> index-pack" is executed without a repository, we do not know what hash
> algorithm to be used and the_hash_algo in theory could be undefined.
> 
> Since there should be some information about the hash algorithm in the
> input pack file, we can initialize the correct hash algorithm with that
> if the_hash_algo is not yet initialized. This assumes that pack files
> with new hash algorithm MUST step up pack version.
> 
> While at there, make sure the hash algorithm requested by the pack file
> and configured by the repository (if we're running with a repo) are
> consistent.
> 
> Signed-off-by: Nguyễn Thái Ngọc Duy 
> ---
>  builtin/index-pack.c | 26 +-
>  1 file changed, 25 insertions(+), 1 deletion(-)
> 
> diff --git a/builtin/index-pack.c b/builtin/index-pack.c
> index 7e3e1a461c..1144458140 100644
> --- a/builtin/index-pack.c
> +++ b/builtin/index-pack.c
> @@ -326,10 +326,31 @@ static const char *open_pack_file(const char *pack_name)
>   output_fd = -1;
>   nothread_data.pack_fd = input_fd;
>   }
> - the_hash_algo->init_fn(_ctx);
>   return pack_name;
>  }
>  
> +static void prepare_hash_algo(uint32_t pack_version)
> +{
> + const struct git_hash_algo *pack_algo;
> +
> + switch (pack_version) {
> + case 2:
> + case 3:
> + pack_algo = _algos[GIT_HASH_SHA1];
> + break;
> + default:
> + die("BUG: how to determine hash algo for new version?");
> + }
> +
> + if (!the_hash_algo) /* running without repo */
> + the_hash_algo = pack_algo;
> +
> + if (the_hash_algo != pack_algo)
> + die(_("incompatible hash algorithm, "
> +   "configured for %s but the pack file needs %s"),
> + the_hash_algo->name, pack_algo->name);

I like this.  It's a nice improvement and it should be easy for us to
pass additional information into the function when our pack format
understands multiple algorithms.

I might have done the comparison using the format_id members instead of
the pointers themselves, but that's more a personal preference than
anything.
-- 
brian m. carlson / brian with sandals: Houston, Texas, US
https://www.crustytoothpaste.net/~bmc | My opinion only
OpenPGP: https://keybase.io/bk2204


signature.asc
Description: PGP signature


Re: [PATCH v2 2/5] sha1_file.c: keep a good name for "unknown" hash_algos[UNKNOWN]

2018-02-24 Thread brian m. carlson
On Sat, Feb 24, 2018 at 10:34:26AM +0700, Nguyễn Thái Ngọc Duy wrote:
> This is mostly for displaying the hash algorithm name when we report
> errors. Printing "unknown" with '%s' is much better than '(null)' in
> glibc printf version (and probably could crash if other implementations
> do not check for NULL pointer)

This is a good change.  Using a NULL pointer with %s on the BSDs
definitely did cause a crash, last I checked.

I looked at the comment I wrote in hash.h, and based on that, it is
indeed very likely that someone would accidentally write code this way.
-- 
brian m. carlson / brian with sandals: Houston, Texas, US
https://www.crustytoothpaste.net/~bmc | My opinion only
OpenPGP: https://keybase.io/bk2204


signature.asc
Description: PGP signature


Re: [PATCH v2 4/5] diff.c: initialize hash algo when running in --no-index mode

2018-02-24 Thread brian m. carlson
On Sat, Feb 24, 2018 at 09:36:03PM +0700, Duy Nguyen wrote:
> On Sat, Feb 24, 2018 at 10:34 AM, Nguyễn Thái Ngọc Duy
>  wrote:
> > @@ -3995,6 +3995,18 @@ static void run_diff(struct diff_filepair *p, struct 
> > diff_options *o)
> > return;
> > }
> >
> > +   /*
> > +* NEEDSWORK: When running in no-index mode (and no repo is
> > +* found, thus no hash algo conifugred), fall back to SHA-1
> > +* hashing (which is used by diff_fill_oid_info below) to
> > +* avoid regression in diff output.
> > +*
> > +* In future, perhaps we can allow the user to specify their
> > +* hash algorithm from command line in this mode.
> > +*/
> > +   if (o->flags.no_index && !the_hash_algo)
> > +   the_hash_algo = _algos[GIT_HASH_SHA1];
> 
> Brian, are we supposed to use the_hash_algo this way (i.e. as a
> writable var)? Or should I stick to something like
> 
> repo_set_hash_algo(the_repository, GIT_HASH_SHA1);
> 
> which allows us to notify other parts inside struct repository about
> the hash algorithm change, if we ever need to?

I would definitely recommend using the function.  As you pointed out, it
makes our code future-proof against needing to more work when setting
the value.

> If the_hash_algo is supposed to be read-only, maybe I should convert
> that macro to an inline function to prevent people from accidentally
> reassigning it?

You could if you want to, although I don't really see a need to, since
we can just catch it in review.  If you wanted to, I'd make it an inline
function for performance reasons.
-- 
brian m. carlson / brian with sandals: Houston, Texas, US
https://www.crustytoothpaste.net/~bmc | My opinion only
OpenPGP: https://keybase.io/bk2204


signature.asc
Description: PGP signature


Re: [PATCH] Call timegm and timelocal with 4-digit year

2018-02-24 Thread Ævar Arnfjörð Bjarmason

On Fri, Feb 23 2018, Bernhard M. Wiedemann jotted:

> amazingly timegm(gmtime(0)) is only 0 before 2020
> because perl's timegm deviates from GNU timegm(3) in how it handles years.
>
> man Time::Local says
>
>  Whenever possible, use an absolute four digit year instead.
>
> with a detailed explanation about ambiguity of 2-digit years above that.
>
> Even though this ambiguity is error-prone with >50% of users getting it
> wrong, it has been like this for 20+ years, so we just use 4-digit years
> everywhere to be on the safe side.
>
> We add some extra logic to cvsimport because it allows 2-digit year
> input and interpreting an 18 as 1918 can be avoided easily and safely.
>
> Signed-off-by: Bernhard M. Wiedemann 
> ---
>  contrib/examples/git-svnimport.perl | 2 +-
>  git-cvsimport.perl  | 4 +++-
>  perl/Git.pm | 4 +++-
>  perl/Git/SVN.pm | 2 +-
>  4 files changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/contrib/examples/git-svnimport.perl 
> b/contrib/examples/git-svnimport.perl
> index c414f0d9c..75a43e23b 100755
> --- a/contrib/examples/git-svnimport.perl
> +++ b/contrib/examples/git-svnimport.perl
> @@ -238,7 +238,7 @@ sub pdate($) {
>   my($d) = @_;
>   $d =~ m#(\d\d\d\d)-(\d\d)-(\d\d)T(\d\d):(\d\d):(\d\d)#
>   or die "Unparseable date: $d\n";
> - my $y=$1; $y-=1900 if $y>1900;
> + my $y=$1; $y+=1900 if $y<1000;
>   return timegm($6||0,$5,$4,$3,$2-1,$y);

I wonder if this whole thing was just cargo-culted to begin with. We
need to match (\d\d\d\d) here, so did SVN's format ever have years like
"0098" (just "98" wouldn't match), so I suspect the whole munging could
be dropped, but this change seems harmless. Just something that jumped
out at me reviewing this.

> diff --git a/git-cvsimport.perl b/git-cvsimport.perl
> index 2d8df8317..b31613cb8 100755
> --- a/git-cvsimport.perl
> +++ b/git-cvsimport.perl
> @@ -601,7 +601,9 @@ sub pdate($) {
>   my ($d) = @_;
>   m#(\d{2,4})/(\d\d)/(\d\d)\s(\d\d):(\d\d)(?::(\d\d))?#
>   or die "Unparseable date: $d\n";
> - my $y=$1; $y-=1900 if $y>1900;
> + my $y=$1;
> + $y+=100 if $y<70;
> + $y+=1900 if $y<1000;
>   return timegm($6||0,$5,$4,$3,$2-1,$y);
>  }

My Time::Local 1.2300 on perl 5.024001 currently interprets "69" here as
1969, but after this it'll be 2069.

Now I doubt anyone's going to be importing CVS history of projects a
little over 20 years before CVS was created in 1990 (although I suppose
old imports...), but just wanted to note it since it seems odd for code
that's auto-interpreting double digit years for the purposes of
importing existing data to end up in an edge case where it returns dates
more than 50 years in the future.

> diff --git a/perl/Git.pm b/perl/Git.pm
> index ffa09ace9..df62518c7 100644
> --- a/perl/Git.pm
> +++ b/perl/Git.pm
> @@ -534,7 +534,9 @@ If TIME is not supplied, the current local time is used.
>  sub get_tz_offset {
>   # some systems don't handle or mishandle %z, so be creative.
>   my $t = shift || time;
> - my $gm = timegm(localtime($t));
> + my @t = localtime($t);
> + $t[5] += 1900;
> + my $gm = timegm(@t);
>

Nice. Just using the 4-digit date is always more correct and won't ever
be buggy.

>   my $sign = qw( + + - )[ $gm <=> $t ];
>   return sprintf("%s%02d%02d", $sign, (gmtime(abs($t - $gm)))[2,1]);
>  }
> diff --git a/perl/Git/SVN.pm b/perl/Git/SVN.pm
> index bc4eed3d7..991a5885e 100644
> --- a/perl/Git/SVN.pm
> +++ b/perl/Git/SVN.pm
> @@ -1405,7 +1405,7 @@ sub parse_svn_date {
>   $ENV{TZ} = 'UTC';
>
>   my $epoch_in_UTC =
> - Time::Local::timelocal($S, $M, $H, $d, $m - 1, $Y - 1900);
> + Time::Local::timelocal($S, $M, $H, $d, $m - 1, $Y);

Ditto. Nicely caught.

>
>   # Determine our local timezone (including DST) at the
>   # time of $epoch_in_UTC.  $Git::SVN::Log::TZ stored the

Anyway, this all looks good to me as-is. That CVS edge case is obscure
and not worth focusing on, and the SVN one could be fixed up in another
commit if anyone cared.

I just spent a bit more time than I should have wondering what this
timegm() edge case was about and whether it might impact other
(unrelated to git) code I had.


Re: [PATCH v5 00/17] document & test fetch pruning & add fetch.pruneTags

2018-02-24 Thread Ævar Arnfjörð Bjarmason

On Fri, Feb 23 2018, Junio C. Hamano jotted:

> Ævar Arnfjörð Bjarmason  writes:
>
>>> Let's see how others find it useful and/or if the changed code gets
>>> in the way of others (I am not absolutely sure if the changes are
>>> free of regression to existing users who do not use the new
>>> feature).
>>
>> I think if you're on the fence about merging it down (and others don't
>> chime in saying the want it / like it) it makes sense to merge down
>> 1-14/17 and we could discard 15-17/17 for now for a later re-submission
>> and discussion once the earlier part of the series lands in master.
>
> Or we merge everything and let people discover glitches and
> complain.  Reverting the last pieces can wait until then ;-).

Thanks, that obviously works for me.


[PATCH] git-gui: search for all current SSH key types

2018-02-24 Thread Beat Bolli
OpenSSH has supported Ed25519 keys since version 6.4 (2014-01-30), and
ECDSA keys since version 5.7 (2011-01-24). git-gui fails to find these
key types in its Help/Show SSH Key dialog.

Teach git-gui to show Ed25519 and ECDSA keys as well.

This was originally reported in
https://github.com/git-for-windows/git/issues/1487 and subseqently in
https://public-inbox.org/git/f65780f29e48994380e2bce87c6f071101146...@deerlm99ex2msx.ww931.my-it-solutions.net/

Signed-off-by: Beat Bolli 
---
Cc: Alexander Gavrilov 
Cc: Pat Thoyts 

 git-gui/lib/sshkey.tcl | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/git-gui/lib/sshkey.tcl b/git-gui/lib/sshkey.tcl
index aa6457bbb..589ff8f78 100644
--- a/git-gui/lib/sshkey.tcl
+++ b/git-gui/lib/sshkey.tcl
@@ -2,7 +2,10 @@
 # Copyright (C) 2006, 2007 Shawn Pearce
 
 proc find_ssh_key {} {
-   foreach name {~/.ssh/id_dsa.pub ~/.ssh/id_rsa.pub ~/.ssh/identity.pub} {
+   foreach name {
+   ~/.ssh/id_dsa.pub ~/.ssh/id_ecdsa.pub ~/.ssh/id_ed25519.pub
+   ~/.ssh/id_rsa.pub ~/.ssh/identity.pub
+   } {
if {[file exists $name]} {
set fh[open $name r]
set cont  [read $fh]
-- 
2.15.0.rc1.299.gda03b47c3



[PATCH v8 7/7] convert: add round trip check based on 'core.checkRoundtripEncoding'

2018-02-24 Thread lars . schneider
From: Lars Schneider 

UTF supports lossless conversion round tripping and conversions between
UTF and other encodings are mostly round trip safe as Unicode aims to be
a superset of all other character encodings. However, certain encodings
(e.g. SHIFT-JIS) are known to have round trip issues [1].

Add 'core.checkRoundtripEncoding', which contains a comma separated
list of encodings, to define for what encodings Git should check the
conversion round trip if they are used in the 'working-tree-encoding'
attribute.

Set SHIFT-JIS as default value for 'core.checkRoundtripEncoding'.

[1] 
https://support.microsoft.com/en-us/help/170559/prb-conversion-problem-between-shift-jis-and-unicode

Signed-off-by: Lars Schneider 
---
 Documentation/config.txt |  6 
 Documentation/gitattributes.txt  |  8 +
 config.c |  5 +++
 convert.c| 74 
 convert.h|  1 +
 environment.c|  1 +
 t/t0028-working-tree-encoding.sh | 41 ++
 7 files changed, 136 insertions(+)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 0e25b2c92b..d7a56054a5 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -530,6 +530,12 @@ core.autocrlf::
This variable can be set to 'input',
in which case no output conversion is performed.
 
+core.checkRoundtripEncoding::
+   A comma separated list of encodings that Git performs UTF-8 round
+   trip checks on if they are used in an `working-tree-encoding`
+   attribute (see linkgit:gitattributes[5]). The default value is
+   `SHIFT-JIS`.
+
 core.symlinks::
If false, symbolic links are checked out as small plain files that
contain the link text. linkgit:git-update-index[1] and
diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index eddeaee1f7..11315054f4 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -310,6 +310,14 @@ number of pitfalls:
   internal contents as UTF-8 and try to convert it to UTF-16 on checkout.
   That operation will fail and cause an error.
 
+- Reencoding content to non-UTF encodings can cause errors as the
+  conversion might not be UTF-8 round trip safe. If you suspect your
+  encoding to not be round trip safe, then add it to
+  `core.checkRoundtripEncoding` to make Git check the round trip
+  encoding (see linkgit:git-config[1]). SHIFT-JIS (Japanese character
+  set) is known to have round trip issues with UTF-8 and is checked by
+  default.
+
 - Reencoding content requires resources that might slow down certain
   Git operations (e.g 'git checkout' or 'git add').
 
diff --git a/config.c b/config.c
index 1f003fbb90..d0ada9fcd4 100644
--- a/config.c
+++ b/config.c
@@ -1172,6 +1172,11 @@ static int git_default_core_config(const char *var, 
const char *value)
return 0;
}
 
+   if (!strcmp(var, "core.checkroundtripencoding")) {
+   check_roundtrip_encoding = xstrdup(value);
+   return 0;
+   }
+
if (!strcmp(var, "core.notesref")) {
notes_ref_name = xstrdup(value);
return 0;
diff --git a/convert.c b/convert.c
index c4e2fd5fa5..398cd9cf7b 100644
--- a/convert.c
+++ b/convert.c
@@ -289,6 +289,39 @@ static void trace_encoding(const char *context, const char 
*path,
strbuf_release();
 }
 
+static int check_roundtrip(const char* enc_name)
+{
+   /*
+* check_roundtrip_encoding contains a string of space and/or
+* comma separated encodings (eg. "UTF-16, ASCII, CP1125").
+* Search for the given encoding in that string.
+*/
+   const char *found = strcasestr(check_roundtrip_encoding, enc_name);
+   const char *next = found + strlen(enc_name);
+   int len = strlen(check_roundtrip_encoding);
+   return (found && (
+   /*
+* check that the found encoding is at the
+* beginning of check_roundtrip_encoding or
+* that it is prefixed with a space or comma
+*/
+   found == check_roundtrip_encoding || (
+   found > check_roundtrip_encoding &&
+   (*(found-1) == ' ' || *(found-1) == ',')
+   )
+   ) && (
+   /*
+* check that the found encoding is at the
+* end of check_roundtrip_encoding or
+* that it is suffixed with a space or comma
+*/
+   next == check_roundtrip_encoding + len || (
+   next < check_roundtrip_encoding + len &&
+   (*next == ' ' || *next == ',')
+   

[PATCH v8 4/7] utf8: add function to detect a missing UTF-16/32 BOM

2018-02-24 Thread lars . schneider
From: Lars Schneider 

If the endianness is not defined in the encoding name, then let's
be strict and require a BOM to avoid any encoding confusion. The
is_missing_required_utf_bom() function returns true if a required BOM
is missing.

The Unicode standard instructs to assume big-endian if there in no BOM
for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard used
in HTML5 recommends to assume little-endian to "deal with deployed
content" [3]. Strictly requiring a BOM seems to be the safest option
for content in Git.

This function is used in a subsequent commit.

[1] http://unicode.org/faq/utf_bom.html#gen6
[2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf
 Section 3.10, D98, page 132
[3] https://encoding.spec.whatwg.org/#utf-16le

Signed-off-by: Lars Schneider 
---
 utf8.c | 13 +
 utf8.h | 16 
 2 files changed, 29 insertions(+)

diff --git a/utf8.c b/utf8.c
index 914881cd1f..5113d26e56 100644
--- a/utf8.c
+++ b/utf8.c
@@ -562,6 +562,19 @@ int has_prohibited_utf_bom(const char *enc, const char 
*data, size_t len)
);
 }
 
+int is_missing_required_utf_bom(const char *enc, const char *data, size_t len)
+{
+   return (
+  !strcmp(enc, "UTF-16") &&
+  !(has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
+has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
+   ) || (
+  !strcmp(enc, "UTF-32") &&
+  !(has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
+has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
+   );
+}
+
 /*
  * Returns first character length in bytes for multi-byte `text` according to
  * `encoding`.
diff --git a/utf8.h b/utf8.h
index 4711429af9..62f86fba64 100644
--- a/utf8.h
+++ b/utf8.h
@@ -79,4 +79,20 @@ void strbuf_utf8_align(struct strbuf *buf, align_type 
position, unsigned int wid
  */
 int has_prohibited_utf_bom(const char *enc, const char *data, size_t len);
 
+/*
+ * If the endianness is not defined in the encoding name, then we
+ * require a BOM. The function returns true if a required BOM is missing.
+ *
+ * The Unicode standard instructs to assume big-endian if there
+ * in no BOM for UTF-16/32 [1][2]. However, the W3C/WHATWG
+ * encoding standard used in HTML5 recommends to assume
+ * little-endian to "deal with deployed content" [3].
+ *
+ * [1] http://unicode.org/faq/utf_bom.html#gen6
+ * [2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf
+ * Section 3.10, D98, page 132
+ * [3] https://encoding.spec.whatwg.org/#utf-16le
+ */
+int is_missing_required_utf_bom(const char *enc, const char *data, size_t len);
+
 #endif
-- 
2.16.1



[PATCH v8 6/7] convert: add tracing for 'working-tree-encoding' attribute

2018-02-24 Thread lars . schneider
From: Lars Schneider 

Add the GIT_TRACE_WORKING_TREE_ENCODING environment variable to enable
tracing for content that is reencoded with the 'working-tree-encoding'
attribute. This is useful to debug encoding issues.

Signed-off-by: Lars Schneider 
---
 convert.c| 25 +
 t/t0028-working-tree-encoding.sh |  2 ++
 2 files changed, 27 insertions(+)

diff --git a/convert.c b/convert.c
index d20c341b6d..c4e2fd5fa5 100644
--- a/convert.c
+++ b/convert.c
@@ -266,6 +266,29 @@ static int will_convert_lf_to_crlf(size_t len, struct 
text_stat *stats,
 
 }
 
+static void trace_encoding(const char *context, const char *path,
+  const char *encoding, const char *buf, size_t len)
+{
+   static struct trace_key coe = TRACE_KEY_INIT(WORKING_TREE_ENCODING);
+   struct strbuf trace = STRBUF_INIT;
+   int i;
+
+   strbuf_addf(, "%s (%s, considered %s):\n", context, path, 
encoding);
+   for (i = 0; i < len && buf; ++i) {
+   strbuf_addf(
+   ,"| \e[2m%2i:\e[0m %2x \e[2m%c\e[0m%c",
+   i,
+   (unsigned char) buf[i],
+   (buf[i] > 32 && buf[i] < 127 ? buf[i] : ' '),
+   ((i+1) % 8 && (i+1) < len ? ' ' : '\n')
+   );
+   }
+   strbuf_addchars(, '\n', 1);
+
+   trace_strbuf(, );
+   strbuf_release();
+}
+
 static struct encoding {
const char *name;
struct encoding *next;
@@ -325,6 +348,7 @@ static int encode_to_git(const char *path, const char *src, 
size_t src_len,
error(error_msg, path, enc->name);
}
 
+   trace_encoding("source", path, enc->name, src, src_len);
dst = reencode_string_len(src, src_len, default_encoding, enc->name,
  _len);
if (!dst) {
@@ -340,6 +364,7 @@ static int encode_to_git(const char *path, const char *src, 
size_t src_len,
else
error(msg, path, enc->name, default_encoding);
}
+   trace_encoding("destination", path, default_encoding, dst, dst_len);
 
strbuf_attach(buf, dst, dst_len, dst_len + 1);
return 1;
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index a55bbcb6d3..869b2d 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -4,6 +4,8 @@ test_description='working-tree-encoding conversion via 
gitattributes'
 
 . ./test-lib.sh
 
+GIT_TRACE_WORKING_TREE_ENCODING=1 && export GIT_TRACE_WORKING_TREE_ENCODING
+
 test_expect_success 'setup test repo' '
git config core.eol lf &&
 
-- 
2.16.1



[PATCH v8 1/7] strbuf: remove unnecessary NUL assignment in xstrdup_tolower()

2018-02-24 Thread lars . schneider
From: Lars Schneider 

Since 3733e69464 (use xmallocz to avoid size arithmetic, 2016-02-22) we
allocate the buffer for the lower case string with xmallocz(). This
already ensures a NUL at the end of the allocated buffer.

Remove the unnecessary assignment.

Signed-off-by: Lars Schneider 
---
 strbuf.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/strbuf.c b/strbuf.c
index 1df674e919..55b7daeb35 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -781,7 +781,6 @@ char *xstrdup_tolower(const char *string)
result = xmallocz(len);
for (i = 0; i < len; i++)
result[i] = tolower(string[i]);
-   result[i] = '\0';
return result;
 }
 
-- 
2.16.1



[PATCH v8 5/7] convert: add 'working-tree-encoding' attribute

2018-02-24 Thread lars . schneider
From: Lars Schneider 

Git recognizes files encoded with ASCII or one of its supersets (e.g.
UTF-8 or ISO-8859-1) as text files. All other encodings are usually
interpreted as binary and consequently built-in Git text processing
tools (e.g. 'git diff') as well as most Git web front ends do not
visualize the content.

Add an attribute to tell Git what encoding the user has defined for a
given file. If the content is added to the index, then Git converts the
content to a canonical UTF-8 representation. On checkout Git will
reverse the conversion.

Signed-off-by: Lars Schneider 
---
 Documentation/gitattributes.txt  |  78 ++
 convert.c| 157 ++-
 convert.h|   1 +
 sha1_file.c  |   2 +-
 t/t0028-working-tree-encoding.sh | 226 +++
 5 files changed, 462 insertions(+), 2 deletions(-)
 create mode 100755 t/t0028-working-tree-encoding.sh

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 30687de81a..eddeaee1f7 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -272,6 +272,84 @@ few exceptions.  Even though...
   catch potential problems early, safety triggers.
 
 
+`working-tree-encoding`
+^^^
+
+Git recognizes files encoded in ASCII or one of its supersets (e.g.
+UTF-8, ISO-8859-1, ...) as text files. Files encoded in certain other
+encodings (e.g. UTF-16) are interpreted as binary and consequently
+built-in Git text processing tools (e.g. 'git diff') as well as most Git
+web front ends do not visualize the contents of these files by default.
+
+In these cases you can tell Git the encoding of a file in the working
+directory with the `working-tree-encoding` attribute. If a file with this
+attribute is added to Git, then Git reencodes the content from the
+specified encoding to UTF-8. Finally, Git stores the UTF-8 encoded
+content in its internal data structure (called "the index"). On checkout
+the content is reencoded back to the specified encoding.
+
+Please note that using the `working-tree-encoding` attribute may have a
+number of pitfalls:
+
+- Alternative Git implementations (e.g. JGit or libgit2) and older Git
+  versions (as of March 2018) do not support the `working-tree-encoding`
+  attribute. If you decide to use the `working-tree-encoding` attribute
+  in your repository, then it is strongly recommended to ensure that all
+  clients working with the repository support it.
+
+  If you declare `*.proj` files as UTF-16 and you add `foo.proj` with an
+  `working-tree-encoding` enabled Git client, then `foo.proj` will be
+  stored as UTF-8 internally. A client without `working-tree-encoding`
+  support will checkout `foo.proj` as UTF-8 encoded file. This will
+  typically cause trouble for the users of this file.
+
+  If a Git client, that does not support the `working-tree-encoding`
+  attribute, adds a new file `bar.proj`, then `bar.proj` will be
+  stored "as-is" internally (in this example probably as UTF-16).
+  A client with `working-tree-encoding` support will interpret the
+  internal contents as UTF-8 and try to convert it to UTF-16 on checkout.
+  That operation will fail and cause an error.
+
+- Reencoding content requires resources that might slow down certain
+  Git operations (e.g 'git checkout' or 'git add').
+
+Use the `working-tree-encoding` attribute only if you cannot store a file
+in UTF-8 encoding and if you want Git to be able to process the content
+as text.
+
+As an example, use the following attributes if your '*.proj' files are
+UTF-16 encoded with byte order mark (BOM) and you want Git to perform
+automatic line ending conversion based on your platform.
+
+
+*.proj text working-tree-encoding=UTF-16
+
+
+Use the following attributes if your '*.proj' files are UTF-16 little
+endian encoded without BOM and you want Git to use Windows line endings
+in the working directory. Please note, it is highly recommended to
+explicitly define the line endings with `eol` if the `working-tree-encoding`
+attribute is used to avoid ambiguity.
+
+
+*.proj text working-tree-encoding=UTF-16LE eol=CRLF
+
+
+You can get a list of all available encodings on your platform with the
+following command:
+
+
+iconv --list
+
+
+If you do not know the encoding of a file, then you can use the `file`
+command to guess the encoding:
+
+
+file foo.proj
+
+
+
 `ident`
 ^^^
 
diff --git a/convert.c b/convert.c
index b976eb968c..d20c341b6d 100644
--- a/convert.c
+++ b/convert.c
@@ -7,6 +7,7 @@
 #include "sigchain.h"
 #include "pkt-line.h"
 #include "sub-process.h"
+#include "utf8.h"
 
 /*
  * convert.c - convert a file when 

[PATCH v8 3/7] utf8: add function to detect prohibited UTF-16/32 BOM

2018-02-24 Thread lars . schneider
From: Lars Schneider 

Whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE
or UTF-32LE a BOM must not be used [1]. The function returns true if
this is the case.

This function is used in a subsequent commit.

[1] http://unicode.org/faq/utf_bom.html#bom10

Signed-off-by: Lars Schneider 
---
 utf8.c | 24 
 utf8.h |  9 +
 2 files changed, 33 insertions(+)

diff --git a/utf8.c b/utf8.c
index 2c27ce0137..914881cd1f 100644
--- a/utf8.c
+++ b/utf8.c
@@ -538,6 +538,30 @@ char *reencode_string_len(const char *in, int insz,
 }
 #endif
 
+static int has_bom_prefix(const char *data, size_t len,
+ const char *bom, size_t bom_len)
+{
+   return (len >= bom_len) && !memcmp(data, bom, bom_len);
+}
+
+static const char utf16_be_bom[] = {0xFE, 0xFF};
+static const char utf16_le_bom[] = {0xFF, 0xFE};
+static const char utf32_be_bom[] = {0x00, 0x00, 0xFE, 0xFF};
+static const char utf32_le_bom[] = {0xFF, 0xFE, 0x00, 0x00};
+
+int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
+{
+   return (
+ (!strcmp(enc, "UTF-16BE") || !strcmp(enc, "UTF-16LE")) &&
+ (has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
+  has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
+   ) || (
+ (!strcmp(enc, "UTF-32BE") || !strcmp(enc, "UTF-32LE")) &&
+ (has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
+  has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
+   );
+}
+
 /*
  * Returns first character length in bytes for multi-byte `text` according to
  * `encoding`.
diff --git a/utf8.h b/utf8.h
index 6bbcf31a83..4711429af9 100644
--- a/utf8.h
+++ b/utf8.h
@@ -70,4 +70,13 @@ typedef enum {
 void strbuf_utf8_align(struct strbuf *buf, align_type position, unsigned int 
width,
   const char *s);
 
+/*
+ * Whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE
+ * or UTF-32LE a BOM must not be used [1]. The function returns true if
+ * this is the case.
+ *
+ * [1] http://unicode.org/faq/utf_bom.html#bom10
+ */
+int has_prohibited_utf_bom(const char *enc, const char *data, size_t len);
+
 #endif
-- 
2.16.1



[PATCH v8 2/7] strbuf: add xstrdup_toupper()

2018-02-24 Thread lars . schneider
From: Lars Schneider 

Create a copy of an existing string and make all characters upper case.
Similar xstrdup_tolower().

This function is used in a subsequent commit.

Signed-off-by: Lars Schneider 
---
 strbuf.c | 12 
 strbuf.h |  1 +
 2 files changed, 13 insertions(+)

diff --git a/strbuf.c b/strbuf.c
index 55b7daeb35..b635f0bdc4 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -784,6 +784,18 @@ char *xstrdup_tolower(const char *string)
return result;
 }
 
+char *xstrdup_toupper(const char *string)
+{
+   char *result;
+   size_t len, i;
+
+   len = strlen(string);
+   result = xmallocz(len);
+   for (i = 0; i < len; i++)
+   result[i] = toupper(string[i]);
+   return result;
+}
+
 char *xstrvfmt(const char *fmt, va_list ap)
 {
struct strbuf buf = STRBUF_INIT;
diff --git a/strbuf.h b/strbuf.h
index 14c8c10d66..df7ced53ed 100644
--- a/strbuf.h
+++ b/strbuf.h
@@ -607,6 +607,7 @@ __attribute__((format (printf,2,3)))
 extern int fprintf_ln(FILE *fp, const char *fmt, ...);
 
 char *xstrdup_tolower(const char *);
+char *xstrdup_toupper(const char *);
 
 /**
  * Create a newly allocated string using printf format. You can do this easily
-- 
2.16.1



[PATCH v8 0/7] convert: add support for different encodings

2018-02-24 Thread lars . schneider
From: Lars Schneider 

Hi,

Patches 1-4, 6 are preparation and helper functions.
Patch 5,7 are the actual change.

This series depends on Torsten's 8462ff43e4 (convert_to_git():
safe_crlf/checksafe becomes int conv_flags, 2018-01-13) which is
already in master.

Changes since v7:

* make it clearer in the documentation that Git stores content "as-is"
  by default. Content is only stored in UTF-8 if w-t-e is used (Junio)
* add test case for $GIT_DIR/info/attributes support (Junio)

Thanks,
Lars

  RFC: 
https://public-inbox.org/git/bdb9b884-6d17-4be3-a83c-f67e2afa2...@gmail.com/
   v1: 
https://public-inbox.org/git/20171211155023.1405-1-lars.schnei...@autodesk.com/
   v2: 
https://public-inbox.org/git/2017122915.39680-1-lars.schnei...@autodesk.com/
   v3: 
https://public-inbox.org/git/20180106004808.77513-1-lars.schnei...@autodesk.com/
   v4: 
https://public-inbox.org/git/20180120152418.52859-1-lars.schnei...@autodesk.com/
   v5: https://public-inbox.org/git/20180129201855.9182-1-tbo...@web.de/
   v6: 
https://public-inbox.org/git/20180209132830.55385-1-lars.schnei...@autodesk.com/
   v7: 
https://public-inbox.org/git/20180215152711.158-1-lars.schnei...@autodesk.com/


Base Ref:
Web-Diff: https://github.com/larsxschneider/git/commit/2758a2da29
Checkout: git fetch https://github.com/larsxschneider/git encoding-v8 && git 
checkout 2758a2da29


### Interdiff (v7..v8):

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 10cb37795d..11315054f4 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -275,11 +275,11 @@ few exceptions.  Even though...
 `working-tree-encoding`
 ^^^

-Git recognizes files encoded with ASCII or one of its supersets (e.g.
-UTF-8 or ISO-8859-1) as text files.  All other encodings are usually
-interpreted as binary and consequently built-in Git text processing
-tools (e.g. 'git diff') as well as most Git web front ends do not
-visualize the content.
+Git recognizes files encoded in ASCII or one of its supersets (e.g.
+UTF-8, ISO-8859-1, ...) as text files. Files encoded in certain other
+encodings (e.g. UTF-16) are interpreted as binary and consequently
+built-in Git text processing tools (e.g. 'git diff') as well as most Git
+web front ends do not visualize the contents of these files by default.

 In these cases you can tell Git the encoding of a file in the working
 directory with the `working-tree-encoding` attribute. If a file with this
@@ -291,12 +291,24 @@ the content is reencoded back to the specified encoding.
 Please note that using the `working-tree-encoding` attribute may have a
 number of pitfalls:

-- Third party Git implementations that do not support the
-  `working-tree-encoding` attribute will checkout the respective files
-  UTF-8 encoded and not in the expected encoding. Consequently, these
-  files will appear different which typically causes trouble. This is
-  in particular the case for older Git versions and alternative Git
-  implementations such as JGit or libgit2 (as of February 2018).
+- Alternative Git implementations (e.g. JGit or libgit2) and older Git
+  versions (as of March 2018) do not support the `working-tree-encoding`
+  attribute. If you decide to use the `working-tree-encoding` attribute
+  in your repository, then it is strongly recommended to ensure that all
+  clients working with the repository support it.
+
+  If you declare `*.proj` files as UTF-16 and you add `foo.proj` with an
+  `working-tree-encoding` enabled Git client, then `foo.proj` will be
+  stored as UTF-8 internally. A client without `working-tree-encoding`
+  support will checkout `foo.proj` as UTF-8 encoded file. This will
+  typically cause trouble for the users of this file.
+
+  If a Git client, that does not support the `working-tree-encoding`
+  attribute, adds a new file `bar.proj`, then `bar.proj` will be
+  stored "as-is" internally (in this example probably as UTF-16).
+  A client with `working-tree-encoding` support will interpret the
+  internal contents as UTF-8 and try to convert it to UTF-16 on checkout.
+  That operation will fail and cause an error.

 - Reencoding content to non-UTF encodings can cause errors as the
   conversion might not be UTF-8 round trip safe. If you suspect your
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index e4717402a5..e34c21eb29 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -13,8 +13,11 @@ test_expect_success 'setup test repo' '
echo "*.utf16 text working-tree-encoding=utf-16" >.gitattributes &&
printf "$text" >test.utf8.raw &&
printf "$text" | iconv -f UTF-8 -t UTF-16 >test.utf16.raw &&
+   printf "$text" | iconv -f UTF-8 -t UTF-32 >test.utf32.raw &&
cp test.utf16.raw test.utf16 &&
+   cp test.utf32.raw test.utf32 &&

+   # Add only UTF-16 file, we will add the UTF-32 file later
git add 

Re: [PATCH v7 0/7] convert: add support for different encodings

2018-02-24 Thread Lars Schneider

> On 23 Feb 2018, at 21:11, Junio C Hamano  wrote:
> 
> Junio C Hamano  writes:
> 
>> Lars Schneider  writes:
>> 
>>> I still think it would be nice to see diffs for arbitrary encodings.
>>> Would it be an option to read the `encoding` attribute and use it in
>>> `git diff`?
>> 
>> Reusing that gitk-only thing and suddenly start doing so would break
>> gitk users, no?  The tool expects the diff to come out encoded in
>> the encoding that is specified by that attribute (which is learned
>> from get_path_encoding helper) and does its thing.
>> 
>> I guess that gitk uses diff-tree plumbing and you won't be applying
>> this change to the plumbing, perhaps?  If so, it might not be too
>> bad, but those who decided to postprocess "git diff" output (instead
>> of "git diff-tree" output) mimicking how gitk does it by thinking
>> that is the safe and sane thing to do will be broken by such a
>> change.  You could do "use the encoding only when a command line
>> option says so", but then people will add a configuration variable
>> to turn it always on and these existing scripts will be broken.
>> 
>> I do not personally have much sympathy for the last case (i.e. those
>> who scripted around 'git diff' instead of 'git diff-tree' to get
>> broken), so making the new feature only work with the Porcelain "git
>> diff" might be an option.  I'll need a bit more time to formulate
>> the rest of my thought ;-)
> 
> So we are introducing in this series a way to say in what encoding
> the things should be placed in the working tree files (i.e. the
> w-t-e attribute attached to the paths).  Currently there is no
> mechanism to say what encoding the in-repo contents are and UTF-8 is
> assumed when conversion from/to w-t-e is required, but there is no
> fundamental reason why it shouldn't be customizable (if anything, as
> a piece of fact on the in-repo data, in-repo-encoding is *more*
> appropriate to be an attribute than w-t-e that can merely be project
> preference at best, as I mentioned earlier in this thread).

Correct.


> We always use the in-repo contents when generating 'diff'.  I think
> by "attribute to be used in diff", what you are reallying after is
> to convert the in-repo contents to that encoding _BEFORE_ running
> 'diff' on it.  E.g. in-repo UTF-16 that can have NUL bytes all over
> the place will not diff well with the xdiff machinery, but if you
> first convert it to UTF-8 and have xdiff work on it, you can get
> reasonable result out of it.  It is unclear what encoding you want
> your final diff output in (it is equally valid in such a set-up to
> desire your patch output in UTF-16 or UTF-8), but assuming that you
> want UTF-8 in your patch output, perhaps we do not have to break
> gitk users by hijacking the 'encoding' attribute.  Instead what you
> want is a single bit that says between in-repo or working tree which
> representation should be given to the xdiff machinery.

I fear that we could confuse users with an additional knob/bit that
defines what we diff against. Git always diff'ed against in-repo 
content and I feel it should stay that way.

However, I agree with your earlier emails that "working-tree-encoding"
is just one half of the feature. I also think it would be nice to be
able to define the "in-repo-encoding" as well. Then we could define 
something like that:

*.foo   text in-repo-encoding=UTF-16LE

This tells Git that the file is stored as UTF-16LE. This would help Git
generating a diff via UTF-8 conversion. I feel that the final patch 
should be in UTF-16LE again. Maybe over time we could then deprecate the
"encoding" attribute as the "in-repo-encoding" attribute serves a similar 
purpose (maybe gitk can switch to it).

In that case we could also do things like that:

*.bar   text working-tree-encoding=SHIFT-JIS 
in-repo-encoding=UTF-16LE

SHIFT-JIS encoded files would be reencoded to UTF-16LE on checkin.
On checkout the opposite would happen. This way we would lift the
"UTF-8 is the only in-repo encoding" limitation of the current w-t-e
implementation.

Does this sound sensible to you? That being said, I think "in-repo-encoding"
would deserve an own series.

- Lars


Re: [PATCHv4 00/27] Moving global state into the repository object (part 1)

2018-02-24 Thread Duy Nguyen
I notice that there are a few global state (or configuration rather)
left after this: packed_git_window_size, packed_git_limit and
delta_base_cache_limit. These can be changed by $GIT_DIR/config, but
since it's still global, a submodule repository will use the same
settings of its super repository. If $SUBMODULE/config changes any of
these, they are ignored.

The natural thing to do is move these to raw_object_store too (and
repo_submodule_init needs to check submodule's config file). But one
may argue these should be per-process instead of per-repo though. I
don't know. But I thought I should mention this.
-- 
Duy


Re: [PATCH v2 4/5] diff.c: initialize hash algo when running in --no-index mode

2018-02-24 Thread Duy Nguyen
On Sat, Feb 24, 2018 at 10:34 AM, Nguyễn Thái Ngọc Duy
 wrote:
> @@ -3995,6 +3995,18 @@ static void run_diff(struct diff_filepair *p, struct 
> diff_options *o)
> return;
> }
>
> +   /*
> +* NEEDSWORK: When running in no-index mode (and no repo is
> +* found, thus no hash algo conifugred), fall back to SHA-1
> +* hashing (which is used by diff_fill_oid_info below) to
> +* avoid regression in diff output.
> +*
> +* In future, perhaps we can allow the user to specify their
> +* hash algorithm from command line in this mode.
> +*/
> +   if (o->flags.no_index && !the_hash_algo)
> +   the_hash_algo = _algos[GIT_HASH_SHA1];

Brian, are we supposed to use the_hash_algo this way (i.e. as a
writable var)? Or should I stick to something like

repo_set_hash_algo(the_repository, GIT_HASH_SHA1);

which allows us to notify other parts inside struct repository about
the hash algorithm change, if we ever need to?

If the_hash_algo is supposed to be read-only, maybe I should convert
that macro to an inline function to prevent people from accidentally
reassigning it?

> +
> diff_fill_oid_info(one);
> diff_fill_oid_info(two);
>
> --
> 2.16.1.435.g8f24da2e1a
>
-- 
Duy


Re: [GSoC][PATCH v2] ref-filter: Make "--contains " less chatty if is invalid

2018-02-24 Thread Paul-Sebastian Ungureanu
Hello,

Your proposed solution makes a lot more sense. I have actually
considered a solution similar to this (the third solution from [1]),
but found it more complicated. I did not account for the fact that
once a callback is called, the user is already aware of the available
options and the user only supplied an invalid argument value.

I also have to make sure that all parsers (all callbacks and standard
ones, for integer, filename, etc.) are already printing errors
appropriately. Otherwise, some commands may fail and the user will not
be aware of it because nothing will be shown (no usage is shown and no
errors either).

I will be implementing this solution and come back with another patch.

Thank you for your review. I really appreciate it!

[1] https://public-inbox.org/git/20160118215433.gb24...@sigill.intra.pe
ff.net/

Best regards,
Paul Ungureanu


[PATCH v5 2/2] diff: add --compact-summary

2018-02-24 Thread Nguyễn Thái Ngọc Duy
Certain information is currently shown with --summary, but when used
in combination with --stat it's a bit hard to read since info of the
same file is in two places (--stat and --summary).

On top of that, commits that add or remove files double the number of
display lines, which could be a lot if you add or remove a lot of
files.

--compact-summary embeds most of --summary back in --stat in the
little space between the file name part and the graph line, e.g. with
commit 0433d533f1:

   Documentation/merge-config.txt |  4 +
   builtin/merge.c|  2 +
   ...-pull-verify-signatures.sh (new +x) | 81 ++
   t/t7612-merge-verify-signatures.sh | 45 
   4 files changed, 132 insertions(+)

It helps both condensing information and saving some text
space. What's new in diffstat is:

- A new 0644 file is shown as (new)
- A new 0755 file is shown as (new +x)
- A new symlink is shown as (new +l)
- A deleted file is shown as (gone)
- A mode change adding executable bit is shown as (mode +x)
- A mode change removing it is shown as (mode -x)

Note that --compact-summary does not contain all the information
--summary provides. Rewrite percentage is not shown but it could be
added later, like R50% or C20%.

Signed-off-by: Nguyễn Thái Ngọc Duy 
---
 Documentation/diff-options.txt|  8 
 diff.c| 37 +++
 diff.h|  1 +
 t/t4013-diff-various.sh   |  5 +++
 ...ty_--root_--stat_--compact-summary_initial | 12 ++
 ...-R_--root_--stat_--compact-summary_initial | 12 ++
 ...tree_--stat_--compact-summary_initial_mode |  4 ++
 ...e_-R_--stat_--compact-summary_initial_mode |  4 ++
 8 files changed, 83 insertions(+)
 create mode 100644 
t/t4013/diff.diff-tree_--pretty_--root_--stat_--compact-summary_initial
 create mode 100644 
t/t4013/diff.diff-tree_--pretty_-R_--root_--stat_--compact-summary_initial
 create mode 100644 t/t4013/diff.diff-tree_--stat_--compact-summary_initial_mode
 create mode 100644 
t/t4013/diff.diff-tree_-R_--stat_--compact-summary_initial_mode

diff --git a/Documentation/diff-options.txt b/Documentation/diff-options.txt
index c330c01ff0..e3a44f03cd 100644
--- a/Documentation/diff-options.txt
+++ b/Documentation/diff-options.txt
@@ -128,6 +128,14 @@ have to use `--diff-algorithm=default` option.
 These parameters can also be set individually with `--stat-width=`,
 `--stat-name-width=` and `--stat-count=`.
 
+--compact-summary::
+   Output a condensed summary of extended header information such
+   as file creations or deletions ("new" or "gone", optionally "+l"
+   if it's a symlink) and mode changes ("+x" or "-x" for adding
+   or removing executable bit respectively) in diffstat. The
+   information is put betwen the filename part and the graph
+   part. Implies `--stat`.
+
 --numstat::
Similar to `--stat`, but shows number of added and
deleted lines in decimal notation and pathname without
diff --git a/diff.c b/diff.c
index e3f72de27d..62e413a80f 100644
--- a/diff.c
+++ b/diff.c
@@ -2129,6 +2129,7 @@ struct diffstat_t {
char *from_name;
char *name;
char *print_name;
+   const char *comments;
unsigned is_unmerged:1;
unsigned is_binary:1;
unsigned is_renamed:1;
@@ -2205,6 +2206,9 @@ static void fill_print_name(struct diffstat_file *file)
else
quote_c_style(file->name, , NULL, 0);
 
+   if (file->comments)
+   strbuf_addf(, " (%s)", file->comments);
+
file->print_name = strbuf_detach(, NULL);
 }
 
@@ -3239,6 +3243,32 @@ static void builtin_diff(const char *name_a,
return;
 }
 
+static char *get_compact_summary(const struct diff_filepair *p, int is_renamed)
+{
+   if (!is_renamed) {
+   if (p->status == DIFF_STATUS_ADDED) {
+   if (S_ISLNK(p->two->mode))
+   return "new +l";
+   else if ((p->two->mode & 0777) == 0755)
+   return "new +x";
+   else
+   return "new";
+   } else if (p->status == DIFF_STATUS_DELETED)
+   return "gone";
+   }
+   if (S_ISLNK(p->one->mode) && !S_ISLNK(p->two->mode))
+   return "mode -l";
+   else if (!S_ISLNK(p->one->mode) && S_ISLNK(p->two->mode))
+   return "mode +l";
+   else if ((p->one->mode & 0777) == 0644 &&
+(p->two->mode & 0777) == 0755)
+   return "mode +x";
+   else if ((p->one->mode & 0777) == 0755 &&
+(p->two->mode & 0777) == 0644)
+   return "mode -x";
+   return NULL;
+}
+
 static void builtin_diffstat(const char *name_a, const char *name_b,
 struct 

[PATCH v5 0/2] diff: add --compact-summary (aka nd/diff-stat-with-summary)

2018-02-24 Thread Nguyễn Thái Ngọc Duy
... and v5 fixes the commit message of 2/2 where in v4 it still
mentions --stat-with-summary instead of --compact-summary. Sorry.

Nguyễn Thái Ngọc Duy (2):
  diff.c: refactor pprint_rename() to use strbuf
  diff: add --compact-summary

 Documentation/diff-options.txt|  8 ++
 diff.c| 96 ---
 diff.h|  1 +
 t/t4013-diff-various.sh   |  5 +
 ...ty_--root_--stat_--compact-summary_initial | 12 +++
 ...-R_--root_--stat_--compact-summary_initial | 12 +++
 ...tree_--stat_--compact-summary_initial_mode |  4 +
 ...e_-R_--stat_--compact-summary_initial_mode |  4 +
 8 files changed, 109 insertions(+), 33 deletions(-)
 create mode 100644 
t/t4013/diff.diff-tree_--pretty_--root_--stat_--compact-summary_initial
 create mode 100644 
t/t4013/diff.diff-tree_--pretty_-R_--root_--stat_--compact-summary_initial
 create mode 100644 t/t4013/diff.diff-tree_--stat_--compact-summary_initial_mode
 create mode 100644 
t/t4013/diff.diff-tree_-R_--stat_--compact-summary_initial_mode

-- 
2.16.1.435.g8f24da2e1a



[PATCH v5 1/2] diff.c: refactor pprint_rename() to use strbuf

2018-02-24 Thread Nguyễn Thái Ngọc Duy
Instead of passing char* around, let function handle strbuf
directly. All callers already use strbuf internally.

This helps kill the "not free" exception in free_diffstat_info(). I
don't think this code is so critical that we need to avoid some free()
calls.

The other benefit comes in the next patch, where we append something
in pname before returning from fill_print_name(). With strbuf, it's
very simple. With "char *" we may have to resort to explicit
reallocation and stuff.

Signed-off-by: Nguyễn Thái Ngọc Duy 
---
 diff.c | 59 ++
 1 file changed, 26 insertions(+), 33 deletions(-)

diff --git a/diff.c b/diff.c
index 21c3838b25..e3f72de27d 100644
--- a/diff.c
+++ b/diff.c
@@ -2045,11 +2045,10 @@ static void fn_out_consume(void *priv, char *line, 
unsigned long len)
}
 }
 
-static char *pprint_rename(const char *a, const char *b)
+static void pprint_rename(struct strbuf *name, const char *a, const char *b)
 {
const char *old = a;
const char *new = b;
-   struct strbuf name = STRBUF_INIT;
int pfx_length, sfx_length;
int pfx_adjust_for_slash;
int len_a = strlen(a);
@@ -2059,10 +2058,10 @@ static char *pprint_rename(const char *a, const char *b)
int qlen_b = quote_c_style(b, NULL, NULL, 0);
 
if (qlen_a || qlen_b) {
-   quote_c_style(a, , NULL, 0);
-   strbuf_addstr(, " => ");
-   quote_c_style(b, , NULL, 0);
-   return strbuf_detach(, NULL);
+   quote_c_style(a, name, NULL, 0);
+   strbuf_addstr(name, " => ");
+   quote_c_style(b, name, NULL, 0);
+   return;
}
 
/* Find common prefix */
@@ -2109,19 +2108,18 @@ static char *pprint_rename(const char *a, const char *b)
if (b_midlen < 0)
b_midlen = 0;
 
-   strbuf_grow(, pfx_length + a_midlen + b_midlen + sfx_length + 7);
+   strbuf_grow(name, pfx_length + a_midlen + b_midlen + sfx_length + 7);
if (pfx_length + sfx_length) {
-   strbuf_add(, a, pfx_length);
-   strbuf_addch(, '{');
+   strbuf_add(name, a, pfx_length);
+   strbuf_addch(name, '{');
}
-   strbuf_add(, a + pfx_length, a_midlen);
-   strbuf_addstr(, " => ");
-   strbuf_add(, b + pfx_length, b_midlen);
+   strbuf_add(name, a + pfx_length, a_midlen);
+   strbuf_addstr(name, " => ");
+   strbuf_add(name, b + pfx_length, b_midlen);
if (pfx_length + sfx_length) {
-   strbuf_addch(, '}');
-   strbuf_add(, a + len_a - sfx_length, sfx_length);
+   strbuf_addch(name, '}');
+   strbuf_add(name, a + len_a - sfx_length, sfx_length);
}
-   return strbuf_detach(, NULL);
 }
 
 struct diffstat_t {
@@ -2197,23 +2195,17 @@ static void show_graph(struct strbuf *out, char ch, int 
cnt,
 
 static void fill_print_name(struct diffstat_file *file)
 {
-   char *pname;
+   struct strbuf pname = STRBUF_INIT;
 
if (file->print_name)
return;
 
-   if (!file->is_renamed) {
-   struct strbuf buf = STRBUF_INIT;
-   if (quote_c_style(file->name, , NULL, 0)) {
-   pname = strbuf_detach(, NULL);
-   } else {
-   pname = file->name;
-   strbuf_release();
-   }
-   } else {
-   pname = pprint_rename(file->from_name, file->name);
-   }
-   file->print_name = pname;
+   if (file->is_renamed)
+   pprint_rename(, file->from_name, file->name);
+   else
+   quote_c_style(file->name, , NULL, 0);
+
+   file->print_name = strbuf_detach(, NULL);
 }
 
 static void print_stat_summary_inserts_deletes(struct diff_options *options,
@@ -2797,8 +2789,7 @@ static void free_diffstat_info(struct diffstat_t 
*diffstat)
int i;
for (i = 0; i < diffstat->nr; i++) {
struct diffstat_file *f = diffstat->files[i];
-   if (f->name != f->print_name)
-   free(f->print_name);
+   free(f->print_name);
free(f->name);
free(f->from_name);
free(f);
@@ -5241,10 +5232,12 @@ static void show_rename_copy(struct diff_options *opt, 
const char *renamecopy,
struct diff_filepair *p)
 {
struct strbuf sb = STRBUF_INIT;
-   char *names = pprint_rename(p->one->path, p->two->path);
+   struct strbuf names = STRBUF_INIT;
+
+   pprint_rename(, p->one->path, p->two->path);
strbuf_addf(, " %s %s (%d%%)\n",
-   renamecopy, names, similarity_index(p));
-   free(names);
+   renamecopy, names.buf, similarity_index(p));
+   strbuf_release();
emit_diff_symbol(opt, DIFF_SYMBOL_SUMMARY,
 sb.buf, sb.len, 0);
  

[PATCH v4 1/2] diff.c: refactor pprint_rename() to use strbuf

2018-02-24 Thread Nguyễn Thái Ngọc Duy
Instead of passing char* around, let function handle strbuf
directly. All callers already use strbuf internally.

This helps kill the "not free" exception in free_diffstat_info(). I
don't think this code is so critical that we need to avoid some free()
calls.

The other benefit comes in the next patch, where we append something
in pname before returning from fill_print_name(). With strbuf, it's
very simple. With "char *" we may have to resort to explicit
reallocation and stuff.

Signed-off-by: Nguyễn Thái Ngọc Duy 
---
 diff.c | 59 ++
 1 file changed, 26 insertions(+), 33 deletions(-)

diff --git a/diff.c b/diff.c
index 21c3838b25..e3f72de27d 100644
--- a/diff.c
+++ b/diff.c
@@ -2045,11 +2045,10 @@ static void fn_out_consume(void *priv, char *line, 
unsigned long len)
}
 }
 
-static char *pprint_rename(const char *a, const char *b)
+static void pprint_rename(struct strbuf *name, const char *a, const char *b)
 {
const char *old = a;
const char *new = b;
-   struct strbuf name = STRBUF_INIT;
int pfx_length, sfx_length;
int pfx_adjust_for_slash;
int len_a = strlen(a);
@@ -2059,10 +2058,10 @@ static char *pprint_rename(const char *a, const char *b)
int qlen_b = quote_c_style(b, NULL, NULL, 0);
 
if (qlen_a || qlen_b) {
-   quote_c_style(a, , NULL, 0);
-   strbuf_addstr(, " => ");
-   quote_c_style(b, , NULL, 0);
-   return strbuf_detach(, NULL);
+   quote_c_style(a, name, NULL, 0);
+   strbuf_addstr(name, " => ");
+   quote_c_style(b, name, NULL, 0);
+   return;
}
 
/* Find common prefix */
@@ -2109,19 +2108,18 @@ static char *pprint_rename(const char *a, const char *b)
if (b_midlen < 0)
b_midlen = 0;
 
-   strbuf_grow(, pfx_length + a_midlen + b_midlen + sfx_length + 7);
+   strbuf_grow(name, pfx_length + a_midlen + b_midlen + sfx_length + 7);
if (pfx_length + sfx_length) {
-   strbuf_add(, a, pfx_length);
-   strbuf_addch(, '{');
+   strbuf_add(name, a, pfx_length);
+   strbuf_addch(name, '{');
}
-   strbuf_add(, a + pfx_length, a_midlen);
-   strbuf_addstr(, " => ");
-   strbuf_add(, b + pfx_length, b_midlen);
+   strbuf_add(name, a + pfx_length, a_midlen);
+   strbuf_addstr(name, " => ");
+   strbuf_add(name, b + pfx_length, b_midlen);
if (pfx_length + sfx_length) {
-   strbuf_addch(, '}');
-   strbuf_add(, a + len_a - sfx_length, sfx_length);
+   strbuf_addch(name, '}');
+   strbuf_add(name, a + len_a - sfx_length, sfx_length);
}
-   return strbuf_detach(, NULL);
 }
 
 struct diffstat_t {
@@ -2197,23 +2195,17 @@ static void show_graph(struct strbuf *out, char ch, int 
cnt,
 
 static void fill_print_name(struct diffstat_file *file)
 {
-   char *pname;
+   struct strbuf pname = STRBUF_INIT;
 
if (file->print_name)
return;
 
-   if (!file->is_renamed) {
-   struct strbuf buf = STRBUF_INIT;
-   if (quote_c_style(file->name, , NULL, 0)) {
-   pname = strbuf_detach(, NULL);
-   } else {
-   pname = file->name;
-   strbuf_release();
-   }
-   } else {
-   pname = pprint_rename(file->from_name, file->name);
-   }
-   file->print_name = pname;
+   if (file->is_renamed)
+   pprint_rename(, file->from_name, file->name);
+   else
+   quote_c_style(file->name, , NULL, 0);
+
+   file->print_name = strbuf_detach(, NULL);
 }
 
 static void print_stat_summary_inserts_deletes(struct diff_options *options,
@@ -2797,8 +2789,7 @@ static void free_diffstat_info(struct diffstat_t 
*diffstat)
int i;
for (i = 0; i < diffstat->nr; i++) {
struct diffstat_file *f = diffstat->files[i];
-   if (f->name != f->print_name)
-   free(f->print_name);
+   free(f->print_name);
free(f->name);
free(f->from_name);
free(f);
@@ -5241,10 +5232,12 @@ static void show_rename_copy(struct diff_options *opt, 
const char *renamecopy,
struct diff_filepair *p)
 {
struct strbuf sb = STRBUF_INIT;
-   char *names = pprint_rename(p->one->path, p->two->path);
+   struct strbuf names = STRBUF_INIT;
+
+   pprint_rename(, p->one->path, p->two->path);
strbuf_addf(, " %s %s (%d%%)\n",
-   renamecopy, names, similarity_index(p));
-   free(names);
+   renamecopy, names.buf, similarity_index(p));
+   strbuf_release();
emit_diff_symbol(opt, DIFF_SYMBOL_SUMMARY,
 sb.buf, sb.len, 0);
  

[PATCH v4 2/2] diff: add --stat-with-summary

2018-02-24 Thread Nguyễn Thái Ngọc Duy
Certain information is currently shown with --summary, but when used
in combination with --stat it's a bit hard to read since info of the
same file is in two places (--stat and --summary).

On top of that, commits that add or remove files double the number of
display lines, which could be a lot if you add or remove a lot of
files.

--stat-with-summary embeds most of --summary back in --stat in the
little space between the file name part and the graph line, e.g. with
commit 0433d533f1:

   Documentation/merge-config.txt |  4 +
   builtin/merge.c|  2 +
   ...-pull-verify-signatures.sh (new +x) | 81 ++
   t/t7612-merge-verify-signatures.sh | 45 
   4 files changed, 132 insertions(+)

It helps both condensing information and saving some text
space. What's new in diffstat is:

- A new 0644 file is shown as (new)
- A new 0755 file is shown as (new +x)
- A new symlink is shown as (new +l)
- A deleted file is shown as (gone)
- A mode change adding executable bit is shown as (mode +x)
- A mode change removing it is shown as (mode -x)

Note that --stat-with-summary does not contain all the information
--summary provides. Rewrite percentage is not shown but it could be
added later, like R50% or C20%.

Signed-off-by: Nguyễn Thái Ngọc Duy 
---
 Documentation/diff-options.txt|  8 
 diff.c| 37 +++
 diff.h|  1 +
 t/t4013-diff-various.sh   |  5 +++
 ...ty_--root_--stat_--compact-summary_initial | 12 ++
 ...-R_--root_--stat_--compact-summary_initial | 12 ++
 ...tree_--stat_--compact-summary_initial_mode |  4 ++
 ...e_-R_--stat_--compact-summary_initial_mode |  4 ++
 8 files changed, 83 insertions(+)
 create mode 100644 
t/t4013/diff.diff-tree_--pretty_--root_--stat_--compact-summary_initial
 create mode 100644 
t/t4013/diff.diff-tree_--pretty_-R_--root_--stat_--compact-summary_initial
 create mode 100644 t/t4013/diff.diff-tree_--stat_--compact-summary_initial_mode
 create mode 100644 
t/t4013/diff.diff-tree_-R_--stat_--compact-summary_initial_mode

diff --git a/Documentation/diff-options.txt b/Documentation/diff-options.txt
index c330c01ff0..e3a44f03cd 100644
--- a/Documentation/diff-options.txt
+++ b/Documentation/diff-options.txt
@@ -128,6 +128,14 @@ have to use `--diff-algorithm=default` option.
 These parameters can also be set individually with `--stat-width=`,
 `--stat-name-width=` and `--stat-count=`.
 
+--compact-summary::
+   Output a condensed summary of extended header information such
+   as file creations or deletions ("new" or "gone", optionally "+l"
+   if it's a symlink) and mode changes ("+x" or "-x" for adding
+   or removing executable bit respectively) in diffstat. The
+   information is put betwen the filename part and the graph
+   part. Implies `--stat`.
+
 --numstat::
Similar to `--stat`, but shows number of added and
deleted lines in decimal notation and pathname without
diff --git a/diff.c b/diff.c
index e3f72de27d..62e413a80f 100644
--- a/diff.c
+++ b/diff.c
@@ -2129,6 +2129,7 @@ struct diffstat_t {
char *from_name;
char *name;
char *print_name;
+   const char *comments;
unsigned is_unmerged:1;
unsigned is_binary:1;
unsigned is_renamed:1;
@@ -2205,6 +2206,9 @@ static void fill_print_name(struct diffstat_file *file)
else
quote_c_style(file->name, , NULL, 0);
 
+   if (file->comments)
+   strbuf_addf(, " (%s)", file->comments);
+
file->print_name = strbuf_detach(, NULL);
 }
 
@@ -3239,6 +3243,32 @@ static void builtin_diff(const char *name_a,
return;
 }
 
+static char *get_compact_summary(const struct diff_filepair *p, int is_renamed)
+{
+   if (!is_renamed) {
+   if (p->status == DIFF_STATUS_ADDED) {
+   if (S_ISLNK(p->two->mode))
+   return "new +l";
+   else if ((p->two->mode & 0777) == 0755)
+   return "new +x";
+   else
+   return "new";
+   } else if (p->status == DIFF_STATUS_DELETED)
+   return "gone";
+   }
+   if (S_ISLNK(p->one->mode) && !S_ISLNK(p->two->mode))
+   return "mode -l";
+   else if (!S_ISLNK(p->one->mode) && S_ISLNK(p->two->mode))
+   return "mode +l";
+   else if ((p->one->mode & 0777) == 0644 &&
+(p->two->mode & 0777) == 0755)
+   return "mode +x";
+   else if ((p->one->mode & 0777) == 0755 &&
+(p->two->mode & 0777) == 0644)
+   return "mode -x";
+   return NULL;
+}
+
 static void builtin_diffstat(const char *name_a, const char *name_b,
 

[PATCH v4 0/2] diff: add --compact-summary (aka nd/diff-stat-with-summary)

2018-02-24 Thread Nguyễn Thái Ngọc Duy
v4 renames the option back to --compact-summary. I can't think of any
better name.

Interdiff

diff --git a/Documentation/diff-options.txt b/Documentation/diff-options.txt
index 595e4cd548..e3a44f03cd 100644
--- a/Documentation/diff-options.txt
+++ b/Documentation/diff-options.txt
@@ -128,7 +128,7 @@ have to use `--diff-algorithm=default` option.
 These parameters can also be set individually with `--stat-width=`,
 `--stat-name-width=` and `--stat-count=`.
 
---stat-with-summary::
+--compact-summary::
Output a condensed summary of extended header information such
as file creations or deletions ("new" or "gone", optionally "+l"
if it's a symlink) and mode changes ("+x" or "-x" for adding
diff --git a/diff.c b/diff.c
index e7ff7dceb7..62e413a80f 100644
--- a/diff.c
+++ b/diff.c
@@ -4332,7 +4332,6 @@ static int stat_opt(struct diff_options *options, const 
char **av)
int graph_width = options->stat_graph_width;
int count = options->stat_count;
int argcount = 1;
-   unsigned with_summary = options->flags.stat_with_summary;
 
if (!skip_prefix(arg, "--stat", ))
die("BUG: stat option does not begin with --stat: %s", arg);
@@ -4376,9 +4375,6 @@ static int stat_opt(struct diff_options *options, const 
char **av)
count = strtoul(av[1], , 10);
argcount = 2;
}
-   } else if (skip_prefix(arg, "-with-summary", )) {
-   with_summary = 1;
-   end = (char*)arg;
}
break;
case '=':
@@ -4397,7 +4393,6 @@ static int stat_opt(struct diff_options *options, const 
char **av)
options->stat_graph_width = graph_width;
options->stat_width = width;
options->stat_count = count;
-   options->flags.stat_with_summary = with_summary;
return argcount;
 }
 
@@ -4579,11 +4574,13 @@ int diff_opt_parse(struct diff_options *options,
else if (!strcmp(arg, "-s") || !strcmp(arg, "--no-patch"))
options->output_format |= DIFF_FORMAT_NO_OUTPUT;
else if (starts_with(arg, "--stat"))
-   /*
-* --stat, --stat-width, --stat-name-width,
-* --stat-count or --stat-with-summary.
-*/
+   /* --stat, --stat-width, --stat-name-width, or --stat-count */
return stat_opt(options, av);
+   else if (!strcmp(arg, "--compact-summary")) {
+options->flags.stat_with_summary = 1;
+options->output_format |= DIFF_FORMAT_DIFFSTAT;
+   } else if (!strcmp(arg, "--no-compact-summary"))
+options->flags.stat_with_summary = 0;
 
/* renames options */
else if (starts_with(arg, "-B") ||
diff --git a/t/t4013-diff-various.sh b/t/t4013-diff-various.sh
index aa6f5da21c..3f9a24fd56 100755
--- a/t/t4013-diff-various.sh
+++ b/t/t4013-diff-various.sh
@@ -362,10 +362,10 @@ diff --no-index --raw --abbrev=4 dir2 dir
 :noellipses diff --no-index --raw --abbrev=4 dir2 dir
 diff --no-index --raw --no-abbrev dir2 dir
 
-diff-tree --pretty --root --stat-with-summary initial
-diff-tree --pretty -R --root --stat-with-summary initial
-diff-tree --stat-with-summary initial mode
-diff-tree -R --stat-with-summary initial mode
+diff-tree --pretty --root --stat --compact-summary initial
+diff-tree --pretty -R --root --stat --compact-summary initial
+diff-tree --stat --compact-summary initial mode
+diff-tree -R --stat --compact-summary initial mode
 EOF
 
 test_expect_success 'log -S requires an argument' '
diff --git a/t/t4013/diff.diff-tree_--pretty_--root_--stat-with-summary_initial 
b/t/t4013/diff.diff-tree_--pretty_--root_--stat_--compact-summary_initial
similarity index 78%
rename from t/t4013/diff.diff-tree_--pretty_--root_--stat-with-summary_initial
rename to 
t/t4013/diff.diff-tree_--pretty_--root_--stat_--compact-summary_initial
index 105f29a92d..d6451ff7cc 100644
--- a/t/t4013/diff.diff-tree_--pretty_--root_--stat-with-summary_initial
+++ b/t/t4013/diff.diff-tree_--pretty_--root_--stat_--compact-summary_initial
@@ -1,4 +1,4 @@
-$ git diff-tree --pretty --root --stat-with-summary initial
+$ git diff-tree --pretty --root --stat --compact-summary initial
 commit 444ac553ac7612cc88969031b02b3767fb8a353a
 Author: A U Thor 
 Date:   Mon Jun 26 00:00:00 2006 +
diff --git 
a/t/t4013/diff.diff-tree_--pretty_-R_--root_--stat-with-summary_initial 
b/t/t4013/diff.diff-tree_--pretty_-R_--root_--stat_--compact-summary_initial
similarity index 78%
rename from 
t/t4013/diff.diff-tree_--pretty_-R_--root_--stat-with-summary_initial
rename to 
t/t4013/diff.diff-tree_--pretty_-R_--root_--stat_--compact-summary_initial
index 45008d09fc..1989e55cd0 100644
--- a/t/t4013/diff.diff-tree_--pretty_-R_--root_--stat-with-summary_initial
+++ b/t/t4013/diff.diff-tree_--pretty_-R_--root_--stat_--compact-summary_initial
@@ -1,4 +1,4 

Re: [PATCH v3 06/42] completion: use __gitcomp_builtin in _git_am

2018-02-24 Thread Duy Nguyen
On Sat, Feb 24, 2018 at 1:08 AM, Junio C Hamano  wrote:
> Duy Nguyen  writes:
>
>> On Fri, Feb 23, 2018 at 1:19 AM, Junio C Hamano  wrote:
>>> Duy Nguyen  writes:
>>>
 Now that you mention it, the only command that completes
 --rerere-autoupdate is git-merge. Since this is "auto" I don't think
 people want to type manually.
>>>
>>> Sorry, but I do not quite get the connection between "since this is
>>> 'auto'" and the rest of the sentence.  Is it just it is so lengthy
>>> that people do not want to type and are likely to use completion?
>>
>> Well, if it is to be done automatically, I should not need to tell it
>> manually (by typing the option on command line). Granted it's a weak
>> argument.
>
> Perhaps I am not reading what the option does correctly, but my
> understanding is that merge operations:
>
>  - always allow rerere to intervene and auto-resolve in the working
>tree, as long as rerere is enabled;
>
>  - by default they however do not add the rerere-resolved result to
>the index, so that the combined diff output from "git diff" can
>be used as a way to sanity check the result; but
>
>  - if the user says --rerere-autoupdate, they add the
>rerere-resolved result to the index, letting the user blindly
>trust it, so that such a trusting user can even commit the result
>without looking at it by merely making sure there is no path
>remaining in the "git ls-files -u" output.
>
> The "autoupdate" could be somewhat dangerous depending on your
> workflow, and that is why the user can selectively enable it via the
> command line option (when known to be safe) even if the user does
> not set rerere.autoupdate to true.
>
> So I am not sure I understand "an option to allow something to be
> done automatically should not be given manually" as a valid argument
> at all, whether it is weak or strong.
>
> Or am I grossly confused?

Nope. It's just me not understanding rerere functionality.
-- 
Duy


Re: [PATCH v2 4/5] diff.c: initialize hash algo when running in --no-index mode

2018-02-24 Thread Duy Nguyen
On Sat, Feb 24, 2018 at 3:15 PM, Eric Sunshine  wrote:
> On Fri, Feb 23, 2018 at 10:34 PM, Nguyễn Thái Ngọc Duy
>  wrote:
>> before, compared to the alternative that we simply do not hash).
>>
>> dòng được
>
> Accidental paste?

Oops. I blame Emacs.
-- 
Duy


Re: [PATCH 1/1] git-p4: add unshelve command

2018-02-24 Thread Miguel Torroja
On Fri, Feb 23, 2018 at 6:22 PM, Luke Diamand  wrote:
> On 22 February 2018 at 22:28, Luke Diamand  wrote:
>> On 22 February 2018 at 21:39, Miguel Torroja  
>> wrote:
>>> Hi Luke,
>>>
>>> I really like the idea of creating a branch based on a shelved CL (We
>>> particularly use shelves all the time), I tested your change and I
>>> have some comments.
>>>
>>>  - I have some concerns about having the same "[git-p4...change =
>>> .]" as if it were a real submitted CL.
>>> One use case I foresee of the new implementation could be to
>>> cherry-pick that change on another branch (or current branch) prior to
>>> a git p4 submit.
>>
>> OK, I think we could just not add that in the case of an unshelved commit.
>>
>>>
>>>  - I see that the new p4/unshelve... branch is based on the tip of
>>> p4/master by default. what if we set the default to the current HEAD?
>>
>> There's a "--origin" option you can use to set it to whatever you want.
>>
>> I started out with HEAD as the default, but then found that to get a
>> sensible diff you have to both sync and rebase, which can be quite
>> annoying.
>>
>> In my case, in my early testing, I ended up with a git commit which
>> included both the creation of a file, and a subsequent change, even
>> though I had only unshelved the subsequent change. That was because
>> HEAD didn't include the file creation change (but p4/master _did_).
>
> Discussing this with some of my colleagues, and playing around with
> it, it seems that what it really needs to do is to figure out the
> parent commit of the shelved changelist, and use that as the basis for
> the diff.
>
> Unfortunately, Perforce doesn't have any concept of a "parent commit".
> One option that would be possible to implement though is to look at
> the shelved changelist, and foreach file, find the original revision
> number ("//depot/foo.c#97"). Then "p4 changes //depot/foo.c" would
> give you the changelist number for that file. Find the most recent P4
> changelist, find the git commit corresponding to that, and do the diff
> against that.
>
> It's pretty clunky, and I'm quite glad I didn't try to do that in the
> initial revision, as I would surely have given up!
>
> To do it properly of course you need to handle the case where the
> shelved changelist author had some files at one changelist, and others
> at another. But I think that's just far too complicated to deal with.
>
> Luke

The behavior of "p4 unshelve" and your "git p4 unshelve" approach I
think are equivalent
as p4 just copies the whole contents of the shelved file locally,
regardless of the previous revision synced.
In other words, you get the same result with "git p4 unshelve" than
with "p4 unshelve" now.

What about creating a new command in a future update to apply just the
change to your local tree?
One approach we took in the past was to create a diff patch and then
apply it to the working tree.
The command "p4 describe -S -du 12345" will output a patch but it has
two problems:
 * the header for each file is not standard, it has to be parsed and
converted, (it starts with " //depot/..." and it needs to be
converted to "--- a/...")
 * The new files are not output as a patch
let's say we call the command "git p4 cherry-pick"

Miguel


Re: [PATCH 01/11] t: prevent '-x' tracing from interfering with test helpers' stderr

2018-02-24 Thread SZEDER Gábor
On Sat, Feb 24, 2018 at 12:39 AM, SZEDER Gábor  wrote:

>   - Duplicate stderr of the tested command executed in the test helper
> function from the function's fd 7 (see next point), to ensure that
> the tested command's error messages go to a different fd than the
> '-x' trace of the commands executed in the function.
>
>   - Duplicate the test helper function's fd 7 from the function's
> original stderr, meaning that, after taking a detour through fd 7,
> the error messages of the tested command do end up on the
> function's original stderr.

> diff --git a/t/lib-terminal.sh b/t/lib-terminal.sh
> index cd220e378e..b3acb4c6f8 100644
> --- a/t/lib-terminal.sh
> +++ b/t/lib-terminal.sh
> @@ -9,8 +9,8 @@ test_terminal () {
> echo >&4 "test_terminal: need to declare TTY prerequisite"
> return 127
> fi
> -   perl "$TEST_DIRECTORY"/test-terminal.perl "$@"
> -}
> +   perl "$TEST_DIRECTORY"/test-terminal.perl "$@" 2>&9
> +} 9>&2 2>&4

Oops, these should duplicate from/to fd 7, not fd 9.


Re: [PATCH v2 4/5] diff.c: initialize hash algo when running in --no-index mode

2018-02-24 Thread Eric Sunshine
On Fri, Feb 23, 2018 at 10:34 PM, Nguyễn Thái Ngọc Duy
 wrote:
> Our "git diff" command supports running as a standalone tool. In this
> code path, we try to hash the file content but after
> 18e2588e11 (sha1_file: switch uses of SHA-1 to the_hash_algo -
> 2018-02-01), there is a chance that the_hash_algo (required by
> index_path) may still be uninitialized if no repository is found.
>
> Executing index_path() when the_hash_algo is NULL (or points to unknown
> algo) either crashes or dies. Let's make it a bit safer by explicitly
> falling back to SHA-1 (so that the diff output remains the same as
> before, compared to the alternative that we simply do not hash).
>
> dòng được

Accidental paste?

> Signed-off-by: Nguyễn Thái Ngọc Duy 
> ---
> diff --git a/diff.c b/diff.c
> @@ -3995,6 +3995,18 @@ static void run_diff(struct diff_filepair *p, struct 
> diff_options *o)
> +   /*
> +* NEEDSWORK: When running in no-index mode (and no repo is
> +* found, thus no hash algo conifugred), fall back to SHA-1

s/conifugred/configured/

> +* hashing (which is used by diff_fill_oid_info below) to
> +* avoid regression in diff output.
> +*
> +* In future, perhaps we can allow the user to specify their
> +* hash algorithm from command line in this mode.
> +*/
> +   if (o->flags.no_index && !the_hash_algo)
> +   the_hash_algo = _algos[GIT_HASH_SHA1];
> +
> diff_fill_oid_info(one);
> diff_fill_oid_info(two);


Re: [PATCH 04/11] t3030-merge-recursive: don't check the stderr of a subshell

2018-02-24 Thread Eric Sunshine
On Fri, Feb 23, 2018 at 6:39 PM, SZEDER Gábor  wrote:
> The two test checking 'git mmerge-recursive' in an empty worktree in

s/mmerge/merge/, I guess.

> 't3030-merge-recursive.sh' fail when the test script is run with '-x'
> tracing (and using a shell other than a Bash version supporting
> BASH_XTRACEFD).  The reason for those failures is that the tests check
> the emptiness of a subshell's stderr, which includes the trace of
> commands executed in that subshell as well, throwing off the emptiness
> check.
>
> Note that both subshells execute four git commands each, meaning that
> checking the emptiness of the whole subshell implicitly ensures that
> not only 'git merge-recursive' but none of the other three commands
> outputs anything to their stderr.  Note also that if one of those
> commands were to output anything on its stderr, then the current
> combined check would not tell us which one of those four commands the
> unexpected output came from.
>
> Save the stderr of those four commands only instead of the whole
> subshell, so it remains free from tracing output, and save and check
> them individually, so they will show us from which command the
> unexpected output came from.
>
> After this change t3030 passes with '-x', even when running with
> /bin/sh.
>
> Signed-off-by: SZEDER Gábor