from:"tboegi"

[PATCH v3 1/1] Use size_t instead of 'unsigned long' for data in memory

2019-04-13 Thread tboegi

From: Torsten Bögershausen 

Currently the length of data which is stored in memory is stored
in "unsigned long" at many places in the code base.
This is OK when both "unsigned long" and size_t are 32 bits,
(and is OK when both are 64 bits).
On a 64 bit Windows system am "unsigned long" is 32 bit, and
that may be too short to measure the size of objects in memory,
a size_t is the natural choice.

Improve the code base in "small steps", as small as possible.
The smallest step seems to be much bigger than expected.
See this code-snippet from convert.c:
const char *ret;
unsigned long sz;
void *data = read_blob_data_from_index(istate, path, &sz);
ret = gather_convert_stats_ascii(data, sz);

The corrected version looks like this:
const char *ret;
size_t sz;
void *data = read_blob_data_from_index(istate, path, &sz);
ret = gather_convert_stats_ascii(data, sz);

However, when the Git code base is compiled with a compiler that
complains that "unsigned long" is different from size_t, we end
up in this huge patch, before the code base cleanly compiles.

And: there is more to be done in the zlib interface.

Reviewed-by: Johannes Schindelin 
Signed-off-by: Torsten Bögershausen 
---

This is the 3rd version of the patch.
 - Dscho contributed with a code-review (converted 2 more unsigned long)
 - Thomas Braun and Philip Oakley have done more work:
   https://github.com/tboegi/git/pull/1
   Those changes are not part of my patch series
 - Applying this patch on 'pu' gives 3 conflicts (blame.c, packfile.[ch])

  apply.c  | 78 
 archive-tar.c| 18 +-
 archive-zip.c|  2 +-
 archive.c|  2 +-
 archive.h|  2 +-
 bisect.c |  2 +-
 blame.c  |  6 ++--
 blame.h  |  2 +-
 builtin/cat-file.c   | 10 +++---
 builtin/difftool.c   |  2 +-
 builtin/fast-export.c|  6 ++--
 builtin/fmt-merge-msg.c  |  4 +--
 builtin/fsck.c   |  6 ++--
 builtin/grep.c   |  8 ++---
 builtin/index-pack.c | 26 +++---
 builtin/log.c|  4 +--
 builtin/ls-tree.c|  2 +-
 builtin/merge-tree.c |  6 ++--
 builtin/mktag.c  |  4 +--
 builtin/notes.c  |  6 ++--
 builtin/pack-objects.c   | 70 ++--
 builtin/reflog.c |  2 +-
 builtin/replace.c|  2 +-
 builtin/tag.c|  4 +--
 builtin/unpack-file.c|  2 +-
 builtin/unpack-objects.c | 34 +-
 builtin/verify-commit.c  |  4 +--
 bundle.c |  2 +-
 cache.h  | 11 +++---
 combine-diff.c   | 11 +++---
 commit.c | 22 ++--
 commit.h | 10 +++---
 config.c |  2 +-
 convert.c| 16 -
 delta.h  | 20 +--
 diff-delta.c |  4 +--
 diff.c   | 30 
 diff.h   |  2 +-
 diffcore-pickaxe.c   |  4 +--
 diffcore.h   |  2 +-
 dir.c|  6 ++--
 dir.h|  2 +-
 entry.c  |  4 +--
 fast-import.c| 26 +++---
 fsck.c   | 12 +++
 fsck.h   |  2 +-
 fuzz-pack-headers.c  |  4 +--
 grep.h   |  2 +-
 http-push.c  |  2 +-
 list-objects-filter.c|  2 +-
 mailmap.c|  2 +-
 match-trees.c|  4 +--
 merge-blobs.c|  6 ++--
 merge-blobs.h|  2 +-
 merge-recursive.c|  4 +--
 notes-cache.c|  2 +-
 notes-merge.c|  4 +--
 notes.c  |  6 ++--
 object-store.h   | 22 ++--
 object.c |  4 +--
 object.h |  2 +-
 pack-check.c |  2 +-
 pack-objects.h   | 14 
 pack.h   |  2 +-
 packfile.c   | 44 +++
 packfile.h   | 10 +++---
 patch-delta.c|  8 ++---
 range-diff.c |  2 +-
 read-cache.c | 48 -
 ref-filter.c | 16 -
 remote-testsvn.c |  4 +--
 rerere.c |  2 +-
 sha1-file.c  | 66 +-
 sha1dc_git.c |  2 +-
 sha1dc_git.h |  2 +-
 streaming.c  | 12 +++
 streaming.h  |  2 +-
 submodule-config.c   |  2 +-
 t/helper/test-delta.c|  2 +-
 tag.c|  6 ++--
 tag.h|  2 +-
 tree-walk.c  | 14 
 tree.c   |  2 +-
 xdiff-interface.c|  4 +--
 xdiff-interface.h|  4 +--
 85 files changed, 422 insertions(+), 420 deletions(-)

diff --git a/apply.c b/apply.c
index f15afa9f6a..7594859ce

[PATCH v2 1/1] trace2: NULL is not allowed for va_list

2019-03-19 Thread tboegi

From: Torsten Bögershausen 

Some compilers don't allow NULL to be passed for a va_list,
and e.g. "gcc (Raspbian 6.3.0-18+rpi1+deb9u1) 6.3.0 20170516"
errors out like this:
 trace2/tr2_tgt_event.c:193:18:
   error: invalid operands to binary &&
   (have ‘int’ and ‘va_list {aka __va_list}’)
if (fmt && *fmt && ap) {
   ^^
I couldn't find any hints that va_list and pointers can be mixed,
and no hints that they can't either. Morten Welinder comments:

"C99, Section 7.15, simply says that va_list "is an object type suitable for
holding information needed by the macros va_start, va_end, and
va_copy". So clearly not guaranteed to be mixable with pointers...

The portable solution is to use "va_list" everywhere in the callchain.
As a consequence, both trace2_region_enter_fl() and trace2_region_leave_fl()
now take a variable argument list.

Signed-off-by: Torsten Bögershausen 
---
 trace2.c| 15 +++
 trace2.h|  4 ++--
 trace2/tr2_tgt_event.c  |  2 +-
 trace2/tr2_tgt_normal.c |  2 +-
 trace2/tr2_tgt_perf.c   |  2 +-
 5 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/trace2.c b/trace2.c
index d4ef09..8bbad56887 100644
--- a/trace2.c
+++ b/trace2.c
@@ -548,10 +548,14 @@ void trace2_region_enter_printf_va_fl(const char *file, 
int line,
 }

 void trace2_region_enter_fl(const char *file, int line, const char *category,
-   const char *label, const struct repository *repo)
+   const char *label, const struct repository *repo, 
...)
 {
+   va_list ap;
+   va_start(ap, repo);
trace2_region_enter_printf_va_fl(file, line, category, label, repo,
-NULL, NULL);
+NULL, ap);
+   va_end(ap);
+
 }

 void trace2_region_enter_printf_fl(const char *file, int line,
@@ -621,10 +625,13 @@ void trace2_region_leave_printf_va_fl(const char *file, 
int line,
 }

 void trace2_region_leave_fl(const char *file, int line, const char *category,
-   const char *label, const struct repository *repo)
+   const char *label, const struct repository *repo, 
...)
 {
+   va_list ap;
+   va_start(ap, repo);
trace2_region_leave_printf_va_fl(file, line, category, label, repo,
-NULL, NULL);
+NULL, ap);
+   va_end(ap);
 }

 void trace2_region_leave_printf_fl(const char *file, int line,
diff --git a/trace2.h b/trace2.h
index ae5020d0e6..b330a54a89 100644
--- a/trace2.h
+++ b/trace2.h
@@ -238,7 +238,7 @@ void trace2_def_repo_fl(const char *file, int line, struct 
repository *repo);
  * on this thread.
  */
 void trace2_region_enter_fl(const char *file, int line, const char *category,
-   const char *label, const struct repository *repo);
+   const char *label, const struct repository *repo, 
...);

 #define trace2_region_enter(category, label, repo) \
trace2_region_enter_fl(__FILE__, __LINE__, (category), (label), (repo))
@@ -278,7 +278,7 @@ void trace2_region_enter_printf(const char *category, const 
char *label,
  * in this nesting level.
  */
 void trace2_region_leave_fl(const char *file, int line, const char *category,
-   const char *label, const struct repository *repo);
+   const char *label, const struct repository *repo, 
...);

 #define trace2_region_leave(category, label, repo) \
trace2_region_leave_fl(__FILE__, __LINE__, (category), (label), (repo))
diff --git a/trace2/tr2_tgt_event.c b/trace2/tr2_tgt_event.c
index 107cb5317d..1cf4f62441 100644
--- a/trace2/tr2_tgt_event.c
+++ b/trace2/tr2_tgt_event.c
@@ -190,7 +190,7 @@ static void fn_atexit(uint64_t us_elapsed_absolute, int 
code)
 static void maybe_add_string_va(struct json_writer *jw, const char *field_name,
const char *fmt, va_list ap)
 {
-   if (fmt && *fmt && ap) {
+   if (fmt && *fmt) {
va_list copy_ap;
struct strbuf buf = STRBUF_INIT;

diff --git a/trace2/tr2_tgt_normal.c b/trace2/tr2_tgt_normal.c
index 547183d5b6..1a07d70abd 100644
--- a/trace2/tr2_tgt_normal.c
+++ b/trace2/tr2_tgt_normal.c
@@ -126,7 +126,7 @@ static void fn_atexit(uint64_t us_elapsed_absolute, int 
code)
 static void maybe_append_string_va(struct strbuf *buf, const char *fmt,
   va_list ap)
 {
-   if (fmt && *fmt && ap) {
+   if (fmt && *fmt) {
va_list copy_ap;

va_copy(copy_ap, ap);
diff --git a/trace2/tr2_tgt_perf.c b/trace2/tr2_tgt_perf.c
index f0746fcf86..2a866d701b 100644
--- a/trace2/tr2_tgt_perf.c
+++ b/trace2/tr2_tgt_perf.c
@@ -211,7 +211,7 @@ static void fn_atexit(uint64_t us_elapsed_absolute, int 
code)
 static void maybe_append_string_va(struct strbuf *buf, const char *fmt,

[PATCH v1 1/1] trace2: NULL is not allowed for va_list

2019-03-16 Thread tboegi

From: Torsten Bögershausen 

Some compilers don't allow NULL to be passed for a va_list.
Use va_list instead.

Signed-off-by: Torsten Bögershausen 
---
 trace2.c| 15 +++
 trace2.h|  4 ++--
 trace2/tr2_tgt_event.c  |  2 +-
 trace2/tr2_tgt_normal.c |  2 +-
 trace2/tr2_tgt_perf.c   |  2 +-
 5 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/trace2.c b/trace2.c
index d4ef09..8bbad56887 100644
--- a/trace2.c
+++ b/trace2.c
@@ -548,10 +548,14 @@ void trace2_region_enter_printf_va_fl(const char *file, 
int line,
 }

 void trace2_region_enter_fl(const char *file, int line, const char *category,
-   const char *label, const struct repository *repo)
+   const char *label, const struct repository *repo, 
...)
 {
+   va_list ap;
+   va_start(ap, repo);
trace2_region_enter_printf_va_fl(file, line, category, label, repo,
-NULL, NULL);
+NULL, ap);
+   va_end(ap);
+
 }

 void trace2_region_enter_printf_fl(const char *file, int line,
@@ -621,10 +625,13 @@ void trace2_region_leave_printf_va_fl(const char *file, 
int line,
 }

 void trace2_region_leave_fl(const char *file, int line, const char *category,
-   const char *label, const struct repository *repo)
+   const char *label, const struct repository *repo, 
...)
 {
+   va_list ap;
+   va_start(ap, repo);
trace2_region_leave_printf_va_fl(file, line, category, label, repo,
-NULL, NULL);
+NULL, ap);
+   va_end(ap);
 }

 void trace2_region_leave_printf_fl(const char *file, int line,
diff --git a/trace2.h b/trace2.h
index ae5020d0e6..b330a54a89 100644
--- a/trace2.h
+++ b/trace2.h
@@ -238,7 +238,7 @@ void trace2_def_repo_fl(const char *file, int line, struct 
repository *repo);
  * on this thread.
  */
 void trace2_region_enter_fl(const char *file, int line, const char *category,
-   const char *label, const struct repository *repo);
+   const char *label, const struct repository *repo, 
...);

 #define trace2_region_enter(category, label, repo) \
trace2_region_enter_fl(__FILE__, __LINE__, (category), (label), (repo))
@@ -278,7 +278,7 @@ void trace2_region_enter_printf(const char *category, const 
char *label,
  * in this nesting level.
  */
 void trace2_region_leave_fl(const char *file, int line, const char *category,
-   const char *label, const struct repository *repo);
+   const char *label, const struct repository *repo, 
...);

 #define trace2_region_leave(category, label, repo) \
trace2_region_leave_fl(__FILE__, __LINE__, (category), (label), (repo))
diff --git a/trace2/tr2_tgt_event.c b/trace2/tr2_tgt_event.c
index 107cb5317d..1cf4f62441 100644
--- a/trace2/tr2_tgt_event.c
+++ b/trace2/tr2_tgt_event.c
@@ -190,7 +190,7 @@ static void fn_atexit(uint64_t us_elapsed_absolute, int 
code)
 static void maybe_add_string_va(struct json_writer *jw, const char *field_name,
const char *fmt, va_list ap)
 {
-   if (fmt && *fmt && ap) {
+   if (fmt && *fmt) {
va_list copy_ap;
struct strbuf buf = STRBUF_INIT;

diff --git a/trace2/tr2_tgt_normal.c b/trace2/tr2_tgt_normal.c
index 547183d5b6..1a07d70abd 100644
--- a/trace2/tr2_tgt_normal.c
+++ b/trace2/tr2_tgt_normal.c
@@ -126,7 +126,7 @@ static void fn_atexit(uint64_t us_elapsed_absolute, int 
code)
 static void maybe_append_string_va(struct strbuf *buf, const char *fmt,
   va_list ap)
 {
-   if (fmt && *fmt && ap) {
+   if (fmt && *fmt) {
va_list copy_ap;

va_copy(copy_ap, ap);
diff --git a/trace2/tr2_tgt_perf.c b/trace2/tr2_tgt_perf.c
index f0746fcf86..2a866d701b 100644
--- a/trace2/tr2_tgt_perf.c
+++ b/trace2/tr2_tgt_perf.c
@@ -211,7 +211,7 @@ static void fn_atexit(uint64_t us_elapsed_absolute, int 
code)
 static void maybe_append_string_va(struct strbuf *buf, const char *fmt,
   va_list ap)
 {
-   if (fmt && *fmt && ap) {
+   if (fmt && *fmt) {
va_list copy_ap;

va_copy(copy_ap, ap);
--
2.21.0.135.g6e0cc67761

[PATCH/RFC v1 1/1] convert.c: Escape sequences only for a tty in trace_encoding()

2019-03-09 Thread tboegi

From: Torsten Bögershausen 

The content of a buffer can be dumped using trace_encoding()
before and after the encoding is converted.
The current function trace_encoding() in convert.c tries to
make the output easier to read:
The byte position and the character itself are dimmed, allowing
the eye to focus on the hex values in the byte stream.

ANSI escape sequences are used to "dim" the display temporally,
and to restore the normal brightness.

When stdout is re-directed into a file, those sequences are not
working as expected (but shown in the editor) which is disturbing.
rather then helpful.

Disable them, if stdout is not a tty.

Signed-off-by: Torsten Bögershausen 
---
 I am temped to remove the "dim" functionality all together,
 or to remove the printout of the values which are now dimmed,
 what do others think ?

convert.c | 20 
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/convert.c b/convert.c
index 5d0307fc10..70e58f1413 100644
--- a/convert.c
+++ b/convert.c
@@ -42,6 +42,9 @@ struct text_stat {
unsigned printable, nonprintable;
 };

+static const char *terminal_half_bright;
+static const char *terminal_reset_normal;
+
 static void gather_stats(const char *buf, unsigned long size, struct text_stat 
*stats)
 {
unsigned long i;
@@ -330,14 +333,23 @@ static void trace_encoding(const char *context, const 
char *path,
static struct trace_key coe = TRACE_KEY_INIT(WORKING_TREE_ENCODING);
struct strbuf trace = STRBUF_INIT;
int i;
-
+   if (!terminal_half_bright || !terminal_reset_normal) {
+   if (isatty(1)) {
+   terminal_half_bright  = "\033[2m";
+   terminal_reset_normal = "\033[0m";
+   } else {
+   terminal_half_bright = "";
+   terminal_reset_normal = "";
+   }
+   }
strbuf_addf(&trace, "%s (%s, considered %s):\n", context, path, 
encoding);
for (i = 0; i < len && buf; ++i) {
+   char c = buf[i] > 32 && buf[i] < 127 ? buf[i] : ' ';
strbuf_addf(
-   &trace, "| \033[2m%2i:\033[0m %2x \033[2m%c\033[0m%c",
-   i,
+   &trace, "| %s%2i:%s %2x %s%c%s%c",
+   terminal_half_bright, i, terminal_reset_normal,
(unsigned char) buf[i],
-   (buf[i] > 32 && buf[i] < 127 ? buf[i] : ' '),
+   terminal_half_bright, c, terminal_reset_normal,
((i+1) % 8 && (i+1) < len ? ' ' : '\n')
);
}
--
2.21.0.135.g6e0cc67761

[PATCH v1 1/1] gitattributes.txt: fix typo

2019-03-05 Thread tboegi

From: Yash Bhatambare 

`UTF-16-LE-BOM` to `UTF-16LE-BOM`.

this closes https://github.com/git-for-windows/git/issues/2095

Signed-off-by: Yash Bhatambare 
Signed-off-by: Torsten Bögershausen 
---

This patch already made it into Git for Windows,
so I send it upstream "as is".

Documentation/gitattributes.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 9b41f81c06..bdd11a2ddd 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -346,7 +346,7 @@ automatic line ending conversion based on your platform.

 Use the following attributes if your '*.ps1' files are UTF-16 little
 endian encoded without BOM and you want Git to use Windows line endings
-in the working directory (use `UTF-16-LE-BOM` instead of `UTF-16LE` if
+in the working directory (use `UTF-16LE-BOM` instead of `UTF-16LE` if
 you want UTF-16 little endian with BOM).
 Please note, it is highly recommended to
 explicitly define the line endings with `eol` if the `working-tree-encoding`
--
2.19.1.593.gc670b1f876

[PATCH v3 1/1] Support working-tree-encoding "UTF-16LE-BOM"

2019-01-30 Thread tboegi

From: Torsten Bögershausen 

Users who want UTF-16 files in the working tree set the .gitattributes
like this:
test.txt working-tree-encoding=UTF-16

The unicode standard itself defines 3 allowed ways how to encode UTF-16.
The following 3 versions convert all back to 'g' 'i' 't' in UTF-8:

a) UTF-16, without BOM, big endian:
$ printf "\000g\000i\000t" | iconv -f UTF-16 -t UTF-8 | od -c
000g   i   t

b) UTF-16, with BOM, little endian:
$ printf "\377\376g\000i\000t\000" | iconv -f UTF-16 -t UTF-8 | od -c
000g   i   t

c) UTF-16, with BOM, big endian:
$ printf "\376\377\000g\000i\000t" | iconv -f UTF-16 -t UTF-8 | od -c
000g   i   t

Git uses libiconv to convert from UTF-8 in the index into ITF-16 in the
working tree.
After a checkout, the resulting file has a BOM and is encoded in "UTF-16",
in the version (c) above.
This is what iconv generates, more details follow below.

iconv (and libiconv) can generate UTF-16, UTF-16LE or UTF-16BE:

d) UTF-16
$ printf 'git' | iconv -f UTF-8 -t UTF-16 | od -c
000  376 377  \0   g  \0   i  \0   t

e) UTF-16LE
$ printf 'git' | iconv -f UTF-8 -t UTF-16LE | od -c
000g  \0   i  \0   t  \0

f)  UTF-16BE
$ printf 'git' | iconv -f UTF-8 -t UTF-16BE | od -c
000   \0   g  \0   i  \0   t

There is no way to generate version (b) from above in a Git working tree,
but that is what some applications need.
(All fully unicode aware applications should be able to read all 3 variants,
but in practise we are not there yet).

When producing UTF-16 as an output, iconv generates the big endian version
with a BOM. (big endian is probably chosen for historical reasons).

iconv can produce UTF-16 files with little endianess by using "UTF-16LE"
as encoding, and that file does not have a BOM.

Not all users (especially under Windows) are happy with this.
Some tools are not fully unicode aware and can only handle version (b).

Today there is no way to produce version (b) with iconv (or libiconv).
Looking into the history of iconv, it seems as if version (c) will
be used in all future iconv versions (for compatibility reasons).

Solve this dilemma and introduce a Git-specific "UTF-16LE-BOM".
libiconv can not handle the encoding, so Git pick it up, handles the BOM
and uses libiconv to convert the rest of the stream.
(UTF-16BE-BOM is added for consistency)

Rported-by: Adrián Gimeno Balaguer 
Signed-off-by: Torsten Bögershausen 
---

Changes since v2:
  Update the commit message (s/possible/allowed/)
  Update the documentation, as suggested by Junio:
  ...wonder if the following,
 instead of the above hunk, would work better..
  Yes, it does.

Documentation/gitattributes.txt  |  4 ++-
 compat/precompose_utf8.c |  2 +-
 t/t0028-working-tree-encoding.sh | 12 -
 utf8.c   | 42 
 utf8.h   |  2 +-
 5 files changed, 48 insertions(+), 14 deletions(-)

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index b8392fc330..a2310fb920 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -344,7 +344,9 @@ automatic line ending conversion based on your platform.

 Use the following attributes if your '*.ps1' files are UTF-16 little
 endian encoded without BOM and you want Git to use Windows line endings
-in the working directory. Please note, it is highly recommended to
+in the working directory (use `UTF-16-LE-BOM` instead of `UTF-16LE` if
+you want UTF-16 little endian with BOM).
+Please note, it is highly recommended to
 explicitly define the line endings with `eol` if the `working-tree-encoding`
 attribute is used to avoid ambiguity.

diff --git a/compat/precompose_utf8.c b/compat/precompose_utf8.c
index de61c15d34..136250fbf6 100644
--- a/compat/precompose_utf8.c
+++ b/compat/precompose_utf8.c
@@ -79,7 +79,7 @@ void precompose_argv(int argc, const char **argv)
size_t namelen;
oldarg = argv[i];
if (has_non_ascii(oldarg, (size_t)-1, &namelen)) {
-   newarg = reencode_string_iconv(oldarg, namelen, 
ic_precompose, NULL);
+   newarg = reencode_string_iconv(oldarg, namelen, 
ic_precompose, 0, NULL);
if (newarg)
argv[i] = newarg;
}
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index 7e87b5a200..e58ecbfc44 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -11,9 +11,12 @@ test_expect_success 'setup test files' '

text="hallo there!\ncan you read me?" &&
echo "*.utf16 text working-tree-encoding=utf-16" >.gitattributes &&
+   echo "*.utf16lebom text working-tree-encoding=UTF-16LE-BOM" 
>>.gitattributes &&
printf "$text" >test.utf8.raw &&
printf "$text" | iconv -f UTF-8 -t UTF-16 >test.utf16.raw &&
printf "$text" | iconv -f UTF-8 -t UTF-32 >test.utf32.raw

[PATCH v2 1/1] Support working-tree-encoding "UTF-16LE-BOM"

2019-01-20 Thread tboegi

From: Torsten Bögershausen 

Users who want UTF-16 files in the working tree set the .gitattributes
like this:
test.txt working-tree-encoding=UTF-16

The unicode standard itself defines 3 possible ways how to encode UTF-16.
The following 3 versions convert all back to 'g' 'i' 't' in UTF-8:

a) UTF-16, without BOM, big endian:
$ printf "\000g\000i\000t" | iconv -f UTF-16 -t UTF-8 | od -c
000g   i   t

b) UTF-16, with BOM, little endian:
$ printf "\377\376g\000i\000t\000" | iconv -f UTF-16 -t UTF-8 | od -c
000g   i   t

c) UTF-16, with BOM, big endian:
$ printf "\376\377\000g\000i\000t" | iconv -f UTF-16 -t UTF-8 | od -c
000g   i   t

Git uses libiconv to convert from UTF-8 in the index into ITF-16 in the
working tree.
After a checkout, the resulting file has a BOM and is encoded in "UTF-16",
in the version (c) above.
This is what iconv generates, more details follow below.

iconv (and libiconv) can generate UTF-16, UTF-16LE or UTF-16BE:

d) UTF-16
$ printf 'git' | iconv -f UTF-8 -t UTF-16 | od -c
000  376 377  \0   g  \0   i  \0   t

e) UTF-16LE
$ printf 'git' | iconv -f UTF-8 -t UTF-16LE | od -c
000g  \0   i  \0   t  \0

f)  UTF-16BE
$ printf 'git' | iconv -f UTF-8 -t UTF-16BE | od -c
000   \0   g  \0   i  \0   t

There is no way to generate version (b) from above in a Git working tree,
but that is what some applications need.
(All fully unicode aware applications should be able to read all 3 variants,
but in practise we are not there yet).

When producing UTF-16 as an output, iconv generates the big endian version
with a BOM. (big endian is probably chosen for historical reasons).

iconv can produce UTF-16 files with little endianess by using "UTF-16LE"
as encoding, and that file does not have a BOM.

Not all users (especially under Windows) are happy with this.
Some tools are not fully unicode aware and can only handle version (b).

Today there is no way to produce version (b) with iconv (or libiconv).
Looking into the history of iconv, it seems as if version (c) will
be used in all future iconv versions (for compatibility reasons).

Solve this dilemma and introduce a Git-specific "UTF-16LE-BOM".
libiconv can not handle the encoding, so Git pick it up, handles the BOM
and uses libiconv to convert the rest of the stream.

Rported-by: Adrián Gimeno Balaguer 
Signed-off-by: Torsten Bögershausen 
---

I still think it makes sense to support  UTF-16, little endian and
with BOM in Git.
This V2 should make more clear, what standards we follow, and why
the naming scheme of Unicode does not cover all use cases in real world.

 Documentation/gitattributes.txt  |  4 +--
 compat/precompose_utf8.c |  2 +-
 t/t0028-working-tree-encoding.sh | 12 -
 utf8.c   | 42 
 utf8.h   |  2 +-
 5 files changed, 47 insertions(+), 15 deletions(-)

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index b8392fc330..4a88ab8be7 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -343,13 +343,13 @@ automatic line ending conversion based on your platform.
 

 Use the following attributes if your '*.ps1' files are UTF-16 little
-endian encoded without BOM and you want Git to use Windows line endings
+endian encoded with BOM and you want Git to use Windows line endings
 in the working directory. Please note, it is highly recommended to
 explicitly define the line endings with `eol` if the `working-tree-encoding`
 attribute is used to avoid ambiguity.

 
-*.ps1  text working-tree-encoding=UTF-16LE eol=CRLF
+*.ps1  text working-tree-encoding=UTF-16LE-BOM eol=CRLF
 

 You can get a list of all available encodings on your platform with the
diff --git a/compat/precompose_utf8.c b/compat/precompose_utf8.c
index de61c15d34..136250fbf6 100644
--- a/compat/precompose_utf8.c
+++ b/compat/precompose_utf8.c
@@ -79,7 +79,7 @@ void precompose_argv(int argc, const char **argv)
size_t namelen;
oldarg = argv[i];
if (has_non_ascii(oldarg, (size_t)-1, &namelen)) {
-   newarg = reencode_string_iconv(oldarg, namelen, 
ic_precompose, NULL);
+   newarg = reencode_string_iconv(oldarg, namelen, 
ic_precompose, 0, NULL);
if (newarg)
argv[i] = newarg;
}
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index 7e87b5a200..e58ecbfc44 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -11,9 +11,12 @@ test_expect_success 'setup test files' '

text="hallo there!\ncan you read me?" &&
echo "*.utf16 text working-tree-encoding=utf-16" >.gitattributes &&
+   echo "*.utf16lebom text working-tree-encoding=UTF-16LE-BOM" 
>>.gitattributes &&
printf

[PATCH/RFC v2 1/1] test-lint: Only use only sed [-n] [-e command] [-f command_file]

2019-01-19 Thread tboegi

From: Torsten Bögershausen 

From `man sed` (on a Mac OS X box):
The -E, -a and -i options are non-standard FreeBSD extensions and may not be 
available
on other operating systems.

From `man sed` on a Linux box:
REGULAR EXPRESSIONS
   POSIX.2 BREs should be supported, but they aren't completely because of
   performance problems.  The \n sequence in a regular expression matches 
the newline
   character,  and  similarly  for \a, \t, and other sequences.
   The -E option switches to using extended regular expressions instead; 
the -E option
   has been supported for years by GNU sed, and is now included in POSIX.

Well, there are still a lot of systems out there, which don't support it.
Beside that, IEEE Std 1003.1TM-2017, see
http://pubs.opengroup.org/onlinepubs/9699919799/
does not mention -E either.

To be on the safe side, don't allow -E (or -r, which is GNU).
Change check-non-portable-shell.pl to only accept the portable options:
sed [-n] [-e command] [-f command_file]

Reported-by: SZEDER Gábor 
Helped-by: Eric Sunshine 
Helped-by: Ævar Arnfjörð Bjarmason 
Signed-off-by: Torsten Bögershausen 
---
 t/check-non-portable-shell.pl | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/t/check-non-portable-shell.pl b/t/check-non-portable-shell.pl
index b45bdac688..6c798608a9 100755
--- a/t/check-non-portable-shell.pl
+++ b/t/check-non-portable-shell.pl
@@ -35,7 +35,7 @@ sub err {
chomp;
}

-   /\bsed\s+-i/ and err 'sed -i is not portable';
+   /\bsed\s+-[^efn]\s+/ and err 'Not portable option with sed (use only 
[-n] [-e command] [-f command_file])';
/\becho\s+-[neE]/ and err 'echo with option is not portable (use 
printf)';
/^\s*declare\s+/ and err 'arrays/declare not portable';
/^\s*[^#]\s*which\s/ and err 'which is not portable (use type)';
--
2.20.1.2.gb21ebb671

[PATCH/RFC v1 1/1] test-lint: sed -E (or -a, -l) are not portable

2019-01-15 Thread tboegi

From: Torsten Bögershausen 

From `man sed` (on a Mac OS X box):
The -E, -a and -i options are non-standard FreeBSD extensions and may not be 
available
on other operating systems.

From `man sed` on a Linux box:
REGULAR EXPRESSIONS
   POSIX.2 BREs should be supported, but they aren't completely because of
   performance problems.  The \n sequence in a regular expression matches
   the newline character,  and  similarly  for \a, \t, and other sequences.
   The -E option switches to using extended regular expressions instead;
   the -E option has been supported for years by GNU sed, and is now
   included in POSIX.

Well, there are still a lot of systems out there, which don't support it.

Beside that, see IEEE Std 1003.1TM-2017
http://pubs.opengroup.org/onlinepubs/9699919799/
does not mention -E either.

To be on the safe side, don't allow it.

Reported-by: SZEDER Gábor 
Signed-off-by: Torsten Bögershausen 
---

I am somewhat unsure if we should disable all options except -e -f -n
instead ?
/\bsed\s+-[^efn]/ and err 'Not portable option with sed. Only -n -e -f are 
portable';

That would cause a false positive in t9001 here:
"--cc-cmd=./cccmd-sed --suppress-cc=self"

which could either be fixed by an anchor:
/^\s*sed\s+-[^efn]/

Or by allowing '--' like this:
/\bsed\s+-[^-efn]/

Any thoughts, please ?

t/check-non-portable-shell.pl | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/t/check-non-portable-shell.pl b/t/check-non-portable-shell.pl
index b45bdac688..96b6afdeb8 100755
--- a/t/check-non-portable-shell.pl
+++ b/t/check-non-portable-shell.pl
@@ -35,7 +35,7 @@ sub err {
chomp;
}

-   /\bsed\s+-i/ and err 'sed -i is not portable';
+   /\bsed\s+-[Eail]/ and err 'Not portable option with sed. Only -e -f -n 
are portable';
/\becho\s+-[neE]/ and err 'echo with option is not portable (use 
printf)';
/^\s*declare\s+/ and err 'arrays/declare not portable';
/^\s*[^#]\s*which\s/ and err 'which is not portable (use type)';
--
2.20.1.2.gb21ebb671

[PATCH/RFC v1 1/1] Support working-tree-encoding "UTF-16LE-BOM"

2018-12-29 Thread tboegi

From: Torsten Bögershausen 

Users who want UTF-16 files in the working tree set the .gitattributes
like this:
test.txt working-tree-encoding=UTF-16

After a checkout, the resulting file has a BOM and is encoded in "UTF-16".
The unicode standard allows both little- and big-endianess (LE/BE) for
those files, the BOM will tell which one is used inside the file.
iconv seems to prefer the BE version.
Not all users under Windows are happy with this when tools are not fully
unicode aware and don't digest the BE version at all.

Today there is no name for "UTF-16 with BOM, little endian please".
Introduce "UTF-16LE-BOM".

Rported-by: Adrián Gimeno Balaguer 
Signed-off-by: Torsten Bögershausen 
---

This feels like an RFC at the moment - please comment.
Using UTF-16 in the way "UTF-16LE-BOM" is used in this patch
could be an alternative - simply produce UTF-16 in LE version
under Git - this could make people using Git happy as well.

Documentation/gitattributes.txt  |  4 +--
 compat/precompose_utf8.c |  2 +-
 t/t0028-working-tree-encoding.sh | 12 -
 utf8.c   | 42 
 utf8.h   |  2 +-
 5 files changed, 47 insertions(+), 15 deletions(-)

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index b8392fc330..4a88ab8be7 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -343,13 +343,13 @@ automatic line ending conversion based on your platform.
 
 
 Use the following attributes if your '*.ps1' files are UTF-16 little
-endian encoded without BOM and you want Git to use Windows line endings
+endian encoded with BOM and you want Git to use Windows line endings
 in the working directory. Please note, it is highly recommended to
 explicitly define the line endings with `eol` if the `working-tree-encoding`
 attribute is used to avoid ambiguity.
 
 
-*.ps1  text working-tree-encoding=UTF-16LE eol=CRLF
+*.ps1  text working-tree-encoding=UTF-16LE-BOM eol=CRLF
 
 
 You can get a list of all available encodings on your platform with the
diff --git a/compat/precompose_utf8.c b/compat/precompose_utf8.c
index de61c15d34..136250fbf6 100644
--- a/compat/precompose_utf8.c
+++ b/compat/precompose_utf8.c
@@ -79,7 +79,7 @@ void precompose_argv(int argc, const char **argv)
size_t namelen;
oldarg = argv[i];
if (has_non_ascii(oldarg, (size_t)-1, &namelen)) {
-   newarg = reencode_string_iconv(oldarg, namelen, 
ic_precompose, NULL);
+   newarg = reencode_string_iconv(oldarg, namelen, 
ic_precompose, 0, NULL);
if (newarg)
argv[i] = newarg;
}
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index 7e87b5a200..e58ecbfc44 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -11,9 +11,12 @@ test_expect_success 'setup test files' '
 
text="hallo there!\ncan you read me?" &&
echo "*.utf16 text working-tree-encoding=utf-16" >.gitattributes &&
+   echo "*.utf16lebom text working-tree-encoding=UTF-16LE-BOM" 
>>.gitattributes &&
printf "$text" >test.utf8.raw &&
printf "$text" | iconv -f UTF-8 -t UTF-16 >test.utf16.raw &&
printf "$text" | iconv -f UTF-8 -t UTF-32 >test.utf32.raw &&
+   printf "\377\376" >test.utf16lebom.raw &&
+   printf "$text" | iconv -f UTF-8 -t UTF-32LE >>test.utf16lebom.raw &&
 
# Line ending tests
printf "one\ntwo\nthree\n" >lf.utf8.raw &&
@@ -32,7 +35,8 @@ test_expect_success 'setup test files' '
# Add only UTF-16 file, we will add the UTF-32 file later
cp test.utf16.raw test.utf16 &&
cp test.utf32.raw test.utf32 &&
-   git add .gitattributes test.utf16 &&
+   cp test.utf16lebom.raw test.utf16lebom &&
+   git add .gitattributes test.utf16 test.utf16lebom &&
git commit -m initial
 '
 
@@ -51,6 +55,12 @@ test_expect_success 're-encode to UTF-16 on checkout' '
test_cmp_bin test.utf16.raw test.utf16
 '
 
+test_expect_success 're-encode to UTF-16-LE-BOM on checkout' '
+   rm test.utf16lebom &&
+   git checkout test.utf16lebom &&
+   test_cmp_bin test.utf16lebom.raw test.utf16lebom
+'
+
 test_expect_success 'check $GIT_DIR/info/attributes support' '
test_when_finished "rm -f test.utf32.git" &&
test_when_finished "git reset --hard HEAD" &&
diff --git a/utf8.c b/utf8.c
index eb78587504..83824dc2f4 100644
--- a/utf8.c
+++ b/utf8.c
@@ -4,6 +4,11 @@
 
 /* This code is originally from http://www.cl.cam.ac.uk/~mgk25/ucs/ */
 
+static const char utf16_be_bom[] = {'\xFE', '\xFF'};
+static const char utf16_le_bom[] = {'\xFF', '\xFE'};
+static const char utf32_be_bom[] = {'\0', '\0', '\xFE', '\xFF'};
+static const cha

[PATCH v4 1/1] git clone C:\cygwin\home\USER\repo' is working (again)

2018-12-14 Thread tboegi

From: Torsten Bögershausen 

A regression for cygwin users was introduced with commit 05b458c,
 "real_path: resolve symlinks by hand".

In the the commit message we read:
  The current implementation of real_path uses chdir() in order to resolve
symlinks.  Unfortunately this isn't thread-safe as chdir() affects a
  process as a whole...

The old (and non-thread-save) OS calls chdir()/pwd() had been
replaced by a string operation.
The cygwin layer "knows" that "C:\cygwin" is an absolute path,
but the new string operation does not.

"git clone  C:\cygwin\home\USER\repo" fails like this:
fatal: Invalid path '/home/USER/repo/C:\cygwin\home\USER\repo'

The solution is to implement has_dos_drive_prefix(), skip_dos_drive_prefix()
is_dir_sep(), offset_1st_component() and convert_slashes() for cygwin
in the same way as it is done in 'Git for Windows' in compat/mingw.[ch]

Extract the needed code into compat/win32/path-utils.[ch] and use it
for cygwin as well.

Reported-by: Steven Penny 
Helped-by: Johannes Schindelin 
Signed-off-by: Torsten Bögershausen 

Changes since v3:
  Rename e.g. mingw_skip_dos_drive_prefix() into
  win32_skip_dos_drive_prefix()
   as suggested by Dscho, thanls for that.
  Add a tweak in t5601 for cygwin.

The test suite passes now on cygwin.
The "Git for Windows" build was tested was tested on
the gfw/master, with this commit cherry-picked on top.


---
 compat/cygwin.c   | 19 ---
 compat/cygwin.h   |  2 --
 compat/mingw.c| 29 +
 compat/mingw.h| 20 
 compat/win32/path-utils.c | 28 
 compat/win32/path-utils.h | 20 
 config.mak.uname  |  3 ++-
 git-compat-util.h |  3 ++-
 t/t5601-clone.sh  |  2 +-
 9 files changed, 54 insertions(+), 72 deletions(-)
 delete mode 100644 compat/cygwin.c
 delete mode 100644 compat/cygwin.h
 create mode 100644 compat/win32/path-utils.c
 create mode 100644 compat/win32/path-utils.h

diff --git a/compat/cygwin.c b/compat/cygwin.c
deleted file mode 100644
index b9862d606d..00
--- a/compat/cygwin.c
+++ /dev/null
@@ -1,19 +0,0 @@
-#include "../git-compat-util.h"
-#include "../cache.h"
-
-int cygwin_offset_1st_component(const char *path)
-{
-   const char *pos = path;
-   /* unc paths */
-   if (is_dir_sep(pos[0]) && is_dir_sep(pos[1])) {
-   /* skip server name */
-   pos = strchr(pos + 2, '/');
-   if (!pos)
-   return 0; /* Error: malformed unc path */
-
-   do {
-   pos++;
-   } while (*pos && pos[0] != '/');
-   }
-   return pos + is_dir_sep(*pos) - path;
-}
diff --git a/compat/cygwin.h b/compat/cygwin.h
deleted file mode 100644
index 8e52de4644..00
--- a/compat/cygwin.h
+++ /dev/null
@@ -1,2 +0,0 @@
-int cygwin_offset_1st_component(const char *path);
-#define offset_1st_component cygwin_offset_1st_component
diff --git a/compat/mingw.c b/compat/mingw.c
index 34b3880b29..b459e1a291 100644
--- a/compat/mingw.c
+++ b/compat/mingw.c
@@ -350,7 +350,7 @@ static inline int needs_hiding(const char *path)
return 0;
 
/* We cannot use basename(), as it would remove trailing slashes */
-   mingw_skip_dos_drive_prefix((char **)&path);
+   win32_skip_dos_drive_prefix((char **)&path);
if (!*path)
return 0;
 
@@ -2275,33 +2275,6 @@ pid_t waitpid(pid_t pid, int *status, int options)
return -1;
 }
 
-int mingw_skip_dos_drive_prefix(char **path)
-{
-   int ret = has_dos_drive_prefix(*path);
-   *path += ret;
-   return ret;
-}
-
-int mingw_offset_1st_component(const char *path)
-{
-   char *pos = (char *)path;
-
-   /* unc paths */
-   if (!skip_dos_drive_prefix(&pos) &&
-   is_dir_sep(pos[0]) && is_dir_sep(pos[1])) {
-   /* skip server name */
-   pos = strpbrk(pos + 2, "\\/");
-   if (!pos)
-   return 0; /* Error: malformed unc path */
-
-   do {
-   pos++;
-   } while (*pos && !is_dir_sep(*pos));
-   }
-
-   return pos + is_dir_sep(*pos) - path;
-}
-
 int xutftowcsn(wchar_t *wcs, const char *utfs, size_t wcslen, int utflen)
 {
int upos = 0, wpos = 0;
diff --git a/compat/mingw.h b/compat/mingw.h
index 8c24ddaa3e..30d9fb3e36 100644
--- a/compat/mingw.h
+++ b/compat/mingw.h
@@ -443,32 +443,12 @@ HANDLE winansi_get_osfhandle(int fd);
  * git specific compatibility
  */
 
-#define has_dos_drive_prefix(path) \
-   (isalpha(*(path)) && (path)[1] == ':' ? 2 : 0)
-int mingw_skip_dos_drive_prefix(char **path);
-#define skip_dos_drive_prefix mingw_skip_dos_drive_prefix
-static inline int mingw_is_dir_sep(int c)
-{
-   return c == '/' || c == '\\';
-}
-#define is_dir_sep mingw_is_dir_sep
-static inline char *mingw_find_last_dir_sep(co

[PATCH v3 1/1] git clone C:\cygwin\home\USER\repo' is working (again)

2018-12-08 Thread tboegi

From: Torsten Bögershausen 

A regression for cygwin users was introduced with commit 05b458c,
 "real_path: resolve symlinks by hand".

In the the commit message we read:
  The current implementation of real_path uses chdir() in order to resolve
symlinks.  Unfortunately this isn't thread-safe as chdir() affects a
  process as a whole...

The old (and non-thread-save) OS calls chdir()/pwd() had been
replaced by a string operation.
The cygwin layer "knows" that "C:\cygwin" is an absolute path,
but the new string operation does not.

"git clone  C:\cygwin\home\USER\repo" fails like this:
fatal: Invalid path '/home/USER/repo/C:\cygwin\home\USER\repo'

The solution is to implement has_dos_drive_prefix(), skip_dos_drive_prefix()
is_dir_sep(), offset_1st_component() and convert_slashes() for cygwin
in the same way as it is done in 'Git for Windows' in compat/mingw.[ch]

Extract the needed code into compat/win32/path-utils.[ch] and use it
for cygwin as well.

Reported-by: Steven Penny 
Helped-by: Johannes Schindelin 
Signed-off-by: Torsten Bögershausen 
---
Changes since V2:
- Settled on a better name:
  The common code is in compat/win32/path-utils.c/h
- Skip the 2 patches which "only" do a cleanup (for a moment)
  put those cleanups onto the "todo stack".
- The "DOS" moniker is still used for 2 reasons:
  Windows inherited the "drive letter" concept from DOS,
  and everybody (tm) familar with the code and the path handling
  in Git is used to that wording.
  Even if there was a better name, it needed to be addressed
  in a patch series different from this one.
  Here I want to fix a reported regression.
   
And, before any cleanup is done, I sould like to ask if anybody
can build the code with VS and confirm that it works, please ?

Thanks for the reviews, testing and comment.

compat/cygwin.c   | 19 ---
 compat/cygwin.h   |  2 --
 compat/mingw.c| 29 +
 compat/mingw.h| 20 
 compat/win32/path-utils.c | 28 
 compat/win32/path-utils.h | 20 
 config.mak.uname  |  3 ++-
 git-compat-util.h |  3 ++-
 8 files changed, 53 insertions(+), 71 deletions(-)
 delete mode 100644 compat/cygwin.c
 delete mode 100644 compat/cygwin.h
 create mode 100644 compat/win32/path-utils.c
 create mode 100644 compat/win32/path-utils.h

diff --git a/compat/cygwin.c b/compat/cygwin.c
deleted file mode 100644
index b9862d606d..00
--- a/compat/cygwin.c
+++ /dev/null
@@ -1,19 +0,0 @@
-#include "../git-compat-util.h"
-#include "../cache.h"
-
-int cygwin_offset_1st_component(const char *path)
-{
-   const char *pos = path;
-   /* unc paths */
-   if (is_dir_sep(pos[0]) && is_dir_sep(pos[1])) {
-   /* skip server name */
-   pos = strchr(pos + 2, '/');
-   if (!pos)
-   return 0; /* Error: malformed unc path */
-
-   do {
-   pos++;
-   } while (*pos && pos[0] != '/');
-   }
-   return pos + is_dir_sep(*pos) - path;
-}
diff --git a/compat/cygwin.h b/compat/cygwin.h
deleted file mode 100644
index 8e52de4644..00
--- a/compat/cygwin.h
+++ /dev/null
@@ -1,2 +0,0 @@
-int cygwin_offset_1st_component(const char *path);
-#define offset_1st_component cygwin_offset_1st_component
diff --git a/compat/mingw.c b/compat/mingw.c
index 34b3880b29..27e397f268 100644
--- a/compat/mingw.c
+++ b/compat/mingw.c
@@ -350,7 +350,7 @@ static inline int needs_hiding(const char *path)
return 0;
 
/* We cannot use basename(), as it would remove trailing slashes */
-   mingw_skip_dos_drive_prefix((char **)&path);
+   win_path_utils_skip_dos_drive_prefix((char **)&path);
if (!*path)
return 0;
 
@@ -2275,33 +2275,6 @@ pid_t waitpid(pid_t pid, int *status, int options)
return -1;
 }
 
-int mingw_skip_dos_drive_prefix(char **path)
-{
-   int ret = has_dos_drive_prefix(*path);
-   *path += ret;
-   return ret;
-}
-
-int mingw_offset_1st_component(const char *path)
-{
-   char *pos = (char *)path;
-
-   /* unc paths */
-   if (!skip_dos_drive_prefix(&pos) &&
-   is_dir_sep(pos[0]) && is_dir_sep(pos[1])) {
-   /* skip server name */
-   pos = strpbrk(pos + 2, "\\/");
-   if (!pos)
-   return 0; /* Error: malformed unc path */
-
-   do {
-   pos++;
-   } while (*pos && !is_dir_sep(*pos));
-   }
-
-   return pos + is_dir_sep(*pos) - path;
-}
-
 int xutftowcsn(wchar_t *wcs, const char *utfs, size_t wcslen, int utflen)
 {
int upos = 0, wpos = 0;
diff --git a/compat/mingw.h b/compat/mingw.h
index 8c24ddaa3e..30d9fb3e36 100644
--- a/compat/mingw.h
+++ b/compat/mingw.h
@@ -443,32 +443,12 @@ HANDLE winansi_get_osfhandle(int fd);
  * git specific compatibility
  */
 
-#

[PATCH v2 1/3] git clone C:\cygwin\home\USER\repo' is working (again)

2018-12-07 Thread tboegi

From: Torsten Bögershausen 

A regression for cygwin users was introduced with commit 05b458c,
 "real_path: resolve symlinks by hand".

In the the commit message we read:
  The current implementation of real_path uses chdir() in order to resolve
symlinks.  Unfortunately this isn't thread-safe as chdir() affects a
  process as a whole...

The old (and non-thread-save) OS calls chdir()/pwd() had been
replaced by a string operation.
The cygwin layer "knows" that "C:\cygwin" is an absolute path,
but the new string operation does not.

"git clone  C:\cygwin\home\USER\repo" fails like this:
fatal: Invalid path '/home/USER/repo/C:\cygwin\home\USER\repo'

The solution is to implement has_dos_drive_prefix(), skip_dos_drive_prefix()
is_dir_sep(), offset_1st_component() and convert_slashes() for cygwin
in the same way as it is done in 'Git for Windows' in compat/mingw.[ch]

Instead of duplicating the code, it is extracted into compat/mingw-cygwin.[ch]
Some need for refactoring and cleanup came up in the review, they are adressed
in a seperate commit.

Reported-By: Steven Penny 
Signed-off-by: Torsten Bögershausen 
---
 compat/cygwin.c   | 19 ---
 compat/cygwin.h   |  2 --
 compat/mingw-cygwin.c | 28 
 compat/mingw-cygwin.h | 20 
 compat/mingw.c| 29 +
 compat/mingw.h| 20 
 config.mak.uname  |  4 ++--
 git-compat-util.h |  3 ++-
 8 files changed, 53 insertions(+), 72 deletions(-)
 delete mode 100644 compat/cygwin.c
 delete mode 100644 compat/cygwin.h
 create mode 100644 compat/mingw-cygwin.c
 create mode 100644 compat/mingw-cygwin.h

diff --git a/compat/cygwin.c b/compat/cygwin.c
deleted file mode 100644
index b9862d606d..00
--- a/compat/cygwin.c
+++ /dev/null
@@ -1,19 +0,0 @@
-#include "../git-compat-util.h"
-#include "../cache.h"
-
-int cygwin_offset_1st_component(const char *path)
-{
-   const char *pos = path;
-   /* unc paths */
-   if (is_dir_sep(pos[0]) && is_dir_sep(pos[1])) {
-   /* skip server name */
-   pos = strchr(pos + 2, '/');
-   if (!pos)
-   return 0; /* Error: malformed unc path */
-
-   do {
-   pos++;
-   } while (*pos && pos[0] != '/');
-   }
-   return pos + is_dir_sep(*pos) - path;
-}
diff --git a/compat/cygwin.h b/compat/cygwin.h
deleted file mode 100644
index 8e52de4644..00
--- a/compat/cygwin.h
+++ /dev/null
@@ -1,2 +0,0 @@
-int cygwin_offset_1st_component(const char *path);
-#define offset_1st_component cygwin_offset_1st_component
diff --git a/compat/mingw-cygwin.c b/compat/mingw-cygwin.c
new file mode 100644
index 00..c63d7acb9c
--- /dev/null
+++ b/compat/mingw-cygwin.c
@@ -0,0 +1,28 @@
+#include "../git-compat-util.h"
+
+int mingw_cygwin_skip_dos_drive_prefix(char **path)
+{
+   int ret = has_dos_drive_prefix(*path);
+   *path += ret;
+   return ret;
+}
+
+int mingw_cygwin_offset_1st_component(const char *path)
+{
+   char *pos = (char *)path;
+
+   /* unc paths */
+   if (!skip_dos_drive_prefix(&pos) &&
+   is_dir_sep(pos[0]) && is_dir_sep(pos[1])) {
+   /* skip server name */
+   pos = strpbrk(pos + 2, "\\/");
+   if (!pos)
+   return 0; /* Error: malformed unc path */
+
+   do {
+   pos++;
+   } while (*pos && !is_dir_sep(*pos));
+   }
+
+   return pos + is_dir_sep(*pos) - path;
+}
diff --git a/compat/mingw-cygwin.h b/compat/mingw-cygwin.h
new file mode 100644
index 00..66ccc909ae
--- /dev/null
+++ b/compat/mingw-cygwin.h
@@ -0,0 +1,20 @@
+#define has_dos_drive_prefix(path) \
+   (isalpha(*(path)) && (path)[1] == ':' ? 2 : 0)
+int mingw_cygwin_skip_dos_drive_prefix(char **path);
+#define skip_dos_drive_prefix mingw_cygwin_skip_dos_drive_prefix
+static inline int mingw_cygwin_is_dir_sep(int c)
+{
+   return c == '/' || c == '\\';
+}
+#define is_dir_sep mingw_cygwin_is_dir_sep
+static inline char *mingw_cygwin_find_last_dir_sep(const char *path)
+{
+   char *ret = NULL;
+   for (; *path; ++path)
+   if (is_dir_sep(*path))
+   ret = (char *)path;
+   return ret;
+}
+#define find_last_dir_sep mingw_cygwin_find_last_dir_sep
+int mingw_cygwin_offset_1st_component(const char *path);
+#define offset_1st_component mingw_cygwin_offset_1st_component
diff --git a/compat/mingw.c b/compat/mingw.c
index 34b3880b29..038e96af9d 100644
--- a/compat/mingw.c
+++ b/compat/mingw.c
@@ -350,7 +350,7 @@ static inline int needs_hiding(const char *path)
return 0;
 
/* We cannot use basename(), as it would remove trailing slashes */
-   mingw_skip_dos_drive_prefix((char **)&path);
+   mingw_cygwin_skip_dos_drive_prefix((char **)&path);
if (!*path)
return 0;

[PATCH v2 3/3] Refactor mingw_cygwin_offset_1st_component()

2018-12-07 Thread tboegi

From: Torsten Bögershausen 

The Windows version of offset_1st_component() needs to hande 3 cases:
- The path is an UNC path, starting with "//" or "".
  Skip the servername and the name of the share.
- The path is a DOS drive, starting with e.g. "X:"
  The driver letter and the ':' must be skipped
- The path is pointing to a subdirectory somewhere in the path and the
  directory seperator needs to be skipped ('/' or '\\').

Refactor the code to make it easier to read.

Suggested-by: Johannes Schindelin 
Signed-off-by: Torsten Bögershausen 
---
 compat/mingw-cygwin.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/compat/mingw-cygwin.c b/compat/mingw-cygwin.c
index 5552c3ac20..c379a72775 100644
--- a/compat/mingw-cygwin.c
+++ b/compat/mingw-cygwin.c
@@ -10,10 +10,8 @@ size_t mingw_cygwin_skip_dos_drive_prefix(char **path)
 size_t mingw_cygwin_offset_1st_component(const char *path)
 {
char *pos = (char *)path;
-
-   /* unc paths */
-   if (!skip_dos_drive_prefix(&pos) &&
-   is_dir_sep(pos[0]) && is_dir_sep(pos[1])) {
+   if (is_dir_sep(pos[0]) && is_dir_sep(pos[1])) {
+   /* unc path */
/* skip server name */
pos = strpbrk(pos + 2, "\\/");
if (!pos)
@@ -22,7 +20,8 @@ size_t mingw_cygwin_offset_1st_component(const char *path)
do {
pos++;
} while (*pos && !is_dir_sep(*pos));
+   } else {
+   skip_dos_drive_prefix(&pos);
}
-
return pos + is_dir_sep(*pos) - path;
 }
-- 
2.19.0.271.gfe8321ec05

[PATCH v2 2/3] offset_1st_component(), dos_drive_prefix() return size_t

2018-12-07 Thread tboegi

From: Torsten Bögershausen 

Change the return value for offset_1st_component(),
has_dos_drive_prefix() and skip_dos_drive_prefix from int into size_t,
which is the natural type for length of data in memory.

While at it, remove possible "parameter not used" warnings in for the
non-Windows builds in git-compat-util.h

Signed-off-by: Torsten Bögershausen 
---
 abspath.c | 2 +-
 compat/mingw-cygwin.c | 6 +++---
 compat/mingw-cygwin.h | 4 ++--
 git-compat-util.h | 8 +---
 setup.c   | 4 ++--
 5 files changed, 13 insertions(+), 11 deletions(-)

diff --git a/abspath.c b/abspath.c
index 9857985329..12055a1d8f 100644
--- a/abspath.c
+++ b/abspath.c
@@ -51,7 +51,7 @@ static void get_next_component(struct strbuf *next, struct 
strbuf *remaining)
 /* copies root part from remaining to resolved, canonicalizing it on the way */
 static void get_root_part(struct strbuf *resolved, struct strbuf *remaining)
 {
-   int offset = offset_1st_component(remaining->buf);
+   size_t offset = offset_1st_component(remaining->buf);
 
strbuf_reset(resolved);
strbuf_add(resolved, remaining->buf, offset);
diff --git a/compat/mingw-cygwin.c b/compat/mingw-cygwin.c
index c63d7acb9c..5552c3ac20 100644
--- a/compat/mingw-cygwin.c
+++ b/compat/mingw-cygwin.c
@@ -1,13 +1,13 @@
 #include "../git-compat-util.h"
 
-int mingw_cygwin_skip_dos_drive_prefix(char **path)
+size_t mingw_cygwin_skip_dos_drive_prefix(char **path)
 {
-   int ret = has_dos_drive_prefix(*path);
+   size_t ret = has_dos_drive_prefix(*path);
*path += ret;
return ret;
 }
 
-int mingw_cygwin_offset_1st_component(const char *path)
+size_t mingw_cygwin_offset_1st_component(const char *path)
 {
char *pos = (char *)path;
 
diff --git a/compat/mingw-cygwin.h b/compat/mingw-cygwin.h
index 66ccc909ae..0e8a0c9074 100644
--- a/compat/mingw-cygwin.h
+++ b/compat/mingw-cygwin.h
@@ -1,6 +1,6 @@
 #define has_dos_drive_prefix(path) \
(isalpha(*(path)) && (path)[1] == ':' ? 2 : 0)
-int mingw_cygwin_skip_dos_drive_prefix(char **path);
+size_t mingw_cygwin_skip_dos_drive_prefix(char **path);
 #define skip_dos_drive_prefix mingw_cygwin_skip_dos_drive_prefix
 static inline int mingw_cygwin_is_dir_sep(int c)
 {
@@ -16,5 +16,5 @@ static inline char *mingw_cygwin_find_last_dir_sep(const char 
*path)
return ret;
 }
 #define find_last_dir_sep mingw_cygwin_find_last_dir_sep
-int mingw_cygwin_offset_1st_component(const char *path);
+size_t mingw_cygwin_offset_1st_component(const char *path);
 #define offset_1st_component mingw_cygwin_offset_1st_component
diff --git a/git-compat-util.h b/git-compat-util.h
index 7ece969b22..65eaaf0d50 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -355,16 +355,18 @@ static inline int noop_core_config(const char *var, const 
char *value, void *cb)
 #endif
 
 #ifndef has_dos_drive_prefix
-static inline int git_has_dos_drive_prefix(const char *path)
+static inline size_t git_has_dos_drive_prefix(const char *path)
 {
+   (void)path;
return 0;
 }
 #define has_dos_drive_prefix git_has_dos_drive_prefix
 #endif
 
 #ifndef skip_dos_drive_prefix
-static inline int git_skip_dos_drive_prefix(char **path)
+static inline size_t git_skip_dos_drive_prefix(char **path)
 {
+   (void)path;
return 0;
 }
 #define skip_dos_drive_prefix git_skip_dos_drive_prefix
@@ -379,7 +381,7 @@ static inline int git_is_dir_sep(int c)
 #endif
 
 #ifndef offset_1st_component
-static inline int git_offset_1st_component(const char *path)
+static inline size_t git_offset_1st_component(const char *path)
 {
return is_dir_sep(path[0]);
 }
diff --git a/setup.c b/setup.c
index 1be5037f12..538bc1ff99 100644
--- a/setup.c
+++ b/setup.c
@@ -29,7 +29,7 @@ static int abspath_part_inside_repo(char *path)
size_t len;
size_t wtlen;
char *path0;
-   int off;
+   size_t off;
const char *work_tree = get_git_work_tree();
 
if (!work_tree)
@@ -800,7 +800,7 @@ static const char *setup_bare_git_dir(struct strbuf *cwd, 
int offset,
  struct repository_format *repo_fmt,
  int *nongit_ok)
 {
-   int root_len;
+   size_t root_len;
 
if (check_repository_format_gently(".", repo_fmt, nongit_ok))
return NULL;
-- 
2.19.0.271.gfe8321ec05

[PATCH v1/RFC 1/1] 'git clone C:\cygwin\home\USER\repo' is working (again)

2018-11-26 Thread tboegi

From: Torsten Bögershausen 

A regression for cygwin users was introduced with commit 05b458c,
 "real_path: resolve symlinks by hand".

In the the commit message we read:
  The current implementation of real_path uses chdir() in order to resolve
  symlinks.  Unfortunately this isn't thread-safe as chdir() affects a
  process as a whole...

The old (and non-thread-save) OS calls chdir()/pwd() had been
replaced by a string operation.
The cygwin layer "knows" that "C:\cygwin" is an absolute path,
but the new string operation does not.

"git clone  C:\cygwin\home\USER\repo" fails like this:
fatal: Invalid path '/home/USER/repo/C:\cygwin\home\USER\repo'

The solution is to implement has_dos_drive_prefix(), skip_dos_drive_prefix()
is_dir_sep(), offset_1st_component() and convert_slashes() for cygwin
in the same way as it is done in 'Git for Windows' in compat/mingw.[ch]

Reported-By: Steven Penny 
Signed-off-by: Torsten Bögershausen 
---

This is the first vesion of a patch.
Is there a chance that you test it ?

abspath.c   |  2 +-
 compat/cygwin.c | 18 ++
 compat/cygwin.h | 32 
 3 files changed, 47 insertions(+), 5 deletions(-)

diff --git a/abspath.c b/abspath.c
index 9857985329..77a281f789 100644
--- a/abspath.c
+++ b/abspath.c
@@ -55,7 +55,7 @@ static void get_root_part(struct strbuf *resolved, struct 
strbuf *remaining)
 
strbuf_reset(resolved);
strbuf_add(resolved, remaining->buf, offset);
-#ifdef GIT_WINDOWS_NATIVE
+#if defined(GIT_WINDOWS_NATIVE) || defined(__CYGWIN__)
convert_slashes(resolved->buf);
 #endif
strbuf_remove(remaining, 0, offset);
diff --git a/compat/cygwin.c b/compat/cygwin.c
index b9862d606d..c4a10cb5a1 100644
--- a/compat/cygwin.c
+++ b/compat/cygwin.c
@@ -1,19 +1,29 @@
 #include "../git-compat-util.h"
 #include "../cache.h"
 
+int cygwin_skip_dos_drive_prefix(char **path)
+{
+   int ret = has_dos_drive_prefix(*path);
+   *path += ret;
+   return ret;
+}
+
 int cygwin_offset_1st_component(const char *path)
 {
-   const char *pos = path;
+   char *pos = (char *)path;
+
/* unc paths */
-   if (is_dir_sep(pos[0]) && is_dir_sep(pos[1])) {
+   if (!skip_dos_drive_prefix(&pos) &&
+   is_dir_sep(pos[0]) && is_dir_sep(pos[1])) {
/* skip server name */
-   pos = strchr(pos + 2, '/');
+   pos = strpbrk(pos + 2, "\\/");
if (!pos)
return 0; /* Error: malformed unc path */
 
do {
pos++;
-   } while (*pos && pos[0] != '/');
+   } while (*pos && !is_dir_sep(*pos));
}
+
return pos + is_dir_sep(*pos) - path;
 }
diff --git a/compat/cygwin.h b/compat/cygwin.h
index 8e52de4644..46f29c0a90 100644
--- a/compat/cygwin.h
+++ b/compat/cygwin.h
@@ -1,2 +1,34 @@
+#define has_dos_drive_prefix(path) \
+   (isalpha(*(path)) && (path)[1] == ':' ? 2 : 0)
+
+
+int cygwin_offset_1st_component(const char *path);
+#define offset_1st_component cygwin_offset_1st_component
+
+
+#define has_dos_drive_prefix(path) \
+   (isalpha(*(path)) && (path)[1] == ':' ? 2 : 0)
+int cygwin_skip_dos_drive_prefix(char **path);
+#define skip_dos_drive_prefix cygwin_skip_dos_drive_prefix
+static inline int cygwin_is_dir_sep(int c)
+{
+   return c == '/' || c == '\\';
+}
+#define is_dir_sep cygwin_is_dir_sep
+static inline char *cygwin_find_last_dir_sep(const char *path)
+{
+   char *ret = NULL;
+   for (; *path; ++path)
+   if (is_dir_sep(*path))
+   ret = (char *)path;
+   return ret;
+}
+static inline void convert_slashes(char *path)
+{
+   for (; *path; path++)
+   if (*path == '\\')
+   *path = '/';
+}
+#define find_last_dir_sep cygwin_find_last_dir_sep
 int cygwin_offset_1st_component(const char *path);
 #define offset_1st_component cygwin_offset_1st_component
-- 
2.19.0.271.gfe8321ec05

[PATCH v1 1/1] t5601-99: Enable colliding file detection for MINGW

2018-11-22 Thread tboegi

From: Torsten Bögershausen 

Commit b878579ae7 (clone: report duplicate entries on case-insensitive
filesystems - 2018-08-17) adds a warning to user when cloning a repo
with case-sensitive file names on a case-insensitive file system.

This test has never been enabled for MINGW.
It had been working since day 1, but I forget to report that to the
author.
Enable it after a re-test.

Signed-off-by: Torsten Bögershausen 
---

The other day, I wanted to test Duys patch -
under MINGW - to see if the problem is catch(ed)
but hehe git am failed to apply - not a big desaster,
because is is already in master
Here is a follow-up, end we can end the match


 t/t5601-clone.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/t/t5601-clone.sh b/t/t5601-clone.sh
index c28d51bd59..8bbc7068ac 100755
--- a/t/t5601-clone.sh
+++ b/t/t5601-clone.sh
@@ -628,7 +628,7 @@ test_expect_success 'clone on case-insensitive fs' '
)
 '
 
-test_expect_success !MINGW,CASE_INSENSITIVE_FS 'colliding file detection' '
+test_expect_success CASE_INSENSITIVE_FS 'colliding file detection' '
grep X icasefs/warning &&
grep x icasefs/warning &&
test_i18ngrep "the following paths have collided" icasefs/warning
-- 
2.19.0.271.gfe8321ec05

[PATCH v2 1/1] Use size_t instead of 'unsigned long' for data in memory

2018-11-19 Thread tboegi

From: Torsten Bögershausen 

Currently the length of data which is stored in memory is stored
in "unsigned long" at many places in the code base.
This is OK when both "unsigned long" and size_t are 32 bits,
(and is OK when both are 64 bits).
On a 64 bit Windows system am "unsigned long" is 32 bit, and
that may be too short to measure the size of objects in memory,
a size_t is the natural choice.

Improve the code base in "small steps", as small as possible.
The smallest step seems to be much bigger than expected.
See this code-snippet from convert.c:

const char *ret;
unsigned long sz;
void *data = read_blob_data_from_index(istate, path, &sz);
ret = gather_convert_stats_ascii(data, sz);

The corrected version looks like this:
const char *ret;
size_t sz;
void *data = read_blob_data_from_index(istate, path, &sz);
ret = gather_convert_stats_ascii(data, sz);

However, when the Git code base is compiled with a compiler that
complains that "unsigned long" is different from size_t, we end
up in this huge patch, before the code base cleanly compiles.

Signed-off-by: Torsten Bögershausen 
---

Thanks for all the comments on V1.
Changes since V1:
- Make the motivation somewhat clearer in the commit message
- Rebase to the November 19 pu

What we really need for this patch to fly are this branches:
mk/use-size-t-in-zlib
tb/print-size-t-with-uintmax-format

And then it is rebased on top of all cooking stuff, too many branches
to be mentioned here.

It may be usefull to examine all "unsigned long" which are left after
this patch and turn them into (what ? unsigned int? size_t? uint32_t ?).
And once they are settled, re-do this patch with help of a coccinelle script.
I don't know.
I probably will rebase it until Junio says stop or someone else comes with
a better solution.

apply.c  | 14 -
 archive-tar.c| 18 +--
 archive-zip.c|  2 +-
 archive.c|  2 +-
 archive.h|  2 +-
 bisect.c |  2 +-
 blame.c  |  6 ++--
 blame.h  |  2 +-
 builtin/cat-file.c   | 10 +++---
 builtin/difftool.c   |  2 +-
 builtin/fast-export.c|  6 ++--
 builtin/fmt-merge-msg.c  |  3 +-
 builtin/fsck.c   |  6 ++--
 builtin/grep.c   |  8 ++---
 builtin/index-pack.c | 27 
 builtin/log.c|  4 +--
 builtin/ls-tree.c|  2 +-
 builtin/merge-tree.c |  6 ++--
 builtin/mktag.c  |  4 +--
 builtin/notes.c  |  6 ++--
 builtin/pack-objects.c   | 56 +-
 builtin/reflog.c |  2 +-
 builtin/replace.c|  2 +-
 builtin/tag.c|  4 +--
 builtin/unpack-file.c|  2 +-
 builtin/unpack-objects.c | 35 ++---
 builtin/verify-commit.c  |  4 +--
 bundle.c |  2 +-
 cache.h  | 10 +++---
 combine-diff.c   | 11 ---
 commit.c | 22 +++---
 commit.h | 10 +++---
 config.c |  2 +-
 convert.c| 18 +--
 delta.h  | 20 ++--
 diff-delta.c |  4 +--
 diff.c   | 30 +-
 diff.h   |  2 +-
 diffcore-pickaxe.c   |  4 +--
 diffcore.h   |  2 +-
 dir.c|  6 ++--
 dir.h|  2 +-
 entry.c  |  4 +--
 fast-import.c| 26 
 fsck.c   | 12 
 fsck.h   |  2 +-
 fuzz-pack-headers.c  |  4 +--
 grep.h   |  2 +-
 http-push.c  |  2 +-
 list-objects-filter.c|  2 +-
 mailmap.c|  2 +-
 match-trees.c|  4 +--
 merge-blobs.c|  6 ++--
 merge-blobs.h|  2 +-
 merge-recursive.c|  4 +--
 notes-cache.c|  2 +-
 notes-merge.c|  4 +--
 notes.c  |  6 ++--
 object-store.h   | 20 ++--
 object.c |  4 +--
 object.h |  2 +-
 pack-check.c |  2 +-
 pack-objects.h   | 14 -
 pack.h   |  2 +-
 packfile.c   | 40 
 packfile.h   |  8 ++---
 patch-delta.c|  8 ++---
 range-diff.c |  2 +-
 read-cache.c | 48 ++---
 ref-filter.c | 30 +-
 remote-testsvn.c |  4 +--
 rerere.c |  2 +-
 sha1-file.c  | 66 
 sha1dc_git.c |  2 +-
 sha1dc_git.h |  2 +-
 streaming.c  | 12 
 streaming.h  |  2 +-
 submodule-config.c   |  2 +-
 t/helper/test-delta.c|  2 +-
 tag.c|  6 ++--
 tag.h|  2 +-
 tree-walk.c  | 14 -
 tree.c

[PATCH/RFC v2 1/1] Use size_t instead of 'unsigned long' for data in memory

2018-11-19 Thread tboegi

From: Torsten Bögershausen 

Currently the length of data which is stored in memory is stored
in "unsigned long" at many places in the code base.
This is OK when both "unsigned long" and size_t are 32 bits,
(and is OK when both are 64 bits).
On a 64 bit Windows system am "unsigned long" is 32 bit, and
that may be too short to measure the size of objects in memory,
a size_t is the natural choice.

Improve the code base in "small steps", as small as possible.
The smallest step seems to be much bigger than expected.
See this code-snippet from convert.c:

const char *ret;
unsigned long sz;
void *data = read_blob_data_from_index(istate, path, &sz);
ret = gather_convert_stats_ascii(data, sz);

The corrected version looks like this:
const char *ret;
size_t sz;
void *data = read_blob_data_from_index(istate, path, &sz);
ret = gather_convert_stats_ascii(data, sz);

However, when the Git code base is compiled with a compiler that
complains that "unsigned long" is different from size_t, we end
up in this huge patch, before the code base cleanly compiles.

Signed-off-by: Torsten Bögershausen 
---

Thanks for all the comments on V1.
Changes since V1:
- Make the motivation somewhat clearer in the commit message
- Rebase to the November 19 pu

What we really need for this patch to fly are this branches:
mk/use-size-t-in-zlib
tb/print-size-t-with-uintmax-format

And then it is rebased on top of all cooking stuff, too many branches
to be mentioned here.

It may be usefull to examine all "unsigned long" which are left after
this patch and turn them into (what ? unsigned int? size_t? uint32_t ?).
And once they are settled, re-do this patch with help of a coccinelle script.
I don't know.
I probably will rebase it until Junio says stop or someone else comes with
a better solution.

apply.c  | 14 -
 archive-tar.c| 18 +--
 archive-zip.c|  2 +-
 archive.c|  2 +-
 archive.h|  2 +-
 bisect.c |  2 +-
 blame.c  |  6 ++--
 blame.h  |  2 +-
 builtin/cat-file.c   | 10 +++---
 builtin/difftool.c   |  2 +-
 builtin/fast-export.c|  6 ++--
 builtin/fmt-merge-msg.c  |  3 +-
 builtin/fsck.c   |  6 ++--
 builtin/grep.c   |  8 ++---
 builtin/index-pack.c | 27 
 builtin/log.c|  4 +--
 builtin/ls-tree.c|  2 +-
 builtin/merge-tree.c |  6 ++--
 builtin/mktag.c  |  4 +--
 builtin/notes.c  |  6 ++--
 builtin/pack-objects.c   | 56 +-
 builtin/reflog.c |  2 +-
 builtin/replace.c|  2 +-
 builtin/tag.c|  4 +--
 builtin/unpack-file.c|  2 +-
 builtin/unpack-objects.c | 35 ++---
 builtin/verify-commit.c  |  4 +--
 bundle.c |  2 +-
 cache.h  | 10 +++---
 combine-diff.c   | 11 ---
 commit.c | 22 +++---
 commit.h | 10 +++---
 config.c |  2 +-
 convert.c| 18 +--
 delta.h  | 20 ++--
 diff-delta.c |  4 +--
 diff.c   | 30 +-
 diff.h   |  2 +-
 diffcore-pickaxe.c   |  4 +--
 diffcore.h   |  2 +-
 dir.c|  6 ++--
 dir.h|  2 +-
 entry.c  |  4 +--
 fast-import.c| 26 
 fsck.c   | 12 
 fsck.h   |  2 +-
 fuzz-pack-headers.c  |  4 +--
 grep.h   |  2 +-
 http-push.c  |  2 +-
 list-objects-filter.c|  2 +-
 mailmap.c|  2 +-
 match-trees.c|  4 +--
 merge-blobs.c|  6 ++--
 merge-blobs.h|  2 +-
 merge-recursive.c|  4 +--
 notes-cache.c|  2 +-
 notes-merge.c|  4 +--
 notes.c  |  6 ++--
 object-store.h   | 20 ++--
 object.c |  4 +--
 object.h |  2 +-
 pack-check.c |  2 +-
 pack-objects.h   | 14 -
 pack.h   |  2 +-
 packfile.c   | 40 
 packfile.h   |  8 ++---
 patch-delta.c|  8 ++---
 range-diff.c |  2 +-
 read-cache.c | 48 ++---
 ref-filter.c | 30 +-
 remote-testsvn.c |  4 +--
 rerere.c |  2 +-
 sha1-file.c  | 66 
 sha1dc_git.c |  2 +-
 sha1dc_git.h |  2 +-
 streaming.c  | 12 
 streaming.h  |  2 +-
 submodule-config.c   |  2 +-
 t/helper/test-delta.c|  2 +-
 tag.c|  6 ++--
 tag.h|  2 +-
 tree-walk.c  | 14 -
 tree.c

[PATCH/RFC v1 1/1] Use size_t instead of unsigned long

2018-11-17 Thread tboegi

From: Torsten Bögershausen 

Currently Git users can not commit files >4Gib under 64 bit Windows,
where "long" is 32 bit but size_t is 64 bit.
Improve the code base in small steps, as small as possible.
What started with a small patch to replace "unsigned long" with size_t
in one file (convert.c) ended up with a change in many files.

Signed-off-by: Torsten Bögershausen 
---

This needs to go on top of pu, to cover all the good stuff
  cooking here.
I have started this series on November 1st, since that 2 or 3 rebases
  had been done to catch up, and now it is on pu from November 15.

I couldn't find a reason why changing "unsigned ling"
  into "size_t" may break anything, any thoughts, please ?
Side question: One thing I wondered about is why Git creates a conflict
like this, using git cherry-pick:
<<< HEAD
unsigned long size;
void *data = read_object_file(oid, &type, &size);
===
size_t size;
void *data = repo_read_object_file(the_repository, oid, &type,
   &size);
>>> 3ee0abef4c... Use size_t instead of unsigned long

One commit changed "unsigned long size" into "size_t size",
the other commit swapped repo_read_object_file() with read_object_file().
Both changed are on different lines, but Git sees a conflict here.

 apply.c  | 14 -
 archive-tar.c| 18 +--
 archive-zip.c|  2 +-
 archive.c|  2 +-
 archive.h|  2 +-
 bisect.c |  2 +-
 blame.c  |  6 ++--
 blame.h  |  2 +-
 builtin/cat-file.c   | 10 +++---
 builtin/difftool.c   |  3 +-
 builtin/fast-export.c|  6 ++--
 builtin/fmt-merge-msg.c  |  4 ++-
 builtin/fsck.c   |  6 ++--
 builtin/grep.c   |  8 ++---
 builtin/index-pack.c | 27 
 builtin/log.c|  4 +--
 builtin/ls-tree.c|  2 +-
 builtin/merge-tree.c |  6 ++--
 builtin/mktag.c  |  5 +--
 builtin/notes.c  |  6 ++--
 builtin/pack-objects.c   | 56 +-
 builtin/reflog.c |  2 +-
 builtin/replace.c|  2 +-
 builtin/tag.c|  4 +--
 builtin/unpack-file.c|  2 +-
 builtin/unpack-objects.c | 35 ++---
 builtin/verify-commit.c  |  4 +--
 bundle.c |  2 +-
 cache.h  | 10 +++---
 combine-diff.c   | 11 ---
 commit.c | 22 +++---
 commit.h | 10 +++---
 config.c |  2 +-
 convert.c| 18 +--
 delta.h  | 20 ++--
 diff-delta.c |  4 +--
 diff.c   | 30 +-
 diff.h   |  2 +-
 diffcore-pickaxe.c   |  4 +--
 diffcore.h   |  2 +-
 dir.c|  6 ++--
 dir.h|  2 +-
 entry.c  |  4 +--
 fast-import.c| 26 
 fsck.c   | 12 
 fsck.h   |  2 +-
 fuzz-pack-headers.c  |  4 +--
 grep.h   |  2 +-
 http-push.c  |  2 +-
 list-objects-filter.c|  2 +-
 mailmap.c|  2 +-
 match-trees.c|  4 +--
 merge-blobs.c|  6 ++--
 merge-blobs.h|  2 +-
 merge-recursive.c|  4 +--
 notes-cache.c|  2 +-
 notes-merge.c|  4 +--
 notes.c  |  6 ++--
 object-store.h   | 20 ++--
 object.c |  4 +--
 object.h |  2 +-
 pack-check.c |  2 +-
 pack-objects.h   | 14 -
 pack.h   |  2 +-
 packfile.c   | 40 
 packfile.h   |  8 ++---
 patch-delta.c|  8 ++---
 range-diff.c |  2 +-
 read-cache.c | 48 ++---
 ref-filter.c | 30 +-
 remote-testsvn.c |  4 +--
 rerere.c |  2 +-
 sha1-file.c  | 66 
 sha1dc_git.c |  2 +-
 sha1dc_git.h |  2 +-
 streaming.c  | 12 
 streaming.h  |  2 +-
 submodule-config.c   |  2 +-
 t/helper/test-delta.c|  2 +-
 tag.c|  6 ++--
 tag.h|  2 +-
 tree-walk.c  | 14 -
 tree.c   |  2 +-
 xdiff-interface.c|  4 +--
 xdiff-interface.h|  4 +--
 85 files changed, 391 insertions(+), 384 deletions(-)

diff --git a/apply.c b/apply.c
index 3703bfc8d0..5e11b85d17 100644
--- a/apply.c
+++ b/apply.c
@@ -3096,7 +3096,7 @@ static int apply_binary_fragment(struct apply_state 
*state,
 struct patch *patch)
 {
struct fragment *fragment = patch->fragments;
-   unsigned long len;
+   size_t len;

[PATCH v2 1/1] Upcast size_t variables to uintmax_t when printing

2018-11-10 Thread tboegi

From: Torsten Bögershausen 

When printing variables which contain a size, today "unsigned long"
is used at many places.
In order to be able to change the type from "unsigned long" into size_t
some day in the future, we need to have a way to print 64 bit variables
on a system that has "unsigned long" defined to be 32 bit, like Win64.

Upcast all those variables into uintmax_t before they are printed.
This is to prepare for a bigger change, when "unsigned long"
will be converted into size_t for variables which may be > 4Gib.

Signed-off-by: Torsten Bögershausen 
---

Changes since V1:
- fixed typos in the commit message, thanks to Eric  Sunshime for careful 
reading

Applying it on pu  gives 1 conflict from the index/repo changes,
Should be easy to fix.

archive-tar.c  |  2 +-
 builtin/cat-file.c |  4 ++--
 builtin/fast-export.c  |  2 +-
 builtin/index-pack.c   |  9 +
 builtin/ls-tree.c  |  2 +-
 builtin/pack-objects.c | 12 ++--
 diff.c |  2 +-
 fast-import.c  |  4 ++--
 http-push.c|  2 +-
 ref-filter.c   |  2 +-
 sha1-file.c|  6 +++---
 11 files changed, 24 insertions(+), 23 deletions(-)

diff --git a/archive-tar.c b/archive-tar.c
index 7a535cba24..a58e1a8ebf 100644
--- a/archive-tar.c
+++ b/archive-tar.c
@@ -202,7 +202,7 @@ static void prepare_header(struct archiver_args *args,
   unsigned int mode, unsigned long size)
 {
xsnprintf(header->mode, sizeof(header->mode), "%07o", mode & 0);
-   xsnprintf(header->size, sizeof(header->size), "%011lo", S_ISREG(mode) ? 
size : 0);
+   xsnprintf(header->size, sizeof(header->size), "%011"PRIoMAX , 
S_ISREG(mode) ? (uintmax_t)size : (uintmax_t)0);
xsnprintf(header->mtime, sizeof(header->mtime), "%011lo", (unsigned 
long) args->time);
 
xsnprintf(header->uid, sizeof(header->uid), "%07o", 0);
diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 8d97c84725..05decee33f 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -92,7 +92,7 @@ static int cat_one_file(int opt, const char *exp_type, const 
char *obj_name,
oi.sizep = &size;
if (oid_object_info_extended(the_repository, &oid, &oi, flags) 
< 0)
die("git cat-file: could not get object info");
-   printf("%lu\n", size);
+   printf("%"PRIuMAX"\n", (uintmax_t)size);
return 0;
 
case 'e':
@@ -238,7 +238,7 @@ static void expand_atom(struct strbuf *sb, const char 
*atom, int len,
if (data->mark_query)
data->info.sizep = &data->size;
else
-   strbuf_addf(sb, "%lu", data->size);
+   strbuf_addf(sb, "%"PRIuMAX , (uintmax_t)data->size);
} else if (is_atom("objectsize:disk", atom, len)) {
if (data->mark_query)
data->info.disk_sizep = &data->disk_size;
diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 456797c12a..5790f0d554 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -253,7 +253,7 @@ static void export_blob(const struct object_id *oid)
 
mark_next_object(object);
 
-   printf("blob\nmark :%"PRIu32"\ndata %lu\n", last_idnum, size);
+   printf("blob\nmark :%"PRIu32"\ndata %"PRIuMAX"\n", last_idnum, 
(uintmax_t)size);
if (size && fwrite(buf, size, 1, stdout) != 1)
die_errno("could not write blob '%s'", oid_to_hex(oid));
printf("\n");
diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 2004e25da2..2a8ada432b 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -450,7 +450,8 @@ static void *unpack_entry_data(off_t offset, unsigned long 
size,
int hdrlen;
 
if (!is_delta_type(type)) {
-   hdrlen = xsnprintf(hdr, sizeof(hdr), "%s %lu", type_name(type), 
size) + 1;
+   hdrlen = xsnprintf(hdr, sizeof(hdr), "%s %"PRIuMAX,
+  type_name(type),(uintmax_t)size) + 1;
the_hash_algo->init_fn(&c);
the_hash_algo->update_fn(&c, hdr, hdrlen);
} else
@@ -1628,10 +1629,10 @@ static void show_pack_info(int stat_only)
chain_histogram[obj_stat[i].delta_depth - 1]++;
if (stat_only)
continue;
-   printf("%s %-6s %lu %lu %"PRIuMAX,
+   printf("%s %-6s %"PRIuMAX" %"PRIuMAX" %"PRIuMAX,
   oid_to_hex(&obj->idx.oid),
-  type_name(obj->real_type), obj->size,
-  (unsigned long)(obj[1].idx.offset - obj->idx.offset),
+  type_name(obj->real_type), (uintmax_t)obj->size,
+  (uintmax_t)(obj[1].idx.offset - obj->idx.offset),
   (uintmax_t)obj->idx.offset);
if (is_delta_type(obj->type)) {
struct object_entry *bobj

[PATCH v2 1/1] remote-curl.c: xcurl_off_t is not portable (on 32 bit platfoms)

2018-11-09 Thread tboegi

From: Torsten Bögershausen 

When  setting
DEVELOPER = 1
DEVOPTS = extra-all

"gcc (Raspbian 6.3.0-18+rpi1+deb9u1) 6.3.0 20170516" errors out with
"comparison is always false due to limited range of data type"
"[-Werror=type-limits]"

It turns out that the function xcurl_off_t() has 2 flavours:
- It gives a warning 32 bit systems, like Linux
- It takes the signed ssize_t as a paramter, but the only caller is using
  a size_t (which is typically unsigned these days)

The original motivation of this function is to make sure that sizes > 2GiB
are handled correctly. The curl documentation says:
"For any given platform/compiler curl_off_t must be typedef'ed to a 64-bit
 wide signed integral data type"
On a 32 bit system "size_t" can be promoted into a 64 bit signed value
without loss of data, and therefore we may see the
"comparison is always false" warning.
On a 64 bit system it may happen, at least in theory, that size_t is > 2^63,
and then the promotion from an unsigned "size_t" into a signed "curl_off_t"
may be a problem.

One solution to suppress a possible compiler warning could be to remove
the function xcurl_off_t().

However, to be on the very safe side, we keep it and improve it:
- The len parameter is changed from ssize_t to size_t
- A temporally variable "size" is used, promoted int uintmax_t and the comopared
  with "maximum_signed_value_of_type(curl_off_t)".
  Thanks to Junio C Hamano for this hint.

Signed-off-by: Torsten Bögershausen 
---

This is a re-semd, the orignal patch was part of a 2
patch-series.
This patch needed some rework, and here should be
the polished version.

remote-curl.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/remote-curl.c b/remote-curl.c
index 762a55a75f..1220dffcdc 100644
--- a/remote-curl.c
+++ b/remote-curl.c
@@ -617,10 +617,11 @@ static int probe_rpc(struct rpc_state *rpc, struct 
slot_results *results)
return err;
 }
 
-static curl_off_t xcurl_off_t(ssize_t len) {
-   if (len > maximum_signed_value_of_type(curl_off_t))
+static curl_off_t xcurl_off_t(size_t len) {
+   uintmax_t size = len;
+   if (size > maximum_signed_value_of_type(curl_off_t))
die("cannot handle pushes this big");
-   return (curl_off_t) len;
+   return (curl_off_t)size;
 }
 
 static int post_rpc(struct rpc_state *rpc)
-- 
2.19.0.271.gfe8321ec05

[PATCH v1 1/1] Upcast size_t variables to uintmax_t when printing

2018-11-09 Thread tboegi

From: Torsten Bögershausen 

When printing variables which contains a size, today "unsigned long"
is used at many places.
In order to be able to change the type from "unsigned long" into size_t
some day the future, we need to have a way to print 64 bit variables
on a system that has "unsigned long" defined to be 32 bit, link Win64.

Upcast all those variables into uintmax_t before they are printed.
This is to prepare for a bigger change, when "unsligned long"
will be converted into size_t for variables which may be > 4Gib.

Signed-off-by: Torsten Bögershausen 
---
 archive-tar.c  |  2 +-
 builtin/cat-file.c |  4 ++--
 builtin/fast-export.c  |  2 +-
 builtin/index-pack.c   |  9 +
 builtin/ls-tree.c  |  2 +-
 builtin/pack-objects.c | 12 ++--
 diff.c |  2 +-
 fast-import.c  |  4 ++--
 http-push.c|  2 +-
 ref-filter.c   |  2 +-
 sha1-file.c|  6 +++---
 11 files changed, 24 insertions(+), 23 deletions(-)

diff --git a/archive-tar.c b/archive-tar.c
index 7a535cba24..a58e1a8ebf 100644
--- a/archive-tar.c
+++ b/archive-tar.c
@@ -202,7 +202,7 @@ static void prepare_header(struct archiver_args *args,
   unsigned int mode, unsigned long size)
 {
xsnprintf(header->mode, sizeof(header->mode), "%07o", mode & 0);
-   xsnprintf(header->size, sizeof(header->size), "%011lo", S_ISREG(mode) ? 
size : 0);
+   xsnprintf(header->size, sizeof(header->size), "%011"PRIoMAX , 
S_ISREG(mode) ? (uintmax_t)size : (uintmax_t)0);
xsnprintf(header->mtime, sizeof(header->mtime), "%011lo", (unsigned 
long) args->time);
 
xsnprintf(header->uid, sizeof(header->uid), "%07o", 0);
diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 8d97c84725..05decee33f 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -92,7 +92,7 @@ static int cat_one_file(int opt, const char *exp_type, const 
char *obj_name,
oi.sizep = &size;
if (oid_object_info_extended(the_repository, &oid, &oi, flags) 
< 0)
die("git cat-file: could not get object info");
-   printf("%lu\n", size);
+   printf("%"PRIuMAX"\n", (uintmax_t)size);
return 0;
 
case 'e':
@@ -238,7 +238,7 @@ static void expand_atom(struct strbuf *sb, const char 
*atom, int len,
if (data->mark_query)
data->info.sizep = &data->size;
else
-   strbuf_addf(sb, "%lu", data->size);
+   strbuf_addf(sb, "%"PRIuMAX , (uintmax_t)data->size);
} else if (is_atom("objectsize:disk", atom, len)) {
if (data->mark_query)
data->info.disk_sizep = &data->disk_size;
diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 456797c12a..5790f0d554 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -253,7 +253,7 @@ static void export_blob(const struct object_id *oid)
 
mark_next_object(object);
 
-   printf("blob\nmark :%"PRIu32"\ndata %lu\n", last_idnum, size);
+   printf("blob\nmark :%"PRIu32"\ndata %"PRIuMAX"\n", last_idnum, 
(uintmax_t)size);
if (size && fwrite(buf, size, 1, stdout) != 1)
die_errno("could not write blob '%s'", oid_to_hex(oid));
printf("\n");
diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 2004e25da2..2a8ada432b 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -450,7 +450,8 @@ static void *unpack_entry_data(off_t offset, unsigned long 
size,
int hdrlen;
 
if (!is_delta_type(type)) {
-   hdrlen = xsnprintf(hdr, sizeof(hdr), "%s %lu", type_name(type), 
size) + 1;
+   hdrlen = xsnprintf(hdr, sizeof(hdr), "%s %"PRIuMAX,
+  type_name(type),(uintmax_t)size) + 1;
the_hash_algo->init_fn(&c);
the_hash_algo->update_fn(&c, hdr, hdrlen);
} else
@@ -1628,10 +1629,10 @@ static void show_pack_info(int stat_only)
chain_histogram[obj_stat[i].delta_depth - 1]++;
if (stat_only)
continue;
-   printf("%s %-6s %lu %lu %"PRIuMAX,
+   printf("%s %-6s %"PRIuMAX" %"PRIuMAX" %"PRIuMAX,
   oid_to_hex(&obj->idx.oid),
-  type_name(obj->real_type), obj->size,
-  (unsigned long)(obj[1].idx.offset - obj->idx.offset),
+  type_name(obj->real_type), (uintmax_t)obj->size,
+  (uintmax_t)(obj[1].idx.offset - obj->idx.offset),
   (uintmax_t)obj->idx.offset);
if (is_delta_type(obj->type)) {
struct object_entry *bobj = 
&objects[obj_stat[i].base_object_no];
diff --git a/builtin/ls-tree.c b/builtin/ls-tree.c
index fe3b952cb3..7d581d6463 100644
--- a/builtin/ls-tree.c
+++ b/builtin/ls-tree.c
@@ -100,7 +100,

[PATCH v2 1/1] remote-curl.c: xcurl_off_t is not portable (on 32 bit platfoms)

2018-10-29 Thread tboegi

From: Torsten Bögershausen 

When  setting
DEVELOPER = 1
DEVOPTS = extra-all

"gcc (Raspbian 6.3.0-18+rpi1+deb9u1) 6.3.0 20170516" errors out with
"comparison is always false due to limited range of data type"
"[-Werror=type-limits]"

It turns out that the function xcurl_off_t() has 2 flavours:
- It gives a warning 32 bit systems, like Linux
- It takes the signed ssize_t as a paramter, but the only caller is using
  a size_t (which is typically unsigned these days)

The original motivation of this function is to make sure that sizes > 2GiB
are handled correctly. The curl documentation says:
"For any given platform/compiler curl_off_t must be typedef'ed to a 64-bit
 wide signed integral data type"
On a 32 bit system "size_t" can be promoted into a 64 bit signed value
without loss of data, and therefore we may see the
"comparison is always false" warning.
On a 64 bit system it may happen, at least in theory, that size_t is > 2^63,
and then the promotion from an unsigned "size_t" into a signed "curl_off_t"
may be a problem.

One solution to suppress a possible compiler warning could be to remove
the function xcurl_off_t().

However, to be on the very safe side, we keep it and improve it:
- The len parameter is changed from ssize_t to size_t
- A temporally variable "size" is used, promoted int uintmax_t and the comopared
  with "maximum_signed_value_of_type(curl_off_t)".
  Thanks to Junio C Hamano for this hint.

Signed-off-by: Torsten Bögershausen 
---
 remote-curl.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/remote-curl.c b/remote-curl.c
index 762a55a75f..1220dffcdc 100644
--- a/remote-curl.c
+++ b/remote-curl.c
@@ -617,10 +617,11 @@ static int probe_rpc(struct rpc_state *rpc, struct 
slot_results *results)
return err;
 }
 
-static curl_off_t xcurl_off_t(ssize_t len) {
-   if (len > maximum_signed_value_of_type(curl_off_t))
+static curl_off_t xcurl_off_t(size_t len) {
+   uintmax_t size = len;
+   if (size > maximum_signed_value_of_type(curl_off_t))
die("cannot handle pushes this big");
-   return (curl_off_t) len;
+   return (curl_off_t)size;
 }
 
 static int post_rpc(struct rpc_state *rpc)
-- 
2.19.0.271.gfe8321ec05

[PATCH v1 1/2] path.c: char is not (always) signed

2018-10-25 Thread tboegi

From: Torsten Bögershausen 

If a "char" in C is signed or unsigned is not specified, because it is
out of tradition "implementation dependent".
Therefore constructs like "if (name[i] < 0)" are not portable,
use "if (name[i] & 0x80)" instead.

Detected by "gcc (Raspbian 6.3.0-18+rpi1+deb9u1) 6.3.0 20170516" when
setting
DEVELOPER = 1
DEVOPTS = extra-all

Signed-off-by: Torsten Bögershausen 
---
 path.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/path.c b/path.c
index 34f0f98349..ba06ec5b2d 100644
--- a/path.c
+++ b/path.c
@@ -1369,7 +1369,7 @@ static int is_ntfs_dot_generic(const char *name,
saw_tilde = 1;
} else if (i >= 6)
return 0;
-   else if (name[i] < 0) {
+   else if (name[i] & 0x80) {
/*
 * We know our needles contain only ASCII, so we clamp
 * here to make the results of tolower() sane.
-- 
2.11.0

[PATCH v1 2/2] curl_off_t xcurl_off_t is not portable

2018-10-25 Thread tboegi

From: Torsten Bögershausen 

Comparing signed and unsigned values is not always portable.
When  setting
DEVELOPER = 1
DEVOPTS = extra-all

"gcc (Raspbian 6.3.0-18+rpi1+deb9u1) 6.3.0 20170516" errors out with
"comparison is always false due to limited range of data type"
"[-Werror=type-limits]"

Solution:
Use a valid cast & compare, similar to xsize_t()

Signed-off-by: Torsten Bögershausen 
---
 remote-curl.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/remote-curl.c b/remote-curl.c
index 762a55a75f..c89fd6d1c3 100644
--- a/remote-curl.c
+++ b/remote-curl.c
@@ -618,9 +618,10 @@ static int probe_rpc(struct rpc_state *rpc, struct 
slot_results *results)
 }
 
 static curl_off_t xcurl_off_t(ssize_t len) {
-   if (len > maximum_signed_value_of_type(curl_off_t))
+   curl_off_t size = (curl_off_t) len;
+   if (len != (ssize_t) size)
die("cannot handle pushes this big");
-   return (curl_off_t) len;
+   return size;
 }
 
 static int post_rpc(struct rpc_state *rpc)
-- 
2.11.0

[PATCH/RFC v2 1/1] Use off_t instead of size_t for functions dealing with streamed checkin

2018-10-23 Thread tboegi

From: Torsten Bögershausen 

When streaming data from disk into a blob, it should be possible to commit
a file with a file size > 4 GiB using the streaming functionality in Git.
Because of the streaming there is no need to load the whole data into
memory at once.
Today this is not possible on e.g. a 32 bit Linux system.
There is no good reason to limit the length of the file by using a size_t
in the code, which is a 32 bit value.
Loosen this restriction and use off_t instead of size_t in the call chain.

Signed-off-by: Torsten Bögershausen 
---

This is a suggestion for V2, changing even sha1-file.c,
so that the whole patch makes more sense.
The initial commit of a >4Gib file was tested on a 32 bit system

I didn't remove the wrapper functions, as I don't know
what their purpose is.

And: The commit message may need some tweaking, though

bulk-checkin.c | 6 +++---
 bulk-checkin.h | 2 +-
 sha1-file.c| 5 ++---
 3 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/bulk-checkin.c b/bulk-checkin.c
index 409ecb566b..34dbf5c4ea 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -96,7 +96,7 @@ static int already_written(struct bulk_checkin_state *state, 
struct object_id *o
  */
 static int stream_to_pack(struct bulk_checkin_state *state,
  git_hash_ctx *ctx, off_t *already_hashed_to,
- int fd, size_t size, enum object_type type,
+ int fd, off_t size, enum object_type type,
  const char *path, unsigned flags)
 {
git_zstream s;
@@ -189,7 +189,7 @@ static void prepare_to_stream(struct bulk_checkin_state 
*state,
 
 static int deflate_to_pack(struct bulk_checkin_state *state,
   struct object_id *result_oid,
-  int fd, size_t size,
+  int fd, off_t size,
   enum object_type type, const char *path,
   unsigned flags)
 {
@@ -258,7 +258,7 @@ static int deflate_to_pack(struct bulk_checkin_state *state,
 }
 
 int index_bulk_checkin(struct object_id *oid,
-  int fd, size_t size, enum object_type type,
+  int fd, off_t size, enum object_type type,
   const char *path, unsigned flags)
 {
int status = deflate_to_pack(&state, oid, fd, size, type,
diff --git a/bulk-checkin.h b/bulk-checkin.h
index f438f93811..09b2affdf3 100644
--- a/bulk-checkin.h
+++ b/bulk-checkin.h
@@ -7,7 +7,7 @@
 #include "cache.h"
 
 extern int index_bulk_checkin(struct object_id *oid,
- int fd, size_t size, enum object_type type,
+ int fd, off_t size, enum object_type type,
  const char *path, unsigned flags);
 
 extern void plug_bulk_checkin(void);
diff --git a/sha1-file.c b/sha1-file.c
index a4367b8f04..98d0f50ffa 100644
--- a/sha1-file.c
+++ b/sha1-file.c
@@ -1934,7 +1934,7 @@ static int index_core(struct object_id *oid, int fd, 
size_t size,
  * binary blobs, they generally do not want to get any conversion, and
  * callers should avoid this code path when filters are requested.
  */
-static int index_stream(struct object_id *oid, int fd, size_t size,
+static int index_stream(struct object_id *oid, int fd, off_t size,
enum object_type type, const char *path,
unsigned flags)
 {
@@ -1959,8 +1959,7 @@ int index_fd(struct object_id *oid, int fd, struct stat 
*st,
ret = index_core(oid, fd, xsize_t(st->st_size), type, path,
 flags);
else
-   ret = index_stream(oid, fd, xsize_t(st->st_size), type, path,
-  flags);
+   ret = index_stream(oid, fd, st->st_size, type, path, flags);
close(fd);
return ret;
 }
-- 
2.11.0

[PATCH v1 1/1] index_bulk_checkin(): Take off_t, not size_t

2018-10-18 Thread tboegi

From: Torsten Bögershausen 

When streaming data from disk into a blob, use off_t instead of
size_t, which is a better choice for file length.

Signed-off-by: Torsten Bögershausen 
---

This is based on an old patch from 2017, which never made it to the list.
I think it make sense to have off_t/size_t more consistent,
reviews/comments are welcome.

bulk-checkin.c | 4 ++--
 bulk-checkin.h | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/bulk-checkin.c b/bulk-checkin.c
index 409ecb566b..2631e82d6c 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -189,7 +189,7 @@ static void prepare_to_stream(struct bulk_checkin_state 
*state,
 
 static int deflate_to_pack(struct bulk_checkin_state *state,
   struct object_id *result_oid,
-  int fd, size_t size,
+  int fd, off_t size,
   enum object_type type, const char *path,
   unsigned flags)
 {
@@ -258,7 +258,7 @@ static int deflate_to_pack(struct bulk_checkin_state *state,
 }
 
 int index_bulk_checkin(struct object_id *oid,
-  int fd, size_t size, enum object_type type,
+  int fd, off_t size, enum object_type type,
   const char *path, unsigned flags)
 {
int status = deflate_to_pack(&state, oid, fd, size, type,
diff --git a/bulk-checkin.h b/bulk-checkin.h
index f438f93811..09b2affdf3 100644
--- a/bulk-checkin.h
+++ b/bulk-checkin.h
@@ -7,7 +7,7 @@
 #include "cache.h"
 
 extern int index_bulk_checkin(struct object_id *oid,
- int fd, size_t size, enum object_type type,
+ int fd, off_t size, enum object_type type,
  const char *path, unsigned flags);
 
 extern void plug_bulk_checkin(void);
-- 
2.19.0.271.gfe8321ec05

[PATCH v2 1/1] zlib.c: use size_t for size

2018-10-12 Thread tboegi

From: Martin Koegler 

Signed-off-by: Martin Koegler 
Signed-off-by: Junio C Hamano 
Signed-off-by: Torsten Bögershausen 
---

After doing a review, I decided to send the result as a patch.
In general, the changes from off_t to size_t seem to be not really
motivated.
But if they are, they could and should go into an own patch.
For the moment, change only "unsigned long" into size_t, thats all

 builtin/pack-objects.c |  8 
 cache.h| 10 +-
 pack-check.c   |  4 ++--
 packfile.h |  2 +-
 wrapper.c  |  8 
 zlib.c |  8 
 6 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index e6316d294d..23c4cd8c77 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -269,12 +269,12 @@ static void copy_pack_data(struct hashfile *f,
off_t len)
 {
unsigned char *in;
-   unsigned long avail;
+   size_t avail;
 
while (len) {
in = use_pack(p, w_curs, offset, &avail);
if (avail > len)
-   avail = (unsigned long)len;
+   avail = xsize_t(len);
hashwrite(f, in, avail);
offset += avail;
len -= avail;
@@ -1478,8 +1478,8 @@ static void check_object(struct object_entry *entry)
struct pack_window *w_curs = NULL;
const unsigned char *base_ref = NULL;
struct object_entry *base_entry;
-   unsigned long used, used_0;
-   unsigned long avail;
+   size_t used, used_0;
+   size_t avail;
off_t ofs;
unsigned char *buf, c;
enum object_type type;
diff --git a/cache.h b/cache.h
index d508f3d4f8..fce53fe620 100644
--- a/cache.h
+++ b/cache.h
@@ -20,10 +20,10 @@
 #include 
 typedef struct git_zstream {
z_stream z;
-   unsigned long avail_in;
-   unsigned long avail_out;
-   unsigned long total_in;
-   unsigned long total_out;
+   size_t avail_in;
+   size_t avail_out;
+   size_t total_in;
+   size_t total_out;
unsigned char *next_in;
unsigned char *next_out;
 } git_zstream;
@@ -40,7 +40,7 @@ void git_deflate_end(git_zstream *);
 int git_deflate_abort(git_zstream *);
 int git_deflate_end_gently(git_zstream *);
 int git_deflate(git_zstream *, int flush);
-unsigned long git_deflate_bound(git_zstream *, unsigned long);
+size_t git_deflate_bound(git_zstream *, size_t);
 
 /* The length in bytes and in hex digits of an object name (SHA-1 value). */
 #define GIT_SHA1_RAWSZ 20
diff --git a/pack-check.c b/pack-check.c
index fa5f0ff8fa..d1e7f554ae 100644
--- a/pack-check.c
+++ b/pack-check.c
@@ -33,7 +33,7 @@ int check_pack_crc(struct packed_git *p, struct pack_window 
**w_curs,
uint32_t data_crc = crc32(0, NULL, 0);
 
do {
-   unsigned long avail;
+   size_t avail;
void *data = use_pack(p, w_curs, offset, &avail);
if (avail > len)
avail = len;
@@ -68,7 +68,7 @@ static int verify_packfile(struct packed_git *p,
 
the_hash_algo->init_fn(&ctx);
do {
-   unsigned long remaining;
+   size_t remaining;
unsigned char *in = use_pack(p, w_curs, offset, &remaining);
offset += remaining;
if (!pack_sig_ofs)
diff --git a/packfile.h b/packfile.h
index 442625723d..e2daf63426 100644
--- a/packfile.h
+++ b/packfile.h
@@ -78,7 +78,7 @@ extern void close_pack_index(struct packed_git *);
 
 extern uint32_t get_pack_fanout(struct packed_git *p, uint32_t value);
 
-extern unsigned char *use_pack(struct packed_git *, struct pack_window **, 
off_t, unsigned long *);
+extern unsigned char *use_pack(struct packed_git *, struct pack_window **, 
off_t, size_t *);
 extern void close_pack_windows(struct packed_git *);
 extern void close_pack(struct packed_git *);
 extern void close_all_packs(struct raw_object_store *o);
diff --git a/wrapper.c b/wrapper.c
index e4fa9d84cd..1a510bd6fc 100644
--- a/wrapper.c
+++ b/wrapper.c
@@ -67,11 +67,11 @@ static void *do_xmalloc(size_t size, int gentle)
ret = malloc(1);
if (!ret) {
if (!gentle)
-   die("Out of memory, malloc failed (tried to 
allocate %lu bytes)",
-   (unsigned long)size);
+   die("Out of memory, malloc failed (tried to 
allocate %" PRIuMAX " bytes)",
+   (uintmax_t)size);
else {
-   error("Out of memory, malloc failed (tried to 
allocate %lu bytes)",
- (unsigned long)size);
+   error("Out of memory, malloc failed (tried to 
allocate %" PRIuMA

[PATCH v1 1/1] Make git_check_attr() a void function

2018-09-12 Thread tboegi

From: Torsten Bögershausen 

git_check_attr() returns always 0.
Remove all the error handling code of the callers, which is never executed.
Change git_check_attr() to be a void function.

Signed-off-by: Torsten Bögershausen 
---
 archive.c  |  3 ++-
 attr.c |  8 +++-
 attr.h |  4 ++--
 builtin/check-attr.c   |  3 +--
 builtin/pack-objects.c |  3 +--
 convert.c  | 42 ++--
 ll-merge.c | 16 +++
 userdiff.c |  3 +--
 ws.c   | 44 +++---
 9 files changed, 57 insertions(+), 69 deletions(-)

diff --git a/archive.c b/archive.c
index 0a07b140fe..c1870105eb 100644
--- a/archive.c
+++ b/archive.c
@@ -110,7 +110,8 @@ static const struct attr_check *get_archive_attrs(struct 
index_state *istate,
static struct attr_check *check;
if (!check)
check = attr_check_initl("export-ignore", "export-subst", NULL);
-   return git_check_attr(istate, path, check) ? NULL : check;
+   git_check_attr(istate, path, check);
+   return check;
 }
 
 static int check_attr_export_ignore(const struct attr_check *check)
diff --git a/attr.c b/attr.c
index 98e4953f6e..60d284796d 100644
--- a/attr.c
+++ b/attr.c
@@ -1143,9 +1143,9 @@ static void collect_some_attrs(const struct index_state 
*istate,
fill(path, pathlen, basename_offset, check->stack, check->all_attrs, 
rem);
 }
 
-int git_check_attr(const struct index_state *istate,
-  const char *path,
-  struct attr_check *check)
+void git_check_attr(const struct index_state *istate,
+   const char *path,
+   struct attr_check *check)
 {
int i;
 
@@ -1158,8 +1158,6 @@ int git_check_attr(const struct index_state *istate,
value = ATTR__UNSET;
check->items[i].value = value;
}
-
-   return 0;
 }
 
 void git_all_attrs(const struct index_state *istate,
diff --git a/attr.h b/attr.h
index 2be86db36e..b0378bfe5f 100644
--- a/attr.h
+++ b/attr.h
@@ -63,8 +63,8 @@ void attr_check_free(struct attr_check *check);
  */
 const char *git_attr_name(const struct git_attr *);
 
-int git_check_attr(const struct index_state *istate,
-  const char *path, struct attr_check *check);
+void git_check_attr(const struct index_state *istate,
+   const char *path, struct attr_check *check);
 
 /*
  * Retrieve all attributes that apply to the specified path.
diff --git a/builtin/check-attr.c b/builtin/check-attr.c
index c05573ff9c..30a2f84274 100644
--- a/builtin/check-attr.c
+++ b/builtin/check-attr.c
@@ -65,8 +65,7 @@ static void check_attr(const char *prefix,
if (collect_all) {
git_all_attrs(&the_index, full_path, check);
} else {
-   if (git_check_attr(&the_index, full_path, check))
-   die("git_check_attr died");
+   git_check_attr(&the_index, full_path, check);
}
output_attr(check, file);
 
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index d1144a8f7e..eb71dab5be 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -951,8 +951,7 @@ static int no_try_delta(const char *path)
 
if (!check)
check = attr_check_initl("delta", NULL);
-   if (git_check_attr(&the_index, path, check))
-   return 0;
+   git_check_attr(&the_index, path, check);
if (ATTR_FALSE(check->items[0].value))
return 1;
return 0;
diff --git a/convert.c b/convert.c
index 6057f1f580..e0848226d2 100644
--- a/convert.c
+++ b/convert.c
@@ -1297,6 +1297,7 @@ static void convert_attrs(const struct index_state 
*istate,
  struct conv_attrs *ca, const char *path)
 {
static struct attr_check *check;
+   struct attr_check_item *ccheck = NULL;
 
if (!check) {
check = attr_check_initl("crlf", "ident", "filter",
@@ -1306,30 +1307,25 @@ static void convert_attrs(const struct index_state 
*istate,
git_config(read_convert_config, NULL);
}
 
-   if (!git_check_attr(istate, path, check)) {
-   struct attr_check_item *ccheck = check->items;
-   ca->crlf_action = git_path_check_crlf(ccheck + 4);
-   if (ca->crlf_action == CRLF_UNDEFINED)
-   ca->crlf_action = git_path_check_crlf(ccheck + 0);
-   ca->ident = git_path_check_ident(ccheck + 1);
-   ca->drv = git_path_check_convert(ccheck + 2);
-   if (ca->crlf_action != CRLF_BINARY) {
-   enum eol eol_attr = git_path_check_eol(ccheck + 3);
-   if (ca->crlf_action == CRLF_AUTO && eol_attr == EOL_LF)
-   ca->crlf_action = CRLF_AUTO_INPUT;
-   else if (ca->crlf_action == CRLF_AUTO && eol_attr ==

[PATCH v1 1/1] test: Correct detection of UTF8_NFD_TO_NFC for APFS

2018-04-29 Thread tboegi

From: Torsten Bögershausen 

On HFS (which is the default Mac filesystem prior to High Sierra),
unicode names are "decomposed" before recording.
On APFS, which appears to be the new default filesystem in Mac OS High
Sierra, filenames are recorded as specified by the user.

APFS continues to allow the user to access it via any name
that normalizes to the same thing.

This difference causes t0050-filesystem.sh to fail two tests.

Improve the test for a NFD/NFC in test-lib.sh:
Test if the same file can be reached in pre- and decomposed unicode.

Reported-By: Elijah Newren 
Signed-off-by: Torsten Bögershausen 
---
 t/test-lib.sh | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/t/test-lib.sh b/t/test-lib.sh
index ea2bbaaa7a..e206250d1b 100644
--- a/t/test-lib.sh
+++ b/t/test-lib.sh
@@ -1106,12 +1106,7 @@ test_lazy_prereq UTF8_NFD_TO_NFC '
auml=$(printf "\303\244")
aumlcdiar=$(printf "\141\314\210")
>"$auml" &&
-   case "$(echo *)" in
-   "$aumlcdiar")
-   true ;;
-   *)
-   false ;;
-   esac
+   test -r "$aumlcdiar"
 '
 
 test_lazy_prereq AUTOIDENT '
-- 
2.16.0.rc0.8.g5497051b43

[no subject]

2018-02-26 Thread tboegi

>From 9f7d43f29eaf6017b7b16261ce91d8ef182cf415 Mon Sep 17 00:00:00 2001
In-Reply-To: <20171218131249.gb4...@sigill.intra.peff.net>
References: <20171218131249.gb4...@sigill.intra.peff.net>
From: =?UTF-8?q?Torsten=20B=C3=B6gershausen?= 
Date: Fri, 23 Feb 2018 20:53:34 +0100
Subject: [PATCH 0/1] Auto diff of UTF-16 files in UTF-8
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Make it possible to show a user-readable diff for UTF-16 encoded files.
This would replace the "binary files differ" with something useful,
without breaking anything for existing users (?).
For future repos the w-t-e encoding can be used, which allows e.g. easier
merging.
People which stick to native UTF-16 because they need the compatiblity
with e.g. libgit2 can still get a readable diff.
Opinions ?

Torsten Bögershausen (1):
  Auto diff of UTF-16 files in UTF-8

 diff.c   | 43 -
 diffcore.h   |  3 ++
 t/t4066-diff-encoding.sh | 98 
 utf8.h   | 11 ++
 4 files changed, 153 insertions(+), 2 deletions(-)
 create mode 100755 t/t4066-diff-encoding.sh

-- 
2.16.1.194.gb2e45c695d

[PATCH/RFC 1/1] Auto diff of UTF-16 files in UTF-8

2018-02-26 Thread tboegi

From: Torsten Bögershausen 

When an UTF-16 file is commited and later changed, `git diff` shows
"Binary files XX and YY differ".

When the user wants a diff in UTF-8, a textconv needs to be specified
in .gitattributes and the textconv must be configured.

A more user-friendly diff can be produced for UTF-16 if
- the user did not use `git diff --binary`
- the blob is identified as binary
- the blob has an UTF-16 BOM
- the blob can be converted into UTF-8

Enhance the diff machinery to auto-detect UTF-16 blobs and show them
as UTF-8, unless the user specifies `git diff --binary` which creates
a binary diff.

Signed-off-by: Torsten Bögershausen 
---
 diff.c   | 43 -
 diffcore.h   |  3 ++
 t/t4066-diff-encoding.sh | 98 
 utf8.h   | 11 ++
 4 files changed, 153 insertions(+), 2 deletions(-)
 create mode 100755 t/t4066-diff-encoding.sh

diff --git a/diff.c b/diff.c
index fb22b19f09..51831ee94d 100644
--- a/diff.c
+++ b/diff.c
@@ -3192,6 +3192,10 @@ static void builtin_diff(const char *name_a,
strbuf_reset(&header);
}
 
+   if (one && one->reencoded_from_utf16)
+   strbuf_addf(&header, "a is converted to UTF-8 from 
UTF-16\n");
+   if (two && two->reencoded_from_utf16)
+   strbuf_addf(&header, "b is converted to UTF-8 from 
UTF-16\n");
mf1.size = fill_textconv(textconv_one, one, &mf1.ptr);
mf2.size = fill_textconv(textconv_two, two, &mf2.ptr);
 
@@ -3611,8 +3615,25 @@ int diff_populate_filespec(struct diff_filespec *s, 
unsigned int flags)
s->size = size;
s->should_free = 1;
}
-   }
-   else {
+   if (!s->binary && buffer_is_binary(s->data, s->size) &&
+   buffer_has_utf16_bom(s->data, s->size)) {
+   int outsz = 0;
+   char *outbuf;
+   outbuf = reencode_string_len(s->data, (int)s->size,
+"UTF-8", "UTF-16", &outsz);
+   if (outbuf) {
+   if (s->should_free)
+   free(s->data);
+   if (s->should_munmap)
+   munmap(s->data, s->size);
+   s->should_munmap = 0;
+   s->data = outbuf;
+   s->size = outsz;
+   s->reencoded_from_utf16 = 1;
+   s->should_free = 1;
+   }
+   }
+   } else {
enum object_type type;
if (size_only || (flags & CHECK_BINARY)) {
type = sha1_object_info(s->oid.hash, &s->size);
@@ -3629,6 +3650,19 @@ int diff_populate_filespec(struct diff_filespec *s, 
unsigned int flags)
s->data = read_sha1_file(s->oid.hash, &type, &s->size);
if (!s->data)
die("unable to read %s", oid_to_hex(&s->oid));
+   if (!s->binary && buffer_is_binary(s->data, s->size) &&
+   buffer_has_utf16_bom(s->data, s->size)) {
+   int outsz = 0;
+   char *buf;
+   buf = reencode_string_len(s->data, (int)s->size,
+ "UTF-8", "UTF-16", &outsz);
+   if (buf) {
+   free(s->data);
+   s->data = buf;
+   s->size = outsz;
+   s->reencoded_from_utf16 = 1;
+   }
+   }
s->should_free = 1;
}
return 0;
@@ -5695,6 +5729,10 @@ static int diff_filespec_is_identical(struct 
diff_filespec *one,
 
 static int diff_filespec_check_stat_unmatch(struct diff_filepair *p)
 {
+   if (p->binary) {
+   p->one->binary = 1;
+   p->two->binary = 1;
+   }
if (p->done_skip_stat_unmatch)
return p->skip_stat_unmatch_result;
 
@@ -5735,6 +5773,7 @@ static void diffcore_skip_stat_unmatch(struct 
diff_options *diffopt)
for (i = 0; i < q->nr; i++) {
struct diff_filepair *p = q->queue[i];
 
+   p->binary = diffopt->flags.binary;
if (diff_filespec_check_stat_unmatch(p))
diff_q(&outq, p);
else {
diff --git a/diffcore.h b/diffcore.h
index a30da161da..3cd97bb93b 100644
--- a/diffcore.h
+++ b/diffcore.h
@@ -47,6 +47,8 @@ struct diff_filespec {
unsigned has_more_entries : 1; /* only appear in combined diff */
/* data should be considered "binary"; -1 means "don't know yet" */
signed int is_binary : 2;
+

[PATCH v5 1/7] strbuf: remove unnecessary NUL assignment in xstrdup_tolower()

2018-01-29 Thread tboegi

From: Lars Schneider 

Since 3733e69464 (use xmallocz to avoid size arithmetic, 2016-02-22) we
allocate the buffer for the lower case string with xmallocz(). This
already ensures a NUL at the end of the allocated buffer.

Remove the unnecessary assignment.

Signed-off-by: Lars Schneider 
Signed-off-by: Torsten Bögershausen 
---
 strbuf.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/strbuf.c b/strbuf.c
index 8007be8fb..490f7850e 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -781,7 +781,6 @@ char *xstrdup_tolower(const char *string)
result = xmallocz(len);
for (i = 0; i < len; i++)
result[i] = tolower(string[i]);
-   result[i] = '\0';
return result;
 }
 
-- 
2.16.0.rc0.2.g64d3e4d0cc.dirty

[PATCH v5 5/7] convert: add 'working-tree-encoding' attribute

2018-01-29 Thread tboegi

From: Lars Schneider 

Git recognizes files encoded with ASCII or one of its supersets (e.g.
UTF-8 or ISO-8859-1) as text files. All other encodings are usually
interpreted as binary and consequently built-in Git text processing
tools (e.g. 'git diff') as well as most Git web front ends do not
visualize the content.

Add an attribute to tell Git what encoding the user has defined for a
given file. If the content is added to the index, then Git converts the
content to a canonical UTF-8 representation. On checkout Git will
reverse the conversion.

Signed-off-by: Lars Schneider 
Signed-off-by: Torsten Bögershausen 
---
 Documentation/gitattributes.txt  |  60 
 convert.c| 190 -
 convert.h|   1 +
 sha1_file.c  |   2 +-
 t/t0028-working-tree-encoding.sh | 196 +++
 5 files changed, 447 insertions(+), 2 deletions(-)
 create mode 100755 t/t0028-working-tree-encoding.sh

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 30687de81..a8dbf4be3 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -272,6 +272,66 @@ few exceptions.  Even though...
   catch potential problems early, safety triggers.
 
 
+`working-tree-encoding`
+^^^
+
+Git recognizes files encoded with ASCII or one of its supersets (e.g.
+UTF-8 or ISO-8859-1) as text files.  All other encodings are usually
+interpreted as binary and consequently built-in Git text processing
+tools (e.g. 'git diff') as well as most Git web front ends do not
+visualize the content.
+
+In these cases you can tell Git the encoding of a file in the working
+directory with the `working-tree-encoding` attribute. If a file with this
+attributes is added to Git, then Git reencodes the content from the
+specified encoding to UTF-8 and stores the result in its internal data
+structure (called "the index"). On checkout the content is encoded
+back to the specified encoding.
+
+Please note that using the `working-tree-encoding` attribute may have a
+number of pitfalls:
+
+- Git clients that do not support the `working-tree-encoding` attribute
+  will checkout the respective files UTF-8 encoded and not in the
+  expected encoding. Consequently, these files will appear different
+  which typically causes trouble. This is in particular the case for
+  older Git versions and alternative Git implementations such as JGit
+  or libgit2 (as of January 2018).
+
+- Reencoding content to non-UTF encodings (e.g. SHIFT-JIS) can cause
+  errors as the conversion might not be round trip safe.
+
+- Reencoding content requires resources that might slow down certain
+  Git operations (e.g 'git checkout' or 'git add').
+
+Use the `working-tree-encoding` attribute only if you cannot store a file in
+UTF-8 encoding and if you want Git to be able to process the content as
+text.
+
+Use the following attributes if your '*.txt' files are UTF-16 encoded
+with byte order mark (BOM) and you want Git to perform automatic line
+ending conversion based on your platform.
+
+
+*.txt  text working-tree-encoding=UTF-16
+
+
+Use the following attributes if your '*.txt' files are UTF-16 little
+endian encoded without BOM and you want Git to use Windows line endings
+in the working directory.
+
+
+*.txt  working-tree-encoding=UTF-16LE text eol=CRLF
+
+
+You can get a list of all available encodings on your platform with the
+following command:
+
+
+iconv --list
+
+
+
 `ident`
 ^^^
 
diff --git a/convert.c b/convert.c
index b976eb968..0c372069b 100644
--- a/convert.c
+++ b/convert.c
@@ -7,6 +7,7 @@
 #include "sigchain.h"
 #include "pkt-line.h"
 #include "sub-process.h"
+#include "utf8.h"
 
 /*
  * convert.c - convert a file when checking it out and checking it in.
@@ -265,6 +266,147 @@ static int will_convert_lf_to_crlf(size_t len, struct 
text_stat *stats,
 
 }
 
+static struct encoding {
+   const char *name;
+   struct encoding *next;
+} *encoding, **encoding_tail;
+static const char *default_encoding = "UTF-8";
+
+static int encode_to_git(const char *path, const char *src, size_t src_len,
+struct strbuf *buf, struct encoding *enc, int 
conv_flags)
+{
+   char *dst;
+   int dst_len;
+
+   /*
+* No encoding is specified or there is nothing to encode.
+* Tell the caller that the content was not modified.
+*/
+   if (!enc || (src && !src_len))
+   return 0;
+
+   /*
+* Looks like we got called from "would_convert_to_git()".
+* This means Git wants to know if it would encode (= modify!)
+* the content. Let's answer with "yes", since an encoding was
+* specified.
+*/
+   if (!buf && !src)
+

[PATCH/RFC v5 7/7] Careful with CRLF when using e.g. UTF-16 for working-tree-encoding

2018-01-29 Thread tboegi

From: Torsten Bögershausen 

UTF-16 encoded files are treated as "binary" by Git, and no CRLF
conversion is done.
When the UTF-16 encoded files are converted into UF-8 using the new
"working-tree-encoding", the CRLF are converted if core.autocrlf is true.

This may lead to confusion:
A tool writes an UTF-16 encoded file with CRLF.
The file is commited with core.autocrlf=true, the CLRF are converted into LF.
The repo is pushed somewhere and cloned by a different user, who has
decided to use core.autocrlf=false.
He uses the same tool, and now the CRLF are not there as expected, but LF,
make the file useless for the tool.

Avoid this (possible) confusion by ignoring core.autocrlf for all files
which have "working-tree-encoding" defined.

The user can still use a .gitattributes file and specify the line endings
like "text=auto", "text", or "text eol=crlf" and let that .gitattribute
file travel together with push and clone.

Change convert.c to e more careful, simplify the initialization when
attributes are retrived (and none are specified) and update the documentation.

Signed-off-by: Torsten Bögershausen 
---
 Documentation/gitattributes.txt |  9 ++---
 convert.c   | 15 ---
 2 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index a8dbf4be3..3665c4677 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -308,12 +308,15 @@ Use the `working-tree-encoding` attribute only if you 
cannot store a file in
 UTF-8 encoding and if you want Git to be able to process the content as
 text.
 
+Note that when `working-tree-encoding` is defined, core.autocrlf is ignored.
+Set the `text` attribute (or `text=auto`) to enable CRLF conversions.
+
 Use the following attributes if your '*.txt' files are UTF-16 encoded
-with byte order mark (BOM) and you want Git to perform automatic line
-ending conversion based on your platform.
+with byte order mark (BOM) and you want Git to perform line
+ending conversion based on core.eol.
 
 
-*.txt  text working-tree-encoding=UTF-16
+*.txt  working-tree-encoding=UTF-16 text
 
 
 Use the following attributes if your '*.txt' files are UTF-16 little
diff --git a/convert.c b/convert.c
index 13fad490c..e7f11d1db 100644
--- a/convert.c
+++ b/convert.c
@@ -1264,15 +1264,24 @@ static void convert_attrs(struct conv_attrs *ca, const 
char *path)
}
ca->checkout_encoding = git_path_check_encoding(ccheck + 5);
} else {
-   ca->drv = NULL;
-   ca->crlf_action = CRLF_UNDEFINED;
-   ca->ident = 0;
+   memset(ca, 0, sizeof(*ca));
}
 
/* Save attr and make a decision for action */
ca->attr_action = ca->crlf_action;
if (ca->crlf_action == CRLF_TEXT)
ca->crlf_action = text_eol_is_crlf() ? CRLF_TEXT_CRLF : 
CRLF_TEXT_INPUT;
+   /*
+* Often UTF-16 encoded files are read and written by programs which
+* really need CRLF, and it is important to keep the CRLF "as is" when
+* files are committed with core.autocrlf=true and the repo is pushed.
+* The CRLF would be converted into LF when the repo is cloned to
+* a machine with core.autocrlf=false.
+* Obey the "text" and "eol" attributes and be independent on the
+* local core.autocrlf for all "encoded" files.
+*/
+   if ((ca->crlf_action == CRLF_UNDEFINED) && ca->checkout_encoding)
+   ca->crlf_action = CRLF_BINARY;
if (ca->crlf_action == CRLF_UNDEFINED && auto_crlf == AUTO_CRLF_FALSE)
ca->crlf_action = CRLF_BINARY;
if (ca->crlf_action == CRLF_UNDEFINED && auto_crlf == AUTO_CRLF_TRUE)
-- 
2.16.0.rc0.2.g64d3e4d0cc.dirty

[PATCH v5 2/7] strbuf: add xstrdup_toupper()

2018-01-29 Thread tboegi

From: Lars Schneider 

Create a copy of an existing string and make all characters upper case.
Similar xstrdup_tolower().

This function is used in a subsequent commit.

Signed-off-by: Lars Schneider 
Signed-off-by: Torsten Bögershausen 
---
 strbuf.c | 12 
 strbuf.h |  1 +
 2 files changed, 13 insertions(+)

diff --git a/strbuf.c b/strbuf.c
index 490f7850e..a20af696b 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -784,6 +784,18 @@ char *xstrdup_tolower(const char *string)
return result;
 }
 
+char *xstrdup_toupper(const char *string)
+{
+   char *result;
+   size_t len, i;
+
+   len = strlen(string);
+   result = xmallocz(len);
+   for (i = 0; i < len; i++)
+   result[i] = toupper(string[i]);
+   return result;
+}
+
 char *xstrvfmt(const char *fmt, va_list ap)
 {
struct strbuf buf = STRBUF_INIT;
diff --git a/strbuf.h b/strbuf.h
index 14c8c10d6..df7ced53e 100644
--- a/strbuf.h
+++ b/strbuf.h
@@ -607,6 +607,7 @@ __attribute__((format (printf,2,3)))
 extern int fprintf_ln(FILE *fp, const char *fmt, ...);
 
 char *xstrdup_tolower(const char *);
+char *xstrdup_toupper(const char *);
 
 /**
  * Create a newly allocated string using printf format. You can do this easily
-- 
2.16.0.rc0.2.g64d3e4d0cc.dirty

[PATCH v5 6/7] convert: add tracing for 'working-tree-encoding' attribute

2018-01-29 Thread tboegi

From: Lars Schneider 

Add the GIT_TRACE_CHECKOUT_ENCODING environment variable to enable
tracing for content that is reencoded with the 'working-tree-encoding'
attribute. This is useful to debug encoding issues.

Signed-off-by: Lars Schneider 
Signed-off-by: Torsten Bögershausen 
---
 convert.c| 28 
 t/t0028-working-tree-encoding.sh |  2 ++
 2 files changed, 30 insertions(+)

diff --git a/convert.c b/convert.c
index 0c372069b..13fad490c 100644
--- a/convert.c
+++ b/convert.c
@@ -266,6 +266,29 @@ static int will_convert_lf_to_crlf(size_t len, struct 
text_stat *stats,
 
 }
 
+static void trace_encoding(const char *context, const char *path,
+  const char *encoding, const char *buf, size_t len)
+{
+   static struct trace_key coe = TRACE_KEY_INIT(CHECKOUT_ENCODING);
+   struct strbuf trace = STRBUF_INIT;
+   int i;
+
+   strbuf_addf(&trace, "%s (%s, considered %s):\n", context, path, 
encoding);
+   for (i = 0; i < len && buf; ++i) {
+   strbuf_addf(
+   &trace,"| \e[2m%2i:\e[0m %2x \e[2m%c\e[0m%c",
+   i,
+   (unsigned char) buf[i],
+   (buf[i] > 32 && buf[i] < 127 ? buf[i] : ' '),
+   ((i+1) % 8 && (i+1) < len ? ' ' : '\n')
+   );
+   }
+   strbuf_addchars(&trace, '\n', 1);
+
+   trace_strbuf(&coe, &trace);
+   strbuf_release(&trace);
+}
+
 static struct encoding {
const char *name;
struct encoding *next;
@@ -325,6 +348,7 @@ static int encode_to_git(const char *path, const char *src, 
size_t src_len,
error(error_msg, path, enc->name);
}
 
+   trace_encoding("source", path, enc->name, src, src_len);
dst = reencode_string_len(src, src_len, default_encoding, enc->name,
  &dst_len);
if (!dst) {
@@ -340,6 +364,7 @@ static int encode_to_git(const char *path, const char *src, 
size_t src_len,
else
error(msg, path, enc->name, default_encoding);
}
+   trace_encoding("destination", path, default_encoding, dst, dst_len);
 
/*
 * UTF supports lossless round tripping [1]. UTF to other encoding are
@@ -365,6 +390,9 @@ static int encode_to_git(const char *path, const char *src, 
size_t src_len,
 enc->name, default_encoding,
 &re_src_len);
 
+   trace_encoding("reencoded source", path, enc->name,
+  re_src, re_src_len);
+
if (!re_src || src_len != re_src_len ||
memcmp(src, re_src, src_len)) {
const char* msg = _("encoding '%s' from %s to %s and "
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index 4d85b4277..0f36d4990 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -4,6 +4,8 @@ test_description='working-tree-encoding conversion via 
gitattributes'
 
 . ./test-lib.sh
 
+GIT_TRACE_CHECKOUT_ENCODING=1 && export GIT_TRACE_CHECKOUT_ENCODING
+
 test_expect_success 'setup test repo' '
git config core.eol lf &&
 
-- 
2.16.0.rc0.2.g64d3e4d0cc.dirty

[PATCH v5 3/7] utf8: add function to detect prohibited UTF-16/32 BOM

2018-01-29 Thread tboegi

From: Lars Schneider 

Whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE
or UTF-32LE a BOM must not be used [1]. The function returns true if
this is the case.

This function is used in a subsequent commit.

[1] http://unicode.org/faq/utf_bom.html#bom10

Signed-off-by: Lars Schneider 
Signed-off-by: Torsten Bögershausen 
---
 utf8.c | 24 
 utf8.h |  9 +
 2 files changed, 33 insertions(+)

diff --git a/utf8.c b/utf8.c
index 2c27ce013..914881cd1 100644
--- a/utf8.c
+++ b/utf8.c
@@ -538,6 +538,30 @@ char *reencode_string_len(const char *in, int insz,
 }
 #endif
 
+static int has_bom_prefix(const char *data, size_t len,
+ const char *bom, size_t bom_len)
+{
+   return (len >= bom_len) && !memcmp(data, bom, bom_len);
+}
+
+static const char utf16_be_bom[] = {0xFE, 0xFF};
+static const char utf16_le_bom[] = {0xFF, 0xFE};
+static const char utf32_be_bom[] = {0x00, 0x00, 0xFE, 0xFF};
+static const char utf32_le_bom[] = {0xFF, 0xFE, 0x00, 0x00};
+
+int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
+{
+   return (
+ (!strcmp(enc, "UTF-16BE") || !strcmp(enc, "UTF-16LE")) &&
+ (has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
+  has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
+   ) || (
+ (!strcmp(enc, "UTF-32BE") || !strcmp(enc, "UTF-32LE")) &&
+ (has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
+  has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
+   );
+}
+
 /*
  * Returns first character length in bytes for multi-byte `text` according to
  * `encoding`.
diff --git a/utf8.h b/utf8.h
index 6bbcf31a8..4711429af 100644
--- a/utf8.h
+++ b/utf8.h
@@ -70,4 +70,13 @@ typedef enum {
 void strbuf_utf8_align(struct strbuf *buf, align_type position, unsigned int 
width,
   const char *s);
 
+/*
+ * Whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE
+ * or UTF-32LE a BOM must not be used [1]. The function returns true if
+ * this is the case.
+ *
+ * [1] http://unicode.org/faq/utf_bom.html#bom10
+ */
+int has_prohibited_utf_bom(const char *enc, const char *data, size_t len);
+
 #endif
-- 
2.16.0.rc0.2.g64d3e4d0cc.dirty

[PATCH v5 4/7] utf8: add function to detect a missing UTF-16/32 BOM

2018-01-29 Thread tboegi

From: Lars Schneider 

If the endianness is not defined in the encoding name, then let's
be strict and require a BOM to avoid any encoding confusion. The
has_missing_utf_bom() function returns true if a required BOM is
missing.

The Unicode standard instructs to assume big-endian if there in no BOM
for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard used
in HTML5 recommends to assume little-endian to "deal with deployed
content" [3]. Strictly requiring a BOM seems to be the safest option
for content in Git.

This function is used in a subsequent commit.

[1] http://unicode.org/faq/utf_bom.html#gen6
[2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf
 Section 3.10, D98, page 132
[3] https://encoding.spec.whatwg.org/#utf-16le

Signed-off-by: Lars Schneider 
Signed-off-by: Torsten Bögershausen 
---
 utf8.c | 13 +
 utf8.h | 16 
 2 files changed, 29 insertions(+)

diff --git a/utf8.c b/utf8.c
index 914881cd1..f033fec1c 100644
--- a/utf8.c
+++ b/utf8.c
@@ -562,6 +562,19 @@ int has_prohibited_utf_bom(const char *enc, const char 
*data, size_t len)
);
 }
 
+int has_missing_utf_bom(const char *enc, const char *data, size_t len)
+{
+   return (
+  !strcmp(enc, "UTF-16") &&
+  !(has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
+has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
+   ) || (
+  !strcmp(enc, "UTF-32") &&
+  !(has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
+has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
+   );
+}
+
 /*
  * Returns first character length in bytes for multi-byte `text` according to
  * `encoding`.
diff --git a/utf8.h b/utf8.h
index 4711429af..26b5e9185 100644
--- a/utf8.h
+++ b/utf8.h
@@ -79,4 +79,20 @@ void strbuf_utf8_align(struct strbuf *buf, align_type 
position, unsigned int wid
  */
 int has_prohibited_utf_bom(const char *enc, const char *data, size_t len);
 
+/*
+ * If the endianness is not defined in the encoding name, then we
+ * require a BOM. The function returns true if a required BOM is missing.
+ *
+ * The Unicode standard instructs to assume big-endian if there
+ * in no BOM for UTF-16/32 [1][2]. However, the W3C/WHATWG
+ * encoding standard used in HTML5 recommends to assume
+ * little-endian to "deal with deployed content" [3].
+ *
+ * [1] http://unicode.org/faq/utf_bom.html#gen6
+ * [2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf
+ * Section 3.10, D98, page 132
+ * [3] https://encoding.spec.whatwg.org/#utf-16le
+ */
+int has_missing_utf_bom(const char *enc, const char *data, size_t len);
+
 #endif
-- 
2.16.0.rc0.2.g64d3e4d0cc.dirty

[PATCH v5 0/7] convert: add support for different encodings

2018-01-29 Thread tboegi

From: Torsten Bögershausen 

Take V4 from Lars, manually integrated the V2 squash patch,
so a review would be good.

Add my "comments" as a patch, see 7/7 (and this is more like an RFC)

This needs to go on top of tb/crlf-conv-flags


Lars Schneider (6):
  strbuf: remove unnecessary NUL assignment in xstrdup_tolower()
  strbuf: add xstrdup_toupper()
  utf8: add function to detect prohibited UTF-16/32 BOM
  utf8: add function to detect a missing UTF-16/32 BOM
  convert: add 'working-tree-encoding' attribute
  convert: add tracing for 'working-tree-encoding' attribute

Torsten Bögershausen (1):
  Careful with CRLF when using e.g. UTF-16 for working-tree-encoding

 Documentation/gitattributes.txt  |  63 +++
 convert.c| 233 ++-
 convert.h|   1 +
 sha1_file.c  |   2 +-
 strbuf.c |  13 ++-
 strbuf.h |   1 +
 t/t0028-working-tree-encoding.sh | 198 +
 utf8.c   |  37 +++
 utf8.h   |  25 +
 9 files changed, 567 insertions(+), 6 deletions(-)
 create mode 100755 t/t0028-working-tree-encoding.sh

-- 
2.16.0.rc0.2.g64d3e4d0cc.dirty

[PATCH v1 1/1] convert_to_git(): safe_crlf/checksafe becomes int conv_flags

2018-01-13 Thread tboegi

From: Torsten Bögershausen 

When calling convert_to_git(), the checksafe parameter defined what
should happen if the EOL conversion (CRLF --> LF --> CRLF) does not
roundtrip cleanly. In addition, it also defined if line endings should
be renormalized (CRLF --> LF) or kept as they are.

checksafe was an safe_crlf enum with these values:
SAFE_CRLF_FALSE:   do nothing in case of EOL roundtrip errors
SAFE_CRLF_FAIL:die in case of EOL roundtrip errors
SAFE_CRLF_WARN:print a warning in case of EOL roundtrip errors
SAFE_CRLF_RENORMALIZE: change CRLF to LF
SAFE_CRLF_KEEP_CRLF:   keep all line endings as they are

In some cases the integer value 0 was passed as checksafe parameter
instead of the correct enum value SAFE_CRLF_FALSE. That was no problem
because SAFE_CRLF_FALSE is defined as 0.

FALSE/FAIL/WARN are different from RENORMALIZE and KEEP_CRLF. Therefore,
an enum is not ideal. Let's use a integer bit pattern instead and rename
the parameter to conv_flags to make it more generically usable. This
allows us to extend the bit pattern in a subsequent commit.

Reported-By: Randall S. Becker 
Helped-By: Lars Schneider 
Signed-off-by: Torsten Bögershausen 
Signed-off-by: Lars Schneider 
---

 >I think this is being solved a bit differently with a1fbf854
 >("convert_to_git(): safe_crlf/checksafe becomes int conv_flags",
 >2018-01-06), and 0 becomes the right value to pass at this caller to
 >say "I am passing none of the flag bit".

 >I am hoping that the series that ends at f3b11d54 ("convert: add
 >support for 'checkout-encoding' attribute", 2018-01-06) will be
 >rerolled and hit 'master' early in the next cycle.

  Thanks for the report & suggested patch. After reading it, I suggest
  to break out the enum/int fix into an own "series".


apply.c|  6 +++---
 combine-diff.c |  2 +-
 config.c   |  7 +--
 convert.c  | 38 +++---
 convert.h  | 17 +++--
 diff.c |  8 
 environment.c  |  2 +-
 sha1_file.c| 12 ++--
 8 files changed, 46 insertions(+), 46 deletions(-)

diff --git a/apply.c b/apply.c
index 321a9fa68..f8b67bfee 100644
--- a/apply.c
+++ b/apply.c
@@ -2263,8 +2263,8 @@ static void show_stats(struct apply_state *state, struct 
patch *patch)
 static int read_old_data(struct stat *st, struct patch *patch,
 const char *path, struct strbuf *buf)
 {
-   enum safe_crlf safe_crlf = patch->crlf_in_old ?
-   SAFE_CRLF_KEEP_CRLF : SAFE_CRLF_RENORMALIZE;
+   int conv_flags = patch->crlf_in_old ?
+   CONV_EOL_KEEP_CRLF : CONV_EOL_RENORMALIZE;
switch (st->st_mode & S_IFMT) {
case S_IFLNK:
if (strbuf_readlink(buf, path, st->st_size) < 0)
@@ -2281,7 +2281,7 @@ static int read_old_data(struct stat *st, struct patch 
*patch,
 * should never look at the index when explicit crlf option
 * is given.
 */
-   convert_to_git(NULL, path, buf->buf, buf->len, buf, safe_crlf);
+   convert_to_git(NULL, path, buf->buf, buf->len, buf, conv_flags);
return 0;
default:
return -1;
diff --git a/combine-diff.c b/combine-diff.c
index 2505de119..19f30c335 100644
--- a/combine-diff.c
+++ b/combine-diff.c
@@ -1053,7 +1053,7 @@ static void show_patch_diff(struct combine_diff_path 
*elem, int num_parent,
if (is_file) {
struct strbuf buf = STRBUF_INIT;
 
-   if (convert_to_git(&the_index, elem->path, 
result, len, &buf, safe_crlf)) {
+   if (convert_to_git(&the_index, elem->path, 
result, len, &buf, global_conv_flags_eol)) {
free(result);
result = strbuf_detach(&buf, &len);
result_size = len;
diff --git a/config.c b/config.c
index e617c2018..1f003fbb9 100644
--- a/config.c
+++ b/config.c
@@ -1149,11 +1149,14 @@ static int git_default_core_config(const char *var, 
const char *value)
}
 
if (!strcmp(var, "core.safecrlf")) {
+   int eol_rndtrp_die;
if (value && !strcasecmp(value, "warn")) {
-   safe_crlf = SAFE_CRLF_WARN;
+   global_conv_flags_eol = CONV_EOL_RNDTRP_WARN;
return 0;
}
-   safe_crlf = git_config_bool(var, value);
+   eol_rndtrp_die = git_config_bool(var, value);
+   global_conv_flags_eol = eol_rndtrp_die ?
+   CONV_EOL_RNDTRP_DIE : CONV_EOL_RNDTRP_WARN;
return 0;
}
 
diff --git a/convert.c b/convert.c
index 1a41a48e1..b976eb968 100644
--- a/convert.c
+++ b/convert.c
@@ -193,30 +193,30 @@ static enum eol output_eol(enum crlf_action crlf_action)
return core_eol;
 }
 
-static void check_safe_crlf(cons

[PATCH v3 1/1] convert_to_git(): checksafe becomes int conv_flags

2018-01-01 Thread tboegi

From: Torsten Bögershausen 

When calling convert_to_git(), the checksafe parameter has been used to
check if commit would give a non-roundtrip conversion of EOL.

When checksafe was introduced, 3 values had been in use:
SAFE_CRLF_FALSE: no warning
SAFE_CRLF_FAIL:  reject the commit if EOL do not roundtrip
SAFE_CRLF_WARN:  warn the user if EOL do not roundtrip

Already today the integer value 0 is passed as the parameter checksafe
instead of the correct enum value SAFE_CRLF_FALSE.

Turn the whole call chain to use an integer with single bits, which
can be extended in the next commits:
- The global configuration variable safe_crlf is now conv_flags_eol.
- The parameter checksafe is renamed into conv_flags.

Helped-By: Lars Schneider 
Signed-off-by: Torsten Bögershausen 
---
This is my suggestion.
(1) The flag bits had been renamed.
(2) The (theoretical ?) mix of WARN/FAIL is still there,
I am not sure if this is a real problem.

(3) There are 2 reasons that CONV_EOL_RENORMALIZE is set.
Either in a renormalizing merge, or by running
git add --renormalize .
Therefor HASH_RENORMALIZE is not the same as CONV_EOL_RENORMALIZE.

apply.c|  6 +++---
 combine-diff.c |  2 +-
 config.c   |  7 +--
 convert.c  | 38 +++---
 convert.h  | 17 +++--
 diff.c |  8 
 environment.c  |  2 +-
 sha1_file.c| 12 ++--
 8 files changed, 46 insertions(+), 46 deletions(-)

diff --git a/apply.c b/apply.c
index 321a9fa68d..f8b67bfee2 100644
--- a/apply.c
+++ b/apply.c
@@ -2263,8 +2263,8 @@ static void show_stats(struct apply_state *state, struct 
patch *patch)
 static int read_old_data(struct stat *st, struct patch *patch,
 const char *path, struct strbuf *buf)
 {
-   enum safe_crlf safe_crlf = patch->crlf_in_old ?
-   SAFE_CRLF_KEEP_CRLF : SAFE_CRLF_RENORMALIZE;
+   int conv_flags = patch->crlf_in_old ?
+   CONV_EOL_KEEP_CRLF : CONV_EOL_RENORMALIZE;
switch (st->st_mode & S_IFMT) {
case S_IFLNK:
if (strbuf_readlink(buf, path, st->st_size) < 0)
@@ -2281,7 +2281,7 @@ static int read_old_data(struct stat *st, struct patch 
*patch,
 * should never look at the index when explicit crlf option
 * is given.
 */
-   convert_to_git(NULL, path, buf->buf, buf->len, buf, safe_crlf);
+   convert_to_git(NULL, path, buf->buf, buf->len, buf, conv_flags);
return 0;
default:
return -1;
diff --git a/combine-diff.c b/combine-diff.c
index 2505de119a..dbc877d0fe 100644
--- a/combine-diff.c
+++ b/combine-diff.c
@@ -1053,7 +1053,7 @@ static void show_patch_diff(struct combine_diff_path 
*elem, int num_parent,
if (is_file) {
struct strbuf buf = STRBUF_INIT;
 
-   if (convert_to_git(&the_index, elem->path, 
result, len, &buf, safe_crlf)) {
+   if (convert_to_git(&the_index, elem->path, 
result, len, &buf, conv_flags_eol)) {
free(result);
result = strbuf_detach(&buf, &len);
result_size = len;
diff --git a/config.c b/config.c
index e617c2018d..bdc7ce2a7e 100644
--- a/config.c
+++ b/config.c
@@ -1149,11 +1149,14 @@ static int git_default_core_config(const char *var, 
const char *value)
}
 
if (!strcmp(var, "core.safecrlf")) {
+   int eol_rndtrp_die;
if (value && !strcasecmp(value, "warn")) {
-   safe_crlf = SAFE_CRLF_WARN;
+   conv_flags_eol = CONV_EOL_RNDTRP_WARN;
return 0;
}
-   safe_crlf = git_config_bool(var, value);
+   eol_rndtrp_die = git_config_bool(var, value);
+   conv_flags_eol = eol_rndtrp_die ?
+   CONV_EOL_RNDTRP_DIE : CONV_EOL_RNDTRP_WARN;
return 0;
}
 
diff --git a/convert.c b/convert.c
index 1a41a48e15..0207ddab24 100644
--- a/convert.c
+++ b/convert.c
@@ -193,30 +193,30 @@ static enum eol output_eol(enum crlf_action crlf_action)
return core_eol;
 }
 
-static void check_safe_crlf(const char *path, enum crlf_action crlf_action,
+static void check_conv_flags_eol(const char *path, enum crlf_action 
crlf_action,
struct text_stat *old_stats, struct text_stat 
*new_stats,
-   enum safe_crlf checksafe)
+   int conv_flags)
 {
if (old_stats->crlf && !new_stats->crlf ) {
/*
 * CRLFs would not be restored by checkout
 */
-   if (checksafe == SAFE_CRLF_WARN)
+   if (conv_flags & CONV_EOL_RNDTRP_DIE)
+   die(_("CRLF would be replaced by LF in %s."), pa

[PATCH 2/5] strbuf: add xstrdup_toupper()

2017-12-31 Thread tboegi

From: Lars Schneider 

Create a copy of an existing string and make all characters upper case.
Similar xstrdup_tolower().

This function is used in a subsequent commit.

Signed-off-by: Lars Schneider 
Signed-off-by: Torsten Bögershausen 
---
 strbuf.c | 13 +
 strbuf.h |  1 +
 2 files changed, 14 insertions(+)

diff --git a/strbuf.c b/strbuf.c
index 8007be8fba..ee05626dc1 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -785,6 +785,19 @@ char *xstrdup_tolower(const char *string)
return result;
 }
 
+char *xstrdup_toupper(const char *string)
+{
+   char *result;
+   size_t len, i;
+
+   len = strlen(string);
+   result = xmallocz(len);
+   for (i = 0; i < len; i++)
+   result[i] = toupper(string[i]);
+   result[i] = '\0';
+   return result;
+}
+
 char *xstrvfmt(const char *fmt, va_list ap)
 {
struct strbuf buf = STRBUF_INIT;
diff --git a/strbuf.h b/strbuf.h
index 14c8c10d66..df7ced53ed 100644
--- a/strbuf.h
+++ b/strbuf.h
@@ -607,6 +607,7 @@ __attribute__((format (printf,2,3)))
 extern int fprintf_ln(FILE *fp, const char *fmt, ...);
 
 char *xstrdup_tolower(const char *);
+char *xstrdup_toupper(const char *);
 
 /**
  * Create a newly allocated string using printf format. You can do this easily
-- 
2.16.0.rc0.4.ga4e00d4fa4

[PATCH 4/5] utf8: add function to detect a missing UTF-16/32 BOM

2017-12-31 Thread tboegi

From: Lars Schneider 

If the endianness is not defined in the encoding name, then let's
be strict and require a BOM to avoid any encoding confusion. The
has_missing_utf_bom() function returns true if a required BOM is
missing.

The Unicode standard instructs to assume big-endian if there in no BOM
for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard used
in HTML5 recommends to assume little-endian to "deal with deployed
content" [3]. Strictly requiring a BOM seems to be the safest option
for content in Git.

This function is used in a subsequent commit.

[1] http://unicode.org/faq/utf_bom.html#gen6
[2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf
 Section 3.10, D98, page 132
[3] https://encoding.spec.whatwg.org/#utf-16le

Signed-off-by: Lars Schneider 
Signed-off-by: Torsten Bögershausen 
---
 utf8.c | 13 +
 utf8.h | 16 
 2 files changed, 29 insertions(+)

diff --git a/utf8.c b/utf8.c
index 776660ee12..1978d6c42a 100644
--- a/utf8.c
+++ b/utf8.c
@@ -562,6 +562,19 @@ int has_prohibited_utf_bom(const char *enc, const char 
*data, size_t len)
);
 }
 
+int has_missing_utf_bom(const char *enc, const char *data, size_t len)
+{
+   return (
+  !strcmp(enc, "UTF-16") &&
+  !(has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
+has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
+   ) || (
+  !strcmp(enc, "UTF-32") &&
+  !(has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
+has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
+   );
+}
+
 /*
  * Returns first character length in bytes for multi-byte `text` according to
  * `encoding`.
diff --git a/utf8.h b/utf8.h
index 4711429af9..26b5e91852 100644
--- a/utf8.h
+++ b/utf8.h
@@ -79,4 +79,20 @@ void strbuf_utf8_align(struct strbuf *buf, align_type 
position, unsigned int wid
  */
 int has_prohibited_utf_bom(const char *enc, const char *data, size_t len);
 
+/*
+ * If the endianness is not defined in the encoding name, then we
+ * require a BOM. The function returns true if a required BOM is missing.
+ *
+ * The Unicode standard instructs to assume big-endian if there
+ * in no BOM for UTF-16/32 [1][2]. However, the W3C/WHATWG
+ * encoding standard used in HTML5 recommends to assume
+ * little-endian to "deal with deployed content" [3].
+ *
+ * [1] http://unicode.org/faq/utf_bom.html#gen6
+ * [2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf
+ * Section 3.10, D98, page 132
+ * [3] https://encoding.spec.whatwg.org/#utf-16le
+ */
+int has_missing_utf_bom(const char *enc, const char *data, size_t len);
+
 #endif
-- 
2.16.0.rc0.4.ga4e00d4fa4

[PATCH 0/5] V2B: simplify convert.c/h

2017-12-31 Thread tboegi

From: Torsten Bögershausen 

Simplify the convert.h/convert.c logic amd don't touch convert_to_git()
The rest is v2 from Lars

Lars Schneider (4):
  strbuf: add xstrdup_toupper()
  utf8: add function to detect prohibited UTF-16/32 BOM
  utf8: add function to detect a missing UTF-16/32 BOM
  convert: add support for 'checkout-encoding' attribute

Torsten Bögershausen (1):
  convert_to_git(): checksafe becomes an integer

 Documentation/gitattributes.txt |  59 +++
 apply.c |   4 +-
 convert.c   | 210 +---
 convert.h   |  19 ++--
 diff.c  |   4 +-
 environment.c   |   2 +-
 sha1_file.c |   8 +-
 strbuf.c|  13 +++
 strbuf.h|   1 +
 t/t0028-checkout-encoding.sh| 197 +
 utf8.c  |  37 +++
 utf8.h  |  25 +
 12 files changed, 549 insertions(+), 30 deletions(-)
 create mode 100755 t/t0028-checkout-encoding.sh

-- 
2.16.0.rc0.4.ga4e00d4fa4

[PATCH 3/5] utf8: add function to detect prohibited UTF-16/32 BOM

2017-12-31 Thread tboegi

From: Lars Schneider 

Whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE
or UTF-32LE a BOM must not be used [1]. The function returns true if
this is the case.

This function is used in a subsequent commit.

[1] http://unicode.org/faq/utf_bom.html#bom10

Signed-off-by: Lars Schneider 
Signed-off-by: Torsten Bögershausen 
---
 utf8.c | 24 
 utf8.h |  9 +
 2 files changed, 33 insertions(+)

diff --git a/utf8.c b/utf8.c
index 2c27ce0137..776660ee12 100644
--- a/utf8.c
+++ b/utf8.c
@@ -538,6 +538,30 @@ char *reencode_string_len(const char *in, int insz,
 }
 #endif
 
+static int has_bom_prefix(const char *data, size_t len,
+ const char *bom, size_t bom_len)
+{
+   return (len >= bom_len) && !memcmp(data, bom, bom_len);
+}
+
+const char utf16_be_bom[] = {0xFE, 0xFF};
+const char utf16_le_bom[] = {0xFF, 0xFE};
+const char utf32_be_bom[] = {0x00, 0x00, 0xFE, 0xFF};
+const char utf32_le_bom[] = {0xFF, 0xFE, 0x00, 0x00};
+
+int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
+{
+   return (
+ (!strcmp(enc, "UTF-16BE") || !strcmp(enc, "UTF-16LE")) &&
+ (has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
+  has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
+   ) || (
+ (!strcmp(enc, "UTF-32BE") || !strcmp(enc, "UTF-32LE")) &&
+ (has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
+  has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
+   );
+}
+
 /*
  * Returns first character length in bytes for multi-byte `text` according to
  * `encoding`.
diff --git a/utf8.h b/utf8.h
index 6bbcf31a83..4711429af9 100644
--- a/utf8.h
+++ b/utf8.h
@@ -70,4 +70,13 @@ typedef enum {
 void strbuf_utf8_align(struct strbuf *buf, align_type position, unsigned int 
width,
   const char *s);
 
+/*
+ * Whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE
+ * or UTF-32LE a BOM must not be used [1]. The function returns true if
+ * this is the case.
+ *
+ * [1] http://unicode.org/faq/utf_bom.html#bom10
+ */
+int has_prohibited_utf_bom(const char *enc, const char *data, size_t len);
+
 #endif
-- 
2.16.0.rc0.4.ga4e00d4fa4

[PATCH 5/5] convert: add support for 'checkout-encoding' attribute

2017-12-31 Thread tboegi

From: Lars Schneider 

Git and its tools (e.g. git diff) expect all text files in UTF-8
encoding. Git will happily accept content in all other encodings, too,
but it might not be able to process the text (e.g. viewing diffs or
changing line endings).

Add an attribute to tell Git what encoding the user has defined for a
given file. If the content is added to the index, then Git converts the
content to a canonical UTF-8 representation. On checkout Git will
reverse the conversion.

Signed-off-by: Lars Schneider 
Signed-off-by: Torsten Bögershausen 
---
 Documentation/gitattributes.txt |  59 
 convert.c   | 190 +-
 convert.h   |  11 ++-
 sha1_file.c |   2 +-
 t/t0028-checkout-encoding.sh| 197 
 5 files changed, 452 insertions(+), 7 deletions(-)
 create mode 100755 t/t0028-checkout-encoding.sh

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 30687de81a..0039bd38c3 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -272,6 +272,65 @@ few exceptions.  Even though...
   catch potential problems early, safety triggers.
 
 
+`checkout-encoding`
+^^^
+
+Git recognizes files encoded with ASCII or one of its supersets (e.g.
+UTF-8 or ISO-8859-1) as text files.  All other encodings are usually
+interpreted as binary and consequently built-in Git text processing
+tools (e.g. 'git diff') as well as most Git web front ends do not
+visualize the content.
+
+In these cases you can teach Git the encoding of a file in the working
+directory with the `checkout-encoding` attribute. If a file with this
+attributes is added to Git, then Git reencodes the content from the
+specified encoding to UTF-8 and stores the result in its internal data
+structure. On checkout the content is encoded back to the specified
+encoding.
+
+Please note that using the `checkout-encoding` attribute has a number
+of drawbacks:
+
+- Reencoding content to non-UTF encodings (e.g. SHIFT-JIS) can cause
+  errors as the conversion might not be round trip safe.
+
+- Reencoding content requires resources that might slow down certain
+  Git operations (e.g 'git checkout' or 'git add').
+
+- Git clients that do not support the `checkout-encoding` attribute or
+  the used encoding will checkout the respective files as UTF-8 encoded.
+  That means the content appears to be different which could cause
+  trouble. Affected clients are older Git versions and alternative Git
+  implementations such as JGit or libgit2 (as of January 2018).
+
+Use the `checkout-encoding` attribute only if you cannot store a file in
+UTF-8 encoding and if you want Git to be able to process the content as
+text.
+
+Use the following attributes if your '*.txt' files are UTF-16 encoded
+with byte order mark (BOM) and you want Git to perform automatic line
+ending conversion based on your platform.
+
+
+*.txt  text checkout-encoding=UTF-16
+
+
+Use the following attributes if your '*.txt' files are UTF-16 little
+endian encoded without BOM and you want Git to use Windows line endings
+in the working directory.
+
+
+*.txt  checkout-encoding=UTF-16LE text eol=CRLF
+
+
+You can get a list of all available encodings on your platform with the
+following command:
+
+
+iconv --list
+
+
+
 `ident`
 ^^^
 
diff --git a/convert.c b/convert.c
index 5efcc3b73b..22c70d87e5 100644
--- a/convert.c
+++ b/convert.c
@@ -7,6 +7,7 @@
 #include "sigchain.h"
 #include "pkt-line.h"
 #include "sub-process.h"
+#include "utf8.h"
 
 /*
  * convert.c - convert a file when checking it out and checking it in.
@@ -265,6 +266,147 @@ static int will_convert_lf_to_crlf(size_t len, struct 
text_stat *stats,
 
 }
 
+static struct encoding {
+   const char *name;
+   struct encoding *next;
+} *encoding, **encoding_tail;
+static const char *default_encoding = "UTF-8";
+
+static int encode_to_git(const char *path, const char *src, size_t src_len,
+struct strbuf *buf, struct encoding *enc, int 
die_on_failure)
+{
+   char *dst;
+   int dst_len;
+
+   /*
+* No encoding is specified or there is nothing to encode.
+* Tell the caller that the content was not modified.
+*/
+   if (!enc || (src && !src_len))
+   return 0;
+
+   /*
+* Looks like we got called from "would_convert_to_git()".
+* This means Git wants to know if it would encode (= modify!)
+* the content. Let's answer with "yes", since an encoding was
+* specified.
+*/
+   if (!buf && !src)
+   return 1;
+
+   if (has_prohibited_utf_bom(enc->name, src, src_len)) {
+   const char *error_msg = _(
+   "BOM

[PATCH 1/5] convert_to_git(): checksafe becomes an integer

2017-12-31 Thread tboegi

From: Torsten Bögershausen 

When calling convert_to_git(), the checksafe parameter has been used to
check if commit would give a non-roundtrip conversion of EOL.

When checksafe was introduced, 3 values had been in use:
SAFE_CRLF_FALSE: no warning
SAFE_CRLF_FAIL:  reject the commit if EOL do not roundtrip
SAFE_CRLF_WARN:  warn the user if EOL do not roundtrip

Today a small flaw is found in the code base:
An integer with the value 0 is passed as the parameter checksafe
instead of the correct enum value SAFE_CRLF_FALSE.

In the next commit there is a need to turn checksafe into a bitmap, which
allows to tell convert_to_git() to obey the encoding attribute or not.

Signed-off-by: Torsten Bögershausen 
---
 apply.c   |  4 ++--
 convert.c | 20 ++--
 convert.h | 18 --
 diff.c|  4 ++--
 environment.c |  2 +-
 sha1_file.c   |  6 +++---
 6 files changed, 26 insertions(+), 28 deletions(-)

diff --git a/apply.c b/apply.c
index 321a9fa68d..a422516062 100644
--- a/apply.c
+++ b/apply.c
@@ -2263,7 +2263,7 @@ static void show_stats(struct apply_state *state, struct 
patch *patch)
 static int read_old_data(struct stat *st, struct patch *patch,
 const char *path, struct strbuf *buf)
 {
-   enum safe_crlf safe_crlf = patch->crlf_in_old ?
+   int checksafe = patch->crlf_in_old ?
SAFE_CRLF_KEEP_CRLF : SAFE_CRLF_RENORMALIZE;
switch (st->st_mode & S_IFMT) {
case S_IFLNK:
@@ -2281,7 +2281,7 @@ static int read_old_data(struct stat *st, struct patch 
*patch,
 * should never look at the index when explicit crlf option
 * is given.
 */
-   convert_to_git(NULL, path, buf->buf, buf->len, buf, safe_crlf);
+   convert_to_git(NULL, path, buf->buf, buf->len, buf, checksafe);
return 0;
default:
return -1;
diff --git a/convert.c b/convert.c
index 1a41a48e15..5efcc3b73b 100644
--- a/convert.c
+++ b/convert.c
@@ -195,13 +195,13 @@ static enum eol output_eol(enum crlf_action crlf_action)
 
 static void check_safe_crlf(const char *path, enum crlf_action crlf_action,
struct text_stat *old_stats, struct text_stat 
*new_stats,
-   enum safe_crlf checksafe)
+   int checksafe)
 {
if (old_stats->crlf && !new_stats->crlf ) {
/*
 * CRLFs would not be restored by checkout
 */
-   if (checksafe == SAFE_CRLF_WARN)
+   if (checksafe & SAFE_CRLF_WARN)
warning(_("CRLF will be replaced by LF in %s.\n"
  "The file will have its original line"
  " endings in your working directory."), path);
@@ -211,7 +211,7 @@ static void check_safe_crlf(const char *path, enum 
crlf_action crlf_action,
/*
 * CRLFs would be added by checkout
 */
-   if (checksafe == SAFE_CRLF_WARN)
+   if (checksafe & SAFE_CRLF_WARN)
warning(_("LF will be replaced by CRLF in %s.\n"
  "The file will have its original line"
  " endings in your working directory."), path);
@@ -268,7 +268,7 @@ static int will_convert_lf_to_crlf(size_t len, struct 
text_stat *stats,
 static int crlf_to_git(const struct index_state *istate,
   const char *path, const char *src, size_t len,
   struct strbuf *buf,
-  enum crlf_action crlf_action, enum safe_crlf checksafe)
+  enum crlf_action crlf_action, int checksafe)
 {
struct text_stat stats;
char *dst;
@@ -298,12 +298,12 @@ static int crlf_to_git(const struct index_state *istate,
 * unless we want to renormalize in a merge or
 * cherry-pick.
 */
-   if ((checksafe != SAFE_CRLF_RENORMALIZE) &&
+   if ((!(checksafe & SAFE_CRLF_RENORMALIZE)) &&
has_crlf_in_index(istate, path))
convert_crlf_into_lf = 0;
}
-   if ((checksafe == SAFE_CRLF_WARN ||
-   (checksafe == SAFE_CRLF_FAIL)) && len) {
+   if (((checksafe & SAFE_CRLF_WARN) ||
+((checksafe & SAFE_CRLF_FAIL) && len))) {
struct text_stat new_stats;
memcpy(&new_stats, &stats, sizeof(new_stats));
/* simulate "git add" */
@@ -1129,7 +1129,7 @@ const char *get_convert_attr_ascii(const char *path)
 
 int convert_to_git(const struct index_state *istate,
   const char *path, const char *src, size_t len,
-   struct strbuf *dst, enum safe_crlf checksafe)
+  struct strbuf *dst, int checksafe)
 {
int ret = 0;
struct conv_attrs ca;
@@ -1144,7 +1144,7 @@ in

[PATCH/RFC 0/2] git diff --UTF-8

2017-12-29 Thread tboegi

From: Torsten Bögershausen 

RFC patch: convert files from e.g. UTF-16 into UTF-8 while running
"git diff".
The diff must be called with "git diff --UTF-8" and the "encoding"
attribute must be set for the file(s).

The commit messages may need some improvements, and a closer look
at diff.c, how command line options are forwared, is appreciated.

It may even be possible to integrate t4066 somewhere...

Torsten Bögershausen (2):
  convert_to_git(): checksafe becomes an integer
  git diff: Allow to reencode into UTF-8

 Documentation/diff-options.txt  |  4 ++
 Documentation/gitattributes.txt |  9 +
 apply.c |  4 +-
 convert.c   | 60 +++-
 convert.h   | 20 +-
 diff.c  | 40 +--
 diff.h  |  1 +
 diffcore.h  |  3 ++
 environment.c   |  2 +-
 sha1_file.c |  6 +--
 t/t4066-diff-encoding.sh| 86 +
 11 files changed, 205 insertions(+), 30 deletions(-)
 create mode 100755 t/t4066-diff-encoding.sh

-- 
2.15.1.271.g1a4e40aa5d

[PATCH/RFC 2/2] git diff: Allow to reencode into UTF-8

2017-12-29 Thread tboegi

From: Torsten Bögershausen 

When blobs are encoded in UTF-16, `git diff` will treat them as binary.
Make it possible to show a user readable diff encoded in UTF-8.
This allows to run git diff and feed the into a web sever.

Improve Git to look at the "encodig" attribute and to reencode the
content into UTF-8 before running the diff itself.

Signed-off-by: Torsten Bögershausen 
---
 Documentation/diff-options.txt  |  4 ++
 Documentation/gitattributes.txt |  9 +
 convert.c   | 40 +++
 convert.h   |  2 +
 diff.c  | 38 --
 diff.h  |  1 +
 diffcore.h  |  3 ++
 t/t4066-diff-encoding.sh| 86 +
 8 files changed, 180 insertions(+), 3 deletions(-)
 create mode 100755 t/t4066-diff-encoding.sh

diff --git a/Documentation/diff-options.txt b/Documentation/diff-options.txt
index 9d1586b956..bf2f115f11 100644
--- a/Documentation/diff-options.txt
+++ b/Documentation/diff-options.txt
@@ -629,6 +629,10 @@ endif::git-format-patch[]
linkgit:git-log[1], but not for linkgit:git-format-patch[1] or
diff plumbing commands.
 
+--UTF-8::
+   Git converts the content into UTF-8 before running the diff when the
+   "encoding" attribute is defined. See linkgit:gitattributes[5]
+
 --ignore-submodules[=]::
Ignore changes to submodules in the diff generation.  can be
either "none", "untracked", "dirty" or "all", which is the default.
diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 30687de81a..753a7c39b7 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -881,6 +881,15 @@ advantages to choosing this method:
 3. Caching. Textconv caching can speed up repeated diffs, such as those
you might trigger by running `git log -p`.
 
+Running diff on UTF-16 encoded files
+
+
+Git can convert UTF-16 encoded into UTF-8 before they are feed
+into the diff machinery: `diff --UTF-8 file.xxx`.
+
+
+file.xxx encoding=UTF-16
+
 
 Marking files as binary
 ^^^
diff --git a/convert.c b/convert.c
index 5efcc3b73b..45577ce504 100644
--- a/convert.c
+++ b/convert.c
@@ -7,6 +7,7 @@
 #include "sigchain.h"
 #include "pkt-line.h"
 #include "sub-process.h"
+#include "utf8.h"
 
 /*
  * convert.c - convert a file when checking it out and checking it in.
@@ -734,6 +735,34 @@ static struct convert_driver {
int required;
 } *user_convert, **user_convert_tail;
 
+const char *get_encoding_attr(const char *path)
+{
+   static struct attr_check *check;
+   if (!check)
+   check = attr_check_initl("encoding", NULL);
+   if (!git_check_attr(path, check)) {
+   struct attr_check_item *ccheck = check->items;
+   const char *value;
+   value = ccheck->value;
+   if (ATTR_UNSET(value))
+   return NULL;
+   return value;
+   }
+   return NULL;
+}
+
+static int reencode_into_strbuf(const char *path, const char *src, size_t len,
+   struct strbuf *dst, const char *encoding)
+{
+   int outsz = 0;
+   char *buf;
+   buf = reencode_string_len(src, (int)len, "UTF-8", encoding, &outsz);
+   if (!buf)
+   return 0;
+   strbuf_attach(dst, buf, outsz, outsz);
+   return SAFE_CRLF_REENCODE;
+}
+
 static int apply_filter(const char *path, const char *src, size_t len,
int fd, struct strbuf *dst, struct convert_driver *drv,
const unsigned int wanted_capability,
@@ -1136,6 +1165,17 @@ int convert_to_git(const struct index_state *istate,
 
convert_attrs(&ca, path);
 
+   if (checksafe & SAFE_CRLF_REENCODE) {
+   const char *encoding = get_encoding_attr(path);
+   if (encoding) {
+   ret |= reencode_into_strbuf(path, src, len, dst,
+   encoding);
+   if (ret && dst) {
+   src = dst->buf;
+   len = dst->len;
+   }
+   }
+   }
ret |= apply_filter(path, src, len, -1, dst, ca.drv, CAP_CLEAN, NULL);
if (!ret && ca.drv && ca.drv->required)
die("%s: clean filter '%s' failed", path, ca.drv->name);
diff --git a/convert.h b/convert.h
index 532af00423..0b093715c9 100644
--- a/convert.h
+++ b/convert.h
@@ -13,6 +13,7 @@ struct index_state;
 #define SAFE_CRLF_WARN(1<<1)
 #define SAFE_CRLF_RENORMALIZE (1<<2)
 #define SAFE_CRLF_KEEP_CRLF   (1<<3)
+#define SAFE_CRLF_REENCODE(1<<4)
 
 extern int safe_crlf;
 
@@ -60,6 +61,7 @@ extern const char *get_cached_convert_stats_ascii(const 
struct index_state *ista

[PATCH/RFC 1/2] convert_to_git(): checksafe becomes an integer

2017-12-29 Thread tboegi

From: Torsten Bögershausen 

When calling convert_to_git(), the checksafe parameter has been used to
check if commit would give a non-roundtrip conversion of EOL.

When checksafe was introduced, 3 values had been in use:
SAFE_CRLF_FALSE: no warning
SAFE_CRLF_FAIL:  reject the commit if EOL do not roundtrip
SAFE_CRLF_WARN:  warn the user if EOL do not roundtrip

Today a small flaw is found in the code base:
An integer with the value 0 is passed as the parameter checksafe
instead of the correct enum value SAFE_CRLF_FALSE.

In the next commit there is a need to turn checksafe into a bitmap, which
allows to tell convert_to_git() to obey the encoding attribute or not.

Signed-off-by: Torsten Bögershausen 
---
 apply.c   |  4 ++--
 convert.c | 20 ++--
 convert.h | 18 --
 diff.c|  4 ++--
 environment.c |  2 +-
 sha1_file.c   |  6 +++---
 6 files changed, 26 insertions(+), 28 deletions(-)

diff --git a/apply.c b/apply.c
index 321a9fa68d..a422516062 100644
--- a/apply.c
+++ b/apply.c
@@ -2263,7 +2263,7 @@ static void show_stats(struct apply_state *state, struct 
patch *patch)
 static int read_old_data(struct stat *st, struct patch *patch,
 const char *path, struct strbuf *buf)
 {
-   enum safe_crlf safe_crlf = patch->crlf_in_old ?
+   int checksafe = patch->crlf_in_old ?
SAFE_CRLF_KEEP_CRLF : SAFE_CRLF_RENORMALIZE;
switch (st->st_mode & S_IFMT) {
case S_IFLNK:
@@ -2281,7 +2281,7 @@ static int read_old_data(struct stat *st, struct patch 
*patch,
 * should never look at the index when explicit crlf option
 * is given.
 */
-   convert_to_git(NULL, path, buf->buf, buf->len, buf, safe_crlf);
+   convert_to_git(NULL, path, buf->buf, buf->len, buf, checksafe);
return 0;
default:
return -1;
diff --git a/convert.c b/convert.c
index 1a41a48e15..5efcc3b73b 100644
--- a/convert.c
+++ b/convert.c
@@ -195,13 +195,13 @@ static enum eol output_eol(enum crlf_action crlf_action)
 
 static void check_safe_crlf(const char *path, enum crlf_action crlf_action,
struct text_stat *old_stats, struct text_stat 
*new_stats,
-   enum safe_crlf checksafe)
+   int checksafe)
 {
if (old_stats->crlf && !new_stats->crlf ) {
/*
 * CRLFs would not be restored by checkout
 */
-   if (checksafe == SAFE_CRLF_WARN)
+   if (checksafe & SAFE_CRLF_WARN)
warning(_("CRLF will be replaced by LF in %s.\n"
  "The file will have its original line"
  " endings in your working directory."), path);
@@ -211,7 +211,7 @@ static void check_safe_crlf(const char *path, enum 
crlf_action crlf_action,
/*
 * CRLFs would be added by checkout
 */
-   if (checksafe == SAFE_CRLF_WARN)
+   if (checksafe & SAFE_CRLF_WARN)
warning(_("LF will be replaced by CRLF in %s.\n"
  "The file will have its original line"
  " endings in your working directory."), path);
@@ -268,7 +268,7 @@ static int will_convert_lf_to_crlf(size_t len, struct 
text_stat *stats,
 static int crlf_to_git(const struct index_state *istate,
   const char *path, const char *src, size_t len,
   struct strbuf *buf,
-  enum crlf_action crlf_action, enum safe_crlf checksafe)
+  enum crlf_action crlf_action, int checksafe)
 {
struct text_stat stats;
char *dst;
@@ -298,12 +298,12 @@ static int crlf_to_git(const struct index_state *istate,
 * unless we want to renormalize in a merge or
 * cherry-pick.
 */
-   if ((checksafe != SAFE_CRLF_RENORMALIZE) &&
+   if ((!(checksafe & SAFE_CRLF_RENORMALIZE)) &&
has_crlf_in_index(istate, path))
convert_crlf_into_lf = 0;
}
-   if ((checksafe == SAFE_CRLF_WARN ||
-   (checksafe == SAFE_CRLF_FAIL)) && len) {
+   if (((checksafe & SAFE_CRLF_WARN) ||
+((checksafe & SAFE_CRLF_FAIL) && len))) {
struct text_stat new_stats;
memcpy(&new_stats, &stats, sizeof(new_stats));
/* simulate "git add" */
@@ -1129,7 +1129,7 @@ const char *get_convert_attr_ascii(const char *path)
 
 int convert_to_git(const struct index_state *istate,
   const char *path, const char *src, size_t len,
-   struct strbuf *dst, enum safe_crlf checksafe)
+  struct strbuf *dst, int checksafe)
 {
int ret = 0;
struct conv_attrs ca;
@@ -1144,7 +1144,7 @@ in

[PATCH v3 1/1] check-non-portable-shell.pl: Quoted `wc -l` is not portable

2017-12-21 Thread tboegi

From: Torsten Bögershausen 

wc -l was used to count the number if lines in test scripts.
$ wc -l Makefile
gives a line like this:
105 Makefile
while Mac OS has 4 leading spaces:
 105 Makefile

And this means that shell expressions like
test "$(wc -l

[PATCH v2 1/1] check-non-portable-shell.pl: Quoted `wc -l` is not portable

2017-12-16 Thread tboegi

From: Torsten Bögershausen 

wc -l was used to count the number if lines in test scripts.
$ wc -l Makefile
gives a line like this:
105 Makefile
while Mac OS has 4 leading spaces:
 105 Makefile

And this means that shell expressions like
test "$(wc -l

[PATCH v1 1/1] check-non-portable-shell.pl: Quoted `wc -l` is not portable

2017-12-10 Thread tboegi

From: Torsten Bögershausen 

wc -l is used to count the number if lines in test scripts.
$ wc -l Makefile
gives a line like this:
105 Makefile
while Mac OS has 4 leading spaces:
 105 Makefile

And this means that shell expressions like
test "$(wc -l

[PATCH v1 1/2] t0027: Don't use git commit

2017-12-08 Thread tboegi

From: Torsten Bögershausen 

Replace `git commit -m "comment" ""` with `git commit -m "comment"` to
remove the empty path spec.

Signed-off-by: Torsten Bögershausen 
---
 t/t0027-auto-crlf.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/t/t0027-auto-crlf.sh b/t/t0027-auto-crlf.sh
index c2c208fdcd..97154f5c79 100755
--- a/t/t0027-auto-crlf.sh
+++ b/t/t0027-auto-crlf.sh
@@ -370,7 +370,7 @@ test_expect_success 'setup master' '
echo >.gitattributes &&
git checkout -b master &&
git add .gitattributes &&
-   git commit -m "add .gitattributes" "" &&
+   git commit -m "add .gitattributes" &&
printf "\$Id:  
\$\nLINEONE\nLINETWO\nLINETHREE" >LF &&
printf "\$Id:  
\$\r\nLINEONE\r\nLINETWO\r\nLINETHREE" >CRLF &&
printf "\$Id:  
\$\nLINEONE\r\nLINETWO\nLINETHREE"   >CRLF_mix_LF &&
-- 
2.15.1.271.g1a4e40aa5d

[PATCH v1 2/2] t0027: Adapt the new MIX tests to Windows

2017-12-08 Thread tboegi

From: Torsten Bögershausen 

The new MIX tests don't pass under Windows, adapt them
to use the correct native line ending.

Signed-off-by: Torsten Bögershausen 
---

 Sorry for the breakage.
 This needs to go on top of tb/check-crlf-for-safe-crlf
 
 t/t0027-auto-crlf.sh | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/t/t0027-auto-crlf.sh b/t/t0027-auto-crlf.sh
index 97154f5c79..8d929b76dc 100755
--- a/t/t0027-auto-crlf.sh
+++ b/t/t0027-auto-crlf.sh
@@ -170,22 +170,22 @@ commit_MIX_chkwrn () {
git -c core.autocrlf=$crlf add $fname 2>"${pfx}_$f.err"
done
 
-   test_expect_success "commit file with mixed EOL crlf=$crlf attr=$attr 
LF" '
+   test_expect_success "commit file with mixed EOL onto LF crlf=$crlf 
attr=$attr" '
check_warning "$lfwarn" ${pfx}_LF.err
'
-   test_expect_success "commit file with mixed EOL attr=$attr aeol=$aeol 
crlf=$crlf CRLF" '
+   test_expect_success "commit file with mixed EOL onto CLRF attr=$attr 
aeol=$aeol crlf=$crlf" '
check_warning "$crlfwarn" ${pfx}_CRLF.err
'
 
-   test_expect_success "commit file with mixed EOL attr=$attr aeol=$aeol 
crlf=$crlf CRLF_mix_LF" '
+   test_expect_success "commit file with mixed EOL onto CRLF_mix_LF 
attr=$attr aeol=$aeol crlf=$crlf" '
check_warning "$lfmixcrlf" ${pfx}_CRLF_mix_LF.err
'
 
-   test_expect_success "commit file with mixed EOL attr=$attr aeol=$aeol 
crlf=$crlf LF_mix_cr" '
+   test_expect_success "commit file with mixed EOL onto LF_mix_cr 
attr=$attr aeol=$aeol crlf=$crlf " '
check_warning "$lfmixcr" ${pfx}_LF_mix_CR.err
'
 
-   test_expect_success "commit file with mixed EOL attr=$attr aeol=$aeol 
crlf=$crlf CRLF_nul" '
+   test_expect_success "commit file with mixed EOL onto CRLF_nul 
attr=$attr aeol=$aeol crlf=$crlf" '
check_warning "$crlfnul" ${pfx}_CRLF_nul.err
'
 }
@@ -378,7 +378,7 @@ test_expect_success 'setup master' '
printf "\$Id:  
\$\r\nLINEONE\r\nLINETWO\rLINETHREE"   >CRLF_mix_CR &&
printf "\$Id:  
\$\r\nLINEONEQ\r\nLINETWO\r\nLINETHREE" | q_to_nul >CRLF_nul &&
printf "\$Id:  
\$\nLINEONEQ\nLINETWO\nLINETHREE" | q_to_nul >LF_nul &&
-   create_NNO_MIX_files CRLF_mix_LF CRLF_mix_LF CRLF_mix_LF CRLF_mix_LF 
CRLF_mix_LF &&
+   create_NNO_MIX_files &&
git -c core.autocrlf=false add NNO_*.txt MIX_*.txt &&
git commit -m "mixed line endings" &&
test_tick
@@ -441,13 +441,14 @@ test_expect_success 'commit files attr=crlf' '
 '
 
 # Commit "CRLFmixLF" on top of these files already in the repo:
-# LF, CRLF, CRLFmixLF LF_mix_CR CRLFNULL
+# mixed mixed mixed   
mixed   mixed
+# onto  onto  ontoonto 
   onto
 # attrLFCRLF  CRLFmixLF   
LF_mix_CR   CRLFNUL
 commit_MIX_chkwrn ""  ""  false   """"""  ""   
   ""
 commit_MIX_chkwrn ""  ""  true"LF_CRLF" """"  
"LF_CRLF"   "LF_CRLF"
 commit_MIX_chkwrn ""  ""  input   "CRLF_LF" """"  
"CRLF_LF"   "CRLF_LF"
 
-commit_MIX_chkwrn "auto"  ""  false   "CRLF_LF" """"  
"CRLF_LF"   "CRLF_LF"
+commit_MIX_chkwrn "auto"  ""  false   "$WAMIX"  """"  
"$WAMIX""$WAMIX"
 commit_MIX_chkwrn "auto"  ""  true"LF_CRLF" """"  
"LF_CRLF"   "LF_CRLF"
 commit_MIX_chkwrn "auto"  ""  input   "CRLF_LF" """"  
"CRLF_LF"   "CRLF_LF"
 
-- 
2.15.1.271.g1a4e40aa5d

[PATCH v2 1/1] convert: tighten the safe autocrlf handling

2017-11-26 Thread tboegi

From: Torsten Bögershausen 

When a text file had been commited with CRLF and the file is commited
again, the CRLF are kept if .gitattributs has "text=auto".
This is done by analyzing the content of the blob stored in the index:
If a '\r' is found, Git assumes that the blob was commited with CRLF.

The simple search for a '\r' does not always work as expected:
A file is encoded in UTF-16 with CRLF and commited. Git treats it as binary.
Now the content is converted into UTF-8. At the next commit Git treats the
file as text, the CRLF should be converted into LF, but isn't.

Replace has_cr_in_index() with has_crlf_in_index(). When no '\r' is found,
0 is returned directly, this is the most common case.
If a '\r' is found, the content is analyzed more deeply.

Reported-By: Ashish Negi 
Signed-off-by: Torsten Bögershausen 
---

Changes against v1:
- change "if (crp && (crp[1] == '\n'))" to "if (crp)"
  (Thanks Eric. The new patch is more straightforward, and no risk to
   read out of data)
- Remove the "Solution:" in the commit message

Note:
  The original function has_cr_in_index() is from this commit:
c4805393d73 (Finn Arne Gangstad   2010-05-12 00:37:57 +0200  225)
  And has this info:
>Change autocrlf to not do any conversions to files that in the
>repository already contain a CR. git with autocrlf set will never
>create such a file, or change a LF only file to contain CRs, so the
>(new) assumption is that if a file contains a CR, it is intentional,
>and autocrlf should not change that.
  So the original assumption was slightly optimistic (but did work in 7 years)



convert.c| 19 +
 t/t0027-auto-crlf.sh | 76 
 2 files changed, 85 insertions(+), 10 deletions(-)

diff --git a/convert.c b/convert.c
index 20d7ab67bd..1a41a48e15 100644
--- a/convert.c
+++ b/convert.c
@@ -220,18 +220,27 @@ static void check_safe_crlf(const char *path, enum 
crlf_action crlf_action,
}
 }
 
-static int has_cr_in_index(const struct index_state *istate, const char *path)
+static int has_crlf_in_index(const struct index_state *istate, const char 
*path)
 {
unsigned long sz;
void *data;
-   int has_cr;
+   const char *crp;
+   int has_crlf = 0;
 
data = read_blob_data_from_index(istate, path, &sz);
if (!data)
return 0;
-   has_cr = memchr(data, '\r', sz) != NULL;
+
+   crp = memchr(data, '\r', sz);
+   if (crp) {
+   unsigned int ret_stats;
+   ret_stats = gather_convert_stats(data, sz);
+   if (!(ret_stats & CONVERT_STAT_BITS_BIN) &&
+   (ret_stats & CONVERT_STAT_BITS_TXT_CRLF))
+   has_crlf = 1;
+   }
free(data);
-   return has_cr;
+   return has_crlf;
 }
 
 static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
@@ -290,7 +299,7 @@ static int crlf_to_git(const struct index_state *istate,
 * cherry-pick.
 */
if ((checksafe != SAFE_CRLF_RENORMALIZE) &&
-   has_cr_in_index(istate, path))
+   has_crlf_in_index(istate, path))
convert_crlf_into_lf = 0;
}
if ((checksafe == SAFE_CRLF_WARN ||
diff --git a/t/t0027-auto-crlf.sh b/t/t0027-auto-crlf.sh
index 68108d956a..0af35cfb1f 100755
--- a/t/t0027-auto-crlf.sh
+++ b/t/t0027-auto-crlf.sh
@@ -43,19 +43,31 @@ create_gitattributes () {
} >.gitattributes
 }
 
-create_NNO_files () {
+# Create 2 sets of files:
+# The NNO files are "Not NOrmalized in the repo. We use CRLF_mix_LF and store
+#   it under different names for the different test cases, see ${pfx}
+#   Depending on .gitattributes they are normalized at the next commit (or not)
+# The MIX files have different contents in the repo.
+#   Depending on its contents, the "new safer autocrlf" may kick in.
+create_NNO_MIX_files () {
for crlf in false true input
do
for attr in "" auto text -text
do
for aeol in "" lf crlf
do
-   pfx=NNO_attr_${attr}_aeol_${aeol}_${crlf}
+   pfx=NNO_attr_${attr}_aeol_${aeol}_${crlf} &&
cp CRLF_mix_LF ${pfx}_LF.txt &&
cp CRLF_mix_LF ${pfx}_CRLF.txt &&
cp CRLF_mix_LF ${pfx}_CRLF_mix_LF.txt &&
cp CRLF_mix_LF ${pfx}_LF_mix_CR.txt &&
-   cp CRLF_mix_LF ${pfx}_CRLF_nul.txt
+   cp CRLF_mix_LF ${pfx}_CRLF_nul.txt &&
+   pfx=MIX_attr_${attr}_aeol_${aeol}_${crlf} &&
+   cp LF  ${pfx}_LF.txt &&
+   cp CRLF${pfx}_CRLF.txt &&
+   cp CRLF_mix_LF ${pfx}_CRLF_mix_LF.t

[PATCH 1/1] convert: tighten the safe autocrlf handling

2017-11-24 Thread tboegi

From: Torsten Bögershausen 

When a text file had been commited with CRLF and the file is commited
again, the CRLF are kept if .gitattributs has "text=auto".
This is done by analyzing the content of the blob stored in the index:
If a '\r' is found, Git assumes that the blob was commited with CRLF.

The simple search for a '\r' does not always work as expected:
A file is encoded in UTF-16 with CRLF and commited. Git treats it as binary.
Now the content is converted into UTF-8. At the next commit Git treats the
file as text, the CRLF should be converted into LF, but isn't.

Solution:
Replace has_cr_in_index() with has_crlf_in_index(). When no '\r' is found,
0 is returned directly, this is the most common case.
If a '\r' is found, the content is analyzed more deeply.

Reported-By: Ashish Negi 
Signed-off-by: Torsten Bögershausen 
---
 convert.c| 19 +
 t/t0027-auto-crlf.sh | 76 
 2 files changed, 85 insertions(+), 10 deletions(-)

diff --git a/convert.c b/convert.c
index 20d7ab67bd..63ef799239 100644
--- a/convert.c
+++ b/convert.c
@@ -220,18 +220,27 @@ static void check_safe_crlf(const char *path, enum 
crlf_action crlf_action,
}
 }
 
-static int has_cr_in_index(const struct index_state *istate, const char *path)
+static int has_crlf_in_index(const struct index_state *istate, const char 
*path)
 {
unsigned long sz;
void *data;
-   int has_cr;
+   const char *crp;
+   int has_crlf = 0;
 
data = read_blob_data_from_index(istate, path, &sz);
if (!data)
return 0;
-   has_cr = memchr(data, '\r', sz) != NULL;
+
+   crp = memchr(data, '\r', sz);
+   if (crp && (crp[1] == '\n')) {
+   unsigned int ret_stats;
+   ret_stats = gather_convert_stats(data, sz);
+   if (!(ret_stats & CONVERT_STAT_BITS_BIN) &&
+   (ret_stats & CONVERT_STAT_BITS_TXT_CRLF))
+   has_crlf = 1;
+   }
free(data);
-   return has_cr;
+   return has_crlf;
 }
 
 static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
@@ -290,7 +299,7 @@ static int crlf_to_git(const struct index_state *istate,
 * cherry-pick.
 */
if ((checksafe != SAFE_CRLF_RENORMALIZE) &&
-   has_cr_in_index(istate, path))
+   has_crlf_in_index(istate, path))
convert_crlf_into_lf = 0;
}
if ((checksafe == SAFE_CRLF_WARN ||
diff --git a/t/t0027-auto-crlf.sh b/t/t0027-auto-crlf.sh
index 68108d956a..0af35cfb1f 100755
--- a/t/t0027-auto-crlf.sh
+++ b/t/t0027-auto-crlf.sh
@@ -43,19 +43,31 @@ create_gitattributes () {
} >.gitattributes
 }
 
-create_NNO_files () {
+# Create 2 sets of files:
+# The NNO files are "Not NOrmalized in the repo. We use CRLF_mix_LF and store
+#   it under different names for the different test cases, see ${pfx}
+#   Depending on .gitattributes they are normalized at the next commit (or not)
+# The MIX files have different contents in the repo.
+#   Depending on its contents, the "new safer autocrlf" may kick in.
+create_NNO_MIX_files () {
for crlf in false true input
do
for attr in "" auto text -text
do
for aeol in "" lf crlf
do
-   pfx=NNO_attr_${attr}_aeol_${aeol}_${crlf}
+   pfx=NNO_attr_${attr}_aeol_${aeol}_${crlf} &&
cp CRLF_mix_LF ${pfx}_LF.txt &&
cp CRLF_mix_LF ${pfx}_CRLF.txt &&
cp CRLF_mix_LF ${pfx}_CRLF_mix_LF.txt &&
cp CRLF_mix_LF ${pfx}_LF_mix_CR.txt &&
-   cp CRLF_mix_LF ${pfx}_CRLF_nul.txt
+   cp CRLF_mix_LF ${pfx}_CRLF_nul.txt &&
+   pfx=MIX_attr_${attr}_aeol_${aeol}_${crlf} &&
+   cp LF  ${pfx}_LF.txt &&
+   cp CRLF${pfx}_CRLF.txt &&
+   cp CRLF_mix_LF ${pfx}_CRLF_mix_LF.txt &&
+   cp LF_mix_CR   ${pfx}_LF_mix_CR.txt &&
+   cp CRLF_nul${pfx}_CRLF_nul.txt
done
done
done
@@ -136,6 +148,49 @@ commit_chk_wrnNNO () {
'
 }
 
+# Commit a file with mixed line endings on top of different files
+# in the index. Check for warnings
+commit_MIX_chkwrn () {
+   attr=$1 ; shift
+   aeol=$1 ; shift
+   crlf=$1 ; shift
+   lfwarn=$1 ; shift
+   crlfwarn=$1 ; shift
+   lfmixcrlf=$1 ; shift
+   lfmixcr=$1 ; shift
+   crlfnul=$1 ; shift
+   pfx=MIX_attr_${attr}_aeol_${aeol}_${crlf}
+   #Commit file with CLRF_mix_LF on top of existing file
+   create_gitattributes "$attr" $a

[PATCH v3 1/1] Introduce git add --renormalize .

2017-11-16 Thread tboegi

From: Torsten Bögershausen 

Make it safer to normalize the line endings in a repository:
Files that had been commited with CRLF will be commited with LF.

The old way to normalize a repo was like this:
 # Make sure that there are not untracked files
 $ echo "* text=auto" >.gitattributes
 $ git read-tree --empty
 $ git add .
 $ git commit -m "Introduce end-of-line normalization"

The user must make sure that there are no untracked files,
otherwise they would have been added and tracked from now on.

The new "add ..renormalize" does not add untracked files:
 $ echo "* text=auto" >.gitattributes
 $ git add --renormalize .
 $ git commit -m "Introduce end-of-line normalization"

Note that "git add --renormalize " is the short form for
"git add -u --renormalize ".

While add it, document that the same renormalization may be needed,
whenever a clean filter is added or changed.

Helped-By: Junio C Hamano 
Signed-off-by: Torsten Bögershausen 
---

Changes since V2:
  Add line endings in t0025
  Use the <<-\EOF pattern
  Improve the documentation for "git add --renormalize"
  

Documentation/git-add.txt   |  9 -
 Documentation/gitattributes.txt |  6 --
 builtin/add.c   | 28 ++--
 cache.h |  1 +
 read-cache.c| 30 +++---
 sha1_file.c | 16 ++--
 t/t0025-crlf-renormalize.sh | 30 ++
 7 files changed, 102 insertions(+), 18 deletions(-)
 create mode 100755 t/t0025-crlf-renormalize.sh

diff --git a/Documentation/git-add.txt b/Documentation/git-add.txt
index b700beaff5..d50fa339dc 100644
--- a/Documentation/git-add.txt
+++ b/Documentation/git-add.txt
@@ -10,7 +10,7 @@ SYNOPSIS
 [verse]
 'git add' [--verbose | -v] [--dry-run | -n] [--force | -f] [--interactive | 
-i] [--patch | -p]
  [--edit | -e] [--[no-]all | --[no-]ignore-removal | [--update | -u]]
- [--intent-to-add | -N] [--refresh] [--ignore-errors] 
[--ignore-missing]
+ [--intent-to-add | -N] [--refresh] [--ignore-errors] 
[--ignore-missing] [--renormalize]
  [--chmod=(+|-)x] [--] [...]
 
 DESCRIPTION
@@ -175,6 +175,13 @@ for "git add --no-all ...", i.e. ignored removed 
files.
warning (e.g., if you are manually performing operations on
submodules).
 
+--renormalize::
+   Apply the "clean" process freshly to all tracked files to
+   forcibly add them again to the index.  This is useful after
+   changing `core.autocrlf` configuration or the `text` attribute
+   in order to correct files added with wrong CRLF/LF line endings.
+   This option implies `-u`.
+
 --chmod=(+|-)x::
Override the executable bit of the added files.  The executable
bit is only changed in the index, the files on disk are left
diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 4c68bc19d5..30687de81a 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -232,8 +232,7 @@ From a clean working directory:
 
 -
 $ echo "* text=auto" >.gitattributes
-$ git read-tree --empty   # Clean index, force re-scan of working directory
-$ git add .
+$ git add --renormalize .
 $ git status# Show files that will be normalized
 $ git commit -m "Introduce end-of-line normalization"
 -
@@ -328,6 +327,9 @@ You can declare that a filter turns a content that by 
itself is unusable
 into a usable content by setting the filter..required configuration
 variable to `true`.
 
+Note: Whenever the clean filter is changed, the repo should be renormalized:
+$ git add --renormalize .
+
 For example, in .gitattributes, you would assign the `filter`
 attribute for paths.
 
diff --git a/builtin/add.c b/builtin/add.c
index a648cf4c56..c42b50f857 100644
--- a/builtin/add.c
+++ b/builtin/add.c
@@ -26,6 +26,7 @@ static const char * const builtin_add_usage[] = {
 };
 static int patch_interactive, add_interactive, edit_interactive;
 static int take_worktree_changes;
+static int add_renormalize;
 
 struct update_callback_data {
int flags;
@@ -123,6 +124,25 @@ int add_files_to_cache(const char *prefix,
return !!data.add_errors;
 }
 
+static int renormalize_tracked_files(const struct pathspec *pathspec, int 
flags)
+{
+   int i, retval = 0;
+
+   for (i = 0; i < active_nr; i++) {
+   struct cache_entry *ce = active_cache[i];
+
+   if (ce_stage(ce))
+   continue; /* do not touch unmerged paths */
+   if (!S_ISREG(ce->ce_mode) && !S_ISLNK(ce->ce_mode))
+   continue; /* do not touch non blobs */
+   if (pathspec && !ce_path_match(ce, pathspec, NULL))
+   continue;
+   retval |= add_file_to_cache(ce->name, flags | HASH_RENORMALIZE);
+   }
+
+   return retval;
+}
+
 s

[PATCH v2 1/1] Introduce git add --renormalize .

2017-10-30 Thread tboegi

From: Torsten Bögershausen 

Make it safer to normalize the line endings in a repository:
Files that had been commited with CRLF will be commited with LF.

The old way to normalize a repo was like this:
 # Make sure that there are not untracked files
 $ echo "* text=auto" >.gitattributes
 $ git read-tree --empty
 $ git add .
 $ git commit -m "Introduce end-of-line normalization"

The user must make sure that there are no untracked files,
otherwise they would have been added and tracked from now on.

The new "add ..renormalize" does not add untracked files:
 $ echo "* text=auto" >.gitattributes
 $ git add --renormalize .
 $ git commit -m "Introduce end-of-line normalization"

Note that "git add --renormalize " is the short form for
"git add -u --renormalize ".

While add it, document that the same renormalization may be needed,
whenever a clean filter is added or changed.

Helped-By: Junio C Hamano 
Signed-off-by: Torsten Bögershausen 
---

Second version:
- Removed the global flag
- Make clearer that the clean filters may need renormalization
- commit message improved

Documentation/git-add.txt   |  8 +++-
 Documentation/gitattributes.txt |  6 --
 builtin/add.c   | 28 ++--
 cache.h |  1 +
 read-cache.c| 30 +++---
 sha1_file.c | 16 ++--
 t/t0025-crlf-renormalize.sh | 30 ++
 7 files changed, 101 insertions(+), 18 deletions(-)
 create mode 100755 t/t0025-crlf-renormalize.sh

diff --git a/Documentation/git-add.txt b/Documentation/git-add.txt
index b700beaff5..09a08ce4c1 100644
--- a/Documentation/git-add.txt
+++ b/Documentation/git-add.txt
@@ -10,7 +10,7 @@ SYNOPSIS
 [verse]
 'git add' [--verbose | -v] [--dry-run | -n] [--force | -f] [--interactive | 
-i] [--patch | -p]
  [--edit | -e] [--[no-]all | --[no-]ignore-removal | [--update | -u]]
- [--intent-to-add | -N] [--refresh] [--ignore-errors] 
[--ignore-missing]
+ [--intent-to-add | -N] [--refresh] [--ignore-errors] 
[--ignore-missing] [--renormalize]
  [--chmod=(+|-)x] [--] [...]
 
 DESCRIPTION
@@ -175,6 +175,12 @@ for "git add --no-all ...", i.e. ignored removed 
files.
warning (e.g., if you are manually performing operations on
submodules).
 
+--renormalize::
+   Normalizes the line endings from CRLF to LF of tracked files.
+   This applies to files which are either "text" or "text=auto"
+   in .gitattributes (or core.autocrlf is true or input)
+   --renormalize implies -u
+
 --chmod=(+|-)x::
Override the executable bit of the added files.  The executable
bit is only changed in the index, the files on disk are left
diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 4c68bc19d5..30687de81a 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -232,8 +232,7 @@ From a clean working directory:
 
 -
 $ echo "* text=auto" >.gitattributes
-$ git read-tree --empty   # Clean index, force re-scan of working directory
-$ git add .
+$ git add --renormalize .
 $ git status# Show files that will be normalized
 $ git commit -m "Introduce end-of-line normalization"
 -
@@ -328,6 +327,9 @@ You can declare that a filter turns a content that by 
itself is unusable
 into a usable content by setting the filter..required configuration
 variable to `true`.
 
+Note: Whenever the clean filter is changed, the repo should be renormalized:
+$ git add --renormalize .
+
 For example, in .gitattributes, you would assign the `filter`
 attribute for paths.
 
diff --git a/builtin/add.c b/builtin/add.c
index a648cf4c56..c42b50f857 100644
--- a/builtin/add.c
+++ b/builtin/add.c
@@ -26,6 +26,7 @@ static const char * const builtin_add_usage[] = {
 };
 static int patch_interactive, add_interactive, edit_interactive;
 static int take_worktree_changes;
+static int add_renormalize;
 
 struct update_callback_data {
int flags;
@@ -123,6 +124,25 @@ int add_files_to_cache(const char *prefix,
return !!data.add_errors;
 }
 
+static int renormalize_tracked_files(const struct pathspec *pathspec, int 
flags)
+{
+   int i, retval = 0;
+
+   for (i = 0; i < active_nr; i++) {
+   struct cache_entry *ce = active_cache[i];
+
+   if (ce_stage(ce))
+   continue; /* do not touch unmerged paths */
+   if (!S_ISREG(ce->ce_mode) && !S_ISLNK(ce->ce_mode))
+   continue; /* do not touch non blobs */
+   if (pathspec && !ce_path_match(ce, pathspec, NULL))
+   continue;
+   retval |= add_file_to_cache(ce->name, flags | HASH_RENORMALIZE);
+   }
+
+   return retval;
+}
+
 static char *prune_directory(struct dir_struct *dir, struct pathspec 
*pathspec,

[PATCH v1 1/1] Introduce git add --renormalize .

2017-10-16 Thread tboegi

From: Torsten Bögershausen 

Make it safer to normalize the line endings in a repository:
Files that had been commited with CRLF will be commited with LF.
(Unless core.autorclf and .gitattributes specify that Git
 should not do line ending conversions)

The old way to normalize a repo was like this:
 # Make sure that there are not untracked files
 $ echo "* text=auto" >.gitattributes
 $ git read-tree --empty
 $ git add .
 $ git commit -m "Introduce end-of-line normalization"

The new method is one step shorter, more intuitive and does not
add untracked files:
 $ echo "* text=auto" >.gitattributes
 $ git add --renormalize .
 $ git commit -m "Introduce end-of-line normalization"

Note that "git add --renormalize " is the short form for
"git add -u --renormalize ".

Signed-off-by: Torsten Bögershausen 
---
 Documentation/git-add.txt   |  8 +++-
 Documentation/gitattributes.txt |  3 +--
 builtin/add.c   | 27 +--
 cache.h |  1 +
 convert.c   |  1 +
 environment.c   |  1 +
 read-cache.c| 24 ++--
 t/t0025-crlf-renormalize.sh | 30 ++
 8 files changed, 80 insertions(+), 15 deletions(-)
 create mode 100755 t/t0025-crlf-renormalize.sh

diff --git a/Documentation/git-add.txt b/Documentation/git-add.txt
index f4169fb1ec..b6e431903d 100644
--- a/Documentation/git-add.txt
+++ b/Documentation/git-add.txt
@@ -10,7 +10,7 @@ SYNOPSIS
 [verse]
 'git add' [--verbose | -v] [--dry-run | -n] [--force | -f] [--interactive | 
-i] [--patch | -p]
  [--edit | -e] [--[no-]all | --[no-]ignore-removal | [--update | -u]]
- [--intent-to-add | -N] [--refresh] [--ignore-errors] 
[--ignore-missing]
+ [--intent-to-add | -N] [--refresh] [--ignore-errors] 
[--ignore-missing] [--renormalize]
  [--chmod=(+|-)x] [--] [...]
 
 DESCRIPTION
@@ -172,6 +172,12 @@ for "git add --no-all ...", i.e. ignored removed 
files.
warning (e.g., if you are manually performing operations on
submodules).
 
+--renormalize::
+   Normalizes the line endings from CRLF to LF of tracked files.
+   This applies to files which are either "text" or "text=auto"
+   in .gitattributes (or core.autocrlf is true or input)
+--renormalize implies -u
+
 --chmod=(+|-)x::
Override the executable bit of the added files.  The executable
bit is only changed in the index, the files on disk are left
diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 4c68bc19d5..071dec2bc4 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -232,8 +232,7 @@ From a clean working directory:
 
 -
 $ echo "* text=auto" >.gitattributes
-$ git read-tree --empty   # Clean index, force re-scan of working directory
-$ git add .
+$ git add --renormalize .
 $ git status# Show files that will be normalized
 $ git commit -m "Introduce end-of-line normalization"
 -
diff --git a/builtin/add.c b/builtin/add.c
index a648cf4c56..ee8e756fdc 100644
--- a/builtin/add.c
+++ b/builtin/add.c
@@ -123,6 +123,25 @@ int add_files_to_cache(const char *prefix,
return !!data.add_errors;
 }
 
+static int renormalize_tracked_files(const struct pathspec *pathspec, int 
flags)
+{
+   int i, retval = 0;
+
+   for (i = 0; i < active_nr; i++) {
+   struct cache_entry *ce = active_cache[i];
+
+   if (ce_stage(ce))
+   continue; /* do not touch unmerged paths */
+   if (!S_ISREG(ce->ce_mode) && !S_ISLNK(ce->ce_mode))
+   continue; /* do not touch non blobs */
+   if (pathspec && !ce_path_match(ce, pathspec, NULL))
+   continue;
+   retval |= add_file_to_cache(ce->name, flags);
+   }
+
+   return retval;
+}
+
 static char *prune_directory(struct dir_struct *dir, struct pathspec 
*pathspec, int prefix)
 {
char *seen;
@@ -276,6 +295,7 @@ static struct option builtin_add_options[] = {
OPT_BOOL('e', "edit", &edit_interactive, N_("edit current diff and 
apply")),
OPT__FORCE(&ignored_too, N_("allow adding otherwise ignored files")),
OPT_BOOL('u', "update", &take_worktree_changes, N_("update tracked 
files")),
+   OPT_BOOL(0, "renormalize", &add_renormalize, N_("renormalize EOL of 
tracked files (implies -u)")),
OPT_BOOL('N', "intent-to-add", &intent_to_add, N_("record only the fact 
that the path will be added later")),
OPT_BOOL('A', "all", &addremove_explicit, N_("add changes from all 
tracked and untracked files")),
{ OPTION_CALLBACK, 0, "ignore-removal", &addremove_explicit,
@@ -406,7 +426,7 @@ int cmd_add(int argc, const char **argv, const char *prefix)
  chmod_arg[1] != 'x' || chmod_arg[2]))

[PATCH v1 1/1] test-lint: echo -e (or -E) is not portable

2017-09-16 Thread tboegi

From: Torsten Bögershausen 

Some implementations of `echo` support the '-e' option to enable
backslash interpretation of the following string.
As an addition, they support '-E' to turn it off.

However, none of these are portable, POSIX doesn't even mention them,
and many implementations don't support them.

A check for '-n' is already done in check-non-portable-shell.pl,
extend it to cover '-n', '-e' or '-E-'

Signed-off-by: Torsten Bögershausen 
---
 t/check-non-portable-shell.pl | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/t/check-non-portable-shell.pl b/t/check-non-portable-shell.pl
index b170cbc045..03dc9d2852 100755
--- a/t/check-non-portable-shell.pl
+++ b/t/check-non-portable-shell.pl
@@ -17,7 +17,7 @@ sub err {
 while (<>) {
chomp;
/\bsed\s+-i/ and err 'sed -i is not portable';
-   /\becho\s+-n/ and err 'echo -n is not portable (please use printf)';
+   /\becho\s+-[neE]/ and err 'echo with option is not portable (please use 
printf)';
/^\s*declare\s+/ and err 'arrays/declare not portable';
/^\s*[^#]\s*which\s/ and err 'which is not portable (please use type)';
/\btest\s+[^=]*==/ and err '"test a == b" is not portable (please use 
=)';
-- 
2.14.1.145.gb3622a4ee9

[PATCH v4 2/2] File commited with CRLF should roundtrip diff and apply

2017-08-19 Thread tboegi

From: Torsten Bögershausen 

When a file had been commited with CRLF but now .gitattributes say
"* text=auto" (or core.autocrlf is true),
the following does not roundtrip, `git apply` fails:

printf "Added line\r\n" >>file &&
git diff >patch &&
git checkout -- . &&
git apply patch

Before applying the patch, the file from working tree is converted into the
index format (clean filter, CRLF conversion, ...)
Here, when commited with CRLF, the line endings should not be converted.

Note that `git apply --index` or `git apply --cache` doesn't call
convert_to_git() because the source material is already in index format.

Analyze the patch if there is a) any context line with CRLF,
or b) if any line with CRLF is to be removed.
In this case the patch file `patch` has mixed line endings, for a)
it looks like this:

 diff --git a/one b/one
 index 533790e..c30dea8 100644
 --- a/one
 +++ b/one
 @@ -1 +1,2 @@
  a\r
 +b\r

And for b) it looks like this:

 diff --git a/one b/one
 index 533790e..485540d 100644
 --- a/one
 +++ b/one
 @@ -1 +1 @@
 -a\r
 +b\r

If `git apply` detects that the patch itself has CRLF, (look at the line
" a\r" or "-a\r" above), the new flag crlf_in_old is set in "struct patch"
and two things will happen:
- read_old_data() will not convert CRLF into LF by calling
  convert_to_git(..., SAFE_CRLF_KEEP_CRLF);
- The WS_CR_AT_EOL bit is set in the "white space rule",
  CRLF are no longer treated as white space.

While at there, make clear that read_old_data() in apply.c
knows what it wants convert_to_git() to do with respect to CRLF.  In
fact, this codepath is about applying a patch to a file in the
filesystem, which may not exist in the index, or may exist but may
not match what is recorded in the index, or in the extreme case, we
may not even be in a Git repository.  If convert_to_git() peeked at
the index while doing its work, it *would* be a bug.

Pass NULL instead of &the_index to convert_to_git() to make sure we
catch future bugs to clarify this.

Update the test in t4124: split one test case into 3:
- Detect the " a\r" line in the patch
- Detect the "-a\r" line in the patch
- Use LF in repo and CLRF in the worktree.

Reported-by: Anthony Sottile 
Helped-by: Junio C Hamano 
Signed-off-by: Torsten Bögershausen 
---

Changes since v3:
- took apply.c from junio/tb/apply-with-crlf
- Remove the leading asterix in the commit message, at the place
  where the "git diff" is cited.
- Mention "Pass NULL instead of &the_index to convert_to_git()"

apply.c  | 41 -
 t/t4124-apply-ws-rule.sh | 33 +++--
 2 files changed, 63 insertions(+), 11 deletions(-)

diff --git a/apply.c b/apply.c
index f2d599141d..66c68f193a 100644
--- a/apply.c
+++ b/apply.c
@@ -220,6 +220,7 @@ struct patch {
unsigned int recount:1;
unsigned int conflicted_threeway:1;
unsigned int direct_to_threeway:1;
+   unsigned int crlf_in_old:1;
struct fragment *fragments;
char *result;
size_t resultsize;
@@ -1662,6 +1663,19 @@ static void check_whitespace(struct apply_state *state,
record_ws_error(state, result, line + 1, len - 2, state->linenr);
 }
 
+/*
+ * Check if the patch has context lines with CRLF or
+ * the patch wants to remove lines with CRLF.
+ */
+static void check_old_for_crlf(struct patch *patch, const char *line, int len)
+{
+   if (len >= 2 && line[len-1] == '\n' && line[len-2] == '\r') {
+   patch->ws_rule |= WS_CR_AT_EOL;
+   patch->crlf_in_old = 1;
+   }
+}
+
+
 /*
  * Parse a unified diff. Note that this really needs to parse each
  * fragment separately, since the only way to know the difference
@@ -1712,11 +1726,14 @@ static int parse_fragment(struct apply_state *state,
if (!deleted && !added)
leading++;
trailing++;
+   check_old_for_crlf(patch, line, len);
if (!state->apply_in_reverse &&
state->ws_error_action == correct_ws_error)
check_whitespace(state, line, len, 
patch->ws_rule);
break;
case '-':
+   if (!state->apply_in_reverse)
+   check_old_for_crlf(patch, line, len);
if (state->apply_in_reverse &&
state->ws_error_action != nowarn_ws_error)
check_whitespace(state, line, len, 
patch->ws_rule);
@@ -1725,6 +1742,8 @@ static int parse_fragment(struct apply_state *state,
trailing = 0;
break;
case '+':
+   if (state->apply_in_reverse)
+   check_old_for_crlf(patch, line, len);
if (!state->apply_in_reverse &&
state->ws_error_action != nowarn_ws

[PATCH v4 1/2] convert: Add SAFE_CRLF_KEEP_CRLF

2017-08-19 Thread tboegi

From: Torsten Bögershausen 

When convert_to_git() is called, the caller may want to keep CRLF
to be kept as CRLF (and not converted into LF).

This will be used in the next commit, when apply works with files that have
CRLF and patches are applied onto these files.

Add the new value "SAFE_CRLF_KEEP_CRLF" to safe_crlf.

Prepare convert_to_git() to be able to run the clean filter,
skip the CRLF conversion and run the ident filter.

Signed-off-by: Torsten Bögershausen 
---
 convert.c | 10 ++
 convert.h |  3 ++-
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/convert.c b/convert.c
index deaf0ba7b3..040123b4fe 100644
--- a/convert.c
+++ b/convert.c
@@ -1104,10 +1104,12 @@ int convert_to_git(const struct index_state *istate,
src = dst->buf;
len = dst->len;
}
-   ret |= crlf_to_git(istate, path, src, len, dst, ca.crlf_action, 
checksafe);
-   if (ret && dst) {
-   src = dst->buf;
-   len = dst->len;
+   if (checksafe != SAFE_CRLF_KEEP_CRLF) {
+   ret |= crlf_to_git(istate, path, src, len, dst, ca.crlf_action, 
checksafe);
+   if (ret && dst) {
+   src = dst->buf;
+   len = dst->len;
+   }
}
return ret | ident_to_git(path, src, len, dst, ca.ident);
 }
diff --git a/convert.h b/convert.h
index cecf59d1aa..cabd5ed6dd 100644
--- a/convert.h
+++ b/convert.h
@@ -10,7 +10,8 @@ enum safe_crlf {
SAFE_CRLF_FALSE = 0,
SAFE_CRLF_FAIL = 1,
SAFE_CRLF_WARN = 2,
-   SAFE_CRLF_RENORMALIZE = 3
+   SAFE_CRLF_RENORMALIZE = 3,
+   SAFE_CRLF_KEEP_CRLF = 4
 };
 
 extern enum safe_crlf safe_crlf;
-- 
2.14.0.rc1.15.gd40c2d4e85.dirty

[PATCH v3 1/2] convert: Add SAFE_CRLF_KEEP_CRLF

2017-08-17 Thread tboegi

From: Torsten Bögershausen 

When convert_to_git() is called, the caller may want to keep CRLF
to be kept as CRLF (and not converted into LF).

This will be used in the next commit, when apply works with files that have
CRLF and patches are applied onto these files.

Add the new value "SAFE_CRLF_KEEP_CRLF" to safe_crlf.

Prepare convert_to_git() to be able to run the clean filter,
skip the CRLF conversion and run the ident filter.

Signed-off-by: Torsten Bögershausen 
---
 convert.c | 10 ++
 convert.h |  3 ++-
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/convert.c b/convert.c
index deaf0ba7b3..040123b4fe 100644
--- a/convert.c
+++ b/convert.c
@@ -1104,10 +1104,12 @@ int convert_to_git(const struct index_state *istate,
src = dst->buf;
len = dst->len;
}
-   ret |= crlf_to_git(istate, path, src, len, dst, ca.crlf_action, 
checksafe);
-   if (ret && dst) {
-   src = dst->buf;
-   len = dst->len;
+   if (checksafe != SAFE_CRLF_KEEP_CRLF) {
+   ret |= crlf_to_git(istate, path, src, len, dst, ca.crlf_action, 
checksafe);
+   if (ret && dst) {
+   src = dst->buf;
+   len = dst->len;
+   }
}
return ret | ident_to_git(path, src, len, dst, ca.ident);
 }
diff --git a/convert.h b/convert.h
index cecf59d1aa..cabd5ed6dd 100644
--- a/convert.h
+++ b/convert.h
@@ -10,7 +10,8 @@ enum safe_crlf {
SAFE_CRLF_FALSE = 0,
SAFE_CRLF_FAIL = 1,
SAFE_CRLF_WARN = 2,
-   SAFE_CRLF_RENORMALIZE = 3
+   SAFE_CRLF_RENORMALIZE = 3,
+   SAFE_CRLF_KEEP_CRLF = 4
 };
 
 extern enum safe_crlf safe_crlf;
-- 
2.14.1.145.gb3622a4ee9

[PATCH v3 2/2] File commited with CRLF should roundtrip diff and apply

2017-08-17 Thread tboegi

From: Torsten Bögershausen 

When a file had been commited with CRLF but now .gitattributes say
"* text=auto" (or core.autocrlf is true),
the following does not roundtrip, `git apply` fails:

printf "Added line\r\n" >>file &&
git diff >patch &&
git checkout -- . &&
git apply patch

Before applying the patch, the file from working tree is converted into the
index format (clean filter, CRLF conversion, ...)
Here, when commited with CRLF, the line endings should not be converted.

Note that `git apply --index` or `git apply --cache` doesn't call
convert_to_git() because the source material is already in index format.

Analyze the patch if there is a) any context line with CRLF,
or b) if any line with CRLF is to be removed.
In this case the patch file `patch` has mixed line endings, for a)
it looks like this (ignore the * at the begin of the line):

* diff --git a/one b/one
* index 533790e..c30dea8 100644
* --- a/one
* +++ b/one
* @@ -1 +1,2 @@
*  a\r
* +b\r

And for b) it looks like this:

* diff --git a/one b/one
* index 533790e..485540d 100644
* --- a/one
* +++ b/one
* @@ -1 +1 @@
* -a\r
* +b\r

If `git apply` detects that the patch itself has CRLF, (look at the line
" a\r" or "-a\r" above), the new flag crlf_in_old is set in "struct patch"
and two things will happen:
- read_old_data() will not convert CRLF into LF by calling
  convert_to_git(..., SAFE_CRLF_KEEP_CRLF);
- The WS_CR_AT_EOL bit is set in the "white space rule",
  CRLF are no longer treated as white space.

Thanks to Junio C Hamano, his input became the base for the changes in t4124.
One test case is split up into 3:
- Detect the " a\r" line in the patch
- Detect the "-a\r" line in the patch
- Use LF in repo and CLRF in the worktree.

Reported-by: Anthony Sottile 
Signed-off-by: Torsten Bögershausen 
---
Changes since v2:
- Manually integrated all code changes from Junio
  (Thanks, I hope that I didn't miss something)
- Having examples of "git diff" in the commit message confuses "git apply",
  so that all examples for git diff have a '*' at the beginnig of the line
  (V2 used '$' which is typically an example for a shell script)
- The official version to apply the CRLF-rules without having an index is
  SAFE_CRLF_RENORMALIZE, that is already working today.
- Now we have convert_to_git(NULL, ..., safe_crlf) with
  enum safe_crlf safe_crlf = patch->crlf_in_old ?
  SAFE_CRLF_KEEP_CRLF : SAFE_CRLF_RENORMALIZE;

apply.c  | 40 +++-
 t/t4124-apply-ws-rule.sh | 33 +++--
 2 files changed, 62 insertions(+), 11 deletions(-)

diff --git a/apply.c b/apply.c
index f2d599141d..691f47c783 100644
--- a/apply.c
+++ b/apply.c
@@ -220,6 +220,7 @@ struct patch {
unsigned int recount:1;
unsigned int conflicted_threeway:1;
unsigned int direct_to_threeway:1;
+   unsigned int crlf_in_old:1;
struct fragment *fragments;
char *result;
size_t resultsize;
@@ -1662,6 +1663,19 @@ static void check_whitespace(struct apply_state *state,
record_ws_error(state, result, line + 1, len - 2, state->linenr);
 }
 
+/*
+ * Check if the patch has context lines with CRLF or
+ * the patch wants to remove lines with CRLF.
+ */
+static void check_old_for_crlf(struct patch *patch, const char *line, int len)
+{
+   if (len >= 2 && line[len-1] == '\n' && line[len-2] == '\r') {
+   patch->ws_rule |= WS_CR_AT_EOL;
+   patch->crlf_in_old = 1;
+   }
+}
+
+
 /*
  * Parse a unified diff. Note that this really needs to parse each
  * fragment separately, since the only way to know the difference
@@ -1712,11 +1726,15 @@ static int parse_fragment(struct apply_state *state,
if (!deleted && !added)
leading++;
trailing++;
+   if (!state->apply_in_reverse)
+   check_old_for_crlf(patch, line, len);
if (!state->apply_in_reverse &&
state->ws_error_action == correct_ws_error)
check_whitespace(state, line, len, 
patch->ws_rule);
break;
case '-':
+   if (!state->apply_in_reverse)
+   check_old_for_crlf(patch, line, len);
if (state->apply_in_reverse &&
state->ws_error_action != nowarn_ws_error)
check_whitespace(state, line, len, 
patch->ws_rule);
@@ -2268,8 +2286,11 @@ static void show_stats(struct apply_state *state, struct 
patch *patch)
add, pluses, del, minuses);
 }
 
-static int read_old_data(struct stat *st, const char *path, struct strbuf *buf)
+static int read_old_data(struct stat *st, struct patch *patch,
+const char *path, struct strbuf *buf)
 {
+   enum safe_crlf safe_crlf = patch->crlf_in_old ?
+

[PATCH v2 1/2] convert: Add SAFE_CRLF_KEEP_CRLF

2017-08-16 Thread tboegi

From: Torsten Bögershausen 

When convert_to_git() is called, the caller may want to keep CRLF
to be kept as CRLF (and not converted into LF).

This will be used in the next commit, when apply works with files that have
CRLF and patches are applied onto these files.

Add the new value "SAFE_CRLF_KEEP_CRLF" to safe_crlf.

Prepare convert_to_git() to be able to run the clean filter,
skip the CRLF conversion and run the ident filter.

Signed-off-by: Torsten Bögershausen 
---
 convert.c | 10 ++
 convert.h |  3 ++-
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/convert.c b/convert.c
index deaf0ba7b3..040123b4fe 100644
--- a/convert.c
+++ b/convert.c
@@ -1104,10 +1104,12 @@ int convert_to_git(const struct index_state *istate,
src = dst->buf;
len = dst->len;
}
-   ret |= crlf_to_git(istate, path, src, len, dst, ca.crlf_action, 
checksafe);
-   if (ret && dst) {
-   src = dst->buf;
-   len = dst->len;
+   if (checksafe != SAFE_CRLF_KEEP_CRLF) {
+   ret |= crlf_to_git(istate, path, src, len, dst, ca.crlf_action, 
checksafe);
+   if (ret && dst) {
+   src = dst->buf;
+   len = dst->len;
+   }
}
return ret | ident_to_git(path, src, len, dst, ca.ident);
 }
diff --git a/convert.h b/convert.h
index cecf59d1aa..cabd5ed6dd 100644
--- a/convert.h
+++ b/convert.h
@@ -10,7 +10,8 @@ enum safe_crlf {
SAFE_CRLF_FALSE = 0,
SAFE_CRLF_FAIL = 1,
SAFE_CRLF_WARN = 2,
-   SAFE_CRLF_RENORMALIZE = 3
+   SAFE_CRLF_RENORMALIZE = 3,
+   SAFE_CRLF_KEEP_CRLF = 4
 };
 
 extern enum safe_crlf safe_crlf;
-- 
2.14.1.145.gb3622a4ee9

[PATCH v2 2/2] File commited with CRLF should roundtrip diff and apply

2017-08-16 Thread tboegi

From: Torsten Bögershausen 

When a file had been commited with CRLF but now .gitattributes say
"* text=auto" (or core.autocrlf is true),
the following does not roundtrip, `git apply` fails:

printf "Added line\r\n" >>file &&
git diff >patch &&
git checkout -- . &&
git apply patch

Before applying the patch, the file from working tree is converted into the
index format (clean filter, CRLF conversion, ...)
Here, when commited with CRLF, the line endings should not be converted.

Note that `git apply --index` or `git apply --cache` doesn't call
convert_to_git() because the source material is already in index format.

Analyze the patch if there is a) any context line with CRLF,
or b) if any line with CRLF is to be removed.
In this case the patch file `patch` has mixed line endings, for a)
it looks like this (ignore the $ at the begin of the line):

$ diff --git a/one b/one
$ index 533790e..c30dea8 100644
$ --- a/one
$ +++ b/one
$ @@ -1 +1,2 @@
$  a\r
$ +b\r

And for b) it looks like this:

$ diff --git a/one b/one
$ index 533790e..485540d 100644
$ --- a/one
$ +++ b/one
$ @@ -1 +1 @@
$ -a\r
$ +b\r

If `git apply` detects that the patch itself has CRLF, (look at the line
" a\r" or "-a\r" above), the new flag has_crlf is set in "struct patch"
and two things will happen:
- read_old_data() will not convert CRLF into LF by calling
  convert_to_git(..., SAFE_CRLF_KEEP_CRLF);
- The WS_CR_AT_EOL bit is set in the "white space rule",
  CRLF are no longer treated as white space.

Thanks to Junio C Hamano, his input became the base for the changes in t4124.
One test case is split up into 3:
- Detect the " a\r" line in the patch
- Detect the "-a\r" line in the patch
- Use LF in repo and CLRF in the worktree. (*)

* This one proves that convert_to_git(&the_index,...) still needs to pass
the &index, otherwise Git will crash.

Reported-by: Anthony Sottile 
Signed-off-by: Torsten Bögershausen 
---
 apply.c  | 28 +++-
 t/t4124-apply-ws-rule.sh | 33 +++--
 2 files changed, 50 insertions(+), 11 deletions(-)

diff --git a/apply.c b/apply.c
index f2d599141d..bebb176099 100644
--- a/apply.c
+++ b/apply.c
@@ -220,6 +220,7 @@ struct patch {
unsigned int recount:1;
unsigned int conflicted_threeway:1;
unsigned int direct_to_threeway:1;
+   unsigned int has_crlf:1;
struct fragment *fragments;
char *result;
size_t resultsize;
@@ -1662,6 +1663,17 @@ static void check_whitespace(struct apply_state *state,
record_ws_error(state, result, line + 1, len - 2, state->linenr);
 }
 
+/* Check if the patch has context lines with CRLF or
+   the patch wants to remove lines with CRLF */
+static void check_old_for_crlf(struct patch *patch, const char *line, int len)
+{
+   if (len >= 2 && line[len-1] == '\n' && line[len-2] == '\r') {
+   patch->ws_rule |= WS_CR_AT_EOL;
+   patch->has_crlf = 1;
+   }
+}
+
+
 /*
  * Parse a unified diff. Note that this really needs to parse each
  * fragment separately, since the only way to know the difference
@@ -1712,11 +1724,13 @@ static int parse_fragment(struct apply_state *state,
if (!deleted && !added)
leading++;
trailing++;
+   check_old_for_crlf(patch, line, len);
if (!state->apply_in_reverse &&
state->ws_error_action == correct_ws_error)
check_whitespace(state, line, len, 
patch->ws_rule);
break;
case '-':
+   check_old_for_crlf(patch, line, len);
if (state->apply_in_reverse &&
state->ws_error_action != nowarn_ws_error)
check_whitespace(state, line, len, 
patch->ws_rule);
@@ -2268,8 +2282,11 @@ static void show_stats(struct apply_state *state, struct 
patch *patch)
add, pluses, del, minuses);
 }
 
-static int read_old_data(struct stat *st, const char *path, struct strbuf *buf)
+static int read_old_data(struct stat *st, struct patch *patch,
+const char *path, struct strbuf *buf)
 {
+   enum safe_crlf safe_crlf = patch->has_crlf ?
+   SAFE_CRLF_KEEP_CRLF : SAFE_CRLF_FALSE;
switch (st->st_mode & S_IFMT) {
case S_IFLNK:
if (strbuf_readlink(buf, path, st->st_size) < 0)
@@ -2278,7 +2295,7 @@ static int read_old_data(struct stat *st, const char 
*path, struct strbuf *buf)
case S_IFREG:
if (strbuf_read_file(buf, path, st->st_size) != st->st_size)
return error(_("unable to open or read %s"), path);
-   convert_to_git(&the_index, path, buf->buf, buf->len, buf, 0);
+   convert_to_git(&the_index, path, buf->buf, buf->len, buf, 
safe_crlf);
return 0;
default:

[PATCH/RFC 2/2] File commited with CRLF should roundtrip diff and apply

2017-08-13 Thread tboegi

From: Torsten Bögershausen 

When a file had been commited with CRLF and core.autocrlf is true,
the following does not roundtrip, `git apply` fails:

printf "Added line\r\n" >>file &&
git diff >patch &&
git checkout -- . &&
git apply patch

Before applying the patch, the file from working tree is converted into the
index format (clean filter, CRLF conversion, ...)
Here, when commited with CRLF, the line endings should not be converted.

Analyze the patch if there is any context line with CRLF,
or if any line with CRLF is to be removed.

If yes, the new flag has_crlf is set in "struct patch", and two things
will happen:
- read_old_data() will not convert CRLF into LF by calling
  convert_to_git(..., SAFE_CRLF_KEEP_CRLF);
- The WS_CR_AT_EOL bit is set in the "white space rule",
  CRLF are no longer treated as white space.

Thanks to Junio C Hamano, his input became the base for t4140.

Reported-by: Anthony Sottile 
Signed-off-by: Torsten Bögershausen 
---


The last version did not pass t4124, fix this.



apply.c  | 37 -
 apply.h  |  4 
 t/t4124-apply-ws-rule.sh |  3 +--
 t/t4140-apply-CRLF.sh| 46 ++
 4 files changed, 79 insertions(+), 11 deletions(-)
 create mode 100755 t/t4140-apply-CRLF.sh

diff --git a/apply.c b/apply.c
index f2d599141d..63455cd65f 100644
--- a/apply.c
+++ b/apply.c
@@ -220,6 +220,7 @@ struct patch {
unsigned int recount:1;
unsigned int conflicted_threeway:1;
unsigned int direct_to_threeway:1;
+   unsigned int has_crlf:1;
struct fragment *fragments;
char *result;
size_t resultsize;
@@ -1662,6 +1663,17 @@ static void check_whitespace(struct apply_state *state,
record_ws_error(state, result, line + 1, len - 2, state->linenr);
 }
 
+/* Check if the patch has context lines with CRLF or
+   the patch wants to remove lines with CRLF */
+static void check_old_for_crlf(struct patch *patch, const char *line, int len)
+{
+   if (len >= 2 && line[len-1] == '\n' && line[len-2] == '\r') {
+   patch->ws_rule |= WS_CR_AT_EOL;
+   patch->has_crlf = 1;
+   }
+}
+
+
 /*
  * Parse a unified diff. Note that this really needs to parse each
  * fragment separately, since the only way to know the difference
@@ -1712,11 +1724,13 @@ static int parse_fragment(struct apply_state *state,
if (!deleted && !added)
leading++;
trailing++;
+   check_old_for_crlf(patch, line, len);
if (!state->apply_in_reverse &&
state->ws_error_action == correct_ws_error)
check_whitespace(state, line, len, 
patch->ws_rule);
break;
case '-':
+   check_old_for_crlf(patch, line, len);
if (state->apply_in_reverse &&
state->ws_error_action != nowarn_ws_error)
check_whitespace(state, line, len, 
patch->ws_rule);
@@ -2268,8 +2282,10 @@ static void show_stats(struct apply_state *state, struct 
patch *patch)
add, pluses, del, minuses);
 }
 
-static int read_old_data(struct stat *st, const char *path, struct strbuf *buf)
+static int read_old_data(struct stat *st, const char *path, struct strbuf 
*buf, int flags)
 {
+   enum safe_crlf safe_crlf = flags & APPLY_FLAGS_CR_AT_EOL ?
+   SAFE_CRLF_KEEP_CRLF : SAFE_CRLF_FALSE;
switch (st->st_mode & S_IFMT) {
case S_IFLNK:
if (strbuf_readlink(buf, path, st->st_size) < 0)
@@ -2278,7 +2294,7 @@ static int read_old_data(struct stat *st, const char 
*path, struct strbuf *buf)
case S_IFREG:
if (strbuf_read_file(buf, path, st->st_size) != st->st_size)
return error(_("unable to open or read %s"), path);
-   convert_to_git(&the_index, path, buf->buf, buf->len, buf, 0);
+   convert_to_git(&the_index, path, buf->buf, buf->len, buf, 
safe_crlf);
return 0;
default:
return -1;
@@ -3385,7 +3401,8 @@ static int load_patch_target(struct apply_state *state,
 const struct cache_entry *ce,
 struct stat *st,
 const char *name,
-unsigned expected_mode)
+unsigned expected_mode,
+int flags)
 {
if (state->cached || state->check_index) {
if (read_file_or_gitlink(ce, buf))
@@ -3399,7 +3416,7 @@ static int load_patch_target(struct apply_state *state,
} else if (has_symlink_leading_path(name, strlen(name))) {
return error(_("reading from '%s' beyond a symbolic 
link"), name);
} else {
-

[PATCH/RFC 1/2] convert: Add SAFE_CRLF_KEEP_CRLF

2017-08-13 Thread tboegi

From: Torsten Bögershausen 

When convert_to_git() is called, the caller may want to keep CRLF
to be kept as CRLF (and not converted into LF).

This will be used in the next commit, when apply works with files that have
CRLF and patches are applied onto these files.

Add the new value "SAFE_CRLF_KEEP_CRLF" to safe_crlf.

Prepare convert_to_git() to be able to run the clean filter,
skip the CRLF conversion and run the ident filter.

Signed-off-by: Torsten Bögershausen 
---
convert.c | 10 ++
 convert.h |  3 ++-
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/convert.c b/convert.c
index deaf0ba7b3..040123b4fe 100644
--- a/convert.c
+++ b/convert.c
@@ -1104,10 +1104,12 @@ int convert_to_git(const struct index_state *istate,
src = dst->buf;
len = dst->len;
}
-   ret |= crlf_to_git(istate, path, src, len, dst, ca.crlf_action, 
checksafe);
-   if (ret && dst) {
-   src = dst->buf;
-   len = dst->len;
+   if (checksafe != SAFE_CRLF_KEEP_CRLF) {
+   ret |= crlf_to_git(istate, path, src, len, dst, ca.crlf_action, 
checksafe);
+   if (ret && dst) {
+   src = dst->buf;
+   len = dst->len;
+   }
}
return ret | ident_to_git(path, src, len, dst, ca.ident);
 }
diff --git a/convert.h b/convert.h
index cecf59d1aa..cabd5ed6dd 100644
--- a/convert.h
+++ b/convert.h
@@ -10,7 +10,8 @@ enum safe_crlf {
SAFE_CRLF_FALSE = 0,
SAFE_CRLF_FAIL = 1,
SAFE_CRLF_WARN = 2,
-   SAFE_CRLF_RENORMALIZE = 3
+   SAFE_CRLF_RENORMALIZE = 3,
+   SAFE_CRLF_KEEP_CRLF = 4
 };
 
 extern enum safe_crlf safe_crlf;
-- 
2.14.1.145.gb3622a4ee9

[PATCH/RFC] convert: Add SAFE_CRLF_KEEP_CRLF

2017-08-12 Thread tboegi

From: Torsten Bögershausen 

When convert_to_git() is called, the caller may want to keep CRLF
to be kept as CRLF (and not converted into LF).

This will be used in the next commit, when apply works with files that have
CRLF and patches are applied onto these files.

Add the new value "SAFE_CRLF_KEEP_CRLF" to safe_crlf.

Prepare convert_to_git() to be able to run the clean filter,
skip the CRLF conversion and run the ident filter.

Signed-off-by: Torsten Bögershausen 
---
 convert.c | 10 ++
 convert.h |  3 ++-
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/convert.c b/convert.c
index deaf0ba7b3..040123b4fe 100644
--- a/convert.c
+++ b/convert.c
@@ -1104,10 +1104,12 @@ int convert_to_git(const struct index_state *istate,
src = dst->buf;
len = dst->len;
}
-   ret |= crlf_to_git(istate, path, src, len, dst, ca.crlf_action, 
checksafe);
-   if (ret && dst) {
-   src = dst->buf;
-   len = dst->len;
+   if (checksafe != SAFE_CRLF_KEEP_CRLF) {
+   ret |= crlf_to_git(istate, path, src, len, dst, ca.crlf_action, 
checksafe);
+   if (ret && dst) {
+   src = dst->buf;
+   len = dst->len;
+   }
}
return ret | ident_to_git(path, src, len, dst, ca.ident);
 }
diff --git a/convert.h b/convert.h
index cecf59d1aa..cabd5ed6dd 100644
--- a/convert.h
+++ b/convert.h
@@ -10,7 +10,8 @@ enum safe_crlf {
SAFE_CRLF_FALSE = 0,
SAFE_CRLF_FAIL = 1,
SAFE_CRLF_WARN = 2,
-   SAFE_CRLF_RENORMALIZE = 3
+   SAFE_CRLF_RENORMALIZE = 3,
+   SAFE_CRLF_KEEP_CRLF = 4
 };
 
 extern enum safe_crlf safe_crlf;
-- 
2.14.1.145.gb3622a4ee9

[PATCH/RFC] File commited with CRLF should roundtrip diff and apply

2017-08-12 Thread tboegi

From: Torsten Bögershausen 

When a file had been commited with CRLF and core.autocrlf is true,
the following does not roundtrip, `git apply` fails:

printf "Added line\r\n" >>file &&
git diff >patch &&
git checkout -- . &&
git apply patch

Before applying the patch, the file from working tree is converted into the
index format (clean filter, CRLF conversion, ...)
Here, when commited with CRLF, the line endings should not be converted.

Analyze the patch if there is any context line with CRLF,
or if any line with CRLF is to be removed.

If yes, the new flag has_crlf is set in "struct patch", and two things
will happen:
- read_old_data() will not convert CRLF into LF by calling
  convert_to_git(..., SAFE_CRLF_KEEP_CRLF);
- The WS_CR_AT_EOL bit is set in the "white space rule",
  CRLF are no longer treated as white space.

Thanks to Junio C Hamano, his input became the base for t4140.

Reported-by: Anthony Sottile 
Signed-off-by: Torsten Bögershausen 
---
 apply.c   | 37 -
 apply.h   |  4 
 t/t4140-apply-CRLF.sh | 46 ++
 3 files changed, 78 insertions(+), 9 deletions(-)
 create mode 100755 t/t4140-apply-CRLF.sh

diff --git a/apply.c b/apply.c
index f2d599141d..63455cd65f 100644
--- a/apply.c
+++ b/apply.c
@@ -220,6 +220,7 @@ struct patch {
unsigned int recount:1;
unsigned int conflicted_threeway:1;
unsigned int direct_to_threeway:1;
+   unsigned int has_crlf:1;
struct fragment *fragments;
char *result;
size_t resultsize;
@@ -1662,6 +1663,17 @@ static void check_whitespace(struct apply_state *state,
record_ws_error(state, result, line + 1, len - 2, state->linenr);
 }
 
+/* Check if the patch has context lines with CRLF or
+   the patch wants to remove lines with CRLF */
+static void check_old_for_crlf(struct patch *patch, const char *line, int len)
+{
+   if (len >= 2 && line[len-1] == '\n' && line[len-2] == '\r') {
+   patch->ws_rule |= WS_CR_AT_EOL;
+   patch->has_crlf = 1;
+   }
+}
+
+
 /*
  * Parse a unified diff. Note that this really needs to parse each
  * fragment separately, since the only way to know the difference
@@ -1712,11 +1724,13 @@ static int parse_fragment(struct apply_state *state,
if (!deleted && !added)
leading++;
trailing++;
+   check_old_for_crlf(patch, line, len);
if (!state->apply_in_reverse &&
state->ws_error_action == correct_ws_error)
check_whitespace(state, line, len, 
patch->ws_rule);
break;
case '-':
+   check_old_for_crlf(patch, line, len);
if (state->apply_in_reverse &&
state->ws_error_action != nowarn_ws_error)
check_whitespace(state, line, len, 
patch->ws_rule);
@@ -2268,8 +2282,10 @@ static void show_stats(struct apply_state *state, struct 
patch *patch)
add, pluses, del, minuses);
 }
 
-static int read_old_data(struct stat *st, const char *path, struct strbuf *buf)
+static int read_old_data(struct stat *st, const char *path, struct strbuf 
*buf, int flags)
 {
+   enum safe_crlf safe_crlf = flags & APPLY_FLAGS_CR_AT_EOL ?
+   SAFE_CRLF_KEEP_CRLF : SAFE_CRLF_FALSE;
switch (st->st_mode & S_IFMT) {
case S_IFLNK:
if (strbuf_readlink(buf, path, st->st_size) < 0)
@@ -2278,7 +2294,7 @@ static int read_old_data(struct stat *st, const char 
*path, struct strbuf *buf)
case S_IFREG:
if (strbuf_read_file(buf, path, st->st_size) != st->st_size)
return error(_("unable to open or read %s"), path);
-   convert_to_git(&the_index, path, buf->buf, buf->len, buf, 0);
+   convert_to_git(&the_index, path, buf->buf, buf->len, buf, 
safe_crlf);
return 0;
default:
return -1;
@@ -3385,7 +3401,8 @@ static int load_patch_target(struct apply_state *state,
 const struct cache_entry *ce,
 struct stat *st,
 const char *name,
-unsigned expected_mode)
+unsigned expected_mode,
+int flags)
 {
if (state->cached || state->check_index) {
if (read_file_or_gitlink(ce, buf))
@@ -3399,7 +3416,7 @@ static int load_patch_target(struct apply_state *state,
} else if (has_symlink_leading_path(name, strlen(name))) {
return error(_("reading from '%s' beyond a symbolic 
link"), name);
} else {
-   if (read_old_data(st, name, buf))
+   if (read_old_dat

[PATCH v1 1/1] correct apply for files commited with CRLF

2017-08-02 Thread tboegi

From: Torsten Bögershausen 

git apply does not find the source lines when files have CRLF in the index
and core.autocrlf is true:
These files should not get the CRLF converted to LF. Because cmd_apply()
does not load the index, this does not work, CRLF are converted into LF
and apply fails.

Fix this in the spirit of commit a08feb8ef0b6,
"correct blame for files commited with CRLF" by loading the index.

As an optimization, skip read_cache() when no conversion is specified
for this path.

Reported-by: Anthony Sottile 
Signed-off-by: Torsten Bögershausen 
---
 apply.c |  2 ++
 t/t0020-crlf.sh | 12 
 2 files changed, 14 insertions(+)

diff --git a/apply.c b/apply.c
index f2d599141d..66b8387360 100644
--- a/apply.c
+++ b/apply.c
@@ -2278,6 +2278,8 @@ static int read_old_data(struct stat *st, const char 
*path, struct strbuf *buf)
case S_IFREG:
if (strbuf_read_file(buf, path, st->st_size) != st->st_size)
return error(_("unable to open or read %s"), path);
+   if (would_convert_to_git(&the_index, path))
+   read_cache();
convert_to_git(&the_index, path, buf->buf, buf->len, buf, 0);
return 0;
default:
diff --git a/t/t0020-crlf.sh b/t/t0020-crlf.sh
index 71350e0657..6611f8a6f6 100755
--- a/t/t0020-crlf.sh
+++ b/t/t0020-crlf.sh
@@ -386,4 +386,16 @@ test_expect_success 'New CRLF file gets LF in repo' '
test_cmp alllf alllf2
 '
 
+test_expect_success 'CRLF in repo, apply with autocrlf=true' '
+   git config core.autocrlf false &&
+   printf "1\r\n2\r\n" >crlf &&
+   git add crlf &&
+   git commit -m "commit crlf with crlf" &&
+   git config core.autocrlf true &&
+   printf "1\r\n2\r\n\r\n\r\n\r\n" >crlf &&
+   git diff >patch &&
+   git checkout -- . &&
+   git apply patch
+'
+
 test_done
-- 
2.13.2.533.ge0aaa1b

[PATCH v3 1/1] cygwin: Allow pushing to UNC paths

2017-07-03 Thread tboegi

From: Torsten Bögershausen 

 cygwin can use an UNC path like //server/share/repo
 $ cd //server/share/dir
 $ mkdir test
 $ cd test
 $ git init --bare

 However, when we try to push from a local Git repository to this repo,
 there is a problem: Git converts the leading "//" into a single "/".

 As cygwin handles an UNC path so well, Git can support them better:
 - Introduce cygwin_offset_1st_component() which keeps the leading "//",
   similar to what Git for Windows does.
 - Move CYGWIN out of the POSIX in the tests for path normalization in t0060

Signed-off-by: Torsten Bögershausen 
---

I think I skip all the changing in setup.c and cygwin_access() for the
moment:
- It is not clear, what is a regression and what is an improvement
- It may be a problem that could be solved in cygwin itself
- I was able to push a an UNC path on a Windows server
  when the domain controller was reachable.


compat/cygwin.c   | 19 +++
 compat/cygwin.h   |  2 ++
 config.mak.uname  |  1 +
 git-compat-util.h |  3 +++
 t/t0060-path-utils.sh |  2 ++
 5 files changed, 27 insertions(+)
 create mode 100644 compat/cygwin.c
 create mode 100644 compat/cygwin.h

diff --git a/compat/cygwin.c b/compat/cygwin.c
new file mode 100644
index 000..b9862d6
--- /dev/null
+++ b/compat/cygwin.c
@@ -0,0 +1,19 @@
+#include "../git-compat-util.h"
+#include "../cache.h"
+
+int cygwin_offset_1st_component(const char *path)
+{
+   const char *pos = path;
+   /* unc paths */
+   if (is_dir_sep(pos[0]) && is_dir_sep(pos[1])) {
+   /* skip server name */
+   pos = strchr(pos + 2, '/');
+   if (!pos)
+   return 0; /* Error: malformed unc path */
+
+   do {
+   pos++;
+   } while (*pos && pos[0] != '/');
+   }
+   return pos + is_dir_sep(*pos) - path;
+}
diff --git a/compat/cygwin.h b/compat/cygwin.h
new file mode 100644
index 000..8e52de4
--- /dev/null
+++ b/compat/cygwin.h
@@ -0,0 +1,2 @@
+int cygwin_offset_1st_component(const char *path);
+#define offset_1st_component cygwin_offset_1st_component
diff --git a/config.mak.uname b/config.mak.uname
index adfb90b..551e465 100644
--- a/config.mak.uname
+++ b/config.mak.uname
@@ -184,6 +184,7 @@ ifeq ($(uname_O),Cygwin)
UNRELIABLE_FSTAT = UnfortunatelyYes
SPARSE_FLAGS = -isystem /usr/include/w32api -Wno-one-bit-signed-bitfield
OBJECT_CREATION_USES_RENAMES = UnfortunatelyNeedsTo
+   COMPAT_OBJS += compat/cygwin.o
 endif
 ifeq ($(uname_S),FreeBSD)
NEEDS_LIBICONV = YesPlease
diff --git a/git-compat-util.h b/git-compat-util.h
index 047172d..db9c22d 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -189,6 +189,9 @@
 #include 
 #endif
 
+#if defined(__CYGWIN__)
+#include "compat/cygwin.h"
+#endif
 #if defined(__MINGW32__)
 /* pull in Windows compatibility stuff */
 #include "compat/mingw.h"
diff --git a/t/t0060-path-utils.sh b/t/t0060-path-utils.sh
index 444b5a4..7ea2bb5 100755
--- a/t/t0060-path-utils.sh
+++ b/t/t0060-path-utils.sh
@@ -70,6 +70,8 @@ ancestor() {
 case $(uname -s) in
 *MINGW*)
;;
+*CYGWIN*)
+   ;;
 *)
test_set_prereq POSIX
;;
-- 
2.10.0

[PATCH v2 2/2] cygwin: Allow pushing to UNC paths

2017-07-01 Thread tboegi

From: Torsten Bögershausen 

 cygwin can use an UNC path like //server/share/repo
 $ cd //server/share/dir
 $ mkdir test
 $ cd test
 $ git init --bare

 However, when we try to push from a local Git repository to this repo,
 there is a problem: Git converts the leading "//" into a single "/".

 As cygwin handles an UNC path so well, Git can support them better:
 - Introduce cygwin_offset_1st_component() which keeps the leading "//",
   similar to what Git for Windows does.
 - Move CYGWIN out of the POSIX in the tests for path normalization in t0060.
---
 config.mak.uname  | 1 +
 git-compat-util.h | 3 +++
 t/t0060-path-utils.sh | 2 ++
 3 files changed, 6 insertions(+)

diff --git a/config.mak.uname b/config.mak.uname
index adfb90b..551e465 100644
--- a/config.mak.uname
+++ b/config.mak.uname
@@ -184,6 +184,7 @@ ifeq ($(uname_O),Cygwin)
UNRELIABLE_FSTAT = UnfortunatelyYes
SPARSE_FLAGS = -isystem /usr/include/w32api -Wno-one-bit-signed-bitfield
OBJECT_CREATION_USES_RENAMES = UnfortunatelyNeedsTo
+   COMPAT_OBJS += compat/cygwin.o
 endif
 ifeq ($(uname_S),FreeBSD)
NEEDS_LIBICONV = YesPlease
diff --git a/git-compat-util.h b/git-compat-util.h
index 047172d..db9c22d 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -189,6 +189,9 @@
 #include 
 #endif
 
+#if defined(__CYGWIN__)
+#include "compat/cygwin.h"
+#endif
 #if defined(__MINGW32__)
 /* pull in Windows compatibility stuff */
 #include "compat/mingw.h"
diff --git a/t/t0060-path-utils.sh b/t/t0060-path-utils.sh
index 444b5a4..7ea2bb5 100755
--- a/t/t0060-path-utils.sh
+++ b/t/t0060-path-utils.sh
@@ -70,6 +70,8 @@ ancestor() {
 case $(uname -s) in
 *MINGW*)
;;
+*CYGWIN*)
+   ;;
 *)
test_set_prereq POSIX
;;
-- 
2.10.0

[PATCH v2 1/2] Check DB_ENVIRONMENT using is_directory()

2017-07-01 Thread tboegi

From: Torsten Bögershausen 

In setup.c is_git_directory() checks a Git directory using access(X_OK).
This does not check, if path is a file or a directory.
Check path with is_directory() instead.
---
After all the discussions (and lots of tests) I found that this patch
works for my setup.
All in all could the error reporting be improvved for is_git_directory(),
as there may be "access denied", or "not a directory" or others, but
that is for another day.

setup.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/setup.c b/setup.c
index 358fbc2..5a7ee2e 100644
--- a/setup.c
+++ b/setup.c
@@ -321,7 +321,7 @@ int is_git_directory(const char *suspect)
 
/* Check non-worktree-related signatures */
if (getenv(DB_ENVIRONMENT)) {
-   if (access(getenv(DB_ENVIRONMENT), X_OK))
+   if (!is_directory(getenv(DB_ENVIRONMENT)))
goto done;
}
else {
-- 
2.10.0

[PATCH/RFC v1 1/1] cygwin: Allow pushing to UNC paths

2017-06-28 Thread tboegi

From: Torsten Bögershausen 

cygwin can use an UNC path like //server/share/repo
$ cd //server/share/dir
$ mkdir test
$ cd test
$ git init --bare

However, when we try to push from a local Git repository to this repo,
there are 2 problems:
- Git converts the leading "//" into a single "/".
- The remote repo is not accepted because setup.c calls
  access(getenv(DB_ENVIRONMENT), X_OK)
  and this call fails. In other words, checking the executable bit
  of a directory mounted on a SAMBA share is not reliable (and not needed).

As cygwin handles an UNC path so well, Git can support them better.
- Introduce cygwin_offset_1st_component() which keeps the leading "//",
  similar to what Git for Windows does.
- Move CYGWIN out of the POSIX in the tests for path normalization in t0060.
- Use cygwin_access() with a relaxed test for the executable bit on
  a directory pointed out by an UNC path.

Signed-off-by: Torsten Bögershausen 
---
 compat/cygwin.c   | 29 +
 compat/cygwin.h   |  7 +++
 config.mak.uname  |  1 +
 git-compat-util.h |  3 +++
 t/t0060-path-utils.sh |  2 ++
 5 files changed, 42 insertions(+)
 create mode 100644 compat/cygwin.c
 create mode 100644 compat/cygwin.h

diff --git a/compat/cygwin.c b/compat/cygwin.c
new file mode 100644
index 000..d98e877
--- /dev/null
+++ b/compat/cygwin.c
@@ -0,0 +1,29 @@
+#include "../git-compat-util.h"
+#include "../cache.h"
+
+int cygwin_offset_1st_component(const char *path)
+{
+   const char *pos = path;
+   /* unc paths */
+   if (is_dir_sep(pos[0]) && is_dir_sep(pos[1])) {
+   /* skip server name */
+   pos = strchr(pos + 2, '/');
+   if (!pos)
+   return 0; /* Error: malformed unc path */
+
+   do {
+   pos++;
+   } while (*pos && pos[0] != '/');
+   }
+   return pos + is_dir_sep(*pos) - path;
+}
+
+#undef access
+int cygwin_access(const char *filename, int mode)
+{
+   /* the execute bit does not work on SAMBA drives */
+   if (filename[0] == '/' && filename[1] == '/' )
+   return access(filename, mode & ~X_OK);
+   else
+   return access(filename, mode);
+}
diff --git a/compat/cygwin.h b/compat/cygwin.h
new file mode 100644
index 000..efa12ad
--- /dev/null
+++ b/compat/cygwin.h
@@ -0,0 +1,7 @@
+int cygwin_access(const char *filename, int mode);
+#undef access
+#define access cygwin_access
+
+
+int cygwin_offset_1st_component(const char *path);
+#define offset_1st_component cygwin_offset_1st_component
diff --git a/config.mak.uname b/config.mak.uname
index adfb90b..551e465 100644
--- a/config.mak.uname
+++ b/config.mak.uname
@@ -184,6 +184,7 @@ ifeq ($(uname_O),Cygwin)
UNRELIABLE_FSTAT = UnfortunatelyYes
SPARSE_FLAGS = -isystem /usr/include/w32api -Wno-one-bit-signed-bitfield
OBJECT_CREATION_USES_RENAMES = UnfortunatelyNeedsTo
+   COMPAT_OBJS += compat/cygwin.o
 endif
 ifeq ($(uname_S),FreeBSD)
NEEDS_LIBICONV = YesPlease
diff --git a/git-compat-util.h b/git-compat-util.h
index 047172d..db9c22d 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -189,6 +189,9 @@
 #include 
 #endif
 
+#if defined(__CYGWIN__)
+#include "compat/cygwin.h"
+#endif
 #if defined(__MINGW32__)
 /* pull in Windows compatibility stuff */
 #include "compat/mingw.h"
diff --git a/t/t0060-path-utils.sh b/t/t0060-path-utils.sh
index 444b5a4..7ea2bb5 100755
--- a/t/t0060-path-utils.sh
+++ b/t/t0060-path-utils.sh
@@ -70,6 +70,8 @@ ancestor() {
 case $(uname -s) in
 *MINGW*)
;;
+*CYGWIN*)
+   ;;
 *)
test_set_prereq POSIX
;;
-- 
2.10.0

[PATCH v3 1/1] t0027: tests are not expensive; remove t0025

2017-05-10 Thread tboegi

From: Torsten Bögershausen 

The purpose of t0027 is to test all CRLF related conversions at "git checkout"
and "git add".

Running t0027 under Git for Windows takes 3-4 minutes, so the whole script had
been marked as "EXPENSIVE".

The source code for "Git for Windows" overrides this since 2014:
"t0027 is marked expensive, but really, for MinGW we want to run these
tests always."

Recent "stress" tests show that t0025 if flaky, reported by Lars Schneider,
larsxschnei...@gmail.com

All tests in t0025 are covered by t0027 already, so that t0025 can be retired.
t0027 takes less than 14 seconds under Linux, and 63 seconds under Mac Os X,
and this is more or less the same with a SSD or a spinning disk.

Acked-by: Johannes Schindelin 
Signed-off-by: Torsten Bögershausen 
---
 t/t0025-crlf-auto.sh | 181 ---
 t/t0027-auto-crlf.sh |   6 --
 2 files changed, 187 deletions(-)
 delete mode 100755 t/t0025-crlf-auto.sh

diff --git a/t/t0025-crlf-auto.sh b/t/t0025-crlf-auto.sh
deleted file mode 100755
index 89826c5..000
--- a/t/t0025-crlf-auto.sh
+++ /dev/null
@@ -1,181 +0,0 @@
-#!/bin/sh
-
-test_description='CRLF conversion'
-
-. ./test-lib.sh
-
-has_cr() {
-   tr '\015' Q <"$1" | grep Q >/dev/null
-}
-
-test_expect_success setup '
-
-   git config core.autocrlf false &&
-
-   for w in Hello world how are you; do echo $w; done >LFonly &&
-   for w in I am very very fine thank you; do echo ${w}Q; done | q_to_cr 
>CRLFonly &&
-   for w in Oh here is a QNUL byte how alarming; do echo ${w}; done | 
q_to_nul >LFwithNUL &&
-   git add . &&
-
-   git commit -m initial &&
-
-   LFonly=$(git rev-parse HEAD:LFonly) &&
-   CRLFonly=$(git rev-parse HEAD:CRLFonly) &&
-   LFwithNUL=$(git rev-parse HEAD:LFwithNUL) &&
-
-   echo happy.
-'
-
-test_expect_success 'default settings cause no changes' '
-
-   rm -f .gitattributes tmp LFonly CRLFonly LFwithNUL &&
-   git read-tree --reset -u HEAD &&
-
-   ! has_cr LFonly &&
-   has_cr CRLFonly &&
-   LFonlydiff=$(git diff LFonly) &&
-   CRLFonlydiff=$(git diff CRLFonly) &&
-   LFwithNULdiff=$(git diff LFwithNUL) &&
-   test -z "$LFonlydiff" -a -z "$CRLFonlydiff" -a -z "$LFwithNULdiff"
-'
-
-test_expect_success 'crlf=true causes a CRLF file to be normalized' '
-
-   # Backwards compatibility check
-   rm -f .gitattributes tmp LFonly CRLFonly LFwithNUL &&
-   echo "CRLFonly crlf" > .gitattributes &&
-   git read-tree --reset -u HEAD &&
-
-   # Note, "normalized" means that git will normalize it if added
-   has_cr CRLFonly &&
-   CRLFonlydiff=$(git diff CRLFonly) &&
-   test -n "$CRLFonlydiff"
-'
-
-test_expect_success 'text=true causes a CRLF file to be normalized' '
-
-   rm -f .gitattributes tmp LFonly CRLFonly LFwithNUL &&
-   echo "CRLFonly text" > .gitattributes &&
-   git read-tree --reset -u HEAD &&
-
-   # Note, "normalized" means that git will normalize it if added
-   has_cr CRLFonly &&
-   CRLFonlydiff=$(git diff CRLFonly) &&
-   test -n "$CRLFonlydiff"
-'
-
-test_expect_success 'eol=crlf gives a normalized file CRLFs with 
autocrlf=false' '
-
-   rm -f .gitattributes tmp LFonly CRLFonly LFwithNUL &&
-   git config core.autocrlf false &&
-   echo "LFonly eol=crlf" > .gitattributes &&
-   git read-tree --reset -u HEAD &&
-
-   has_cr LFonly &&
-   LFonlydiff=$(git diff LFonly) &&
-   test -z "$LFonlydiff"
-'
-
-test_expect_success 'eol=crlf gives a normalized file CRLFs with 
autocrlf=input' '
-
-   rm -f .gitattributes tmp LFonly CRLFonly LFwithNUL &&
-   git config core.autocrlf input &&
-   echo "LFonly eol=crlf" > .gitattributes &&
-   git read-tree --reset -u HEAD &&
-
-   has_cr LFonly &&
-   LFonlydiff=$(git diff LFonly) &&
-   test -z "$LFonlydiff"
-'
-
-test_expect_success 'eol=lf gives a normalized file LFs with autocrlf=true' '
-
-   rm -f .gitattributes tmp LFonly CRLFonly LFwithNUL &&
-   git config core.autocrlf true &&
-   echo "LFonly eol=lf" > .gitattributes &&
-   git read-tree --reset -u HEAD &&
-
-   ! has_cr LFonly &&
-   LFonlydiff=$(git diff LFonly) &&
-   test -z "$LFonlydiff"
-'
-
-test_expect_success 'autocrlf=true does not normalize CRLF files' '
-
-   rm -f .gitattributes tmp LFonly CRLFonly LFwithNUL &&
-   git config core.autocrlf true &&
-   git read-tree --reset -u HEAD &&
-
-   has_cr LFonly &&
-   has_cr CRLFonly &&
-   LFonlydiff=$(git diff LFonly) &&
-   CRLFonlydiff=$(git diff CRLFonly) &&
-   LFwithNULdiff=$(git diff LFwithNUL) &&
-   test -z "$LFonlydiff" -a -z "$CRLFonlydiff" -a -z "$LFwithNULdiff"
-'
-
-test_expect_success 'text=auto, autocrlf=true does not normalize CRLF files' '
-
-   rm -f .gitattributes tmp LFonly CRLFonly LFwithNUL &&
-   git config core.autocrlf true &&
-   echo "* text=auto" > .gitattributes &&
-   gi

[PATCH v2 1/1] t0027: tests are not expensive; remove t0025

2017-05-02 Thread tboegi

From: Torsten Bögershausen 

The purpose of t0027 is to test all CRLF related conversions at
"git checkout" and "git add".

Running t0027 under Git for Windows takes 3-4 minutes, so the whole script
had been marked as "EXPENSIVE".

The source code for "Git for Windows" overrides this since 2014:
"t0027 is marked expensive, but really, for MinGW we want to run these
tests always."

Recent "stress" tests show that t0025 if flaky, reported by Lars Schneider,
larsxschnei...@gmail.com

All tests from t0025 are covered in t0027 already, so that t0025 can be
retiered:
The execution time for t0027 is 14 seconds under Linux,
and 63 seconds under Mac Os X.
And in case you ask, things are not going significantly faster using a SSD
instead of a spinning disk.

Signed-off-by: Torsten Bögershausen 
---
 t/t0025-crlf-auto.sh | 181 ---
 t/t0027-auto-crlf.sh |   6 --
 2 files changed, 187 deletions(-)
 delete mode 100755 t/t0025-crlf-auto.sh

diff --git a/t/t0025-crlf-auto.sh b/t/t0025-crlf-auto.sh
deleted file mode 100755
index 89826c5..000
--- a/t/t0025-crlf-auto.sh
+++ /dev/null
@@ -1,181 +0,0 @@
-#!/bin/sh
-
-test_description='CRLF conversion'
-
-. ./test-lib.sh
-
-has_cr() {
-   tr '\015' Q <"$1" | grep Q >/dev/null
-}
-
-test_expect_success setup '
-
-   git config core.autocrlf false &&
-
-   for w in Hello world how are you; do echo $w; done >LFonly &&
-   for w in I am very very fine thank you; do echo ${w}Q; done | q_to_cr 
>CRLFonly &&
-   for w in Oh here is a QNUL byte how alarming; do echo ${w}; done | 
q_to_nul >LFwithNUL &&
-   git add . &&
-
-   git commit -m initial &&
-
-   LFonly=$(git rev-parse HEAD:LFonly) &&
-   CRLFonly=$(git rev-parse HEAD:CRLFonly) &&
-   LFwithNUL=$(git rev-parse HEAD:LFwithNUL) &&
-
-   echo happy.
-'
-
-test_expect_success 'default settings cause no changes' '
-
-   rm -f .gitattributes tmp LFonly CRLFonly LFwithNUL &&
-   git read-tree --reset -u HEAD &&
-
-   ! has_cr LFonly &&
-   has_cr CRLFonly &&
-   LFonlydiff=$(git diff LFonly) &&
-   CRLFonlydiff=$(git diff CRLFonly) &&
-   LFwithNULdiff=$(git diff LFwithNUL) &&
-   test -z "$LFonlydiff" -a -z "$CRLFonlydiff" -a -z "$LFwithNULdiff"
-'
-
-test_expect_success 'crlf=true causes a CRLF file to be normalized' '
-
-   # Backwards compatibility check
-   rm -f .gitattributes tmp LFonly CRLFonly LFwithNUL &&
-   echo "CRLFonly crlf" > .gitattributes &&
-   git read-tree --reset -u HEAD &&
-
-   # Note, "normalized" means that git will normalize it if added
-   has_cr CRLFonly &&
-   CRLFonlydiff=$(git diff CRLFonly) &&
-   test -n "$CRLFonlydiff"
-'
-
-test_expect_success 'text=true causes a CRLF file to be normalized' '
-
-   rm -f .gitattributes tmp LFonly CRLFonly LFwithNUL &&
-   echo "CRLFonly text" > .gitattributes &&
-   git read-tree --reset -u HEAD &&
-
-   # Note, "normalized" means that git will normalize it if added
-   has_cr CRLFonly &&
-   CRLFonlydiff=$(git diff CRLFonly) &&
-   test -n "$CRLFonlydiff"
-'
-
-test_expect_success 'eol=crlf gives a normalized file CRLFs with 
autocrlf=false' '
-
-   rm -f .gitattributes tmp LFonly CRLFonly LFwithNUL &&
-   git config core.autocrlf false &&
-   echo "LFonly eol=crlf" > .gitattributes &&
-   git read-tree --reset -u HEAD &&
-
-   has_cr LFonly &&
-   LFonlydiff=$(git diff LFonly) &&
-   test -z "$LFonlydiff"
-'
-
-test_expect_success 'eol=crlf gives a normalized file CRLFs with 
autocrlf=input' '
-
-   rm -f .gitattributes tmp LFonly CRLFonly LFwithNUL &&
-   git config core.autocrlf input &&
-   echo "LFonly eol=crlf" > .gitattributes &&
-   git read-tree --reset -u HEAD &&
-
-   has_cr LFonly &&
-   LFonlydiff=$(git diff LFonly) &&
-   test -z "$LFonlydiff"
-'
-
-test_expect_success 'eol=lf gives a normalized file LFs with autocrlf=true' '
-
-   rm -f .gitattributes tmp LFonly CRLFonly LFwithNUL &&
-   git config core.autocrlf true &&
-   echo "LFonly eol=lf" > .gitattributes &&
-   git read-tree --reset -u HEAD &&
-
-   ! has_cr LFonly &&
-   LFonlydiff=$(git diff LFonly) &&
-   test -z "$LFonlydiff"
-'
-
-test_expect_success 'autocrlf=true does not normalize CRLF files' '
-
-   rm -f .gitattributes tmp LFonly CRLFonly LFwithNUL &&
-   git config core.autocrlf true &&
-   git read-tree --reset -u HEAD &&
-
-   has_cr LFonly &&
-   has_cr CRLFonly &&
-   LFonlydiff=$(git diff LFonly) &&
-   CRLFonlydiff=$(git diff CRLFonly) &&
-   LFwithNULdiff=$(git diff LFwithNUL) &&
-   test -z "$LFonlydiff" -a -z "$CRLFonlydiff" -a -z "$LFwithNULdiff"
-'
-
-test_expect_success 'text=auto, autocrlf=true does not normalize CRLF files' '
-
-   rm -f .gitattributes tmp LFonly CRLFonly LFwithNUL &&
-   git config core.autocrlf true &&
-   echo "* text=auto" > .gitattr

[PATCH/RFC 1/1] t0027: Some tests are not expensive

2017-04-29 Thread tboegi

From: Torsten Bögershausen 

The purpose of t0027 is to test all CRLF related conversions at
"git checkout" and "git add".

Running t0027 under Git for Windows takes 3-4 minutes, so the whole script
had been marked as "EXPENSIVE".

The source code for "Git for Windows" overrides this since 2014:
"t0027 is marked expensive, but really, for MinGW we want to run these
tests always."

Recent "stress" tests show that t0025 if flaky, reported by Lars Schneider,
larsxschnei...@gmail.com

All tests from t0025 are covered in t0027 as well, so that t0025 can be
retired later.

Split the tests in t0027 into 2 groups: expensive and not expensive.
Expensive are all tests which check the CRLF conversion warnings and
all tests which activate the Git internal "ident" filter.

All other test are now run under all platforms, which allows to remove
the flaky t0025 in the next commit.

The execution time for the non-expansive part is 6..8 seconds under Linux,
and 32 seconds under Mac Os X.

Running the "expensive" version roughly doubles the time.

And in case you ask, things are not going significantly faster using a SSD
instead of a spinning disk.

Signed-off-by: Torsten Bögershausen 
PS: The removal of t0025 is not included (yet)
---
 t/t0027-auto-crlf.sh | 100 ++-
 1 file changed, 59 insertions(+), 41 deletions(-)

diff --git a/t/t0027-auto-crlf.sh b/t/t0027-auto-crlf.sh
index 90db54c..2c5aff6 100755
--- a/t/t0027-auto-crlf.sh
+++ b/t/t0027-auto-crlf.sh
@@ -4,10 +4,12 @@ test_description='CRLF conversion all combinations'
 
 . ./test-lib.sh
 
-if ! test_have_prereq EXPENSIVE
+if ! test_have_prereq EXPENSIVE && ! test_have_prereq MINGW
 then
-   skip_all="EXPENSIVE not set"
-   test_done
+   say "# EXPENSIVE or MINGW not set, skipping ident and warning tests"
+else
+   EXPENSIVE0027=t
+   export EXPENSIVE0027
 fi
 
 compare_files () {
@@ -95,11 +97,14 @@ commit_check_warn () {
git -c core.autocrlf=$crlf add $fname 2>"${pfx}_$f.err"
done &&
git commit -m "core.autocrlf $crlf" &&
-   check_warning "$lfname" ${pfx}_LF.err &&
-   check_warning "$crlfname" ${pfx}_CRLF.err &&
-   check_warning "$lfmixcrlf" ${pfx}_CRLF_mix_LF.err &&
-   check_warning "$lfmixcr" ${pfx}_LF_mix_CR.err &&
-   check_warning "$crlfnul" ${pfx}_CRLF_nul.err
+   if test "$EXPENSIVE0027" = t
+   then
+   check_warning "$lfname" ${pfx}_LF.err &&
+   check_warning "$crlfname" ${pfx}_CRLF.err &&
+   check_warning "$lfmixcrlf" ${pfx}_CRLF_mix_LF.err &&
+   check_warning "$lfmixcr" ${pfx}_LF_mix_CR.err &&
+   check_warning "$crlfnul" ${pfx}_CRLF_nul.err
+   fi
 }
 
 commit_chk_wrnNNO () {
@@ -122,24 +127,27 @@ commit_chk_wrnNNO () {
git -c core.autocrlf=$crlf add $fname 2>"${pfx}_$f.err"
done
 
-   test_expect_success "commit NNO files crlf=$crlf attr=$attr LF" '
-   check_warning "$lfwarn" ${pfx}_LF.err
-   '
-   test_expect_success "commit NNO files attr=$attr aeol=$aeol crlf=$crlf 
CRLF" '
-   check_warning "$crlfwarn" ${pfx}_CRLF.err
-   '
-
-   test_expect_success "commit NNO files attr=$attr aeol=$aeol crlf=$crlf 
CRLF_mix_LF" '
-   check_warning "$lfmixcrlf" ${pfx}_CRLF_mix_LF.err
-   '
-
-   test_expect_success "commit NNO files attr=$attr aeol=$aeol crlf=$crlf 
LF_mix_cr" '
-   check_warning "$lfmixcr" ${pfx}_LF_mix_CR.err
-   '
-
-   test_expect_success "commit NNO files attr=$attr aeol=$aeol crlf=$crlf 
CRLF_nul" '
-   check_warning "$crlfnul" ${pfx}_CRLF_nul.err
-   '
+   if test "$EXPENSIVE0027" = t
+   then
+   test_expect_success "commit NNO files crlf=$crlf attr=$attr LF" 
'
+   check_warning "$lfwarn" ${pfx}_LF.err
+   '
+   test_expect_success "commit NNO files attr=$attr aeol=$aeol 
crlf=$crlf CRLF" '
+   check_warning "$crlfwarn" ${pfx}_CRLF.err
+   '
+
+   test_expect_success "commit NNO files attr=$attr aeol=$aeol 
crlf=$crlf CRLF_mix_LF" '
+   check_warning "$lfmixcrlf" ${pfx}_CRLF_mix_LF.err
+   '
+
+   test_expect_success "commit NNO files attr=$attr aeol=$aeol 
crlf=$crlf LF_mix_cr" '
+   check_warning "$lfmixcr" ${pfx}_LF_mix_CR.err
+   '
+
+   test_expect_success "commit NNO files attr=$attr aeol=$aeol 
crlf=$crlf CRLF_nul" '
+   check_warning "$crlfnul" ${pfx}_CRLF_nul.err
+   '
+   fi
 }
 
 stats_ascii () {
@@ -250,21 +258,24 @@ checkout_files () {
fi
done
 
-   test_expect_success "ls-files --eol attr=$attr $ident aeol=$aeol 
core.autocrlf=$crlf core.eol=$ceol" '
-   test_when_finished "rm expect actual" &&
-   sort <<-EOF >expect &&
-

[PATCH v2 1/1] Document how to normalize the line endings

2017-04-12 Thread tboegi

From: Torsten Bögershausen 

The instructions how to normalize the line endings should have been updated
as part of commit 6523728499e 'convert: unify the "auto" handling of CRLF',
(but that part never made it into the commit).

Update the documentation in Documentation/gitattributes.txt
and add a test case in t0025.

Reported by Kristian Adrup
https://github.com/git-for-windows/git/issues/954

Signed-off-by: Torsten Bögershausen 
---
 Documentation/gitattributes.txt |  6 ++
 t/t0025-crlf-auto.sh| 26 ++
 2 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 976243a..3b76687 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -227,11 +227,9 @@ From a clean working directory:
 
 -
 $ echo "* text=auto" >.gitattributes
-$ rm .git/index # Remove the index to force Git to
-$ git reset # re-scan the working directory
+$ rm .git/index # Remove the index to re-scan the working directory
+$ git add .
 $ git status# Show files that will be normalized
-$ git add -u
-$ git add .gitattributes
 $ git commit -m "Introduce end-of-line normalization"
 -
 
diff --git a/t/t0025-crlf-auto.sh b/t/t0025-crlf-auto.sh
index d0bee08..89826c5 100755
--- a/t/t0025-crlf-auto.sh
+++ b/t/t0025-crlf-auto.sh
@@ -152,4 +152,30 @@ test_expect_success 'eol=crlf _does_ normalize binary 
files' '
test -z "$LFwithNULdiff"
 '
 
+test_expect_success 'prepare unnormalized' '
+   > .gitattributes &&
+   git config core.autocrlf false &&
+   printf "LINEONE\nLINETWO\r\n" >mixed &&
+   git add mixed .gitattributes &&
+   git commit -m "Add mixed" &&
+   git ls-files --eol | egrep "i/crlf" &&
+   git ls-files --eol | egrep "i/mixed"
+'
+
+test_expect_success 'normalize unnormalized' '
+   echo "* text=auto" >.gitattributes &&
+   rm .git/index &&
+   git add . &&
+   git commit -m "Introduce end-of-line normalization" &&
+   git ls-files --eol | tr "\\t" " " | sort >act &&
+cat >exp <

[PATCH v1 1/1] git diff --quiet exits with 1 on clean tree with CRLF conversions

2017-03-01 Thread tboegi

From: Junio C Hamano 

git diff --quiet may take a short-cut to see if a file is changed
in the working tree:
Whenever the file size differs from what is recorded in the index,
the file is assumed to be changed and git diff --quiet returns
exit with code 1

This shortcut must be suppressed whenever the line endings are converted
or a filter is in use.
The attributes say "* text=auto" and a file has
"Hello\nWorld\n" in the index with a length of 12.
The file in the working tree has "Hello\r\nWorld\r\n" with a length of 14.
(Or even "Hello\r\nWorld\n").
In this case "git add" will not do any changes to the index, and
"git diff -quiet" should exit 0.

Add calls to would_convert_to_git() before blindly saying that a different
size means different content.

Reported-By: Mike Crowe 
Signed-off-by: Torsten Bögershausen 
---
This is what I can come up with, collecting all the loose ends.
I'm not sure if Mike wan't to have the Reported-By with a
Signed-off-by ?
The other question is, if the commit message summarizes the discussion
well enough ?

diff.c| 18 ++
 t/t0028-diff-converted.sh | 27 +++
 2 files changed, 41 insertions(+), 4 deletions(-)
 create mode 100755 t/t0028-diff-converted.sh

diff --git a/diff.c b/diff.c
index 051761b..c264758 100644
--- a/diff.c
+++ b/diff.c
@@ -4921,9 +4921,10 @@ static int diff_filespec_check_stat_unmatch(struct 
diff_filepair *p)
 *differences.
 *
 * 2. At this point, the file is known to be modified,
-*with the same mode and size, and the object
-*name of one side is unknown.  Need to inspect
-*the identical contents.
+*with the same mode and size, the object
+*name of one side is unknown, or size comparison
+*cannot be depended upon.  Need to inspect the
+*contents.
 */
if (!DIFF_FILE_VALID(p->one) || /* (1) */
!DIFF_FILE_VALID(p->two) ||
@@ -4931,7 +4932,16 @@ static int diff_filespec_check_stat_unmatch(struct 
diff_filepair *p)
(p->one->mode != p->two->mode) ||
diff_populate_filespec(p->one, CHECK_SIZE_ONLY) ||
diff_populate_filespec(p->two, CHECK_SIZE_ONLY) ||
-   (p->one->size != p->two->size) ||
+
+   /*
+* only if eol and other conversions are not involved,
+* we can say that two contents of different sizes
+* cannot be the same without checking their contents.
+*/
+   (!would_convert_to_git(p->one->path) &&
+!would_convert_to_git(p->two->path) &&
+(p->one->size != p->two->size)) ||
+
!diff_filespec_is_identical(p->one, p->two)) /* (2) */
p->skip_stat_unmatch_result = 1;
return p->skip_stat_unmatch_result;
diff --git a/t/t0028-diff-converted.sh b/t/t0028-diff-converted.sh
new file mode 100755
index 000..3d5ab95
--- /dev/null
+++ b/t/t0028-diff-converted.sh
@@ -0,0 +1,27 @@
+#!/bin/sh
+#
+# Copyright (c) 2017 Mike Crowe
+#
+# These tests ensure that files changing line endings in the presence
+# of .gitattributes to indicate that line endings should be ignored
+# don't cause 'git diff' or 'git diff --quiet' to think that they have
+# been changed.
+
+test_description='git diff with files that require CRLF conversion'
+
+. ./test-lib.sh
+
+test_expect_success setup '
+   echo "* text=auto" >.gitattributes &&
+   printf "Hello\r\nWorld\r\n" >crlf.txt &&
+   git add .gitattributes crlf.txt &&
+   git commit -m "initial"
+'
+
+test_expect_success 'quiet diff works on file with line-ending change that has 
no effect on repository' '
+   printf "Hello\r\nWorld\n" >crlf.txt &&
+   git status &&
+   git diff --quiet
+'
+
+test_done
-- 
2.10.0

[PATCH v2 1/1] convert: git cherry-pick -Xrenormalize did not work

2016-11-30 Thread tboegi

From: Torsten Bögershausen 

Working with a repo that used to be all CRLF. At some point it
was changed to all LF, with `text=auto` in .gitattributes.
Trying to cherry-pick a commit from before the switchover fails:

$ git cherry-pick -Xrenormalize 
fatal: CRLF would be replaced by LF in [path]

Commit 65237284 "unify the "auto" handling of CRLF" introduced
a regression:

Whenever crlf_action is CRLF_TEXT_XXX and not CRLF_AUTO_XXX,
SAFE_CRLF_RENORMALIZE was feed into check_safe_crlf().
This is wrong because here everything else than SAFE_CRLF_WARN is
treated as SAFE_CRLF_FAIL.

Call check_safe_crlf() only if checksafe is SAFE_CRLF_WARN or SAFE_CRLF_FAIL.

Reported-by: Eevee (Lexy Munroe) 
Signed-off-by: Torsten Bögershausen 
---
 convert.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/convert.c b/convert.c
index be91358..f8e4dfe 100644
--- a/convert.c
+++ b/convert.c
@@ -281,13 +281,13 @@ static int crlf_to_git(const char *path, const char *src, 
size_t len,
/*
 * If the file in the index has any CR in it, do not convert.
 * This is the new safer autocrlf handling.
+  - unless we want to renormalize in a merge or cherry-pick
 */
-   if (checksafe == SAFE_CRLF_RENORMALIZE)
-   checksafe = SAFE_CRLF_FALSE;
-   else if (has_cr_in_index(path))
+   if ((checksafe != SAFE_CRLF_RENORMALIZE) && 
has_cr_in_index(path))
convert_crlf_into_lf = 0;
}
-   if (checksafe && len) {
+   if ((checksafe == SAFE_CRLF_WARN ||
+   (checksafe == SAFE_CRLF_FAIL)) && len) {
struct text_stat new_stats;
memcpy(&new_stats, &stats, sizeof(new_stats));
/* simulate "git add" */
-- 
2.10.0

[PATCH v1 1/1] convert: git cherry-pick -Xrenormalize did not work

2016-11-29 Thread tboegi

From: Torsten Bögershausen 

Working with a repo that used to be all CRLF. At some point it
was changed to all LF, with `text=auto` in .gitattributes.
Trying to cherry-pick a commit from before the switchover fails:

$ git cherry-pick -Xrenormalize 
fatal: CRLF would be replaced by LF in [path]

Whenever crlf_action is CRLF_TEXT_XXX and not CRLF_AUTO_XXX,
SAFE_CRLF_RENORMALIZE must be turned into CRLF_SAFE_FALSE.

Reported-by: Eevee (Lexy Munroe) 
Signed-off-by: Torsten Bögershausen 
---

Thanks for reporting.
Here is a less invasive patch.
Please let me know, if the patch is OK for you
(email address, does it work..)

 convert.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/convert.c b/convert.c
index be91358..526ec1d 100644
--- a/convert.c
+++ b/convert.c
@@ -286,7 +286,9 @@ static int crlf_to_git(const char *path, const char *src, 
size_t len,
checksafe = SAFE_CRLF_FALSE;
else if (has_cr_in_index(path))
convert_crlf_into_lf = 0;
-   }
+   } else if (checksafe == SAFE_CRLF_RENORMALIZE)
+   checksafe = SAFE_CRLF_FALSE;
+
if (checksafe && len) {
struct text_stat new_stats;
memcpy(&new_stats, &stats, sizeof(new_stats));
-- 
2.10.0

[PATCH/RFC v1 1/1] New way to normalize the line endings

2016-11-27 Thread tboegi

From: Torsten Bögershausen 

Sincec commit 6523728499e7 'convert: unify the "auto" handling of CRLF'
the normalization instruction in Documentation/gitattributes.txt
doesn't work any more.

Update the documentation and add a test case.

Reported by Kristian Adrup
https://github.com/git-for-windows/git/issues/954

Signed-off-by: Torsten Bögershausen 
---
 Documentation/gitattributes.txt |  7 +++
 t/t0025-crlf-auto.sh| 29 +
 2 files changed, 32 insertions(+), 4 deletions(-)

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 976243a..1f7529a 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -227,11 +227,10 @@ From a clean working directory:
 
 -
 $ echo "* text=auto" >.gitattributes
-$ rm .git/index # Remove the index to force Git to
-$ git reset # re-scan the working directory
+$ git ls-files --eol | egrep "i/(crlf|mixed)" # find not normalized files
+$ rm .git/index # Remove the index to re-scan the working directory
+$ git add .
 $ git status# Show files that will be normalized
-$ git add -u
-$ git add .gitattributes
 $ git commit -m "Introduce end-of-line normalization"
 -
 
diff --git a/t/t0025-crlf-auto.sh b/t/t0025-crlf-auto.sh
index d0bee08..4ad4d02 100755
--- a/t/t0025-crlf-auto.sh
+++ b/t/t0025-crlf-auto.sh
@@ -152,4 +152,33 @@ test_expect_success 'eol=crlf _does_ normalize binary 
files' '
test -z "$LFwithNULdiff"
 '
 
+test_expect_success 'prepare unnormalized' '
+
+   > .gitattributes &&
+   git config core.autocrlf false &&
+   printf "LINEONE\nLINETWO\r\n" >mixed &&
+   git add mixed .gitattributes &&
+   git commit -m "Add mixed" &&
+   git ls-files --eol | egrep "i/crlf" &&
+   git ls-files --eol | egrep "i/mixed"
+
+'
+
+test_expect_success 'normalize unnormalized' '
+   echo "* text=auto" >.gitattributes &&
+   rm .git/index &&
+   git add . &&
+   git commit -m "Introduce end-of-line normalization" &&
+   git ls-files --eol | tr "\\t" " " | sort >act &&
+cat >exp <

[PATCH v2 2/2] convert.c: stream and fast search for binary

2016-10-12 Thread tboegi

From: Torsten Bögershausen 

When statistics are done for the autocrlf handling, the search in
the content can be stopped, if e.g
- a search for binary is done, and a NUL character is found
- a search for CRLF is done, and the first CRLF is found.

Similar when statistics for binary vs non-binary are gathered:
Whenever a lone CR or NUL is found, the search can be aborted.

When checking out files in "auto" mode, any file that has a "lone CR"
or a CRLF will not be converted, so the search can be aborted early.

Add the new bit, CONVERT_STAT_BITS_ANY_CR,
which is set for either lone CR or CRLF.

Many binary files have a NUL very early and it is often not necessary
to load the whole content of a file or blob into memory.

Split gather_stats() into gather_all_stats() and gather_stats_partly()
to do a streaming handling for blobs and files in the worktree.

Signed-off-by: Torsten Bögershausen 
---
 convert.c | 191 ++
 1 file changed, 129 insertions(+), 62 deletions(-)

diff --git a/convert.c b/convert.c
index 077f5e6..2396fe5 100644
--- a/convert.c
+++ b/convert.c
@@ -3,6 +3,7 @@
 #include "run-command.h"
 #include "quote.h"
 #include "sigchain.h"
+#include "streaming.h"
 
 /*
  * convert.c - convert a file when checking it out and checking it in.
@@ -13,10 +14,12 @@
  * translation when the "text" attribute or "auto_crlf" option is set.
  */
 
-/* Stat bits: When BIN is set, the txt bits are unset */
 #define CONVERT_STAT_BITS_TXT_LF0x1
 #define CONVERT_STAT_BITS_TXT_CRLF  0x2
 #define CONVERT_STAT_BITS_BIN   0x4
+#define CONVERT_STAT_BITS_ANY_CR0x8
+
+#define STREAM_BUFFER_SIZE (1024*16)
 
 enum crlf_action {
CRLF_UNDEFINED,
@@ -31,30 +34,36 @@ enum crlf_action {
 
 struct text_stat {
/* NUL, CR, LF and CRLF counts */
-   unsigned nul, lonecr, lonelf, crlf;
+   unsigned stat_bits, lonecr, lonelf, crlf;
 
/* These are just approximations! */
unsigned printable, nonprintable;
 };
 
-static void gather_stats(const char *buf, unsigned long size, struct text_stat 
*stats)
+static void gather_stats_partly(const char *buf, unsigned long size,
+   struct text_stat *stats, unsigned search_only)
 {
unsigned long i;
 
-   memset(stats, 0, sizeof(*stats));
-
+   if (!buf || !size)
+   return;
for (i = 0; i < size; i++) {
unsigned char c = buf[i];
if (c == '\r') {
+   stats->stat_bits |= CONVERT_STAT_BITS_ANY_CR;
if (i+1 < size && buf[i+1] == '\n') {
stats->crlf++;
i++;
-   } else
+   stats->stat_bits |= CONVERT_STAT_BITS_TXT_CRLF;
+   } else {
stats->lonecr++;
+   stats->stat_bits |= CONVERT_STAT_BITS_BIN;
+   }
continue;
}
if (c == '\n') {
stats->lonelf++;
+   stats->stat_bits |= CONVERT_STAT_BITS_TXT_LF;
continue;
}
if (c == 127)
@@ -67,7 +76,7 @@ static void gather_stats(const char *buf, unsigned long size, 
struct text_stat *
stats->printable++;
break;
case 0:
-   stats->nul++;
+   stats->stat_bits |= CONVERT_STAT_BITS_BIN;
/* fall through */
default:
stats->nonprintable++;
@@ -75,6 +84,8 @@ static void gather_stats(const char *buf, unsigned long size, 
struct text_stat *
}
else
stats->printable++;
+   if (stats->stat_bits & search_only)
+   break; /* We found what we have been searching for */
}
 
/* If file ends with EOF then don't count this EOF as non-printable. */
@@ -86,41 +97,62 @@ static void gather_stats(const char *buf, unsigned long 
size, struct text_stat *
  * The same heuristics as diff.c::mmfile_is_binary()
  * We treat files with bare CR as binary
  */
-static int convert_is_binary(unsigned long size, const struct text_stat *stats)
+static void convert_nonprintable(struct text_stat *stats)
 {
-   if (stats->lonecr)
-   return 1;
-   if (stats->nul)
-   return 1;
if ((stats->printable >> 7) < stats->nonprintable)
-   return 1;
-   return 0;
+   stats->stat_bits |= CONVERT_STAT_BITS_BIN;
 }
 
-static unsigned int gather_convert_stats(const char *data, unsigned long size)
+static void gather_all_stats(const char *buf, unsigned long size,
+struct text_stat *stats, unsigned sear

[PATCH v2 1/2] read-cache: factor out get_sha1_from_index() helper

2016-10-12 Thread tboegi

From: Torsten Bögershausen 

Factor out the retrieval of the sha1 for a given path in
read_blob_data_from_index() into the function get_sha1_from_index().

This will be used in the next commit, when convert.c can do the
analyze for "text=auto" without slurping the whole blob into memory
at once.

Add a wrapper definition get_sha1_from_cache().

Signed-off-by: Torsten Bögershausen 
---
 cache.h  |  3 +++
 read-cache.c | 29 ++---
 2 files changed, 21 insertions(+), 11 deletions(-)

diff --git a/cache.h b/cache.h
index 1604e29..04de209 100644
--- a/cache.h
+++ b/cache.h
@@ -380,6 +380,7 @@ extern void free_name_hash(struct index_state *istate);
 #define unmerge_cache_entry_at(at) unmerge_index_entry_at(&the_index, at)
 #define unmerge_cache(pathspec) unmerge_index(&the_index, pathspec)
 #define read_blob_data_from_cache(path, sz) 
read_blob_data_from_index(&the_index, (path), (sz))
+#define get_sha1_from_cache(path)  get_sha1_from_index (&the_index, (path))
 #endif
 
 enum object_type {
@@ -1089,6 +1090,8 @@ static inline void *read_sha1_file(const unsigned char 
*sha1, enum object_type *
return read_sha1_file_extended(sha1, type, size, LOOKUP_REPLACE_OBJECT);
 }
 
+const unsigned char *get_sha1_from_index(struct index_state *istate, const 
char *path);
+
 /*
  * This internal function is only declared here for the benefit of
  * lookup_replace_object().  Please do not call it directly.
diff --git a/read-cache.c b/read-cache.c
index 38d67fa..5a1df14 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -2290,13 +2290,27 @@ int index_name_is_other(const struct index_state 
*istate, const char *name,
 
 void *read_blob_data_from_index(struct index_state *istate, const char *path, 
unsigned long *size)
 {
-   int pos, len;
+   const unsigned char *sha1;
unsigned long sz;
enum object_type type;
void *data;
 
-   len = strlen(path);
-   pos = index_name_pos(istate, path, len);
+   sha1 = get_sha1_from_index(istate, path);
+   if (!sha1)
+   return NULL;
+   data = read_sha1_file(sha1, &type, &sz);
+   if (!data || type != OBJ_BLOB) {
+   free(data);
+   return NULL;
+   }
+   if (size)
+   *size = sz;
+   return data;
+}
+
+const unsigned char *get_sha1_from_index(struct index_state *istate, const 
char *path)
+{
+   int pos = index_name_pos(istate, path, strlen(path));
if (pos < 0) {
/*
 * We might be in the middle of a merge, in which
@@ -2312,14 +2326,7 @@ void *read_blob_data_from_index(struct index_state 
*istate, const char *path, un
}
if (pos < 0)
return NULL;
-   data = read_sha1_file(istate->cache[pos]->oid.hash, &type, &sz);
-   if (!data || type != OBJ_BLOB) {
-   free(data);
-   return NULL;
-   }
-   if (size)
-   *size = sz;
-   return data;
+   return istate->cache[pos]->oid.hash;
 }
 
 void stat_validity_clear(struct stat_validity *sv)
-- 
2.10.0

[PATCH v2 0/2] Stream and fast search

2016-10-12 Thread tboegi

From: Torsten Bögershausen 

Changes since v1:
- Rename earlyout into search_only
- Increase buffer from 2KiB to 16KiB
- s/mask/eol_bits/
- Reduce the "noise"
- Document "split gather_stats() into gather_all_stats()/gather_stats_partly()

Torsten Bögershausen (2):
  read-cache: factor out get_sha1_from_index() helper
  convert.c: stream and fast search for binary

 cache.h  |   3 +
 convert.c| 191 ---
 read-cache.c |  29 +
 3 files changed, 150 insertions(+), 73 deletions(-)

-- 
2.10.0

[PATCH v1 1/2] read-cache: factor out get_sha1_from_index() helper

2016-10-09 Thread tboegi

From: Torsten Bögershausen 

Factor out the retrieval of the sha1 for a given path in
read_blob_data_from_index() into the function get_sha1_from_index().

This will be used in the next commit, when convert.c can do the
analyze for "text=auto" without slurping the whole blob into memory
at once.

Add a wrapper definition get_sha1_from_cache().
---
 cache.h  |  3 +++
 read-cache.c | 29 ++---
 2 files changed, 21 insertions(+), 11 deletions(-)

diff --git a/cache.h b/cache.h
index 1604e29..04de209 100644
--- a/cache.h
+++ b/cache.h
@@ -380,6 +380,7 @@ extern void free_name_hash(struct index_state *istate);
 #define unmerge_cache_entry_at(at) unmerge_index_entry_at(&the_index, at)
 #define unmerge_cache(pathspec) unmerge_index(&the_index, pathspec)
 #define read_blob_data_from_cache(path, sz) 
read_blob_data_from_index(&the_index, (path), (sz))
+#define get_sha1_from_cache(path)  get_sha1_from_index (&the_index, (path))
 #endif
 
 enum object_type {
@@ -1089,6 +1090,8 @@ static inline void *read_sha1_file(const unsigned char 
*sha1, enum object_type *
return read_sha1_file_extended(sha1, type, size, LOOKUP_REPLACE_OBJECT);
 }
 
+const unsigned char *get_sha1_from_index(struct index_state *istate, const 
char *path);
+
 /*
  * This internal function is only declared here for the benefit of
  * lookup_replace_object().  Please do not call it directly.
diff --git a/read-cache.c b/read-cache.c
index 38d67fa..5a1df14 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -2290,13 +2290,27 @@ int index_name_is_other(const struct index_state 
*istate, const char *name,
 
 void *read_blob_data_from_index(struct index_state *istate, const char *path, 
unsigned long *size)
 {
-   int pos, len;
+   const unsigned char *sha1;
unsigned long sz;
enum object_type type;
void *data;
 
-   len = strlen(path);
-   pos = index_name_pos(istate, path, len);
+   sha1 = get_sha1_from_index(istate, path);
+   if (!sha1)
+   return NULL;
+   data = read_sha1_file(sha1, &type, &sz);
+   if (!data || type != OBJ_BLOB) {
+   free(data);
+   return NULL;
+   }
+   if (size)
+   *size = sz;
+   return data;
+}
+
+const unsigned char *get_sha1_from_index(struct index_state *istate, const 
char *path)
+{
+   int pos = index_name_pos(istate, path, strlen(path));
if (pos < 0) {
/*
 * We might be in the middle of a merge, in which
@@ -2312,14 +2326,7 @@ void *read_blob_data_from_index(struct index_state 
*istate, const char *path, un
}
if (pos < 0)
return NULL;
-   data = read_sha1_file(istate->cache[pos]->oid.hash, &type, &sz);
-   if (!data || type != OBJ_BLOB) {
-   free(data);
-   return NULL;
-   }
-   if (size)
-   *size = sz;
-   return data;
+   return istate->cache[pos]->oid.hash;
 }
 
 void stat_validity_clear(struct stat_validity *sv)
-- 
2.10.0

[PATCH v1 0/2] convert: stream and early out

2016-10-09 Thread tboegi

From: Torsten Bögershausen 

An optimization when autocrlf is used and the binary/text detection is run.
Or git ls-files --eol is run to analyze the content of files or blobs.

Torsten Bögershausen (2):
  read-cache: factor out get_sha1_from_index() helper
  convert.c: stream and early out

 cache.h  |   3 +
 convert.c| 195 +++
 read-cache.c |  29 +
 3 files changed, 151 insertions(+), 76 deletions(-)

-- 
2.10.0

[PATCH v1 2/2] convert.c: stream and early out

2016-10-09 Thread tboegi

From: Torsten Bögershausen 

When statistics are done for the autocrlf handling, the search in
the content can be stopped, if e.g
- a search for binary is done, and a NUL character is found
- a search for CRLF is done, and the first CRLF is found.

Similar when statistics for binary vs non-binary are gathered:
Whenever a lone CR or NUL is found, the search can be aborted.

When checking out files in "auto" mode, any file that has a "lone CR"
or a CRLF will not be converted, so the search can be aborted early.

Add the new bit, CONVERT_STAT_BITS_ANY_CR,
which is set for either lone CR or CRLF.

Many binary files have a NUL very early (within the first few bytes,
latest within the first 1..2K).
It is often not necessary to load the whole content of a file or blob
into memory.

Use a streaming handling for blobs and files in the worktree.
---
 convert.c | 195 +-
 1 file changed, 130 insertions(+), 65 deletions(-)

diff --git a/convert.c b/convert.c
index 077f5e6..6a625e5 100644
--- a/convert.c
+++ b/convert.c
@@ -3,6 +3,7 @@
 #include "run-command.h"
 #include "quote.h"
 #include "sigchain.h"
+#include "streaming.h"
 
 /*
  * convert.c - convert a file when checking it out and checking it in.
@@ -13,10 +14,10 @@
  * translation when the "text" attribute or "auto_crlf" option is set.
  */
 
-/* Stat bits: When BIN is set, the txt bits are unset */
 #define CONVERT_STAT_BITS_TXT_LF0x1
 #define CONVERT_STAT_BITS_TXT_CRLF  0x2
 #define CONVERT_STAT_BITS_BIN   0x4
+#define CONVERT_STAT_BITS_ANY_CR0x8
 
 enum crlf_action {
CRLF_UNDEFINED,
@@ -31,30 +32,36 @@ enum crlf_action {
 
 struct text_stat {
/* NUL, CR, LF and CRLF counts */
-   unsigned nul, lonecr, lonelf, crlf;
+   unsigned stat_bits, lonecr, lonelf, crlf;
 
/* These are just approximations! */
unsigned printable, nonprintable;
 };
 
-static void gather_stats(const char *buf, unsigned long size, struct text_stat 
*stats)
+static void gather_stats_partly(const char *buf, unsigned long len,
+   struct text_stat *stats, unsigned earlyout)
 {
unsigned long i;
 
-   memset(stats, 0, sizeof(*stats));
-
-   for (i = 0; i < size; i++) {
+   if (!buf || !len)
+   return;
+   for (i = 0; i < len; i++) {
unsigned char c = buf[i];
if (c == '\r') {
-   if (i+1 < size && buf[i+1] == '\n') {
+   stats->stat_bits |= CONVERT_STAT_BITS_ANY_CR;
+   if (i+1 < len && buf[i+1] == '\n') {
stats->crlf++;
i++;
-   } else
+   stats->stat_bits |= CONVERT_STAT_BITS_TXT_CRLF;
+   } else {
stats->lonecr++;
+   stats->stat_bits |= CONVERT_STAT_BITS_BIN;
+   }
continue;
}
if (c == '\n') {
stats->lonelf++;
+   stats->stat_bits |= CONVERT_STAT_BITS_TXT_LF;
continue;
}
if (c == 127)
@@ -67,7 +74,7 @@ static void gather_stats(const char *buf, unsigned long size, 
struct text_stat *
stats->printable++;
break;
case 0:
-   stats->nul++;
+   stats->stat_bits |= CONVERT_STAT_BITS_BIN;
/* fall through */
default:
stats->nonprintable++;
@@ -75,10 +82,12 @@ static void gather_stats(const char *buf, unsigned long 
size, struct text_stat *
}
else
stats->printable++;
+   if (stats->stat_bits & earlyout)
+   break; /* We found what we have been searching for */
}
 
/* If file ends with EOF then don't count this EOF as non-printable. */
-   if (size >= 1 && buf[size-1] == '\032')
+   if (len >= 1 && buf[len-1] == '\032')
stats->nonprintable--;
 }
 
@@ -86,41 +95,62 @@ static void gather_stats(const char *buf, unsigned long 
size, struct text_stat *
  * The same heuristics as diff.c::mmfile_is_binary()
  * We treat files with bare CR as binary
  */
-static int convert_is_binary(unsigned long size, const struct text_stat *stats)
+static void convert_nonprintable(struct text_stat *stats)
 {
-   if (stats->lonecr)
-   return 1;
-   if (stats->nul)
-   return 1;
if ((stats->printable >> 7) < stats->nonprintable)
-   return 1;
-   return 0;
+   stats->stat_bits |= CONVERT_STAT_BITS_BIN;
 }
 
-static unsigned int gather_convert_stats(const char *data, unsigned long s

[PATCH v2 0/2] Adjust the documentation to the unified "auto" handling

2016-08-26 Thread tboegi

From: Torsten Bögershausen 

Changes since v1:
- 1/2 is left unchanged
- 2/2 is re-written and should be more consistant to read.

Torsten Bögershausen (2):
  git ls-files: text=auto eol=lf is supported in Git 2.10
  gitattributes: Document the unified "auto" handling

 Documentation/git-ls-files.txt  |  3 +--
 Documentation/gitattributes.txt | 58 +
 2 files changed, 25 insertions(+), 36 deletions(-)

-- 
2.9.0.243.g5c589a7

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 2/2] gitattributes: Document the unified "auto" handling

2016-08-26 Thread tboegi

From: Torsten Bögershausen 

Update the documentation about text=auto:
text=auto now follows the core.autocrlf handling when files are not
normalized in the repository.

For a cross platform project recommend the usage of attributes for
line-ending conversions.

Signed-off-by: Torsten Bögershausen 
---
 Documentation/gitattributes.txt | 58 +
 1 file changed, 24 insertions(+), 34 deletions(-)

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 807577a..7aff940 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -182,23 +182,6 @@ While Git normally leaves file contents alone, it can be 
configured to
 normalize line endings to LF in the repository and, optionally, to
 convert them to CRLF when files are checked out.
 
-Here is an example that will make Git normalize .txt, .vcproj and .sh
-files, ensure that .vcproj files have CRLF and .sh files have LF in
-the working directory, and prevent .jpg files from being normalized
-regardless of their content.
-
-
-*   text=auto
-*.txt  text
-*.vcproj   text eol=crlf
-*.sh   text eol=lf
-*.jpg  -text
-
-
-Other source code management systems normalize all text files in their
-repositories, and there are two ways to enable similar automatic
-normalization in Git.
-
 If you simply want to have CRLF line endings in your working directory
 regardless of the repository you are working with, you can set the
 config variable "core.autocrlf" without using any attributes.
@@ -208,35 +191,42 @@ config variable "core.autocrlf" without using any 
attributes.
autocrlf = true
 
 
-This does not force normalization of all text files, but does ensure
+This does not force normalization of text files, but does ensure
 that text files that you introduce to the repository have their line
 endings normalized to LF when they are added, and that files that are
 already normalized in the repository stay normalized.
 
-If you want to interoperate with a source code management system that
-enforces end-of-line normalization, or you simply want all text files
-in your repository to be normalized, you should instead set the `text`
-attribute to "auto" for _all_ files.
+If you want to ensure that text files that any contributor introduces to
+the repository have their line endings normalized, you can set the
+`text` attribute to "auto" for _all_ files.
 
 
 *  text=auto
 
 
-This ensures that all files that Git considers to be text will have
-normalized (LF) line endings in the repository.  The `core.eol`
-configuration variable controls which line endings Git will use for
-normalized files in your working directory; the default is to use the
-native line ending for your platform, or CRLF if `core.autocrlf` is
-set.
+The attributes allow a fine-grained control, how the line endings
+are converted.
+Here is an example that will make Git normalize .txt, .vcproj and .sh
+files, ensure that .vcproj files have CRLF and .sh files have LF in
+the working directory, and prevent .jpg files from being normalized
+regardless of their content.
+
+
+*   text=auto
+*.txt  text
+*.vcproj   text eol=crlf
+*.sh   text eol=lf
+*.jpg  -text
+
+
+NOTE: When `text=auto` conversion is enabled in a cross-platform
+project using push and pull to a central repository the text files
+containing CRLFs should be normalized.
 
-NOTE: When `text=auto` normalization is enabled in an existing
-repository, any text files containing CRLFs should be normalized.  If
-they are not they will be normalized the next time someone tries to
-change them, causing unfortunate misattribution.  From a clean working
-directory:
+From a clean working directory:
 
 -
-$ echo "* text=auto" >>.gitattributes
+$ echo "* text=auto" >.gitattributes
 $ rm .git/index # Remove the index to force Git to
 $ git reset # re-scan the working directory
 $ git status# Show files that will be normalized
-- 
2.9.0.243.g5c589a7

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 1/2] git ls-files: text=auto eol=lf is supported in Git 2.10

2016-08-26 Thread tboegi

From: Torsten Bögershausen 

The man page for `git ls-files --eol` mentions the combination
of text attributes "text=auto eol=lf" or "text=auto eol=crlf" as not
supported yet, but may be in the future.
Now they are supported

Signed-off-by: Torsten Bögershausen 
---
 Documentation/git-ls-files.txt | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/Documentation/git-ls-files.txt b/Documentation/git-ls-files.txt
index 078b556..0d933ac 100644
--- a/Documentation/git-ls-files.txt
+++ b/Documentation/git-ls-files.txt
@@ -159,8 +159,7 @@ not accessible in the working tree.
 +
  is the attribute that is used when checking out or committing,
 it is either "", "-text", "text", "text=auto", "text eol=lf", "text eol=crlf".
-Note: Currently Git does not support "text=auto eol=lf" or "text=auto 
eol=crlf",
-that may change in the future.
+Since Git 2.10 "text=auto eol=lf" and "text=auto eol=crlf" are supported.
 +
 Both the  in the index ("i/")
 and in the working tree ("w/") are shown for regular files,
-- 
2.9.0.243.g5c589a7

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v1 0/3] Update eol documentation

2016-08-25 Thread tboegi

From: Torsten Bögershausen 

Sorry for posting this so late:
While reviewing another patch I realized that the eol related
documentation was not updated as it should be.

Torsten Bögershausen (2):
  git ls-files: text=auto eol=lf is supported in Git 2.10
  gitattributes: Document the unified "auto" handling

 Documentation/git-ls-files.txt  |  3 +--
 Documentation/gitattributes.txt | 24 
 2 files changed, 17 insertions(+), 10 deletions(-)

-- 
2.9.3.599.g2376d31.dirty

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v1 1/2] git ls-files: text=auto eol=lf is supported in Git 2.10

2016-08-25 Thread tboegi

From: Torsten Bögershausen 

The man page for `git ls-files --eol` mentions the combination
of text attributes "text=auto eol=lf" or "text=auto eol=crlf" as not
supported yet, but may be in the future.
Now they are supported

Signed-off-by: Torsten Bögershausen 
---
 Documentation/git-ls-files.txt | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/Documentation/git-ls-files.txt b/Documentation/git-ls-files.txt
index 078b556..0d933ac 100644
--- a/Documentation/git-ls-files.txt
+++ b/Documentation/git-ls-files.txt
@@ -159,8 +159,7 @@ not accessible in the working tree.
 +
  is the attribute that is used when checking out or committing,
 it is either "", "-text", "text", "text=auto", "text eol=lf", "text eol=crlf".
-Note: Currently Git does not support "text=auto eol=lf" or "text=auto 
eol=crlf",
-that may change in the future.
+Since Git 2.10 "text=auto eol=lf" and "text=auto eol=crlf" are supported.
 +
 Both the  in the index ("i/")
 and in the working tree ("w/") are shown for regular files,
-- 
2.9.3.599.g2376d31.dirty

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v1 2/2] gitattributes: Document the unified "auto" handling

2016-08-25 Thread tboegi

From: Torsten Bögershausen 

Update the documentation about text=auto:
text=auto now follows the core.autocrlf handling when files are not
normalized in the repository.

For a cross platform project recommend the usage of attributes for
line-ending conversions.

Signed-off-by: Torsten Bögershausen 
---
 Documentation/gitattributes.txt | 24 
 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 807577a..4012661 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -213,27 +213,35 @@ that text files that you introduce to the repository have 
their line
 endings normalized to LF when they are added, and that files that are
 already normalized in the repository stay normalized.
 
+If you want to ensure that text files that any contributor introduces to
+the repository have their line endings normalized, you could set the
+`text` attribute to "auto" for _all_ files.
+
+
+*  text=auto
+
+
 If you want to interoperate with a source code management system that
 enforces end-of-line normalization, or you simply want all text files
 in your repository to be normalized, you should instead set the `text`
-attribute to "auto" for _all_ files.
+attribute to "text" for text files.
 
 
-*  text=auto
+*.txt  text
 
 
-This ensures that all files that Git considers to be text will have
+This ensures that all files marked as text will have
 normalized (LF) line endings in the repository.  The `core.eol`
 configuration variable controls which line endings Git will use for
 normalized files in your working directory; the default is to use the
 native line ending for your platform, or CRLF if `core.autocrlf` is
 set.
 
-NOTE: When `text=auto` normalization is enabled in an existing
-repository, any text files containing CRLFs should be normalized.  If
-they are not they will be normalized the next time someone tries to
-change them, causing unfortunate misattribution.  From a clean working
-directory:
+NOTE: When you have a cross-platform project using push and pull
+to a central repository the text files containing CRLFs should be
+normalized. All text files should have a text attribute, either
+`text` or `text=auto`.
+From a clean working directory:
 
 -
 $ echo "* text=auto" >>.gitattributes
-- 
2.9.3.599.g2376d31.dirty

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v1 0/1] Rename NotNormalized (NNO) into CRLF in index

2016-08-19 Thread tboegi

From: Torsten Bögershausen 

Here comes the promised cleanup of t0027:
- The wording NNO is removed and replaced by CRI
- No code changes
- Needs to go on top of next or pu or tb/t0027-raciness-fix
Torsten Bögershausen (1):
  t0027: Rename NotNormalized (NNO) into CRLF in index

 t/t0027-auto-crlf.sh | 122 +--
 1 file changed, 61 insertions(+), 61 deletions(-)

-- 
2.9.3.599.g2376d31.dirty

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v1 1/1] t0027: Rename NotNormalized (NNO) into CRLF in index

2016-08-19 Thread tboegi

From: Torsten Bögershausen 

Originally NNO stands for content, that had been commited
"Not NOrmalized", in other words files with CRLF in the index.

Make more clear what should be tested:
- commit a file with CRLF into the index
- Change the content in the working tree
- Run "git add" and check for the conversion warnings
- Repeat for different content (text, LF, CRLF, mixed) and
  binary (LF and lone CR, CRLF with NUL)

Rename commit_chk_wrnNNO() into CRI_add_chk_wrn() and rename NNO into CRI.

Integrate create_NNO_files() into 'setup master'

Signed-off-by: Torsten Bögershausen 
---
 t/t0027-auto-crlf.sh | 122 +--
 1 file changed, 61 insertions(+), 61 deletions(-)

diff --git a/t/t0027-auto-crlf.sh b/t/t0027-auto-crlf.sh
index 90db54c..bfcf14b 100755
--- a/t/t0027-auto-crlf.sh
+++ b/t/t0027-auto-crlf.sh
@@ -49,24 +49,6 @@ create_gitattributes () {
} >.gitattributes
 }
 
-create_NNO_files () {
-   for crlf in false true input
-   do
-   for attr in "" auto text -text
-   do
-   for aeol in "" lf crlf
-   do
-   pfx=NNO_attr_${attr}_aeol_${aeol}_${crlf}
-   cp CRLF_mix_LF ${pfx}_LF.txt &&
-   cp CRLF_mix_LF ${pfx}_CRLF.txt &&
-   cp CRLF_mix_LF ${pfx}_CRLF_mix_LF.txt &&
-   cp CRLF_mix_LF ${pfx}_LF_mix_CR.txt &&
-   cp CRLF_mix_LF ${pfx}_CRLF_nul.txt
-   done
-   done
-   done
-}
-
 check_warning () {
case "$1" in
LF_CRLF) echo "warning: LF will be replaced by CRLF" >"$2".expect ;;
@@ -102,7 +84,7 @@ commit_check_warn () {
check_warning "$crlfnul" ${pfx}_CRLF_nul.err
 }
 
-commit_chk_wrnNNO () {
+CRI_add_chk_wrn () {
attr=$1 ; shift
aeol=$1 ; shift
crlf=$1 ; shift
@@ -111,7 +93,7 @@ commit_chk_wrnNNO () {
lfmixcrlf=$1 ; shift
lfmixcr=$1 ; shift
crlfnul=$1 ; shift
-   pfx=NNO_attr_${attr}_aeol_${aeol}_${crlf}
+   pfx=CRI_attr_${attr}_aeol_${aeol}_${crlf}
#Commit files on top of existing file
create_gitattributes "$attr" $aeol &&
for f in LF CRLF CRLF_mix_LF LF_mix_CR CRLF_nul
@@ -122,22 +104,22 @@ commit_chk_wrnNNO () {
git -c core.autocrlf=$crlf add $fname 2>"${pfx}_$f.err"
done
 
-   test_expect_success "commit NNO files crlf=$crlf attr=$attr LF" '
+   test_expect_success "CRLF in index add file crlf=$crlf attr=$attr LF" '
check_warning "$lfwarn" ${pfx}_LF.err
'
-   test_expect_success "commit NNO files attr=$attr aeol=$aeol crlf=$crlf 
CRLF" '
+   test_expect_success "CRLF in index add file attr=$attr aeol=$aeol 
crlf=$crlf CRLF" '
check_warning "$crlfwarn" ${pfx}_CRLF.err
'
 
-   test_expect_success "commit NNO files attr=$attr aeol=$aeol crlf=$crlf 
CRLF_mix_LF" '
+   test_expect_success "CRLF in index add file attr=$attr aeol=$aeol 
crlf=$crlf CRLF_mix_LF" '
check_warning "$lfmixcrlf" ${pfx}_CRLF_mix_LF.err
'
 
-   test_expect_success "commit NNO files attr=$attr aeol=$aeol crlf=$crlf 
LF_mix_cr" '
+   test_expect_success "CRLF in index add file attr=$attr aeol=$aeol 
crlf=$crlf LF_mix_cr" '
check_warning "$lfmixcr" ${pfx}_LF_mix_CR.err
'
 
-   test_expect_success "commit NNO files attr=$attr aeol=$aeol crlf=$crlf 
CRLF_nul" '
+   test_expect_success "CRLF in index add file attr=$attr aeol=$aeol 
crlf=$crlf CRLF_nul" '
check_warning "$crlfnul" ${pfx}_CRLF_nul.err
'
 }
@@ -199,7 +181,7 @@ check_files_in_repo () {
compare_files $crlfnul ${pfx}CRLF_nul.txt
 }
 
-check_in_repo_NNO () {
+check_in_repo_CRI () {
attr=$1 ; shift
aeol=$1 ; shift
crlf=$1 ; shift
@@ -208,7 +190,7 @@ check_in_repo_NNO () {
lfmixcrlf=$1 ; shift
lfmixcr=$1 ; shift
crlfnul=$1 ; shift
-   pfx=NNO_attr_${attr}_aeol_${aeol}_${crlf}
+   pfx=CRI_attr_${attr}_aeol_${aeol}_${crlf}
test_expect_success "compare_files $lfname ${pfx}_LF.txt" '
compare_files $lfname ${pfx}_LF.txt
'
@@ -329,8 +311,22 @@ test_expect_success 'setup master' '
printf "\$Id:  
\$\r\nLINEONE\r\nLINETWO\rLINETHREE"   >CRLF_mix_CR &&
printf "\$Id:  
\$\r\nLINEONEQ\r\nLINETWO\r\nLINETHREE" | q_to_nul >CRLF_nul &&
printf "\$Id:  
\$\nLINEONEQ\nLINETWO\nLINETHREE" | q_to_nul >LF_nul &&
-   create_NNO_files CRLF_mix_LF CRLF_mix_LF CRLF_mix_LF CRLF_mix_LF 
CRLF_mix_LF &&
-   git -c core.autocrlf=false add NNO_*.txt &&
+   for crlf in false true input
+   do
+   for attr in "" auto text -text
+

1 2 3 >

1 - 100 of 242 matches

Mail list logo