[PATCH v2 4/4] diff: mark any file larger than core.bigfilethreshold binary

2014-06-24 Thread Nguyễn Thái Ngọc Duy
Too large files may lead to failure to allocate memory. If it happens
here, it could impact quite a few commands that involve
diff. Moreover, too large files are inefficient to compare anyway (and
most likely non-text), so mark them binary and skip looking at their
content.

Noticed-by: Dale R. Worley 
Signed-off-by: Nguyễn Thái Ngọc Duy 
---
 Documentation/config.txt|  3 ++-
 Documentation/gitattributes.txt |  4 ++--
 diff.c  | 26 ++
 diffcore.h  |  1 +
 t/t1050-large.sh|  4 
 5 files changed, 27 insertions(+), 11 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 9f467d3..a865850 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -499,7 +499,8 @@ core.bigFileThreshold::
Files larger than this size are stored deflated, without
attempting delta compression.  Storing large files without
delta compression avoids excessive memory usage, at the
-   slight expense of increased disk usage.
+   slight expense of increased disk usage. Additionally files
+   larger than this size are allways treated as binary.
 +
 Default is 512 MiB on all platforms.  This should be reasonable
 for most projects as source code and other text files can still
diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 643c1ba..9b45bda 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -440,8 +440,8 @@ Unspecified::
 
A path to which the `diff` attribute is unspecified
first gets its contents inspected, and if it looks like
-   text, it is treated as text.  Otherwise it would
-   generate `Binary files differ`.
+   text and is smaller than core.bigFileThreshold, it is treated
+   as text. Otherwise it would generate `Binary files differ`.
 
 String::
 
diff --git a/diff.c b/diff.c
index a489540..7a977aa 100644
--- a/diff.c
+++ b/diff.c
@@ -2185,8 +2185,8 @@ int diff_filespec_is_binary(struct diff_filespec *one)
one->is_binary = one->driver->binary;
else {
if (!one->data && DIFF_FILE_VALID(one))
-   diff_populate_filespec(one, 0);
-   if (one->data)
+   diff_populate_filespec(one, 
DIFF_POPULATE_IS_BINARY);
+   if (one->is_binary == -1 && one->data)
one->is_binary = buffer_is_binary(one->data,
one->size);
if (one->is_binary == -1)
@@ -2721,6 +2721,11 @@ int diff_populate_filespec(struct diff_filespec *s, 
unsigned int flags)
}
if (size_only)
return 0;
+   if ((flags & DIFF_POPULATE_IS_BINARY) &&
+   s->size > big_file_threshold && s->is_binary == -1) {
+   s->is_binary = 1;
+   return 0;
+   }
fd = open(s->path, O_RDONLY);
if (fd < 0)
goto err_empty;
@@ -2742,16 +2747,21 @@ int diff_populate_filespec(struct diff_filespec *s, 
unsigned int flags)
}
else {
enum object_type type;
-   if (size_only) {
+   if (size_only || (flags & DIFF_POPULATE_IS_BINARY)) {
type = sha1_object_info(s->sha1, &s->size);
if (type < 0)
die("unable to read %s", sha1_to_hex(s->sha1));
-   } else {
-   s->data = read_sha1_file(s->sha1, &type, &s->size);
-   if (!s->data)
-   die("unable to read %s", sha1_to_hex(s->sha1));
-   s->should_free = 1;
+   if (size_only)
+   return 0;
+   if (s->size > big_file_threshold && s->is_binary == -1) 
{
+   s->is_binary = 1;
+   return 0;
+   }
}
+   s->data = read_sha1_file(s->sha1, &type, &s->size);
+   if (!s->data)
+   die("unable to read %s", sha1_to_hex(s->sha1));
+   s->should_free = 1;
}
return 0;
 }
diff --git a/diffcore.h b/diffcore.h
index a186d7c..e7760d9 100644
--- a/diffcore.h
+++ b/diffcore.h
@@ -56,6 +56,7 @@ extern void fill_filespec(struct diff_filespec *, const 
unsigned char *,
  int, unsigned short);
 
 #define DIFF_POPULATE_SIZE_ONLY 1
+#define DIFF_POPULATE_IS_BINARY 2
 extern int diff_populate_filespec(struct diff_filespec *, unsigned int);
 extern void diff_free_filespec_data(struct diff_filespec *);
 extern void diff_free_filespec_blob(struct diff_filespec *);
diff --git a/t/t1050-large.sh b/t/t105

Re: [PATCH v2 4/4] diff: mark any file larger than core.bigfilethreshold binary

2014-06-26 Thread Junio C Hamano
Nguyễn Thái Ngọc Duy   writes:

> Too large files may lead to failure to allocate memory. If it happens
> here, it could impact quite a few commands that involve
> diff. Moreover, too large files are inefficient to compare anyway (and
> most likely non-text), so mark them binary and skip looking at their
> content.
> ...
> diff --git a/diff.c b/diff.c
> index a489540..7a977aa 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -2185,8 +2185,8 @@ int diff_filespec_is_binary(struct diff_filespec *one)
>   one->is_binary = one->driver->binary;
>   else {
>   if (!one->data && DIFF_FILE_VALID(one))
> - diff_populate_filespec(one, 0);
> - if (one->data)
> + diff_populate_filespec(one, 
> DIFF_POPULATE_IS_BINARY);
> + if (one->is_binary == -1 && one->data)
>   one->is_binary = buffer_is_binary(one->data,
>   one->size);
>   if (one->is_binary == -1)

The name is misleading and forced me to read it twice before I
realized that this is "populating the is-binary bit".  It might make
it a bit better if you renamed it to DIFF_POPULATE_IS_BINARY_BIT or
perhaps DIFF_POPULATE_CHECK_BINARY or something.  For consistency,
the other bit may want to be also renamed from SIZE_ONLY to either

 (1) CHECK_SIZE_ONLY

 (2) One bit for CHECK_SIZE, another for NO_CONTENTS, and optionally
 make SIZE_ONLY the union of two

I do not have strong preference either way; the latter may be more
logical in that "not loading contents" and "check size" are sort of
orthogonal in that you can later choose to check, without loading
contents, only the binary-ness without checking size, but no calles
that passes a non-zero flag to the populate-filespec function will
want to slurp in the contents in practice, so in that sense we could
declare that the NO_CONENTS bit is implied.

But more importantly, would this patch actually help?  For one
thing, this wouldn't (and shouldn't) help if the user wants --binary
diff out of us anyway, I suspect, but I wonder what the following
codepath in the builtin_diff() function would do:

...
} else if (!DIFF_OPT_TST(o, TEXT) &&
( (!textconv_one && diff_filespec_is_binary(one)) ||
  (!textconv_two && diff_filespec_is_binary(two)) )) {
if (fill_mmfile(&mf1, one) < 0 || fill_mmfile(&mf2, two) < 0)
die("unable to read files to diff");
/* Quite common confusing case */
if (mf1.size == mf2.size &&
!memcmp(mf1.ptr, mf2.ptr, mf1.size)) {
if (must_show_header)
fprintf(o->file, "%s", header.buf);
goto free_ab_and_return;
}
fprintf(o->file, "%s", header.buf);
strbuf_reset(&header);
if (DIFF_OPT_TST(o, BINARY))
emit_binary_diff(o->file, &mf1, &mf2, line_prefix);
else
fprintf(o->file, "%sBinary files %s and %s differ\n",
line_prefix, lbl[0], lbl[1]);
o->found_changes = 1;
} else {
...

If we weren't told with --text/-a to force textual output, and
at least one of the sides is marked as binary (and this patch marks
a large blob as binary unless attributes says otherwise), we still
call fill_mmfile() on them to slurp the contents to be compared, no?

And before you get to the DIFF_OPT_TST(o, BINARY), you memcmp(3) to
check if the sides are identical, so...

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/4] diff: mark any file larger than core.bigfilethreshold binary

2014-06-27 Thread Thomas Braun
Am 26.06.2014 19:55, schrieb Junio C Hamano:
> Nguyễn Thái Ngọc Duy   writes:
> 
>> Too large files may lead to failure to allocate memory. If it happens
>> here, it could impact quite a few commands that involve
>> diff. Moreover, too large files are inefficient to compare anyway (and
>> most likely non-text), so mark them binary and skip looking at their
>> content.
>> ...
>> diff --git a/diff.c b/diff.c
>> index a489540..7a977aa 100644
>> --- a/diff.c
>> +++ b/diff.c
>> @@ -2185,8 +2185,8 @@ int diff_filespec_is_binary(struct diff_filespec *one)
>>  one->is_binary = one->driver->binary;
>>  else {
>>  if (!one->data && DIFF_FILE_VALID(one))
>> -diff_populate_filespec(one, 0);
>> -if (one->data)
>> +diff_populate_filespec(one, 
>> DIFF_POPULATE_IS_BINARY);
>> +if (one->is_binary == -1 && one->data)
>>  one->is_binary = buffer_is_binary(one->data,
>>  one->size);
>>  if (one->is_binary == -1)
> 
> The name is misleading and forced me to read it twice before I
> realized that this is "populating the is-binary bit".  It might make
> it a bit better if you renamed it to DIFF_POPULATE_IS_BINARY_BIT or
> perhaps DIFF_POPULATE_CHECK_BINARY or something.  For consistency,
> the other bit may want to be also renamed from SIZE_ONLY to either
> 
>  (1) CHECK_SIZE_ONLY
> 
>  (2) One bit for CHECK_SIZE, another for NO_CONTENTS, and optionally
>  make SIZE_ONLY the union of two
> 
> I do not have strong preference either way; the latter may be more
> logical in that "not loading contents" and "check size" are sort of
> orthogonal in that you can later choose to check, without loading
> contents, only the binary-ness without checking size, but no calles
> that passes a non-zero flag to the populate-filespec function will
> want to slurp in the contents in practice, so in that sense we could
> declare that the NO_CONENTS bit is implied.
> 
> But more importantly, would this patch actually help?  For one
> thing, this wouldn't (and shouldn't) help if the user wants --binary
> diff out of us anyway, I suspect, but I wonder what the following
> codepath in the builtin_diff() function would do:
> 
>   ...
>   } else if (!DIFF_OPT_TST(o, TEXT) &&
>   ( (!textconv_one && diff_filespec_is_binary(one)) ||
> (!textconv_two && diff_filespec_is_binary(two)) )) {
>   if (fill_mmfile(&mf1, one) < 0 || fill_mmfile(&mf2, two) < 0)
>   die("unable to read files to diff");
>   /* Quite common confusing case */
>   if (mf1.size == mf2.size &&
>   !memcmp(mf1.ptr, mf2.ptr, mf1.size)) {
>   if (must_show_header)
>   fprintf(o->file, "%s", header.buf);
>   goto free_ab_and_return;
>   }
>   fprintf(o->file, "%s", header.buf);
>   strbuf_reset(&header);
>   if (DIFF_OPT_TST(o, BINARY))
>   emit_binary_diff(o->file, &mf1, &mf2, line_prefix);
>   else
>   fprintf(o->file, "%sBinary files %s and %s differ\n",
>   line_prefix, lbl[0], lbl[1]);
>   o->found_changes = 1;
>   } else {
>   ...
> 
> If we weren't told with --text/-a to force textual output, and
> at least one of the sides is marked as binary (and this patch marks
> a large blob as binary unless attributes says otherwise), we still
> call fill_mmfile() on them to slurp the contents to be compared, no?
> 
> And before you get to the DIFF_OPT_TST(o, BINARY), you memcmp(3) to
> check if the sides are identical, so...

Good point. So how about an additional change roughly sketched as

@@ -2223,6 +2223,14 @@ struct userdiff_driver *get_textconv(struct
diff_filespec *one)
return userdiff_get_textconv(one->driver);
 }

+/* read the files in small chunks into memory and compare them */
+static int filecmp_chunked(struct diff_filespec *one,
+   struct diff_filespec *two)
+{
+   // TODO add implementation
+   return 0;
+}
+
 static void builtin_diff(const char *name_a,
 const char *name_b,
 struct diff_filespec *one,
@@ -2325,19 +2333,26 @@ static void builtin_diff(const char *name_a,
} else if (!DIFF_OPT_TST(o, TEXT) &&
( (!textconv_one && diff_filespec_is_binary(one)) ||
  (!textconv_two && diff_filespec_is_binary(two)) )) {
-   if (fill_mmfile(&mf1, one) < 0 || fill_mmfile(&mf2,two)< 0)
-   die("unable to read files to diff");
+
+   unsigned long size1 = diff_filespec_size(one);
+   unsigned long size2 = diff_filespec_size(two);
+
+   if (size1 == 0 || size2 == 

Re: [PATCH v2 4/4] diff: mark any file larger than core.bigfilethreshold binary

2014-06-28 Thread Duy Nguyen
On Sat, Jun 28, 2014 at 1:56 AM, Thomas Braun
 wrote:
>> The name is misleading and forced me to read it twice before I
>> realized that this is "populating the is-binary bit".  It might make
>> it a bit better if you renamed it to DIFF_POPULATE_IS_BINARY_BIT or
>> perhaps DIFF_POPULATE_CHECK_BINARY or something.  For consistency,
>> the other bit may want to be also renamed from SIZE_ONLY to either
>>
>>  (1) CHECK_SIZE_ONLY
>>
>>  (2) One bit for CHECK_SIZE, another for NO_CONTENTS, and optionally
>>  make SIZE_ONLY the union of two
>>
>> I do not have strong preference either way; the latter may be more
>> logical in that "not loading contents" and "check size" are sort of
>> orthogonal in that you can later choose to check, without loading
>> contents, only the binary-ness without checking size, but no calles
>> that passes a non-zero flag to the populate-filespec function will
>> want to slurp in the contents in practice, so in that sense we could
>> declare that the NO_CONENTS bit is implied.

Will do (and probably go with (1) as I still prefer zero as "good defaults")

>> But more importantly, would this patch actually help?

Well yes as demonstrated by the new test ;-) Unfortunately the scope
of help is limited to --stat.. I should have done more thorough
testing.

>> For one
>> thing, this wouldn't (and shouldn't) help if the user wants --binary
>> diff out of us anyway, I suspect, but I wonder what the following
>> codepath in the builtin_diff() function would do:
>>
>>   ...
>>   } else if (!DIFF_OPT_TST(o, TEXT) &&
>>   ( (!textconv_one && diff_filespec_is_binary(one)) ||
>> (!textconv_two && diff_filespec_is_binary(two)) )) {
>>   if (fill_mmfile(&mf1, one) < 0 || fill_mmfile(&mf2, two) < 0)
>>   die("unable to read files to diff");
>>   /* Quite common confusing case */
>>   if (mf1.size == mf2.size &&
>>   !memcmp(mf1.ptr, mf2.ptr, mf1.size)) {
>>   if (must_show_header)
>>   fprintf(o->file, "%s", header.buf);
>>   goto free_ab_and_return;
>>   }
>>   fprintf(o->file, "%s", header.buf);
>>   strbuf_reset(&header);
>>   if (DIFF_OPT_TST(o, BINARY))
>>   emit_binary_diff(o->file, &mf1, &mf2, line_prefix);
>>   else
>>   fprintf(o->file, "%sBinary files %s and %s differ\n",
>>   line_prefix, lbl[0], lbl[1]);
>>   o->found_changes = 1;
>>   } else {
>>   ...
>>
>> If we weren't told with --text/-a to force textual output, and
>> at least one of the sides is marked as binary (and this patch marks
>> a large blob as binary unless attributes says otherwise), we still
>> call fill_mmfile() on them to slurp the contents to be compared, no?
>>
>> And before you get to the DIFF_OPT_TST(o, BINARY), you memcmp(3) to
>> check if the sides are identical, so...
>
> Good point. So how about an additional change roughly sketched as
>
> @@ -2223,6 +2223,14 @@ struct userdiff_driver *get_textconv(struct
> diff_filespec *one)
> return userdiff_get_textconv(one->driver);
>  }
>
> +/* read the files in small chunks into memory and compare them */
> +static int filecmp_chunked(struct diff_filespec *one,
> +   struct diff_filespec *two)
> +{
> +   // TODO add implementation
> +   return 0;
> +}
> +


We have object streaming interface to do similar like this. In fact
index-pack already does large file memcmp() for hash collision test. I
think I can move some code around and support large file in this code
path..
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html