Re: [PATCH] travis-ci: run previously failed tests first, then slowest to fastest

2016-02-03 Thread Junio C Hamano
Junio C Hamano  writes:

> The change to t0023 is merely an example that shows that existing
> tests assume the convert_to_git() way of defining the dirtyness of
> the working tree.  It used to be OK to have core.autocrlf set to true,
> have LF terminated file on the working tree and add it to the index,
> and the resulting state was "We just added it to the index, and
> nobody touched the index nor the working tree file--by definition
> the working tree IS CLEAN".  With your updated semantics, that no
> longer is true.  "We just added it, but if we check it out, we would
> normalize the line ending to be CRLF on the working tree, so the
> working tree is dirty" is what happens.
>
> There are tons of tests that would break the same way all of which
> needs to be looked at and fixed if we were to go in this direction.

That made me think further aloud.  I haven't thought things through,
but I wonder what happens if we do both.  That is, we define the
working tree file is clean if either:

  * the result of running convert_to_git() on the working tree
contents matches what is in the index (because that would mean
doing another "git add" on the path is a no-op); OR

  * the result of running convert_to_working_tree() on the content
in the index matches what is in the working tree (because that
would mean doing another "git checkout -f" on the path is a
no-op).

A possible downside (but again, I haven't thought things through, so
this may be a non-issue) of doing this is that it may make it even
harder to "fix" an index entry or a working tree file that is
inconsistent with the user's conversion settings.  Even when "git
add" would allow the user to fix an index entry by applying (an
updated) convert_to_git() filter to the working tree file, because
of the new rule that works in the opposite direction, we would end
up saying "the working tree file is clean, and there is no point
doing 'git add'".  And vice versa for fixing a working tree file by
running "git checkout".

Also this will make "update-index --refresh" potentially take twice
as long for paths that are not known to be clean and indeed dirty,
as they would need to be processed twice.

An updated patch to do so would look like this.  At least we don't
have to update the expectation t0023 makes with this approach.

 read-cache.c | 61 
 1 file changed, 61 insertions(+)

diff --git a/read-cache.c b/read-cache.c
index 84616c8..42d9452 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -156,17 +156,78 @@ void fill_stat_cache_info(struct cache_entry *ce, struct 
stat *st)
ce_mark_uptodate(ce);
 }
 
+/*
+ * Compare the data in buf with the data in the file pointed by fd and
+ * return 0 if they are identical, and non-zero if they differ.
+ */
+static int compare_with_fd(const char *input, ssize_t len, int fd)
+{
+   for (;;) {
+   char buf[1024 * 16];
+   ssize_t chunk_len, read_len;
+
+   chunk_len = sizeof(buf) < len ? sizeof(buf) : len;
+   read_len = xread(fd, buf, chunk_len ? chunk_len : 1);
+
+   if (!read_len)
+   /* EOF on the working tree file */
+   return !len ? 0 : -1;
+
+   if (!len)
+   /* we expected there is nothing left */
+   return -1;
+
+   if (memcmp(buf, input, read_len))
+   return -1;
+   input += read_len;
+   len -= read_len;
+   }
+}
+
+/*
+ * Does the file in the working tree match what is in the index?
+ */
 static int ce_compare_data(const struct cache_entry *ce, struct stat *st)
 {
int match = -1;
int fd = open(ce->name, O_RDONLY);
 
+   /*
+* Would another "git add" on the path change what is in the
+* index for the path?
+*/
if (fd >= 0) {
unsigned char sha1[20];
if (!index_fd(sha1, fd, st, OBJ_BLOB, ce->name, 0))
match = hashcmp(sha1, ce->sha1);
/* index_fd() closed the file descriptor already */
}
+   if (!match)
+   return match;
+
+   /*
+* Would another "git checkout -f" out of the index change
+* what is in the working tree file?
+*/
+   fd = open(ce->name, O_RDONLY);
+   if (fd >= 0) {
+   enum object_type type;
+   unsigned long size;
+   void *data = read_sha1_file(ce->sha1, , );
+
+   if (type == OBJ_BLOB) {
+   struct strbuf worktree = STRBUF_INIT;
+   if (convert_to_working_tree(ce->name, data, size,
+   )) {
+   free(data);
+   data = strbuf_detach(, );
+   }
+   if 

Re: [PATCH] travis-ci: run previously failed tests first, then slowest to fastest

2016-02-02 Thread Junio C Hamano
Clemens Buchacher  writes:

> On Mon, Feb 01, 2016 at 10:17:24AM -0800, Junio C Hamano wrote:
>> 
>> Your proposal is to redefine "is the working tree dirty?"; it would
>> check if "git checkout -f" would change what is in the working tree.
>
> I like this definition. Sounds obviously right.

So this is an illustration.  The change to ce_compare_data() has a
room for improvement in that it assumes that it can always slurp the
whole blob in-core; it should try to use the streaming interface
when it makes sense.  Otherwise we would not be able to handle a
blob that we used to be able to (as index_fd() streams), which would
be a regression.

The change to t0023 is merely an example that shows that existing
tests assume the convert_to_git() way of defining the dirtyness of
the working tree.  It used to be OK to have core.autocrlf set to true,
have LF terminated file on the working tree and add it to the index,
and the resulting state was "We just added it to the index, and
nobody touched the index nor the working tree file--by definition
the working tree IS CLEAN".  With your updated semantics, that no
longer is true.  "We just added it, but if we check it out, we would
normalize the line ending to be CRLF on the working tree, so the
working tree is dirty" is what happens.

There are tons of tests that would break the same way all of which
needs to be looked at and fixed if we were to go in this direction.


 read-cache.c   | 53 +
 t/t0023-crlf-am.sh |  2 +-
 2 files changed, 50 insertions(+), 5 deletions(-)

diff --git a/read-cache.c b/read-cache.c
index 84616c8..c284f78 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -156,16 +156,61 @@ void fill_stat_cache_info(struct cache_entry *ce, struct 
stat *st)
ce_mark_uptodate(ce);
 }
 
+/*
+ * Compare the data in buf with the data in the file pointed by fd and
+ * return 0 if they are identical, and non-zero if they differ.
+ */
+static int compare_with_fd(const char *input, ssize_t len, int fd)
+{
+   for (;;) {
+   char buf[1024 * 16];
+   ssize_t chunk_len, read_len;
+
+   chunk_len = sizeof(buf) < len ? sizeof(buf) : len;
+   read_len = xread(fd, buf, chunk_len ? chunk_len : 1);
+
+   if (!read_len)
+   /* EOF on the working tree file */
+   return !len ? 0 : -1;
+
+   if (!len)
+   /* we expected there is nothing left */
+   return -1;
+
+   if (memcmp(buf, input, read_len))
+   return -1;
+   input += read_len;
+   len -= read_len;
+   }
+}
+
+/*
+ * Does the file in the working tree match what is in the index?
+ * That is, do we lose any data from the working tree copy if we
+ * did a new "git checkout" of that path out of the index?
+ */
 static int ce_compare_data(const struct cache_entry *ce, struct stat *st)
 {
int match = -1;
int fd = open(ce->name, O_RDONLY);
 
if (fd >= 0) {
-   unsigned char sha1[20];
-   if (!index_fd(sha1, fd, st, OBJ_BLOB, ce->name, 0))
-   match = hashcmp(sha1, ce->sha1);
-   /* index_fd() closed the file descriptor already */
+   enum object_type type;
+   unsigned long size;
+   void *data = read_sha1_file(ce->sha1, , );
+
+   if (type == OBJ_BLOB) {
+   struct strbuf worktree = STRBUF_INIT;
+   if (convert_to_working_tree(ce->name, data, size,
+   )) {
+   free(data);
+   data = strbuf_detach(, );
+   }
+   if (!compare_with_fd(data, size, fd))
+   match = 0;
+   }
+   free(data);
+   close(fd);
}
return match;
 }
diff --git a/t/t0023-crlf-am.sh b/t/t0023-crlf-am.sh
index f9bbb91..5c086b4 100755
--- a/t/t0023-crlf-am.sh
+++ b/t/t0023-crlf-am.sh
@@ -27,7 +27,7 @@ EOF
 test_expect_success 'setup' '
 
git config core.autocrlf true &&
-   echo foo >bar &&
+   printf "%s\r\n" foo >bar &&
git add bar &&
test_tick &&
git commit -m initial

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] travis-ci: run previously failed tests first, then slowest to fastest

2016-02-01 Thread Clemens Buchacher
On Mon, Feb 01, 2016 at 10:17:24AM -0800, Junio C Hamano wrote:
> 
> Your proposal is to redefine "is the working tree dirty?"; it would
> check if "git checkout -f" would change what is in the working tree.

I like this definition. Sounds obviously right.

> > Regarding performance impact: We only need to do this extra check if the
> > usual check convert_to_git(work tree) against index state fails, and
> > conversion is in effect.
> 
> How would you detect the failure, though?  Having contents in the
> index that contradicts the attributes and eol settings affects the
> cleanliness both ways.  Running the working tree contents via to-git
> conversion and hashing may match the blob in the index, declaring
> that the index entry is "clean", but passing the blob to to-worktree
> conversion may produce result different from what is in the
> worktree, which is "falsely clean".

True. But this is what we do today, and I thought at first that we have
to keep this behavior. The following enables eol conversion on git add,
but not on checkout:

 printf 'line 1\r\n' >dos.txt
 echo '* text' >.gitattributes
 git add dos.txt
 git commit

After git add the worktree is considered clean, even though dos.txt
still has CRLF line endings, and rm dos.txt && git checkout dos.txt
re-creates dos.txt with LF line endings. If we change the definition as
proposed above, then the worktree would be dirty even though we just
did git add and git commit.

So I concluded that we have to treat the worktree clean if either git
add -u does not change the index state, _or_ git checkout -f does not
change the worktree state.

But doing only the git checkout -f check makes much more sense. Maybe we
can handle the above situation better by doing an implicit
git checkout -f  after git commit. After all, I would
expect git commit to give me exactly the same state that I get later
when I do git checkout  for the same commit.

> > On the other hand, a user who simply follows an upstream repository by
> > doing git pull all the time, and who does not make changes to their
> > configuration, can still run into this issue, because upstream could
> > change .gitattributes. This part we could actually detect by hashing the
> > attributes for each index entry, and if that changes we re-evaluate the
> > file state.
> 
> If this has to bloat each index entry, I do not think solving the
> problem is worth that cost of that overhead.  I'd rather just say
> "if you have inconsistent data, here is a workaround using 'reset'
> and then 'reset --hard'" and be done with it.

Works for me.

> > This is also an issue only if a smudge filter is in place. The eol
> > conversion which only acts in the convert_to_git direction is not
> > affected.
> 
> IIRC, autocrlf=true would strip CR at the end of line in to-git
> conversion, and would add CR in to-worktree conversion.  So some eol
> conversion may only act in to-git, but some others do affect both,
> and without needing you to touch attributes.

I was somehow under the impression that autocrlf=true is discouraged,
and setting the text attribute to true is the new recommended way to
configure eol conversion. But I see that the Git for Windows installer
still offers autocrlf=true as the default option, so clearly we need to
support it well.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] travis-ci: run previously failed tests first, then slowest to fastest

2016-02-01 Thread Torsten Bögershausen
On 2016-02-01 19.17, Junio C Hamano wrote:
> Clemens Buchacher  writes:
[]
> 
> IIRC, autocrlf=true would strip CR at the end of line in to-git
> conversion, and would add CR in to-worktree conversion.  So some eol
> conversion may only act in to-git, but some others do affect both,
> and without needing you to touch attributes.
That depends, which version of Git you are running.
It has changed from the first version of core.autocrlf:

commit c4805393d73425a5f467f10fa434fb99bfba17ac
Author: Finn Arne Gangstad 
Date:   Wed May 12 00:37:57 2010 +0200

autocrlf: Make it work also for un-normalized repositories

Previously, autocrlf would only work well for normalized
repositories. Any text files that contained CRLF in the repository
would cause problems, and would be modified when handled with
core.autocrlf set.

Change autocrlf to not do any conversions to files that in the
repository already contain a CR. git with autocrlf set will never
create such a file, or change a LF only file to contain CRs, so the
(new) assumption is that if a file contains a CR, it is intentional,
and autocrlf should not change that.

The following sequence should now always be a NOP even with autocrlf
set (assuming a clean working directory):

git checkout 
touch *
git add -A .(will add nothing)
git commit  (nothing to commit)

Previously this would break for any text file containing a CR.

Some of you may have been folowing Eyvind's excellent thread about
trying to make end-of-line translation in git a bit smoother.

I decided to attack the problem from a different angle: Is it possible
to make autocrlf behave non-destructively for all the previous problem 
cases?

Stealing the problem from Eyvind's initial mail (paraphrased and
summarized a bit):

1. Setting autocrlf globally is a pain since autocrlf does not work well
   with CRLF in the repo
2. Setting it in individual repos is hard since you do it "too late"
   (the clone will get it wrong)
3. If someone checks in a file with CRLF later, you get into problems again
4. If a repository once has contained CRLF, you can't tell autocrlf
   at which commit everything is sane again
5. autocrlf does needless work if you know that all your users want
   the same EOL style.

I belive that this patch makes autocrlf a safe (and good) default
setting for Windows, and this solves problems 1-4 (it solves 2 by being
set by default, which is early enough for clone).

I implemented it by looking for CR charactes in the index, and
aborting any conversion attempt if this is found.
---
And my intention is to do a similar fix for the attributes.
More patches coming.


--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] travis-ci: run previously failed tests first, then slowest to fastest

2016-02-01 Thread Junio C Hamano
Clemens Buchacher  writes:

> Ok, then let's take a step back. I do not actually care if git diff and
> friends say the worktree is clean or not.

You may not, but many existing scripts people have do.

> But I know that I did not make
> any modifications to the worktree, because I just did git reset --hard.
> And now I want to use commands like cherry-pick and checkout without
> failure. But they can fail, because they essentially use git diff to
> check if there are worktree changes, and if so refuse to overwrite them.

Yes, exactly.

> So, if the check "Am I allowed to modify the worktree file?", would go
> the extra mile to also check if the worktree is clean in the sense that
> convert_to_worktree(index state) matches the worktree. If this is the
> case, then it is safe to modify the file because it is the committed
> state, and can be recovered.

So in essense, the proposed "fix" is "let's fix it in the right
way"?

The way we defined "would we lose some changes that are only in the
working tree?", aka "is the working tree dirty wrt the index?", has
been to check if "git add -u" would change the states in the index.
And for scripted Porcelains and end-user scripts, "git diff-files",
aka "what change would 'git add -u' make to the states in the
index?", has been the command to do the same check.

Your proposal is to redefine "is the working tree dirty?"; it would
check if "git checkout -f" would change what is in the working tree.

I agree that indeed is "would we lose some changes that are only in
the working tree", and I think we can do that transparently for
"internal" commands, i.e. without any end-user impact, as the new
check would behave identically when they have sane contents--the
difference between the current check and the new check only exists
when the contents in the index contradicts what the user specifies
for to-git conversion via eol or clean filter.

We would need a way for our scripted Porcelains and end-user scripts
to ask that new question, though, but I think that is not something
insurmountable.  A new option to "diff-files" or something, perhaps,
would be workable, but having a new "git require-clean-work-tree"
plumbing, which would replace require_clean_work_tree shell helper
in git-sh-setup, may be conceptually much cleaner, because the new
definition of "working tree being clean" is no longer tied to what
"diff" should say.

I like that as a general direction.

> Regarding performance impact: We only need to do this extra check if the
> usual check convert_to_git(work tree) against index state fails, and
> conversion is in effect.

How would you detect the failure, though?  Having contents in the
index that contradicts the attributes and eol settings affects the
cleanliness both ways.  Running the working tree contents via to-git
conversion and hashing may match the blob in the index, declaring
that the index entry is "clean", but passing the blob to to-worktree
conversion may produce result different from what is in the
worktree, which is "falsely clean".  That is an equally important
case that is opposite from what we have been primarily discussing,
which is "falsely dirty".

>> Besides, I do not think the above approach really solves the issue,
>> either.  After "git reset --hard" to have the contents in the index
>> dumped to the working tree, if your core.autocrlf is flipped,
>
> Indeed, if the user configuration changes, then we cannot always detect
> this (at least if the filter is an external program, and the behavior of
> that changes). But the user is in control of that, and we can document
> this limitation.

That argument does not result in a very useful result, though.
Because the user is in control of what attributes and eol settings
are in effect in her repository, we can just document that the
current check will give unspecified result if the indexed contents
contradict with that setting, e.g. when you have CRLF encoded data
in the index but the eol conversion assumes LF in the repository.
But this discussion is an attempt to do better than that, no?

> On the other hand, a user who simply follows an upstream repository by
> doing git pull all the time, and who does not make changes to their
> configuration, can still run into this issue, because upstream could
> change .gitattributes. This part we could actually detect by hashing the
> attributes for each index entry, and if that changes we re-evaluate the
> file state.

If this has to bloat each index entry, I do not think solving the
problem is worth that cost of that overhead.  I'd rather just say
"if you have inconsistent data, here is a workaround using 'reset'
and then 'reset --hard'" and be done with it.

> This is also an issue only if a smudge filter is in place. The eol
> conversion which only acts in the convert_to_git direction is not
> affected.

IIRC, autocrlf=true would strip CR at the end of line in to-git
conversion, and would add CR in to-worktree conversion.  So some eol

Re: [PATCH] travis-ci: run previously failed tests first, then slowest to fastest

2016-01-30 Thread Clemens Buchacher
On Thu, Jan 28, 2016 at 01:32:30PM -0800, Junio C Hamano wrote:
> Clemens Buchacher  writes:
> 
> > If we do this, then git diff should show the diff between
> > convert_to_worktree(index state) and the worktree state.
> 
> And that unfortunately is a very good reason why this approach
> should not be taken.

Ok, then let's take a step back. I do not actually care if git diff and
friends say the worktree is clean or not. But I know that I did not make
any modifications to the worktree, because I just did git reset --hard.
And now I want to use commands like cherry-pick and checkout without
failure. But they can fail, because they essentially use git diff to
check if there are worktree changes, and if so refuse to overwrite them.

So, if the check "Am I allowed to modify the worktree file?", would go
the extra mile to also check if the worktree is clean in the sense that
convert_to_worktree(index state) matches the worktree. If this is the
case, then it is safe to modify the file because it is the committed
state, and can be recovered.

Regarding performance impact: We only need to do this extra check if the
usual check convert_to_git(work tree) against index state fails, and
conversion is in effect.

> Besides, I do not think the above approach really solves the issue,
> either.  After "git reset --hard" to have the contents in the index
> dumped to the working tree, if your core.autocrlf is flipped,

Indeed, if the user configuration changes, then we cannot always detect
this (at least if the filter is an external program, and the behavior of
that changes). But the user is in control of that, and we can document
this limitation.

On the other hand, a user who simply follows an upstream repository by
doing git pull all the time, and who does not make changes to their
configuration, can still run into this issue, because upstream could
change .gitattributes. This part we could actually detect by hashing the
attributes for each index entry, and if that changes we re-evaluate the
file state.

This is also an issue only if a smudge filter is in place. The eol
conversion which only acts in the convert_to_git direction is not
affected.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] travis-ci: run previously failed tests first, then slowest to fastest

2016-01-28 Thread Junio C Hamano
Clemens Buchacher  writes:

> On Wed, Jan 27, 2016 at 12:49:31PM -0800, Junio C Hamano wrote:
>> Junio C Hamano  writes:
>> 
>> > I wonder what would break if we ask this question instead:
>> >
>> > We do not know if the working tree file and the indexed data
>> > match.  Let's see if "git checkout" of that path would leave the
>> > same data as what currently is in the working tree file.
>
> If we do this, then git diff should show the diff between
> convert_to_worktree(index state) and the worktree state.

I agree with you that, when ce_compare_data(), i.e. "does the index
match the working tree?", says "they match", "git diff" (show me the
change to go from the index to the worknig tree) should show empty
to be consistent, and for that to happen under the above definition
of ce_compare_data(), "git diff" needs to be comparing the data in
the index after converting it to the working tree representation
with the data in the working tree.

And that unfortunately is a very good reason why this approach
should not be taken.  "git diff" (show me the change to go from the
index to the working tree) is a preview of what we would see in "git
diff --cached" (show me the change to go from HEAD to the index) if
we did "git add", and it is a preview of what we would see in "git
show" (show me the change of what the last commit did) if we did
"git commit -a".  It is crazy for these latter comparisons to happen
in the working tree (aka "smudged") representation of the data, IOW,
these two must compare the "clean" representation.  It also is crazy
for "git diff" to be using different representation from these two.
This alone makes the above idea a non-starter X-<.

Besides, I do not think the above approach really solves the issue,
either.  After "git reset --hard" to have the contents in the index
dumped to the working tree, if your core.autocrlf is flipped, "git
checkout" of the same path would result in a working tree
representation of the data that is different from what you have in
the working tree, so we would declare that the working tree is not
clean, even though nobody actually touched them in the meantime.
This is less of an issue than having data in the index that is
inconsistent with the convert_to_git() setting (i.e. eol and clean
filter conversion that happens when you "git add"), but it still is
fundamentally the same issue.

Oh, bummer, I thought it was a nice approach.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] travis-ci: run previously failed tests first, then slowest to fastest

2016-01-27 Thread Clemens Buchacher
On Wed, Jan 27, 2016 at 12:49:31PM -0800, Junio C Hamano wrote:
> Junio C Hamano  writes:
> 
> > I wonder what would break if we ask this question instead:
> >
> > We do not know if the working tree file and the indexed data
> > match.  Let's see if "git checkout" of that path would leave the
> > same data as what currently is in the working tree file.

If we do this, then git diff should show the diff between
convert_to_worktree(index state) and the worktree state. That would be
nice, because the diff would actually show what we have in the worktree.
It keeps confusing me that with eol conversion enabled, git diff does
not actually show me the worktree state.

However, even if the file is clean in that direction, there could be a
mismatch between convert_to_git(worktree state) and the index state.
This will happen for example in t0025.4, where we have a CRLF file in
the index and the worktree, but convert_to_git converts it to a file
with LF line endings. Still, I do not see a problem if we also provide a
command like git add --fix-index, which will force normalization of all
files.

> > If we did this, "reset --hard HEAD" followed by "diff HEAD" will by
> > definition always report "is clean" as long as nobody changes files
> > in the working tree, even with the inconsistent data in the index.

Yes, this is a more elegant and a more complete solution to the problem
which prompted me to submit the GIT_ATTRIBUTES_DISABLED patch.

> > This still requires that convert_to_working_tree(), i.e. your smudge
> > filter, is deterministic, though, but I think that is a sensible
> > assumption for sane people, even for those with inconsistent data in
> > the index.

Deterministic, yes. But not unchanging. When a smudge filter is added,
or modified, or if the filter program changes, we still have to remove
the index before we can trust git diff again. The only way to avoid this
would be to somehow detect if the conversion itself changes. One could
hash the attributes, but changes to the filter configuration or the
filter itself are hard to detect. So I think we have to live with this.

> [...] Doing the other check will have to
> inflate the blob data and apply the convert_to_working_tree()
> processing, and also read the whole thing from the filesystem and
> compare, which is more work at runtime.

If we assume that the smudge filter is deterministic, then we could also
hash the output of convert_to_working_tree, and store the hash in the
index. With this optimization, the comparision would be less work,
because we do not have to apply a filter again, whereas currently we
have to apply convert_to_git.

> IOW, I am saying that the "add --fix-index" lunchbreak patch I sent
> earlier in the thread that has to hold the data in-core while
> processing is not a production quality patch ;-)

Ok. The existing implementation in renormalize_buffer (convert.c) works
for me, though.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] travis-ci: run previously failed tests first, then slowest to fastest

2016-01-27 Thread Junio C Hamano
Clemens Buchacher  writes:

> Coming back to "[PATCH] optionally disable gitattributes": The topics
> are related, because they both deal with the situation where the work
> tree has files which are not normalized according to gitattributes. But
> my patch is more about saying: ok, I know I may have files which need to
> be normalized, but I want to ignore this issue for now. Please disable
> gitattributes for now, because I want to work with the files as they are
> committed. Conversely, the discussion here is about how to reliably
> detect and fix files which are not normalized.

I primarily wanted to make sure that you understood the underlying
issue, so that I do not have to go back to the basics in the other
thread.  And it is clear that you obviously do, which is good.

Here, you seem to think that what t0025 wants to see happen is
sensible, judging by the fact that you call "rm .git/index && git
reset" a "fix".

My take on this is quite different.  After a "reset --hard HEAD", we
should be able to trust the cached stat information and have "diff
HEAD" say "no changes".  That is what you essentially want in the
other thread, if I understand you correctly, and in an ideal world
where the filesystem timestamp has infinite precision, that is what
would happen in t0025, always "breaking" its expectation.  The real
world has much coarser timestamp granularity than ideal, and that is
why the test appear to be "flaky", failing to give "correct" outcome
some of the time--but I'd say that it is expecting a wrong thing.

An index entry that has data that does not round-trip when it goes
through convert_to_working_tree() and then convert_to_git() "breaks"
this arrangement, and I'd view it as the user having an inconsistent
data.  It is like you are in a repository that still has an unmerged
paths--you cannot proceed before you resolve them.

Anyway.

As to your patch in the other thread, here is what I think:

 (1) When you know (or perhaps your CI knows) that the working tree
 has never been modified since you did "reset --hard HEAD" (or
 its equivalent, like "git checkout $branch" from a clean
 state), these paths with inconsistent data would break the
 usual check to ask "is the working tree clean?"  That is a
 problem and we need a way to ensure that the working tree is
 always judged to be clean immediately after "reset --hard
 HEAD".  IOW, I agree with you that the issue you are trying to
 solve is worth solving.

 (2) Regardless of the "inconsistent data breaking the cleanliness
 check" issue, it may be handy to have a way to temporarily
 disable the attributes, i.e. allow us to ask "what happens if
 there is no attributes defined?"  IOW, I am saying that the
 change in the patch is not without merit.

In addition to (1), I further think that this sequence should not
report that the path F is modified:

 # Write F from HEAD to the working tree, after passing it
 # through convert_to_working_tree()
 $ git reset --hard HEAD

 # Force the re-reading, without changing the contents at all
 $ cp F F.new
 $ mv F.new F

 $ git diff HEAD

which is broken by paths with inconsistent data.  Your CI would want
a way to make that happen.

However, I do not think disabling attributes (i.e. (2)) is a
solution to the issue (i.e. (1)), which we just agreed to be an
issue that is worth solving, for at least two reasons.

 * Even without any attributes, core.autocrlf setting can get the
   data in your index (whose lines can be terminated with CRLF) into
   the same "inconsistent data" situation.  Disabling attribute
   handling would not have any effect on that codepath, I think.

 * The indexed data and the contents in the working tree file may
   match only because the clean/smudge transformation is done.  If
   you disable attributes, re-checking by passing the working tree
   contents through convert_to_git() and comparing the result with
   what is in the index would tell you that they are different, even
   if the clean/smudge filter pair implements round-trip operations
   correctly.

One way to solve (1) I can think of is to change the definition of
ce_compare_data(), which is called by the code that does not trust
the cached stat data (including but not limited to the Racy Git
codepath).  The current semantics of that function asks this
question:

We do not know if the working tree file and the indexed data
match.  Let's see if "git add" of that path would record the
data that is identical to what is in the index.

This definition was cast in stone by 29e4d363 (Racy GIT, 2005-12-20)
and has been with us since Git v1.0.0.  But that does not have to be
the only sensible definition of this check.  I wonder what would
break if we ask this question instead:

We do not know if the working tree file and the indexed data
match.  Let's see if "git checkout" of that path would leave the
same data as what 

Re: [PATCH] travis-ci: run previously failed tests first, then slowest to fastest

2016-01-27 Thread Junio C Hamano
Junio C Hamano  writes:

> One way to solve (1) I can think of is to change the definition of
> ce_compare_data(), which is called by the code that does not trust
> the cached stat data (including but not limited to the Racy Git
> codepath).  The current semantics of that function asks this
> question:
>
> We do not know if the working tree file and the indexed data
> match.  Let's see if "git add" of that path would record the
> data that is identical to what is in the index.
>
> This definition was cast in stone by 29e4d363 (Racy GIT, 2005-12-20)
> and has been with us since Git v1.0.0.  But that does not have to be
> the only sensible definition of this check.  I wonder what would
> break if we ask this question instead:
>
> We do not know if the working tree file and the indexed data
> match.  Let's see if "git checkout" of that path would leave the
> same data as what currently is in the working tree file.
>
> If we did this, "reset --hard HEAD" followed by "diff HEAD" will by
> definition always report "is clean" as long as nobody changes files
> in the working tree, even with the inconsistent data in the index.
>
> This still requires that convert_to_working_tree(), i.e. your smudge
> filter, is deterministic, though, but I think that is a sensible
> assumption for sane people, even for those with inconsistent data in
> the index.

Just a few additional comments.

The primary reason why I originally chose "does 'git add' of what is
in the working tree give us the same blob in the index?" as opposed
to "does 'git checkout' from the index again will give the same
result in the working tree?" is because it is a lot less resource
intensive and also is simpler.  Back then I do not think we had a
streaming interface to hash huge contents from a file in the working
tree, but it requires us to read the entire file from the filesystem
just once, apply the convert_to_git() processing and then hash the
result, whether we keep the whole thing in core at once or process
the data in streaming fashion.  Doing the other check will have to
inflate the blob data and apply the convert_to_working_tree()
processing, and also read the whole thing from the filesystem and
compare, which is more work at runtime.  And for a sane set-up where
the data in the index does not contradict with the clean/smudge
filter and EOL settings, both would yield the same result.

If we were to switch the semantics of ce_compare_data(), we would
want a new sibling interface next to stream_blob_to_fd() that takes
a file descriptor opened on the file in the working tree for reading
(fd), the object name (sha1), and the output filter, and works very
similarly to stream_blob_to_fd().  The difference would be that we
would be reading from the fd (i.e. the file in the working tree) as
we read from the istream (i.e. the contents of the blob in the
index, after passing the convert_to_working_tree() filter) and
comparing them in the main loop.  The filter parameter to the
function would be obtained by calling get_stream_filter() just like
how write_entry() uses it to prepare the filter parameter to call
streaming_write_entry() with.  That way, we can rely on future
improvement of the streaming interface to make sure we keep the data
we have to keep in core to the minimum.

IOW, I am saying that the "add --fix-index" lunchbreak patch I sent
earlier in the thread that has to hold the data in-core while
processing is not a production quality patch ;-)



--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] travis-ci: run previously failed tests first, then slowest to fastest

2016-01-27 Thread Clemens Buchacher
I think Junio pointed me to this thread from "[PATCH] optionally disable
gitattributes". Since I am not sure I am following everything correctly
in this thread, allow me to recapitulate what I understood so far.

Firstly, I think the racy'ness of t0025 is understood. It is due to the
is_racy_timestamp check in read-cache.c's ie_match_stat. But for the
moment I would like to put this aside, because the issue can be
reproduced reliably with this change to t0025:

diff --git a/t/t0025-crlf-auto.sh b/t/t0025-crlf-auto.sh
index c164b46..e30e9b3 100755
--- a/t/t0025-crlf-auto.sh
+++ b/t/t0025-crlf-auto.sh
@@ -55,8 +55,11 @@ test_expect_success 'crlf=true causes a CRLF file to be 
normalized' '
 test_expect_success 'text=true causes a CRLF file to be normalized' '
 
rm -f .gitattributes tmp LFonly CRLFonly LFwithNUL &&
-   echo "CRLFonly text" > .gitattributes &&
git read-tree --reset -u HEAD &&
+   sleep 1 &&
+   rm .git/index &&
+   git reset &&
+   echo "CRLFonly text" > .gitattributes &&
 
# Note, "normalized" means that git will normalize it if added
has_cr CRLFonly &&

I intentionally wait for one second and then I remove and re-read the
index. Now the timestamps of CRLFonly and .git/index are different, so
we avoid the is_racy_timestamp check. From now on Git will not read the
contents of CRLFonly from disk again until either the index entry or the
mtime of CRLFonly changes (maybe we also check the size, I am not sure).

Now we add .gitattributes. This does not change the index entry, nor
does it change the mtime of CRLFonly. Therefore the subsequent git diff
turns out empty, and the test fails.

I believe this behavior is expected. In gitattributes(5) we therefore
recommend using rm .git/index and git reset to "force Git to rescan the
working directory." The test should be fixed accordingly, something
like:

diff --git a/t/t0025-crlf-auto.sh b/t/t0025-crlf-auto.sh
index c164b46..2917591 100755
--- a/t/t0025-crlf-auto.sh
+++ b/t/t0025-crlf-auto.sh
@@ -57,6 +57,8 @@ test_expect_success 'text=true causes a CRLF file to be 
normalized' '
rm -f .gitattributes tmp LFonly CRLFonly LFwithNUL &&
echo "CRLFonly text" > .gitattributes &&
git read-tree --reset -u HEAD &&
+   rm .git/index &&
+   git reset &&
 
# Note, "normalized" means that git will normalize it if added
has_cr CRLFonly &&



On Mon, Jan 25, 2016 at 01:52:18PM -0800, Junio C Hamano wrote:
> Junio C Hamano  writes:
> 
> > I do not think there is a canned command to help dealing with these
> > broken paths right now.

I think (rm .git/index && git reset) works well enough in most cases,
but not all:

> We could go even fancier and attempt the round-trip twice or more.
> It is possible that the in-index representation will not converge
> when you use a misconfigured pair of clean/smudge filters

This can also happen with eol conversion, for example if you have files
with CRCRLF line endings. The eol conversion will remove only one CR.
Two conversions would be needed to achieve a normalized format. But
iterating (rm .git/index && git reset) does not help. Since we do not
touch the file on disk, after the first round, we have CRCRLF on disk
and CRLF in the index. During the second round, Git reads CRCRLF from
disk again, converts it to CRLF, which matches the index. Even
git reset --hard will not checkout the CRLF version to the worktree.

A possible solution is to iterate (rm -r * && git checkout -- . && git
add -u) until the work tree is clean. Quite ugly.

A command like git add --fix-index would make this conversion less
painful.  It should be ok if the user has to run it several times in
corner cases like CRCRLF, but it would be nice to issue a warning if the
index is still not normalized after running git add --fix-index.

Regarding the name of the option, maybe git add --renormalize-index
would be more consistent, since we also have the related merge option
"renormalize", which is very similar. In fact possibly you can share
some code with it.

Your patch looks good to me otherwise.


Coming back to "[PATCH] optionally disable gitattributes": The topics
are related, because they both deal with the situation where the work
tree has files which are not normalized according to gitattributes. But
my patch is more about saying: ok, I know I may have files which need to
be normalized, but I want to ignore this issue for now. Please disable
gitattributes for now, because I want to work with the files as they are
committed. Conversely, the discussion here is about how to reliably
detect and fix files which are not normalized.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] travis-ci: run previously failed tests first, then slowest to fastest

2016-01-25 Thread Thomas Gummerer
On 01/24, Junio C Hamano wrote:
> Junio C Hamano  writes:
>
> > Sorry, but I am confused by all of the above.
> >
> > We write the thing out with write_entry(), possibly applying smudge
> > filters and eol conversion to the "git" representation to create the
> > "working tree" representation in this codepath, right?  The resulting
> > file matches what the user's configuration told us to produce.
> >
> > Until the working tree file is changed by somebody after the above
> > happens, we shouldn't have to check the contents of the file to see
> > if there is a difference.  By definition, that has to match the
> > contents expected to be there by Git.
> >
> > The only case I can think of that the above does not hold is when
> > the smuge/clean and the eol conversion are not a correct round-trip
> > operation pairs, but that would be a misconfiguration.  Otherwise,
> > we'd be _always_ comparing the contents without relying on the
> > cached stat info for any paths whose in-core and working tree
> > representations are different, not just those that are configured
> > with misbehaving smudge/clean pair, no?
>
> To put it differently, if a blob stored at path has CRLF line
> endings and .gitattributes is changed after the fact to say that it
> must have LF line endings, we should treat it as a broken transitory
> state.

Right, I wasn't considering this as a broken state, because t0025 uses
just this to transition between the states.

> There may have to be a way to "fix" an already "wrong" blob
> in the index that is milder than "rm --cached && add .", but I do
> not think write_entry(), which is shared by all the normal codepaths
> that writes out to the working tree, is the best place to do so, if
> doing so forces the contents of the paths to be always re-checked,
> just in case the user is in such a broken transitory state.

Maybe I'm misunderstanding something, but the contents of the paths
are only re-checked if we are in such a broken transition state, and
the file stored in git has crlf line endings, and thus would be
normalized.  In this case we currently re-check the contents of that
file anyway, at least when the file and the index have the same mtime,
and we actually show the correct output.

I'm not too familiar with the eol conversion code, so I might be
missing something.

--
Thomas
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] travis-ci: run previously failed tests first, then slowest to fastest

2016-01-25 Thread Junio C Hamano
Thomas Gummerer  writes:

> On 01/24, Junio C Hamano wrote:
>> To put it differently, if a blob stored at path has CRLF line
>> endings and .gitattributes is changed after the fact to say that it
>> must have LF line endings, we should treat it as a broken transitory
>> state.
>
> Right, I wasn't considering this as a broken state, because t0025 uses
> just this to transition between the states.
>
>> There may have to be a way to "fix" an already "wrong" blob
>> in the index that is milder than "rm --cached && add .", but I do
>> not think write_entry(), which is shared by all the normal codepaths
>> that writes out to the working tree, is the best place to do so, if
>> doing so forces the contents of the paths to be always re-checked,
>> just in case the user is in such a broken transitory state.
>
> Maybe I'm misunderstanding something, but the contents of the paths
> are only re-checked if we are in such a broken transition state, and

What I do not understand here is how the added check ensures that
"only if in such a broken transition state".  would_convert_to_git()
does not take the contents but is called only with the pathname to
key into the attributes, so in a typical cross platform project
where all the source files are "text" and the repository can set
core.eol to adjust the end of line convention for its working tree,
the added check has no way to differentiate the paths that are
recorded with CRLF line endings in the object database by mistake,
i.e. the ones in the broken transitory state, and all the other
paths that are following the "text" and storing their blobs with LF
line endings.  The added check would trigger "is it racy" check to
re-reads the contents that we have written out from the working tree
for the paths with "wrong" blobs, but how would it avoid doing so
for the paths whose blobs are already stored correctly?

The new code affects not just "reset --hard", but everybody who
writes out from the object database to the working tree and records
that these paths are checked out in the index.  How does the new
code avoid destroying the performance for all paths?

> the file stored in git has crlf line endings, and thus would be
> normalized.  In this case we currently re-check the contents of that
> file anyway, at least when the file and the index have the same mtime,
> and we actually show the correct output.

The right way to at that "correct output", I think, is that it
happens to be shown that way by accident.  It is questionable that
it is correct to report that such a path is modified.  Immediately
after you check out a path to the working tree out of the index, via
"reset --hard" and "checkout $path", by definition it ought to be
the same between the working tree file and the index.

Unless it is in this transititory broken state, that is.

The "by accident" happens only because racy-git avoidance is being
implemented in one particular way.  Is about protecting people from
making changes to the working tree files immediately after their
previous contents are registered to the index (and the index files
written to the disk), and immediately after we write things out of
the index and by definition the result and the indexed blob ought to
match there is no reason to re-read and recheck unless the working
tree files are further edited.

The current way the racy-git avoidance works by re-reading and
re-checking the contents when the timestamps match is merely one
possible implementation, and that is the only thing that produces
your "correct" output most of the time in this test.  We could have
waited after writing the index time for a second before giving the
control back to the user instead, which would have also allowed us
to implement the racy-git avoidance, and in that alternate world,
all these transitory broken paths would have been correctly reported
as unmodified.

IOW, I would think the test in question is insisting a single
outcome for an operation whose result is undefined, and it is
harmful to twist the system by pessimizing the common cases just
to cater to this transititory broken state.

I am not saying that we shouldn't have support for users to fix
their repository and get out of this transititory broken state.  A
recent work by Torsten Bögershausen to have ls-files report the end
of line convention used in the blob in the index and the settings
that affect conversion for each path (among other things) is a step
in the right direction.  With a support like that, those who noticed
that they by mistake added CRLF files to the index as-is when they
wanted their project to be cross platform can recover from it by
setting necessary attributes (i.e. mark them as "text") and then
find paths that are broken out of "ls-files --eol" output to see
which ones are not using lf end-of-line in the index.

I do not think there is a canned command to help dealing with these
broken paths right now.  You would have to check them out of the
index (you would 

Re: [PATCH] travis-ci: run previously failed tests first, then slowest to fastest

2016-01-25 Thread Junio C Hamano
Junio C Hamano  writes:

> I am not saying that we shouldn't have support for users to fix
> their repository and get out of this transititory broken state.  A
> recent work by Torsten Bögershausen to have ls-files report the end
> of line convention used in the blob in the index and the settings
> that affect conversion for each path (among other things) is a step
> in the right direction.  With a support like that, those who noticed
> that they by mistake added CRLF files to the index as-is when they
> wanted their project to be cross platform can recover from it by
> setting necessary attributes (i.e. mark them as "text") and then
> find paths that are broken out of "ls-files --eol" output to see
> which ones are not using lf end-of-line in the index.
>
> I do not think there is a canned command to help dealing with these
> broken paths right now.  You would have to check them out of the
> index (you would get a CRLF file in the working tree in the example
> we are discussing), fix the line endings (you would run dos2unix on
> it in this example, as you would want "text=true" attribute) and
> "git add" them to recover manually, but I can imagine that Torsten's
> work can be extended to do all of these, without molesting the
> working tree files, with minimum work by the end user.  That is:
>
>  * Reuse Torsten's "ls-files --eol" code to find paths that record
>the blob in the index that does not follow the eol convention
>specified for the path;
>
>  * For each of these index entries, run convert_to_working_tree() on
>the indexed contents, and then on the result of it, run
>convert_to_git().  The result is the blob that the index ought to
>have had, if it were to be consistent with the attribute
>settings.  So add that to the index.
>
>  * Write the index out.
>
>  * Tell the user to commit or commit it automatically with a canned
>log message "fix broken encoding" or something.

Here is what I whipped up as a lunch-break hack.  I do not claim
that "git add" would be the best place to do this, but it should be
sufficient to illustrate the overall idea.

The user can say "git add --fix-index" and have a simplified version
of the above happen, i.e. for each path in the index, if the
contents recorded there does not round-trip to the identical
contents when first converted to the working tree representation
(i.e. passing through core.eol and smudge filter conversion) and
then converted back to the Git blob representation (i.e. clean
filter and core.crlf), and when the result is different from what we
started from, we know we have an unnormalized blob registered in the
index, so we replace it.  After this, "git diff --cached" would show
the correction made by this operation, and committing it would let
you fix the earlier mistake that added CRLF content when the path
was marked with text=true attribute.

We could go even fancier and attempt the round-trip twice or more.
It is possible that the in-index representation will not converge
when you use a misconfigured pair of clean/smudge filters (e.g.
using "gzip -d -c" as the smudge filter, and then using "gzip -c"
without "-n" option as the clean filter would most likely make the
in-index representation fuzzy, as each time the cycle is run, the
compressed contents will be made with different timestamps, even
though the working tree representation will be the same), and an
operation "we screwed up the filters, please repair the damage!"
like this "add --fix-index" is probably the best place to catch such
a misconfiguration.

 builtin/add.c | 64 +++
 1 file changed, 64 insertions(+)

diff --git a/builtin/add.c b/builtin/add.c
index 145f06e..36d3915 100644
--- a/builtin/add.c
+++ b/builtin/add.c
@@ -233,6 +233,7 @@ N_("The following paths are ignored by one of your 
.gitignore files:\n");
 
 static int verbose, show_only, ignored_too, refresh_only;
 static int ignore_add_errors, intent_to_add, ignore_missing;
+static int fix_index;
 
 #define ADDREMOVE_DEFAULT 1
 static int addremove = ADDREMOVE_DEFAULT;
@@ -263,6 +264,7 @@ static struct option builtin_add_options[] = {
OPT_BOOL( 0 , "refresh", _only, N_("don't add, only refresh the 
index")),
OPT_BOOL( 0 , "ignore-errors", _add_errors, N_("just skip files 
which cannot be added because of errors")),
OPT_BOOL( 0 , "ignore-missing", _missing, N_("check if - even 
missing - files are ignored in dry run")),
+   OPT_BOOL( 0 , "fix-index", _index, N_("fix contents in the index 
that is inconsistent with the eol and clean/smudge filters")),
OPT_END(),
 };
 
@@ -297,6 +299,64 @@ static int add_files(struct dir_struct *dir, int flags)
return exit_status;
 }
 
+static int fix_index_roundtrip(int ac, const char **av, const char *prefix)
+{
+   int i;
+
+   if (ac)
+   die(_("git add --fix-index does not take any other argument"));
+
+   if (read_cache() 

Re: [PATCH] travis-ci: run previously failed tests first, then slowest to fastest

2016-01-25 Thread Thomas Gummerer
On 01/25, Junio C Hamano wrote:
> Thomas Gummerer  writes:
>
> > On 01/24, Junio C Hamano wrote:
> >> To put it differently, if a blob stored at path has CRLF line
> >> endings and .gitattributes is changed after the fact to say that it
> >> must have LF line endings, we should treat it as a broken transitory
> >> state.
> >
> > Right, I wasn't considering this as a broken state, because t0025 uses
> > just this to transition between the states.
> >
> >> There may have to be a way to "fix" an already "wrong" blob
> >> in the index that is milder than "rm --cached && add .", but I do
> >> not think write_entry(), which is shared by all the normal codepaths
> >> that writes out to the working tree, is the best place to do so, if
> >> doing so forces the contents of the paths to be always re-checked,
> >> just in case the user is in such a broken transitory state.
> >
> > Maybe I'm misunderstanding something, but the contents of the paths
> > are only re-checked if we are in such a broken transition state, and
>
> What I do not understand here is how the added check ensures that
> "only if in such a broken transition state".  would_convert_to_git()
> does not take the contents but is called only with the pathname to
> key into the attributes, so in a typical cross platform project
> where all the source files are "text" and the repository can set
> core.eol to adjust the end of line convention for its working tree,
> the added check has no way to differentiate the paths that are
> recorded with CRLF line endings in the object database by mistake,
> i.e. the ones in the broken transitory state, and all the other
> paths that are following the "text" and storing their blobs with LF
> line endings.  The added check would trigger "is it racy" check to
> re-reads the contents that we have written out from the working tree
> for the paths with "wrong" blobs, but how would it avoid doing so
> for the paths whose blobs are already stored correctly?
>
> The new code affects not just "reset --hard", but everybody who
> writes out from the object database to the working tree and records
> that these paths are checked out in the index.  How does the new
> code avoid destroying the performance for all paths?

I misunderstood the way would_convert_to_git() works, I should have
actually read the code, instead of just relying on my test, which was
even wrong.  Sorry about the confusion, my patch does indeed hurt
the performance.

> > the file stored in git has crlf line endings, and thus would be
> > normalized.  In this case we currently re-check the contents of that
> > file anyway, at least when the file and the index have the same mtime,
> > and we actually show the correct output.
>
> The right way to at that "correct output", I think, is that it
> happens to be shown that way by accident.  It is questionable that
> it is correct to report that such a path is modified.  Immediately
> after you check out a path to the working tree out of the index, via
> "reset --hard" and "checkout $path", by definition it ought to be
> the same between the working tree file and the index.
>
> Unless it is in this transititory broken state, that is.
>
> The "by accident" happens only because racy-git avoidance is being
> implemented in one particular way.  Is about protecting people from
> making changes to the working tree files immediately after their
> previous contents are registered to the index (and the index files
> written to the disk), and immediately after we write things out of
> the index and by definition the result and the indexed blob ought to
> match there is no reason to re-read and recheck unless the working
> tree files are further edited.
>
> The current way the racy-git avoidance works by re-reading and
> re-checking the contents when the timestamps match is merely one
> possible implementation, and that is the only thing that produces
> your "correct" output most of the time in this test.  We could have
> waited after writing the index time for a second before giving the
> control back to the user instead, which would have also allowed us
> to implement the racy-git avoidance, and in that alternate world,
> all these transitory broken paths would have been correctly reported
> as unmodified.
>
> IOW, I would think the test in question is insisting a single
> outcome for an operation whose result is undefined, and it is
> harmful to twist the system by pessimizing the common cases just
> to cater to this transititory broken state.
>
> I am not saying that we shouldn't have support for users to fix
> their repository and get out of this transititory broken state.  A
> recent work by Torsten Bögershausen to have ls-files report the end
> of line convention used in the blob in the index and the settings
> that affect conversion for each path (among other things) is a step
> in the right direction.  With a support like that, those who noticed
> that they by mistake added CRLF files to the index 

Re: [PATCH] travis-ci: run previously failed tests first, then slowest to fastest

2016-01-24 Thread Thomas Gummerer
On 01/22, Jeff King wrote:
> On Fri, Jan 22, 2016 at 12:52:55AM -0500, Jeff King wrote:
>
> > I get a few of the threads failing (in test 4) after 2-3 minutes. The
> > "-v" output is pretty unenlightening, though. I don't see anything
> > racy-looking in the test unless it is something with "read-tree" and
> > stat mtimes.
>
> And indeed, doing this:
>
> diff --git a/t/t0025-crlf-auto.sh b/t/t0025-crlf-auto.sh
> index c164b46..d34775b 100755
> --- a/t/t0025-crlf-auto.sh
> +++ b/t/t0025-crlf-auto.sh
> @@ -56,6 +56,7 @@ test_expect_success 'text=true causes a CRLF file to be 
> normalized' '
>
>   rm -f .gitattributes tmp LFonly CRLFonly LFwithNUL &&
>   echo "CRLFonly text" > .gitattributes &&
> + sleep 1 &&
>   git read-tree --reset -u HEAD &&
>
>   # Note, "normalized" means that git will normalize it if added
>
> let me run for over 5 minutes with no failures in test 4 (I eventually
> did hit one in test 9). I don't claim to understand what is going on,
> though.

I don't think this is the right solution though, I think this mostly
lessens the load on the filesystem so the flakiness doesn't occur,
especially on your system, where it seems hard to trigger in the first
place :)

I actually hit the same problem occasionally when running the test
suite before, but was always to lazy to try to reproduce it.  Thanks
to your reproduction I think I was able to track the underlying
problem down.

My analysis is in the commit message below.

--->8---
Subject: [PATCH] entry: fix up to date marking

write_entry always marks cache entries up to date when
state->refresh_cache is set.  This is however not always accurate,
if core.autocrlf is set in the config, a file with cr and lf line
endings exists and the file attribute is set to text or crlf in the
gitattributes.

Most notably this makes t0025 flaky.  When calling deleting the files
that will be adjusted through the automated crlf handling, and then
calling `git read-tree --reset -u HEAD`, this leads to a race between
git read-tree and the filesystem.  The test currently only passes
most of the time, because the filesystem usually isn't synced between
the call to unpack_trees() and write_locked_index().

Currently the sequence of 1) remove files with cr and lf as line
endings, 2) `git read-tree --reset -u HEAD` 3) checking the status of
the changed files succeeds, because the index and the files are written
at the same time, so they have the same mtime.  Thus when reading the
index the next time, the files are recognized as racy, and the actual
contents on the disk are checked for changes.

If the index and the files have different mtimes however, the entry is
written to the index as up to date because of the flag set in
entry.c:write_entry(), and the contents on the filesystem are not
actually checked again, because the stat data in the index matches.

The failures in t0025 can be consistently reproduced by introducing a
call to sync() between the call to unpack_trees() and
write_index_locked().

Instead of blindly marking and entry up to date in write_entry(), check
if the contents may change on disk first, and strip the CE_UPTODATE flag
in that case.  Because the flag is not set, the cache entry will go
through the racy check when writing the index the first time, and
smudged if appropriate.

This fixes the flaky test as well as the underlying problem.

Signed-off-by: Thomas Gummerer 
---
 entry.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/entry.c b/entry.c
index 582c400..102fdfa 100644
--- a/entry.c
+++ b/entry.c
@@ -214,6 +214,8 @@ finish:
if (!fstat_done)
lstat(ce->name, );
fill_stat_cache_info(ce, );
+   if (would_convert_to_git(ce->name))
+   ce->ce_flags &= ~CE_UPTODATE;
ce->ce_flags |= CE_UPDATE_IN_BASE;
state->istate->cache_changed |= CE_ENTRY_CHANGED;
}
--
2.7.0.75.g3ee1e0f.dirty
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] travis-ci: run previously failed tests first, then slowest to fastest

2016-01-24 Thread Junio C Hamano
Thomas Gummerer  writes:

> My analysis is in the commit message below.
>
> --->8---
> Subject: [PATCH] entry: fix up to date marking
>
> write_entry always marks cache entries up to date when
> state->refresh_cache is set.  This is however not always accurate,
> if core.autocrlf is set in the config, a file with cr and lf line
> endings exists and the file attribute is set to text or crlf in the
> gitattributes.
>
> Most notably this makes t0025 flaky.  When calling deleting the files
> that will be adjusted through the automated crlf handling, and then
> calling `git read-tree --reset -u HEAD`, this leads to a race between
> git read-tree and the filesystem.  The test currently only passes
> most of the time, because the filesystem usually isn't synced between
> the call to unpack_trees() and write_locked_index().
>
> Currently the sequence of 1) remove files with cr and lf as line
> endings, 2) `git read-tree --reset -u HEAD` 3) checking the status of
> the changed files succeeds, because the index and the files are written
> at the same time, so they have the same mtime.  Thus when reading the
> index the next time, the files are recognized as racy, and the actual
> contents on the disk are checked for changes.
>
> If the index and the files have different mtimes however, the entry is
> written to the index as up to date because of the flag set in
> entry.c:write_entry(), and the contents on the filesystem are not
> actually checked again, because the stat data in the index matches.
>
> The failures in t0025 can be consistently reproduced by introducing a
> call to sync() between the call to unpack_trees() and
> write_index_locked().
>
> Instead of blindly marking and entry up to date in write_entry(), check
> if the contents may change on disk first, and strip the CE_UPTODATE flag
> in that case.  Because the flag is not set, the cache entry will go
> through the racy check when writing the index the first time, and
> smudged if appropriate.

Sorry, but I am confused by all of the above.

We write the thing out with write_entry(), possibly applying smudge
filters and eol conversion to the "git" representation to create the
"working tree" representation in this codepath, right?  The resulting
file matches what the user's configuration told us to produce.

Until the working tree file is changed by somebody after the above
happens, we shouldn't have to check the contents of the file to see
if there is a difference.  By definition, that has to match the
contents expected to be there by Git.

The only case I can think of that the above does not hold is when
the smuge/clean and the eol conversion are not a correct round-trip
operation pairs, but that would be a misconfiguration.  Otherwise,
we'd be _always_ comparing the contents without relying on the
cached stat info for any paths whose in-core and working tree
representations are different, not just those that are configured
with misbehaving smudge/clean pair, no?

Puzzled...  In this case, my hunch says that the patch is correct,
your analysis also is and it is only me who is missing some crucial
bits in the analysis and getting confused.

Enlightenment, please?

>
> This fixes the flaky test as well as the underlying problem.
>
> Signed-off-by: Thomas Gummerer 
> ---
>  entry.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/entry.c b/entry.c
> index 582c400..102fdfa 100644
> --- a/entry.c
> +++ b/entry.c
> @@ -214,6 +214,8 @@ finish:
>   if (!fstat_done)
>   lstat(ce->name, );
>   fill_stat_cache_info(ce, );
> + if (would_convert_to_git(ce->name))
> + ce->ce_flags &= ~CE_UPTODATE;
>   ce->ce_flags |= CE_UPDATE_IN_BASE;
>   state->istate->cache_changed |= CE_ENTRY_CHANGED;
>   }
> --
> 2.7.0.75.g3ee1e0f.dirty
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] travis-ci: run previously failed tests first, then slowest to fastest

2016-01-24 Thread Junio C Hamano
Junio C Hamano  writes:

> Sorry, but I am confused by all of the above.
>
> We write the thing out with write_entry(), possibly applying smudge
> filters and eol conversion to the "git" representation to create the
> "working tree" representation in this codepath, right?  The resulting
> file matches what the user's configuration told us to produce.
>
> Until the working tree file is changed by somebody after the above
> happens, we shouldn't have to check the contents of the file to see
> if there is a difference.  By definition, that has to match the
> contents expected to be there by Git.
>
> The only case I can think of that the above does not hold is when
> the smuge/clean and the eol conversion are not a correct round-trip
> operation pairs, but that would be a misconfiguration.  Otherwise,
> we'd be _always_ comparing the contents without relying on the
> cached stat info for any paths whose in-core and working tree
> representations are different, not just those that are configured
> with misbehaving smudge/clean pair, no?

To put it differently, if a blob stored at path has CRLF line
endings and .gitattributes is changed after the fact to say that it
must have LF line endings, we should treat it as a broken transitory
state.  There may have to be a way to "fix" an already "wrong" blob
in the index that is milder than "rm --cached && add .", but I do
not think write_entry(), which is shared by all the normal codepaths
that writes out to the working tree, is the best place to do so, if
doing so forces the contents of the paths to be always re-checked,
just in case the user is in such a broken transitory state.

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] travis-ci: run previously failed tests first, then slowest to fastest

2016-01-21 Thread brian m. carlson
On Wed, Jan 20, 2016 at 10:22:16AM +0100, Lars Schneider wrote:
> I tested different settings and found that running prove with "-j5" seems to 
> be
> the fastest option for the Travis CI machines. However, I also noticed that
> I got more test failures with higher parallelism (Dscho reported similar
> observations [1]).
> 
> Especially t0025-crlf-auto.sh failed multiple times ([2], [3]) on the OS X 
> builds
> when I increase the parallelism:
> 
> not ok 4 - text=true causes a CRLF file to be normalized
> not ok 9 - text=auto, autocrlf=true _does_ normalize CRLF files
> 
> Anyone an idea why that might be the case?

I've seen this on my personal box too[0] when running make -j4 all test.
I wasn't able to pin down why it was occurring, but if we're going to
run the tests in parallel, it's probably worth spending some time
figuring it out.

[0] Debian amd64/sid, ThinkPad X220.
-- 
brian m. carlson / brian with sandals: Houston, Texas, US
+1 832 623 2791 | https://www.crustytoothpaste.net/~bmc | My opinion only
OpenPGP: RSA v4 4096b: 88AC E9B2 9196 305B A994 7552 F1BA 225C 0223 B187


signature.asc
Description: PGP signature


Re: [PATCH] travis-ci: run previously failed tests first, then slowest to fastest

2016-01-21 Thread Jeff King
On Fri, Jan 22, 2016 at 12:52:55AM -0500, Jeff King wrote:

> I get a few of the threads failing (in test 4) after 2-3 minutes. The
> "-v" output is pretty unenlightening, though. I don't see anything
> racy-looking in the test unless it is something with "read-tree" and
> stat mtimes.

And indeed, doing this:

diff --git a/t/t0025-crlf-auto.sh b/t/t0025-crlf-auto.sh
index c164b46..d34775b 100755
--- a/t/t0025-crlf-auto.sh
+++ b/t/t0025-crlf-auto.sh
@@ -56,6 +56,7 @@ test_expect_success 'text=true causes a CRLF file to be 
normalized' '
 
rm -f .gitattributes tmp LFonly CRLFonly LFwithNUL &&
echo "CRLFonly text" > .gitattributes &&
+   sleep 1 &&
git read-tree --reset -u HEAD &&
 
# Note, "normalized" means that git will normalize it if added

let me run for over 5 minutes with no failures in test 4 (I eventually
did hit one in test 9). I don't claim to understand what is going on,
though.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] travis-ci: run previously failed tests first, then slowest to fastest

2016-01-20 Thread Junio C Hamano
Lars Schneider  writes:

> On 19 Jan 2016, at 21:00, Junio C Hamano  wrote:
>
>> IOW, I am confused by the beginning of the log message that says
>> this is taking advantage of "the Travis-CI cache feature".  This
>> improvement looks to me like using the feature of "prove" that
>> allows us to run slower tests first, and does not have much to do
>> with Travis.
>> 
>> You are relying on the assumption that things under $HOME/ is stable
>> while things under t/ (or in our source tree in general) are not,
>> and I think that is a sensible thing to take advantage of, but are
>> we sure that they are running in an environment where "ln -s" would
>> work?  Otherwise, it may be more robust to copy $HOME/.prove to
>> t/.prove before starting to test and then copy it back once the
>> tests are done.
>
> OK, looks like my wording was not ideal. One important thing to know is that 
> $HOME is *not* stable. These TravisCI machines start *always* in a completely 
> clean state.

Ah, that is what I missed.  Travis makes everything transient by
default (which is a sensible thing to do for CI), but it lets you
declare some things are to be made stable, and that is the "cache"
feature you are taking advantage of in Travis.

The log message needs to be clarified in a reroll, but thanks for
clarifying it for me in advance ;-)

That only leaves one question from me: Is 'ln -s' safe enough?
Would copying back and forth make it more robust?

I am guessint the answers are Yes and No, in which case the patch
text can (and should) stay as-is.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html