Re: GIT get corrupted on lustre
Hi, On 01/23/2013 01:34 PM, Sébastien Boisvert wrote: Hello, Here is a patch (with git format-patch) that removes any timer if NO_SETITIMER is set. Even with the patch, I finally got an error... :-/ Here are the log (strace -f) of a clean execution and one with the error: http://www.giref.ulaval.ca/~ericc/NO_SETITIMER-patch_bin_git_noerror.txt.gz http://www.giref.ulaval.ca/~ericc/NO_SETITIMER-patch_bin_git_with_error.txt.gz Eric -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: GIT get corrupted on lustre
Hello, Here is a patch (with git format-patch) that removes any timer if NO_SETITIMER is set. Éric: To test it with your workflow: $ module load apps/git/1.8.1.1.348.g78eb407-NO_SETITIMER-patch $ git clone ... Sébastien On 01/22/2013 05:14 PM, Thomas Rast wrote: Eric Chamberland writes: So, hum, do we have some sort of conclusion? Shall it be a fix for git to get around that lustre "behavior"? If something can be done in git it would be great: it is a *lot* easier to change git than the lustre filesystem software for a cluster in running in production mode... (words from cluster team) :-/ I thought you already established that simply disabling the progress display is a sufficient workaround? If that doesn't help, you can try patching out all use of SIGALRM within git. Other than that I agree with Junio, from what we've seen so far, Lustre returns EINTR on all sorts of calls that simply aren't allowed to do so. -- --- Spécialiste en granularité (1 journée / semaine) Calcul Québec / Calcul Canada Pavillon Adrien-Pouliot, Université Laval, Québec (Québec), Canada >From 78eb4075d98eb9cdc57210c63b8d8de8a3d0cd9e Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?S=C3=A9bastien=20Boisvert?= Date: Wed, 23 Jan 2013 13:10:57 -0500 Subject: [PATCH] don't use timers if NO_SETITIMER is set MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit With NO_SETITIMER, the user experience on legacy Lustre is fixed, but there is no early progress. The patch has no effect on the resulting git executable if NO_SETITIMER is not set (the default). So by default this patch has no effect at all, which is good. git tests: $ make clean $ make NO_SETITIMER=YesPlease $ make test NO_SETITIMER=YesPlease &> make-test.log $ grep "^not ok" make-test.log |grep -v "# TODO known breakage"|wc -l 0 $ grep "^ok" make-test.log |wc -l 9531 $ grep "^not ok" make-test.log |wc -l 65 No timers with NO_SETITIMER: $ objdump -d ./git|grep setitimer|wc -l 0 $ objdump -d ./git|grep alarm|wc -l 0 Timers without NO_SETITIMER: $ objdump -d /software/apps/git/1.8.1/bin/git|grep setitimer|wc -l 5 $ objdump -d /software/apps/git/1.8.1/bin/git|grep alarm|wc -l 0 Signed-off-by: Sébastien Boisvert --- builtin/log.c |7 +++ daemon.c |6 ++ progress.c|8 upload-pack.c |2 ++ 4 files changed, 23 insertions(+), 0 deletions(-) diff --git a/builtin/log.c b/builtin/log.c index 8f0b2e8..f8321c7 100644 --- a/builtin/log.c +++ b/builtin/log.c @@ -198,7 +198,9 @@ static void show_early_header(struct rev_info *rev, const char *stage, int nr) printf(_("Final output: %d %s\n"), nr, stage); } +#ifndef NO_SETITIMER static struct itimerval early_output_timer; +#endif static void log_show_early(struct rev_info *revs, struct commit_list *list) { @@ -240,9 +242,12 @@ static void log_show_early(struct rev_info *revs, struct commit_list *list) * trigger every second even if we're blocked on a * reader! */ + + #ifndef NO_SETITIMER early_output_timer.it_value.tv_sec = 0; early_output_timer.it_value.tv_usec = 50; setitimer(ITIMER_REAL, &early_output_timer, NULL); + #endif } static void early_output(int signal) @@ -274,9 +279,11 @@ static void setup_early_output(struct rev_info *rev) * * This is a one-time-only trigger. */ + #ifndef NO_SETITIMER early_output_timer.it_value.tv_sec = 0; early_output_timer.it_value.tv_usec = 10; setitimer(ITIMER_REAL, &early_output_timer, NULL); + #endif } static void finish_early_output(struct rev_info *rev) diff --git a/daemon.c b/daemon.c index 4602b46..eb82c19 100644 --- a/daemon.c +++ b/daemon.c @@ -611,9 +611,15 @@ static int execute(void) if (addr) loginfo("Connection from %s:%s", addr, port); + #ifndef NO_SETITIMER alarm(init_timeout ? init_timeout : timeout); + #endif + pktlen = packet_read_line(0, line, sizeof(line)); + + #ifndef NO_SETITIMER alarm(0); + #endif len = strlen(line); if (pktlen != len) diff --git a/progress.c b/progress.c index 3971f49..b84ccc7 100644 --- a/progress.c +++ b/progress.c @@ -45,7 +45,10 @@ static void progress_interval(int signum) static void set_progress_signal(void) { struct sigaction sa; + + #ifndef NO_SETITIMER struct itimerval v; + #endif progress_update = 0; @@ -55,16 +58,21 @@ static void set_progress_signal(void) sa.sa_flags = SA_RESTART; sigaction(SIGALRM, &sa, NULL); + #ifndef NO_SETITIMER v.it_interval.tv_sec = 1; v.it_interval.tv_usec = 0; v.it_value = v.it_interval; setitimer(ITIMER_REAL, &v, NULL); + #endif } static void clear_progress_signal(void) { + #ifndef NO_SETITIMER struct itimerval v = {{0,},}; setitimer(ITIMER_REAL, &v, NULL); + #endif + signal(SIGALRM, SIG_IGN); progress_update = 0; } diff --git a/upload-pack.c b/upload-pack.c index 95d8313..e0b8b32 100644 --- a/upload-pack.c +++ b/upload-pack.c @@ -47,7 +47,9 @@ static int stateless_rpc; static void
Re: GIT get corrupted on lustre
Thomas Rast wrote: > Taken together this should mean that the bug is in fact simply that the > calls do not *restart*. They are (like you say) allowed to return EINTR > despite not being specified to, *but* SA_RESTART should restart it. > > Now, does that make it a lustre bug or a glibc bug? :-) The kernel takes care of SA_RESTART, if I remember correctly. (See arch/x86/kernel/signal.c::handle_signal() case -ERESTARTSYS.) -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: GIT get corrupted on lustre
On Wed, Jan 23, 2013 at 4:44 PM, Thomas Rast wrote: > Erik Faye-Lund writes: > >> On Wed, Jan 23, 2013 at 4:32 PM, Thomas Rast wrote: >>> Erik Faye-Lund writes: >>> POSIX allows error codes to be generated other than those defined. From http://pubs.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_03.html: "Implementations may support additional errors not included in this list, *may generate errors included in this list under circumstances other than those described here*, or may contain extensions or limitations that prevent some errors from occurring." >>> >>> That same page says, however: >>> >>> For functions under the Threads option for which [EINTR] is not listed >>> as a possible error condition in this volume of IEEE Std 1003.1-2001, >>> an implementation shall not return an error code of [EINTR]. >> >> Yes, but surely that's for pthreads functions, no? utime is not one of >> those functions... > > Ah, my bad. In fact in > > http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xsh_chap02.html > > there is a paragraph "Signal Effects on Other Functions", which says > > > > Taken together this should mean that the bug is in fact simply that the > calls do not *restart*. They are (like you say) allowed to return EINTR > despite not being specified to, *but* SA_RESTART should restart it. > Right, thanks for clearing that up. > Now, does that make it a lustre bug or a glibc bug? :-) That's kind of uninteresting, the important bit is that it is indeed a bug (outside of Git). -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: GIT get corrupted on lustre
Erik Faye-Lund writes: > On Wed, Jan 23, 2013 at 4:32 PM, Thomas Rast wrote: >> Erik Faye-Lund writes: >> >>> POSIX allows error codes >>> to be generated other than those defined. From >>> http://pubs.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_03.html: >>> >>> "Implementations may support additional errors not included in this >>> list, *may generate errors included in this list under circumstances >>> other than those described here*, or may contain extensions or >>> limitations that prevent some errors from occurring." >> >> That same page says, however: >> >> For functions under the Threads option for which [EINTR] is not listed >> as a possible error condition in this volume of IEEE Std 1003.1-2001, >> an implementation shall not return an error code of [EINTR]. > > Yes, but surely that's for pthreads functions, no? utime is not one of > those functions... Ah, my bad. In fact in http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xsh_chap02.html there is a paragraph "Signal Effects on Other Functions", which says The most common behavior of an interrupted function after a signal-catching function returns is for the interrupted function to give an [EINTR] error unless the SA_RESTART flag is in effect for the signal. However, there are a number of specific exceptions, including sleep() and certain situations with read() and write(). The historical implementations of many functions defined by IEEE Std 1003.1-2001 are not interruptible[...] Functions not mentioned explicitly as interruptible may be so on some implementations, possibly as an extension where the function gives an [EINTR] error. There are several functions (for example, getpid(), getuid()) that are specified as never returning an error, which can thus never be extended in this way. If a signal-catching function returns while the SA_RESTART flag is in effect, an interrupted function is restarted at the point it was interrupted. Conforming applications cannot make assumptions about the internal behavior of interrupted functions, even if the functions are async-signal-safe. For example, suppose the read() function is interrupted with SA_RESTART in effect, the signal-catching function closes the file descriptor being read from and returns, and the read() function is then restarted; in this case the application cannot assume that the read() function will give an [EBADF] error, since read() might have checked the file descriptor for validity before being interrupted. Taken together this should mean that the bug is in fact simply that the calls do not *restart*. They are (like you say) allowed to return EINTR despite not being specified to, *but* SA_RESTART should restart it. Now, does that make it a lustre bug or a glibc bug? :-) -- Thomas Rast trast@{inf,student}.ethz.ch -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: GIT get corrupted on lustre
On Wed, Jan 23, 2013 at 4:32 PM, Thomas Rast wrote: > Erik Faye-Lund writes: > >> On Tue, Jan 22, 2013 at 11:14 PM, Thomas Rast wrote: >>> Eric Chamberland writes: >>> >>> Other than that I agree with Junio, from what we've seen so far, Lustre >>> returns EINTR on all sorts of calls that simply aren't allowed to do so. >> >> I don't think this analysis is 100% accurate, POSIX allows error codes >> to be generated other than those defined. From >> http://pubs.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_03.html: >> >> "Implementations may support additional errors not included in this >> list, *may generate errors included in this list under circumstances >> other than those described here*, or may contain extensions or >> limitations that prevent some errors from occurring." > > That same page says, however: > > For functions under the Threads option for which [EINTR] is not listed > as a possible error condition in this volume of IEEE Std 1003.1-2001, > an implementation shall not return an error code of [EINTR]. Yes, but surely that's for pthreads functions, no? utime is not one of those functions... -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: GIT get corrupted on lustre
Erik Faye-Lund writes: > On Tue, Jan 22, 2013 at 11:14 PM, Thomas Rast wrote: >> Eric Chamberland writes: >> >> Other than that I agree with Junio, from what we've seen so far, Lustre >> returns EINTR on all sorts of calls that simply aren't allowed to do so. > > I don't think this analysis is 100% accurate, POSIX allows error codes > to be generated other than those defined. From > http://pubs.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_03.html: > > "Implementations may support additional errors not included in this > list, *may generate errors included in this list under circumstances > other than those described here*, or may contain extensions or > limitations that prevent some errors from occurring." That same page says, however: For functions under the Threads option for which [EINTR] is not listed as a possible error condition in this volume of IEEE Std 1003.1-2001, an implementation shall not return an error code of [EINTR]. -- Thomas Rast trast@{inf,student}.ethz.ch -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: GIT get corrupted on lustre
On Tue, Jan 22, 2013 at 11:14 PM, Thomas Rast wrote: > Eric Chamberland writes: > > Other than that I agree with Junio, from what we've seen so far, Lustre > returns EINTR on all sorts of calls that simply aren't allowed to do so. I don't think this analysis is 100% accurate, POSIX allows error codes to be generated other than those defined. From http://pubs.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_03.html: "Implementations may support additional errors not included in this list, *may generate errors included in this list under circumstances other than those described here*, or may contain extensions or limitations that prevent some errors from occurring." So I don't think Lustre violates POSIX by erroring with errno=EINTR, but I also think guarding every single function call for EINTR just to be safe to be insane :) However, looking at Eric's log, I can't see that being what has happened here - grepping it for EINTR does not produce a single match. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: GIT get corrupted on lustre
[I forgot to subscribe to the git mailing list, sorry for that] On 01/22/2013 05:14 PM, Thomas Rast wrote: Eric Chamberland writes: So, hum, do we have some sort of conclusion? Shall it be a fix for git to get around that lustre "behavior"? If something can be done in git it would be great: it is a *lot* easier to change git than the lustre filesystem software for a cluster in running in production mode... (words from cluster team) :-/ I thought you already established that simply disabling the progress display is a sufficient workaround? If that doesn't help, you can try patching out all use of SIGALRM within git. In git (9591fcc6d66), I have found these SIGALRM signal handling: builtin/log.c:268:sigaction(SIGALRM, &sa, NULL); builtin/log.c:285:signal(SIGALRM, SIG_IGN); compat/mingw.c:1590:mingw_raise(SIGALRM); compat/mingw.c:1666:if (sig != SIGALRM) compat/mingw.c:1668:error("sigaction only implemented for SIGALRM"); compat/mingw.c:1683:case SIGALRM: compat/mingw.c:1702:case SIGALRM: compat/mingw.c:1706:exit(128 + SIGALRM); compat/mingw.c:1708:timer_fn(SIGALRM); compat/mingw.h:42:#define SIGALRM 14 perl/Git/SVN.pm:2121:SIGALRM, SIGUSR1, SIGUSR2); progress.c:56:sigaction(SIGALRM, &sa, NULL); progress.c:68:signal(SIGALRM, SIG_IGN); I suppose that compat/mingw.{h,c} and SVN.pm can be ignored as our patch to work around this problem won't be pushed upstream because the real problem is not in git, right ? If I understand correctly, some VFS system calls get interrupted by SIGALRM, but when they resume (via SA_RESTART) they return EINTR. Thomas said that these failed calls may need to be retried, but that open(O_CREAT|O_EXCL) is still tricky around this case. progress.c SIGALRM code paths are for progress and therefore are required, right ? builtin/log.c SIGALRM code paths are for early output, and the comments in the code say that "If we can get the whole output in less than a tenth of a second, don't even bother doing the early-output thing." So where do I start for the patch ? Other than that I agree with Junio, from what we've seen so far, Lustre returns EINTR on all sorts of calls that simply aren't allowed to do so. -- --- Spécialiste en granularité (1 journée / semaine) Calcul Québec / Calcul Canada Pavillon Adrien-Pouliot, Université Laval, Québec (Québec), Canada -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: GIT get corrupted on lustre
On 01/22/2013 05:14 PM, Thomas Rast wrote: Eric Chamberland writes: So, hum, do we have some sort of conclusion? Shall it be a fix for git to get around that lustre "behavior"? If something can be done in git it would be great: it is a *lot* easier to change git than the lustre filesystem software for a cluster in running in production mode... (words from cluster team) :-/ I thought you already established that simply disabling the progress display is a sufficient workaround? If that doesn't help, you can try patching out all use of SIGALRM within git. In git (9591fcc6d66), I have found these SIGALRM signal handling: builtin/log.c:268: sigaction(SIGALRM, &sa, NULL); builtin/log.c:285: signal(SIGALRM, SIG_IGN); compat/mingw.c:1590:mingw_raise(SIGALRM); compat/mingw.c:1666:if (sig != SIGALRM) compat/mingw.c:1668:error("sigaction only implemented for SIGALRM"); compat/mingw.c:1683:case SIGALRM: compat/mingw.c:1702:case SIGALRM: compat/mingw.c:1706:exit(128 + SIGALRM); compat/mingw.c:1708:timer_fn(SIGALRM); compat/mingw.h:42:#define SIGALRM 14 perl/Git/SVN.pm:2121: SIGALRM, SIGUSR1, SIGUSR2); progress.c:56: sigaction(SIGALRM, &sa, NULL); progress.c:68: signal(SIGALRM, SIG_IGN); I suppose that compat/mingw.{h,c} and SVN.pm can be ignored as our patch to work around this problem won't be pushed upstream because the real problem is not in git, right ? If I understand correctly, some VFS system calls get interrupted by SIGALRM, but when they resume (via SA_RESTART) they return EINTR. Thomas said that these failed calls may need to be retried, but that open(O_CREAT|O_EXCL) is still tricky around this case. progress.c SIGALRM code paths are for progress and therefore are required, right ? builtin/log.c SIGALRM code paths are for early output, and the comments in the code say that "If we can get the whole output in less than a tenth of a second, don't even bother doing the early-output thing." So where do I start for the patch ? Other than that I agree with Junio, from what we've seen so far, Lustre returns EINTR on all sorts of calls that simply aren't allowed to do so. -- --- Spécialiste en granularité (1 journée / semaine) Calcul Québec / Calcul Canada Pavillon Adrien-Pouliot, Université Laval, Québec (Québec), Canada -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: GIT get corrupted on lustre
On 01/22/2013 05:14 PM, Thomas Rast wrote: Eric Chamberland writes: So, hum, do we have some sort of conclusion? Shall it be a fix for git to get around that lustre "behavior"? If something can be done in git it would be great: it is a *lot* easier to change git than the lustre filesystem software for a cluster in running in production mode... (words from cluster team) :-/ I thought you already established that simply disabling the progress display is a sufficient workaround? If that doesn't help, you can try patching out all use of SIGALRM within git. I tried that solution after Brian told me to try it, but it didn't solved the problem for me! :-( Other than that I agree with Junio, from what we've seen so far, Lustre returns EINTR on all sorts of calls that simply aren't allowed to do so. Ok, so now the "good" move would be to have all this reported to lustre development team? Brian, have you seen anything new from what you have already reported? I have to admit that I am not a fs expert... And I also agree with Junio point of view: The problem may impact mission critical applications Eric -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: GIT get corrupted on lustre
Eric Chamberland writes: > So, hum, do we have some sort of conclusion? > > Shall it be a fix for git to get around that lustre "behavior"? > > If something can be done in git it would be great: it is a *lot* > easier to change git than the lustre filesystem software for a cluster > in running in production mode... (words from cluster team) :-/ I thought you already established that simply disabling the progress display is a sufficient workaround? If that doesn't help, you can try patching out all use of SIGALRM within git. Other than that I agree with Junio, from what we've seen so far, Lustre returns EINTR on all sorts of calls that simply aren't allowed to do so. -- Thomas Rast trast@{inf,student}.ethz.ch -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: GIT get corrupted on lustre
Eric Chamberland writes: > So, hum, do we have some sort of conclusion? > > Shall it be a fix for git to get around that lustre "behavior"? > > If something can be done in git it would be great: it is a *lot* > easier to change git than the lustre filesystem software for a cluster > in running in production mode... (words from cluster team) :-/ Do we know the real cause of the symptom? I did not follow the thread carefully, but the impression I was getting was that the filesystem is broken around EINTR, and even if you "fix"ed Git, your other more mission critical applications will be broken by the same filesystem behaviour, no? -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: GIT get corrupted on lustre
So, hum, do we have some sort of conclusion? Shall it be a fix for git to get around that lustre "behavior"? If something can be done in git it would be great: it is a *lot* easier to change git than the lustre filesystem software for a cluster in running in production mode... (words from cluster team) :-/ I hope this subject will not die in the list... :-/ Thanks, Eric On 01/21/2013 02:29 PM, Thomas Rast wrote: Please don't drop the Cc list! "Brian J. Murrell" writes: What's odd is that while I cannot reproduce the original problem, there seems to be another issue/bug with utime(): I wonder if this is related to http://jira.whamcloud.com/browse/LU-305. That was reported as fixed in Lustre 2.0.0 and 2.1.0 but I thought I saw it on 2.1.1 and added a comment to the above ticket about that. Aha, that's a very interesting bug report. My observations support yours: I managed to get EINTR during utime(). In the absence of it, wouldn't we in theory have to write a simple loop-on-EINTR wrapper for *all* syscalls? IIUC, that's what SA_RESTART is all about. Yes, but there's precious little clear language on when SA_RESTART is supposed to act. In all cases? The wording on http://www.delorie.com/gnu/docs/glibc/libc_485.html http://www.delorie.com/gnu/docs/glibc/libc_498.html leads me to believe that SA_RESTART is actually used on the glibc side of things, so that any glibc syscall wrapper not specifically equipped with the restarting behavior would return EINTR unmodified. This might explain why utime() doesn't restart like it should (assuming we work on the theory that POSIX doesn't allow an EINTR from utime() to begin with). -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: GIT get corrupted on lustre
Please don't drop the Cc list! "Brian J. Murrell" writes: >> What's odd is that while I cannot reproduce the original problem, there >> seems to be another issue/bug with utime(): > > I wonder if this is related to http://jira.whamcloud.com/browse/LU-305. > That was reported as fixed in Lustre 2.0.0 and 2.1.0 but I thought I > saw it on 2.1.1 and added a comment to the above ticket about that. Aha, that's a very interesting bug report. My observations support yours: I managed to get EINTR during utime(). >> In the absence of it, wouldn't we in theory have to write a simple >> loop-on-EINTR wrapper for *all* syscalls? > > IIUC, that's what SA_RESTART is all about. Yes, but there's precious little clear language on when SA_RESTART is supposed to act. In all cases? The wording on http://www.delorie.com/gnu/docs/glibc/libc_485.html http://www.delorie.com/gnu/docs/glibc/libc_498.html leads me to believe that SA_RESTART is actually used on the glibc side of things, so that any glibc syscall wrapper not specifically equipped with the restarting behavior would return EINTR unmodified. This might explain why utime() doesn't restart like it should (assuming we work on the theory that POSIX doesn't allow an EINTR from utime() to begin with). -- Thomas Rast trast@{inf,student}.ethz.ch -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: GIT get corrupted on lustre
On 13-01-21 11:11 AM, Thomas Rast wrote: > > What's odd is that while I cannot reproduce the original problem, there > seems to be another issue/bug with utime(): I wonder if this is related to http://jira.whamcloud.com/browse/LU-305. That was reported as fixed in Lustre 2.0.0 and 2.1.0 but I thought I saw it on 2.1.1 and added a comment to the above ticket about that. > In the absence of it, wouldn't we in theory have to write a simple > loop-on-EINTR wrapper for *all* syscalls? IIUC, that's what SA_RESTART is all about. > Of course there's the added problem that when open(O_CREAT|O_EXCL) fails > with EINTR, it's hard to tell whether a file that may now exist is > indeed yours or some other process's. Or whether it's in a "half created" state such as I hypothesize in http://jira.whamcloud.com/browse/LU-2276. b. signature.asc Description: OpenPGP digital signature
Re: GIT get corrupted on lustre
On 01/21/2013 12:07 PM, Eric Chamberland wrote: Hi, It just happened again. Now I have the "strace -f" output gzipped here: http://www.giref.ulaval.ca/~ericc/strace-f_git_error.txt.gz I added the "strace -f" output when non error occurs... http://www.giref.ulaval.ca/~ericc/strace-f_git_no_error.txt.gz a "kdiff3" can show the differences just before the error... Eric -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: GIT get corrupted on lustre
Hi, It just happened again. Now I have the "strace -f" output gzipped here: http://www.giref.ulaval.ca/~ericc/strace-f_git_error.txt.gz thanks, Eric On 01/21/2013 08:29 AM, Erik Faye-Lund wrote: On Fri, Jan 18, 2013 at 6:50 PM, Eric Chamberland wrote: Good idea! I did a strace and here is the output with the error: http://www.giref.ulaval.ca/~ericc/strace_git_error.txt Hope it will be insightful! This trace doesn't seem to contain child-processes, but instead having their stderr inlined into the log. Try using "strace -f" instead... -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: GIT get corrupted on lustre
Maxime Boissonneault writes: > Hi Thomas, > Can you tell me what is the version of the lustre servers and the > lustre clients ? $ uname -a Linux brutus4.ethz.ch 2.6.32-279.14.1.el6.x86_64 #1 SMP Tue Nov 6 23:43:09 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux $ cat /proc/fs/lustre/version lustre: 2.3.0 kernel: patchless_client build: 2.3.0-RC6--PRISTINE-2.6.32-279.14.1.el6.x86_64 I have no idea what the servers are running, I only have client access. -- Thomas Rast trast@{inf,student}.ethz.ch -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: GIT get corrupted on lustre
Hi Thomas, Can you tell me what is the version of the lustre servers and the lustre clients ? Thanks, Maxime Boissonneault HPC specialist @ Calcul Québec. Le 2013-01-21 11:11, Thomas Rast a écrit : Erik Faye-Lund writes: On Fri, Jan 18, 2013 at 6:50 PM, Eric Chamberland wrote: Good idea! I did a strace and here is the output with the error: http://www.giref.ulaval.ca/~ericc/strace_git_error.txt Hope it will be insightful! This trace doesn't seem to contain child-processes, but instead having their stderr inlined into the log. Try using "strace -f" instead... I happen to have access to a lustre FS on the brutus cluster of ETH Zurich, so I figured I could give it a shot. What's odd is that while I cannot reproduce the original problem, there seems to be another issue/bug with utime(): $ strace -f -o ~/gc.trace git gc warning: failed utime() on .git/objects/68/tmp_obj_sCAEVc: Interrupted system call warning: failed utime() on .git/objects/a6/tmp_obj_3cdB2c: Interrupted system call warning: failed utime() on .git/objects/69/tmp_obj_lbU3Xc: Interrupted system call warning: failed utime() on .git/objects/c3/tmp_obj_EU97Wc: Interrupted system call warning: failed utime() on .git/objects/3e/tmp_obj_tb2j3c: Interrupted system call warning: failed utime() on .git/objects/15/tmp_obj_e6zMXc: Interrupted system call warning: failed utime() on .git/objects/54/tmp_obj_ExOJVc: Interrupted system call warning: failed utime() on .git/objects/e3/tmp_obj_GtPw4c: Interrupted system call warning: failed utime() on .git/objects/21/tmp_obj_Xex32c: Interrupted system call warning: failed utime() on .git/objects/1a/tmp_obj_CzwsZc: Interrupted system call warning: failed utime() on .git/objects/18/tmp_obj_o6fp3c: Interrupted system call warning: failed utime() on .git/objects/32/tmp_obj_Ih0G4c: Interrupted system call warning: failed utime() on .git/objects/41/tmp_obj_1RXV1c: Interrupted system call Counting objects: 137744, done. Delta compression using up to 48 threads. Compressing objects: 100% (36510/36510), done. Writing objects: 100% (137744/137744), done. Total 137744 (delta 101591), reused 135512 (delta 99472) The trace is here (2.1MB compressed): http://thomasrast.ch/download/gc.trace.bz2 For the test I used a clone of another git.git I had around. I think the error is from sha1_file.c:2564. While that doesn't look too important (see ca11b212 for context), it does raise the question: what other system calls that we never expect to EINTR can do this on sufficiently arcane system/FS combinations? Peff's test ran without any apparent issue for a few minutes. I also ran an extended version (at the end) that sets alarms, so as to actually get interrupted. That proved more interesting. I had to fix verify() and write_in_full() to account for EINTR in read()/write(), as those seem likely to fail. I also got link() to fail: $ ~/lustre-peff-reproducer unable to create hard link: Interrupted system call unable to open index file: No such file or directory but it took a long time. Unfortunately, when running it with strace I managed to panic the host I ran it on: $ strace -o ~/peff-reproducer.trace ~/lustre-peff-reproducer Message from syslogd@brutus1 at Jan 21 17:09:43 ... kernel:LustreError: 37417:0:(osc_lock.c:1182:osc_lock_enqueue()) ASSERTION( ols->ols_state == OLS_NEW ) failed: Impossible state: 4 Message from syslogd@brutus1 at Jan 21 17:09:43 ... kernel:LustreError: 37417:0:(osc_lock.c:1182:osc_lock_enqueue()) LBUG Message from syslogd@brutus1 at Jan 21 17:09:43 ... kernel:Kernel panic - not syncing: LBUG Yay for now having to explain this to the cluster team. I tried finding a standard that limits the syscalls to which EINTR applies, without too much success. I'm not sure how far I should trust my manpages, but while some of them explicitly list EINTR as a possible error (read, write, etc.) link() does not. (And the linux manpages agree with the POSOIX ones for once.) If somebody finds such a standard, we could of course use it to blame lustre instead :-) In the absence of it, wouldn't we in theory have to write a simple loop-on-EINTR wrapper for *all* syscalls? Of course there's the added problem that when open(O_CREAT|O_EXCL) fails with EINTR, it's hard to tell whether a file that may now exist is indeed yours or some other process's. --- 8< #include #include #include #include #include #include #include #include struct itimerval itv; static int randomize(unsigned char *buf, int len) { int i; len = rand() % len; for (i = 0; i < len; i++) buf[i] = rand() & 0xff; return len; } static int check_eof(int fd) { int ch; int r = read(fd, &ch, 1); if (r < 0) { perror("read error after expected EOF"); return -1; } if (r > 0) { fprintf(stderr, "extra byte after expected EOF"); return -1; } return 0; } static int
Re: GIT get corrupted on lustre
Erik Faye-Lund writes: > On Fri, Jan 18, 2013 at 6:50 PM, Eric Chamberland > wrote: >> Good idea! >> >> I did a strace and here is the output with the error: >> >> http://www.giref.ulaval.ca/~ericc/strace_git_error.txt >> >> Hope it will be insightful! > > This trace doesn't seem to contain child-processes, but instead having > their stderr inlined into the log. Try using "strace -f" instead... I happen to have access to a lustre FS on the brutus cluster of ETH Zurich, so I figured I could give it a shot. What's odd is that while I cannot reproduce the original problem, there seems to be another issue/bug with utime(): $ strace -f -o ~/gc.trace git gc warning: failed utime() on .git/objects/68/tmp_obj_sCAEVc: Interrupted system call warning: failed utime() on .git/objects/a6/tmp_obj_3cdB2c: Interrupted system call warning: failed utime() on .git/objects/69/tmp_obj_lbU3Xc: Interrupted system call warning: failed utime() on .git/objects/c3/tmp_obj_EU97Wc: Interrupted system call warning: failed utime() on .git/objects/3e/tmp_obj_tb2j3c: Interrupted system call warning: failed utime() on .git/objects/15/tmp_obj_e6zMXc: Interrupted system call warning: failed utime() on .git/objects/54/tmp_obj_ExOJVc: Interrupted system call warning: failed utime() on .git/objects/e3/tmp_obj_GtPw4c: Interrupted system call warning: failed utime() on .git/objects/21/tmp_obj_Xex32c: Interrupted system call warning: failed utime() on .git/objects/1a/tmp_obj_CzwsZc: Interrupted system call warning: failed utime() on .git/objects/18/tmp_obj_o6fp3c: Interrupted system call warning: failed utime() on .git/objects/32/tmp_obj_Ih0G4c: Interrupted system call warning: failed utime() on .git/objects/41/tmp_obj_1RXV1c: Interrupted system call Counting objects: 137744, done. Delta compression using up to 48 threads. Compressing objects: 100% (36510/36510), done. Writing objects: 100% (137744/137744), done. Total 137744 (delta 101591), reused 135512 (delta 99472) The trace is here (2.1MB compressed): http://thomasrast.ch/download/gc.trace.bz2 For the test I used a clone of another git.git I had around. I think the error is from sha1_file.c:2564. While that doesn't look too important (see ca11b212 for context), it does raise the question: what other system calls that we never expect to EINTR can do this on sufficiently arcane system/FS combinations? Peff's test ran without any apparent issue for a few minutes. I also ran an extended version (at the end) that sets alarms, so as to actually get interrupted. That proved more interesting. I had to fix verify() and write_in_full() to account for EINTR in read()/write(), as those seem likely to fail. I also got link() to fail: $ ~/lustre-peff-reproducer unable to create hard link: Interrupted system call unable to open index file: No such file or directory but it took a long time. Unfortunately, when running it with strace I managed to panic the host I ran it on: $ strace -o ~/peff-reproducer.trace ~/lustre-peff-reproducer Message from syslogd@brutus1 at Jan 21 17:09:43 ... kernel:LustreError: 37417:0:(osc_lock.c:1182:osc_lock_enqueue()) ASSERTION( ols->ols_state == OLS_NEW ) failed: Impossible state: 4 Message from syslogd@brutus1 at Jan 21 17:09:43 ... kernel:LustreError: 37417:0:(osc_lock.c:1182:osc_lock_enqueue()) LBUG Message from syslogd@brutus1 at Jan 21 17:09:43 ... kernel:Kernel panic - not syncing: LBUG Yay for now having to explain this to the cluster team. I tried finding a standard that limits the syscalls to which EINTR applies, without too much success. I'm not sure how far I should trust my manpages, but while some of them explicitly list EINTR as a possible error (read, write, etc.) link() does not. (And the linux manpages agree with the POSOIX ones for once.) If somebody finds such a standard, we could of course use it to blame lustre instead :-) In the absence of it, wouldn't we in theory have to write a simple loop-on-EINTR wrapper for *all* syscalls? Of course there's the added problem that when open(O_CREAT|O_EXCL) fails with EINTR, it's hard to tell whether a file that may now exist is indeed yours or some other process's. --- 8< #include #include #include #include #include #include #include #include struct itimerval itv; static int randomize(unsigned char *buf, int len) { int i; len = rand() % len; for (i = 0; i < len; i++) buf[i] = rand() & 0xff; return len; } static int check_eof(int fd) { int ch; int r = read(fd, &ch, 1); if (r < 0) { perror("read error after expected EOF"); return -1; } if (r > 0) { fprintf(stderr, "extra byte after expected EOF"); return -1; } return 0; } static int verify(int fd, const unsigned char *buf, int len) { while (len) { char to_check[4096]; int got = read(fd, to_check, len < sizeof(to_check) ?
Re: GIT get corrupted on lustre
On Fri, Jan 18, 2013 at 6:50 PM, Eric Chamberland wrote: > Good idea! > > I did a strace and here is the output with the error: > > http://www.giref.ulaval.ca/~ericc/strace_git_error.txt > > Hope it will be insightful! This trace doesn't seem to contain child-processes, but instead having their stderr inlined into the log. Try using "strace -f" instead... -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: GIT get corrupted on lustre
Good idea! I did a strace and here is the output with the error: http://www.giref.ulaval.ca/~ericc/strace_git_error.txt Hope it will be insightful! Eric On 01/17/2013 12:17 PM, Pyeron, Jason J CTR (US) wrote: Sorry, I am in cygwin mode, and I had crossed wires in my head. s/ProcessMon/strace/ -Original Message- From: git-ow...@vger.kernel.org [mailto:git-ow...@vger.kernel.org] On Behalf Of Maxime Boissonneault Sent: Thursday, January 17, 2013 11:41 AM To: Pyeron, Jason J CTR (US) Cc: Eric Chamberland; Philippe Vaucher; git@vger.kernel.org; Sébastien Boisvert Subject: Re: GIT get corrupted on lustre I don't know of any lustre filesystem that is used on Windows. Barely anybody uses Windows in the HPC industry. This is a Linux cluster. Maxime Boissonneault Le 2013-01-17 11:40, Pyeron, Jason J CTR (US) a écrit : -Original Message- From: Eric Chamberland Sent: Thursday, January 17, 2013 11:31 AM On 01/17/2013 09:23 AM, Philippe Vaucher wrote: Anyone has a new idea? Did you try Jeff King's code to confirm his idea? Philippe Yes I did, but it was running without any problem I find that my test case is "simple" (fresh git clone then "git gc" in a crontab), I bet anyone who has access to a Lustre filesystem can reproduce the problem... The problem is to have such a filesystem to do the tests Stabbing in the dark, but can you log the details with ProcessMon? http://technet.microsoft.com/en-us/sysinternals/bb896645 But I am available to do it... -Jason -- - Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: GIT get corrupted on lustre
Sorry, I am in cygwin mode, and I had crossed wires in my head. s/ProcessMon/strace/ > -Original Message- > From: git-ow...@vger.kernel.org [mailto:git-ow...@vger.kernel.org] On > Behalf Of Maxime Boissonneault > Sent: Thursday, January 17, 2013 11:41 AM > To: Pyeron, Jason J CTR (US) > Cc: Eric Chamberland; Philippe Vaucher; git@vger.kernel.org; Sébastien > Boisvert > Subject: Re: GIT get corrupted on lustre > > I don't know of any lustre filesystem that is used on Windows. Barely > anybody uses Windows in the HPC industry. > This is a Linux cluster. > > Maxime Boissonneault > > Le 2013-01-17 11:40, Pyeron, Jason J CTR (US) a écrit : > >> -Original Message- > >> From: Eric Chamberland > >> Sent: Thursday, January 17, 2013 11:31 AM > >> > >> On 01/17/2013 09:23 AM, Philippe Vaucher wrote: > >>>> Anyone has a new idea? > >>> Did you try Jeff King's code to confirm his idea? > >>> > >>> Philippe > >>> > >> Yes I did, but it was running without any problem > >> > >> I find that my test case is "simple" (fresh git clone then "git gc" > in > >> a > >> crontab), I bet anyone who has access to a Lustre filesystem can > >> reproduce the problem... The problem is to have such a filesystem > to > >> do > >> the tests > > Stabbing in the dark, but can you log the details with ProcessMon? > > > > http://technet.microsoft.com/en-us/sysinternals/bb896645 > > > >> But I am available to do it... > > -Jason > > > -- > - > Maxime Boissonneault > Analyste de calcul - Calcul Québec, Université Laval > Ph. D. en physique > > -- > To unsubscribe from this list: send the line "unsubscribe git" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html smime.p7s Description: S/MIME cryptographic signature
Re: GIT get corrupted on lustre
I don't know of any lustre filesystem that is used on Windows. Barely anybody uses Windows in the HPC industry. This is a Linux cluster. Maxime Boissonneault Le 2013-01-17 11:40, Pyeron, Jason J CTR (US) a écrit : -Original Message- From: Eric Chamberland Sent: Thursday, January 17, 2013 11:31 AM On 01/17/2013 09:23 AM, Philippe Vaucher wrote: Anyone has a new idea? Did you try Jeff King's code to confirm his idea? Philippe Yes I did, but it was running without any problem I find that my test case is "simple" (fresh git clone then "git gc" in a crontab), I bet anyone who has access to a Lustre filesystem can reproduce the problem... The problem is to have such a filesystem to do the tests Stabbing in the dark, but can you log the details with ProcessMon? http://technet.microsoft.com/en-us/sysinternals/bb896645 But I am available to do it... -Jason -- - Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: GIT get corrupted on lustre
> -Original Message- > From: Eric Chamberland > Sent: Thursday, January 17, 2013 11:31 AM > > On 01/17/2013 09:23 AM, Philippe Vaucher wrote: > >> Anyone has a new idea? > > > > Did you try Jeff King's code to confirm his idea? > > > > Philippe > > > > Yes I did, but it was running without any problem > > I find that my test case is "simple" (fresh git clone then "git gc" in > a > crontab), I bet anyone who has access to a Lustre filesystem can > reproduce the problem... The problem is to have such a filesystem to > do > the tests Stabbing in the dark, but can you log the details with ProcessMon? http://technet.microsoft.com/en-us/sysinternals/bb896645 > > But I am available to do it... -Jason smime.p7s Description: S/MIME cryptographic signature
Re: GIT get corrupted on lustre
On 01/17/2013 09:23 AM, Philippe Vaucher wrote: Anyone has a new idea? Did you try Jeff King's code to confirm his idea? Philippe Yes I did, but it was running without any problem I find that my test case is "simple" (fresh git clone then "git gc" in a crontab), I bet anyone who has access to a Lustre filesystem can reproduce the problem... The problem is to have such a filesystem to do the tests But I am available to do it... Thanks, Eric -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: GIT get corrupted on lustre
Hi! I still have the corruption problems We just compiled a git without threads to try... (by the way, --without-pthreads doesn't work, you have to do a --disable-pthreads instead). And to remove the warnings about threads at "git gc" execution, I did a: git config --local pack.threads 1 and cloned a repository and started to do: git gc once every hour. Then this night (at 05:35:02 exactly), the same error as usual occurred: error: index file .git/objects/pack/pack-bf0748cee64a1964be0a1061c82aca51c993b825.idx is too small error: refs/heads/master does not point to a valid object! So now I am convinced that it is not a thread problem I am kind of discouraged, we like to use git, but in this case we have this error which seems unsolvable! Anyone has a new idea? Thanks, Eric On 01/09/2013 04:20 PM, Eric Chamberland wrote: Hi Brian, On 01/08/2013 11:11 AM, Eric Chamberland wrote: On 12/24/2012 10:11 AM, Brian J. Murrell wrote: Have you tried adding a "-q" to the git command line to quiet down git's "feedback" messages? I moved to git 1.8.1 and added the "-q" to the command "git gc" but it occured to return an error, so the "-q" option is not avoiding the problem here... :-/ command in crontab: cd /rap/jsf-051-aa/ericc/tests_git_clones/GIREF && for i in seq 10; do /software/apps/git/1.8.1/bin/git gc -q || true;done results: error: index file .git/objects/pack/pack-1f09879c88cd71a15dcc891713cf038d249830ad.idx is too small error: refs/remotes/origin/BIB_Branche_1_4_x does not point to a valid object! and this clone was a "clean" clone in which only "git qc -q" has been run on I still have a doubt on threads Eric -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: GIT get corrupted on lustre
Hi Brian, On 01/08/2013 11:11 AM, Eric Chamberland wrote: On 12/24/2012 10:11 AM, Brian J. Murrell wrote: Have you tried adding a "-q" to the git command line to quiet down git's "feedback" messages? I moved to git 1.8.1 and added the "-q" to the command "git gc" but it occured to return an error, so the "-q" option is not avoiding the problem here... :-/ command in crontab: cd /rap/jsf-051-aa/ericc/tests_git_clones/GIREF && for i in seq 10; do /software/apps/git/1.8.1/bin/git gc -q || true;done results: error: index file .git/objects/pack/pack-1f09879c88cd71a15dcc891713cf038d249830ad.idx is too small error: refs/remotes/origin/BIB_Branche_1_4_x does not point to a valid object! and this clone was a "clean" clone in which only "git qc -q" has been run on I still have a doubt on threads Eric -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: GIT get corrupted on lustre
On 12/24/2012 10:11 AM, Brian J. Murrell wrote: Have you tried adding a "-q" to the git command line to quiet down git's "feedback" messages? Ok, I have modified my crontab to use "-q" and I will wait to see if the problem occurs from now. I discovered other oddities with using git on Lustre which I described in this thread: http://thread.gmane.org/gmane.comp.version-control.git/208886 I found that by simply disabling the feedback (which disables the copious SIGALRM processing) I could alleviate the issue. I wonder if your issues are more of the same. I filed Lustre bug LU-2276 about it at: http://jira.whamcloud.com/browse/LU-2276 Thank you for these informations. I see the bug is unresolved!... Eric -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: GIT get corrupted on lustre
On Mon, Dec 24, 2012 at 09:08:46AM -0500, Eric Chamberland wrote: > Doing a "git clone" always work fine, but when we "git pull" or "git > gc" or "git fsck", often (1/5) the local repository get corrupted. > for example, I got this error two days ago while doing "git gc": > > error: index file > .git/objects/pack/pack-7b43b1c613a851392aaf4f66916dff2577931576.idx is too > small > error: refs/heads/mail_seekable does not point to a valid object! > [...] > We think it could be related to the fact that we are on a *Lustre* > filesystem, which I think doesn't fully support file locking. I don't think locking is a problem here. The problem is that you have a corrupt .idx file (the second error is almost certainly an effect of the first one; git cannot look in the packfile, and therefore cannot find the object the ref points to). But we do not ever lock the .idx files. They are generated in tmpfiles and then atomically moved into place using a hard link. So if anything, I would suspect that lustre has trouble with the write/fsync/close/link sequence. Is it possible that it does not keep the ordering, and readers might see a linked file that is missing some data? If you wait (or do some synchronizing operation on the filesystem, like "sync", or an unmount/mount), does the repo later work, or is it broken forever? > #1) However, how can we *test* the filesystem (lustre) compatibility > with git? (Is there a unit test we can run?) Running "make test" in git.git would be a good start. You could also try running the C program I'm including below. It repeatedly runs a write/close/fsync/link sequence like the one that index-pack runs, and then verifies the result. If it does not run forever without error, that would be a sign of the possible ordering problem I mentioned above. > #2) Is there a way to compile GIT to be compatible with lustre? (ex: > no threads?) This isn't a known issue, so I don't know offhand what compile flags might help. The complete list is at the top of Makefile. You might try with NO_PTHREADS=Yes, but I kind of doubt that threads are at work here. > #3) If you *know* your filesystem doesn't allow file locking, how > would you configure/compile GIT to work on it? I think locking is a red herring here, as it is not used to create the .idx files at all (and we don't do flock locking anyway; everything happens via O_EXCL creation). -Peff -- >8 -- #include #include #include #include #include static int randomize(unsigned char *buf, int len) { int i; len = rand() % len; for (i = 0; i < len; i++) buf[i] = rand() & 0xff; return len; } static int check_eof(int fd) { int ch; int r = read(fd, &ch, 1); if (r < 0) { perror("read error after expected EOF"); return -1; } if (r > 0) { fprintf(stderr, "extra byte after expected EOF"); return -1; } return 0; } static int verify(int fd, const unsigned char *buf, int len) { while (len) { char to_check[4096]; int got = read(fd, to_check, len < sizeof(to_check) ? len : sizeof(to_check)); if (got < 0) { perror("unable to read"); return -1; } if (got == 0) { fprintf(stderr, "premature EOF (%d bytes remaining)", len); return -1; } if (memcmp(buf, to_check, got)) { fprintf(stderr, "bytes differ"); return -1; } buf += got; len -= got; } return check_eof(fd); } int write_in_full(int fd, const unsigned char *buf, int len) { while (len) { int r = write(fd, buf, len); if (r < 0) return -1; buf += r; len -= r; } return 0; } int move_into_place(const char *old, const char *new) { if (link(old, new) < 0) { perror("unable to create hard link"); return 1; } unlink(old); return 0; } int main(void) { while (1) { static unsigned char junk[1024*1024]; int len = randomize(junk, sizeof(junk)); int fd; /* clean up from any previous round */ unlink("tmpfile"); unlink("final.idx"); fd = open("tmpfile", O_WRONLY|O_CREAT, 0666); if (fd < 0) { perror("unable to open tmpfile"); return 1; } if (write_in_full(fd, junk, len) < 0 || fsync(fd) < 0 || close(fd) < 0) { perror("unable to write"); return 1; } if (move_into_place("tmpfile", "final.idx") < 0) return 1; fd = open("final.idx", O_RDONLY); if (fd < 0) { perror("unable to open index file"); return 1; } if (verify(fd, junk, len) < 0) return 1; close(fd); } } -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: GIT get corrupted on lustre
we are using git since may and all is working fine for all of us (almost 20 people) on our workstations. However, when we clone our repositories to the cluster, only and only there we are having many problems similiar to this post: What filesystem tests have you run on lustre? I would run every test you can find, and lustre should have a robust test suite. It's really hard to be certain, but given how many filesystems git is used with, your experience points to a lustre bug. I would also suggest using ktrace/ktruss/strace and perhaps poring over the logs to see if you can spot any bad behavior. pgp6qmR6soDBX.pgp Description: PGP signature
Re: GIT get corrupted on lustre
On 12-12-24 09:08 AM, Eric Chamberland wrote: > Hi, Hi, > Doing a "git clone" always work fine, but when we "git pull" or "git gc" > or "git fsck", often (1/5) the local repository get corrupted. Have you tried adding a "-q" to the git command line to quiet down git's "feedback" messages? I discovered other oddities with using git on Lustre which I described in this thread: http://thread.gmane.org/gmane.comp.version-control.git/208886 I found that by simply disabling the feedback (which disables the copious SIGALRM processing) I could alleviate the issue. I wonder if your issues are more of the same. I filed Lustre bug LU-2276 about it at: http://jira.whamcloud.com/browse/LU-2276 Cheers, b. signature.asc Description: OpenPGP digital signature
Re: GIT get corrupted on lustre
Eric Chamberland writes: > #1) However, how can we *test* the filesystem (lustre) compatibility with > git? (Is there a unit test we can run?) Have you considered running git's testsuite? Andreas. -- Andreas Schwab, sch...@linux-m68k.org GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html