[jira] [Commented] (MAPREDUCE-2374) Should not use PrintWriter to write taskjvm.sh
[ https://issues.apache.org/jira/browse/MAPREDUCE-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13422618#comment-13422618 ] Shrinivas Joshi commented on MAPREDUCE-2374: Todd, thanks for clarifying my concerns with sync. With my comment about PrintWriter close not affecting flush/sync I meant that calling flush and FD level sync on underlying LocalFSFileOutputStream object before calling PrintWriter close would still sync the file. > Should not use PrintWriter to write taskjvm.sh > -- > > Key: MAPREDUCE-2374 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2374 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 0.22.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: 0.22.1 > > Attachments: failed_taskjvmsh.strace, mapreduce-2374-branch-1.patch, > mapreduce-2374-on-20sec.txt, mapreduce-2374.txt, mapreduce-2374.txt, > successfull_taskjvmsh.strace > > > Our use of PrintWriter in TaskController.writeCommand is unsafe, since that > class swallows all IO exceptions. We're not currently checking for errors, > which I'm seeing result in occasional task failures with the message "Text > file busy" - assumedly because the close() call is failing silently for some > reason. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2374) Should not use PrintWriter to write taskjvm.sh
[ https://issues.apache.org/jira/browse/MAPREDUCE-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13422572#comment-13422572 ] Todd Lipcon commented on MAPREDUCE-2374: You've got the layering mixed up there -- execve is perfectly happy to read out of buffer cache regardless of whether the underlying file is on persistent media or not. So, you don't need to sync anything. The java close() API does forward the 'close' through to the underlying writers, and that also triggers a flush of any buffered streams. So I think Andy's patch is sufficient to fix this issue. Andy: would you mind updating the patch with a comment above your change indicating that it's important to run "bash foo.sh" instead of just chmod 755 and execing "foo.sh" or running "bash -c foo.sh", due to this race condition? It's subtle and I can imagine it regressing if someone later redoes this code at all. Also, please double-check that the LinuxTaskExecutor code path doesn't have the same problem, if you can. We'll also need to look at the equivalent code in trunk (MR2), since it probably has the same issue. Nice work, everyone, tracking down this long-standing bug. > Should not use PrintWriter to write taskjvm.sh > -- > > Key: MAPREDUCE-2374 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2374 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 0.22.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: 0.22.1 > > Attachments: failed_taskjvmsh.strace, mapreduce-2374-branch-1.patch, > mapreduce-2374-on-20sec.txt, mapreduce-2374.txt, mapreduce-2374.txt, > successfull_taskjvmsh.strace > > > Our use of PrintWriter in TaskController.writeCommand is unsafe, since that > class swallows all IO exceptions. We're not currently checking for errors, > which I'm seeing result in occasional task failures with the message "Text > file busy" - assumedly because the close() call is failing silently for some > reason. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2374) Should not use PrintWriter to write taskjvm.sh
[ https://issues.apache.org/jira/browse/MAPREDUCE-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13422518#comment-13422518 ] Shrinivas Joshi commented on MAPREDUCE-2374: No luck with strace output yet. I will try to collect a failing case using short running jobs such as piEstimator. While I was initially debugging this issues, since I was not able to see any process in lsof output of taskjvm.sh file, I was wondering whether ext4 delayed allocation had to do something with the text file busy error. I thought that this would be more of an issue with the -c bash switch change if the whole file doesn't gets committed to the disk by the time executor thread is scheduled. This theory could be wrong though as indicated by Todd and Andy. I had to remove the BufferedOutputStream wrapper so that fdos.sync call would succeed as LocalFSFileOutputStream implements Syncable interface whereas BufferedOutputStream doesnt. I thought PrintWriter close wouldn't affect the flush and sync of LocalFSFileOutputStream. May be I am missing something? > Should not use PrintWriter to write taskjvm.sh > -- > > Key: MAPREDUCE-2374 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2374 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 0.22.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: 0.22.1 > > Attachments: failed_taskjvmsh.strace, mapreduce-2374-branch-1.patch, > mapreduce-2374-on-20sec.txt, mapreduce-2374.txt, mapreduce-2374.txt, > successfull_taskjvmsh.strace > > > Our use of PrintWriter in TaskController.writeCommand is unsafe, since that > class swallows all IO exceptions. We're not currently checking for errors, > which I'm seeing result in occasional task failures with the message "Text > file busy" - assumedly because the close() call is failing silently for some > reason. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2374) Should not use PrintWriter to write taskjvm.sh
[ https://issues.apache.org/jira/browse/MAPREDUCE-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13421956#comment-13421956 ] Todd Lipcon commented on MAPREDUCE-2374: Last I looked, I think ProcessBuilder did do that.. no? > Should not use PrintWriter to write taskjvm.sh > -- > > Key: MAPREDUCE-2374 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2374 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 0.22.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: 0.22.1 > > Attachments: failed_taskjvmsh.strace, mapreduce-2374-branch-1.patch, > mapreduce-2374-on-20sec.txt, mapreduce-2374.txt, mapreduce-2374.txt, > successfull_taskjvmsh.strace > > > Our use of PrintWriter in TaskController.writeCommand is unsafe, since that > class swallows all IO exceptions. We're not currently checking for errors, > which I'm seeing result in occasional task failures with the message "Text > file busy" - assumedly because the close() call is failing silently for some > reason. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2374) Should not use PrintWriter to write taskjvm.sh
[ https://issues.apache.org/jira/browse/MAPREDUCE-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13421943#comment-13421943 ] Andy Isaacson commented on MAPREDUCE-2374: -- bq. Andy - could you please clarify about your comment on ProcessBuilder bug? You should be able to file a bug on bugs.sun.com. I am suggesting that {{ProcessBuilder}} should ensure that there are no filedescriptors unintentionally left open in the child, before execing the target executable. There are some systems where {{system(3)}} does this by simply {{close(2)}}ing every filedescriptor from 3 up to the maxfiles ulimit. Turns out glibc's {{system(3)}} doesn't, (anymore?) though -- if you have a open fd and call {{system("ls /proc/self/fd")}} you can see that {{ls}} still has a copy of the open fd. So that's evidence that maybe it's not considered part of the interface contract. > Should not use PrintWriter to write taskjvm.sh > -- > > Key: MAPREDUCE-2374 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2374 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 0.22.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: 0.22.1 > > Attachments: failed_taskjvmsh.strace, mapreduce-2374-branch-1.patch, > mapreduce-2374-on-20sec.txt, mapreduce-2374.txt, mapreduce-2374.txt, > successfull_taskjvmsh.strace > > > Our use of PrintWriter in TaskController.writeCommand is unsafe, since that > class swallows all IO exceptions. We're not currently checking for errors, > which I'm seeing result in occasional task failures with the message "Text > file busy" - assumedly because the close() call is failing silently for some > reason. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2374) Should not use PrintWriter to write taskjvm.sh
[ https://issues.apache.org/jira/browse/MAPREDUCE-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13421922#comment-13421922 ] Andy Isaacson commented on MAPREDUCE-2374: -- bq. I have tested Andy's suggestion of not using -c switch. It does resolve the issue on our test cluster. Thanks for testing! This is great news. bq. to avoid potential data loss issues (from delayed allocation by file systems like ext4) I have made some changes so that the IO buffer contents of taskjvm.sh file are committed to the underlying storage before shell executor is called. I'm a bit confused why you're concerned about crash consistency here. AFAIK, ext4 delayed allocation is *completely* invisible to application code, unless the machine crashes and you're recovering afterwards. (OK, that's not quite true since you could use {{filefrag}} to find out where the allocations are, or maybe you could use hires timers to notice seek-induced IO timing discontinuities, but those are not relevant to this discussion.) Since the taskjvm.sh script is just used immediately and not reused across a kernel crash, why do you care that the IO buffer is synced to the stable storage? Looking at mapreduce-2374-branch-1.patch, I see that you've made two related changes, one getting rid of the {{BufferedOutputStream}} from {{RawLocalFileSystem#create}} and another adding calls to {{FSDataOutputStream#flush}} and {{#sync}}. I don't see how either of those changes can make much of a difference given that we call {{w.close}} 8 lines down in the finally block. bq. with strace the frequency of Text busy errors goes down drastically Not too surprising; to get ETXTBSY we have to get to the relevant {{execve}} before the forked process that's holding the fd exits. Adding a strace before that slows down the shell and adds another child process into the scheduling mix, making the race harder to win. > Should not use PrintWriter to write taskjvm.sh > -- > > Key: MAPREDUCE-2374 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2374 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 0.22.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: 0.22.1 > > Attachments: failed_taskjvmsh.strace, mapreduce-2374-branch-1.patch, > mapreduce-2374-on-20sec.txt, mapreduce-2374.txt, mapreduce-2374.txt, > successfull_taskjvmsh.strace > > > Our use of PrintWriter in TaskController.writeCommand is unsafe, since that > class swallows all IO exceptions. We're not currently checking for errors, > which I'm seeing result in occasional task failures with the message "Text > file busy" - assumedly because the close() call is failing silently for some > reason. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2374) Should not use PrintWriter to write taskjvm.sh
[ https://issues.apache.org/jira/browse/MAPREDUCE-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13421911#comment-13421911 ] Todd Lipcon commented on MAPREDUCE-2374: bq. Since we are thinking of removing the -c switch, to avoid potential data loss issues (from delayed allocation by file systems like ext4) I have made some changes so that the IO buffer contents of taskjvm.sh file are committed to the underlying storage before shell executor is That shouldn't be necessary -- the exec code is perfectly OK to read out of the OS buffer cache. Adding an fsync here will just hurt the latency of task launch. Though I agree that a flush() is necessary to make sure it is pushed out of any Java-side buffers. > Should not use PrintWriter to write taskjvm.sh > -- > > Key: MAPREDUCE-2374 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2374 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 0.22.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: 0.22.1 > > Attachments: failed_taskjvmsh.strace, mapreduce-2374-branch-1.patch, > mapreduce-2374-on-20sec.txt, mapreduce-2374.txt, mapreduce-2374.txt, > successfull_taskjvmsh.strace > > > Our use of PrintWriter in TaskController.writeCommand is unsafe, since that > class swallows all IO exceptions. We're not currently checking for errors, > which I'm seeing result in occasional task failures with the message "Text > file busy" - assumedly because the close() call is failing silently for some > reason. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2374) Should not use PrintWriter to write taskjvm.sh
[ https://issues.apache.org/jira/browse/MAPREDUCE-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13421801#comment-13421801 ] Shrinivas Joshi commented on MAPREDUCE-2374: I have tested Andy's suggestion of not using -c switch. It does resolve the issue on our test cluster. Since we are thinking of removing the -c switch, to avoid potential data loss issues (from delayed allocation by file systems like ext4) I have made some changes so that the IO buffer contents of taskjvm.sh file are committed to the underlying storage before shell executor is called. The mapreduce-2374-branch-1.patch patch also removes the redundant set permission call. It also includes the -c switch change that Andy has suggested. If you all find these changes useful please take a look at the attached patch. In any case, it would be nice if Andy's patch can be committed. Regarding other queries from Andy and Colin: I have been collecting the strace output files by annotating "strace -o /path/to/strace.output" strings to the same shell executor that calls "bash -c /path/to/taskjvm.sh". With this change, the frequency of Text busy errors goes down drastically to the level where it may occur once in several runs. I will post strace output as soon I see a failing case. Running strace outside of Hadoop code is difficult since it is hard to tell which of the JVM processes will have a failing task. Let me know if you have any tricks of capturing strace output. Andy - could you please clarify about your comment on ProcessBuilder bug? You should be able to file a bug on bugs.sun.com. > Should not use PrintWriter to write taskjvm.sh > -- > > Key: MAPREDUCE-2374 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2374 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 0.22.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: 0.22.1 > > Attachments: failed_taskjvmsh.strace, mapreduce-2374-branch-1.patch, > mapreduce-2374-on-20sec.txt, mapreduce-2374.txt, mapreduce-2374.txt, > successfull_taskjvmsh.strace > > > Our use of PrintWriter in TaskController.writeCommand is unsafe, since that > class swallows all IO exceptions. We're not currently checking for errors, > which I'm seeing result in occasional task failures with the message "Text > file busy" - assumedly because the close() call is failing silently for some > reason. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2374) Should not use PrintWriter to write taskjvm.sh
[ https://issues.apache.org/jira/browse/MAPREDUCE-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13421731#comment-13421731 ] Andy Isaacson commented on MAPREDUCE-2374: -- This is really a bug in {{java.lang.ProcessBuilder}}, that it does not close open files in the forked child before execing the target command. Where does one report such bugs? > Should not use PrintWriter to write taskjvm.sh > -- > > Key: MAPREDUCE-2374 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2374 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 0.22.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: 0.22.1 > > Attachments: failed_taskjvmsh.strace, mapreduce-2374-on-20sec.txt, > mapreduce-2374.txt, mapreduce-2374.txt, successfull_taskjvmsh.strace > > > Our use of PrintWriter in TaskController.writeCommand is unsafe, since that > class swallows all IO exceptions. We're not currently checking for errors, > which I'm seeing result in occasional task failures with the message "Text > file busy" - assumedly because the close() call is failing silently for some > reason. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2374) Should not use PrintWriter to write taskjvm.sh
[ https://issues.apache.org/jira/browse/MAPREDUCE-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13421730#comment-13421730 ] Andy Isaacson commented on MAPREDUCE-2374: -- It's true that open(O_CLOEXEC) isn't in RHEL5, but fcntl to set FD_CLOEXEC is. It doesn't completely close the race but it does reduce the window substantially. (The fact that there's an unfixable race window there is why O_CLOEXEC was added.) It would still require NativeIO. It's orthogonal to my proposed fix, though. Colin's theory also clears up this worry: bq. Now, suppose the undiscovered but hypothesized race condition in writeCommand does exist, and affects the write as well as the close. Since there's no race condition in writeCommand, there's no chance the script will be incomplete when we run it. I'd love to see a strace -tttf showing the entire failure scenario just to be sure, but I'm confident enough that Colin nailed it that I think we should just merge my "no more -c" patch. > Should not use PrintWriter to write taskjvm.sh > -- > > Key: MAPREDUCE-2374 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2374 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 0.22.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: 0.22.1 > > Attachments: failed_taskjvmsh.strace, mapreduce-2374-on-20sec.txt, > mapreduce-2374.txt, mapreduce-2374.txt, successfull_taskjvmsh.strace > > > Our use of PrintWriter in TaskController.writeCommand is unsafe, since that > class swallows all IO exceptions. We're not currently checking for errors, > which I'm seeing result in occasional task failures with the message "Text > file busy" - assumedly because the close() call is failing silently for some > reason. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2374) Should not use PrintWriter to write taskjvm.sh
[ https://issues.apache.org/jira/browse/MAPREDUCE-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13421662#comment-13421662 ] Todd Lipcon commented on MAPREDUCE-2374: Yea, I buy that theory. Good thinking, Colin. So it seems like to fix this issue we could do one or more of the following: 1) If the NativeIO library is present and we're on a system that has it defined, we could use open() with O_CLOEXEC to open the taskjvm.sh files. 2) We could switch from "bash -c" to "bash" to run the shell script. Can anyone think of an unintended consequence of this? How does the taskjvm.sh get execced with the linux task controller at play? 3) We could add a simple retry or two on Text File Busy. #1 probably sucks because O_CLOEXEC isn't present on RHEL5, and it will require native support. Between #2 and #3 I don't have a strong opinion. > Should not use PrintWriter to write taskjvm.sh > -- > > Key: MAPREDUCE-2374 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2374 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 0.22.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: 0.22.1 > > Attachments: failed_taskjvmsh.strace, mapreduce-2374-on-20sec.txt, > mapreduce-2374.txt, mapreduce-2374.txt, successfull_taskjvmsh.strace > > > Our use of PrintWriter in TaskController.writeCommand is unsafe, since that > class swallows all IO exceptions. We're not currently checking for errors, > which I'm seeing result in occasional task failures with the message "Text > file busy" - assumedly because the close() call is failing silently for some > reason. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2374) Should not use PrintWriter to write taskjvm.sh
[ https://issues.apache.org/jira/browse/MAPREDUCE-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13421605#comment-13421605 ] Andy Isaacson commented on MAPREDUCE-2374: -- Colin, I think you've got a great theory! That fork-from-another-thread scenario would explain the intermittent problems. bq. It may be helpful to use strace -f to follow forks. I realize there's a lot of output, but those are the lines we need, I think. My favorite strace commandline is {{strace -tttfo /tmp/foo -p }}, with -ttt for millisecond timestamps and -f to follow forks and -o to put the output in a file. > Should not use PrintWriter to write taskjvm.sh > -- > > Key: MAPREDUCE-2374 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2374 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 0.22.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: 0.22.1 > > Attachments: failed_taskjvmsh.strace, mapreduce-2374-on-20sec.txt, > mapreduce-2374.txt, mapreduce-2374.txt, successfull_taskjvmsh.strace > > > Our use of PrintWriter in TaskController.writeCommand is unsafe, since that > class swallows all IO exceptions. We're not currently checking for errors, > which I'm seeing result in occasional task failures with the message "Text > file busy" - assumedly because the close() call is failing silently for some > reason. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2374) Should not use PrintWriter to write taskjvm.sh
[ https://issues.apache.org/jira/browse/MAPREDUCE-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13421198#comment-13421198 ] Colin Patrick McCabe commented on MAPREDUCE-2374: - I just thought of something. Suppose that the JVM is holding blahblahblah.sh open for write, and meanwhile another thread forks a bash process (or something). After the fork completes, that process will hold blahblahblah.sh open for write with O_WRONLY. At the very least, this is a race condition that could lead to "mysterious" failures, since you don't know when the fork'ed process will next get scheduled in relation to the parent process. The O_CLOEXEC flag was introduced in Linux 2.6.23 to solve this problem, by atomically closing the FDs on a fork. However, I didn't see it being used in the strace output you posted. And it's certainly not around on RHEL5 and earlier. If this is true, then I guess the solution Andy posted earlier is probably the best way to go. Just get rid of the -c and this behavior will be masked. > Should not use PrintWriter to write taskjvm.sh > -- > > Key: MAPREDUCE-2374 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2374 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 0.22.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: 0.22.1 > > Attachments: failed_taskjvmsh.strace, mapreduce-2374-on-20sec.txt, > mapreduce-2374.txt, mapreduce-2374.txt, successfull_taskjvmsh.strace > > > Our use of PrintWriter in TaskController.writeCommand is unsafe, since that > class swallows all IO exceptions. We're not currently checking for errors, > which I'm seeing result in occasional task failures with the message "Text > file busy" - assumedly because the close() call is failing silently for some > reason. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2374) Should not use PrintWriter to write taskjvm.sh
[ https://issues.apache.org/jira/browse/MAPREDUCE-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13421196#comment-13421196 ] Colin Patrick McCabe commented on MAPREDUCE-2374: - Thanks for the strace output, Shrinivas. Unfortunately, it doesn't seem to show the place where you're opening up /local5/sj_mrv1_trunk/hadoop-local/ttprivate/taskTracker/root/jobcache/job_201207131350_0001/attempt_201207131350_0001_m_000201_0/taskjvm.sh for writing. If there is no file descriptor leak, you'd expect to see something like this: {code} open("blahblahblah.sh", {st_mode=S_IFREG|0700, ...}) = 5 close(5) = 0 ... execve("blahblahblah.sh") ... {code} On the other hand, if there is a leak, there should be no corresponding close() call. Things can get more complicated than that because of dup() and stuff like that, but that's the basic idea... In the absence of that, we can't really draw any conclusions either way. It may be helpful to use strace -f to follow forks. I realize there's a lot of output, but those are the lines we need, I think. > Should not use PrintWriter to write taskjvm.sh > -- > > Key: MAPREDUCE-2374 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2374 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 0.22.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: 0.22.1 > > Attachments: failed_taskjvmsh.strace, mapreduce-2374-on-20sec.txt, > mapreduce-2374.txt, mapreduce-2374.txt, successfull_taskjvmsh.strace > > > Our use of PrintWriter in TaskController.writeCommand is unsafe, since that > class swallows all IO exceptions. We're not currently checking for errors, > which I'm seeing result in occasional task failures with the message "Text > file busy" - assumedly because the close() call is failing silently for some > reason. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2374) Should not use PrintWriter to write taskjvm.sh
[ https://issues.apache.org/jira/browse/MAPREDUCE-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13421169#comment-13421169 ] Shrinivas Joshi commented on MAPREDUCE-2374: I have attached 2 files containing strace output. In our analysis it did not show why ETXTBSY error was caused. This is on RHEL Server release 6.2 (2.6.32-220.el6.x86_64). > Should not use PrintWriter to write taskjvm.sh > -- > > Key: MAPREDUCE-2374 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2374 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 0.22.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: 0.22.1 > > Attachments: failed_taskjvmsh.strace, mapreduce-2374-on-20sec.txt, > mapreduce-2374.txt, mapreduce-2374.txt, successfull_taskjvmsh.strace > > > Our use of PrintWriter in TaskController.writeCommand is unsafe, since that > class swallows all IO exceptions. We're not currently checking for errors, > which I'm seeing result in occasional task failures with the message "Text > file busy" - assumedly because the close() call is failing silently for some > reason. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2374) Should not use PrintWriter to write taskjvm.sh
[ https://issues.apache.org/jira/browse/MAPREDUCE-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13421036#comment-13421036 ] Andy Isaacson commented on MAPREDUCE-2374: -- bq. I have noticed ENOEXEC errors for both successful as well as unsuccessful (ETXTBSY) cases of bash -c /path/to/taskjvm.sh calls through strace output. So I am not sure if this is the root cause. However, it doesn't hurts to add the #!/bin/sh construct. If you have strace output showing the ETXTBSY, could you upload it to the bug? That would greatly help in isolating the problem. Especially if it also shows the open()/write()/close() of the script file. Specific kernel version where the problem has been seen would be great too! > Should not use PrintWriter to write taskjvm.sh > -- > > Key: MAPREDUCE-2374 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2374 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 0.22.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: 0.22.1 > > Attachments: mapreduce-2374-on-20sec.txt, mapreduce-2374.txt, > mapreduce-2374.txt > > > Our use of PrintWriter in TaskController.writeCommand is unsafe, since that > class swallows all IO exceptions. We're not currently checking for errors, > which I'm seeing result in occasional task failures with the message "Text > file busy" - assumedly because the close() call is failing silently for some > reason. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2374) Should not use PrintWriter to write taskjvm.sh
[ https://issues.apache.org/jira/browse/MAPREDUCE-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13420970#comment-13420970 ] Colin Patrick McCabe commented on MAPREDUCE-2374: - Ah, you're absolutely right. Without the -c, bash won't execve the file itself. It will read the file into an already running interpreter. So no ETXTBSY. I'm a little nervous about this proposed "fix" because it doesn't really seem to fix the problem. Are we absolutely sure that the file is closed and not still being written to? If it's not, we could be getting file descriptor leaks, partially written scripts, and all the usual evils. I guess Shrinivas did some testing with lsof. Shrinivas, can you try deliberately keeping the file open for write and verifying that your lsof test detects it? Also, it would be interesting to see if the same problem shows up when using the raw FileChannel API (as opposed to the FileWriter API). If this is a real kernel bug, I guess we need to file it upstream on the red hat bug tracker and/or kernel mailing list. > Should not use PrintWriter to write taskjvm.sh > -- > > Key: MAPREDUCE-2374 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2374 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 0.22.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: 0.22.1 > > Attachments: mapreduce-2374-on-20sec.txt, mapreduce-2374.txt, > mapreduce-2374.txt > > > Our use of PrintWriter in TaskController.writeCommand is unsafe, since that > class swallows all IO exceptions. We're not currently checking for errors, > which I'm seeing result in occasional task failures with the message "Text > file busy" - assumedly because the close() call is failing silently for some > reason. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2374) Should not use PrintWriter to write taskjvm.sh
[ https://issues.apache.org/jira/browse/MAPREDUCE-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13420956#comment-13420956 ] Andy Isaacson commented on MAPREDUCE-2374: -- bq. I agree that it would be simpler without the -c, but I don't see how this "avoids the code path that triggers the ETXTBSY." Currently we are invoking {{bash -c /path/to/script}}. Bash treats the argument after -c as input to its command processor. The string ends up being parsed as a command to execute. Bash tries to execve the script, and either gets ENOEXEC (and then falls back to trying to read the file as a script) or ETXTBSY (and fails the command). The execve that we sometimes see returning ETXTBSY is {{execve("/path/to/script", ...)}}. I'm suggesting we remove the -c, so we would invoke {{bash /path/to/script}}. According to {{bash(1)}}, "If ... neither the -c nor the -s option have been supplied, the first argument is assumed to be the name of a file containing shell commands. ... Bash reads and executes commands from this file, then exits." The intermittently failing execve is never called, per the documentation, and I verified with a standalone testcase that the workaround works correctly. Also it's instructive to observe bash with strace. bq. We're still calling execve at some point It's true that we're still calling {{execve("/bin/bash")}} but that is not relevant to the EXTXBSY failure. Without -c we do not call execve on the shell script. > Should not use PrintWriter to write taskjvm.sh > -- > > Key: MAPREDUCE-2374 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2374 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 0.22.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: 0.22.1 > > Attachments: mapreduce-2374-on-20sec.txt, mapreduce-2374.txt, > mapreduce-2374.txt > > > Our use of PrintWriter in TaskController.writeCommand is unsafe, since that > class swallows all IO exceptions. We're not currently checking for errors, > which I'm seeing result in occasional task failures with the message "Text > file busy" - assumedly because the close() call is failing silently for some > reason. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2374) Should not use PrintWriter to write taskjvm.sh
[ https://issues.apache.org/jira/browse/MAPREDUCE-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13420950#comment-13420950 ] Shrinivas Joshi commented on MAPREDUCE-2374: SELinux is disabled on all the nodes of our cluster. Another thing I have noticed with this issue is that it is more pronounced with large number of map processes. > Should not use PrintWriter to write taskjvm.sh > -- > > Key: MAPREDUCE-2374 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2374 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 0.22.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: 0.22.1 > > Attachments: mapreduce-2374-on-20sec.txt, mapreduce-2374.txt, > mapreduce-2374.txt > > > Our use of PrintWriter in TaskController.writeCommand is unsafe, since that > class swallows all IO exceptions. We're not currently checking for errors, > which I'm seeing result in occasional task failures with the message "Text > file busy" - assumedly because the close() call is failing silently for some > reason. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2374) Should not use PrintWriter to write taskjvm.sh
[ https://issues.apache.org/jira/browse/MAPREDUCE-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13420918#comment-13420918 ] Colin Patrick McCabe commented on MAPREDUCE-2374: - bq. We can avoid the ETXTBSY by avoiding the execve. If I'm reading launchTask correctly, the script we're execing isn't even a valid shell script anyways – it's just a sequence of shell commands, without the leading "#!/bin/sh" header. By running it bash -c "/path/to/script" we're relying on the ancient pre-Bourne shell script convention that if execve() fails with ENOEXEC, the shell tries to interpret the file as a script. bq. Instead, we can ask bash to directly run the script as a script by running bash "/path/to/script" leaving out the -c. This avoids the code path that triggers the ETXTBSY failure and is slightly less reliant on random backwards compatibility kludges. And it doesn't break if we do have the #!/bin/sh line since that's just a comment. That's a very interesting analysis, Andy. I agree that it would be simpler without the -c, but I don't see how this "avoids the code path that triggers the ETXTBSY." We're still calling execve at some point, and if someone has that FD open for write it will squawk. (All the other exec flavors are just wrappers around execve and return the same error codes). Am I missing something? As you pointed out, it's possible that SELinux is doing something odd. Shrinivas, can you confirm that seLinux is off during your testing? Just this command as root and then we will be sure: {code} /usr/sbin/setenforce 0 {code} It will stay in effect until you reboot. > Should not use PrintWriter to write taskjvm.sh > -- > > Key: MAPREDUCE-2374 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2374 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 0.22.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: 0.22.1 > > Attachments: mapreduce-2374-on-20sec.txt, mapreduce-2374.txt, > mapreduce-2374.txt > > > Our use of PrintWriter in TaskController.writeCommand is unsafe, since that > class swallows all IO exceptions. We're not currently checking for errors, > which I'm seeing result in occasional task failures with the message "Text > file busy" - assumedly because the close() call is failing silently for some > reason. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2374) Should not use PrintWriter to write taskjvm.sh
[ https://issues.apache.org/jira/browse/MAPREDUCE-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13420895#comment-13420895 ] Shrinivas Joshi commented on MAPREDUCE-2374: I have tried collecting lsof output before and after the call to shExec.execute() and I do not see any process holding on to the handle for taskjvm.sh file. I agree with Andy, the problem here does not seem to be because of an open FD. I have tried making changes where a call to RawLocalFileSystem.sync() gets explicitly called for the underlying output data stream after the write to taskjvm.sh happens. I assume that the sync() method calls fsync sys call which will ensure that the data is committed to the underlying storage. I have noticed ENOEXEC errors for both successful as well as unsuccessful (ETXTBSY) cases of bash -c /path/to/taskjvm.sh calls through strace output. So I am not sure if this is the root cause. However, it doesn't hurts to add the #!/bin/sh construct. One thing I am planning to do is to capture strace output for child processes forked from taskjvm.sh script. Also, another thing that I need to try is to call sync on the whole directory containing taskjvm.sh. There is probably something internal at the kernel level that is causing this issue. As I mentioned earlier, this problem is much more pronounced on RHEL 6.2 (stock kernel) compared to Ubuntu 11.04 server. > Should not use PrintWriter to write taskjvm.sh > -- > > Key: MAPREDUCE-2374 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2374 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 0.22.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: 0.22.1 > > Attachments: mapreduce-2374-on-20sec.txt > > > Our use of PrintWriter in TaskController.writeCommand is unsafe, since that > class swallows all IO exceptions. We're not currently checking for errors, > which I'm seeing result in occasional task failures with the message "Text > file busy" - assumedly because the close() call is failing silently for some > reason. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2374) Should not use PrintWriter to write taskjvm.sh
[ https://issues.apache.org/jira/browse/MAPREDUCE-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13420888#comment-13420888 ] Andy Isaacson commented on MAPREDUCE-2374: -- bq. (I'll post another comment relevant to SElinux.) There are some examples in the RHEL bugzilla of SElinux causing mysterious ETXTBSY failures. Such failures should leave a trace in {{/var/log/audit/audit.log}} on the node which experienced the failure. Could someone who can reproduce this failure please check the audit log on the node when a failure occurs? Even if you think that SElinux is turned off, sometimes it's on by accident... It would also potentially be helpful to compare an {{audit.log}} from a node that runs the same code and does not experience the ETXTBSY failure. > Should not use PrintWriter to write taskjvm.sh > -- > > Key: MAPREDUCE-2374 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2374 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 0.22.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: 0.22.1 > > Attachments: mapreduce-2374-on-20sec.txt > > > Our use of PrintWriter in TaskController.writeCommand is unsafe, since that > class swallows all IO exceptions. We're not currently checking for errors, > which I'm seeing result in occasional task failures with the message "Text > file busy" - assumedly because the close() call is failing silently for some > reason. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2374) Should not use PrintWriter to write taskjvm.sh
[ https://issues.apache.org/jira/browse/MAPREDUCE-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13420883#comment-13420883 ] Andy Isaacson commented on MAPREDUCE-2374: -- bq. in the case of exceptions being thrown, the FileChannel is not closed. Ah, but in this case the caller doesn't catch that exception, so we wouldn't have gotten to the {{shExec.execute()}} if {{writeCommand}}'s invocation of {{w.close()}} had thrown. > Should not use PrintWriter to write taskjvm.sh > -- > > Key: MAPREDUCE-2374 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2374 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 0.22.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: 0.22.1 > > Attachments: mapreduce-2374-on-20sec.txt > > > Our use of PrintWriter in TaskController.writeCommand is unsafe, since that > class swallows all IO exceptions. We're not currently checking for errors, > which I'm seeing result in occasional task failures with the message "Text > file busy" - assumedly because the close() call is failing silently for some > reason. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2374) Should not use PrintWriter to write taskjvm.sh
[ https://issues.apache.org/jira/browse/MAPREDUCE-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13420877#comment-13420877 ] Andy Isaacson commented on MAPREDUCE-2374: -- Supposing that SElinux is not involved, here's some analysis under traditional UNIX semantics. (I'll post another comment relevant to SElinux.) The ETXTBSY error code from execve(2) happens when the executable (in this case, a shell script) is still opened for write by another process. Reading the kernel code there does not appear to be any chance for a "slight delay"; when the writer calls close() the atomic {{inode->i_readcount}} is decremented, and {{open_exec()}} correctly tests the atomic against zero. So I strongly suspect that the writer does, in fact, still have the filedescriptor open for write when the execve happens. I don't see how that's possible, though -- we clearly call {{w.close()}} in the {{finally}} clause of {{writeCommand}}, and I don't see anywhere that we leak a {{FileDescriptor}} to prevent {{FileOutputStream#close}} from triggering the underlying {{close(2)}}. But... We can avoid the {{ETXTBSY}} by avoiding the {{execve}}. If I'm reading {{launchTask}} correctly, the script we're execing isn't even a valid shell script anyways -- it's just a sequence of shell commands, without the leading "#!/bin/sh" header. By running it {{bash -c "/path/to/script"}} we're relying on the ancient pre-Bourne shell script convention that if execve() fails with {{ENOEXEC}}, the shell tries to interpret the file as a script. Instead, we can ask bash to directly run the script as a script by running {{bash "/path/to/script"}} leaving out the {{-c}}. This avoids the code path that triggers the ETXTBSY failure and is slightly less reliant on random backwards compatibility kludges. And it doesn't break if we do have the {{#!/bin/sh}} line since that's just a comment. Now, suppose the undiscovered but hypothesized race condition in {{writeCommand}} does exist, and affects the {{write}} as well as the {{close}}. Then removing {{-c}} does not remove the race, and when we lose the race the failure will probably be more silent. The command file might not be completely written and running the script might fail. > Should not use PrintWriter to write taskjvm.sh > -- > > Key: MAPREDUCE-2374 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2374 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 0.22.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: 0.22.1 > > Attachments: mapreduce-2374-on-20sec.txt > > > Our use of PrintWriter in TaskController.writeCommand is unsafe, since that > class swallows all IO exceptions. We're not currently checking for errors, > which I'm seeing result in occasional task failures with the message "Text > file busy" - assumedly because the close() call is failing silently for some > reason. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2374) Should not use PrintWriter to write taskjvm.sh
[ https://issues.apache.org/jira/browse/MAPREDUCE-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13420830#comment-13420830 ] Colin Patrick McCabe commented on MAPREDUCE-2374: - In OpenJDK 6 at least, FileWriter#close doesn't always close the file descriptor. Starting here: http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/io/FileWriter.java I went to OutputStreamWriter. OutputStreamWriter#close calls StreamEncoder#close, which has: {code} void implClose() throws IOException { flushLeftoverChar(null, true); try { for (;;) { CoderResult cr = encoder.flush(bb); if (cr.isUnderflow()) break; if (cr.isOverflow()) { assert bb.position() > 0; writeBytes(); continue; } cr.throwException(); } if (bb.position() > 0) writeBytes(); if (ch != null) ch.close(); else out.close(); } catch (IOException x) { encoder.reset(); throw x; } } {code} So you can see that in the case of exceptions being thrown, the FileChannel is not closed. The finalizer probably takes care of it, but that's cold comfort to you in this case. Maybe it would be best to use the FileChannel API directly rather than using FileWriter. > Should not use PrintWriter to write taskjvm.sh > -- > > Key: MAPREDUCE-2374 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2374 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 0.22.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: 0.22.1 > > Attachments: mapreduce-2374-on-20sec.txt > > > Our use of PrintWriter in TaskController.writeCommand is unsafe, since that > class swallows all IO exceptions. We're not currently checking for errors, > which I'm seeing result in occasional task failures with the message "Text > file busy" - assumedly because the close() call is failing silently for some > reason. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2374) Should not use PrintWriter to write taskjvm.sh
[ https://issues.apache.org/jira/browse/MAPREDUCE-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13294698#comment-13294698 ] Shrinivas Joshi commented on MAPREDUCE-2374: Hi Todd, I spent some time looking in to this. I have a query regarding this. Is there any particular reason why a call to rawFs.setPermission method (of DefaultTaskController class) is made when it is already performed by writeCommand method (of TaskController class)? By commenting out the call to rawFs.setPermission method I have not seen these errors occurring in my testing. I am in process of additional testing. It may not be the ideal solution for this issue, however, it does seem to address it to a great extent. If you think there is value in this change then let me know, I can create the 1-line patch and run unit tests before submitting it. > Should not use PrintWriter to write taskjvm.sh > -- > > Key: MAPREDUCE-2374 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2374 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 0.22.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: 0.22.1 > > Attachments: mapreduce-2374-on-20sec.txt > > > Our use of PrintWriter in TaskController.writeCommand is unsafe, since that > class swallows all IO exceptions. We're not currently checking for errors, > which I'm seeing result in occasional task failures with the message "Text > file busy" - assumedly because the close() call is failing silently for some > reason. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2374) Should not use PrintWriter to write taskjvm.sh
[ https://issues.apache.org/jira/browse/MAPREDUCE-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13292946#comment-13292946 ] Todd Lipcon commented on MAPREDUCE-2374: Yea, I've noticed it more on RHEL6 than RHEL5, I think... but I haven't looked closely recently. I saw it on Java6, so it's not Java7-specific. It does seem like some kind of kernel level issue: if you close() a file, it should be immediately available to execute, but it seems on some OSes there's a slight delay before it becomes executable, or something. > Should not use PrintWriter to write taskjvm.sh > -- > > Key: MAPREDUCE-2374 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2374 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 0.22.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: 0.22.1 > > Attachments: mapreduce-2374-on-20sec.txt > > > Our use of PrintWriter in TaskController.writeCommand is unsafe, since that > class swallows all IO exceptions. We're not currently checking for errors, > which I'm seeing result in occasional task failures with the message "Text > file busy" - assumedly because the close() call is failing silently for some > reason. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2374) Should not use PrintWriter to write taskjvm.sh
[ https://issues.apache.org/jira/browse/MAPREDUCE-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13292886#comment-13292886 ] Shrinivas Joshi commented on MAPREDUCE-2374: Hi Todd, Thanks for the info. We see these exceptions once or twice in few runs so it may be little easier for us to reproduce them. Will update this thread once we have more info. In the mean time let us know if there is any specific debug info/traces you think would help root cause this issue. BTW, we notice these on a RHEL 6.2 64-bit platform with Oracle JDK 7 update 4. We have not yet encountered these exceptions on an Ubuntu 11.04 64-bit environment though. Do you think this might be a platform specific issue? > Should not use PrintWriter to write taskjvm.sh > -- > > Key: MAPREDUCE-2374 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2374 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 0.22.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: 0.22.1 > > Attachments: mapreduce-2374-on-20sec.txt > > > Our use of PrintWriter in TaskController.writeCommand is unsafe, since that > class swallows all IO exceptions. We're not currently checking for errors, > which I'm seeing result in occasional task failures with the message "Text > file busy" - assumedly because the close() call is failing silently for some > reason. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2374) Should not use PrintWriter to write taskjvm.sh
[ https://issues.apache.org/jira/browse/MAPREDUCE-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13292071#comment-13292071 ] Todd Lipcon commented on MAPREDUCE-2374: Hi Shrinivas, Yes, I still see them on very rare occasion, though I don't have a good explanation as to why. And since they're rare, it's pretty hard to repro and investigate further. > Should not use PrintWriter to write taskjvm.sh > -- > > Key: MAPREDUCE-2374 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2374 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 0.22.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: 0.22.1 > > Attachments: mapreduce-2374-on-20sec.txt > > > Our use of PrintWriter in TaskController.writeCommand is unsafe, since that > class swallows all IO exceptions. We're not currently checking for errors, > which I'm seeing result in occasional task failures with the message "Text > file busy" - assumedly because the close() call is failing silently for some > reason. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2374) Should not use PrintWriter to write taskjvm.sh
[ https://issues.apache.org/jira/browse/MAPREDUCE-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291984#comment-13291984 ] Shrinivas Joshi commented on MAPREDUCE-2374: We are noticing these "Text file busy" exceptions on an Hadoop 1.0.3 environment which are causing task failures with exit status 126. These exceptions are non-deterministic. They occur when shExec.execute() call is made in org.apache.hadoop.mapred.DefaultTaskController class. Return string from the call to shExec.getOutput()is empty line. We captured the details of this exception by logging the stack trace in the catch block. Here is what the stack trace looks like: org.apache.hadoop.util.Shell$ExitCodeException: bash: /local4/1_0_3/hadoop-local/ttprivate/taskTracker/root/jobcache/job_201206081243_0001/attempt_201206081243_0001_m_24_0/taskjvm.sh: Text file busy at org.apache.hadoop.util.Shell.runCommand(Shell.java:255) at org.apache.hadoop.util.Shell.run(Shell.java:182) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375) at org.apache.hadoop.mapred.DefaultTaskController.launchTask(DefaultTaskController.java:131) at org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:498) at org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:472) Also, all the log/work files corresponding to the failed tasks get cleaned up and hence it becomes harder to debug the underlying cause. Above patch has helped reduce these failures, however, they still occur. I am not sure whether the "Text file busy" exception is a Hadoop issue or some other issue though. I just thought sharing this information here would benefit someone. > Should not use PrintWriter to write taskjvm.sh > -- > > Key: MAPREDUCE-2374 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2374 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 0.22.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: 0.22.1 > > Attachments: mapreduce-2374-on-20sec.txt > > > Our use of PrintWriter in TaskController.writeCommand is unsafe, since that > class swallows all IO exceptions. We're not currently checking for errors, > which I'm seeing result in occasional task failures with the message "Text > file busy" - assumedly because the close() call is failing silently for some > reason. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2374) Should not use PrintWriter to write taskjvm.sh
[ https://issues.apache.org/jira/browse/MAPREDUCE-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13016061#comment-13016061 ] Tom White commented on MAPREDUCE-2374: -- +1 > Should not use PrintWriter to write taskjvm.sh > -- > > Key: MAPREDUCE-2374 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2374 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 0.22.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: 0.22.0 > > Attachments: mapreduce-2374-on-20sec.txt > > > Our use of PrintWriter in TaskController.writeCommand is unsafe, since that > class swallows all IO exceptions. We're not currently checking for errors, > which I'm seeing result in occasional task failures with the message "Text > file busy" - assumedly because the close() call is failing silently for some > reason. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira