Re: Child Error

2013-05-28 Thread Jim Twensky
d restart my
>>>>>> cluster?
>>>>>>
>>>>>> Thank you.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, May 23, 2013 at 7:14 PM, Jim Twensky 
>>>>>> wrote:
>>>>>>
>>>>>>> Hello, I have a 20 node Hadoop cluster where each node has 8GB
>>>>>>> memory and an 8-core processor. I sometimes get the following error on a
>>>>>>> random basis:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ---
>>>>>>>
>>>>>>> Exception in thread "main" java.io.IOException: Exception reading 
>>>>>>> file:/var/tmp/jim/hadoop-jim/mapred/local/taskTracker/jim/jobcache/job_201305231647_0005/jobToken
>>>>>>> at 
>>>>>>> org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:135)
>>>>>>> at 
>>>>>>> org.apache.hadoop.mapreduce.security.TokenCache.loadTokens(TokenCache.java:165)
>>>>>>> at org.apache.hadoop.mapred.Child.main(Child.java:92)
>>>>>>> Caused by: java.io.IOException: failure to login
>>>>>>> at 
>>>>>>> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:501)
>>>>>>> at 
>>>>>>> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:463)
>>>>>>> at 
>>>>>>> org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:1519)
>>>>>>> at 
>>>>>>> org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1420)
>>>>>>> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
>>>>>>> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
>>>>>>> at 
>>>>>>> org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:129)
>>>>>>> ... 2 more
>>>>>>> Caused by: javax.security.auth.login.LoginException: 
>>>>>>> java.lang.NullPointerException: invalid null input: name
>>>>>>> at 
>>>>>>> com.sun.security.auth.UnixPrincipal.(UnixPrincipal.java:70)
>>>>>>> at 
>>>>>>> com.sun.security.auth.module.UnixLoginModule.login(UnixLoginModule.java:132)
>>>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>>> at 
>>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>>>> at 
>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>>>
>>>>>>> ..
>>>>>>>
>>>>>>>
>>>>>>> ---
>>>>>>>
>>>>>>> This does not always happen but I see a pattern when the
>>>>>>> intermediate data is larger, it tends to occur more frequently. In the 
>>>>>>> web
>>>>>>> log, I can see the following:
>>>>>>>
>>>>>>> java.lang.Throwable: Child Error
>>>>>>> at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
>>>>>>> Caused by: java.io.IOException: Task process exit with nonzero status 
>>>>>>> of 1.
>>>>>>> at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)
>>>>>>>
>>>>>>> From what I read online, a possible cause is when there is not
>>>>>>> enough memory for all JVM's. My mapred site.xml is set up to allocate
>>>>>>> 1100MB for each child and the maximum number of map and reduce tasks are
>>>>>>> set to 3 - So 6600MB of the child JVMs + (500MB * 2) for the data node 
>>>>>>> and
>>>>>>> task tracker (as I set HADOOP_HEAP to 500 MB). I feel like memory is not
>>>>>>> the cause but I couldn't avoid it so far.
>>>>>>> In case it helps, here are the relevant sections of my
>>>>>>> mapred-site.xml
>>>>>>>
>>>>>>>
>>>>>>> ---
>>>>>>>
>>>>>>> mapred.tasktracker.map.tasks.maximum
>>>>>>> 3
>>>>>>>
>>>>>>> mapred.tasktracker.reduce.tasks.maximum
>>>>>>> 3
>>>>>>>
>>>>>>> mapred.child.java.opts
>>>>>>> -Xmx1100M -ea -XX:+HeapDumpOnOutOfMemoryError
>>>>>>> -XX:HeapDumpPath=/var/tmp/soner
>>>>>>>
>>>>>>> mapred.reduce.parallel.copies
>>>>>>> 5
>>>>>>>
>>>>>>> tasktracker.http.threads
>>>>>>> 80
>>>>>>>
>>>>>>> ---
>>>>>>>
>>>>>>> My jobs still complete most of the time though they occasionally
>>>>>>> fail and I'm really puzzled at this point. I'd appreciate any help or 
>>>>>>> ideas.
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


Re: Child Error

2013-05-28 Thread Jean-Marc Spaggiari
java version "1.7.0_21"
>>> OpenJDK Runtime Environment (IcedTea 2.3.9) (7u21-2.3.9-0ubuntu0.12.10.1)
>>> OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode)
>>>
>>> I don't get any OOME errors and this error happens on random nodes, not
>>> a particular one. Usually all tasks running on a particular node fail and
>>> that node gets blacklisted. However, the same node works just fine during
>>> the next or previous jobs. Can it be a problem with the ssh keys? What else
>>> can cause the IOException with "failure to login" message? I've been
>>> digging into this for two days but I'm almost clueless.
>>>
>>> Thanks,
>>> Jim
>>>
>>>
>>>
>>>
>>> On Fri, May 24, 2013 at 10:32 PM, Jean-Marc Spaggiari <
>>> jean-m...@spaggiari.org> wrote:
>>>
>>>> Hi Jim,
>>>>
>>>> Which JVM are you using?
>>>>
>>>> I don't think you have any memory issue. Else you will have got some
>>>> OOME...
>>>>
>>>> JM
>>>>
>>>>
>>>> 2013/5/24 Jim Twensky 
>>>>
>>>>> Hi again, in addition to my previous post, I was able to get some
>>>>> error logs from the task tracker/data node this morning and looks like it
>>>>> might be a jetty issue:
>>>>>
>>>>> 2013-05-23 19:59:20,595 WARN org.apache.hadoop.mapred.TaskLog: Failed
>>>>> to retrieve stdout log for task: attempt_201305231647_0007_m_001096_0
>>>>> java.io.IOException: Owner 'jim' for path
>>>>> /var/tmp/jim/hadoop-logs/userlogs/job_201305231647_0007/attempt_201305231647_0007_m_001096_0/stdout
>>>>> did not match expected owner '10929'
>>>>>   at
>>>>> org.apache.hadoop.io.SecureIOUtils.checkStat(SecureIOUtils.java:177)
>>>>>   at
>>>>> org.apache.hadoop.io.SecureIOUtils.openForRead(SecureIOUtils.java:117)
>>>>>   at org.apache.hadoop.mapred.TaskLog$Reader.(TaskLog.java:455)
>>>>>   at
>>>>> org.apache.hadoop.mapred.TaskLogServlet.printTaskLog(TaskLogServlet.java:81)
>>>>>   at
>>>>> org.apache.hadoop.mapred.TaskLogServlet.doGet(TaskLogServlet.java:296)
>>>>>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
>>>>>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
>>>>>   at
>>>>> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
>>>>>   at
>>>>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
>>>>>   at
>>>>> org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:848)
>>>>>   at
>>>>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>>>>>   at
>>>>> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
>>>>>
>>>>>
>>>>> I am wondering if I am hitting 
>>>>> MAPREDUCE-2389<https://issues.apache.org/jira/browse/MAPREDUCE-2389>If 
>>>>> so, how do I downgrade my jetty version? Should I just replace the jetty
>>>>> jar file in the lib directory with an earlier version and restart my
>>>>> cluster?
>>>>>
>>>>> Thank you.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, May 23, 2013 at 7:14 PM, Jim Twensky wrote:
>>>>>
>>>>>> Hello, I have a 20 node Hadoop cluster where each node has 8GB memory
>>>>>> and an 8-core processor. I sometimes get the following error on a random
>>>>>> basis:
>>>>>>
>>>>>>
>>>>>>
>>>>>> ---
>>>>>>
>>>>>> Exception in thread "main" java.io.IOException: Exception reading 
>>>>>> file:/var/tmp/jim/hadoop-jim/mapred/local/taskTracker/jim/jobcache/job_201305231647_0005/jobToken
>>>>>>  at 
>>>>>> org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:135)
>>>>>>  at 
>>>>>> org.apache.hadoop.mapreduce.security.TokenCache.loadTokens(TokenCache.java:165)
>>>>>>  at org.apach

Re: Child Error

2013-05-26 Thread Jim Twensky
>>
>> On Fri, May 24, 2013 at 10:32 PM, Jean-Marc Spaggiari <
>> jean-m...@spaggiari.org> wrote:
>>
>>> Hi Jim,
>>>
>>> Which JVM are you using?
>>>
>>> I don't think you have any memory issue. Else you will have got some
>>> OOME...
>>>
>>> JM
>>>
>>>
>>> 2013/5/24 Jim Twensky 
>>>
>>>> Hi again, in addition to my previous post, I was able to get some error
>>>> logs from the task tracker/data node this morning and looks like it might
>>>> be a jetty issue:
>>>>
>>>> 2013-05-23 19:59:20,595 WARN org.apache.hadoop.mapred.TaskLog: Failed
>>>> to retrieve stdout log for task: attempt_201305231647_0007_m_001096_0
>>>> java.io.IOException: Owner 'jim' for path
>>>> /var/tmp/jim/hadoop-logs/userlogs/job_201305231647_0007/attempt_201305231647_0007_m_001096_0/stdout
>>>> did not match expected owner '10929'
>>>>   at
>>>> org.apache.hadoop.io.SecureIOUtils.checkStat(SecureIOUtils.java:177)
>>>>   at
>>>> org.apache.hadoop.io.SecureIOUtils.openForRead(SecureIOUtils.java:117)
>>>>   at org.apache.hadoop.mapred.TaskLog$Reader.(TaskLog.java:455)
>>>>   at
>>>> org.apache.hadoop.mapred.TaskLogServlet.printTaskLog(TaskLogServlet.java:81)
>>>>   at
>>>> org.apache.hadoop.mapred.TaskLogServlet.doGet(TaskLogServlet.java:296)
>>>>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
>>>>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
>>>>   at
>>>> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
>>>>   at
>>>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
>>>>   at
>>>> org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:848)
>>>>   at
>>>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>>>>   at
>>>> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
>>>>
>>>>
>>>> I am wondering if I am hitting 
>>>> MAPREDUCE-2389<https://issues.apache.org/jira/browse/MAPREDUCE-2389>If so, 
>>>> how do I downgrade my jetty version? Should I just replace the jetty
>>>> jar file in the lib directory with an earlier version and restart my
>>>> cluster?
>>>>
>>>> Thank you.
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, May 23, 2013 at 7:14 PM, Jim Twensky wrote:
>>>>
>>>>> Hello, I have a 20 node Hadoop cluster where each node has 8GB memory
>>>>> and an 8-core processor. I sometimes get the following error on a random
>>>>> basis:
>>>>>
>>>>>
>>>>>
>>>>> ---
>>>>>
>>>>> Exception in thread "main" java.io.IOException: Exception reading 
>>>>> file:/var/tmp/jim/hadoop-jim/mapred/local/taskTracker/jim/jobcache/job_201305231647_0005/jobToken
>>>>>   at 
>>>>> org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:135)
>>>>>   at 
>>>>> org.apache.hadoop.mapreduce.security.TokenCache.loadTokens(TokenCache.java:165)
>>>>>   at org.apache.hadoop.mapred.Child.main(Child.java:92)
>>>>> Caused by: java.io.IOException: failure to login
>>>>>   at 
>>>>> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:501)
>>>>>   at 
>>>>> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:463)
>>>>>   at 
>>>>> org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:1519)
>>>>>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1420)
>>>>>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
>>>>>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
>>>>>   at 
>>>>> org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:129)
>>>>>   ... 2 more
>>>>> Caused by: javax.security.auth.login.LoginException: 
>>>>> java.lang.NullPointerException: invalid null input: name
>>>>>   at com.sun.secu

Re: Child Error

2013-05-25 Thread Jean-Marc Spaggiari
.fs.FileSystem$Cache$Key.(FileSystem.java:1519)
>>>>at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1420)
>>>>at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
>>>>at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
>>>>at 
>>>> org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:129)
>>>>... 2 more
>>>> Caused by: javax.security.auth.login.LoginException: 
>>>> java.lang.NullPointerException: invalid null input: name
>>>>at com.sun.security.auth.UnixPrincipal.(UnixPrincipal.java:70)
>>>>at 
>>>> com.sun.security.auth.module.UnixLoginModule.login(UnixLoginModule.java:132)
>>>>at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>at 
>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>at 
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>
>>>> ..
>>>>
>>>>
>>>> ---
>>>>
>>>> This does not always happen but I see a pattern when the intermediate
>>>> data is larger, it tends to occur more frequently. In the web log, I can
>>>> see the following:
>>>>
>>>> java.lang.Throwable: Child Error
>>>>at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
>>>> Caused by: java.io.IOException: Task process exit with nonzero status of 1.
>>>>at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)
>>>>
>>>> From what I read online, a possible cause is when there is not enough
>>>> memory for all JVM's. My mapred site.xml is set up to allocate 1100MB for
>>>> each child and the maximum number of map and reduce tasks are set to 3 - So
>>>> 6600MB of the child JVMs + (500MB * 2) for the data node and task tracker
>>>> (as I set HADOOP_HEAP to 500 MB). I feel like memory is not the cause but I
>>>> couldn't avoid it so far.
>>>> In case it helps, here are the relevant sections of my mapred-site.xml
>>>>
>>>>
>>>> ---
>>>>
>>>> mapred.tasktracker.map.tasks.maximum
>>>> 3
>>>>
>>>> mapred.tasktracker.reduce.tasks.maximum
>>>> 3
>>>>
>>>> mapred.child.java.opts
>>>> -Xmx1100M -ea -XX:+HeapDumpOnOutOfMemoryError
>>>> -XX:HeapDumpPath=/var/tmp/soner
>>>>
>>>> mapred.reduce.parallel.copies
>>>> 5
>>>>
>>>> tasktracker.http.threads
>>>> 80
>>>>
>>>> ---
>>>>
>>>> My jobs still complete most of the time though they occasionally fail
>>>> and I'm really puzzled at this point. I'd appreciate any help or ideas.
>>>>
>>>> Thanks
>>>>
>>>
>>>
>>
>


Re: Child Error

2013-05-25 Thread Jim Twensky
t com.sun.security.auth.UnixPrincipal.(UnixPrincipal.java:70)
>>> at 
>>> com.sun.security.auth.module.UnixLoginModule.login(UnixLoginModule.java:132)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> at 
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>> at 
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>
>>> ..
>>>
>>>
>>> ---
>>>
>>> This does not always happen but I see a pattern when the intermediate
>>> data is larger, it tends to occur more frequently. In the web log, I can
>>> see the following:
>>>
>>> java.lang.Throwable: Child Error
>>> at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
>>> Caused by: java.io.IOException: Task process exit with nonzero status of 1.
>>> at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)
>>>
>>> From what I read online, a possible cause is when there is not enough
>>> memory for all JVM's. My mapred site.xml is set up to allocate 1100MB for
>>> each child and the maximum number of map and reduce tasks are set to 3 - So
>>> 6600MB of the child JVMs + (500MB * 2) for the data node and task tracker
>>> (as I set HADOOP_HEAP to 500 MB). I feel like memory is not the cause but I
>>> couldn't avoid it so far.
>>> In case it helps, here are the relevant sections of my mapred-site.xml
>>>
>>>
>>> ---
>>>
>>> mapred.tasktracker.map.tasks.maximum
>>> 3
>>>
>>> mapred.tasktracker.reduce.tasks.maximum
>>> 3
>>>
>>> mapred.child.java.opts
>>> -Xmx1100M -ea -XX:+HeapDumpOnOutOfMemoryError
>>> -XX:HeapDumpPath=/var/tmp/soner
>>>
>>> mapred.reduce.parallel.copies
>>> 5
>>>
>>> tasktracker.http.threads
>>> 80
>>>
>>> ---
>>>
>>> My jobs still complete most of the time though they occasionally fail
>>> and I'm really puzzled at this point. I'd appreciate any help or ideas.
>>>
>>> Thanks
>>>
>>
>>
>


Re: Child Error

2013-05-24 Thread Jean-Marc Spaggiari
Hi Jim,

Which JVM are you using?

I don't think you have any memory issue. Else you will have got some OOME...

JM

2013/5/24 Jim Twensky 

> Hi again, in addition to my previous post, I was able to get some error
> logs from the task tracker/data node this morning and looks like it might
> be a jetty issue:
>
> 2013-05-23 19:59:20,595 WARN org.apache.hadoop.mapred.TaskLog: Failed to
> retrieve stdout log for task: attempt_201305231647_0007_m_001096_0
> java.io.IOException: Owner 'jim' for path
> /var/tmp/jim/hadoop-logs/userlogs/job_201305231647_0007/attempt_201305231647_0007_m_001096_0/stdout
> did not match expected owner '10929'
>   at org.apache.hadoop.io.SecureIOUtils.checkStat(SecureIOUtils.java:177)
>   at org.apache.hadoop.io.SecureIOUtils.openForRead(SecureIOUtils.java:117)
>   at org.apache.hadoop.mapred.TaskLog$Reader.(TaskLog.java:455)
>   at
> org.apache.hadoop.mapred.TaskLogServlet.printTaskLog(TaskLogServlet.java:81)
>   at org.apache.hadoop.mapred.TaskLogServlet.doGet(TaskLogServlet.java:296)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
>   at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
>   at
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
>   at
> org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:848)
>   at
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>   at
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
>
>
> I am wondering if I am hitting 
> MAPREDUCE-2389<https://issues.apache.org/jira/browse/MAPREDUCE-2389>If so, 
> how do I downgrade my jetty version? Should I just replace the jetty
> jar file in the lib directory with an earlier version and restart my
> cluster?
>
> Thank you.
>
>
>
>
> On Thu, May 23, 2013 at 7:14 PM, Jim Twensky wrote:
>
>> Hello, I have a 20 node Hadoop cluster where each node has 8GB memory and
>> an 8-core processor. I sometimes get the following error on a random basis:
>>
>>
>>
>> ---
>>
>> Exception in thread "main" java.io.IOException: Exception reading 
>> file:/var/tmp/jim/hadoop-jim/mapred/local/taskTracker/jim/jobcache/job_201305231647_0005/jobToken
>>  at 
>> org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:135)
>>  at 
>> org.apache.hadoop.mapreduce.security.TokenCache.loadTokens(TokenCache.java:165)
>>  at org.apache.hadoop.mapred.Child.main(Child.java:92)
>> Caused by: java.io.IOException: failure to login
>>  at 
>> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:501)
>>  at 
>> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:463)
>>  at 
>> org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:1519)
>>  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1420)
>>  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
>>  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
>>  at 
>> org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:129)
>>  ... 2 more
>> Caused by: javax.security.auth.login.LoginException: 
>> java.lang.NullPointerException: invalid null input: name
>>  at com.sun.security.auth.UnixPrincipal.(UnixPrincipal.java:70)
>>  at 
>> com.sun.security.auth.module.UnixLoginModule.login(UnixLoginModule.java:132)
>>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>  at 
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>  at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>
>> ..
>>
>>
>> ---
>>
>> This does not always happen but I see a pattern when the intermediate
>> data is larger, it tends to occur more frequently. In the web log, I can
>> see the following:
>>
>> java.lang.Throwable: Child Error
>>  at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
>> Caused by: java.io.IOException: Task process exit with nonzero status of 1.
>>  at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)
>>
>> Fr

Re: Child Error

2013-05-24 Thread Jim Twensky
Hi again, in addition to my previous post, I was able to get some error
logs from the task tracker/data node this morning and looks like it might
be a jetty issue:

2013-05-23 19:59:20,595 WARN org.apache.hadoop.mapred.TaskLog: Failed to
retrieve stdout log for task: attempt_201305231647_0007_m_001096_0
java.io.IOException: Owner 'jim' for path
/var/tmp/jim/hadoop-logs/userlogs/job_201305231647_0007/attempt_201305231647_0007_m_001096_0/stdout
did not match expected owner '10929'
  at org.apache.hadoop.io.SecureIOUtils.checkStat(SecureIOUtils.java:177)
  at org.apache.hadoop.io.SecureIOUtils.openForRead(SecureIOUtils.java:117)
  at org.apache.hadoop.mapred.TaskLog$Reader.(TaskLog.java:455)
  at
org.apache.hadoop.mapred.TaskLogServlet.printTaskLog(TaskLogServlet.java:81)
  at org.apache.hadoop.mapred.TaskLogServlet.doGet(TaskLogServlet.java:296)
  at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
  at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
  at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
  at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
  at
org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:848)
  at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
  at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)


I am wondering if I am hitting
MAPREDUCE-2389<https://issues.apache.org/jira/browse/MAPREDUCE-2389>If
so, how do I downgrade my jetty version? Should I just replace the
jetty
jar file in the lib directory with an earlier version and restart my
cluster?

Thank you.




On Thu, May 23, 2013 at 7:14 PM, Jim Twensky  wrote:

> Hello, I have a 20 node Hadoop cluster where each node has 8GB memory and
> an 8-core processor. I sometimes get the following error on a random basis:
>
>
>
> ---
>
> Exception in thread "main" java.io.IOException: Exception reading 
> file:/var/tmp/jim/hadoop-jim/mapred/local/taskTracker/jim/jobcache/job_201305231647_0005/jobToken
>   at 
> org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:135)
>   at 
> org.apache.hadoop.mapreduce.security.TokenCache.loadTokens(TokenCache.java:165)
>   at org.apache.hadoop.mapred.Child.main(Child.java:92)
> Caused by: java.io.IOException: failure to login
>   at 
> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:501)
>   at 
> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:463)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:1519)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1420)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
>   at 
> org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:129)
>   ... 2 more
> Caused by: javax.security.auth.login.LoginException: 
> java.lang.NullPointerException: invalid null input: name
>   at com.sun.security.auth.UnixPrincipal.(UnixPrincipal.java:70)
>   at 
> com.sun.security.auth.module.UnixLoginModule.login(UnixLoginModule.java:132)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
> ..
>
>
> ---
>
> This does not always happen but I see a pattern when the intermediate data
> is larger, it tends to occur more frequently. In the web log, I can see the
> following:
>
> java.lang.Throwable: Child Error
>   at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
> Caused by: java.io.IOException: Task process exit with nonzero status of 1.
>   at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)
>
> From what I read online, a possible cause is when there is not enough
> memory for all JVM's. My mapred site.xml is set up to allocate 1100MB for
> each child and the maximum number of map and reduce tasks are set to 3 - So
> 6600MB of the child JVMs + (500MB * 2) for the data node and task tracker
> (as I set HADOOP_HEAP to 500 MB). I feel like memory is not the cause but I
> couldn't avoid it so far.
> In case it helps, here are the relevant sections of my mapred-site.xml
>
>
> ---

Child Error

2013-05-23 Thread Jim Twensky
Hello, I have a 20 node Hadoop cluster where each node has 8GB memory and
an 8-core processor. I sometimes get the following error on a random basis:


---

Exception in thread "main" java.io.IOException: Exception reading
file:/var/tmp/jim/hadoop-jim/mapred/local/taskTracker/jim/jobcache/job_201305231647_0005/jobToken
at 
org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:135)
at 
org.apache.hadoop.mapreduce.security.TokenCache.loadTokens(TokenCache.java:165)
at org.apache.hadoop.mapred.Child.main(Child.java:92)
Caused by: java.io.IOException: failure to login
at 
org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:501)
at 
org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:463)
at 
org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:1519)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1420)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
at 
org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:129)
... 2 more
Caused by: javax.security.auth.login.LoginException:
java.lang.NullPointerException: invalid null input: name
at com.sun.security.auth.UnixPrincipal.(UnixPrincipal.java:70)
at 
com.sun.security.auth.module.UnixLoginModule.login(UnixLoginModule.java:132)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

..

---

This does not always happen but I see a pattern when the intermediate data
is larger, it tends to occur more frequently. In the web log, I can see the
following:

java.lang.Throwable: Child Error
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
Caused by: java.io.IOException: Task process exit with nonzero status of 1.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)

>From what I read online, a possible cause is when there is not enough
memory for all JVM's. My mapred site.xml is set up to allocate 1100MB for
each child and the maximum number of map and reduce tasks are set to 3 - So
6600MB of the child JVMs + (500MB * 2) for the data node and task tracker
(as I set HADOOP_HEAP to 500 MB). I feel like memory is not the cause but I
couldn't avoid it so far.
In case it helps, here are the relevant sections of my mapred-site.xml

---

mapred.tasktracker.map.tasks.maximum
3

mapred.tasktracker.reduce.tasks.maximum
3

mapred.child.java.opts
-Xmx1100M -ea -XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/var/tmp/soner

mapred.reduce.parallel.copies
5

tasktracker.http.threads
80
---

My jobs still complete most of the time though they occasionally fail and
I'm really puzzled at this point. I'd appreciate any help or ideas.

Thanks


Re: Child error

2013-03-13 Thread Amit Sela
10x

On Wed, Mar 13, 2013 at 1:56 PM, Azuryy Yu  wrote:

> dont wait patch, its a very simple fix. just do it.
> On Mar 13, 2013 5:04 PM, "Amit Sela"  wrote:
>
>> But the patch will work on 1.0.4 correct ?
>>
>> On Wed, Mar 13, 2013 at 4:57 AM, George Datskos <
>> george.dats...@jp.fujitsu.com> wrote:
>>
>>>  Leo
>>>
>>> That JIRA says "fix version=1.0.4" but it is not correct.  The real JIRA
>>> is MAPREDUCE-2374.
>>>
>>> The actual fix version for this bug 1.1.2
>>>
>>>
>>> George
>>>
>>>
>>>   or https://issues.apache.org/jira/browse/MAPREDUCE-4857
>>>
>>> Which is fixed in 1.0.4
>>>
>>> ** **
>>>
>>> ** **
>>>
>>> *From:* Amit Sela [mailto:am...@infolinks.com ]
>>> *Sent:* Tuesday, March 12, 2013 5:08 AM
>>> *To:* user@hadoop.apache.org
>>> *Subject:* Re: Child error
>>>
>>> ** **
>>>
>>> Hi Jean-Marc, 
>>>
>>> I am running Hadoop 1.0.3, and I did see this issue you've mentioned but
>>> the exit status in the issue is 126 and sometimes I get 255.
>>>
>>> Any ideas what do theses status codes mean ? 
>>>
>>> Did you suffer this issue and upgraded to 1.0.4 ? If so, How "smooth" is
>>> such upgrade (shouldn't differ from 1.0.3 that much no ?)
>>>
>>> ** **
>>>
>>> Thanks!
>>>
>>> ** **
>>>
>>> ** **
>>>
>>> On Tue, Mar 12, 2013 at 1:40 PM, Jean-Marc Spaggiari <
>>> jean-m...@spaggiari.org> wrote:
>>>
>>> Hi Amit,
>>>
>>> Which Hadoop version are you using?
>>>
>>> I have been told it's because of
>>> https://issues.apache.org/jira/browse/MAPREDUCE-2374
>>>
>>> JM
>>>
>>> 2013/3/12 Amit Sela :
>>>
>>> > Hi all,
>>> >
>>> > I have a weird failure occurring every now and then during a MapReduce
>>> job.
>>> >
>>> > This is the error:
>>> >
>>> > java.lang.Throwable: Child Error
>>> > at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
>>> > Caused by: java.io.IOException: Task process exit with nonzero status
>>> of
>>> > 255.
>>> > at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)
>>> >
>>> > And sometimes it's the same but with status of 126.
>>> >
>>> > Any ideas ?
>>> >
>>> > Thanks.
>>>
>>> ** **
>>>
>>>
>>>
>>


Re: Child error

2013-03-13 Thread Azuryy Yu
dont wait patch, its a very simple fix. just do it.
On Mar 13, 2013 5:04 PM, "Amit Sela"  wrote:

> But the patch will work on 1.0.4 correct ?
>
> On Wed, Mar 13, 2013 at 4:57 AM, George Datskos <
> george.dats...@jp.fujitsu.com> wrote:
>
>>  Leo
>>
>> That JIRA says "fix version=1.0.4" but it is not correct.  The real JIRA
>> is MAPREDUCE-2374.
>>
>> The actual fix version for this bug 1.1.2
>>
>>
>> George
>>
>>
>>   or https://issues.apache.org/jira/browse/MAPREDUCE-4857
>>
>> Which is fixed in 1.0.4
>>
>> ** **
>>
>> ** **
>>
>> *From:* Amit Sela [mailto:am...@infolinks.com ]
>> *Sent:* Tuesday, March 12, 2013 5:08 AM
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: Child error
>>
>> ** **
>>
>> Hi Jean-Marc, 
>>
>> I am running Hadoop 1.0.3, and I did see this issue you've mentioned but
>> the exit status in the issue is 126 and sometimes I get 255.
>>
>> Any ideas what do theses status codes mean ? 
>>
>> Did you suffer this issue and upgraded to 1.0.4 ? If so, How "smooth" is
>> such upgrade (shouldn't differ from 1.0.3 that much no ?)
>>
>> ** **
>>
>> Thanks!
>>
>> ** **
>>
>> ** **
>>
>> On Tue, Mar 12, 2013 at 1:40 PM, Jean-Marc Spaggiari <
>> jean-m...@spaggiari.org> wrote:
>>
>> Hi Amit,
>>
>> Which Hadoop version are you using?
>>
>> I have been told it's because of
>> https://issues.apache.org/jira/browse/MAPREDUCE-2374
>>
>> JM
>>
>> 2013/3/12 Amit Sela :
>>
>> > Hi all,
>> >
>> > I have a weird failure occurring every now and then during a MapReduce
>> job.
>> >
>> > This is the error:
>> >
>> > java.lang.Throwable: Child Error
>> > at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
>> > Caused by: java.io.IOException: Task process exit with nonzero status of
>> > 255.
>> > at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)
>> >
>> > And sometimes it's the same but with status of 126.
>> >
>> > Any ideas ?
>> >
>> > Thanks.
>>
>> ** **
>>
>>
>>
>


Re: Child error

2013-03-13 Thread Azuryy Yu
yes, you are right.
On Mar 13, 2013 5:04 PM, "Amit Sela"  wrote:

> But the patch will work on 1.0.4 correct ?
>
> On Wed, Mar 13, 2013 at 4:57 AM, George Datskos <
> george.dats...@jp.fujitsu.com> wrote:
>
>>  Leo
>>
>> That JIRA says "fix version=1.0.4" but it is not correct.  The real JIRA
>> is MAPREDUCE-2374.
>>
>> The actual fix version for this bug 1.1.2
>>
>>
>> George
>>
>>
>>   or https://issues.apache.org/jira/browse/MAPREDUCE-4857
>>
>> Which is fixed in 1.0.4
>>
>> ** **
>>
>> ** **
>>
>> *From:* Amit Sela [mailto:am...@infolinks.com ]
>> *Sent:* Tuesday, March 12, 2013 5:08 AM
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: Child error
>>
>> ** **
>>
>> Hi Jean-Marc, 
>>
>> I am running Hadoop 1.0.3, and I did see this issue you've mentioned but
>> the exit status in the issue is 126 and sometimes I get 255.
>>
>> Any ideas what do theses status codes mean ? 
>>
>> Did you suffer this issue and upgraded to 1.0.4 ? If so, How "smooth" is
>> such upgrade (shouldn't differ from 1.0.3 that much no ?)
>>
>> ** **
>>
>> Thanks!
>>
>> ** **
>>
>> ** **
>>
>> On Tue, Mar 12, 2013 at 1:40 PM, Jean-Marc Spaggiari <
>> jean-m...@spaggiari.org> wrote:
>>
>> Hi Amit,
>>
>> Which Hadoop version are you using?
>>
>> I have been told it's because of
>> https://issues.apache.org/jira/browse/MAPREDUCE-2374
>>
>> JM
>>
>> 2013/3/12 Amit Sela :
>>
>> > Hi all,
>> >
>> > I have a weird failure occurring every now and then during a MapReduce
>> job.
>> >
>> > This is the error:
>> >
>> > java.lang.Throwable: Child Error
>> > at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
>> > Caused by: java.io.IOException: Task process exit with nonzero status of
>> > 255.
>> > at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)
>> >
>> > And sometimes it's the same but with status of 126.
>> >
>> > Any ideas ?
>> >
>> > Thanks.
>>
>> ** **
>>
>>
>>
>


Re: Child error

2013-03-13 Thread Amit Sela
But the patch will work on 1.0.4 correct ?

On Wed, Mar 13, 2013 at 4:57 AM, George Datskos <
george.dats...@jp.fujitsu.com> wrote:

>  Leo
>
> That JIRA says "fix version=1.0.4" but it is not correct.  The real JIRA
> is MAPREDUCE-2374.
>
> The actual fix version for this bug 1.1.2
>
>
> George
>
>
>   or https://issues.apache.org/jira/browse/MAPREDUCE-4857
>
> Which is fixed in 1.0.4
>
> ** **
>
> ** **
>
> *From:* Amit Sela [mailto:am...@infolinks.com ]
> *Sent:* Tuesday, March 12, 2013 5:08 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Child error
>
> ** **
>
> Hi Jean-Marc, 
>
> I am running Hadoop 1.0.3, and I did see this issue you've mentioned but
> the exit status in the issue is 126 and sometimes I get 255.
>
> Any ideas what do theses status codes mean ? 
>
> Did you suffer this issue and upgraded to 1.0.4 ? If so, How "smooth" is
> such upgrade (shouldn't differ from 1.0.3 that much no ?)
>
> ** **
>
> Thanks!
>
> ** **
>
> ** **
>
> On Tue, Mar 12, 2013 at 1:40 PM, Jean-Marc Spaggiari <
> jean-m...@spaggiari.org> wrote:
>
> Hi Amit,
>
> Which Hadoop version are you using?
>
> I have been told it's because of
> https://issues.apache.org/jira/browse/MAPREDUCE-2374
>
> JM
>
> 2013/3/12 Amit Sela :
>
> > Hi all,
> >
> > I have a weird failure occurring every now and then during a MapReduce
> job.
> >
> > This is the error:
> >
> > java.lang.Throwable: Child Error
> > at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
> > Caused by: java.io.IOException: Task process exit with nonzero status of
> > 255.
> > at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)
> >
> > And sometimes it's the same but with status of 126.
> >
> > Any ideas ?
> >
> > Thanks.
>
> ** **
>
>
>


Re: Child error

2013-03-12 Thread George Datskos

Leo

That JIRA says "fix version=1.0.4" but it is not correct.  The real JIRA 
is MAPREDUCE-2374.


The actual fix version for this bug 1.1.2


George



or https://issues.apache.org/jira/browse/MAPREDUCE-4857

Which is fixed in 1.0.4

*From:*Amit Sela [mailto:am...@infolinks.com]
*Sent:* Tuesday, March 12, 2013 5:08 AM
*To:* user@hadoop.apache.org
*Subject:* Re: Child error

Hi Jean-Marc,

I am running Hadoop 1.0.3, and I did see this issue you've mentioned 
but the exit status in the issue is 126 and sometimes I get 255.


Any ideas what do theses status codes mean ?

Did you suffer this issue and upgraded to 1.0.4 ? If so, How "smooth" 
is such upgrade (shouldn't differ from 1.0.3 that much no ?)


Thanks!

On Tue, Mar 12, 2013 at 1:40 PM, Jean-Marc Spaggiari 
mailto:jean-m...@spaggiari.org>> wrote:


Hi Amit,

Which Hadoop version are you using?

I have been told it's because of
https://issues.apache.org/jira/browse/MAPREDUCE-2374

JM

2013/3/12 Amit Sela mailto:am...@infolinks.com>>:

> Hi all,
>
> I have a weird failure occurring every now and then during a 
MapReduce job.

>
> This is the error:
>
> java.lang.Throwable: Child Error
> at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
> Caused by: java.io.IOException: Task process exit with nonzero status of
> 255.
> at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)
>
> And sometimes it's the same but with status of 126.
>
> Any ideas ?
>
> Thanks.





RE: Child error

2013-03-12 Thread Leo Leung
or https://issues.apache.org/jira/browse/MAPREDUCE-4857
Which is fixed in 1.0.4


From: Amit Sela [mailto:am...@infolinks.com]
Sent: Tuesday, March 12, 2013 5:08 AM
To: user@hadoop.apache.org
Subject: Re: Child error

Hi Jean-Marc,
I am running Hadoop 1.0.3, and I did see this issue you've mentioned but the 
exit status in the issue is 126 and sometimes I get 255.
Any ideas what do theses status codes mean ?
Did you suffer this issue and upgraded to 1.0.4 ? If so, How "smooth" is such 
upgrade (shouldn't differ from 1.0.3 that much no ?)

Thanks!


On Tue, Mar 12, 2013 at 1:40 PM, Jean-Marc Spaggiari 
mailto:jean-m...@spaggiari.org>> wrote:
Hi Amit,

Which Hadoop version are you using?

I have been told it's because of
https://issues.apache.org/jira/browse/MAPREDUCE-2374

JM

2013/3/12 Amit Sela mailto:am...@infolinks.com>>:
> Hi all,
>
> I have a weird failure occurring every now and then during a MapReduce job.
>
> This is the error:
>
> java.lang.Throwable: Child Error
> at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
> Caused by: java.io.IOException: Task process exit with nonzero status of
> 255.
> at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)
>
> And sometimes it's the same but with status of 126.
>
> Any ideas ?
>
> Thanks.



Re: Child error

2013-03-12 Thread Amit Sela
Hi Jean-Marc,
I am running Hadoop 1.0.3, and I did see this issue you've mentioned but
the exit status in the issue is 126 and sometimes I get 255.
Any ideas what do theses status codes mean ?
Did you suffer this issue and upgraded to 1.0.4 ? If so, How "smooth" is
such upgrade (shouldn't differ from 1.0.3 that much no ?)

Thanks!



On Tue, Mar 12, 2013 at 1:40 PM, Jean-Marc Spaggiari <
jean-m...@spaggiari.org> wrote:

> Hi Amit,
>
> Which Hadoop version are you using?
>
> I have been told it's because of
> https://issues.apache.org/jira/browse/MAPREDUCE-2374
>
> JM
>
> 2013/3/12 Amit Sela :
> > Hi all,
> >
> > I have a weird failure occurring every now and then during a MapReduce
> job.
> >
> > This is the error:
> >
> > java.lang.Throwable: Child Error
> > at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
> > Caused by: java.io.IOException: Task process exit with nonzero status of
> > 255.
> > at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)
> >
> > And sometimes it's the same but with status of 126.
> >
> > Any ideas ?
> >
> > Thanks.
>


Re: Child error

2013-03-12 Thread Jean-Marc Spaggiari
Hi Amit,

Which Hadoop version are you using?

I have been told it's because of
https://issues.apache.org/jira/browse/MAPREDUCE-2374

JM

2013/3/12 Amit Sela :
> Hi all,
>
> I have a weird failure occurring every now and then during a MapReduce job.
>
> This is the error:
>
> java.lang.Throwable: Child Error
> at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
> Caused by: java.io.IOException: Task process exit with nonzero status of
> 255.
> at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)
>
> And sometimes it's the same but with status of 126.
>
> Any ideas ?
>
> Thanks.


Child error

2013-03-12 Thread Amit Sela
Hi all,

I have a weird failure occurring every now and then during a MapReduce job.

This is the error:

*java.lang.Throwable: Child Error*
* at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)*
*Caused by: java.io.IOException: Task process exit with nonzero status of
255.*
* at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)*
*
*
And sometimes it's the same but with *status of 126.*
*
*
Any ideas ?

Thanks.


Getting many Child Error : Could not reserve enough space for object heap

2013-02-14 Thread Vincent Etter
Dear all,

I am setting up and configuring a small Hadoop cluster of 11 nodes for
teaching purposes. All machines are identical, and have the following specs:

   - 4-core Intel(R) Xeon(R) CPU E3-1270 (3.5 GHz)
   - 16 GB of RAM
   - Debian Squeeze

I use a version of Hadoop 0.20.2 packaged by
Cloudera (hadoop-0.20.2-cdh3u5).

The significant configuration options I changed are:

   - mapred.tasktracker.map.tasks.maximum : 4
   - mapred.tasktracker.reduce.tasks.maximum : 2
   - mapred.child.java.opts : -Xmx1500m
   - mapred.child.ulimit : 450
   - io.sort.mb : 200
   - io.sort.factor : 64
   - io.file.buffer.size : 65536
   - mapred.jobtracker.taskScheduler
   : org.apache.hadoop.mapred.FairScheduler
   - mapred.reduce.tasks : 10
   - mapred.reduce.parallel.copies : 10
   - mapred.reduce.slowstart.completed.maps : 0.8

Most of these values were taken from the "Hadoop Operations" book.

My problem is the following: when running jobs on the cluster, I often get
the following errors in my mappers:

java.lang.Throwable: Child Error
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:250)
Caused by: java.io.IOException: Task process exit with nonzero status of 1.
 at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:237)

Error occurred during initialization of VM
Could not reserve enough space for object heap

I had at first a ulimit of 300, and then increased it to 450, with
no change. I don't understand why I get these memory errors: as I
understood, each node should use 1 + 1 + 4*1.5 + 2*1.5 = 11 GB of RAM at
most, leaving plenty of margin (the first 2 GB are for the TaskTracker and
DataNode processes).

Of course, no other software is running on these machines. The JobTracker
and NameNode are on two separated machines, not part of these 11 workers.

Do any of you have any advice on how I could prevent these errors from
happening? All jobs run fine though, it's just that these failures slow
things down a bit, and let me with the impression that I got something
wrong.

Are there any issues with my configuration options, given the hardware
specs of my machines?

Thanks in advance for any help/pointer!

Cheers,

Vincent


RE: Map Reduce "Child Error" task failure

2012-08-21 Thread Joshi, Shrinivas
Hi Matt,

You are most probably seeing this 
https://issues.apache.org/jira/browse/MAPREDUCE-2374 

There is a single line fix for this issue. See the latest patch attached to the 
above JIRA entry.

-Shrinivas

-Original Message-
From: Matt Kennedy [mailto:stinkym...@gmail.com] 
Sent: Tuesday, August 21, 2012 2:15 PM
To: user@hadoop.apache.org
Subject: Map Reduce "Child Error" task failure

I'm encountering a sporadic error while running MapReduce jobs, it shows up in 
the console output as follows:

12/08/21 14:56:05 INFO mapred.JobClient: Task Id :
attempt_201208211430_0001_m_003538_0, Status : FAILED
java.lang.Throwable: Child Error
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
Caused by: java.io.IOException: Task process exit with nonzero status of 126.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)

12/08/21 14:56:05 WARN mapred.JobClient: Error reading task 
outputhttp://:50060/tasklog?plaintext=true&attemptid=attempt_201208211430_0001_m_003538_0&filter=stdout
12/08/21 14:56:05 WARN mapred.JobClient: Error reading task 
outputhttp://:50060/tasklog?plaintext=true&attemptid=attempt_201208211430_0001_m_003538_0&filter=stderr

The conditions look exactly like those described in:
https://issues.apache.org/jira/browse/MAPREDUCE-4003

Unfortunately, this issue is marked as closed for Apache Hadoop version 1.0.3, 
but that's the version that I'm running into this issue with.

There does seem to be a correlation between the frequency of these errors and 
the number of concurrent map tasks being executed, however the hardware 
resources on the cluster do not appear to be near their limits. I'm assuming 
that there is a knob somewhere that is maladjusted that is causing this error, 
however I haven't found it.

I did find this discussion
(https://groups.google.com/a/cloudera.org/d/topic/cdh-user/NlhvHapf3pk/discussion)
on CDH users list describing the exact same problem and the advice was to 
increase the value of the mapred.child.ulimit setting. However, I had this 
value initially unset, which should mean that the value is unlimited if my 
research is correct. Then I set the value to 3 GB (3x my setting for 
mapred.map.child.java.opts) and it still did not resolve the problem. Finally, 
out of frustration, I just added a zero at the end and now the value is 
31457280 (the unit for the setting is in KB) which is 30GB. I'm still having 
the problem.

Is anybody else seeing this issue or have an idea for a workaround?
Right now my workaround is to set the allowed failures to be very high before a 
tasktracker is blacklisted, but this has the unintended side effect of taking a 
very long time to evict legitimately messed up tasktrackers. If this error is 
indicative of some other configuration problem, I'd like to try to resolve it.

Ideas? Or should I re-open the JIRA?

Thank you for your time,
Matt




Map Reduce "Child Error" task failure

2012-08-21 Thread Matt Kennedy
I'm encountering a sporadic error while running MapReduce jobs, it
shows up in the console output as follows:

12/08/21 14:56:05 INFO mapred.JobClient: Task Id :
attempt_201208211430_0001_m_003538_0, Status : FAILED
java.lang.Throwable: Child Error
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
Caused by: java.io.IOException: Task process exit with nonzero status of 126.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)

12/08/21 14:56:05 WARN mapred.JobClient: Error reading task
outputhttp://:50060/tasklog?plaintext=true&attemptid=attempt_201208211430_0001_m_003538_0&filter=stdout
12/08/21 14:56:05 WARN mapred.JobClient: Error reading task
outputhttp://:50060/tasklog?plaintext=true&attemptid=attempt_201208211430_0001_m_003538_0&filter=stderr

The conditions look exactly like those described in:
https://issues.apache.org/jira/browse/MAPREDUCE-4003

Unfortunately, this issue is marked as closed for Apache Hadoop
version 1.0.3, but that's the version that I'm running into this issue
with.

There does seem to be a correlation between the frequency of these
errors and the number of concurrent map tasks being executed, however
the hardware resources on the cluster do not appear to be near their
limits. I'm assuming that there is a knob somewhere that is
maladjusted that is causing this error, however I haven't found it.

I did find this discussion
(https://groups.google.com/a/cloudera.org/d/topic/cdh-user/NlhvHapf3pk/discussion)
on CDH users list describing the exact same problem and the advice was
to increase the value of the mapred.child.ulimit setting. However, I
had this value initially unset, which should mean that the value is
unlimited if my research is correct. Then I set the value to 3 GB (3x
my setting for mapred.map.child.java.opts) and it still did not
resolve the problem. Finally, out of frustration, I just added a zero
at the end and now the value is 31457280 (the unit for the setting is
in KB) which is 30GB. I'm still having the problem.

Is anybody else seeing this issue or have an idea for a workaround?
Right now my workaround is to set the allowed failures to be very high
before a tasktracker is blacklisted, but this has the unintended side
effect of taking a very long time to evict legitimately messed up
tasktrackers. If this error is indicative of some other configuration
problem, I'd like to try to resolve it.

Ideas? Or should I re-open the JIRA?

Thank you for your time,
Matt