[jira] [Commented] (HIVE-15017) Random job failures with MapReduce and Tez

2016-11-04 Thread Alexandre Linte (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15636276#comment-15636276
 ] 

Alexandre Linte commented on HIVE-15017:


Nothing new here?

> Random job failures with MapReduce and Tez
> --
>
> Key: HIVE-15017
> URL: https://issues.apache.org/jira/browse/HIVE-15017
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 2.1.0
> Environment: Hadoop 2.7.2, Hive 2.1.0
>Reporter: Alexandre Linte
>Priority: Critical
> Attachments: debug_yarn_container_mr_job_datanode03.log, 
> debug_yarn_container_mr_job_datanode05.log, hive-site.xml, hive_cli_mr.txt, 
> hive_cli_tez.txt, nodemanager_logs_mr_job.txt, 
> yarn_container_tez_job_datanode05.txt, yarn_container_tez_job_datanode06.txt, 
> yarn_syslog_mr_job.txt, yarn_syslog_tez_job.txt
>
>
> Since Hive 2.1.0, we are facing a blocking issue on our cluster. All the jobs 
> are failing randomly on mapreduce and tez as well. 
> In both case, we don't have any ERROR or WARN message in the logs. You can 
> find attached:
> - hive cli output errors 
> - yarn logs for a tez and mapreduce job
> - nodemanager logs (mr only, we have the same logs with tez)
> Note: This issue doesn't exist with Pig jobs (mr + tez), Spark jobs (mr), so 
> this cannot be an Hadoop / Yarn issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15017) Random job failures with MapReduce and Tez

2016-10-26 Thread Alexandre Linte (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15608809#comment-15608809
 ] 

Alexandre Linte commented on HIVE-15017:


I added DEBUG container logs. 

I don't see a line equal to:
{noformat}
LOG.debug("initApplication: " + Arrays.toString(commandArray));
{noformat}

> Random job failures with MapReduce and Tez
> --
>
> Key: HIVE-15017
> URL: https://issues.apache.org/jira/browse/HIVE-15017
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 2.1.0
> Environment: Hadoop 2.7.2, Hive 2.1.0
>Reporter: Alexandre Linte
>Priority: Critical
> Attachments: debug_yarn_container_mr_job_datanode03.log, 
> debug_yarn_container_mr_job_datanode05.log, hive-site.xml, hive_cli_mr.txt, 
> hive_cli_tez.txt, nodemanager_logs_mr_job.txt, 
> yarn_container_tez_job_datanode05.txt, yarn_container_tez_job_datanode06.txt, 
> yarn_syslog_mr_job.txt, yarn_syslog_tez_job.txt
>
>
> Since Hive 2.1.0, we are facing a blocking issue on our cluster. All the jobs 
> are failing randomly on mapreduce and tez as well. 
> In both case, we don't have any ERROR or WARN message in the logs. You can 
> find attached:
> - hive cli output errors 
> - yarn logs for a tez and mapreduce job
> - nodemanager logs (mr only, we have the same logs with tez)
> Note: This issue doesn't exist with Pig jobs (mr + tez), Spark jobs (mr), so 
> this cannot be an Hadoop / Yarn issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15017) Random job failures with MapReduce and Tez

2016-10-24 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15602905#comment-15602905
 ] 

Sergey Shelukhin commented on HIVE-15017:
-

HADOOP logs. No, container-executor should be there, that is correct. This is 
what is (presumably) executed by the command that fails, assuming 
HADOOP_YARN_HOME is set and yarn.nodemanager.linux-container-executor.path 
doesn't override it - so, I wanted to double-check that the file is there.

> Random job failures with MapReduce and Tez
> --
>
> Key: HIVE-15017
> URL: https://issues.apache.org/jira/browse/HIVE-15017
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 2.1.0
> Environment: Hadoop 2.7.2, Hive 2.1.0
>Reporter: Alexandre Linte
>Priority: Critical
> Attachments: hive-site.xml, hive_cli_mr.txt, hive_cli_tez.txt, 
> nodemanager_logs_mr_job.txt, yarn_container_tez_job_datanode05.txt, 
> yarn_container_tez_job_datanode06.txt, yarn_syslog_mr_job.txt, 
> yarn_syslog_tez_job.txt
>
>
> Since Hive 2.1.0, we are facing a blocking issue on our cluster. All the jobs 
> are failing randomly on mapreduce and tez as well. 
> In both case, we don't have any ERROR or WARN message in the logs. You can 
> find attached:
> - hive cli output errors 
> - yarn logs for a tez and mapreduce job
> - nodemanager logs (mr only, we have the same logs with tez)
> Note: This issue doesn't exist with Pig jobs (mr + tez), Spark jobs (mr), so 
> this cannot be an Hadoop / Yarn issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15017) Random job failures with MapReduce and Tez

2016-10-24 Thread Alexandre Linte (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15601270#comment-15601270
 ] 

Alexandre Linte commented on HIVE-15017:


Which logs do you need in DEBUG mode? (Hadoop, Hive)
I added the hive-site.xml to help.
The HADOOP_YARN_HOME env variable is properly set on every datanode. I checked.
There is no "yarn.nodemanager.linux-container-executor.path" property set on 
any of Hadoop configuration files. The default value must be used.
Yes, bin/container-executor is under Yarn home. Is that wrong?

> Random job failures with MapReduce and Tez
> --
>
> Key: HIVE-15017
> URL: https://issues.apache.org/jira/browse/HIVE-15017
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 2.1.0
> Environment: Hadoop 2.7.2, Hive 2.1.0
>Reporter: Alexandre Linte
>Priority: Critical
> Attachments: hive_cli_mr.txt, hive_cli_tez.txt, 
> nodemanager_logs_mr_job.txt, yarn_container_tez_job_datanode05.txt, 
> yarn_container_tez_job_datanode06.txt, yarn_syslog_mr_job.txt, 
> yarn_syslog_tez_job.txt
>
>
> Since Hive 2.1.0, we are facing a blocking issue on our cluster. All the jobs 
> are failing randomly on mapreduce and tez as well. 
> In both case, we don't have any ERROR or WARN message in the logs. You can 
> find attached:
> - hive cli output errors 
> - yarn logs for a tez and mapreduce job
> - nodemanager logs (mr only, we have the same logs with tez)
> Note: This issue doesn't exist with Pig jobs (mr + tez), Spark jobs (mr), so 
> this cannot be an Hadoop / Yarn issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15017) Random job failures with MapReduce and Tez

2016-10-20 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15592620#comment-15592620
 ] 

Sergey Shelukhin commented on HIVE-15017:
-

Looks like some setup issue... can you enable debug logging to see this line: 
{noformat}
LOG.debug("initApplication: " + Arrays.toString(commandArray));
{noformat}
Is HADOOP_YARN_HOME set for the container?
Is yarn.nodemanager.linux-container-executor.path set to something non-default?
Otherwise, is there a file under Yarn home called bin/container-executor?

> Random job failures with MapReduce and Tez
> --
>
> Key: HIVE-15017
> URL: https://issues.apache.org/jira/browse/HIVE-15017
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 2.1.0
> Environment: Hadoop 2.7.2, Hive 2.1.0
>Reporter: Alexandre Linte
>Priority: Critical
> Attachments: hive_cli_mr.txt, hive_cli_tez.txt, 
> nodemanager_logs_mr_job.txt, yarn_container_tez_job_datanode05.txt, 
> yarn_container_tez_job_datanode06.txt, yarn_syslog_mr_job.txt, 
> yarn_syslog_tez_job.txt
>
>
> Since Hive 2.1.0, we are facing a blocking issue on our cluster. All the jobs 
> are failing randomly on mapreduce and tez as well. 
> In both case, we don't have any ERROR or WARN message in the logs. You can 
> find attached:
> - hive cli output errors 
> - yarn logs for a tez and mapreduce job
> - nodemanager logs (mr only, we have the same logs with tez)
> Note: This issue doesn't exist with Pig jobs (mr + tez), Spark jobs (mr), so 
> this cannot be an Hadoop / Yarn issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15017) Random job failures with MapReduce and Tez

2016-10-20 Thread Alexandre Linte (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15591369#comment-15591369
 ] 

Alexandre Linte commented on HIVE-15017:


Hi [~sershe],
The "yarn logs" command doesn't return the logs as you can see below.
{noformat}
[root@namenode01 ~]# yarn logs -applicationId application_1475850791417_0105
/Products/YARN/logs/hdfs/logs/application_1475850791417_0105 does not exist.
Log aggregation has not completed or is not enabled.
{noformat}
So I decided to dig into the logs manually. I found interesting things on both 
datanode05 and datanode06. The error "255" appears regularly, I think this is 
the cause of the container crash.

I uploaded the relevant part of the logs.

> Random job failures with MapReduce and Tez
> --
>
> Key: HIVE-15017
> URL: https://issues.apache.org/jira/browse/HIVE-15017
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 2.1.0
> Environment: Hadoop 2.7.2, Hive 2.1.0
>Reporter: Alexandre Linte
>Priority: Critical
> Attachments: hive_cli_mr.txt, hive_cli_tez.txt, 
> nodemanager_logs_mr_job.txt, yarn_syslog_mr_job.txt, yarn_syslog_tez_job.txt
>
>
> Since Hive 2.1.0, we are facing a blocking issue on our cluster. All the jobs 
> are failing randomly on mapreduce and tez as well. 
> In both case, we don't have any ERROR or WARN message in the logs. You can 
> find attached:
> - hive cli output errors 
> - yarn logs for a tez and mapreduce job
> - nodemanager logs (mr only, we have the same logs with tez)
> Note: This issue doesn't exist with Pig jobs (mr + tez), Spark jobs (mr), so 
> this cannot be an Hadoop / Yarn issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15017) Random job failures with MapReduce and Tez

2016-10-19 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15589425#comment-15589425
 ] 

Sergey Shelukhin commented on HIVE-15017:
-

The app logs appear to be for the AMs.
Can you download the full application logs for Tez or MR apps?
If they don't have anything for the problematic container (e.g. 
containerId=container_1475850791417_0105_01_02, 
nodeId=datanode06.bigdata.fr:60737), it might be possible to go the node and 
try to find container log directory to see its output

> Random job failures with MapReduce and Tez
> --
>
> Key: HIVE-15017
> URL: https://issues.apache.org/jira/browse/HIVE-15017
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 2.1.0
> Environment: Hadoop 2.7.2, Hive 2.1.0
>Reporter: Alexandre Linte
>Priority: Critical
> Attachments: hive_cli_mr.txt, hive_cli_tez.txt, 
> nodemanager_logs_mr_job.txt, yarn_syslog_mr_job.txt, yarn_syslog_tez_job.txt
>
>
> Since Hive 2.1.0, we are facing a blocking issue on our cluster. All the jobs 
> are failing randomly on mapreduce and tez as well. 
> In both case, we don't have any ERROR or WARN message in the logs. You can 
> find attached:
> - hive cli output errors 
> - yarn logs for a tez and mapreduce job
> - nodemanager logs (mr only, we have the same logs with tez)
> Note: This issue doesn't exist with Pig jobs (mr + tez), Spark jobs (mr), so 
> this cannot be an Hadoop / Yarn issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)