Re: Review Request: HIVE-2156: Improve Execution Error Messages

Ning Zhang Thu, 26 May 2011 10:57:12 -0700


> On 2011-05-24 20:49:24, Ning Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/HadoopJobExecHelper.java, line 
> > 571
> > <https://reviews.apache.org/r/777/diff/2/?file=19556#file19556line571>
> >
> >     error code -101 is also used in TaskRunner.java to indicate OOM 
> > exception. We should define all these error code in a centralized place.
> 
> Syed Albiz wrote:
>     This was just used as something to initialize the exitVal to, that 
> specific value should never be returned unless the call to 
> runningJob.waitFor() returns the same value. I can change it to something 
> else just to avoid the collision, but should we do both the consolidation of 
> exit codes and the change to showJobDebugInfo in the same patch? They seem 
> like different changes, and consolidating the exit codes would require 
> touching several other parts of MapredLocalTask, MapRedTask and ExecDriver. 
> Would these changes fit better in a separate patch?


Yes, change it to something else won't be fine for now. We should probably 
consider consolidate all error codes into a centralized place in a separate 
JIRA. 


> On 2011-05-24 20:49:24, Ning Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/JobDebugger.java, line 110
> > <https://reviews.apache.org/r/777/diff/2/?file=19557#file19557line110>
> >
> >     Do you have some numbers on how long it takes to get all the 
> > TaskCompletionEvents? There are cases that a job may have more than 10k 
> > tasks and all of them failed with the same error.
> >     
> >     If it takes too long you may want to consider adding a threshold to the 
> > time spent in getting all the TaskCompleteEvents.
> 
> Syed Albiz wrote:
>     I have only tested it on some of the queries in the NegativeCliDriver 
> tests, where it usually only takes <10s running in miniMR cluster mode. There 
> is a coarse timeout (default 5 minutes, configurable in 
> HiveConf.ConfVars.JOB_DEBUG_TIMEOUT) to get all TaskCompletionEvents before 
> we stop that is enforced by HadoopJobExecHelper, but it would make sense to 
> timeout grabbing TaskCompletionEvents specifically, and then print out the 
> information obtained so far instead of what this patch does, which is just 
> throw away the taskCompletionEvents gathered so far and return the "could not 
> obtain debugging info". Does that sound reasonable, or do you think the 
> coarse timeout would be sufficient?

I think 5 mins is too long for getting the TaskCompleteEvents. And if the 
timeout happens, we won't get any error message from the task tracker.  Can you 
get a sense of how long it takes to get a small number of TaskCompleteEvents in 
a real cluster, and then extrapolate to large (say 30k) # of mappers? If that's 
too long we should restrict the number of fetching TaskCompleteEvents to a few 
seconds and spend sometime to retrieve the task logs. 


- Ning


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/777/#review711
-----------------------------------------------------------


On 2011-05-24 04:29:32, Syed Albiz wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/777/
> -----------------------------------------------------------
> 
> (Updated 2011-05-24 04:29:32)
> 
> 
> Review request for hive and John Sichi.
> 
> 
> Summary
> -------
> 
> - Add local error messages to point to job logs and provide TaskIDs
> - Add a timeout to the fetching of task logs and errors
> 
> 
> This addresses bug HIVE-2156.
>     https://issues.apache.org/jira/browse/HIVE-2156
> 
> 
> Diffs
> -----
> 
>   build-common.xml 00c3680 
>   common/src/java/org/apache/hadoop/hive/conf/HiveConf.java dc96a1f 
>   conf/hive-default.xml 159d825 
>   ql/build.xml 449b47a 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/HadoopJobExecHelper.java 4717c25 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/JobDebugger.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/MapRedTask.java 53769a0 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/MapredLocalTask.java 691f038 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 9cb407c 
>   ql/src/test/queries/clientnegative/minimr_broken_pipe.q PRE-CREATION 
>   ql/src/test/results/clientnegative/dyn_part3.q.out 5f4df65 
>   ql/src/test/results/clientnegative/minimr_broken_pipe.q.out PRE-CREATION 
>   ql/src/test/results/clientnegative/script_broken_pipe1.q.out d33d2cc 
>   ql/src/test/results/clientnegative/script_broken_pipe2.q.out afbaa44 
>   ql/src/test/results/clientnegative/script_broken_pipe3.q.out fe8f757 
>   ql/src/test/results/clientnegative/script_error.q.out c72d780 
>   ql/src/test/results/clientnegative/udf_reflect_neg.q.out f2082a3 
>   ql/src/test/results/clientnegative/udf_test_error.q.out 5fd9a00 
>   ql/src/test/results/clientnegative/udf_test_error_reduce.q.out ddc5e5b 
>   ql/src/test/templates/TestNegativeCliDriver.vm ec13f79 
> 
> Diff: https://reviews.apache.org/r/777/diff
> 
> 
> Testing
> -------
> 
> Tested TestNegativeCliDriver in both local and miniMR mode
> 
> 
> Thanks,
> 
> Syed
> 
>

Re: Review Request: HIVE-2156: Improve Execution Error Messages

Reply via email to