> On 2011-05-24 20:49:24, Ning Zhang wrote: > > ql/src/java/org/apache/hadoop/hive/ql/exec/HadoopJobExecHelper.java, line > > 571 > > <https://reviews.apache.org/r/777/diff/2/?file=19556#file19556line571> > > > > error code -101 is also used in TaskRunner.java to indicate OOM > > exception. We should define all these error code in a centralized place. > > Syed Albiz wrote: > This was just used as something to initialize the exitVal to, that > specific value should never be returned unless the call to > runningJob.waitFor() returns the same value. I can change it to something > else just to avoid the collision, but should we do both the consolidation of > exit codes and the change to showJobDebugInfo in the same patch? They seem > like different changes, and consolidating the exit codes would require > touching several other parts of MapredLocalTask, MapRedTask and ExecDriver. > Would these changes fit better in a separate patch?
Yes, change it to something else won't be fine for now. We should probably consider consolidate all error codes into a centralized place in a separate JIRA. > On 2011-05-24 20:49:24, Ning Zhang wrote: > > ql/src/java/org/apache/hadoop/hive/ql/exec/JobDebugger.java, line 110 > > <https://reviews.apache.org/r/777/diff/2/?file=19557#file19557line110> > > > > Do you have some numbers on how long it takes to get all the > > TaskCompletionEvents? There are cases that a job may have more than 10k > > tasks and all of them failed with the same error. > > > > If it takes too long you may want to consider adding a threshold to the > > time spent in getting all the TaskCompleteEvents. > > Syed Albiz wrote: > I have only tested it on some of the queries in the NegativeCliDriver > tests, where it usually only takes <10s running in miniMR cluster mode. There > is a coarse timeout (default 5 minutes, configurable in > HiveConf.ConfVars.JOB_DEBUG_TIMEOUT) to get all TaskCompletionEvents before > we stop that is enforced by HadoopJobExecHelper, but it would make sense to > timeout grabbing TaskCompletionEvents specifically, and then print out the > information obtained so far instead of what this patch does, which is just > throw away the taskCompletionEvents gathered so far and return the "could not > obtain debugging info". Does that sound reasonable, or do you think the > coarse timeout would be sufficient? I think 5 mins is too long for getting the TaskCompleteEvents. And if the timeout happens, we won't get any error message from the task tracker. Can you get a sense of how long it takes to get a small number of TaskCompleteEvents in a real cluster, and then extrapolate to large (say 30k) # of mappers? If that's too long we should restrict the number of fetching TaskCompleteEvents to a few seconds and spend sometime to retrieve the task logs. - Ning ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/777/#review711 ----------------------------------------------------------- On 2011-05-24 04:29:32, Syed Albiz wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/777/ > ----------------------------------------------------------- > > (Updated 2011-05-24 04:29:32) > > > Review request for hive and John Sichi. > > > Summary > ------- > > - Add local error messages to point to job logs and provide TaskIDs > - Add a timeout to the fetching of task logs and errors > > > This addresses bug HIVE-2156. > https://issues.apache.org/jira/browse/HIVE-2156 > > > Diffs > ----- > > build-common.xml 00c3680 > common/src/java/org/apache/hadoop/hive/conf/HiveConf.java dc96a1f > conf/hive-default.xml 159d825 > ql/build.xml 449b47a > ql/src/java/org/apache/hadoop/hive/ql/exec/HadoopJobExecHelper.java 4717c25 > ql/src/java/org/apache/hadoop/hive/ql/exec/JobDebugger.java PRE-CREATION > ql/src/java/org/apache/hadoop/hive/ql/exec/MapRedTask.java 53769a0 > ql/src/java/org/apache/hadoop/hive/ql/exec/MapredLocalTask.java 691f038 > ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 9cb407c > ql/src/test/queries/clientnegative/minimr_broken_pipe.q PRE-CREATION > ql/src/test/results/clientnegative/dyn_part3.q.out 5f4df65 > ql/src/test/results/clientnegative/minimr_broken_pipe.q.out PRE-CREATION > ql/src/test/results/clientnegative/script_broken_pipe1.q.out d33d2cc > ql/src/test/results/clientnegative/script_broken_pipe2.q.out afbaa44 > ql/src/test/results/clientnegative/script_broken_pipe3.q.out fe8f757 > ql/src/test/results/clientnegative/script_error.q.out c72d780 > ql/src/test/results/clientnegative/udf_reflect_neg.q.out f2082a3 > ql/src/test/results/clientnegative/udf_test_error.q.out 5fd9a00 > ql/src/test/results/clientnegative/udf_test_error_reduce.q.out ddc5e5b > ql/src/test/templates/TestNegativeCliDriver.vm ec13f79 > > Diff: https://reviews.apache.org/r/777/diff > > > Testing > ------- > > Tested TestNegativeCliDriver in both local and miniMR mode > > > Thanks, > > Syed > >