Sahil Takiar created HIVE-20512:
-----------------------------------
Summary: Improve record and memory usage logging in
SparkRecordHandler
Key: HIVE-20512
URL: https://issues.apache.org/jira/browse/HIVE-20512
Project: Hive
Issue Type: Sub-task
Components: Spark
Reporter: Sahil Takiar
We currently log memory usage and # of records processed in Spark tasks, but we
should improve the methodology for how frequently we log this info. Currently
we use the following code:
{code:java}
private long getNextLogThreshold(long currentThreshold) {
// A very simple counter to keep track of number of rows processed by the
// reducer. It dumps
// every 1 million times, and quickly before that
if (currentThreshold >= 1000000) {
return currentThreshold + 1000000;
}
return 10 * currentThreshold;
}
{code}
The issue is that after a while, the increase by 10x factor means that you have
to process a huge # of records before this gets triggered.
A better approach would be to log this info at a given interval. This would
help in debugging tasks that are seemingly hung.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)