[
https://issues.apache.org/jira/browse/HIVE-7843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14112513#comment-14112513
]
Venki Korukanti commented on HIVE-7843:
---------------------------------------
As part of {{FileSinkOperator.getDynOutPaths()}} we create {{FSPath}} objects
for given partition and bucket number. {{FSPath}} contains output file name,
RecordWriter and few other fields. RecordWriter in a given {{FSPath}} is
non-null if FileSinkOperator is currently writing to the output file in the
same {{FSPath}}. Once completed RecordWriter is closed and not opened again.
{{valToPaths}} contains the mapping of parition/bucket value (such as
"state=Ca/000000) to {{FSPath}}. For Spark we go into a conditition where we
try to reuse the FSPath which is completed writing and end up with a NPE on
RecordWriter in FSPath.
In {{FileSinkOperator.getDynOutPaths()}}, we call a method
{{Utilities.replaceTaskIdFromFilename(taskId, buckNum)}} to get the bucket file
name. This returns a bucket file name that is equal in length to given task Id
(see {{Utilities.replaceTaskId}} for details) by prefixing "0"s. For example if
task Id is 000001_00 and bucket number is 2, then return file name is
"000002_00". TaskId given to {{replaceTaskIdFromFilename}} is coming from
"mapred.task.id" (See {{Utilities.getTaskId}} for details). If "mapred.task.id"
is not present in conf a random number is returned. In case of Spark, there is
no "mapred.task.id" set, so we get a random number. So bucket file name derived
is different each time for the same bucket number. This leads to creating more
than one {{FSPath}} for the same bucket. When a new {{FSPath}} is created, we
close the current one. And the closed FSPath could be retrieved next and cause
NPE on RecordWriter.
Changed {{Utilities.getTaskId}} as follows to always return a random number
between 1000000 - 9999999 and the test passed fine, but need to find a proper
fix:
{code}
public static String getTaskId(Configuration hconf) {
String taskid = (hconf == null) ? null : hconf.get("mapred.task.id");
if ((taskid == null) || taskid.equals("")) {
- return ("" + Math.abs(randGen.nextInt()));
+ // Generate a 6 digit random number (first digit value should be > 0)
+ // TODO: this is not efficient. Need to find if we have task.id
equivalent in Spark.
+ int rand = randGen.nextInt()%100000;
+ rand += Math.abs(rand) + 1000000;
+ return ("" + rand);
} else {
{code}
> orc_analyze.q fails with an assertion in FileSinkOperator [Spark Branch]
> ------------------------------------------------------------------------
>
> Key: HIVE-7843
> URL: https://issues.apache.org/jira/browse/HIVE-7843
> Project: Hive
> Issue Type: Sub-task
> Components: Spark
> Affects Versions: spark-branch
> Reporter: Venki Korukanti
> Assignee: Venki Korukanti
> Labels: Spark-M1
> Fix For: spark-branch
>
>
> {code}
> java.lang.AssertionError: data length is different from num of DP columns
> org.apache.hadoop.hive.ql.exec.FileSinkOperator.getDynPartDirectory(FileSinkOperator.java:809)
> org.apache.hadoop.hive.ql.exec.FileSinkOperator.getDynOutPaths(FileSinkOperator.java:730)
> org.apache.hadoop.hive.ql.exec.FileSinkOperator.startGroup(FileSinkOperator.java:829)
> org.apache.hadoop.hive.ql.exec.Operator.defaultStartGroup(Operator.java:502)
> org.apache.hadoop.hive.ql.exec.Operator.startGroup(Operator.java:525)
> org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:198)
> org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:47)
> org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:27)
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:98)
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41)
> scala.collection.Iterator$class.foreach(Iterator.scala:727)
> scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:759)
> org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:759)
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
> org.apache.spark.scheduler.Task.run(Task.scala:54)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:744)
> {code}
--
This message was sent by Atlassian JIRA
(v6.2#6252)