[
https://issues.apache.org/jira/browse/HIVE-8639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Szehon Ho updated HIVE-8639:
----------------------------
Affects Version/s: spark-branch
Status: Patch Available (was: Open)
I have a patch for this JIRA.
Instead of making a SMB -> MapJoin path, I introduce a new unitied join
processor called 'SparkJoinOptimizer' in the logical layer. This will call the
SMB or MapJoin optimizers in a certain order depending on the flags that are
set and which one works. Thus no need to write a SMB -> MapJoin path.
Two issues so far during this refactoring:
1. NonBlockingOpDeDupProc optimizer does not update joinContext, making any
SMB optimizer not able to run after it. Submitted patch in HIVE-9060 which
should be committed to trunk, but also including it in this spark branch patch.
2. auto_sortmerge_join_9 failure. This was passing until yesterday when
bucket-map join is enabled in HIVE-8638. As expected, by choosing MapJoins
over SMB join, it may become a BucketMapJoin. Some of the more complicated
queries there get converted to BucketMapJoin and fail. Can probably file a new
JIRA to fix this test, as its a BucketMapJoin issue. Might need the help of
[~jxiang] on this one.
Exception is below:
{noformat}
2014-12-10 15:31:38,527 WARN [task-result-getter-3]: scheduler.TaskSetManager
(Logging.scala:logWarning(71)) - Lost task 1.0 in stage 50.0 (TID 80,
172.19.8.203): java.lang.RuntimeException: Hive Runtime Error while closing
operators
at
org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.close(SparkMapRecordHandler.java:207)
at
org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.closeRecordProcessor(HiveMapFunctionResultList.java:57)
at
org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:108)
at
scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at
org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$2.apply(AsyncRDDActions.scala:115)
at
org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$2.apply(AsyncRDDActions.scala:115)
at
org.apache.spark.SparkContext$$anonfun$30.apply(SparkContext.scala:1390)
at
org.apache.spark.SparkContext$$anonfun$30.apply(SparkContext.scala:1390)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException:
java.lang.NullPointerException
at
org.apache.hadoop.hive.ql.exec.SparkHashTableSinkOperator.closeOp(SparkHashTableSinkOperator.java:87)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:598)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:610)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:610)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:610)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:610)
at
org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.close(SparkMapRecordHandler.java:185)
... 15 more
Caused by: java.lang.NullPointerException
at
org.apache.hadoop.hive.ql.plan.BucketMapJoinContext.getMappingBigFile(BucketMapJoinContext.java:187)
at
org.apache.hadoop.hive.ql.exec.SparkHashTableSinkOperator.flushToFile(SparkHashTableSinkOperator.java:100)
at
org.apache.hadoop.hive.ql.exec.SparkHashTableSinkOperator.closeOp(SparkHashTableSinkOperator.java:81)
... 21 more
{noformat}
> Convert SMBJoin to MapJoin [Spark Branch]
> -----------------------------------------
>
> Key: HIVE-8639
> URL: https://issues.apache.org/jira/browse/HIVE-8639
> Project: Hive
> Issue Type: Sub-task
> Components: Spark
> Affects Versions: spark-branch
> Reporter: Szehon Ho
> Assignee: Szehon Ho
> Attachments: HIVE-8639.1-spark.patch
>
>
> HIVE-8202 supports auto-conversion of SMB Join. However, if the tables are
> partitioned, there could be a slow down as each mapper would need to get a
> very small chunk of a partition which has a single key. Thus, in some
> scenarios it's beneficial to convert SMB join to map join.
> The task is to research and support the conversion from SMB join to map join
> for Spark execution engine. See the equivalent of MapReduce in
> SortMergeJoinResolver.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)