[jira] [Updated] (HIVE-8639) Convert SMBJoin to MapJoin [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-8639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated HIVE-8639: -- Resolution: Fixed Fix Version/s: spark-branch Status: Resolved (was: Patch Available) Committed to Spark branch. Thanks to Szehon for this nice piece. Convert SMBJoin to MapJoin [Spark Branch] - Key: HIVE-8639 URL: https://issues.apache.org/jira/browse/HIVE-8639 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Szehon Ho Assignee: Szehon Ho Fix For: spark-branch Attachments: HIVE-8639.1-spark.patch, HIVE-8639.2-spark.patch, HIVE-8639.3-spark.patch, HIVE-8639.3-spark.patch, HIVE-8639.4-spark.patch HIVE-8202 supports auto-conversion of SMB Join. However, if the tables are partitioned, there could be a slow down as each mapper would need to get a very small chunk of a partition which has a single key. Thus, in some scenarios it's beneficial to convert SMB join to map join. The task is to research and support the conversion from SMB join to map join for Spark execution engine. See the equivalent of MapReduce in SortMergeJoinResolver. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-8639) Convert SMBJoin to MapJoin [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-8639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Szehon Ho updated HIVE-8639: Attachment: HIVE-8639.2-spark.patch Address review comments, update some golden files, and fix another issue. The issue is that if SMBJoin and MapJoin operators are in the same tree, they trigger some code in SparkReduceSinkMapJoinProc and GenSparkWork that corrupts the graph. In particular, those processor had assumed that you only visit a MapJoin op once from a non-RS path (big-table), but this becomes false if the big-table is a child of SMBJoin, as that itself has multiple non-RS parents. The additional fix is to make sure we walk down once from SMBJoinOp, only the big-table path. Thus we skip further walking if it's a small-table, as anyway no further processing is necessary. RB is not working for me at the moment, will upload there once it is. Convert SMBJoin to MapJoin [Spark Branch] - Key: HIVE-8639 URL: https://issues.apache.org/jira/browse/HIVE-8639 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Szehon Ho Assignee: Szehon Ho Attachments: HIVE-8639.1-spark.patch, HIVE-8639.2-spark.patch HIVE-8202 supports auto-conversion of SMB Join. However, if the tables are partitioned, there could be a slow down as each mapper would need to get a very small chunk of a partition which has a single key. Thus, in some scenarios it's beneficial to convert SMB join to map join. The task is to research and support the conversion from SMB join to map join for Spark execution engine. See the equivalent of MapReduce in SortMergeJoinResolver. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-8639) Convert SMBJoin to MapJoin [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-8639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Szehon Ho updated HIVE-8639: Attachment: HIVE-8639.3-spark.patch Fix some import statements. Convert SMBJoin to MapJoin [Spark Branch] - Key: HIVE-8639 URL: https://issues.apache.org/jira/browse/HIVE-8639 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Szehon Ho Assignee: Szehon Ho Attachments: HIVE-8639.1-spark.patch, HIVE-8639.2-spark.patch, HIVE-8639.3-spark.patch HIVE-8202 supports auto-conversion of SMB Join. However, if the tables are partitioned, there could be a slow down as each mapper would need to get a very small chunk of a partition which has a single key. Thus, in some scenarios it's beneficial to convert SMB join to map join. The task is to research and support the conversion from SMB join to map join for Spark execution engine. See the equivalent of MapReduce in SortMergeJoinResolver. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-8639) Convert SMBJoin to MapJoin [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-8639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Szehon Ho updated HIVE-8639: Attachment: HIVE-8639.3-spark.patch I think there was a temporary issue downloading the dependency for build, attaching same patch again. Also updated the review-board. Convert SMBJoin to MapJoin [Spark Branch] - Key: HIVE-8639 URL: https://issues.apache.org/jira/browse/HIVE-8639 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Szehon Ho Assignee: Szehon Ho Attachments: HIVE-8639.1-spark.patch, HIVE-8639.2-spark.patch, HIVE-8639.3-spark.patch, HIVE-8639.3-spark.patch HIVE-8202 supports auto-conversion of SMB Join. However, if the tables are partitioned, there could be a slow down as each mapper would need to get a very small chunk of a partition which has a single key. Thus, in some scenarios it's beneficial to convert SMB join to map join. The task is to research and support the conversion from SMB join to map join for Spark execution engine. See the equivalent of MapReduce in SortMergeJoinResolver. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-8639) Convert SMBJoin to MapJoin [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-8639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Szehon Ho updated HIVE-8639: Attachment: HIVE-8639.4-spark.patch Update more golden files. Ton of SMB joins got converted to mapjoin/bucket mapjoin. Also, due to forward-walking some of the operator numbers are changed for cross-product check. Convert SMBJoin to MapJoin [Spark Branch] - Key: HIVE-8639 URL: https://issues.apache.org/jira/browse/HIVE-8639 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Szehon Ho Assignee: Szehon Ho Attachments: HIVE-8639.1-spark.patch, HIVE-8639.2-spark.patch, HIVE-8639.3-spark.patch, HIVE-8639.3-spark.patch, HIVE-8639.4-spark.patch HIVE-8202 supports auto-conversion of SMB Join. However, if the tables are partitioned, there could be a slow down as each mapper would need to get a very small chunk of a partition which has a single key. Thus, in some scenarios it's beneficial to convert SMB join to map join. The task is to research and support the conversion from SMB join to map join for Spark execution engine. See the equivalent of MapReduce in SortMergeJoinResolver. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-8639) Convert SMBJoin to MapJoin [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-8639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Szehon Ho updated HIVE-8639: Attachment: HIVE-8639.1-spark.patch Convert SMBJoin to MapJoin [Spark Branch] - Key: HIVE-8639 URL: https://issues.apache.org/jira/browse/HIVE-8639 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Szehon Ho Assignee: Chinna Rao Lalam Attachments: HIVE-8639.1-spark.patch HIVE-8202 supports auto-conversion of SMB Join. However, if the tables are partitioned, there could be a slow down as each mapper would need to get a very small chunk of a partition which has a single key. Thus, in some scenarios it's beneficial to convert SMB join to map join. The task is to research and support the conversion from SMB join to map join for Spark execution engine. See the equivalent of MapReduce in SortMergeJoinResolver. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-8639) Convert SMBJoin to MapJoin [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-8639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Szehon Ho updated HIVE-8639: Affects Version/s: spark-branch Status: Patch Available (was: Open) I have a patch for this JIRA. Instead of making a SMB - MapJoin path, I introduce a new unitied join processor called 'SparkJoinOptimizer' in the logical layer. This will call the SMB or MapJoin optimizers in a certain order depending on the flags that are set and which one works. Thus no need to write a SMB - MapJoin path. Two issues so far during this refactoring: 1. NonBlockingOpDeDupProc optimizer does not update joinContext, making any SMB optimizer not able to run after it. Submitted patch in HIVE-9060 which should be committed to trunk, but also including it in this spark branch patch. 2. auto_sortmerge_join_9 failure. This was passing until yesterday when bucket-map join is enabled in HIVE-8638. As expected, by choosing MapJoins over SMB join, it may become a BucketMapJoin. Some of the more complicated queries there get converted to BucketMapJoin and fail. Can probably file a new JIRA to fix this test, as its a BucketMapJoin issue. Might need the help of [~jxiang] on this one. Exception is below: {noformat} 2014-12-10 15:31:38,527 WARN [task-result-getter-3]: scheduler.TaskSetManager (Logging.scala:logWarning(71)) - Lost task 1.0 in stage 50.0 (TID 80, 172.19.8.203): java.lang.RuntimeException: Hive Runtime Error while closing operators at org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.close(SparkMapRecordHandler.java:207) at org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.closeRecordProcessor(HiveMapFunctionResultList.java:57) at org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:108) at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$2.apply(AsyncRDDActions.scala:115) at org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$2.apply(AsyncRDDActions.scala:115) at org.apache.spark.SparkContext$$anonfun$30.apply(SparkContext.scala:1390) at org.apache.spark.SparkContext$$anonfun$30.apply(SparkContext.scala:1390) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.SparkHashTableSinkOperator.closeOp(SparkHashTableSinkOperator.java:87) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:598) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:610) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:610) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:610) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:610) at org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.close(SparkMapRecordHandler.java:185) ... 15 more Caused by: java.lang.NullPointerException at org.apache.hadoop.hive.ql.plan.BucketMapJoinContext.getMappingBigFile(BucketMapJoinContext.java:187) at org.apache.hadoop.hive.ql.exec.SparkHashTableSinkOperator.flushToFile(SparkHashTableSinkOperator.java:100) at org.apache.hadoop.hive.ql.exec.SparkHashTableSinkOperator.closeOp(SparkHashTableSinkOperator.java:81) ... 21 more {noformat} Convert SMBJoin to MapJoin [Spark Branch] - Key: HIVE-8639 URL: https://issues.apache.org/jira/browse/HIVE-8639 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Szehon Ho Assignee: Szehon Ho Attachments: HIVE-8639.1-spark.patch HIVE-8202 supports auto-conversion of SMB Join. However, if the tables are partitioned, there could be a slow down as each mapper would need to get a very small chunk of a partition which has a single key. Thus, in some scenarios it's beneficial to convert SMB join to map join. The task is to research and support the conversion from SMB join to map join for Spark execution engine. See the equivalent of MapReduce