[jira] [Updated] (HIVE-8639) Convert SMBJoin to MapJoin [Spark Branch]

2014-12-18 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-8639:
--
   Resolution: Fixed
Fix Version/s: spark-branch
   Status: Resolved  (was: Patch Available)

Committed to Spark branch. Thanks to Szehon for this nice piece.

 Convert SMBJoin to MapJoin [Spark Branch]
 -

 Key: HIVE-8639
 URL: https://issues.apache.org/jira/browse/HIVE-8639
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: spark-branch
Reporter: Szehon Ho
Assignee: Szehon Ho
 Fix For: spark-branch

 Attachments: HIVE-8639.1-spark.patch, HIVE-8639.2-spark.patch, 
 HIVE-8639.3-spark.patch, HIVE-8639.3-spark.patch, HIVE-8639.4-spark.patch


 HIVE-8202 supports auto-conversion of SMB Join.  However, if the tables are 
 partitioned, there could be a slow down as each mapper would need to get a 
 very small chunk of a partition which has a single key. Thus, in some 
 scenarios it's beneficial to convert SMB join to map join.
 The task is to research and support the conversion from SMB join to map join 
 for Spark execution engine.  See the equivalent of MapReduce in 
 SortMergeJoinResolver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-8639) Convert SMBJoin to MapJoin [Spark Branch]

2014-12-17 Thread Szehon Ho (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szehon Ho updated HIVE-8639:

Attachment: HIVE-8639.2-spark.patch

Address review comments, update some golden files, and fix another issue.  The 
issue is that if SMBJoin and MapJoin operators are in the same tree, they 
trigger some code in SparkReduceSinkMapJoinProc and GenSparkWork that corrupts 
the graph.  In particular, those processor had assumed that you only visit a 
MapJoin op once from a non-RS path (big-table), but this becomes false if the 
big-table is a child of SMBJoin, as that itself has multiple non-RS parents.

The additional fix is to make sure we walk down once from SMBJoinOp, only the 
big-table path.  Thus we skip further walking if it's a small-table, as anyway 
no further processing is necessary.

RB is not working for me at the moment, will upload there once it is.

 Convert SMBJoin to MapJoin [Spark Branch]
 -

 Key: HIVE-8639
 URL: https://issues.apache.org/jira/browse/HIVE-8639
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: spark-branch
Reporter: Szehon Ho
Assignee: Szehon Ho
 Attachments: HIVE-8639.1-spark.patch, HIVE-8639.2-spark.patch


 HIVE-8202 supports auto-conversion of SMB Join.  However, if the tables are 
 partitioned, there could be a slow down as each mapper would need to get a 
 very small chunk of a partition which has a single key. Thus, in some 
 scenarios it's beneficial to convert SMB join to map join.
 The task is to research and support the conversion from SMB join to map join 
 for Spark execution engine.  See the equivalent of MapReduce in 
 SortMergeJoinResolver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-8639) Convert SMBJoin to MapJoin [Spark Branch]

2014-12-17 Thread Szehon Ho (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szehon Ho updated HIVE-8639:

Attachment: HIVE-8639.3-spark.patch

Fix some import statements.

 Convert SMBJoin to MapJoin [Spark Branch]
 -

 Key: HIVE-8639
 URL: https://issues.apache.org/jira/browse/HIVE-8639
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: spark-branch
Reporter: Szehon Ho
Assignee: Szehon Ho
 Attachments: HIVE-8639.1-spark.patch, HIVE-8639.2-spark.patch, 
 HIVE-8639.3-spark.patch


 HIVE-8202 supports auto-conversion of SMB Join.  However, if the tables are 
 partitioned, there could be a slow down as each mapper would need to get a 
 very small chunk of a partition which has a single key. Thus, in some 
 scenarios it's beneficial to convert SMB join to map join.
 The task is to research and support the conversion from SMB join to map join 
 for Spark execution engine.  See the equivalent of MapReduce in 
 SortMergeJoinResolver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-8639) Convert SMBJoin to MapJoin [Spark Branch]

2014-12-17 Thread Szehon Ho (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szehon Ho updated HIVE-8639:

Attachment: HIVE-8639.3-spark.patch

I think there was a temporary issue downloading the dependency for build, 
attaching same patch again.

Also updated the review-board.

 Convert SMBJoin to MapJoin [Spark Branch]
 -

 Key: HIVE-8639
 URL: https://issues.apache.org/jira/browse/HIVE-8639
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: spark-branch
Reporter: Szehon Ho
Assignee: Szehon Ho
 Attachments: HIVE-8639.1-spark.patch, HIVE-8639.2-spark.patch, 
 HIVE-8639.3-spark.patch, HIVE-8639.3-spark.patch


 HIVE-8202 supports auto-conversion of SMB Join.  However, if the tables are 
 partitioned, there could be a slow down as each mapper would need to get a 
 very small chunk of a partition which has a single key. Thus, in some 
 scenarios it's beneficial to convert SMB join to map join.
 The task is to research and support the conversion from SMB join to map join 
 for Spark execution engine.  See the equivalent of MapReduce in 
 SortMergeJoinResolver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-8639) Convert SMBJoin to MapJoin [Spark Branch]

2014-12-17 Thread Szehon Ho (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szehon Ho updated HIVE-8639:

Attachment: HIVE-8639.4-spark.patch

Update more golden files.  Ton of SMB joins got converted to mapjoin/bucket 
mapjoin.  Also, due to forward-walking some of the operator numbers are changed 
for cross-product check.

 Convert SMBJoin to MapJoin [Spark Branch]
 -

 Key: HIVE-8639
 URL: https://issues.apache.org/jira/browse/HIVE-8639
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: spark-branch
Reporter: Szehon Ho
Assignee: Szehon Ho
 Attachments: HIVE-8639.1-spark.patch, HIVE-8639.2-spark.patch, 
 HIVE-8639.3-spark.patch, HIVE-8639.3-spark.patch, HIVE-8639.4-spark.patch


 HIVE-8202 supports auto-conversion of SMB Join.  However, if the tables are 
 partitioned, there could be a slow down as each mapper would need to get a 
 very small chunk of a partition which has a single key. Thus, in some 
 scenarios it's beneficial to convert SMB join to map join.
 The task is to research and support the conversion from SMB join to map join 
 for Spark execution engine.  See the equivalent of MapReduce in 
 SortMergeJoinResolver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-8639) Convert SMBJoin to MapJoin [Spark Branch]

2014-12-10 Thread Szehon Ho (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szehon Ho updated HIVE-8639:

Attachment: HIVE-8639.1-spark.patch

 Convert SMBJoin to MapJoin [Spark Branch]
 -

 Key: HIVE-8639
 URL: https://issues.apache.org/jira/browse/HIVE-8639
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Szehon Ho
Assignee: Chinna Rao Lalam
 Attachments: HIVE-8639.1-spark.patch


 HIVE-8202 supports auto-conversion of SMB Join.  However, if the tables are 
 partitioned, there could be a slow down as each mapper would need to get a 
 very small chunk of a partition which has a single key. Thus, in some 
 scenarios it's beneficial to convert SMB join to map join.
 The task is to research and support the conversion from SMB join to map join 
 for Spark execution engine.  See the equivalent of MapReduce in 
 SortMergeJoinResolver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-8639) Convert SMBJoin to MapJoin [Spark Branch]

2014-12-10 Thread Szehon Ho (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szehon Ho updated HIVE-8639:

Affects Version/s: spark-branch
   Status: Patch Available  (was: Open)

I have a patch for this JIRA.

Instead of making a SMB - MapJoin path, I introduce a new unitied join 
processor called 'SparkJoinOptimizer' in the logical layer.  This will call the 
SMB or MapJoin optimizers in a certain order depending on the flags that are 
set and which one works.  Thus no need to write a SMB - MapJoin path.

Two issues so far during this refactoring:
1.  NonBlockingOpDeDupProc optimizer does not update joinContext, making any 
SMB optimizer not able to run after it. Submitted patch in HIVE-9060 which 
should be committed to trunk, but also including it in this spark branch patch.

2.  auto_sortmerge_join_9 failure.  This was passing until yesterday when 
bucket-map join is enabled in HIVE-8638.  As expected, by choosing MapJoins 
over SMB join, it may become a BucketMapJoin.  Some of the more complicated 
queries there get converted to BucketMapJoin and fail.  Can probably file a new 
JIRA to fix this test, as its a BucketMapJoin issue.  Might need the help of 
[~jxiang] on this one.

Exception is below:
{noformat}
2014-12-10 15:31:38,527 WARN  [task-result-getter-3]: scheduler.TaskSetManager 
(Logging.scala:logWarning(71)) - Lost task 1.0 in stage 50.0 (TID 80, 
172.19.8.203): java.lang.RuntimeException: Hive Runtime Error while closing 
operators
at 
org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.close(SparkMapRecordHandler.java:207)
at 
org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.closeRecordProcessor(HiveMapFunctionResultList.java:57)
at 
org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:108)
at 
scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$2.apply(AsyncRDDActions.scala:115)
at 
org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$2.apply(AsyncRDDActions.scala:115)
at 
org.apache.spark.SparkContext$$anonfun$30.apply(SparkContext.scala:1390)
at 
org.apache.spark.SparkContext$$anonfun$30.apply(SparkContext.scala:1390)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
java.lang.NullPointerException
at 
org.apache.hadoop.hive.ql.exec.SparkHashTableSinkOperator.closeOp(SparkHashTableSinkOperator.java:87)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:598)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:610)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:610)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:610)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:610)
at 
org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.close(SparkMapRecordHandler.java:185)
... 15 more
Caused by: java.lang.NullPointerException
at 
org.apache.hadoop.hive.ql.plan.BucketMapJoinContext.getMappingBigFile(BucketMapJoinContext.java:187)
at 
org.apache.hadoop.hive.ql.exec.SparkHashTableSinkOperator.flushToFile(SparkHashTableSinkOperator.java:100)
at 
org.apache.hadoop.hive.ql.exec.SparkHashTableSinkOperator.closeOp(SparkHashTableSinkOperator.java:81)
... 21 more
{noformat}

 Convert SMBJoin to MapJoin [Spark Branch]
 -

 Key: HIVE-8639
 URL: https://issues.apache.org/jira/browse/HIVE-8639
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: spark-branch
Reporter: Szehon Ho
Assignee: Szehon Ho
 Attachments: HIVE-8639.1-spark.patch


 HIVE-8202 supports auto-conversion of SMB Join.  However, if the tables are 
 partitioned, there could be a slow down as each mapper would need to get a 
 very small chunk of a partition which has a single key. Thus, in some 
 scenarios it's beneficial to convert SMB join to map join.
 The task is to research and support the conversion from SMB join to map join 
 for Spark execution engine.  See the equivalent of MapReduce