[jira] [Commented] (HIVE-8701) Combine nested map joins into the parent map join if possible [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-8701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14218875#comment-14218875 ] Suhas Satish commented on HIVE-8701: [~szehon] - Can you illustrate how you plan to optimize for lower memory utilization in the case of nested map-joins, with an example ? Combine nested map joins into the parent map join if possible [Spark Branch] Key: HIVE-8701 URL: https://issues.apache.org/jira/browse/HIVE-8701 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Szehon Ho With the work in HIVE-8616 enabled, the generated plan shows that the nested map join operator isn't merged to its parent when possible. This is demonstrated in auto_join2.q. The MR plan shown that this optimization is in place. We should do the same for Spark. {code} STAGE PLANS: Stage: Stage-1 Spark Edges: Map 2 - Map 3 (NONE, 0) Map 3 - Map 1 (NONE, 0) DagName: xzhang_20141102074141_ac089634-bf01-4386-b1cf-3e7f2e99f6eb:3 Vertices: Map 1 Map Operator Tree: TableScan alias: src2 Statistics: Num rows: 58 Data size: 5812 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: key is not null (type: boolean) Statistics: Num rows: 29 Data size: 2906 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: key (type: string) sort order: + Map-reduce partition columns: key (type: string) Statistics: Num rows: 29 Data size: 2906 Basic stats: COMPLETE Column stats: NONE Map 2 Map Operator Tree: TableScan alias: src3 Statistics: Num rows: 29 Data size: 5812 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: UDFToDouble(key) is not null (type: boolean) Statistics: Num rows: 15 Data size: 3006 Basic stats: COMPLETE Column stats: NONE Map Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {_col0} 1 {value} keys: 0 (_col0 + _col5) (type: double) 1 UDFToDouble(key) (type: double) outputColumnNames: _col0, _col11 input vertices: 0 Map 3 Statistics: Num rows: 17 Data size: 1813 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: string), _col11 (type: string) outputColumnNames: _col0, _col1 Statistics: Num rows: 17 Data size: 1813 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 17 Data size: 1813 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Map 3 Map Operator Tree: TableScan alias: src1 Statistics: Num rows: 58 Data size: 5812 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: key is not null (type: boolean) Statistics: Num rows: 29 Data size: 2906 Basic stats: COMPLETE Column stats: NONE Map Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {key} 1 {key} keys: 0 key (type: string) 1 key (type: string) outputColumnNames: _col0, _col5 input vertices: 1 Map 1 Statistics: Num rows: 31 Data size: 3196 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (_col0 + _col5) is not null (type:
[jira] [Commented] (HIVE-8548) Integrate with remote Spark context after HIVE-8528 [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208562#comment-14208562 ] Suhas Satish commented on HIVE-8548: HI [~xuefuz] - Regarding unit testing remote-spark context with local-cluster mode, we will need to use either yarn or mesos as the cluster manager. Is that going to be our test setup? The reason is that currently, if spark.master=local, it implies a *spark-standalone cluster* which only supports *client* deploy mode, and not *local-cluster* deployment mode. Integrate with remote Spark context after HIVE-8528 [Spark Branch] -- Key: HIVE-8548 URL: https://issues.apache.org/jira/browse/HIVE-8548 Project: Hive Issue Type: Bug Components: Spark Reporter: Xuefu Zhang Assignee: Chengxiang Li With HIVE-8528, HiverSever2 should use remote Spark context to submit job and monitor progress, etc. This is necessary if Hive runs on standalone cluster, Yarn, or Mesos. If Hive runs with spark.master=local, we should continue using SparkContext in current way. We take this as root JIRA to track all Remote Spark Context integration related subtasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8548) Integrate with remote Spark context after HIVE-8528 [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208778#comment-14208778 ] Suhas Satish commented on HIVE-8548: Thanks for clarifying [~xuefuz] and [~vanzin]. I had some misconceptions about the naming conventions. Integrate with remote Spark context after HIVE-8528 [Spark Branch] -- Key: HIVE-8548 URL: https://issues.apache.org/jira/browse/HIVE-8548 Project: Hive Issue Type: Bug Components: Spark Reporter: Xuefu Zhang Assignee: Chengxiang Li With HIVE-8528, HiverSever2 should use remote Spark context to submit job and monitor progress, etc. This is necessary if Hive runs on standalone cluster, Yarn, or Mesos. If Hive runs with spark.master=local, we should continue using SparkContext in current way. We take this as root JIRA to track all Remote Spark Context integration related subtasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8622) Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-8622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202496#comment-14202496 ] Suhas Satish commented on HIVE-8622: [~csun] - We already have a mapr of BaseWork containing the map-join to its parent ReduceSinks. This exists as {{linkWorkWithReduceSinkMap}} in {{GenSparkProcContext}} Do you think we can leverage that in some way, or replace the RSs in that Map with the HashTableSinks that we introduced? It looks like we should still propagate the whole GenSparkProcContext to the {{SparkMapJoinResolver}} through the SparkCompiler.generateTaskTree(...) and {{SparkCompiler.optimizeTaskPlan}} All the state information stored there will make life a lot easier. Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch] Key: HIVE-8622 URL: https://issues.apache.org/jira/browse/HIVE-8622 Project: Hive Issue Type: Sub-task Reporter: Suhas Satish Assignee: Chao Attachments: HIVE-8622.2-spark.patch, HIVE-8622.3-spark.patch, HIVE-8622.patch This is a sub-task of map-join for spark https://issues.apache.org/jira/browse/HIVE-7613 This can use the baseline patch for map-join https://issues.apache.org/jira/browse/HIVE-8616 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8700) Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-8700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201198#comment-14201198 ] Suhas Satish commented on HIVE-8700: Have a patch which now generates the HashTableSinkOperators as follows. Will be uploading a patch soon. explain select table1.key, table2.value, table3.value from table1 join table2 on table1.key=table2.key join table3 on table1.key=table3.key; OK STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Spark Edges: Map 3 - Map 1 (NONE, 0), Map 2 (NONE, 0) DagName: ssatish_20141106152828_299c0f54-40a8-4cf5-91f4-ecb1f420955f:1 Vertices: Map 1 Map Operator Tree: TableScan alias: table1 Statistics: Num rows: 1453 Data size: 5812 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: key is not null (type: boolean) Statistics: Num rows: 727 Data size: 2908 Basic stats: COMPLETE Column stats: NONE HashTable Sink Operator condition expressions: 0 {key} 1 {value} 2 {value} keys: 0 key (type: int) 1 key (type: int) 2 key (type: int) Map 2 Map Operator Tree: TableScan alias: table3 Statistics: Num rows: 2 Data size: 216 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: key is not null (type: boolean) Statistics: Num rows: 1 Data size: 108 Basic stats: COMPLETE Column stats: NONE HashTable Sink Operator condition expressions: 0 {key} 1 {value} 2 {value} keys: 0 key (type: int) 1 key (type: int) 2 key (type: int) Map 3 Map Operator Tree: TableScan alias: table2 Statistics: Num rows: 55 Data size: 5791 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: key is not null (type: boolean) Statistics: Num rows: 28 Data size: 2948 Basic stats: COMPLETE Column stats: NONE Map Join Operator condition map: Inner Join 0 to 1 Inner Join 0 to 2 condition expressions: 0 {key} 1 {value} 2 {value} keys: 0 key (type: int) 1 key (type: int) 2 key (type: int) outputColumnNames: _col0, _col6, _col11 input vertices: 0 Map 1 2 Map 2 Statistics: Num rows: 1599 Data size: 6397 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: int), _col6 (type: string), _col11 (type: string) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 1599 Data size: 6397 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 1599 Data size: 6397 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch] -- Key: HIVE-8700 URL: https://issues.apache.org/jira/browse/HIVE-8700 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Suhas Satish Attachments: HIVE-8700-spark.patch, HIVE-8700.patch With HIVE-8616 enabled, the new plan has ReduceSinkOperator for the small tables. For example, the follow represents the operator plan for the small table dec1 derived from query
[jira] [Commented] (HIVE-8700) Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-8700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201199#comment-14201199 ] Suhas Satish commented on HIVE-8700: {code} explain select table1.key, table2.value, table3.value from table1 join table2 on table1.key=table2.key join table3 on table1.key=table3.key; OK STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Spark Edges: Map 3 - Map 1 (NONE, 0), Map 2 (NONE, 0) DagName: ssatish_20141106152828_299c0f54-40a8-4cf5-91f4-ecb1f420955f:1 Vertices: Map 1 Map Operator Tree: TableScan alias: table1 Statistics: Num rows: 1453 Data size: 5812 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: key is not null (type: boolean) Statistics: Num rows: 727 Data size: 2908 Basic stats: COMPLETE Column stats: NONE HashTable Sink Operator condition expressions: 0 {key} 1 {value} 2 {value} keys: 0 key (type: int) 1 key (type: int) 2 key (type: int) Map 2 Map Operator Tree: TableScan alias: table3 Statistics: Num rows: 2 Data size: 216 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: key is not null (type: boolean) Statistics: Num rows: 1 Data size: 108 Basic stats: COMPLETE Column stats: NONE HashTable Sink Operator condition expressions: 0 {key} 1 {value} 2 {value} keys: 0 key (type: int) 1 key (type: int) 2 key (type: int) Map 3 Map Operator Tree: TableScan alias: table2 Statistics: Num rows: 55 Data size: 5791 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: key is not null (type: boolean) Statistics: Num rows: 28 Data size: 2948 Basic stats: COMPLETE Column stats: NONE Map Join Operator condition map: Inner Join 0 to 1 Inner Join 0 to 2 condition expressions: 0 {key} 1 {value} 2 {value} keys: 0 key (type: int) 1 key (type: int) 2 key (type: int) outputColumnNames: _col0, _col6, _col11 input vertices: 0 Map 1 2 Map 2 Statistics: Num rows: 1599 Data size: 6397 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: int), _col6 (type: string), _col11 (type: string) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 1599 Data size: 6397 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 1599 Data size: 6397 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink {code} Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch] -- Key: HIVE-8700 URL: https://issues.apache.org/jira/browse/HIVE-8700 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Suhas Satish Attachments: HIVE-8700-spark.patch, HIVE-8700.patch With HIVE-8616 enabled, the new plan has ReduceSinkOperator for the small tables. For example, the follow represents the operator plan for the small table dec1 derived from query {code}explain select /*+ MAPJOIN(dec)*/ * from dec join dec1 on dec.value=dec1.d;{code}
[jira] [Updated] (HIVE-8700) Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-8700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suhas Satish updated HIVE-8700: --- Attachment: HIVE-8700.2-spark.patch Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch] -- Key: HIVE-8700 URL: https://issues.apache.org/jira/browse/HIVE-8700 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Suhas Satish Attachments: HIVE-8700-spark.patch, HIVE-8700.2-spark.patch, HIVE-8700.patch With HIVE-8616 enabled, the new plan has ReduceSinkOperator for the small tables. For example, the follow represents the operator plan for the small table dec1 derived from query {code}explain select /*+ MAPJOIN(dec)*/ * from dec join dec1 on dec.value=dec1.d;{code} {code} Map 2 Map Operator Tree: TableScan alias: dec1 Statistics: Num rows: 0 Data size: 107 Basic stats: PARTIAL Column stats: NONE Filter Operator predicate: d is not null (type: boolean) Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE Reduce Output Operator key expressions: d (type: decimal(5,2)) sort order: + Map-reduce partition columns: d (type: decimal(5,2)) Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE value expressions: i (type: int) {code} With the new design for broadcasting small tables, we need to convert the ReduceSinkOperator with HashTableSinkOperator or equivalent in the new plan. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8700) Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-8700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201319#comment-14201319 ] Suhas Satish commented on HIVE-8700: It was an optimization suggested by eclipse to catch any ClassCastExceptions at compile time instead of surprizes at runtime. I think it was intriduced in java-7 http://docs.oracle.com/javase/7/docs/api/java/lang/SafeVarargs.html I can remove it if you dont like it. But I think it offers some additional type safety during casting. Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch] -- Key: HIVE-8700 URL: https://issues.apache.org/jira/browse/HIVE-8700 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Suhas Satish Attachments: HIVE-8700-spark.patch, HIVE-8700.2-spark.patch, HIVE-8700.patch With HIVE-8616 enabled, the new plan has ReduceSinkOperator for the small tables. For example, the follow represents the operator plan for the small table dec1 derived from query {code}explain select /*+ MAPJOIN(dec)*/ * from dec join dec1 on dec.value=dec1.d;{code} {code} Map 2 Map Operator Tree: TableScan alias: dec1 Statistics: Num rows: 0 Data size: 107 Basic stats: PARTIAL Column stats: NONE Filter Operator predicate: d is not null (type: boolean) Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE Reduce Output Operator key expressions: d (type: decimal(5,2)) sort order: + Map-reduce partition columns: d (type: decimal(5,2)) Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE value expressions: i (type: int) {code} With the new design for broadcasting small tables, we need to convert the ReduceSinkOperator with HashTableSinkOperator or equivalent in the new plan. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8621) Dump small table join data for map-join [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201344#comment-14201344 ] Suhas Satish commented on HIVE-8621: [~jxiang] - Are you sure you are getting any tags being set and read on this line ? mapJoinTableSerdes[tag] Maybe a review board link will help. Also, the current patch does not change any default replication_number related settings right? Dump small table join data for map-join [Spark Branch] -- Key: HIVE-8621 URL: https://issues.apache.org/jira/browse/HIVE-8621 Project: Hive Issue Type: Sub-task Reporter: Suhas Satish Assignee: Jimmy Xiang Fix For: spark-branch Attachments: HIVE-8621.1-spark.patch This jira aims to re-use a slightly modified approach of map-reduce distributed cache in spark to dump map-joined small tables as hash tables onto spark DFS cluster. This is a sub-task of map-join for spark https://issues.apache.org/jira/browse/HIVE-7613 This can use the baseline patch for map-join https://issues.apache.org/jira/browse/HIVE-8616 The original thought process was to use broadcast variable concept in spark, for the small tables. The number of broadcast variables that must be created is m x n where 'm' is the number of small tables in the (m+1) way join and n is the number of buckets of tables. If unbucketed, n=1 But it was discovered that objects compressed with kryo serialization on disk, can occupy 20X or more when deserialized in-memory. For bucket join, the spark Driver has to hold all the buckets (for bucketed tables) in-memory (to provide for fault-tolerance against Executor failures) although the executors only need individual buckets in their memory. So the broadcast variable approach may not be the right approach. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-8700) Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-8700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suhas Satish updated HIVE-8700: --- Attachment: HIVE-8700.3-spark.patch Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch] -- Key: HIVE-8700 URL: https://issues.apache.org/jira/browse/HIVE-8700 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Suhas Satish Attachments: HIVE-8700-spark.patch, HIVE-8700.2-spark.patch, HIVE-8700.3-spark.patch, HIVE-8700.patch With HIVE-8616 enabled, the new plan has ReduceSinkOperator for the small tables. For example, the follow represents the operator plan for the small table dec1 derived from query {code}explain select /*+ MAPJOIN(dec)*/ * from dec join dec1 on dec.value=dec1.d;{code} {code} Map 2 Map Operator Tree: TableScan alias: dec1 Statistics: Num rows: 0 Data size: 107 Basic stats: PARTIAL Column stats: NONE Filter Operator predicate: d is not null (type: boolean) Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE Reduce Output Operator key expressions: d (type: decimal(5,2)) sort order: + Map-reduce partition columns: d (type: decimal(5,2)) Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE value expressions: i (type: int) {code} With the new design for broadcasting small tables, we need to convert the ReduceSinkOperator with HashTableSinkOperator or equivalent in the new plan. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8700) Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-8700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201354#comment-14201354 ] Suhas Satish commented on HIVE-8700: removed in HIVE-8700.3-spark.patch Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch] -- Key: HIVE-8700 URL: https://issues.apache.org/jira/browse/HIVE-8700 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Suhas Satish Attachments: HIVE-8700-spark.patch, HIVE-8700.2-spark.patch, HIVE-8700.3-spark.patch, HIVE-8700.patch With HIVE-8616 enabled, the new plan has ReduceSinkOperator for the small tables. For example, the follow represents the operator plan for the small table dec1 derived from query {code}explain select /*+ MAPJOIN(dec)*/ * from dec join dec1 on dec.value=dec1.d;{code} {code} Map 2 Map Operator Tree: TableScan alias: dec1 Statistics: Num rows: 0 Data size: 107 Basic stats: PARTIAL Column stats: NONE Filter Operator predicate: d is not null (type: boolean) Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE Reduce Output Operator key expressions: d (type: decimal(5,2)) sort order: + Map-reduce partition columns: d (type: decimal(5,2)) Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE value expressions: i (type: int) {code} With the new design for broadcasting small tables, we need to convert the ReduceSinkOperator with HashTableSinkOperator or equivalent in the new plan. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8700) Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-8700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201486#comment-14201486 ] Suhas Satish commented on HIVE-8700: Hi [~csun] I thought the dummyStoreOperators were already introduced and taken care of in the SparkMapJoinOptimizer. But that portion of code is commented out there. I will enable it as part of this jira and post an updated patch soon. Thanks for bringing it up. Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch] -- Key: HIVE-8700 URL: https://issues.apache.org/jira/browse/HIVE-8700 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Suhas Satish Attachments: HIVE-8700-spark.patch, HIVE-8700.2-spark.patch, HIVE-8700.3-spark.patch, HIVE-8700.patch With HIVE-8616 enabled, the new plan has ReduceSinkOperator for the small tables. For example, the follow represents the operator plan for the small table dec1 derived from query {code}explain select /*+ MAPJOIN(dec)*/ * from dec join dec1 on dec.value=dec1.d;{code} {code} Map 2 Map Operator Tree: TableScan alias: dec1 Statistics: Num rows: 0 Data size: 107 Basic stats: PARTIAL Column stats: NONE Filter Operator predicate: d is not null (type: boolean) Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE Reduce Output Operator key expressions: d (type: decimal(5,2)) sort order: + Map-reduce partition columns: d (type: decimal(5,2)) Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE value expressions: i (type: int) {code} With the new design for broadcasting small tables, we need to convert the ReduceSinkOperator with HashTableSinkOperator or equivalent in the new plan. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8700) Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-8700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201604#comment-14201604 ] Suhas Satish commented on HIVE-8700: ah yes. thanks [~csun] Regarding the test failures, 3 of these failures seem unrelated - {code} org.apache.hadoop.hive.ql.io.parquet.serde.TestParquetTimestampUtils.testTimezone org.apache.hive.hcatalog.streaming.TestStreaming.testTransactionBatchEmptyCommit org.apache.hive.minikdc.TestJdbcWithMiniKdc.testNegativeTokenAuth {code} Does anyone know if this one is a known failure? {{org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_sample_islocalmode_hook}} I see that fail in HIVE-8621 as well. Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch] -- Key: HIVE-8700 URL: https://issues.apache.org/jira/browse/HIVE-8700 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Suhas Satish Attachments: HIVE-8700-spark.patch, HIVE-8700.2-spark.patch, HIVE-8700.3-spark.patch, HIVE-8700.patch With HIVE-8616 enabled, the new plan has ReduceSinkOperator for the small tables. For example, the follow represents the operator plan for the small table dec1 derived from query {code}explain select /*+ MAPJOIN(dec)*/ * from dec join dec1 on dec.value=dec1.d;{code} {code} Map 2 Map Operator Tree: TableScan alias: dec1 Statistics: Num rows: 0 Data size: 107 Basic stats: PARTIAL Column stats: NONE Filter Operator predicate: d is not null (type: boolean) Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE Reduce Output Operator key expressions: d (type: decimal(5,2)) sort order: + Map-reduce partition columns: d (type: decimal(5,2)) Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE value expressions: i (type: int) {code} With the new design for broadcasting small tables, we need to convert the ReduceSinkOperator with HashTableSinkOperator or equivalent in the new plan. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8622) Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-8622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14198723#comment-14198723 ] Suhas Satish commented on HIVE-8622: I saw this condition in your patch - if (containsOp(work, MapJoinOperator.class)) { if (containsOp(parentWork, HashTableSinkOperator.class)) { This means that HIVE-8621 which introduces *replaceReduceSinkWithHashTableSink(..)* should be called before this stage. To create HashTableSinkOperator, we need to pass in the MapJoinOperator associated with it. This is available in *GenSparkProcContext* but that doesnt get passed into the physical resolvers. We have to either pass it in or find another way to extract this information from the available physicalContext inside *SparkMapJoinResolver* and pass it into *replaceReduceSinkWithHashTableSink(..)* Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch] Key: HIVE-8622 URL: https://issues.apache.org/jira/browse/HIVE-8622 Project: Hive Issue Type: Sub-task Reporter: Suhas Satish Assignee: Chao Attachments: HIVE-8622.patch This is a sub-task of map-join for spark https://issues.apache.org/jira/browse/HIVE-7613 This can use the baseline patch for map-join https://issues.apache.org/jira/browse/HIVE-8616 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-8700) Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-8700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suhas Satish updated HIVE-8700: --- Attachment: HIVE-8700-spark.patch Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch] -- Key: HIVE-8700 URL: https://issues.apache.org/jira/browse/HIVE-8700 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Suhas Satish Attachments: HIVE-8700-spark.patch, HIVE-8700.patch With HIVE-8616 enabled, the new plan has ReduceSinkOperator for the small tables. For example, the follow represents the operator plan for the small table dec1 derived from query {code}explain select /*+ MAPJOIN(dec)*/ * from dec join dec1 on dec.value=dec1.d;{code} {code} Map 2 Map Operator Tree: TableScan alias: dec1 Statistics: Num rows: 0 Data size: 107 Basic stats: PARTIAL Column stats: NONE Filter Operator predicate: d is not null (type: boolean) Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE Reduce Output Operator key expressions: d (type: decimal(5,2)) sort order: + Map-reduce partition columns: d (type: decimal(5,2)) Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE value expressions: i (type: int) {code} With the new design for broadcasting small tables, we need to convert the ReduceSinkOperator with HashTableSinkOperator or equivalent in the new plan. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-8700) Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-8700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suhas Satish updated HIVE-8700: --- Status: Patch Available (was: Open) Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch] -- Key: HIVE-8700 URL: https://issues.apache.org/jira/browse/HIVE-8700 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Suhas Satish Attachments: HIVE-8700-spark.patch, HIVE-8700.patch With HIVE-8616 enabled, the new plan has ReduceSinkOperator for the small tables. For example, the follow represents the operator plan for the small table dec1 derived from query {code}explain select /*+ MAPJOIN(dec)*/ * from dec join dec1 on dec.value=dec1.d;{code} {code} Map 2 Map Operator Tree: TableScan alias: dec1 Statistics: Num rows: 0 Data size: 107 Basic stats: PARTIAL Column stats: NONE Filter Operator predicate: d is not null (type: boolean) Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE Reduce Output Operator key expressions: d (type: decimal(5,2)) sort order: + Map-reduce partition columns: d (type: decimal(5,2)) Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE value expressions: i (type: int) {code} With the new design for broadcasting small tables, we need to convert the ReduceSinkOperator with HashTableSinkOperator or equivalent in the new plan. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8700) Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-8700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14199000#comment-14199000 ] Suhas Satish commented on HIVE-8700: Attaching a patch that leverages changes from Chao's HIVE-8622.patch. ReduceSinks are now converted to HashTableSink. But the condition check *if( currentTask.getTaskTag() == Task.CONVERTED_MAPJOIN)* is disabled currently (until we decide where to enable it - either in CommonJoinResolver or somewhere else). Will also be sending review request soon. Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch] -- Key: HIVE-8700 URL: https://issues.apache.org/jira/browse/HIVE-8700 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Suhas Satish Attachments: HIVE-8700-spark.patch, HIVE-8700.patch With HIVE-8616 enabled, the new plan has ReduceSinkOperator for the small tables. For example, the follow represents the operator plan for the small table dec1 derived from query {code}explain select /*+ MAPJOIN(dec)*/ * from dec join dec1 on dec.value=dec1.d;{code} {code} Map 2 Map Operator Tree: TableScan alias: dec1 Statistics: Num rows: 0 Data size: 107 Basic stats: PARTIAL Column stats: NONE Filter Operator predicate: d is not null (type: boolean) Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE Reduce Output Operator key expressions: d (type: decimal(5,2)) sort order: + Map-reduce partition columns: d (type: decimal(5,2)) Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE value expressions: i (type: int) {code} With the new design for broadcasting small tables, we need to convert the ReduceSinkOperator with HashTableSinkOperator or equivalent in the new plan. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8622) Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-8622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14199002#comment-14199002 ] Suhas Satish commented on HIVE-8622: Thanks Chao, I have leveraged some of your work in this patch and uploaded a patch to HIVE-8700 to unblock you. You can continue working off that. Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch] Key: HIVE-8622 URL: https://issues.apache.org/jira/browse/HIVE-8622 Project: Hive Issue Type: Sub-task Reporter: Suhas Satish Assignee: Chao Attachments: HIVE-8622.patch This is a sub-task of map-join for spark https://issues.apache.org/jira/browse/HIVE-7613 This can use the baseline patch for map-join https://issues.apache.org/jira/browse/HIVE-8616 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8700) Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-8700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14199011#comment-14199011 ] Suhas Satish commented on HIVE-8700: https://reviews.apache.org/r/27640/ Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch] -- Key: HIVE-8700 URL: https://issues.apache.org/jira/browse/HIVE-8700 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Suhas Satish Attachments: HIVE-8700-spark.patch, HIVE-8700.patch With HIVE-8616 enabled, the new plan has ReduceSinkOperator for the small tables. For example, the follow represents the operator plan for the small table dec1 derived from query {code}explain select /*+ MAPJOIN(dec)*/ * from dec join dec1 on dec.value=dec1.d;{code} {code} Map 2 Map Operator Tree: TableScan alias: dec1 Statistics: Num rows: 0 Data size: 107 Basic stats: PARTIAL Column stats: NONE Filter Operator predicate: d is not null (type: boolean) Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE Reduce Output Operator key expressions: d (type: decimal(5,2)) sort order: + Map-reduce partition columns: d (type: decimal(5,2)) Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE value expressions: i (type: int) {code} With the new design for broadcasting small tables, we need to convert the ReduceSinkOperator with HashTableSinkOperator or equivalent in the new plan. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-8621) Dump small table join data for map-join [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suhas Satish updated HIVE-8621: --- Assignee: (was: Suhas Satish) Dump small table join data for map-join [Spark Branch] -- Key: HIVE-8621 URL: https://issues.apache.org/jira/browse/HIVE-8621 Project: Hive Issue Type: Sub-task Reporter: Suhas Satish This jira aims to re-use a slightly modified approach of map-reduce distributed cache in spark to dump map-joined small tables as hash tables onto spark DFS cluster. This is a sub-task of map-join for spark https://issues.apache.org/jira/browse/HIVE-7613 This can use the baseline patch for map-join https://issues.apache.org/jira/browse/HIVE-8616 The original thought process was to use broadcast variable concept in spark, for the small tables. The number of broadcast variables that must be created is m x n where 'm' is the number of small tables in the (m+1) way join and n is the number of buckets of tables. If unbucketed, n=1 But it was discovered that objects compressed with kryo serialization on disk, can occupy 20X or more when deserialized in-memory. For bucket join, the spark Driver has to hold all the buckets (for bucketed tables) in-memory (to provide for fault-tolerance against Executor failures) although the executors only need individual buckets in their memory. So the broadcast variable approach may not be the right approach. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8700) Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-8700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14197096#comment-14197096 ] Suhas Satish commented on HIVE-8700: Hi Szehon, The patch still needs some work. This includes calling *physicalOptimizer.optimize()* in SparkCompiler to activate the SparkMapJoinResolver, also making sure CommonJoinResolver portion is commented out so that does not interfere and throw ClassCastExceptions. There might be more hidden issues behind that. But I will try to come up with something soon. Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch] -- Key: HIVE-8700 URL: https://issues.apache.org/jira/browse/HIVE-8700 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Suhas Satish Attachments: HIVE-8700.patch With HIVE-8616 enabled, the new plan has ReduceSinkOperator for the small tables. For example, the follow represents the operator plan for the small table dec1 derived from query {code}explain select /*+ MAPJOIN(dec)*/ * from dec join dec1 on dec.value=dec1.d;{code} {code} Map 2 Map Operator Tree: TableScan alias: dec1 Statistics: Num rows: 0 Data size: 107 Basic stats: PARTIAL Column stats: NONE Filter Operator predicate: d is not null (type: boolean) Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE Reduce Output Operator key expressions: d (type: decimal(5,2)) sort order: + Map-reduce partition columns: d (type: decimal(5,2)) Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE value expressions: i (type: int) {code} With the new design for broadcasting small tables, we need to convert the ReduceSinkOperator with HashTableSinkOperator or equivalent in the new plan. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (HIVE-8700) Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-8700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suhas Satish reassigned HIVE-8700: -- Assignee: Suhas Satish (was: Szehon Ho) Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch] -- Key: HIVE-8700 URL: https://issues.apache.org/jira/browse/HIVE-8700 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Suhas Satish With HIVE-8616 enabled, the new plan has ReduceSinkOperator for the small tables. For example, the follow represents the operator plan for the small table dec1 derived from query {code}explain select /*+ MAPJOIN(dec)*/ * from dec join dec1 on dec.value=dec1.d;{code} {code} Map 2 Map Operator Tree: TableScan alias: dec1 Statistics: Num rows: 0 Data size: 107 Basic stats: PARTIAL Column stats: NONE Filter Operator predicate: d is not null (type: boolean) Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE Reduce Output Operator key expressions: d (type: decimal(5,2)) sort order: + Map-reduce partition columns: d (type: decimal(5,2)) Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE value expressions: i (type: int) {code} With the new design for broadcasting small tables, we need to convert the ReduceSinkOperator with HashTableSinkOperator or equivalent in the new plan. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8700) Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-8700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194854#comment-14194854 ] Suhas Satish commented on HIVE-8700: Thank you [~szehon] Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch] -- Key: HIVE-8700 URL: https://issues.apache.org/jira/browse/HIVE-8700 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Suhas Satish With HIVE-8616 enabled, the new plan has ReduceSinkOperator for the small tables. For example, the follow represents the operator plan for the small table dec1 derived from query {code}explain select /*+ MAPJOIN(dec)*/ * from dec join dec1 on dec.value=dec1.d;{code} {code} Map 2 Map Operator Tree: TableScan alias: dec1 Statistics: Num rows: 0 Data size: 107 Basic stats: PARTIAL Column stats: NONE Filter Operator predicate: d is not null (type: boolean) Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE Reduce Output Operator key expressions: d (type: decimal(5,2)) sort order: + Map-reduce partition columns: d (type: decimal(5,2)) Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE value expressions: i (type: int) {code} With the new design for broadcasting small tables, we need to convert the ReduceSinkOperator with HashTableSinkOperator or equivalent in the new plan. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-8700) Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-8700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suhas Satish updated HIVE-8700: --- Attachment: HIVE-8700.patch Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch] -- Key: HIVE-8700 URL: https://issues.apache.org/jira/browse/HIVE-8700 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Suhas Satish Attachments: HIVE-8700.patch With HIVE-8616 enabled, the new plan has ReduceSinkOperator for the small tables. For example, the follow represents the operator plan for the small table dec1 derived from query {code}explain select /*+ MAPJOIN(dec)*/ * from dec join dec1 on dec.value=dec1.d;{code} {code} Map 2 Map Operator Tree: TableScan alias: dec1 Statistics: Num rows: 0 Data size: 107 Basic stats: PARTIAL Column stats: NONE Filter Operator predicate: d is not null (type: boolean) Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE Reduce Output Operator key expressions: d (type: decimal(5,2)) sort order: + Map-reduce partition columns: d (type: decimal(5,2)) Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE value expressions: i (type: int) {code} With the new design for broadcasting small tables, we need to convert the ReduceSinkOperator with HashTableSinkOperator or equivalent in the new plan. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8700) Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-8700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195403#comment-14195403 ] Suhas Satish commented on HIVE-8700: Sure [~szehon]. Attaching my changeset as a patch. This compiles. I was testing at runtime. So didn't follow the naming conventions like HIVE-8700-spark.patch as I dont want unit tests triggered just yet. Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch] -- Key: HIVE-8700 URL: https://issues.apache.org/jira/browse/HIVE-8700 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Suhas Satish Attachments: HIVE-8700.patch With HIVE-8616 enabled, the new plan has ReduceSinkOperator for the small tables. For example, the follow represents the operator plan for the small table dec1 derived from query {code}explain select /*+ MAPJOIN(dec)*/ * from dec join dec1 on dec.value=dec1.d;{code} {code} Map 2 Map Operator Tree: TableScan alias: dec1 Statistics: Num rows: 0 Data size: 107 Basic stats: PARTIAL Column stats: NONE Filter Operator predicate: d is not null (type: boolean) Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE Reduce Output Operator key expressions: d (type: decimal(5,2)) sort order: + Map-reduce partition columns: d (type: decimal(5,2)) Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE value expressions: i (type: int) {code} With the new design for broadcasting small tables, we need to convert the ReduceSinkOperator with HashTableSinkOperator or equivalent in the new plan. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8700) Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-8700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14193895#comment-14193895 ] Suhas Satish commented on HIVE-8700: Hi Xuefu, I was working on this as apart of HIVE-8621. Do you want to assign this task to me and swap HIVE-8621 with Szehon instead? Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch] -- Key: HIVE-8700 URL: https://issues.apache.org/jira/browse/HIVE-8700 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Szehon Ho With HIVE-8616 enabled, the new plan has ReduceSinkOperator for the small tables. For example, the follow represents the operator plan for the small table dec1 derived from query {code}explain select /*+ MAPJOIN(dec)*/ * from dec join dec1 on dec.value=dec1.d;{code} {code} Map 2 Map Operator Tree: TableScan alias: dec1 Statistics: Num rows: 0 Data size: 107 Basic stats: PARTIAL Column stats: NONE Filter Operator predicate: d is not null (type: boolean) Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE Reduce Output Operator key expressions: d (type: decimal(5,2)) sort order: + Map-reduce partition columns: d (type: decimal(5,2)) Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE value expressions: i (type: int) {code} With the new design for broadcasting small tables, we need to convert the ReduceSinkOperator with HashTableSinkOperator or equivalent in the new plan. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-8621) Dump small table join data for map-join [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suhas Satish updated HIVE-8621: --- Summary: Dump small table join data for map-join [Spark Branch] (was: Dump small table join data into appropriate number of broadcast variables [Spark Branch]) Dump small table join data for map-join [Spark Branch] -- Key: HIVE-8621 URL: https://issues.apache.org/jira/browse/HIVE-8621 Project: Hive Issue Type: Sub-task Reporter: Suhas Satish Assignee: Suhas Satish The number of broadcast variables that must be created is m x n where 'm' is the number of small tables in the (m+1) way join and n is the number of buckets of tables. If unbucketed, n=1 This is a sub-task of map-join for spark https://issues.apache.org/jira/browse/HIVE-7613 This can use the baseline patch for map-join https://issues.apache.org/jira/browse/HIVE-8616 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-8621) Dump small table join data for map-join [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suhas Satish updated HIVE-8621: --- Description: This jira aims to re-use a slightly modified approach of map-reduce distributed cache in spark to dump map-joined small tables as hash tables onto spark DFS cluster. This is a sub-task of map-join for spark https://issues.apache.org/jira/browse/HIVE-7613 This can use the baseline patch for map-join https://issues.apache.org/jira/browse/HIVE-8616 The original thought process was to use broadcast variable concept in spark, for the small tables. The number of broadcast variables that must be created is m x n where 'm' is the number of small tables in the (m+1) way join and n is the number of buckets of tables. If unbucketed, n=1 But it was discovered that objects compressed with kryo serialization on disk, can occupy 20X or more when deserialized in-memory. For bucket join, the spark Driver has to hold all the buckets (for bucketed tables) in-memory (to provide for fault-tolerance against Executor failures) although the executors only need individual buckets in their memory. So the broadcast variable approach may not be the right approach. was: The number of broadcast variables that must be created is m x n where 'm' is the number of small tables in the (m+1) way join and n is the number of buckets of tables. If unbucketed, n=1 This is a sub-task of map-join for spark https://issues.apache.org/jira/browse/HIVE-7613 This can use the baseline patch for map-join https://issues.apache.org/jira/browse/HIVE-8616 Dump small table join data for map-join [Spark Branch] -- Key: HIVE-8621 URL: https://issues.apache.org/jira/browse/HIVE-8621 Project: Hive Issue Type: Sub-task Reporter: Suhas Satish Assignee: Suhas Satish This jira aims to re-use a slightly modified approach of map-reduce distributed cache in spark to dump map-joined small tables as hash tables onto spark DFS cluster. This is a sub-task of map-join for spark https://issues.apache.org/jira/browse/HIVE-7613 This can use the baseline patch for map-join https://issues.apache.org/jira/browse/HIVE-8616 The original thought process was to use broadcast variable concept in spark, for the small tables. The number of broadcast variables that must be created is m x n where 'm' is the number of small tables in the (m+1) way join and n is the number of buckets of tables. If unbucketed, n=1 But it was discovered that objects compressed with kryo serialization on disk, can occupy 20X or more when deserialized in-memory. For bucket join, the spark Driver has to hold all the buckets (for bucketed tables) in-memory (to provide for fault-tolerance against Executor failures) although the executors only need individual buckets in their memory. So the broadcast variable approach may not be the right approach. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-8616) convert joinOp to MapJoinOp and generate MapWorks only [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-8616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suhas Satish updated HIVE-8616: --- Attachment: HIVE-8616.2-spark.patch convert joinOp to MapJoinOp and generate MapWorks only [Spark Branch] - Key: HIVE-8616 URL: https://issues.apache.org/jira/browse/HIVE-8616 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Suhas Satish Assignee: Suhas Satish Attachments: HIVE-8616-spark.patch, HIVE-8616.2-spark.patch This is a sub-task of map join on spark. The parent jira is https://issues.apache.org/jira/browse/HIVE-7613 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8616) convert joinOp to MapJoinOp and generate MapWorks only [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-8616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14188061#comment-14188061 ] Suhas Satish commented on HIVE-8616: Addressed review board comments and uploaded updated patch. convert joinOp to MapJoinOp and generate MapWorks only [Spark Branch] - Key: HIVE-8616 URL: https://issues.apache.org/jira/browse/HIVE-8616 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Suhas Satish Assignee: Suhas Satish Attachments: HIVE-8616-spark.patch, HIVE-8616.2-spark.patch This is a sub-task of map join on spark. The parent jira is https://issues.apache.org/jira/browse/HIVE-7613 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8616) convert joinOp to MapJoinOp and generate MapWorks only [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-8616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14189370#comment-14189370 ] Suhas Satish commented on HIVE-8616: Hi Xuefu, Yes most of these failing tests set hive.auto.convert.join = true and convert a common join where possible into map-join. But since we dont have the HashTable sinking and SparkHashTableLoader yet, they fail downstream. I am commenting out the triggering rules in SparkCompiler and resubmitting my patch. convert joinOp to MapJoinOp and generate MapWorks only [Spark Branch] - Key: HIVE-8616 URL: https://issues.apache.org/jira/browse/HIVE-8616 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Suhas Satish Assignee: Suhas Satish Attachments: HIVE-8616-spark.patch, HIVE-8616.2-spark.patch This is a sub-task of map join on spark. The parent jira is https://issues.apache.org/jira/browse/HIVE-7613 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-8616) convert joinOp to MapJoinOp and generate MapWorks only [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-8616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suhas Satish updated HIVE-8616: --- Attachment: HIVE-8616.3-spark.patch convert joinOp to MapJoinOp and generate MapWorks only [Spark Branch] - Key: HIVE-8616 URL: https://issues.apache.org/jira/browse/HIVE-8616 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Suhas Satish Assignee: Suhas Satish Attachments: HIVE-8616-spark.patch, HIVE-8616.2-spark.patch, HIVE-8616.3-spark.patch This is a sub-task of map join on spark. The parent jira is https://issues.apache.org/jira/browse/HIVE-7613 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8621) Dump small table join data into appropriate number of broadcast variables [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14189531#comment-14189531 ] Suhas Satish commented on HIVE-8621: Currently so far in the spark implementation, we are not tagging the small tables, but I realized that we need to tag them to be able to use different broadcast variables for different tables. Also, we have 2 reduce sinks (RS) for the 2 small tables in a 3-way map-join. In M/R, we have only one HashTableSink Operator (HTS) for all small tables combined. This conversion from RS- HTS happens in LocalMapJoinProcFactory and is triggered by rule R7 (MapReduceCompiler: MapJoinFactory.getTableScanMapJoin )in TaskCompiler.optimizeTaskPlan phase. Using similar logic as in LocalMapJoinProcFactory in SparkMapJoinResolver, we will end up with 2 HashTableSinks (or in general, (n-1) HTS for n-way join). Each of these will generate its broadcast variable. After going through Sandy Ryza's spark presentation here, http://www.slideshare.net/SandyRyza/spark-job-failures-talk it looks like the recommended way to distribute compute in spark is to have a large number of SparkTasks. So I think its better to have each MapWork from each small table as a separate SparkTask. This can be tackled independently in this jira if you guys agree https://issues.apache.org/jira/browse/HIVE-8622 Dump small table join data into appropriate number of broadcast variables [Spark Branch] Key: HIVE-8621 URL: https://issues.apache.org/jira/browse/HIVE-8621 Project: Hive Issue Type: Sub-task Reporter: Suhas Satish Assignee: Suhas Satish The number of broadcast variables that must be created is m x n where 'm' is the number of small tables in the (m+1) way join and n is the number of buckets of tables. If unbucketed, n=1 This is a sub-task of map-join for spark https://issues.apache.org/jira/browse/HIVE-7613 This can use the baseline patch for map-join https://issues.apache.org/jira/browse/HIVE-8616 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8616) convert joinOp to MapJoinOp and generate MapWorks only [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-8616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14189548#comment-14189548 ] Suhas Satish commented on HIVE-8616: Even with this merged into spark branch, the following 2 rules in SparkCompiler.java need to be enabled for dependent map-join follow-up jiras - SparkCompiler.java - opRules.put(new RuleRegExp(new String(Convert Join to Map-join), JoinOperator.getOperatorName() + %), new SparkMapJoinOptimizer()); opRules.put(new RuleRegExp(No more walking on ReduceSink-MapJoin, MapJoinOperator.getOperatorName() + %), new SparkReduceSinkMapJoinProc()); convert joinOp to MapJoinOp and generate MapWorks only [Spark Branch] - Key: HIVE-8616 URL: https://issues.apache.org/jira/browse/HIVE-8616 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Suhas Satish Assignee: Suhas Satish Fix For: spark-branch Attachments: HIVE-8616-spark.patch, HIVE-8616.2-spark.patch, HIVE-8616.3-spark.patch This is a sub-task of map join on spark. The parent jira is https://issues.apache.org/jira/browse/HIVE-7613 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Work started] (HIVE-8621) Dump small table join data into appropriate number of broadcast variables [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HIVE-8621 started by Suhas Satish. -- Dump small table join data into appropriate number of broadcast variables [Spark Branch] Key: HIVE-8621 URL: https://issues.apache.org/jira/browse/HIVE-8621 Project: Hive Issue Type: Sub-task Reporter: Suhas Satish Assignee: Suhas Satish The number of broadcast variables that must be created is m x n where 'm' is the number of small tables in the (m+1) way join and n is the number of buckets of tables. If unbucketed, n=1 This is a sub-task of map-join for spark https://issues.apache.org/jira/browse/HIVE-7613 This can use the baseline patch for map-join https://issues.apache.org/jira/browse/HIVE-8616 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-8616) convert joinOp to MapJoinOp and generate MapWorks only
Suhas Satish created HIVE-8616: -- Summary: convert joinOp to MapJoinOp and generate MapWorks only Key: HIVE-8616 URL: https://issues.apache.org/jira/browse/HIVE-8616 Project: Hive Issue Type: Sub-task Reporter: Suhas Satish Assignee: Suhas Satish This is a sub-task of map join on spark. The parent jira is https://issues.apache.org/jira/browse/HIVE-7613 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-8616) convert joinOp to MapJoinOp and generate MapWorks only
[ https://issues.apache.org/jira/browse/HIVE-8616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suhas Satish updated HIVE-8616: --- Attachment: HIVE-8616-spark.patch convert joinOp to MapJoinOp and generate MapWorks only -- Key: HIVE-8616 URL: https://issues.apache.org/jira/browse/HIVE-8616 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Suhas Satish Assignee: Suhas Satish Attachments: HIVE-8616-spark.patch This is a sub-task of map join on spark. The parent jira is https://issues.apache.org/jira/browse/HIVE-7613 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-8616) convert joinOp to MapJoinOp and generate MapWorks only
[ https://issues.apache.org/jira/browse/HIVE-8616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suhas Satish updated HIVE-8616: --- Status: Patch Available (was: Open) Attached a patch which addresses this sub-task. With this patch applied, this is the explain plan for a 3-way join. explain select * from table1 join table2 on (table1.key = table2.key) join table3 on table1.key = table3.key; OK STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Spark Edges: Map 1 - Map 2 (NONE, 0), Map 3 (NONE, 0) DagName: ssatish_20141027131919_0ab004f6-5495-44b4-b7b1-16bf8ca15473:2 Vertices: Map 1 Map Operator Tree: TableScan alias: table1 Statistics: Num rows: 55 Data size: 5812 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: key is not null (type: boolean) Statistics: Num rows: 28 Data size: 2958 Basic stats: COMPLETE Column stats: NONE Map Join Operator condition map: Inner Join 0 to 1 Inner Join 0 to 2 condition expressions: 0 {key} {value} 1 {key} {value} 2 {key} {value} keys: 0 key (type: int) 1 key (type: int) 2 key (type: int) outputColumnNames: _col0, _col1, _col5, _col6, _col10, _col11 input vertices: 1 Map 3 2 Map 2 Statistics: Num rows: 61 Data size: 6507 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: int), _col1 (type: string), _col5 (type: int), _col6 (type: string), _col10 (type: int), _col11 (type: string) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5 Statistics: Num rows: 61 Data size: 6507 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 61 Data size: 6507 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Map 2 Map Operator Tree: TableScan alias: table3 Statistics: Num rows: 1 Data size: 140 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: key is not null (type: boolean) Statistics: Num rows: 1 Data size: 140 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: key (type: int) sort order: + Map-reduce partition columns: key (type: int) Statistics: Num rows: 1 Data size: 140 Basic stats: COMPLETE Column stats: NONE value expressions: value (type: string) Map 3 Map Operator Tree: TableScan alias: table2 Statistics: Num rows: 55 Data size: 5791 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: key is not null (type: boolean) Statistics: Num rows: 28 Data size: 2948 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: key (type: int) sort order: + Map-reduce partition columns: key (type: int) Statistics: Num rows: 28 Data size: 2948 Basic stats: COMPLETE Column stats: NONE value expressions: value (type: string) Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink convert joinOp to MapJoinOp and generate MapWorks only -- Key: HIVE-8616 URL: https://issues.apache.org/jira/browse/HIVE-8616 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Suhas Satish Assignee: Suhas Satish Attachments: HIVE-8616-spark.patch This is a sub-task of map join
[jira] [Commented] (HIVE-7613) Research optimization of auto convert join to map join [Spark branch]
[ https://issues.apache.org/jira/browse/HIVE-7613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14185878#comment-14185878 ] Suhas Satish commented on HIVE-7613: Submitted patch for HIVE-8616. This can be used as the baseline patch for subsequent sub-tasks. Research optimization of auto convert join to map join [Spark branch] - Key: HIVE-7613 URL: https://issues.apache.org/jira/browse/HIVE-7613 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Suhas Satish Priority: Minor Attachments: HIve on Spark Map join background.docx ConvertJoinMapJoin is an optimization the replaces a common join(aka shuffle join) with a map join(aka broadcast or fragment replicate join) when possible. we need to research how to make it workable with Hive on Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8616) convert joinOp to MapJoinOp and generate MapWorks only
[ https://issues.apache.org/jira/browse/HIVE-8616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14185886#comment-14185886 ] Suhas Satish commented on HIVE-8616: Review board: https://reviews.apache.org/r/27247/ convert joinOp to MapJoinOp and generate MapWorks only -- Key: HIVE-8616 URL: https://issues.apache.org/jira/browse/HIVE-8616 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Suhas Satish Assignee: Suhas Satish Attachments: HIVE-8616-spark.patch This is a sub-task of map join on spark. The parent jira is https://issues.apache.org/jira/browse/HIVE-7613 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-8621) Aggregate all small table join data into 1 broadcast variable
Suhas Satish created HIVE-8621: -- Summary: Aggregate all small table join data into 1 broadcast variable Key: HIVE-8621 URL: https://issues.apache.org/jira/browse/HIVE-8621 Project: Hive Issue Type: Bug Reporter: Suhas Satish Assignee: Suhas Satish This is a sub-task of map-join for spark https://issues.apache.org/jira/browse/HIVE-7613 This can use the baseline patch for map-join https://issues.apache.org/jira/browse/HIVE-8616 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-8622) Split map-join plan into 2 SparkTasks in 3 stages.
Suhas Satish created HIVE-8622: -- Summary: Split map-join plan into 2 SparkTasks in 3 stages. Key: HIVE-8622 URL: https://issues.apache.org/jira/browse/HIVE-8622 Project: Hive Issue Type: Bug Reporter: Suhas Satish This is a sub-task of map-join for spark https://issues.apache.org/jira/browse/HIVE-7613 This can use the baseline patch for map-join https://issues.apache.org/jira/browse/HIVE-8616 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-8622) Split map-join plan into 2 SparkTasks in 3 stages.
[ https://issues.apache.org/jira/browse/HIVE-8622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suhas Satish updated HIVE-8622: --- Issue Type: Sub-task (was: Bug) Parent: HIVE-7292 Split map-join plan into 2 SparkTasks in 3 stages. --- Key: HIVE-8622 URL: https://issues.apache.org/jira/browse/HIVE-8622 Project: Hive Issue Type: Sub-task Reporter: Suhas Satish This is a sub-task of map-join for spark https://issues.apache.org/jira/browse/HIVE-7613 This can use the baseline patch for map-join https://issues.apache.org/jira/browse/HIVE-8616 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-8623) Implement SparkHashTableLoader for map-join broadcast variable read
Suhas Satish created HIVE-8623: -- Summary: Implement SparkHashTableLoader for map-join broadcast variable read Key: HIVE-8623 URL: https://issues.apache.org/jira/browse/HIVE-8623 Project: Hive Issue Type: Task Reporter: Suhas Satish This is a sub-task of map-join for spark https://issues.apache.org/jira/browse/HIVE-7613 This can use the baseline patch for map-join https://issues.apache.org/jira/browse/HIVE-8616 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-8623) Implement SparkHashTableLoader for map-join broadcast variable read
[ https://issues.apache.org/jira/browse/HIVE-8623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suhas Satish updated HIVE-8623: --- Issue Type: Sub-task (was: Task) Parent: HIVE-7292 Implement SparkHashTableLoader for map-join broadcast variable read --- Key: HIVE-8623 URL: https://issues.apache.org/jira/browse/HIVE-8623 Project: Hive Issue Type: Sub-task Reporter: Suhas Satish This is a sub-task of map-join for spark https://issues.apache.org/jira/browse/HIVE-7613 This can use the baseline patch for map-join https://issues.apache.org/jira/browse/HIVE-8616 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-8621) Aggregate all small table join data into 1 broadcast variable
[ https://issues.apache.org/jira/browse/HIVE-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suhas Satish updated HIVE-8621: --- Issue Type: Sub-task (was: Bug) Parent: HIVE-7292 Aggregate all small table join data into 1 broadcast variable - Key: HIVE-8621 URL: https://issues.apache.org/jira/browse/HIVE-8621 Project: Hive Issue Type: Sub-task Reporter: Suhas Satish Assignee: Suhas Satish This is a sub-task of map-join for spark https://issues.apache.org/jira/browse/HIVE-7613 This can use the baseline patch for map-join https://issues.apache.org/jira/browse/HIVE-8616 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8621) Aggregate all small table join data into 1 broadcast variable
[ https://issues.apache.org/jira/browse/HIVE-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14185960#comment-14185960 ] Suhas Satish commented on HIVE-8621: Hi Szehon, Yes what you say makes sense. I had not looked too deep into MapJoinOperator when I created this jira. Thanks for pointing it out. We can rename the jira accordingly. Aggregate all small table join data into 1 broadcast variable - Key: HIVE-8621 URL: https://issues.apache.org/jira/browse/HIVE-8621 Project: Hive Issue Type: Sub-task Reporter: Suhas Satish Assignee: Suhas Satish This is a sub-task of map-join for spark https://issues.apache.org/jira/browse/HIVE-7613 This can use the baseline patch for map-join https://issues.apache.org/jira/browse/HIVE-8616 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-8621) Aggregate all small table join data into broadcast variables
[ https://issues.apache.org/jira/browse/HIVE-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suhas Satish updated HIVE-8621: --- Summary: Aggregate all small table join data into broadcast variables (was: Aggregate all small table join data into 1 broadcast variable) Aggregate all small table join data into broadcast variables Key: HIVE-8621 URL: https://issues.apache.org/jira/browse/HIVE-8621 Project: Hive Issue Type: Sub-task Reporter: Suhas Satish Assignee: Suhas Satish This is a sub-task of map-join for spark https://issues.apache.org/jira/browse/HIVE-7613 This can use the baseline patch for map-join https://issues.apache.org/jira/browse/HIVE-8616 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-8621) Aggregate all small table join data into mxn broadcast variables [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suhas Satish updated HIVE-8621: --- Summary: Aggregate all small table join data into mxn broadcast variables [Spark Branch] (was: Aggregate all small table join data into broadcast variables [Spark Branch]) Aggregate all small table join data into mxn broadcast variables [Spark Branch] --- Key: HIVE-8621 URL: https://issues.apache.org/jira/browse/HIVE-8621 Project: Hive Issue Type: Sub-task Reporter: Suhas Satish Assignee: Suhas Satish This is a sub-task of map-join for spark https://issues.apache.org/jira/browse/HIVE-7613 This can use the baseline patch for map-join https://issues.apache.org/jira/browse/HIVE-8616 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-8621) Aggregate all small table join data into mxn broadcast variables [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suhas Satish updated HIVE-8621: --- Description: In the title of the jira, 'm' is the number of small tables in the (m+1)- way join and n is the number of buckets of tables. If unbucketed, n=1 This is a sub-task of map-join for spark https://issues.apache.org/jira/browse/HIVE-7613 This can use the baseline patch for map-join https://issues.apache.org/jira/browse/HIVE-8616 was: This is a sub-task of map-join for spark https://issues.apache.org/jira/browse/HIVE-7613 This can use the baseline patch for map-join https://issues.apache.org/jira/browse/HIVE-8616 Aggregate all small table join data into mxn broadcast variables [Spark Branch] --- Key: HIVE-8621 URL: https://issues.apache.org/jira/browse/HIVE-8621 Project: Hive Issue Type: Sub-task Reporter: Suhas Satish Assignee: Suhas Satish In the title of the jira, 'm' is the number of small tables in the (m+1)- way join and n is the number of buckets of tables. If unbucketed, n=1 This is a sub-task of map-join for spark https://issues.apache.org/jira/browse/HIVE-7613 This can use the baseline patch for map-join https://issues.apache.org/jira/browse/HIVE-8616 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-8621) Aggregate all small table join data into m x n broadcast variables [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suhas Satish updated HIVE-8621: --- Summary: Aggregate all small table join data into m x n broadcast variables [Spark Branch] (was: Aggregate all small table join data into mxn broadcast variables [Spark Branch]) Aggregate all small table join data into m x n broadcast variables [Spark Branch] - Key: HIVE-8621 URL: https://issues.apache.org/jira/browse/HIVE-8621 Project: Hive Issue Type: Sub-task Reporter: Suhas Satish Assignee: Suhas Satish In the title of the jira, 'm' is the number of small tables in the (m+1)- way join and n is the number of buckets of tables. If unbucketed, n=1 This is a sub-task of map-join for spark https://issues.apache.org/jira/browse/HIVE-7613 This can use the baseline patch for map-join https://issues.apache.org/jira/browse/HIVE-8616 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-8621) Aggregate all small table join data into m x n broadcast variables [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suhas Satish updated HIVE-8621: --- Description: The number of broadcast variables that must be created is m x n where 'm' is the number of small tables in the (m+1) way join and n is the number of buckets of tables. If unbucketed, n=1 This is a sub-task of map-join for spark https://issues.apache.org/jira/browse/HIVE-7613 This can use the baseline patch for map-join https://issues.apache.org/jira/browse/HIVE-8616 was: In the title of the jira, 'm' is the number of small tables in the (m+1)- way join and n is the number of buckets of tables. If unbucketed, n=1 This is a sub-task of map-join for spark https://issues.apache.org/jira/browse/HIVE-7613 This can use the baseline patch for map-join https://issues.apache.org/jira/browse/HIVE-8616 Aggregate all small table join data into m x n broadcast variables [Spark Branch] - Key: HIVE-8621 URL: https://issues.apache.org/jira/browse/HIVE-8621 Project: Hive Issue Type: Sub-task Reporter: Suhas Satish Assignee: Suhas Satish The number of broadcast variables that must be created is m x n where 'm' is the number of small tables in the (m+1) way join and n is the number of buckets of tables. If unbucketed, n=1 This is a sub-task of map-join for spark https://issues.apache.org/jira/browse/HIVE-7613 This can use the baseline patch for map-join https://issues.apache.org/jira/browse/HIVE-8616 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-8621) Dump small table join data into appropriate number of broadcast variables [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suhas Satish updated HIVE-8621: --- Summary: Dump small table join data into appropriate number of broadcast variables [Spark Branch] (was: Aggregate all small table join data into m x n broadcast variables [Spark Branch]) Dump small table join data into appropriate number of broadcast variables [Spark Branch] Key: HIVE-8621 URL: https://issues.apache.org/jira/browse/HIVE-8621 Project: Hive Issue Type: Sub-task Reporter: Suhas Satish Assignee: Suhas Satish The number of broadcast variables that must be created is m x n where 'm' is the number of small tables in the (m+1) way join and n is the number of buckets of tables. If unbucketed, n=1 This is a sub-task of map-join for spark https://issues.apache.org/jira/browse/HIVE-7613 This can use the baseline patch for map-join https://issues.apache.org/jira/browse/HIVE-8616 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8621) Dump small table join data into appropriate number of broadcast variables [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14186003#comment-14186003 ] Suhas Satish commented on HIVE-8621: Agreed, there was no confusion. Have updated the title and the description. Dump small table join data into appropriate number of broadcast variables [Spark Branch] Key: HIVE-8621 URL: https://issues.apache.org/jira/browse/HIVE-8621 Project: Hive Issue Type: Sub-task Reporter: Suhas Satish Assignee: Suhas Satish The number of broadcast variables that must be created is m x n where 'm' is the number of small tables in the (m+1) way join and n is the number of buckets of tables. If unbucketed, n=1 This is a sub-task of map-join for spark https://issues.apache.org/jira/browse/HIVE-7613 This can use the baseline patch for map-join https://issues.apache.org/jira/browse/HIVE-8616 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-7916) Snappy-java error when running hive query on spark [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-7916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14183290#comment-14183290 ] Suhas Satish commented on HIVE-7916: Not sure what solved it for you, but setting this seems to work for me on a Mac OS X - export HADOOP_OPTS=-Dorg.xerial.snappy.tempdir=/tmp -Dorg.xerial.snappy.lib.name=libsnappyjava.jnilib $HADOOP_OPTS Snappy-java error when running hive query on spark [Spark Branch] - Key: HIVE-7916 URL: https://issues.apache.org/jira/browse/HIVE-7916 Project: Hive Issue Type: Bug Components: Spark Reporter: Xuefu Zhang Labels: Spark-M1 Recently spark branch upgraded its dependency on Spark to 1.1.0-SNAPSHOT. While the new version addressed some lib conflicts (such as guava), I'm afraid that it also introduced new problems. The following might be one, when I set the master URL to be a spark standalone cluster: {code} hive set hive.execution.engine=spark; hive set spark.serializer=org.apache.spark.serializer.KryoSerializer; hive set spark.master=spark://xzdt:7077; hive select name, avg(value) from dec group by name; 14/08/28 16:41:52 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 333.0 KB, free 128.0 MB) java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.xerial.snappy.SnappyLoader.loadNativeLibrary(SnappyLoader.java:317) at org.xerial.snappy.SnappyLoader.load(SnappyLoader.java:219) at org.xerial.snappy.Snappy.clinit(Snappy.java:44) at org.xerial.snappy.SnappyOutputStream.init(SnappyOutputStream.java:79) at org.apache.spark.io.SnappyCompressionCodec.compressedOutputStream(CompressionCodec.scala:124) at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:207) at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:83) at org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:68) at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:36) at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29) at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62) at org.apache.spark.SparkContext.broadcast(SparkContext.scala:809) at org.apache.spark.rdd.HadoopRDD.init(HadoopRDD.scala:116) at org.apache.spark.SparkContext.hadoopRDD(SparkContext.scala:541) at org.apache.spark.api.java.JavaSparkContext.hadoopRDD(JavaSparkContext.scala:318) at org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generateRDD(SparkPlanGenerator.java:160) at org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:88) at org.apache.hadoop.hive.ql.exec.spark.SparkClient.execute(SparkClient.java:156) at org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.submit(SparkSessionImpl.java:52) at org.apache.hadoop.hive.ql.exec.spark.SparkTask.execute(SparkTask.java:77) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:161) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1537) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1304) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1116) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:940) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:930) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:246) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:198) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:408) at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:781) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:675) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:614) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) Caused by: java.lang.UnsatisfiedLinkError: no snappyjava in java.library.path at
[jira] [Commented] (HIVE-7916) Snappy-java error when running hive query on spark [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-7916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14175712#comment-14175712 ] Suhas Satish commented on HIVE-7916: I also hit the following snappy lib exceptions - I am using snappy snappy-java-1.0.5.jar. Let me try upgrading to snappy 1.1.1.3 2014-10-17 16:18:01,977 ERROR [Executor task launch worker-0]: executor.Executor (Logging.scala:logError(96)) - Exception in task 0.0 in stage 0.0 (TID 0) org.xerial.snappy.SnappyError: [FAILED_TO_LOAD_NATIVE_LIBRARY] null at org.xerial.snappy.SnappyLoader.load(SnappyLoader.java:229) at org.xerial.snappy.Snappy.clinit(Snappy.java:44) at org.xerial.snappy.SnappyOutputStream.init(SnappyOutputStream.java:79) at org.apache.spark.io.SnappyCompressionCodec.compressedOutputStream(CompressionCodec.scala:125) at org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1083) at org.apache.spark.storage.BlockManager$$anonfun$7.apply(BlockManager.scala:579) at org.apache.spark.storage.BlockManager$$anonfun$7.apply(BlockManager.scala:579) at org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:126) at org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:192) at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$4$$anonfun$apply$2.apply(ExternalSorter.scala:732) at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$4$$anonfun$apply$2.apply(ExternalSorter.scala:731) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at org.apache.spark.util.collection.ExternalSorter$IteratorForPartition.foreach(ExternalSorter.scala:789) at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$4.apply(ExternalSorter.scala:731) at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$4.apply(ExternalSorter.scala:727) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:727) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:70) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) -- 2014-10-17 16:18:02,021 INFO [main]: scheduler.DAGScheduler (Logging.scala:logInfo(59)) - Job 0 failed: foreach at SparkPlan.java:80, took 3.389683 s 2014-10-17 16:18:02,021 ERROR [main]: spark.SparkClient (SparkClient.java:execute(166)) - Error executing Spark Plan org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): org.xerial.snappy.SnappyError: [FAILED_TO_LOAD_NATIVE_LIBRARY] null org.xerial.snappy.SnappyLoader.load(SnappyLoader.java:229) org.xerial.snappy.Snappy.clinit(Snappy.java:44) org.xerial.snappy.SnappyOutputStream.init(SnappyOutputStream.java:79) org.apache.spark.io.SnappyCompressionCodec.compressedOutputStream(CompressionCodec.scala:125) org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1083) org.apache.spark.storage.BlockManager$$anonfun$7.apply(BlockManager.scala:579) org.apache.spark.storage.BlockManager$$anonfun$7.apply(BlockManager.scala:579) org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:126) org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:192) org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$4$$anonfun$apply$2.apply(ExternalSorter.scala:732) org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$4$$anonfun$apply$2.apply(ExternalSorter.scala:731) scala.collection.Iterator$class.foreach(Iterator.scala:727) org.apache.spark.util.collection.ExternalSorter$IteratorForPartition.foreach(ExternalSorter.scala:789) org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$4.apply(ExternalSorter.scala:731) org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$4.apply(ExternalSorter.scala:727)
[jira] [Commented] (HIVE-7916) Snappy-java error when running hive query on spark [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-7916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14175771#comment-14175771 ] Suhas Satish commented on HIVE-7916: Hitting the same problem with snappy 1.1.1.3 as well. Using hive tar ball as of today (fri, oct 17, 2014) with spark.master=local Snappy-java error when running hive query on spark [Spark Branch] - Key: HIVE-7916 URL: https://issues.apache.org/jira/browse/HIVE-7916 Project: Hive Issue Type: Bug Components: Spark Reporter: Xuefu Zhang Labels: Spark-M1 Recently spark branch upgraded its dependency on Spark to 1.1.0-SNAPSHOT. While the new version addressed some lib conflicts (such as guava), I'm afraid that it also introduced new problems. The following might be one, when I set the master URL to be a spark standalone cluster: {code} hive set hive.execution.engine=spark; hive set spark.serializer=org.apache.spark.serializer.KryoSerializer; hive set spark.master=spark://xzdt:7077; hive select name, avg(value) from dec group by name; 14/08/28 16:41:52 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 333.0 KB, free 128.0 MB) java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.xerial.snappy.SnappyLoader.loadNativeLibrary(SnappyLoader.java:317) at org.xerial.snappy.SnappyLoader.load(SnappyLoader.java:219) at org.xerial.snappy.Snappy.clinit(Snappy.java:44) at org.xerial.snappy.SnappyOutputStream.init(SnappyOutputStream.java:79) at org.apache.spark.io.SnappyCompressionCodec.compressedOutputStream(CompressionCodec.scala:124) at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:207) at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:83) at org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:68) at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:36) at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29) at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62) at org.apache.spark.SparkContext.broadcast(SparkContext.scala:809) at org.apache.spark.rdd.HadoopRDD.init(HadoopRDD.scala:116) at org.apache.spark.SparkContext.hadoopRDD(SparkContext.scala:541) at org.apache.spark.api.java.JavaSparkContext.hadoopRDD(JavaSparkContext.scala:318) at org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generateRDD(SparkPlanGenerator.java:160) at org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:88) at org.apache.hadoop.hive.ql.exec.spark.SparkClient.execute(SparkClient.java:156) at org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.submit(SparkSessionImpl.java:52) at org.apache.hadoop.hive.ql.exec.spark.SparkTask.execute(SparkTask.java:77) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:161) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1537) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1304) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1116) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:940) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:930) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:246) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:198) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:408) at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:781) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:675) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:614) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) Caused by: java.lang.UnsatisfiedLinkError: no snappyjava in java.library.path at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1860) at
[jira] [Commented] (HIVE-7551) expand spark accumulator to support hive counter [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-7551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167997#comment-14167997 ] Suhas Satish commented on HIVE-7551: Xuefu is right, feel free to work on this ~Chengxiang expand spark accumulator to support hive counter [Spark Branch] Key: HIVE-7551 URL: https://issues.apache.org/jira/browse/HIVE-7551 Project: Hive Issue Type: New Feature Components: Spark Reporter: Chengxiang Li Assignee: Chengxiang Li Labels: Spark-M3 hive collect some operator statistic information through counter, we need to support MR/Tez counter counterpart through spark accumulator. NO PRECOMMIT TESTS. This is for spark branch only. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-8243) clone SparkWork for join optimization
Suhas Satish created HIVE-8243: -- Summary: clone SparkWork for join optimization Key: HIVE-8243 URL: https://issues.apache.org/jira/browse/HIVE-8243 Project: Hive Issue Type: Bug Components: Spark Reporter: Suhas Satish Map-join optimization needs to clone the SparkWork containing the operator tree to make changes to it. For MapredWork, this is done thru kryo serialization/deserialization in https://issues.apache.org/jira/browse/HIVE-5263 Something similar should be done for SparkWork -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-8243) clone SparkWork for join optimization
[ https://issues.apache.org/jira/browse/HIVE-8243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suhas Satish updated HIVE-8243: --- Issue Type: Sub-task (was: Bug) Parent: HIVE-7292 clone SparkWork for join optimization - Key: HIVE-8243 URL: https://issues.apache.org/jira/browse/HIVE-8243 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Suhas Satish Labels: https://issues.apache.org/jira/browse/HIVE-5263 Map-join optimization needs to clone the SparkWork containing the operator tree to make changes to it. For MapredWork, this is done thru kryo serialization/deserialization in https://issues.apache.org/jira/browse/HIVE-5263 Something similar should be done for SparkWork -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8243) clone SparkWork for join optimization
[ https://issues.apache.org/jira/browse/HIVE-8243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145705#comment-14145705 ] Suhas Satish commented on HIVE-8243: cloning via kryo.copy() as suggested for MapredWork in https://issues.apache.org/jira/browse/HIVE-4396 maybe a good approach here. clone SparkWork for join optimization - Key: HIVE-8243 URL: https://issues.apache.org/jira/browse/HIVE-8243 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Suhas Satish Labels: https://issues.apache.org/jira/browse/HIVE-5263 Map-join optimization needs to clone the SparkWork containing the operator tree to make changes to it. For MapredWork, this is done thru kryo serialization/deserialization in https://issues.apache.org/jira/browse/HIVE-5263 Something similar should be done for SparkWork -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-7613) Research optimization of auto convert join to map join [Spark branch]
[ https://issues.apache.org/jira/browse/HIVE-7613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138567#comment-14138567 ] Suhas Satish commented on HIVE-7613: Hi Xuefu, thats a good idea. I was thinking on the lines of calling SparkContext's addFile method in each of the N-1 spark jobs in HashTableSinkOperator.java to write the hash tables as files and then read it in the map-only join job in MapJoinOperator. But that doesn't involve RDDs. Research optimization of auto convert join to map join [Spark branch] - Key: HIVE-7613 URL: https://issues.apache.org/jira/browse/HIVE-7613 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Suhas Satish Priority: Minor Attachments: HIve on Spark Map join background.docx ConvertJoinMapJoin is an optimization the replaces a common join(aka shuffle join) with a map join(aka broadcast or fragment replicate join) when possible. we need to research how to make it workable with Hive on Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-7613) Research optimization of auto convert join to map join [Spark branch]
[ https://issues.apache.org/jira/browse/HIVE-7613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139611#comment-14139611 ] Suhas Satish commented on HIVE-7613: {{ConvertJoinMapJoin}} heavily uses {{OptimizeTezProcContext}} . Although we do have an equivalent {{OptimizeSparkProcContext}}, the 2 are not derived from any common ancestor class. We will need some class hierarchy redesign/refactoring to make ConvertJoinMapJoin be more generic to support multiple execution frameworks. For now, I am thinking of proceeding with a cloned {{SparkConvertJoinMapJoin}} class using {{OptimizeSparkProcContext}} We might need to open a jira for this refactoring. Research optimization of auto convert join to map join [Spark branch] - Key: HIVE-7613 URL: https://issues.apache.org/jira/browse/HIVE-7613 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Suhas Satish Priority: Minor Attachments: HIve on Spark Map join background.docx ConvertJoinMapJoin is an optimization the replaces a common join(aka shuffle join) with a map join(aka broadcast or fragment replicate join) when possible. we need to research how to make it workable with Hive on Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-8183) make ConvertJoinMapJoin optimization pluggable for different execution frameworks
Suhas Satish created HIVE-8183: -- Summary: make ConvertJoinMapJoin optimization pluggable for different execution frameworks Key: HIVE-8183 URL: https://issues.apache.org/jira/browse/HIVE-8183 Project: Hive Issue Type: Improvement Components: Physical Optimizer Affects Versions: 0.13.1, 0.14.0, spark-branch Reporter: Suhas Satish Originally introduced for Tez, ConvertJoinMapJoin heavily uses OptimizeTezProcContext . Although we do have an equivalent OptimizeSparkProcContext, the 2 are not derived from any common ancestor class. We will need some class hierarchy redesign/refactoring to make ConvertJoinMapJoin be more generic to support multiple execution frameworks. For now, I am thinking of proceeding with a cloned SparkConvertJoinMapJoin class using OptimizeSparkProcContext We might need to open a jira for this refactoring. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-8183) make ConvertJoinMapJoin optimization pluggable for different execution frameworks
[ https://issues.apache.org/jira/browse/HIVE-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suhas Satish updated HIVE-8183: --- Description: Originally introduced for Tez, ConvertJoinMapJoin heavily uses OptimizeTezProcContext . Although we do have an equivalent OptimizeSparkProcContext, the 2 are not derived from any common ancestor class. We will need some class hierarchy redesign/refactoring to make ConvertJoinMapJoin be more generic to support multiple execution frameworks . was: Originally introduced for Tez, ConvertJoinMapJoin heavily uses OptimizeTezProcContext . Although we do have an equivalent OptimizeSparkProcContext, the 2 are not derived from any common ancestor class. We will need some class hierarchy redesign/refactoring to make ConvertJoinMapJoin be more generic to support multiple execution frameworks. For now, I am thinking of proceeding with a cloned SparkConvertJoinMapJoin class using OptimizeSparkProcContext make ConvertJoinMapJoin optimization pluggable for different execution frameworks - Key: HIVE-8183 URL: https://issues.apache.org/jira/browse/HIVE-8183 Project: Hive Issue Type: Improvement Components: Physical Optimizer Affects Versions: 0.14.0, 0.13.1, spark-branch Reporter: Suhas Satish Labels: spark Originally introduced for Tez, ConvertJoinMapJoin heavily uses OptimizeTezProcContext . Although we do have an equivalent OptimizeSparkProcContext, the 2 are not derived from any common ancestor class. We will need some class hierarchy redesign/refactoring to make ConvertJoinMapJoin be more generic to support multiple execution frameworks . -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-8183) make ConvertJoinMapJoin optimization pluggable for different execution frameworks
[ https://issues.apache.org/jira/browse/HIVE-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suhas Satish updated HIVE-8183: --- Description: Originally introduced for Tez, ConvertJoinMapJoin heavily uses OptimizeTezProcContext . Although we do have an equivalent OptimizeSparkProcContext, the 2 are not derived from any common ancestor class. We will need some class hierarchy redesign/refactoring to make ConvertJoinMapJoin be more generic to support multiple execution frameworks. For now, I am thinking of proceeding with a cloned SparkConvertJoinMapJoin class using OptimizeSparkProcContext was: Originally introduced for Tez, ConvertJoinMapJoin heavily uses OptimizeTezProcContext . Although we do have an equivalent OptimizeSparkProcContext, the 2 are not derived from any common ancestor class. We will need some class hierarchy redesign/refactoring to make ConvertJoinMapJoin be more generic to support multiple execution frameworks. For now, I am thinking of proceeding with a cloned SparkConvertJoinMapJoin class using OptimizeSparkProcContext We might need to open a jira for this refactoring. make ConvertJoinMapJoin optimization pluggable for different execution frameworks - Key: HIVE-8183 URL: https://issues.apache.org/jira/browse/HIVE-8183 Project: Hive Issue Type: Improvement Components: Physical Optimizer Affects Versions: 0.14.0, 0.13.1, spark-branch Reporter: Suhas Satish Labels: spark Originally introduced for Tez, ConvertJoinMapJoin heavily uses OptimizeTezProcContext . Although we do have an equivalent OptimizeSparkProcContext, the 2 are not derived from any common ancestor class. We will need some class hierarchy redesign/refactoring to make ConvertJoinMapJoin be more generic to support multiple execution frameworks. For now, I am thinking of proceeding with a cloned SparkConvertJoinMapJoin class using OptimizeSparkProcContext -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-7613) Research optimization of auto convert join to map join [Spark branch]
[ https://issues.apache.org/jira/browse/HIVE-7613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120697#comment-14120697 ] Suhas Satish commented on HIVE-7613: as a part of this work, we should also enable auto_sortmerge_join_1.q which currently fails with {code:title=auto_sortmerge_join_1.stackTrace|borderStyle=solid} 2014-09-03 16:12:59,607 ERROR [main]: spark.SparkClient (SparkClient.java:execute(166)) - Error executing Spark Plan org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 1, localhost): java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {key:0,value:val_0,ds:2008-04-08} org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:151) org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:47) org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:28) org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:99) scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:65) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1177) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1166) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1165) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1165) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1383) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) {code} Research optimization of auto convert join to map join [Spark branch] - Key: HIVE-7613 URL: https://issues.apache.org/jira/browse/HIVE-7613 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Szehon Ho Priority: Minor Attachments: HIve on Spark Map join background.docx ConvertJoinMapJoin is an optimization the replaces a common join(aka shuffle join) with a map join(aka broadcast or fragment replicate join) when possible. we need to research how to make it workable with Hive on Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-7952) Investigate query failures (1)
[ https://issues.apache.org/jira/browse/HIVE-7952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120706#comment-14120706 ] Suhas Satish commented on HIVE-7952: auto_sortmerge_join_1 and auto_sortmerge_join13 are covered under existing jira on Map join and the stackTrace from the test failure is listed here - https://issues.apache.org/jira/browse/HIVE-7613 Investigate query failures (1) -- Key: HIVE-7952 URL: https://issues.apache.org/jira/browse/HIVE-7952 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Brock Noland Assignee: Suhas Satish I ran all q-file tests and the following failed with an exception: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/HIVE-SPARK-ALL-TESTS-Build/lastCompletedBuild/testReport/ we don't necessary want to run all these tests as part of the spark tests, but we should understand why they failed with an exception. This JIRA is to look into these failures and document them with one of: * New JIRA * Covered under existing JIRA * More investigation required Tests: {noformat} org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_auto_sortmerge_join_13 2.5 sec 2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_tez_fsstat 1.6 sec 2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_dynpart_sort_opt_vectorization 5.3 sec 2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_auto_sortmerge_join_14 6.3 sec 2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_udf_using 0.34 sec2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_authorization_create_func1 0.96 sec2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_sample_islocalmode_hook 11 sec 2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_authorization_set_show_current_role 1.4 sec 2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_authorization_owner_actions_db 0.42 sec2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_auto_sortmerge_join_8 5.5 sec 2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_lock21.8 sec 2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_authorization_1_sql_std 2.7 sec 2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_exim_19_part_external_location 3.9 sec 2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_stats_empty_partition 0.67 sec2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_authorization_role_grant1 3.6 sec 2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_authorization_role_grant2 2.6 sec 2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_authorization_show_grant 3.5 sec 2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_auto_smb_mapjoin_14 2.6 sec 2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_dbtxnmgr_query1 0.93 sec2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_dbtxnmgr_query4 0.26 sec2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_auto_sortmerge_join_1 10 sec 2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_auto_sortmerge_join_7 {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-7869) Build long running HS2 test framework
[ https://issues.apache.org/jira/browse/HIVE-7869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118721#comment-14118721 ] Suhas Satish commented on HIVE-7869: thanks Brock. I will continue to add queries to this. Build long running HS2 test framework - Key: HIVE-7869 URL: https://issues.apache.org/jira/browse/HIVE-7869 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Brock Noland Assignee: Suhas Satish Fix For: 0.14.0 Attachments: HIVE-7869-spark.patch, HIVE-7869.2-spark.patch I have noticed when running the full test suite locally that the test JVM eventually crashes. We should do some testing (not part of the unit tests) which starts up a HS2 and runs queries on it continuously for 24 hours or so. In this JIRA let's create a stand alone java program which connects to a HS2 over JDBC, creates a bunch of tables (say 100) and then runs queries until the JDBC client is killed. This will allow us to run long running tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (HIVE-7952) Investigate query failures (1)
[ https://issues.apache.org/jira/browse/HIVE-7952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suhas Satish reassigned HIVE-7952: -- Assignee: Suhas Satish Investigate query failures (1) -- Key: HIVE-7952 URL: https://issues.apache.org/jira/browse/HIVE-7952 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Brock Noland Assignee: Suhas Satish I ran all q-file tests and the following failed with an exception: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/HIVE-SPARK-ALL-TESTS-Build/lastCompletedBuild/testReport/ we don't necessary want to run all these tests as part of the spark tests, but we should understand why they failed with an exception. This JIRA is to look into these failures and document them with one of: * New JIRA * Covered under existing JIRA * More investigation required Tests: {noformat} org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_auto_sortmerge_join_13 2.5 sec 2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_tez_fsstat 1.6 sec 2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_dynpart_sort_opt_vectorization 5.3 sec 2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_auto_sortmerge_join_14 6.3 sec 2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_udf_using 0.34 sec2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_authorization_create_func1 0.96 sec2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_sample_islocalmode_hook 11 sec 2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_authorization_set_show_current_role 1.4 sec 2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_authorization_owner_actions_db 0.42 sec2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_auto_sortmerge_join_8 5.5 sec 2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_lock21.8 sec 2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_authorization_1_sql_std 2.7 sec 2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_exim_19_part_external_location 3.9 sec 2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_stats_empty_partition 0.67 sec2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_authorization_role_grant1 3.6 sec 2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_authorization_role_grant2 2.6 sec 2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_authorization_show_grant 3.5 sec 2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_auto_smb_mapjoin_14 2.6 sec 2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_dbtxnmgr_query1 0.93 sec2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_dbtxnmgr_query4 0.26 sec2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_auto_sortmerge_join_1 10 sec 2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_auto_sortmerge_join_7 {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-7551) expand spark accumulator to support hive counter [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-7551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14115369#comment-14115369 ] Suhas Satish commented on HIVE-7551: Assigning to myself after talking to Na. Is this for milestone Spark-M3 as the dependent jiras are labeled? expand spark accumulator to support hive counter [Spark Branch] Key: HIVE-7551 URL: https://issues.apache.org/jira/browse/HIVE-7551 Project: Hive Issue Type: New Feature Components: Spark Reporter: Chengxiang Li Assignee: Na Yang hive collect some operator statistic information through counter, we need to support MR/Tez counter counterpart through spark accumulator. NO PRECOMMIT TESTS. This is for spark branch only. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (HIVE-7551) expand spark accumulator to support hive counter [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-7551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suhas Satish reassigned HIVE-7551: -- Assignee: Suhas Satish (was: Na Yang) expand spark accumulator to support hive counter [Spark Branch] Key: HIVE-7551 URL: https://issues.apache.org/jira/browse/HIVE-7551 Project: Hive Issue Type: New Feature Components: Spark Reporter: Chengxiang Li Assignee: Suhas Satish hive collect some operator statistic information through counter, we need to support MR/Tez counter counterpart through spark accumulator. NO PRECOMMIT TESTS. This is for spark branch only. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7775) enable sample8.q.[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-7775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14115380#comment-14115380 ] Suhas Satish commented on HIVE-7775: what kind of join did Szehon enable? Does hive on spark support full outer join? enable sample8.q.[Spark Branch] --- Key: HIVE-7775 URL: https://issues.apache.org/jira/browse/HIVE-7775 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Chengxiang Li Fix For: spark-branch Attachments: HIVE-7775.1-spark.patch, HIVE-7775.2-spark.patch, HIVE-7775.3-spark.additional.patch sample8.q contain join query, should enable this qtest after hive on spark support join operation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HIVE-7869) Long running tests (1) [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-7869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suhas Satish updated HIVE-7869: --- Attachment: HIVE-7869.2-spark.patch Long running tests (1) [Spark Branch] - Key: HIVE-7869 URL: https://issues.apache.org/jira/browse/HIVE-7869 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Brock Noland Assignee: Suhas Satish Attachments: HIVE-7869-spark.patch, HIVE-7869.2-spark.patch I have noticed when running the full test suite locally that the test JVM eventually crashes. We should do some testing (not part of the unit tests) which starts up a HS2 and runs queries on it continuously for 24 hours or so. In this JIRA let's create a stand alone java program which connects to a HS2 over JDBC, creates a bunch of tables (say 100) and then runs queries until the JDBC client is killed. This will allow us to run long running tests. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7869) Long running tests (1) [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-7869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14115909#comment-14115909 ] Suhas Satish commented on HIVE-7869: addressed review board comments Long running tests (1) [Spark Branch] - Key: HIVE-7869 URL: https://issues.apache.org/jira/browse/HIVE-7869 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Brock Noland Assignee: Suhas Satish Attachments: HIVE-7869-spark.patch, HIVE-7869.2-spark.patch I have noticed when running the full test suite locally that the test JVM eventually crashes. We should do some testing (not part of the unit tests) which starts up a HS2 and runs queries on it continuously for 24 hours or so. In this JIRA let's create a stand alone java program which connects to a HS2 over JDBC, creates a bunch of tables (say 100) and then runs queries until the JDBC client is killed. This will allow us to run long running tests. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HIVE-7869) Long running tests (1) [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-7869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suhas Satish updated HIVE-7869: --- Status: Patch Available (was: Open) Long running tests (1) [Spark Branch] - Key: HIVE-7869 URL: https://issues.apache.org/jira/browse/HIVE-7869 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Brock Noland Assignee: Suhas Satish Attachments: HIVE-7869-spark.patch I have noticed when running the full test suite locally that the test JVM eventually crashes. We should do some testing (not part of the unit tests) which starts up a HS2 and runs queries on it continuously for 24 hours or so. In this JIRA let's create a stand alone java program which connects to a HS2 over JDBC, creates a bunch of tables (say 100) and then runs queries until the JDBC client is killed. This will allow us to run long running tests. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HIVE-7869) Long running tests (1) [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-7869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suhas Satish updated HIVE-7869: --- Attachment: HIVE-7869-spark.patch Long running tests (1) [Spark Branch] - Key: HIVE-7869 URL: https://issues.apache.org/jira/browse/HIVE-7869 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Brock Noland Assignee: Suhas Satish Attachments: HIVE-7869-spark.patch I have noticed when running the full test suite locally that the test JVM eventually crashes. We should do some testing (not part of the unit tests) which starts up a HS2 and runs queries on it continuously for 24 hours or so. In this JIRA let's create a stand alone java program which connects to a HS2 over JDBC, creates a bunch of tables (say 100) and then runs queries until the JDBC client is killed. This will allow us to run long running tests. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7869) Long running tests (1) [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-7869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114665#comment-14114665 ] Suhas Satish commented on HIVE-7869: https://reviews.apache.org/r/25177/ Long running tests (1) [Spark Branch] - Key: HIVE-7869 URL: https://issues.apache.org/jira/browse/HIVE-7869 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Brock Noland Assignee: Suhas Satish Attachments: HIVE-7869-spark.patch I have noticed when running the full test suite locally that the test JVM eventually crashes. We should do some testing (not part of the unit tests) which starts up a HS2 and runs queries on it continuously for 24 hours or so. In this JIRA let's create a stand alone java program which connects to a HS2 over JDBC, creates a bunch of tables (say 100) and then runs queries until the JDBC client is killed. This will allow us to run long running tests. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7869) Long running tests (1) [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-7869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114696#comment-14114696 ] Suhas Satish commented on HIVE-7869: est failures are not related to the patch. Long running tests (1) [Spark Branch] - Key: HIVE-7869 URL: https://issues.apache.org/jira/browse/HIVE-7869 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Brock Noland Assignee: Suhas Satish Attachments: HIVE-7869-spark.patch I have noticed when running the full test suite locally that the test JVM eventually crashes. We should do some testing (not part of the unit tests) which starts up a HS2 and runs queries on it continuously for 24 hours or so. In this JIRA let's create a stand alone java program which connects to a HS2 over JDBC, creates a bunch of tables (say 100) and then runs queries until the JDBC client is killed. This will allow us to run long running tests. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (HIVE-7869) Long running tests (1) [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-7869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suhas Satish reassigned HIVE-7869: -- Assignee: Suhas Satish Long running tests (1) [Spark Branch] - Key: HIVE-7869 URL: https://issues.apache.org/jira/browse/HIVE-7869 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Brock Noland Assignee: Suhas Satish I have noticed when running the full test suite locally that the test JVM eventually crashes. We should do some testing (not part of the unit tests) which starts up a HS2 and runs queries on it continuously for 24 hours or so. In this JIRA let's create a stand alone java program which connects to a HS2 over JDBC, creates a bunch of tables (say 100) and then runs queries until the JDBC client is killed. This will allow us to run long running tests. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HIVE-7821) StarterProject: enable groupby4.q
[ https://issues.apache.org/jira/browse/HIVE-7821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suhas Satish updated HIVE-7821: --- Status: Patch Available (was: Open) StarterProject: enable groupby4.q - Key: HIVE-7821 URL: https://issues.apache.org/jira/browse/HIVE-7821 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Brock Noland Assignee: Suhas Satish Attachments: HIVE-7821.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HIVE-7821) StarterProject: enable groupby4.q
[ https://issues.apache.org/jira/browse/HIVE-7821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suhas Satish updated HIVE-7821: --- Attachment: HIVE-7821.patch StarterProject: enable groupby4.q - Key: HIVE-7821 URL: https://issues.apache.org/jira/browse/HIVE-7821 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Brock Noland Assignee: Suhas Satish Attachments: HIVE-7821.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7821) StarterProject: enable groupby4.q
[ https://issues.apache.org/jira/browse/HIVE-7821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107319#comment-14107319 ] Suhas Satish commented on HIVE-7821: grouby4 has a deterministic order, so the output ordering when run on spark is the same across test runs, but may not match the same order as q.out from corresponding test run on map-reduce. StarterProject: enable groupby4.q - Key: HIVE-7821 URL: https://issues.apache.org/jira/browse/HIVE-7821 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Brock Noland Assignee: Suhas Satish Attachments: HIVE-7821.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HIVE-7821) StarterProject: enable groupby4.q
[ https://issues.apache.org/jira/browse/HIVE-7821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suhas Satish updated HIVE-7821: --- Attachment: HIVE-7821-spark.2.patch StarterProject: enable groupby4.q - Key: HIVE-7821 URL: https://issues.apache.org/jira/browse/HIVE-7821 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Brock Noland Assignee: Suhas Satish Attachments: HIVE-7821-spark.2.patch, HIVE-7821-spark.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7821) StarterProject: enable groupby4.q
[ https://issues.apache.org/jira/browse/HIVE-7821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107376#comment-14107376 ] Suhas Satish commented on HIVE-7821: Attached updated patch generated with git diff --no-prefix StarterProject: enable groupby4.q - Key: HIVE-7821 URL: https://issues.apache.org/jira/browse/HIVE-7821 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Brock Noland Assignee: Suhas Satish Attachments: HIVE-7821-spark.2.patch, HIVE-7821-spark.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HIVE-7821) StarterProject: enable groupby4.q
[ https://issues.apache.org/jira/browse/HIVE-7821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suhas Satish updated HIVE-7821: --- Attachment: (was: HIVE-7821-spark.2.patch) StarterProject: enable groupby4.q - Key: HIVE-7821 URL: https://issues.apache.org/jira/browse/HIVE-7821 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Brock Noland Assignee: Suhas Satish Attachments: HIVE-7821-spark.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HIVE-7821) StarterProject: enable groupby4.q
[ https://issues.apache.org/jira/browse/HIVE-7821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suhas Satish updated HIVE-7821: --- Attachment: HIVE-7821.2-spark.patch StarterProject: enable groupby4.q - Key: HIVE-7821 URL: https://issues.apache.org/jira/browse/HIVE-7821 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Brock Noland Assignee: Suhas Satish Attachments: HIVE-7821-spark.patch, HIVE-7821.2-spark.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HIVE-7821) StarterProject: enable groupby4.q
[ https://issues.apache.org/jira/browse/HIVE-7821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suhas Satish updated HIVE-7821: --- Attachment: HIVE-7821.3-spark.patch StarterProject: enable groupby4.q - Key: HIVE-7821 URL: https://issues.apache.org/jira/browse/HIVE-7821 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Brock Noland Assignee: Suhas Satish Attachments: HIVE-7821-spark.patch, HIVE-7821.3-spark.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HIVE-7821) StarterProject: enable groupby4.q
[ https://issues.apache.org/jira/browse/HIVE-7821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suhas Satish updated HIVE-7821: --- Attachment: (was: HIVE-7821.2-spark.patch) StarterProject: enable groupby4.q - Key: HIVE-7821 URL: https://issues.apache.org/jira/browse/HIVE-7821 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Brock Noland Assignee: Suhas Satish Attachments: HIVE-7821-spark.patch, HIVE-7821.3-spark.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7821) StarterProject: enable groupby4.q
[ https://issues.apache.org/jira/browse/HIVE-7821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107639#comment-14107639 ] Suhas Satish commented on HIVE-7821: rebasing patch StarterProject: enable groupby4.q - Key: HIVE-7821 URL: https://issues.apache.org/jira/browse/HIVE-7821 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Brock Noland Assignee: Suhas Satish Attachments: HIVE-7821-spark.patch, HIVE-7821.3-spark.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HIVE-7821) StarterProject: enable groupby4.q
[ https://issues.apache.org/jira/browse/HIVE-7821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suhas Satish updated HIVE-7821: --- Attachment: HIVE-7821.4-spark.patch StarterProject: enable groupby4.q - Key: HIVE-7821 URL: https://issues.apache.org/jira/browse/HIVE-7821 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Brock Noland Assignee: Suhas Satish Attachments: HIVE-7821-spark.patch, HIVE-7821.3-spark.patch, HIVE-7821.4-spark.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7821) StarterProject: enable groupby4.q
[ https://issues.apache.org/jira/browse/HIVE-7821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107795#comment-14107795 ] Suhas Satish commented on HIVE-7821: 4 of the 5 tests are unrelated to SparkCliDriver. 1 relevant failure groupby4.q.out had a SORT_BEFORE_DIFF from an experimental run. Attaching a clean one without it. StarterProject: enable groupby4.q - Key: HIVE-7821 URL: https://issues.apache.org/jira/browse/HIVE-7821 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Brock Noland Assignee: Suhas Satish Attachments: HIVE-7821-spark.patch, HIVE-7821.3-spark.patch, HIVE-7821.4-spark.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (HIVE-7821) StarterProject: enable groupby4.q
[ https://issues.apache.org/jira/browse/HIVE-7821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suhas Satish reassigned HIVE-7821: -- Assignee: Suhas Satish (was: Chinna Rao Lalam) StarterProject: enable groupby4.q - Key: HIVE-7821 URL: https://issues.apache.org/jira/browse/HIVE-7821 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Brock Noland Assignee: Suhas Satish -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-5351) Secure-Socket-Layer (SSL) support for HiveServer2
[ https://issues.apache.org/jira/browse/HIVE-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050663#comment-14050663 ] Suhas Satish commented on HIVE-5351: I have used the 3 properties above and started my hiveserver2 which is now using ssl. But how do I connect to it from beeline client? There doesnt seem to be any information about it. I am trying to use something like this - !connect jdbc:hive2://127.0.0.1:1/default;ssl=true;sslTrustStore=/opt/mapr/conf/ssl_truststore; but when it prompts for username and password, it fails to connect even after I enter the correct ssl_truststore password. Enter username for jdbc:hive2://10.10.30.181:1/default;ssl=true;sslTrustStore=/opt/mapr/conf/ssl_truststore;sslTrustStorePassword=mapr123: mapr Enter password for jdbc:hive2://10.10.30.181:1/default;ssl=true;sslTrustStore=/opt/mapr/conf/ssl_truststore;sslTrustStorePassword=mapr123: Error: Invalid URL: jdbc:hive2://10.10.30.181:1/default;ssl=true;sslTrustStore=/opt/mapr/conf/ssl_truststore;sslTrustStorePassword=mapr123 (state=08S01,code=0) Is my jdbc connect string the right way to connect? Secure-Socket-Layer (SSL) support for HiveServer2 - Key: HIVE-5351 URL: https://issues.apache.org/jira/browse/HIVE-5351 Project: Hive Issue Type: Improvement Components: Authorization, HiveServer2, JDBC Affects Versions: 0.11.0, 0.12.0 Reporter: Prasad Mujumdar Assignee: Prasad Mujumdar Fix For: 0.13.0 Attachments: HIVE-5301.test-binary-files.tar, HIVE-5351.3.patch, HIVE-5351.5.patch HiveServer2 and JDBC driver should support encrypted communication using SSL -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-4629) HS2 should support an API to retrieve query logs
[ https://issues.apache.org/jira/browse/HIVE-4629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13926260#comment-13926260 ] Suhas Satish commented on HIVE-4629: It would be great to have this jira accepted into hive trunk, even I am waiting on this from a long time. HS2 should support an API to retrieve query logs Key: HIVE-4629 URL: https://issues.apache.org/jira/browse/HIVE-4629 Project: Hive Issue Type: Sub-task Components: HiveServer2 Reporter: Shreepadma Venugopalan Assignee: Shreepadma Venugopalan Attachments: HIVE-4629-no_thrift.1.patch, HIVE-4629.1.patch, HIVE-4629.2.patch HiveServer2 should support an API to retrieve query logs. This is particularly relevant because HiveServer2 supports async execution but doesn't provide a way to report progress. Providing an API to retrieve query logs will help report progress to the client. -- This message was sent by Atlassian JIRA (v6.2#6252)