[jira] [Commented] (HIVE-8701) Combine nested map joins into the parent map join if possible [Spark Branch]

2014-11-19 Thread Suhas Satish (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14218875#comment-14218875
 ] 

Suhas Satish commented on HIVE-8701:


[~szehon] - Can you illustrate how you plan to optimize for lower memory 
utilization in the case of nested map-joins, with an example ? 

 Combine nested map joins into the parent map join if possible [Spark Branch]
 

 Key: HIVE-8701
 URL: https://issues.apache.org/jira/browse/HIVE-8701
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Szehon Ho

 With the work in HIVE-8616 enabled, the generated plan shows that the nested 
 map join operator isn't merged to its parent when possible. This is 
 demonstrated in auto_join2.q. The MR plan shown that this optimization is in 
 place. We should do the same for Spark.
 {code}
 STAGE PLANS:
   Stage: Stage-1
 Spark
   Edges:
 Map 2 - Map 3 (NONE, 0)
 Map 3 - Map 1 (NONE, 0)
   DagName: xzhang_20141102074141_ac089634-bf01-4386-b1cf-3e7f2e99f6eb:3
   Vertices:
 Map 1 
 Map Operator Tree:
 TableScan
   alias: src2
   Statistics: Num rows: 58 Data size: 5812 Basic stats: 
 COMPLETE Column stats: NONE
   Filter Operator
 predicate: key is not null (type: boolean)
 Statistics: Num rows: 29 Data size: 2906 Basic stats: 
 COMPLETE Column stats: NONE
 Reduce Output Operator
   key expressions: key (type: string)
   sort order: +
   Map-reduce partition columns: key (type: string)
   Statistics: Num rows: 29 Data size: 2906 Basic stats: 
 COMPLETE Column stats: NONE
 Map 2 
 Map Operator Tree:
 TableScan
   alias: src3
   Statistics: Num rows: 29 Data size: 5812 Basic stats: 
 COMPLETE Column stats: NONE
   Filter Operator
 predicate: UDFToDouble(key) is not null (type: boolean)
 Statistics: Num rows: 15 Data size: 3006 Basic stats: 
 COMPLETE Column stats: NONE
 Map Join Operator
   condition map:
Inner Join 0 to 1
   condition expressions:
 0 {_col0}
 1 {value}
   keys:
 0 (_col0 + _col5) (type: double)
 1 UDFToDouble(key) (type: double)
   outputColumnNames: _col0, _col11
   input vertices:
 0 Map 3
   Statistics: Num rows: 17 Data size: 1813 Basic stats: 
 COMPLETE Column stats: NONE
   Select Operator
 expressions: _col0 (type: string), _col11 (type: 
 string)
 outputColumnNames: _col0, _col1
 Statistics: Num rows: 17 Data size: 1813 Basic stats: 
 COMPLETE Column stats: NONE
 File Output Operator
   compressed: false
   Statistics: Num rows: 17 Data size: 1813 Basic 
 stats: COMPLETE Column stats: NONE
   table:
   input format: 
 org.apache.hadoop.mapred.TextInputFormat
   output format: 
 org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
   serde: 
 org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
 Map 3 
 Map Operator Tree:
 TableScan
   alias: src1
   Statistics: Num rows: 58 Data size: 5812 Basic stats: 
 COMPLETE Column stats: NONE
   Filter Operator
 predicate: key is not null (type: boolean)
 Statistics: Num rows: 29 Data size: 2906 Basic stats: 
 COMPLETE Column stats: NONE
 Map Join Operator
   condition map:
Inner Join 0 to 1
   condition expressions:
 0 {key}
 1 {key}
   keys:
 0 key (type: string)
 1 key (type: string)
   outputColumnNames: _col0, _col5
   input vertices:
 1 Map 1
   Statistics: Num rows: 31 Data size: 3196 Basic stats: 
 COMPLETE Column stats: NONE
   Filter Operator
 predicate: (_col0 + _col5) is not null (type: 

[jira] [Commented] (HIVE-8548) Integrate with remote Spark context after HIVE-8528 [Spark Branch]

2014-11-12 Thread Suhas Satish (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208562#comment-14208562
 ] 

Suhas Satish commented on HIVE-8548:


HI [~xuefuz] - Regarding unit testing remote-spark context with local-cluster 
mode, we will need to use either yarn or mesos as the cluster manager. Is that 
going to be our test setup? 

The reason is that currently, if spark.master=local, it implies a 
*spark-standalone cluster* which only supports *client* deploy mode, and not 
*local-cluster* deployment mode.

 Integrate with remote Spark context after HIVE-8528 [Spark Branch]
 --

 Key: HIVE-8548
 URL: https://issues.apache.org/jira/browse/HIVE-8548
 Project: Hive
  Issue Type: Bug
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Chengxiang Li

 With HIVE-8528, HiverSever2 should use remote Spark context to submit job and 
 monitor progress, etc. This is necessary if Hive runs on standalone cluster, 
 Yarn, or Mesos. If Hive runs with spark.master=local, we should continue 
 using SparkContext in current way.
 We take this as root JIRA to track all Remote Spark Context integration 
 related subtasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8548) Integrate with remote Spark context after HIVE-8528 [Spark Branch]

2014-11-12 Thread Suhas Satish (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208778#comment-14208778
 ] 

Suhas Satish commented on HIVE-8548:


Thanks for clarifying [~xuefuz] and [~vanzin]. I had some misconceptions about 
the naming conventions. 

 Integrate with remote Spark context after HIVE-8528 [Spark Branch]
 --

 Key: HIVE-8548
 URL: https://issues.apache.org/jira/browse/HIVE-8548
 Project: Hive
  Issue Type: Bug
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Chengxiang Li

 With HIVE-8528, HiverSever2 should use remote Spark context to submit job and 
 monitor progress, etc. This is necessary if Hive runs on standalone cluster, 
 Yarn, or Mesos. If Hive runs with spark.master=local, we should continue 
 using SparkContext in current way.
 We take this as root JIRA to track all Remote Spark Context integration 
 related subtasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8622) Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]

2014-11-07 Thread Suhas Satish (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202496#comment-14202496
 ] 

Suhas Satish commented on HIVE-8622:


[~csun] - We already have a mapr of  BaseWork containing the map-join to its 
parent ReduceSinks. 
This exists as {{linkWorkWithReduceSinkMap}} in {{GenSparkProcContext}}

Do you think we can leverage that in some way, or replace the RSs in that Map 
with the HashTableSinks that we introduced? It looks like we should still 
propagate the whole GenSparkProcContext to the {{SparkMapJoinResolver}} through 
the SparkCompiler.generateTaskTree(...) and {{SparkCompiler.optimizeTaskPlan}}  

All the state information stored there will make life a lot easier. 

 Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]
 

 Key: HIVE-8622
 URL: https://issues.apache.org/jira/browse/HIVE-8622
 Project: Hive
  Issue Type: Sub-task
Reporter: Suhas Satish
Assignee: Chao
 Attachments: HIVE-8622.2-spark.patch, HIVE-8622.3-spark.patch, 
 HIVE-8622.patch


 This is a sub-task of map-join for spark 
 https://issues.apache.org/jira/browse/HIVE-7613
 This can use the baseline patch for map-join
 https://issues.apache.org/jira/browse/HIVE-8616



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8700) Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]

2014-11-06 Thread Suhas Satish (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201198#comment-14201198
 ] 

Suhas Satish commented on HIVE-8700:


Have a patch which now generates the HashTableSinkOperators as follows. Will be 
uploading  a patch soon. 

explain select table1.key, table2.value, table3.value from table1 join table2 
on table1.key=table2.key join table3 on table1.key=table3.key;
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
Spark
  Edges:
Map 3 - Map 1 (NONE, 0), Map 2 (NONE, 0)
  DagName: ssatish_20141106152828_299c0f54-40a8-4cf5-91f4-ecb1f420955f:1
  Vertices:
Map 1 
Map Operator Tree:
TableScan
  alias: table1
  Statistics: Num rows: 1453 Data size: 5812 Basic stats: 
COMPLETE Column stats: NONE
  Filter Operator
predicate: key is not null (type: boolean)
Statistics: Num rows: 727 Data size: 2908 Basic stats: 
COMPLETE Column stats: NONE
HashTable Sink Operator
  condition expressions:
0 {key}
1 {value}
2 {value}
  keys:
0 key (type: int)
1 key (type: int)
2 key (type: int)
Map 2 
Map Operator Tree:
TableScan
  alias: table3
  Statistics: Num rows: 2 Data size: 216 Basic stats: COMPLETE 
Column stats: NONE
  Filter Operator
predicate: key is not null (type: boolean)
Statistics: Num rows: 1 Data size: 108 Basic stats: 
COMPLETE Column stats: NONE
HashTable Sink Operator
  condition expressions:
0 {key}
1 {value}
2 {value}
  keys:
0 key (type: int)
1 key (type: int)
2 key (type: int)
Map 3 
Map Operator Tree:
TableScan
  alias: table2
  Statistics: Num rows: 55 Data size: 5791 Basic stats: 
COMPLETE Column stats: NONE
  Filter Operator
predicate: key is not null (type: boolean)
Statistics: Num rows: 28 Data size: 2948 Basic stats: 
COMPLETE Column stats: NONE
Map Join Operator
  condition map:
   Inner Join 0 to 1
   Inner Join 0 to 2
  condition expressions:
0 {key}
1 {value}
2 {value}
  keys:
0 key (type: int)
1 key (type: int)
2 key (type: int)
  outputColumnNames: _col0, _col6, _col11
  input vertices:
0 Map 1
2 Map 2
  Statistics: Num rows: 1599 Data size: 6397 Basic stats: 
COMPLETE Column stats: NONE
  Select Operator
expressions: _col0 (type: int), _col6 (type: string), 
_col11 (type: string)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 1599 Data size: 6397 Basic stats: 
COMPLETE Column stats: NONE
File Output Operator
  compressed: false
  Statistics: Num rows: 1599 Data size: 6397 Basic 
stats: COMPLETE Column stats: NONE
  table:
  input format: 
org.apache.hadoop.mapred.TextInputFormat
  output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
  serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
Fetch Operator
  limit: -1
  Processor Tree:
ListSink


 Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]
 --

 Key: HIVE-8700
 URL: https://issues.apache.org/jira/browse/HIVE-8700
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Suhas Satish
 Attachments: HIVE-8700-spark.patch, HIVE-8700.patch


 With HIVE-8616 enabled, the new plan has ReduceSinkOperator for the small 
 tables. For example, the follow represents the operator plan for the small 
 table dec1 derived from query 

[jira] [Commented] (HIVE-8700) Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]

2014-11-06 Thread Suhas Satish (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201199#comment-14201199
 ] 

Suhas Satish commented on HIVE-8700:


{code}
explain select table1.key, table2.value, table3.value from table1 join table2 
on table1.key=table2.key join table3 on table1.key=table3.key;
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
Spark
  Edges:
Map 3 - Map 1 (NONE, 0), Map 2 (NONE, 0)
  DagName: ssatish_20141106152828_299c0f54-40a8-4cf5-91f4-ecb1f420955f:1
  Vertices:
Map 1 
Map Operator Tree:
TableScan
  alias: table1
  Statistics: Num rows: 1453 Data size: 5812 Basic stats: 
COMPLETE Column stats: NONE
  Filter Operator
predicate: key is not null (type: boolean)
Statistics: Num rows: 727 Data size: 2908 Basic stats: 
COMPLETE Column stats: NONE
HashTable Sink Operator
  condition expressions:
0 {key}
1 {value}
2 {value}
  keys:
0 key (type: int)
1 key (type: int)
2 key (type: int)
Map 2 
Map Operator Tree:
TableScan
  alias: table3
  Statistics: Num rows: 2 Data size: 216 Basic stats: COMPLETE 
Column stats: NONE
  Filter Operator
predicate: key is not null (type: boolean)
Statistics: Num rows: 1 Data size: 108 Basic stats: 
COMPLETE Column stats: NONE
HashTable Sink Operator
  condition expressions:
0 {key}
1 {value}
2 {value}
  keys:
0 key (type: int)
1 key (type: int)
2 key (type: int)
Map 3 
Map Operator Tree:
TableScan
  alias: table2
  Statistics: Num rows: 55 Data size: 5791 Basic stats: 
COMPLETE Column stats: NONE
  Filter Operator
predicate: key is not null (type: boolean)
Statistics: Num rows: 28 Data size: 2948 Basic stats: 
COMPLETE Column stats: NONE
Map Join Operator
  condition map:
   Inner Join 0 to 1
   Inner Join 0 to 2
  condition expressions:
0 {key}
1 {value}
2 {value}
  keys:
0 key (type: int)
1 key (type: int)
2 key (type: int)
  outputColumnNames: _col0, _col6, _col11
  input vertices:
0 Map 1
2 Map 2
  Statistics: Num rows: 1599 Data size: 6397 Basic stats: 
COMPLETE Column stats: NONE
  Select Operator
expressions: _col0 (type: int), _col6 (type: string), 
_col11 (type: string)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 1599 Data size: 6397 Basic stats: 
COMPLETE Column stats: NONE
File Output Operator
  compressed: false
  Statistics: Num rows: 1599 Data size: 6397 Basic 
stats: COMPLETE Column stats: NONE
  table:
  input format: 
org.apache.hadoop.mapred.TextInputFormat
  output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
  serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
Fetch Operator
  limit: -1
  Processor Tree:
ListSink
{code}

 Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]
 --

 Key: HIVE-8700
 URL: https://issues.apache.org/jira/browse/HIVE-8700
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Suhas Satish
 Attachments: HIVE-8700-spark.patch, HIVE-8700.patch


 With HIVE-8616 enabled, the new plan has ReduceSinkOperator for the small 
 tables. For example, the follow represents the operator plan for the small 
 table dec1 derived from query {code}explain select /*+ MAPJOIN(dec)*/ * from 
 dec join dec1 on dec.value=dec1.d;{code}
 

[jira] [Updated] (HIVE-8700) Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]

2014-11-06 Thread Suhas Satish (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suhas Satish updated HIVE-8700:
---
Attachment: HIVE-8700.2-spark.patch

 Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]
 --

 Key: HIVE-8700
 URL: https://issues.apache.org/jira/browse/HIVE-8700
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Suhas Satish
 Attachments: HIVE-8700-spark.patch, HIVE-8700.2-spark.patch, 
 HIVE-8700.patch


 With HIVE-8616 enabled, the new plan has ReduceSinkOperator for the small 
 tables. For example, the follow represents the operator plan for the small 
 table dec1 derived from query {code}explain select /*+ MAPJOIN(dec)*/ * from 
 dec join dec1 on dec.value=dec1.d;{code}
 {code}
 Map 2 
 Map Operator Tree:
 TableScan
   alias: dec1
   Statistics: Num rows: 0 Data size: 107 Basic stats: PARTIAL 
 Column stats: NONE
   Filter Operator
 predicate: d is not null (type: boolean)
 Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
 Column stats: NONE
 Reduce Output Operator
   key expressions: d (type: decimal(5,2))
   sort order: +
   Map-reduce partition columns: d (type: decimal(5,2))
   Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
 Column stats: NONE
   value expressions: i (type: int)
 {code}
 With the new design for broadcasting small tables, we need to convert the 
 ReduceSinkOperator with HashTableSinkOperator or equivalent in the new plan.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8700) Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]

2014-11-06 Thread Suhas Satish (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201319#comment-14201319
 ] 

Suhas Satish commented on HIVE-8700:


It was an optimization suggested by eclipse to catch any ClassCastExceptions at 
compile time instead of surprizes at runtime. I think it was intriduced in 
java-7 
http://docs.oracle.com/javase/7/docs/api/java/lang/SafeVarargs.html

I can remove it if you dont like it. But I think it offers some additional type 
safety during casting.

 Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]
 --

 Key: HIVE-8700
 URL: https://issues.apache.org/jira/browse/HIVE-8700
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Suhas Satish
 Attachments: HIVE-8700-spark.patch, HIVE-8700.2-spark.patch, 
 HIVE-8700.patch


 With HIVE-8616 enabled, the new plan has ReduceSinkOperator for the small 
 tables. For example, the follow represents the operator plan for the small 
 table dec1 derived from query {code}explain select /*+ MAPJOIN(dec)*/ * from 
 dec join dec1 on dec.value=dec1.d;{code}
 {code}
 Map 2 
 Map Operator Tree:
 TableScan
   alias: dec1
   Statistics: Num rows: 0 Data size: 107 Basic stats: PARTIAL 
 Column stats: NONE
   Filter Operator
 predicate: d is not null (type: boolean)
 Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
 Column stats: NONE
 Reduce Output Operator
   key expressions: d (type: decimal(5,2))
   sort order: +
   Map-reduce partition columns: d (type: decimal(5,2))
   Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
 Column stats: NONE
   value expressions: i (type: int)
 {code}
 With the new design for broadcasting small tables, we need to convert the 
 ReduceSinkOperator with HashTableSinkOperator or equivalent in the new plan.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8621) Dump small table join data for map-join [Spark Branch]

2014-11-06 Thread Suhas Satish (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201344#comment-14201344
 ] 

Suhas Satish commented on HIVE-8621:


[~jxiang] - Are you sure you are getting any tags being set and read on this 
line ?
mapJoinTableSerdes[tag]

Maybe a review board link will help. Also, the current patch does not change 
any default replication_number related settings right?

 Dump small table join data for map-join [Spark Branch]
 --

 Key: HIVE-8621
 URL: https://issues.apache.org/jira/browse/HIVE-8621
 Project: Hive
  Issue Type: Sub-task
Reporter: Suhas Satish
Assignee: Jimmy Xiang
 Fix For: spark-branch

 Attachments: HIVE-8621.1-spark.patch


 This jira aims to re-use a slightly modified approach of map-reduce 
 distributed cache in spark to dump map-joined small tables as hash tables 
 onto spark DFS cluster. 
 This is a sub-task of map-join for spark 
 https://issues.apache.org/jira/browse/HIVE-7613
 This can use the baseline patch for map-join
 https://issues.apache.org/jira/browse/HIVE-8616
 The original thought process was to use broadcast variable concept in spark, 
 for the small tables. 
 The number of broadcast variables that must be created is m x n where
 'm' is  the number of small tables in the (m+1) way join and n is the number 
 of buckets of tables. If unbucketed, n=1
 But it was discovered that objects compressed with kryo serialization on 
 disk, can occupy 20X or more when deserialized in-memory. For bucket join, 
 the spark Driver has to hold all the buckets (for bucketed tables) in-memory 
 (to provide for fault-tolerance against Executor failures) although the 
 executors only need individual buckets in their memory. So the broadcast 
 variable approach may not be the right approach. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-8700) Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]

2014-11-06 Thread Suhas Satish (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suhas Satish updated HIVE-8700:
---
Attachment: HIVE-8700.3-spark.patch

 Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]
 --

 Key: HIVE-8700
 URL: https://issues.apache.org/jira/browse/HIVE-8700
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Suhas Satish
 Attachments: HIVE-8700-spark.patch, HIVE-8700.2-spark.patch, 
 HIVE-8700.3-spark.patch, HIVE-8700.patch


 With HIVE-8616 enabled, the new plan has ReduceSinkOperator for the small 
 tables. For example, the follow represents the operator plan for the small 
 table dec1 derived from query {code}explain select /*+ MAPJOIN(dec)*/ * from 
 dec join dec1 on dec.value=dec1.d;{code}
 {code}
 Map 2 
 Map Operator Tree:
 TableScan
   alias: dec1
   Statistics: Num rows: 0 Data size: 107 Basic stats: PARTIAL 
 Column stats: NONE
   Filter Operator
 predicate: d is not null (type: boolean)
 Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
 Column stats: NONE
 Reduce Output Operator
   key expressions: d (type: decimal(5,2))
   sort order: +
   Map-reduce partition columns: d (type: decimal(5,2))
   Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
 Column stats: NONE
   value expressions: i (type: int)
 {code}
 With the new design for broadcasting small tables, we need to convert the 
 ReduceSinkOperator with HashTableSinkOperator or equivalent in the new plan.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8700) Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]

2014-11-06 Thread Suhas Satish (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201354#comment-14201354
 ] 

Suhas Satish commented on HIVE-8700:


removed in HIVE-8700.3-spark.patch

 Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]
 --

 Key: HIVE-8700
 URL: https://issues.apache.org/jira/browse/HIVE-8700
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Suhas Satish
 Attachments: HIVE-8700-spark.patch, HIVE-8700.2-spark.patch, 
 HIVE-8700.3-spark.patch, HIVE-8700.patch


 With HIVE-8616 enabled, the new plan has ReduceSinkOperator for the small 
 tables. For example, the follow represents the operator plan for the small 
 table dec1 derived from query {code}explain select /*+ MAPJOIN(dec)*/ * from 
 dec join dec1 on dec.value=dec1.d;{code}
 {code}
 Map 2 
 Map Operator Tree:
 TableScan
   alias: dec1
   Statistics: Num rows: 0 Data size: 107 Basic stats: PARTIAL 
 Column stats: NONE
   Filter Operator
 predicate: d is not null (type: boolean)
 Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
 Column stats: NONE
 Reduce Output Operator
   key expressions: d (type: decimal(5,2))
   sort order: +
   Map-reduce partition columns: d (type: decimal(5,2))
   Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
 Column stats: NONE
   value expressions: i (type: int)
 {code}
 With the new design for broadcasting small tables, we need to convert the 
 ReduceSinkOperator with HashTableSinkOperator or equivalent in the new plan.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8700) Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]

2014-11-06 Thread Suhas Satish (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201486#comment-14201486
 ] 

Suhas Satish commented on HIVE-8700:


Hi [~csun] 
I thought the dummyStoreOperators were already introduced and taken care of in 
the SparkMapJoinOptimizer. But that portion of code is commented out there. I 
will enable it as part of this jira and post an updated patch soon. Thanks for 
bringing it up. 


 Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]
 --

 Key: HIVE-8700
 URL: https://issues.apache.org/jira/browse/HIVE-8700
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Suhas Satish
 Attachments: HIVE-8700-spark.patch, HIVE-8700.2-spark.patch, 
 HIVE-8700.3-spark.patch, HIVE-8700.patch


 With HIVE-8616 enabled, the new plan has ReduceSinkOperator for the small 
 tables. For example, the follow represents the operator plan for the small 
 table dec1 derived from query {code}explain select /*+ MAPJOIN(dec)*/ * from 
 dec join dec1 on dec.value=dec1.d;{code}
 {code}
 Map 2 
 Map Operator Tree:
 TableScan
   alias: dec1
   Statistics: Num rows: 0 Data size: 107 Basic stats: PARTIAL 
 Column stats: NONE
   Filter Operator
 predicate: d is not null (type: boolean)
 Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
 Column stats: NONE
 Reduce Output Operator
   key expressions: d (type: decimal(5,2))
   sort order: +
   Map-reduce partition columns: d (type: decimal(5,2))
   Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
 Column stats: NONE
   value expressions: i (type: int)
 {code}
 With the new design for broadcasting small tables, we need to convert the 
 ReduceSinkOperator with HashTableSinkOperator or equivalent in the new plan.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8700) Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]

2014-11-06 Thread Suhas Satish (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201604#comment-14201604
 ] 

Suhas Satish commented on HIVE-8700:


ah yes. thanks [~csun]

Regarding the test failures, 

3 of these failures seem unrelated -
{code}
org.apache.hadoop.hive.ql.io.parquet.serde.TestParquetTimestampUtils.testTimezone
org.apache.hive.hcatalog.streaming.TestStreaming.testTransactionBatchEmptyCommit
org.apache.hive.minikdc.TestJdbcWithMiniKdc.testNegativeTokenAuth
{code}


Does anyone know if this one is a known failure?
{{org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_sample_islocalmode_hook}}

I see that fail in HIVE-8621 as well. 


 Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]
 --

 Key: HIVE-8700
 URL: https://issues.apache.org/jira/browse/HIVE-8700
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Suhas Satish
 Attachments: HIVE-8700-spark.patch, HIVE-8700.2-spark.patch, 
 HIVE-8700.3-spark.patch, HIVE-8700.patch


 With HIVE-8616 enabled, the new plan has ReduceSinkOperator for the small 
 tables. For example, the follow represents the operator plan for the small 
 table dec1 derived from query {code}explain select /*+ MAPJOIN(dec)*/ * from 
 dec join dec1 on dec.value=dec1.d;{code}
 {code}
 Map 2 
 Map Operator Tree:
 TableScan
   alias: dec1
   Statistics: Num rows: 0 Data size: 107 Basic stats: PARTIAL 
 Column stats: NONE
   Filter Operator
 predicate: d is not null (type: boolean)
 Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
 Column stats: NONE
 Reduce Output Operator
   key expressions: d (type: decimal(5,2))
   sort order: +
   Map-reduce partition columns: d (type: decimal(5,2))
   Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
 Column stats: NONE
   value expressions: i (type: int)
 {code}
 With the new design for broadcasting small tables, we need to convert the 
 ReduceSinkOperator with HashTableSinkOperator or equivalent in the new plan.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8622) Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]

2014-11-05 Thread Suhas Satish (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14198723#comment-14198723
 ] 

Suhas Satish commented on HIVE-8622:


I saw this condition in your patch - 
if (containsOp(work, MapJoinOperator.class)) {
  if (containsOp(parentWork, HashTableSinkOperator.class)) {

This means that HIVE-8621 which introduces  
*replaceReduceSinkWithHashTableSink(..)*  should be called before this stage. 
To create HashTableSinkOperator, we need to pass in the MapJoinOperator 
associated with it. This is available in *GenSparkProcContext*  but that doesnt 
get passed into the physical resolvers. We have to either pass it in or find 
another way to extract this information from the available physicalContext 
inside *SparkMapJoinResolver* and pass it into 
*replaceReduceSinkWithHashTableSink(..)*

 Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]
 

 Key: HIVE-8622
 URL: https://issues.apache.org/jira/browse/HIVE-8622
 Project: Hive
  Issue Type: Sub-task
Reporter: Suhas Satish
Assignee: Chao
 Attachments: HIVE-8622.patch


 This is a sub-task of map-join for spark 
 https://issues.apache.org/jira/browse/HIVE-7613
 This can use the baseline patch for map-join
 https://issues.apache.org/jira/browse/HIVE-8616



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-8700) Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]

2014-11-05 Thread Suhas Satish (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suhas Satish updated HIVE-8700:
---
Attachment: HIVE-8700-spark.patch

 Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]
 --

 Key: HIVE-8700
 URL: https://issues.apache.org/jira/browse/HIVE-8700
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Suhas Satish
 Attachments: HIVE-8700-spark.patch, HIVE-8700.patch


 With HIVE-8616 enabled, the new plan has ReduceSinkOperator for the small 
 tables. For example, the follow represents the operator plan for the small 
 table dec1 derived from query {code}explain select /*+ MAPJOIN(dec)*/ * from 
 dec join dec1 on dec.value=dec1.d;{code}
 {code}
 Map 2 
 Map Operator Tree:
 TableScan
   alias: dec1
   Statistics: Num rows: 0 Data size: 107 Basic stats: PARTIAL 
 Column stats: NONE
   Filter Operator
 predicate: d is not null (type: boolean)
 Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
 Column stats: NONE
 Reduce Output Operator
   key expressions: d (type: decimal(5,2))
   sort order: +
   Map-reduce partition columns: d (type: decimal(5,2))
   Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
 Column stats: NONE
   value expressions: i (type: int)
 {code}
 With the new design for broadcasting small tables, we need to convert the 
 ReduceSinkOperator with HashTableSinkOperator or equivalent in the new plan.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-8700) Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]

2014-11-05 Thread Suhas Satish (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suhas Satish updated HIVE-8700:
---
Status: Patch Available  (was: Open)

 Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]
 --

 Key: HIVE-8700
 URL: https://issues.apache.org/jira/browse/HIVE-8700
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Suhas Satish
 Attachments: HIVE-8700-spark.patch, HIVE-8700.patch


 With HIVE-8616 enabled, the new plan has ReduceSinkOperator for the small 
 tables. For example, the follow represents the operator plan for the small 
 table dec1 derived from query {code}explain select /*+ MAPJOIN(dec)*/ * from 
 dec join dec1 on dec.value=dec1.d;{code}
 {code}
 Map 2 
 Map Operator Tree:
 TableScan
   alias: dec1
   Statistics: Num rows: 0 Data size: 107 Basic stats: PARTIAL 
 Column stats: NONE
   Filter Operator
 predicate: d is not null (type: boolean)
 Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
 Column stats: NONE
 Reduce Output Operator
   key expressions: d (type: decimal(5,2))
   sort order: +
   Map-reduce partition columns: d (type: decimal(5,2))
   Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
 Column stats: NONE
   value expressions: i (type: int)
 {code}
 With the new design for broadcasting small tables, we need to convert the 
 ReduceSinkOperator with HashTableSinkOperator or equivalent in the new plan.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8700) Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]

2014-11-05 Thread Suhas Satish (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14199000#comment-14199000
 ] 

Suhas Satish commented on HIVE-8700:


Attaching a patch that leverages changes from Chao's HIVE-8622.patch.

ReduceSinks are now converted to HashTableSink. But the condition check *if( 
currentTask.getTaskTag() == Task.CONVERTED_MAPJOIN)*  is disabled currently 
(until we decide where to enable it - either in CommonJoinResolver or somewhere 
else). Will also be sending review request soon. 

 Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]
 --

 Key: HIVE-8700
 URL: https://issues.apache.org/jira/browse/HIVE-8700
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Suhas Satish
 Attachments: HIVE-8700-spark.patch, HIVE-8700.patch


 With HIVE-8616 enabled, the new plan has ReduceSinkOperator for the small 
 tables. For example, the follow represents the operator plan for the small 
 table dec1 derived from query {code}explain select /*+ MAPJOIN(dec)*/ * from 
 dec join dec1 on dec.value=dec1.d;{code}
 {code}
 Map 2 
 Map Operator Tree:
 TableScan
   alias: dec1
   Statistics: Num rows: 0 Data size: 107 Basic stats: PARTIAL 
 Column stats: NONE
   Filter Operator
 predicate: d is not null (type: boolean)
 Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
 Column stats: NONE
 Reduce Output Operator
   key expressions: d (type: decimal(5,2))
   sort order: +
   Map-reduce partition columns: d (type: decimal(5,2))
   Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
 Column stats: NONE
   value expressions: i (type: int)
 {code}
 With the new design for broadcasting small tables, we need to convert the 
 ReduceSinkOperator with HashTableSinkOperator or equivalent in the new plan.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8622) Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]

2014-11-05 Thread Suhas Satish (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14199002#comment-14199002
 ] 

Suhas Satish commented on HIVE-8622:


Thanks Chao, I have leveraged some of your work in this patch and uploaded a 
patch to HIVE-8700 to unblock you. You can continue working off that. 

 Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]
 

 Key: HIVE-8622
 URL: https://issues.apache.org/jira/browse/HIVE-8622
 Project: Hive
  Issue Type: Sub-task
Reporter: Suhas Satish
Assignee: Chao
 Attachments: HIVE-8622.patch


 This is a sub-task of map-join for spark 
 https://issues.apache.org/jira/browse/HIVE-7613
 This can use the baseline patch for map-join
 https://issues.apache.org/jira/browse/HIVE-8616



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8700) Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]

2014-11-05 Thread Suhas Satish (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14199011#comment-14199011
 ] 

Suhas Satish commented on HIVE-8700:


https://reviews.apache.org/r/27640/

 Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]
 --

 Key: HIVE-8700
 URL: https://issues.apache.org/jira/browse/HIVE-8700
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Suhas Satish
 Attachments: HIVE-8700-spark.patch, HIVE-8700.patch


 With HIVE-8616 enabled, the new plan has ReduceSinkOperator for the small 
 tables. For example, the follow represents the operator plan for the small 
 table dec1 derived from query {code}explain select /*+ MAPJOIN(dec)*/ * from 
 dec join dec1 on dec.value=dec1.d;{code}
 {code}
 Map 2 
 Map Operator Tree:
 TableScan
   alias: dec1
   Statistics: Num rows: 0 Data size: 107 Basic stats: PARTIAL 
 Column stats: NONE
   Filter Operator
 predicate: d is not null (type: boolean)
 Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
 Column stats: NONE
 Reduce Output Operator
   key expressions: d (type: decimal(5,2))
   sort order: +
   Map-reduce partition columns: d (type: decimal(5,2))
   Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
 Column stats: NONE
   value expressions: i (type: int)
 {code}
 With the new design for broadcasting small tables, we need to convert the 
 ReduceSinkOperator with HashTableSinkOperator or equivalent in the new plan.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-8621) Dump small table join data for map-join [Spark Branch]

2014-11-04 Thread Suhas Satish (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suhas Satish updated HIVE-8621:
---
Assignee: (was: Suhas Satish)

 Dump small table join data for map-join [Spark Branch]
 --

 Key: HIVE-8621
 URL: https://issues.apache.org/jira/browse/HIVE-8621
 Project: Hive
  Issue Type: Sub-task
Reporter: Suhas Satish

 This jira aims to re-use a slightly modified approach of map-reduce 
 distributed cache in spark to dump map-joined small tables as hash tables 
 onto spark DFS cluster. 
 This is a sub-task of map-join for spark 
 https://issues.apache.org/jira/browse/HIVE-7613
 This can use the baseline patch for map-join
 https://issues.apache.org/jira/browse/HIVE-8616
 The original thought process was to use broadcast variable concept in spark, 
 for the small tables. 
 The number of broadcast variables that must be created is m x n where
 'm' is  the number of small tables in the (m+1) way join and n is the number 
 of buckets of tables. If unbucketed, n=1
 But it was discovered that objects compressed with kryo serialization on 
 disk, can occupy 20X or more when deserialized in-memory. For bucket join, 
 the spark Driver has to hold all the buckets (for bucketed tables) in-memory 
 (to provide for fault-tolerance against Executor failures) although the 
 executors only need individual buckets in their memory. So the broadcast 
 variable approach may not be the right approach. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8700) Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]

2014-11-04 Thread Suhas Satish (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14197096#comment-14197096
 ] 

Suhas Satish commented on HIVE-8700:


Hi Szehon, The patch still needs some work. This includes calling   
*physicalOptimizer.optimize()*   in SparkCompiler to activate the 
SparkMapJoinResolver, also making sure CommonJoinResolver portion is commented 
out so that does not interfere and throw ClassCastExceptions. There might be 
more hidden issues behind that. But I will try to come up with something soon. 

 Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]
 --

 Key: HIVE-8700
 URL: https://issues.apache.org/jira/browse/HIVE-8700
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Suhas Satish
 Attachments: HIVE-8700.patch


 With HIVE-8616 enabled, the new plan has ReduceSinkOperator for the small 
 tables. For example, the follow represents the operator plan for the small 
 table dec1 derived from query {code}explain select /*+ MAPJOIN(dec)*/ * from 
 dec join dec1 on dec.value=dec1.d;{code}
 {code}
 Map 2 
 Map Operator Tree:
 TableScan
   alias: dec1
   Statistics: Num rows: 0 Data size: 107 Basic stats: PARTIAL 
 Column stats: NONE
   Filter Operator
 predicate: d is not null (type: boolean)
 Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
 Column stats: NONE
 Reduce Output Operator
   key expressions: d (type: decimal(5,2))
   sort order: +
   Map-reduce partition columns: d (type: decimal(5,2))
   Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
 Column stats: NONE
   value expressions: i (type: int)
 {code}
 With the new design for broadcasting small tables, we need to convert the 
 ReduceSinkOperator with HashTableSinkOperator or equivalent in the new plan.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (HIVE-8700) Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]

2014-11-03 Thread Suhas Satish (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suhas Satish reassigned HIVE-8700:
--

Assignee: Suhas Satish  (was: Szehon Ho)

 Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]
 --

 Key: HIVE-8700
 URL: https://issues.apache.org/jira/browse/HIVE-8700
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Suhas Satish

 With HIVE-8616 enabled, the new plan has ReduceSinkOperator for the small 
 tables. For example, the follow represents the operator plan for the small 
 table dec1 derived from query {code}explain select /*+ MAPJOIN(dec)*/ * from 
 dec join dec1 on dec.value=dec1.d;{code}
 {code}
 Map 2 
 Map Operator Tree:
 TableScan
   alias: dec1
   Statistics: Num rows: 0 Data size: 107 Basic stats: PARTIAL 
 Column stats: NONE
   Filter Operator
 predicate: d is not null (type: boolean)
 Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
 Column stats: NONE
 Reduce Output Operator
   key expressions: d (type: decimal(5,2))
   sort order: +
   Map-reduce partition columns: d (type: decimal(5,2))
   Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
 Column stats: NONE
   value expressions: i (type: int)
 {code}
 With the new design for broadcasting small tables, we need to convert the 
 ReduceSinkOperator with HashTableSinkOperator or equivalent in the new plan.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8700) Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]

2014-11-03 Thread Suhas Satish (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194854#comment-14194854
 ] 

Suhas Satish commented on HIVE-8700:


Thank you [~szehon]

 Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]
 --

 Key: HIVE-8700
 URL: https://issues.apache.org/jira/browse/HIVE-8700
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Suhas Satish

 With HIVE-8616 enabled, the new plan has ReduceSinkOperator for the small 
 tables. For example, the follow represents the operator plan for the small 
 table dec1 derived from query {code}explain select /*+ MAPJOIN(dec)*/ * from 
 dec join dec1 on dec.value=dec1.d;{code}
 {code}
 Map 2 
 Map Operator Tree:
 TableScan
   alias: dec1
   Statistics: Num rows: 0 Data size: 107 Basic stats: PARTIAL 
 Column stats: NONE
   Filter Operator
 predicate: d is not null (type: boolean)
 Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
 Column stats: NONE
 Reduce Output Operator
   key expressions: d (type: decimal(5,2))
   sort order: +
   Map-reduce partition columns: d (type: decimal(5,2))
   Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
 Column stats: NONE
   value expressions: i (type: int)
 {code}
 With the new design for broadcasting small tables, we need to convert the 
 ReduceSinkOperator with HashTableSinkOperator or equivalent in the new plan.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-8700) Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]

2014-11-03 Thread Suhas Satish (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suhas Satish updated HIVE-8700:
---
Attachment: HIVE-8700.patch

 Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]
 --

 Key: HIVE-8700
 URL: https://issues.apache.org/jira/browse/HIVE-8700
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Suhas Satish
 Attachments: HIVE-8700.patch


 With HIVE-8616 enabled, the new plan has ReduceSinkOperator for the small 
 tables. For example, the follow represents the operator plan for the small 
 table dec1 derived from query {code}explain select /*+ MAPJOIN(dec)*/ * from 
 dec join dec1 on dec.value=dec1.d;{code}
 {code}
 Map 2 
 Map Operator Tree:
 TableScan
   alias: dec1
   Statistics: Num rows: 0 Data size: 107 Basic stats: PARTIAL 
 Column stats: NONE
   Filter Operator
 predicate: d is not null (type: boolean)
 Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
 Column stats: NONE
 Reduce Output Operator
   key expressions: d (type: decimal(5,2))
   sort order: +
   Map-reduce partition columns: d (type: decimal(5,2))
   Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
 Column stats: NONE
   value expressions: i (type: int)
 {code}
 With the new design for broadcasting small tables, we need to convert the 
 ReduceSinkOperator with HashTableSinkOperator or equivalent in the new plan.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8700) Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]

2014-11-03 Thread Suhas Satish (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195403#comment-14195403
 ] 

Suhas Satish commented on HIVE-8700:


Sure [~szehon]. Attaching my changeset as a patch. This compiles. I was testing 
at runtime. So didn't follow the naming conventions like HIVE-8700-spark.patch 
as I dont want unit tests triggered just yet. 

 Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]
 --

 Key: HIVE-8700
 URL: https://issues.apache.org/jira/browse/HIVE-8700
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Suhas Satish
 Attachments: HIVE-8700.patch


 With HIVE-8616 enabled, the new plan has ReduceSinkOperator for the small 
 tables. For example, the follow represents the operator plan for the small 
 table dec1 derived from query {code}explain select /*+ MAPJOIN(dec)*/ * from 
 dec join dec1 on dec.value=dec1.d;{code}
 {code}
 Map 2 
 Map Operator Tree:
 TableScan
   alias: dec1
   Statistics: Num rows: 0 Data size: 107 Basic stats: PARTIAL 
 Column stats: NONE
   Filter Operator
 predicate: d is not null (type: boolean)
 Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
 Column stats: NONE
 Reduce Output Operator
   key expressions: d (type: decimal(5,2))
   sort order: +
   Map-reduce partition columns: d (type: decimal(5,2))
   Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
 Column stats: NONE
   value expressions: i (type: int)
 {code}
 With the new design for broadcasting small tables, we need to convert the 
 ReduceSinkOperator with HashTableSinkOperator or equivalent in the new plan.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8700) Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]

2014-11-02 Thread Suhas Satish (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14193895#comment-14193895
 ] 

Suhas Satish commented on HIVE-8700:


Hi Xuefu,
I was working on this as  apart of HIVE-8621. Do you want to assign this task 
to me and swap HIVE-8621 with Szehon instead?

 Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]
 --

 Key: HIVE-8700
 URL: https://issues.apache.org/jira/browse/HIVE-8700
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Szehon Ho

 With HIVE-8616 enabled, the new plan has ReduceSinkOperator for the small 
 tables. For example, the follow represents the operator plan for the small 
 table dec1 derived from query {code}explain select /*+ MAPJOIN(dec)*/ * from 
 dec join dec1 on dec.value=dec1.d;{code}
 {code}
 Map 2 
 Map Operator Tree:
 TableScan
   alias: dec1
   Statistics: Num rows: 0 Data size: 107 Basic stats: PARTIAL 
 Column stats: NONE
   Filter Operator
 predicate: d is not null (type: boolean)
 Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
 Column stats: NONE
 Reduce Output Operator
   key expressions: d (type: decimal(5,2))
   sort order: +
   Map-reduce partition columns: d (type: decimal(5,2))
   Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
 Column stats: NONE
   value expressions: i (type: int)
 {code}
 With the new design for broadcasting small tables, we need to convert the 
 ReduceSinkOperator with HashTableSinkOperator or equivalent in the new plan.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-8621) Dump small table join data for map-join [Spark Branch]

2014-10-31 Thread Suhas Satish (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suhas Satish updated HIVE-8621:
---
Summary: Dump small table join data for map-join [Spark Branch]  (was: Dump 
small table join data into appropriate number of broadcast variables [Spark 
Branch])

 Dump small table join data for map-join [Spark Branch]
 --

 Key: HIVE-8621
 URL: https://issues.apache.org/jira/browse/HIVE-8621
 Project: Hive
  Issue Type: Sub-task
Reporter: Suhas Satish
Assignee: Suhas Satish

 The number of broadcast variables that must be created is m x n where
 'm' is  the number of small tables in the (m+1) way join and n is the number 
 of buckets of tables. If unbucketed, n=1
 This is a sub-task of map-join for spark 
 https://issues.apache.org/jira/browse/HIVE-7613
 This can use the baseline patch for map-join
 https://issues.apache.org/jira/browse/HIVE-8616



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-8621) Dump small table join data for map-join [Spark Branch]

2014-10-31 Thread Suhas Satish (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suhas Satish updated HIVE-8621:
---
Description: 
This jira aims to re-use a slightly modified approach of map-reduce distributed 
cache in spark to dump map-joined small tables as hash tables onto spark DFS 
cluster. 
This is a sub-task of map-join for spark 
https://issues.apache.org/jira/browse/HIVE-7613

This can use the baseline patch for map-join
https://issues.apache.org/jira/browse/HIVE-8616

The original thought process was to use broadcast variable concept in spark, 
for the small tables. 
The number of broadcast variables that must be created is m x n where
'm' is  the number of small tables in the (m+1) way join and n is the number of 
buckets of tables. If unbucketed, n=1

But it was discovered that objects compressed with kryo serialization on disk, 
can occupy 20X or more when deserialized in-memory. For bucket join, the spark 
Driver has to hold all the buckets (for bucketed tables) in-memory (to provide 
for fault-tolerance against Executor failures) although the executors only need 
individual buckets in their memory. So the broadcast variable approach may not 
be the right approach. 


  was:
The number of broadcast variables that must be created is m x n where
'm' is  the number of small tables in the (m+1) way join and n is the number of 
buckets of tables. If unbucketed, n=1

This is a sub-task of map-join for spark 
https://issues.apache.org/jira/browse/HIVE-7613

This can use the baseline patch for map-join
https://issues.apache.org/jira/browse/HIVE-8616



 Dump small table join data for map-join [Spark Branch]
 --

 Key: HIVE-8621
 URL: https://issues.apache.org/jira/browse/HIVE-8621
 Project: Hive
  Issue Type: Sub-task
Reporter: Suhas Satish
Assignee: Suhas Satish

 This jira aims to re-use a slightly modified approach of map-reduce 
 distributed cache in spark to dump map-joined small tables as hash tables 
 onto spark DFS cluster. 
 This is a sub-task of map-join for spark 
 https://issues.apache.org/jira/browse/HIVE-7613
 This can use the baseline patch for map-join
 https://issues.apache.org/jira/browse/HIVE-8616
 The original thought process was to use broadcast variable concept in spark, 
 for the small tables. 
 The number of broadcast variables that must be created is m x n where
 'm' is  the number of small tables in the (m+1) way join and n is the number 
 of buckets of tables. If unbucketed, n=1
 But it was discovered that objects compressed with kryo serialization on 
 disk, can occupy 20X or more when deserialized in-memory. For bucket join, 
 the spark Driver has to hold all the buckets (for bucketed tables) in-memory 
 (to provide for fault-tolerance against Executor failures) although the 
 executors only need individual buckets in their memory. So the broadcast 
 variable approach may not be the right approach. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-8616) convert joinOp to MapJoinOp and generate MapWorks only [Spark Branch]

2014-10-29 Thread Suhas Satish (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suhas Satish updated HIVE-8616:
---
Attachment: HIVE-8616.2-spark.patch

 convert joinOp to MapJoinOp and generate MapWorks only [Spark Branch]
 -

 Key: HIVE-8616
 URL: https://issues.apache.org/jira/browse/HIVE-8616
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Suhas Satish
Assignee: Suhas Satish
 Attachments: HIVE-8616-spark.patch, HIVE-8616.2-spark.patch


 This is a sub-task of map join on spark. 
 The parent jira is
 https://issues.apache.org/jira/browse/HIVE-7613



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8616) convert joinOp to MapJoinOp and generate MapWorks only [Spark Branch]

2014-10-29 Thread Suhas Satish (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14188061#comment-14188061
 ] 

Suhas Satish commented on HIVE-8616:


Addressed review board comments and uploaded updated patch.

 convert joinOp to MapJoinOp and generate MapWorks only [Spark Branch]
 -

 Key: HIVE-8616
 URL: https://issues.apache.org/jira/browse/HIVE-8616
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Suhas Satish
Assignee: Suhas Satish
 Attachments: HIVE-8616-spark.patch, HIVE-8616.2-spark.patch


 This is a sub-task of map join on spark. 
 The parent jira is
 https://issues.apache.org/jira/browse/HIVE-7613



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8616) convert joinOp to MapJoinOp and generate MapWorks only [Spark Branch]

2014-10-29 Thread Suhas Satish (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14189370#comment-14189370
 ] 

Suhas Satish commented on HIVE-8616:


Hi Xuefu, 
Yes most of these failing tests set hive.auto.convert.join = true and convert a 
common join where possible into map-join. But since we dont have the HashTable 
sinking and SparkHashTableLoader yet, they fail downstream. 

I am commenting out the triggering rules in SparkCompiler and resubmitting my 
patch. 

 convert joinOp to MapJoinOp and generate MapWorks only [Spark Branch]
 -

 Key: HIVE-8616
 URL: https://issues.apache.org/jira/browse/HIVE-8616
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Suhas Satish
Assignee: Suhas Satish
 Attachments: HIVE-8616-spark.patch, HIVE-8616.2-spark.patch


 This is a sub-task of map join on spark. 
 The parent jira is
 https://issues.apache.org/jira/browse/HIVE-7613



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-8616) convert joinOp to MapJoinOp and generate MapWorks only [Spark Branch]

2014-10-29 Thread Suhas Satish (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suhas Satish updated HIVE-8616:
---
Attachment: HIVE-8616.3-spark.patch

 convert joinOp to MapJoinOp and generate MapWorks only [Spark Branch]
 -

 Key: HIVE-8616
 URL: https://issues.apache.org/jira/browse/HIVE-8616
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Suhas Satish
Assignee: Suhas Satish
 Attachments: HIVE-8616-spark.patch, HIVE-8616.2-spark.patch, 
 HIVE-8616.3-spark.patch


 This is a sub-task of map join on spark. 
 The parent jira is
 https://issues.apache.org/jira/browse/HIVE-7613



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8621) Dump small table join data into appropriate number of broadcast variables [Spark Branch]

2014-10-29 Thread Suhas Satish (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14189531#comment-14189531
 ] 

Suhas Satish commented on HIVE-8621:


Currently so far in the spark implementation, we are not tagging the small 
tables, but I realized that we need to tag them to be able to use different 
broadcast variables for different tables. 

Also, we have 2 reduce sinks (RS) for the 2 small tables in a 3-way map-join. 

In M/R, we have only one HashTableSink Operator (HTS) for all small tables 
combined. This conversion from RS- HTS 
happens in LocalMapJoinProcFactory and is  triggered by rule R7  
(MapReduceCompiler: MapJoinFactory.getTableScanMapJoin )in 
TaskCompiler.optimizeTaskPlan phase. 

Using similar logic as in LocalMapJoinProcFactory in SparkMapJoinResolver, we 
will end up with 2 HashTableSinks (or in general, (n-1) HTS for n-way join). 
Each of these will generate its broadcast variable. 

After going through Sandy Ryza's spark presentation here, 
http://www.slideshare.net/SandyRyza/spark-job-failures-talk
it looks like the recommended way to distribute compute in spark is to have a 
large number of SparkTasks. So I think its better to have each MapWork from 
each small table as a separate SparkTask. This can be tackled independently in 
this jira if you guys agree 
https://issues.apache.org/jira/browse/HIVE-8622


 Dump small table join data into appropriate number of broadcast variables 
 [Spark Branch]
 

 Key: HIVE-8621
 URL: https://issues.apache.org/jira/browse/HIVE-8621
 Project: Hive
  Issue Type: Sub-task
Reporter: Suhas Satish
Assignee: Suhas Satish

 The number of broadcast variables that must be created is m x n where
 'm' is  the number of small tables in the (m+1) way join and n is the number 
 of buckets of tables. If unbucketed, n=1
 This is a sub-task of map-join for spark 
 https://issues.apache.org/jira/browse/HIVE-7613
 This can use the baseline patch for map-join
 https://issues.apache.org/jira/browse/HIVE-8616



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8616) convert joinOp to MapJoinOp and generate MapWorks only [Spark Branch]

2014-10-29 Thread Suhas Satish (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14189548#comment-14189548
 ] 

Suhas Satish commented on HIVE-8616:


Even with this merged into spark branch, the following 2 rules in 
SparkCompiler.java need to be enabled for dependent map-join follow-up jiras - 

SparkCompiler.java - 
 opRules.put(new RuleRegExp(new String(Convert Join to Map-join), 
JoinOperator.getOperatorName() + %), new SparkMapJoinOptimizer());

  opRules.put(new RuleRegExp(No more walking on ReduceSink-MapJoin, 
MapJoinOperator.getOperatorName() + %), new SparkReduceSinkMapJoinProc());


 convert joinOp to MapJoinOp and generate MapWorks only [Spark Branch]
 -

 Key: HIVE-8616
 URL: https://issues.apache.org/jira/browse/HIVE-8616
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Suhas Satish
Assignee: Suhas Satish
 Fix For: spark-branch

 Attachments: HIVE-8616-spark.patch, HIVE-8616.2-spark.patch, 
 HIVE-8616.3-spark.patch


 This is a sub-task of map join on spark. 
 The parent jira is
 https://issues.apache.org/jira/browse/HIVE-7613



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Work started] (HIVE-8621) Dump small table join data into appropriate number of broadcast variables [Spark Branch]

2014-10-29 Thread Suhas Satish (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HIVE-8621 started by Suhas Satish.
--
 Dump small table join data into appropriate number of broadcast variables 
 [Spark Branch]
 

 Key: HIVE-8621
 URL: https://issues.apache.org/jira/browse/HIVE-8621
 Project: Hive
  Issue Type: Sub-task
Reporter: Suhas Satish
Assignee: Suhas Satish

 The number of broadcast variables that must be created is m x n where
 'm' is  the number of small tables in the (m+1) way join and n is the number 
 of buckets of tables. If unbucketed, n=1
 This is a sub-task of map-join for spark 
 https://issues.apache.org/jira/browse/HIVE-7613
 This can use the baseline patch for map-join
 https://issues.apache.org/jira/browse/HIVE-8616



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-8616) convert joinOp to MapJoinOp and generate MapWorks only

2014-10-27 Thread Suhas Satish (JIRA)
Suhas Satish created HIVE-8616:
--

 Summary: convert joinOp to MapJoinOp and generate MapWorks only
 Key: HIVE-8616
 URL: https://issues.apache.org/jira/browse/HIVE-8616
 Project: Hive
  Issue Type: Sub-task
Reporter: Suhas Satish
Assignee: Suhas Satish


This is a sub-task of map join on spark. 
The parent jira is
https://issues.apache.org/jira/browse/HIVE-7613



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-8616) convert joinOp to MapJoinOp and generate MapWorks only

2014-10-27 Thread Suhas Satish (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suhas Satish updated HIVE-8616:
---
Attachment: HIVE-8616-spark.patch

 convert joinOp to MapJoinOp and generate MapWorks only
 --

 Key: HIVE-8616
 URL: https://issues.apache.org/jira/browse/HIVE-8616
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Suhas Satish
Assignee: Suhas Satish
 Attachments: HIVE-8616-spark.patch


 This is a sub-task of map join on spark. 
 The parent jira is
 https://issues.apache.org/jira/browse/HIVE-7613



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-8616) convert joinOp to MapJoinOp and generate MapWorks only

2014-10-27 Thread Suhas Satish (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suhas Satish updated HIVE-8616:
---
Status: Patch Available  (was: Open)

Attached a patch which addresses this sub-task. With this patch applied, this 
is the explain plan for a 3-way join. 

explain select * from table1 join table2 on (table1.key = table2.key) join 
table3 on table1.key = table3.key;

OK

STAGE DEPENDENCIES:

  Stage-1 is a root stage

  Stage-0 depends on stages: Stage-1



STAGE PLANS:

  Stage: Stage-1

Spark

  Edges:

Map 1 - Map 2 (NONE, 0), Map 3 (NONE, 0)

  DagName: ssatish_20141027131919_0ab004f6-5495-44b4-b7b1-16bf8ca15473:2

  Vertices:

Map 1 

Map Operator Tree:

TableScan

  alias: table1

  Statistics: Num rows: 55 Data size: 5812 Basic stats: 
COMPLETE Column stats: NONE

  Filter Operator

predicate: key is not null (type: boolean)

Statistics: Num rows: 28 Data size: 2958 Basic stats: 
COMPLETE Column stats: NONE

Map Join Operator

  condition map:

   Inner Join 0 to 1

   Inner Join 0 to 2

  condition expressions:

0 {key} {value}

1 {key} {value}

2 {key} {value}

  keys:

0 key (type: int)

1 key (type: int)

2 key (type: int)

  outputColumnNames: _col0, _col1, _col5, _col6, _col10, 
_col11

  input vertices:

1 Map 3

2 Map 2

  Statistics: Num rows: 61 Data size: 6507 Basic stats: 
COMPLETE Column stats: NONE

  Select Operator

expressions: _col0 (type: int), _col1 (type: string), 
_col5 (type: int), _col6 (type: string), _col10 (type: int), _col11 (type: 
string)

outputColumnNames: _col0, _col1, _col2, _col3, _col4, 
_col5

Statistics: Num rows: 61 Data size: 6507 Basic stats: 
COMPLETE Column stats: NONE

File Output Operator

  compressed: false

  Statistics: Num rows: 61 Data size: 6507 Basic stats: 
COMPLETE Column stats: NONE

  table:

  input format: 
org.apache.hadoop.mapred.TextInputFormat

  output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

Map 2 

Map Operator Tree:

TableScan

  alias: table3

  Statistics: Num rows: 1 Data size: 140 Basic stats: COMPLETE 
Column stats: NONE

  Filter Operator

predicate: key is not null (type: boolean)

Statistics: Num rows: 1 Data size: 140 Basic stats: 
COMPLETE Column stats: NONE

Reduce Output Operator

  key expressions: key (type: int)

  sort order: +

  Map-reduce partition columns: key (type: int)

  Statistics: Num rows: 1 Data size: 140 Basic stats: 
COMPLETE Column stats: NONE

  value expressions: value (type: string)

Map 3 

Map Operator Tree:

TableScan

  alias: table2

  Statistics: Num rows: 55 Data size: 5791 Basic stats: 
COMPLETE Column stats: NONE

  Filter Operator

predicate: key is not null (type: boolean)

Statistics: Num rows: 28 Data size: 2948 Basic stats: 
COMPLETE Column stats: NONE

Reduce Output Operator

  key expressions: key (type: int)

  sort order: +

  Map-reduce partition columns: key (type: int)

  Statistics: Num rows: 28 Data size: 2948 Basic stats: 
COMPLETE Column stats: NONE

  value expressions: value (type: string)



  Stage: Stage-0

Fetch Operator

  limit: -1

  Processor Tree:

ListSink



 convert joinOp to MapJoinOp and generate MapWorks only
 --

 Key: HIVE-8616
 URL: https://issues.apache.org/jira/browse/HIVE-8616
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Suhas Satish
Assignee: Suhas Satish
 Attachments: HIVE-8616-spark.patch


 This is a sub-task of map join 

[jira] [Commented] (HIVE-7613) Research optimization of auto convert join to map join [Spark branch]

2014-10-27 Thread Suhas Satish (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14185878#comment-14185878
 ] 

Suhas Satish commented on HIVE-7613:


Submitted patch for HIVE-8616. This can be used as the baseline patch for 
subsequent sub-tasks. 

 Research optimization of auto convert join to map join [Spark branch]
 -

 Key: HIVE-7613
 URL: https://issues.apache.org/jira/browse/HIVE-7613
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Chengxiang Li
Assignee: Suhas Satish
Priority: Minor
 Attachments: HIve on Spark Map join background.docx


 ConvertJoinMapJoin is an optimization the replaces a common join(aka shuffle 
 join) with a map join(aka broadcast or fragment replicate join) when 
 possible. we need to research how to make it workable with Hive on Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8616) convert joinOp to MapJoinOp and generate MapWorks only

2014-10-27 Thread Suhas Satish (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14185886#comment-14185886
 ] 

Suhas Satish commented on HIVE-8616:


Review board:
https://reviews.apache.org/r/27247/

 convert joinOp to MapJoinOp and generate MapWorks only
 --

 Key: HIVE-8616
 URL: https://issues.apache.org/jira/browse/HIVE-8616
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Suhas Satish
Assignee: Suhas Satish
 Attachments: HIVE-8616-spark.patch


 This is a sub-task of map join on spark. 
 The parent jira is
 https://issues.apache.org/jira/browse/HIVE-7613



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-8621) Aggregate all small table join data into 1 broadcast variable

2014-10-27 Thread Suhas Satish (JIRA)
Suhas Satish created HIVE-8621:
--

 Summary: Aggregate all small table join data into 1 broadcast 
variable
 Key: HIVE-8621
 URL: https://issues.apache.org/jira/browse/HIVE-8621
 Project: Hive
  Issue Type: Bug
Reporter: Suhas Satish
Assignee: Suhas Satish


This is a sub-task of map-join for spark 
https://issues.apache.org/jira/browse/HIVE-7613

This can use the baseline patch for map-join
https://issues.apache.org/jira/browse/HIVE-8616




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-8622) Split map-join plan into 2 SparkTasks in 3 stages.

2014-10-27 Thread Suhas Satish (JIRA)
Suhas Satish created HIVE-8622:
--

 Summary: Split map-join plan into 2 SparkTasks in 3 stages. 
 Key: HIVE-8622
 URL: https://issues.apache.org/jira/browse/HIVE-8622
 Project: Hive
  Issue Type: Bug
Reporter: Suhas Satish


This is a sub-task of map-join for spark 
https://issues.apache.org/jira/browse/HIVE-7613
This can use the baseline patch for map-join
https://issues.apache.org/jira/browse/HIVE-8616




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-8622) Split map-join plan into 2 SparkTasks in 3 stages.

2014-10-27 Thread Suhas Satish (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suhas Satish updated HIVE-8622:
---
Issue Type: Sub-task  (was: Bug)
Parent: HIVE-7292

 Split map-join plan into 2 SparkTasks in 3 stages. 
 ---

 Key: HIVE-8622
 URL: https://issues.apache.org/jira/browse/HIVE-8622
 Project: Hive
  Issue Type: Sub-task
Reporter: Suhas Satish

 This is a sub-task of map-join for spark 
 https://issues.apache.org/jira/browse/HIVE-7613
 This can use the baseline patch for map-join
 https://issues.apache.org/jira/browse/HIVE-8616



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-8623) Implement SparkHashTableLoader for map-join broadcast variable read

2014-10-27 Thread Suhas Satish (JIRA)
Suhas Satish created HIVE-8623:
--

 Summary: Implement SparkHashTableLoader for map-join broadcast 
variable read
 Key: HIVE-8623
 URL: https://issues.apache.org/jira/browse/HIVE-8623
 Project: Hive
  Issue Type: Task
Reporter: Suhas Satish


This is a sub-task of map-join for spark 
https://issues.apache.org/jira/browse/HIVE-7613
This can use the baseline patch for map-join
https://issues.apache.org/jira/browse/HIVE-8616




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-8623) Implement SparkHashTableLoader for map-join broadcast variable read

2014-10-27 Thread Suhas Satish (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suhas Satish updated HIVE-8623:
---
Issue Type: Sub-task  (was: Task)
Parent: HIVE-7292

 Implement SparkHashTableLoader for map-join broadcast variable read
 ---

 Key: HIVE-8623
 URL: https://issues.apache.org/jira/browse/HIVE-8623
 Project: Hive
  Issue Type: Sub-task
Reporter: Suhas Satish

 This is a sub-task of map-join for spark 
 https://issues.apache.org/jira/browse/HIVE-7613
 This can use the baseline patch for map-join
 https://issues.apache.org/jira/browse/HIVE-8616



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-8621) Aggregate all small table join data into 1 broadcast variable

2014-10-27 Thread Suhas Satish (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suhas Satish updated HIVE-8621:
---
Issue Type: Sub-task  (was: Bug)
Parent: HIVE-7292

 Aggregate all small table join data into 1 broadcast variable
 -

 Key: HIVE-8621
 URL: https://issues.apache.org/jira/browse/HIVE-8621
 Project: Hive
  Issue Type: Sub-task
Reporter: Suhas Satish
Assignee: Suhas Satish

 This is a sub-task of map-join for spark 
 https://issues.apache.org/jira/browse/HIVE-7613
 This can use the baseline patch for map-join
 https://issues.apache.org/jira/browse/HIVE-8616



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8621) Aggregate all small table join data into 1 broadcast variable

2014-10-27 Thread Suhas Satish (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14185960#comment-14185960
 ] 

Suhas Satish commented on HIVE-8621:


Hi Szehon, Yes what you say makes sense. I had not looked too deep into 
MapJoinOperator when I created this jira. Thanks for pointing it out. We can 
rename the jira accordingly. 

 Aggregate all small table join data into 1 broadcast variable
 -

 Key: HIVE-8621
 URL: https://issues.apache.org/jira/browse/HIVE-8621
 Project: Hive
  Issue Type: Sub-task
Reporter: Suhas Satish
Assignee: Suhas Satish

 This is a sub-task of map-join for spark 
 https://issues.apache.org/jira/browse/HIVE-7613
 This can use the baseline patch for map-join
 https://issues.apache.org/jira/browse/HIVE-8616



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-8621) Aggregate all small table join data into broadcast variables

2014-10-27 Thread Suhas Satish (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suhas Satish updated HIVE-8621:
---
Summary: Aggregate all small table join data into broadcast variables  
(was: Aggregate all small table join data into 1 broadcast variable)

 Aggregate all small table join data into broadcast variables
 

 Key: HIVE-8621
 URL: https://issues.apache.org/jira/browse/HIVE-8621
 Project: Hive
  Issue Type: Sub-task
Reporter: Suhas Satish
Assignee: Suhas Satish

 This is a sub-task of map-join for spark 
 https://issues.apache.org/jira/browse/HIVE-7613
 This can use the baseline patch for map-join
 https://issues.apache.org/jira/browse/HIVE-8616



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-8621) Aggregate all small table join data into mxn broadcast variables [Spark Branch]

2014-10-27 Thread Suhas Satish (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suhas Satish updated HIVE-8621:
---
Summary: Aggregate all small table join data into mxn broadcast variables 
[Spark Branch]  (was: Aggregate all small table join data into broadcast 
variables [Spark Branch])

 Aggregate all small table join data into mxn broadcast variables [Spark 
 Branch]
 ---

 Key: HIVE-8621
 URL: https://issues.apache.org/jira/browse/HIVE-8621
 Project: Hive
  Issue Type: Sub-task
Reporter: Suhas Satish
Assignee: Suhas Satish

 This is a sub-task of map-join for spark 
 https://issues.apache.org/jira/browse/HIVE-7613
 This can use the baseline patch for map-join
 https://issues.apache.org/jira/browse/HIVE-8616



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-8621) Aggregate all small table join data into mxn broadcast variables [Spark Branch]

2014-10-27 Thread Suhas Satish (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suhas Satish updated HIVE-8621:
---
Description: 
In the title of the jira, 'm' is  the number of small tables in the (m+1)- way 
join and n is the number of buckets of tables. If unbucketed, n=1

This is a sub-task of map-join for spark 
https://issues.apache.org/jira/browse/HIVE-7613

This can use the baseline patch for map-join
https://issues.apache.org/jira/browse/HIVE-8616


  was:
This is a sub-task of map-join for spark 
https://issues.apache.org/jira/browse/HIVE-7613

This can use the baseline patch for map-join
https://issues.apache.org/jira/browse/HIVE-8616



 Aggregate all small table join data into mxn broadcast variables [Spark 
 Branch]
 ---

 Key: HIVE-8621
 URL: https://issues.apache.org/jira/browse/HIVE-8621
 Project: Hive
  Issue Type: Sub-task
Reporter: Suhas Satish
Assignee: Suhas Satish

 In the title of the jira, 'm' is  the number of small tables in the (m+1)- 
 way join and n is the number of buckets of tables. If unbucketed, n=1
 This is a sub-task of map-join for spark 
 https://issues.apache.org/jira/browse/HIVE-7613
 This can use the baseline patch for map-join
 https://issues.apache.org/jira/browse/HIVE-8616



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-8621) Aggregate all small table join data into m x n broadcast variables [Spark Branch]

2014-10-27 Thread Suhas Satish (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suhas Satish updated HIVE-8621:
---
Summary: Aggregate all small table join data into m x n broadcast variables 
[Spark Branch]  (was: Aggregate all small table join data into mxn broadcast 
variables [Spark Branch])

 Aggregate all small table join data into m x n broadcast variables [Spark 
 Branch]
 -

 Key: HIVE-8621
 URL: https://issues.apache.org/jira/browse/HIVE-8621
 Project: Hive
  Issue Type: Sub-task
Reporter: Suhas Satish
Assignee: Suhas Satish

 In the title of the jira, 'm' is  the number of small tables in the (m+1)- 
 way join and n is the number of buckets of tables. If unbucketed, n=1
 This is a sub-task of map-join for spark 
 https://issues.apache.org/jira/browse/HIVE-7613
 This can use the baseline patch for map-join
 https://issues.apache.org/jira/browse/HIVE-8616



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-8621) Aggregate all small table join data into m x n broadcast variables [Spark Branch]

2014-10-27 Thread Suhas Satish (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suhas Satish updated HIVE-8621:
---
Description: 
The number of broadcast variables that must be created is m x n where
'm' is  the number of small tables in the (m+1) way join and n is the number of 
buckets of tables. If unbucketed, n=1

This is a sub-task of map-join for spark 
https://issues.apache.org/jira/browse/HIVE-7613

This can use the baseline patch for map-join
https://issues.apache.org/jira/browse/HIVE-8616


  was:
In the title of the jira, 'm' is  the number of small tables in the (m+1)- way 
join and n is the number of buckets of tables. If unbucketed, n=1

This is a sub-task of map-join for spark 
https://issues.apache.org/jira/browse/HIVE-7613

This can use the baseline patch for map-join
https://issues.apache.org/jira/browse/HIVE-8616



 Aggregate all small table join data into m x n broadcast variables [Spark 
 Branch]
 -

 Key: HIVE-8621
 URL: https://issues.apache.org/jira/browse/HIVE-8621
 Project: Hive
  Issue Type: Sub-task
Reporter: Suhas Satish
Assignee: Suhas Satish

 The number of broadcast variables that must be created is m x n where
 'm' is  the number of small tables in the (m+1) way join and n is the number 
 of buckets of tables. If unbucketed, n=1
 This is a sub-task of map-join for spark 
 https://issues.apache.org/jira/browse/HIVE-7613
 This can use the baseline patch for map-join
 https://issues.apache.org/jira/browse/HIVE-8616



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-8621) Dump small table join data into appropriate number of broadcast variables [Spark Branch]

2014-10-27 Thread Suhas Satish (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suhas Satish updated HIVE-8621:
---
Summary: Dump small table join data into appropriate number of broadcast 
variables [Spark Branch]  (was: Aggregate all small table join data into m x n 
broadcast variables [Spark Branch])

 Dump small table join data into appropriate number of broadcast variables 
 [Spark Branch]
 

 Key: HIVE-8621
 URL: https://issues.apache.org/jira/browse/HIVE-8621
 Project: Hive
  Issue Type: Sub-task
Reporter: Suhas Satish
Assignee: Suhas Satish

 The number of broadcast variables that must be created is m x n where
 'm' is  the number of small tables in the (m+1) way join and n is the number 
 of buckets of tables. If unbucketed, n=1
 This is a sub-task of map-join for spark 
 https://issues.apache.org/jira/browse/HIVE-7613
 This can use the baseline patch for map-join
 https://issues.apache.org/jira/browse/HIVE-8616



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8621) Dump small table join data into appropriate number of broadcast variables [Spark Branch]

2014-10-27 Thread Suhas Satish (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14186003#comment-14186003
 ] 

Suhas Satish commented on HIVE-8621:


Agreed, there was no confusion. Have updated the title and the description. 

 Dump small table join data into appropriate number of broadcast variables 
 [Spark Branch]
 

 Key: HIVE-8621
 URL: https://issues.apache.org/jira/browse/HIVE-8621
 Project: Hive
  Issue Type: Sub-task
Reporter: Suhas Satish
Assignee: Suhas Satish

 The number of broadcast variables that must be created is m x n where
 'm' is  the number of small tables in the (m+1) way join and n is the number 
 of buckets of tables. If unbucketed, n=1
 This is a sub-task of map-join for spark 
 https://issues.apache.org/jira/browse/HIVE-7613
 This can use the baseline patch for map-join
 https://issues.apache.org/jira/browse/HIVE-8616



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-7916) Snappy-java error when running hive query on spark [Spark Branch]

2014-10-24 Thread Suhas Satish (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14183290#comment-14183290
 ] 

Suhas Satish commented on HIVE-7916:


Not sure what solved it for you, but setting this seems to work for me on a Mac 
OS X -
export HADOOP_OPTS=-Dorg.xerial.snappy.tempdir=/tmp 
-Dorg.xerial.snappy.lib.name=libsnappyjava.jnilib $HADOOP_OPTS


 Snappy-java error when running hive query on spark [Spark Branch]
 -

 Key: HIVE-7916
 URL: https://issues.apache.org/jira/browse/HIVE-7916
 Project: Hive
  Issue Type: Bug
  Components: Spark
Reporter: Xuefu Zhang
  Labels: Spark-M1

 Recently spark branch upgraded its dependency on Spark to 1.1.0-SNAPSHOT. 
 While the new version addressed some lib conflicts (such as guava), I'm 
 afraid that it also introduced new problems. The following might be one, when 
 I set the master URL to be a spark standalone cluster:
 {code}
 hive set hive.execution.engine=spark;
 hive set spark.serializer=org.apache.spark.serializer.KryoSerializer;
 hive set spark.master=spark://xzdt:7077;
 hive select name, avg(value) from dec group by name;
 14/08/28 16:41:52 INFO storage.MemoryStore: Block broadcast_0 stored as 
 values in memory (estimated size 333.0 KB, free 128.0 MB)
 java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:601)
 at org.xerial.snappy.SnappyLoader.loadNativeLibrary(SnappyLoader.java:317)
 at org.xerial.snappy.SnappyLoader.load(SnappyLoader.java:219)
 at org.xerial.snappy.Snappy.clinit(Snappy.java:44)
 at org.xerial.snappy.SnappyOutputStream.init(SnappyOutputStream.java:79)
 at 
 org.apache.spark.io.SnappyCompressionCodec.compressedOutputStream(CompressionCodec.scala:124)
 at 
 org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:207)
 at 
 org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:83)
 at 
 org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:68)
 at 
 org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:36)
 at 
 org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29)
 at 
 org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
 at org.apache.spark.SparkContext.broadcast(SparkContext.scala:809)
 at org.apache.spark.rdd.HadoopRDD.init(HadoopRDD.scala:116)
 at org.apache.spark.SparkContext.hadoopRDD(SparkContext.scala:541)
 at 
 org.apache.spark.api.java.JavaSparkContext.hadoopRDD(JavaSparkContext.scala:318)
 at 
 org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generateRDD(SparkPlanGenerator.java:160)
 at 
 org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:88)
 at 
 org.apache.hadoop.hive.ql.exec.spark.SparkClient.execute(SparkClient.java:156)
 at 
 org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.submit(SparkSessionImpl.java:52)
 at 
 org.apache.hadoop.hive.ql.exec.spark.SparkTask.execute(SparkTask.java:77)
 at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:161)
 at 
 org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
 at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1537)
 at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1304)
 at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1116)
 at org.apache.hadoop.hive.ql.Driver.run(Driver.java:940)
 at org.apache.hadoop.hive.ql.Driver.run(Driver.java:930)
 at 
 org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:246)
 at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:198)
 at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:408)
 at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:781)
 at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:675)
 at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:614)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:601)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
 Caused by: java.lang.UnsatisfiedLinkError: no snappyjava in java.library.path
 at 

[jira] [Commented] (HIVE-7916) Snappy-java error when running hive query on spark [Spark Branch]

2014-10-17 Thread Suhas Satish (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14175712#comment-14175712
 ] 

Suhas Satish commented on HIVE-7916:


I also hit the following snappy lib exceptions - I am using snappy 
snappy-java-1.0.5.jar. Let me try upgrading to snappy 1.1.1.3

2014-10-17 16:18:01,977 ERROR [Executor task launch worker-0]: 
executor.Executor (Logging.scala:logError(96)) - Exception in task 0.0 in stage 
0.0 (TID 0)
org.xerial.snappy.SnappyError: [FAILED_TO_LOAD_NATIVE_LIBRARY] null
at org.xerial.snappy.SnappyLoader.load(SnappyLoader.java:229)
at org.xerial.snappy.Snappy.clinit(Snappy.java:44)
at 
org.xerial.snappy.SnappyOutputStream.init(SnappyOutputStream.java:79)
at 
org.apache.spark.io.SnappyCompressionCodec.compressedOutputStream(CompressionCodec.scala:125)
at 
org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1083)
at 
org.apache.spark.storage.BlockManager$$anonfun$7.apply(BlockManager.scala:579)
at 
org.apache.spark.storage.BlockManager$$anonfun$7.apply(BlockManager.scala:579)
at 
org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:126)
at 
org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:192)
at 
org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$4$$anonfun$apply$2.apply(ExternalSorter.scala:732)
at 
org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$4$$anonfun$apply$2.apply(ExternalSorter.scala:731)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at 
org.apache.spark.util.collection.ExternalSorter$IteratorForPartition.foreach(ExternalSorter.scala:789)
at 
org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$4.apply(ExternalSorter.scala:731)
at 
org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$4.apply(ExternalSorter.scala:727)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:727)
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:70)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

--
2014-10-17 16:18:02,021 INFO  [main]: scheduler.DAGScheduler 
(Logging.scala:logInfo(59)) - Job 0 failed: foreach at SparkPlan.java:80, took 
3.389683 s
2014-10-17 16:18:02,021 ERROR [main]: spark.SparkClient 
(SparkClient.java:execute(166)) - Error executing Spark Plan
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 
0, localhost): org.xerial.snappy.SnappyError: [FAILED_TO_LOAD_NATIVE_LIBRARY] 
null
org.xerial.snappy.SnappyLoader.load(SnappyLoader.java:229)
org.xerial.snappy.Snappy.clinit(Snappy.java:44)
org.xerial.snappy.SnappyOutputStream.init(SnappyOutputStream.java:79)

org.apache.spark.io.SnappyCompressionCodec.compressedOutputStream(CompressionCodec.scala:125)

org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1083)

org.apache.spark.storage.BlockManager$$anonfun$7.apply(BlockManager.scala:579)

org.apache.spark.storage.BlockManager$$anonfun$7.apply(BlockManager.scala:579)

org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:126)

org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:192)

org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$4$$anonfun$apply$2.apply(ExternalSorter.scala:732)

org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$4$$anonfun$apply$2.apply(ExternalSorter.scala:731)
scala.collection.Iterator$class.foreach(Iterator.scala:727)

org.apache.spark.util.collection.ExternalSorter$IteratorForPartition.foreach(ExternalSorter.scala:789)

org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$4.apply(ExternalSorter.scala:731)

org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$4.apply(ExternalSorter.scala:727)

[jira] [Commented] (HIVE-7916) Snappy-java error when running hive query on spark [Spark Branch]

2014-10-17 Thread Suhas Satish (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14175771#comment-14175771
 ] 

Suhas Satish commented on HIVE-7916:


Hitting the same problem with snappy 1.1.1.3 as well. Using hive tar ball as of 
today (fri, oct 17, 2014) with spark.master=local


 Snappy-java error when running hive query on spark [Spark Branch]
 -

 Key: HIVE-7916
 URL: https://issues.apache.org/jira/browse/HIVE-7916
 Project: Hive
  Issue Type: Bug
  Components: Spark
Reporter: Xuefu Zhang
  Labels: Spark-M1

 Recently spark branch upgraded its dependency on Spark to 1.1.0-SNAPSHOT. 
 While the new version addressed some lib conflicts (such as guava), I'm 
 afraid that it also introduced new problems. The following might be one, when 
 I set the master URL to be a spark standalone cluster:
 {code}
 hive set hive.execution.engine=spark;
 hive set spark.serializer=org.apache.spark.serializer.KryoSerializer;
 hive set spark.master=spark://xzdt:7077;
 hive select name, avg(value) from dec group by name;
 14/08/28 16:41:52 INFO storage.MemoryStore: Block broadcast_0 stored as 
 values in memory (estimated size 333.0 KB, free 128.0 MB)
 java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:601)
 at org.xerial.snappy.SnappyLoader.loadNativeLibrary(SnappyLoader.java:317)
 at org.xerial.snappy.SnappyLoader.load(SnappyLoader.java:219)
 at org.xerial.snappy.Snappy.clinit(Snappy.java:44)
 at org.xerial.snappy.SnappyOutputStream.init(SnappyOutputStream.java:79)
 at 
 org.apache.spark.io.SnappyCompressionCodec.compressedOutputStream(CompressionCodec.scala:124)
 at 
 org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:207)
 at 
 org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:83)
 at 
 org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:68)
 at 
 org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:36)
 at 
 org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29)
 at 
 org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
 at org.apache.spark.SparkContext.broadcast(SparkContext.scala:809)
 at org.apache.spark.rdd.HadoopRDD.init(HadoopRDD.scala:116)
 at org.apache.spark.SparkContext.hadoopRDD(SparkContext.scala:541)
 at 
 org.apache.spark.api.java.JavaSparkContext.hadoopRDD(JavaSparkContext.scala:318)
 at 
 org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generateRDD(SparkPlanGenerator.java:160)
 at 
 org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:88)
 at 
 org.apache.hadoop.hive.ql.exec.spark.SparkClient.execute(SparkClient.java:156)
 at 
 org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.submit(SparkSessionImpl.java:52)
 at 
 org.apache.hadoop.hive.ql.exec.spark.SparkTask.execute(SparkTask.java:77)
 at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:161)
 at 
 org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
 at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1537)
 at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1304)
 at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1116)
 at org.apache.hadoop.hive.ql.Driver.run(Driver.java:940)
 at org.apache.hadoop.hive.ql.Driver.run(Driver.java:930)
 at 
 org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:246)
 at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:198)
 at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:408)
 at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:781)
 at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:675)
 at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:614)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:601)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
 Caused by: java.lang.UnsatisfiedLinkError: no snappyjava in java.library.path
 at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1860)
 at 

[jira] [Commented] (HIVE-7551) expand spark accumulator to support hive counter [Spark Branch]

2014-10-10 Thread Suhas Satish (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167997#comment-14167997
 ] 

Suhas Satish commented on HIVE-7551:


Xuefu is right, feel free to work on this ~Chengxiang

 expand spark accumulator  to support hive counter [Spark Branch]
 

 Key: HIVE-7551
 URL: https://issues.apache.org/jira/browse/HIVE-7551
 Project: Hive
  Issue Type: New Feature
  Components: Spark
Reporter: Chengxiang Li
Assignee: Chengxiang Li
  Labels: Spark-M3

 hive collect some operator statistic information through counter, we need to 
 support MR/Tez counter counterpart through spark accumulator.
 NO PRECOMMIT TESTS. This is for spark branch only.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-8243) clone SparkWork for join optimization

2014-09-23 Thread Suhas Satish (JIRA)
Suhas Satish created HIVE-8243:
--

 Summary: clone SparkWork for join optimization
 Key: HIVE-8243
 URL: https://issues.apache.org/jira/browse/HIVE-8243
 Project: Hive
  Issue Type: Bug
  Components: Spark
Reporter: Suhas Satish


Map-join optimization needs to clone the SparkWork containing the operator tree 
to make changes to it. For MapredWork, this is done thru kryo 
serialization/deserialization in 
https://issues.apache.org/jira/browse/HIVE-5263

Something similar should be done for SparkWork




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-8243) clone SparkWork for join optimization

2014-09-23 Thread Suhas Satish (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suhas Satish updated HIVE-8243:
---
Issue Type: Sub-task  (was: Bug)
Parent: HIVE-7292

 clone SparkWork for join optimization
 -

 Key: HIVE-8243
 URL: https://issues.apache.org/jira/browse/HIVE-8243
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Suhas Satish
  Labels: https://issues.apache.org/jira/browse/HIVE-5263

 Map-join optimization needs to clone the SparkWork containing the operator 
 tree to make changes to it. For MapredWork, this is done thru kryo 
 serialization/deserialization in 
 https://issues.apache.org/jira/browse/HIVE-5263
 Something similar should be done for SparkWork



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8243) clone SparkWork for join optimization

2014-09-23 Thread Suhas Satish (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145705#comment-14145705
 ] 

Suhas Satish commented on HIVE-8243:


cloning via kryo.copy() as suggested for MapredWork in 
https://issues.apache.org/jira/browse/HIVE-4396
maybe a good approach here. 

 clone SparkWork for join optimization
 -

 Key: HIVE-8243
 URL: https://issues.apache.org/jira/browse/HIVE-8243
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Suhas Satish
  Labels: https://issues.apache.org/jira/browse/HIVE-5263

 Map-join optimization needs to clone the SparkWork containing the operator 
 tree to make changes to it. For MapredWork, this is done thru kryo 
 serialization/deserialization in 
 https://issues.apache.org/jira/browse/HIVE-5263
 Something similar should be done for SparkWork



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-7613) Research optimization of auto convert join to map join [Spark branch]

2014-09-18 Thread Suhas Satish (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138567#comment-14138567
 ] 

Suhas Satish commented on HIVE-7613:


Hi Xuefu, 
thats a good idea. I was thinking on the lines of calling SparkContext's 
addFile method in each of the N-1 spark jobs in HashTableSinkOperator.java  to 
write the hash tables as files and then read it in the map-only join job in 
MapJoinOperator. But that doesn't involve RDDs.   

 Research optimization of auto convert join to map join [Spark branch]
 -

 Key: HIVE-7613
 URL: https://issues.apache.org/jira/browse/HIVE-7613
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Chengxiang Li
Assignee: Suhas Satish
Priority: Minor
 Attachments: HIve on Spark Map join background.docx


 ConvertJoinMapJoin is an optimization the replaces a common join(aka shuffle 
 join) with a map join(aka broadcast or fragment replicate join) when 
 possible. we need to research how to make it workable with Hive on Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-7613) Research optimization of auto convert join to map join [Spark branch]

2014-09-18 Thread Suhas Satish (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139611#comment-14139611
 ] 

Suhas Satish commented on HIVE-7613:


{{ConvertJoinMapJoin}} heavily uses {{OptimizeTezProcContext}} . Although we do 
have an equivalent {{OptimizeSparkProcContext}}, the 2 are not derived from any 
common ancestor class. We will need some class hierarchy redesign/refactoring 
to  make ConvertJoinMapJoin be more generic to support multiple execution 
frameworks. 

For now, I am thinking of proceeding with a cloned {{SparkConvertJoinMapJoin}}  
class using {{OptimizeSparkProcContext}}
We might need to open a jira for this refactoring.


 Research optimization of auto convert join to map join [Spark branch]
 -

 Key: HIVE-7613
 URL: https://issues.apache.org/jira/browse/HIVE-7613
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Chengxiang Li
Assignee: Suhas Satish
Priority: Minor
 Attachments: HIve on Spark Map join background.docx


 ConvertJoinMapJoin is an optimization the replaces a common join(aka shuffle 
 join) with a map join(aka broadcast or fragment replicate join) when 
 possible. we need to research how to make it workable with Hive on Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-8183) make ConvertJoinMapJoin optimization pluggable for different execution frameworks

2014-09-18 Thread Suhas Satish (JIRA)
Suhas Satish created HIVE-8183:
--

 Summary: make ConvertJoinMapJoin optimization pluggable for 
different execution frameworks
 Key: HIVE-8183
 URL: https://issues.apache.org/jira/browse/HIVE-8183
 Project: Hive
  Issue Type: Improvement
  Components: Physical Optimizer
Affects Versions: 0.13.1, 0.14.0, spark-branch
Reporter: Suhas Satish


Originally introduced for Tez, ConvertJoinMapJoin heavily uses 
OptimizeTezProcContext . Although we do have an equivalent 
OptimizeSparkProcContext, the 2 are not derived from any common ancestor class. 
We will need some class hierarchy redesign/refactoring to make 
ConvertJoinMapJoin be more generic to support multiple execution frameworks.
For now, I am thinking of proceeding with a cloned SparkConvertJoinMapJoin 
class using OptimizeSparkProcContext
We might need to open a jira for this refactoring.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-8183) make ConvertJoinMapJoin optimization pluggable for different execution frameworks

2014-09-18 Thread Suhas Satish (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suhas Satish updated HIVE-8183:
---
Description: 
Originally introduced for Tez, ConvertJoinMapJoin heavily uses 
OptimizeTezProcContext . Although we do have an equivalent 
OptimizeSparkProcContext, the 2 are not derived from any common ancestor class. 
We will need some class hierarchy redesign/refactoring to make 
ConvertJoinMapJoin be more generic to support multiple execution frameworks .



  was:
Originally introduced for Tez, ConvertJoinMapJoin heavily uses 
OptimizeTezProcContext . Although we do have an equivalent 
OptimizeSparkProcContext, the 2 are not derived from any common ancestor class. 
We will need some class hierarchy redesign/refactoring to make 
ConvertJoinMapJoin be more generic to support multiple execution frameworks.
For now, I am thinking of proceeding with a cloned SparkConvertJoinMapJoin 
class using OptimizeSparkProcContext



 make ConvertJoinMapJoin optimization pluggable for different execution 
 frameworks
 -

 Key: HIVE-8183
 URL: https://issues.apache.org/jira/browse/HIVE-8183
 Project: Hive
  Issue Type: Improvement
  Components: Physical Optimizer
Affects Versions: 0.14.0, 0.13.1, spark-branch
Reporter: Suhas Satish
  Labels: spark

 Originally introduced for Tez, ConvertJoinMapJoin heavily uses 
 OptimizeTezProcContext . Although we do have an equivalent 
 OptimizeSparkProcContext, the 2 are not derived from any common ancestor 
 class. We will need some class hierarchy redesign/refactoring to make 
 ConvertJoinMapJoin be more generic to support multiple execution frameworks .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-8183) make ConvertJoinMapJoin optimization pluggable for different execution frameworks

2014-09-18 Thread Suhas Satish (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suhas Satish updated HIVE-8183:
---
Description: 
Originally introduced for Tez, ConvertJoinMapJoin heavily uses 
OptimizeTezProcContext . Although we do have an equivalent 
OptimizeSparkProcContext, the 2 are not derived from any common ancestor class. 
We will need some class hierarchy redesign/refactoring to make 
ConvertJoinMapJoin be more generic to support multiple execution frameworks.
For now, I am thinking of proceeding with a cloned SparkConvertJoinMapJoin 
class using OptimizeSparkProcContext


  was:
Originally introduced for Tez, ConvertJoinMapJoin heavily uses 
OptimizeTezProcContext . Although we do have an equivalent 
OptimizeSparkProcContext, the 2 are not derived from any common ancestor class. 
We will need some class hierarchy redesign/refactoring to make 
ConvertJoinMapJoin be more generic to support multiple execution frameworks.
For now, I am thinking of proceeding with a cloned SparkConvertJoinMapJoin 
class using OptimizeSparkProcContext
We might need to open a jira for this refactoring.


 make ConvertJoinMapJoin optimization pluggable for different execution 
 frameworks
 -

 Key: HIVE-8183
 URL: https://issues.apache.org/jira/browse/HIVE-8183
 Project: Hive
  Issue Type: Improvement
  Components: Physical Optimizer
Affects Versions: 0.14.0, 0.13.1, spark-branch
Reporter: Suhas Satish
  Labels: spark

 Originally introduced for Tez, ConvertJoinMapJoin heavily uses 
 OptimizeTezProcContext . Although we do have an equivalent 
 OptimizeSparkProcContext, the 2 are not derived from any common ancestor 
 class. We will need some class hierarchy redesign/refactoring to make 
 ConvertJoinMapJoin be more generic to support multiple execution frameworks.
 For now, I am thinking of proceeding with a cloned SparkConvertJoinMapJoin 
 class using OptimizeSparkProcContext



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-7613) Research optimization of auto convert join to map join [Spark branch]

2014-09-03 Thread Suhas Satish (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120697#comment-14120697
 ] 

Suhas Satish commented on HIVE-7613:


as a part of this work, we should also enable auto_sortmerge_join_1.q which 
currently fails with 

{code:title=auto_sortmerge_join_1.stackTrace|borderStyle=solid}
2014-09-03 16:12:59,607 ERROR [main]: spark.SparkClient 
(SparkClient.java:execute(166)) - Error executing Spark Plan
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 
1, localhost): java.lang.RuntimeException: 
org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while 
processing row {key:0,value:val_0,ds:2008-04-08}

org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:151)

org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:47)

org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:28)

org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:99)

scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:65)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1177)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1166)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1165)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1165)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1383)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

{code}

 Research optimization of auto convert join to map join [Spark branch]
 -

 Key: HIVE-7613
 URL: https://issues.apache.org/jira/browse/HIVE-7613
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Chengxiang Li
Assignee: Szehon Ho
Priority: Minor
 Attachments: HIve on Spark Map join background.docx


 ConvertJoinMapJoin is an optimization the replaces a common join(aka shuffle 
 join) with a map join(aka broadcast or fragment replicate join) when 
 possible. we need to research how to make it workable with Hive on Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-7952) Investigate query failures (1)

2014-09-03 Thread Suhas Satish (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120706#comment-14120706
 ] 

Suhas Satish commented on HIVE-7952:


auto_sortmerge_join_1 and auto_sortmerge_join13 are covered under existing jira 
on  Map join and the stackTrace from the test failure is listed here - 
https://issues.apache.org/jira/browse/HIVE-7613

 Investigate query failures (1)
 --

 Key: HIVE-7952
 URL: https://issues.apache.org/jira/browse/HIVE-7952
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Brock Noland
Assignee: Suhas Satish

 I ran all q-file tests and the following failed with an exception:
 http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/HIVE-SPARK-ALL-TESTS-Build/lastCompletedBuild/testReport/
 we don't necessary want to run all these tests as part of the spark tests, 
 but we should understand why they failed with an exception. This JIRA is to 
 look into these failures and document them with one of:
 * New JIRA
 * Covered under existing JIRA
 * More investigation required
 Tests:
 {noformat}
  
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_auto_sortmerge_join_13
2.5 sec 2
  org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_tez_fsstat   
 1.6 sec 2
  
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_dynpart_sort_opt_vectorization
5.3 sec 2
  
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_auto_sortmerge_join_14
6.3 sec 2
  org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_udf_using
 0.34 sec2
  
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_authorization_create_func1
0.96 sec2
  
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_sample_islocalmode_hook
   11 sec  2
  
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_authorization_set_show_current_role
   1.4 sec 2
  
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_authorization_owner_actions_db
0.42 sec2
  
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_auto_sortmerge_join_8
 5.5 sec 2
  org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_lock21.8 sec 
 2
  
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_authorization_1_sql_std
   2.7 sec 2
  
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_exim_19_part_external_location
3.9 sec 2
  
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_stats_empty_partition
 0.67 sec2
  
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_authorization_role_grant1
 3.6 sec 2
  
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_authorization_role_grant2
 2.6 sec 2
  
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_authorization_show_grant
  3.5 sec 2
  
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_auto_smb_mapjoin_14
   2.6 sec 2
  org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_dbtxnmgr_query1  
 0.93 sec2
  org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_dbtxnmgr_query4  
 0.26 sec2
  
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_auto_sortmerge_join_1
 10 sec  2
  
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_auto_sortmerge_join_7
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-7869) Build long running HS2 test framework

2014-09-02 Thread Suhas Satish (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118721#comment-14118721
 ] 

Suhas Satish commented on HIVE-7869:


thanks Brock. I will continue to add queries to this. 

 Build long running HS2 test framework
 -

 Key: HIVE-7869
 URL: https://issues.apache.org/jira/browse/HIVE-7869
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Brock Noland
Assignee: Suhas Satish
 Fix For: 0.14.0

 Attachments: HIVE-7869-spark.patch, HIVE-7869.2-spark.patch


 I have noticed when running the full test suite locally that the test JVM 
 eventually crashes. We should do some testing (not part of the unit tests) 
 which starts up a HS2 and runs queries on it continuously for 24 hours or so.
 In this JIRA let's create a stand alone java program which connects to a HS2 
 over JDBC, creates a bunch of tables (say 100) and then runs queries until 
 the JDBC client is killed. This will allow us to run long running tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (HIVE-7952) Investigate query failures (1)

2014-09-02 Thread Suhas Satish (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suhas Satish reassigned HIVE-7952:
--

Assignee: Suhas Satish

 Investigate query failures (1)
 --

 Key: HIVE-7952
 URL: https://issues.apache.org/jira/browse/HIVE-7952
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Brock Noland
Assignee: Suhas Satish

 I ran all q-file tests and the following failed with an exception:
 http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/HIVE-SPARK-ALL-TESTS-Build/lastCompletedBuild/testReport/
 we don't necessary want to run all these tests as part of the spark tests, 
 but we should understand why they failed with an exception. This JIRA is to 
 look into these failures and document them with one of:
 * New JIRA
 * Covered under existing JIRA
 * More investigation required
 Tests:
 {noformat}
  
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_auto_sortmerge_join_13
2.5 sec 2
  org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_tez_fsstat   
 1.6 sec 2
  
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_dynpart_sort_opt_vectorization
5.3 sec 2
  
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_auto_sortmerge_join_14
6.3 sec 2
  org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_udf_using
 0.34 sec2
  
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_authorization_create_func1
0.96 sec2
  
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_sample_islocalmode_hook
   11 sec  2
  
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_authorization_set_show_current_role
   1.4 sec 2
  
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_authorization_owner_actions_db
0.42 sec2
  
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_auto_sortmerge_join_8
 5.5 sec 2
  org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_lock21.8 sec 
 2
  
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_authorization_1_sql_std
   2.7 sec 2
  
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_exim_19_part_external_location
3.9 sec 2
  
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_stats_empty_partition
 0.67 sec2
  
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_authorization_role_grant1
 3.6 sec 2
  
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_authorization_role_grant2
 2.6 sec 2
  
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_authorization_show_grant
  3.5 sec 2
  
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_auto_smb_mapjoin_14
   2.6 sec 2
  org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_dbtxnmgr_query1  
 0.93 sec2
  org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_dbtxnmgr_query4  
 0.26 sec2
  
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_auto_sortmerge_join_1
 10 sec  2
  
 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_auto_sortmerge_join_7
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-7551) expand spark accumulator to support hive counter [Spark Branch]

2014-08-29 Thread Suhas Satish (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14115369#comment-14115369
 ] 

Suhas Satish commented on HIVE-7551:


Assigning to myself after talking to Na. Is this for milestone Spark-M3 as the 
dependent jiras are labeled?

 expand spark accumulator  to support hive counter [Spark Branch]
 

 Key: HIVE-7551
 URL: https://issues.apache.org/jira/browse/HIVE-7551
 Project: Hive
  Issue Type: New Feature
  Components: Spark
Reporter: Chengxiang Li
Assignee: Na Yang

 hive collect some operator statistic information through counter, we need to 
 support MR/Tez counter counterpart through spark accumulator.
 NO PRECOMMIT TESTS. This is for spark branch only.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (HIVE-7551) expand spark accumulator to support hive counter [Spark Branch]

2014-08-29 Thread Suhas Satish (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suhas Satish reassigned HIVE-7551:
--

Assignee: Suhas Satish  (was: Na Yang)

 expand spark accumulator  to support hive counter [Spark Branch]
 

 Key: HIVE-7551
 URL: https://issues.apache.org/jira/browse/HIVE-7551
 Project: Hive
  Issue Type: New Feature
  Components: Spark
Reporter: Chengxiang Li
Assignee: Suhas Satish

 hive collect some operator statistic information through counter, we need to 
 support MR/Tez counter counterpart through spark accumulator.
 NO PRECOMMIT TESTS. This is for spark branch only.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-7775) enable sample8.q.[Spark Branch]

2014-08-29 Thread Suhas Satish (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14115380#comment-14115380
 ] 

Suhas Satish commented on HIVE-7775:


what kind of join did Szehon enable? Does hive on spark support full outer 
join? 

 enable sample8.q.[Spark Branch]
 ---

 Key: HIVE-7775
 URL: https://issues.apache.org/jira/browse/HIVE-7775
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Chengxiang Li
Assignee: Chengxiang Li
 Fix For: spark-branch

 Attachments: HIVE-7775.1-spark.patch, HIVE-7775.2-spark.patch, 
 HIVE-7775.3-spark.additional.patch


 sample8.q contain join query, should enable this qtest after hive on spark 
 support join operation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7869) Long running tests (1) [Spark Branch]

2014-08-29 Thread Suhas Satish (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suhas Satish updated HIVE-7869:
---

Attachment: HIVE-7869.2-spark.patch

 Long running tests (1) [Spark Branch]
 -

 Key: HIVE-7869
 URL: https://issues.apache.org/jira/browse/HIVE-7869
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Brock Noland
Assignee: Suhas Satish
 Attachments: HIVE-7869-spark.patch, HIVE-7869.2-spark.patch


 I have noticed when running the full test suite locally that the test JVM 
 eventually crashes. We should do some testing (not part of the unit tests) 
 which starts up a HS2 and runs queries on it continuously for 24 hours or so.
 In this JIRA let's create a stand alone java program which connects to a HS2 
 over JDBC, creates a bunch of tables (say 100) and then runs queries until 
 the JDBC client is killed. This will allow us to run long running tests.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-7869) Long running tests (1) [Spark Branch]

2014-08-29 Thread Suhas Satish (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14115909#comment-14115909
 ] 

Suhas Satish commented on HIVE-7869:


addressed review board comments

 Long running tests (1) [Spark Branch]
 -

 Key: HIVE-7869
 URL: https://issues.apache.org/jira/browse/HIVE-7869
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Brock Noland
Assignee: Suhas Satish
 Attachments: HIVE-7869-spark.patch, HIVE-7869.2-spark.patch


 I have noticed when running the full test suite locally that the test JVM 
 eventually crashes. We should do some testing (not part of the unit tests) 
 which starts up a HS2 and runs queries on it continuously for 24 hours or so.
 In this JIRA let's create a stand alone java program which connects to a HS2 
 over JDBC, creates a bunch of tables (say 100) and then runs queries until 
 the JDBC client is killed. This will allow us to run long running tests.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7869) Long running tests (1) [Spark Branch]

2014-08-28 Thread Suhas Satish (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suhas Satish updated HIVE-7869:
---

Status: Patch Available  (was: Open)

 Long running tests (1) [Spark Branch]
 -

 Key: HIVE-7869
 URL: https://issues.apache.org/jira/browse/HIVE-7869
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Brock Noland
Assignee: Suhas Satish
 Attachments: HIVE-7869-spark.patch


 I have noticed when running the full test suite locally that the test JVM 
 eventually crashes. We should do some testing (not part of the unit tests) 
 which starts up a HS2 and runs queries on it continuously for 24 hours or so.
 In this JIRA let's create a stand alone java program which connects to a HS2 
 over JDBC, creates a bunch of tables (say 100) and then runs queries until 
 the JDBC client is killed. This will allow us to run long running tests.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7869) Long running tests (1) [Spark Branch]

2014-08-28 Thread Suhas Satish (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suhas Satish updated HIVE-7869:
---

Attachment: HIVE-7869-spark.patch

 Long running tests (1) [Spark Branch]
 -

 Key: HIVE-7869
 URL: https://issues.apache.org/jira/browse/HIVE-7869
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Brock Noland
Assignee: Suhas Satish
 Attachments: HIVE-7869-spark.patch


 I have noticed when running the full test suite locally that the test JVM 
 eventually crashes. We should do some testing (not part of the unit tests) 
 which starts up a HS2 and runs queries on it continuously for 24 hours or so.
 In this JIRA let's create a stand alone java program which connects to a HS2 
 over JDBC, creates a bunch of tables (say 100) and then runs queries until 
 the JDBC client is killed. This will allow us to run long running tests.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-7869) Long running tests (1) [Spark Branch]

2014-08-28 Thread Suhas Satish (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114665#comment-14114665
 ] 

Suhas Satish commented on HIVE-7869:


https://reviews.apache.org/r/25177/

 Long running tests (1) [Spark Branch]
 -

 Key: HIVE-7869
 URL: https://issues.apache.org/jira/browse/HIVE-7869
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Brock Noland
Assignee: Suhas Satish
 Attachments: HIVE-7869-spark.patch


 I have noticed when running the full test suite locally that the test JVM 
 eventually crashes. We should do some testing (not part of the unit tests) 
 which starts up a HS2 and runs queries on it continuously for 24 hours or so.
 In this JIRA let's create a stand alone java program which connects to a HS2 
 over JDBC, creates a bunch of tables (say 100) and then runs queries until 
 the JDBC client is killed. This will allow us to run long running tests.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-7869) Long running tests (1) [Spark Branch]

2014-08-28 Thread Suhas Satish (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114696#comment-14114696
 ] 

Suhas Satish commented on HIVE-7869:


est failures are not related to the patch. 

 Long running tests (1) [Spark Branch]
 -

 Key: HIVE-7869
 URL: https://issues.apache.org/jira/browse/HIVE-7869
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Brock Noland
Assignee: Suhas Satish
 Attachments: HIVE-7869-spark.patch


 I have noticed when running the full test suite locally that the test JVM 
 eventually crashes. We should do some testing (not part of the unit tests) 
 which starts up a HS2 and runs queries on it continuously for 24 hours or so.
 In this JIRA let's create a stand alone java program which connects to a HS2 
 over JDBC, creates a bunch of tables (say 100) and then runs queries until 
 the JDBC client is killed. This will allow us to run long running tests.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (HIVE-7869) Long running tests (1) [Spark Branch]

2014-08-25 Thread Suhas Satish (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suhas Satish reassigned HIVE-7869:
--

Assignee: Suhas Satish

 Long running tests (1) [Spark Branch]
 -

 Key: HIVE-7869
 URL: https://issues.apache.org/jira/browse/HIVE-7869
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Brock Noland
Assignee: Suhas Satish

 I have noticed when running the full test suite locally that the test JVM 
 eventually crashes. We should do some testing (not part of the unit tests) 
 which starts up a HS2 and runs queries on it continuously for 24 hours or so.
 In this JIRA let's create a stand alone java program which connects to a HS2 
 over JDBC, creates a bunch of tables (say 100) and then runs queries until 
 the JDBC client is killed. This will allow us to run long running tests.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7821) StarterProject: enable groupby4.q

2014-08-22 Thread Suhas Satish (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suhas Satish updated HIVE-7821:
---

Status: Patch Available  (was: Open)

 StarterProject: enable groupby4.q
 -

 Key: HIVE-7821
 URL: https://issues.apache.org/jira/browse/HIVE-7821
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Brock Noland
Assignee: Suhas Satish
 Attachments: HIVE-7821.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7821) StarterProject: enable groupby4.q

2014-08-22 Thread Suhas Satish (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suhas Satish updated HIVE-7821:
---

Attachment: HIVE-7821.patch

 StarterProject: enable groupby4.q
 -

 Key: HIVE-7821
 URL: https://issues.apache.org/jira/browse/HIVE-7821
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Brock Noland
Assignee: Suhas Satish
 Attachments: HIVE-7821.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-7821) StarterProject: enable groupby4.q

2014-08-22 Thread Suhas Satish (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107319#comment-14107319
 ] 

Suhas Satish commented on HIVE-7821:


grouby4 has a deterministic order, so the output ordering when run on spark is 
the same across test runs, but may not match the same order as q.out from 
corresponding test run on map-reduce.  

 StarterProject: enable groupby4.q
 -

 Key: HIVE-7821
 URL: https://issues.apache.org/jira/browse/HIVE-7821
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Brock Noland
Assignee: Suhas Satish
 Attachments: HIVE-7821.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7821) StarterProject: enable groupby4.q

2014-08-22 Thread Suhas Satish (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suhas Satish updated HIVE-7821:
---

Attachment: HIVE-7821-spark.2.patch

 StarterProject: enable groupby4.q
 -

 Key: HIVE-7821
 URL: https://issues.apache.org/jira/browse/HIVE-7821
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Brock Noland
Assignee: Suhas Satish
 Attachments: HIVE-7821-spark.2.patch, HIVE-7821-spark.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-7821) StarterProject: enable groupby4.q

2014-08-22 Thread Suhas Satish (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107376#comment-14107376
 ] 

Suhas Satish commented on HIVE-7821:


Attached updated patch generated with git diff --no-prefix

 StarterProject: enable groupby4.q
 -

 Key: HIVE-7821
 URL: https://issues.apache.org/jira/browse/HIVE-7821
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Brock Noland
Assignee: Suhas Satish
 Attachments: HIVE-7821-spark.2.patch, HIVE-7821-spark.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7821) StarterProject: enable groupby4.q

2014-08-22 Thread Suhas Satish (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suhas Satish updated HIVE-7821:
---

Attachment: (was: HIVE-7821-spark.2.patch)

 StarterProject: enable groupby4.q
 -

 Key: HIVE-7821
 URL: https://issues.apache.org/jira/browse/HIVE-7821
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Brock Noland
Assignee: Suhas Satish
 Attachments: HIVE-7821-spark.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7821) StarterProject: enable groupby4.q

2014-08-22 Thread Suhas Satish (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suhas Satish updated HIVE-7821:
---

Attachment: HIVE-7821.2-spark.patch

 StarterProject: enable groupby4.q
 -

 Key: HIVE-7821
 URL: https://issues.apache.org/jira/browse/HIVE-7821
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Brock Noland
Assignee: Suhas Satish
 Attachments: HIVE-7821-spark.patch, HIVE-7821.2-spark.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7821) StarterProject: enable groupby4.q

2014-08-22 Thread Suhas Satish (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suhas Satish updated HIVE-7821:
---

Attachment: HIVE-7821.3-spark.patch

 StarterProject: enable groupby4.q
 -

 Key: HIVE-7821
 URL: https://issues.apache.org/jira/browse/HIVE-7821
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Brock Noland
Assignee: Suhas Satish
 Attachments: HIVE-7821-spark.patch, HIVE-7821.3-spark.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7821) StarterProject: enable groupby4.q

2014-08-22 Thread Suhas Satish (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suhas Satish updated HIVE-7821:
---

Attachment: (was: HIVE-7821.2-spark.patch)

 StarterProject: enable groupby4.q
 -

 Key: HIVE-7821
 URL: https://issues.apache.org/jira/browse/HIVE-7821
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Brock Noland
Assignee: Suhas Satish
 Attachments: HIVE-7821-spark.patch, HIVE-7821.3-spark.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-7821) StarterProject: enable groupby4.q

2014-08-22 Thread Suhas Satish (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107639#comment-14107639
 ] 

Suhas Satish commented on HIVE-7821:


rebasing patch

 StarterProject: enable groupby4.q
 -

 Key: HIVE-7821
 URL: https://issues.apache.org/jira/browse/HIVE-7821
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Brock Noland
Assignee: Suhas Satish
 Attachments: HIVE-7821-spark.patch, HIVE-7821.3-spark.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7821) StarterProject: enable groupby4.q

2014-08-22 Thread Suhas Satish (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suhas Satish updated HIVE-7821:
---

Attachment: HIVE-7821.4-spark.patch

 StarterProject: enable groupby4.q
 -

 Key: HIVE-7821
 URL: https://issues.apache.org/jira/browse/HIVE-7821
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Brock Noland
Assignee: Suhas Satish
 Attachments: HIVE-7821-spark.patch, HIVE-7821.3-spark.patch, 
 HIVE-7821.4-spark.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-7821) StarterProject: enable groupby4.q

2014-08-22 Thread Suhas Satish (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107795#comment-14107795
 ] 

Suhas Satish commented on HIVE-7821:


4 of the 5 tests are unrelated to SparkCliDriver. 

1 relevant failure groupby4.q.out had a SORT_BEFORE_DIFF from an experimental 
run. 
Attaching a clean one without it. 

 StarterProject: enable groupby4.q
 -

 Key: HIVE-7821
 URL: https://issues.apache.org/jira/browse/HIVE-7821
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Brock Noland
Assignee: Suhas Satish
 Attachments: HIVE-7821-spark.patch, HIVE-7821.3-spark.patch, 
 HIVE-7821.4-spark.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (HIVE-7821) StarterProject: enable groupby4.q

2014-08-21 Thread Suhas Satish (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suhas Satish reassigned HIVE-7821:
--

Assignee: Suhas Satish  (was: Chinna Rao Lalam)

 StarterProject: enable groupby4.q
 -

 Key: HIVE-7821
 URL: https://issues.apache.org/jira/browse/HIVE-7821
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Brock Noland
Assignee: Suhas Satish





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-5351) Secure-Socket-Layer (SSL) support for HiveServer2

2014-07-02 Thread Suhas Satish (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050663#comment-14050663
 ] 

Suhas Satish commented on HIVE-5351:


I have used the 3 properties above and started my hiveserver2 which is now 
using ssl. 

But how do I connect to it from beeline client? There doesnt seem to be any 
information about it. 

I am trying to use something like this - 

!connect 
jdbc:hive2://127.0.0.1:1/default;ssl=true;sslTrustStore=/opt/mapr/conf/ssl_truststore;

but when it prompts for username and password, it fails to connect even after I 
enter the correct ssl_truststore password.

Enter username for 
jdbc:hive2://10.10.30.181:1/default;ssl=true;sslTrustStore=/opt/mapr/conf/ssl_truststore;sslTrustStorePassword=mapr123:
 mapr
Enter password for 
jdbc:hive2://10.10.30.181:1/default;ssl=true;sslTrustStore=/opt/mapr/conf/ssl_truststore;sslTrustStorePassword=mapr123:
 
Error: Invalid URL: 
jdbc:hive2://10.10.30.181:1/default;ssl=true;sslTrustStore=/opt/mapr/conf/ssl_truststore;sslTrustStorePassword=mapr123
 (state=08S01,code=0)

Is my jdbc connect string the right way to connect?



 Secure-Socket-Layer (SSL) support for HiveServer2
 -

 Key: HIVE-5351
 URL: https://issues.apache.org/jira/browse/HIVE-5351
 Project: Hive
  Issue Type: Improvement
  Components: Authorization, HiveServer2, JDBC
Affects Versions: 0.11.0, 0.12.0
Reporter: Prasad Mujumdar
Assignee: Prasad Mujumdar
 Fix For: 0.13.0

 Attachments: HIVE-5301.test-binary-files.tar, HIVE-5351.3.patch, 
 HIVE-5351.5.patch


 HiveServer2 and JDBC driver should support encrypted communication using SSL



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-4629) HS2 should support an API to retrieve query logs

2014-03-10 Thread Suhas Satish (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13926260#comment-13926260
 ] 

Suhas Satish commented on HIVE-4629:


It would be great to have this jira accepted into hive trunk, even I am waiting 
on this from a long time. 

 HS2 should support an API to retrieve query logs
 

 Key: HIVE-4629
 URL: https://issues.apache.org/jira/browse/HIVE-4629
 Project: Hive
  Issue Type: Sub-task
  Components: HiveServer2
Reporter: Shreepadma Venugopalan
Assignee: Shreepadma Venugopalan
 Attachments: HIVE-4629-no_thrift.1.patch, HIVE-4629.1.patch, 
 HIVE-4629.2.patch


 HiveServer2 should support an API to retrieve query logs. This is 
 particularly relevant because HiveServer2 supports async execution but 
 doesn't provide a way to report progress. Providing an API to retrieve query 
 logs will help report progress to the client.



--
This message was sent by Atlassian JIRA
(v6.2#6252)