from:"Xuefu Zhang \(Jira\)"

[jira] [Commented] (HIVE-8036) PTest SSH Options

2014-09-11 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-8036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14130044#comment-14130044
 ] 

Xuefu Zhang commented on HIVE-8036:
---

+1, looks good to me.

> PTest SSH Options
> -
>
> Key: HIVE-8036
> URL: https://issues.apache.org/jira/browse/HIVE-8036
> Project: Hive
>  Issue Type: Improvement
>Reporter: Brock Noland
>Assignee: Brock Noland
> Attachments: HIVE-8036.patch
>
>
> I'd like to be able to specify the following options:
> {noformat}
> StrictHostKeyChecking no
> ConnectionAttempts 3
> ServerAliveInterval 1
> {noformat}
> as a config param in the ptest config file as opposed to depending on them 
> set in the env.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-8041) Hadoop-2 build is broken with JDK6

2014-09-11 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-8041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14130047#comment-14130047
 ] 

Xuefu Zhang commented on HIVE-8041:
---

I saw this with Oracle's JDK6 on Ubuntu.

> Hadoop-2 build is broken with JDK6
> --
>
> Key: HIVE-8041
> URL: https://issues.apache.org/jira/browse/HIVE-8041
> Project: Hive
>  Issue Type: Bug
>  Components: Build Infrastructure
>Affects Versions: 0.14.0
>Reporter: Xuefu Zhang
>
> {code}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) 
> on project hive-exec: Compilation failure
> [ERROR] 
> /home/xzhang/apache/hive7/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFIf.java:[81,1]
>  illegal start of expression
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-8054) Disable hive.optimize.union.remove when hive.execution.engine=spark [Spark Branch]

2014-09-11 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-8054:
-

 Summary: Disable hive.optimize.union.remove when 
hive.execution.engine=spark [Spark Branch]
 Key: HIVE-8054
 URL: https://issues.apache.org/jira/browse/HIVE-8054
 Project: Hive
  Issue Type: Improvement
  Components: Spark
Reporter: Xuefu Zhang


Option hive.optimize.union.remove introduced in HIVE-3276 removes union 
operators from the operator graph in certain cases as an optimization reduce 
the number of MR jobs. While making sense in MR, this optimization is actually 
harmful to an execution engine such as Spark, which natives supports union 
without requiring additional jobs. This is because removing union operator 
creates disjointed operator graphs, each graph generating a job, and thus this 
optimization requires more jobs to run the query. Not to mention the additional 
complexity handling linked FS descriptors.

I propose that we disable such optimization when the execution engine is Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-8055) Code cleanup after HIVE-8054 [Spark Branch]

2014-09-11 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-8055:
-

 Summary: Code cleanup after HIVE-8054 [Spark Branch]
 Key: HIVE-8055
 URL: https://issues.apache.org/jira/browse/HIVE-8055
 Project: Hive
  Issue Type: Task
  Components: Spark
Reporter: Xuefu Zhang


There is quite some code handling union removal optimization in SparkCompiler 
and related classes. We need to clean this up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-8041) Hadoop-2 build is broken with JDK6

2014-09-11 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-8041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14131053#comment-14131053
 ] 

Xuefu Zhang commented on HIVE-8041:
---

+1. build passes with the patch. Thanks, Navis.

> Hadoop-2 build is broken with JDK6
> --
>
> Key: HIVE-8041
> URL: https://issues.apache.org/jira/browse/HIVE-8041
> Project: Hive
>  Issue Type: Bug
>  Components: Build Infrastructure
>Affects Versions: 0.14.0
>Reporter: Xuefu Zhang
> Attachments: HIVE-8041.1.patch.txt
>
>
> {code}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) 
> on project hive-exec: Compilation failure
> [ERROR] 
> /home/xzhang/apache/hive7/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFIf.java:[81,1]
>  illegal start of expression
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (HIVE-8041) Hadoop-2 build is broken with JDK6

2014-09-11 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-8041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang reassigned HIVE-8041:
-

Assignee: Navis

> Hadoop-2 build is broken with JDK6
> --
>
> Key: HIVE-8041
> URL: https://issues.apache.org/jira/browse/HIVE-8041
> Project: Hive
>  Issue Type: Bug
>  Components: Build Infrastructure
>Affects Versions: 0.14.0
>Reporter: Xuefu Zhang
>Assignee: Navis
> Attachments: HIVE-8041.1.patch.txt
>
>
> {code}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) 
> on project hive-exec: Compilation failure
> [ERROR] 
> /home/xzhang/apache/hive7/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFIf.java:[81,1]
>  illegal start of expression
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-8040) Commit for HIVE-7925 breaks hadoop-1 build

2014-09-11 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-8040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14131061#comment-14131061
 ] 

Xuefu Zhang commented on HIVE-8040:
---

Build seems still broken even after this commit, with a different error:

{code}
ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-compiler-plugin:3.1:testCompile 
(default-testCompile) on project hive-metastore: Compilation failure: 
Compilation failure:
[ERROR] 
/home/xzhang/apache/hive7/metastore/src/test/org/apache/hadoop/hive/metastore/TestHiveMetaStorePartitionSpecs.java:[14,30]
 cannot find symbol
[ERROR] symbol  : class ExitUtil
[ERROR] location: package org.apache.hadoop.util
[ERROR] 
/home/xzhang/apache/hive7/metastore/src/test/org/apache/hadoop/hive/metastore/TestHiveMetaStorePartitionSpecs.java:[55,25]
 package ExitUtil does not exist
{code}

> Commit for HIVE-7925 breaks hadoop-1 build
> --
>
> Key: HIVE-8040
> URL: https://issues.apache.org/jira/browse/HIVE-8040
> Project: Hive
>  Issue Type: Bug
>  Components: Build Infrastructure
>Affects Versions: 0.14.0
>Reporter: Xuefu Zhang
> Attachments: HIVE-8040.1.patch.txt
>
>
> {code}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) 
> on project hive-metastore: Compilation failure
> [ERROR] 
> /home/xzhang/apache/hive7/metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:[45,37]
>  package org.apache.commons.math3.stat does not exist
> [ERROR] -> [Help 1]
> {code}
> Missing pom file changes maybe?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (HIVE-8040) Commit for HIVE-7925 breaks hadoop-1 build

2014-09-11 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-8040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14131061#comment-14131061
 ] 

Xuefu Zhang edited comment on HIVE-8040 at 9/12/14 3:25 AM:


Build seems still broken even after this commit, with a different error (caused 
by HIVE-7223 maybe):

{code}
ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-compiler-plugin:3.1:testCompile 
(default-testCompile) on project hive-metastore: Compilation failure: 
Compilation failure:
[ERROR] 
/home/xzhang/apache/hive7/metastore/src/test/org/apache/hadoop/hive/metastore/TestHiveMetaStorePartitionSpecs.java:[14,30]
 cannot find symbol
[ERROR] symbol  : class ExitUtil
[ERROR] location: package org.apache.hadoop.util
[ERROR] 
/home/xzhang/apache/hive7/metastore/src/test/org/apache/hadoop/hive/metastore/TestHiveMetaStorePartitionSpecs.java:[55,25]
 package ExitUtil does not exist
{code}


was (Author: xuefuz):
Build seems still broken even after this commit, with a different error:

{code}
ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-compiler-plugin:3.1:testCompile 
(default-testCompile) on project hive-metastore: Compilation failure: 
Compilation failure:
[ERROR] 
/home/xzhang/apache/hive7/metastore/src/test/org/apache/hadoop/hive/metastore/TestHiveMetaStorePartitionSpecs.java:[14,30]
 cannot find symbol
[ERROR] symbol  : class ExitUtil
[ERROR] location: package org.apache.hadoop.util
[ERROR] 
/home/xzhang/apache/hive7/metastore/src/test/org/apache/hadoop/hive/metastore/TestHiveMetaStorePartitionSpecs.java:[55,25]
 package ExitUtil does not exist
{code}

> Commit for HIVE-7925 breaks hadoop-1 build
> --
>
> Key: HIVE-8040
> URL: https://issues.apache.org/jira/browse/HIVE-8040
> Project: Hive
>  Issue Type: Bug
>  Components: Build Infrastructure
>Affects Versions: 0.14.0
>Reporter: Xuefu Zhang
> Attachments: HIVE-8040.1.patch.txt
>
>
> {code}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) 
> on project hive-metastore: Compilation failure
> [ERROR] 
> /home/xzhang/apache/hive7/metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:[45,37]
>  package org.apache.commons.math3.stat does not exist
> [ERROR] -> [Help 1]
> {code}
> Missing pom file changes maybe?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-8017) Use HiveKey instead of BytesWritable as key type of the pair RDD [Spark Branch]

2014-09-11 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-8017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14131121#comment-14131121
 ] 

Xuefu Zhang commented on HIVE-8017:
---

[~ruili] I think it might be better to update union_remove_25.q, so we will see 
one less failure every time the test runs. What do you think?

> Use HiveKey instead of BytesWritable as key type of the pair RDD [Spark 
> Branch]
> ---
>
> Key: HIVE-8017
> URL: https://issues.apache.org/jira/browse/HIVE-8017
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
> Attachments: HIVE-8017-spark.patch, HIVE-8017.2-spark.patch, 
> HIVE-8017.3-spark.patch, HIVE-8017.4-spark.patch
>
>
> HiveKey should be used as the key type because it holds the hash code for 
> partitioning. While BytesWritable serves partitioning well for simple cases, 
> we have to use {{HiveKey.hashCode}} for more complicated ones, e.g. join, 
> bucketed table, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-8017) Use HiveKey instead of BytesWritable as key type of the pair RDD [Spark Branch]

2014-09-12 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-8017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14131603#comment-14131603
 ] 

Xuefu Zhang commented on HIVE-8017:
---

{quote}
do you think we need a JIRA to track this difference so we can find the cause 
when we have time
{quote}

Yes, please.

> Use HiveKey instead of BytesWritable as key type of the pair RDD [Spark 
> Branch]
> ---
>
> Key: HIVE-8017
> URL: https://issues.apache.org/jira/browse/HIVE-8017
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
> Attachments: HIVE-8017-spark.patch, HIVE-8017.2-spark.patch, 
> HIVE-8017.3-spark.patch, HIVE-8017.4-spark.patch, HIVE-8017.5-spark.patch
>
>
> HiveKey should be used as the key type because it holds the hash code for 
> partitioning. While BytesWritable serves partitioning well for simple cases, 
> we have to use {{HiveKey.hashCode}} for more complicated ones, e.g. join, 
> bucketed table, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (HIVE-8017) Use HiveKey instead of BytesWritable as key type of the pair RDD [Spark Branch]

2014-09-12 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-8017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14131603#comment-14131603
 ] 

Xuefu Zhang edited comment on HIVE-8017 at 9/12/14 2:30 PM:


{quote}
do you think we need a JIRA to track this difference so we can find the cause 
when we have time
{quote}

Yes, please.

I will commit this patch shortly.


was (Author: xuefuz):
{quote}
do you think we need a JIRA to track this difference so we can find the cause 
when we have time
{quote}

Yes, please.

> Use HiveKey instead of BytesWritable as key type of the pair RDD [Spark 
> Branch]
> ---
>
> Key: HIVE-8017
> URL: https://issues.apache.org/jira/browse/HIVE-8017
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
> Attachments: HIVE-8017-spark.patch, HIVE-8017.2-spark.patch, 
> HIVE-8017.3-spark.patch, HIVE-8017.4-spark.patch, HIVE-8017.5-spark.patch
>
>
> HiveKey should be used as the key type because it holds the hash code for 
> partitioning. While BytesWritable serves partitioning well for simple cases, 
> we have to use {{HiveKey.hashCode}} for more complicated ones, e.g. join, 
> bucketed table, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-8017) Use HiveKey instead of BytesWritable as key type of the pair RDD [Spark Branch]

2014-09-12 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-8017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-8017:
--
   Resolution: Fixed
Fix Version/s: spark-branch
   Status: Resolved  (was: Patch Available)

Patch committed to spark branch. Thanks to Rui for the contribution.

> Use HiveKey instead of BytesWritable as key type of the pair RDD [Spark 
> Branch]
> ---
>
> Key: HIVE-8017
> URL: https://issues.apache.org/jira/browse/HIVE-8017
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
> Fix For: spark-branch
>
> Attachments: HIVE-8017-spark.patch, HIVE-8017.2-spark.patch, 
> HIVE-8017.3-spark.patch, HIVE-8017.4-spark.patch, HIVE-8017.5-spark.patch
>
>
> HiveKey should be used as the key type because it holds the hash code for 
> partitioning. While BytesWritable serves partitioning well for simple cases, 
> we have to use {{HiveKey.hashCode}} for more complicated ones, e.g. join, 
> bucketed table, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-8017) Use HiveKey instead of BytesWritable as key type of the pair RDD [Spark Branch]

2014-09-12 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-8017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-8017:
--
Labels: Spark-M1  (was: )

> Use HiveKey instead of BytesWritable as key type of the pair RDD [Spark 
> Branch]
> ---
>
> Key: HIVE-8017
> URL: https://issues.apache.org/jira/browse/HIVE-8017
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
>  Labels: Spark-M1
> Fix For: spark-branch
>
> Attachments: HIVE-8017-spark.patch, HIVE-8017.2-spark.patch, 
> HIVE-8017.3-spark.patch, HIVE-8017.4-spark.patch, HIVE-8017.5-spark.patch
>
>
> HiveKey should be used as the key type because it holds the hash code for 
> partitioning. While BytesWritable serves partitioning well for simple cases, 
> we have to use {{HiveKey.hashCode}} for more complicated ones, e.g. join, 
> bucketed table, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-8042) Optionally allow move tasks to run in parallel

2014-09-12 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-8042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14131721#comment-14131721
 ] 

Xuefu Zhang commented on HIVE-8042:
---

HiveConf.java actually doesn't say anything about task types. It uses a general 
term "job", which seems good even after this patch. Wiki of course can supply 
more info.

> Optionally allow move tasks to run in parallel
> --
>
> Key: HIVE-8042
> URL: https://issues.apache.org/jira/browse/HIVE-8042
> Project: Hive
>  Issue Type: Bug
>Reporter: Gunther Hagleitner
>Assignee: Gunther Hagleitner
> Fix For: 0.14.0
>
> Attachments: HIVE-8042.1.patch, HIVE-8042.2.patch, HIVE-8042.3.patch
>
>
> hive.exec.parallel allows one to run different stages of a query in parallel. 
> However that applies only to map-reduce tasks. When using large multi insert 
> queries there are many MoveTasks that are all executed in sequence on the 
> client. There's no real reason for that - they could be run in parallel as 
> well (i.e.: the stage graph captures the dependencies and knows which tasks 
> can happen in parallel).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-8073) Go thru all operator plan optimizations and disable those that are not suitable for Spark [Spark Branch]

2014-09-12 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-8073:
-

 Summary: Go thru all operator plan optimizations and disable those 
that are not suitable for Spark [Spark Branch]
 Key: HIVE-8073
 URL: https://issues.apache.org/jira/browse/HIVE-8073
 Project: Hive
  Issue Type: Task
  Components: Spark
Reporter: Xuefu Zhang


I have seen some optimization done in the logical plan that's not applicable, 
such as in HIVE-8054. We should go thru all those optimizaitons to identify if 
any.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-5690) Support subquery for single sourced multi query

2014-09-13 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-5690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14133033#comment-14133033
 ] 

Xuefu Zhang commented on HIVE-5690:
---

Sorry for chiming in late, but I'm not sure if I understand the use case 
correctly. Wouldn't the above query can be rewritten as
{code}
explain from (select distinct key, value ) X
insert overwrite table x1 select *  order by key
insert overwrite table x2 select *  order by value;
{code}

Since the multi-insert is to insert from a single source specified by the 
from-clause, allowing subquery in later select clause seems defeating that 
purpose.


> Support subquery for single sourced multi query
> ---
>
> Key: HIVE-5690
> URL: https://issues.apache.org/jira/browse/HIVE-5690
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: Navis
>Assignee: Navis
>Priority: Minor
> Attachments: D13791.1.patch, HIVE-5690.10.patch.txt, 
> HIVE-5690.11.patch.txt, HIVE-5690.12.patch.txt, HIVE-5690.2.patch.txt, 
> HIVE-5690.3.patch.txt, HIVE-5690.4.patch.txt, HIVE-5690.5.patch.txt, 
> HIVE-5690.6.patch.txt, HIVE-5690.7.patch.txt, HIVE-5690.8.patch.txt, 
> HIVE-5690.9.patch.txt
>
>
> Single sourced multi (insert) query is very useful for various ETL processes 
> but it does not allow subqueries included. For example, 
> {noformat}
> explain from src 
> insert overwrite table x1 select * from (select distinct key,value) b order 
> by key
> insert overwrite table x2 select * from (select distinct key,value) c order 
> by value;
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-7822) how to merge two hive metastores' metadata stored in different databases (such as mysql)

2014-09-13 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-7822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14133064#comment-14133064
 ] 

Xuefu Zhang commented on HIVE-7822:
---

[~wangmeng] JIRA is to report issues or request features. Questions like what 
you presented is better sent to user list.

I'm closing this JIRA.

> how to merge two  hive metastores' metadata  stored in different databases 
> (such as mysql)
> --
>
> Key: HIVE-7822
> URL: https://issues.apache.org/jira/browse/HIVE-7822
> Project: Hive
>  Issue Type: Improvement
>Reporter: wangmeng
>
> Hi, What is a good way to merge  two hive metadata stored in different 
> databases(such as mysql)?
> Is there any way to get all history Hqls  from metastore?  I think  I need  
> to  run these  Hqls   in  another hive  metadata database  again.
> Thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (HIVE-7822) how to merge two hive metastores' metadata stored in different databases (such as mysql)

2014-09-13 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-7822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang resolved HIVE-7822.
---
Resolution: Invalid

> how to merge two  hive metastores' metadata  stored in different databases 
> (such as mysql)
> --
>
> Key: HIVE-7822
> URL: https://issues.apache.org/jira/browse/HIVE-7822
> Project: Hive
>  Issue Type: Improvement
>Reporter: wangmeng
>
> Hi, What is a good way to merge  two hive metadata stored in different 
> databases(such as mysql)?
> Is there any way to get all history Hqls  from metastore?  I think  I need  
> to  run these  Hqls   in  another hive  metadata database  again.
> Thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-8054) Disable hive.optimize.union.remove when hive.execution.engine=spark [Spark Branch]

2014-09-14 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-8054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-8054:
--
Status: Patch Available  (was: Open)

> Disable hive.optimize.union.remove when hive.execution.engine=spark [Spark 
> Branch]
> --
>
> Key: HIVE-8054
> URL: https://issues.apache.org/jira/browse/HIVE-8054
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Na Yang
>  Labels: Spark-M1
> Attachments: HIVE-8054-spark.patch
>
>
> Option hive.optimize.union.remove introduced in HIVE-3276 removes union 
> operators from the operator graph in certain cases as an optimization reduce 
> the number of MR jobs. While making sense in MR, this optimization is 
> actually harmful to an execution engine such as Spark, which natives supports 
> union without requiring additional jobs. This is because removing union 
> operator creates disjointed operator graphs, each graph generating a job, and 
> thus this optimization requires more jobs to run the query. Not to mention 
> the additional complexity handling linked FS descriptors.
> I propose that we disable such optimization when the execution engine is 
> Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-8054) Disable hive.optimize.union.remove when hive.execution.engine=spark [Spark Branch]

2014-09-14 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-8054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14133227#comment-14133227
 ] 

Xuefu Zhang commented on HIVE-8054:
---

Hi [~nyang], thank you very much for working on this. The patch looks good, and 
I just submitted it to let the test run.

+1 pending on test result.

> Disable hive.optimize.union.remove when hive.execution.engine=spark [Spark 
> Branch]
> --
>
> Key: HIVE-8054
> URL: https://issues.apache.org/jira/browse/HIVE-8054
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Na Yang
>  Labels: Spark-M1
> Attachments: HIVE-8054-spark.patch
>
>
> Option hive.optimize.union.remove introduced in HIVE-3276 removes union 
> operators from the operator graph in certain cases as an optimization reduce 
> the number of MR jobs. While making sense in MR, this optimization is 
> actually harmful to an execution engine such as Spark, which natives supports 
> union without requiring additional jobs. This is because removing union 
> operator creates disjointed operator graphs, each graph generating a job, and 
> thus this optimization requires more jobs to run the query. Not to mention 
> the additional complexity handling linked FS descriptors.
> I propose that we disable such optimization when the execution engine is 
> Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-8054) Disable hive.optimize.union.remove when hive.execution.engine=spark [Spark Branch]

2014-09-14 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-8054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14133568#comment-14133568
 ] 

Xuefu Zhang commented on HIVE-8054:
---

[~nyang], it looks like that some test output needs to be updated. Thanks.

> Disable hive.optimize.union.remove when hive.execution.engine=spark [Spark 
> Branch]
> --
>
> Key: HIVE-8054
> URL: https://issues.apache.org/jira/browse/HIVE-8054
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Na Yang
>  Labels: Spark-M1
> Attachments: HIVE-8054-spark.patch
>
>
> Option hive.optimize.union.remove introduced in HIVE-3276 removes union 
> operators from the operator graph in certain cases as an optimization reduce 
> the number of MR jobs. While making sense in MR, this optimization is 
> actually harmful to an execution engine such as Spark, which natives supports 
> union without requiring additional jobs. This is because removing union 
> operator creates disjointed operator graphs, each graph generating a job, and 
> thus this optimization requires more jobs to run the query. Not to mention 
> the additional complexity handling linked FS descriptors.
> I propose that we disable such optimization when the execution engine is 
> Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-5690) Support subquery for single sourced multi query

2014-09-14 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-5690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14133572#comment-14133572
 ] 

Xuefu Zhang commented on HIVE-5690:
---

[~navis], thanks for your explanation. I think my only concern is that the 
complexity of the syntax and possible confusion this might introduce. For 
instance, what if I need to do a join with another table in the subquery.

At lease, we should clearly define the syntax to avoid possible confusions.

> Support subquery for single sourced multi query
> ---
>
> Key: HIVE-5690
> URL: https://issues.apache.org/jira/browse/HIVE-5690
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: Navis
>Assignee: Navis
>Priority: Minor
> Attachments: D13791.1.patch, HIVE-5690.10.patch.txt, 
> HIVE-5690.11.patch.txt, HIVE-5690.12.patch.txt, HIVE-5690.13.patch.txt, 
> HIVE-5690.2.patch.txt, HIVE-5690.3.patch.txt, HIVE-5690.4.patch.txt, 
> HIVE-5690.5.patch.txt, HIVE-5690.6.patch.txt, HIVE-5690.7.patch.txt, 
> HIVE-5690.8.patch.txt, HIVE-5690.9.patch.txt
>
>
> Single sourced multi (insert) query is very useful for various ETL processes 
> but it does not allow subqueries included. For example, 
> {noformat}
> explain from src 
> insert overwrite table x1 select * from (select distinct key,value) b order 
> by key
> insert overwrite table x2 select * from (select distinct key,value) c order 
> by value;
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-8106) Enable vectorization for spark [spark branch]

2014-09-15 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-8106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14134356#comment-14134356
 ] 

Xuefu Zhang commented on HIVE-8106:
---

Hi [~chinnalalam] Thanks for working on this. I briefly looked at your patch 
and wanted to let you know that SparkWork.getMapWork() is obsolete. Please 
refer to Tez's way to get all MapWorks. Also, please take a look at the test 
failures. Thanks.

> Enable vectorization for spark [spark branch]
> -
>
> Key: HIVE-8106
> URL: https://issues.apache.org/jira/browse/HIVE-8106
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Chinna Rao Lalam
>Assignee: Chinna Rao Lalam
> Attachments: HIVE-8106-spark.patch
>
>
> Enable the vectorization optimization on spark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-8054) Disable hive.optimize.union.remove when hive.execution.engine=spark [Spark Branch]

2014-09-15 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-8054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14134426#comment-14134426
 ] 

Xuefu Zhang commented on HIVE-8054:
---

Hi [~nyang], is load_dyn_part13 failure related to your patch? It seems having 
a different test output. Thanks.

> Disable hive.optimize.union.remove when hive.execution.engine=spark [Spark 
> Branch]
> --
>
> Key: HIVE-8054
> URL: https://issues.apache.org/jira/browse/HIVE-8054
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Na Yang
>  Labels: Spark-M1
> Attachments: HIVE-8054-spark.patch, HIVE-8054.2-spark.patch
>
>
> Option hive.optimize.union.remove introduced in HIVE-3276 removes union 
> operators from the operator graph in certain cases as an optimization reduce 
> the number of MR jobs. While making sense in MR, this optimization is 
> actually harmful to an execution engine such as Spark, which natives supports 
> union without requiring additional jobs. This is because removing union 
> operator creates disjointed operator graphs, each graph generating a job, and 
> thus this optimization requires more jobs to run the query. Not to mention 
> the additional complexity handling linked FS descriptors.
> I propose that we disable such optimization when the execution engine is 
> Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-8054) Disable hive.optimize.union.remove when hive.execution.engine=spark [Spark Branch]

2014-09-15 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-8054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-8054:
--
   Resolution: Fixed
Fix Version/s: spark-branch
   Status: Resolved  (was: Patch Available)

Patch committed to spark branch. Thanks to Na for the contribution.

> Disable hive.optimize.union.remove when hive.execution.engine=spark [Spark 
> Branch]
> --
>
> Key: HIVE-8054
> URL: https://issues.apache.org/jira/browse/HIVE-8054
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Na Yang
>  Labels: Spark-M1
> Fix For: spark-branch
>
> Attachments: HIVE-8054-spark.patch, HIVE-8054.2-spark.patch, 
> HIVE-8054.3-spark.patch
>
>
> Option hive.optimize.union.remove introduced in HIVE-3276 removes union 
> operators from the operator graph in certain cases as an optimization reduce 
> the number of MR jobs. While making sense in MR, this optimization is 
> actually harmful to an execution engine such as Spark, which natives supports 
> union without requiring additional jobs. This is because removing union 
> operator creates disjointed operator graphs, each graph generating a job, and 
> thus this optimization requires more jobs to run the query. Not to mention 
> the additional complexity handling linked FS descriptors.
> I propose that we disable such optimization when the execution engine is 
> Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-8118) SparkMapRecorderHandler and SparkReduceRecordHandler should be initialized with multiple result collectors[Spark Branch]

2014-09-15 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-8118:
-

 Summary: SparkMapRecorderHandler and SparkReduceRecordHandler 
should be initialized with multiple result collectors[Spark Branch]
 Key: HIVE-8118
 URL: https://issues.apache.org/jira/browse/HIVE-8118
 Project: Hive
  Issue Type: Bug
  Components: Spark
Reporter: Xuefu Zhang


In the current implementation, both SparkMapRecordHandler and 
SparkReduceRecorderHandler takes only one result collector, which limits that 
the corresponding map or reduce task can have only one child. It's very comment 
in multi-insert queries where a map/reduce task has more than one children. A 
query like the following has two map tasks as parents:

{code}
select name, sum(value) from dec group by name union all select name, value 
from dec order by name
{code}

It's possible in the future an optimation may be implemented so that a map work 
is followed by two reduce works and then connected to a union work.

Thus, we should accommodate this. Tez is currently providing a collector for 
each child operator in the map-side or reduce side operator tree.

Likely this is a big change. With this, we can have a simpler and clean 
multi-insert implementation.

This is also the problem observed in HIVE-7731.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-8118) SparkMapRecorderHandler and SparkReduceRecordHandler should be initialized with multiple result collectors[Spark Branch]

2014-09-15 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-8118:
--
Labels: Spark-M1  (was: )

> SparkMapRecorderHandler and SparkReduceRecordHandler should be initialized 
> with multiple result collectors[Spark Branch]
> 
>
> Key: HIVE-8118
> URL: https://issues.apache.org/jira/browse/HIVE-8118
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Xuefu Zhang
>  Labels: Spark-M1
>
> In the current implementation, both SparkMapRecordHandler and 
> SparkReduceRecorderHandler takes only one result collector, which limits that 
> the corresponding map or reduce task can have only one child. It's very 
> comment in multi-insert queries where a map/reduce task has more than one 
> children. A query like the following has two map tasks as parents:
> {code}
> select name, sum(value) from dec group by name union all select name, value 
> from dec order by name
> {code}
> It's possible in the future an optimation may be implemented so that a map 
> work is followed by two reduce works and then connected to a union work.
> Thus, we should accommodate this. Tez is currently providing a collector for 
> each child operator in the map-side or reduce side operator tree.
> Likely this is a big change. With this, we can have a simpler and clean 
> multi-insert implementation.
> This is also the problem observed in HIVE-7731.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (HIVE-8118) SparkMapRecorderHandler and SparkReduceRecordHandler should be initialized with multiple result collectors[Spark Branch]

2014-09-15 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang reassigned HIVE-8118:
-

Assignee: Venki Korukanti

> SparkMapRecorderHandler and SparkReduceRecordHandler should be initialized 
> with multiple result collectors[Spark Branch]
> 
>
> Key: HIVE-8118
> URL: https://issues.apache.org/jira/browse/HIVE-8118
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Venki Korukanti
>  Labels: Spark-M1
>
> In the current implementation, both SparkMapRecordHandler and 
> SparkReduceRecorderHandler takes only one result collector, which limits that 
> the corresponding map or reduce task can have only one child. It's very 
> comment in multi-insert queries where a map/reduce task has more than one 
> children. A query like the following has two map tasks as parents:
> {code}
> select name, sum(value) from dec group by name union all select name, value 
> from dec order by name
> {code}
> It's possible in the future an optimation may be implemented so that a map 
> work is followed by two reduce works and then connected to a union work.
> Thus, we should accommodate this. Tez is currently providing a collector for 
> each child operator in the map-side or reduce side operator tree.
> Likely this is a big change. With this, we can have a simpler and clean 
> multi-insert implementation.
> This is also the problem observed in HIVE-7731.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-8118) SparkMapRecorderHandler and SparkReduceRecordHandler should be initialized with multiple result collectors[Spark Branch]

2014-09-15 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-8118:
--
Description: 
In the current implementation, both SparkMapRecordHandler and 
SparkReduceRecorderHandler takes only one result collector, which limits that 
the corresponding map or reduce task can have only one child. It's very comment 
in multi-insert queries where a map/reduce task has more than one children. A 
query like the following has two map tasks as parents:

{code}
select name, sum(value) from dec group by name union all select name, value 
from dec order by name
{code}

It's possible in the future an optimation may be implemented so that a map work 
is followed by two reduce works and then connected to a union work.

Thus, we should take this as a general case. Tez is currently providing a 
collector for each child operator in the map-side or reduce side operator tree. 
We can take Tez as a reference.

Likely this is a big change and subtasks are possible. 

With this, we can have a simpler and clean multi-insert implementation. This is 
also the problem observed in HIVE-7731.

  was:
In the current implementation, both SparkMapRecordHandler and 
SparkReduceRecorderHandler takes only one result collector, which limits that 
the corresponding map or reduce task can have only one child. It's very comment 
in multi-insert queries where a map/reduce task has more than one children. A 
query like the following has two map tasks as parents:

{code}
select name, sum(value) from dec group by name union all select name, value 
from dec order by name
{code}

It's possible in the future an optimation may be implemented so that a map work 
is followed by two reduce works and then connected to a union work.

Thus, we should accommodate this. Tez is currently providing a collector for 
each child operator in the map-side or reduce side operator tree.

Likely this is a big change. With this, we can have a simpler and clean 
multi-insert implementation.

This is also the problem observed in HIVE-7731.


> SparkMapRecorderHandler and SparkReduceRecordHandler should be initialized 
> with multiple result collectors[Spark Branch]
> 
>
> Key: HIVE-8118
> URL: https://issues.apache.org/jira/browse/HIVE-8118
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Venki Korukanti
>  Labels: Spark-M1
>
> In the current implementation, both SparkMapRecordHandler and 
> SparkReduceRecorderHandler takes only one result collector, which limits that 
> the corresponding map or reduce task can have only one child. It's very 
> comment in multi-insert queries where a map/reduce task has more than one 
> children. A query like the following has two map tasks as parents:
> {code}
> select name, sum(value) from dec group by name union all select name, value 
> from dec order by name
> {code}
> It's possible in the future an optimation may be implemented so that a map 
> work is followed by two reduce works and then connected to a union work.
> Thus, we should take this as a general case. Tez is currently providing a 
> collector for each child operator in the map-side or reduce side operator 
> tree. We can take Tez as a reference.
> Likely this is a big change and subtasks are possible. 
> With this, we can have a simpler and clean multi-insert implementation. This 
> is also the problem observed in HIVE-7731.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-8054) Disable hive.optimize.union.remove when hive.execution.engine=spark [Spark Branch]

2014-09-16 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-8054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14135415#comment-14135415
 ] 

Xuefu Zhang commented on HIVE-8054:
---

Thank you for the catch, [~leftylev].

> Disable hive.optimize.union.remove when hive.execution.engine=spark [Spark 
> Branch]
> --
>
> Key: HIVE-8054
> URL: https://issues.apache.org/jira/browse/HIVE-8054
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Na Yang
>  Labels: Spark-M1, TODOC-SPARK
> Fix For: spark-branch
>
> Attachments: HIVE-8054-spark.patch, HIVE-8054.2-spark.patch, 
> HIVE-8054.3-spark.patch
>
>
> Option hive.optimize.union.remove introduced in HIVE-3276 removes union 
> operators from the operator graph in certain cases as an optimization reduce 
> the number of MR jobs. While making sense in MR, this optimization is 
> actually harmful to an execution engine such as Spark, which natives supports 
> union without requiring additional jobs. This is because removing union 
> operator creates disjointed operator graphs, each graph generating a job, and 
> thus this optimization requires more jobs to run the query. Not to mention 
> the additional complexity handling linked FS descriptors.
> I propose that we disable such optimization when the execution engine is 
> Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-8118) SparkMapRecorderHandler and SparkReduceRecordHandler should be initialized with multiple result collectors[Spark Branch]

2014-09-16 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14135517#comment-14135517
 ] 

Xuefu Zhang commented on HIVE-8118:
---

Hi [~chengxiang li],

Thank you for your input. I'm not sure if I understand your thought right. Let 
me clarify the problem  by giving a SparkWork like this:
{code}
MapWork1 -> ReduceWork1
 \-> ReduceWork2
{code}
it means that MapWork1 will generate different datasets to feed to ReduceWork1 
and ReduceWork2. In case of multi-insert, ReduceWork1 and ReduceWork2 will have 
a FS operator. Inside MapWork1, there will be two operator branches consuming 
the same data, and push different data sets to two RS operators. (ReduceWork1 
and ReduceWork2 have different HiveReduceFunctions.)

However, current implemenation only takes the first data set and feed it to 
both reduce works. The same problem can happen also if MapWork1 were a reduce 
work following other ReduceWork or MapWork.

With this problem, I'm not sure how we can get around without letting MapWork1 
generate two output RDDs, one for each following reduce work. Potentially, we 
can duplicate MapWork1 and have the following diagram:
{code}
MapWork11 -> ReduceWork1
MapWork12 -> ReduceWork2
{code}
where MapWork11 and MapWork12 consume the same input table (input table as 
RDD), and feed its first output RDD to ReduceWork1 and the second to 
ReduceWork2. This has its complexity, but more importantly, there will be 
wasted READ (unless SPark is smart enough to cache the input table, which is 
unlikely) and COMPUTATION (computing data twice). I feel that it's unlikely to 
get such optimizations from Spark framework in the near term.

Thus, I think we have to take into consideration that a map work or a reduce 
work might generate multiple RDDs, one feeds to each of its children. Since 
SparkMapRecorderHandler and SparkReduceRecordHandler are doing the data 
processing on map and reduce side, they need to have a way to generate multiple 
outputs.

Please correct me if I understood you wrong. Thanks.


> SparkMapRecorderHandler and SparkReduceRecordHandler should be initialized 
> with multiple result collectors[Spark Branch]
> 
>
> Key: HIVE-8118
> URL: https://issues.apache.org/jira/browse/HIVE-8118
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Venki Korukanti
>  Labels: Spark-M1
>
> In the current implementation, both SparkMapRecordHandler and 
> SparkReduceRecorderHandler takes only one result collector, which limits that 
> the corresponding map or reduce task can have only one child. It's very 
> comment in multi-insert queries where a map/reduce task has more than one 
> children. A query like the following has two map tasks as parents:
> {code}
> select name, sum(value) from dec group by name union all select name, value 
> from dec order by name
> {code}
> It's possible in the future an optimation may be implemented so that a map 
> work is followed by two reduce works and then connected to a union work.
> Thus, we should take this as a general case. Tez is currently providing a 
> collector for each child operator in the map-side or reduce side operator 
> tree. We can take Tez as a reference.
> Likely this is a big change and subtasks are possible. 
> With this, we can have a simpler and clean multi-insert implementation. This 
> is also the problem observed in HIVE-7731.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-7870) Insert overwrite table query does not generate correct task plan [Spark Branch]

2014-09-16 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-7870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-7870:
--
   Resolution: Fixed
Fix Version/s: spark-branch
   Status: Resolved  (was: Patch Available)

Fixed via HIVE-8017.

> Insert overwrite table query does not generate correct task plan [Spark 
> Branch]
> ---
>
> Key: HIVE-7870
> URL: https://issues.apache.org/jira/browse/HIVE-7870
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Na Yang
>Assignee: Na Yang
>  Labels: Spark-M1
> Fix For: spark-branch
>
> Attachments: HIVE-7870.1-spark.patch, HIVE-7870.2-spark.patch, 
> HIVE-7870.3-spark.patch, HIVE-7870.4-spark.patch, HIVE-7870.5-spark.patch
>
>
> Insert overwrite table query does not generate correct task plan when 
> hive.optimize.union.remove and hive.merge.sparkfiles properties are ON. 
> {noformat}
> set hive.optimize.union.remove=true
> set hive.merge.sparkfiles=true
> insert overwrite table outputTbl1
> SELECT * FROM
> (
> select key, 1 as values from inputTbl1
> union all
> select * FROM (
>   SELECT key, count(1) as values from inputTbl1 group by key
>   UNION ALL
>   SELECT key, 2 as values from inputTbl1
> ) a
> )b;
> select * from outputTbl1 order by key, values;
> {noformat}
> query result
> {noformat}
> 1 1
> 1 2
> 2 1
> 2 2
> 3 1
> 3 2
> 7 1
> 7 2
> 8 2
> 8 2
> 8 2
> {noformat}
> expected result:
> {noformat}
> 1 1
> 1 1
> 1 2
> 2 1
> 2 1
> 2 2
> 3 1
> 3 1
> 3 2
> 7 1
> 7 1
> 7 2
> 8 1
> 8 1
> 8 2
> 8 2
> 8 2
> {noformat}
> Move work is not working properly and some data are missing during move.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (HIVE-7870) Insert overwrite table query does not generate correct task plan [Spark Branch]

2014-09-16 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-7870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14135532#comment-14135532
 ] 

Xuefu Zhang edited comment on HIVE-7870 at 9/16/14 2:36 PM:


Fixed via HIVE-8054.


was (Author: xuefuz):
Fixed via HIVE-8017.

> Insert overwrite table query does not generate correct task plan [Spark 
> Branch]
> ---
>
> Key: HIVE-7870
> URL: https://issues.apache.org/jira/browse/HIVE-7870
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Na Yang
>Assignee: Na Yang
>  Labels: Spark-M1
> Fix For: spark-branch
>
> Attachments: HIVE-7870.1-spark.patch, HIVE-7870.2-spark.patch, 
> HIVE-7870.3-spark.patch, HIVE-7870.4-spark.patch, HIVE-7870.5-spark.patch
>
>
> Insert overwrite table query does not generate correct task plan when 
> hive.optimize.union.remove and hive.merge.sparkfiles properties are ON. 
> {noformat}
> set hive.optimize.union.remove=true
> set hive.merge.sparkfiles=true
> insert overwrite table outputTbl1
> SELECT * FROM
> (
> select key, 1 as values from inputTbl1
> union all
> select * FROM (
>   SELECT key, count(1) as values from inputTbl1 group by key
>   UNION ALL
>   SELECT key, 2 as values from inputTbl1
> ) a
> )b;
> select * from outputTbl1 order by key, values;
> {noformat}
> query result
> {noformat}
> 1 1
> 1 2
> 2 1
> 2 2
> 3 1
> 3 2
> 7 1
> 7 2
> 8 2
> 8 2
> 8 2
> {noformat}
> expected result:
> {noformat}
> 1 1
> 1 1
> 1 2
> 2 1
> 2 1
> 2 2
> 3 1
> 3 1
> 3 2
> 7 1
> 7 1
> 7 2
> 8 1
> 8 1
> 8 2
> 8 2
> 8 2
> {noformat}
> Move work is not working properly and some data are missing during move.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (HIVE-8118) SparkMapRecorderHandler and SparkReduceRecordHandler should be initialized with multiple result collectors[Spark Branch]

2014-09-16 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14135517#comment-14135517
 ] 

Xuefu Zhang edited comment on HIVE-8118 at 9/16/14 4:02 PM:


Hi [~chengxiang li],

Thank you for your input. I'm not sure if I understand your thought right. Let 
me clarify the problem  by giving a SparkWork like this:
{code}
MapWork1 -> ReduceWork1
  \-> ReduceWork2
{code}
it means that MapWork1 will generate different datasets to feed to ReduceWork1 
and ReduceWork2. In case of multi-insert, ReduceWork1 and ReduceWork2 will have 
a FS operator. Inside MapWork1, there will be two operator branches consuming 
the same data, and push different data sets to two RS operators. (ReduceWork1 
and ReduceWork2 have different HiveReduceFunctions.)

However, current implemenation only takes the first data set and feed it to 
both reduce works. The same problem can happen also if MapWork1 were a reduce 
work following other ReduceWork or MapWork.

With this problem, I'm not sure how we can get around without letting MapWork1 
generate two output RDDs, one for each following reduce work. Potentially, we 
can duplicate MapWork1 and have the following diagram:
{code}
MapWork11 -> ReduceWork1
MapWork12 -> ReduceWork2
{code}
where MapWork11 and MapWork12 consume the same input table (input table as 
RDD), and feed its first output RDD to ReduceWork1 and the second to 
ReduceWork2. This has its complexity, but more importantly, there will be 
wasted READ (unless SPark is smart enough to cache the input table, which is 
unlikely) and COMPUTATION (computing data twice). I feel that it's unlikely to 
get such optimizations from Spark framework in the near term.

Thus, I think we have to take into consideration that a map work or a reduce 
work might generate multiple RDDs, one feeds to each of its children. Since 
SparkMapRecorderHandler and SparkReduceRecordHandler are doing the data 
processing on map and reduce side, they need to have a way to generate multiple 
outputs.

Please correct me if I understood you wrong. Thanks.



was (Author: xuefuz):
Hi [~chengxiang li],

Thank you for your input. I'm not sure if I understand your thought right. Let 
me clarify the problem  by giving a SparkWork like this:
{code}
MapWork1 -> ReduceWork1
 \-> ReduceWork2
{code}
it means that MapWork1 will generate different datasets to feed to ReduceWork1 
and ReduceWork2. In case of multi-insert, ReduceWork1 and ReduceWork2 will have 
a FS operator. Inside MapWork1, there will be two operator branches consuming 
the same data, and push different data sets to two RS operators. (ReduceWork1 
and ReduceWork2 have different HiveReduceFunctions.)

However, current implemenation only takes the first data set and feed it to 
both reduce works. The same problem can happen also if MapWork1 were a reduce 
work following other ReduceWork or MapWork.

With this problem, I'm not sure how we can get around without letting MapWork1 
generate two output RDDs, one for each following reduce work. Potentially, we 
can duplicate MapWork1 and have the following diagram:
{code}
MapWork11 -> ReduceWork1
MapWork12 -> ReduceWork2
{code}
where MapWork11 and MapWork12 consume the same input table (input table as 
RDD), and feed its first output RDD to ReduceWork1 and the second to 
ReduceWork2. This has its complexity, but more importantly, there will be 
wasted READ (unless SPark is smart enough to cache the input table, which is 
unlikely) and COMPUTATION (computing data twice). I feel that it's unlikely to 
get such optimizations from Spark framework in the near term.

Thus, I think we have to take into consideration that a map work or a reduce 
work might generate multiple RDDs, one feeds to each of its children. Since 
SparkMapRecorderHandler and SparkReduceRecordHandler are doing the data 
processing on map and reduce side, they need to have a way to generate multiple 
outputs.

Please correct me if I understood you wrong. Thanks.


> SparkMapRecorderHandler and SparkReduceRecordHandler should be initialized 
> with multiple result collectors[Spark Branch]
> 
>
> Key: HIVE-8118
> URL: https://issues.apache.org/jira/browse/HIVE-8118
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Venki Korukanti
>  Labels: Spark-M1
>
> In the current implementation, both SparkMapRecordHandler and 
> SparkReduceRecorderHandler takes only one result collector, which limits that 
> the corresponding map or reduce task can have only one child. It's very 
> comment in multi-insert queries where a map/reduce task has more than one 
> children. A query like the following has two

[jira] [Commented] (HIVE-8118) SparkMapRecorderHandler and SparkReduceRecordHandler should be initialized with multiple result collectors [Spark Branch]

2014-09-16 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14135849#comment-14135849
 ] 

Xuefu Zhang commented on HIVE-8118:
---

I and [~chengxiang li] had an offline discussion and there was just a little 
bit confusion on understanding the problem, and now we are in the same page. To 
summarize, the problem comes when a map work or reduce work is connected to 
multiple reduce works. Currently the a map work or reduce work is only wired 
with one collector, which collects all data regardless the branch. That data 
set feeds to all subsequent child reduce works.
 
I also noted that Tez provides a  map to its recorder 
handlers. However, for us, we may not be able to do that, due to the 
limitations of Spark's RDD transformation APIs.


> SparkMapRecorderHandler and SparkReduceRecordHandler should be initialized 
> with multiple result collectors [Spark Branch]
> -
>
> Key: HIVE-8118
> URL: https://issues.apache.org/jira/browse/HIVE-8118
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Venki Korukanti
>  Labels: Spark-M1
>
> In the current implementation, both SparkMapRecordHandler and 
> SparkReduceRecorderHandler takes only one result collector, which limits that 
> the corresponding map or reduce task can have only one child. It's very 
> comment in multi-insert queries where a map/reduce task has more than one 
> children. A query like the following has two map tasks as parents:
> {code}
> select name, sum(value) from dec group by name union all select name, value 
> from dec order by name
> {code}
> It's possible in the future an optimation may be implemented so that a map 
> work is followed by two reduce works and then connected to a union work.
> Thus, we should take this as a general case. Tez is currently providing a 
> collector for each child operator in the map-side or reduce side operator 
> tree. We can take Tez as a reference.
> Likely this is a big change and subtasks are possible. 
> With this, we can have a simpler and clean multi-insert implementation. This 
> is also the problem observed in HIVE-7731.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-8055) Code cleanup after HIVE-8054 [Spark Branch]

2014-09-16 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-8055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14135948#comment-14135948
 ] 

Xuefu Zhang commented on HIVE-8055:
---

Patch looks good. +1 pending on test.

> Code cleanup after HIVE-8054 [Spark Branch]
> ---
>
> Key: HIVE-8055
> URL: https://issues.apache.org/jira/browse/HIVE-8055
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Na Yang
>  Labels: Spark-M1
> Attachments: HIVE-8055-spark.patch
>
>
> There is quite some code handling union removal optimization in SparkCompiler 
> and related classes. We need to clean this up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-8055) Code cleanup after HIVE-8054 [Spark Branch]

2014-09-16 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-8055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-8055:
--
Status: Patch Available  (was: Open)

> Code cleanup after HIVE-8054 [Spark Branch]
> ---
>
> Key: HIVE-8055
> URL: https://issues.apache.org/jira/browse/HIVE-8055
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Na Yang
>  Labels: Spark-M1
> Attachments: HIVE-8055-spark.patch
>
>
> There is quite some code handling union removal optimization in SparkCompiler 
> and related classes. We need to clean this up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-8106) Enable vectorization for spark [spark branch]

2014-09-16 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-8106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14135995#comment-14135995
 ] 

Xuefu Zhang commented on HIVE-8106:
---

Hi [~chinnalalam], If the patch is ready, please click above  
button to allow the test run. Thanks.


> Enable vectorization for spark [spark branch]
> -
>
> Key: HIVE-8106
> URL: https://issues.apache.org/jira/browse/HIVE-8106
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Chinna Rao Lalam
>Assignee: Chinna Rao Lalam
> Attachments: HIVE-8106-spark.patch, HIVE-8106.1-spark.patch
>
>
> Enable the vectorization optimization on spark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-8140) Remove obsolete code from SparkWork [Spark Branch]

2014-09-16 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-8140:
-

 Summary: Remove obsolete code from SparkWork [Spark Branch]
 Key: HIVE-8140
 URL: https://issues.apache.org/jira/browse/HIVE-8140
 Project: Hive
  Issue Type: Bug
  Components: Spark
Reporter: Xuefu Zhang


There are old code in SparkWork about get/set map/reduce work. It's from POC 
code, which isn't applicable any more. We should remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-8140) Remove obsolete code from SparkWork [Spark Branch]

2014-09-16 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-8140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-8140:
--
Issue Type: Sub-task  (was: Bug)
Parent: HIVE-7292

> Remove obsolete code from SparkWork [Spark Branch]
> --
>
> Key: HIVE-8140
> URL: https://issues.apache.org/jira/browse/HIVE-8140
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Xuefu Zhang
>  Labels: Spark-M1
>
> There are old code in SparkWork about get/set map/reduce work. It's from POC 
> code, which isn't applicable any more. We should remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-8140) Remove obsolete code from SparkWork [Spark Branch]

2014-09-16 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-8140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14136271#comment-14136271
 ] 

Xuefu Zhang commented on HIVE-8140:
---

Sure. I didn't realized that was in the source.

> Remove obsolete code from SparkWork [Spark Branch]
> --
>
> Key: HIVE-8140
> URL: https://issues.apache.org/jira/browse/HIVE-8140
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Xuefu Zhang
>  Labels: Spark-M1
>
> There are old code in SparkWork about get/set map/reduce work. It's from POC 
> code, which isn't applicable any more. We should remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-8043) Support merging small files [Spark Branch]

2014-09-16 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-8043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14136300#comment-14136300
 ] 

Xuefu Zhang commented on HIVE-8043:
---

[~lirui] Current Hive on Spark code borrowed Tez's code dealing with merging 
small files. It basically falls back to MR's way to do this, and please refer 
to GenSparkUtils.processFileSinkOperators() for details. I think we can take a 
look at HIVE-7704 to see if there is anything that we can do similarly. Please 
do the research and put down your findings. We don't need to implement it right 
way as it's not critical for our M1.

> Support merging small files [Spark Branch]
> --
>
> Key: HIVE-8043
> URL: https://issues.apache.org/jira/browse/HIVE-8043
> Project: Hive
>  Issue Type: Task
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Rui Li
>  Labels: Spark-M1
>
> Hive currently supports merging small files with MR as the execution engine. 
> There are options available for this, such as 
> {code}
> hive.merge.mapfiles
> hive.merge.mapredfiles
> {code}
> Hive.merge.sparkfiles is already introduced in HIVE-7810. To make it work, we 
> might need a little more research and design on this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-8140) Remove obsolete code from SparkWork [Spark Branch]

2014-09-16 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-8140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-8140:
--
Status: Patch Available  (was: Open)

> Remove obsolete code from SparkWork [Spark Branch]
> --
>
> Key: HIVE-8140
> URL: https://issues.apache.org/jira/browse/HIVE-8140
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Chao
>  Labels: Spark-M1
> Attachments: HIVE-8140.1-spark.patch
>
>
> There are old code in SparkWork about get/set map/reduce work. It's from POC 
> code, which isn't applicable any more. We should remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-7647) Beeline does not honor --headerInterval and --color when executing with "-e"

2014-09-16 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-7647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14136496#comment-14136496
 ] 

Xuefu Zhang commented on HIVE-7647:
---

+1

> Beeline does not honor --headerInterval and --color when executing with "-e"
> 
>
> Key: HIVE-7647
> URL: https://issues.apache.org/jira/browse/HIVE-7647
> Project: Hive
>  Issue Type: Bug
>  Components: CLI
>Affects Versions: 0.14.0
>Reporter: Naveen Gangam
>Assignee: Naveen Gangam
>Priority: Minor
> Fix For: 0.14.0
>
> Attachments: HIVE-7647.1.patch
>
>
> --showHeader is being honored
> [root@localhost ~]# beeline --showHeader=false -u 
> 'jdbc:hive2://localhost:1/default' -n hive -d 
> org.apache.hive.jdbc.HiveDriver -e "select * from sample_07 limit 10;"
> Connecting to jdbc:hive2://localhost:1/default
> Connected to: Apache Hive (version 0.12.0-cdh5.0.1)
> Driver: Hive JDBC (version 0.12.0-cdh5.0.1)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> -hiveconf (No such file or directory)
> +--+--++-+
> | 00-  | All Occupations  | 135185230  | 42270   |
> | 11-  | Management occupations   | 6152650| 100310  |
> | 11-1011  | Chief executives | 301930 | 160440  |
> | 11-1021  | General and operations managers  | 1697690| 107970  |
> | 11-1031  | Legislators  | 64650  | 37980   |
> | 11-2011  | Advertising and promotions managers  | 36100  | 94720   |
> | 11-2021  | Marketing managers   | 166790 | 118160  |
> | 11-2022  | Sales managers   | 333910 | 110390  |
> | 11-2031  | Public relations managers| 51730  | 101220  |
> | 11-3011  | Administrative services managers | 246930 | 79500   |
> +--+--++-+
> 10 rows selected (0.838 seconds)
> Beeline version 0.12.0-cdh5.1.0 by Apache Hive
> Closing: org.apache.hive.jdbc.HiveConnection
> --outputFormat is being honored.
> [root@localhost ~]# beeline --outputFormat=csv -u 
> 'jdbc:hive2://localhost:1/default' -n hive -d 
> org.apache.hive.jdbc.HiveDriver -e "select * from sample_07 limit 10;"
> Connecting to jdbc:hive2://localhost:1/default
> Connected to: Apache Hive (version 0.12.0-cdh5.0.1)
> Driver: Hive JDBC (version 0.12.0-cdh5.0.1)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> 'code','description','total_emp','salary'
> '00-','All Occupations','135185230','42270'
> '11-','Management occupations','6152650','100310'
> '11-1011','Chief executives','301930','160440'
> '11-1021','General and operations managers','1697690','107970'
> '11-1031','Legislators','64650','37980'
> '11-2011','Advertising and promotions managers','36100','94720'
> '11-2021','Marketing managers','166790','118160'
> '11-2022','Sales managers','333910','110390'
> '11-2031','Public relations managers','51730','101220'
> '11-3011','Administrative services managers','246930','79500'
> 10 rows selected (0.664 seconds)
> Beeline version 0.12.0-cdh5.1.0 by Apache Hive
> Closing: org.apache.hive.jdbc.HiveConnection
> both --color & --headerInterval are being honored when executing using "-f" 
> option (reads query from a file rather than the commandline) (cannot really 
> see the color here but use the terminal colors)
> [root@localhost ~]# beeline --showheader=true --color=true --headerInterval=5 
> -u 'jdbc:hive2://localhost:1/default' -n hive -d 
> org.apache.hive.jdbc.HiveDriver -f /tmp/tmp.sql  
> Connecting to jdbc:hive2://localhost:1/default
> Connected to: Apache Hive (version 0.12.0-cdh5.0.1)
> Driver: Hive JDBC (version 0.12.0-cdh5.0.1)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> Beeline version 0.12.0-cdh5.1.0 by Apache Hive
> 0: jdbc:hive2://localhost> select * from sample_07 limit 8;
> +--+--++-+
> |   code   | description  | total_emp  | salary  |
> +--+--++-+
> | 00-  | All Occupations  | 135185230  | 42270   |
> | 11-  | Management occupations   | 6152650| 100310  |
> | 11-1011  | Chief executives | 301930 | 160440  |
> | 11-1021  | General and operations managers  | 1697690| 107970  |
> | 11-1031  | Legislators  | 64650  | 37980   |
> +--+--++-+
> |   code   | description  | total_emp  | salary  |
> +--+--+-

[jira] [Updated] (HIVE-8055) Code cleanup after HIVE-8054 [Spark Branch]

2014-09-16 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-8055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-8055:
--
   Resolution: Fixed
Fix Version/s: spark-branch
   Status: Resolved  (was: Patch Available)

Patch committed to Spark branch. Thanks to Na for the contribution.

> Code cleanup after HIVE-8054 [Spark Branch]
> ---
>
> Key: HIVE-8055
> URL: https://issues.apache.org/jira/browse/HIVE-8055
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Na Yang
>  Labels: Spark-M1
> Fix For: spark-branch
>
> Attachments: HIVE-8055-spark.patch
>
>
> There is quite some code handling union removal optimization in SparkCompiler 
> and related classes. We need to clean this up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-8140) Remove obsolete code from SparkWork [Spark Branch]

2014-09-16 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-8140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14136542#comment-14136542
 ] 

Xuefu Zhang commented on HIVE-8140:
---

+1

> Remove obsolete code from SparkWork [Spark Branch]
> --
>
> Key: HIVE-8140
> URL: https://issues.apache.org/jira/browse/HIVE-8140
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Chao
>  Labels: Spark-M1
> Attachments: HIVE-8140.1-spark.patch
>
>
> There are old code in SparkWork about get/set map/reduce work. It's from POC 
> code, which isn't applicable any more. We should remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-8140) Remove obsolete code from SparkWork [Spark Branch]

2014-09-16 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-8140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-8140:
--
   Resolution: Fixed
Fix Version/s: spark-branch
   Status: Resolved  (was: Patch Available)

Patch committed to Spark branch. Thanks to Chao for the contribution.

> Remove obsolete code from SparkWork [Spark Branch]
> --
>
> Key: HIVE-8140
> URL: https://issues.apache.org/jira/browse/HIVE-8140
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Chao
>  Labels: Spark-M1
> Fix For: spark-branch
>
> Attachments: HIVE-8140.1-spark.patch
>
>
> There are old code in SparkWork about get/set map/reduce work. It's from POC 
> code, which isn't applicable any more. We should remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-7100) Users of hive should be able to specify skipTrash when dropping tables.

2014-09-16 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-7100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14136572#comment-14136572
 ] 

Xuefu Zhang commented on HIVE-7100:
---

+1 pending on test result.

> Users of hive should be able to specify skipTrash when dropping tables.
> ---
>
> Key: HIVE-7100
> URL: https://issues.apache.org/jira/browse/HIVE-7100
> Project: Hive
>  Issue Type: Improvement
>Affects Versions: 0.13.0
>Reporter: Ravi Prakash
>Assignee: david serafini
> Attachments: HIVE-7100.1.patch, HIVE-7100.10.patch, 
> HIVE-7100.2.patch, HIVE-7100.3.patch, HIVE-7100.4.patch, HIVE-7100.5.patch, 
> HIVE-7100.8.patch, HIVE-7100.9.patch, HIVE-7100.patch
>
>
> Users of our clusters are often running up against their quota limits because 
> of Hive tables. When they drop tables, they have to then manually delete the 
> files from HDFS using skipTrash. This is cumbersome and unnecessary. We 
> should enable users to skipTrash directly when dropping tables.
> We should also be able to provide this functionality without polluting SQL 
> syntax.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-7980) Hive on spark issue..

2014-09-17 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-7980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14137302#comment-14137302
 ] 

Xuefu Zhang commented on HIVE-7980:
---

[~alton.jung] Thanks for reporting the problem. I'll find a developer to look 
at this issue.

> Hive on spark issue..
> -
>
> Key: HIVE-7980
> URL: https://issues.apache.org/jira/browse/HIVE-7980
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2, Spark
>Affects Versions: spark-branch
> Environment: Test Environment is..
> . hive 0.14.0(spark branch version)
> . spark 
> (http://ec2-50-18-79-139.us-west-1.compute.amazonaws.com/data/spark-assembly-1.1.0-SNAPSHOT-hadoop2.3.0.jar)
> . hadoop 2.4.0 (yarn)
>Reporter: alton.jung
> Fix For: spark-branch
>
>
> .I followed this 
> guide(https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started).
>  and i compiled hive from spark branch. in the next step i met the below 
> error..
> (*i typed the hive query on beeline, i used the  simple query using "order 
> by" to invoke the palleral works 
>ex) select * from test where id = 1 order by id;
> )
> [Error list is]
> 2014-09-04 02:58:08,796 ERROR spark.SparkClient 
> (SparkClient.java:execute(158)) - Error generating Spark Plan
> java.lang.NullPointerException
>   at 
> org.apache.spark.SparkContext.defaultParallelism(SparkContext.scala:1262)
>   at 
> org.apache.spark.SparkContext.defaultMinPartitions(SparkContext.scala:1269)
>   at 
> org.apache.spark.SparkContext.hadoopRDD$default$5(SparkContext.scala:537)
>   at 
> org.apache.spark.api.java.JavaSparkContext.hadoopRDD(JavaSparkContext.scala:318)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generateRDD(SparkPlanGenerator.java:160)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:88)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkClient.execute(SparkClient.java:156)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.submit(SparkSessionImpl.java:52)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkTask.execute(SparkTask.java:77)
>   at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:161)
>   at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
>   at org.apache.hadoop.hive.ql.exec.TaskRunner.run(TaskRunner.java:72)
> 2014-09-04 02:58:11,108 ERROR ql.Driver (SessionState.java:printError(696)) - 
> FAILED: Execution Error, return code 2 from 
> org.apache.hadoop.hive.ql.exec.spark.SparkTask
> 2014-09-04 02:58:11,182 INFO  log.PerfLogger 
> (PerfLogger.java:PerfLogEnd(135)) -  start=1409824527954 end=1409824691182 duration=163228 
> from=org.apache.hadoop.hive.ql.Driver>
> 2014-09-04 02:58:11,223 INFO  log.PerfLogger 
> (PerfLogger.java:PerfLogBegin(108)) -  from=org.apache.hadoop.hive.ql.Driver>
> 2014-09-04 02:58:11,224 INFO  log.PerfLogger 
> (PerfLogger.java:PerfLogEnd(135)) -  start=1409824691223 end=1409824691224 duration=1 
> from=org.apache.hadoop.hive.ql.Driver>
> 2014-09-04 02:58:11,306 ERROR operation.Operation 
> (SQLOperation.java:run(199)) - Error running hive query: 
> org.apache.hive.service.cli.HiveSQLException: Error while processing 
> statement: FAILED: Execution Error, return code 2 from 
> org.apache.hadoop.hive.ql.exec.spark.SparkTask
>   at 
> org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:284)
>   at 
> org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:146)
>   at 
> org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:69)
>   at 
> org.apache.hive.service.cli.operation.SQLOperation$1$1.run(SQLOperation.java:196)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
>   at 
> org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:508)
>   at 
> org.apache.hive.service.cli.operation.SQLOperation$1.run(SQLOperation.java:208)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:722)
> 2014-09-04 02:58:11,634 INFO  exec.ListSinkOperator 
> (Operator.java:close(580)) - 47 finished. closing... 
> 2014-09-04 02:58:11,683 INFO  exec.ListSinkOperator 
> (Op

[jira] [Assigned] (HIVE-7980) Hive on spark issue..

2014-09-17 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-7980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang reassigned HIVE-7980:
-

Assignee: Chao

> Hive on spark issue..
> -
>
> Key: HIVE-7980
> URL: https://issues.apache.org/jira/browse/HIVE-7980
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2, Spark
>Affects Versions: spark-branch
> Environment: Test Environment is..
> . hive 0.14.0(spark branch version)
> . spark 
> (http://ec2-50-18-79-139.us-west-1.compute.amazonaws.com/data/spark-assembly-1.1.0-SNAPSHOT-hadoop2.3.0.jar)
> . hadoop 2.4.0 (yarn)
>Reporter: alton.jung
>Assignee: Chao
> Fix For: spark-branch
>
>
> .I followed this 
> guide(https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started).
>  and i compiled hive from spark branch. in the next step i met the below 
> error..
> (*i typed the hive query on beeline, i used the  simple query using "order 
> by" to invoke the palleral works 
>ex) select * from test where id = 1 order by id;
> )
> [Error list is]
> 2014-09-04 02:58:08,796 ERROR spark.SparkClient 
> (SparkClient.java:execute(158)) - Error generating Spark Plan
> java.lang.NullPointerException
>   at 
> org.apache.spark.SparkContext.defaultParallelism(SparkContext.scala:1262)
>   at 
> org.apache.spark.SparkContext.defaultMinPartitions(SparkContext.scala:1269)
>   at 
> org.apache.spark.SparkContext.hadoopRDD$default$5(SparkContext.scala:537)
>   at 
> org.apache.spark.api.java.JavaSparkContext.hadoopRDD(JavaSparkContext.scala:318)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generateRDD(SparkPlanGenerator.java:160)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:88)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkClient.execute(SparkClient.java:156)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.submit(SparkSessionImpl.java:52)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkTask.execute(SparkTask.java:77)
>   at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:161)
>   at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
>   at org.apache.hadoop.hive.ql.exec.TaskRunner.run(TaskRunner.java:72)
> 2014-09-04 02:58:11,108 ERROR ql.Driver (SessionState.java:printError(696)) - 
> FAILED: Execution Error, return code 2 from 
> org.apache.hadoop.hive.ql.exec.spark.SparkTask
> 2014-09-04 02:58:11,182 INFO  log.PerfLogger 
> (PerfLogger.java:PerfLogEnd(135)) -  start=1409824527954 end=1409824691182 duration=163228 
> from=org.apache.hadoop.hive.ql.Driver>
> 2014-09-04 02:58:11,223 INFO  log.PerfLogger 
> (PerfLogger.java:PerfLogBegin(108)) -  from=org.apache.hadoop.hive.ql.Driver>
> 2014-09-04 02:58:11,224 INFO  log.PerfLogger 
> (PerfLogger.java:PerfLogEnd(135)) -  start=1409824691223 end=1409824691224 duration=1 
> from=org.apache.hadoop.hive.ql.Driver>
> 2014-09-04 02:58:11,306 ERROR operation.Operation 
> (SQLOperation.java:run(199)) - Error running hive query: 
> org.apache.hive.service.cli.HiveSQLException: Error while processing 
> statement: FAILED: Execution Error, return code 2 from 
> org.apache.hadoop.hive.ql.exec.spark.SparkTask
>   at 
> org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:284)
>   at 
> org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:146)
>   at 
> org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:69)
>   at 
> org.apache.hive.service.cli.operation.SQLOperation$1$1.run(SQLOperation.java:196)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
>   at 
> org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:508)
>   at 
> org.apache.hive.service.cli.operation.SQLOperation$1.run(SQLOperation.java:208)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:722)
> 2014-09-04 02:58:11,634 INFO  exec.ListSinkOperator 
> (Operator.java:close(580)) - 47 finished. closing... 
> 2014-09-04 02:58:11,683 INFO  exec.ListSinkOperator 
> (Operator.java:close(598)) - 47 Close done
> 2014-09-04 02:58:12,190 INFO  log.PerfLogger 
> (PerfLog

[jira] [Commented] (HIVE-8141) Refactor the GraphTran code by moving union handling logic to UnionTran [Spark Branch]

2014-09-17 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-8141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14137322#comment-14137322
 ] 

Xuefu Zhang commented on HIVE-8141:
---

+1

> Refactor the GraphTran code by moving union handling logic to UnionTran 
> [Spark Branch]
> --
>
> Key: HIVE-8141
> URL: https://issues.apache.org/jira/browse/HIVE-8141
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Na Yang
>Assignee: Na Yang
>  Labels: Spark-M1
> Attachments: HIVE-8141.1-spark.patch
>
>
> In the current hive on spark code, union logic is handled in the GraphTran 
> class. The Union logic could be moved to the UnionTran class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-8141) Refactor the GraphTran code by moving union handling logic to UnionTran [Spark Branch]

2014-09-17 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-8141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-8141:
--
   Resolution: Fixed
Fix Version/s: spark-branch
   Status: Resolved  (was: Patch Available)

Patch committed to Spark branch. Thanks to Na for the contribution.

> Refactor the GraphTran code by moving union handling logic to UnionTran 
> [Spark Branch]
> --
>
> Key: HIVE-8141
> URL: https://issues.apache.org/jira/browse/HIVE-8141
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Na Yang
>Assignee: Na Yang
>  Labels: Spark-M1
> Fix For: spark-branch
>
> Attachments: HIVE-8141.1-spark.patch
>
>
> In the current hive on spark code, union logic is handled in the GraphTran 
> class. The Union logic could be moved to the UnionTran class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-8160) Upgrade Spark dependency to 1.2.0-SNAPSHOT [Spark Branch]

2014-09-17 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-8160:
-

 Summary: Upgrade Spark dependency to 1.2.0-SNAPSHOT [Spark Branch]
 Key: HIVE-8160
 URL: https://issues.apache.org/jira/browse/HIVE-8160
 Project: Hive
  Issue Type: Task
  Components: Spark
Reporter: Xuefu Zhang
Priority: Minor


Hive on Spark needs SPARK-2978, which is now available in latest Spark main 
branch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-8083) Authorization DDLs should not enforce hive identifier syntax for user or group

2014-09-17 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-8083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14137581#comment-14137581
 ] 

Xuefu Zhang commented on HIVE-8083:
---

+1 pending on test result.

> Authorization DDLs should not enforce hive identifier syntax for user or group
> --
>
> Key: HIVE-8083
> URL: https://issues.apache.org/jira/browse/HIVE-8083
> Project: Hive
>  Issue Type: Bug
>  Components: SQL, SQLStandardAuthorization
>Affects Versions: 0.13.0, 0.13.1
>Reporter: Prasad Mujumdar
>Assignee: Prasad Mujumdar
> Attachments: HIVE-8083.1.patch, HIVE-8083.2.patch, HIVE-8083.3.patch
>
>
> The compiler expects principals (user, group and role) as hive identifiers 
> for authorization DDLs. The user and group are entities that belong to 
> external namespace and we can't expect those to follow hive identifier syntax 
> rules. For example, a userid or group can contain '-' which is not allowed by 
> compiler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-8043) Support merging small files [Spark Branch]

2014-09-17 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-8043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14137618#comment-14137618
 ] 

Xuefu Zhang commented on HIVE-8043:
---

[~lirui] Thanks for your detailed analysis. I think we need to verify the 
following:

1. File merging (either thru DLL or hive settings) needs to work for all data 
formats regardless executioin engine type. That includes RC, ORC, and other 
formats. Please verify that with Spark, file merging works. If not, check MR.

2. The improvement made in HIVE-7704 might be Tez only. If this the case, 
please identify the work that needs to be done to support that, but we don't 
have to implement it now, as it's an optimization, which can be done in later 
milestones.

Thanks.

> Support merging small files [Spark Branch]
> --
>
> Key: HIVE-8043
> URL: https://issues.apache.org/jira/browse/HIVE-8043
> Project: Hive
>  Issue Type: Task
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Rui Li
>  Labels: Spark-M1
>
> Hive currently supports merging small files with MR as the execution engine. 
> There are options available for this, such as 
> {code}
> hive.merge.mapfiles
> hive.merge.mapredfiles
> {code}
> Hive.merge.sparkfiles is already introduced in HIVE-7810. To make it work, we 
> might need a little more research and design on this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (HIVE-8160) Upgrade Spark dependency to 1.2.0-SNAPSHOT [Spark Branch]

2014-09-17 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-8160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang reassigned HIVE-8160:
-

Assignee: Xuefu Zhang

> Upgrade Spark dependency to 1.2.0-SNAPSHOT [Spark Branch]
> -
>
> Key: HIVE-8160
> URL: https://issues.apache.org/jira/browse/HIVE-8160
> Project: Hive
>  Issue Type: Task
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
>Priority: Minor
> Attachments: HIVE-8160.1-spark.patch
>
>
> Hive on Spark needs SPARK-2978, which is now available in latest Spark main 
> branch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-8160) Upgrade Spark dependency to 1.2.0-SNAPSHOT [Spark Branch]

2014-09-17 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-8160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-8160:
--
Attachment: HIVE-8160.1-spark.patch

> Upgrade Spark dependency to 1.2.0-SNAPSHOT [Spark Branch]
> -
>
> Key: HIVE-8160
> URL: https://issues.apache.org/jira/browse/HIVE-8160
> Project: Hive
>  Issue Type: Task
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
>Priority: Minor
> Attachments: HIVE-8160.1-spark.patch
>
>
> Hive on Spark needs SPARK-2978, which is now available in latest Spark main 
> branch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-8160) Upgrade Spark dependency to 1.2.0-SNAPSHOT [Spark Branch]

2014-09-17 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-8160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-8160:
--
Labels: Spark-M1  (was: )

> Upgrade Spark dependency to 1.2.0-SNAPSHOT [Spark Branch]
> -
>
> Key: HIVE-8160
> URL: https://issues.apache.org/jira/browse/HIVE-8160
> Project: Hive
>  Issue Type: Task
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
>Priority: Minor
>  Labels: Spark-M1
> Attachments: HIVE-8160.1-spark.patch
>
>
> Hive on Spark needs SPARK-2978, which is now available in latest Spark main 
> branch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-8160) Upgrade Spark dependency to 1.2.0-SNAPSHOT [Spark Branch]

2014-09-17 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-8160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-8160:
--
Status: Patch Available  (was: Open)

> Upgrade Spark dependency to 1.2.0-SNAPSHOT [Spark Branch]
> -
>
> Key: HIVE-8160
> URL: https://issues.apache.org/jira/browse/HIVE-8160
> Project: Hive
>  Issue Type: Task
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
>Priority: Minor
> Attachments: HIVE-8160.1-spark.patch
>
>
> Hive on Spark needs SPARK-2978, which is now available in latest Spark main 
> branch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-8160) Upgrade Spark dependency to 1.2.0-SNAPSHOT [Spark Branch]

2014-09-17 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-8160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-8160:
--
   Resolution: Fixed
Fix Version/s: spark-branch
   Status: Resolved  (was: Patch Available)

Patch committed to Spark branch.

> Upgrade Spark dependency to 1.2.0-SNAPSHOT [Spark Branch]
> -
>
> Key: HIVE-8160
> URL: https://issues.apache.org/jira/browse/HIVE-8160
> Project: Hive
>  Issue Type: Task
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
>Priority: Minor
>  Labels: Spark-M1
> Fix For: spark-branch
>
> Attachments: HIVE-8160.1-spark.patch
>
>
> Hive on Spark needs SPARK-2978, which is now available in latest Spark main 
> branch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-7647) Beeline does not honor --headerInterval and --color when executing with "-e"

2014-09-17 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-7647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14138264#comment-14138264
 ] 

Xuefu Zhang commented on HIVE-7647:
---

[~ngangam], it looks like your patch needs to be rebased.

> Beeline does not honor --headerInterval and --color when executing with "-e"
> 
>
> Key: HIVE-7647
> URL: https://issues.apache.org/jira/browse/HIVE-7647
> Project: Hive
>  Issue Type: Bug
>  Components: CLI
>Affects Versions: 0.14.0
>Reporter: Naveen Gangam
>Assignee: Naveen Gangam
>Priority: Minor
> Fix For: 0.14.0
>
> Attachments: HIVE-7647.1.patch
>
>
> --showHeader is being honored
> [root@localhost ~]# beeline --showHeader=false -u 
> 'jdbc:hive2://localhost:1/default' -n hive -d 
> org.apache.hive.jdbc.HiveDriver -e "select * from sample_07 limit 10;"
> Connecting to jdbc:hive2://localhost:1/default
> Connected to: Apache Hive (version 0.12.0-cdh5.0.1)
> Driver: Hive JDBC (version 0.12.0-cdh5.0.1)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> -hiveconf (No such file or directory)
> +--+--++-+
> | 00-  | All Occupations  | 135185230  | 42270   |
> | 11-  | Management occupations   | 6152650| 100310  |
> | 11-1011  | Chief executives | 301930 | 160440  |
> | 11-1021  | General and operations managers  | 1697690| 107970  |
> | 11-1031  | Legislators  | 64650  | 37980   |
> | 11-2011  | Advertising and promotions managers  | 36100  | 94720   |
> | 11-2021  | Marketing managers   | 166790 | 118160  |
> | 11-2022  | Sales managers   | 333910 | 110390  |
> | 11-2031  | Public relations managers| 51730  | 101220  |
> | 11-3011  | Administrative services managers | 246930 | 79500   |
> +--+--++-+
> 10 rows selected (0.838 seconds)
> Beeline version 0.12.0-cdh5.1.0 by Apache Hive
> Closing: org.apache.hive.jdbc.HiveConnection
> --outputFormat is being honored.
> [root@localhost ~]# beeline --outputFormat=csv -u 
> 'jdbc:hive2://localhost:1/default' -n hive -d 
> org.apache.hive.jdbc.HiveDriver -e "select * from sample_07 limit 10;"
> Connecting to jdbc:hive2://localhost:1/default
> Connected to: Apache Hive (version 0.12.0-cdh5.0.1)
> Driver: Hive JDBC (version 0.12.0-cdh5.0.1)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> 'code','description','total_emp','salary'
> '00-','All Occupations','135185230','42270'
> '11-','Management occupations','6152650','100310'
> '11-1011','Chief executives','301930','160440'
> '11-1021','General and operations managers','1697690','107970'
> '11-1031','Legislators','64650','37980'
> '11-2011','Advertising and promotions managers','36100','94720'
> '11-2021','Marketing managers','166790','118160'
> '11-2022','Sales managers','333910','110390'
> '11-2031','Public relations managers','51730','101220'
> '11-3011','Administrative services managers','246930','79500'
> 10 rows selected (0.664 seconds)
> Beeline version 0.12.0-cdh5.1.0 by Apache Hive
> Closing: org.apache.hive.jdbc.HiveConnection
> both --color & --headerInterval are being honored when executing using "-f" 
> option (reads query from a file rather than the commandline) (cannot really 
> see the color here but use the terminal colors)
> [root@localhost ~]# beeline --showheader=true --color=true --headerInterval=5 
> -u 'jdbc:hive2://localhost:1/default' -n hive -d 
> org.apache.hive.jdbc.HiveDriver -f /tmp/tmp.sql  
> Connecting to jdbc:hive2://localhost:1/default
> Connected to: Apache Hive (version 0.12.0-cdh5.0.1)
> Driver: Hive JDBC (version 0.12.0-cdh5.0.1)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> Beeline version 0.12.0-cdh5.1.0 by Apache Hive
> 0: jdbc:hive2://localhost> select * from sample_07 limit 8;
> +--+--++-+
> |   code   | description  | total_emp  | salary  |
> +--+--++-+
> | 00-  | All Occupations  | 135185230  | 42270   |
> | 11-  | Management occupations   | 6152650| 100310  |
> | 11-1011  | Chief executives | 301930 | 160440  |
> | 11-1021  | General and operations managers  | 1697690| 107970  |
> | 11-1031  | Legislators  | 64650  | 37980   |
> +--+--++-+
> |   code   | description  | total_emp  | salary  |

[jira] [Commented] (HIVE-7613) Research optimization of auto convert join to map join [Spark branch]

2014-09-17 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-7613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14138560#comment-14138560
 ] 

Xuefu Zhang commented on HIVE-7613:
---

Here is what I have in mind:

1. For N-way join being converting to a map join, we can run N-1 Spark jobs, 
one for each small input to the join (assuming transformation is needed. If 
not, then we don't need a Spark job). Each job generates some RDD at the end, 
so we have N-1 RDDs in the end.

2. Dump the content of RDDs into the data structure (hash tables) that's needed 
by MapJoinOperator.

3. Call SparkContext.broadcast() on that data structure. This will broadcast 
the data struture to all nodes.

4. Then, we can launch the map only, join job, which can load the broadcasted 
data structure via the HashTableLoader interface.

For more information about Spark's broadcast variable, please refer to 
http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables.

> Research optimization of auto convert join to map join [Spark branch]
> -
>
> Key: HIVE-7613
> URL: https://issues.apache.org/jira/browse/HIVE-7613
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Chengxiang Li
>Assignee: Suhas Satish
>Priority: Minor
> Attachments: HIve on Spark Map join background.docx
>
>
> ConvertJoinMapJoin is an optimization the replaces a common join(aka shuffle 
> join) with a map join(aka broadcast or fragment replicate join) when 
> possible. we need to research how to make it workable with Hive on Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-8043) Support merging small files [Spark Branch]

2014-09-18 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-8043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14139931#comment-14139931
 ] 

Xuefu Zhang commented on HIVE-8043:
---

[~lirui] Thanks for providing further details. I guess "alter table ... 
concatenate" is a very old feature with some new element, which seems 
incomplete in many sense. Lacking documentation is understandable, thus. I'm 
not sure of its adoption. Please feel free to create JIRAs on those issues. 
Awesome research!

> Support merging small files [Spark Branch]
> --
>
> Key: HIVE-8043
> URL: https://issues.apache.org/jira/browse/HIVE-8043
> Project: Hive
>  Issue Type: Task
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Rui Li
>  Labels: Spark-M1
> Attachments: HIVE-8043.1-spark.patch, HIVE-8043.2-spark.patch
>
>
> Hive currently supports merging small files with MR as the execution engine. 
> There are options available for this, such as 
> {code}
> hive.merge.mapfiles
> hive.merge.mapredfiles
> {code}
> Hive.merge.sparkfiles is already introduced in HIVE-7810. To make it work, we 
> might need a little more research and design on this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (HIVE-7382) Create a MiniSparkCluster and set up a testing framework [Spark Branch]

2014-09-18 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-7382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang reassigned HIVE-7382:
-

Assignee: Xuefu Zhang  (was: Szehon Ho)

> Create a MiniSparkCluster and set up a testing framework [Spark Branch]
> ---
>
> Key: HIVE-7382
> URL: https://issues.apache.org/jira/browse/HIVE-7382
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
>  Labels: Spark-M1
>
> To automatically test Hive functionality over Spark execution engine, we need 
> to create a test framework that can execute Hive queries with Spark as the 
> backend. For that, we should create a MiniSparkCluser for this, similar to 
> other execution engines.
> Spark has a way to create a local cluster with a few processes in the local 
> machine, each process is a work node. It's fairly close to a real Spark 
> cluster. Our mini cluster can be based on that.
> For more info, please refer to the design doc on wiki.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (HIVE-7382) Create a MiniSparkCluster and set up a testing framework [Spark Branch]

2014-09-18 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-7382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang reassigned HIVE-7382:
-

Assignee: Rui Li  (was: Xuefu Zhang)

Assigned to Rui to do further research.

> Create a MiniSparkCluster and set up a testing framework [Spark Branch]
> ---
>
> Key: HIVE-7382
> URL: https://issues.apache.org/jira/browse/HIVE-7382
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Rui Li
>  Labels: Spark-M1
>
> To automatically test Hive functionality over Spark execution engine, we need 
> to create a test framework that can execute Hive queries with Spark as the 
> backend. For that, we should create a MiniSparkCluser for this, similar to 
> other execution engines.
> Spark has a way to create a local cluster with a few processes in the local 
> machine, each process is a work node. It's fairly close to a real Spark 
> cluster. Our mini cluster can be based on that.
> For more info, please refer to the design doc on wiki.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-8083) Authorization DDLs should not enforce hive identifier syntax for user or group

2014-09-18 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-8083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-8083:
--
   Resolution: Fixed
Fix Version/s: 0.14.0
   Status: Resolved  (was: Patch Available)

Patch committed to trunk. Thanks to Prasad.

> Authorization DDLs should not enforce hive identifier syntax for user or group
> --
>
> Key: HIVE-8083
> URL: https://issues.apache.org/jira/browse/HIVE-8083
> Project: Hive
>  Issue Type: Bug
>  Components: SQL, SQLStandardAuthorization
>Affects Versions: 0.13.0, 0.13.1
>Reporter: Prasad Mujumdar
>Assignee: Prasad Mujumdar
> Fix For: 0.14.0
>
> Attachments: HIVE-8083.1.patch, HIVE-8083.2.patch, HIVE-8083.3.patch
>
>
> The compiler expects principals (user, group and role) as hive identifiers 
> for authorization DDLs. The user and group are entities that belong to 
> external namespace and we can't expect those to follow hive identifier syntax 
> rules. For example, a userid or group can contain '-' which is not allowed by 
> compiler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-7100) Users of hive should be able to specify skipTrash when dropping tables.

2014-09-18 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-7100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140006#comment-14140006
 ] 

Xuefu Zhang commented on HIVE-7100:
---

[~dbsalti], would you like to address the above question/concern? Thanks.

> Users of hive should be able to specify skipTrash when dropping tables.
> ---
>
> Key: HIVE-7100
> URL: https://issues.apache.org/jira/browse/HIVE-7100
> Project: Hive
>  Issue Type: Improvement
>Affects Versions: 0.13.0
>Reporter: Ravi Prakash
>Assignee: david serafini
> Attachments: HIVE-7100.1.patch, HIVE-7100.10.patch, 
> HIVE-7100.2.patch, HIVE-7100.3.patch, HIVE-7100.4.patch, HIVE-7100.5.patch, 
> HIVE-7100.8.patch, HIVE-7100.9.patch, HIVE-7100.patch
>
>
> Users of our clusters are often running up against their quota limits because 
> of Hive tables. When they drop tables, they have to then manually delete the 
> files from HDFS using skipTrash. This is cumbersome and unnecessary. We 
> should enable users to skipTrash directly when dropping tables.
> We should also be able to provide this functionality without polluting SQL 
> syntax.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-7980) Hive on spark issue..

2014-09-19 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-7980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140778#comment-14140778
 ] 

Xuefu Zhang commented on HIVE-7980:
---

[~alton.jung] For hive, you need the latest from Spark branch. For Spark, you 
can also have the latest in their master branch. Since both are in the 
development, issues can arrive. Could you describe what you are trying to do 
and how to reproduce your issue(s)? Thanks.

> Hive on spark issue..
> -
>
> Key: HIVE-7980
> URL: https://issues.apache.org/jira/browse/HIVE-7980
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2, Spark
>Affects Versions: spark-branch
> Environment: Test Environment is..
> . hive 0.14.0(spark branch version)
> . spark 
> (http://ec2-50-18-79-139.us-west-1.compute.amazonaws.com/data/spark-assembly-1.1.0-SNAPSHOT-hadoop2.3.0.jar)
> . hadoop 2.4.0 (yarn)
>Reporter: alton.jung
>Assignee: Chao
> Fix For: spark-branch
>
>
> .I followed this 
> guide(https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started).
>  and i compiled hive from spark branch. in the next step i met the below 
> error..
> (*i typed the hive query on beeline, i used the  simple query using "order 
> by" to invoke the palleral works 
>ex) select * from test where id = 1 order by id;
> )
> [Error list is]
> 2014-09-04 02:58:08,796 ERROR spark.SparkClient 
> (SparkClient.java:execute(158)) - Error generating Spark Plan
> java.lang.NullPointerException
>   at 
> org.apache.spark.SparkContext.defaultParallelism(SparkContext.scala:1262)
>   at 
> org.apache.spark.SparkContext.defaultMinPartitions(SparkContext.scala:1269)
>   at 
> org.apache.spark.SparkContext.hadoopRDD$default$5(SparkContext.scala:537)
>   at 
> org.apache.spark.api.java.JavaSparkContext.hadoopRDD(JavaSparkContext.scala:318)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generateRDD(SparkPlanGenerator.java:160)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:88)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkClient.execute(SparkClient.java:156)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.submit(SparkSessionImpl.java:52)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkTask.execute(SparkTask.java:77)
>   at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:161)
>   at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
>   at org.apache.hadoop.hive.ql.exec.TaskRunner.run(TaskRunner.java:72)
> 2014-09-04 02:58:11,108 ERROR ql.Driver (SessionState.java:printError(696)) - 
> FAILED: Execution Error, return code 2 from 
> org.apache.hadoop.hive.ql.exec.spark.SparkTask
> 2014-09-04 02:58:11,182 INFO  log.PerfLogger 
> (PerfLogger.java:PerfLogEnd(135)) -  start=1409824527954 end=1409824691182 duration=163228 
> from=org.apache.hadoop.hive.ql.Driver>
> 2014-09-04 02:58:11,223 INFO  log.PerfLogger 
> (PerfLogger.java:PerfLogBegin(108)) -  from=org.apache.hadoop.hive.ql.Driver>
> 2014-09-04 02:58:11,224 INFO  log.PerfLogger 
> (PerfLogger.java:PerfLogEnd(135)) -  start=1409824691223 end=1409824691224 duration=1 
> from=org.apache.hadoop.hive.ql.Driver>
> 2014-09-04 02:58:11,306 ERROR operation.Operation 
> (SQLOperation.java:run(199)) - Error running hive query: 
> org.apache.hive.service.cli.HiveSQLException: Error while processing 
> statement: FAILED: Execution Error, return code 2 from 
> org.apache.hadoop.hive.ql.exec.spark.SparkTask
>   at 
> org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:284)
>   at 
> org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:146)
>   at 
> org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:69)
>   at 
> org.apache.hive.service.cli.operation.SQLOperation$1$1.run(SQLOperation.java:196)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
>   at 
> org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:508)
>   at 
> org.apache.hive.service.cli.operation.SQLOperation$1.run(SQLOperation.java:208)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at ja

[jira] [Created] (HIVE-8202) Support SMB Join for Hive on Spark [Spark Branch]

2014-09-19 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-8202:
-

 Summary: Support SMB Join for Hive on Spark [Spark Branch]
 Key: HIVE-8202
 URL: https://issues.apache.org/jira/browse/HIVE-8202
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang


SMB joins are used wherever the tables are sorted and bucketed. It's a 
reduce-side join. The join boils down to just merging the already sorted 
tables, allowing this operation to be faster than an ordinary map-join. 
However, if the tables are partitioned, there could be a slow down as each 
mapper would need to get a very small chunk of a partition which has a single 
key. Thus, in some scenarios it's beneficial to convert SMB join to SMB map 
join as well.

The task is to research and support the conversion from regular SMB join to SMB 
map join for Spark execution engine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-7503) Support Hive's multi-table insert query with Spark [Spark Branch]

2014-09-19 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-7503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14141760#comment-14141760
 ] 

Xuefu Zhang commented on HIVE-7503:
---

+1 :)

> Support Hive's multi-table insert query with Spark [Spark Branch]
> -
>
> Key: HIVE-7503
> URL: https://issues.apache.org/jira/browse/HIVE-7503
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Chao
>  Labels: spark-m1
> Attachments: HIVE-7503.1-spark.patch, HIVE-7503.2-spark.patch, 
> HIVE-7503.3-spark.patch, HIVE-7503.4-spark.patch, HIVE-7503.5-spark.patch, 
> HIVE-7503.6-spark.patch, HIVE-7503.7-spark.patch, HIVE-7503.8-spark.patch, 
> HIVE-7503.9-spark.patch
>
>
> For Hive's multi insert query 
> (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML), there 
> may be an MR job for each insert.  When we achieve this with Spark, it would 
> be nice if all the inserts can happen concurrently.
> It seems that this functionality isn't available in Spark. To make things 
> worse, the source of the insert may be re-computed unless it's staged. Even 
> with this, the inserts will happen sequentially, making the performance 
> suffer.
> This task is to find out what takes in Spark to enable this without requiring 
> staging the source and sequential insertion. If this has to be solved in 
> Hive, find out an optimum way to do this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-7503) Support Hive's multi-table insert query with Spark [Spark Branch]

2014-09-19 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-7503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-7503:
--
   Resolution: Fixed
Fix Version/s: spark-branch
   Status: Resolved  (was: Patch Available)

Patch committed to Spark branch. Thanks to Chao for the contribution.

> Support Hive's multi-table insert query with Spark [Spark Branch]
> -
>
> Key: HIVE-7503
> URL: https://issues.apache.org/jira/browse/HIVE-7503
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Chao
>  Labels: spark-m1
> Fix For: spark-branch
>
> Attachments: HIVE-7503.1-spark.patch, HIVE-7503.2-spark.patch, 
> HIVE-7503.3-spark.patch, HIVE-7503.4-spark.patch, HIVE-7503.5-spark.patch, 
> HIVE-7503.6-spark.patch, HIVE-7503.7-spark.patch, HIVE-7503.8-spark.patch, 
> HIVE-7503.9-spark.patch
>
>
> For Hive's multi insert query 
> (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML), there 
> may be an MR job for each insert.  When we achieve this with Spark, it would 
> be nice if all the inserts can happen concurrently.
> It seems that this functionality isn't available in Spark. To make things 
> worse, the source of the insert may be re-computed unless it's staged. Even 
> with this, the inserts will happen sequentially, making the performance 
> suffer.
> This task is to find out what takes in Spark to enable this without requiring 
> staging the source and sequential insertion. If this has to be solved in 
> Hive, find out an optimum way to do this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-8083) Authorization DDLs should not enforce hive identifier syntax for user or group

2014-09-20 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-8083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-8083:
--
Labels: TODOC14  (was: )

> Authorization DDLs should not enforce hive identifier syntax for user or group
> --
>
> Key: HIVE-8083
> URL: https://issues.apache.org/jira/browse/HIVE-8083
> Project: Hive
>  Issue Type: Bug
>  Components: SQL, SQLStandardAuthorization
>Affects Versions: 0.13.0, 0.13.1
>Reporter: Prasad Mujumdar
>Assignee: Prasad Mujumdar
>  Labels: TODOC14
> Fix For: 0.14.0
>
> Attachments: HIVE-8083.1.patch, HIVE-8083.2.patch, HIVE-8083.3.patch
>
>
> The compiler expects principals (user, group and role) as hive identifiers 
> for authorization DDLs. The user and group are entities that belong to 
> external namespace and we can't expect those to follow hive identifier syntax 
> rules. For example, a userid or group can contain '-' which is not allowed by 
> compiler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-8083) Authorization DDLs should not enforce hive identifier syntax for user or group

2014-09-20 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-8083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14141988#comment-14141988
 ] 

Xuefu Zhang commented on HIVE-8083:
---

Thanks, Lefty. It does seem that this has doc impact, especially regarding 
hive.support.quoted.identifiers. [~prasadm], could you please comment on this?

> Authorization DDLs should not enforce hive identifier syntax for user or group
> --
>
> Key: HIVE-8083
> URL: https://issues.apache.org/jira/browse/HIVE-8083
> Project: Hive
>  Issue Type: Bug
>  Components: SQL, SQLStandardAuthorization
>Affects Versions: 0.13.0, 0.13.1
>Reporter: Prasad Mujumdar
>Assignee: Prasad Mujumdar
>  Labels: TODOC14
> Fix For: 0.14.0
>
> Attachments: HIVE-8083.1.patch, HIVE-8083.2.patch, HIVE-8083.3.patch
>
>
> The compiler expects principals (user, group and role) as hive identifiers 
> for authorization DDLs. The user and group are entities that belong to 
> external namespace and we can't expect those to follow hive identifier syntax 
> rules. For example, a userid or group can contain '-' which is not allowed by 
> compiler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-7674) Update to Spark 1.2 [Spark Branch]

2014-09-20 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-7674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-7674:
--
Description: In HIVE-8160 we added a custom repo to use Spark 1.2. Once 
1.2is released we need to remove this repo.  (was: In HIVE-7540 we added a 
custom repo to use Spark 1.1. Once 1.1 is released we need to remove this repo.)
Summary: Update to Spark 1.2 [Spark Branch]  (was: Update to Spark 1.1 
[Spark Branch])

Updated the JIRA to reflect de status quo.

> Update to Spark 1.2 [Spark Branch]
> --
>
> Key: HIVE-7674
> URL: https://issues.apache.org/jira/browse/HIVE-7674
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Brock Noland
>Priority: Blocker
>
> In HIVE-8160 we added a custom repo to use Spark 1.2. Once 1.2is released we 
> need to remove this repo.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-7674) Update to Spark 1.2 [Spark Branch]

2014-09-20 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-7674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-7674:
--
Description: In HIVE-8160 we added a custom repo to use Spark 1.2. Once 1.2 
is released we need to remove this repo.  (was: In HIVE-8160 we added a custom 
repo to use Spark 1.2. Once 1.2is released we need to remove this repo.)

> Update to Spark 1.2 [Spark Branch]
> --
>
> Key: HIVE-7674
> URL: https://issues.apache.org/jira/browse/HIVE-7674
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Brock Noland
>Priority: Blocker
>
> In HIVE-8160 we added a custom repo to use Spark 1.2. Once 1.2 is released we 
> need to remove this repo.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-7100) Users of hive should be able to specify skipTrash when dropping tables.

2014-09-20 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-7100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142002#comment-14142002
 ] 

Xuefu Zhang commented on HIVE-7100:
---

{quote}
What should the behavior of drop table be for an immutable table? With and 
without the purge option?
{quote}
There should be no difference in dropping table behavior regarding immutable 
tables, accordingly HIVE-6406. Being "immutable" doesn't prevents "droping". It 
only blocks "updating" when content exisits.

Purge is an option for dropping, Thus, there shouldn't be any relationship 
between "immutable" and "purge".


> Users of hive should be able to specify skipTrash when dropping tables.
> ---
>
> Key: HIVE-7100
> URL: https://issues.apache.org/jira/browse/HIVE-7100
> Project: Hive
>  Issue Type: Improvement
>Affects Versions: 0.13.0
>Reporter: Ravi Prakash
>Assignee: david serafini
> Attachments: HIVE-7100.1.patch, HIVE-7100.10.patch, 
> HIVE-7100.2.patch, HIVE-7100.3.patch, HIVE-7100.4.patch, HIVE-7100.5.patch, 
> HIVE-7100.8.patch, HIVE-7100.9.patch, HIVE-7100.patch
>
>
> Users of our clusters are often running up against their quota limits because 
> of Hive tables. When they drop tables, they have to then manually delete the 
> files from HDFS using skipTrash. This is cumbersome and unnecessary. We 
> should enable users to skipTrash directly when dropping tables.
> We should also be able to provide this functionality without polluting SQL 
> syntax.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-7100) Users of hive should be able to specify skipTrash when dropping tables.

2014-09-20 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-7100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142003#comment-14142003
 ] 

Xuefu Zhang commented on HIVE-7100:
---

[~dbsalti] The patch needs to be rebased as it doesn't apply to latest trunk 
any more.

> Users of hive should be able to specify skipTrash when dropping tables.
> ---
>
> Key: HIVE-7100
> URL: https://issues.apache.org/jira/browse/HIVE-7100
> Project: Hive
>  Issue Type: Improvement
>Affects Versions: 0.13.0
>Reporter: Ravi Prakash
>Assignee: david serafini
> Attachments: HIVE-7100.1.patch, HIVE-7100.10.patch, 
> HIVE-7100.2.patch, HIVE-7100.3.patch, HIVE-7100.4.patch, HIVE-7100.5.patch, 
> HIVE-7100.8.patch, HIVE-7100.9.patch, HIVE-7100.patch
>
>
> Users of our clusters are often running up against their quota limits because 
> of Hive tables. When they drop tables, they have to then manually delete the 
> files from HDFS using skipTrash. This is cumbersome and unnecessary. We 
> should enable users to skipTrash directly when dropping tables.
> We should also be able to provide this functionality without polluting SQL 
> syntax.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-7842) load_dyn_part1.q fails with an assertion [Spark Branch]

2014-09-20 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-7842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142252#comment-14142252
 ] 

Xuefu Zhang commented on HIVE-7842:
---

[~vkorukanti], could you please verify and enable the test if it hasn't been 
enabled? Thanks.

> load_dyn_part1.q fails with an assertion [Spark Branch]
> ---
>
> Key: HIVE-7842
> URL: https://issues.apache.org/jira/browse/HIVE-7842
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Affects Versions: spark-branch
>Reporter: Venki Korukanti
>Assignee: Venki Korukanti
>  Labels: Spark-M1
> Fix For: spark-branch
>
>
> On spark branch, load_dyn_part1.q fails with following assertion. Looks like 
> SerDe is receiving invalid ByteWritable buffer.
> {code}
> java.lang.AssertionError
> "org.apache.hadoop.hive.serde2.binarysortable.BinarySortableSerDe.deserialize(BinarySortableSerDe.java:205)"
> "org.apache.hadoop.hive.serde2.binarysortable.BinarySortableSerDe.deserialize(BinarySortableSerDe.java:187)"
> "org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:186)"
> "org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:47)"
> "org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:27)"
> "org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:98)"
> "scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41)"
> "scala.collection.Iterator$class.foreach(Iterator.scala:727)"
> "scala.collection.AbstractIterator.foreach(Iterator.scala:1157)"
> "org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:759)"
> "org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:759)"
> "org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)"
> "org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)"
> "org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)"
> "org.apache.spark.scheduler.Task.run(Task.scala:54)"
> "org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)"
> "java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)"
> "java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)"
> "java.lang.Thread.run(Thread.java:744)"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-8043) Support merging small files [Spark Branch]

2014-09-20 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-8043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142337#comment-14142337
 ] 

Xuefu Zhang commented on HIVE-8043:
---

Patch looks good to me. +1

> Support merging small files [Spark Branch]
> --
>
> Key: HIVE-8043
> URL: https://issues.apache.org/jira/browse/HIVE-8043
> Project: Hive
>  Issue Type: Task
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Rui Li
>  Labels: Spark-M1
> Attachments: HIVE-8043.1-spark.patch, HIVE-8043.2-spark.patch, 
> HIVE-8043.3-spark.patch
>
>
> Hive currently supports merging small files with MR as the execution engine. 
> There are options available for this, such as 
> {code}
> hive.merge.mapfiles
> hive.merge.mapredfiles
> {code}
> Hive.merge.sparkfiles is already introduced in HIVE-7810. To make it work, we 
> might need a little more research and design on this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-8043) Support merging small files [Spark Branch]

2014-09-20 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-8043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-8043:
--
   Resolution: Fixed
Fix Version/s: spark-branch
   Status: Resolved  (was: Patch Available)

Patch committed to Spark branch. Thanks to Rui for the great contribution.

> Support merging small files [Spark Branch]
> --
>
> Key: HIVE-8043
> URL: https://issues.apache.org/jira/browse/HIVE-8043
> Project: Hive
>  Issue Type: Task
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Rui Li
>  Labels: Spark-M1
> Fix For: spark-branch
>
> Attachments: HIVE-8043.1-spark.patch, HIVE-8043.2-spark.patch, 
> HIVE-8043.3-spark.patch
>
>
> Hive currently supports merging small files with MR as the execution engine. 
> There are options available for this, such as 
> {code}
> hive.merge.mapfiles
> hive.merge.mapredfiles
> {code}
> Hive.merge.sparkfiles is already introduced in HIVE-7810. To make it work, we 
> might need a little more research and design on this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-7647) Beeline does not honor --headerInterval and --color when executing with "-e"

2014-09-21 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-7647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-7647:
--
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Patch committed to trunk. Thanks Naveen for the contribution.

> Beeline does not honor --headerInterval and --color when executing with "-e"
> 
>
> Key: HIVE-7647
> URL: https://issues.apache.org/jira/browse/HIVE-7647
> Project: Hive
>  Issue Type: Bug
>  Components: CLI
>Affects Versions: 0.14.0
>Reporter: Naveen Gangam
>Assignee: Naveen Gangam
>Priority: Minor
> Fix For: 0.14.0
>
> Attachments: HIVE-7647.1.patch, HIVE-7647.2.patch
>
>
> --showHeader is being honored
> [root@localhost ~]# beeline --showHeader=false -u 
> 'jdbc:hive2://localhost:1/default' -n hive -d 
> org.apache.hive.jdbc.HiveDriver -e "select * from sample_07 limit 10;"
> Connecting to jdbc:hive2://localhost:1/default
> Connected to: Apache Hive (version 0.12.0-cdh5.0.1)
> Driver: Hive JDBC (version 0.12.0-cdh5.0.1)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> -hiveconf (No such file or directory)
> +--+--++-+
> | 00-  | All Occupations  | 135185230  | 42270   |
> | 11-  | Management occupations   | 6152650| 100310  |
> | 11-1011  | Chief executives | 301930 | 160440  |
> | 11-1021  | General and operations managers  | 1697690| 107970  |
> | 11-1031  | Legislators  | 64650  | 37980   |
> | 11-2011  | Advertising and promotions managers  | 36100  | 94720   |
> | 11-2021  | Marketing managers   | 166790 | 118160  |
> | 11-2022  | Sales managers   | 333910 | 110390  |
> | 11-2031  | Public relations managers| 51730  | 101220  |
> | 11-3011  | Administrative services managers | 246930 | 79500   |
> +--+--++-+
> 10 rows selected (0.838 seconds)
> Beeline version 0.12.0-cdh5.1.0 by Apache Hive
> Closing: org.apache.hive.jdbc.HiveConnection
> --outputFormat is being honored.
> [root@localhost ~]# beeline --outputFormat=csv -u 
> 'jdbc:hive2://localhost:1/default' -n hive -d 
> org.apache.hive.jdbc.HiveDriver -e "select * from sample_07 limit 10;"
> Connecting to jdbc:hive2://localhost:1/default
> Connected to: Apache Hive (version 0.12.0-cdh5.0.1)
> Driver: Hive JDBC (version 0.12.0-cdh5.0.1)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> 'code','description','total_emp','salary'
> '00-','All Occupations','135185230','42270'
> '11-','Management occupations','6152650','100310'
> '11-1011','Chief executives','301930','160440'
> '11-1021','General and operations managers','1697690','107970'
> '11-1031','Legislators','64650','37980'
> '11-2011','Advertising and promotions managers','36100','94720'
> '11-2021','Marketing managers','166790','118160'
> '11-2022','Sales managers','333910','110390'
> '11-2031','Public relations managers','51730','101220'
> '11-3011','Administrative services managers','246930','79500'
> 10 rows selected (0.664 seconds)
> Beeline version 0.12.0-cdh5.1.0 by Apache Hive
> Closing: org.apache.hive.jdbc.HiveConnection
> both --color & --headerInterval are being honored when executing using "-f" 
> option (reads query from a file rather than the commandline) (cannot really 
> see the color here but use the terminal colors)
> [root@localhost ~]# beeline --showheader=true --color=true --headerInterval=5 
> -u 'jdbc:hive2://localhost:1/default' -n hive -d 
> org.apache.hive.jdbc.HiveDriver -f /tmp/tmp.sql  
> Connecting to jdbc:hive2://localhost:1/default
> Connected to: Apache Hive (version 0.12.0-cdh5.0.1)
> Driver: Hive JDBC (version 0.12.0-cdh5.0.1)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> Beeline version 0.12.0-cdh5.1.0 by Apache Hive
> 0: jdbc:hive2://localhost> select * from sample_07 limit 8;
> +--+--++-+
> |   code   | description  | total_emp  | salary  |
> +--+--++-+
> | 00-  | All Occupations  | 135185230  | 42270   |
> | 11-  | Management occupations   | 6152650| 100310  |
> | 11-1011  | Chief executives | 301930 | 160440  |
> | 11-1021  | General and operations managers  | 1697690| 107970  |
> | 11-1031  | Legislators  | 64650  | 37980   |
> +--+--++-+
> |   code   | description

[jira] [Commented] (HIVE-7946) CBO: Merge CBO changes to Trunk

2014-09-21 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-7946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142828#comment-14142828
 ] 

Xuefu Zhang commented on HIVE-7946:
---

Thanks for the good work. I only briefly went over the patch. One thing that 
caught my eyes was some code style issues. It would be nice to be consistent 
with existing code. Moreover, it'd be nice to have a review board link so that 
the review can be more effective.

> CBO: Merge CBO changes to Trunk
> ---
>
> Key: HIVE-7946
> URL: https://issues.apache.org/jira/browse/HIVE-7946
> Project: Hive
>  Issue Type: Bug
>  Components: CBO
>Reporter: Laljo John Pullokkaran
>Assignee: Laljo John Pullokkaran
> Attachments: HIVE-7946.1.patch, HIVE-7946.10.patch, 
> HIVE-7946.11.patch, HIVE-7946.12.patch, HIVE-7946.13.patch, 
> HIVE-7946.14.patch, HIVE-7946.2.patch, HIVE-7946.3.patch, HIVE-7946.4.patch, 
> HIVE-7946.5.patch, HIVE-7946.6.patch, HIVE-7946.7.patch, HIVE-7946.8.patch, 
> HIVE-7946.9.patch, HIVE-7946.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-7842) Enable qtest load_dyn_part1.q [Spark Branch]

2014-09-22 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-7842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143506#comment-14143506
 ] 

Xuefu Zhang commented on HIVE-7842:
---

Thanks, Venki. Patch looks good. +1 pending on test result.

> Enable qtest load_dyn_part1.q [Spark Branch]
> 
>
> Key: HIVE-7842
> URL: https://issues.apache.org/jira/browse/HIVE-7842
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Affects Versions: spark-branch
>Reporter: Venki Korukanti
>Assignee: Venki Korukanti
>  Labels: Spark-M1
> Fix For: spark-branch
>
> Attachments: HIVE-7842.1-spark.patch
>
>
> On spark branch, load_dyn_part1.q fails with following assertion. Looks like 
> SerDe is receiving invalid ByteWritable buffer.
> {code}
> java.lang.AssertionError
> "org.apache.hadoop.hive.serde2.binarysortable.BinarySortableSerDe.deserialize(BinarySortableSerDe.java:205)"
> "org.apache.hadoop.hive.serde2.binarysortable.BinarySortableSerDe.deserialize(BinarySortableSerDe.java:187)"
> "org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:186)"
> "org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:47)"
> "org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:27)"
> "org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:98)"
> "scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41)"
> "scala.collection.Iterator$class.foreach(Iterator.scala:727)"
> "scala.collection.AbstractIterator.foreach(Iterator.scala:1157)"
> "org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:759)"
> "org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:759)"
> "org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)"
> "org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)"
> "org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)"
> "org.apache.spark.scheduler.Task.run(Task.scala:54)"
> "org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)"
> "java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)"
> "java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)"
> "java.lang.Thread.run(Thread.java:744)"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-7842) Enable qtest load_dyn_part1.q [Spark Branch]

2014-09-22 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-7842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-7842:
--
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Patch committed to Spark branch. Thanks to Venki for the contribution.

> Enable qtest load_dyn_part1.q [Spark Branch]
> 
>
> Key: HIVE-7842
> URL: https://issues.apache.org/jira/browse/HIVE-7842
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Affects Versions: spark-branch
>Reporter: Venki Korukanti
>Assignee: Venki Korukanti
>  Labels: Spark-M1
> Fix For: spark-branch
>
> Attachments: HIVE-7842.1-spark.patch
>
>
> On spark branch, load_dyn_part1.q fails with following assertion. Looks like 
> SerDe is receiving invalid ByteWritable buffer.
> {code}
> java.lang.AssertionError
> "org.apache.hadoop.hive.serde2.binarysortable.BinarySortableSerDe.deserialize(BinarySortableSerDe.java:205)"
> "org.apache.hadoop.hive.serde2.binarysortable.BinarySortableSerDe.deserialize(BinarySortableSerDe.java:187)"
> "org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:186)"
> "org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:47)"
> "org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:27)"
> "org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:98)"
> "scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41)"
> "scala.collection.Iterator$class.foreach(Iterator.scala:727)"
> "scala.collection.AbstractIterator.foreach(Iterator.scala:1157)"
> "org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:759)"
> "org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:759)"
> "org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)"
> "org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)"
> "org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)"
> "org.apache.spark.scheduler.Task.run(Task.scala:54)"
> "org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)"
> "java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)"
> "java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)"
> "java.lang.Thread.run(Thread.java:744)"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-7100) Users of hive should be able to specify skipTrash when dropping tables.

2014-09-22 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-7100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-7100:
--
   Resolution: Fixed
Fix Version/s: 0.14.0
   Status: Resolved  (was: Patch Available)

Patch committed to trunk. Thanks, David.

> Users of hive should be able to specify skipTrash when dropping tables.
> ---
>
> Key: HIVE-7100
> URL: https://issues.apache.org/jira/browse/HIVE-7100
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Affects Versions: 0.13.0
>Reporter: Ravi Prakash
>Assignee: david serafini
>  Labels: TODOC14
> Fix For: 0.14.0
>
> Attachments: HIVE-7100.1.patch, HIVE-7100.10.patch, 
> HIVE-7100.11.patch, HIVE-7100.2.patch, HIVE-7100.3.patch, HIVE-7100.4.patch, 
> HIVE-7100.5.patch, HIVE-7100.8.patch, HIVE-7100.9.patch, HIVE-7100.patch
>
>
> Users of our clusters are often running up against their quota limits because 
> of Hive tables. When they drop tables, they have to then manually delete the 
> files from HDFS using skipTrash. This is cumbersome and unnecessary. We 
> should enable users to skipTrash directly when dropping tables.
> We should also be able to provide this functionality without polluting SQL 
> syntax.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-7100) Users of hive should be able to specify skipTrash when dropping tables.

2014-09-22 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-7100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-7100:
--
Component/s: Query Processor

> Users of hive should be able to specify skipTrash when dropping tables.
> ---
>
> Key: HIVE-7100
> URL: https://issues.apache.org/jira/browse/HIVE-7100
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Affects Versions: 0.13.0
>Reporter: Ravi Prakash
>Assignee: david serafini
>  Labels: TODOC14
> Fix For: 0.14.0
>
> Attachments: HIVE-7100.1.patch, HIVE-7100.10.patch, 
> HIVE-7100.11.patch, HIVE-7100.2.patch, HIVE-7100.3.patch, HIVE-7100.4.patch, 
> HIVE-7100.5.patch, HIVE-7100.8.patch, HIVE-7100.9.patch, HIVE-7100.patch
>
>
> Users of our clusters are often running up against their quota limits because 
> of Hive tables. When they drop tables, they have to then manually delete the 
> files from HDFS using skipTrash. This is cumbersome and unnecessary. We 
> should enable users to skipTrash directly when dropping tables.
> We should also be able to provide this functionality without polluting SQL 
> syntax.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-7100) Users of hive should be able to specify skipTrash when dropping tables.

2014-09-22 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-7100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-7100:
--
Labels: TODOC14  (was: )

> Users of hive should be able to specify skipTrash when dropping tables.
> ---
>
> Key: HIVE-7100
> URL: https://issues.apache.org/jira/browse/HIVE-7100
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Affects Versions: 0.13.0
>Reporter: Ravi Prakash
>Assignee: david serafini
>  Labels: TODOC14
> Fix For: 0.14.0
>
> Attachments: HIVE-7100.1.patch, HIVE-7100.10.patch, 
> HIVE-7100.11.patch, HIVE-7100.2.patch, HIVE-7100.3.patch, HIVE-7100.4.patch, 
> HIVE-7100.5.patch, HIVE-7100.8.patch, HIVE-7100.9.patch, HIVE-7100.patch
>
>
> Users of our clusters are often running up against their quota limits because 
> of Hive tables. When they drop tables, they have to then manually delete the 
> files from HDFS using skipTrash. This is cumbersome and unnecessary. We 
> should enable users to skipTrash directly when dropping tables.
> We should also be able to provide this functionality without polluting SQL 
> syntax.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-8219) Multi-Insert optimization, don't sink the source into a file [Spark Branch]

2014-09-22 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-8219:
-

 Summary: Multi-Insert optimization, don't sink the source into a 
file [Spark Branch]
 Key: HIVE-8219
 URL: https://issues.apache.org/jira/browse/HIVE-8219
 Project: Hive
  Issue Type: Bug
  Components: Spark
Reporter: Xuefu Zhang


Current implementation split the operator plan at the lowest common ancester by 
inserting one FileSinkOperator and a list of TableScanOperators. Writing to a 
file (by the FS) is expensive. We should be able to insert a ReduceSinkOperator 
instead. The result RDD from the first job can be cached and refereed in 
subsequent Spark jobs.

This is a followup for HIVE-7503.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-8219) Multi-Insert optimization, don't sink the source into a file [Spark Branch]

2014-09-22 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-8219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-8219:
--
Labels: Spark-M1  (was: )

> Multi-Insert optimization, don't sink the source into a file [Spark Branch]
> ---
>
> Key: HIVE-8219
> URL: https://issues.apache.org/jira/browse/HIVE-8219
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Xuefu Zhang
>  Labels: Spark-M1
>
> Current implementation split the operator plan at the lowest common ancester 
> by inserting one FileSinkOperator and a list of TableScanOperators. Writing 
> to a file (by the FS) is expensive. We should be able to insert a 
> ReduceSinkOperator instead. The result RDD from the first job can be cached 
> and refereed in subsequent Spark jobs.
> This is a followup for HIVE-7503.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-8219) Multi-Insert optimization, don't sink the source into a file [Spark Branch]

2014-09-22 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-8219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-8219:
--
Issue Type: Improvement  (was: Bug)

> Multi-Insert optimization, don't sink the source into a file [Spark Branch]
> ---
>
> Key: HIVE-8219
> URL: https://issues.apache.org/jira/browse/HIVE-8219
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Reporter: Xuefu Zhang
>  Labels: Spark-M1
>
> Current implementation split the operator plan at the lowest common ancester 
> by inserting one FileSinkOperator and a list of TableScanOperators. Writing 
> to a file (by the FS) is expensive. We should be able to insert a 
> ReduceSinkOperator instead. The result RDD from the first job can be cached 
> and refereed in subsequent Spark jobs.
> This is a followup for HIVE-7503.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-8220) Refactor multi-insert code such that plan splitting and task generation are modular and reusable [Spark Branch]

2014-09-22 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-8220:
--
Description: 
This is a followup for HIVE-7053. Currently the code to split the operator tree 
and to generate tasks is mingled and thus hard to understand and maintain. 
Logically the two seems independent. This can be improved by modulizing both. 
The following might be helpful:
{code}
@Override
protected void generateTaskTree(List> rootTasks, 
ParseContext pCtx,
  List> mvTask, Set inputs, Set 
outputs)
  throws SemanticException {
// 1. Identify if the plan is for multi-insert and split the plan if necessary
List> operatorSets = multiInsertSplit(...);
// 2. For each operator set, generate a task.
for (Set topOps : operatorSets) {
  SparkTask task = generateTask(topOps);
  ...
}
// 3. wire up the tasks
...
}
{code}

  was:
This is a followup for HIVE-7053. Currently the code to split the operator tree 
and to generate tasks is mingled and thus hard to understand and maintain. 
Logically the two seems independent. This can be improved by modulizing both. 
The following might be helpful:
{code}
  @Override
  protected void generateTaskTree(List> rootTasks, 
ParseContext pCtx,
  List> mvTask, Set inputs, Set 
outputs)
  throws SemanticException {
// 1. Identify if the plan is for multi-insert and split the plan if necessary
List> operatorSets = multiInsertSplit(...);
// 2. For each operator set, generate a task.
for (Set topOps : operatorSets) {
  SparkTask task = generateTask(topOps);
  ...
}
// 3. wire up the tasks
...
}
{code}


> Refactor multi-insert code such that plan splitting and task generation are 
> modular and reusable [Spark Branch]
> ---
>
> Key: HIVE-8220
> URL: https://issues.apache.org/jira/browse/HIVE-8220
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Reporter: Xuefu Zhang
>  Labels: Spark-M1
>
> This is a followup for HIVE-7053. Currently the code to split the operator 
> tree and to generate tasks is mingled and thus hard to understand and 
> maintain. Logically the two seems independent. This can be improved by 
> modulizing both. The following might be helpful:
> {code}
> @Override
> protected void generateTaskTree(List> rootTasks, 
> ParseContext pCtx,
>   List> mvTask, Set inputs, Set 
> outputs)
>   throws SemanticException {
> // 1. Identify if the plan is for multi-insert and split the plan if necessary
> List> operatorSets = multiInsertSplit(...);
> // 2. For each operator set, generate a task.
> for (Set topOps : operatorSets) {
>   SparkTask task = generateTask(topOps);
>   ...
> }
> // 3. wire up the tasks
> ...
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-8220) Refactor multi-insert code such that plan splitting and task generation are modular and reusable [Spark Branch]

2014-09-22 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-8220:
-

 Summary: Refactor multi-insert code such that plan splitting and 
task generation are modular and reusable [Spark Branch]
 Key: HIVE-8220
 URL: https://issues.apache.org/jira/browse/HIVE-8220
 Project: Hive
  Issue Type: Improvement
  Components: Spark
Reporter: Xuefu Zhang


This is a followup for HIVE-7053. Currently the code to split the operator tree 
and to generate tasks is mingled and thus hard to understand and maintain. 
Logically the two seems independent. This can be improved by modulizing both. 
The following might be helpful:
{code}
  @Override
  protected void generateTaskTree(List> rootTasks, 
ParseContext pCtx,
  List> mvTask, Set inputs, Set 
outputs)
  throws SemanticException {
// 1. Identify if the plan is for multi-insert and split the plan if necessary
List> operatorSets = multiInsertSplit(...);
// 2. For each operator set, generate a task.
for (Set topOps : operatorSets) {
  SparkTask task = generateTask(topOps);
  ...
}
// 3. wire up the tasks
...
}
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-7842) Enable qtest load_dyn_part1.q [Spark Branch]

2014-09-22 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-7842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14144071#comment-14144071
 ] 

Xuefu Zhang commented on HIVE-7842:
---

Thanks for pointing it out, [~vkorukanti]. I just added that file.

> Enable qtest load_dyn_part1.q [Spark Branch]
> 
>
> Key: HIVE-7842
> URL: https://issues.apache.org/jira/browse/HIVE-7842
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Affects Versions: spark-branch
>Reporter: Venki Korukanti
>Assignee: Venki Korukanti
>  Labels: Spark-M1
> Fix For: spark-branch
>
> Attachments: HIVE-7842.1-spark.patch
>
>
> On spark branch, load_dyn_part1.q fails with following assertion. Looks like 
> SerDe is receiving invalid ByteWritable buffer.
> {code}
> java.lang.AssertionError
> "org.apache.hadoop.hive.serde2.binarysortable.BinarySortableSerDe.deserialize(BinarySortableSerDe.java:205)"
> "org.apache.hadoop.hive.serde2.binarysortable.BinarySortableSerDe.deserialize(BinarySortableSerDe.java:187)"
> "org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:186)"
> "org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:47)"
> "org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:27)"
> "org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:98)"
> "scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41)"
> "scala.collection.Iterator$class.foreach(Iterator.scala:727)"
> "scala.collection.AbstractIterator.foreach(Iterator.scala:1157)"
> "org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:759)"
> "org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:759)"
> "org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)"
> "org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)"
> "org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)"
> "org.apache.spark.scheduler.Task.run(Task.scala:54)"
> "org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)"
> "java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)"
> "java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)"
> "java.lang.Thread.run(Thread.java:744)"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-8224) Support Char, Varchar in AvroSerDe

2014-09-22 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-8224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14144170#comment-14144170
 ] 

Xuefu Zhang commented on HIVE-8224:
---

Patch looks good to me. +1

> Support Char, Varchar in AvroSerDe
> --
>
> Key: HIVE-8224
> URL: https://issues.apache.org/jira/browse/HIVE-8224
> Project: Hive
>  Issue Type: Task
>  Components: Serializers/Deserializers
>Reporter: Mohit Sabharwal
>Assignee: Mohit Sabharwal
>  Labels: Avro
> Attachments: HIVE-8224.patch
>
>
> Both Char and Varchar represented as String primitive type in Avro. 
> Char is persisted without padding, if any.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-8205) Using strings in group type fails in ParquetSerDe

2014-09-23 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-8205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-8205:
--
Attachment: HIVE-8205.1.patch

> Using strings in group type fails in ParquetSerDe
> -
>
> Key: HIVE-8205
> URL: https://issues.apache.org/jira/browse/HIVE-8205
> Project: Hive
>  Issue Type: Bug
>  Components: Serializers/Deserializers
>Reporter: Mohit Sabharwal
>Assignee: Mohit Sabharwal
>  Labels: parquet
> Attachments: HIVE-8205.1.patch, HIVE-8205.1.patch, HIVE-8205.patch
>
>
> In HIVE-7735, schema info was plumbed to ETypeConverter to disambiguate 
> between hive Char, Varchar and String types, which are all represented as 
> PrimitiveType "binary" and OriginalType "utf8" in parquet.
> However, this does not work for parquet nested types (that map to hive Array, 
> Map, etc.) containing these values, because schema lookup for nested values 
> was not implemented.  It's also non-trivial to do that in the current parquet 
> serde implementation. Instead of plumbing in the schema, we should convert 
> these types to the same Text writeable and let the object inspectors handle 
> the final conversion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-8205) Using strings in group type fails in ParquetSerDe

2014-09-23 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-8205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14144945#comment-14144945
 ] 

Xuefu Zhang commented on HIVE-8205:
---

Patch looks good to me. Reloaded the patch to have another test run.

> Using strings in group type fails in ParquetSerDe
> -
>
> Key: HIVE-8205
> URL: https://issues.apache.org/jira/browse/HIVE-8205
> Project: Hive
>  Issue Type: Bug
>  Components: Serializers/Deserializers
>Reporter: Mohit Sabharwal
>Assignee: Mohit Sabharwal
>  Labels: parquet
> Attachments: HIVE-8205.1.patch, HIVE-8205.1.patch, HIVE-8205.patch
>
>
> In HIVE-7735, schema info was plumbed to ETypeConverter to disambiguate 
> between hive Char, Varchar and String types, which are all represented as 
> PrimitiveType "binary" and OriginalType "utf8" in parquet.
> However, this does not work for parquet nested types (that map to hive Array, 
> Map, etc.) containing these values, because schema lookup for nested values 
> was not implemented.  It's also non-trivial to do that in the current parquet 
> serde implementation. Instead of plumbing in the schema, we should convert 
> these types to the same Text writeable and let the object inspectors handle 
> the final conversion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-8205) Using strings in group type fails in ParquetSerDe

2014-09-23 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-8205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14145548#comment-14145548
 ] 

Xuefu Zhang commented on HIVE-8205:
---

+1

> Using strings in group type fails in ParquetSerDe
> -
>
> Key: HIVE-8205
> URL: https://issues.apache.org/jira/browse/HIVE-8205
> Project: Hive
>  Issue Type: Bug
>  Components: Serializers/Deserializers
>Reporter: Mohit Sabharwal
>Assignee: Mohit Sabharwal
>  Labels: parquet
> Attachments: HIVE-8205.1.patch, HIVE-8205.1.patch, HIVE-8205.patch
>
>
> In HIVE-7735, schema info was plumbed to ETypeConverter to disambiguate 
> between hive Char, Varchar and String types, which are all represented as 
> PrimitiveType "binary" and OriginalType "utf8" in parquet.
> However, this does not work for parquet nested types (that map to hive Array, 
> Map, etc.) containing these values, because schema lookup for nested values 
> was not implemented.  It's also non-trivial to do that in the current parquet 
> serde implementation. Instead of plumbing in the schema, we should convert 
> these types to the same Text writeable and let the object inspectors handle 
> the final conversion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-8207) Add .q tests for multi-table insertion [Spark Branch]

2014-09-23 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14145814#comment-14145814
 ] 

Xuefu Zhang commented on HIVE-8207:
---

+1

> Add .q tests for multi-table insertion [Spark Branch]
> -
>
> Key: HIVE-8207
> URL: https://issues.apache.org/jira/browse/HIVE-8207
> Project: Hive
>  Issue Type: Test
>  Components: Spark
>Reporter: Chao
>Assignee: Chao
> Attachments: HIVE-8207.1-spark.patch, HIVE-8207.2-spark.patch, 
> HIVE-8207.3-spark.patch
>
>
> Now that multi-table insertion is committed to branch, we should enable those 
> related qtests.
> Here is a list of qfiles that should be activated (some of them may already 
> be activated).
> The list may not be comprehensive.
> {noformat}
> add_part_multiple.q
> auto_smb_mapjoin_14.q
> bucket5.q
> column_access_stats.q
> date_udf.q
> groupby10.q
> groupby11.q
> groupby3_map_multi_distinct.q
> groupby3_map.q
> groupby3_map_skew.q
> groupby3_noskew_multi_distinct.q
> groupby3_noskew.q
> groupby7_map_multi_single_reducer.q
> groupby7_map.q
> groupby7_map_skew.q
> groupby7_noskew_multi_single_reducer.q
> groupby7_noskew.q
> groupby7.q
> groupby8_map.q
> groupby8_map_skew.q
> groupby8_noskew.q
> groupby8.q
> groupby9.q
> groupby_complex_types_multi_single_reducer.q
> groupby_complex_types.q
> groupby_cube1.q
> groupby_map_ppr_multi_distinct.q
> groupby_map_ppr.q
> groupby_multi_insert_common_distinct.q
> groupby_multi_single_reducer2.q
> groupby_multi_single_reducer3.q
> groupby_multi_single_reducer.q
> groupby_position.q
> groupby_ppr.q
> groupby_rollup1.q
> groupby_sort_1_23.q
> groupby_sort_1.q
> groupby_sort_skew_1_23.q
> infer_bucket_sort_multi_insert.q
> innerjoin.q
> input12_hadoop20.q
> input12.q
> input13.q
> input14.q
> input17.q
> input18.q
> input1_limit.q
> input_part2.q
> insert_into3.q
> join_nullsafe.q
> load_dyn_part8.q
> metadata_only_queries_with_filters.q
> multigroupby_singlemr.q
> multi_insert_gby2.q
> multi_insert_gby3.q
> multi_insert_gby.q
> multi_insert_lateral_view.qmulti_insert_move_tasks_share_dependencies.q
> multi_insert.q
> parallel.q
> partition_date2.q
> pcr.q
> ppd_multi_insert.q
> ppd_transform.q
> smb_mapjoin_11.q
> smb_mapjoin_12.q
> smb_mapjoin_13.q
> smb_mapjoin_15.q
> smb_mapjoin_16.q
> stats4.q
> subquery_multiinsert.q
> table_access_keys_stats.q
> tez_dml.q
> udaf_percentile_approx_20.q
> udaf_percentile_approx_23.q
> union17.q
> union18.q
> union19.q
> {noformat}
>   
> There are some tests that cannot be enabled right now, due to various reasons:
> 1. ForwardOperator Issue, including
> {noformat}
> groupby7_noskew_multi_single_reducer.q
> groupby8_map.q
> groupby8_map_skew.q
> groupby8_noskew.q
> groupby8.q
> groupby9.q
> groupby10.q
> groupby_multi_insert_common_distinct.q 
> union17.q
> {noformat}
> *Reason*: currently, if the node to break in the operator tree is a 
> ForwardOperator, we simple do nothing. However, we may have the following 
> case:
> {noformat}
>   ...
>   RS_0
>|
>   FOR
>|
>  /   \
>GBY_1  GBY_2
> | |
>...   ...
> | |
>RS_1  RS_2
> | |
>...   ...
> | |
>FS_1  FS_2
> {noformat}
> which may result to:
> {noformat}
>   RW
>  /  \
>RWRW
> {noformat}
> and because of the issue in HIVE-7731 and HIVE-8118, both downstream branches 
> will get duplicated (and same) inputs.
> 2. Stats issue, including:
> {noformat}
> bucket5.q
> infer_bucket_sort_multi_insert.q
> stats4.q
> smb_mapjoin_13.q
> smb_mapjoin_15.q
> {noformat}
> *Reason*: In these tests, I get diff error because {{numRows}} and 
> {{rawDataSize}} are -1, but they are expected to be some positive value. I 
> don't think this is related to multi-insertion.
> 3. Join/SMB Join Issue, including
> {noformat}
> auto_smb_mapjoin_14.q
> auto_sortmerge_join_13.q
> smb_mapjoin_11.q
> smb_mapjoin_12.q
> smb_mapjoin_13.q
> smb_mapjoin_15.q
> smb_mapjoin_16.q
> {noformat}
> *Reason*: These tests either failed with exception or failed with diff. I 
> think it's because SMB Join (HIVE-8202) isn't supported right now.
> 4. Result doesn't match, including
> {noformat}
> groupby3_map_skew.q
> groupby_map_ppr_multi_distinct.q
> groupby_complex_types_multi_single_reducer.q
> groupby_map_ppr.q
> partition_date2.q
> udaf_percentile_approx_23.q
> {noformat}
> *Reason*: The results from these tests are different from MR's. For instance, 
> test for groupby3_map_skew.q failed because:
> {noformat}
> < 130091.0  260.182 256.10355987055016  98.00.0 
> 142.92680950752379  143.06995106518903  20428.07288 20469.0109
> ---
> > 130091.0  260.182 256.10355987055016  98.00.0 
> > 142.92680950

[jira] [Updated] (HIVE-8207) Add .q tests for multi-table insertion [Spark Branch]

2014-09-23 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-8207:
--
   Resolution: Fixed
Fix Version/s: spark-branch
   Status: Resolved  (was: Patch Available)

Patch committed to Spark branch. Thanks to Chao for the contribution.

> Add .q tests for multi-table insertion [Spark Branch]
> -
>
> Key: HIVE-8207
> URL: https://issues.apache.org/jira/browse/HIVE-8207
> Project: Hive
>  Issue Type: Test
>  Components: Spark
>Reporter: Chao
>Assignee: Chao
> Fix For: spark-branch
>
> Attachments: HIVE-8207.1-spark.patch, HIVE-8207.2-spark.patch, 
> HIVE-8207.3-spark.patch
>
>
> Now that multi-table insertion is committed to branch, we should enable those 
> related qtests.
> Here is a list of qfiles that should be activated (some of them may already 
> be activated).
> The list may not be comprehensive.
> {noformat}
> add_part_multiple.q
> auto_smb_mapjoin_14.q
> bucket5.q
> column_access_stats.q
> date_udf.q
> groupby10.q
> groupby11.q
> groupby3_map_multi_distinct.q
> groupby3_map.q
> groupby3_map_skew.q
> groupby3_noskew_multi_distinct.q
> groupby3_noskew.q
> groupby7_map_multi_single_reducer.q
> groupby7_map.q
> groupby7_map_skew.q
> groupby7_noskew_multi_single_reducer.q
> groupby7_noskew.q
> groupby7.q
> groupby8_map.q
> groupby8_map_skew.q
> groupby8_noskew.q
> groupby8.q
> groupby9.q
> groupby_complex_types_multi_single_reducer.q
> groupby_complex_types.q
> groupby_cube1.q
> groupby_map_ppr_multi_distinct.q
> groupby_map_ppr.q
> groupby_multi_insert_common_distinct.q
> groupby_multi_single_reducer2.q
> groupby_multi_single_reducer3.q
> groupby_multi_single_reducer.q
> groupby_position.q
> groupby_ppr.q
> groupby_rollup1.q
> groupby_sort_1_23.q
> groupby_sort_1.q
> groupby_sort_skew_1_23.q
> infer_bucket_sort_multi_insert.q
> innerjoin.q
> input12_hadoop20.q
> input12.q
> input13.q
> input14.q
> input17.q
> input18.q
> input1_limit.q
> input_part2.q
> insert_into3.q
> join_nullsafe.q
> load_dyn_part8.q
> metadata_only_queries_with_filters.q
> multigroupby_singlemr.q
> multi_insert_gby2.q
> multi_insert_gby3.q
> multi_insert_gby.q
> multi_insert_lateral_view.qmulti_insert_move_tasks_share_dependencies.q
> multi_insert.q
> parallel.q
> partition_date2.q
> pcr.q
> ppd_multi_insert.q
> ppd_transform.q
> smb_mapjoin_11.q
> smb_mapjoin_12.q
> smb_mapjoin_13.q
> smb_mapjoin_15.q
> smb_mapjoin_16.q
> stats4.q
> subquery_multiinsert.q
> table_access_keys_stats.q
> tez_dml.q
> udaf_percentile_approx_20.q
> udaf_percentile_approx_23.q
> union17.q
> union18.q
> union19.q
> {noformat}
>   
> There are some tests that cannot be enabled right now, due to various reasons:
> 1. ForwardOperator Issue, including
> {noformat}
> groupby7_noskew_multi_single_reducer.q
> groupby8_map.q
> groupby8_map_skew.q
> groupby8_noskew.q
> groupby8.q
> groupby9.q
> groupby10.q
> groupby_multi_insert_common_distinct.q 
> union17.q
> {noformat}
> *Reason*: currently, if the node to break in the operator tree is a 
> ForwardOperator, we simple do nothing. However, we may have the following 
> case:
> {noformat}
>   ...
>   RS_0
>|
>   FOR
>|
>  /   \
>GBY_1  GBY_2
> | |
>...   ...
> | |
>RS_1  RS_2
> | |
>...   ...
> | |
>FS_1  FS_2
> {noformat}
> which may result to:
> {noformat}
>   RW
>  /  \
>RWRW
> {noformat}
> and because of the issue in HIVE-7731 and HIVE-8118, both downstream branches 
> will get duplicated (and same) inputs.
> 2. Stats issue, including:
> {noformat}
> bucket5.q
> infer_bucket_sort_multi_insert.q
> stats4.q
> smb_mapjoin_13.q
> smb_mapjoin_15.q
> {noformat}
> *Reason*: In these tests, I get diff error because {{numRows}} and 
> {{rawDataSize}} are -1, but they are expected to be some positive value. I 
> don't think this is related to multi-insertion.
> 3. Join/SMB Join Issue, including
> {noformat}
> auto_smb_mapjoin_14.q
> auto_sortmerge_join_13.q
> smb_mapjoin_11.q
> smb_mapjoin_12.q
> smb_mapjoin_13.q
> smb_mapjoin_15.q
> smb_mapjoin_16.q
> {noformat}
> *Reason*: These tests either failed with exception or failed with diff. I 
> think it's because SMB Join (HIVE-8202) isn't supported right now.
> 4. Result doesn't match, including
> {noformat}
> groupby3_map_skew.q
> groupby_map_ppr_multi_distinct.q
> groupby_complex_types_multi_single_reducer.q
> groupby_map_ppr.q
> partition_date2.q
> udaf_percentile_approx_23.q
> {noformat}
> *Reason*: The results from these tests are different from MR's. For instance, 
> test for groupby3_map_skew.q failed because:
> {noformat}
> < 130091.0  260.182 256.10355987055016  98.00.0 
> 142.926

[jira] [Updated] (HIVE-8224) Support Char, Varchar in AvroSerDe

2014-09-24 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-8224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-8224:
--
   Resolution: Fixed
Fix Version/s: 0.14.0
 Release Note: To document support of char/varchar for avro.
   Status: Resolved  (was: Patch Available)

Patch committed to trunk. Thanks to Mohit for the contribution.

> Support Char, Varchar in AvroSerDe
> --
>
> Key: HIVE-8224
> URL: https://issues.apache.org/jira/browse/HIVE-8224
> Project: Hive
>  Issue Type: Task
>  Components: Serializers/Deserializers
>Reporter: Mohit Sabharwal
>Assignee: Mohit Sabharwal
>  Labels: Avro
> Fix For: 0.14.0
>
> Attachments: HIVE-8224.1.patch, HIVE-8224.patch
>
>
> Both Char and Varchar represented as String primitive type in Avro. 
> Char is persisted without padding, if any.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 3153 matches

Mail list logo