[jira] [Created] (HIVE-17049) hive doesn't support chinese comments for columns

2017-07-05 Thread liugaopeng (JIRA)
liugaopeng created HIVE-17049:
-

 Summary: hive doesn't support chinese comments for columns
 Key: HIVE-17049
 URL: https://issues.apache.org/jira/browse/HIVE-17049
 Project: Hive
  Issue Type: Bug
  Components: CLI
Affects Versions: 1.2.1
 Environment: hive 1.2.1 in HDP 
Reporter: liugaopeng


1.  alter table stg.test_chinese change chinesetitle chinesetitle tinyint 
comment '中文';
2.  desc stg.test_chinese;
Result: chinese comment "中文" becase "??"

also, if i modify the comment via hive view, it will still display the messy 
code "??".

I did some testing, but cannot fix it, such as:
1. change the hive.COLUMNS_V2 to UTF-8 chartset.
2. append the characterEncoding=UTF-8 to hive_to_mysqlmetadata url 

i found some ideas that need to apply some patch to fix it, but seems they all 
effects in 0.x version, i use the 1.2.1 version.

Please give some guidence.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[Announce] New committer: Teddy Choi

2017-07-05 Thread Ashutosh Chauhan
The Project Management Committee (PMC) for Apache Hive has invited Teddy
Choi to become a committer and we are pleased to announce that he has
accepted.

Welcome, Teddy!

Thanks,
Ashutosh


[Announce] New committer: Peter Vary

2017-07-05 Thread Ashutosh Chauhan
The Project Management Committee (PMC) for Apache Hive has invited Peter
Vary to become a committer and we are pleased to announce that he has
accepted.

Welcome, Peter!

Thanks,
Ashutosh


[Announce] New committer: Vihang Karajgaonkar

2017-07-05 Thread Ashutosh Chauhan
The Project Management Committee (PMC) for Apache Hive has invited Vihang
Karajgaonkar to become a committer and we are pleased to announce that he
has accepted.

Welcome, Vihang!

Thanks,
Ashutosh


[Announce] New committer: Sahil Takiar

2017-07-05 Thread Ashutosh Chauhan
The Project Management Committee (PMC) for Apache Hive has invited Sahil
Takiar to become a committer and we are pleased to announce that he has
accepted.

Welcome, Sahil!

Thanks,
Ashutosh


[Announce] New committer: Deepesh Khandelwal

2017-07-05 Thread Ashutosh Chauhan
The Project Management Committee (PMC) for Apache Hive has invited Deepesh
Khandelwal to become a committer and we are pleased to announce that he has
accepted.

Welcome, Deepesh!

Thanks,
Ashutosh


[jira] [Created] (HIVE-17048) Pass HiveOperation info to HiveSemanticAnalyzerHook through HiveSemanticAnalyzerHookContext

2017-07-05 Thread Aihua Xu (JIRA)
Aihua Xu created HIVE-17048:
---

 Summary: Pass HiveOperation info to HiveSemanticAnalyzerHook 
through HiveSemanticAnalyzerHookContext
 Key: HIVE-17048
 URL: https://issues.apache.org/jira/browse/HIVE-17048
 Project: Hive
  Issue Type: Improvement
  Components: Hooks
Affects Versions: 2.1.1
Reporter: Aihua Xu
Assignee: Aihua Xu


Currently hive passes the following info to HiveSemanticAnalyzerHook through 
HiveSemanticAnalyzerHookContext (see 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/Driver.java#L553).
 But the operation type (HiveOperation) is also needed in some cases, e.g., 
when integrating with Sentry. 

{noformat}
hookCtx.setConf(conf);
hookCtx.setUserName(userName);
hookCtx.setIpAddress(SessionState.get().getUserIpAddress());
hookCtx.setCommand(command);
{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (HIVE-17047) Allow table property to be populated to jobConf to make FixedLengthInputFormat work

2017-07-05 Thread Zhiyuan Yang (JIRA)
Zhiyuan Yang created HIVE-17047:
---

 Summary: Allow table property to be populated to jobConf to make 
FixedLengthInputFormat work
 Key: HIVE-17047
 URL: https://issues.apache.org/jira/browse/HIVE-17047
 Project: Hive
  Issue Type: Bug
Reporter: Zhiyuan Yang
Assignee: Zhiyuan Yang
 Fix For: 1.2.1


To make FixedLengthInputFormat work in Hive, we need table specific value for 
the configuration "fixedlengthinputformat.record.length". Right now the best 
place would be table property. Unfortunately, table property is not alway 
populated to InputFormat configurations because of this in HiveInputFormat:
{code}
PartitionDesc part = pathToPartitionInfo.get(hsplit.getPath().toString());
if ((part != null) && (part.getTableDesc() != null)) {
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (HIVE-17046) Flaky test: TestCliDriver[ppd_windowing2]

2017-07-05 Thread Janaki Lahorani (JIRA)
Janaki Lahorani created HIVE-17046:
--

 Summary: Flaky test: TestCliDriver[ppd_windowing2]
 Key: HIVE-17046
 URL: https://issues.apache.org/jira/browse/HIVE-17046
 Project: Hive
  Issue Type: Sub-task
Reporter: Janaki Lahorani






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Re: [DISCUSS] Separating out the metastore as its own TLP

2017-07-05 Thread Xuefu Zhang
I think Edward's concern is valid. While I voiced my support for this
proposal, which was more from the benefits of the whole Hadoop ecosystem, I
don't see the equal benefits for Hive. Instead, it may even create more
overhead for Hive. I'd really like to take time to see what are the road
blocks for other projects to use HMS as it is. The issue of Spark including
a Hive fork, which was brought up some time back, is certainly not one of
them.

Thanks,
Xuefu

On Wed, Jul 5, 2017 at 12:33 PM, Edward Capriolo 
wrote:

> On Wed, Jul 5, 2017 at 1:51 PM, Alan Gates  wrote:
>
> > On Mon, Jul 3, 2017 at 6:20 AM, Edward Capriolo 
> > wrote:
> >
> > >
> > > We already have things in the meta-store not directly tied to language
> > > features. For example hive metastore has a "retention" property which
> is
> > > not actively in use by anything. In reality, we rarely say 'no' or -1
> to
> > > much. Which in part is why I believe our release process is grinding
> > > slower: we have so many things in flight I do not feel that any one
> > person
> > > can keep track. You are working on porting the metastore to hbase.
> > > https://issues.apache.org/jira/browse/HIVE-9452 did you get a -1 or
> 'No'
> > > along the way? When I first noticed this I pointed out that someone has
> > > already ported the metastore to Cassandra
> > > https://github.com/riptano/brisk/blob/master/src/java/
> > > src/org/apache/cassandra/hadoop/hive/metastore/SchemaManager
> > Service.java,
> > > but I was more exciting/rational for this multi-year approach using
> hbase
> > > so I let everyone 'have at it'.
> > >
> > Your example and mine are not equivalent.  The HBase metastore is still a
> > Hive feature, even if some thought it not worth while.  That is different
> > than people bringing features that will never interest Hive or that Hive
> > could never use (e.g. Dain’s desire for the metastore to support Presto
> > style views).
> >
> > I forgot to mention the issue these would be non-Hive contributors have
> > with releases if they contribute their features to the metastore while
> it’s
> > inside Hive.  Is Hive going to do a release just to push out features in
> > the metastore that it doesn’t care about?
> >
> > You seem to be asserting that doing this doesn’t really help non-Hive
> based
> > systems that are using or would like to use the metastore.  But it is
> > interesting that people from three of those systems have commented in the
> > thread so far, and all are positive (Dmitrias from Impala, Dain from
> > Presto, and Sriharsha from the schema registry project).
> >
> >
> > > I am going to give a hypothetical but real world situation. Suppose I
> > want
> > > to add the statement "CREATE permanent macro xyz", this feature I
> believe
> > > would cross cut calcite, hive, and hive metastore. To build this
> feature
> > I
> > > would need to orchestrate the change across 3 separate groups of hive
> > > 'subcommittees' for lack of a better word. 3 git repos, 3 Jira's 3
> > > releases. That is not counting if we run into some bug or misfeature
> > (maybe
> > > with Tez or something else) so that brings in 4-5 releases of upstream
> to
> > > add a feature to hive. This does not take into account normal processes
> > > mess ups. For example say you get the metastore done, but now the
> people
> > > doing the calcite/antlr suggest the feature have different syntax
> because
> > > they did not read the 3-4 linked tickets when the process started? Now,
> > you
> > > have to loop back around the process. Finding 1 person in 1 project to
> > > usher along the feature you want is difficult, having to find and clear
> > > time with 3 people across three projects is going to be a difficult
> along
> > > with then 'pushing' them all to kick out a release so you can finally
> use
> > > said feature.
> > >
> >
> > I partially agree with you.  On the reviews, JIRAs, etc. I don’t think it
> > adds much, if any, overhead.  Hive is a big project and no one person
> knows
> > all the code anymore.  If you wanted to add a permanent macros feature
> you
> > would need reviews from someone who knows the parser (probably
> Pengcheng),
> > people who know the optimizer (Jesus, Ashutosh, …), and someone who knows
> > the metastore (me, Thejas, …).  And any large feature is going to be
> > implemented over multiple JIRAs, all of which are linkable regardless of
> > whether the JIRAs start with METASTORE- or HIVE-.   I also don’t think it
> > makes the feature disagreement any worse.  If the optimizer team
> absolutely
> > insists it has to have some feature and the metastore team insists that
> it
> > can’t have that feature you’re going to have to work through the issue
> > whether they all are in Hive or in two separate projects.
> >
> > Where I agree the split adds cost is releases.  Before your macro feature
> > could go live you need releases from each of the components.  And while
> in

Re: [DISCUSS] Separating out the metastore as its own TLP

2017-07-05 Thread Edward Capriolo
On Wed, Jul 5, 2017 at 1:51 PM, Alan Gates  wrote:

> On Mon, Jul 3, 2017 at 6:20 AM, Edward Capriolo 
> wrote:
>
> >
> > We already have things in the meta-store not directly tied to language
> > features. For example hive metastore has a "retention" property which is
> > not actively in use by anything. In reality, we rarely say 'no' or -1 to
> > much. Which in part is why I believe our release process is grinding
> > slower: we have so many things in flight I do not feel that any one
> person
> > can keep track. You are working on porting the metastore to hbase.
> > https://issues.apache.org/jira/browse/HIVE-9452 did you get a -1 or 'No'
> > along the way? When I first noticed this I pointed out that someone has
> > already ported the metastore to Cassandra
> > https://github.com/riptano/brisk/blob/master/src/java/
> > src/org/apache/cassandra/hadoop/hive/metastore/SchemaManager
> Service.java,
> > but I was more exciting/rational for this multi-year approach using hbase
> > so I let everyone 'have at it'.
> >
> Your example and mine are not equivalent.  The HBase metastore is still a
> Hive feature, even if some thought it not worth while.  That is different
> than people bringing features that will never interest Hive or that Hive
> could never use (e.g. Dain’s desire for the metastore to support Presto
> style views).
>
> I forgot to mention the issue these would be non-Hive contributors have
> with releases if they contribute their features to the metastore while it’s
> inside Hive.  Is Hive going to do a release just to push out features in
> the metastore that it doesn’t care about?
>
> You seem to be asserting that doing this doesn’t really help non-Hive based
> systems that are using or would like to use the metastore.  But it is
> interesting that people from three of those systems have commented in the
> thread so far, and all are positive (Dmitrias from Impala, Dain from
> Presto, and Sriharsha from the schema registry project).
>
>
> > I am going to give a hypothetical but real world situation. Suppose I
> want
> > to add the statement "CREATE permanent macro xyz", this feature I believe
> > would cross cut calcite, hive, and hive metastore. To build this feature
> I
> > would need to orchestrate the change across 3 separate groups of hive
> > 'subcommittees' for lack of a better word. 3 git repos, 3 Jira's 3
> > releases. That is not counting if we run into some bug or misfeature
> (maybe
> > with Tez or something else) so that brings in 4-5 releases of upstream to
> > add a feature to hive. This does not take into account normal processes
> > mess ups. For example say you get the metastore done, but now the people
> > doing the calcite/antlr suggest the feature have different syntax because
> > they did not read the 3-4 linked tickets when the process started? Now,
> you
> > have to loop back around the process. Finding 1 person in 1 project to
> > usher along the feature you want is difficult, having to find and clear
> > time with 3 people across three projects is going to be a difficult along
> > with then 'pushing' them all to kick out a release so you can finally use
> > said feature.
> >
>
> I partially agree with you.  On the reviews, JIRAs, etc. I don’t think it
> adds much, if any, overhead.  Hive is a big project and no one person knows
> all the code anymore.  If you wanted to add a permanent macros feature you
> would need reviews from someone who knows the parser (probably Pengcheng),
> people who know the optimizer (Jesus, Ashutosh, …), and someone who knows
> the metastore (me, Thejas, …).  And any large feature is going to be
> implemented over multiple JIRAs, all of which are linkable regardless of
> whether the JIRAs start with METASTORE- or HIVE-.   I also don’t think it
> makes the feature disagreement any worse.  If the optimizer team absolutely
> insists it has to have some feature and the metastore team insists that it
> can’t have that feature you’re going to have to work through the issue
> whether they all are in Hive or in two separate projects.
>
> Where I agree the split adds cost is releases.  Before your macro feature
> could go live you need releases from each of the components.  And while in
> development the components need to use snapshot versions of the other
> components.  My assertion is that the benefits out weigh this cost.
>
> Alan.
>


"You seem to be asserting that doing this doesn’t really help non-Hive based
systems that are using or would like to use the metastore.  But it is
interesting that people from three of those systems have commented in the
thread so far, and all are positive (Dmitrias from Impala, Dain from
Presto, and Sriharsha from the schema registry project)."

I notice that impala has a syntax for caching.

https://www.cloudera.com/documentation/enterprise/5-8-x/topi
cs/impala_perf_hdfs_caching.html

Notice how the cache syntax did not way into Hive? It would make sense if
this feature 

Re: Review Request 60589: HIVE-17001: Insert overwrite table doesn't clean partition directory on HDFS if partition is missing from HMS

2017-07-05 Thread Vihang Karajgaonkar

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60589/#review179680
---




ql/src/test/queries/clientpositive/insert_overwrite_table.q
Lines 1-10 (patched)


I don't understand this test case completely. The table is defined as 
external so it is expected that the drop partition will not delete the HDFS 
file. The DFS operation is performed without the knowledge of Hive so when it 
returned 2 rows instead of 1 isn't it the expected behavior?

I think the right way to solve this problem to throw an exception when we 
do a insert overwrite on an external table. Just like truncate table command on 
an external table doesn't work, I think insert overwrite should also fail on a 
external table. The behavior of external table is inconsistent in my opinion. 
We allow it to be overwritten but not truncated.

When the table is a managed table, the test works as expected since Hive 
cleans up the directory after drop partition command.


- Vihang Karajgaonkar


On July 3, 2017, 9:05 a.m., Barna Zsombor Klara wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/60589/
> ---
> 
> (Updated July 3, 2017, 9:05 a.m.)
> 
> 
> Review request for hive.
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> HIVE-17001: Insert overwrite table doesn't clean partition directory on HDFS 
> if partition is missing from HMS
> 
> 
> Diffs
> -
> 
>   ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 
> 73710a7c2917b5268f788f22baaee2d87846961b 
>   ql/src/test/queries/clientpositive/insert_overwrite_table.q PRE-CREATION 
>   ql/src/test/results/clientpositive/insert_overwrite_table.q.out 
> PRE-CREATION 
> 
> 
> Diff: https://reviews.apache.org/r/60589/diff/1/
> 
> 
> Testing
> ---
> 
> Manual testing and qtests.
> 
> 
> Thanks,
> 
> Barna Zsombor Klara
> 
>



Re: Review Request 60445: HIVE-16935: Hive should strip comments from input before choosing which CommandProcessor to run.

2017-07-05 Thread Andrew Sherman

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60445/
---

(Updated July 5, 2017, 6:09 p.m.)


Review request for hive and Sahil Takiar.


Changes
---

Use Guava's Splitter


Bugs: HIVE-16935
https://issues.apache.org/jira/browse/HIVE-16935


Repository: hive-git


Description
---

We strip sql comments from a command string. The stripped command is use to 
determine which
CommandProcessor will execute the command. If the CommandProcessorFactory does 
not select a special
CommandProcessor then we execute the original unstripped command so that the 
sql parser can remove comments.
Move BeeLine's comment stripping code to HiveStringUtils and change BeeLine to 
call it from there
Add a better test with separate tokens for "set role" in 
TestCommandProcessorFactory.
Add a test case for comment removal in set_processor_namespaces.q using an 
indented comment as
unindented comments are removed by the test driver.

Change-Id: I166dc1e7588ec9802ba373d88e69e716aecd33c2


Diffs (updated)
-

  beeline/src/java/org/apache/hive/beeline/Commands.java 
3b2d72ed79771e6198e62c47060a7f80665dbcb2 
  beeline/src/test/org/apache/hive/beeline/TestCommands.java 
04c939a04c7a56768286743c2bb9c9797507e3aa 
  cli/src/java/org/apache/hadoop/hive/cli/CliDriver.java 
27fd66d35ea89b0de0d17763625fbf564584fcca 
  common/src/java/org/apache/hive/common/util/HiveStringUtils.java 
4a6413a7c376ffb4de6d20d24707ac5bf89ebc0c 
  common/src/test/org/apache/hive/common/util/TestHiveStringUtils.java 
6bd7037152c6f809daec8af42708693c05fe00cf 
  
ql/src/test/org/apache/hadoop/hive/ql/processors/TestCommandProcessorFactory.java
 21bdcf44436a02b11f878fa439e916d4b55ac63d 
  ql/src/test/queries/clientpositive/set_processor_namespaces.q 
612807f0c871b1881446d088e1c2c399d1afe970 
  ql/src/test/results/clientpositive/set_processor_namespaces.q.out 
c05ce4d61d00a9ee6671d97f2fd178f18d44cc8c 
  
service/src/java/org/apache/hive/service/cli/operation/ExecuteStatementOperation.java
 2dd90b69b3bf789b1a3928129cf801b17884033f 


Diff: https://reviews.apache.org/r/60445/diff/4/

Changes: https://reviews.apache.org/r/60445/diff/3-4/


Testing
---

Added new test case.
Hand tested with Hue and Jdbc.


Thanks,

Andrew Sherman



Re: [DISCUSS] Separating out the metastore as its own TLP

2017-07-05 Thread Alan Gates
On Mon, Jul 3, 2017 at 10:17 AM, Dain Sundstrom  wrote:

> +1
>
> I work on Presto and I think this the right direction for our users.  We
> have several users running Presto without Hive and anything we can do to
> help simplify the Metastore experience would be a good help.
>
> When I read proposals like this, one thing I like to see is a vision
> (scope) for the project.  In this case, I’d like to understand if the plan
> is to limit the scope of the system to what Hive can support.  For example,
> the system will clearly support schemas (databases) with tables and views
> as defined by Hive, but will there be support for additional types like a
> Presto view which is incompatible with a Hive views due to the language
> differences?  Currently, in Presto we create a Hive view to reserve a spot
> in the "tables namespace”, and then we put our view data in a table
> properties.  I would like to formalize this kind of system, so if a Hive
> user queries a Presto view, they get a proper error message. I have similar
> concerns about data types, compression, and data organization (e.g.,
> different bucketing strategies).
>

We tried to lay out the scope in the wiki page [1] Details will need to be
worked out by the new project.  But I’ll give you my view on it.  I don’t
see the value of breaking this out of Hive if it isn’t willing to take
non-Hive features.  If it’s still Hive only in it’s focus why pay the cost
of having separate projects?  So, as long as Presto style views don’t break
Hive style views or make the system horribly complicated and someone is
willing to add them, +1.

A related area that we will need to work out is the metastore connection to
the Hive physical layout.  Today, when a user says “create table”, the
metastore creates a directory in HDFS.  This ties the metastore to a Hive
style data layout.  How should that be handled going forward?  We could
assert that having a standard data layout is good, and all users of this
metadata system should use this layout.  We could make the physical
operations pluggable, providing the Hive style operations as an option, but
allowing users to bring others. We could completely remove the physical
operations, leave them all in Hive, and say that any system using this
should do their own physical operations.  I don't like the last option
because it makes it hard to share data across tools, but I can think of pro
and con arguments for the first two.


> Another aspect of this is what is the vision for the specification of the
> Metastore.  Is the vision to have a very open end-user extensible design
> (e.g., just a name and a bag of properties), or is the vision to have a
> project specified common set properties with “rules” for proper extension?
>

Again, just my opinion, but I would say the latter.  The utility of a name
and a bag of properties turns out to be pretty limited and pretty easy to
implement if that’s all you want.  The current metastore can do a lot more
than that.


>
> I would also be very interested in documentation for the Metastore APIs
> (and can help). We currently reverse engineer proper metastore interaction
> by reading the Hive code, and writing a lot of experimental programs, and I
> would really just like to know the "right way”.  Also, we end up missing
> out on new features in the Metastore due to the work required to understand
> how they work.
>

+1 to better documentation regardless of where the metastore code lives.

Alan.


1. https://cwiki.apache.org/confluence/display/Hive/Metastore+TLP+Proposal


Re: [DISCUSS] Separating out the metastore as its own TLP

2017-07-05 Thread Alan Gates
On Mon, Jul 3, 2017 at 6:20 AM, Edward Capriolo 
wrote:

>
> We already have things in the meta-store not directly tied to language
> features. For example hive metastore has a "retention" property which is
> not actively in use by anything. In reality, we rarely say 'no' or -1 to
> much. Which in part is why I believe our release process is grinding
> slower: we have so many things in flight I do not feel that any one person
> can keep track. You are working on porting the metastore to hbase.
> https://issues.apache.org/jira/browse/HIVE-9452 did you get a -1 or 'No'
> along the way? When I first noticed this I pointed out that someone has
> already ported the metastore to Cassandra
> https://github.com/riptano/brisk/blob/master/src/java/
> src/org/apache/cassandra/hadoop/hive/metastore/SchemaManagerService.java,
> but I was more exciting/rational for this multi-year approach using hbase
> so I let everyone 'have at it'.
>
Your example and mine are not equivalent.  The HBase metastore is still a
Hive feature, even if some thought it not worth while.  That is different
than people bringing features that will never interest Hive or that Hive
could never use (e.g. Dain’s desire for the metastore to support Presto
style views).

I forgot to mention the issue these would be non-Hive contributors have
with releases if they contribute their features to the metastore while it’s
inside Hive.  Is Hive going to do a release just to push out features in
the metastore that it doesn’t care about?

You seem to be asserting that doing this doesn’t really help non-Hive based
systems that are using or would like to use the metastore.  But it is
interesting that people from three of those systems have commented in the
thread so far, and all are positive (Dmitrias from Impala, Dain from
Presto, and Sriharsha from the schema registry project).


> I am going to give a hypothetical but real world situation. Suppose I want
> to add the statement "CREATE permanent macro xyz", this feature I believe
> would cross cut calcite, hive, and hive metastore. To build this feature I
> would need to orchestrate the change across 3 separate groups of hive
> 'subcommittees' for lack of a better word. 3 git repos, 3 Jira's 3
> releases. That is not counting if we run into some bug or misfeature (maybe
> with Tez or something else) so that brings in 4-5 releases of upstream to
> add a feature to hive. This does not take into account normal processes
> mess ups. For example say you get the metastore done, but now the people
> doing the calcite/antlr suggest the feature have different syntax because
> they did not read the 3-4 linked tickets when the process started? Now, you
> have to loop back around the process. Finding 1 person in 1 project to
> usher along the feature you want is difficult, having to find and clear
> time with 3 people across three projects is going to be a difficult along
> with then 'pushing' them all to kick out a release so you can finally use
> said feature.
>

I partially agree with you.  On the reviews, JIRAs, etc. I don’t think it
adds much, if any, overhead.  Hive is a big project and no one person knows
all the code anymore.  If you wanted to add a permanent macros feature you
would need reviews from someone who knows the parser (probably Pengcheng),
people who know the optimizer (Jesus, Ashutosh, …), and someone who knows
the metastore (me, Thejas, …).  And any large feature is going to be
implemented over multiple JIRAs, all of which are linkable regardless of
whether the JIRAs start with METASTORE- or HIVE-.   I also don’t think it
makes the feature disagreement any worse.  If the optimizer team absolutely
insists it has to have some feature and the metastore team insists that it
can’t have that feature you’re going to have to work through the issue
whether they all are in Hive or in two separate projects.

Where I agree the split adds cost is releases.  Before your macro feature
could go live you need releases from each of the components.  And while in
development the components need to use snapshot versions of the other
components.  My assertion is that the benefits out weigh this cost.

Alan.


[jira] [Created] (HIVE-17045) Add HyperLogLog as an UDAF

2017-07-05 Thread Pengcheng Xiong (JIRA)
Pengcheng Xiong created HIVE-17045:
--

 Summary: Add HyperLogLog as an UDAF
 Key: HIVE-17045
 URL: https://issues.apache.org/jira/browse/HIVE-17045
 Project: Hive
  Issue Type: Sub-task
Reporter: Pengcheng Xiong
Assignee: Pengcheng Xiong






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (HIVE-17044) Transform LEFT/RIGHT OUTER into INNER join in the presence of FK relationship

2017-07-05 Thread Jesus Camacho Rodriguez (JIRA)
Jesus Camacho Rodriguez created HIVE-17044:
--

 Summary: Transform LEFT/RIGHT OUTER into INNER join in the 
presence of FK relationship
 Key: HIVE-17044
 URL: https://issues.apache.org/jira/browse/HIVE-17044
 Project: Hive
  Issue Type: Sub-task
  Components: Logical Optimizer
Affects Versions: 3.0.0
Reporter: Jesus Camacho Rodriguez


Consider we are executing a LEFT OUTER join on two tables using their FK-UK/PK 
relationship. We might be able to transform the OUTER into an INNER join if the 
FK columns (OUTER relationship) are not nullable. Similarly for RIGHT OUTER 
join.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (HIVE-17043) Remove non unique columns from group by keys if not referenced later

2017-07-05 Thread Jesus Camacho Rodriguez (JIRA)
Jesus Camacho Rodriguez created HIVE-17043:
--

 Summary: Remove non unique columns from group by keys if not 
referenced later
 Key: HIVE-17043
 URL: https://issues.apache.org/jira/browse/HIVE-17043
 Project: Hive
  Issue Type: Sub-task
  Components: Logical Optimizer
Affects Versions: 3.0.0
Reporter: Ashutosh Chauhan


Group by keys may be a mix of unique (or primary) keys and regular columns. In 
such cases presence of regular column won't alter cardinality of groups. So, if 
regular columns are not referenced later, they can be dropped from group by 
keys. Depending on operator tree may result in those columns not being read at 
all from disk in best case. In worst case, we will avoid shuffling and sorting 
regular columns from mapper to reducer, which still could be substantial CPU 
and network savings.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (HIVE-17042) Expose NOT NULL constraint in optimizer so constant folding can take advantage of it

2017-07-05 Thread Jesus Camacho Rodriguez (JIRA)
Jesus Camacho Rodriguez created HIVE-17042:
--

 Summary: Expose NOT NULL constraint in optimizer so constant 
folding can take advantage of it
 Key: HIVE-17042
 URL: https://issues.apache.org/jira/browse/HIVE-17042
 Project: Hive
  Issue Type: Sub-task
  Components: Logical Optimizer
Affects Versions: 3.0.0
Reporter: Jesus Camacho Rodriguez


We need to set the type to be not nullable for those columns.

Among others, it would be useful to simplify IS NOT NULL, NVL, and COALESCE 
predicates.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (HIVE-17041) Aggregate elimination with UNIQUE and NOT NULL column

2017-07-05 Thread Jesus Camacho Rodriguez (JIRA)
Jesus Camacho Rodriguez created HIVE-17041:
--

 Summary: Aggregate elimination with UNIQUE and NOT NULL column
 Key: HIVE-17041
 URL: https://issues.apache.org/jira/browse/HIVE-17041
 Project: Hive
  Issue Type: Sub-task
  Components: Logical Optimizer
Affects Versions: 3.0.0
Reporter: Jesus Camacho Rodriguez


If columns are part of a GROUP BY expression and they are UNIQUE and do not 
accept NULL values, i.e. PK or UK+NOTNULL, the _Aggregate_ operator can be 
transformed into a Project operator, as each row will end up in a different 
group.

For instance, given that _pk_ is the PRIMARY KEY for the table, the GROUP BY 
could be removed from grouping columns for following query:
{code:sql}
SELECT pk, value1
FROM table_1
GROUP BY value1, pk, value2;
{code}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (HIVE-17040) Join elimination in the presence of FK relationship

2017-07-05 Thread Jesus Camacho Rodriguez (JIRA)
Jesus Camacho Rodriguez created HIVE-17040:
--

 Summary: Join elimination in the presence of FK relationship
 Key: HIVE-17040
 URL: https://issues.apache.org/jira/browse/HIVE-17040
 Project: Hive
  Issue Type: Sub-task
  Components: Logical Optimizer
Affects Versions: 3.0.0
Reporter: Jesus Camacho Rodriguez


If the PK/UK table is not filtered, we can safely remove the join.

A simple example:
{code:sql}
SELECT c_current_cdemo_sk
FROM customer, customer_address
ON c_current_addr_sk = ca_address_sk;
{code}

As a Calcite rule, we could implement this rewriting by 1) matching a Project 
on top of a Join operator, 2) checking that only columns from the FK are used 
in the Project, 3) checking that the join condition matches the FK - PK/UK 
relationship, 4) pulling all the predicates from the PK/UK side and checking 
that the input is not filtered, and 5) removing the join, possibly adding a IS 
NOT NULL condition on the join column from the FK side.

If the PK/UK table is filtered, we should still transform the Join into a 
SemiJoin operator.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (HIVE-17039) Implement optimization rewritings that rely on database SQL constraints

2017-07-05 Thread Jesus Camacho Rodriguez (JIRA)
Jesus Camacho Rodriguez created HIVE-17039:
--

 Summary: Implement optimization rewritings that rely on database 
SQL constraints
 Key: HIVE-17039
 URL: https://issues.apache.org/jira/browse/HIVE-17039
 Project: Hive
  Issue Type: New Feature
  Components: Logical Optimizer
Affects Versions: 3.0.0
Reporter: Jesus Camacho Rodriguez


Hive already has support to declare multiple SQL constraints (PRIMARY KEY, 
FOREIGN KEY, UNIQUE, and NOT NULL). Although these constraints cannot be 
currently enforced on the data, they can be made available to the optimizer by 
using the 'RELY' keyword.

This ticket is an umbrella for all the rewriting optimizations based on SQL 
constraints that we will be including in Hive.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (HIVE-17038) invalid CAST result

2017-07-05 Thread Jim Hopper (JIRA)
Jim Hopper created HIVE-17038:
-

 Summary: invalid CAST result
 Key: HIVE-17038
 URL: https://issues.apache.org/jira/browse/HIVE-17038
 Project: Hive
  Issue Type: Bug
Reporter: Jim Hopper


when casting incorrect date literals to DATE data type hive returns wrong 
values instead of NULL.

{code}

SELECT CAST('2017-05-31' AS DATE);

{code}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (HIVE-17037) Extend join algorithm selection to avoid unnecessary input data shuffle

2017-07-05 Thread Jesus Camacho Rodriguez (JIRA)
Jesus Camacho Rodriguez created HIVE-17037:
--

 Summary: Extend join algorithm selection to avoid unnecessary 
input data shuffle
 Key: HIVE-17037
 URL: https://issues.apache.org/jira/browse/HIVE-17037
 Project: Hive
  Issue Type: Improvement
  Components: Physical Optimizer
Affects Versions: 3.0.0
Reporter: Jesus Camacho Rodriguez
Assignee: Jesus Camacho Rodriguez


As an example, consider the following query:

{code:sql}
SELECT *
FROM (
  SELECT a.value
  FROM src1 a
  JOIN src1 b
  ON (a.value = b.value)
  GROUP BY a.value
) a
JOIN src
ON (a.value = src.value);
{code}

Currently, the plan generated for Tez will contain an unnecessary shuffle 
operation between the subquery and the join, since the records produced by the 
subquery are already sorted by the value.

This issue is to extend join algorithm selection to be able to shuffle only 
some of the inputs for a given join and avoid unnecessary shuffle operations.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (HIVE-17036) Lineage: Minor CPU/Mem optimization for lineage transform

2017-07-05 Thread Rajesh Balamohan (JIRA)
Rajesh Balamohan created HIVE-17036:
---

 Summary: Lineage: Minor CPU/Mem optimization for lineage transform
 Key: HIVE-17036
 URL: https://issues.apache.org/jira/browse/HIVE-17036
 Project: Hive
  Issue Type: Bug
  Components: lineage
Reporter: Rajesh Balamohan
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (HIVE-17035) Optimizer: Lineage transform() should be invoked after rest of the optimizers are invoked

2017-07-05 Thread Rajesh Balamohan (JIRA)
Rajesh Balamohan created HIVE-17035:
---

 Summary: Optimizer: Lineage transform() should be invoked after 
rest of the optimizers are invoked
 Key: HIVE-17035
 URL: https://issues.apache.org/jira/browse/HIVE-17035
 Project: Hive
  Issue Type: Bug
  Components: Logical Optimizer
Reporter: Rajesh Balamohan
Priority: Minor


In a fairly large query which had tens of left join, time taken to create 
linageInfo itself took 1500+ seconds. This is due to the fact that the table 
had lots of columns and in some processing, it ended up having 7000+ value 
columns in {{ReduceSinkLineage}}. 

It would be good to invoke lineage transform when rest of the optimizers in 
{{Optimizer}} are invoked. This would avoid help in improving the runtime.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (HIVE-17034) The spark tar for itests is downloaded every time if md5sum is not installed

2017-07-05 Thread Rui Li (JIRA)
Rui Li created HIVE-17034:
-

 Summary: The spark tar for itests is downloaded every time if 
md5sum is not installed
 Key: HIVE-17034
 URL: https://issues.apache.org/jira/browse/HIVE-17034
 Project: Hive
  Issue Type: Test
Reporter: Rui Li
Assignee: Rui Li


I think we should either skip verifying md5, or fail the build to let developer 
know md5sum is required.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)