[jira] [Created] (HIVE-12230) custom UDF configure() not called in Vectorization mode

2015-10-21 Thread Matt McCline (JIRA)
Matt McCline created HIVE-12230:
---

 Summary: custom UDF configure() not called in Vectorization mode
 Key: HIVE-12230
 URL: https://issues.apache.org/jira/browse/HIVE-12230
 Project: Hive
  Issue Type: Bug
  Components: Hive
Reporter: Matt McCline
Assignee: Matt McCline
Priority: Critical


PROBLEM:

A custom UDF that overrides the configure()

{code}
@Override
public void configure(MapredContext context) {
greeting = "Hello ";
}

{code}

In vectorization mode, it is not called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [ANNOUNCE] New Hive Committer - Siddharth Seth

2015-10-21 Thread Vaibhav Gumashta
Congrats Sid!

‹Vaibhav

On 10/21/15, 8:30 PM, "Matthew McCline"  wrote:

>
>Congratulations!
>
>
>From: Chetna C 
>Sent: Wednesday, October 21, 2015 8:27 PM
>To: dev@hive.apache.org
>Cc: Siddharth Seth
>Subject: Re: [ANNOUNCE] New Hive Committer - Siddharth Seth
>
>Congratulations !!
>On Oct 22, 2015 5:13 AM, "Pengcheng Xiong"  wrote:
>
>> Congrats Sid!
>>
>> On Wed, Oct 21, 2015 at 2:14 PM, Sergey Shelukhin
>>
>> wrote:
>>
>> > The Apache Hive PMC has voted to make Siddharth Seth a committer on
>>the
>> > Apache Hive Project.
>> >
>> > Please join me in congratulating Sid!
>> >
>> > Thanks,
>> > Sergey.
>> >
>> >
>>
>



Re: [ANNOUNCE] New Hive Committer- Aihua Xu

2015-10-21 Thread Vaibhav Gumashta
Congrats Aihua!

‹Vaibhav

On 10/21/15, 4:42 PM, "Pengcheng Xiong"  wrote:

>Congrats Aihua!
>
>On Wed, Oct 21, 2015 at 2:09 PM, Szehon Ho  wrote:
>
>> The Apache Hive PMC has voted to make Aihua Xu a committer on the Apache
>> Hive Project.
>>
>> Please join me in congratulating Aihua!
>>
>> Thanks,
>> Szehon
>>



deprecating MR in the first release of Hive 2.0

2015-10-21 Thread Sergey Shelukhin
We have discussed the removal of hadoop-1 and MR support in Hive 2 line in the 
past..
Hadoop-1 removal seems to be non-controversial and on track; before we cut the 
first release of Hive 2, I propose we deprecate MR.

Tez and Spark engines provide vast perf improvements over MR;
Execution optimization work by most contributors for a long time has been done 
for these engines and is not portable to MR, so it is languishing further;
At the same time, supporting additional code has other development costs for 
new features or bugs, plus we have to run tests for it both in Apache and for 
local changes and to deploy code.

However, MR is hard to remove. Plus, it may provide a baseline for some bugs in 
other engines (which is not bulletproof since MR logic can be incorrect), or to 
mock during perf benchmarks.

Therefore, I propose that for now we add deprecation warnings suggesting the 
other alternatives:

  *   to Hive configuration documentation.
  *   to Hive wiki.
  *   to release notes on Hive 2.
  *   in Beeline and CLI when using MR.

Additionally, I propose we remove Minimr test driver from HiveQA runs for 
master.

What do you think?


Re: Refactor code for avoiding comparing Strings

2015-10-21 Thread Ashutosh Chauhan
Seems like you are on really old version of Hive. Do you have following
patches on your tree?

https://issues.apache.org/jira/browse/HIVE-3616
https://issues.apache.org/jira/browse/HIVE-6095
https://issues.apache.org/jira/browse/HIVE-6116
https://issues.apache.org/jira/browse/HIVE-6121
https://issues.apache.org/jira/browse/HIVE-6171
https://issues.apache.org/jira/browse/HIVE-6197
https://issues.apache.org/jira/browse/HIVE-6228

Thanks,
Ashutosh

On Wed, Oct 21, 2015 at 5:17 AM, Алина Абрамова 
wrote:

> Hi,
> I propose you I do some refactor code of Apache Hive. At many places there
> String is used for representation path and it causes some issues sometime.
> We need to compare it with equals() but comparing Strings often is not
> right in terms comparing paths .
> If we used Path from org.apache.hadoop.fs it would be more correct.
> What do you think about my proposal ?
>
> Thanks.
> Alina
>


[jira] [Created] (HIVE-12220) LLAP: Usability issues with hive.llap.io.cache.orc.size

2015-10-21 Thread Carter Shanklin (JIRA)
Carter Shanklin created HIVE-12220:
--

 Summary: LLAP: Usability issues with hive.llap.io.cache.orc.size
 Key: HIVE-12220
 URL: https://issues.apache.org/jira/browse/HIVE-12220
 Project: Hive
  Issue Type: Bug
  Components: Hive
Affects Versions: llap
Reporter: Carter Shanklin


In the llap-daemon site you need to set, among other things,

llap.daemon.memory.per.instance.mb
and
hive.llap.io.cache.orc.size

The use of hive.llap.io.cache.orc.size caused me some unnecessary problems, 
initially I entered the value in MB rather than in bytes. Operator error you 
could say but I look at this as a fraction of the other value which is in mb.

Second, is this really tied to ORC? E.g. when we have the vectorized text 
reader will this data be cached as well? Or might it be in the future?

I would like to propose instead using hive.llap.io.cache.size.mb for this 
setting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Profiling the internal of Hive

2015-10-21 Thread Sergey Shelukhin
For Tez containers, you can specify JVM args via hive.tez.java.opts config
setting, there’s probably a similar setting for MR. You can add the
profiler agent to this setting, there e.g. for Yourkit something like:
"-agentpath:/opt/yourkit/bin/linux-x86-64/libyjpagent.so=disablej2ee,disabl
etracing,dir=/tmp/ykdumps,sampling" or for JMC that is built in on some
versions of JVM something like "-XX:+UnlockCommercialFeatures
-XX:+FlightRecorder
-XX:FlightRecorderOptions=defaultrecording=true,dumponexit=true,dumponexitp
ath=/tmp/jmcdumps,disk=true,repository=/tmp/jmcdumps;”.
Then look at profiler dumps.
 


On 15/10/19, 23:45, "Justin Wong"  wrote:

>Hi there,
>
>I'm recently doing a research program about hive, and I need to profile
>the internal performance
>of Hive. 
>
>I've managed to run hive over hadoop localJobRunner and attached
>JProfiler to it, while the CPU
>hostspot stopped at level "runJar". I also tried hadoop's built in
>profiling, but not profiling
>log was generated.
>
>Could you give me some advises of profiling hive internal?
>
>Cheers,
>-- 
>Justin Wong
>
>Blog: https://bigeagle.me/
>Fingerprint: 15CC 6A61 738B 1599 0095  E256 CB67 DA7A 865B AC3A



Refactor code for avoiding comparing Strings

2015-10-21 Thread Алина Абрамова
Hi,
I propose you I do some refactor code of Apache Hive. At many places there
String is used for representation path and it causes some issues sometime.
We need to compare it with equals() but comparing Strings often is not
right in terms comparing paths .
If we used Path from org.apache.hadoop.fs it would be more correct.
What do you think about my proposal ?

Thanks.
Alina


[jira] [Created] (HIVE-12221) Concurrency issue in HCatUtil.getHiveMetastoreClient()

2015-10-21 Thread Roshan Naik (JIRA)
Roshan Naik created HIVE-12221:
--

 Summary: Concurrency issue in HCatUtil.getHiveMetastoreClient() 
 Key: HIVE-12221
 URL: https://issues.apache.org/jira/browse/HIVE-12221
 Project: Hive
  Issue Type: Bug
Reporter: Roshan Naik


HCatUtil.getHiveMetastoreClient()  uses double checked locking pattern
to implement singleton, which is a broken pattern



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-12224) Remove HOLD_DDLTIME

2015-10-21 Thread Ashutosh Chauhan (JIRA)
Ashutosh Chauhan created HIVE-12224:
---

 Summary: Remove HOLD_DDLTIME
 Key: HIVE-12224
 URL: https://issues.apache.org/jira/browse/HIVE-12224
 Project: Hive
  Issue Type: Bug
  Components: Metastore
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan


This arcane feature was introduced long ago via HIVE-1394 It was broken as soon 
as it landed, HIVE-1442 and is thus useless. Fact that no one has fixed it 
since informs that its not really used by anyone. Better is to remove it so no 
one hits the bug of HIVE-1442



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-12223) Filter on Grouping__ID does not work properly

2015-10-21 Thread Jesus Camacho Rodriguez (JIRA)
Jesus Camacho Rodriguez created HIVE-12223:
--

 Summary: Filter on Grouping__ID does not work properly
 Key: HIVE-12223
 URL: https://issues.apache.org/jira/browse/HIVE-12223
 Project: Hive
  Issue Type: Bug
  Components: Logical Optimizer
Affects Versions: 1.3.0, 2.0.0
Reporter: Jesus Camacho Rodriguez
Assignee: Jesus Camacho Rodriguez
 Fix For: 1.3.0, 2.0.0


Consider the following query:

{noformat}
SELECT key, value, GROUPING__ID, count(*)
FROM T1
GROUP BY key, value
GROUPING SETS ((), (key))
HAVING GROUPING__ID = 1
{noformat}

This query will not return results. The reason is that a "constant" placeholder 
is introduced by SemanticAnalyzer for the GROUPING__ID column. At execution 
time, this placeholder is replaced by the actual value of the GROUPING__ID. As 
it is a constant, the Hive optimizer will evaluate statically whether the 
condition is met or not, leading to incorrect results. A possible solution is 
to transform the placeholder constant into a function over the grouping keys.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Temporary table problem

2015-10-21 Thread Pengcheng Xiong
Hi Lingang,

(1) Could you first specify what is the difference between the results
in details? If you can provide the schema of the table, and also some data
of the rows of the table that can help reproduce the problem, that is very
helpful.
(2) And, what Hive version are you using and settings, configurations?
(3) If possible, could you also post the result when you run explain?

That would help us better help you. Thanks!

Best
Pengcheng Xiong

On Wed, Oct 21, 2015 at 4:13 AM, Deng Lingang(上海_技术部_架构部_基础架构_邓林钢) <
dengling...@yhd.com> wrote:

> Hi,all
>I create 3 tables at context with color,  a_110,b_110,c_110, then the
> script become
>
> SELECT count(*)
> FROM testtmp.a_110 a
> LEFT OUTER JOIN testtmp.b_110 b ON a.cms_id = b.cms_id AND a.pltfm_id =
> b.pltfm_id
> LEFT OUTER JOIN testtmp.c_110 c ON b.cms_id = c.cms_id AND b.categ_lvl2_id
> = c.categ_lvl2_id AND b.pltfm_id = c.pltfm_id
> LEFT OUTER JOIN dw.dim_cms dim ON a.cms_id= dim.cms_id
> and GetTimestampFmt(dim.CMS_START_TIME) <= GetTimestampFmt('2015-10-18')
> AND GetTimestampFmt(dim.CMS_END_TIME) >= GetTimestampFmt('2015-10-18')
> where GetTimestampFmt(dim.CMS_START_TIME) <= GetTimestampFmt('2015-10-18')
>   AND GetTimestampFmt(dim.CMS_END_TIME) >= GetTimestampFmt('2015-09-01') ;
>
> --11524
>
>
> But, problems arise,two scripts result is defferent,  who can tell reason,
> thanks
>
>
> SELECT count(*)
> FROM
>   (SELECT nav_page_value AS cms_id,
>   pltfm_id,
>   COUNT(DISTINCT(CASE WHEN t.nav_tracker_id > 0 THEN t.sessn_id
> ELSE NULL END)) AS cms_vstrs,
>   COUNT(DISTINCT(CASE WHEN t.nav_tracker_id > 0
>  AND t.nav_next_tracker_id > 0 THEN t.sessn_id
> ELSE NULL END)) AS cms_click_vstrs,
>   COUNT(DISTINCT(CASE WHEN t.nav_tracker_id > 0 THEN
> t.nav_tracker_id ELSE NULL END)) AS cms_pv,
>   COUNT(DISTINCT(CASE WHEN t.nav_tracker_id > 0
>  AND t.sessn_pv > 1 THEN t.sessn_id ELSE NULL
> END)) AS cms_sec_vstrs,
>   COUNT(DISTINCT(CASE WHEN (t.detl_tracker_id > 0
> AND (length(t.detl_button_position) = 0
>  OR t.detl_button_position IS NULL
>  OR t.detl_button_position =
> 'null'))
>  OR (t.cart_tracker_id > 0
>  AND length(t.detl_tracker_id) = 0) THEN
> t.sessn_id ELSE NULL END)) AS cms_detl_vstrs,
>   COUNT(DISTINCT(CASE WHEN t.nav_tracker_id > 0 THEN t.ordr_code
> ELSE NULL END)) AS cms_ordr_num
>FROM dw.fct_traffic_cms_detl t
>WHERE ds= '2015-10-18'
>  AND t.nav_page_value IS NOT NULL
>GROUP BY nav_page_value,
> pltfm_id) a
> LEFT OUTER JOIN
>   (SELECT nvl(t3.mg_brand_id,-99) AS mg_brand_id,
>   nvl(t1.nav_page_value,-1) AS cms_id,
>   hc.categ_lvl2_id ,
>   t1.pltfm_id,
>   nvl(COUNT(DISTINCT(CASE WHEN t1.detl_pv > 0
>  OR dirct_cart_pv>0 THEN t1.sessn_id ELSE NULL
> END)), 0) AS detl_vstrs,
>   COUNT(DISTINCT(CASE WHEN t1.ordr_tranx_activ_flg=1 THEN
> parnt_ordr_id ELSE NULL END)) AS ordr_num,
>   COUNT(DISTINCT(CASE WHEN t1.ordr_tranx_activ_flg=1 THEN
> t1.end_user_id ELSE NULL END)) AS cust_num
>FROM dw.fct_traffic_prdt_cart_path t1 LEFT
>OUTER JOIN dw.dim_prod t2 ON t1.prod_id = t2.prod_id
>AND t2.cur_flag = 1 LEFT
>OUTER JOIN dw.hier_categ hc ON t2.categ_lvl_id = hc.categ_lvl_id INNER
>JOIN dw.dim_mrchnt b ON t1.mrchnt_id = b.mrchnt_id
>AND b.cur_flag = 1 LEFT
>OUTER JOIN dw.dim_brand t3 ON t2.brand_id = t3.brand_id
>AND t3.cur_flag = 1
>WHERE t1.ds = '2015-10-18'
>  AND b.biz_unit = 1
>  AND t1.nav_page_categ_id = 1
>  AND t1.nav_page_value>0
>GROUP BY nvl(t3.mg_brand_id,-99),
> nvl(t1.nav_page_value,-1) ,
> hc.categ_lvl2_id ,
> t1.pltfm_id) b ON a.cms_id = b.cms_id
> AND a.pltfm_id = b.pltfm_id
> LEFT OUTER JOIN
>   (SELECT nvl(t1.nav_page_value,-1) AS cms_id,
>   hc.categ_lvl2_id ,
>   t1.pltfm_id,
>   nvl(COUNT(DISTINCT(CASE WHEN t1.detl_pv > 0
>  OR dirct_cart_pv>0 THEN t1.sessn_id ELSE NULL
> END)), 0) AS categ_lvl2_cms_detl_vstrs,
>   COUNT(DISTINCT(CASE WHEN t1.ordr_tranx_activ_flg=1 THEN
> parnt_ordr_id ELSE NULL END)) AS categ_lvl2_cms_ordr_num
>FROM dw.fct_traffic_prdt_cart_path t1 LEFT
>OUTER JOIN dw.dim_prod t2 ON t1.prod_id = t2.prod_id
>AND t2.cur_flag = 1 LEFT
>OUTER JOIN dw.hier_categ hc ON t2.categ_lvl_id = hc.categ_lvl_id INNER
>JOIN dw.dim_mrchnt b ON t1.mrchnt_id = b.mrchnt_id
>AND b.cur_flag = 1 LEFT
>OUTER JOIN dw.dim_brand t3 ON t2.brand_id = t3.brand_id
>AND t3.cur_flag = 1
>WHERE t1.ds = '2015-10-18'
>  AND b.biz_unit = 1
>  AND t1.nav_page_categ_id = 1
>  AND t1.nav_page_value>0
>GROUP BY nvl(t1.nav_page_value,-1) ,
> 

[jira] [Created] (HIVE-12222) Define port range in property for RPCServer

2015-10-21 Thread Andrew Lee (JIRA)
Andrew Lee created HIVE-1:
-

 Summary: Define port range in property for RPCServer
 Key: HIVE-1
 URL: https://issues.apache.org/jira/browse/HIVE-1
 Project: Hive
  Issue Type: Improvement
  Components: CLI
Affects Versions: 1.2.1
 Environment: Apache Hadoop 2.7.0
Apache Hive 1.2.1
Apache Spark 1.5.1

Reporter: Andrew Lee


Creating this JIRA after discussin with Xuefu on the dev mailing list. Would 
need some help to review and update the fields in this JIRA ticket, thanks.

I notice that in 
./spark-client/src/main/java/org/apache/hive/spark/client/rpc/RpcServer.java

The port number is assigned with 0 which means it will be a random port every 
time when the RPC Server is created to talk to Spark in the same session.
Because of this, this is causing problems to configure firewall between the 
HiveCLI RPC Server and Spark due to unpredictable port numbers here. In other 
word, users need to open all hive ports range 
from Data Node => HiveCLI (edge node).

{code}
 this.channel = new ServerBootstrap()
  .group(group)
  .channel(NioServerSocketChannel.class)
  .childHandler(new ChannelInitializer() {
  @Override
  public void initChannel(SocketChannel ch) throws Exception {
SaslServerHandler saslHandler = new SaslServerHandler(config);
final Rpc newRpc = Rpc.createServer(saslHandler, config, ch, group);
saslHandler.rpc = newRpc;

Runnable cancelTask = new Runnable() {
@Override
public void run() {
  LOG.warn("Timed out waiting for hello from client.");
  newRpc.close();
}
};
saslHandler.cancelTask = group.schedule(cancelTask,
RpcServer.this.config.getServerConnectTimeoutMs(),
TimeUnit.MILLISECONDS);

  }
  })
{code}

2 Main reasons.

- Most users (what I see and encounter) use HiveCLI as a command line tool, and 
in order to use that, they need to login to the edge node (via SSH). Now, here 
comes the interesting part.
Could be true or not, but this is what I observe and encounter from time to 
time. Most users will abuse the resource on that edge node (increasing 
HADOOP_HEAPSIZE, dumping output to local disk, running huge python workflow, 
etc), this may cause the HS2 process to run into OOME, choke and die, etc. 
various resource issues including others like login, etc.

- Analyst connects to Hive via HS2 + ODBC. So HS2 needs to be highly available. 
This makes sense to run it on the gateway node or a service node and separated 
from the HiveCLI.
The logs are located in different location, monitoring and auditing is easier 
to run HS2 with a daemon user account, etc. so we don't want users to run 
HiveCLI where HS2 is running.
It's better to isolate the resource this way to avoid any memory, file 
handlers, disk space, issues.

>From a security standpoint, 

- Since users can login to edge node (via SSH), the security on the edge node 
needs to be fortified and enhanced. Therefore, all the FW comes in and auditing.

- Regulation/compliance for auditing is another requirement to monitor all 
traffic, specifying ports and locking down the ports makes it easier since we 
can focus
on a range to monitor and audit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


parallelizing tests on ptest

2015-10-21 Thread Sergey Shelukhin
Hi. 
We have merged the llap branch into master and would like to parallelize
the llap test the same way the CliDriver, Minimr, Spark, HBase, etc. tests
are parallelized.
The test currently only runs ~20 tests from a separate variable (otherwise
it times out).

https://issues.apache.org/jira/browse/HIVE-12014 brings the test config to
its final form, the relevant settings are

 

Re: Hard Coded 0 to assign RPC Server port number when hive.execution.engine=spark

2015-10-21 Thread Andrew Lee
Hi Xuefu,

Thanks, I'll create a JIRA, by the way, since HiveCLI will be replaced by 
beeline or other design later,
I'm hoping the same philosophy can be considered if other CLI is using 
RPCServer as well or sharing the same source code at some point.

Shall the Issue Type of the JIRA ticket be "Improvement" or "New Feature" ?


From: Xuefu Zhang 
Sent: Tuesday, October 20, 2015 6:39 PM
To: dev@hive.apache.org
Subject: Re: Hard Coded 0 to assign RPC Server port number when 
hive.execution.engine=spark

Thanks, Andrew! You have a point. However, we're trying to sunset Hive CLI.
In the meantime, I guess it doesn't hurt to give admin more control over
the ports to be used. Please put your proposal in a JIRA and we can go from
there.

--Xuefu

On Tue, Oct 20, 2015 at 7:54 AM, Andrew Lee  wrote:

> Hi Xuefu,
>
> 2 Main reasons.
>
> - Most users (what I see and encounter) use HiveCLI as a command line
> tool, and in order to use that, they need to login to the edge node (via
> SSH). Now, here comes the interesting part.
> Could be true or not, but this is what I observe and encounter from time
> to time. Most users will abuse the resource on that edge node (increasing
> HADOOP_HEAPSIZE, dumping output to local disk, running huge python
> workflow, etc), this may cause the HS2 process to run into OOME, choke and
> die, etc. various resource issues including others like login, etc.
>
> - Analyst connects to Hive via HS2 + ODBC. So HS2 needs to be highly
> available. This makes sense to run it on the gateway node or a service node
> and separated from the HiveCLI.
> The logs are located in different location, monitoring and auditing is
> easier to run HS2 with a daemon user account, etc. so we don't want users
> to run HiveCLI where HS2 is running.
> It's better to isolate the resource this way to avoid any memory, file
> handlers, disk space, issues.
>
> From a security standpoint,
>
> - Since users can login to edge node (via SSH), the security on the edge
> node needs to be fortified and enhanced. Therefore, all the FW comes in and
> auditing.
>
> - Regulation/compliance for auditing is another requirement to monitor all
> traffic, specifying ports and locking down the ports makes it easier since
> we can focus
> on a range to monitor and audit.
>
> Hope this explains the reason why we are asking for this feature.
>
>
> 
> From: Xuefu Zhang 
> Sent: Monday, October 19, 2015 9:37 PM
> To: dev@hive.apache.org
> Subject: Re: Hard Coded 0 to assign RPC Server port number when
> hive.execution.engine=spark
>
> Hi Adrew,
>
> I understand your policy on edge node. However, I'm wondering why you
> cannot require that Hive CLI run only on gateway nodes, similar to HS2? In
> essence, Hive CLI is a client with embedded hive server, so it seems
> reasonable to have a similar requirement as it for HS2.
>
> I'm not defending against your request. Rather, I'm interested in the
> rationale behind your policy.
>
> Thanks,
> Xuefu
>
> On Mon, Oct 19, 2015 at 9:12 PM, Andrew Lee  wrote:
>
> > Hi Xuefu,
> >
> > I agree for HS2 since HS2 usually runs on a gateway or service node
> inside
> > the cluster environment.
> > In my case, it is actually additional security.
> > A separate edge node (not running HS2, HS2 runs on another box) is used
> > for HiveCLI.
> > We don't allow data/worker nodes to talk to the edge node on random
> ports.
> > All ports must be registered or explicitly specified and monitored.
> > That's why I am asking for this feature. Otherwise, opening up 1024-65535
> > from data/worker node to edge node is actually
> > a bad idea and bad practice for network security.  :(
> >
> >
> >
> > 
> > From: Xuefu Zhang 
> > Sent: Monday, October 19, 2015 1:12 PM
> > To: dev@hive.apache.org
> > Subject: Re: Hard Coded 0 to assign RPC Server port number when
> > hive.execution.engine=spark
> >
> > Hi Andrew,
> >
> > RpcServer is an instance launched for each user session. In case of Hive
> > CLI, which is for a single user, what you said makes sense and the port
> > number can be configurable. In the context of HS2, however, there are
> > multiple user sessions and the total is unknown in advance. While +1
> scheme
> > works, there can be still a band of ports that might be eventually
> opened.
> >
> > On a different perspective, we expect that either Hive CLI or HS2 resides
> > on a gateway node, which are in the same network with the data/worker
> > nodes. In this configuration, firewall issue you mentioned doesn't apply.
> > Such configuration is what we usually see in our enterprise customers,
> > which is what we recommend. I'm not sure why you would want your Hive
> users
> > to launch Hive CLI anywhere outside your cluster, which doesn't seem
> secure
> > if security is your concern.
> >
> > 

Re: Review Request 39199: HIVE-12084 : Hive queries with ORDER BY and large LIMIT fails with OutOfMemoryError Java heap space

2015-10-21 Thread Hari Sankar Sivarama Subramaniyan

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/39199/
---

(Updated Oct. 21, 2015, 8:17 p.m.)


Review request for hive, Ashutosh Chauhan and John Pullokkaran.


Repository: hive-git


Description
---

Please look at https://issues.apache.org/jira/browse/HIVE-12084


Diffs (updated)
-

  common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 6c7adbd 
  ql/src/java/org/apache/hadoop/hive/ql/exec/PTFTopNHash.java f93b420 
  ql/src/java/org/apache/hadoop/hive/ql/exec/TopNHash.java 484006a 
  ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java 4fed49e 
  ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java cc8c9e8 
  ql/src/test/queries/clientpositive/topn.q PRE-CREATION 
  ql/src/test/results/clientpositive/topn.q.out PRE-CREATION 

Diff: https://reviews.apache.org/r/39199/diff/


Testing
---


Thanks,

Hari Sankar Sivarama Subramaniyan



Re: Review Request 39522: HIVE-12220 LLAP: Usability issues with hive.llap.io.cache.orc.size

2015-10-21 Thread Gopal V

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/39522/#review103461
---



common/src/java/org/apache/hadoop/hive/conf/HiveConf.java (line 2291)


Remove the .ll.?


- Gopal V


On Oct. 21, 2015, 8:04 p.m., Sergey Shelukhin wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/39522/
> ---
> 
> (Updated Oct. 21, 2015, 8:04 p.m.)
> 
> 
> Review request for hive and Gopal V.
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> see jira
> 
> 
> Diffs
> -
> 
>   common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 6c7adbd 
>   llap-server/src/java/org/apache/hadoop/hive/llap/cache/BuddyAllocator.java 
> ae64d20 
>   
> llap-server/src/java/org/apache/hadoop/hive/llap/cache/LowLevelCacheMemoryManager.java
>  4a256ee 
>   
> llap-server/src/java/org/apache/hadoop/hive/llap/cache/LowLevelLrfuCachePolicy.java
>  f551edb 
>   llap-server/src/java/org/apache/hadoop/hive/llap/cli/LlapServiceDriver.java 
> 05fecc7 
>   
> llap-server/src/java/org/apache/hadoop/hive/llap/daemon/impl/LlapDaemon.java 
> 6f75001 
>   
> llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/OrcEncodedDataReader.java
>  86a56ab 
>   
> llap-server/src/test/org/apache/hadoop/hive/llap/cache/TestBuddyAllocator.java
>  d4d4bb2 
>   
> llap-server/src/test/org/apache/hadoop/hive/llap/cache/TestLowLevelLrfuCachePolicy.java
>  a423eeb 
> 
> Diff: https://reviews.apache.org/r/39522/diff/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Sergey Shelukhin
> 
>



[jira] [Created] (HIVE-12225) LineageCtx should release all resources at clear

2015-10-21 Thread Jimmy Xiang (JIRA)
Jimmy Xiang created HIVE-12225:
--

 Summary: LineageCtx should release all resources at clear
 Key: HIVE-12225
 URL: https://issues.apache.org/jira/browse/HIVE-12225
 Project: Hive
  Issue Type: Bug
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang


Somce maps are not released in clear() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-12226) Support unicode for table names

2015-10-21 Thread Pengcheng Xiong (JIRA)
Pengcheng Xiong created HIVE-12226:
--

 Summary: Support unicode for table names
 Key: HIVE-12226
 URL: https://issues.apache.org/jira/browse/HIVE-12226
 Project: Hive
  Issue Type: Sub-task
Reporter: Pengcheng Xiong
Assignee: richard du






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Review Request 39522: HIVE-12220 LLAP: Usability issues with hive.llap.io.cache.orc.size

2015-10-21 Thread Sergey Shelukhin

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/39522/
---

Review request for hive and Gopal V.


Repository: hive-git


Description
---

see jira


Diffs
-

  common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 6c7adbd 
  llap-server/src/java/org/apache/hadoop/hive/llap/cache/BuddyAllocator.java 
ae64d20 
  
llap-server/src/java/org/apache/hadoop/hive/llap/cache/LowLevelCacheMemoryManager.java
 4a256ee 
  
llap-server/src/java/org/apache/hadoop/hive/llap/cache/LowLevelLrfuCachePolicy.java
 f551edb 
  llap-server/src/java/org/apache/hadoop/hive/llap/cli/LlapServiceDriver.java 
05fecc7 
  llap-server/src/java/org/apache/hadoop/hive/llap/daemon/impl/LlapDaemon.java 
6f75001 
  
llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/OrcEncodedDataReader.java
 86a56ab 
  
llap-server/src/test/org/apache/hadoop/hive/llap/cache/TestBuddyAllocator.java 
d4d4bb2 
  
llap-server/src/test/org/apache/hadoop/hive/llap/cache/TestLowLevelLrfuCachePolicy.java
 a423eeb 

Diff: https://reviews.apache.org/r/39522/diff/


Testing
---


Thanks,

Sergey Shelukhin



Re: [ANNOUNCE] New Hive Committer- Aihua Xu

2015-10-21 Thread Zhuoluo (Clark) Yang
Congrats

Thanks,
Zhuoluo (Clark) Yang

On Wed, Oct 21, 2015 at 2:09 PM, Szehon Ho  wrote:

> The Apache Hive PMC has voted to make Aihua Xu a committer on the Apache
> Hive Project.
>
> Please join me in congratulating Aihua!
>
> Thanks,
> Szehon
>


Re: [ANNOUNCE] New Hive Committer- Aihua Xu

2015-10-21 Thread Chaoyu Tang
Congratulations to Aihua.

On Wed, Oct 21, 2015 at 5:12 PM, Zhuoluo (Clark) Yang  wrote:

> Congrats
>
> Thanks,
> Zhuoluo (Clark) Yang
>
> On Wed, Oct 21, 2015 at 2:09 PM, Szehon Ho  wrote:
>
> > The Apache Hive PMC has voted to make Aihua Xu a committer on the Apache
> > Hive Project.
> >
> > Please join me in congratulating Aihua!
> >
> > Thanks,
> > Szehon
> >
>


[ANNOUNCE] New Hive Committer- Aihua Xu

2015-10-21 Thread Szehon Ho
The Apache Hive PMC has voted to make Aihua Xu a committer on the Apache
Hive Project.

Please join me in congratulating Aihua!

Thanks,
Szehon


Re: [ANNOUNCE] New Hive Committer - Siddharth Seth

2015-10-21 Thread Chaoyu Tang
Congratulations to Siddharth!


On Wed, Oct 21, 2015 at 5:25 PM, Xuefu Zhang  wrote:

> Congratulations, Sirddharth!
>
> On Wed, Oct 21, 2015 at 2:14 PM, Sergey Shelukhin 
> wrote:
>
> > The Apache Hive PMC has voted to make Siddharth Seth a committer on the
> > Apache Hive Project.
> >
> > Please join me in congratulating Sid!
> >
> > Thanks,
> > Sergey.
> >
> >
>


Re: [ANNOUNCE] New Hive Committer- Aihua Xu

2015-10-21 Thread Sergey Shelukhin
Congrats!

On 15/10/21, 14:30, "Chaoyu Tang"  wrote:

>Congratulations to Aihua.
>
>On Wed, Oct 21, 2015 at 5:12 PM, Zhuoluo (Clark) Yang
>> wrote:
>
>> Congrats
>>
>> Thanks,
>> Zhuoluo (Clark) Yang
>>
>> On Wed, Oct 21, 2015 at 2:09 PM, Szehon Ho  wrote:
>>
>> > The Apache Hive PMC has voted to make Aihua Xu a committer on the
>>Apache
>> > Hive Project.
>> >
>> > Please join me in congratulating Aihua!
>> >
>> > Thanks,
>> > Szehon
>> >
>>



Re: [ANNOUNCE] New Hive Committer- Aihua Xu

2015-10-21 Thread Jimmy Xiang
Congrats!!

On Wed, Oct 21, 2015 at 2:09 PM, Szehon Ho  wrote:

> The Apache Hive PMC has voted to make Aihua Xu a committer on the Apache
> Hive Project.
>
> Please join me in congratulating Aihua!
>
> Thanks,
> Szehon
>


Re: [ANNOUNCE] New Hive Committer - Siddharth Seth

2015-10-21 Thread Xuefu Zhang
Congratulations, Sirddharth!

On Wed, Oct 21, 2015 at 2:14 PM, Sergey Shelukhin 
wrote:

> The Apache Hive PMC has voted to make Siddharth Seth a committer on the
> Apache Hive Project.
>
> Please join me in congratulating Sid!
>
> Thanks,
> Sergey.
>
>


[jira] [Created] (HIVE-12227) LLAP: better column vector object pools

2015-10-21 Thread Sergey Shelukhin (JIRA)
Sergey Shelukhin created HIVE-12227:
---

 Summary: LLAP: better column vector object pools
 Key: HIVE-12227
 URL: https://issues.apache.org/jira/browse/HIVE-12227
 Project: Hive
  Issue Type: Bug
Reporter: Gopal V
Assignee: Sergey Shelukhin


Vector allocations become a problem in sub-second cases. The pool of 8 per 
request is too small. Needs to be bigger and potentially pre-populated if we 
can do it off the main path.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[ANNOUNCE] New Hive Committer - Siddharth Seth

2015-10-21 Thread Sergey Shelukhin
The Apache Hive PMC has voted to make Siddharth Seth a committer on the
Apache Hive Project.

Please join me in congratulating Sid!

Thanks,
Sergey.



Re: [ANNOUNCE] New Hive Committer - Siddharth Seth

2015-10-21 Thread Jimmy Xiang
Congrats!!

On Wed, Oct 21, 2015 at 2:29 PM, Chaoyu Tang  wrote:

> Congratulations to Siddharth!
>
>
> On Wed, Oct 21, 2015 at 5:25 PM, Xuefu Zhang  wrote:
>
> > Congratulations, Sirddharth!
> >
> > On Wed, Oct 21, 2015 at 2:14 PM, Sergey Shelukhin <
> ser...@hortonworks.com>
> > wrote:
> >
> > > The Apache Hive PMC has voted to make Siddharth Seth a committer on the
> > > Apache Hive Project.
> > >
> > > Please join me in congratulating Sid!
> > >
> > > Thanks,
> > > Sergey.
> > >
> > >
> >
>


Re: [ANNOUNCE] New Hive Committer - Siddharth Seth

2015-10-21 Thread Jesus Camacho Rodriguez
Congrats Sid!

--
Jesús




On 10/21/15, 2:37 PM, "Jimmy Xiang"  wrote:

>Congrats!!
>
>On Wed, Oct 21, 2015 at 2:29 PM, Chaoyu Tang  wrote:
>
>> Congratulations to Siddharth!
>>
>>
>> On Wed, Oct 21, 2015 at 5:25 PM, Xuefu Zhang  wrote:
>>
>> > Congratulations, Sirddharth!
>> >
>> > On Wed, Oct 21, 2015 at 2:14 PM, Sergey Shelukhin <
>> ser...@hortonworks.com>
>> > wrote:
>> >
>> > > The Apache Hive PMC has voted to make Siddharth Seth a committer on the
>> > > Apache Hive Project.
>> > >
>> > > Please join me in congratulating Sid!
>> > >
>> > > Thanks,
>> > > Sergey.
>> > >
>> > >
>> >
>>


[jira] [Created] (HIVE-12228) Hive Error When query nested query with UDF returns Struct type

2015-10-21 Thread Wenlei Xie (JIRA)
Wenlei Xie created HIVE-12228:
-

 Summary: Hive Error When query nested query with UDF returns 
Struct type
 Key: HIVE-12228
 URL: https://issues.apache.org/jira/browse/HIVE-12228
 Project: Hive
  Issue Type: Bug
  Components: Hive, Query Planning, UDF
Affects Versions: 0.13.1
Reporter: Wenlei Xie


The following simple nested query with UDF returns Struct would fail on Hive 
0.13.1 . The UDF java code is attached.

{noformat}
ADD JAR simplestruct.jar;
CREATE TEMPORARY FUNCTION simplestruct AS 'test.SimpleStruct';

SELECT *
  FROM (
SELECT *
from mytest
 ) subquery
WHERE simplestruct(subquery.testStr).first
{noformat}

The error message is 
{noformat}
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error 
while processing row {"testint":1,"testname":"haha","teststr":"hehe"}
at 
org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:549)
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:177)
... 8 more
Caused by: java.lang.RuntimeException: cannot find field teststr from [0:_col0, 
1:_col1, 2:_col2]
at 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.getStandardStructFieldRef(ObjectInspectorUtils.java:415)
at 
org.apache.hadoop.hive.serde2.objectinspector.StandardStructObjectInspector.getStructFieldRef(StandardStructObjectInspector.java:150)
..
{noformat}

The query works fine if we replace the UDF returns Boolean. By comparing the 
query plan, we note when using the {{SimpleStruct}} UDF, the query plan is 
{noformat}
  TableScan
Select Operator
  Filter Operator
Select Operator
{noformat}
The first Select Operator would rename the columns to {{col_k}}, which cause 
this trouble. If we use some UDF returns Boolean, the query plan becomes 
{noformat}
  TableScan
Filter Operator
  Select Operator
{noformat}

It looks like the Query Planner failed to push down the Filter Operator when 
the predicate is based on a UDF returns Struct. 

This bug was fixed in Hive 1.2.1, but we cannot find the ticket to fix it.


Appendix: 
The table {{mytest}} is created in the following way
{noformat}
CREATE TABLE mytest(testInt INT, testName STRING, testStr STRING) ROW FORMAT 
DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;

LOAD DATA LOCAL INPATH 'test.txt' INTO TABLE mytest;
{noformat}
The file {{test.txt}} is a simple CSV file.
{noformat}
1,haha,hehe
2,my,test
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Review Request 38862: HIVE-11985 handle long typenames from Avro schema in metastore

2015-10-21 Thread Sergey Shelukhin

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/38862/
---

(Updated Oct. 21, 2015, 10:54 p.m.)


Review request for hive, Ashutosh Chauhan and Xuefu Zhang.


Repository: hive-git


Description
---

see jira


Diffs (updated)
-

  common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 0fcd39b 
  metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreUtils.java 
12f3f16 
  ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java 076791f 
  ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 4058606 
  ql/src/java/org/apache/hadoop/hive/ql/metadata/Partition.java 9f9b5bc 
  ql/src/java/org/apache/hadoop/hive/ql/metadata/Table.java 3d1ca93 
  ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 3262887 
  serde/src/java/org/apache/hadoop/hive/serde2/AbstractSerDe.java c5e78c5 
  serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroSerDe.java 0e4e4c6 

Diff: https://reviews.apache.org/r/38862/diff/


Testing
---


Thanks,

Sergey Shelukhin



Re: [ANNOUNCE] New Hive Committer - Siddharth Seth

2015-10-21 Thread Prasanth Jayachandran
Congratulations Sid!

- Prasanth
> On Oct 21, 2015, at 3:51 PM, Jesus Camacho Rodriguez 
>  wrote:
> 
> Congrats Sid!
> 
> --
> Jesús
> 
> 
> 
> 
> On 10/21/15, 2:37 PM, "Jimmy Xiang"  wrote:
> 
>> Congrats!!
>> 
>> On Wed, Oct 21, 2015 at 2:29 PM, Chaoyu Tang  wrote:
>> 
>>> Congratulations to Siddharth!
>>> 
>>> 
>>> On Wed, Oct 21, 2015 at 5:25 PM, Xuefu Zhang  wrote:
>>> 
 Congratulations, Sirddharth!
 
 On Wed, Oct 21, 2015 at 2:14 PM, Sergey Shelukhin <
>>> ser...@hortonworks.com>
 wrote:
 
> The Apache Hive PMC has voted to make Siddharth Seth a committer on the
> Apache Hive Project.
> 
> Please join me in congratulating Sid!
> 
> Thanks,
> Sergey.
> 
> 
 
>>> 



Re: [ANNOUNCE] New Hive Committer- Aihua Xu

2015-10-21 Thread Prasanth Jayachandran
Congratulations!

 - Prasanth

> On Oct 21, 2015, at 2:45 PM, Sergey Shelukhin  wrote:
> 
> Congrats!
> 
> On 15/10/21, 14:30, "Chaoyu Tang"  wrote:
> 
>> Congratulations to Aihua.
>> 
>> On Wed, Oct 21, 2015 at 5:12 PM, Zhuoluo (Clark) Yang
>> >> wrote:
>> 
>>> Congrats
>>> 
>>> Thanks,
>>> Zhuoluo (Clark) Yang
>>> 
>>> On Wed, Oct 21, 2015 at 2:09 PM, Szehon Ho  wrote:
>>> 
 The Apache Hive PMC has voted to make Aihua Xu a committer on the
>>> Apache
 Hive Project.
 
 Please join me in congratulating Aihua!
 
 Thanks,
 Szehon
 
>>> 
> 



Re: Hard Coded 0 to assign RPC Server port number when hive.execution.engine=spark

2015-10-21 Thread Xuefu Zhang
Sounds good. Thanks.

On Wed, Oct 21, 2015 at 10:31 AM, Andrew Lee  wrote:

> Hi Xuefu,
>
> https://issues.apache.org/jira/browse/HIVE-1
>
> created. Please advise if the subject and the fields are appropriate and
> feel free to update them to make it more standard for the community. I'll
> follow up in that JIRA ticket for discussion, thanks.
>
> 
> From: Andrew Lee
> Sent: Wednesday, October 21, 2015 10:25 AM
> To: dev@hive.apache.org
> Subject: Re: Hard Coded 0 to assign RPC Server port number when
> hive.execution.engine=spark
>
> Hi Xuefu,
>
> Thanks, I'll create a JIRA, by the way, since HiveCLI will be replaced by
> beeline or other design later,
> I'm hoping the same philosophy can be considered if other CLI is using
> RPCServer as well or sharing the same source code at some point.
>
> Shall the Issue Type of the JIRA ticket be "Improvement" or "New Feature" ?
>
> 
> From: Xuefu Zhang 
> Sent: Tuesday, October 20, 2015 6:39 PM
> To: dev@hive.apache.org
> Subject: Re: Hard Coded 0 to assign RPC Server port number when
> hive.execution.engine=spark
>
> Thanks, Andrew! You have a point. However, we're trying to sunset Hive CLI.
> In the meantime, I guess it doesn't hurt to give admin more control over
> the ports to be used. Please put your proposal in a JIRA and we can go from
> there.
>
> --Xuefu
>
> On Tue, Oct 20, 2015 at 7:54 AM, Andrew Lee  wrote:
>
> > Hi Xuefu,
> >
> > 2 Main reasons.
> >
> > - Most users (what I see and encounter) use HiveCLI as a command line
> > tool, and in order to use that, they need to login to the edge node (via
> > SSH). Now, here comes the interesting part.
> > Could be true or not, but this is what I observe and encounter from time
> > to time. Most users will abuse the resource on that edge node (increasing
> > HADOOP_HEAPSIZE, dumping output to local disk, running huge python
> > workflow, etc), this may cause the HS2 process to run into OOME, choke
> and
> > die, etc. various resource issues including others like login, etc.
> >
> > - Analyst connects to Hive via HS2 + ODBC. So HS2 needs to be highly
> > available. This makes sense to run it on the gateway node or a service
> node
> > and separated from the HiveCLI.
> > The logs are located in different location, monitoring and auditing is
> > easier to run HS2 with a daemon user account, etc. so we don't want users
> > to run HiveCLI where HS2 is running.
> > It's better to isolate the resource this way to avoid any memory, file
> > handlers, disk space, issues.
> >
> > From a security standpoint,
> >
> > - Since users can login to edge node (via SSH), the security on the edge
> > node needs to be fortified and enhanced. Therefore, all the FW comes in
> and
> > auditing.
> >
> > - Regulation/compliance for auditing is another requirement to monitor
> all
> > traffic, specifying ports and locking down the ports makes it easier
> since
> > we can focus
> > on a range to monitor and audit.
> >
> > Hope this explains the reason why we are asking for this feature.
> >
> >
> > 
> > From: Xuefu Zhang 
> > Sent: Monday, October 19, 2015 9:37 PM
> > To: dev@hive.apache.org
> > Subject: Re: Hard Coded 0 to assign RPC Server port number when
> > hive.execution.engine=spark
> >
> > Hi Adrew,
> >
> > I understand your policy on edge node. However, I'm wondering why you
> > cannot require that Hive CLI run only on gateway nodes, similar to HS2?
> In
> > essence, Hive CLI is a client with embedded hive server, so it seems
> > reasonable to have a similar requirement as it for HS2.
> >
> > I'm not defending against your request. Rather, I'm interested in the
> > rationale behind your policy.
> >
> > Thanks,
> > Xuefu
> >
> > On Mon, Oct 19, 2015 at 9:12 PM, Andrew Lee  wrote:
> >
> > > Hi Xuefu,
> > >
> > > I agree for HS2 since HS2 usually runs on a gateway or service node
> > inside
> > > the cluster environment.
> > > In my case, it is actually additional security.
> > > A separate edge node (not running HS2, HS2 runs on another box) is used
> > > for HiveCLI.
> > > We don't allow data/worker nodes to talk to the edge node on random
> > ports.
> > > All ports must be registered or explicitly specified and monitored.
> > > That's why I am asking for this feature. Otherwise, opening up
> 1024-65535
> > > from data/worker node to edge node is actually
> > > a bad idea and bad practice for network security.  :(
> > >
> > >
> > >
> > > 
> > > From: Xuefu Zhang 
> > > Sent: Monday, October 19, 2015 1:12 PM
> > > To: dev@hive.apache.org
> > > Subject: Re: Hard Coded 0 to assign RPC Server port number when
> > > hive.execution.engine=spark
> > >
> > > Hi Andrew,
> > >
> > > RpcServer is an instance launched for each user session. In case of
> 

Re: Review Request 39199: HIVE-12084 : Hive queries with ORDER BY and large LIMIT fails with OutOfMemoryError Java heap space

2015-10-21 Thread Hari Sankar Sivarama Subramaniyan

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/39199/
---

(Updated Oct. 22, 2015, 12:41 a.m.)


Review request for hive, Ashutosh Chauhan and John Pullokkaran.


Changes
---

After discussing with Ashutosh and John, have made the change in 
TopNHash.initialize() to compute the threshold better.


Repository: hive-git


Description
---

Please look at https://issues.apache.org/jira/browse/HIVE-12084


Diffs (updated)
-

  ql/src/java/org/apache/hadoop/hive/ql/exec/TopNHash.java 484006a 
  ql/src/test/queries/clientpositive/topn.q PRE-CREATION 
  ql/src/test/results/clientpositive/topn.q.out PRE-CREATION 

Diff: https://reviews.apache.org/r/39199/diff/


Testing
---


Thanks,

Hari Sankar Sivarama Subramaniyan



Re: [ANNOUNCE] New Hive Committer- Aihua Xu

2015-10-21 Thread Pengcheng Xiong
Congrats Aihua!

On Wed, Oct 21, 2015 at 2:09 PM, Szehon Ho  wrote:

> The Apache Hive PMC has voted to make Aihua Xu a committer on the Apache
> Hive Project.
>
> Please join me in congratulating Aihua!
>
> Thanks,
> Szehon
>


Re: Review Request 39426: CBO: Calcite Operator To Hive Operator (Calcite Return Path): fix udaf_percentile_approx_23.q

2015-10-21 Thread pengcheng xiong

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/39426/
---

(Updated Oct. 21, 2015, 11:56 p.m.)


Review request for hive and Ashutosh Chauhan.


Repository: hive-git


Description
---

Due to a type conversion problem.


Diffs (updated)
-

  ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveCalciteUtil.java 
0200506 
  ql/src/test/queries/clientpositive/cbo_rp_udaf_percentile_approx_23.q 
PRE-CREATION 
  ql/src/test/results/clientpositive/cbo_rp_udaf_percentile_approx_23.q.out 
PRE-CREATION 

Diff: https://reviews.apache.org/r/39426/diff/


Testing
---


Thanks,

pengcheng xiong



Re: [ANNOUNCE] New Hive Committer - Siddharth Seth

2015-10-21 Thread Pengcheng Xiong
Congrats Sid!

On Wed, Oct 21, 2015 at 2:14 PM, Sergey Shelukhin 
wrote:

> The Apache Hive PMC has voted to make Siddharth Seth a committer on the
> Apache Hive Project.
>
> Please join me in congratulating Sid!
>
> Thanks,
> Sergey.
>
>


[jira] [Created] (HIVE-12229) Custom script in query cannot be executed in yarn-cluster mode [Spark Branch].

2015-10-21 Thread Lifeng Wang (JIRA)
Lifeng Wang created HIVE-12229:
--

 Summary: Custom script in query cannot be executed in yarn-cluster 
mode [Spark Branch].
 Key: HIVE-12229
 URL: https://issues.apache.org/jira/browse/HIVE-12229
 Project: Hive
  Issue Type: Bug
  Components: Spark
Affects Versions: 1.1.0
Reporter: Lifeng Wang


Added one python script in the query and the python script cannot be found 
during execution in yarn-cluster mode.
{noformat}
15/10/21 21:10:55 INFO exec.ScriptOperator: Executing [/usr/bin/python, 
q2-sessionize.py, 3600]
15/10/21 21:10:55 INFO exec.ScriptOperator: tablename=null
15/10/21 21:10:55 INFO exec.ScriptOperator: partname=null
15/10/21 21:10:55 INFO exec.ScriptOperator: alias=null
15/10/21 21:10:55 INFO spark.SparkRecordHandler: processing 10 rows: used 
memory = 324896224
15/10/21 21:10:55 INFO exec.ScriptOperator: ErrorStreamProcessor calling 
reporter.progress()
/usr/bin/python: can't open file 'q2-sessionize.py': [Errno 2] No such file or 
directory
15/10/21 21:10:55 INFO exec.ScriptOperator: StreamThread OutputProcessor done
15/10/21 21:10:55 INFO exec.ScriptOperator: StreamThread ErrorProcessor done
15/10/21 21:10:55 INFO spark.SparkRecordHandler: processing 100 rows: used 
memory = 325619920
15/10/21 21:10:55 ERROR exec.ScriptOperator: Error in writing to script: Stream 
closed
15/10/21 21:10:55 INFO exec.ScriptOperator: The script did not consume all 
input data. This is considered as an error.
15/10/21 21:10:55 INFO exec.ScriptOperator: set 
hive.exec.script.allow.partial.consumption=true; to ignore it.
15/10/21 21:10:55 ERROR spark.SparkReduceRecordHandler: Fatal error: 
org.apache.hadoop.hive.ql.metadata.HiveException: Error while processing row 
(tag=0) 
{"key":{"reducesinkkey0":2,"reducesinkkey1":3316240655},"value":{"_col0":5529}}
org.apache.hadoop.hive.ql.metadata.HiveException: Error while processing row 
(tag=0) 
{"key":{"reducesinkkey0":2,"reducesinkkey1":3316240655},"value":{"_col0":5529}}
at 
org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processKeyValues(SparkReduceRecordHandler.java:340)
at 
org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:289)
at 
org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:49)
at 
org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:28)
at 
org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:95)
at 
scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:99)
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: [Error 20001]: An 
error occurred while reading or writing to your custom script. It may have 
crashed with an error.
at 
org.apache.hadoop.hive.ql.exec.ScriptOperator.processOp(ScriptOperator.java:453)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
at 
org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84)
at 
org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processKeyValues(SparkReduceRecordHandler.java:331)
... 14 more
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [ANNOUNCE] New Hive Committer - Siddharth Seth

2015-10-21 Thread Chetna C
Congratulations !!
On Oct 22, 2015 5:13 AM, "Pengcheng Xiong"  wrote:

> Congrats Sid!
>
> On Wed, Oct 21, 2015 at 2:14 PM, Sergey Shelukhin 
> wrote:
>
> > The Apache Hive PMC has voted to make Siddharth Seth a committer on the
> > Apache Hive Project.
> >
> > Please join me in congratulating Sid!
> >
> > Thanks,
> > Sergey.
> >
> >
>


Re: [ANNOUNCE] New Hive Committer - Siddharth Seth

2015-10-21 Thread Matthew McCline

Congratulations!


From: Chetna C 
Sent: Wednesday, October 21, 2015 8:27 PM
To: dev@hive.apache.org
Cc: Siddharth Seth
Subject: Re: [ANNOUNCE] New Hive Committer - Siddharth Seth

Congratulations !!
On Oct 22, 2015 5:13 AM, "Pengcheng Xiong"  wrote:

> Congrats Sid!
>
> On Wed, Oct 21, 2015 at 2:14 PM, Sergey Shelukhin 
> wrote:
>
> > The Apache Hive PMC has voted to make Siddharth Seth a committer on the
> > Apache Hive Project.
> >
> > Please join me in congratulating Sid!
> >
> > Thanks,
> > Sergey.
> >
> >
>


Temporary table problem

2015-10-21 Thread 上海_技术部_架构部_基础架构_邓林钢
Hi,all
   I create 3 tables at context with color,  a_110,b_110,c_110, then the script 
become

SELECT count(*)
FROM testtmp.a_110 a
LEFT OUTER JOIN testtmp.b_110 b ON a.cms_id = b.cms_id AND a.pltfm_id = 
b.pltfm_id
LEFT OUTER JOIN testtmp.c_110 c ON b.cms_id = c.cms_id AND b.categ_lvl2_id = 
c.categ_lvl2_id AND b.pltfm_id = c.pltfm_id
LEFT OUTER JOIN dw.dim_cms dim ON a.cms_id= dim.cms_id
and GetTimestampFmt(dim.CMS_START_TIME) <= GetTimestampFmt('2015-10-18')
AND GetTimestampFmt(dim.CMS_END_TIME) >= GetTimestampFmt('2015-10-18')
where GetTimestampFmt(dim.CMS_START_TIME) <= GetTimestampFmt('2015-10-18')
  AND GetTimestampFmt(dim.CMS_END_TIME) >= GetTimestampFmt('2015-09-01') ;

--11524


But, problems arise,two scripts result is defferent,  who can tell reason, 
thanks


SELECT count(*)
FROM
  (SELECT nav_page_value AS cms_id,
  pltfm_id,
  COUNT(DISTINCT(CASE WHEN t.nav_tracker_id > 0 THEN t.sessn_id ELSE 
NULL END)) AS cms_vstrs,
  COUNT(DISTINCT(CASE WHEN t.nav_tracker_id > 0
 AND t.nav_next_tracker_id > 0 THEN t.sessn_id ELSE 
NULL END)) AS cms_click_vstrs,
  COUNT(DISTINCT(CASE WHEN t.nav_tracker_id > 0 THEN t.nav_tracker_id 
ELSE NULL END)) AS cms_pv,
  COUNT(DISTINCT(CASE WHEN t.nav_tracker_id > 0
 AND t.sessn_pv > 1 THEN t.sessn_id ELSE NULL END)) AS 
cms_sec_vstrs,
  COUNT(DISTINCT(CASE WHEN (t.detl_tracker_id > 0
AND (length(t.detl_button_position) = 0
 OR t.detl_button_position IS NULL
 OR t.detl_button_position = 'null'))
 OR (t.cart_tracker_id > 0
 AND length(t.detl_tracker_id) = 0) THEN t.sessn_id 
ELSE NULL END)) AS cms_detl_vstrs,
  COUNT(DISTINCT(CASE WHEN t.nav_tracker_id > 0 THEN t.ordr_code ELSE 
NULL END)) AS cms_ordr_num
   FROM dw.fct_traffic_cms_detl t
   WHERE ds= '2015-10-18'
 AND t.nav_page_value IS NOT NULL
   GROUP BY nav_page_value,
pltfm_id) a
LEFT OUTER JOIN
  (SELECT nvl(t3.mg_brand_id,-99) AS mg_brand_id,
  nvl(t1.nav_page_value,-1) AS cms_id,
  hc.categ_lvl2_id ,
  t1.pltfm_id,
  nvl(COUNT(DISTINCT(CASE WHEN t1.detl_pv > 0
 OR dirct_cart_pv>0 THEN t1.sessn_id ELSE NULL 
END)), 0) AS detl_vstrs,
  COUNT(DISTINCT(CASE WHEN t1.ordr_tranx_activ_flg=1 THEN parnt_ordr_id 
ELSE NULL END)) AS ordr_num,
  COUNT(DISTINCT(CASE WHEN t1.ordr_tranx_activ_flg=1 THEN 
t1.end_user_id ELSE NULL END)) AS cust_num
   FROM dw.fct_traffic_prdt_cart_path t1 LEFT
   OUTER JOIN dw.dim_prod t2 ON t1.prod_id = t2.prod_id
   AND t2.cur_flag = 1 LEFT
   OUTER JOIN dw.hier_categ hc ON t2.categ_lvl_id = hc.categ_lvl_id INNER
   JOIN dw.dim_mrchnt b ON t1.mrchnt_id = b.mrchnt_id
   AND b.cur_flag = 1 LEFT
   OUTER JOIN dw.dim_brand t3 ON t2.brand_id = t3.brand_id
   AND t3.cur_flag = 1
   WHERE t1.ds = '2015-10-18'
 AND b.biz_unit = 1
 AND t1.nav_page_categ_id = 1
 AND t1.nav_page_value>0
   GROUP BY nvl(t3.mg_brand_id,-99),
nvl(t1.nav_page_value,-1) ,
hc.categ_lvl2_id ,
t1.pltfm_id) b ON a.cms_id = b.cms_id
AND a.pltfm_id = b.pltfm_id
LEFT OUTER JOIN
  (SELECT nvl(t1.nav_page_value,-1) AS cms_id,
  hc.categ_lvl2_id ,
  t1.pltfm_id,
  nvl(COUNT(DISTINCT(CASE WHEN t1.detl_pv > 0
 OR dirct_cart_pv>0 THEN t1.sessn_id ELSE NULL 
END)), 0) AS categ_lvl2_cms_detl_vstrs,
  COUNT(DISTINCT(CASE WHEN t1.ordr_tranx_activ_flg=1 THEN parnt_ordr_id 
ELSE NULL END)) AS categ_lvl2_cms_ordr_num
   FROM dw.fct_traffic_prdt_cart_path t1 LEFT
   OUTER JOIN dw.dim_prod t2 ON t1.prod_id = t2.prod_id
   AND t2.cur_flag = 1 LEFT
   OUTER JOIN dw.hier_categ hc ON t2.categ_lvl_id = hc.categ_lvl_id INNER
   JOIN dw.dim_mrchnt b ON t1.mrchnt_id = b.mrchnt_id
   AND b.cur_flag = 1 LEFT
   OUTER JOIN dw.dim_brand t3 ON t2.brand_id = t3.brand_id
   AND t3.cur_flag = 1
   WHERE t1.ds = '2015-10-18'
 AND b.biz_unit = 1
 AND t1.nav_page_categ_id = 1
 AND t1.nav_page_value>0
   GROUP BY nvl(t1.nav_page_value,-1) ,
hc.categ_lvl2_id ,
t1.pltfm_id) c ON b.cms_id = c.cms_id
AND b.categ_lvl2_id = c.categ_lvl2_id
AND b.pltfm_id = c.pltfm_id
LEFT OUTER JOIN dw.dim_cms dim ON a.cms_id= dim.cms_id
and GetTimestampFmt(dim.CMS_START_TIME) <= GetTimestampFmt('2015-10-18')
AND GetTimestampFmt(dim.CMS_END_TIME) >= GetTimestampFmt('2015-10-18')
where GetTimestampFmt(dim.CMS_START_TIME) <= GetTimestampFmt('2015-10-18')
  AND GetTimestampFmt(dim.CMS_END_TIME) >= GetTimestampFmt('2015-09-01') ;



--10723