is there a way to read Hive configurations from the REST APIs?

2015-03-12 Thread Xiaoyong Zhu
Hi experts

Don't know if there is a way to read the Hive configurations from the REST 
APIs? The reason is that we are trying to get the Hive configurations in order 
to perform some behaviors accordingly...

Xiaoyong



RE: is there a way to read Hive configurations from the REST APIs?

2015-03-12 Thread Mich Talebzadeh
You can use the command *set* in hive to get the behaviour. You can also do
the same through beeline.

 

HTH

 

Mich Talebzadeh

 

http://talebzadehmich.wordpress.com

 

Publications due shortly:

Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and
Coherence Cache

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Ltd, its
subsidiaries or their employees, unless expressly so stated. It is the
responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Ltd, its subsidiaries nor their employees accept
any responsibility.

 

From: Xiaoyong Zhu [mailto:xiaoy...@microsoft.com] 
Sent: 12 March 2015 09:26
To: user@hive.apache.org
Subject: is there a way to read Hive configurations from the REST APIs?

 

Hi experts

 

Don't know if there is a way to read the Hive configurations from the REST
APIs? The reason is that we are trying to get the Hive configurations in
order to perform some behaviors accordingly.

 

Xiaoyong

 



Bucket pruning

2015-03-12 Thread Daniel Haviv
Hi,
We created a bucketed table and when we select in the following way:
select *
from testtble
where bucket_col ='X';

We observe that there all of the table is being read and not just the
specific bucket.

Does Hive support such a feature ?


Thanks,
Daniel


Re: is there a way to read Hive configurations from the REST APIs?

2015-03-12 Thread Lefty Leverenz
See these wikidocs for the *set* command:  Commands

& Beeline Hive Commands

.

-- Lefty

On Thu, Mar 12, 2015 at 2:35 AM, Mich Talebzadeh 
wrote:

> You can use the command **set** in hive to get the behaviour. You can
> also do the same through beeline.
>
>
>
> HTH
>
>
>
> Mich Talebzadeh
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> *Publications due shortly:*
>
> *Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and
> Coherence Cache*
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Ltd, its
> subsidiaries or their employees, unless expressly so stated. It is the
> responsibility of the recipient to ensure that this email is virus free,
> therefore neither Peridale Ltd, its subsidiaries nor their employees accept
> any responsibility.
>
>
>
> *From:* Xiaoyong Zhu [mailto:xiaoy...@microsoft.com]
> *Sent:* 12 March 2015 09:26
> *To:* user@hive.apache.org
> *Subject:* is there a way to read Hive configurations from the REST APIs?
>
>
>
> Hi experts
>
>
>
> Don’t know if there is a way to read the Hive configurations from the REST
> APIs? The reason is that we are trying to get the Hive configurations in
> order to perform some behaviors accordingly…
>
>
>
> Xiaoyong
>
>
>


Hive map-join

2015-03-12 Thread Guillermo Ortiz
Hello,

I'm executing a join of two tables.
-table1 sizes 130Gb
-table2 sizes 1.5Gb

In HDFS table1 is just one text file and table2 it's ten files.  I'd
like to execute a map-join and load in memory table2

use esp;
set hive.auto.convert.join=true;
#set hive.auto.convert.join.noconditionaltask = true;
#I tried this one to force to execute mapjoin but I think that I don't
know how to use it.
#set hive.auto.convert.join.noconditionaltask.size = 100;

# Although it's not neccesary MAPJOIN, I have tried with and without it.
SELECT /*+ MAPJOIN(table2) */ DISTINCT
t1.c1,
t1.c2,
t2.c3,
t2.c4,
FROM table2 t1
RIGHT JOIN table1 t2
ON (t1.c1 = t2.c3)
AND (t1.c5 = t2.c5)
WHERE t2.xx = 'XX'
LIMIT 10;

This query creates 11 maps. Ten of them takes about 15 seconds and one
of them 2hours. So, I guess that one map loads 130gb to make the join.
Why doesn't Hive split that file? What I'm doing bad with this query?


insert overwrite local question

2015-03-12 Thread Garry Chen
Hi,
I am using hive-1.1.0.  Is there a way to let the insert 
overwrite local statement write to a given file name like syz.txt instead of 
default name 0_0?  Thank you very much for your input.

Garry


Re: Any getting-started with UDAF development

2015-03-12 Thread Jason Dere
I think the Java code is likely still fine for the examples (though someone who 
actually knows about UDAFs might want to correct me here).
If anything is out of date, it would be the build/test commands, which have 
switched from using ant to maven.

Jason

On Mar 11, 2015, at 4:01 AM, shahab 
mailto:shahab.mok...@gmail.com>> wrote:

Hi,

I do appreciate if anyone could point me to a Getting-started tutorial on 
developing custom UDAF?

I found this one 
https://cwiki.apache.org/confluence/display/Hive/GenericUDAFCaseStudy

But this is not really made for Hive 0.12.0 /0.13.0.

Any help is appreciated.

thanks,
/Shahab



filter on bucketed column

2015-03-12 Thread cobby cohen
bucketed column seems great but i dont understand why they are being used for 
just for optimizing joins and not where clause (filter).i have a huge table 
(billions of records)  which includes a field with medium cardinality 
(~100,000). user usually filter with that field (at least). using partitions, 
or full table scan, are both inefficient. Hash partition, or bucketing seems to 
be the way to go. i saw HIVE-5831, but it seems the solution is not going into 
trunk for some reason.any comments?thanks. 

Re: Bucket pruning

2015-03-12 Thread Gopal Vijayaraghavan
Hi,

No and it¹s a shame because we¹re stuck on some compatibility details with
this.

The primary issue is the fact that the InputFormat is very generic and
offers no way to communicate StorageDescriptor or bucketing.

The split generation for something SequenceFileInputFormat lives inside
MapReduce, where it has no idea about bucketing.

So InputFormat.getSplits(conf) returns something relatively arbitrary,
which contains a mixture of files when CombineInputFormat is turned on.

I have implemented this twice so far for ORC (for custom Tez jobs, with
huge wins) by using an MRv2 PathFilter over the regular OrcNewInputFormat
implementation, by turning off combine input and using Tez grouping
instead.

But that has proved to be very fragile for a trunk feature, since with
schema evolution of partitioned tables older partitions may be bucketed
with a different count from a newer partition - so the StorageDescriptor
for each partition has to be fetched across before we can generate a valid
PathFilter.

The SARGs are probably a better way to do this eventually as they can
implement IN_BUCKET(1,2) to indicate 1 of 2 instead of the ³0_1²
PathFilter which is fragile.


Right now, the most fool-proof solution we¹ve hit upon was to apply the
ORC bloom filter to the bucket columns, which is far safer as it does not
care about the DDL - but does a membership check on the actual metadata &
prunes deeper at the stripe-level if it is sorted as well.

That is somewhat neat since this doesn¹t need any new options for querying
- it automatically(*) kicks in for your query pattern.

Cheers,
Gopal
(*) - conditions apply - there¹s a threshold for file-size for these
filters to be evaluated during planning (to prevent HS2 from burning CPU).


From:  Daniel Haviv 
Reply-To:  "user@hive.apache.org" 
Date:  Thursday, March 12, 2015 at 2:36 AM
To:  "user@hive.apache.org" 
Subject:  Bucket pruning


Hi,
We created a bucketed table and when we select in the following way:
select * 
from testtble
where bucket_col ='X';

We observe that there all of the table is being read and not just the
specific bucket.

Does Hive support such a feature ?


Thanks,
Daniel




Error creating a partitioned view

2015-03-12 Thread Buntu Dev
I got a 'log' table which is currently partitioned by year, month and day.
I'm looking to create a partitioned view on top of 'log' table but running
into this error:



hive> CREATE VIEW log_view PARTITIONED ON (pagename,year,month,day) AS
SELECT pagename year,month,day,uid,properties FROM log;

FAILED: SemanticException [Error 10093]: Rightmost columns in view output
do not match PARTITIONED ON clause



How do I go about creating this view?


Thanks!


sqoop import to hive being killed by resource manager

2015-03-12 Thread Steve Howard
Hi All,

We have not been able to get what is in the subject line to run.  This is
on hive 0.14.  While pulling a billion row table from Oracle using 12
splits on the primary key, each job continually runs out of memory such as
below...

15/03/13 00:22:23 INFO mapreduce.Job: Task Id :
attempt_1426097251374_0011_m_11_0, Status : FAILED
Container [pid=27919,containerID=container_1426097251374_0011_01_13] is
running beyond physical memory limits. Current usage: 513.5 MB of 512 MB
physical memory used; 879.3 MB of 1.0 GB virtual memory used. Killing
container.
Dump of the process-tree for container_1426097251374_0011_01_13 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 28078 27919 27919 27919 (java) 63513 834 912551936 131129
/usr/jdk64/jdk1.7.0_45/bin/java -server -XX:NewRatio=8
-Djava.net.preferIPv4Stack=true -Dhdp.version=2.2.0.0-2041 -Xmx410m
-Djava.io.tmpdir=/mnt/hdfs/hadoop/yarn/local/usercache/hdfs/appcache/application_1426097251374_0011/container_1426097251374_0011_01_13/tmp
-Dlog4j.configuration=container-log4j.properties
-Dyarn.app.container.log.dir=/mnt/hdfs/hadoop/yarn/log/application_1426097251374_0011/container_1426097251374_0011_01_13
-Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA
org.apache.hadoop.mapred.YarnChild 172.27.2.57 52335
attempt_1426097251374_0011_m_11_0 13
|- 27919 27917 27919 27919 (bash) 1 2 9424896 317 /bin/bash -c
/usr/jdk64/jdk1.7.0_45/bin/java -server -XX:NewRatio=8
-Djava.net.preferIPv4Stack=true -Dhdp.version=2.2.0.0-2041 -Xmx410m
-Djava.io.tmpdir=/mnt/hdfs/hadoop/yarn/local/usercache/hdfs/appcache/application_1426097251374_0011/container_1426097251374_0011_01_13/tmp
-Dlog4j.configuration=container-log4j.properties
-Dyarn.app.container.log.dir=/mnt/hdfs/hadoop/yarn/log/application_1426097251374_0011/container_1426097251374_0011_01_13
-Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA
org.apache.hadoop.mapred.YarnChild 172.27.2.57 52335
attempt_1426097251374_0011_m_11_0 13
1>/mnt/hdfs/hadoop/yarn/log/application_1426097251374_0011/container_1426097251374_0011_01_13/stdout
2>/mnt/hdfs/hadoop/yarn/log/application_1426097251374_0011/container_1426097251374_0011_01_13/stderr

Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143

We have tried several different sizes for various switches, but the job
always fails.

Is this simply a function of the data, or is there another issue?

Thanks,

Steve


Re: when start hive could not generate log file

2015-03-12 Thread Jianfeng (Jeff) Zhang

By default, hive.log is located in /tmp/${user}/hive.log


Best Regard,
Jeff Zhang


From: zhangjp mailto:smart...@hotmail.com>>
Reply-To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>
Date: Wednesday, March 11, 2015 at 7:12 PM
To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>
Subject: when start hive could not generate log file

when  i  run command  " hive" , the message as follows
[@xxx/]# hive
log4j:WARN No appenders could be found for logger 
(org.apache.hadoop.hive.common.LogUtils).
log4j:WARN Please initialize the log4j system properly.
Logging initialized using configuration in 
file:/search/apache-hive-0.13.1-bin/conf/hive-log4j.properties?

My hive-log4j.properties use the default template. but when I run "find -name 
hive.log" couldn't find any file.?


Question on hive query correlation optimization

2015-03-12 Thread canan chen
I use the following sql with mr engine and find that it would invoke 3 mr
jobs. But as my understanding the join and group by operator could be done
in the same mr job since they are using the same key. So not sure why here
still 3 mr jobs, anyone know that ? Thanks



select s2.name,count(1) as cnt from student_s2 s2 join student_s3 s3 on
s2.name=s3.name group by s2.name order by cnt limit 3;


inserting dynamic partitions - need more reducers

2015-03-12 Thread Alex Bohr
I'm inserting from an unpartitioned table with a 6 hours of data into a
table partitioned by hour.

The source table is 400M rows and 500GB so it's needs a lot of reducers
working on the data - Hive chose 544 which sounds good.

But 538 reducers did nothing and the other 6 are working for over an hour
with all the data.

I see from running explain on the query:
Map-reduce partition columns: _col54 (type: int), _col55 (type: int),
_col56 (type: int), _col57 (type: int)

which the partition columns of the destination table (year, month, day,
hour).
That's an unnecessary centralization of work, I don't need each partition
to be written by only one reducer.  Each destination partition should
instead include a bunch of output files from various Reducers.  If I wrote
my own M/R job I would use MultipleOutputs and partition on epoch or
something.

So I hacked it, and added another column to the destination partition after
the hour column- a random number up to 200.  Now all the reducers are
sharing the work.

*Is there any other way I can get Hive to distribute the work to all
reducers without hacking the table DDL with random columns?*

I'm on Hive 0.13 with Beeline and HiveServer2 and start the query off with
the settings:
set  hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

Thanks


Re: inserting dynamic partitions - need more reducers

2015-03-12 Thread Prasanth Jayachandran
Hi

Can you try with hive.optimize.sort.dynamic.partition set to false?

Thanks
Prasanth




On Thu, Mar 12, 2015 at 9:02 PM -0700, "Alex Bohr" 
mailto:a...@gradientx.com>> wrote:

I'm inserting from an unpartitioned table with a 6 hours of data into a table 
partitioned by hour.

The source table is 400M rows and 500GB so it's needs a lot of reducers working 
on the data - Hive chose 544 which sounds good.

But 538 reducers did nothing and the other 6 are working for over an hour with 
all the data.

I see from running explain on the query:
Map-reduce partition columns: _col54 (type: int), _col55 (type: int), _col56 
(type: int), _col57 (type: int)

which the partition columns of the destination table (year, month, day, hour).
That's an unnecessary centralization of work, I don't need each partition to be 
written by only one reducer.  Each destination partition should instead include 
a bunch of output files from various Reducers.  If I wrote my own M/R job I 
would use MultipleOutputs and partition on epoch or something.

So I hacked it, and added another column to the destination partition after the 
hour column- a random number up to 200.  Now all the reducers are sharing the 
work.

Is there any other way I can get Hive to distribute the work to all reducers 
without hacking the table DDL with random columns?

I'm on Hive 0.13 with Beeline and HiveServer2 and start the query off with the 
settings:
set  hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

Thanks


Re: inserting dynamic partitions - need more reducers

2015-03-12 Thread Alex Bohr
YES!
That did it, I will be adding that one to our global config.  Good to see
they defaulted it to false in 0.14.
Thanks Prasanth

On Thu, Mar 12, 2015 at 9:29 PM, Prasanth Jayachandran <
pjayachand...@hortonworks.com> wrote:

>  Hi
>
>  Can you try with hive.optimize.sort.dynamic.partition set to false?
>
> Thanks
> Prasanth
>
>
>
>
> On Thu, Mar 12, 2015 at 9:02 PM -0700, "Alex Bohr" 
> wrote:
>
>  I'm inserting from an unpartitioned table with a 6 hours of data into a
> table partitioned by hour.
>
>  The source table is 400M rows and 500GB so it's needs a lot of reducers
> working on the data - Hive chose 544 which sounds good.
>
>  But 538 reducers did nothing and the other 6 are working for over an
> hour with all the data.
>
>  I see from running explain on the query:
> Map-reduce partition columns: _col54 (type: int), _col55 (type: int),
> _col56 (type: int), _col57 (type: int)
>
>  which the partition columns of the destination table (year, month, day,
> hour).
> That's an unnecessary centralization of work, I don't need each partition
> to be written by only one reducer.  Each destination partition should
> instead include a bunch of output files from various Reducers.  If I wrote
> my own M/R job I would use MultipleOutputs and partition on epoch or
> something.
>
>  So I hacked it, and added another column to the destination partition
> after the hour column- a random number up to 200.  Now all the reducers are
> sharing the work.
>
>  *Is there any other way I can get Hive to distribute the work to all
> reducers without hacking the table DDL with random columns?*
>
>  I'm on Hive 0.13 with Beeline and HiveServer2 and start the query off
> with the settings:
>  set  hive.exec.dynamic.partition=true;
> set hive.exec.dynamic.partition.mode=nonstrict;
>
>  Thanks
>