Re: merge small orc files

2015-04-20 Thread Gopal Vijayaraghavan
Hi,

How to set the configuration hive-site.xml to automatically merge small
orc file (output from mapreduce job) in hive 0.14 ?

Hive cannot add work-stages to a map-reduce job.

Hive follows merge.mapfiles=true when Hive generates a plan, by adding
more work to the plan as a conditional task.

-rwxr-xr-x   1 root hdfs  29072 2015-04-20 15:23
/apps/hive/warehouse/coordinate/zone=2/part-r-0

This looks like it was written by an MRv2 Reducer and not by the Hive
FileSinkOperator  handled by the MR outputcommitter instead of the Hive
MoveTask.

But 0.14 has an option which helps ³hive.merge.orcfile.stripe.level². If
that is true (like your setting), then do

³alter table table concatenate²

which effectively concatenates ORC blocks (without decompressing them),
while maintaining metadata linkage of start/end offsets in the footer.

Cheers,
Gopal




Re: Table Lock Manager: ZooKeeper cluster

2015-04-20 Thread Xuefu Zhang
I'm not a zookeeper expert, but zookeeper is supposed to be characteristics
of light-weight, high performance, and fast response. Unless you zookeeper
is already overloaded, I don't see why you would need a separate zookeeper
cluster just for Hive.

There are a few zookeeper usages in Hive, the additional stress on
zookeeper is determined by the load on your HS2. As most of the time user
sessions are waiting on query execution, I don't expect the additional
stress on your zookeeper will be significant.

You do need to test it out before putting it in production as a general
practice.

On Fri, Apr 17, 2015 at 1:56 PM, Eduardo Ferreira eafon...@gmail.com
wrote:

 Hi there,

 I read on the Hive installation documentation that we need to have a
 ZooKeeper cluster setup to support Table Lock Manager (Cloudera docs link
 below).

 As we have HBase with a ZooKeeper cluster already, my question is if we
 can use the same ZK cluster for Hive.
 Is that recommended?
 What kind of load and constrains would this put on the HBase ZK cluster?


 http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_ig_hiveserver2_configure.html

 Thanks in advance.
 Eduardo.



Re: [Hive 0.13.1] - Explanation/confusion over Fatal error occurred when node tried to create too many dynamic partitions on small dataset with dynamic partitions

2015-04-20 Thread Daniel Harper
In our case we’ve chose 128 buckets, but that’s just an arbitrary figure we’ve 
chosen to get a good even distribution

To fix the issue we were having with the small file we just updated the setting 
hive.exec.max.dynamic.partitions.pernode to 1, that way if we do run a tiny 
file (very rarely) which only allocates one reducer – we can be sure we don’t 
run into this issue again

With thanks,

Daniel Harper
Software Engineer, OTG ANT
BC5 A5

From: Mich Talebzadeh m...@peridale.co.ukmailto:m...@peridale.co.uk
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Friday, 17 April 2015 10:18
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: RE: [Hive 0.13.1] - Explanation/confusion over Fatal error occurred 
when node tried to create too many dynamic partitions on small dataset with 
dynamic partitions

Hi Lefty,

I took a look at the documentation link and I noticed that it can be improved. 
For example the paragraph below:


“How does Hive distribute the rows across the buckets? In general, the bucket 
number is determined by the expression hash_function(bucketing_column) mod 
num_buckets. (There's a '0x7FFF in there too, but that's not that 
important). The hash_function depends on the type of the bucketing column. For 
an int, it's easy, hash_int(i) == i. For example, if user_id were an int, and 
there were 10 buckets, we would expect all user_id's that end in 0 to be in 
bucket 1, all user_id's that end in a 1 to be in bucket 2, etc. For other 
datatypes, it's a little tricky. In particular, the hash of a BIGINT is not the 
same as the BIGINT. And the hash of a string or a complex datatype will be some 
number that's derived from the value, but not anything humanly-recognizable. 
For example, if user_id were a STRING, then the user_id's in bucket 1 would 
probably not end in 0. In general, distributing rows based on the hash will 
give you a even distribution in the buckets.
So, what can go wrong? As long as you set hive.enforce.bucketing = true, and 
use the syntax above, the tables should be populated properly. Things can go 
wrong if the bucketing column type is different during the insert and on read, 
or if you manually cluster by a value that's different from the table 
definition.”

So in a nutshell num_buckets determines the granularity of hashing and the 
number of files. So eventually the table will have in total number_partitions x 
num_buckets files. The example mentions (not shown above) 256 buckets but that 
is just a number.

It also states “For example, …and there were 10 buckets”. This is not standard. 
In a nutshell bucketing is a method to get data “evenly distributed” over many 
files. Thus, one should define the number of num_buckets by a power of two -- 
2^n,  like 2, 4, 8, 16 etc to achieve best results and getting best clustering.

I will try to see the upper limits on the number of buckets within a partition 
and will get back on that.

HTH

Mich Talebzadeh

http://talebzadehmich.wordpress.com

Author of the books A Practitioner’s Guide to Upgrading to Sybase ASE 15, 
ISBN 978-0-9563693-0-7.
co-author Sybase Transact SQL Guidelines Best Practices, ISBN 
978-0-9759693-0-4
Publications due shortly:
Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and 
Coherence Cache
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Ltd, its subsidiaries 
or their employees, unless expressly so stated. It is the responsibility of the 
recipient to ensure that this email is virus free, therefore neither Peridale 
Ltd, its subsidiaries nor their employees accept any responsibility.

From: Lefty Leverenz [mailto:leftylever...@gmail.com]
Sent: 17 April 2015 00:06
To: user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: [Hive 0.13.1] - Explanation/confusion over Fatal error occurred 
when node tried to create too many dynamic partitions on small dataset with 
dynamic partitions

If the number of buckets in a partitioned table has a limit, we need to 
document it in the wiki.  Currently the 
examplehttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL+BucketedTables
 shows 256 buckets.

-- Lefty

On Thu, Apr 16, 2015 at 4:35 AM, Daniel Harper 
daniel.har...@bbc.co.ukmailto:daniel.har...@bbc.co.uk wrote:
As in you can only have 32 buckets (rather than 128 in our case?)
With thanks,

Daniel Harper
Software Engineer, OTG ANT
BC5 A5

From: Mich Talebzadeh m...@peridale.co.ukmailto:m...@peridale.co.uk
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: 

Re: UDF cannot be found when the query is submitted via templeton

2015-04-20 Thread Eugene Koifman
can you give the complete REST call you are making to submit the query?

From: Xiaoyong Zhu xiaoy...@microsoft.commailto:xiaoy...@microsoft.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Sunday, April 19, 2015 at 8:23 PM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: RE: UDF cannot be found when the query is submitted via templeton

No, it doesn't...

What surprises me is that for HDInsight (Hadoop on Azure) which is using Azure 
BLOB storage, using

ADD JAR wasb:///test/HiveUDF.jar;
CREATE TEMPORARY FUNCTION FindPat as 'HiveUDF.FindPattern'
select count(FindPat(columnname)) from table1;

would work. However, for my own cluster,

ADD JAR hdfs:///test/HiveUDF.jar;
CREATE TEMPORARY FUNCTION FindPat as 'HiveUDF.FindPattern'
select count(FindPat(columnname)) from table1;

does not work...

Xiaoyong

From: Jason Dere [mailto:jd...@hortonworks.com]
Sent: Saturday, April 18, 2015 1:37 AM
To: user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: UDF cannot be found when the query is submitted via templeton

Does fully qualifying the function name (HiveUDF.FindPattern()) in the query 
help here?

On Apr 17, 2015, at 6:44 AM, Xiaoyong Zhu 
xiaoy...@microsoft.commailto:xiaoy...@microsoft.com wrote:


Hi experts

I am trying to use an UDF (I have already put that in the metastore using 
CREATE FUNCTION) as following.

select count(FindPattern(s_sitename)) AS testcol from weblogs;

However, when I tried to use the UDF from WebHCat (i.e. submit the above 
command via WebHCat), the job always fails saying

Added [/tmp/2cb22c27-72d3-4b41-aea0-655df1192872_resources/HiveUDF.jar] to 
class path
Added resources: [hdfs://PATHTOFOLDER/Portal-Queries/HiveUDF.jar]
FAILED: SemanticException [Error 10011]: Line 1:13 Invalid function FindPattern

If I execute this command through Hive CLI (through hive -f file or execute it 
in the interactive shell) the statement above works. From the log I can see the 
jar file is added but it seems the function cannot be found. Can someone help 
to share some thoughts on this issue?

Btw, the create function statement is as following (changing the hdfs URI to 
full path does not work either):
CREATE FUNCTION FindPattern AS 'HiveUDF.FindPattern' USING 
JAR'hdfs:///UDFFolder/HiveUDF.jar'hdfs://UDFFolder/HiveUDF.jar';

Thanks in advance!

Xiaoyong




Using Hive as a file comparison and grep-ping tool

2015-04-20 Thread Sanjay Subramanian
hey guys
As data wranglers and programmers we often need quick tools. One such tool I 
need almost everyday is one that greps a file based on contents of another 
file. One can write this in perl, python but since I am already using hadoop 
ecosystem extensively, I said why not do this in Hive ? 
Perhaps you guys already know this and have better solutionsnevertheless 
:-) here goes...

Best regards
sanjay(Hive super-fan)
I just posted this on my 
bloghttps://bigdatalatte.wordpress.com/2015/04/20/using-hive-as-a-file-comparison-and-grep-ping-tool/
In case the blog URL does not work for any reason, here is the logic
Using Hive as a file comparison and grep-ping 
tool==1. Logon to your linux 
terminal where u run Hive queries from
2. Create a database called myutils in Hive   Create two hive tables 
myutils.file1 and myutils.file2 in Hive    - each of these tables will have a 
partition called fn      fn is short for filename    - each of these 
tables will have just one column called ln  ln is short for line   An 
easy script to help do that would be as follows     for r in 1 2 ; do hive -e 
CREATE DATABASE IF NOT EXISTS myutils; USE myutils; DROP TABLE IF EXISTS 
file${r}; CREATE EXTERNAL TABLE IF NOT EXISTS file${r}(ln STRING) PARTITIONED 
BY (fn STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';; done
3. Create a permanent base location folder in HDFS   hdfs dfs -mkdir -p 
/workspace/myutils/filecomparator/file1/   hdfs dfs -mkdir -p 
/workspace/myutils/filecomparator/file2/   USECASE 1 : ===Search if a 
bunch of IP addresses exist in another file containing (larger) bunch of IPs
[1] registeredIPs.txt    10.456.34.90    123.675.654.1    21.87.657.456    
234.109.34.234        visitorIPs.txt    10.456.34.90    12.367.54.23    
218.7.657.456    23.4.109.3   [2] Output which IPs in File1 are present in File2
[3] Put each file in a separate HDFS location        hdfs dfs -mkdir -p 
/workspace/myutils/filecomparator/file1/registeredIPs.txt    hdfs dfs -put 
VisitorIPs.txt  /workspace/myutils/filecomparator/file1/visitorIPs.txt
    hdfs dfs -put registeredIPs.txt  
/workspace/myutils/filecomparator/file1/registeredIPs.txt    hdfs dfs -put 
visitorIPs.txt  /workspace/myutils/filecomparator/file1/visitorIPs.txt
[4] Add partition to  myutils.file1    For simplicity keep the partition names 
identical to the file names themselves      hive -e USE myutils; ALTER TABLE 
file1 ADD PARTITION(ln='registeredIPs.txt') LOCATION 
'/workspace/myutils/filecomparator/file1/registeredIPs.txt'
    hive -e USE myutils; ALTER TABLE file2 ADD PARTITION(ln='visitorIPs.txt') 
LOCATION '/workspace/myutils/filecomparator/file2/visitorIPs.txt'    [5] Check 
that partitions can be accesd by Hive
    # This should give u the same answer as    # wc -l registeredIPs.txt    
hive -e select count(*) from myutils.file1 where fn='registeredIPs.txt'
    # This should give u the same answer as    # wc -l visitorIPs.txt    hive 
-e select count(*) from myutils.file2 where fn='visitorIPs.txt'
[6] Count the number of IPs in registeredIPs.txt that are in visitorIPs.txt
# This dumps to a local file systemhive -e SELECT f1.ln FROM (SELECT ln FROM 
utils.file1 WHERE fn='registeredIPs.txt') f1  JOIN (select ln from 
myutils.file2 WHERE fn='visitorIPs.txt') f2  ON trim(f1.ln) = trim(f2.ln)  
./registered_in_visitors_list.txt
# This dumps to a new internally-managed-by-hive table # Make sure u already 
dont have some valuable hive table called myutils.registered_in_visitors_list 
- else this will overwrite that hive table with the results of this hive query 
hive -e USE myutils; DROP TABLE IF EXITS registered_in_visitors_list; CREATE 
TABLE if not exists registered_in_visitors_list AS SELECT f1.ln FROM (select ln 
FROM utils.file1 WHERE fn='registeredIPs.txt') f1  JOIN (SELECT ln FROM 
myutils.file2 WHERE fn='visitorIPs.txt') f2  ON trim(f1.ln) = trim(f2.ln)
# This dumps to a directory on HDFS# Make sure u already dont have some 
valuable directory called registered_in_visitors_list - else this will 
overwrite that director and all its contents with the results of this hive 
query hive -e INSERT OVERWRITE DIRECTORY 
'/workspace/myutils/filecomparator/registered_in_visitors_list' SELECT f1.ln 
FROM (select ln FROM utils.file1 WHERE fn='registeredIPs.txt') f1  JOIN (SELECT 
ln FROM myutils.file2 WHERE fn='visitorIPs.txt') f2  ON trim(f1.ln) = 
trim(f2.ln)


 
  

merge small orc files

2015-04-20 Thread patcharee

Hi,

How to set the configuration hive-site.xml to automatically merge small 
orc file (output from mapreduce job) in hive 0.14 ?


This is my current configuration

property
  namehive.merge.mapfiles/name
  valuetrue/value
/property

property
  namehive.merge.mapredfiles/name
  valuetrue/value
/property

property
  namehive.merge.orcfile.stripe.level/name
  valuetrue/value
/property

However the output from a mapreduce job, which is stored into an orc 
file, was not merged. This is the output


-rwxr-xr-x   1 root hdfs  0 2015-04-20 15:23 
/apps/hive/warehouse/coordinate/zone=2/_SUCCESS
-rwxr-xr-x   1 root hdfs  29072 2015-04-20 15:23 
/apps/hive/warehouse/coordinate/zone=2/part-r-0
-rwxr-xr-x   1 root hdfs  29049 2015-04-20 15:23 
/apps/hive/warehouse/coordinate/zone=2/part-r-1
-rwxr-xr-x   1 root hdfs  29075 2015-04-20 15:23 
/apps/hive/warehouse/coordinate/zone=2/part-r-2


Any ideas?

BR,
Patcharee


Re: merge small orc files

2015-04-20 Thread Xuefu Zhang
Also check hive.merge.size.per.task and hive.merge.smallfiles.avgsize.

On Mon, Apr 20, 2015 at 8:29 AM, patcharee patcharee.thong...@uni.no
wrote:

 Hi,

 How to set the configuration hive-site.xml to automatically merge small
 orc file (output from mapreduce job) in hive 0.14 ?

 This is my current configuration

 property
   namehive.merge.mapfiles/name
   valuetrue/value
 /property

 property
   namehive.merge.mapredfiles/name
   valuetrue/value
 /property

 property
   namehive.merge.orcfile.stripe.level/name
   valuetrue/value
 /property

 However the output from a mapreduce job, which is stored into an orc file,
 was not merged. This is the output

 -rwxr-xr-x   1 root hdfs  0 2015-04-20 15:23
 /apps/hive/warehouse/coordinate/zone=2/_SUCCESS
 -rwxr-xr-x   1 root hdfs  29072 2015-04-20 15:23
 /apps/hive/warehouse/coordinate/zone=2/part-r-0
 -rwxr-xr-x   1 root hdfs  29049 2015-04-20 15:23
 /apps/hive/warehouse/coordinate/zone=2/part-r-1
 -rwxr-xr-x   1 root hdfs  29075 2015-04-20 15:23
 /apps/hive/warehouse/coordinate/zone=2/part-r-2

 Any ideas?

 BR,
 Patcharee



Clear up Hive scratch directory

2015-04-20 Thread Martin Benson
Hi,

One of my users tried to run an HUGE join, which failed due to a lack of space 
in HDFS. This has resulted in a large amount of data remaining in the Hive 
scratch directory which I need to clear down. I've tried setting 
hive.start.cleanup.scratchdir to true and restarting Hive, but it didn't tidy 
it up. So, I'm wondering if it is safe to just delete the content of the 
directory in HDFS (while Hive is stopped). Could anyone advise please?

Many thanks,

Martin.



Registered in England and Wales at Players House, 300 Attercliffe Common, 
Sheffield, S9 2AG. Company number 05935923.

This email and its attachments are confidential and are intended solely for the 
use of the addressed recipient.
Any views or opinions expressed are those of the author and do not necessarily 
represent Jaywing. If you are not
the intended recipient, you must not forward or show this to anyone or take any 
action based upon it.
Please contact the sender if you received this in error.


Re: Orc file and Hive Optimiser

2015-04-20 Thread Alan Gates




Mich Talebzadeh mailto:m...@peridale.co.uk
April 19, 2015 at 12:32

Finally this is more of a speculative question. If we have ORC files 
that provide good functionality, is there any reason why one should 
deploy a columnar database such as Hbase or Cassandra If Hive can do 
the job as well?


Yes, there is.  Hive is designed around the assumption that you will be 
doing scans of significant amounts of data, as are most data warehousing 
type solutions.  It doesn't have the right tools to handle efficient 
lookup of single rows or small ranges of rows.  That's what HBase is 
good at.  I don't know Cassandra as well as HBase, but my impression is 
that efficient single row/small range lookup is it's sweet spot as well, 
it just makes a different consistency/partitioning trade off than HBase 
does.


This means that Hive with ORC is still a bad fit for transactional or 
front end serving applications.


Alan.


Thanks,

Mich Talebzadeh

http://talebzadehmich.wordpress.com

__

Author of the books*A Practitioner's Guide to Upgrading to 
Sybase**ASE 15, **ISBN 978-0-9563693-0-7*.


co-author *Sybase Transact SQL Guidelines Best Practices, ISBN 
978-0-9759693-0-4*


_Publications due shortly:_

*Creating in-memory Data Grid for Trading Systems with Oracle TimesTen 
and Coherence Cache*


*Oracle and Sybase, Concepts and Contrasts*, ISBN:978-0-9563693-1-4, 
volume one out shortly


NOTE: The information in this email is proprietary and confidential. 
This message is for the designated recipient only, if you are not the 
intended recipient, you should destroy it immediately. Any information 
in this message shall not be understood as given or endorsed by 
Peridale Ltd, its subsidiaries or their employees, unless expressly so 
stated. It is the responsibility of the recipient to ensure that this 
email is virus free, therefore neither Peridale Ltd, its subsidiaries 
nor their employees accept any responsibility.