Re: merge small orc files
Hi, How to set the configuration hive-site.xml to automatically merge small orc file (output from mapreduce job) in hive 0.14 ? Hive cannot add work-stages to a map-reduce job. Hive follows merge.mapfiles=true when Hive generates a plan, by adding more work to the plan as a conditional task. -rwxr-xr-x 1 root hdfs 29072 2015-04-20 15:23 /apps/hive/warehouse/coordinate/zone=2/part-r-0 This looks like it was written by an MRv2 Reducer and not by the Hive FileSinkOperator handled by the MR outputcommitter instead of the Hive MoveTask. But 0.14 has an option which helps ³hive.merge.orcfile.stripe.level². If that is true (like your setting), then do ³alter table table concatenate² which effectively concatenates ORC blocks (without decompressing them), while maintaining metadata linkage of start/end offsets in the footer. Cheers, Gopal
Re: Table Lock Manager: ZooKeeper cluster
I'm not a zookeeper expert, but zookeeper is supposed to be characteristics of light-weight, high performance, and fast response. Unless you zookeeper is already overloaded, I don't see why you would need a separate zookeeper cluster just for Hive. There are a few zookeeper usages in Hive, the additional stress on zookeeper is determined by the load on your HS2. As most of the time user sessions are waiting on query execution, I don't expect the additional stress on your zookeeper will be significant. You do need to test it out before putting it in production as a general practice. On Fri, Apr 17, 2015 at 1:56 PM, Eduardo Ferreira eafon...@gmail.com wrote: Hi there, I read on the Hive installation documentation that we need to have a ZooKeeper cluster setup to support Table Lock Manager (Cloudera docs link below). As we have HBase with a ZooKeeper cluster already, my question is if we can use the same ZK cluster for Hive. Is that recommended? What kind of load and constrains would this put on the HBase ZK cluster? http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_ig_hiveserver2_configure.html Thanks in advance. Eduardo.
Re: [Hive 0.13.1] - Explanation/confusion over Fatal error occurred when node tried to create too many dynamic partitions on small dataset with dynamic partitions
In our case we’ve chose 128 buckets, but that’s just an arbitrary figure we’ve chosen to get a good even distribution To fix the issue we were having with the small file we just updated the setting hive.exec.max.dynamic.partitions.pernode to 1, that way if we do run a tiny file (very rarely) which only allocates one reducer – we can be sure we don’t run into this issue again With thanks, Daniel Harper Software Engineer, OTG ANT BC5 A5 From: Mich Talebzadeh m...@peridale.co.ukmailto:m...@peridale.co.uk Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Friday, 17 April 2015 10:18 To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: RE: [Hive 0.13.1] - Explanation/confusion over Fatal error occurred when node tried to create too many dynamic partitions on small dataset with dynamic partitions Hi Lefty, I took a look at the documentation link and I noticed that it can be improved. For example the paragraph below: “How does Hive distribute the rows across the buckets? In general, the bucket number is determined by the expression hash_function(bucketing_column) mod num_buckets. (There's a '0x7FFF in there too, but that's not that important). The hash_function depends on the type of the bucketing column. For an int, it's easy, hash_int(i) == i. For example, if user_id were an int, and there were 10 buckets, we would expect all user_id's that end in 0 to be in bucket 1, all user_id's that end in a 1 to be in bucket 2, etc. For other datatypes, it's a little tricky. In particular, the hash of a BIGINT is not the same as the BIGINT. And the hash of a string or a complex datatype will be some number that's derived from the value, but not anything humanly-recognizable. For example, if user_id were a STRING, then the user_id's in bucket 1 would probably not end in 0. In general, distributing rows based on the hash will give you a even distribution in the buckets. So, what can go wrong? As long as you set hive.enforce.bucketing = true, and use the syntax above, the tables should be populated properly. Things can go wrong if the bucketing column type is different during the insert and on read, or if you manually cluster by a value that's different from the table definition.” So in a nutshell num_buckets determines the granularity of hashing and the number of files. So eventually the table will have in total number_partitions x num_buckets files. The example mentions (not shown above) 256 buckets but that is just a number. It also states “For example, …and there were 10 buckets”. This is not standard. In a nutshell bucketing is a method to get data “evenly distributed” over many files. Thus, one should define the number of num_buckets by a power of two -- 2^n, like 2, 4, 8, 16 etc to achieve best results and getting best clustering. I will try to see the upper limits on the number of buckets within a partition and will get back on that. HTH Mich Talebzadeh http://talebzadehmich.wordpress.com Author of the books A Practitioner’s Guide to Upgrading to Sybase ASE 15, ISBN 978-0-9563693-0-7. co-author Sybase Transact SQL Guidelines Best Practices, ISBN 978-0-9759693-0-4 Publications due shortly: Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and Coherence Cache Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility. From: Lefty Leverenz [mailto:leftylever...@gmail.com] Sent: 17 April 2015 00:06 To: user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: [Hive 0.13.1] - Explanation/confusion over Fatal error occurred when node tried to create too many dynamic partitions on small dataset with dynamic partitions If the number of buckets in a partitioned table has a limit, we need to document it in the wiki. Currently the examplehttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL+BucketedTables shows 256 buckets. -- Lefty On Thu, Apr 16, 2015 at 4:35 AM, Daniel Harper daniel.har...@bbc.co.ukmailto:daniel.har...@bbc.co.uk wrote: As in you can only have 32 buckets (rather than 128 in our case?) With thanks, Daniel Harper Software Engineer, OTG ANT BC5 A5 From: Mich Talebzadeh m...@peridale.co.ukmailto:m...@peridale.co.uk Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date:
Re: UDF cannot be found when the query is submitted via templeton
can you give the complete REST call you are making to submit the query? From: Xiaoyong Zhu xiaoy...@microsoft.commailto:xiaoy...@microsoft.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Sunday, April 19, 2015 at 8:23 PM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: RE: UDF cannot be found when the query is submitted via templeton No, it doesn't... What surprises me is that for HDInsight (Hadoop on Azure) which is using Azure BLOB storage, using ADD JAR wasb:///test/HiveUDF.jar; CREATE TEMPORARY FUNCTION FindPat as 'HiveUDF.FindPattern' select count(FindPat(columnname)) from table1; would work. However, for my own cluster, ADD JAR hdfs:///test/HiveUDF.jar; CREATE TEMPORARY FUNCTION FindPat as 'HiveUDF.FindPattern' select count(FindPat(columnname)) from table1; does not work... Xiaoyong From: Jason Dere [mailto:jd...@hortonworks.com] Sent: Saturday, April 18, 2015 1:37 AM To: user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: UDF cannot be found when the query is submitted via templeton Does fully qualifying the function name (HiveUDF.FindPattern()) in the query help here? On Apr 17, 2015, at 6:44 AM, Xiaoyong Zhu xiaoy...@microsoft.commailto:xiaoy...@microsoft.com wrote: Hi experts I am trying to use an UDF (I have already put that in the metastore using CREATE FUNCTION) as following. select count(FindPattern(s_sitename)) AS testcol from weblogs; However, when I tried to use the UDF from WebHCat (i.e. submit the above command via WebHCat), the job always fails saying Added [/tmp/2cb22c27-72d3-4b41-aea0-655df1192872_resources/HiveUDF.jar] to class path Added resources: [hdfs://PATHTOFOLDER/Portal-Queries/HiveUDF.jar] FAILED: SemanticException [Error 10011]: Line 1:13 Invalid function FindPattern If I execute this command through Hive CLI (through hive -f file or execute it in the interactive shell) the statement above works. From the log I can see the jar file is added but it seems the function cannot be found. Can someone help to share some thoughts on this issue? Btw, the create function statement is as following (changing the hdfs URI to full path does not work either): CREATE FUNCTION FindPattern AS 'HiveUDF.FindPattern' USING JAR'hdfs:///UDFFolder/HiveUDF.jar'hdfs://UDFFolder/HiveUDF.jar'; Thanks in advance! Xiaoyong
Using Hive as a file comparison and grep-ping tool
hey guys As data wranglers and programmers we often need quick tools. One such tool I need almost everyday is one that greps a file based on contents of another file. One can write this in perl, python but since I am already using hadoop ecosystem extensively, I said why not do this in Hive ? Perhaps you guys already know this and have better solutionsnevertheless :-) here goes... Best regards sanjay(Hive super-fan) I just posted this on my bloghttps://bigdatalatte.wordpress.com/2015/04/20/using-hive-as-a-file-comparison-and-grep-ping-tool/ In case the blog URL does not work for any reason, here is the logic Using Hive as a file comparison and grep-ping tool==1. Logon to your linux terminal where u run Hive queries from 2. Create a database called myutils in Hive Create two hive tables myutils.file1 and myutils.file2 in Hive - each of these tables will have a partition called fn fn is short for filename - each of these tables will have just one column called ln ln is short for line An easy script to help do that would be as follows for r in 1 2 ; do hive -e CREATE DATABASE IF NOT EXISTS myutils; USE myutils; DROP TABLE IF EXISTS file${r}; CREATE EXTERNAL TABLE IF NOT EXISTS file${r}(ln STRING) PARTITIONED BY (fn STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';; done 3. Create a permanent base location folder in HDFS hdfs dfs -mkdir -p /workspace/myutils/filecomparator/file1/ hdfs dfs -mkdir -p /workspace/myutils/filecomparator/file2/ USECASE 1 : ===Search if a bunch of IP addresses exist in another file containing (larger) bunch of IPs [1] registeredIPs.txt 10.456.34.90 123.675.654.1 21.87.657.456 234.109.34.234 visitorIPs.txt 10.456.34.90 12.367.54.23 218.7.657.456 23.4.109.3 [2] Output which IPs in File1 are present in File2 [3] Put each file in a separate HDFS location hdfs dfs -mkdir -p /workspace/myutils/filecomparator/file1/registeredIPs.txt hdfs dfs -put VisitorIPs.txt /workspace/myutils/filecomparator/file1/visitorIPs.txt hdfs dfs -put registeredIPs.txt /workspace/myutils/filecomparator/file1/registeredIPs.txt hdfs dfs -put visitorIPs.txt /workspace/myutils/filecomparator/file1/visitorIPs.txt [4] Add partition to myutils.file1 For simplicity keep the partition names identical to the file names themselves hive -e USE myutils; ALTER TABLE file1 ADD PARTITION(ln='registeredIPs.txt') LOCATION '/workspace/myutils/filecomparator/file1/registeredIPs.txt' hive -e USE myutils; ALTER TABLE file2 ADD PARTITION(ln='visitorIPs.txt') LOCATION '/workspace/myutils/filecomparator/file2/visitorIPs.txt' [5] Check that partitions can be accesd by Hive # This should give u the same answer as # wc -l registeredIPs.txt hive -e select count(*) from myutils.file1 where fn='registeredIPs.txt' # This should give u the same answer as # wc -l visitorIPs.txt hive -e select count(*) from myutils.file2 where fn='visitorIPs.txt' [6] Count the number of IPs in registeredIPs.txt that are in visitorIPs.txt # This dumps to a local file systemhive -e SELECT f1.ln FROM (SELECT ln FROM utils.file1 WHERE fn='registeredIPs.txt') f1 JOIN (select ln from myutils.file2 WHERE fn='visitorIPs.txt') f2 ON trim(f1.ln) = trim(f2.ln) ./registered_in_visitors_list.txt # This dumps to a new internally-managed-by-hive table # Make sure u already dont have some valuable hive table called myutils.registered_in_visitors_list - else this will overwrite that hive table with the results of this hive query hive -e USE myutils; DROP TABLE IF EXITS registered_in_visitors_list; CREATE TABLE if not exists registered_in_visitors_list AS SELECT f1.ln FROM (select ln FROM utils.file1 WHERE fn='registeredIPs.txt') f1 JOIN (SELECT ln FROM myutils.file2 WHERE fn='visitorIPs.txt') f2 ON trim(f1.ln) = trim(f2.ln) # This dumps to a directory on HDFS# Make sure u already dont have some valuable directory called registered_in_visitors_list - else this will overwrite that director and all its contents with the results of this hive query hive -e INSERT OVERWRITE DIRECTORY '/workspace/myutils/filecomparator/registered_in_visitors_list' SELECT f1.ln FROM (select ln FROM utils.file1 WHERE fn='registeredIPs.txt') f1 JOIN (SELECT ln FROM myutils.file2 WHERE fn='visitorIPs.txt') f2 ON trim(f1.ln) = trim(f2.ln)
merge small orc files
Hi, How to set the configuration hive-site.xml to automatically merge small orc file (output from mapreduce job) in hive 0.14 ? This is my current configuration property namehive.merge.mapfiles/name valuetrue/value /property property namehive.merge.mapredfiles/name valuetrue/value /property property namehive.merge.orcfile.stripe.level/name valuetrue/value /property However the output from a mapreduce job, which is stored into an orc file, was not merged. This is the output -rwxr-xr-x 1 root hdfs 0 2015-04-20 15:23 /apps/hive/warehouse/coordinate/zone=2/_SUCCESS -rwxr-xr-x 1 root hdfs 29072 2015-04-20 15:23 /apps/hive/warehouse/coordinate/zone=2/part-r-0 -rwxr-xr-x 1 root hdfs 29049 2015-04-20 15:23 /apps/hive/warehouse/coordinate/zone=2/part-r-1 -rwxr-xr-x 1 root hdfs 29075 2015-04-20 15:23 /apps/hive/warehouse/coordinate/zone=2/part-r-2 Any ideas? BR, Patcharee
Re: merge small orc files
Also check hive.merge.size.per.task and hive.merge.smallfiles.avgsize. On Mon, Apr 20, 2015 at 8:29 AM, patcharee patcharee.thong...@uni.no wrote: Hi, How to set the configuration hive-site.xml to automatically merge small orc file (output from mapreduce job) in hive 0.14 ? This is my current configuration property namehive.merge.mapfiles/name valuetrue/value /property property namehive.merge.mapredfiles/name valuetrue/value /property property namehive.merge.orcfile.stripe.level/name valuetrue/value /property However the output from a mapreduce job, which is stored into an orc file, was not merged. This is the output -rwxr-xr-x 1 root hdfs 0 2015-04-20 15:23 /apps/hive/warehouse/coordinate/zone=2/_SUCCESS -rwxr-xr-x 1 root hdfs 29072 2015-04-20 15:23 /apps/hive/warehouse/coordinate/zone=2/part-r-0 -rwxr-xr-x 1 root hdfs 29049 2015-04-20 15:23 /apps/hive/warehouse/coordinate/zone=2/part-r-1 -rwxr-xr-x 1 root hdfs 29075 2015-04-20 15:23 /apps/hive/warehouse/coordinate/zone=2/part-r-2 Any ideas? BR, Patcharee
Clear up Hive scratch directory
Hi, One of my users tried to run an HUGE join, which failed due to a lack of space in HDFS. This has resulted in a large amount of data remaining in the Hive scratch directory which I need to clear down. I've tried setting hive.start.cleanup.scratchdir to true and restarting Hive, but it didn't tidy it up. So, I'm wondering if it is safe to just delete the content of the directory in HDFS (while Hive is stopped). Could anyone advise please? Many thanks, Martin. Registered in England and Wales at Players House, 300 Attercliffe Common, Sheffield, S9 2AG. Company number 05935923. This email and its attachments are confidential and are intended solely for the use of the addressed recipient. Any views or opinions expressed are those of the author and do not necessarily represent Jaywing. If you are not the intended recipient, you must not forward or show this to anyone or take any action based upon it. Please contact the sender if you received this in error.
Re: Orc file and Hive Optimiser
Mich Talebzadeh mailto:m...@peridale.co.uk April 19, 2015 at 12:32 Finally this is more of a speculative question. If we have ORC files that provide good functionality, is there any reason why one should deploy a columnar database such as Hbase or Cassandra If Hive can do the job as well? Yes, there is. Hive is designed around the assumption that you will be doing scans of significant amounts of data, as are most data warehousing type solutions. It doesn't have the right tools to handle efficient lookup of single rows or small ranges of rows. That's what HBase is good at. I don't know Cassandra as well as HBase, but my impression is that efficient single row/small range lookup is it's sweet spot as well, it just makes a different consistency/partitioning trade off than HBase does. This means that Hive with ORC is still a bad fit for transactional or front end serving applications. Alan. Thanks, Mich Talebzadeh http://talebzadehmich.wordpress.com __ Author of the books*A Practitioner's Guide to Upgrading to Sybase**ASE 15, **ISBN 978-0-9563693-0-7*. co-author *Sybase Transact SQL Guidelines Best Practices, ISBN 978-0-9759693-0-4* _Publications due shortly:_ *Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and Coherence Cache* *Oracle and Sybase, Concepts and Contrasts*, ISBN:978-0-9563693-1-4, volume one out shortly NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.