Using SPLIT with DOT(.) delimiter demonstrate funny behavior within a VIEW
Hi guys I am using Hive version = 0.13.1-cdh5.3.3 HIVE TABLE = qnap_resume_file_location---DROP TABLE IF EXISTS qnap_resume_file_location;CREATE EXTERNAL TABLE qnap_resume_file_location ( floc STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION '/data/myfirm/file_location'; SAMPLE DATA = qnap_resume_file_location-select * from resume.qnap_resume_file_location limit 10/mnt/myfirm/200901/14008_1_1004942032_0_1005347483.html /mnt/myfirm/200901/14008_1_1004944724_0_1005334758.html/mnt/myfirm/200901/14000__900719077_10_901309048.doc/mnt/myfirm/200901/14012_6_1004858088_0_1005227990.doc/mnt/myfirm/200901/14008_1_1004811416_0_1005258541.html/mnt/myfirm/200901/14008_1_501597002_6_1005224039.html/mnt/myfirm/200901/14009_5_1004872908_0_1005244348.doc/mnt/myfirm/200901/14001_7_14165_1_66775.docx/mnt/myfirm/200901/14009_5_1004935267_0_1005322091.txt/mnt/myfirm/200901/14012_6_1004904422_0_1005283729.doc HIVE VIEW = qnap_resume_resid_file_loc_map_vw-DROP VIEW IF EXISTS qnap_resume_resid_file_loc_map_vw;CREATE VIEW IF NOT EXISTS qnap_resume_resid_file_loc_map_vw ( resid , mm , floc )ASSELECT split(reverse(split(reverse(floc),'/')[0]),'.')[0] as resid, reverse(split(reverse(floc),'/')[1]) as mm, flocFROM resume.qnap_resume_file_location; QUERY ON VIEWselect * from qnap_resume_resid_file_loc_map_vw limit 10EXPECTED RESULTS 14012_6_1006686583_0_1102955123 201205 /mnt/myfirm/201205/14012_6_1006686583_0_1102955123.docx14009_5_1008440384_0_1102887209 201205 /mnt/myfirm/201205/14009_5_1008440384_0_1102887209.doc14013_7_700200576_8_1102752594 201205 /mnt/myfirm/201205/14013_7_700200576_8_1102752594.doc14015_8_1008470815_0_1102954525 201205 /mnt/myfirm/201205/14015_8_1008470815_0_1102954525.doc14009_5_1008373102_0_1102709973 201205 /mnt/myfirm/201205/14009_5_1008373102_0_1102709973.doc14011_1_1007981566_0_1102730546 201205 /mnt/myfirm/201205/14011_1_1007981566_0_1102730546.txt14015_8_1008436709_0_1102867682 201205 /mnt/myfirm/201205/14015_8_1008436709_0_1102867682.docx14001_7_1006347358_0_1007662042 201205 /mnt/myfirm/201205/14001_7_1006347358_0_1007662042.docx14009_5_1008382590_0_1102732450 201205 /mnt/myfirm/201205/14009_5_1008382590_0_1102732450.doc14000_6_1008189721_0_1102253740 201205 /mnt/myfirm/201205/14000_6_1008189721_0_1102253740.docx ACTUAL RESULTS (see the first column the DOT and extension are still there) 14012_6_1006686583_0_1102955123 .docx 201205 /mnt/myfirm/201205/14012_6_1006686583_0_1102955123.docx14009_5_1008440384_0_1102887209 .doc 201205 /mnt/myfirm/201205/14009_5_1008440384_0_1102887209.doc14013_7_700200576_8_1102752594 .doc 201205 /mnt/myfirm/201205/14013_7_700200576_8_1102752594.doc14015_8_1008470815_0_1102954525.doc 201205 /mnt/myfirm/201205/14015_8_1008470815_0_1102954525.doc14009_5_1008373102_0_1102709973 .doc 201205 /mnt/myfirm/201205/14009_5_1008373102_0_1102709973.doc14011_1_1007981566_0_1102730546 .txt 201205 /mnt/myfirm/201205/14011_1_1007981566_0_1102730546.txt14015_8_1008436709_0_1102867682 .docx 201205 /mnt/myfirm/201205/14015_8_1008436709_0_1102867682.docx14001_7_1006347358_0_1007662042 .docx 201205 /mnt/myfirm/201205/14001_7_1006347358_0_1007662042.docx14009_5_1008382590_0_1102732450 .doc 201205 /mnt/myfirm/201205/14009_5_1008382590_0_1102732450.doc14000_6_1008189721_0_1102253740 .docx 201205 /mnt/myfirm/201205/14000_6_1008189721_0_1102253740.docx EXECUTING THE SELECT HIVE QUERY INSIDE THE VIEW INDEPENDENTLY (gives u correct results - but when placed inside a view then it does not seem to work)SELECT split(reverse(split(reverse(floc),'/')[0]),'.')[0] as resid, reverse(split(reverse(floc),'/')[1]) as mm, flocFROM resume.qnap_resume_file_location 14012_6_1006686583_0_1102955123 201205 /mnt/myfirm/201205/14012_6_1006686583_0_1102955123.docx14009_5_1008440384_0_1102887209 201205 /mnt/myfirm/201205/14009_5_1008440384_0_1102887209.doc14013_7_700200576_8_1102752594 201205 /mnt/myfirm/201205/14013_7_700200576_8_1102752594.doc14015_8_1008470815_0_1102954525 201205 /mnt/myfirm/201205/14015_8_1008470815_0_1102954525.doc14009_5_1008373102_0_1102709973 201205 /mnt/myfirm/201205/14009_5_1008373102_0_1102709973.doc14011_1_1007981566_0_1102730546 201205 /mnt/myfirm/201205/14011_1_1007981566_0_1102730546.txt14015_8_1008436709_0_1102867682 201205 /mnt/myfirm/201205/14015_8_1008436709_0_1102867682.docx14001_7_1006347358_0_1007662042 201205 /mnt/myfirm/201205/14001_7_1006347358_0_1007662042.docx14009_5_1008382590_0_1102732450 201205 /mnt/myfirm/201205/14009_5_1008382590_0_1102732450.doc14000_6_1008189721_0_1102253740 201205 /mnt/myfirm/201205/14000_6_1008189721_0_1102253740.docx Thoughts ? Ideas ? thanks sanjay
Re: Using SPLIT with DOT(.) delimiter demonstrate funny behavior within a VIEW
Hi Vivek Thanks A few more experiments The \\ , \\\ and all work correctly if used as following split(reverse(split(reverse(floc),'/')[0]),'[\\.]')[0] (need those square brackets) split(reverse(split(reverse(floc),'/')[0]),'\\\.')[0] (does not need square brackets) split(reverse(split(reverse(floc),'/')[0]),'[.]')[0] (need those square brackets) From: Vivek Veeramani vivek.veeraman...@gmail.com To: user@hive.apache.org; Sanjay Subramanian sanjaysubraman...@yahoo.com Sent: Monday, August 24, 2015 1:57 PM Subject: Re: Using SPLIT with DOT(.) delimiter demonstrate funny behavior within a VIEW Hi Sanjay, Try replacing the 4 backslashes with just 2. Usually works with 2 backslashes. Replace this split(reverse(split(reverse(floc),'/')[0]),'.')[0] as resid with split(reverse(split(reverse(floc),'/')[0]),'\\.')[0] as resid Please have a look and let us know if that helps. Best , Vivek Veeramani cell : + 1-415 996 7853 On Mon, Aug 24, 2015 at 12:15 PM, Sanjay Subramanian sanjaysubraman...@yahoo.com wrote: Hi guys I am using Hive version = 0.13.1-cdh5.3.3 HIVE TABLE = qnap_resume_file_location---DROP TABLE IF EXISTS qnap_resume_file_location;CREATE EXTERNAL TABLE qnap_resume_file_location ( floc STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION '/data/myfirm/file_location'; SAMPLE DATA = qnap_resume_file_location-select * from resume.qnap_resume_file_location limit 10/mnt/myfirm/200901/14008_1_1004942032_0_1005347483.html /mnt/myfirm/200901/14008_1_1004944724_0_1005334758.html/mnt/myfirm/200901/14000__900719077_10_901309048.doc/mnt/myfirm/200901/14012_6_1004858088_0_1005227990.doc/mnt/myfirm/200901/14008_1_1004811416_0_1005258541.html/mnt/myfirm/200901/14008_1_501597002_6_1005224039.html/mnt/myfirm/200901/14009_5_1004872908_0_1005244348.doc/mnt/myfirm/200901/14001_7_14165_1_66775.docx/mnt/myfirm/200901/14009_5_1004935267_0_1005322091.txt/mnt/myfirm/200901/14012_6_1004904422_0_1005283729.doc HIVE VIEW = qnap_resume_resid_file_loc_map_vw-DROP VIEW IF EXISTS qnap_resume_resid_file_loc_map_vw;CREATE VIEW IF NOT EXISTS qnap_resume_resid_file_loc_map_vw ( resid , mm , floc )ASSELECT split(reverse(split(reverse(floc),'/')[0]),'.')[0] as resid, reverse(split(reverse(floc),'/')[1]) as mm, flocFROM resume.qnap_resume_file_location; QUERY ON VIEWselect * from qnap_resume_resid_file_loc_map_vw limit 10EXPECTED RESULTS 14012_6_1006686583_0_1102955123 201205 /mnt/myfirm/201205/14012_6_1006686583_0_1102955123.docx14009_5_1008440384_0_1102887209 201205 /mnt/myfirm/201205/14009_5_1008440384_0_1102887209.doc14013_7_700200576_8_1102752594 201205 /mnt/myfirm/201205/14013_7_700200576_8_1102752594.doc14015_8_1008470815_0_1102954525 201205 /mnt/myfirm/201205/14015_8_1008470815_0_1102954525.doc14009_5_1008373102_0_1102709973 201205 /mnt/myfirm/201205/14009_5_1008373102_0_1102709973.doc14011_1_1007981566_0_1102730546 201205 /mnt/myfirm/201205/14011_1_1007981566_0_1102730546.txt14015_8_1008436709_0_1102867682 201205 /mnt/myfirm/201205/14015_8_1008436709_0_1102867682.docx14001_7_1006347358_0_1007662042 201205 /mnt/myfirm/201205/14001_7_1006347358_0_1007662042.docx14009_5_1008382590_0_1102732450 201205 /mnt/myfirm/201205/14009_5_1008382590_0_1102732450.doc14000_6_1008189721_0_1102253740 201205 /mnt/myfirm/201205/14000_6_1008189721_0_1102253740.docx ACTUAL RESULTS (see the first column the DOT and extension are still there) 14012_6_1006686583_0_1102955123 .docx 201205 /mnt/myfirm/201205/14012_6_1006686583_0_1102955123.docx14009_5_1008440384_0_1102887209 .doc 201205 /mnt/myfirm/201205/14009_5_1008440384_0_1102887209.doc14013_7_700200576_8_1102752594 .doc 201205 /mnt/myfirm/201205/14013_7_700200576_8_1102752594.doc14015_8_1008470815_0_1102954525.doc 201205 /mnt/myfirm/201205/14015_8_1008470815_0_1102954525.doc14009_5_1008373102_0_1102709973 .doc 201205 /mnt/myfirm/201205/14009_5_1008373102_0_1102709973.doc14011_1_1007981566_0_1102730546 .txt 201205 /mnt/myfirm/201205/14011_1_1007981566_0_1102730546.txt14015_8_1008436709_0_1102867682 .docx 201205 /mnt/myfirm/201205/14015_8_1008436709_0_1102867682.docx14001_7_1006347358_0_1007662042 .docx 201205 /mnt/myfirm/201205/14001_7_1006347358_0_1007662042.docx14009_5_1008382590_0_1102732450 .doc 201205 /mnt/myfirm/201205/14009_5_1008382590_0_1102732450.doc14000_6_1008189721_0_1102253740 .docx 201205 /mnt/myfirm/201205/14000_6_1008189721_0_1102253740.docx EXECUTING THE SELECT HIVE QUERY INSIDE THE VIEW INDEPENDENTLY (gives u correct results - but when placed inside a view then it does not seem to work)SELECT split(reverse(split(reverse(floc),'/')[0]),'.')[0] as resid, reverse(split(reverse(floc),'/')[1]) as mm, flocFROM resume.qnap_resume_file_location 14012_6_1006686583_0_1102955123 201205 /mnt/myfirm
Pointing SparkSQL to existing Hive Metadata with data file locations in HDFS
hey guys On the Hive/Hadoop ecosystem we have using Cloudera distribution CDH 5.2.x , there are about 300+ hive tables.The data is stored an text (moving slowly to Parquet) on HDFS.I want to use SparkSQL and point to the Hive metadata and be able to define JOINS etc using a programming structure like this import org.apache.spark.sql.hive.HiveContextval sqlContext = new HiveContext(sc)val schemaRdd = sqlContext.sql(some complex SQL) Is that the way to go ? Some guidance will be great. thanks sanjay
Using Hive as a file comparison and grep-ping tool
hey guys As data wranglers and programmers we often need quick tools. One such tool I need almost everyday is one that greps a file based on contents of another file. One can write this in perl, python but since I am already using hadoop ecosystem extensively, I said why not do this in Hive ? Perhaps you guys already know this and have better solutionsnevertheless :-) here goes... Best regards sanjay(Hive super-fan) I just posted this on my bloghttps://bigdatalatte.wordpress.com/2015/04/20/using-hive-as-a-file-comparison-and-grep-ping-tool/ In case the blog URL does not work for any reason, here is the logic Using Hive as a file comparison and grep-ping tool==1. Logon to your linux terminal where u run Hive queries from 2. Create a database called myutils in Hive Create two hive tables myutils.file1 and myutils.file2 in Hive - each of these tables will have a partition called fn fn is short for filename - each of these tables will have just one column called ln ln is short for line An easy script to help do that would be as follows for r in 1 2 ; do hive -e CREATE DATABASE IF NOT EXISTS myutils; USE myutils; DROP TABLE IF EXISTS file${r}; CREATE EXTERNAL TABLE IF NOT EXISTS file${r}(ln STRING) PARTITIONED BY (fn STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';; done 3. Create a permanent base location folder in HDFS hdfs dfs -mkdir -p /workspace/myutils/filecomparator/file1/ hdfs dfs -mkdir -p /workspace/myutils/filecomparator/file2/ USECASE 1 : ===Search if a bunch of IP addresses exist in another file containing (larger) bunch of IPs [1] registeredIPs.txt 10.456.34.90 123.675.654.1 21.87.657.456 234.109.34.234 visitorIPs.txt 10.456.34.90 12.367.54.23 218.7.657.456 23.4.109.3 [2] Output which IPs in File1 are present in File2 [3] Put each file in a separate HDFS location hdfs dfs -mkdir -p /workspace/myutils/filecomparator/file1/registeredIPs.txt hdfs dfs -put VisitorIPs.txt /workspace/myutils/filecomparator/file1/visitorIPs.txt hdfs dfs -put registeredIPs.txt /workspace/myutils/filecomparator/file1/registeredIPs.txt hdfs dfs -put visitorIPs.txt /workspace/myutils/filecomparator/file1/visitorIPs.txt [4] Add partition to myutils.file1 For simplicity keep the partition names identical to the file names themselves hive -e USE myutils; ALTER TABLE file1 ADD PARTITION(ln='registeredIPs.txt') LOCATION '/workspace/myutils/filecomparator/file1/registeredIPs.txt' hive -e USE myutils; ALTER TABLE file2 ADD PARTITION(ln='visitorIPs.txt') LOCATION '/workspace/myutils/filecomparator/file2/visitorIPs.txt' [5] Check that partitions can be accesd by Hive # This should give u the same answer as # wc -l registeredIPs.txt hive -e select count(*) from myutils.file1 where fn='registeredIPs.txt' # This should give u the same answer as # wc -l visitorIPs.txt hive -e select count(*) from myutils.file2 where fn='visitorIPs.txt' [6] Count the number of IPs in registeredIPs.txt that are in visitorIPs.txt # This dumps to a local file systemhive -e SELECT f1.ln FROM (SELECT ln FROM utils.file1 WHERE fn='registeredIPs.txt') f1 JOIN (select ln from myutils.file2 WHERE fn='visitorIPs.txt') f2 ON trim(f1.ln) = trim(f2.ln) ./registered_in_visitors_list.txt # This dumps to a new internally-managed-by-hive table # Make sure u already dont have some valuable hive table called myutils.registered_in_visitors_list - else this will overwrite that hive table with the results of this hive query hive -e USE myutils; DROP TABLE IF EXITS registered_in_visitors_list; CREATE TABLE if not exists registered_in_visitors_list AS SELECT f1.ln FROM (select ln FROM utils.file1 WHERE fn='registeredIPs.txt') f1 JOIN (SELECT ln FROM myutils.file2 WHERE fn='visitorIPs.txt') f2 ON trim(f1.ln) = trim(f2.ln) # This dumps to a directory on HDFS# Make sure u already dont have some valuable directory called registered_in_visitors_list - else this will overwrite that director and all its contents with the results of this hive query hive -e INSERT OVERWRITE DIRECTORY '/workspace/myutils/filecomparator/registered_in_visitors_list' SELECT f1.ln FROM (select ln FROM utils.file1 WHERE fn='registeredIPs.txt') f1 JOIN (SELECT ln FROM myutils.file2 WHERE fn='visitorIPs.txt') f2 ON trim(f1.ln) = trim(f2.ln)
Re: [ANN] Hivemall v0.3 is now available
awesome thank u. really value your ML contributions. regardssanjay From: Makoto Yui yuin...@gmail.com To: user@hive.apache.org Sent: Friday, February 6, 2015 3:31 AM Subject: [ANN] Hivemall v0.3 is now available Hello all, We are excited to announce that a new stable version of Hivemall (v0.3.0) is now available. https://github.com/myui/hivemall/releases/tag/v0.3.0 Hivemall provides a collection of machine learning algorithms as Hive UDFs/UDAFs/UDTFs. The main enhancement in v0.3.0 is the support for matrix factorization. Hope you enjoy the release! Feedback and pull requests are welcome. Thanks, Makoto
Hive JSON Serde question
hey guys This is the Hive table definition I have created based on the JSON I am using this version of hive json serde https://github.com/rcongiu/Hive-JSON-Serde ADD JAR /home/sanjay/mycode/jar/jsonserde/json-serde-1.3.1-SNAPSHOT-jar-with-dependencies.jar;DROP TABLE IF EXISTS datafeed_json;CREATE EXTERNAL TABLE IF NOT EXISTS datafeed_json ( object STRING, entry array struct id:STRING, time:BIGINT, changes:array struct field:STRING, value:struct item:STRING, verb:STRING, parent_id:STRING, sender_id:BIGINT, created_time:BIGINT) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE LOCATION '/data/sanjay/datafeed'; QUERY 1===ADD JAR /home/sanjay/mycode/jar/jsonserde/json-serde-1.3.1-SNAPSHOT-jar-with-dependencies.jar; SELECT object, entry[0].id, entry[0].time, entry[0].changes[0].field, entry[0].changes[0].value.item, entry[0].changes[0].value.verb, entry[0].changes[0].value.parent_id, entry[0].changes[0].value.sender_id, entry[0].changes[0].value.created_time FROM datafeed_json; RESULT1==foo123 113621765320467 1418608223 leads song1 rock 113621765320467_1107142375968396 14748082019 1418608223 QUERY2==ADD JAR /home/sanjay/mycode/jar/jsonserde/json-serde-1.3.1-SNAPSHOT-jar-with-dependencies.jar; SELECT object, entry.id, entry.time, ntry FROM datafeed_json LATERAL VIEW EXPLODE (datafeed_json.entry.changes) oc1 AS ntry; RESULT2===This gives 4 rows but I was not able to iteratively do the LATERAL VIEW EXPLODE I tried various combinations of LATERAL VIEW , LATERAL VIEW EXPLODE, json_tuple to extract all fields in an exploded view from the JSON in tab separated format but no luck. Any thoughts ? Thanks sanjay
Re: Hive JSON Serde question
Thanks Ed. Let me try a few more iterations. Somehow I am not doing this correctly :-) regards sanjay From: Edward Capriolo edlinuxg...@gmail.com To: user@hive.apache.org user@hive.apache.org; Sanjay Subramanian sanjaysubraman...@yahoo.com Sent: Sunday, January 25, 2015 8:11 AM Subject: Re: Hive JSON Serde question Nested lists require nested lateral views. On Sun, Jan 25, 2015 at 11:02 AM, Sanjay Subramanian sanjaysubraman...@yahoo.com wrote: hey guys This is the Hive table definition I have created based on the JSON I am using this version of hive json serde https://github.com/rcongiu/Hive-JSON-Serde ADD JAR /home/sanjay/mycode/jar/jsonserde/json-serde-1.3.1-SNAPSHOT-jar-with-dependencies.jar;DROP TABLE IF EXISTS datafeed_json;CREATE EXTERNAL TABLE IF NOT EXISTS datafeed_json ( object STRING, entry array struct id:STRING, time:BIGINT, changes:array struct field:STRING, value:struct item:STRING, verb:STRING, parent_id:STRING, sender_id:BIGINT, created_time:BIGINT) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE LOCATION '/data/sanjay/datafeed'; QUERY 1===ADD JAR /home/sanjay/mycode/jar/jsonserde/json-serde-1.3.1-SNAPSHOT-jar-with-dependencies.jar; SELECT object, entry[0].id, entry[0].time, entry[0].changes[0].field, entry[0].changes[0].value.item, entry[0].changes[0].value.verb, entry[0].changes[0].value.parent_id, entry[0].changes[0].value.sender_id, entry[0].changes[0].value.created_time FROM datafeed_json; RESULT1==foo123 113621765320467 1418608223 leads song1 rock 113621765320467_1107142375968396 14748082019 1418608223 QUERY2==ADD JAR /home/sanjay/mycode/jar/jsonserde/json-serde-1.3.1-SNAPSHOT-jar-with-dependencies.jar; SELECT object, entry.id, entry.time, ntry FROM datafeed_json LATERAL VIEW EXPLODE (datafeed_json.entry.changes) oc1 AS ntry; RESULT2===This gives 4 rows but I was not able to iteratively do the LATERAL VIEW EXPLODE I tried various combinations of LATERAL VIEW , LATERAL VIEW EXPLODE, json_tuple to extract all fields in an exploded view from the JSON in tab separated format but no luck. Any thoughts ? Thanks sanjay
Re: Hive JSON Serde question
sure will try get_json_objectthank uregardssanjay From: 丁桂涛(桂花) dinggui...@baixing.com To: user@hive.apache.org; Sanjay Subramanian sanjaysubraman...@yahoo.com Sent: Sunday, January 25, 2015 4:45 PM Subject: Re: Hive JSON Serde question Try get_json_object UDF. No iterations need. :) On Mon, Jan 26, 2015 at 12:25 AM, Sanjay Subramanian sanjaysubraman...@yahoo.com wrote: Thanks Ed. Let me try a few more iterations. Somehow I am not doing this correctly :-) regards sanjay From: Edward Capriolo edlinuxg...@gmail.com To: user@hive.apache.org user@hive.apache.org; Sanjay Subramanian sanjaysubraman...@yahoo.com Sent: Sunday, January 25, 2015 8:11 AM Subject: Re: Hive JSON Serde question Nested lists require nested lateral views. On Sun, Jan 25, 2015 at 11:02 AM, Sanjay Subramanian sanjaysubraman...@yahoo.com wrote: hey guys This is the Hive table definition I have created based on the JSON I am using this version of hive json serde https://github.com/rcongiu/Hive-JSON-Serde ADD JAR /home/sanjay/mycode/jar/jsonserde/json-serde-1.3.1-SNAPSHOT-jar-with-dependencies.jar;DROP TABLE IF EXISTS datafeed_json;CREATE EXTERNAL TABLE IF NOT EXISTS datafeed_json ( object STRING, entry array struct id:STRING, time:BIGINT, changes:array struct field:STRING, value:struct item:STRING, verb:STRING, parent_id:STRING, sender_id:BIGINT, created_time:BIGINT) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE LOCATION '/data/sanjay/datafeed'; QUERY 1===ADD JAR /home/sanjay/mycode/jar/jsonserde/json-serde-1.3.1-SNAPSHOT-jar-with-dependencies.jar; SELECT object, entry[0].id, entry[0].time, entry[0].changes[0].field, entry[0].changes[0].value.item, entry[0].changes[0].value.verb, entry[0].changes[0].value.parent_id, entry[0].changes[0].value.sender_id, entry[0].changes[0].value.created_time FROM datafeed_json; RESULT1==foo123 113621765320467 1418608223 leads song1 rock 113621765320467_1107142375968396 14748082019 1418608223 QUERY2==ADD JAR /home/sanjay/mycode/jar/jsonserde/json-serde-1.3.1-SNAPSHOT-jar-with-dependencies.jar; SELECT object, entry.id, entry.time, ntry FROM datafeed_json LATERAL VIEW EXPLODE (datafeed_json.entry.changes) oc1 AS ntry; RESULT2===This gives 4 rows but I was not able to iteratively do the LATERAL VIEW EXPLODE I tried various combinations of LATERAL VIEW , LATERAL VIEW EXPLODE, json_tuple to extract all fields in an exploded view from the JSON in tab separated format but no luck. Any thoughts ? Thanks sanjay
Writing Hive Query Output to local system
hey guys I recall this did not happen in the days of 0.9.x version But I use 0.13.x now and when I run a hive query hive -e select * from tablename ./myfile.txt The first line in myfile.txt is as follows 2015-01-16 10:48:13,091 WARN [main] conf.HiveConf (HiveConf.java:initialize(1491)) - DEPRECATED: Configuration property hive.metastore.local no longer has any effect. Make sure to provide a valid value for hive.metastore.uris if you are connecting to a remote metastore. How do I avoid this ? thanks sanjay
Re: Writing Hive Query Output to local system
I am using Cloudera Manager and I know how to add a param thru Safety Valve I don't know how to ignore a param thru Safety Valve :-) Actually I will put the question on the CDH groups as well... Any thoughts guys ? thanks sanjay From: grimaldi.vince...@gmail.com grimaldi.vince...@gmail.com To: user@hive.apache.org; Sanjay Subramanian sanjaysubraman...@yahoo.com Sent: Friday, January 16, 2015 11:27 AM Subject: Re: Writing Hive Query Output to local system In your hive-site.xml remove the property block corresponding to the parametereç hive.metastore.local That's hit. Is a deprecated parameter and is not necessary anymore. Regards, || Vincenzo Grimaldi Senior Business Intelligence Consultant p:00 39 380 52 22 218 | m:00 353 851 69 84 58 | e:grimaldi.vince...@gmail.com | a: Aprt. 148A, Smithfield Market, Smithfield, Dublin 7, Ireland | Get a signature like this: Click here! 2015-01-16 19:05 GMT+00:00 Sanjay Subramanian sanjaysubraman...@yahoo.com: hey guys I recall this did not happen in the days of 0.9.x version But I use 0.13.x now and when I run a hive query hive -e select * from tablename ./myfile.txt The first line in myfile.txt is as follows 2015-01-16 10:48:13,091 WARN [main] conf.HiveConf (HiveConf.java:initialize(1491)) - DEPRECATED: Configuration property hive.metastore.local no longer has any effect. Make sure to provide a valid value for hive.metastore.uris if you are connecting to a remote metastore. How do I avoid this ? thanks sanjay -- || Vincenzo Grimaldi Senior Business Intelligence Consultant p:00 39 380 52 22 218 | m:00 353 851 69 84 58 | e:grimaldi.vince...@gmail.com | a: Aprt. 148A, Smithfield Market, Smithfield, Dublin 7, Ireland |
Re: How to convert RDBMS DDL to Hive DDL ?
@Krishare u looking for an automated tool that takes RDBMS DDL as input and outputs Hive DDL ?I exported the DDLS of all tables with col sequence numbersI wrote code that converted all DB2 tables we have to Hive. Not sure if there is a standard tool. regards sanjay From: Lefty Leverenz leftylever...@gmail.com To: user@hive.apache.org Sent: Thursday, January 8, 2015 1:34 PM Subject: Re: How to convert RDBMS DDL to Hive DDL ? Have you looked at the documentation? If not, start here: - Creating, Showing, Altering, and Dropping Tables - Create Table -- Lefty On Thu, Jan 8, 2015 at 12:28 PM, Krish Donald gotomyp...@gmail.com wrote: Hi, I am new to hive and I have imported data into HDFS using DB2.Now I want to create hive table.I am not sure How to convert RDBMS DDL to Hive DDL ?Can somebody please guide ? ThanksKrish
Using IF in the JOIN clause
hey guys This is a portion of a long query we wrote.Can u advise if the bold portion will work ? and if(f.fr_name is not null, f.fr_name, e.fr_name)=d.fr_name and if(f.pos_bin is not null, f.pos_bin, e.pos_bin)=d.pos_bin thanksregards sanjay PART OF A LARGER QUERY==from ( select z.* from ( select city, fr_name, (floor((none_position_scaled-1)/10)+1) * 10 as pos_bin, sum(if(job_count0,60,if(int_flag=1,6,if(mjp_flag=1,1,0 as one_six_sixty_sum, sum(job_count) as job_sum, sum(pay_count) as pay_sum, if(sum(hours_count) is null, 0, sum(hours_count)) as hours_sum, count(*) as counter from oletas02.scorek_x_done group by city, fr_name, (floor((none_position_scaled-1)/10)+1) * 10 order by city, fr_name, pos_bin ) z where pos_bin = 200 and NOT pos_bin is null) ffull outer join ( select z.* from ( select city, fr_name, (floor((none_position-1)/10)+1) * 10 as pos_bin, sum(if(job_count0,60,if(int_flag=1,6,if(mjp_flag=1,1,0 as one_six_sixty_sum, sum(job_count) as job_sum, sum(pay_count) as pay_sum, if(sum(hours_count) is null, 0, sum(hours_count)) as hours_sum, count(*) as counter from oletas02.scorek_x_done group by city, fr_name, (floor((none_position-1)/10)+1) * 10 order by city, fr_name, pos_bin ) z where pos_bin = 200 and NOT pos_bin is null) e on f.city=e.city and f.fr_name=e.fr_name and f.pos_bin=e.pos_binfull outer join ( select z.* from ( select city, fr_name, (floor((either_position_scaled-1)/10)+1) * 10 as pos_bin, sum(if(job_count0,60,if(int_flag=1,6,if(mjp_flag=1,1,0 as one_six_sixty_sum, sum(job_count) as job_sum, sum(pay_count) as pay_sum, if(sum(hours_count) is null, 0, sum(hours_count)) as hours_sum, count(*) as counter from oletas02.scorek_x_done group by city, fr_name, (floor((either_position_scaled-1)/10)+1) * 10 order by city, fr_name, pos_bin ) z where pos_bin = 200 and NOT pos_bin is null) d on if(f.city is not null, f.city, e.city)=d.city and if(f.fr_name is not null, f.fr_name, e.fr_name)=d.fr_name and if(f.pos_bin is not null, f.pos_bin, e.pos_bin)=d.pos_bin
Re: It's extremely slow when hive reads compression files
It could be the serde that is slow and not the compression ?If your input XML is in multiline records then u may wanna write a bit of RecordReader code to process the multiline XML yourself, just to see if it makes any changes to the processing speed ?https://github.com/sanjaysubramanian/big_data_latte/tree/master/src/main/java/org/medicalsidefx/bdo/logparsers/multiline From: Yan Fang yanfang...@gmail.com To: user@hive.apache.org Sent: Wednesday, October 22, 2014 8:06 AM Subject: It's extremely slow when hive reads compression files Hi guys, Not sure if you run into this problem. We have 6 nodes cluster with CDH5. It takes about 8 hours to process 80MB compressed files (in .deflate format), while it is much faster (less than 1 hour) to process the uncompressed files. I think there must be something wrong with my settings. Any help ? Thank you. We are using an XMLInputFormat as InputFormat and a customized SerDe to read the XML records. Cheers, Fang, Yan yanfang...@gmail.com +1 (206) 849-4108
Re: Optimize hive external tables with serde
The non gzipped version will be faster.One of the usecases for using a GZIP file is when u have input records that are multiline long formats exampleIn the following case, u have to send all contents between begin and end sections to the mapper. If u do NOT gzip this file (containing multiple records) then HDFS will split it in 128MB blocks(whatever u have set the HDFS block size to be) and it may happen that part of contents between the begin and end section get split and u will get errors processing the records. [begin_20100922_12345678_serverid_foofla01]blah bhah blah bhah blah bhah blah bhah blah bhah blah bhah [end_20100922_12345678_serverid_foofla01] So if your record is not multi line then don't use GZIP. I have used LZO which is fast but getting LZO to work correctly needs little bit of work on your cluster. Its not difficult but just a few steps to do. U can refer my documentation on the Hive site. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LZO I rewrote your query. (The group by is pushed to the outer query. And I created an initial subset jt1 to use later in the query)I did not understand two things in your queryThe join clause --- ON (1=1) ?The where clause --- 1=1 ? a) Code heatmap - how many users used some code WITH jt1 AS (SELECT * FROM json_table WHERE 1=1 AND code in (1,3,54,300,222,111) AND partitionDate '20140922' AND partitionDate = '20141022' ) SELECT A.code, A.NumberOfDistinctCustomers, B.NumberOfAllDistinctCustomers, A.NumberOfEntries, B.NumberOfAllEntries FROM (SELECT jt1.code as code, COUNT(DISTINCT jt1.customerId) AS NumberOfDistinctCustomers, COUNT(*) AS NumberOfEntries FROM jt1 ) A JOIN (SELECT COUNT(DISTINCT jt1.customerId) AS NumberOfAllDistinctCustomers, COUNT(*) AS NumberOfAllEntries FROM jt1 ) B ON (1=1) GROUP BY A.code ; b) Select users which use some code, but have some special attributes in second table:NOTE : You are querying this with no partition so if json_table has 500-1000+ partitions then this query might be really slow or may not runOne way you can solve this is to write a bash script that loops though a set of partitions and u store the results in an intermediate Hive managed table and then have another query to aggregate that. SELECT jt.customerIdFROM (SELECT * FROM json_table WHERE 1=1 AND code in (1,3,54,300,222,111) ) jt JOIN (SELECT * FROM attribute_table WHERE attribute_X1='1' AND attribute_X2='1' ) atON jt.customerId = at.customerId From: ptrst ptrstp...@gmail.com To: user@hive.apache.org; Sanjay Subramanian sanjaysubraman...@yahoo.com Sent: Wednesday, October 22, 2014 1:02 AM Subject: Re: Optimize hive external tables with serde ad 1) My files are not bigger than Block Size. To be precise all data from one day are up to 2GB gzipped. Unzipped they are ~15GB. The are split in one folder into files less then block size (block size in my case is 128MB, files are ~100MB). I can transform them to other format if you think it will speed up my queries, but as I understand the documentation, such structure should be same as split-able files.If I add one more partition dimension with code, so I have partition by date and code, single files is 3-10MB. Because unique codes I have 3000 ad 2) I read your code and I use same serde only newer one. Should I check something more? Ad 4) Most queries will be about last 3-6 months Ad 3) My example queries are:a) Code heatmap - how many users used some code SELECT A.code, A.NumberOfDistinctCustomers, B.NumberOfAllDistinctCustomers, A.NumberOfEntries, B.NumberOfAllEntries FROM ( SELECT code, COUNT(DISTINCT customerId) AS NumberOfDistinctCustomers, COUNT(*) AS NumberOfEntries FROM json_table WHERE 1=1 AND code in (1,3,54,300,222,111) AND partitionDate '20140922' AND partitionDate = '20141022' GROUP BY code ) A JOIN ( SELECT COUNT(DISTINCT customerId) AS NumberOfAllDistinctCustomers, COUNT(*) AS NumberOfAllEntries FROM json_table WHERE 1=1 AND code in (1,3,54,300,222,111) AND partitionDate '20140922' AND partitionDate = '20141022' ) B ON(1=1); b) Select users which use some code, but have some special attributes in second table: SELECT json_table.customerId FROM json_table JOIN attribute_table ON (json_table.customerId=attribute_table.customerId) AND attribute_table.attribute_X1='1' AND attribute_table.attribute_X2='1' AND json_table.code in (1,3,54,300,222,111); c) Mix of above. On Tue, Oct 21, 2014 at 11:12 PM, Sanjay Subramanian sanjaysubraman...@yahoo.com wrote: 1. The gzip files are not splittable, so gzip itself will make the queries slower. 2. As a reference for JSON serdes , here
Re: Optimize hive external tables with serde
1. The gzip files are not splittable, so gzip itself will make the queries slower. 2. As a reference for JSON serdes , here is a example from my blog http://bigdatalatte.wordpress.com/2014/08/21/denormalizing-json-arrays-in-hive/ 3. Need to see your query first to try and optimize it 4. Even if you have datewise partitions and u have 5 years of data i.e. about 1825 partitions. -- Trying to do a select count(*) without where clause might make hive crawl. From: Ja Sam ptrstp...@gmail.com To: user@hive.apache.org Sent: Tuesday, October 21, 2014 10:37 AM Subject: Optimize hive external tables with serde Part 1: my enviromentI have following files uploaded to Hadoop: - The are plain text - Each line contains JSON like: {code:[int], customerId:[string], data:{[something more here]}} - code are numbers from 1 to 3000, - customerId are total up to 4 millions, daily up to 0.5 millon - All files are gzip - In hive I created external table with custom JSON serde (let's call it CUSTOMER_DATA) - All files from each date is stored in separate directory - and I use it as partitions in Hive tables Most queries which I do are filtering by date, code and customerId. I have also a second file with format (let's call it CUSTOMER_ATTRIBUTES]: [customerId] [attribute_1] [attribute_2] ... [attribute_n]which contains data for all my customers, so rows are up to 4 millions.I query and filter my data in following way: - Filtering by date - partitions do the job here using WHERE partitionDate IN (20141020,20141020) - Filtering by code using statement like for example `WHERE code IN (1,4,5,33,6784) - Joining table CUSTOMER_ATTRIBUTES with CUSTOMER_DATA with condition query like SELECT customerId FROM CUSTOMER_DATA JOIN CUSTOMER_ATTRIBUTES ON (CUSTOMER_ATTRIBUTES.customerId=CUSTOMER_DATA.customerId) WHERE CUSTOMER_ATTRIBUTES.attribute_1=[something] Part 2: questionIs there any efficient way how can I optimize my queries. I read about indexes and buckets by I don't know if I can use them with external tables and if they will optimize my queries.
Re: select * from table and select column from table in hive
One way to debug is to put bash in action say you have a data file in hdfs (/data/rockers/rockers.csv) that looks like cust_num,cust_name,instrument1,paul,bass2,john,rhythm3,ringo,drums4,george,lead to get the column=cust_num of data (in this case its column 1) hdfs dfs -cat /data/rockers/rockers.csv | cut -d, -f 1 /path/to/local/file.txt If file is tab separated cust_num cust_name instrument1 paul bass2 john rhythm3 ringo drums4 george lead hdfs dfs -cat /data/rockers/rockers.csv | tr '\t' ',' | cut -d, -f 1 /path/to/local/file.txt From: Nitin Pawar nitinpawar...@gmail.com To: Raj Hadoop hadoop...@yahoo.com; user@hive.apache.org Sent: Monday, October 20, 2014 12:11 PM Subject: Re: select * from table and select column from table in hive whats your table create ddl?is the data in csv like format? On 21 Oct 2014 00:26, Raj Hadoop hadoop...@yahoo.com wrote: I am able to see the data in the table for all the columns when I issue the following - SELECT * FROM t1 WHERE dt1='2013-11-20' But I am unable to see the column data when i issue the following - SELECT cust_num FROM t1 WHERE dt1='2013-11-20' The above shows null values. How should I debug this ?
Re: Migration of metastore tables from mysql to oracle.
First question, Why are u migrating to Oracle ? Since u never store data on Hive Metastore MYSQL is a great choice. I have done a MYSQL to MYSQL transfer From the source DB mysql dump, it should be possible to mod any Oracle required syntax right ? From: hadoop hive hadooph...@gmail.com To: user@hive.apache.org Sent: Tuesday, October 21, 2014 2:36 AM Subject: Migration of metastore tables from mysql to oracle. Hi folks,I have a requirement where I need to migrate metastore tables from mysql to oracle.Is there any who already performed this kind of activity.Any help will be appreciatedThanks Vikas Srivastava
Re: Weird Error on Inserting in Table [ORC, MESOS, HIVE]
hi I faced a similar situation in my dev cluster CDH distribution 5.1.3 See the thread details with log files https://groups.google.com/a/cloudera.org/forum/#!mydiscussions/scm-users/MpcpHj5mWT8 thanks sanjay From: John Omernik j...@omernik.com To: user@hive.apache.org Sent: Tuesday, September 9, 2014 12:10 PM Subject: Re: Weird Error on Inserting in Table [ORC, MESOS, HIVE] Well, here is me talking to myself: but in case someone else runs across this, I changed the hive metastore connect timeout to 600 seconds (per the JIRA below for Hive 0.14) and now my problem has gone away. It looks like the timeout was causing some craziness. https://issues.apache.org/jira/browse/HIVE-7140 On Tue, Sep 9, 2014 at 1:00 PM, John Omernik j...@omernik.com wrote: I ran with debug logging, and this is interesting, there was a loss of connection to the metastore client RIGHT before the partition mention above... as data was looking to be moved around... I wonder if the timing on that is bad? 14/09/09 12:47:37 [main]: INFO exec.MoveTask: Partition is: {day=null, source=null}14/09/09 12:47:38 [main]: INFO metadata.Hive: Renaming src:maprfs:/user/hive/scratch/hive-mapr/hive_2014-09-09_12-38-30_860_3555291990145206535-1/-ext-1/day=2012-11-30/source=20121119_SWAirlines_Spam/04_0;dest: maprfs:/user/hive/warehouse/intel_flow.db/pcaps/day=2012-11-30/source=20121119_SWAirlines_Spam/04_0;Status:true14/09/09 12:48:02 [main]: WARN metastore.RetryingMetaStoreClient: MetaStoreClient lost connection. Attempting to reconnect.org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129) On Tue, Sep 9, 2014 at 11:02 AM, John Omernik j...@omernik.com wrote: I am doing a dynamic partition load in Hive 0.13 using ORC files. This has always worked in the past both with MapReduce V1 and YARN. I am working with Mesos now, and trying to trouble shoot this weird error: Failed with exception AlreadyExistsException(message:Partition already exists What's odd is is my insert is an insert (without Overwrite) so it's like two different reducers have data to go into the same partition, but then there is a collision of some sort? Perhaps there is a situation where the partition doesn't exist prior to the run, but when two reducers have data, they both think they should be the one to create the partition? Shouldn't if a partition already exists, the reducer just copies it's file into the partition? I am struggling to see why this would be an issue with Mesos, but not on Yarn, or MRv1. Any thoughts would be welcome. John
Re: problem with delimiters (control A)
Hi Jack Since u already have your data with columns separated by CtrlA then u need to define the HIVE table as follows (by default Hive will assume CtrlA as column delimiter) create table if not exists my_test( userid BIGINT, movieId BIGINT, comment STRING ); regards sanjay From: Petter von Dolwitz (Hem) petter.von.dolw...@gmail.com To: user@hive.apache.org user@hive.apache.org Sent: Thursday, May 22, 2014 8:00 AM Subject: Re: problem with delimiters (control A) Hi Jack, could it be that your source file does not use \001 as delimiter but the character sequence ^A? I get this feeling when looking at your cat printout and also when you say you can do a search and replace. If I do a print out of a file with the same delimiter I get the following 2014-04-097846478510 2014-04-107851178558 These are actually three columns but you cannot see the ^A because it is not a printable character. Check your file in a hex editor or something to verify that you have the correct delimiter. Br, Petter 2014-05-18 6:40 GMT+02:00 Jack Yang j...@uow.edu.au: Hi All, I have a local file called mytest.txt (restored in hdfs already). The content is like this: $ cat -A HDFSLOAD_DIR/mytest.txt 49139801^A25752451^Aunknown$ 49139801^A24751754^Aunknown$ 49139801^A2161696^Anice$ To load this raw data above, I then defined the table like this in HQL: create table my_test( userid BIGINT, movieId BIGINT, comment STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' STORED AS TEXTFILE; My problem is that when I “SELECT * FROM my_test;” , I got this: NULL NULL NULL NULL NULL NULL NULL NULL NULL I then replace “^A” with “^” in mytest.txt, and also re-defined my table structure by using : FIELDS TERMINATED BY '^’ So when I select all, I got the correct results. Any thoughts??? Thanks in advance. Best regards, Jack
Re: Finding Max of a column without using any Aggregation functions
Thanks For the sake of this question I wanted to avoid all order by and limit syntax . It's more of a challenge question Regards Sanjay Sent from my iPhone On Apr 23, 2014, at 2:51 AM, Furcy Pin furcy@flaminem.com wrote: Hi, note that if your table contains the max value several time, all the occurences will be returned. Note also that if it contains a null it will be returned too. How about this? SELECT score FROM student ORDER BY score DESC LIMIT 1 ; Note that on this query Impala is incoherent with Hive or mySQL, since Impala considers that NULL is greater than everything while Hive and mySQL consider NULL to be smaller than everything And just for fun, you can also try in Hive: FROM ( SELECT score FROM student DISTRIBUTE BY '1' SORT BY score DESC ) M SELECT score LIMIT 1 ; FROM ( SELECT score FROM student DISTRIBUTE BY '1' ) M REDUCE score USING 'sort -rn | head -1' AS score ; The second is just to demonstrate the possibilities of custom reduce, but is greatly inefficient (in speed and memory). And be sure to use SET mapred.reduce.tasks=1 ; before if you don't want idle reduce tasks... (I guess I have some spare time too!) 2014-04-23 3:27 GMT+02:00 Subramanian, Sanjay (HQP) sanjay.subraman...@roberthalf.com: Hey guys TABLE=STUDENT COLUMN=SCORE U want to find the max value in the column without using any aggregation functions. Its easy in a RDB context but I was trying to get a solution in Hive (clearly I have some spare time on my hands - LOL) select nfr.score from student nfr left outer join (select a.score as fra, b.score as frb from (select '1' as dummy, score from student ) a join (select '1' as dummy, score from student ) b ON a.dummy = b.dummy where a.score b.score ) frab on frab.fra=nfr.score where frab.fra is null Thanks Warm Regards Sanjay
Re: create table question
For example if ur name node was hadoop_name_nodeIP:8020 (verify this thru your browser http://hadoop_name_nodeIP:50070) Modified Create Table == CREATE EXTERNAL TABLE states(abbreviation string, full_name string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION 'hdfs://hp8300one:8020/tmp/states' ; From: Shengjun Xin s...@gopivotal.com To: user@hive.apache.org Sent: Tuesday, April 22, 2014 8:58 PM Subject: Re: create table question in the ql, you set relative path tmp/states, according to the error message, you need to set absolute path On Wed, Apr 23, 2014 at 11:23 AM, EdwardKing zhan...@neusoft.com wrote: I use hadoop 2.2.0 and hive 0.13.0, I want to create a table from an existing file, states.hql is follows: CREATE EXTERNAL TABLE states(abbreviation string, full_name string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION 'tmp/states' ; [hadoop@master ~]$ hadoop fs -ls 14/04/22 20:17:32 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Found 1 items drwxr-xr-x - hadoop supergroup 0 2014-04-22 20:02 tmp [hadoop@master ~]$ hadoop fs -put states.txt tmp/states [hadoop@master ~]$ hadoop fs -ls tmp/states 14/04/22 20:17:19 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Found 1 items -rw-r--r-- 2 hadoop supergroup 654 2014-04-22 20:02 tmp/states/states.txt Then I execute states.hql [hadoop@master ~]$ hive -f states.hql 14/04/22 20:11:47 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces 14/04/22 20:11:47 INFO Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize 14/04/22 20:11:47 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative 14/04/22 20:11:47 INFO Configuration.deprecation: mapred.min.split.size.per.node is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.node 14/04/22 20:11:47 INFO Configuration.deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive 14/04/22 20:11:47 INFO Configuration.deprecation: mapred.min.split.size.per.rack is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.rack 14/04/22 20:11:47 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize 14/04/22 20:11:47 INFO Configuration.deprecation: mapred.committer.job.setup.cleanup.needed is deprecated. Instead, use mapreduce.job.committer.setup.cleanup.needed Logging initialized using configuration in jar:file:/home/software/apache-hive-0.13.0-bin/lib/hive-common-0.13.0.jar!/hive-log4j.properties FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: hdfs://master:9000./tmp/states) It raise following error,why? How to correct it? 2014-04-22 20:12:03,907 INFO [main]: exec.DDLTask (DDLTask.java:createTable(4074)) - Default to LazySimpleSerDe for table states 2014-04-22 20:12:05,147 INFO [main]: metastore.HiveMetaStore (HiveMetaStore.java:logInfo(624)) - 0: create_table: Table(tableName:states, dbName:default, owner:hadoop, createTime:1398222724, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:abbreviation, type:string, comment:null), FieldSchema(name:full_name, type:string, comment:null)], location:tmp/states, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format= , field.delim= }), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{EXTERNAL=TRUE}, viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE) 2014-04-22 20:12:05,147 INFO [main]: HiveMetaStore.audit (HiveMetaStore.java:logAuditEvent(306)) - ugi=hadoop ip=unknown-ip-addr cmd=create_table: Table(tableName:states, dbName:default, owner:hadoop, createTime:1398222724, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:abbreviation, type:string, comment:null), FieldSchema(name:full_name, type:string, comment:null)], location:tmp/states, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false,
Re: Query hangs at 99.97 % for one reducer in Hive
Even 500 reducers sounds a high number but I don't know the deatils of your cluster. Can u provide some details How many nodes in cluster Hive version Which distribution (Hortonworks, Apache, CDH, Amazon) Node specs Partitions in the table Number of records. Thanks Sanjay Sent from my iPhone On Mar 2, 2014, at 3:09 PM, Siddharth Tiwari siddharth.tiw...@live.com wrote: Hi team, following query hangs at 99.97% for one reducer, kindly help or point to what can be cause drop table if exists sample.dpi_short_lt; create table sample.dpi_short_lt as select b.msisdn, a.area_erb, a.longitude, a.latitude, substring(b.msisdn,1,2) as country, substring(b.msisdn,3,2) as area_code, substring(b.start_time,1,4) as year, substring(b.start_time,6,2) as month, substring(b.start_time,9,2) as day, substring(b.start_time,12,2) as hour, cast(b.procedure_duration as double) as duracao_ms, cast(b.internet_latency as double) as int_internet_latency, cast(b.ran_latency as double) as int_ran_latency, cast(b.http_latency as double) as int_http_latency, (case when b.internet_latency='' then 1 else 0 end) as internet_latency_missing, (case when b.ran_latency='' then 1 else 0 end) as ran_latency_missing, (case when b.http_latency='' then 1 else 0 end) as http_latency_missing, (cast(b.mean_throughput_ul as int) * cast( procedure_duration as int) / 1000) as total_up_bytes, (cast(b.mean_throughput_dl as int) * cast(procedure_duration as int) / 1000) as total_dl_bytes, cast(b.missing_packets_ul as int) as int_missing_packets_ul, cast(b.missing_packets_dl as int) as int_missing_packets_dl from sample.dpi_large b left outer join sample.science_new a on b.cgi = regexp_replace(a.codigo_cgi_ecgi,'-','') where msisdn!=''; Hive was heuristically selecting 1000 reducers and it was hanging at 99.97 percent on one reduce task. I then changed the above values to 3GB per reducer and 500 reducers and started hitting this error. java.lang.RuntimeException: Hive Runtime Error while closing operators: Unable to rename output from: hdfs://tlvcluster/tmp/hive-hadoop/hive_2014-03-01_03-14-36_812_8390586541316719852-1/_task_tmp.-ext-10001/_tmp.03_0 to: hdfs://tlvcluster/tmp/hive-hadoop/hive_2014-03-01_03-14-36_812_8390586541316719852-1/_tmp.-ext-10001/03_0 at org.apache.hadoop.hive.ql.exec.ExecReducer.close(ExecReducer.java:313) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:516) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:447) at org.apache.hadoop.mapred.Child$4.run(Child.java:268) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.mapred.Child.main(Child.java:262) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to rename output from: hdfs://tlvcluster/tmp/hive-hadoop/hive_2014-03-01_03-14-36_812 I have 22 node cluster running cdh 4.3. Please try to locate what can be teh issue. ** Cheers !!! Siddharth Tiwari Have a refreshing day !!! Every duty is holy, and devotion to duty is the highest form of worship of God.” Maybe other people will try to limit me but I don't limit myself
Amazon EMR error
Sorry guys , not sure if I should request help with this error here because its an error on Amazon EMR Hive But you guys have been my Hive fraternity for about 2 years now and I thought it best to turn to u for help first Amazon Hive version === AMI version:2.4.2 Hadoop distribution:Amazon 1.0.3 Applications:Hive 0.11.0.1 HIVE TABLE === DROP TABLE solar_installs; CREATE EXTERNAL TABLE IF NOT EXISTS solar_installs( zipcode STRING, state STRING, sizekw DOUBLE, cost DOUBLE, date_ STRING, lat DOUBLE, lon DOUBLE) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ; ALTER TABLE solar_installs SET LOCATION 's3://solarinstallations/data' ; DATA SET SAMPLE == 19801,DE,230.6,,12/30/2014,39.7402405,-75.5521456 80005,CO,7.1,12000.0,01/28/2014,39.851157,-105.132533 19968,DE,5.2,,01/27/2014,38.7776124,-75.3099136 52314,IA,30.0,116900.0,01/17/2014,41.901451,-91.48409 33414,FL,9.28,34000.0,01/16/2014,26.667321,-80.215585 19808,DE,5.2,,01/16/2014,39.721215,-75.668342 19702,DE,7.28,,01/16/2014,39.649586,-75.6998009 19958,DE,5.5,,01/16/2014,38.7638163,-75.1778853 19958,DE,10.56,,01/15/2014,38.7638163,-75.1778853 19958,DE,4.5,,01/15/2014,38.708142,-75.214294 06410,CT,860.0,,01/14/2014,41.5589919,-72.912401 05701,VT,140.4,,01/14/2014,43.598277,-72.955117 19971,DE,7.8,,01/13/2014,38.734048,-75.126563 19971,DE,4.06,,01/13/2014,38.7020005,-75.0961549 QUERY = hive -e select zipcode, count(*) from solar_installs group by zipcode ERROR = java.lang.RuntimeException: java.lang.IllegalArgumentException: Can not create a Path from an empty string at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:167) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:441) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:377) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132) at org.apache.hadoop.mapred.Child.main(Child.java:249) Caused by: java.lang.IllegalArgumentException: Can not create a Path from an empty string at org.apache.hadoop.fs.Path.checkPathArg(Path.java:82) at org.apache.hadoop.fs.Path.init(Path.java:90) at org.apache.hadoop.hive.ql.exec.MapOperator.cleanUpInputFileChangedOp(MapOperator.java:588) at org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1377) at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:611) at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:149) ... 8 more
Re: Amazon EMR error
ok so I spun up another cluster with a previous version and it worked successfully This Amazon Hive version WORKS SUCCESSFULLY = AMI version:2.4.1 Hadoop distribution:Amazon 1.0.3 Applications:Hive 0.11.0.1 From: Sanjay Subramanian sanjaysubraman...@yahoo.com To: user@hive.apache.org user@hive.apache.org; USers Hadoop u...@hadoop.apache.org Sent: Saturday, March 1, 2014 4:57 PM Subject: Amazon EMR error Sorry guys , not sure if I should request help with this error here because its an error on Amazon EMR Hive But you guys have been my Hive fraternity for about 2 years now and I thought it best to turn to u for help first Amazon Hive version === AMI version:2.4.2 Hadoop distribution:Amazon 1.0.3 Applications:Hive 0.11.0.1 HIVE TABLE === DROP TABLE solar_installs; CREATE EXTERNAL TABLE IF NOT EXISTS solar_installs( zipcode STRING, state STRING, sizekw DOUBLE, cost DOUBLE, date_ STRING, lat DOUBLE, lon DOUBLE) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ; ALTER TABLE solar_installs SET LOCATION 's3://solarinstallations/data' ; DATA SET SAMPLE == 19801,DE,230.6,,12/30/2014,39.7402405,-75.5521456 80005,CO,7.1,12000.0,01/28/2014,39.851157,-105.132533 19968,DE,5.2,,01/27/2014,38.7776124,-75.3099136 52314,IA,30.0,116900.0,01/17/2014,41.901451,-91.48409 33414,FL,9.28,34000.0,01/16/2014,26.667321,-80.215585 19808,DE,5.2,,01/16/2014,39.721215,-75.668342 19702,DE,7.28,,01/16/2014,39.649586,-75.6998009 19958,DE,5.5,,01/16/2014,38.7638163,-75.1778853 19958,DE,10.56,,01/15/2014,38.7638163,-75.1778853 19958,DE,4.5,,01/15/2014,38.708142,-75.214294 06410,CT,860.0,,01/14/2014,41.5589919,-72.912401 05701,VT,140.4,,01/14/2014,43.598277,-72.955117 19971,DE,7.8,,01/13/2014,38.734048,-75.126563 19971,DE,4.06,,01/13/2014,38.7020005,-75.0961549 QUERY = hive -e select zipcode, count(*) from solar_installs group by zipcode ERROR = java.lang.RuntimeException: java.lang.IllegalArgumentException: Can not create a Path from an empty string at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:167) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:441) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:377) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132) at org.apache.hadoop.mapred.Child.main(Child.java:249) Caused by: java.lang.IllegalArgumentException: Can not create a Path from an empty string at org.apache.hadoop.fs.Path.checkPathArg(Path.java:82) at org.apache.hadoop.fs.Path.init(Path.java:90) at org.apache.hadoop.hive.ql.exec.MapOperator.cleanUpInputFileChangedOp(MapOperator.java:588) at org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1377) at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:611) at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:149) ... 8 more
Re: How to prevent user drop table in Hive metadata?
Cloudera Sentry is awesome and I have implemented this in Cloudera manager 4.7.2 CDH 4.4.0. Thanks again to shreepadma for all answers to my questions on the CDH users group. I can provide guidance on Sentry configs if needed. Sent from my iPhone On Nov 22, 2013, at 4:25 PM, Shreepadma Venugopalan shreepa...@cloudera.com wrote: Apache Sentry is already available and made its first incubating release a couple of months back. On Fri, Nov 22, 2013 at 3:06 PM, Echo Li echo...@gmail.com wrote: Thanks all, that's all very helpful information. Shreepadma, when will the Apache Sentry come GA? On Fri, Nov 22, 2013 at 2:36 PM, Shreepadma Venugopalan shreepa...@apache.org wrote: Apache Sentry (incubating) provides fine-grained role-based authorization for Hive among other components of the Hadoop ecosystem. It currently supports fully secure, fine-grained, role-based authorization for Hive and can be used to prevent the scenario described earlier i.e., prevent a user from dropping a table the user shouldn't be allowed to drop. Shreepadma On Fri, Nov 22, 2013 at 12:55 PM, simon.2.thomp...@bt.com wrote: Thanks Alan - I'll fwd the spec in the Jira to some of our security and integrity people for comment. Simon Dr. Simon Thompson From: Alan Gates [ga...@hortonworks.com] Sent: 22 November 2013 20:53 To: user@hive.apache.org Subject: Re: How to prevent user drop table in Hive metadata? See https://issues.apache.org/jira/browse/HIVE-5837 for a JIRA addressing this. Also, you can use the StorageBasedAuthorizationProvider in Hive, which bases metadata security on file security. So if the user doesn't have permissions to remove the directory that stores the table data, they won't have permissions to drop the table. This isn't perfect, but it's a start. Alan. On Nov 22, 2013, at 11:49 AM, simon.2.thomp...@bt.com simon.2.thomp...@bt.com wrote: Has no one raised a Jira ticket ? Dr. Simon Thompson From: Biswajit Nayak [biswajit.na...@inmobi.com] Sent: 22 November 2013 19:45 To: user@hive.apache.org Subject: Re: How to prevent user drop table in Hive metadata? Hi Echo, I dont think there is any to prevent this. I had the same concern in hbase, but found out that it is assumed that user using the system are very much aware of it. I am into hive from last 3 months, was looking for some kind of way here, but no luck till now.. Thanks Biswa On 23 Nov 2013 01:06, Echo Li echo...@gmail.commailto:echo...@gmail.com wrote: Good Friday! I was trying to apply certain level of security in our hive data warehouse, by modifying access mode of directories and files on hdfs to 755 I think it's good enough for a new user to remove data, however the user still can drop the table definition in hive cli, seems the revoke doesn't help much, is there any way to prevent this? Thanks, Echo _ The information contained in this communication is intended solely for the use of the individual or entity to whom it is addressed and others authorized to receive it. It may contain confidential or legally privileged information. If you are not the intended recipient you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this information is strictly prohibited and may be unlawful. If you have received this communication in error, please notify us immediately by responding to this email and then delete it from your system. The firm is neither liable for the proper and complete transmission of the information contained in this communication nor for any delay in its receipt. -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
In Beeline what is the syntax for ALTER TABLE ?
Hi guys Using Hive0.10.0+198 CDH4 Getting this error for ALTER table command jdbc:hive2://dev-thdp5.corp.nextag.com:100 ALTER TABLE outpdir_seller_hidden ADD IF NOT EXISTS PARTITION (header_date_partition='2013-10-17', header_servername_partition='lu3') LOCATION '/data/output/impressions/outpdir/2013-10-17/008-131018121904385-oozie-oozi-W/outpdir_seller_hidden/sellerhidden/2013-10-17/lu3'; Error: Error while processing statement: FAILED: IllegalArgumentException URI '' in invalid. Must start with file:// or hdfs:// (state=42000,code=4) Thanks Regards sanjay CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: JSON format files versus AVRO
Hi Thanks I have to still check out JsonSerDe in catalog. U r right an I did think about adding the unique key as an attribute inside the JSON Instead of analyzing further I am going to try both methods out and see how my down the stream processes will work. I have a 40 step Oozie workflow that needs to be successful after all this :-) Cool thanks Thanks Regards sanjay email : sanjay.subraman...@wizecommerce.com From: Sushanth Sowmyan khorg...@gmail.commailto:khorg...@gmail.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Tuesday, October 8, 2013 11:39 AM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: JSON format files versus AVRO Have you had a look at the JsonSerDe in hcatalog to see if it suits your need? It does not support the format you are suggesting directly, but if you made the unique I'd part of the json object, so that each line was a json record, it would. It's made to be used in conjunction with text tables. Also, even if it proves to not be what you want directly, it already provides a serializer/deserializer On Oct 7, 2013 4:41 PM, Sanjay Subramanian sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com wrote: Sorry if the subject sounds really stupid ! Basically I am re-architecting our web log record format Currently we have Multiple lines = 1 Record format (I have Hadoop jobs that parse the files and create columnar output for Hive tables) [begin_unique_id] Pipe delimited Blah Pipe delimited Blah Pipe delimited Blah Pipe delimited Blah Pipe delimited Blah [end_unique_id] I have created JSON serializers that will log records in the following way going forward unique_id JSON-string This is the plan - I will store the records in a two column table in Hive - Write JSON deserializers in hive HDFs that will take these tables and create hive tables pertaining to specific requirements - Modify current aggregation scripts in Hive I was seeing AVRO format but I don't see the value of using AVO when I feel JSON gives me pretty much the same thing ? Please poke holes in my thinking ! Rip me apart ! Thanks Regards sanjay CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator. CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: Execution failed with exit status: 3
Hi Are u running this thru Beeswax in Hue ? If I recall right then u might need to provide access to hue user to submit and run MR jobs on the cluster Also r u using YARN or MR1 ? Thanks Regards sanjay From: Martin, Nick nimar...@pssd.commailto:nimar...@pssd.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Tuesday, October 8, 2013 10:27 AM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: RE: Execution failed with exit status: 3 Update on this... When I run this in Hive CLI it works perfectly. Only has a bug in Hue. I'll send this thread over to hue_user@ and see what they say. From: Martin, Nick [mailto:nimar...@pssd.com] Sent: Tuesday, October 08, 2013 12:17 PM To: user@hive.apache.orgmailto:user@hive.apache.org Subject: RE: Execution failed with exit status: 3 Hi Sanjay, thanks for the suggestion. There are no partitions on either table. From: Sanjay Subramanian [mailto:sanjay.subraman...@wizecommerce.com] Sent: Monday, October 07, 2013 8:19 PM To: user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: Execution failed with exit status: 3 Hi Nick How many partitions are there in table t1 and table t2 If there are many partitions in either t1 or t2 or both can u mod your query as follows and see if the error comes up SELECT T1.somecolumn, T2.someothercolumn FROM (SELECT * FROM t1 WHERE partition_column1='some_val') T1 JOIN (SELECT * FROM t2 WHERE partition_column2='some_val') T2 ON (T1.idfield=T2.idfield) Thanks Regards sanjay email : sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com Irc : sanjaysub (channel #noc) skype : sanjaysubramanian mobile : (925) 399 2692 From: Martin, Nick nimar...@pssd.commailto:nimar...@pssd.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Monday, October 7, 2013 5:13 PM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: Execution failed with exit status: 3 Hi all, I'm doing a very basic join in Hive and getting the error below. The HiveQL join syntax I'm using is: SELECT T1.somecolumn, T2.someothercolumn FROM t1 JOIN t2 ON (t1.idfield=t2.idfield) Driver returned: 3. Errors: OK Total MapReduce jobs = 1 setting HADOOP_USER_NAME someuser Execution failed with exit status: 3 Obtaining error information Task failed! Task ID: Stage-4 Logs: FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.MapredLocalTask I searched through the logs and couldn't find anything terribly useful, although perhaps I'm missing something. Is this a common error I'm just now coming across? On Hive 0.11 Thanks! Nick CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator. CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: Is there any API that tells me what files comprise a hive table?
Perhaps a good thing to have in your Hive cheat sheet :-) ' I use the following mySQL query to find out the locations of the Hive table echo select t.TBL_NAME, p.PART_NAME, s.LOCATION from PARTITIONS p, SDS s, TBLS t where t.TBL_ID=p.TBL_ID and p.SD_ID=s.SD_ID | mysql -uhive_user -ppassword hive_metastore_table -A | grep your_hive_table_name Thanks Regards sanjay email : sanjay.subraman...@wizecommerce.com Irc : sanjaysub (channel #noc) skype : sanjaysubramanian mobile : (925) 399 2692 From: Edward Capriolo edlinuxg...@gmail.commailto:edlinuxg...@gmail.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Monday, October 7, 2013 11:52 AM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: Is there any API that tells me what files comprise a hive table? Not a direct API. What I do is this. From java/thrift: Table t = client.getTable(name_of_table); Path p = new Path(t.getSd.getLocation()); FileSystem fs = FileSystem.get(conf); ListFileStatus f = fs.listFiles(p) /// your logic here. On Mon, Oct 7, 2013 at 2:01 PM, demian rosas demia...@gmail.commailto:demia...@gmail.com wrote: Hi all, I want to track the changes made to the files of a Hive table. I wounder whether there is any API that I can use to find out the following: 1. What files in hdfs constitute a hive table. 2. What is the size of each of these files. 3. The time stamp of the creation/last update to each of these files. Also in a wider view, is there any API that can do the above mentioned for HDFS files in general (not only hive specific)? Thanks a lot in advance. Cheers. CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
JSON format files versus AVRO
Sorry if the subject sounds really stupid ! Basically I am re-architecting our web log record format Currently we have Multiple lines = 1 Record format (I have Hadoop jobs that parse the files and create columnar output for Hive tables) [begin_unique_id] Pipe delimited Blah Pipe delimited Blah Pipe delimited Blah Pipe delimited Blah Pipe delimited Blah [end_unique_id] I have created JSON serializers that will log records in the following way going forward unique_id JSON-string This is the plan - I will store the records in a two column table in Hive - Write JSON deserializers in hive HDFs that will take these tables and create hive tables pertaining to specific requirements - Modify current aggregation scripts in Hive I was seeing AVRO format but I don't see the value of using AVO when I feel JSON gives me pretty much the same thing ? Please poke holes in my thinking ! Rip me apart ! Thanks Regards sanjay CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: Execution failed with exit status: 3
Hi Nick How many partitions are there in table t1 and table t2 If there are many partitions in either t1 or t2 or both can u mod your query as follows and see if the error comes up SELECT T1.somecolumn, T2.someothercolumn FROM (SELECT * FROM t1 WHERE partition_column1='some_val') T1 JOIN (SELECT * FROM t2 WHERE partition_column2='some_val') T2 ON (T1.idfield=T2.idfield) Thanks Regards sanjay email : sanjay.subraman...@wizecommerce.com Irc : sanjaysub (channel #noc) skype : sanjaysubramanian mobile : (925) 399 2692 From: Martin, Nick nimar...@pssd.commailto:nimar...@pssd.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Monday, October 7, 2013 5:13 PM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: Execution failed with exit status: 3 Hi all, I'm doing a very basic join in Hive and getting the error below. The HiveQL join syntax I'm using is: SELECT T1.somecolumn, T2.someothercolumn FROM t1 JOIN t2 ON (t1.idfield=t2.idfield) Driver returned: 3. Errors: OK Total MapReduce jobs = 1 setting HADOOP_USER_NAME someuser Execution failed with exit status: 3 Obtaining error information Task failed! Task ID: Stage-4 Logs: FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.MapredLocalTask I searched through the logs and couldn't find anything terribly useful, although perhaps I'm missing something. Is this a common error I'm just now coming across? On Hive 0.11 Thanks! Nick CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Hiveserver2 Authentication (openLDAP) and Authorization (using Sentry)
Hi guys DISCLAIMER == I have no affiliations to Cloudera and I am writing this mail of my own free will, with the hope to help fellow Hive users who will be implementing security around Hive I have implemented Hiveserver2 Authentication (openLDAP) and Authorization (using Cloudera Sentry). I am super-excited because we know can open our Hive Data Platform in read only mode to remote clients in the company and SAS clients ! We are using Cloudera Manager Standard distribution (still can't afford support costs !) Thanks Sanjay The stack I am planning to push to production after my hive/yarn/oozie tests succeed are as follows Group 1 (CDH4) Hosts dev-thdp5.corp.nextag.com Component Version CDH Version Impala 1.1.1 Not applicable Yarn (CDH4 only)2.0.0+1475 CDH4 HDFS (CDH4 only)2.0.0+1475 CDH4 Hue Plugins 2.5.0+139 CDH4 Sqoop2 (CDH4 only) 1.99.2+85 CDH4 MapReduce 2 (CDH4 only) 2.0.0+1475 CDH4 HBase 0.94.6+132 CDH4 HCatalog (CDH4 only)Unavailable Not installed or path incorrect Oozie 3.3.2+92CDH4 Zookeeper 3.4.5+23CDH4 Hue 2.5.0+139 CDH4 MapReduce 1 (CDH4 only) 2.0.0+1475 CDH4 Pig 0.11.0+33 CDH4 HttpFS (CDH4 only) 2.0.0+1475 CDH4 Hadoop 2.0.0+1475 CDH4 Hive0.10.0+198 CDH4 Lily HBase Indexer (CDH4 only) 1.2+2 CDH4 Solr (CDH4 only)4.4.0+69CDH4 Flume NG1.4.0+23CDH4 Javajava version 1.6.0_31 Java(TM) SE Runtime Environment (build 1.6.0_31-b04) Java HotSpot(TM) 64-Bit Server VM (build 20.6-b01, mixed mode) Not applicable Cloudera Manager Agent 4.7.0 Not applicable CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: Issue while quering Hive
With regards to splitting an compression there are 2 options really as of now If u r using Sequence Files , then Snappy If u r using TXT files then LZO us great (u have to cross a few minor hoops to get LZO to work and I can provide guidance on that) Please don't use GZ (not splittable) / or worse BZ2 (took slow to compress/decompress) - too slow for comfort The only compelling reason u want to use GZIP as I am using in production is that my log files are MULTIPLE LINES…so if I use regular TXT files then splitting can happen between records sanjay From: Nitin Pawar nitinpawar...@gmail.commailto:nitinpawar...@gmail.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Monday, September 16, 2013 5:07 AM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: Issue while quering Hive As per my understanding, hadoop 1.x does not provide you any help on processing compressing files in parallel manner. (Atleast this was the case few months back). This bzip2 splitting etc is added in hadoop2.x as per my understanding. On Mon, Sep 16, 2013 at 5:18 PM, Garg, Rinku rinku.g...@fisglobal.commailto:rinku.g...@fisglobal.com wrote: Thanks Nitin, That way it worked, But in that case Hadoop will not be able to split my file into chunks/blocks and run multiple maps in parallel. This can cause under-utilization of my cluster's 'mapping' power. Is that rue?? Thanks Regards, Rinku Garg From: Nitin Pawar [mailto:nitinpawar...@gmail.commailto:nitinpawar...@gmail.com] Sent: 16 September 2013 15:57 To: user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: Issue while quering Hive Does your .gz file contains the data in sequencefile ? or its a plain csv? I think looking at the filename its a plain csv file, so I would recommend that you create a normal table with TextInputFormat (the default) and load data in the new table and give it a try. On Mon, Sep 16, 2013 at 3:36 PM, Garg, Rinku rinku.g...@fisglobal.commailto:rinku.g...@fisglobal.com wrote: Hi Nitin, Yes, I created the table with sequencefile. Thanks Regards, Rinku Garg From: Nitin Pawar [mailto:nitinpawar...@gmail.commailto:nitinpawar...@gmail.com] Sent: 16 September 2013 14:19 To: user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: Issue while quering Hive Look at the error message Caused by: java.io.IOException: hdfs://localhost:54310/user/hive/warehouse/cpj_tbl/cpj.csv.gz not a SequenceFile Did you create table with sequencefile ? On Mon, Sep 16, 2013 at 1:33 PM, Garg, Rinku rinku.g...@fisglobal.commailto:rinku.g...@fisglobal.com wrote: Hi All, I have setup Hadoop, hive setup and trying to load gzip file in hadoop cluster. Files are loaded successfully and can be view on web UI. While executing Select query it gives me the below mentioned error. ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:nxtbig (auth:SIMPLE) cause:java.io.IOException: java.lang.reflect.InvocationTargetException 2013-09-16 09:11:18,971 WARN org.apache.hadoop.mapred.Child: Error running child java.io.IOException: java.lang.reflect.InvocationTargetException at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97) at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57) at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:369) at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.init(HadoopShimsSecure.java:316) at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getRecordReader(HadoopShimsSecure.java:430) at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:540) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:395) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333) at org.apache.hadoop.mapred.Child$4.run(Child.java:268) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1407) at org.apache.hadoop.mapred.Child.main(Child.java:262) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at
Inner Map key and value separators
Hi guys I have to load data into the following data type in hive map string,mapstring,string Is there a way to define custom SEPARATORS (while creating the table) for - Inner map collection item - Inner map key delimiters for 2nd-level maps are \004 and \005 per this http://mail-archives.apache.org/mod_mbox/hadoop-hive-user/201009.mbox/%3c4f6b25afffcafe44b6259a412d5f9b101c07a...@exchmbx104.netflix.com%3E Thanks sanjay CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: question about partition table in hive
A couple of days back, Erik Sammer at the Hadoop Hands On Lab at the Cloudera Sessions demonstrated how to achieve dynamic partitioning using Flume and created those partitioned directories on HDFS which are then readily usable by Hive Understanding what I can from the two lines of your mail below, I would configure Flume to do dynamic partitioning (YEAR, MONTH, DAY, HOUR) and create those directories in HDFS and then create Hive tables with those partitions and run the queries As Stephen said earlier , experiment like crazy - and share please - it will make all of us better as well ! Thanks sanjay From: ch huang justlo...@gmail.commailto:justlo...@gmail.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Thursday, September 12, 2013 6:55 PM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: question about partition table in hive hi,all: i use flume collect log data and put it in hdfs ,i want to use hive to do some caculate, query based on timerange,i want to use parttion table , but the data file in hdfs is a big file ,how can i put it into pratition table in hive? CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: Interesting claims that seem untrue
I have not read the full blogs but in the year 2013 , IMHO , LOC is a very old metric that defines good software any more... From: Edward Capriolo edlinuxg...@gmail.commailto:edlinuxg...@gmail.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Thursday, September 12, 2013 7:19 AM To: hive-u...@hadoop.apache.orgmailto:hive-u...@hadoop.apache.org hive-u...@hadoop.apache.orgmailto:hive-u...@hadoop.apache.org, hive-...@hadoop.apache.orgmailto:hive-...@hadoop.apache.org hive-...@hadoop.apache.orgmailto:hive-...@hadoop.apache.org Subject: Interesting claims that seem untrue I was reading the horton-works blog and found an interesting article. http://hortonworks.com/blog/stinger-phase-2-the-journey-to-100x-faster-hive/#comment-160753 There is a very interesting graphic which attempts to demonstrate lines of code in the 12 release. http://hortonworks.com/wp-content/uploads/2013/09/hive4.png Although I do not know how they are calculated, they are probably counting code generated by tests output, but besides that they are wrong. One claim is that Cloudera contributed 4,244 lines of code. So to debunk that claim: In https://issues.apache.org/jira/browse/HIVE-4675 Brock Noland from cloudera, created the ptest2 testing framework. He did all the work for ptest2 in hive 12, and it is clearly more then 4,244 This consists of 84 java files [edward@desksandra ptest2]$ find . -name *.java | wc -l 84 and by itself is 8001 lines of code. [edward@desksandra ptest2]$ find . -name *.java | xargs cat | wc -l 8001 [edward@desksandra hive-trunk]$ wc -l HIVE-4675.patch 7902 HIVE-4675.patch This is not the only feature from cloudera in hive 12. There is also a section of the article that talks of a ROAD MAP for hive features. I did not know we (hive) had a road map. I have advocated switching to feature based release and having a road map before, but it was suggested that might limit people from itch-scratching. CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: Sentry Meetup
Ahh…..why not SFO :-) I am struggling to implement Sentry and would love some inputs Shreepadma, thanks for all your clarifications but I am only partially done with my implementation I have a workaround whereby I can blank out tables from default db of hive to all hiveserver2 JDBC users…this is not elegant but will keep my production tables secure sanjay From: Shreepadma Venugopalan shreepa...@cloudera.commailto:shreepa...@cloudera.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Wednesday, September 4, 2013 3:40 PM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: Sentry Meetup Sentry is a fine-grained role-based authorization system for Hive among others. We're planning to have a Sentry meetup in NYC coinciding with Hadoop World and Strata 2013. Please RSVP herehttp://www.meetup.com/Sentry-User-Meetup/events/138189232/ if you wish to attend the meetup and hear more about how Sentry can help secure your Hive deployment. Thanks. Shreepadma CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: how to config the job id in oozie
Wow that’s a great idea Artem From: Artem Ervits are9...@nyp.orgmailto:are9...@nyp.org Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Wednesday, September 4, 2013 5:07 AM To: 'user@hive.apache.orgmailto:'user@hive.apache.org' user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: how to config the job id in oozie You can also use shell action to generate timestamp in format you want and pass to the next action as parameter. I do agree it should be easier. Artem Ervits Data Analyst New York Presbyterian Hospital From: Sanjay Subramanian [mailto:sanjay.subraman...@wizecommerce.com] Sent: Tuesday, September 03, 2013 02:34 PM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: how to config the job id in oozie Hi See here http://oozie.apache.org/docs/3.2.0-incubating/WorkflowFunctionalSpec.html#a4.2.1_Basic_EL_Constants String timestamp() It returns the UTC current date and time in W3C format down to the second (-MM-DDThh:mm:ss.sZ). I.e.: 1997-07-16T19:20:30.45Z I don’t like the fact that that there are no Oozie String Format EL functions where u can format the DATE as you want. See if this works for you . Else you have top pass a param when u submit the Oozie job and append that param to the Oozie workflow ID inside the workflow.XML oozie job -oozie http://oozie_server:11000/oozie -config job.properties -D START_TIME=$(date +%Y_%m_%dT%H_%M_%S) -run Inside your workflow.xml = property namemapreduce.job.name/name valueMyJobName_${wf:id()}_${START_TIME}/value /property Thanks sanjay From: 李宏展 lihongzhan...@gmail.commailto:lihongzhan...@gmail.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Tuesday, September 3, 2013 7:43 AM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: how to config the job id in oozie when i submit a workflow,oozie return a jobid like this 002-130903214537362-oozie-*. But in this page http://oozie.apache.org/docs/3.3.2/DG_Examples.html ,the job id is 14-20090525161321-oozie-tucu. The job id with a date is really useful for me,but How should I config oozie for this? CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator. This electronic message is intended to be for the use only of the named recipient, and may contain information that is confidential or privileged. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or use of the contents of this message is strictly prohibited. If you have received this message in error or are not the named recipient, please notify us immediately by contacting the sender at the electronic mail address noted above, and delete and destroy all copies of this message. Thank you. CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: Sentry Meetup
Thanks I will see it in the CDH users group Regards sanjay From: Shreepadma Venugopalan shreepa...@cloudera.commailto:shreepa...@cloudera.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Wednesday, September 4, 2013 5:33 PM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: Sentry Meetup Hi Sanjay, The problems you are encountering with Sentry are due to misconfiguration. These are not bugs. We have responded to you questions on cdh-user@. Shreepadma On Wed, Sep 4, 2013 at 3:49 PM, Sanjay Subramanian sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com wrote: Ahh…..why not SFO :-) I am struggling to implement Sentry and would love some inputs Shreepadma, thanks for all your clarifications but I am only partially done with my implementation I have a workaround whereby I can blank out tables from default db of hive to all hiveserver2 JDBC users…this is not elegant but will keep my production tables secure sanjay From: Shreepadma Venugopalan shreepa...@cloudera.commailto:shreepa...@cloudera.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Wednesday, September 4, 2013 3:40 PM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: Sentry Meetup Sentry is a fine-grained role-based authorization system for Hive among others. We're planning to have a Sentry meetup in NYC coinciding with Hadoop World and Strata 2013. Please RSVP herehttp://www.meetup.com/Sentry-User-Meetup/events/138189232/ if you wish to attend the meetup and hear more about how Sentry can help secure your Hive deployment. Thanks. Shreepadma CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator. CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: Hive Statistics information
Thanks Ravi let me give this a shot Regards sanjay From: Ravi Kiran maghamraviki...@gmail.commailto:maghamraviki...@gmail.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Friday, August 30, 2013 10:53 PM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: Hive Statistics information Hi Sanjay, What do the logs say when you fire the ANALYZE TABLE... statement on a table ? One minor correction to the db connectionstring would be to use amp; for the query parameters. hive.stats.dbconnectionstring=jdbc:mysql://v-so1.nextagqa.com/hive_vso1_tempstatsstore?user=hive_user_vso1amp;password=hive_user_vso1http://v-so1.nextagqa.com/hive_vso1_tempstatsstore?user=hive_user_vso1password=hive_user_vso1 I hope the database hive_vso1_tempstatsstorehttp://v-so1.nextagqa.com/hive_vso1_tempstatsstore?user=hive_user_vso1password=hive_user_vso1 exists in your MySQL? Regards Ravi Magham On Sat, Aug 31, 2013 at 6:15 AM, Sanjay Subramanian sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com wrote: Hi guys I have configured Hive to use MySQL for all statistics hive.stats.atomic=false hive.stats.autogather=true hive.stats.collect.rawdatasize=true hive.stats.dbclass=jdbc:mysql hive.stats.dbconnectionstring=jdbc:mysql://v-so1.nextagqa.com/hive_vso1_tempstatsstore?user=hive_user_vso1password=hive_user_vso1http://v-so1.nextagqa.com/hive_vso1_tempstatsstore?user=hive_user_vso1password=hive_user_vso1 hive.stats.jdbc.timeout=30 hive.stats.jdbcdriver=com.mysql.jdbc.Driver hive.stats.retries.max=0 hive.stats.retries.wait=3000 However in the MYSQL hive statistics tables , they don't seem to have any data ? Where does Hive store the statistics information ? sanjay CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator. CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: Hive Query - Issue
Hi When you do a SELECT * , the partition columns are returned as last N columns (if u have N partitions) In this case the 63rd column in SELECT * is the partition column Instead of SELECT * Do a SELECT col1, col2, col3, ….. Not to show the candle to the sun if u r a AWK/SED ninja :-) but to get all column from hive u can do this hive -e describe ur_table_name | awk '{print $1,}'|sed '1i SELECT'|less Thanks sanjay From: Manickam P manicka...@outlook.commailto:manicka...@outlook.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Monday, September 2, 2013 4:32 AM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: Hive Query - Issue Hello Experts, when i try to execute the below query i'm getting error. Please help me to correct this. insert overwrite table table_baseline partition (sourcedate='base_2013_08') select * from (select * from table_a where sourcedate='tablea_2013_08' union all select * from table_b where sourcedate='tableb_2013_08') final My intention here is i want to populate the table_baseline by using the all records from table_a and table_b with partition. I am getting the below error. Error in semantic analysis: Line 1:23 Cannot insert into target table because column number/types are different ''BASE_2013_08'': Table insclause-0 has 62 columns, but query has 63 columns. I verified the column count and types everything is same but here it says some difference. The same query works fine without having any partitions in all the three tables but getting error while executing with partitions. please help. Thanks Manickam P CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Hive Statistics information
Hi guys I have configured Hive to use MySQL for all statistics hive.stats.atomic=false hive.stats.autogather=true hive.stats.collect.rawdatasize=true hive.stats.dbclass=jdbc:mysql hive.stats.dbconnectionstring=jdbc:mysql://v-so1.nextagqa.com/hive_vso1_tempstatsstore?user=hive_user_vso1password=hive_user_vso1 hive.stats.jdbc.timeout=30 hive.stats.jdbcdriver=com.mysql.jdbc.Driver hive.stats.retries.max=0 hive.stats.retries.wait=3000 However in the MYSQL hive statistics tables , they don't seem to have any data ? Where does Hive store the statistics information ? sanjay CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
SAS--Hive integration
Hi guys Anyone tried SAS--Hive integration successfully ? I tried a simple query in SAS (select col1 from table1 limit 10) and it opened 3 connections to hive-server and killed it !!! :-( I will setup a dev environment for SAS and Hive to test all this But I was wondering if you guys had any clues ? Any thoughts ? sanjay CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: hiveserver2 with OpenLDAP ?
Hi Mikhail Good news ! I deleted current user which used dn: cn=, and re-created it using dn: uid=, looks like it works. So that’s an option 3 as well Nevertheless for academic reasons I am going to give your option 2 a try and report back what I find Thanks again Regards sanjay From: Mikhail Antonov olorinb...@gmail.commailto:olorinb...@gmail.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Friday, August 23, 2013 9:40 PM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: hiveserver2 with OpenLDAP ? You don't need to rebuild Hive, no. I see 2 solutions for that - 1) modify your LDAP data schema so it accepts uid instead of cn attribute (and actually I think it should be working, as after all there can't be so obvious bug in Hive...), so probably whoever tested that provider was using difference LDAP schema that what we have 2) write your class similar to the one mentioned, and point Hive to that in the hive-site.xml, set hive.server2.authentication=CUSTOM,and provide a full class name of your class (and make sure it's on classpath, of course). I prefer second way. Hope it should help. Let me know it it worked for you. *General question to folks* - am I missing something or there's really a bug in LDAP authenticator, which doesn't allow precise configuration of binding string? Mikhail 2013/8/23 Sanjay Subramanian sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com Thanks a lot Mikhail for getting back. That means I cannot use this using beeline unless I change the code and build hive again ? Thanks sanjay From: Mikhail Antonov olorinb...@gmail.commailto:olorinb...@gmail.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Friday, August 23, 2013 6:17 PM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: hiveserver2 with OpenLDAP ? I see the same behavior and here's the reason. LdapAuthenticationProviderImpl - that the one responsible for LDAP authentication in Hive. Look at this class. It has snippet (CDH 4.2.1, hive 0.10): // setup the security principal String bindDN; if (baseDN != null) { bindDN = uid= + user + , + baseDN; } else { bindDN = user; } And according to Cloudera documentation, you're supposed to set baseDN param for OpenLDAP, but not for AD. So when this baseDN isn't present, Hive takes username as it is (say user1) and tries to bind to the ldap server, which works. When you set this baseDN, it constructs the bind string as uid=user1,dc=wizetest,dc=com. But most likely, your open ldap expects it to be rather cn=user1,dc=wizetest,dc=com, uid attribute isn't being used. I think the way to go is to provide you own LDAP authenticator, which has more control on how to generate LDAP bind string. Mikhail 2013/8/23 Sanjay Subramanian sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com Hi guys I tested hiveserver2 with Active directory - It works With Open LDAP it does not Is there any specific syntax for specifying the LDAP url or baseDN ? property namehive.server2.authentication.ldap.url/name valueldap://myserver.corp.nextag.com:389http://myserver.corp.nextag.com:389/value /property property namehive.server2.authentication.ldap.baseDN/name valuedc=wizetest,dc=com/value /property Beeline keeps giving error jdbc:hive2://dev-thdp5:1 !connect jdbc:hive2://dev-thdp5:1 hiveuser1 org.apache.hive.jdbc.HiveDriver Connecting to jdbc:hive2://dev-thdp5:1 Error: Could not establish connection to jdbc:hive2://dev-thdp5:1: Peer indicated failure: Error validating the login (state=08S01,code=0) Any clues ? Thanks sanjay CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator. -- Thanks, Michael Antonov CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any
hiveserver2 with OpenLDAP ?
Hi guys I tested hiveserver2 with Active directory - It works With Open LDAP it does not Is there any specific syntax for specifying the LDAP url or baseDN ? property namehive.server2.authentication.ldap.url/name valueldap://myserver.corp.nextag.com:389/value /property property namehive.server2.authentication.ldap.baseDN/name valuedc=wizetest,dc=com/value /property Beeline keeps giving error jdbc:hive2://dev-thdp5:1 !connect jdbc:hive2://dev-thdp5:1 hiveuser1 org.apache.hive.jdbc.HiveDriver Connecting to jdbc:hive2://dev-thdp5:1 Error: Could not establish connection to jdbc:hive2://dev-thdp5:1: Peer indicated failure: Error validating the login (state=08S01,code=0) Any clues ? Thanks sanjay CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: Alter or Query a table with field name 'date' always get error
Yes date is a reserved word. My recommendations If your table is Hive managed (I.e u created the table without using EXTERNAL ) === - Then copy the data for this hive table that is on HDFS to another location - Drop the table - CREATE a EXTERNAL TABLE with filename - replace date with some field name like date_ - ALTER TABLE ADD IF NOT EXISTS PARTITION (date_='your_partition') LOCATION '/path/to/your/HDFS/where/you/copied/the/data' If your table is EXTERNAL === - Drop the table - CREATE a EXTERNAL TABLE with filename - replace date with some field name like date_ - ALTER TABLE ADD IF NOT EXISTS PARTITION (date_='your_partition') LOCATION '/path/to/your/HDFS/where/you/copied/the/data' Hope this helps Good luck sanjay From: Sonya Ling sonya_ling1...@yahoo.commailto:sonya_ling1...@yahoo.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org, Sonya Ling sonya_ling1...@yahoo.commailto:sonya_ling1...@yahoo.com Date: Thursday, August 22, 2013 6:45 PM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: Alter or Query a table with field name 'date' always get error Hi: I have a table transaction3 with a field name 'date'. That table is the target of importing table from MYSQL using Sqoop. The table in MYSQL has a field with the field name 'date' and SQOOP does not allow column name mapping. Therefore, the field name 'date' is kept in the transaction3 table in Hive. 'date' field is the partition field name of transaction3. I tried to change column name (existing code is is using created_date field) ALTER TABLE transaction3 CHANGE date created_date STRING; I got error. FAILED: ParseException line 1:32 extraneous input 'date' expecting Identifier near 'EOF' I cannot get around it. I query the count SELECT COUNT(*) FROM transaction3 WHERE date = '2013-08-15'; I got another error. FAILED: ParseException line 1:40 cannot recognize input near 'date' '=' ''2013-08-15'' in expression specification I query another table with field name 'created_date' instead and it works fine and I didn't get any error. I like to know if date is a reserved word. Is there any way to get around either change field name or query? Thanks. CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: How to perform arithmetic operations in hive
Yes this will work Also arithmetic operations will work in a WHERE clause Example select channel_id from keyword_impressions_log where header_date_partition='2013-08-21' and channel_id*10=290640 limit 10 From: Justin Workman justinjwork...@gmail.commailto:justinjwork...@gmail.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Thursday, August 22, 2013 7:17 AM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: How to perform arithmetic operations in hive Try select emp_name, (emp_no * 10) from emp_table; Sent from my iPhone On Aug 22, 2013, at 8:14 AM, Sandeep Nemuri nhsande...@gmail.commailto:nhsande...@gmail.com wrote: Hi all , Can we perform arithmetic operator on select command. I have a table emp_table with columns emp_name and emp_no i want to multiply the data in a emp_no with 10 . i tryed select * from emp_table where emp_no=emp_no*10; But it did not work . Help me out Thanks in advance . --Regards Sandeep Nemuri CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: using hive with multiple schemas
Some ideas to get u started CREATE EXTERNAL TABLE IF NOT EXISTS names(fullname STRING,address STRING,phone STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' CREATE EXTERNAL TABLE IF NOT EXISTS names_detail(id BIGINT, fullname STRING,address STRING,gender STRING, phone STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ALTER TABLE names SET LOCATION 'hdfs://namenodeserver:port/path/to/dir/in/hdfs/Schema1.csv' ALTER TABLE names_detail SET LOCATION 'hdfs://namenodeserver:port/path/to/dir/in/hdfs/Schema2.csv' CHECKPOINT to see if DATA is there == # won't use Map Reduce hive -e select * from names hive -e select * from names_detail # Will use Map Reduce (if Hive version is older than 0.11) hive -e select * from names hive -e select * from names_detail JOIN QUERY (may not be the business case but just as an illustration of FULL INNER JOIN) === = SELECT nd.id, nd.gender, n.fullname, n.address, n.phone FROM names n JOIN names_detail nd ON n.fullname = nd.fullname AND n.phone = nd.phone This query is not tested so please make tweaks as appropriate to make it work. Hope this helps Good luck sanjay From: Chris Driscol cdris...@rallydev.com Reply-To: user@hive.apache.org user@hive.apache.org Date: Wednesday, August 21, 2013 7:24 AM To: user@hive.apache.org user@hive.apache.org Subject: using hive with multiple schemas Hi -I just started to get my feet wet with Hive and have a question that I have not been able to find an answer to.. Suppose I have 2 CSV files: cat Schema1.csv Name, Address, Phone Chris, address1, 999-999- and cat Schema2.csv Id, Name, Address, Gender, Phone 13, Tom, address2, male, 888-888- tel:888-888- I put these two files into Hadoop and want to be able to query these 2 different schema's via Hive.. Do I need to create two tables in Hive to represent both schemas and use a join? Or is there a better way that can handle these two different schemas? Please reply back with any other specific questions, I realize this is somewhat open-ended.. thanks! -- -cd CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: only one mapper
Hi Try this setting in your hive query SET mapreduce.input.fileinputformat.split.maxsize=some bytes; If u set this value low then the MR job will use this size to split the input LZO files and u will get multiple mappers (and make sure the input LZO files are indexed I.e. .LZO.INDEX files are created) sanjay From: Edward Capriolo edlinuxg...@gmail.commailto:edlinuxg...@gmail.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Wednesday, August 21, 2013 10:43 AM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: only one mapper LZO files are only splittable if you index them. Sequence files compresses with LZO are splittable without being indexed. Snappy + SequenceFile is a better option then LZO. On Wed, Aug 21, 2013 at 1:39 PM, Igor Tatarinov i...@decide.commailto:i...@decide.com wrote: LZO files are combinable so check your max split setting. http://mail-archives.apache.org/mod_mbox/hive-user/201107.mbox/%3c4e328964.7000...@gmail.com%3E igor decide.comhttp://decide.com On Wed, Aug 21, 2013 at 2:17 AM, 闫昆 yankunhad...@gmail.commailto:yankunhad...@gmail.com wrote: hi all when i use hive hive job make only one mapper actually my file split 18 block my block size is 128MB and data size 2GB i use lzo compression and create file.lzo and make index file.lzo.index i use hive 0.10.0 Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Cannot run job locally: Input Size (= 2304560827) is larger than hive.exec.mode.local.auto.inputbytes.max (= 134217728) Starting Job = job_1377071515613_0003, Tracking URL = http://hydra0001:8088/proxy/application_1377071515613_0003/ Kill Command = /opt/module/hadoop-2.0.0-cdh4.3.0/bin/hadoop job -kill job_1377071515613_0003 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2013-08-21 16:44:30,237 Stage-1 map = 0%, reduce = 0% 2013-08-21 16:44:40,495 Stage-1 map = 2%, reduce = 0%, Cumulative CPU 6.81 sec 2013-08-21 16:44:41,710 Stage-1 map = 2%, reduce = 0%, Cumulative CPU 6.81 sec 2013-08-21 16:44:42,919 Stage-1 map = 2%, reduce = 0%, Cumulative CPU 6.81 sec 2013-08-21 16:44:44,117 Stage-1 map = 3%, reduce = 0%, Cumulative CPU 9.95 sec 2013-08-21 16:44:45,333 Stage-1 map = 3%, reduce = 0%, Cumulative CPU 9.95 sec 2013-08-21 16:44:46,530 Stage-1 map = 5%, reduce = 0%, Cumulative CPU 13.0 sec -- In the Hadoop world, I am just a novice, explore the entire Hadoop ecosystem, I hope one day I can contribute their own code YanBit yankunhad...@gmail.commailto:yankunhad...@gmail.com CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: single output file per partition?
Hi I tried file crusher with LZO but it does not work….I have LZO correctly configured in production and my jobs are running daily using LZO compression. I like Crusher so I will see why its not working…Thanks to Edward the code is there to tweak :-) and test locally sanjay From: Stephen Sprague sprag...@gmail.commailto:sprag...@gmail.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Wednesday, August 21, 2013 12:07 PM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: single output file per partition? I see. I'll have to punt then. However, there is an after the fact file crusher Ed Capriolo wrote a while back here: https://github.com/edwardcapriolo/filecrush YMMV On Wed, Aug 21, 2013 at 11:12 AM, Igor Tatarinov i...@decide.commailto:i...@decide.com wrote: Using a single bucket per partition seems to create a single reducer which is too slow. I've tried enforcing small files merge but that didn't work. I still got multiple output files. Creating a temp table and then combining the multiple files into one using a simple select * is the only option that seems to work. It's odd that I have to create the temp table but I don't see a workaround. On Wed, Aug 21, 2013 at 8:51 AM, Stephen Sprague sprag...@gmail.commailto:sprag...@gmail.com wrote: hi igor, lots of ideas there! I can't speak for them all but let me confirm first that cluster by X into 1 bucket didn't work? I would have thought that would have done it. On Tue, Aug 20, 2013 at 2:29 PM, Igor Tatarinov i...@decide.commailto:i...@decide.com wrote: What's the best way to enforce a single output file per partition? INSERT OVERWRITE TABLE table PARTITION (x,y,z) SELECT ... FROM ... WHERE ... It tried adding CLUSTER BY x,y,z at the end thinking that sorting will force a single reducer per partition but that didn't work. I still got multiple files per partition. Do I have to use a single reduce task? With a few TB of data that's probably not a good idea. My current idea is to create a temp table with the same partitioning structure. Insert into that table first and then select * from that table into the output table. With combineinputformat=true that should work right? Or should I make Hive merge output files instead? (using hive.merge.mapfiles) Will that work with a partitioned table? Thanks! igor CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Last time request for cwiki update privileges
Hey guys I can only think of two reasons for my request is not yet accepted 1. The admins don't want to give me access 2. The admins have not seen my mail yet. This is the fourth and the LAST time I am requesting permission to edit wiki docs…Nobody likes being ignored and that includes me. Meanwhile to show my thankfulness to the Hive community I shall continue to answer questions .There will be no change in that behavior Regards sanjay From: Sanjay Subramanian sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com Date: Wednesday, August 14, 2013 3:52 PM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Cc: d...@hive.apache.orgmailto:d...@hive.apache.org d...@hive.apache.orgmailto:d...@hive.apache.org Subject: Re: Review Request (wikidoc): LZO Compression in Hive Once again, I am down on my knees humbling calling upon the Hive Jedi Masters to please provide this paadwaan with cwiki update privileges May the Force be with u Thanks sanjay From: Sanjay Subramanian sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Wednesday, July 31, 2013 9:38 AM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Cc: d...@hive.apache.orgmailto:d...@hive.apache.org d...@hive.apache.orgmailto:d...@hive.apache.org Subject: Re: Review Request (wikidoc): LZO Compression in Hive Hi guys Any chance I could get cwiki update privileges today ? Thanks sanjay From: Sanjay Subramanian sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com Date: Tuesday, July 30, 2013 4:26 PM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Cc: d...@hive.apache.orgmailto:d...@hive.apache.org d...@hive.apache.orgmailto:d...@hive.apache.org Subject: Review Request (wikidoc): LZO Compression in Hive Hi Met with Lefty this afternoon and she was kind to spend time to add my documentation to the site - since I still don't have editing privileges :-) Please review the new wikidoc about LZO compression in the Hive language manual. If anything is unclear or needs more information, you can email suggestions to this list or edit the wiki yourself (if you have editing privileges). Here are the links: 1. Language Manualhttps://cwiki.apache.org/confluence/display/Hive/LanguageManual (new bullet under File Formats) 2. LZO Compressionhttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+LZO 3. CREATE TABLEhttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable (near end of section, pasted in here:) Use STORED AS TEXTFILE if the data needs to be stored as plain text files. Use STORED AS SEQUENCEFILE if the data needs to be compressed. Please read more about CompressedStoragehttps://cwiki.apache.org/confluence/display/Hive/CompressedStorage if you are planning to keep data compressed in your Hive tables. Use INPUTFORMAT and OUTPUTFORMAT to specify the name of a corresponding InputFormat and OutputFormat class as a string literal, e.g., 'org.apache.hadoop.hive.contrib.fileformat.base64.Base64TextInputFormat'. For LZO compression, the values to use are 'INPUTFORMAT com.hadoop.mapred.DeprecatedLzoTextInputFormat OUTPUTFORMAT org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' (see LZO Compressionhttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+LZO). My cwiki id is https://cwiki.apache.org/confluence/display/~sanjaysubraman...@yahoo.com It will be great if I could get edit privileges Thanks sanjay CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator. CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure
Re: Last time request for cwiki update privileges
Thanks Ashutosh From: Ashutosh Chauhan hashut...@apache.orgmailto:hashut...@apache.org Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Tuesday, August 20, 2013 3:13 PM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Cc: d...@hive.apache.orgmailto:d...@hive.apache.org d...@hive.apache.orgmailto:d...@hive.apache.org Subject: Re: Last time request for cwiki update privileges Hi Sanjay, Really sorry for that. I apologize for the delay. You are added now. Feel free to make changes to make Hive even better! Thanks, Ashutosh On Tue, Aug 20, 2013 at 2:39 PM, Sanjay Subramanian sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com wrote: Hey guys I can only think of two reasons for my request is not yet accepted 1. The admins don't want to give me access 2. The admins have not seen my mail yet. This is the fourth and the LAST time I am requesting permission to edit wiki docs…Nobody likes being ignored and that includes me. Meanwhile to show my thankfulness to the Hive community I shall continue to answer questions .There will be no change in that behavior Regards sanjay From: Sanjay Subramanian sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com Date: Wednesday, August 14, 2013 3:52 PM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Cc: d...@hive.apache.orgmailto:d...@hive.apache.org d...@hive.apache.orgmailto:d...@hive.apache.org Subject: Re: Review Request (wikidoc): LZO Compression in Hive Once again, I am down on my knees humbling calling upon the Hive Jedi Masters to please provide this paadwaan with cwiki update privileges May the Force be with u Thanks sanjay From: Sanjay Subramanian sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Wednesday, July 31, 2013 9:38 AM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Cc: d...@hive.apache.orgmailto:d...@hive.apache.org d...@hive.apache.orgmailto:d...@hive.apache.org Subject: Re: Review Request (wikidoc): LZO Compression in Hive Hi guys Any chance I could get cwiki update privileges today ? Thanks sanjay From: Sanjay Subramanian sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com Date: Tuesday, July 30, 2013 4:26 PM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Cc: d...@hive.apache.orgmailto:d...@hive.apache.org d...@hive.apache.orgmailto:d...@hive.apache.org Subject: Review Request (wikidoc): LZO Compression in Hive Hi Met with Lefty this afternoon and she was kind to spend time to add my documentation to the site - since I still don't have editing privileges :-) Please review the new wikidoc about LZO compression in the Hive language manual. If anything is unclear or needs more information, you can email suggestions to this list or edit the wiki yourself (if you have editing privileges). Here are the links: 1. Language Manualhttps://cwiki.apache.org/confluence/display/Hive/LanguageManual (new bullet under File Formats) 2. LZO Compressionhttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+LZO 3. CREATE TABLEhttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable (near end of section, pasted in here:) Use STORED AS TEXTFILE if the data needs to be stored as plain text files. Use STORED AS SEQUENCEFILE if the data needs to be compressed. Please read more about CompressedStoragehttps://cwiki.apache.org/confluence/display/Hive/CompressedStorage if you are planning to keep data compressed in your Hive tables. Use INPUTFORMAT and OUTPUTFORMAT to specify the name of a corresponding InputFormat and OutputFormat class as a string literal, e.g., 'org.apache.hadoop.hive.contrib.fileformat.base64.Base64TextInputFormat'. For LZO compression, the values to use are 'INPUTFORMAT com.hadoop.mapred.DeprecatedLzoTextInputFormat OUTPUTFORMAT org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' (see LZO Compressionhttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+LZO). My cwiki id is https://cwiki.apache.org/confluence/display/~sanjaysubraman...@yahoo.com It will be great if I could get edit privileges Thanks sanjay CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your
Re: question about hive SQL
Here is my stab at it. I have not tested it but this should get you started Following points are importat 1. I added a WHERE clause in the sub query to limit he data set by any partition u may have 2. You have to write a collect UDF to use it. Wampler/Capriolo's book in Chapter 13.Functions - refer the class GenericUDAFCollect SELECT page_url, token, collect(concat_ws('|', pcw. original_category, pcw.weight)) FROM (SELECT page_url, token, original_category, weight FROM media_visit_info) WHERE partition_column='partition_col_val' GROUP BY original_category, weight ) pcw LIMIT 10 ; From: ch huang justlo...@gmail.commailto:justlo...@gmail.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Monday, August 19, 2013 2:04 AM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: question about hive SQL hi,all: i do not very familar with HQL, and my problem is ,now i have 2 queries Q1: select page_url, original_category,token from media_visit_info group by page_url, original_category,token limit 10 Q2: select original_category as code , weight from media_visit_info where page_url='X' group by original_category,weight; Q1 page_url value should be send to Q2 where condition ,and the two query result should be combined like { url:http\\:www.baidu.com, category:|CN10, token:20, categorys: [ {code:|CN10-1-1,weight:0.5}, {code:|CN11-2-2,weight:0.1}, {code:|CN10-1-3,weight:0.02} ] } i do not know if it can write into one query(JOIN+SUBQUERY??) ,any one can help? CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Hive Authorization (ROLES AND PRIVILEGES) does not work with hiveserver2 ?
0: jdbc:hive2://dev-thdp5:1 CREATE ROLE sas_role; No rows affected (0.16 seconds) 0: jdbc:hive2://dev-thdp5:1 CREATE EXTERNAL TABLE IF NOT EXISTS keyword_impressions_log (date_ STRING,server STRING,impression_id STRING,search_session_id STRING,channel_id INT,visit_id BIGINT,visitor_id BIGINT,app_style STRING,publisher_id INT,ip STRING,keyword_id BIGINT,keyword STRING,node_constraint BIGINT,mfr_constraint BIGINT,seller_constraint BIGINT,other_constraint STRING,continued INT,offset INT,results INT,sort_ INT,ad_nodes STRING,view_ INT,spelling STRING,referrer STRING,internal_ip INT,cat_feat_mode STRING,rules STRING,rb_filter INT,shuffled INT,related_item_results INT,pixeled INT,sr_attr_imps STRING,unranked_ptitle_ids STRING,perpage INT)PARTITIONED BY (header_date_partition STRING)STORED AS INPUTFORMAT com.hadoop.mapred.DeprecatedLzoTextInputFormat OUTPUTFORMAT org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat; No rows affected (0.468 seconds) 0: jdbc:hive2://dev-thdp5:1 alter table keyword_impressions_log add if not exists partition (header_date_partition='2013-08-11') location '/user/hive/warehouse/keyword_impressions_log/2013-08-11'; No rows affected (0.438 seconds) 0: jdbc:hive2://dev-thdp5:1 GRANT SELECT ON TABLE keyword_impressions_log TO ROLE sas_role; No rows affected (0.403 seconds) 0: jdbc:hive2://dev-thdp5:1 GRANT ROLE sas_role TO USER hiveuser1; No rows affected (0.168 seconds) 0: jdbc:hive2://dev-thdp5:1 show grant role sas_role on table keyword_impressions_log; No rows affected (0.117 seconds) The following CLI command shows the role information hive -e show grant role sas_role on table keyword_impressions_log; Logging initialized using configuration in jar:file:/usr/lib/hive/lib/hive-common-0.10.0-cdh4.3.0.jar!/hive-log4j.properties Hive history file=/tmp/sasubramanian/hive_job_log_99244cad-b6e4-4c71-9fe1-d4b248b078c3_1882800333.txt OK database default table keyword_impressions_log principalName sas_role principalType ROLE privilege select grantTime Mon Aug 19 12:24:08 PDT 2013 grantor hive Time taken: 1.76 seconds From: Sanjay Subramanian sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Friday, August 16, 2013 7:24 PM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: Hive Authorization clarification Hi guys I am not getting the expected result from my authorization settings I am evaluating Hive0.10.0+121 mysql select * from hive.ROLES; +-+-++---+ | ROLE_ID | CREATE_TIME | OWNER_NAME | ROLE_NAME | +-+-++---+ | 6 | 1376704610 | NULL | sas_role | +-+-++---+ My settings are as follows. !-- added by sanjay -- property namehive.server2.authentication/name valueLDAP/value /property property namehive.server2.authentication.ldap.url/name valueldap://corp.nextag.com/value /property property namehive.security.authorization.enabled/name valuetrue/value descriptionenable or disable the hive client authorization/description /property property namehive.security.authorization.createtable.owner.grants/name valueALL/value descriptionthe privileges automatically granted to the owner whenever a table gets created. An example like select,drop will grant select and drop privilege to the owner of the table/description /property property namehive.security.authorization.createtable.role.grants/name valuesas_role:select/value descriptionThe privileges automatically granted to some roles whenever a table gets created. An example like roleX,roleY:select;roleZ:create will grant select privilege to roleX and roleY, and grant create privilege to roleZ whenever a new table created./description /property property namehive.security.authorization.createtable.group.grants/name valuehiveuser1:select/value descriptionThe privileges automatically granted to some groups whenever a table gets created. An example like groupX,groupY:select;groupZ:create will grant select privilege to groupX and groupY, and grant create privilege to groupZ whenever a new table created./description /property property namehive.security.authorization.createtable.user.grants/name valuehiveuser1:select/value descriptionThe privileges automatically granted to some users whenever a table gets created. An example like userX,userY:select;userZ:create will grant select privilege to userX and userY, and grant create privilege to userZ whenever a new table created./description /property USECASE STEPS 1. Connect as sasubramanian and create table beeline !connect jdbc:hive2://dev-thdp5:1 sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com
Re: Hive cli Vs beeline cli
Some notes from my experience * Beeline u have the benefit of being able to use LDAP/Kerberos authentication * I am not sure how to use -e and -f option with Beeline which is very strong with hive CLI * Beeline at the present version may not fully integrate with Oozie, so if u are using oozie for workflows then hive-action may be an issue * U can solve this though by having both services hiveserver1 and hiveserver2 running and blocking hivserver1 port from outside world. Thanks sanjay From: pandees waran pande...@gmail.commailto:pande...@gmail.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Wednesday, August 14, 2013 10:41 PM To: user user@hive.apache.orgmailto:user@hive.apache.org Subject: Hive cli Vs beeline cli Hi pros, Based on your experience with beeline cli, could you please share your thoughts in advantages of using beeline cli over default hive cli? Please share if you find any useful link for this info. Thanks Pandeeswaran CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: FAILED: Error in metadata: MetaException
Hi Ankit Do u have a directory on HDFS /user/hive/warehouse And its permission should be 1777 sanjay From: Ankit Bhatnagar ank...@yahoo-inc.commailto:ank...@yahoo-inc.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Friday, August 16, 2013 12:02 PM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: FAILED: Error in metadata: MetaException Hi folks, FAILED: Error in metadata: MetaException(message:file:/user/hive/warehouse/src is not a directory or unable to create one) drwxrwxrwx - hadoop supergroup 0 2013-08-16 18:31 /user/hive/warehouse/src Is it some permission issue? Ankit CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: SHOW ALL ROLES
Ok never mind , this will work just fine for me mysql select ROLE_NAME from ROLES; From: Sanjay Subramanian sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Friday, August 16, 2013 6:34 PM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: SHOW ALL ROLES Hi How do I display all ROLES defined in hive thru CLI ? Thanks sanjay CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator. CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: Hive and Lzo Compression
That is really interesting…let me try and think of a reason…meanwhile any other LZO Hive Samurais out there ? Please help with some guidance sanjay From: w00t w00t w00...@yahoo.demailto:w00...@yahoo.de Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org, w00t w00t w00...@yahoo.demailto:w00...@yahoo.de Date: Wednesday, August 14, 2013 1:15 AM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: Hive and Lzo Compression Thanks for your reply. The interesting thing I experience is that the SELECT query still works - even when I do not specify the STORED AS clause... that puzzles me a bit. Von: Sanjay Subramanian sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com An: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org; w00t w00t w00...@yahoo.demailto:w00...@yahoo.de Gesendet: 3:44 Mittwoch, 14.August 2013 Betreff: Re: Hive and Lzo Compression Hi I think the CREATE TABLE without the STORED AS clause will not give any errors while creating the table. However when you query that table and since that table contains .lzo files , you would get errors. With external tables , u r separating the table creation(definition) from the data. So only at the time of query of that table, hive might report errors. LZO compression rocks ! I am so glad I used it in our projects here. Regards sanjay From: w00t w00t w00...@yahoo.demailto:w00...@yahoo.de Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org, w00t w00t w00...@yahoo.demailto:w00...@yahoo.de Date: Tuesday, August 13, 2013 12:13 AM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: Hive and Lzo Compression Thanks for your replies and the link. I could get it working, but wondered why the CREATE TABLE statement worked without the STORED AS Clause as well...that's what puzzles me a bit... But I will use the STORED AS Clause to be on the safe side. Von: Lefty Leverenz leftylever...@gmail.commailto:leftylever...@gmail.com An: user@hive.apache.orgmailto:user@hive.apache.org CC: w00t w00t w00...@yahoo.demailto:w00...@yahoo.de Gesendet: 19:06 Samstag, 10.August 2013 Betreff: Re: Hive and Lzo Compression I'm not seeing any documentation link in Sanjay's message, so here it is again (in the Hive wiki's language manual): https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LZO. On Thu, Aug 8, 2013 at 3:30 PM, Sanjay Subramanian sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com wrote: Please refer this documentation here Let me know if u need more clarifications so that we can make this document better and complete Thanks sanjay From: w00t w00t w00...@yahoo.demailto:w00...@yahoo.de Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org, w00t w00t w00...@yahoo.demailto:w00...@yahoo.de Date: Thursday, August 8, 2013 2:02 AM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: Hive and Lzo Compression Hello, I am started to run Hive with Lzo compression on Hortonworks 1.2 I have managed to install/configure Lzo and hive -e set io.compression.codecs shows me the Lzo Codecs: io.compression.codecs= org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.DefaultCodec, com.hadoop.compression.lzo.LzoCodec, com.hadoop.compression.lzo.LzopCodec, org.apache.hadoop.io.compress.BZip2Codec However, I have some questions where I would be happy if you could help me. (1) CREATE TABLE statement I read in different postings, that in the CREATE TABLE statement, I have to use the following STORAGE clause: CREATE EXTERNAL TABLE txt_table_lzo ( txt_line STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '' STORED AS INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION '/user/myuser/data/in/lzo_compressed'; It works withouth any problems now to execute SELECT statements on this table with Lzo data. However I also created a table on the same data without this STORAGE clause: CREATE EXTERNAL TABLE txt_table_lzo_tst ( txt_line STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '' LOCATION '/user/myuser/data/in/lzo_compressed'; The interesting thing is, it works as well, when I execute a SELECT statement and this table. Can you help, why the second CREATE TABLE statement works as well? What should I use in DDLs? Is it best practice to use the STORED AS clause with a deprecatedLzoTextInputFormat? Or should I remove it? (2) Output and Intermediate Compression Settings I want to use output compression . In Programming
Re: Strange error in Hive - Insert INTO
Another reason I can think of is possibly some STRING column in your table has a DELIMITER character…Like once in production I had tab spaces in the string and my table was also defined using TAB as delimiter From: Stephen Sprague sprag...@gmail.commailto:sprag...@gmail.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Wednesday, August 14, 2013 8:43 AM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: Strange error in Hive - Insert INTO Hi Jerome, That's a grandiose sql statement you got there! :)I find that if you break up those nested queries into simple CTAS (Create Table AS) statements and create a cascading effect of referring to the table in the previous step it makes debugging *so* much easier. In other SQL dialects like DB2 this is facilitated by the WITH keyword. Maybe the Hive gurus will implement that some day. But that's a topic for another day. So all that said, i see that the columns in your create table statement don't match the columns in your outermost select statement. In particular, DT_JOUR is listed as the 6th column in your create table statement but it appears to be the 2nd column in your select statement. So something looks fishy there. My guess is ultimately you're missing a comma somewhere in the select list so hive is eating an column as a column alias and all your data is skewed over by one column. This happens not so infrequently since it is valid sql. Long winded answer to a simple question. Apologies up front! On Wed, Aug 14, 2013 at 5:35 AM, Jérôme Verdier verdier.jerom...@gmail.commailto:verdier.jerom...@gmail.com wrote: Hi everybody, I faced a strange error in Hive today. I have launch a hive script to make some calculations, joins, union, etc... and then insert these results in over hive table. Everything is working fine (.hql is working, full ok, data are imported), but one field (CO_RGRP_PRODUITS) is very strange. after the insert, CO_RGRP_PRODUITS is looking like a TIMESTAMP (1970-01-01 01:00:00) instead of being a simple STRING. I precise that source field are simple string like this : 0101380, for example What is going wrong here. You can find my script below (create table and .hql insert/calculations) Thanks for your help. INSERT SCRIPT : --THM_CA_RGRP_PRODUITS_JOUR CREATE TABLE default.THM_CA_RGRP_PRODUITS_JOUR ( CO_SOCIETE BIGINT, TYPE_ENTITE STRING, CODE_ENTITE STRING, TYPE_RGRP_PRODUITS STRING, CO_RGRP_PRODUITS STRING, DT_JOUR TIMESTAMP, MT_CA_NET_TTC FLOAT, MT_OBJ_CA_NET_TTC FLOAT, NB_CLIENTS FLOAT, MT_CA_NET_TTC_COMP FLOAT, MT_OBJ_CA_NET_TTC_COMP FLOAT, NB_CLIENTS_COMP FLOAT); INSERT SCRIPT : INSERT INTO TABLE THM_CA_RGRP_PRODUITS_JOUR SELECT 1 as CO_SOCIETE,-- A modifier = variable '2013-01-02 00:00:00.0' as DT_JOUR, -- A modifier = variable 'MAG' as TYPE_ENTITE, m.co_magasinas CODE_ENTITE, 'FAM' as TYPE_RGRP_PRODUITS, sourceunion.CO_RGRP_PRODUITSas CO_RGRP_PRODUITS, SUM(MT_CA_NET_TTC) as MT_CA_NET_TTC, SUM(MT_OBJ_CA_NET_TTC) as MT_OBJ_CA_NET_TTC, SUM(NB_CLIENTS) as NB_CLIENTS, SUM(MT_CA_NET_TTC_COMP) as MT_CA_NET_TTC_COMP, SUM(MT_OBJ_CA_NET_TTC_COMP) as MT_OBJ_CA_NET_TTC_COMP, SUM(NB_CLIENTS_COMP)as NB_CLIENTS_COMP FROM ( SELECT mtransf.id_mag_transfere as ID_MAGASIN, v.co_famille as CO_RGRP_PRODUITS, sum(v.mt_ca_net_ttc) as MT_CA_NET_TTC, 0as MT_OBJ_CA_NET_TTC, 0as NB_CLIENTS, sum(v.mt_ca_net_ttc * (CASE WHEN mtransf.flag_mag_comp = 'NC' THEN 0 ELSE 1 END)) as MT_CA_NET_TTC_COMP, 0as MT_OBJ_CA_NET_TTC_COMP, 0as NB_CLIENTS_COMP FROM default.VENTES_FAM v JOIN default.kpi_magasin mtransf ON mtransf.co_societe = CASE WHEN v.co_societe = 1 THEN 1 ELSE 2 END AND mtransf.id_magasin = v.id_magasin WHERE mtransf.co_societe= 1 -- Modifier variable AND v.dt_jour = '2013-01-02
Re: Hive and Lzo Compression
I am not sure if in this cade data is loaded OR partition added with location specified (to some location in HDFS) Yes u r stating the question correctly sanjay From: Nitin Pawar nitinpawar...@gmail.commailto:nitinpawar...@gmail.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Wednesday, August 14, 2013 10:54 AM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: Hive and Lzo Compression Please correct me if I understood the question correctly You created a table def without mentioning a stored as clause then you load data into table from a compressed a file then do a select query and it still works but how did it figured out which compression codec to use? Am I stating it correctly ? On Wed, Aug 14, 2013 at 11:11 PM, Sanjay Subramanian sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com wrote: That is really interesting…let me try and think of a reason…meanwhile any other LZO Hive Samurais out there ? Please help with some guidance sanjay From: w00t w00t w00...@yahoo.demailto:w00...@yahoo.de Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org, w00t w00t w00...@yahoo.demailto:w00...@yahoo.de Date: Wednesday, August 14, 2013 1:15 AM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: Hive and Lzo Compression Thanks for your reply. The interesting thing I experience is that the SELECT query still works - even when I do not specify the STORED AS clause... that puzzles me a bit. Von: Sanjay Subramanian sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com An: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org; w00t w00t w00...@yahoo.demailto:w00...@yahoo.de Gesendet: 3:44 Mittwoch, 14.August 2013 Betreff: Re: Hive and Lzo Compression Hi I think the CREATE TABLE without the STORED AS clause will not give any errors while creating the table. However when you query that table and since that table contains .lzo files , you would get errors. With external tables , u r separating the table creation(definition) from the data. So only at the time of query of that table, hive might report errors. LZO compression rocks ! I am so glad I used it in our projects here. Regards sanjay From: w00t w00t w00...@yahoo.demailto:w00...@yahoo.de Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org, w00t w00t w00...@yahoo.demailto:w00...@yahoo.de Date: Tuesday, August 13, 2013 12:13 AM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: Hive and Lzo Compression Thanks for your replies and the link. I could get it working, but wondered why the CREATE TABLE statement worked without the STORED AS Clause as well...that's what puzzles me a bit... But I will use the STORED AS Clause to be on the safe side. Von: Lefty Leverenz leftylever...@gmail.commailto:leftylever...@gmail.com An: user@hive.apache.orgmailto:user@hive.apache.org CC: w00t w00t w00...@yahoo.demailto:w00...@yahoo.de Gesendet: 19:06 Samstag, 10.August 2013 Betreff: Re: Hive and Lzo Compression I'm not seeing any documentation link in Sanjay's message, so here it is again (in the Hive wiki's language manual): https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LZO. On Thu, Aug 8, 2013 at 3:30 PM, Sanjay Subramanian sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com wrote: Please refer this documentation here Let me know if u need more clarifications so that we can make this document better and complete Thanks sanjay From: w00t w00t w00...@yahoo.demailto:w00...@yahoo.de Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org, w00t w00t w00...@yahoo.demailto:w00...@yahoo.de Date: Thursday, August 8, 2013 2:02 AM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: Hive and Lzo Compression Hello, I am started to run Hive with Lzo compression on Hortonworks 1.2 I have managed to install/configure Lzo and hive -e set io.compression.codecs shows me the Lzo Codecs: io.compression.codecs= org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.DefaultCodec, com.hadoop.compression.lzo.LzoCodec, com.hadoop.compression.lzo.LzopCodec, org.apache.hadoop.io.compress.BZip2Codec However, I have some questions where I would be happy if you could help me. (1) CREATE TABLE statement I read in different postings, that in the CREATE TABLE statement, I have to use the following STORAGE clause: CREATE EXTERNAL TABLE txt_table_lzo ( txt_line STRING ) ROW
Re: Review Request (wikidoc): LZO Compression in Hive
Once again, I am down on my knees humbling calling upon the Hive Jedi Masters to please provide this paadwaan with cwiki update privileges May the Force be with u Thanks sanjay From: Sanjay Subramanian sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Wednesday, July 31, 2013 9:38 AM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Cc: d...@hive.apache.orgmailto:d...@hive.apache.org d...@hive.apache.orgmailto:d...@hive.apache.org Subject: Re: Review Request (wikidoc): LZO Compression in Hive Hi guys Any chance I could get cwiki update privileges today ? Thanks sanjay From: Sanjay Subramanian sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com Date: Tuesday, July 30, 2013 4:26 PM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Cc: d...@hive.apache.orgmailto:d...@hive.apache.org d...@hive.apache.orgmailto:d...@hive.apache.org Subject: Review Request (wikidoc): LZO Compression in Hive Hi Met with Lefty this afternoon and she was kind to spend time to add my documentation to the site - since I still don't have editing privileges :-) Please review the new wikidoc about LZO compression in the Hive language manual. If anything is unclear or needs more information, you can email suggestions to this list or edit the wiki yourself (if you have editing privileges). Here are the links: 1. Language Manualhttps://cwiki.apache.org/confluence/display/Hive/LanguageManual (new bullet under File Formats) 2. LZO Compressionhttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+LZO 3. CREATE TABLEhttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable (near end of section, pasted in here:) Use STORED AS TEXTFILE if the data needs to be stored as plain text files. Use STORED AS SEQUENCEFILE if the data needs to be compressed. Please read more about CompressedStoragehttps://cwiki.apache.org/confluence/display/Hive/CompressedStorage if you are planning to keep data compressed in your Hive tables. Use INPUTFORMAT and OUTPUTFORMAT to specify the name of a corresponding InputFormat and OutputFormat class as a string literal, e.g., 'org.apache.hadoop.hive.contrib.fileformat.base64.Base64TextInputFormat'. For LZO compression, the values to use are 'INPUTFORMAT com.hadoop.mapred.DeprecatedLzoTextInputFormat OUTPUTFORMAT org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' (see LZO Compressionhttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+LZO). My cwiki id is https://cwiki.apache.org/confluence/display/~sanjaysubraman...@yahoo.com It will be great if I could get edit privileges Thanks sanjay CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator. CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: Hive and Lzo Compression
Hi I think the CREATE TABLE without the STORED AS clause will not give any errors while creating the table. However when you query that table and since that table contains .lzo files , you would get errors. With external tables , u r separating the table creation(definition) from the data. So only at the time of query of that table, hive might report errors. LZO compression rocks ! I am so glad I used it in our projects here. Regards sanjay From: w00t w00t w00...@yahoo.demailto:w00...@yahoo.de Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org, w00t w00t w00...@yahoo.demailto:w00...@yahoo.de Date: Tuesday, August 13, 2013 12:13 AM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: Hive and Lzo Compression Thanks for your replies and the link. I could get it working, but wondered why the CREATE TABLE statement worked without the STORED AS Clause as well...that's what puzzles me a bit... But I will use the STORED AS Clause to be on the safe side. Von: Lefty Leverenz leftylever...@gmail.commailto:leftylever...@gmail.com An: user@hive.apache.orgmailto:user@hive.apache.org CC: w00t w00t w00...@yahoo.demailto:w00...@yahoo.de Gesendet: 19:06 Samstag, 10.August 2013 Betreff: Re: Hive and Lzo Compression I'm not seeing any documentation link in Sanjay's message, so here it is again (in the Hive wiki's language manual): https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LZO. On Thu, Aug 8, 2013 at 3:30 PM, Sanjay Subramanian sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com wrote: Please refer this documentation here Let me know if u need more clarifications so that we can make this document better and complete Thanks sanjay From: w00t w00t w00...@yahoo.demailto:w00...@yahoo.de Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org, w00t w00t w00...@yahoo.demailto:w00...@yahoo.de Date: Thursday, August 8, 2013 2:02 AM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: Hive and Lzo Compression Hello, I am started to run Hive with Lzo compression on Hortonworks 1.2 I have managed to install/configure Lzo and hive -e set io.compression.codecs shows me the Lzo Codecs: io.compression.codecs= org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.DefaultCodec, com.hadoop.compression.lzo.LzoCodec, com.hadoop.compression.lzo.LzopCodec, org.apache.hadoop.io.compress.BZip2Codec However, I have some questions where I would be happy if you could help me. (1) CREATE TABLE statement I read in different postings, that in the CREATE TABLE statement, I have to use the following STORAGE clause: CREATE EXTERNAL TABLE txt_table_lzo ( txt_line STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '' STORED AS INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION '/user/myuser/data/in/lzo_compressed'; It works withouth any problems now to execute SELECT statements on this table with Lzo data. However I also created a table on the same data without this STORAGE clause: CREATE EXTERNAL TABLE txt_table_lzo_tst ( txt_line STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '' LOCATION '/user/myuser/data/in/lzo_compressed'; The interesting thing is, it works as well, when I execute a SELECT statement and this table. Can you help, why the second CREATE TABLE statement works as well? What should I use in DDLs? Is it best practice to use the STORED AS clause with a deprecatedLzoTextInputFormat? Or should I remove it? (2) Output and Intermediate Compression Settings I want to use output compression . In Programming Hive from Capriolo, Wampler, Rutherglen the following commands are recommended: SET hive.exec.compress.output=true; SET mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec; However, in some other places in forums, I found the following recommended settings: SET hive.exec.compress.output=true SET mapreduce.output.fileoutputformat.compress=true SET mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec Am I right, that the first settings are for Hadoop versions prior 0.23? Or is there any other reason why the settings are different? I am using Hadoop 1.1.2 with Hive 0.10.0. Which settings would you recommend to use? -- I also want to compress intermediate results. Again, in Programming Hive the following settings are recommended: SET hive.exec.compress.intermediate=true; SET mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzopCodec; Is this the right setting? Or should I again use the settings (which look more valid
Re: LZO output compression
Check this class where these are defined http://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.1/src/mapred/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.java From: w00t w00t w00...@yahoo.demailto:w00...@yahoo.de Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org, w00t w00t w00...@yahoo.demailto:w00...@yahoo.de Date: Tuesday, August 13, 2013 2:39 AM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org, w00t w00t w00...@yahoo.demailto:w00...@yahoo.de Subject: Re: LZO output compression Oh, I could get it working using these settings: SET hive.exec.compress.output=true; SET mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec; But I have one question, where maybe on of you can help me with an explaination: As I am running Hadoop 1.1.* why do I need the old command for Hadoop 0.20?: SET mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec; I supposed the commands for the newer Hadoop versions are: SET hive.exec.compress.output=true; SET mapreduce.output.fileoutputformat.compress=true; SET mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec; Von: w00t w00t w00...@yahoo.demailto:w00...@yahoo.de An: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Gesendet: 11:26 Dienstag, 13.August 2013 Betreff: LZO output compression Hello, I am running Hortonworks 1.2 using Hadoop 1.1.2.21 and Hive 0.10.0.21. I set up LZO compression and can read LZO compressed data without problems. My next try was to test output compression. Therefore, I created the following small script: -- SET hive.exec.compress.output=true; SET mapreduce.output.fileoutputformat.compress=true; SET mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec; DROP TABLE IF EXISTS simple_lzo; CREATE TABLE simple_lzo ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' AS SELECT count(*) FROM txt_table_lzo; The output gets compressed but with default-codec deflate - not with LZO. Do you know what the problem could be here and how I could debug it? There are no error messages or so. Additionally, I also tried the commands for Hadoop 0.20: mapred.output.compress=true; mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzopCodec That didn't work as well. In Pig or Java MR, I have no problems to gerneate LZO compressed output. Thanks CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Hiveserver2 Beeline command clarification
Hi guys I just hooked up hivservrer2 to ldap. In beeline I realized you can login like the following (don't need to define org.apache.hive.jdbc.HiveDriver) beeline !connect jdbc:hive2://dev-thdp5:1 sanjay.subraman...@wizecommerce.com scan complete in 2ms Connecting to jdbc:hive2://dev-thdp5:1 Enter password for jdbc:hive2://dev-thdp5:1: Connected to: Hive (version 0.10.0) Driver: Hive (version 0.10.0-cdh4.3.0) Transaction isolation: TRANSACTION_REPEATABLE_READ 0: jdbc:hive2://dev-thdp5:1 show tables; +--+ | tab_name | +--+ | keyword_impressions_log | +--+ 1 row selected (1.574 seconds) 0: jdbc:hive2://dev-thdp5:1 If this is also a correct way to use beeline, then I actually prefer this since the password is not visible sanjay CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
hive -h hostname option bypasses ROLES and access permissions ?
Hi Hive version 0.9.0 (hive-common-0.9.0-cdh4.1.2.jar) property namehive.security.authorization.enabled/name valuetrue/value descriptionenable or disable the hive client authorization/description /property Linux User = hiveuser1 (no hive permissions) CASE 1 hive -e select * from outpdir_ptitle_explanation_parsed limit 10 Authorization failed:No privilege 'Select' found for inputs { database:default, table:outpdir_ptitle_explanation_parsed, columnName:header_servername}. Use show grant to get more details. CASE 2 (use the -h option) hive -h localhost -e select * from outpdir_ptitle_explanation_parsed limit 10 Shows results !!! Why does -h option bypass authorization Thanks sanjay CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator. CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator. CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Does hiveserver2 support -e and -f options eom
CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: Hive and Lzo Compression
Thanks Lefty. Sent from my iPhone On Aug 10, 2013, at 10:08 AM, Lefty Leverenz leftylever...@gmail.commailto:leftylever...@gmail.com wrote: I'm not seeing any documentation link in Sanjay's message, so here it is again (in the Hive wiki's language manual): https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LZO. On Thu, Aug 8, 2013 at 3:30 PM, Sanjay Subramanian sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com wrote: Please refer this documentation here Let me know if u need more clarifications so that we can make this document better and complete Thanks sanjay From: w00t w00t w00...@yahoo.demailto:w00...@yahoo.de Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org, w00t w00t w00...@yahoo.demailto:w00...@yahoo.de Date: Thursday, August 8, 2013 2:02 AM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: Hive and Lzo Compression Hello, I am started to run Hive with Lzo compression on Hortonworks 1.2 I have managed to install/configure Lzo and hive -e set io.compression.codecs shows me the Lzo Codecs: io.compression.codecs= org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.DefaultCodec, com.hadoop.compression.lzo.LzoCodec, com.hadoop.compression.lzo.LzopCodec, org.apache.hadoop.io.compress.BZip2Codec However, I have some questions where I would be happy if you could help me. (1) CREATE TABLE statement I read in different postings, that in the CREATE TABLE statement, I have to use the following STORAGE clause: CREATE EXTERNAL TABLE txt_table_lzo ( txt_line STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '' STORED AS INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION '/user/myuser/data/in/lzo_compressed'; It works withouth any problems now to execute SELECT statements on this table with Lzo data. However I also created a table on the same data without this STORAGE clause: CREATE EXTERNAL TABLE txt_table_lzo_tst ( txt_line STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '' LOCATION '/user/myuser/data/in/lzo_compressed'; The interesting thing is, it works as well, when I execute a SELECT statement and this table. Can you help, why the second CREATE TABLE statement works as well? What should I use in DDLs? Is it best practice to use the STORED AS clause with a deprecatedLzoTextInputFormat? Or should I remove it? (2) Output and Intermediate Compression Settings I want to use output compression . In Programming Hive from Capriolo, Wampler, Rutherglen the following commands are recommended: SET hive.exec.compress.output=true; SET mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec; However, in some other places in forums, I found the following recommended settings: SET hive.exec.compress.output=true SET mapreduce.output.fileoutputformat.compress=true SET mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec Am I right, that the first settings are for Hadoop versions prior 0.23? Or is there any other reason why the settings are different? I am using Hadoop 1.1.2 with Hive 0.10.0. Which settings would you recommend to use? -- I also want to compress intermediate results. Again, in Programming Hive the following settings are recommended: SET hive.exec.compress.intermediate=true; SET mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzopCodec; Is this the right setting? Or should I again use the settings (which look more valid for Hadoop 0.23 and greater)?: SET hive.exec.compress.intermediate=true; SET mapreduce.map.output.compression.codec=com.hadoop.compression.lzo.LzopCodec; Thanks CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator. -- Lefty CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along
Re: Hive and Lzo Compression
Please refer this documentation here Let me know if u need more clarifications so that we can make this document better and complete Thanks sanjay From: w00t w00t w00...@yahoo.demailto:w00...@yahoo.de Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org, w00t w00t w00...@yahoo.demailto:w00...@yahoo.de Date: Thursday, August 8, 2013 2:02 AM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: Hive and Lzo Compression Hello, I am started to run Hive with Lzo compression on Hortonworks 1.2 I have managed to install/configure Lzo and hive -e set io.compression.codecs shows me the Lzo Codecs: io.compression.codecs= org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.DefaultCodec, com.hadoop.compression.lzo.LzoCodec, com.hadoop.compression.lzo.LzopCodec, org.apache.hadoop.io.compress.BZip2Codec However, I have some questions where I would be happy if you could help me. (1) CREATE TABLE statement I read in different postings, that in the CREATE TABLE statement, I have to use the following STORAGE clause: CREATE EXTERNAL TABLE txt_table_lzo ( txt_line STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '' STORED AS INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION '/user/myuser/data/in/lzo_compressed'; It works withouth any problems now to execute SELECT statements on this table with Lzo data. However I also created a table on the same data without this STORAGE clause: CREATE EXTERNAL TABLE txt_table_lzo_tst ( txt_line STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '' LOCATION '/user/myuser/data/in/lzo_compressed'; The interesting thing is, it works as well, when I execute a SELECT statement and this table. Can you help, why the second CREATE TABLE statement works as well? What should I use in DDLs? Is it best practice to use the STORED AS clause with a deprecatedLzoTextInputFormat? Or should I remove it? (2) Output and Intermediate Compression Settings I want to use output compression . In Programming Hive from Capriolo, Wampler, Rutherglen the following commands are recommended: SET hive.exec.compress.output=true; SET mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec; However, in some other places in forums, I found the following recommended settings: SET hive.exec.compress.output=true SET mapreduce.output.fileoutputformat.compress=true SET mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec Am I right, that the first settings are for Hadoop versions prior 0.23? Or is there any other reason why the settings are different? I am using Hadoop 1.1.2 with Hive 0.10.0. Which settings would you recommend to use? -- I also want to compress intermediate results. Again, in Programming Hive the following settings are recommended: SET hive.exec.compress.intermediate=true; SET mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzopCodec; Is this the right setting? Or should I again use the settings (which look more valid for Hadoop 0.23 and greater)?: SET hive.exec.compress.intermediate=true; SET mapreduce.map.output.compression.codec=com.hadoop.compression.lzo.LzopCodec; Thanks CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Hive query hive_server_host A and write results to hadoop cluster B
Hi guys Perhaps u know this already but very useful. This directly creates a file based on the output of this query to name_node_host_2 HDFS cluster Regards Sanjay hive -h hive_server_host1 -e hive_query_string| hdfs dfs -put - hdfs://name_node_host_2:port/path/to/ur/dir/your_file_name CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: Hive Query Issue
Hi Some quick checks Please don't mind if my questions sound trivial 1. Is your hdfs cluster or pseudo-distributed node up and running Can u see the HDFS at http://host:50070 ? 2. Is your mrV1 or Yarn (mrV2) up and running 2a)mrV1 http://host:50030 2b) YARN http://host:8088 3. Is your hive-default xml; pointing to the job tracker 4. Can u run a MR job on your cluster independent of Hive ? U can try the WordCount that comes standard with the examples Thanks sanjay From: Manickam P manicka...@outlook.commailto:manicka...@outlook.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Tuesday, August 6, 2013 4:10 AM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: RE: Hive Query Issue Hi, Thanks for your reply. I have a small test table which has around 10 records. when i run select * from table.. it launched a job and that got completed successfully. When i run the same query in my actual table no job got initiated. it got struck. I did not get any error in my task tracker also. All my data nodes are up and running. I don't have any clue here. Thanks, Manickam P Date: Tue, 6 Aug 2013 16:24:38 +0530 Subject: Re: Hive Query Issue From: nitinpawar...@gmail.commailto:nitinpawar...@gmail.com To: user@hive.apache.orgmailto:user@hive.apache.org when you run select * from table .. it does not launch a mapreduce job, where are when you put some condition, it does need to process the data so it launches a mapreduce job now when you start this query, go to your jobtracker page and see how many jobs are running. Is it able to start your job? if not why ? that will tell you why its stuck On Tue, Aug 6, 2013 at 3:22 PM, Manickam P manicka...@outlook.commailto:manicka...@outlook.com wrote: Hi, If i run a hive query like select * from table_name limit 10 it is working. If i run query like select column_name from table_name limit 10 it hangs. Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator It just prints the above after that nothing happening in the hive shell. Can you pls help me on this. Thanks, Manickam P -- Nitin Pawar CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: Hive UDAF extending UDAF class: iterate or evaluate method
Please follow the guidance for UDF provided in the Hive Programming book by Wampler/Capriolo. That will work for u. I can say with confidence that their book was mighty helpful to me in my project from start to production... And I would recommend go ahead with a way, implement and then fine tune otherwise u will be in analysis paralysis mode… We are all on a path of discovery here ... Regards sanjay From: Ritesh Agrawal ragra...@netflix.commailto:ragra...@netflix.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Wednesday, August 7, 2013 5:57 PM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: Hive UDAF extending UDAF class: iterate or evaluate method Hi Sanjay, Lefty Thanks for the help but none of above responses directly answering my question (probably I am not asking clear enough :-( ). Below I have two different structure of a UDAF (aggregation function). My question is which one is the preferred/right approach http://pastebin.com/QCgd4Hxc : This version is based on based on what I could understand from API docs about UDAF class. http://pastebin.com/Uctamtek : This version is based on the book Hadoop The definitive guide. Notice the function names for different from the first one. I hope this clarifies my question. Thanks Ritesh On Wed, Aug 7, 2013 at 5:34 PM, Lefty Leverenz leftylever...@gmail.commailto:leftylever...@gmail.com wrote: Sounds like the wikidoc needs some work. I'm open to suggestions. If Sanjay's simple UDF helps, I could put it in the wiki along with any advice you think would help. Does anyone else have use cases to contribute? -- Lefty On Mon, Aug 5, 2013 at 2:45 PM, Sanjay Subramanian sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com wrote: Hi Ritesh To help u get started , I am writing a simple HelloWorld-ish UDF that might help…If it doesn't please ask for more clarifications... Good Luck Thanks sanjay ToUpperCase.java package com.sanjaysubramanian.utils.hive.udf; import org.apache.hadoop.hive.ql.exec.UDF; public finalclass ToUpperCase extends UDF{ protected final Log logger = LogFactory.getLog(toUpperCase.class); publicString evaluate(final String inputString) { if (inputString != null){ return inputString.toUpper; } else { return inputString; } } } Usage in a Hive script hive -e create temporary function toupper as 'com.sanjaysubramanian.utils.hive.udf.ToUpperCase'; SELECT first_name, toupper(first_name) FROM company_names *** From: Ritesh Agrawal ragra...@netflix.commailto:ragra...@netflix.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Monday, August 5, 2013 9:41 AM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: Hive UDAF extending UDAF class: iterate or evaluate method Hi Lefty, I used the wiki you sent to write my first version of UDAF. However, I found it to be utterly complex, especially for storing partial results as I am not very familiar with hive API. Then I found another example of UDAF in the hadoop the definitive guide book and it had much simpler code but using different method. Instead of using iterate it was using evaluate method and so I am getting confused. Ritesh On Sun, Aug 4, 2013 at 2:18 PM, Lefty Leverenz leftylever...@gmail.commailto:leftylever...@gmail.com wrote: You might find this wikidoc useful: GenericUDAFCaseStudyhttps://cwiki.apache.org/confluence/display/Hive/GenericUDAFCaseStudy. The O'Reilly book Programming Hive also has a section called User-Defined Aggregate Functions in chapter 13 (Functions), pages 172 to 176. -- Lefty On Sun, Aug 4, 2013 at 7:12 AM, Ritesh Agrawal ragra...@netflix.commailto:ragra...@netflix.com wrote: Hi all, I am trying to write a UDAF function. I found an example that shows how to implement a UDAF in Hadoop The Definitive Guide book. However I am little confused. In the book, the author extends UDAF class and implements init, iterate, terminatePartial, merge and terminate function. However looking at the hive docs (http://hive.apache.org/docs/r0.11.0/api/org/apache/hadoop/hive/ql/exec/UDAF.html), it seems I need to implement init, aggregate, evaluatePartial, aggregatePartial and evaluate function. Please let me know what are the write functions to implement. Ritesh -- Lefty CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may
Re: Hive Thrift Service - Not Running Continously
At least that would ensure asynch running and would not die if ur session dies Another way I would propose also is to have a screen session dedicated to hive_server and start it in synch mode Create a screen session == screen -S my_awesome_hive_server ## gets u into the screen session ##Start hive server hive --service hiveserver ## Press Ctrl A D to detach from screen session To go back to your screen session screen -x my_awesome_hive_server From: Bhaskar, Snehalata snehalata_bhas...@syntelinc.commailto:snehalata_bhas...@syntelinc.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Monday, August 5, 2013 9:34 PM To: Raj Hadoop hadoop...@yahoo.commailto:hadoop...@yahoo.com Cc: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: RE: Hive Thrift Service - Not Running Continously Can you please try executing “nohup hive --service hiveserver ” command? Regards, Snehalata From: Raj Hadoop [mailto:hadoop...@yahoo.com] Sent: Monday, August 05, 2013 9:59 PM To: Hive Subject: Hive Thrift Service - Not Running Continously Hi, The hive thrift service is not running continously. I had to execute the command (hive --service hiveserver ) very frequently . Can any one help me on this? Thanks, Raj CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: Hive Thrift Service - Not Running Continously
Can u see the logs why the service is dying ? Will have some clues there…sometimes the hive log directory can get full and that might kill it. I have seen that happening in our install here From: Raj Hadoop hadoop...@yahoo.commailto:hadoop...@yahoo.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org, Raj Hadoop hadoop...@yahoo.commailto:hadoop...@yahoo.com Date: Monday, August 5, 2013 9:29 AM To: Hive user@hive.apache.orgmailto:user@hive.apache.org Subject: Hive Thrift Service - Not Running Continously Hi, The hive thrift service is not running continously. I had to execute the command (hive --service hiveserver ) very frequently . Can any one help me on this? Thanks, Raj CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Is there a way to disable -h hostname option ?
Hive version 0.9.0 (hive-common-0.9.0-cdh4.1.2.jar) Is there a way to disable -h hostname option ? This way , I can disable anyone on the network drop/alter tables. Like I said below , even if I create a ROLE with only SELECT permission , a user on the network can DROP tables with a -h hostname option Thanks Sanjay From: Sanjay Subramanian sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Thursday, August 1, 2013 6:37 PM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: Hive Authorization is bypassed with -h option Hi Hive version 0.9.0 (hive-common-0.9.0-cdh4.1.2.jar) property namehive.security.authorization.enabled/name valuetrue/value descriptionenable or disable the hive client authorization/description /property Linux User = hiveuser1 (no hive permissions) CASE 1 hive -e select * from outpdir_ptitle_explanation_parsed limit 10 Authorization failed:No privilege 'Select' found for inputs { database:default, table:outpdir_ptitle_explanation_parsed, columnName:header_servername}. Use show grant to get more details. CASE 2 (use the -h option) hive -h localhost -e select * from outpdir_ptitle_explanation_parsed limit 10 Shows results !!! Why does -h option bypass authorization Thanks sanjay CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator. CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Nice hive notes and cheat sheets from Minwoo Kim
http://julingks.tistory.com/category/Hive Thanks sanjay CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Hive Authorization is bypassed with -h option
Hi Hive version 0.9.0 (hive-common-0.9.0-cdh4.1.2.jar) property namehive.security.authorization.enabled/name valuetrue/value descriptionenable or disable the hive client authorization/description /property Linux User = hiveuser1 (no hive permissions) CASE 1 hive -e select * from outpdir_ptitle_explanation_parsed limit 10 Authorization failed:No privilege 'Select' found for inputs { database:default, table:outpdir_ptitle_explanation_parsed, columnName:header_servername}. Use show grant to get more details. CASE 2 (use the -h option) hive -h localhost -e select * from outpdir_ptitle_explanation_parsed limit 10 Shows results !!! Why does -h option bypass authorization Thanks sanjay CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: Review Request (wikidoc): LZO Compression in Hive
Hi guys Any chance I could get cwiki update privileges today ? Thanks sanjay From: Sanjay Subramanian sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com Date: Tuesday, July 30, 2013 4:26 PM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Cc: d...@hive.apache.orgmailto:d...@hive.apache.org d...@hive.apache.orgmailto:d...@hive.apache.org Subject: Review Request (wikidoc): LZO Compression in Hive Hi Met with Lefty this afternoon and she was kind to spend time to add my documentation to the site - since I still don't have editing privileges :-) Please review the new wikidoc about LZO compression in the Hive language manual. If anything is unclear or needs more information, you can email suggestions to this list or edit the wiki yourself (if you have editing privileges). Here are the links: 1. Language Manualhttps://cwiki.apache.org/confluence/display/Hive/LanguageManual (new bullet under File Formats) 2. LZO Compressionhttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+LZO 3. CREATE TABLEhttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable (near end of section, pasted in here:) Use STORED AS TEXTFILE if the data needs to be stored as plain text files. Use STORED AS SEQUENCEFILE if the data needs to be compressed. Please read more about CompressedStoragehttps://cwiki.apache.org/confluence/display/Hive/CompressedStorage if you are planning to keep data compressed in your Hive tables. Use INPUTFORMAT and OUTPUTFORMAT to specify the name of a corresponding InputFormat and OutputFormat class as a string literal, e.g., 'org.apache.hadoop.hive.contrib.fileformat.base64.Base64TextInputFormat'. For LZO compression, the values to use are 'INPUTFORMAT com.hadoop.mapred.DeprecatedLzoTextInputFormat OUTPUTFORMAT org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' (see LZO Compressionhttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+LZO). My cwiki id is https://cwiki.apache.org/confluence/display/~sanjaysubraman...@yahoo.com It will be great if I could get edit privileges Thanks sanjay CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Review Request (wikidoc): LZO Compression in Hive
Hi Met with Lefty this afternoon and she was kind to spend time to add my documentation to the site - since I still don't have editing privileges :-) Please review the new wikidoc about LZO compression in the Hive language manual. If anything is unclear or needs more information, you can email suggestions to this list or edit the wiki yourself (if you have editing privileges). Here are the links: 1. Language Manualhttps://cwiki.apache.org/confluence/display/Hive/LanguageManual (new bullet under File Formats) 2. LZO Compressionhttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+LZO 3. CREATE TABLEhttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable (near end of section, pasted in here:) Use STORED AS TEXTFILE if the data needs to be stored as plain text files. Use STORED AS SEQUENCEFILE if the data needs to be compressed. Please read more about CompressedStoragehttps://cwiki.apache.org/confluence/display/Hive/CompressedStorage if you are planning to keep data compressed in your Hive tables. Use INPUTFORMAT and OUTPUTFORMAT to specify the name of a corresponding InputFormat and OutputFormat class as a string literal, e.g., 'org.apache.hadoop.hive.contrib.fileformat.base64.Base64TextInputFormat'. For LZO compression, the values to use are 'INPUTFORMAT com.hadoop.mapred.DeprecatedLzoTextInputFormat OUTPUTFORMAT org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' (see LZO Compressionhttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+LZO). My cwiki id is https://cwiki.apache.org/confluence/display/~sanjaysubraman...@yahoo.com It will be great if I could get edit privileges Thanks sanjay CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Possible release date for Hive 0.12.0 ?
Hi guys When is stable Hive 0.12.0 expected I have a use case that needs this fixed and looks like its fixed in 0.12.0 https://issues.apache.org/jira/browse/HIVE-3603 Sanjay CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: Need help in joining 2 tables
Hi Rams Please don't think I am sermonizing or preaching and please don't mind what I am saying :-) This community is there is help u and there is no doubt about that. However I am assuming you tried out a few options by yourself before you reached out to the community with your question. Since the whole community is opensource and everyone is busy in projects , I think it will be better utilization of everyones time if you spend a few minutes first playing around a few options, juxtapose syntax if needed and see what errors you get (if any) and then ask your question. There are many times an error for a syntax might give new ideas to community developers. If this makes sense, I am happy…..if not, I don't mind :-) Good luck with Hive…its a great technology and there are many people working very hard to make this better every day Regards sanjay From: Ramasubramanian Narayanan ramasubramanian.naraya...@gmail.commailto:ramasubramanian.naraya...@gmail.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Friday, July 26, 2013 1:17 AM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: Need help in joining 2 tables Hi, Need to create a view in HIVE as per the below query (written in Oracle). Need help to provide the equivalent query in HIVE to create view. EMP_ADDRESS Emp_Id Address1 Address2 Address3 Address4 City State Pin Country Alternate_Address1 Alternate_Address2 Alternate_Address3 Alternate_Address4 Alternate_City Alternate_State Alternate_Pin Alternate_Country row_create_date EMP_MASTER Emp_Id Name DOB Department Sex FathersName row_create_date View Query select MAST.Emp_Id, MAST.Name, MAST.DOB, ADDR.Address1, ADDR.Address2, ADDR.Address3, ADDR.Address4, ADDR.City, ADDR.State, ADDR.Pin, ADDR.Country from EMP_MASTER MAST, EMP_ADDRESS ADDR where MAST.row_create_date = (select max(row_create_date) from EMP_MASTER where Emp_Id = MAST.Emp_Id) and ADDR.row_create_date = (select max(row_create_date) from EMP_ADDRESS where Emp_Id = ADDR.Emp_Id) regards, Rams CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: Merging different HDFS file for HIVE
We have a similar situation like this in production…for your case case I would propose the following steps 1. Design a map reduce job (Job Output format - Text, Lzo, Snappy, your choice) Inputs to Mapper -- records from these three feeds Outputs from Mapper -- Key = EMP1 Value = feed1~field1 field2 field6 field9 -- Key = EMP1 Value = feed2~field5 field7 field10 -- Key = EMP1 Value = feed3~field3 field4 field8 Reducer Output -- Key = EMP1 Value = field1 field2 field3 field4 field5 field6 field7 field8 field9 field10 2. (Optional) If u use LZO then u will need to run LzoIndexer 3. CREATE TABLE IF NOT EXISTS YOUR_HIVE_TABLE 4. ALTER TABLE ADD PARTITION (foo1 = , foo2 = ) LOCATION 'path/to/files' From: Stephen Sprague sprag...@gmail.commailto:sprag...@gmail.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Friday, July 26, 2013 4:37 PM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: Merging different HDFS file for HIVE i like #2. so you have three, say, external tables representing your three feed files. After the third and final file is loaded then join 'em all together - maybe make the table partitioned for one per day. for example: alter table final add partition (datekey=MMDD); insert overwrite table final partition (datekey=MMDD) select EMP_ID,f1,...,f10 from FF1 a join FF2 b on (a.EMP_ID=b.EMP_ID join FF3 c on (b.EMP_ID=c.EMP_ID) Or a variation on #3. make a view on the three tables which would look just like the select statement above. What do you want to optimize for? On Fri, Jul 26, 2013 at 5:30 AM, Nitin Pawar nitinpawar...@gmail.commailto:nitinpawar...@gmail.com wrote: Option 1 ) Use pig or oozie, write a workflow and join the files to a single file Option 2 ) Create a temp table for each of the different file and then join them to a single table and delete temp table Option 3 ) don't do anything, change your queries to look at three different files when they query about different files Wait for others to give better suggestions :) On Fri, Jul 26, 2013 at 4:22 PM, Ramasubramanian Narayanan ramasubramanian.naraya...@gmail.commailto:ramasubramanian.naraya...@gmail.com wrote: Hi, Please help in providing solution for the below problem... this scenario is applicable in Banking atleast... I have a HIVE table with the below structure... Hive Table: Field1 ... Field 10 For the above table, I will get the values for each feed in different file. You can imagine that these files belongs to same branch and will get at any time interval. I have to load into table only if I get all 3 files for the same branch. (assume that we have a common field in all the files to join) Feed file 1 : EMP ID Field 1 Field 2 Field 6 Field 9 Feed File2 : EMP ID Field 5 Field 7 Field 10 Feed File3 : EMP ID Field 3 Field 4 Field 8 Now the question is, what is the best way to make all these files to make it as a single file so that it can be placed under the HIVE structure. regards, Rams -- Nitin Pawar CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: Merging different HDFS file for HIVE
Hi I am using Oozie Coordinators to schedule and run daily Oozie Workflows that contain 35-40 actions each (I use shell, java , hive and map reduce oozie actions) So if anyone needs help and has questions please fire away… sanjay From: Sanjay Subramanian sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Friday, July 26, 2013 6:23 PM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: Merging different HDFS file for HIVE We have a similar situation like this in production…for your case case I would propose the following steps 1. Design a map reduce job (Job Output format - Text, Lzo, Snappy, your choice) Inputs to Mapper -- records from these three feeds Outputs from Mapper -- Key = EMP1 Value = feed1~field1 field2 field6 field9 -- Key = EMP1 Value = feed2~field5 field7 field10 -- Key = EMP1 Value = feed3~field3 field4 field8 Reducer Output -- Key = EMP1 Value = field1 field2 field3 field4 field5 field6 field7 field8 field9 field10 2. (Optional) If u use LZO then u will need to run LzoIndexer 3. CREATE TABLE IF NOT EXISTS YOUR_HIVE_TABLE 4. ALTER TABLE ADD PARTITION (foo1 = , foo2 = ) LOCATION 'path/to/files' From: Stephen Sprague sprag...@gmail.commailto:sprag...@gmail.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Friday, July 26, 2013 4:37 PM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: Merging different HDFS file for HIVE i like #2. so you have three, say, external tables representing your three feed files. After the third and final file is loaded then join 'em all together - maybe make the table partitioned for one per day. for example: alter table final add partition (datekey=MMDD); insert overwrite table final partition (datekey=MMDD) select EMP_ID,f1,...,f10 from FF1 a join FF2 b on (a.EMP_ID=b.EMP_ID join FF3 c on (b.EMP_ID=c.EMP_ID) Or a variation on #3. make a view on the three tables which would look just like the select statement above. What do you want to optimize for? On Fri, Jul 26, 2013 at 5:30 AM, Nitin Pawar nitinpawar...@gmail.commailto:nitinpawar...@gmail.com wrote: Option 1 ) Use pig or oozie, write a workflow and join the files to a single file Option 2 ) Create a temp table for each of the different file and then join them to a single table and delete temp table Option 3 ) don't do anything, change your queries to look at three different files when they query about different files Wait for others to give better suggestions :) On Fri, Jul 26, 2013 at 4:22 PM, Ramasubramanian Narayanan ramasubramanian.naraya...@gmail.commailto:ramasubramanian.naraya...@gmail.com wrote: Hi, Please help in providing solution for the below problem... this scenario is applicable in Banking atleast... I have a HIVE table with the below structure... Hive Table: Field1 ... Field 10 For the above table, I will get the values for each feed in different file. You can imagine that these files belongs to same branch and will get at any time interval. I have to load into table only if I get all 3 files for the same branch. (assume that we have a common field in all the files to join) Feed file 1 : EMP ID Field 1 Field 2 Field 6 Field 9 Feed File2 : EMP ID Field 5 Field 7 Field 10 Feed File3 : EMP ID Field 3 Field 4 Field 8 Now the question is, what is the best way to make all these files to make it as a single file so that it can be placed under the HIVE structure. regards, Rams -- Nitin Pawar CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator. CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message
Re: Help in debugging Hive Query
The query is correct but since u r creating a managed table , that is possibly creating some issue and the records are not all getting created This is what I would propose CHECKPOINT 1 : Is this query running at all ? === Use this option in BOLD and run the QUERY ONLY (without any table creation) to log errors and pipe to a log file by using nohup or some other way that u prefer hive -hiveconf hive.root.logger=INFO,console -e select a.evar23,sum(b.max_visit_page_num) from (select distinct visid_high,visid_low,evar23 from web.omniture_web_data) a JOIN (select visid_high,visid_low,max(visit_page_num) as max_visit_page_num from omniture_web_data group by visid_high,visid_low) b where a.visid_high=b.visid_high and a.visid_low=b.visid_low group by a.evar23; CHECKPOINT 2 : Run the query (using the CREATE TABLE option) with these additional options === Required params: SET mapreduce.job.maps=500; SET mapreduce.job.reduces=8; SET mapreduce.tasktracker.map.tasks.maximum=12; SET mapreduce.tasktracker.reduce.tasks.maximum=8; SET mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec; SET mapreduce.map.output.compress=true; Optional params: --- If u r using compression in output , use the following ; u can change the LzoCodec to whatever u r using for compression SET hive.exec.compress.intermediate=true; SET hive.exec.compress.output=true; SET mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec; SET mapreduce.output.fileoutputformat.compress=true; Thanks Sanjay From: Raj Hadoop hadoop...@yahoo.commailto:hadoop...@yahoo.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org, Raj Hadoop hadoop...@yahoo.commailto:hadoop...@yahoo.com Date: Thursday, July 25, 2013 5:00 AM To: Hive user@hive.apache.orgmailto:user@hive.apache.org Subject: Help in debugging Hive Query All, I am trying to determine visits for customer from omniture weblog file using Hive. Table: omniture_web_data Columns: visid_high,visid_low,evar23,visit_page_num Sample Data: visid_high,visid_low,evar23,visit_page_num 999,888,1003,10 999,888,1003,14 999,888,1003,6 999,777,1003,12 999,777,1003,20 I want to calculate for each Customer Number ( evar23 is Customer Number ) , total visits. visid_high and visid_low determines a unique visit. For each distinct visitor, calculate sum of maximum visit_page_num. In above example 14 + 20 = 34 should be the total visits for the customer 1003. I am trying to run the following queries - Method 1 is almost the same as Method 2. Except in Method 1 I only choose a particualr customer number 1003. In method 2 , i generalized to all. In Method 1 , I am getting the accurate result. In metnhod 2 , I am not getting the same result as Method 1. Any suggestions on how to trouble shoot. ALso, any alternative approaches. // Method 1 select a.evar23,sum(b.max_visit_page_num) from (select distinct visid_high,visid_low,evar23 from web.omniture_web_data where evar23='1003') a JOIN (select visid_high,visid_low,max(visit_page_num) as max_visit_page_num from omniture_web_data where evar23='1003' group by visid_high,visid_low) b where a.visid_high=b.visid_high and a.visid_low=b.visid_low group by a.evar23; / Result of Method 1 100334 // Method 2 create table temp123 as select a.evar23,sum(b.max_visit_page_num) from (select distinct visid_high,visid_low,evar23 from web.omniture_web_data) a JOIN (select visid_high,visid_low,max(visit_page_num) as max_visit_page_num from omniture_web_data group by visid_high,visid_low) b where a.visid_high=b.visid_high and a.visid_low=b.visid_low group by a.evar23; select * from temp123 where evar23='1003'; // The Result of Method 2 is not the same as Method 1. It is showing a different number. Thanks, Raj CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Calling same UDF multiple times in a SELECT query
Hi V r using version hive-exec-0.9.0-cdh4.1.2 in production I need to check and use the output from a UDF in a query to assign values to 2 columns in a SELECT query Example SELECT a, IF(fooUdf(a) -1 , -1, fooUdf(a)) as b, IF(fooUdf(a) -1 , fooUdf(a), 0) as c FROM my_hive_table So will fooUdf be called 4 times ? Or once ? Why this is important is because in our case this UDF calls a web service and I don't want so many calls to the service. Thanks sanjay CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: Calling same UDF multiple times in a SELECT query
Thanks Jan I will mod my UDF and test it out I want to make sure I understand your words here The obvious condition is that it must always return the identical result when called with same parameters. If I can make sure that a call to the web service is successful it will always return same output for a given set of input F(x1,y1) will always equal - z1 that’s what u mean right ? sanjay From: Jan Dolinár dolik@gmail.commailto:dolik@gmail.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Tuesday, July 23, 2013 12:35 PM To: user user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: Calling same UDF multiple times in a SELECT query Hi, If you use annotation, Hive should be able to optimize it to single call: @UDFType(deterministic = true) The obvious condition is that it must always return the identical result when called with same parameters. Little bit more on this can be found in Mark Grovers post at http://mark.thegrovers.ca/1/post/2012/06/how-to-write-a-hive-udf.html. Regards, Jan On Tue, Jul 23, 2013 at 9:25 PM, Nitin Pawar nitinpawar...@gmail.commailto:nitinpawar...@gmail.com wrote: fucntion return values are not stored for repeat use of same (as per my understanding) I know you may have already thought about other approach as select a , if (call -1, -1 call) as b from (select a, fooudf(a) as call from table On Wed, Jul 24, 2013 at 12:42 AM, Sanjay Subramanian sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com wrote: Hi V r using version hive-exec-0.9.0-cdh4.1.2 in production I need to check and use the output from a UDF in a query to assign values to 2 columns in a SELECT query Example SELECT a, IF(fooUdf(a) -1 , -1, fooUdf(a)) as b, IF(fooUdf(a) -1 , fooUdf(a), 0) as c FROM my_hive_table So will fooUdf be called 4 times ? Or once ? Why this is important is because in our case this UDF calls a web service and I don't want so many calls to the service. Thanks sanjay CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator. -- Nitin Pawar CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: export csv, use ',' as split
Hive does not have a output delimiter specifier yet (not sure if 0.11.x may have it) But for now please try the following hive -e myquery | sed 's/\t/,/g' result.csv Good luck Sanjay From: kentkong_work kentkong_w...@163.commailto:kentkong_w...@163.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Tuesday, July 9, 2013 9:48 PM To: user user@hive.apache.orgmailto:user@hive.apache.org Subject: export csv, use ',' as split hi here, I create a table like this and put a lot data into it. then I export query result into csv file like this: hive -e myquery result.csv but the csv uses tab as split. how to make hive use ','? thanks! CREATE TABLE if not exists upload_users( username string, mobile string, id_type string, id_no string, email string, address string, validate_time string ) partitioned by (fileid string) row format delimited fields terminated by \,; CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: Hive CLI
Hi Rahul Is there a reason why u use Hive CLI ? I have aliases defined that I use, so I never had to use Hive CLI again alias hivescript='hive -e ' alias hivescriptd='hive -hiveconf hive.root.logger=INFO,console -e ' So when I want to run hive commands from Linux I just type hivescript select colA, colB, colC from TableN where partitionA='somval1' limit 100 Good luck sanjay From: rahul kavale kavale.ra...@gmail.commailto:kavale.ra...@gmail.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Monday, July 8, 2013 11:26 PM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: Hive CLI Sorry My bad, it was my bad as Ctrl-a was screen binding which is why I was not able to use it for moving cursor. This had nothing to do with Hive or its configuration. My bad Thanks Regards, Rahul On 9 July 2013 11:00, rahul kavale kavale.ra...@gmail.commailto:kavale.ra...@gmail.com wrote: Hey there, I have been using HIVE(0.7) for a while now using CLI and bash scripts. But its a pain to move cursor in the CLI i.e. once you enter a very long query then you cant go to start of the query (like you do using Ctrl+A/Ctrl+E in terminal). Does anyone know how to do it? Thanks Regards, Rahul CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: integration issure about hive and hbase
I am attaching portions from a document I had written last year while investigating Hbase and Hive. You may have already crossed that bridge….nevertheless… Please forgive me :-) if some steps seamy hacky and not very well explained….I was on a solo mission to build a Hive Data platform from scratch and QDBW (Quick and Dirty But Works) was my philosophy to go ahead !!! Good luck Sanjay = Hive and Hbase integration on local Fedora desktop guidehttps://wizecommerce.atlassian.net/wiki/display/traffic/Hive+and+Hbase+integration+on+local+Fedora+desktop+guide Pre-requisites * Hadoop needs to be installed and HDFS needs to be be running (Hadoop HDFS setup on local Fedora desktop guidehttps://wizecommerce.atlassian.net/wiki/display/traffic/Hadoop+HDFS+setup+on+local+Fedora+desktop+guide) * Hive needs to be installed (Hive setup on local Fedora desktop guidehttps://wizecommerce.atlassian.net/wiki/display/traffic/Hive+setup+on+local+Fedora+desktop+guide) * HBase needs to be installed and running.(Hbase setup on local Fedora desktop guidehttps://wizecommerce.atlassian.net/wiki/display/traffic/Hbase+setup+on+local+Fedora+desktop+guide) * Make sure ZooKeeper is running on port 2181. If not stop Hbase , change $HBASE_HOME/conf/hbase-site.xml and restart HBase Copying JARS to HADOOP_CLASSPATH Before you query tables , copy these jars from $HIVE_HOME/lib $HADOOP_HOME/lib 1. Make sure zookeeper-3.4.3.jar is not there * ls -latr $HADOOP_HOME/lib/zookeeper-3.4.3.jar 2. Copy zookeeper-3.4.3.jar * sudo cp -av $HIVE_HOME/zookeeper-3.4.3.jar $HADOOP_HOME/lib 3. Make sure hive-common-0.9.0.jar is not there * ls -latr $HADOOP_HOME/lib/hive-common-0.9.0.jar 4. Copy hive-common-0.9.0.jar * sudo cp -av $HIVE_HOME/hive-common-0.9.0.jar $HADOOP_HOME/lib 5. Make sure hive-hbase-handler-0.9.0.jar is not there * ls -latr $HADOOP_HOME/lib/hive-hbase-handler-0.9.0.jar 6. Copy hive-hbase-handler-0.9.0.jar * sudo cp -av $HIVE_HOME/hive-hbase-handler-0.9.0.jar $HADOOP_HOME/lib 7. Exit from Hive Shell (type exit;) 8. Exit from HBase shell 9. Stop Hbase * $HBASE_HOME/bin/stop-hbase.sh 10. Stop Hadoop/HDFS * $HADOOP_HOME/bin/stop-all.sh 11. Check if NO java processes related to Hadoop/HDFS/Hbase/Hive exist *ps auxw | grep java 12. Start Hadoop/HDFS * $HADOOP_HOME/bin/start-all.sh 13. Start Hbase * $HBASE_HOME/bin/start-hbase.sh 14. Check ALL java processes related to Hadoop/HDFS/Hbase/Hive exist * ps auxw | grep java Create tables in HBase * Refer Hbase setup on local Fedora desktop guidehttps://wizecommerce.atlassian.net/wiki/display/traffic/Hbase+setup+on+local+Fedora+desktop+guide and create the tables mentioned there * hbase_2_hive_food * hbase_2_hive_names Create tables in HIVE To run Hive type $HIVE_HOME/bin/hive This will take you to Hive shell. In the shell, create these two tables * CREATE EXTERNAL TABLE hbase_hive_names(hbid INT, id INT, fn STRING, ln STRING, age INT) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES (hbase.columns.mapping = :key,id:id,name:fn,name:ln,age:age) TBLPROPERTIES(hbase.table.name = hbase_2_hive_names); * This HIVE table will map to Hbase table hbase_2_hive_names * CREATE EXTERNAL TABLE hbase_hive_food(hbid INT, id INT, name STRING) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES (hbase.columns.mapping = :key,id:id,name:name) TBLPROPERTIES(hbase.table.name =id:id,name:namehttp://idid%2Cnamename/) TBLPROPERTIES(hbase.table.name = hbase_2_hive_food); * This HIVE table will map to Hbase table hbase_2_hive_food Creating Loading tables in HBase through Hive * Make sure there is no table in Hbase called 'hive2hbase_names_table' * In Hive shell * CREATE TABLE hive2hbase_names_table (hb_id int, fn string, ln string, age_dnq INT) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES (hbase.columns.mapping = :key,student:fn,student:ln,student:age) TBLPROPERTIES (hbase.table.name = hive2hbase_names_table) ; * Go to HBase shell * check that table hive2hbase_names_table is created. * In Hive Shell * create a Hive table and populate with data which we will use to populate the HiveHBase table * CREATE TABLE names_tab (hb_id int, fn string, ln string, age_dnq INT) PARTITIONED BY (age INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; * LOAD DATA LOCAL INPATH '/data/mycode/impressions/inputfiles/names1.tsv.4fields' OVERWRITE INTO TABLE names_tab PARTITION (age=60); * LOAD DATA LOCAL INPATH
Re: Special characters in web log file causing issues
U may have to remove non-printable chars first, save an intermediate file and then load into Hive tr -cd '[:print:]\r\n\t' Or if u have strings function that will only output printable chars From: Raj Hadoop hadoop...@yahoo.commailto:hadoop...@yahoo.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org, Raj Hadoop hadoop...@yahoo.commailto:hadoop...@yahoo.com Date: Monday, July 8, 2013 1:52 PM To: Hive user@hive.apache.orgmailto:user@hive.apache.org Subject: Special characters in web log file causing issues Hi , The log file that I am trying to load throuh Hive has some special characters The field is shown below and the special characters ¿¿are also shown. Shockwave Flash;Chrome Remote Desktop Viewer;Native Client;Chrome PDF Viewer;Adobe Acrobat;Microsoft Office 2010;Motive Plug- in;Motive Management Plug-in;Google Update;Java(TM) Platform SE 7 U21;McAfee SiteAdvisor;McAfee Virtual Technician;Windows Live¿¿ Photo Gallery;McAfee SecurityCenter;Silverlig The above is causing the record to be terminated and loading another line. How can I avoid this type of issues and how to load the proper data ? Any suggestions please. Thanks, Raj CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: Loading a flat file + one additional field to a Hive table
Assuming that u can replicate this problem with say 100K records in the log file versus some records in Hive The way I would start the debug is Select * from my_hive_table | sort hive.out.check1.sorted Your original log file is say log.original.txt Sort log.original.txt log.original.txt.sorted Make sure separators in both files hive.out.check1.sorted and log.original.txt.sorted are same Diff log.original.txt.sorted hive.out.check1.sorted See what u get…and analyze why some extra records came in My guess is = In your Hive Meta Store, possibly there is another PARTITION that could be pointing to a data location containing data from some previous logs…. sanjay From: Raj Hadoop hadoop...@yahoo.commailto:hadoop...@yahoo.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org, Raj Hadoop hadoop...@yahoo.commailto:hadoop...@yahoo.com Date: Friday, July 5, 2013 3:27 PM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: Loading a flat file + one additional field to a Hive table Thanks Sanjay. I will look into this. Also - one more question. When I am trying to load log file to Hive and comparing the counts like this select count(*) from Table Versus wc -l File I see a few hundred records greater in Table. How should I debug it? Any tips please. From: Sanjay Subramanian sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org; Raj Hadoop hadoop...@yahoo.commailto:hadoop...@yahoo.com Sent: Saturday, July 6, 2013 4:32 AM Subject: Re: Loading a flat file + one additional field to a Hive table How about this ? Assume you have a log file called oompaloompa.log TIMESTAMP=$(date +%Y_%m_%d_T%H_%M_%S);mv oompaloopa.log oompaloopa.log.${TIMESTAMP};cat oompaloopa.log.${TIMESTAMP}| hdfs dfs -put - /user/sasubramanian/oompaloopa.log.${TIMESTAMP} This will directly put the file on HDFS and u can put it to the LOCATION specified by your HIVE TABLE definition sanjay From: manishbh...@rocketmail.commailto:manishbh...@rocketmail.com manishbh...@rocketmail.commailto:manishbh...@rocketmail.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Friday, July 5, 2013 10:39 AM To: Raj Hadoop hadoop...@yahoo.commailto:hadoop...@yahoo.com, Hive user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: Loading a flat file + one additional field to a Hive table Raj, You should dump the data in a temp table first and then move the data into final table with select query. Select date(), c1,c2. From temp table. Reason: we should avoid custom operation in load unless it is necessary. Sent via Rocket from my HTC - Reply message - From: Raj Hadoop hadoop...@yahoo.commailto:hadoop...@yahoo.com To: Hive user@hive.apache.orgmailto:user@hive.apache.org Subject: Loading a flat file + one additional field to a Hive table Date: Fri, Jul 5, 2013 10:30 PM Hi, Can any one please suggest the best way to do the following in Hive? Load 'todays date stamp' + ALL FIELDS C1,C2,C3,C4 IN A FILE F1 to a Hive table T1 ( D1,C1,C2,C3,C4) Can the following command be modified in some way to acheive the above hive load data local inpath '/software/home/hadoop/dat_files/' into table T1; My requirement is to append a date stamp to a Web log file and then load it to Hive table. Thanks, Raj CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator. CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: Loading a flat file + one additional field to a Hive table
How about this ? Assume you have a log file called oompaloompa.log TIMESTAMP=$(date +%Y_%m_%d_T%H_%M_%S);mv oompaloopa.log oompaloopa.log.${TIMESTAMP};cat oompaloopa.log.${TIMESTAMP}| hdfs dfs -put - /user/sasubramanian/oompaloopa.log.${TIMESTAMP} This will directly put the file on HDFS and u can put it to the LOCATION specified by your HIVE TABLE definition sanjay From: manishbh...@rocketmail.commailto:manishbh...@rocketmail.com manishbh...@rocketmail.commailto:manishbh...@rocketmail.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Friday, July 5, 2013 10:39 AM To: Raj Hadoop hadoop...@yahoo.commailto:hadoop...@yahoo.com, Hive user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: Loading a flat file + one additional field to a Hive table Raj, You should dump the data in a temp table first and then move the data into final table with select query. Select date(), c1,c2. From temp table. Reason: we should avoid custom operation in load unless it is necessary. Sent via Rocket from my HTC - Reply message - From: Raj Hadoop hadoop...@yahoo.commailto:hadoop...@yahoo.com To: Hive user@hive.apache.orgmailto:user@hive.apache.org Subject: Loading a flat file + one additional field to a Hive table Date: Fri, Jul 5, 2013 10:30 PM Hi, Can any one please suggest the best way to do the following in Hive? Load 'todays date stamp' + ALL FIELDS C1,C2,C3,C4 IN A FILE F1 to a Hive table T1 ( D1,C1,C2,C3,C4) Can the following command be modified in some way to acheive the above hive load data local inpath '/software/home/hadoop/dat_files/' into table T1; My requirement is to append a date stamp to a Web log file and then load it to Hive table. Thanks, Raj CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
One query works the other does notŠany clues ?
THIS FAILS = INSERT OVERWRITE DIRECTORY '/user/beeswax/warehouse/impressions_hive_stats/outpdir_impressions_header/2013-07-01/record_counts' select 'outpdir_impressions_header', '2013-07-01', 'record_counts', 'all_servers', count(*) from outpdir_impressions_header where header_date_partition='2013-07-01' 13/07/03 14:28:31 ERROR exec.Task: Failed with exception Unable to rename: hdfs://thv-nn1.pv.sv.nextag.com:8020/tmp/hive-nextag/hive_2013-07-03_14-25-39_324_3631745157890476605/-ext-1 to: /user/beeswax/warehouse/impressions_hive_stats/outpdir_impressions_header/2013-07-01/record_counts org.apache.hadoop.hive.ql.metadata.HiveException: Unable to rename: hdfs://thv-nn1.pv.sv.nextag.com:8020/tmp/hive-nextag/hive_2013-07-03_14-25-39_324_3631745157890476605/-ext-1 to: /user/beeswax/warehouse/impressions_hive_stats/outpdir_impressions_header/2013-07-01/record_counts at org.apache.hadoop.hive.ql.exec.MoveTask.moveFile(MoveTask.java:95) at org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:159) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1331) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1117) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:950) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:215) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:406) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:341) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:695) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:607) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:208) FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask THIS SUCCEEDS = INSERT OVERWRITE DIRECTORY '/user/beeswax/warehouse/impressions_hive_stats/outpdir_impressions_header/record_counts/2013-07-01' select 'outpdir_impressions_header', '2013-07-01', 'record_counts', 'all_servers', count(*) from outpdir_impressions_header where header_date_partition='2013-07-01' TABLE DEFINITION CREATE EXTERNAL TABLE IF NOT EXISTS impressions_hive_stats(table_name STRING, aggregation_date STRING , metric_name STRING, metric_key STRING, metric_value BIGINT) PARTITIONED BY (table_name_partition STRING, aggregation_date_partition STRING , metric_name_partition STRING, metric_key_partition STRING) STORED AS INPUTFORMAT \com.hadoop.mapred.DeprecatedLzoTextInputFormat\ OUTPUTFORMAT \org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat\ ; CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: One query works the other does notŠany clues ?
For the time being I have added the create HDFS dir in my hive script…got to keep moving on….cant wait for ideal solution :-) but would love to know the ideal solution ! !hdfs dfs -mkdir /user/beeswax/warehouse/impressions_hive_stats/outpdir_impressions_header/2013-07-01/record_counts ; INSERT OVERWRITE DIRECTORY '/user/beeswax/warehouse/impressions_hive_stats/outpdir_impressions_header/2013-07-01/record_counts' SELECT 'outpdir_impressions_header', '2013-07-01', 'record_counts', 'all_servers', count(*) FROM outpdir_impressions_header WHERE header_date_partition='2013-07-01' ; From: Sanjay Subramanian sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Wednesday, July 3, 2013 2:40 PM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: One query works the other does notŠany clues ? THIS FAILS = INSERT OVERWRITE DIRECTORY '/user/beeswax/warehouse/impressions_hive_stats/outpdir_impressions_header/2013-07-01/record_counts' select 'outpdir_impressions_header', '2013-07-01', 'record_counts', 'all_servers', count(*) from outpdir_impressions_header where header_date_partition='2013-07-01' 13/07/03 14:28:31 ERROR exec.Task: Failed with exception Unable to rename: hdfs://thv-nn1.pv.sv.nextag.com:8020/tmp/hive-nextag/hive_2013-07-03_14-25-39_324_3631745157890476605/-ext-1 to: /user/beeswax/warehouse/impressions_hive_stats/outpdir_impressions_header/2013-07-01/record_counts org.apache.hadoop.hive.ql.metadata.HiveException: Unable to rename: hdfs://thv-nn1.pv.sv.nextag.com:8020/tmp/hive-nextag/hive_2013-07-03_14-25-39_324_3631745157890476605/-ext-1 to: /user/beeswax/warehouse/impressions_hive_stats/outpdir_impressions_header/2013-07-01/record_counts at org.apache.hadoop.hive.ql.exec.MoveTask.moveFile(MoveTask.java:95) at org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:159) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1331) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1117) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:950) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:215) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:406) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:341) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:695) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:607) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:208) FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask THIS SUCCEEDS = INSERT OVERWRITE DIRECTORY '/user/beeswax/warehouse/impressions_hive_stats/outpdir_impressions_header/record_counts/2013-07-01' select 'outpdir_impressions_header', '2013-07-01', 'record_counts', 'all_servers', count(*) from outpdir_impressions_header where header_date_partition='2013-07-01' TABLE DEFINITION CREATE EXTERNAL TABLE IF NOT EXISTS impressions_hive_stats(table_name STRING, aggregation_date STRING , metric_name STRING, metric_key STRING, metric_value BIGINT) PARTITIONED BY (table_name_partition STRING, aggregation_date_partition STRING , metric_name_partition STRING, metric_key_partition STRING) STORED AS INPUTFORMAT \com.hadoop.mapred.DeprecatedLzoTextInputFormat\ OUTPUTFORMAT \org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat\ ; CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator. CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s
Request perm to edit wiki
Hi I'd like to add documentation on how to use LZO compression in Hive May I get edit permissions? Thanks. Sanjay Confluence username: sanjaysubraman...@yahoo.commailto:sanjaysubraman...@yahoo.com Apache JIRA username - sanjaysubramanian From: Lefty Leverenz le...@hortonworks.commailto:le...@hortonworks.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Tuesday, June 18, 2013 2:12 AM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: LZO compression implementation in Hive You can email hive-...@hadoop.apache.orgmailto:hive-...@hadoop.apache.org asking for edit privileges on the Hive wiki. Here's an example from the archiveshttp://mail-archives.apache.org/mod_mbox/hive-dev/201304.mbox/%3ccakhgjs42yqqj54rc8yvzgxbubies7mpv9fdofq2z3onfo67...@mail.gmail.com%3e. Once you have permission, you can use the Edit and Add symbols in the upper right corner of each wiki page. For page location, I'd have to think it over and check the existing docs. Initial thoughts: * The Language Manualhttps://cwiki.apache.org/confluence/display/Hive/LanguageManual has a placeholder for file formats which only contains ORC files now. * Perhaps the DDL section is more appropriate. If so, add a subbullet on the Language Manual page. * If you don't want it in the Language Manual, just put it on the home pagehttps://cwiki.apache.org/confluence/display/Hive/Home and add a link under User Documentation. * As long as it's somewhere in the Hive wiki tree, the apparent organization can be changed by adding/removing links. * Many people don't pay attention to the page structure, they just google the topic they're looking for. – Lefty On Tue, Jun 18, 2013 at 2:56 AM, Sanjay Subramanian sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com wrote: :-) Not sure how to add a page…may be the Admin needs to grant me permission From: Sanjay Subramanian sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Monday, June 17, 2013 11:50 PM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: LZO compression implementation in Hive Sure…would love to add the LZO compression in Hive Is there a specific page structure u want me to add to in confluence? https://cwiki.apache.org/confluencehttps://cwiki.apache.org/confluence/dashboard.action thanks sanjay From: Lefty Leverenz le...@hortonworks.commailto:le...@hortonworks.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Monday, June 17, 2013 7:08 PM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: LZO compression implementation in Hive Perhaps you'd like to write up your insights in the Hive wiki, and others could add their insights. Then the information would be available to all, immediately. – Lefty On Mon, Jun 17, 2013 at 4:39 PM, Ramki Palle ramki.pa...@gmail.commailto:ramki.pa...@gmail.com wrote: Hi Sanjay, Can you quickly give your insights on thip topic, if possible? Regards, Ramki. On Mon, May 20, 2013 at 2:51 PM, Sanjay Subramanian sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com wrote: Hi Programming Hive Book authors Maybe a lot of u have already successfully implemented this but only these last two weeks , we implemented our aggregations using LZO compression in Hive - MR jobs creating LZO files as Input for Hive --- Therafter Hive aggregations creating more LZO files as output. As usual nothing was straight forward :-) Also the other challenge was to neatly tie all into actions in Oozie workflows….but after being underwater for weeks I think I am able to rise above water and breathe ! In the next version of the book , If u guys r planning to add detailed sections on using lzo compression in Hive , let me know…my experiences might be useful :-) Thanks sanjay CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator. CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended
Re: show table throwing strange error
Can u stop following services hive-server hive-meta-store Hive-server2 (if u r running that) Move current hive.log some place else Start following services hive-server hive-meta-store Hive-server2 (if u r running that) And check hive.log ? Also can u paste the CREATE TABLe script verbatim here…I feel if u are using some custom INPUTFORMAT / OUTPUTFORMAT class that have to be specified in quotes…u may have to be escape that Plus try and add a semicolon to the end of the create table script ... sanjay From: Mohammad Tariq donta...@gmail.commailto:donta...@gmail.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Thursday, June 20, 2013 12:52 PM To: user user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: show table throwing strange error Thank you for looking into it Sanjay. show tables is working fine from both Ubuntu and Hive shell. But i'm getting the same error as yesterday when i'm running create table, which is : line 1:30 character '' not supported here line 1:31 character '' not supported here line 1:32 character '' not supported here line 1:33 character '' not supported here line 1:34 character '' not supported here line 1:35 character '' not supported here line 1:36 character '' not supported here line 1:37 character '' not supported here line 1:38 character '' not supported here line 1:39 character '' not supported here line 1:40 character '' not supported here line 1:41 character '' not supported here line 1:42 character '' not supported here . . . . Also, I have noticed 1 strange thing. hive.log is totally messed up. Looks like logs are getting written in some binary encoding. I have attached a snapshot of the same. Any idea? Warm Regards, Tariq cloudfront.blogspot.comhttp://cloudfront.blogspot.com On Fri, Jun 21, 2013 at 1:03 AM, Sanjay Subramanian sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com wrote: Can u try from your ubuntu command prompt $ hive -e show tables From: Mohammad Tariq donta...@gmail.commailto:donta...@gmail.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Thursday, June 20, 2013 4:28 AM To: user user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: show table throwing strange error Thank you for the response ma'am. It didn't help either. Warm Regards, Tariq cloudfront.blogspot.comhttp://cloudfront.blogspot.com On Thu, Jun 20, 2013 at 8:43 AM, Sunita Arvind sunitarv...@gmail.commailto:sunitarv...@gmail.com wrote: Your issue seems familiar. Try logging out of hive session and re-login. Sunita On Wed, Jun 19, 2013 at 8:53 PM, Mohammad Tariq donta...@gmail.commailto:donta...@gmail.com wrote: Hello list, I have a hive(0.9.0) setup on my Ubuntu box running hadoop-1.0.4. Everything was going smooth till now. But today when I issued show tables I got some strange error on the CLI. Here is the error : hive show tables; FAILED: Parse Error: line 1:0 character '' not supported here line 1:1 character '' not supported here line 1:2 character '' not supported here line 1:3 character '' not supported here line 1:4 character '' not supported here line 1:5 character '' not supported here line 1:6 character '' not supported here line 1:7 character '' not supported here line 1:8 character '' not supported here line 1:9 character '' not supported here line 1:10 character '' not supported here line 1:11 character '' not supported here line 1:12 character '' not supported here line 1:13 character '' not supported here line 1:14 character '' not supported here line 1:15 character '' not supported here line 1:16 character '' not supported here line 1:17 character '' not supported here line 1:18 character '' not supported here line 1:19 character '' not supported here line 1:20 character '' not supported here line 1:21 character '' not supported here line 1:22 character '' not supported here line 1:23 character '' not supported here line 1:24 character '' not supported here line 1:25 character '' not supported here line 1:26 character '' not supported here line 1:27 character '' not supported here line 1:28 character '' not supported here line 1:29 character '' not supported here line 1:30 character '' not supported here line 1:31 character '' not supported here line 1:32 character '' not supported here line 1:33 character '' not supported here line 1:34 character '' not supported here line 1:35 character '' not supported here line 1:36 character '' not supported here line 1:37 character '' not supported here line 1:38 character '' not supported here line 1:39 character '' not supported here line 1:40 character '' not supported here line 1:41 character '' not supported here line 1:42 character '' not supported here line 1:43 character '' not supported here line 1:44 character '' not supported here line 1:45 character '' not supported here line 1:46 character '' not supported
Re: show table throwing strange error
Can u try from your ubuntu command prompt $ hive -e show tables From: Mohammad Tariq donta...@gmail.commailto:donta...@gmail.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Thursday, June 20, 2013 4:28 AM To: user user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: show table throwing strange error Thank you for the response ma'am. It didn't help either. Warm Regards, Tariq cloudfront.blogspot.comhttp://cloudfront.blogspot.com On Thu, Jun 20, 2013 at 8:43 AM, Sunita Arvind sunitarv...@gmail.commailto:sunitarv...@gmail.com wrote: Your issue seems familiar. Try logging out of hive session and re-login. Sunita On Wed, Jun 19, 2013 at 8:53 PM, Mohammad Tariq donta...@gmail.commailto:donta...@gmail.com wrote: Hello list, I have a hive(0.9.0) setup on my Ubuntu box running hadoop-1.0.4. Everything was going smooth till now. But today when I issued show tables I got some strange error on the CLI. Here is the error : hive show tables; FAILED: Parse Error: line 1:0 character '' not supported here line 1:1 character '' not supported here line 1:2 character '' not supported here line 1:3 character '' not supported here line 1:4 character '' not supported here line 1:5 character '' not supported here line 1:6 character '' not supported here line 1:7 character '' not supported here line 1:8 character '' not supported here line 1:9 character '' not supported here line 1:10 character '' not supported here line 1:11 character '' not supported here line 1:12 character '' not supported here line 1:13 character '' not supported here line 1:14 character '' not supported here line 1:15 character '' not supported here line 1:16 character '' not supported here line 1:17 character '' not supported here line 1:18 character '' not supported here line 1:19 character '' not supported here line 1:20 character '' not supported here line 1:21 character '' not supported here line 1:22 character '' not supported here line 1:23 character '' not supported here line 1:24 character '' not supported here line 1:25 character '' not supported here line 1:26 character '' not supported here line 1:27 character '' not supported here line 1:28 character '' not supported here line 1:29 character '' not supported here line 1:30 character '' not supported here line 1:31 character '' not supported here line 1:32 character '' not supported here line 1:33 character '' not supported here line 1:34 character '' not supported here line 1:35 character '' not supported here line 1:36 character '' not supported here line 1:37 character '' not supported here line 1:38 character '' not supported here line 1:39 character '' not supported here line 1:40 character '' not supported here line 1:41 character '' not supported here line 1:42 character '' not supported here line 1:43 character '' not supported here line 1:44 character '' not supported here line 1:45 character '' not supported here line 1:46 character '' not supported here line 1:47 character '' not supported here line 1:48 character '' not supported here line 1:49 character '' not supported here line 1:50 character '' not supported here line 1:51 character '' not supported here line 1:52 character '' not supported here line 1:53 character '' not supported here line 1:54 character '' not supported here line 1:55 character '' not supported here line 1:56 character '' not supported here line 1:57 character '' not supported here line 1:58 character '' not supported here line 1:59 character '' not supported here line 1:60 character '' not supported here line 1:61 character '' not supported here line 1:62 character '' not supported here line 1:63 character '' not supported here line 1:64 character '' not supported here line 1:65 character '' not supported here line 1:66 character '' not supported here line 1:67 character '' not supported here line 1:68 character '' not supported here line 1:69 character '' not supported here line 1:70 character '' not supported here line 1:71 character '' not supported here line 1:72 character '' not supported here line 1:73 character '' not supported here line 1:74 character '' not supported here line 1:75 character '' not supported here line 1:76 character '' not supported here line 1:77 character '' not supported here line 1:78 character '' not supported here line 1:79 character '' not supported here . . . . . . line 1:378 character '' not supported here line 1:379 character '' not supported here line 1:380 character '' not supported here line 1:381 character '' not supported here Strangely other queries like select foo from pokes where bar = 'tariq'; are working fine. Tried to search over the net but could not find anything useful.Need some help. Thank you so much for your time. Warm Regards, Tariq cloudfront.blogspot.comhttp://cloudfront.blogspot.com CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of
Re: LZO compression implementation in Hive
Sure…would love to add the LZO compression in Hive Is there a specific page structure u want me to add to in confluence? https://cwiki.apache.org/confluencehttps://cwiki.apache.org/confluence/dashboard.action thanks sanjay From: Lefty Leverenz le...@hortonworks.commailto:le...@hortonworks.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Monday, June 17, 2013 7:08 PM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: LZO compression implementation in Hive Perhaps you'd like to write up your insights in the Hive wiki, and others could add their insights. Then the information would be available to all, immediately. – Lefty On Mon, Jun 17, 2013 at 4:39 PM, Ramki Palle ramki.pa...@gmail.commailto:ramki.pa...@gmail.com wrote: Hi Sanjay, Can you quickly give your insights on thip topic, if possible? Regards, Ramki. On Mon, May 20, 2013 at 2:51 PM, Sanjay Subramanian sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com wrote: Hi Programming Hive Book authors Maybe a lot of u have already successfully implemented this but only these last two weeks , we implemented our aggregations using LZO compression in Hive - MR jobs creating LZO files as Input for Hive --- Therafter Hive aggregations creating more LZO files as output. As usual nothing was straight forward :-) Also the other challenge was to neatly tie all into actions in Oozie workflows….but after being underwater for weeks I think I am able to rise above water and breathe ! In the next version of the book , If u guys r planning to add detailed sections on using lzo compression in Hive , let me know…my experiences might be useful :-) Thanks sanjay CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator. CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: hive to hbase mapping
How about you have two streams - one to hbase and one to Hive fro your data generation source ? Moving data out of Hbase may not be trivial specially if the data sizes are large…. From: Mario Casola mario.cas...@gmail.commailto:mario.cas...@gmail.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Friday, June 14, 2013 9:54 AM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: hive to hbase mapping Hi Sanjay, thanks for the response. I need Hbase because is perfect for aggregating data through the counters, and write performance is great. Now the problem is...Which is the best way for loading periodically (every hour for example) Hbase data in Hive table? Mario 2013/6/14 Sanjay Subramanian sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com 6 months back I was tasked with building a Data platform for logs and I benchmarked Hbase + Hive (queries were 8X slower) Hive only So I decided for Hive option and am deploying that solution to production. Couple of things u can think while u design if u really want to go HBase+Hive (also look at this http://hadoopstack.com/hive-on-hbase-part-1/) - Query only todays data in a Hive+Hbase architecture - Older data than one day query Hive only Hope I am not diverting from your question and problem sanjay From: Mario Casola mario.cas...@gmail.commailto:mario.cas...@gmail.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Friday, June 14, 2013 8:54 AM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: hive to hbase mapping Hi, I have a performance issue when I query HBase from Hive. My idea is to build the scenario below: 1. Collect data in hbase for aggregation purpose 2. Create an external table that map Hive to Hbase 3. Create a real Hive table 4. Periodically transfer data from hbase to Hive through INSERTO INTO real hive table SELECT * FROM external table WHERE time = 201305212909 Currently I'm doing a test on a Hbase table that has 70,000,000 rows and I'm trying to query this table with a single column value filter, like the query above. If I try this type of query directly in Hbase the response time is around 80 seconds. If I try the query in Hive shell, after 30 minutes, all the tasks (9 in my case) are 0,00% complete. Which could be the problem? thanks Mario CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator. CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: LZO compression implementation in Hive
Thanks I sent it From: Lefty Leverenz le...@hortonworks.commailto:le...@hortonworks.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Tuesday, June 18, 2013 2:12 AM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: LZO compression implementation in Hive You can email hive-...@hadoop.apache.orgmailto:hive-...@hadoop.apache.org asking for edit privileges on the Hive wiki. Here's an example from the archiveshttp://mail-archives.apache.org/mod_mbox/hive-dev/201304.mbox/%3ccakhgjs42yqqj54rc8yvzgxbubies7mpv9fdofq2z3onfo67...@mail.gmail.com%3e. Once you have permission, you can use the Edit and Add symbols in the upper right corner of each wiki page. For page location, I'd have to think it over and check the existing docs. Initial thoughts: * The Language Manualhttps://cwiki.apache.org/confluence/display/Hive/LanguageManual has a placeholder for file formats which only contains ORC files now. * Perhaps the DDL section is more appropriate. If so, add a subbullet on the Language Manual page. * If you don't want it in the Language Manual, just put it on the home pagehttps://cwiki.apache.org/confluence/display/Hive/Home and add a link under User Documentation. * As long as it's somewhere in the Hive wiki tree, the apparent organization can be changed by adding/removing links. * Many people don't pay attention to the page structure, they just google the topic they're looking for. – Lefty On Tue, Jun 18, 2013 at 2:56 AM, Sanjay Subramanian sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com wrote: :-) Not sure how to add a page…may be the Admin needs to grant me permission From: Sanjay Subramanian sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Monday, June 17, 2013 11:50 PM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: LZO compression implementation in Hive Sure…would love to add the LZO compression in Hive Is there a specific page structure u want me to add to in confluence? https://cwiki.apache.org/confluencehttps://cwiki.apache.org/confluence/dashboard.action thanks sanjay From: Lefty Leverenz le...@hortonworks.commailto:le...@hortonworks.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Monday, June 17, 2013 7:08 PM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: LZO compression implementation in Hive Perhaps you'd like to write up your insights in the Hive wiki, and others could add their insights. Then the information would be available to all, immediately. – Lefty On Mon, Jun 17, 2013 at 4:39 PM, Ramki Palle ramki.pa...@gmail.commailto:ramki.pa...@gmail.com wrote: Hi Sanjay, Can you quickly give your insights on thip topic, if possible? Regards, Ramki. On Mon, May 20, 2013 at 2:51 PM, Sanjay Subramanian sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com wrote: Hi Programming Hive Book authors Maybe a lot of u have already successfully implemented this but only these last two weeks , we implemented our aggregations using LZO compression in Hive - MR jobs creating LZO files as Input for Hive --- Therafter Hive aggregations creating more LZO files as output. As usual nothing was straight forward :-) Also the other challenge was to neatly tie all into actions in Oozie workflows….but after being underwater for weeks I think I am able to rise above water and breathe ! In the next version of the book , If u guys r planning to add detailed sections on using lzo compression in Hive , let me know…my experiences might be useful :-) Thanks sanjay CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator. CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email
Re: Errors in one Hive script using LZO compression
Ok guys I solved it in not so elegant way but I need to go forward in production and deploy this because of time constraints :-) I divided the scripts into two stages Stage 1 : The hive script creates TXT files and writes to HDFS Stage 2 : I wrote a Lzo file creator and indexer that will convert the TXT files on HDFS to .lzo and .lzo.index I still don't know what makes this specific hive script throw this error….but I got to keep going ahead…. Perhaps if anyone can shed more light on this error in the future , I will STILL be interested in knowing the root cause Thanks sanjay From: Sanjay Subramanian sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Monday, June 17, 2013 11:59 PM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: Errors in one Hive script using LZO compression Hi I am using LZO compression in our scripts but one script is still creating errors Diagnostic Messages for this Task: Error: java.io.IOException: java.io.EOFException: Premature EOF from inputStream at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97) at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57) at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:243) at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:522) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.init(MapTask.java:160) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:381) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:334) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:152) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:147) Caused by: java.io.EOFException: Premature EOF from inputStream at com.hadoop.compression.lzo.LzopInputStream.readFully(LzopInputStream.java:75) at com.hadoop.compression.lzo.LzopInputStream.readHeader(LzopInputStream.java:114) at com.hadoop.compression.lzo.LzopInputStream.init(LzopInputStream.java:54) at com.hadoop.compression.lzo.LzopCodec.createInputStream(LzopCodec.java:83) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1871) at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1765) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1714) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1728) at org.apache.hadoop.mapred.SequenceFileRecordReader.init(SequenceFileRecordReader.java:49) at org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:64) at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:240) ... 9 more SCRIPT === set hiveconf mapred.output.compression.type=BLOCK; set mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec; set mapreduce.map.output.compress=true; set hive.exec.compress.output=true; set mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec; set mapreduce.output.fileoutputformat.compress=true; set hive.exec.compress.intermediate=true; set mapreduce.job.maps=500; set mapreduce.job.reduces=8; set mapreduce.tasktracker.map.tasks.maximum=12; set mapreduce.tasktracker.reduce.tasks.maximum=8; add jar /home/nextag/sasubramanian/mycode/impressions/jar/impressions-hiveudfs-1.0-20130615-155038.jar; create temporary function collect as 'com.wizecommerce.utils.hive.udf.GenericUDAFCollect'; create temporary function isnextagip as 'com.wizecommerce.utils.hive.udf.IsNextagIP'; create temporary function isfrombot as 'com.wizecommerce.utils.hive.udf.IsFromBot'; create temporary function processblankkeyword as 'com.wizecommerce.utils.hive.udf.ProcessBlankKeyword'; create temporary function getValidHiddenSellers as 'com.wizecommerce.utils.hive.udf.GetValidHiddenSellers'; INSERT OVERWRITE DIRECTORY '/user/beeswax/warehouse/keyword_impressions_ptitles_log/2013-03-19' SELECT hp.header_date, hp.impression_id, hp.header_searchsessionid, hp.cached_visit_id, split(hp.header_servername,'[\.]')[0], hp.cached_ip, hp.header_adnode, IF (concat_ws(',' , collect_set(concat_ws('|', cast(hp.seller_id as STRING), cast(IF(hp.seller_pricetier IS NULL, -1L, hp.seller_pricetier) as STRING), cast