Using SPLIT with DOT(.) delimiter demonstrate funny behavior within a VIEW

2015-08-24 Thread Sanjay Subramanian
Hi guys
I am using Hive version = 0.13.1-cdh5.3.3

HIVE TABLE = 
qnap_resume_file_location---DROP  TABLE IF 
EXISTS
     qnap_resume_file_location;CREATE EXTERNAL TABLE qnap_resume_file_location 
( floc STRING     ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'  LOCATION 
'/data/myfirm/file_location';
SAMPLE DATA = qnap_resume_file_location-select 
* from resume.qnap_resume_file_location limit 
10/mnt/myfirm/200901/14008_1_1004942032_0_1005347483.html
/mnt/myfirm/200901/14008_1_1004944724_0_1005334758.html/mnt/myfirm/200901/14000__900719077_10_901309048.doc/mnt/myfirm/200901/14012_6_1004858088_0_1005227990.doc/mnt/myfirm/200901/14008_1_1004811416_0_1005258541.html/mnt/myfirm/200901/14008_1_501597002_6_1005224039.html/mnt/myfirm/200901/14009_5_1004872908_0_1005244348.doc/mnt/myfirm/200901/14001_7_14165_1_66775.docx/mnt/myfirm/200901/14009_5_1004935267_0_1005322091.txt/mnt/myfirm/200901/14012_6_1004904422_0_1005283729.doc

HIVE VIEW = 
qnap_resume_resid_file_loc_map_vw-DROP
 VIEW IF EXISTS
     qnap_resume_resid_file_loc_map_vw;CREATE VIEW IF NOT EXISTS 
qnap_resume_resid_file_loc_map_vw ( resid , mm , floc      )ASSELECT    
split(reverse(split(reverse(floc),'/')[0]),'.')[0] as resid,     
reverse(split(reverse(floc),'/')[1])  as mm,    flocFROM   
resume.qnap_resume_file_location;

QUERY ON VIEWselect * from qnap_resume_resid_file_loc_map_vw limit 10EXPECTED 
RESULTS
14012_6_1006686583_0_1102955123 201205 
/mnt/myfirm/201205/14012_6_1006686583_0_1102955123.docx14009_5_1008440384_0_1102887209
 201205 
/mnt/myfirm/201205/14009_5_1008440384_0_1102887209.doc14013_7_700200576_8_1102752594
 201205 
/mnt/myfirm/201205/14013_7_700200576_8_1102752594.doc14015_8_1008470815_0_1102954525
 201205 
/mnt/myfirm/201205/14015_8_1008470815_0_1102954525.doc14009_5_1008373102_0_1102709973
 201205 
/mnt/myfirm/201205/14009_5_1008373102_0_1102709973.doc14011_1_1007981566_0_1102730546
 201205 
/mnt/myfirm/201205/14011_1_1007981566_0_1102730546.txt14015_8_1008436709_0_1102867682
 201205 
/mnt/myfirm/201205/14015_8_1008436709_0_1102867682.docx14001_7_1006347358_0_1007662042
 201205 
/mnt/myfirm/201205/14001_7_1006347358_0_1007662042.docx14009_5_1008382590_0_1102732450
 201205 
/mnt/myfirm/201205/14009_5_1008382590_0_1102732450.doc14000_6_1008189721_0_1102253740
 201205 /mnt/myfirm/201205/14000_6_1008189721_0_1102253740.docx
ACTUAL RESULTS (see the first column the DOT and extension are still there) 
14012_6_1006686583_0_1102955123 .docx 201205 
/mnt/myfirm/201205/14012_6_1006686583_0_1102955123.docx14009_5_1008440384_0_1102887209
 .doc 201205 
/mnt/myfirm/201205/14009_5_1008440384_0_1102887209.doc14013_7_700200576_8_1102752594
 .doc 201205 
/mnt/myfirm/201205/14013_7_700200576_8_1102752594.doc14015_8_1008470815_0_1102954525.doc
 201205 
/mnt/myfirm/201205/14015_8_1008470815_0_1102954525.doc14009_5_1008373102_0_1102709973
 .doc 201205 
/mnt/myfirm/201205/14009_5_1008373102_0_1102709973.doc14011_1_1007981566_0_1102730546
 .txt 201205 
/mnt/myfirm/201205/14011_1_1007981566_0_1102730546.txt14015_8_1008436709_0_1102867682
 .docx 201205 
/mnt/myfirm/201205/14015_8_1008436709_0_1102867682.docx14001_7_1006347358_0_1007662042
 .docx 201205 
/mnt/myfirm/201205/14001_7_1006347358_0_1007662042.docx14009_5_1008382590_0_1102732450
 .doc 201205 
/mnt/myfirm/201205/14009_5_1008382590_0_1102732450.doc14000_6_1008189721_0_1102253740
 .docx 201205 /mnt/myfirm/201205/14000_6_1008189721_0_1102253740.docx
EXECUTING THE SELECT HIVE QUERY INSIDE THE VIEW INDEPENDENTLY (gives u correct 
results - but when placed inside a view then it does not seem to work)SELECT    
split(reverse(split(reverse(floc),'/')[0]),'.')[0] as resid,     
reverse(split(reverse(floc),'/')[1])  as mm,    flocFROM   
resume.qnap_resume_file_location
14012_6_1006686583_0_1102955123 201205 
/mnt/myfirm/201205/14012_6_1006686583_0_1102955123.docx14009_5_1008440384_0_1102887209
 201205 
/mnt/myfirm/201205/14009_5_1008440384_0_1102887209.doc14013_7_700200576_8_1102752594
 201205 
/mnt/myfirm/201205/14013_7_700200576_8_1102752594.doc14015_8_1008470815_0_1102954525
 201205 
/mnt/myfirm/201205/14015_8_1008470815_0_1102954525.doc14009_5_1008373102_0_1102709973
 201205 
/mnt/myfirm/201205/14009_5_1008373102_0_1102709973.doc14011_1_1007981566_0_1102730546
 201205 
/mnt/myfirm/201205/14011_1_1007981566_0_1102730546.txt14015_8_1008436709_0_1102867682
 201205 
/mnt/myfirm/201205/14015_8_1008436709_0_1102867682.docx14001_7_1006347358_0_1007662042
 201205 
/mnt/myfirm/201205/14001_7_1006347358_0_1007662042.docx14009_5_1008382590_0_1102732450
 201205 
/mnt/myfirm/201205/14009_5_1008382590_0_1102732450.doc14000_6_1008189721_0_1102253740
 201205 /mnt/myfirm/201205/14000_6_1008189721_0_1102253740.docx

Thoughts ? Ideas ?
thanks
sanjay


Re: Using SPLIT with DOT(.) delimiter demonstrate funny behavior within a VIEW

2015-08-24 Thread Sanjay Subramanian
Hi Vivek
Thanks A few more experiments 
The \\ , \\\ and  all work correctly if used as following 
split(reverse(split(reverse(floc),'/')[0]),'[\\.]')[0]   (need those square 
brackets)
split(reverse(split(reverse(floc),'/')[0]),'\\\.')[0]  (does not need square 
brackets)
split(reverse(split(reverse(floc),'/')[0]),'[.]')[0]   (need those square 
brackets)
    From: Vivek Veeramani vivek.veeraman...@gmail.com
 To: user@hive.apache.org; Sanjay Subramanian sanjaysubraman...@yahoo.com 
 Sent: Monday, August 24, 2015 1:57 PM
 Subject: Re: Using SPLIT with DOT(.) delimiter demonstrate funny behavior 
within a VIEW
   
Hi Sanjay,
Try replacing the 4 backslashes with just 2. Usually works with 2 backslashes.
Replace this split(reverse(split(reverse(floc),'/')[0]),'.')[0] as resid    
     with split(reverse(split(reverse(floc),'/')[0]),'\\.')[0] as resid

Please have a look and let us know if that helps.


Best ,
Vivek Veeramani

cell : + 1-415 996 7853



On Mon, Aug 24, 2015 at 12:15 PM, Sanjay Subramanian 
sanjaysubraman...@yahoo.com wrote:

Hi guys
I am using Hive version = 0.13.1-cdh5.3.3

HIVE TABLE = 
qnap_resume_file_location---DROP  TABLE IF 
EXISTS
     qnap_resume_file_location;CREATE EXTERNAL TABLE qnap_resume_file_location 
( floc STRING     ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'  LOCATION 
'/data/myfirm/file_location';
SAMPLE DATA = qnap_resume_file_location-select 
* from resume.qnap_resume_file_location limit 
10/mnt/myfirm/200901/14008_1_1004942032_0_1005347483.html
/mnt/myfirm/200901/14008_1_1004944724_0_1005334758.html/mnt/myfirm/200901/14000__900719077_10_901309048.doc/mnt/myfirm/200901/14012_6_1004858088_0_1005227990.doc/mnt/myfirm/200901/14008_1_1004811416_0_1005258541.html/mnt/myfirm/200901/14008_1_501597002_6_1005224039.html/mnt/myfirm/200901/14009_5_1004872908_0_1005244348.doc/mnt/myfirm/200901/14001_7_14165_1_66775.docx/mnt/myfirm/200901/14009_5_1004935267_0_1005322091.txt/mnt/myfirm/200901/14012_6_1004904422_0_1005283729.doc

HIVE VIEW = 
qnap_resume_resid_file_loc_map_vw-DROP
 VIEW IF EXISTS
     qnap_resume_resid_file_loc_map_vw;CREATE VIEW IF NOT EXISTS 
qnap_resume_resid_file_loc_map_vw ( resid , mm , floc      )ASSELECT    
split(reverse(split(reverse(floc),'/')[0]),'.')[0] as resid,     
reverse(split(reverse(floc),'/')[1])  as mm,    flocFROM   
resume.qnap_resume_file_location;

QUERY ON VIEWselect * from qnap_resume_resid_file_loc_map_vw limit 10EXPECTED 
RESULTS
14012_6_1006686583_0_1102955123 201205 
/mnt/myfirm/201205/14012_6_1006686583_0_1102955123.docx14009_5_1008440384_0_1102887209
 201205 
/mnt/myfirm/201205/14009_5_1008440384_0_1102887209.doc14013_7_700200576_8_1102752594
 201205 
/mnt/myfirm/201205/14013_7_700200576_8_1102752594.doc14015_8_1008470815_0_1102954525
 201205 
/mnt/myfirm/201205/14015_8_1008470815_0_1102954525.doc14009_5_1008373102_0_1102709973
 201205 
/mnt/myfirm/201205/14009_5_1008373102_0_1102709973.doc14011_1_1007981566_0_1102730546
 201205 
/mnt/myfirm/201205/14011_1_1007981566_0_1102730546.txt14015_8_1008436709_0_1102867682
 201205 
/mnt/myfirm/201205/14015_8_1008436709_0_1102867682.docx14001_7_1006347358_0_1007662042
 201205 
/mnt/myfirm/201205/14001_7_1006347358_0_1007662042.docx14009_5_1008382590_0_1102732450
 201205 
/mnt/myfirm/201205/14009_5_1008382590_0_1102732450.doc14000_6_1008189721_0_1102253740
 201205 /mnt/myfirm/201205/14000_6_1008189721_0_1102253740.docx
ACTUAL RESULTS (see the first column the DOT and extension are still there) 
14012_6_1006686583_0_1102955123 .docx 201205 
/mnt/myfirm/201205/14012_6_1006686583_0_1102955123.docx14009_5_1008440384_0_1102887209
 .doc 201205 
/mnt/myfirm/201205/14009_5_1008440384_0_1102887209.doc14013_7_700200576_8_1102752594
 .doc 201205 
/mnt/myfirm/201205/14013_7_700200576_8_1102752594.doc14015_8_1008470815_0_1102954525.doc
 201205 
/mnt/myfirm/201205/14015_8_1008470815_0_1102954525.doc14009_5_1008373102_0_1102709973
 .doc 201205 
/mnt/myfirm/201205/14009_5_1008373102_0_1102709973.doc14011_1_1007981566_0_1102730546
 .txt 201205 
/mnt/myfirm/201205/14011_1_1007981566_0_1102730546.txt14015_8_1008436709_0_1102867682
 .docx 201205 
/mnt/myfirm/201205/14015_8_1008436709_0_1102867682.docx14001_7_1006347358_0_1007662042
 .docx 201205 
/mnt/myfirm/201205/14001_7_1006347358_0_1007662042.docx14009_5_1008382590_0_1102732450
 .doc 201205 
/mnt/myfirm/201205/14009_5_1008382590_0_1102732450.doc14000_6_1008189721_0_1102253740
 .docx 201205 /mnt/myfirm/201205/14000_6_1008189721_0_1102253740.docx
EXECUTING THE SELECT HIVE QUERY INSIDE THE VIEW INDEPENDENTLY (gives u correct 
results - but when placed inside a view then it does not seem to work)SELECT    
split(reverse(split(reverse(floc),'/')[0]),'.')[0] as resid,     
reverse(split(reverse(floc),'/')[1])  as mm,    flocFROM   
resume.qnap_resume_file_location
14012_6_1006686583_0_1102955123 201205 
/mnt/myfirm

Pointing SparkSQL to existing Hive Metadata with data file locations in HDFS

2015-05-27 Thread Sanjay Subramanian
hey guys
On the Hive/Hadoop ecosystem we have using Cloudera distribution CDH 5.2.x , 
there are about 300+ hive tables.The data is stored an text (moving slowly to 
Parquet) on HDFS.I want to use SparkSQL and point to the Hive metadata and be 
able to define JOINS etc using a programming structure like this 
import org.apache.spark.sql.hive.HiveContextval sqlContext = new 
HiveContext(sc)val schemaRdd = sqlContext.sql(some complex SQL)

Is that the way to go ? Some guidance will be great.
thanks
sanjay




Using Hive as a file comparison and grep-ping tool

2015-04-20 Thread Sanjay Subramanian
hey guys
As data wranglers and programmers we often need quick tools. One such tool I 
need almost everyday is one that greps a file based on contents of another 
file. One can write this in perl, python but since I am already using hadoop 
ecosystem extensively, I said why not do this in Hive ? 
Perhaps you guys already know this and have better solutionsnevertheless 
:-) here goes...

Best regards
sanjay(Hive super-fan)
I just posted this on my 
bloghttps://bigdatalatte.wordpress.com/2015/04/20/using-hive-as-a-file-comparison-and-grep-ping-tool/
In case the blog URL does not work for any reason, here is the logic
Using Hive as a file comparison and grep-ping 
tool==1. Logon to your linux 
terminal where u run Hive queries from
2. Create a database called myutils in Hive   Create two hive tables 
myutils.file1 and myutils.file2 in Hive    - each of these tables will have a 
partition called fn      fn is short for filename    - each of these 
tables will have just one column called ln  ln is short for line   An 
easy script to help do that would be as follows     for r in 1 2 ; do hive -e 
CREATE DATABASE IF NOT EXISTS myutils; USE myutils; DROP TABLE IF EXISTS 
file${r}; CREATE EXTERNAL TABLE IF NOT EXISTS file${r}(ln STRING) PARTITIONED 
BY (fn STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';; done
3. Create a permanent base location folder in HDFS   hdfs dfs -mkdir -p 
/workspace/myutils/filecomparator/file1/   hdfs dfs -mkdir -p 
/workspace/myutils/filecomparator/file2/   USECASE 1 : ===Search if a 
bunch of IP addresses exist in another file containing (larger) bunch of IPs
[1] registeredIPs.txt    10.456.34.90    123.675.654.1    21.87.657.456    
234.109.34.234        visitorIPs.txt    10.456.34.90    12.367.54.23    
218.7.657.456    23.4.109.3   [2] Output which IPs in File1 are present in File2
[3] Put each file in a separate HDFS location        hdfs dfs -mkdir -p 
/workspace/myutils/filecomparator/file1/registeredIPs.txt    hdfs dfs -put 
VisitorIPs.txt  /workspace/myutils/filecomparator/file1/visitorIPs.txt
    hdfs dfs -put registeredIPs.txt  
/workspace/myutils/filecomparator/file1/registeredIPs.txt    hdfs dfs -put 
visitorIPs.txt  /workspace/myutils/filecomparator/file1/visitorIPs.txt
[4] Add partition to  myutils.file1    For simplicity keep the partition names 
identical to the file names themselves      hive -e USE myutils; ALTER TABLE 
file1 ADD PARTITION(ln='registeredIPs.txt') LOCATION 
'/workspace/myutils/filecomparator/file1/registeredIPs.txt'
    hive -e USE myutils; ALTER TABLE file2 ADD PARTITION(ln='visitorIPs.txt') 
LOCATION '/workspace/myutils/filecomparator/file2/visitorIPs.txt'    [5] Check 
that partitions can be accesd by Hive
    # This should give u the same answer as    # wc -l registeredIPs.txt    
hive -e select count(*) from myutils.file1 where fn='registeredIPs.txt'
    # This should give u the same answer as    # wc -l visitorIPs.txt    hive 
-e select count(*) from myutils.file2 where fn='visitorIPs.txt'
[6] Count the number of IPs in registeredIPs.txt that are in visitorIPs.txt
# This dumps to a local file systemhive -e SELECT f1.ln FROM (SELECT ln FROM 
utils.file1 WHERE fn='registeredIPs.txt') f1  JOIN (select ln from 
myutils.file2 WHERE fn='visitorIPs.txt') f2  ON trim(f1.ln) = trim(f2.ln)  
./registered_in_visitors_list.txt
# This dumps to a new internally-managed-by-hive table # Make sure u already 
dont have some valuable hive table called myutils.registered_in_visitors_list 
- else this will overwrite that hive table with the results of this hive query 
hive -e USE myutils; DROP TABLE IF EXITS registered_in_visitors_list; CREATE 
TABLE if not exists registered_in_visitors_list AS SELECT f1.ln FROM (select ln 
FROM utils.file1 WHERE fn='registeredIPs.txt') f1  JOIN (SELECT ln FROM 
myutils.file2 WHERE fn='visitorIPs.txt') f2  ON trim(f1.ln) = trim(f2.ln)
# This dumps to a directory on HDFS# Make sure u already dont have some 
valuable directory called registered_in_visitors_list - else this will 
overwrite that director and all its contents with the results of this hive 
query hive -e INSERT OVERWRITE DIRECTORY 
'/workspace/myutils/filecomparator/registered_in_visitors_list' SELECT f1.ln 
FROM (select ln FROM utils.file1 WHERE fn='registeredIPs.txt') f1  JOIN (SELECT 
ln FROM myutils.file2 WHERE fn='visitorIPs.txt') f2  ON trim(f1.ln) = 
trim(f2.ln)


 
  

Re: [ANN] Hivemall v0.3 is now available

2015-02-08 Thread Sanjay Subramanian
awesome thank u. really value your ML contributions.
regardssanjay
  From: Makoto Yui yuin...@gmail.com
 To: user@hive.apache.org 
 Sent: Friday, February 6, 2015 3:31 AM
 Subject: [ANN] Hivemall v0.3 is now available
   
Hello all,

We are excited to announce that a new stable version of Hivemall
(v0.3.0) is now available.

    https://github.com/myui/hivemall/releases/tag/v0.3.0

Hivemall provides a collection of machine learning algorithms as Hive
UDFs/UDAFs/UDTFs.

The main enhancement in v0.3.0 is the support for matrix factorization.

Hope you enjoy the release! Feedback and pull requests are welcome.

Thanks,
Makoto


  

Hive JSON Serde question

2015-01-25 Thread Sanjay Subramanian
hey guys 

This is the Hive table definition I have created based on the JSON I am using 
this version of hive json serde https://github.com/rcongiu/Hive-JSON-Serde

ADD JAR 
/home/sanjay/mycode/jar/jsonserde/json-serde-1.3.1-SNAPSHOT-jar-with-dependencies.jar;DROP
 TABLE IF EXISTS  datafeed_json;CREATE EXTERNAL TABLE IF NOT EXISTS   
datafeed_json (   object STRING,   entry array          struct            
id:STRING,              time:BIGINT,              changes:array               
struct                 field:STRING,                   value:struct           
         item:STRING,                      verb:STRING,                      
parent_id:STRING,                      sender_id:BIGINT,                      
created_time:BIGINT) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' 
STORED AS TEXTFILE LOCATION '/data/sanjay/datafeed'; 

QUERY 1===ADD JAR 
/home/sanjay/mycode/jar/jsonserde/json-serde-1.3.1-SNAPSHOT-jar-with-dependencies.jar;
 SELECT    object,    entry[0].id,    entry[0].time,    
entry[0].changes[0].field,    entry[0].changes[0].value.item,    
entry[0].changes[0].value.verb,    entry[0].changes[0].value.parent_id,    
entry[0].changes[0].value.sender_id,    entry[0].changes[0].value.created_time  
FROM    datafeed_json;
RESULT1==foo123  113621765320467 1418608223 leads song1 rock 
113621765320467_1107142375968396 14748082019 1418608223

QUERY2==ADD JAR 
/home/sanjay/mycode/jar/jsonserde/json-serde-1.3.1-SNAPSHOT-jar-with-dependencies.jar;
 SELECT    object,    entry.id,    entry.time,    ntry  FROM    datafeed_json  
LATERAL VIEW EXPLODE    (datafeed_json.entry.changes) oc1 AS ntry;
RESULT2===This gives 4 rows but I was not able to iteratively do the 
LATERAL VIEW EXPLODE

I tried various combinations of LATERAL VIEW , LATERAL VIEW EXPLODE, json_tuple 
to extract all fields in an exploded view from the JSON in tab separated format 
but no luck.
Any thoughts ?


Thanks
sanjay  

  

Re: Hive JSON Serde question

2015-01-25 Thread Sanjay Subramanian
Thanks Ed. Let me try a few more iterations. Somehow I am not doing this 
correctly :-) 
regards
sanjay  From: Edward Capriolo edlinuxg...@gmail.com
 To: user@hive.apache.org user@hive.apache.org; Sanjay Subramanian 
sanjaysubraman...@yahoo.com 
 Sent: Sunday, January 25, 2015 8:11 AM
 Subject: Re: Hive JSON Serde question
   
Nested lists require nested lateral views.



On Sun, Jan 25, 2015 at 11:02 AM, Sanjay Subramanian 
sanjaysubraman...@yahoo.com wrote:

hey guys 

This is the Hive table definition I have created based on the JSON I am using 
this version of hive json serde https://github.com/rcongiu/Hive-JSON-Serde

ADD JAR 
/home/sanjay/mycode/jar/jsonserde/json-serde-1.3.1-SNAPSHOT-jar-with-dependencies.jar;DROP
 TABLE IF EXISTS  datafeed_json;CREATE EXTERNAL TABLE IF NOT EXISTS   
datafeed_json (   object STRING,   entry array          struct            
id:STRING,              time:BIGINT,              changes:array               
struct                 field:STRING,                   value:struct           
         item:STRING,                      verb:STRING,                      
parent_id:STRING,                      sender_id:BIGINT,                      
created_time:BIGINT) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' 
STORED AS TEXTFILE LOCATION '/data/sanjay/datafeed'; 

QUERY 1===ADD JAR 
/home/sanjay/mycode/jar/jsonserde/json-serde-1.3.1-SNAPSHOT-jar-with-dependencies.jar;
 SELECT    object,    entry[0].id,    entry[0].time,    
entry[0].changes[0].field,    entry[0].changes[0].value.item,    
entry[0].changes[0].value.verb,    entry[0].changes[0].value.parent_id,    
entry[0].changes[0].value.sender_id,    entry[0].changes[0].value.created_time  
FROM    datafeed_json;
RESULT1==foo123  113621765320467 1418608223 leads song1 rock 
113621765320467_1107142375968396 14748082019 1418608223

QUERY2==ADD JAR 
/home/sanjay/mycode/jar/jsonserde/json-serde-1.3.1-SNAPSHOT-jar-with-dependencies.jar;
 SELECT    object,    entry.id,    entry.time,    ntry  FROM    datafeed_json  
LATERAL VIEW EXPLODE    (datafeed_json.entry.changes) oc1 AS ntry;
RESULT2===This gives 4 rows but I was not able to iteratively do the 
LATERAL VIEW EXPLODE

I tried various combinations of LATERAL VIEW , LATERAL VIEW EXPLODE, json_tuple 
to extract all fields in an exploded view from the JSON in tab separated format 
but no luck.
Any thoughts ?


Thanks
sanjay  

   



  

Re: Hive JSON Serde question

2015-01-25 Thread Sanjay Subramanian
sure will try get_json_objectthank uregardssanjay  
  From: 丁桂涛(桂花) dinggui...@baixing.com
 To: user@hive.apache.org; Sanjay Subramanian sanjaysubraman...@yahoo.com 
 Sent: Sunday, January 25, 2015 4:45 PM
 Subject: Re: Hive JSON Serde question
   
Try get_json_object UDF. No iterations need. :)


On Mon, Jan 26, 2015 at 12:25 AM, Sanjay Subramanian 
sanjaysubraman...@yahoo.com wrote:

Thanks Ed. Let me try a few more iterations. Somehow I am not doing this 
correctly :-) 
regards
sanjay  From: Edward Capriolo edlinuxg...@gmail.com
 To: user@hive.apache.org user@hive.apache.org; Sanjay Subramanian 
sanjaysubraman...@yahoo.com 
 Sent: Sunday, January 25, 2015 8:11 AM
 Subject: Re: Hive JSON Serde question
   
Nested lists require nested lateral views.



On Sun, Jan 25, 2015 at 11:02 AM, Sanjay Subramanian 
sanjaysubraman...@yahoo.com wrote:

hey guys 

This is the Hive table definition I have created based on the JSON I am using 
this version of hive json serde https://github.com/rcongiu/Hive-JSON-Serde

ADD JAR 
/home/sanjay/mycode/jar/jsonserde/json-serde-1.3.1-SNAPSHOT-jar-with-dependencies.jar;DROP
 TABLE IF EXISTS  datafeed_json;CREATE EXTERNAL TABLE IF NOT EXISTS   
datafeed_json (   object STRING,   entry array          struct            
id:STRING,              time:BIGINT,              changes:array               
struct                 field:STRING,                   value:struct           
         item:STRING,                      verb:STRING,                      
parent_id:STRING,                      sender_id:BIGINT,                      
created_time:BIGINT) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' 
STORED AS TEXTFILE LOCATION '/data/sanjay/datafeed'; 

QUERY 1===ADD JAR 
/home/sanjay/mycode/jar/jsonserde/json-serde-1.3.1-SNAPSHOT-jar-with-dependencies.jar;
 SELECT    object,    entry[0].id,    entry[0].time,    
entry[0].changes[0].field,    entry[0].changes[0].value.item,    
entry[0].changes[0].value.verb,    entry[0].changes[0].value.parent_id,    
entry[0].changes[0].value.sender_id,    entry[0].changes[0].value.created_time  
FROM    datafeed_json;
RESULT1==foo123  113621765320467 1418608223 leads song1 rock 
113621765320467_1107142375968396 14748082019 1418608223

QUERY2==ADD JAR 
/home/sanjay/mycode/jar/jsonserde/json-serde-1.3.1-SNAPSHOT-jar-with-dependencies.jar;
 SELECT    object,    entry.id,    entry.time,    ntry  FROM    datafeed_json  
LATERAL VIEW EXPLODE    (datafeed_json.entry.changes) oc1 AS ntry;
RESULT2===This gives 4 rows but I was not able to iteratively do the 
LATERAL VIEW EXPLODE

I tried various combinations of LATERAL VIEW , LATERAL VIEW EXPLODE, json_tuple 
to extract all fields in an exploded view from the JSON in tab separated format 
but no luck.
Any thoughts ?


Thanks
sanjay  

   



   



  

Writing Hive Query Output to local system

2015-01-16 Thread Sanjay Subramanian
hey guys
I recall this did not happen in the days of 0.9.x version
But I use 0.13.x now and when I run a hive query 
hive -e select * from tablename  ./myfile.txt
The first line in myfile.txt is as follows
2015-01-16 10:48:13,091 WARN  [main] conf.HiveConf 
(HiveConf.java:initialize(1491)) - DEPRECATED: Configuration property 
hive.metastore.local no longer has any effect. Make sure to provide a valid 
value for hive.metastore.uris if you are connecting to a remote metastore. 
How do I avoid this ?
thanks
sanjay

Re: Writing Hive Query Output to local system

2015-01-16 Thread Sanjay Subramanian
I am using Cloudera Manager and I know how to add a param thru Safety Valve I 
don't know how to ignore a param thru Safety Valve :-) 
Actually I will put the question on the CDH groups as well...
Any thoughts guys ?
thanks

sanjay
  From: grimaldi.vince...@gmail.com grimaldi.vince...@gmail.com
 To: user@hive.apache.org; Sanjay Subramanian sanjaysubraman...@yahoo.com 
 Sent: Friday, January 16, 2015 11:27 AM
 Subject: Re: Writing Hive Query Output to local system
   
In your hive-site.xml remove the property block corresponding to the 
parametereç

hive.metastore.local

That's hit. Is a deprecated parameter and is not necessary anymore.

Regards,



  
||   Vincenzo Grimaldi
 Senior Business Intelligence Consultant   p:00 39 380 52 22 218 | m:00 353 851 
69 84 58 | e:grimaldi.vince...@gmail.com  | a: Aprt. 148A, Smithfield Market, 
Smithfield, Dublin 7, Ireland     |

Get a signature like this:   Click here!  
2015-01-16 19:05 GMT+00:00 Sanjay Subramanian sanjaysubraman...@yahoo.com:



hey guys
I recall this did not happen in the days of 0.9.x version
But I use 0.13.x now and when I run a hive query 
hive -e select * from tablename  ./myfile.txt
The first line in myfile.txt is as follows
2015-01-16 10:48:13,091 WARN  [main] conf.HiveConf 
(HiveConf.java:initialize(1491)) - DEPRECATED: Configuration property 
hive.metastore.local no longer has any effect. Make sure to provide a valid 
value for hive.metastore.uris if you are connecting to a remote metastore. 
How do I avoid this ?
thanks
sanjay



-- 

||   Vincenzo Grimaldi
 Senior Business Intelligence Consultant   p:00 39 380 52 22 218 | m:00 353 851 
69 84 58 | e:grimaldi.vince...@gmail.com  | a: Aprt. 148A, Smithfield Market, 
Smithfield, Dublin 7, Ireland      |



  

Re: How to convert RDBMS DDL to Hive DDL ?

2015-01-08 Thread Sanjay Subramanian
@Krishare u looking for an automated tool that takes RDBMS DDL as input and 
outputs Hive DDL ?I exported the DDLS of all tables with col sequence numbersI 
wrote code that converted all DB2 tables we have to Hive.
Not sure if there is a standard tool.
regards
sanjay
  From: Lefty Leverenz leftylever...@gmail.com
 To: user@hive.apache.org 
 Sent: Thursday, January 8, 2015 1:34 PM
 Subject: Re: How to convert RDBMS DDL to Hive DDL ?
   
Have you looked at the documentation?  If not, start here:   
   - Creating, Showing, Altering, and Dropping Tables   

   - Create Table   


-- Lefty


On Thu, Jan 8, 2015 at 12:28 PM, Krish Donald gotomyp...@gmail.com wrote:

Hi,
I am new to hive and I have imported data into HDFS using DB2.Now I want to 
create hive table.I am not sure  How to convert RDBMS DDL to Hive DDL ?Can 
somebody please guide ?
ThanksKrish



  

Using IF in the JOIN clause

2015-01-08 Thread Sanjay Subramanian
hey guys
This is a portion of a long query we wrote.Can u advise if the bold portion 
will work ?
    and if(f.fr_name is not null, f.fr_name, e.fr_name)=d.fr_name     and 
if(f.pos_bin is not null, f.pos_bin, e.pos_bin)=d.pos_bin 
thanksregards

sanjay

PART OF A LARGER QUERY==from (    select z.* from (    
select city, fr_name, (floor((none_position_scaled-1)/10)+1) * 10 as pos_bin,   
  sum(if(job_count0,60,if(int_flag=1,6,if(mjp_flag=1,1,0 as 
one_six_sixty_sum,     sum(job_count) as job_sum,     sum(pay_count) as 
pay_sum,     if(sum(hours_count) is null, 0, sum(hours_count)) as hours_sum,    
count(*) as counter    from oletas02.scorek_x_done     group by city, fr_name, 
(floor((none_position_scaled-1)/10)+1) * 10    order by city, fr_name, pos_bin  
  ) z     where pos_bin = 200 and NOT pos_bin is null) ffull outer join (    
select z.* from (    select city, fr_name, (floor((none_position-1)/10)+1) * 10 
as pos_bin,     sum(if(job_count0,60,if(int_flag=1,6,if(mjp_flag=1,1,0 as 
one_six_sixty_sum,     sum(job_count) as job_sum,     sum(pay_count) as 
pay_sum,     if(sum(hours_count) is null, 0, sum(hours_count)) as hours_sum,    
count(*) as counter    from oletas02.scorek_x_done     group by city, fr_name, 
(floor((none_position-1)/10)+1) * 10    order by city, fr_name, pos_bin    ) z  
   where pos_bin = 200 and NOT pos_bin is null) e on f.city=e.city and 
f.fr_name=e.fr_name and f.pos_bin=e.pos_binfull outer join (    select z.* from 
(    select city, fr_name, (floor((either_position_scaled-1)/10)+1) * 10 as 
pos_bin,     sum(if(job_count0,60,if(int_flag=1,6,if(mjp_flag=1,1,0 as 
one_six_sixty_sum,     sum(job_count) as job_sum,     sum(pay_count) as 
pay_sum,     if(sum(hours_count) is null, 0, sum(hours_count)) as hours_sum,    
count(*) as counter    from oletas02.scorek_x_done     group by city, fr_name, 
(floor((either_position_scaled-1)/10)+1) * 10    order by city, fr_name, 
pos_bin    ) z     where pos_bin = 200 and NOT pos_bin is null) d on if(f.city 
is not null, f.city, e.city)=d.city     and if(f.fr_name is not null, 
f.fr_name, e.fr_name)=d.fr_name     and if(f.pos_bin is not null, f.pos_bin, 
e.pos_bin)=d.pos_bin 


Re: It's extremely slow when hive reads compression files

2014-10-22 Thread Sanjay Subramanian
It could be the serde that is slow and not the compression ?If your input XML 
is in multiline records then u may wanna write a bit of RecordReader code to 
process the multiline XML yourself, just to see if it makes any changes to the 
processing speed 
?https://github.com/sanjaysubramanian/big_data_latte/tree/master/src/main/java/org/medicalsidefx/bdo/logparsers/multiline

  From: Yan Fang yanfang...@gmail.com
 To: user@hive.apache.org 
 Sent: Wednesday, October 22, 2014 8:06 AM
 Subject: It's extremely slow when hive reads compression files
   
Hi guys,
Not sure if you run into this problem. We have 6 nodes cluster with CDH5. It 
takes about 8 hours to process 80MB compressed files (in .deflate format), 
while it is much faster (less than 1 hour) to process the uncompressed files. I 
think there must be something wrong with my settings. Any help ? Thank you.
We are using an XMLInputFormat as InputFormat and a customized SerDe to read 
the XML records.
Cheers,
Fang, Yan
yanfang...@gmail.com
+1 (206) 849-4108

   

Re: Optimize hive external tables with serde

2014-10-22 Thread Sanjay Subramanian
The non gzipped version will be faster.One of the usecases for using a GZIP 
file is when u have input records that are multiline long formats exampleIn the 
following case, u have to send all contents between begin and end sections to 
the mapper. If u do NOT gzip this file (containing multiple records) then HDFS 
will split it in 128MB blocks(whatever u have set the HDFS block size to be)  
and it may happen that part of contents between the begin and end section get 
split and u will get errors processing the records.
[begin_20100922_12345678_serverid_foofla01]blah bhah blah bhah blah bhah blah 
bhah blah bhah blah bhah [end_20100922_12345678_serverid_foofla01]
So if your record is not multi line then don't use GZIP. I have used LZO which 
is fast but getting LZO to work correctly needs little bit of work on your 
cluster. Its not difficult but just a few steps to do. U can refer my 
documentation on the Hive site. 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LZO  I rewrote 
your query. (The group by is pushed to the outer query. And I created an 
initial subset jt1 to use later in the query)I did not understand two things in 
your queryThe join clause   ---   ON (1=1) ?The where clause  ---   1=1 ? 
a) Code heatmap - how many users used some code    WITH         jt1    AS  
(SELECT  *  FROM json_table        WHERE        1=1   AND        code in 
(1,3,54,300,222,111)   AND        partitionDate  '20140922'    AND        
partitionDate = '20141022'   )   SELECT    A.code,    
A.NumberOfDistinctCustomers,    B.NumberOfAllDistinctCustomers,    
A.NumberOfEntries,    B.NumberOfAllEntries  FROM    (SELECT        jt1.code as 
code,      COUNT(DISTINCT jt1.customerId) AS NumberOfDistinctCustomers,  
COUNT(*) AS NumberOfEntries     FROM        jt1     ) A  JOIN    (SELECT  
COUNT(DISTINCT jt1.customerId) AS NumberOfAllDistinctCustomers,  COUNT(*) AS 
NumberOfAllEntries      FROM  jt1     ) B  ON     (1=1)  GROUP BY     A.code  ; 
 
b) Select users which use some code, but have some special attributes in second 
table:NOTE : You are querying this with no partition so if json_table has 
500-1000+ partitions then this query might be really slow or may not runOne way 
you can solve this is to write a bash script that loops though a set of 
partitions and u store the results in an intermediate Hive managed table and 
then have another query to aggregate that.
SELECT      jt.customerIdFROM   (SELECT        *   FROM        json_table       
 WHERE         1=1   AND         code in (1,3,54,300,222,111)   ) jt JOIN       
(SELECT            *       FROM            attribute_table       WHERE          
  attribute_X1='1'    AND          attribute_X2='1'   ) atON      jt.customerId 
= at.customerId





  From: ptrst ptrstp...@gmail.com
 To: user@hive.apache.org; Sanjay Subramanian sanjaysubraman...@yahoo.com 
 Sent: Wednesday, October 22, 2014 1:02 AM
 Subject: Re: Optimize hive external tables with serde
   
ad 1) My files are not bigger than Block Size. To be precise all data from one 
day are up to 2GB gzipped. Unzipped they are ~15GB. The are split in one folder 
into files less then block size (block size in my case is 128MB, files are 
~100MB). I can transform them to other format if you think it will speed up my 
queries, but as I understand the documentation, such structure should be same 
as split-able files.If I add one more partition dimension with code, so I have 
partition by date and code, single files is 3-10MB. Because unique codes I have 
3000

ad 2)   I read your code and I use same serde only newer one. Should I check 
something more?
Ad 4) Most  queries will be about last 3-6 months
Ad 3) My example queries are:a) Code heatmap - how many users used some code 
SELECT    A.code,    A.NumberOfDistinctCustomers,    
B.NumberOfAllDistinctCustomers,    A.NumberOfEntries,    B.NumberOfAllEntries 
FROM  (   SELECT  code,  COUNT(DISTINCT customerId) AS 
NumberOfDistinctCustomers,  COUNT(*) AS NumberOfEntries    FROM  json_table   
WHERE 1=1    AND code in (1,3,54,300,222,111)    AND partitionDate  '20140922' 
AND partitionDate = '20141022'    GROUP BY code ) A JOIN (   SELECT  
COUNT(DISTINCT customerId) AS NumberOfAllDistinctCustomers,  COUNT(*) AS 
NumberOfAllEntries    FROM  json_table   WHERE 1=1    AND code in 
(1,3,54,300,222,111)    AND partitionDate  '20140922' AND partitionDate = 
'20141022'  ) B ON(1=1);
b) Select users which use some code, but have some special attributes in second 
table: SELECT json_table.customerId  FROM json_table   JOIN attribute_table ON 
(json_table.customerId=attribute_table.customerId)   AND 
attribute_table.attribute_X1='1'  AND attribute_table.attribute_X2='1'  AND 
json_table.code in (1,3,54,300,222,111);
c) Mix of above.


On Tue, Oct 21, 2014 at 11:12 PM, Sanjay Subramanian 
sanjaysubraman...@yahoo.com wrote:

1. The gzip files are not splittable, so gzip itself will make the queries 
slower.
2. As a reference for JSON serdes , here

Re: Optimize hive external tables with serde

2014-10-21 Thread Sanjay Subramanian
1. The gzip files are not splittable, so gzip itself will make the queries 
slower.
2. As a reference for JSON serdes , here is a example from my blog 
http://bigdatalatte.wordpress.com/2014/08/21/denormalizing-json-arrays-in-hive/
3. Need to see your query first to try and optimize it
4. Even if you have datewise partitions and u have 5 years of data i.e. about 
1825 partitions.   -- Trying to do a select count(*) without where clause might 
make hive crawl.

  From: Ja Sam ptrstp...@gmail.com
 To: user@hive.apache.org 
 Sent: Tuesday, October 21, 2014 10:37 AM
 Subject: Optimize hive external tables with serde
   
Part 1: my enviromentI have following files uploaded to Hadoop:   
   - The are plain text
   - Each line contains JSON like:
{code:[int], customerId:[string], data:{[something more here]}}   
   - code are numbers from 1 to 3000,
   - customerId are total up to 4 millions, daily up to 0.5 millon
   - All files are gzip
   - In hive I created external table with custom JSON serde (let's call it 
CUSTOMER_DATA)
   - All files from each date is stored in separate directory - and I use it as 
partitions in Hive tables
Most queries which I do are filtering by date, code and customerId. I have also 
a second file with format (let's call it CUSTOMER_ATTRIBUTES]: [customerId] 
[attribute_1] [attribute_2] ... [attribute_n]which contains data for all my 
customers, so rows are up to 4 millions.I query and filter my data in following 
way:   
   - Filtering by date - partitions do the job here using WHERE partitionDate 
IN (20141020,20141020)
   - Filtering by code using statement like for example `WHERE code IN 
(1,4,5,33,6784)
   - Joining table CUSTOMER_ATTRIBUTES with CUSTOMER_DATA with condition query 
like SELECT customerId FROM CUSTOMER_DATA JOIN CUSTOMER_ATTRIBUTES ON 
(CUSTOMER_ATTRIBUTES.customerId=CUSTOMER_DATA.customerId) WHERE 
CUSTOMER_ATTRIBUTES.attribute_1=[something]
Part 2: questionIs there any efficient way how can I optimize my queries. I 
read about indexes and buckets by I don't know if I can use them with external 
tables and if they will optimize my queries.

  

Re: select * from table and select column from table in hive

2014-10-21 Thread Sanjay Subramanian
One way to debug is to put bash in action
say you have a data file in hdfs (/data/rockers/rockers.csv) that looks like 
cust_num,cust_name,instrument1,paul,bass2,john,rhythm3,ringo,drums4,george,lead
to get the column=cust_num of data (in this case its column 1)
hdfs dfs -cat /data/rockers/rockers.csv | cut -d, -f 1  /path/to/local/file.txt

If file is tab separated 
cust_num     cust_name     instrument1     paul     bass2     john    rhythm3   
  ringo     drums4     george     lead
hdfs dfs -cat /data/rockers/rockers.csv | tr '\t' ',' | cut -d, -f 1  
/path/to/local/file.txt

  
  From: Nitin Pawar nitinpawar...@gmail.com
 To: Raj Hadoop hadoop...@yahoo.com; user@hive.apache.org 
 Sent: Monday, October 20, 2014 12:11 PM
 Subject: Re: select * from table and select column from table in hive
   
whats your table create ddl?is the data in csv like format?

On 21 Oct 2014 00:26, Raj Hadoop hadoop...@yahoo.com wrote:

I am able to see the data in the table for all the columns when I issue the 
following -

SELECT * FROM t1 WHERE dt1='2013-11-20'


But I am unable to see the column data when i issue the following -

SELECT cust_num FROM t1 WHERE dt1='2013-11-20'

The above shows null values.

How should I debug this ?



  

Re: Migration of metastore tables from mysql to oracle.

2014-10-21 Thread Sanjay Subramanian
First question, Why are u migrating to Oracle ? Since u never store data on 
Hive Metastore MYSQL is a great choice. 

I have done a MYSQL to MYSQL transfer From the source DB mysql dump, it should 
be possible  to mod any Oracle required syntax right ?


  From: hadoop hive hadooph...@gmail.com
 To: user@hive.apache.org 
 Sent: Tuesday, October 21, 2014 2:36 AM
 Subject: Migration of metastore tables from mysql to oracle.
   
Hi folks,I have a requirement where I need to migrate metastore tables from 
mysql to oracle.Is there any who already performed this kind of activity.Any 
help will be appreciatedThanks
Vikas Srivastava

  

Re: Weird Error on Inserting in Table [ORC, MESOS, HIVE]

2014-10-07 Thread Sanjay Subramanian
hi 
I faced a similar situation in my dev cluster CDH distribution 5.1.3
See the thread details with log files  
https://groups.google.com/a/cloudera.org/forum/#!mydiscussions/scm-users/MpcpHj5mWT8


thanks
sanjay  From: John Omernik j...@omernik.com
 To: user@hive.apache.org 
 Sent: Tuesday, September 9, 2014 12:10 PM
 Subject: Re: Weird Error on Inserting in Table [ORC, MESOS, HIVE]
   
Well, here is me talking to myself: but in case someone else runs across this, 
I changed the hive metastore connect timeout to 600 seconds (per the JIRA below 
for Hive 0.14) and now my problem has gone away. It looks like the timeout was 
causing some craziness. 
https://issues.apache.org/jira/browse/HIVE-7140





On Tue, Sep 9, 2014 at 1:00 PM, John Omernik j...@omernik.com wrote:

I ran with debug logging, and this is interesting, there was a loss of 
connection to the metastore client RIGHT before the partition mention above... 
as data was looking to be moved around... I wonder if the timing on that is bad?
14/09/09 12:47:37 [main]: INFO exec.MoveTask: Partition is: {day=null, 
source=null}14/09/09 12:47:38 [main]: INFO metadata.Hive: Renaming 
src:maprfs:/user/hive/scratch/hive-mapr/hive_2014-09-09_12-38-30_860_3555291990145206535-1/-ext-1/day=2012-11-30/source=20121119_SWAirlines_Spam/04_0;dest:
 
maprfs:/user/hive/warehouse/intel_flow.db/pcaps/day=2012-11-30/source=20121119_SWAirlines_Spam/04_0;Status:true14/09/09
 12:48:02 [main]: WARN metastore.RetryingMetaStoreClient: MetaStoreClient lost 
connection. Attempting to 
reconnect.org.apache.thrift.transport.TTransportException: 
java.net.SocketTimeoutException: Read timed out at 
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)


On Tue, Sep 9, 2014 at 11:02 AM, John Omernik j...@omernik.com wrote:

I am doing a dynamic partition load in Hive 0.13 using ORC files. This has 
always worked in the past both with MapReduce V1 and YARN. I am working with 
Mesos now, and trying to trouble shoot this weird error:


Failed with exception AlreadyExistsException(message:Partition already exists

What's odd is is my insert is an insert (without Overwrite) so it's like two 
different reducers have data to go into the same partition, but then there is a 
collision of some sort? Perhaps there is a situation where the partition 
doesn't exist prior to the run, but when two reducers have data, they both 
think they should be the one to create the partition? Shouldn't if a partition 
already exists, the reducer just copies it's file into the partition?  I am 
struggling to see why this would be an issue with Mesos, but not on Yarn, or 
MRv1.
Any thoughts would be welcome. 
John





  

Re: problem with delimiters (control A)

2014-05-29 Thread Sanjay Subramanian
Hi Jack 

Since u already have your data with columns separated by CtrlA then u need to 
define the HIVE table as follows 
(by default Hive will assume CtrlA as column delimiter)

create table if not exists my_test(
    userid  BIGINT,
    movieId BIGINT,
    comment STRING
);


regards

sanjay


From: Petter von Dolwitz (Hem) petter.von.dolw...@gmail.com
To: user@hive.apache.org user@hive.apache.org 
Sent: Thursday, May 22, 2014 8:00 AM
Subject: Re: problem with delimiters (control A)
 


Hi Jack,

could it be that your source file does not use \001 as delimiter but the 
character sequence ^A? I get this feeling when looking at your cat printout 
and also when you say you can do a search and replace. If I do a print out of a 
file with the same delimiter I get the following

2014-04-097846478510
2014-04-107851178558

These are actually three columns but you cannot see the ^A because it is not a 
printable character. Check your file in a hex editor or something to verify 
that you have the correct delimiter.

Br,
Petter













2014-05-18 6:40 GMT+02:00 Jack Yang j...@uow.edu.au:

Hi All, 
I have a local file called mytest.txt (restored in hdfs already). The content 
is like this:
$ cat -A  HDFSLOAD_DIR/mytest.txt
49139801^A25752451^Aunknown$
49139801^A24751754^Aunknown$
49139801^A2161696^Anice$
 
To load this raw data above, I then defined the table like this in HQL:
 
create table my_test(
userid  BIGINT,
movieId BIGINT,
comment STRING
) 
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '\001'
STORED AS TEXTFILE;
 
My problem is that when I “SELECT * FROM my_test;” , I got this:
 
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
 
I then replace “^A” with “^” in mytest.txt, and also re-defined my table 
structure by using :
FIELDS TERMINATED BY '^’
 
So when I select all, I got the correct results.
 
Any thoughts??? Thanks in advance.
 
 
Best regards,
Jack 
 
 

Re: Finding Max of a column without using any Aggregation functions

2014-04-23 Thread Sanjay Subramanian
Thanks 
For the sake of this question I wanted to avoid all order by and limit syntax 
. It's more of a challenge question 
Regards
Sanjay 

Sent from my iPhone

 On Apr 23, 2014, at 2:51 AM, Furcy Pin furcy@flaminem.com wrote:
 
 Hi,
 
 note that if your table contains the max value several time, all the 
 occurences will be returned. Note also that if it contains a null it will be 
 returned too.
 
 How about this? 
 
  SELECT score FROM student ORDER BY score DESC LIMIT 1 ;
 
 
 Note that on this query Impala is incoherent with Hive or mySQL, since Impala 
 considers 
 that NULL is greater than everything while Hive and mySQL consider NULL to be 
 smaller than everything
 
 And just for fun, you can also try in Hive:
 
 FROM 
 (
   SELECT score FROM student
   DISTRIBUTE BY '1'
   SORT BY score DESC
 ) M
 SELECT score
 LIMIT 1
 ;
 
 FROM 
 (
   SELECT score FROM student
   DISTRIBUTE BY '1'
 ) M
 REDUCE score USING 'sort -rn | head -1' AS score 
 ;
 
 The second is just to demonstrate the possibilities of custom reduce, but is 
 greatly inefficient (in speed and memory).
 And be sure to use SET mapred.reduce.tasks=1 ; before if you don't want idle 
 reduce tasks...
 
 (I guess I have some spare time too!)
 
 
 
 2014-04-23 3:27 GMT+02:00 Subramanian, Sanjay (HQP) 
 sanjay.subraman...@roberthalf.com:
 Hey guys
 
 TABLE=STUDENT
 COLUMN=SCORE
 
 U want to find the max value in the column without using any aggregation 
 functions.
 
 Its easy in a RDB context but I was trying to get a solution in Hive 
 (clearly I have some spare time on my hands - LOL) 
 
 select 
  nfr.score
 from
  student nfr
 left outer join
  (select 
   a.score as fra,
   b.score as frb
  from 
   (select
'1' as dummy,
score
   from
student
   ) a

  join
   (select
'1' as dummy,
score
   from
student
   ) b
  ON
   a.dummy = b.dummy
  where   
   a.score  b.score
  ) frab
 on
  frab.fra=nfr.score
 where 
  frab.fra is null
 
 Thanks
 Warm Regards
 
 Sanjay
 


Re: create table question

2014-04-22 Thread Sanjay Subramanian
For example
if ur name node was hadoop_name_nodeIP:8020

(verify this thru your browser http://hadoop_name_nodeIP:50070)

Modified Create Table
==
CREATE EXTERNAL TABLE states(abbreviation string, full_name
string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION 'hdfs://hp8300one:8020/tmp/states' ;




 From: Shengjun Xin s...@gopivotal.com
To: user@hive.apache.org 
Sent: Tuesday, April 22, 2014 8:58 PM
Subject: Re: create table question
 


in the ql, you set relative path tmp/states, according to the error message, 
you need to set absolute path




On Wed, Apr 23, 2014 at 11:23 AM, EdwardKing zhan...@neusoft.com wrote:

 
I use hadoop 2.2.0 and hive 0.13.0, I want to 
create a table from an existing file, states.hql 
is follows:
CREATE 
EXTERNAL TABLE states(abbreviation string, full_name
string)
ROW FORMAT 
DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION 'tmp/states' 
;
 
 
[hadoop@master ~]$ hadoop fs 
-ls
14/04/22 20:17:32 WARN util.NativeCodeLoader: Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable
Found 1 items
drwxr-xr-x   - hadoop 
supergroup  0 2014-04-22 
20:02 tmp

[hadoop@master ~]$ hadoop fs -put 
states.txt tmp/states
[hadoop@master ~]$ hadoop fs -ls 
tmp/states
14/04/22 20:17:19 WARN util.NativeCodeLoader: Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable
Found 1 items
-rw-r--r--   2 hadoop 
supergroup    654 2014-04-22 20:02 
tmp/states/states.txt
 
 
Then I execute states.hql
[hadoop@master ~]$ hive -f 
states.hql
14/04/22 20:11:47 INFO Configuration.deprecation: 
mapred.reduce.tasks is deprecated. Instead, use 
mapreduce.job.reduces
14/04/22 20:11:47 INFO Configuration.deprecation: 
mapred.min.split.size is deprecated. Instead, use 
mapreduce.input.fileinputformat.split.minsize
14/04/22 20:11:47 INFO 
Configuration.deprecation: mapred.reduce.tasks.speculative.execution is 
deprecated. Instead, use mapreduce.reduce.speculative
14/04/22 20:11:47 INFO 
Configuration.deprecation: mapred.min.split.size.per.node is deprecated. 
Instead, use mapreduce.input.fileinputformat.split.minsize.per.node
14/04/22 
20:11:47 INFO Configuration.deprecation: mapred.input.dir.recursive is 
deprecated. Instead, use 
mapreduce.input.fileinputformat.input.dir.recursive
14/04/22 20:11:47 INFO 
Configuration.deprecation: mapred.min.split.size.per.rack is deprecated. 
Instead, use mapreduce.input.fileinputformat.split.minsize.per.rack
14/04/22 
20:11:47 INFO Configuration.deprecation: mapred.max.split.size is deprecated. 
Instead, use mapreduce.input.fileinputformat.split.maxsize
14/04/22 20:11:47 
INFO Configuration.deprecation: mapred.committer.job.setup.cleanup.needed is 
deprecated. Instead, use 
mapreduce.job.committer.setup.cleanup.needed
Logging initialized using 
configuration in 
jar:file:/home/software/apache-hive-0.13.0-bin/lib/hive-common-0.13.0.jar!/hive-log4j.properties
FAILED: 
Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. 
MetaException(message:java.lang.IllegalArgumentException: 
java.net.URISyntaxException: Relative path in absolute URI: 
hdfs://master:9000./tmp/states)
 
 
It raise following error,why? How to correct 
it?
2014-04-22 20:12:03,907 INFO  
[main]: exec.DDLTask (DDLTask.java:createTable(4074)) - Default to 
LazySimpleSerDe for table states
2014-04-22 20:12:05,147 INFO  [main]: 
metastore.HiveMetaStore (HiveMetaStore.java:logInfo(624)) - 0: create_table: 
Table(tableName:states, dbName:default, owner:hadoop, createTime:1398222724, 
lastAccessTime:0, retention:0, 
sd:StorageDescriptor(cols:[FieldSchema(name:abbreviation, type:string, 
comment:null), FieldSchema(name:full_name, type:string, comment:null)], 
location:tmp/states, inputFormat:org.apache.hadoop.mapred.TextInputFormat, 
outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, 
compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, 
serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
parameters:{serialization.format= , field.delim= }), bucketCols:[], 
sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], 
skewedColValues:[], skewedColValueLocationMaps:{}), 
storedAsSubDirectories:false), partitionKeys:[], parameters:{EXTERNAL=TRUE}, 
viewOriginalText:null, viewExpandedText:null, 
tableType:EXTERNAL_TABLE)
2014-04-22 20:12:05,147 INFO  [main]: 
HiveMetaStore.audit (HiveMetaStore.java:logAuditEvent(306)) - 
ugi=hadoop ip=unknown-ip-addr cmd=create_table: 
Table(tableName:states, dbName:default, owner:hadoop, createTime:1398222724, 
lastAccessTime:0, retention:0, 
sd:StorageDescriptor(cols:[FieldSchema(name:abbreviation, type:string, 
comment:null), FieldSchema(name:full_name, type:string, comment:null)], 
location:tmp/states, inputFormat:org.apache.hadoop.mapred.TextInputFormat, 
outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, 
compressed:false, 

Re: Query hangs at 99.97 % for one reducer in Hive

2014-03-02 Thread Sanjay Subramanian
Even 500 reducers sounds a high number but I don't know the deatils of your 
cluster. Can u provide some details
How many nodes in cluster 
Hive version 
Which distribution (Hortonworks, Apache, CDH, Amazon)
Node specs
Partitions in the table 
Number of records. 
Thanks
Sanjay

Sent from my iPhone

 On Mar 2, 2014, at 3:09 PM, Siddharth Tiwari siddharth.tiw...@live.com 
 wrote:
 
 Hi team,
 
 following query hangs at 99.97% for one reducer, kindly help or point to what 
 can be cause
 
 drop table if exists sample.dpi_short_lt;
 create table sample.dpi_short_lt as
 select   b.msisdn,
 a.area_erb,
 a.longitude,
 a.latitude,
substring(b.msisdn,1,2) as country,
substring(b.msisdn,3,2) as area_code,
substring(b.start_time,1,4) as year,
substring(b.start_time,6,2) as month,
substring(b.start_time,9,2) as day,
substring(b.start_time,12,2) as hour,
cast(b.procedure_duration as double) as 
 duracao_ms,
cast(b.internet_latency as double) as 
 int_internet_latency,
cast(b.ran_latency as double) as 
 int_ran_latency,
cast(b.http_latency as double) as 
 int_http_latency,
(case when b.internet_latency='' then 1 else 0 
 end) as internet_latency_missing,
(case when b.ran_latency='' then 1 else 0 end) 
 as ran_latency_missing,
(case when b.http_latency='' then 1 else 0 
 end) as http_latency_missing,
(cast(b.mean_throughput_ul as int) * cast( 
 procedure_duration as int) / 1000) as total_up_bytes,
(cast(b.mean_throughput_dl as int) * 
 cast(procedure_duration as int)  / 1000) as total_dl_bytes,
cast(b.missing_packets_ul as int) as 
 int_missing_packets_ul,
cast(b.missing_packets_dl as int) as 
 int_missing_packets_dl
 from sample.dpi_large b
 left outer join sample.science_new a
 on b.cgi = regexp_replace(a.codigo_cgi_ecgi,'-','')
 where msisdn!='';
 
 Hive was heuristically selecting 1000 reducers and it was hanging at 99.97 
 percent on one reduce task. I then changed the above values to 3GB per 
 reducer and 500 reducers and started hitting this error.
 
 java.lang.RuntimeException: Hive Runtime Error while closing operators: 
 Unable to rename output from: 
 hdfs://tlvcluster/tmp/hive-hadoop/hive_2014-03-01_03-14-36_812_8390586541316719852-1/_task_tmp.-ext-10001/_tmp.03_0
  to: 
 hdfs://tlvcluster/tmp/hive-hadoop/hive_2014-03-01_03-14-36_812_8390586541316719852-1/_tmp.-ext-10001/03_0
   at 
 org.apache.hadoop.hive.ql.exec.ExecReducer.close(ExecReducer.java:313)
   at 
 org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:516)
   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:447)
   at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
   at org.apache.hadoop.mapred.Child.main(Child.java:262)
 Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to rename 
 output from: hdfs://tlvcluster/tmp/hive-hadoop/hive_2014-03-01_03-14-36_812
 
 
 I have 22 node cluster running cdh 4.3. Please try to locate what can be teh 
 issue.
 
 
 **
 Cheers !!!
 Siddharth Tiwari
 Have a refreshing day !!!
 Every duty is holy, and devotion to duty is the highest form of worship of 
 God.” 
 Maybe other people will try to limit me but I don't limit myself


Amazon EMR error

2014-03-01 Thread Sanjay Subramanian
Sorry guys , not sure if I should request help with this error here because  
its an error on Amazon EMR Hive
But you guys have been my Hive fraternity for about 2 years now  and I thought 
it best to turn to u for help first 

Amazon Hive version 
===
AMI version:2.4.2
Hadoop distribution:Amazon 1.0.3
Applications:Hive 0.11.0.1

HIVE TABLE
===
DROP TABLE  solar_installs;
CREATE EXTERNAL TABLE  IF NOT EXISTS solar_installs(
   zipcode STRING,
   state STRING,
   sizekw DOUBLE,
   cost   DOUBLE,
   date_  STRING,
   lat     DOUBLE,
   lon     DOUBLE) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
;
ALTER TABLE solar_installs SET LOCATION 's3://solarinstallations/data'
;

DATA SET SAMPLE 
==
19801,DE,230.6,,12/30/2014,39.7402405,-75.5521456
80005,CO,7.1,12000.0,01/28/2014,39.851157,-105.132533
19968,DE,5.2,,01/27/2014,38.7776124,-75.3099136
52314,IA,30.0,116900.0,01/17/2014,41.901451,-91.48409
33414,FL,9.28,34000.0,01/16/2014,26.667321,-80.215585
19808,DE,5.2,,01/16/2014,39.721215,-75.668342
19702,DE,7.28,,01/16/2014,39.649586,-75.6998009
19958,DE,5.5,,01/16/2014,38.7638163,-75.1778853
19958,DE,10.56,,01/15/2014,38.7638163,-75.1778853
19958,DE,4.5,,01/15/2014,38.708142,-75.214294
06410,CT,860.0,,01/14/2014,41.5589919,-72.912401
05701,VT,140.4,,01/14/2014,43.598277,-72.955117
19971,DE,7.8,,01/13/2014,38.734048,-75.126563
19971,DE,4.06,,01/13/2014,38.7020005,-75.0961549

QUERY 
=
hive -e select zipcode, count(*) from solar_installs group by zipcode

ERROR
=
java.lang.RuntimeException: java.lang.IllegalArgumentException: Can not create 
a Path from an empty string at 
org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:167) at 
org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at 
org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:441) at 
org.apache.hadoop.mapred.MapTask.run(MapTask.java:377) at 
org.apache.hadoop.mapred.Child$4.run(Child.java:255) at 
java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:415) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
 at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.lang.IllegalArgumentException: Can not create a Path from an 
empty string at org.apache.hadoop.fs.Path.checkPathArg(Path.java:82) at 
org.apache.hadoop.fs.Path.init(Path.java:90) at 
org.apache.hadoop.hive.ql.exec.MapOperator.cleanUpInputFileChangedOp(MapOperator.java:588)
 at 
org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1377)
 at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:611) at 
org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:149) ... 8 more

Re: Amazon EMR error

2014-03-01 Thread Sanjay Subramanian
ok so  I spun up another cluster with a previous version and it worked 
successfully 

This Amazon Hive version WORKS SUCCESSFULLY 
=
AMI version:2.4.1
Hadoop distribution:Amazon 1.0.3
Applications:Hive 0.11.0.1




 From: Sanjay Subramanian sanjaysubraman...@yahoo.com
To: user@hive.apache.org user@hive.apache.org; USers Hadoop 
u...@hadoop.apache.org 
Sent: Saturday, March 1, 2014 4:57 PM
Subject: Amazon EMR error
 


Sorry guys , not sure if I should request help with this error here because  
its an error on Amazon EMR Hive
But you guys have been my Hive fraternity for about 2 years now  and I thought 
it best to turn to u for help first 

Amazon Hive version 
===
AMI version:2.4.2
Hadoop distribution:Amazon 1.0.3
Applications:Hive 0.11.0.1

HIVE TABLE
===
DROP TABLE  solar_installs;
CREATE EXTERNAL TABLE  IF NOT EXISTS solar_installs(
   zipcode STRING,
   state STRING,
   sizekw DOUBLE,
   cost   DOUBLE,
   date_  STRING,
   lat     DOUBLE,
   lon     DOUBLE) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
;
ALTER TABLE solar_installs SET LOCATION 's3://solarinstallations/data'
;

DATA SET SAMPLE 
==
19801,DE,230.6,,12/30/2014,39.7402405,-75.5521456
80005,CO,7.1,12000.0,01/28/2014,39.851157,-105.132533
19968,DE,5.2,,01/27/2014,38.7776124,-75.3099136
52314,IA,30.0,116900.0,01/17/2014,41.901451,-91.48409
33414,FL,9.28,34000.0,01/16/2014,26.667321,-80.215585
19808,DE,5.2,,01/16/2014,39.721215,-75.668342
19702,DE,7.28,,01/16/2014,39.649586,-75.6998009
19958,DE,5.5,,01/16/2014,38.7638163,-75.1778853
19958,DE,10.56,,01/15/2014,38.7638163,-75.1778853
19958,DE,4.5,,01/15/2014,38.708142,-75.214294
06410,CT,860.0,,01/14/2014,41.5589919,-72.912401
05701,VT,140.4,,01/14/2014,43.598277,-72.955117
19971,DE,7.8,,01/13/2014,38.734048,-75.126563
19971,DE,4.06,,01/13/2014,38.7020005,-75.0961549

QUERY 
=
hive -e select zipcode, count(*) from solar_installs group by zipcode

ERROR
=
java.lang.RuntimeException: java.lang.IllegalArgumentException: Can not create 
a Path from an empty string at 
org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:167) at 
org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at 
org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:441) at 
org.apache.hadoop.mapred.MapTask.run(MapTask.java:377) at 
org.apache.hadoop.mapred.Child$4.run(Child.java:255) at 
java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:415) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
 at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.lang.IllegalArgumentException: Can not create a Path from an 
empty string at org.apache.hadoop.fs.Path.checkPathArg(Path.java:82) at 
org.apache.hadoop.fs.Path.init(Path.java:90) at 
org.apache.hadoop.hive.ql.exec.MapOperator.cleanUpInputFileChangedOp(MapOperator.java:588)
 at 
org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1377)
 at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:611) at 
org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:149) ... 8 more

Re: How to prevent user drop table in Hive metadata?

2013-11-30 Thread Sanjay Subramanian
Cloudera Sentry is awesome and I have implemented this in Cloudera manager 
4.7.2 CDH 4.4.0. Thanks again to shreepadma for all answers to my questions on 
the CDH users group. I can provide guidance on Sentry configs if needed. 

Sent from my iPhone

 On Nov 22, 2013, at 4:25 PM, Shreepadma Venugopalan shreepa...@cloudera.com 
 wrote:
 
 Apache Sentry is already available and made its first incubating release a 
 couple of months back. 
 
 
 On Fri, Nov 22, 2013 at 3:06 PM, Echo Li echo...@gmail.com wrote:
 Thanks all, that's all very helpful information.
 
 Shreepadma, when will the Apache Sentry come GA?
 
 
 On Fri, Nov 22, 2013 at 2:36 PM, Shreepadma Venugopalan 
 shreepa...@apache.org wrote:
 Apache Sentry (incubating) provides fine-grained role-based authorization 
 for Hive among other components of the Hadoop ecosystem. It currently 
 supports fully secure, fine-grained, role-based authorization for Hive and 
 can be used to prevent the scenario described earlier i.e., prevent a user 
 from dropping a table the user shouldn't be allowed to drop.
 
 Shreepadma
 
 
 On Fri, Nov 22, 2013 at 12:55 PM, simon.2.thomp...@bt.com wrote:
 Thanks Alan - I'll fwd the spec in the Jira to some of our security and 
 integrity people for comment.
 
 Simon
 
 Dr. Simon Thompson
 
 
 From: Alan Gates [ga...@hortonworks.com]
 Sent: 22 November 2013 20:53
 To: user@hive.apache.org
 Subject: Re: How to prevent user drop table in Hive metadata?
 
 See https://issues.apache.org/jira/browse/HIVE-5837 for a JIRA addressing 
 this.
 
 Also, you can use the StorageBasedAuthorizationProvider in Hive, which 
 bases metadata security on file security.  So if the user doesn't have 
 permissions to remove the directory that stores the table data, they won't 
 have permissions to drop the table.  This isn't perfect, but it's a start.
 
 Alan.
 
 On Nov 22, 2013, at 11:49 AM, simon.2.thomp...@bt.com 
 simon.2.thomp...@bt.com wrote:
 
  Has no one raised a Jira ticket ?
 
  
  Dr. Simon Thompson
 
  
  From: Biswajit Nayak [biswajit.na...@inmobi.com]
  Sent: 22 November 2013 19:45
  To: user@hive.apache.org
  Subject: Re: How to prevent user drop table in Hive metadata?
 
  Hi Echo,
 
  I dont think there is any to prevent this. I had the same concern in 
  hbase, but found out that it is assumed that user using the system are 
  very much aware of it.  I am into hive from last 3 months, was looking 
  for some kind of way here, but no luck till now..
 
  Thanks
  Biswa
 
  On 23 Nov 2013 01:06, Echo Li 
  echo...@gmail.commailto:echo...@gmail.com wrote:
  Good Friday!
 
  I was trying to apply certain level of security in our hive data 
  warehouse, by modifying access mode of directories and files on hdfs to 
  755 I think it's good enough for a new user to remove data, however the 
  user still can drop the table definition in hive cli, seems the revoke 
  doesn't help much, is there any way to prevent this?
 
 
  Thanks,
  Echo
 
  _
  The information contained in this communication is intended solely for 
  the use of the individual or entity to whom it is addressed and others 
  authorized to receive it. It may contain confidential or legally 
  privileged information. If you are not the intended recipient you are 
  hereby notified that any disclosure, copying, distribution or taking any 
  action in reliance on the contents of this information is strictly 
  prohibited and may be unlawful. If you have received this communication 
  in error, please notify us immediately by responding to this email and 
  then delete it from your system. The firm is neither liable for the 
  proper and complete transmission of the information contained in this 
  communication nor for any delay in its receipt.
 
 
 --
 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity to
 which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader
 of this message is not the intended recipient, you are hereby notified that
 any printing, copying, dissemination, distribution, disclosure or
 forwarding of this communication is strictly prohibited. If you have
 received this communication in error, please contact the sender immediately
 and delete it from your system. Thank You.
 


In Beeline what is the syntax for ALTER TABLE ?

2013-10-21 Thread Sanjay Subramanian
Hi guys

Using
Hive0.10.0+198  CDH4

Getting this error for ALTER table command

jdbc:hive2://dev-thdp5.corp.nextag.com:100 ALTER TABLE outpdir_seller_hidden 
ADD IF NOT EXISTS PARTITION (header_date_partition='2013-10-17', 
header_servername_partition='lu3') LOCATION 
'/data/output/impressions/outpdir/2013-10-17/008-131018121904385-oozie-oozi-W/outpdir_seller_hidden/sellerhidden/2013-10-17/lu3';
Error: Error while processing statement: FAILED: IllegalArgumentException URI 
'' in invalid. Must start with file:// or hdfs:// (state=42000,code=4)

Thanks
Regards

sanjay



CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: JSON format files versus AVRO

2013-10-08 Thread Sanjay Subramanian
Hi
Thanks I have to still check out  JsonSerDe in catalog.
U r right an I did think about adding the unique key as an attribute inside the 
JSON
Instead of analyzing further I am going to try both methods out and see how my 
down the stream processes will work.  I have a 40 step Oozie workflow that 
needs to be successful after all this :-)
Cool thanks

Thanks
Regards

sanjay

email : sanjay.subraman...@wizecommerce.com

From: Sushanth Sowmyan khorg...@gmail.commailto:khorg...@gmail.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Tuesday, October 8, 2013 11:39 AM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: JSON format files versus AVRO


Have you had a look at the JsonSerDe in hcatalog to see if it suits your need?

It does not support the format you are suggesting directly, but if you made the 
unique I'd part of the json object, so that each line was a json record, it 
would. It's made to be used in conjunction with text tables.

Also, even if it proves to not be what you want directly, it already provides a 
serializer/deserializer

On Oct 7, 2013 4:41 PM, Sanjay Subramanian 
sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com
 wrote:
Sorry if the subject sounds really stupid !

Basically I am re-architecting our web log record format

Currently we have Multiple lines = 1 Record  format (I have Hadoop jobs that 
parse the files and create columnar output for Hive tables)

[begin_unique_id]
Pipe delimited Blah
Pipe delimited Blah
Pipe delimited Blah
Pipe delimited Blah
Pipe delimited Blah
[end_unique_id]


I have created JSON serializers that will log records in the following way 
going forward
unique_id JSON-string

This is the plan
- I will store the records in a two column table in Hive
- Write JSON deserializers in hive HDFs that will take these tables and  create 
hive tables pertaining to specific requirements
- Modify current aggregation scripts in Hive

I was seeing AVRO format but I don't see the value of using AVO when I feel 
JSON gives me pretty much the same thing ?

Please poke holes in my thinking ! Rip me apart !


Thanks
Regards

sanjay



CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: Execution failed with exit status: 3

2013-10-08 Thread Sanjay Subramanian
Hi

Are u running this thru Beeswax in Hue ? If I recall right then u might need to 
provide access to hue user to submit and run MR jobs on the cluster
Also r u using YARN or MR1 ?

Thanks
Regards

sanjay


From: Martin, Nick nimar...@pssd.commailto:nimar...@pssd.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Tuesday, October 8, 2013 10:27 AM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: RE: Execution failed with exit status: 3

Update on this...

When I run this in Hive CLI it works perfectly. Only has a bug in Hue. I'll 
send this thread over to hue_user@ and see what they say.

From: Martin, Nick [mailto:nimar...@pssd.com]
Sent: Tuesday, October 08, 2013 12:17 PM
To: user@hive.apache.orgmailto:user@hive.apache.org
Subject: RE: Execution failed with exit status: 3

Hi Sanjay, thanks for the suggestion.

There are no partitions on either table.

From: Sanjay Subramanian [mailto:sanjay.subraman...@wizecommerce.com]
Sent: Monday, October 07, 2013 8:19 PM
To: user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: Execution failed with exit status: 3

Hi Nick

How many partitions are there in table t1 and table t2
If there are many partitions in either t1 or t2 or both can u mod your query as 
follows and see if the error comes up


SELECT

T1.somecolumn,

T2.someothercolumn

FROM

(SELECT * FROM t1 WHERE partition_column1='some_val') T1

JOIN

(SELECT * FROM t2 WHERE partition_column2='some_val') T2

ON (T1.idfield=T2.idfield)


Thanks
Regards

sanjay

 email : 
sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com
   Irc : sanjaysub (channel #noc)
 skype : sanjaysubramanian
mobile : (925) 399 2692

From: Martin, Nick nimar...@pssd.commailto:nimar...@pssd.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Monday, October 7, 2013 5:13 PM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Execution failed with exit status: 3

Hi all,

I'm doing a very basic join in Hive and getting the error below. The HiveQL 
join syntax I'm using is:


SELECT

T1.somecolumn,

T2.someothercolumn

FROM t1

JOIN t2

ON (t1.idfield=t2.idfield)



Driver returned: 3.  Errors: OK
Total MapReduce jobs = 1
setting HADOOP_USER_NAME  someuser
Execution failed with exit status: 3
Obtaining error information

Task failed!
Task ID:
  Stage-4

Logs:

FAILED: Execution Error, return code 3 from 
org.apache.hadoop.hive.ql.exec.MapredLocalTask
I searched through the logs and couldn't find anything terribly useful, 
although perhaps I'm missing something. Is this a common error I'm just now 
coming across?

On Hive 0.11

Thanks!
Nick

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: Is there any API that tells me what files comprise a hive table?

2013-10-07 Thread Sanjay Subramanian
Perhaps a good thing to have in your Hive cheat sheet :-) '
I use the following mySQL query to find out the locations of the Hive table

echo select t.TBL_NAME, p.PART_NAME, s.LOCATION from PARTITIONS p, SDS s, TBLS 
t where t.TBL_ID=p.TBL_ID and p.SD_ID=s.SD_ID | mysql -uhive_user 
-ppassword hive_metastore_table -A | grep your_hive_table_name



Thanks
Regards

sanjay

 email : sanjay.subraman...@wizecommerce.com
   Irc : sanjaysub (channel #noc)
 skype : sanjaysubramanian
mobile : (925) 399 2692

From: Edward Capriolo edlinuxg...@gmail.commailto:edlinuxg...@gmail.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Monday, October 7, 2013 11:52 AM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: Is there any API that tells me what files comprise a hive table?

Not a direct API.

What I do is this. From java/thrift:
Table t = client.getTable(name_of_table);
Path p = new Path(t.getSd.getLocation());
FileSystem fs = FileSystem.get(conf);
ListFileStatus f = fs.listFiles(p)
/// your logic here.




On Mon, Oct 7, 2013 at 2:01 PM, demian rosas 
demia...@gmail.commailto:demia...@gmail.com wrote:
Hi all,

I want to track the changes made to the files of a Hive table.

I wounder whether there is any API that I can use to find out the following:

1. What files in hdfs constitute a hive table.
2. What is the size of each of these files.
3. The time stamp of the creation/last update to each of these files.


Also in a wider view, is there any API that can do the above mentioned for HDFS 
files in general (not only hive specific)?

Thanks a lot in advance.

Cheers.




CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


JSON format files versus AVRO

2013-10-07 Thread Sanjay Subramanian
Sorry if the subject sounds really stupid !

Basically I am re-architecting our web log record format

Currently we have Multiple lines = 1 Record  format (I have Hadoop jobs that 
parse the files and create columnar output for Hive tables)

[begin_unique_id]
Pipe delimited Blah
Pipe delimited Blah
Pipe delimited Blah
Pipe delimited Blah
Pipe delimited Blah
[end_unique_id]


I have created JSON serializers that will log records in the following way 
going forward
unique_id JSON-string

This is the plan
- I will store the records in a two column table in Hive
- Write JSON deserializers in hive HDFs that will take these tables and  create 
hive tables pertaining to specific requirements
- Modify current aggregation scripts in Hive

I was seeing AVRO format but I don't see the value of using AVO when I feel 
JSON gives me pretty much the same thing ?

Please poke holes in my thinking ! Rip me apart !


Thanks
Regards

sanjay



CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: Execution failed with exit status: 3

2013-10-07 Thread Sanjay Subramanian
Hi Nick

How many partitions are there in table t1 and table t2
If there are many partitions in either t1 or t2 or both can u mod your query as 
follows and see if the error comes up


SELECT

T1.somecolumn,

T2.someothercolumn

FROM

(SELECT * FROM t1 WHERE partition_column1='some_val') T1

JOIN

(SELECT * FROM t2 WHERE partition_column2='some_val') T2

ON (T1.idfield=T2.idfield)


Thanks
Regards

sanjay

 email : sanjay.subraman...@wizecommerce.com
   Irc : sanjaysub (channel #noc)
 skype : sanjaysubramanian
mobile : (925) 399 2692

From: Martin, Nick nimar...@pssd.commailto:nimar...@pssd.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Monday, October 7, 2013 5:13 PM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Execution failed with exit status: 3

Hi all,

I'm doing a very basic join in Hive and getting the error below. The HiveQL 
join syntax I'm using is:


SELECT

T1.somecolumn,

T2.someothercolumn

FROM t1

JOIN t2

ON (t1.idfield=t2.idfield)



Driver returned: 3.  Errors: OK
Total MapReduce jobs = 1
setting HADOOP_USER_NAME  someuser
Execution failed with exit status: 3
Obtaining error information

Task failed!
Task ID:
  Stage-4

Logs:

FAILED: Execution Error, return code 3 from 
org.apache.hadoop.hive.ql.exec.MapredLocalTask
I searched through the logs and couldn't find anything terribly useful, 
although perhaps I'm missing something. Is this a common error I'm just now 
coming across?

On Hive 0.11

Thanks!
Nick

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Hiveserver2 Authentication (openLDAP) and Authorization (using Sentry)

2013-09-17 Thread Sanjay Subramanian
Hi guys


DISCLAIMER
==
I have no affiliations to Cloudera and I am writing this mail of my own free 
will, with the hope to help fellow Hive users who will be implementing security 
around Hive


I have implemented Hiveserver2 Authentication (openLDAP)  and Authorization 
(using Cloudera Sentry).

I am super-excited because we know can open our Hive Data Platform in read 
only mode to remote clients in the company and SAS clients !

We are using Cloudera Manager Standard distribution (still can't afford support 
costs !)

Thanks

Sanjay

The stack I am planning to push to production after my hive/yarn/oozie tests 
succeed are as follows

Group 1 (CDH4)
Hosts
dev-thdp5.corp.nextag.com
Component   Version CDH Version
Impala  1.1.1   Not applicable
Yarn (CDH4 only)2.0.0+1475  CDH4
HDFS (CDH4 only)2.0.0+1475  CDH4
Hue Plugins 2.5.0+139   CDH4
Sqoop2 (CDH4 only)  1.99.2+85   CDH4
MapReduce 2 (CDH4 only) 2.0.0+1475  CDH4
HBase   0.94.6+132  CDH4
HCatalog (CDH4 only)Unavailable Not installed or path incorrect
Oozie   3.3.2+92CDH4
Zookeeper   3.4.5+23CDH4
Hue 2.5.0+139   CDH4
MapReduce 1 (CDH4 only) 2.0.0+1475  CDH4
Pig 0.11.0+33   CDH4
HttpFS (CDH4 only)  2.0.0+1475  CDH4
Hadoop  2.0.0+1475  CDH4
Hive0.10.0+198  CDH4
Lily HBase Indexer (CDH4 only)  1.2+2   CDH4
Solr (CDH4 only)4.4.0+69CDH4
Flume NG1.4.0+23CDH4
Javajava version 1.6.0_31 Java(TM) SE Runtime Environment (build 
1.6.0_31-b04) Java HotSpot(TM) 64-Bit Server VM (build 20.6-b01, mixed mode)
 Not applicable
Cloudera Manager Agent  4.7.0   Not applicable


CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: Issue while quering Hive

2013-09-16 Thread Sanjay Subramanian
With regards to splitting an compression there are 2 options really as of now

If u r using Sequence Files , then Snappy
If u r using TXT files then LZO us great (u have to cross a few minor hoops to 
get LZO to work and I can provide guidance on that)

Please don't use GZ (not splittable) / or worse BZ2 (took slow to 
compress/decompress) - too slow for comfort

The only compelling reason u want to use GZIP as I am using in production is 
that my log files are MULTIPLE LINES…so if I use regular TXT files then 
splitting can happen between records

sanjay


From: Nitin Pawar nitinpawar...@gmail.commailto:nitinpawar...@gmail.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Monday, September 16, 2013 5:07 AM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: Issue while quering Hive

As per my understanding, hadoop 1.x does not provide you any help on processing 
compressing files in parallel manner. (Atleast this was the case few months 
back).

This bzip2 splitting etc is added in hadoop2.x as per my understanding.




On Mon, Sep 16, 2013 at 5:18 PM, Garg, Rinku 
rinku.g...@fisglobal.commailto:rinku.g...@fisglobal.com wrote:
Thanks Nitin,

That way it worked, But in that case Hadoop will not be able to split my file 
into chunks/blocks and run multiple maps in parallel. This can cause 
under-utilization of my cluster's 'mapping' power. Is that rue??

Thanks  Regards,
Rinku Garg

From: Nitin Pawar 
[mailto:nitinpawar...@gmail.commailto:nitinpawar...@gmail.com]
Sent: 16 September 2013 15:57

To: user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: Issue while quering Hive

Does your .gz file contains the data in sequencefile ? or its a plain csv?

I think looking at the filename its a plain csv file, so I would recommend that 
you create a normal table with TextInputFormat (the default) and load data in 
the new table and give it a try.


On Mon, Sep 16, 2013 at 3:36 PM, Garg, Rinku 
rinku.g...@fisglobal.commailto:rinku.g...@fisglobal.com wrote:
Hi Nitin,

Yes, I created the table with sequencefile.

Thanks  Regards,
Rinku Garg



From: Nitin Pawar 
[mailto:nitinpawar...@gmail.commailto:nitinpawar...@gmail.com]
Sent: 16 September 2013 14:19
To: user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: Issue while quering Hive

Look at the error message

Caused by: java.io.IOException: 
hdfs://localhost:54310/user/hive/warehouse/cpj_tbl/cpj.csv.gz not a SequenceFile

Did you create table with sequencefile ?

On Mon, Sep 16, 2013 at 1:33 PM, Garg, Rinku 
rinku.g...@fisglobal.commailto:rinku.g...@fisglobal.com wrote:
Hi All,

I have setup Hadoop, hive setup and trying to load gzip file in hadoop cluster. 
Files are loaded successfully and can be view on web UI. While executing Select 
query it gives me the below mentioned error.

ERROR org.apache.hadoop.security.UserGroupInformation: 
PriviledgedActionException as:nxtbig (auth:SIMPLE) cause:java.io.IOException: 
java.lang.reflect.InvocationTargetException
2013-09-16 09:11:18,971 WARN org.apache.hadoop.mapred.Child: Error running child
java.io.IOException: java.lang.reflect.InvocationTargetException
at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
at 
org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:369)
at 
org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.init(HadoopShimsSecure.java:316)
at 
org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getRecordReader(HadoopShimsSecure.java:430)
at 
org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:540)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:395)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1407)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at 

Inner Map key and value separators

2013-09-13 Thread Sanjay Subramanian
Hi guys

I have to load data into the following data type in hive

map string,mapstring,string

Is there a way to define custom SEPARATORS (while creating the table) for
- Inner map collection item
- Inner map key

delimiters for 2nd-level maps are \004 and \005 per this

http://mail-archives.apache.org/mod_mbox/hadoop-hive-user/201009.mbox/%3c4f6b25afffcafe44b6259a412d5f9b101c07a...@exchmbx104.netflix.com%3E

Thanks

sanjay

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: question about partition table in hive

2013-09-13 Thread Sanjay Subramanian
A couple of days back, Erik Sammer at the Hadoop Hands On Lab at the Cloudera 
Sessions demonstrated how to achieve dynamic partitioning using Flume and 
created those partitioned directories on HDFS which are then readily usable by 
Hive

Understanding what I can from the two lines of your mail below, I would 
configure Flume to do dynamic partitioning (YEAR, MONTH, DAY, HOUR) and create 
those directories in HDFS and then create Hive tables with  those partitions 
and run the queries

As Stephen said earlier , experiment like crazy - and share please - it will 
make all of us better as well !


Thanks

sanjay

From: ch huang justlo...@gmail.commailto:justlo...@gmail.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Thursday, September 12, 2013 6:55 PM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: question about partition table in hive

hi,all:
i use flume collect log data and put it in hdfs ,i want to use hive to 
do some caculate, query based on timerange,i want to use parttion table ,
but the data file in hdfs is a big file ,how can i put it into pratition table 
in hive?

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: Interesting claims that seem untrue

2013-09-12 Thread Sanjay Subramanian
I have not read the full blogs but in the year 2013 , IMHO , LOC is a very old 
metric that defines good software any more...

From: Edward Capriolo edlinuxg...@gmail.commailto:edlinuxg...@gmail.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Thursday, September 12, 2013 7:19 AM
To: hive-u...@hadoop.apache.orgmailto:hive-u...@hadoop.apache.org 
hive-u...@hadoop.apache.orgmailto:hive-u...@hadoop.apache.org, 
hive-...@hadoop.apache.orgmailto:hive-...@hadoop.apache.org 
hive-...@hadoop.apache.orgmailto:hive-...@hadoop.apache.org
Subject: Interesting claims that seem untrue

I was reading the horton-works blog and found an interesting article.
http://hortonworks.com/blog/stinger-phase-2-the-journey-to-100x-faster-hive/#comment-160753

There is a very interesting graphic which attempts to demonstrate lines of code 
in the 12 release.
http://hortonworks.com/wp-content/uploads/2013/09/hive4.png

Although I do not know how they are calculated, they are probably counting code 
generated by tests output, but besides that they are wrong.

One claim is that Cloudera contributed 4,244 lines of code.

So to debunk that claim:

In https://issues.apache.org/jira/browse/HIVE-4675 Brock Noland from cloudera, 
created the ptest2 testing framework. He did all the work for ptest2 in hive 
12, and it is clearly more then 4,244

This consists of 84 java files
[edward@desksandra ptest2]$ find . -name *.java | wc -l
84
and by itself is 8001 lines of code.
[edward@desksandra ptest2]$ find . -name *.java | xargs cat | wc -l
8001

[edward@desksandra hive-trunk]$ wc -l HIVE-4675.patch
7902 HIVE-4675.patch

This is not the only feature from cloudera in hive 12.

There is also a section of the article that talks of a ROAD MAP for hive 
features. I did not know we (hive) had a road map. I have advocated switching 
to feature based release and having a road map before, but it was suggested 
that might limit people from itch-scratching.





CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: Sentry Meetup

2013-09-04 Thread Sanjay Subramanian
Ahh…..why not SFO :-)

I am struggling to implement Sentry and would love some inputs
Shreepadma, thanks for all your clarifications but I am only partially done 
with my implementation

I have a workaround whereby I can blank out tables from default db of hive to 
all hiveserver2 JDBC users…this is not elegant but will keep my production 
tables secure

sanjay

From: Shreepadma Venugopalan 
shreepa...@cloudera.commailto:shreepa...@cloudera.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Wednesday, September 4, 2013 3:40 PM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Sentry Meetup

Sentry is a fine-grained role-based authorization system for Hive among others. 
We're planning to have a Sentry meetup in NYC coinciding with Hadoop World and 
Strata 2013. Please RSVP 
herehttp://www.meetup.com/Sentry-User-Meetup/events/138189232/ if you wish to 
attend the meetup and hear more about how Sentry can help secure your Hive 
deployment.

Thanks.
Shreepadma

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: how to config the job id in oozie

2013-09-04 Thread Sanjay Subramanian
Wow that’s a great idea Artem

From: Artem Ervits are9...@nyp.orgmailto:are9...@nyp.org
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Wednesday, September 4, 2013 5:07 AM
To: 'user@hive.apache.orgmailto:'user@hive.apache.org' 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: how to config the job id in oozie

You can also use shell action to generate timestamp in format you want and pass 
to the next action as parameter. I do agree it should be easier.


Artem Ervits
Data Analyst
New York Presbyterian Hospital

From: Sanjay Subramanian [mailto:sanjay.subraman...@wizecommerce.com]
Sent: Tuesday, September 03, 2013 02:34 PM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: how to config the job id in oozie

Hi

See here
http://oozie.apache.org/docs/3.2.0-incubating/WorkflowFunctionalSpec.html#a4.2.1_Basic_EL_Constants


String timestamp()

It returns the UTC current date and time in W3C format down to the second 
(-MM-DDThh:mm:ss.sZ). I.e.: 1997-07-16T19:20:30.45Z

I don’t like the fact that that there are no Oozie String Format EL functions 
where u can format the DATE as you want.

See if this works for you . Else you have top pass a param when u submit the 
Oozie job and append that param to the Oozie workflow ID inside the workflow.XML

oozie job -oozie http://oozie_server:11000/oozie -config job.properties -D 
START_TIME=$(date +%Y_%m_%dT%H_%M_%S)  -run


Inside your workflow.xml
=

property

  namemapreduce.job.name/name

  valueMyJobName_${wf:id()}_${START_TIME}/value

/property

Thanks

sanjay

From: 李宏展 lihongzhan...@gmail.commailto:lihongzhan...@gmail.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Tuesday, September 3, 2013 7:43 AM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: how to config the job id in oozie

when i submit a workflow,oozie return a jobid like this 
002-130903214537362-oozie-*.
But in this page http://oozie.apache.org/docs/3.3.2/DG_Examples.html ,the job 
id is 14-20090525161321-oozie-tucu.

The job id with a date is really useful for me,but How should I config oozie 
for this?

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.

This electronic message is intended to be for the use only of the named 
recipient, and may contain information that is confidential or privileged. If 
you are not the intended recipient, you are hereby notified that any 
disclosure, copying, distribution or use of the contents of this message is 
strictly prohibited. If you have received this message in error or are not the 
named recipient, please notify us immediately by contacting the sender at the 
electronic mail address noted above, and delete and destroy all copies of this 
message. Thank you.

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: Sentry Meetup

2013-09-04 Thread Sanjay Subramanian
Thanks I will see it in the CDH users group
Regards
sanjay

From: Shreepadma Venugopalan 
shreepa...@cloudera.commailto:shreepa...@cloudera.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Wednesday, September 4, 2013 5:33 PM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: Sentry Meetup

Hi Sanjay,

The problems you are encountering with Sentry are due to misconfiguration.  
These are not bugs. We have responded to you questions on cdh-user@.

Shreepadma


On Wed, Sep 4, 2013 at 3:49 PM, Sanjay Subramanian 
sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com
 wrote:
Ahh…..why not SFO :-)

I am struggling to implement Sentry and would love some inputs
Shreepadma, thanks for all your clarifications but I am only partially done 
with my implementation

I have a workaround whereby I can blank out tables from default db of hive to 
all hiveserver2 JDBC users…this is not elegant but will keep my production 
tables secure

sanjay

From: Shreepadma Venugopalan 
shreepa...@cloudera.commailto:shreepa...@cloudera.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Wednesday, September 4, 2013 3:40 PM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Sentry Meetup

Sentry is a fine-grained role-based authorization system for Hive among others. 
We're planning to have a Sentry meetup in NYC coinciding with Hadoop World and 
Strata 2013. Please RSVP 
herehttp://www.meetup.com/Sentry-User-Meetup/events/138189232/ if you wish to 
attend the meetup and hear more about how Sentry can help secure your Hive 
deployment.

Thanks.
Shreepadma

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: Hive Statistics information

2013-09-03 Thread Sanjay Subramanian
Thanks Ravi let me give this a shot
Regards
sanjay

From: Ravi Kiran maghamraviki...@gmail.commailto:maghamraviki...@gmail.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Friday, August 30, 2013 10:53 PM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: Hive Statistics information

Hi Sanjay,

   What do the logs say when you fire the ANALYZE TABLE...   statement on a 
table ?
   One minor correction to the db connectionstring would be to use amp; for 
the query parameters.
hive.stats.dbconnectionstring=jdbc:mysql://v-so1.nextagqa.com/hive_vso1_tempstatsstore?user=hive_user_vso1amp;password=hive_user_vso1http://v-so1.nextagqa.com/hive_vso1_tempstatsstore?user=hive_user_vso1password=hive_user_vso1

I hope the database 
hive_vso1_tempstatsstorehttp://v-so1.nextagqa.com/hive_vso1_tempstatsstore?user=hive_user_vso1password=hive_user_vso1
 exists in your MySQL?

Regards
Ravi Magham


On Sat, Aug 31, 2013 at 6:15 AM, Sanjay Subramanian 
sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com
 wrote:
Hi guys

I have configured Hive to use MySQL for all statistics

hive.stats.atomic=false
hive.stats.autogather=true
hive.stats.collect.rawdatasize=true
hive.stats.dbclass=jdbc:mysql
hive.stats.dbconnectionstring=jdbc:mysql://v-so1.nextagqa.com/hive_vso1_tempstatsstore?user=hive_user_vso1password=hive_user_vso1http://v-so1.nextagqa.com/hive_vso1_tempstatsstore?user=hive_user_vso1password=hive_user_vso1
hive.stats.jdbc.timeout=30
hive.stats.jdbcdriver=com.mysql.jdbc.Driver
hive.stats.retries.max=0
hive.stats.retries.wait=3000

However in the MYSQL hive statistics tables , they don't seem to have any data ?

Where does Hive store the statistics information ?

sanjay

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: Hive Query - Issue

2013-09-03 Thread Sanjay Subramanian
Hi

When you do a SELECT * ,  the partition columns are returned as last N columns  
(if u have N partitions)

In this case the 63rd column in SELECT * is the partition column

Instead of SELECT *
Do a
SELECT
 col1,
 col2,
 col3,
…..


Not to show the candle to the sun if u r a AWK/SED ninja :-) but to get all 
column from hive u can do this

hive -e describe ur_table_name | awk '{print $1,}'|sed '1i SELECT'|less

Thanks

sanjay
From: Manickam P manicka...@outlook.commailto:manicka...@outlook.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Monday, September 2, 2013 4:32 AM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Hive Query - Issue

Hello Experts,

when i try to execute the below query i'm getting error. Please help me to 
correct this.

insert overwrite table table_baseline partition (sourcedate='base_2013_08') 
select * from (select * from table_a where sourcedate='tablea_2013_08' union 
all select * from table_b where sourcedate='tableb_2013_08') final

My intention here is i want to populate the table_baseline by using the all 
records from table_a and table_b with partition. I am getting the below error.

Error in semantic analysis: Line 1:23 Cannot insert into target table because 
column number/types are different ''BASE_2013_08'': Table insclause-0 has 62 
columns, but query has 63 columns.

I verified the column count and types everything is same but here it says some 
difference. The same query works fine without having any partitions in all the 
three tables but getting error while executing with partitions.


please help.



Thanks
Manickam P

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Hive Statistics information

2013-08-30 Thread Sanjay Subramanian
Hi guys

I have configured Hive to use MySQL for all statistics

hive.stats.atomic=false
hive.stats.autogather=true
hive.stats.collect.rawdatasize=true
hive.stats.dbclass=jdbc:mysql
hive.stats.dbconnectionstring=jdbc:mysql://v-so1.nextagqa.com/hive_vso1_tempstatsstore?user=hive_user_vso1password=hive_user_vso1
hive.stats.jdbc.timeout=30
hive.stats.jdbcdriver=com.mysql.jdbc.Driver
hive.stats.retries.max=0
hive.stats.retries.wait=3000

However in the MYSQL hive statistics tables , they don't seem to have any data ?

Where does Hive store the statistics information ?

sanjay

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


SAS--Hive integration

2013-08-28 Thread Sanjay Subramanian
Hi guys

Anyone tried SAS--Hive integration successfully ?

I tried a simple query in SAS (select col1 from table1 limit 10) and it opened 
3 connections to hive-server and killed it !!! :-(

I will setup a dev environment for SAS and Hive to test all this

But I was wondering if you guys had any clues ? Any thoughts ?

sanjay

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: hiveserver2 with OpenLDAP ?

2013-08-24 Thread Sanjay Subramanian
Hi Mikhail

Good news !

I deleted current user  which used dn: cn=, and re-created it using dn: 
uid=, looks like it works.

So that’s an option 3 as well

Nevertheless for academic reasons I am going to give your option 2 a try and 
report back what I find

Thanks again
Regards

sanjay

From: Mikhail Antonov olorinb...@gmail.commailto:olorinb...@gmail.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Friday, August 23, 2013 9:40 PM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: hiveserver2 with OpenLDAP ?

You don't need to rebuild Hive, no.

I see 2 solutions for that -

1) modify your LDAP data schema so it accepts uid instead of cn attribute (and 
actually I think it should be working, as after all there can't be so obvious 
bug in Hive...), so probably whoever tested that provider was using difference 
LDAP schema that what we have

2) write your class similar to the one mentioned, and point Hive to that in the 
hive-site.xml, set hive.server2.authentication=CUSTOM,and provide a full class 
name of your class (and make sure it's on classpath, of course).

I prefer second way. Hope it should help. Let me know it it worked for you.

*General question to folks* - am I missing something or there's really a bug in 
LDAP authenticator, which doesn't allow precise configuration of binding string?

Mikhail


2013/8/23 Sanjay Subramanian 
sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com
Thanks a lot Mikhail for getting back.

That means I cannot use this using beeline unless I change the code and build 
hive again ?

Thanks

sanjay

From: Mikhail Antonov olorinb...@gmail.commailto:olorinb...@gmail.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Friday, August 23, 2013 6:17 PM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: hiveserver2 with OpenLDAP ?

I see the same behavior and here's the reason.

LdapAuthenticationProviderImpl - that the one responsible for LDAP 
authentication in Hive. Look at this class. It has snippet (CDH 4.2.1, hive 
0.10):

 // setup the security principal
String bindDN;
if (baseDN != null) {
  bindDN = uid= + user + , + baseDN;
} else {
  bindDN = user;
}

And according to Cloudera documentation, you're supposed to set baseDN param 
for OpenLDAP, but not for AD. So when this baseDN isn't present, Hive takes 
username as it is (say user1) and tries to bind to the ldap server, which works.

When you set this baseDN, it constructs the bind string as 
uid=user1,dc=wizetest,dc=com. But most likely, your open ldap expects it to be 
rather cn=user1,dc=wizetest,dc=com, uid attribute isn't being used.

I think the way to go is to provide you own LDAP authenticator, which has more 
control on how to generate LDAP bind string.

Mikhail




2013/8/23 Sanjay Subramanian 
sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com
Hi guys

I tested hiveserver2 with Active directory - It works
With Open LDAP it does not

Is there any specific syntax for specifying the LDAP url or baseDN ?

property
  namehive.server2.authentication.ldap.url/name
  
valueldap://myserver.corp.nextag.com:389http://myserver.corp.nextag.com:389/value
/property
property
  namehive.server2.authentication.ldap.baseDN/name
  valuedc=wizetest,dc=com/value
/property

Beeline keeps giving error

jdbc:hive2://dev-thdp5:1 !connect jdbc:hive2://dev-thdp5:1 hiveuser1 
 org.apache.hive.jdbc.HiveDriver
Connecting to jdbc:hive2://dev-thdp5:1
Error: Could not establish connection to jdbc:hive2://dev-thdp5:1: Peer 
indicated failure: Error validating the login (state=08S01,code=0)

Any clues ?

Thanks

sanjay

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.



--
Thanks,
Michael Antonov

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any

hiveserver2 with OpenLDAP ?

2013-08-23 Thread Sanjay Subramanian
Hi guys

I tested hiveserver2 with Active directory - It works
With Open LDAP it does not

Is there any specific syntax for specifying the LDAP url or baseDN ?

property
  namehive.server2.authentication.ldap.url/name
  valueldap://myserver.corp.nextag.com:389/value
/property
property
  namehive.server2.authentication.ldap.baseDN/name
  valuedc=wizetest,dc=com/value
/property

Beeline keeps giving error

jdbc:hive2://dev-thdp5:1 !connect jdbc:hive2://dev-thdp5:1 hiveuser1 
 org.apache.hive.jdbc.HiveDriver
Connecting to jdbc:hive2://dev-thdp5:1
Error: Could not establish connection to jdbc:hive2://dev-thdp5:1: Peer 
indicated failure: Error validating the login (state=08S01,code=0)

Any clues ?

Thanks

sanjay

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: Alter or Query a table with field name 'date' always get error

2013-08-22 Thread Sanjay Subramanian
Yes date is a reserved word.

My recommendations

If your table is Hive managed (I.e u created the table without using EXTERNAL )
===
- Then copy the data for this hive table that is on HDFS to another location
- Drop the table
- CREATE a EXTERNAL TABLE with filename - replace date with  some field name 
like date_
- ALTER TABLE ADD IF NOT EXISTS PARTITION (date_='your_partition') LOCATION 
'/path/to/your/HDFS/where/you/copied/the/data'

If your table is EXTERNAL
===
- Drop the table
- CREATE a EXTERNAL TABLE with filename - replace date with  some field name 
like date_
- ALTER TABLE ADD IF NOT EXISTS PARTITION (date_='your_partition') LOCATION 
'/path/to/your/HDFS/where/you/copied/the/data'

Hope this helps

Good luck

sanjay


From: Sonya Ling sonya_ling1...@yahoo.commailto:sonya_ling1...@yahoo.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org, Sonya Ling 
sonya_ling1...@yahoo.commailto:sonya_ling1...@yahoo.com
Date: Thursday, August 22, 2013 6:45 PM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Alter or Query a table with field name 'date' always get error

Hi:

I have a table transaction3 with a field name 'date'.  That table is the target 
of importing table from MYSQL using Sqoop.  The table in MYSQL has a field with 
the field name 'date' and SQOOP does not allow column name mapping.  Therefore, 
the field name 'date' is kept in the transaction3 table in Hive.  'date' field 
is the partition field name of transaction3.

I tried to change column name (existing code is is using created_date field)

ALTER TABLE transaction3 CHANGE date created_date STRING;
I got error.
FAILED: ParseException line 1:32 extraneous input 'date' expecting Identifier 
near 'EOF'

I cannot get around it.

I query the count
SELECT COUNT(*) FROM transaction3 WHERE date = '2013-08-15';
I got another error.
FAILED: ParseException line 1:40 cannot recognize input near 'date' '=' 
''2013-08-15'' in expression specification

I query another table with field name 'created_date' instead and it works fine 
and I didn't get any error.

I like to know if date is a reserved word.  Is there any way to get around 
either change field name or query?
Thanks.

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: How to perform arithmetic operations in hive

2013-08-22 Thread Sanjay Subramanian
Yes this will work
Also arithmetic operations will  work in a WHERE clause
Example

select channel_id from keyword_impressions_log where 
header_date_partition='2013-08-21' and channel_id*10=290640 limit 10

From: Justin Workman justinjwork...@gmail.commailto:justinjwork...@gmail.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Thursday, August 22, 2013 7:17 AM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: How to perform arithmetic operations in hive

Try

select emp_name, (emp_no * 10) from emp_table;

Sent from my iPhone

On Aug 22, 2013, at 8:14 AM, Sandeep Nemuri 
nhsande...@gmail.commailto:nhsande...@gmail.com wrote:

Hi all ,
 Can we perform arithmetic operator on select command.
I have a table emp_table with columns emp_name and emp_no
i want to multiply the data in a emp_no with 10 .

i tryed select * from emp_table where emp_no=emp_no*10;
But it did not work .
Help me out

Thanks in advance .


--Regards
  Sandeep Nemuri

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: using hive with multiple schemas

2013-08-21 Thread Sanjay Subramanian
Some ideas to get u started


CREATE EXTERNAL TABLE  IF NOT EXISTS names(fullname STRING,address
STRING,phone STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

CREATE EXTERNAL TABLE  IF NOT EXISTS names_detail(id BIGINT, fullname
STRING,address STRING,gender STRING, phone STRING) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','

ALTER TABLE names SET LOCATION
'hdfs://namenodeserver:port/path/to/dir/in/hdfs/Schema1.csv'

ALTER TABLE names_detail SET LOCATION
'hdfs://namenodeserver:port/path/to/dir/in/hdfs/Schema2.csv'




CHECKPOINT to see if DATA is there
==
# won't use Map Reduce
hive -e select * from names
hive -e select * from names_detail

# Will use Map Reduce (if Hive version is older than 0.11)
hive -e select * from names
hive -e select * from names_detail


JOIN QUERY (may not be the business case but just as an illustration of
FULL INNER JOIN)
===
=
SELECT
  nd.id,
  nd.gender,
  n.fullname,
  n.address,
  n.phone
FROM
  names n
JOIN
  names_detail nd
ON
  n.fullname = nd.fullname
AND
  n.phone = nd.phone


This query is not tested so please make tweaks as appropriate to make it
work.

Hope this helps

Good luck


sanjay













From:  Chris Driscol cdris...@rallydev.com
Reply-To:  user@hive.apache.org user@hive.apache.org
Date:  Wednesday, August 21, 2013 7:24 AM
To:  user@hive.apache.org user@hive.apache.org
Subject:  using hive with multiple schemas


Hi -I just started to get my feet wet with Hive and have a question that I
have not been able to find an answer to..

Suppose I have 2 CSV files:
cat Schema1.csv
Name, Address, Phone
Chris, address1, 999-999-

and
cat Schema2.csv
Id, Name, Address, Gender, Phone
13, Tom, address2, male, 888-888- tel:888-888-

I put these two files into Hadoop and want to be able to query these 2
different schema's via Hive..

Do I need to create two tables in Hive to represent both schemas and use a
join?  Or is there a better way that can handle these two different
schemas?

Please reply back with any other specific questions, I realize this is
somewhat open-ended..  thanks!

--
-cd


CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.



Re: only one mapper

2013-08-21 Thread Sanjay Subramanian
Hi

Try this setting in your hive query

SET mapreduce.input.fileinputformat.split.maxsize=some bytes;

If u set this value low then the MR job will use this size to split the input 
LZO files and u will get multiple mappers (and make sure the input LZO files 
are indexed I.e. .LZO.INDEX files are created)

sanjay


From: Edward Capriolo edlinuxg...@gmail.commailto:edlinuxg...@gmail.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Wednesday, August 21, 2013 10:43 AM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: only one mapper

LZO files are only splittable if you index them. Sequence files compresses with 
LZO are splittable without being indexed.

Snappy + SequenceFile is a better option then LZO.


On Wed, Aug 21, 2013 at 1:39 PM, Igor Tatarinov 
i...@decide.commailto:i...@decide.com wrote:
LZO files are combinable so check your max split setting.
http://mail-archives.apache.org/mod_mbox/hive-user/201107.mbox/%3c4e328964.7000...@gmail.com%3E

igor
decide.comhttp://decide.com



On Wed, Aug 21, 2013 at 2:17 AM, 闫昆 
yankunhad...@gmail.commailto:yankunhad...@gmail.com wrote:
hi all when i use hive
hive job make only one mapper actually my file split 18 block my block size is 
128MB and data size 2GB
i use lzo compression and create file.lzo and make index file.lzo.index
i use hive 0.10.0

Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Cannot run job locally: Input Size (= 2304560827) is larger than 
hive.exec.mode.local.auto.inputbytes.max (= 134217728)
Starting Job = job_1377071515613_0003, Tracking URL = 
http://hydra0001:8088/proxy/application_1377071515613_0003/
Kill Command = /opt/module/hadoop-2.0.0-cdh4.3.0/bin/hadoop job  -kill 
job_1377071515613_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2013-08-21 16:44:30,237 Stage-1 map = 0%,  reduce = 0%
2013-08-21 16:44:40,495 Stage-1 map = 2%,  reduce = 0%, Cumulative CPU 6.81 sec
2013-08-21 16:44:41,710 Stage-1 map = 2%,  reduce = 0%, Cumulative CPU 6.81 sec
2013-08-21 16:44:42,919 Stage-1 map = 2%,  reduce = 0%, Cumulative CPU 6.81 sec
2013-08-21 16:44:44,117 Stage-1 map = 3%,  reduce = 0%, Cumulative CPU 9.95 sec
2013-08-21 16:44:45,333 Stage-1 map = 3%,  reduce = 0%, Cumulative CPU 9.95 sec
2013-08-21 16:44:46,530 Stage-1 map = 5%,  reduce = 0%, Cumulative CPU 13.0 sec

--

In the Hadoop world, I am just a novice, explore the entire Hadoop ecosystem, I 
hope one day I can contribute their own code

YanBit
yankunhad...@gmail.commailto:yankunhad...@gmail.com




CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: single output file per partition?

2013-08-21 Thread Sanjay Subramanian
Hi

I tried file crusher with LZO but it does not work….I have LZO correctly 
configured in production and my jobs are running daily using LZO compression.

I like Crusher so I will see why its not working…Thanks to Edward the code is 
there to tweak :-)  and test locally


sanjay


From: Stephen Sprague sprag...@gmail.commailto:sprag...@gmail.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Wednesday, August 21, 2013 12:07 PM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: single output file per partition?

I see.  I'll have to punt then.  However, there is an after the fact file 
crusher Ed Capriolo wrote a while back here:  
https://github.com/edwardcapriolo/filecrush YMMV


On Wed, Aug 21, 2013 at 11:12 AM, Igor Tatarinov 
i...@decide.commailto:i...@decide.com wrote:
Using a single bucket per partition seems to create a single reducer which is 
too slow.
I've tried enforcing small files merge but that didn't work. I still got 
multiple output files.

Creating a temp table and then combining the multiple files into one using a 
simple select * is the only option that seems to work. It's odd that I have to 
create the temp table but I don't see a workaround.


On Wed, Aug 21, 2013 at 8:51 AM, Stephen Sprague 
sprag...@gmail.commailto:sprag...@gmail.com wrote:
hi igor,
lots of ideas there!  I can't speak for them all but let me confirm first that 
cluster by X into 1 bucket didn't work?  I would have thought that would have 
done it.




On Tue, Aug 20, 2013 at 2:29 PM, Igor Tatarinov 
i...@decide.commailto:i...@decide.com wrote:
What's the best way to enforce a single output file per partition?

INSERT OVERWRITE TABLE table
PARTITION (x,y,z)
SELECT ...
FROM ...
WHERE ...

It tried adding CLUSTER BY x,y,z at the end thinking that sorting will force a 
single reducer per partition but that didn't work. I still got multiple files 
per partition.

Do I have to use a single reduce task? With a few TB of data that's probably 
not a good idea.

My current idea is to create a temp table with the same partitioning structure. 
Insert into that table first and then select * from that table into the output 
table. With combineinputformat=true that should work right?

Or should I make Hive merge output files instead? (using hive.merge.mapfiles) 
Will that work with a partitioned table?

Thanks!
igor




CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Last time request for cwiki update privileges

2013-08-20 Thread Sanjay Subramanian
Hey guys

I can only think of two reasons for my request is not yet accepted

1. The admins don't want to give me access

2. The admins have not seen my mail yet.

This is the fourth and the LAST time I am requesting permission to edit wiki 
docs…Nobody likes being ignored and that includes me.

Meanwhile to show my thankfulness to the Hive community I shall continue to 
answer questions .There will be no change in that behavior

Regards

sanjay




From: Sanjay Subramanian 
sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com
Date: Wednesday, August 14, 2013 3:52 PM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Cc: d...@hive.apache.orgmailto:d...@hive.apache.org 
d...@hive.apache.orgmailto:d...@hive.apache.org
Subject: Re: Review Request (wikidoc): LZO Compression in Hive

Once again, I am down on my knees humbling calling upon the Hive Jedi Masters 
to please provide this paadwaan  with cwiki update privileges

May the Force be with u

Thanks

sanjay

From: Sanjay Subramanian 
sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Wednesday, July 31, 2013 9:38 AM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Cc: d...@hive.apache.orgmailto:d...@hive.apache.org 
d...@hive.apache.orgmailto:d...@hive.apache.org
Subject: Re: Review Request (wikidoc): LZO Compression in Hive

Hi guys

Any chance I could get cwiki update privileges today ?

Thanks

sanjay

From: Sanjay Subramanian 
sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com
Date: Tuesday, July 30, 2013 4:26 PM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Cc: d...@hive.apache.orgmailto:d...@hive.apache.org 
d...@hive.apache.orgmailto:d...@hive.apache.org
Subject: Review Request (wikidoc): LZO Compression in Hive

Hi

Met with Lefty this afternoon and she was kind to spend time to add my 
documentation to the site - since I still don't have editing privileges :-)

Please review the new wikidoc about LZO compression in the Hive language 
manual.  If anything is unclear or needs more information, you can email 
suggestions to this list or edit the wiki yourself (if you have editing 
privileges).  Here are the links:

  1.  Language 
Manualhttps://cwiki.apache.org/confluence/display/Hive/LanguageManual (new 
bullet under File Formats)
  2.  LZO 
Compressionhttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+LZO
  3.  CREATE 
TABLEhttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable
 (near end of section, pasted in here:)
Use STORED AS TEXTFILE if the data needs to be stored as plain text files. Use 
STORED AS SEQUENCEFILE if the data needs to be compressed. Please read more 
about 
CompressedStoragehttps://cwiki.apache.org/confluence/display/Hive/CompressedStorage
 if you are planning to keep data compressed in your Hive tables. Use 
INPUTFORMAT and OUTPUTFORMAT to specify the name of a corresponding InputFormat 
and OutputFormat class as a string literal, e.g., 
'org.apache.hadoop.hive.contrib.fileformat.base64.Base64TextInputFormat'. For 
LZO compression, the values to use are 'INPUTFORMAT 
com.hadoop.mapred.DeprecatedLzoTextInputFormat OUTPUTFORMAT 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' (see LZO 
Compressionhttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+LZO).

My cwiki id is
https://cwiki.apache.org/confluence/display/~sanjaysubraman...@yahoo.com
It will be great if I could get edit privileges

Thanks
sanjay

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure

Re: Last time request for cwiki update privileges

2013-08-20 Thread Sanjay Subramanian
Thanks Ashutosh

From: Ashutosh Chauhan hashut...@apache.orgmailto:hashut...@apache.org
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Tuesday, August 20, 2013 3:13 PM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Cc: d...@hive.apache.orgmailto:d...@hive.apache.org 
d...@hive.apache.orgmailto:d...@hive.apache.org
Subject: Re: Last time request for cwiki update privileges

Hi Sanjay,

Really sorry for that. I apologize for the delay. You are added now. Feel free 
to make changes to make Hive even better!

Thanks,
Ashutosh


On Tue, Aug 20, 2013 at 2:39 PM, Sanjay Subramanian 
sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com
 wrote:
Hey guys

I can only think of two reasons for my request is not yet accepted

1. The admins don't want to give me access

2. The admins have not seen my mail yet.

This is the fourth and the LAST time I am requesting permission to edit wiki 
docs…Nobody likes being ignored and that includes me.

Meanwhile to show my thankfulness to the Hive community I shall continue to 
answer questions .There will be no change in that behavior

Regards

sanjay




From: Sanjay Subramanian 
sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com
Date: Wednesday, August 14, 2013 3:52 PM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Cc: d...@hive.apache.orgmailto:d...@hive.apache.org 
d...@hive.apache.orgmailto:d...@hive.apache.org
Subject: Re: Review Request (wikidoc): LZO Compression in Hive

Once again, I am down on my knees humbling calling upon the Hive Jedi Masters 
to please provide this paadwaan  with cwiki update privileges

May the Force be with u

Thanks

sanjay

From: Sanjay Subramanian 
sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Wednesday, July 31, 2013 9:38 AM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Cc: d...@hive.apache.orgmailto:d...@hive.apache.org 
d...@hive.apache.orgmailto:d...@hive.apache.org
Subject: Re: Review Request (wikidoc): LZO Compression in Hive

Hi guys

Any chance I could get cwiki update privileges today ?

Thanks

sanjay

From: Sanjay Subramanian 
sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com
Date: Tuesday, July 30, 2013 4:26 PM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Cc: d...@hive.apache.orgmailto:d...@hive.apache.org 
d...@hive.apache.orgmailto:d...@hive.apache.org
Subject: Review Request (wikidoc): LZO Compression in Hive

Hi

Met with Lefty this afternoon and she was kind to spend time to add my 
documentation to the site - since I still don't have editing privileges :-)

Please review the new wikidoc about LZO compression in the Hive language 
manual.  If anything is unclear or needs more information, you can email 
suggestions to this list or edit the wiki yourself (if you have editing 
privileges).  Here are the links:

  1.  Language 
Manualhttps://cwiki.apache.org/confluence/display/Hive/LanguageManual (new 
bullet under File Formats)
  2.  LZO 
Compressionhttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+LZO
  3.  CREATE 
TABLEhttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable
 (near end of section, pasted in here:)
Use STORED AS TEXTFILE if the data needs to be stored as plain text files. Use 
STORED AS SEQUENCEFILE if the data needs to be compressed. Please read more 
about 
CompressedStoragehttps://cwiki.apache.org/confluence/display/Hive/CompressedStorage
 if you are planning to keep data compressed in your Hive tables. Use 
INPUTFORMAT and OUTPUTFORMAT to specify the name of a corresponding InputFormat 
and OutputFormat class as a string literal, e.g., 
'org.apache.hadoop.hive.contrib.fileformat.base64.Base64TextInputFormat'. For 
LZO compression, the values to use are 'INPUTFORMAT 
com.hadoop.mapred.DeprecatedLzoTextInputFormat OUTPUTFORMAT 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' (see LZO 
Compressionhttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+LZO).

My cwiki id is
https://cwiki.apache.org/confluence/display/~sanjaysubraman...@yahoo.com
It will be great if I could get edit privileges

Thanks
sanjay

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your

Re: question about hive SQL

2013-08-19 Thread Sanjay Subramanian
Here is my stab at it. I have not tested it but this should get you started

Following points are importat

1. I added a WHERE clause in the sub query to limit he data set by any 
partition u may have
2. You have to write a collect UDF to use it. Wampler/Capriolo's book in 
Chapter 13.Functions - refer the class GenericUDAFCollect

SELECT
 page_url,
 token,
 collect(concat_ws('|', pcw. original_category, pcw.weight))
FROM
 (SELECT
  page_url,
  token,
  original_category,
  weight
 FROM
 media_visit_info)
 WHERE
 partition_column='partition_col_val'
 GROUP BY
 original_category,
 weight
 ) pcw

LIMIT 10
;

From: ch huang justlo...@gmail.commailto:justlo...@gmail.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Monday, August 19, 2013 2:04 AM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: question about hive SQL

hi,all:
   i do not very familar with HQL, and my problem is ,now i have 2 queries

Q1: select page_url, original_category,token from media_visit_info group by 
page_url, original_category,token limit 10
Q2:  select original_category as code , weight from media_visit_info where 
page_url='X' group by original_category,weight;

Q1  page_url value should be send to Q2 where condition ,and the two query 
result should be combined like

{
url:http\\:www.baidu.com,
category:|CN10,
token:20,
categorys:
[
{code:|CN10-1-1,weight:0.5},
{code:|CN11-2-2,weight:0.1},
{code:|CN10-1-3,weight:0.02}
]
}

i do not know if it can write into one query(JOIN+SUBQUERY??) ,any one can help?

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Hive Authorization (ROLES AND PRIVILEGES) does not work with hiveserver2 ?

2013-08-19 Thread Sanjay Subramanian
0: jdbc:hive2://dev-thdp5:1 CREATE ROLE sas_role;
No rows affected (0.16 seconds)

0: jdbc:hive2://dev-thdp5:1 CREATE EXTERNAL TABLE  IF NOT EXISTS 
keyword_impressions_log (date_ STRING,server STRING,impression_id 
STRING,search_session_id STRING,channel_id INT,visit_id BIGINT,visitor_id 
BIGINT,app_style STRING,publisher_id INT,ip STRING,keyword_id BIGINT,keyword 
STRING,node_constraint BIGINT,mfr_constraint BIGINT,seller_constraint 
BIGINT,other_constraint STRING,continued INT,offset INT,results INT,sort_ 
INT,ad_nodes STRING,view_ INT,spelling STRING,referrer STRING,internal_ip 
INT,cat_feat_mode STRING,rules STRING,rb_filter INT,shuffled 
INT,related_item_results INT,pixeled INT,sr_attr_imps 
STRING,unranked_ptitle_ids STRING,perpage INT)PARTITIONED BY 
(header_date_partition STRING)STORED AS INPUTFORMAT  
com.hadoop.mapred.DeprecatedLzoTextInputFormat   OUTPUTFORMAT 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat;
No rows affected (0.468 seconds)

0: jdbc:hive2://dev-thdp5:1 alter table keyword_impressions_log add if not 
exists partition (header_date_partition='2013-08-11') location 
'/user/hive/warehouse/keyword_impressions_log/2013-08-11';
No rows affected (0.438 seconds)

0: jdbc:hive2://dev-thdp5:1 GRANT SELECT ON TABLE keyword_impressions_log 
TO ROLE sas_role;
No rows affected (0.403 seconds)

0: jdbc:hive2://dev-thdp5:1 GRANT ROLE sas_role TO USER hiveuser1;
No rows affected (0.168 seconds)

0: jdbc:hive2://dev-thdp5:1 show grant role sas_role on table 
keyword_impressions_log;
No rows affected (0.117 seconds)

The following CLI command shows the role information

hive -e show grant role sas_role on table keyword_impressions_log;

Logging initialized using configuration in 
jar:file:/usr/lib/hive/lib/hive-common-0.10.0-cdh4.3.0.jar!/hive-log4j.properties
Hive history 
file=/tmp/sasubramanian/hive_job_log_99244cad-b6e4-4c71-9fe1-d4b248b078c3_1882800333.txt
OK
database default
table keyword_impressions_log
principalName sas_role
principalType ROLE
privilege select
grantTime Mon Aug 19 12:24:08 PDT 2013
grantor hive
Time taken: 1.76 seconds


From: Sanjay Subramanian 
sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Friday, August 16, 2013 7:24 PM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Hive Authorization clarification

Hi guys

I am not getting the expected result from my authorization settings

I am evaluating Hive0.10.0+121

mysql select * from hive.ROLES;
+-+-++---+
| ROLE_ID | CREATE_TIME | OWNER_NAME | ROLE_NAME |
+-+-++---+
|   6 |  1376704610 | NULL   | sas_role  |
+-+-++---+



My settings are as follows.


!-- added by sanjay --
property
  namehive.server2.authentication/name
  valueLDAP/value
/property
property
  namehive.server2.authentication.ldap.url/name
  valueldap://corp.nextag.com/value
/property

property
  namehive.security.authorization.enabled/name
  valuetrue/value
  descriptionenable or disable the hive client   authorization/description
/property

property
  namehive.security.authorization.createtable.owner.grants/name
  valueALL/value
  descriptionthe privileges automatically granted to the owner whenever a 
table gets created.
An example like select,drop will grant select and drop privilege to the owner 
of the table/description
/property

property
  namehive.security.authorization.createtable.role.grants/name
  valuesas_role:select/value
  descriptionThe privileges automatically granted to some roles whenever a 
table gets created. An example like roleX,roleY:select;roleZ:create will 
grant select privilege to roleX and roleY, and grant create privilege to roleZ 
whenever a new table created./description
/property

property
  namehive.security.authorization.createtable.group.grants/name
  valuehiveuser1:select/value
  descriptionThe privileges automatically granted to some groups whenever a 
table gets created. An example like groupX,groupY:select;groupZ:create will 
grant select privilege to groupX and groupY, and grant create privilege to 
groupZ whenever a new table created./description
/property


property
  namehive.security.authorization.createtable.user.grants/name
  valuehiveuser1:select/value
  descriptionThe privileges automatically granted to some users whenever a 
table gets created. An example like userX,userY:select;userZ:create will 
grant select privilege to userX and userY, and grant create privilege to userZ 
whenever a new table created./description
/property



USECASE STEPS

1.   Connect as sasubramanian and create table
beeline !connect jdbc:hive2://dev-thdp5:1 
sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com

Re: Hive cli Vs beeline cli

2013-08-16 Thread Sanjay Subramanian
Some notes from my experience

  *   Beeline u have the benefit of being able to use LDAP/Kerberos 
authentication
  *   I am not sure how to use -e and -f option with Beeline which is very 
strong with hive CLI
  *   Beeline at the present version may not fully integrate with Oozie, so if 
u are using oozie for workflows then hive-action may be an issue
 *   U can solve this though by having both services hiveserver1 and 
hiveserver2 running and blocking hivserver1 port from outside world.

Thanks

sanjay

From: pandees waran pande...@gmail.commailto:pande...@gmail.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Wednesday, August 14, 2013 10:41 PM
To: user user@hive.apache.orgmailto:user@hive.apache.org
Subject: Hive cli Vs beeline cli


Hi  pros,
Based on your experience with beeline cli,  could you please share your 
thoughts in advantages of using beeline cli over default hive  cli?
Please share if you find any useful link for this info.

Thanks
Pandeeswaran

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: FAILED: Error in metadata: MetaException

2013-08-16 Thread Sanjay Subramanian
Hi Ankit

Do u have a directory on HDFS /user/hive/warehouse
And its permission should be 1777

sanjay

From: Ankit Bhatnagar ank...@yahoo-inc.commailto:ank...@yahoo-inc.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Friday, August 16, 2013 12:02 PM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: FAILED: Error in metadata: MetaException

Hi folks,


FAILED: Error in metadata: MetaException(message:file:/user/hive/warehouse/src 
is not a directory or unable to create one)


drwxrwxrwx   - hadoop supergroup  0 2013-08-16 18:31 
/user/hive/warehouse/src


Is it some permission issue?

Ankit

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: SHOW ALL ROLES

2013-08-16 Thread Sanjay Subramanian
Ok never mind , this will work just fine for me

mysql select ROLE_NAME from ROLES;

From: Sanjay Subramanian 
sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Friday, August 16, 2013 6:34 PM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: SHOW ALL ROLES

Hi
How do I display all ROLES defined in hive thru CLI ?
Thanks
sanjay

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: Hive and Lzo Compression

2013-08-14 Thread Sanjay Subramanian
That is really interesting…let me try and think of a reason…meanwhile any other 
LZO Hive Samurais out there ? Please help with some guidance

sanjay

From: w00t w00t w00...@yahoo.demailto:w00...@yahoo.de
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org, w00t w00t 
w00...@yahoo.demailto:w00...@yahoo.de
Date: Wednesday, August 14, 2013 1:15 AM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: Hive and Lzo Compression


Thanks for your reply.

The interesting thing I experience is that the SELECT query still works - even 
when I do not specify the STORED AS clause... that puzzles me a bit.


Von: Sanjay Subramanian 
sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com
An: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org; w00t w00t 
w00...@yahoo.demailto:w00...@yahoo.de
Gesendet: 3:44 Mittwoch, 14.August 2013
Betreff: Re: Hive and Lzo Compression

Hi

I think the CREATE TABLE without the STORED AS clause will not give any errors 
while creating the table.
However when you query that table and since that table contains .lzo files , 
you would  get errors.
With external tables , u r separating the table creation(definition) from the 
data. So only at the time of query of that table, hive might report errors.

LZO compression rocks ! I am so glad I used it in our projects here.

Regards

sanjay

From: w00t w00t w00...@yahoo.demailto:w00...@yahoo.de
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org, w00t w00t 
w00...@yahoo.demailto:w00...@yahoo.de
Date: Tuesday, August 13, 2013 12:13 AM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: Hive and Lzo Compression

Thanks for your replies and the link.

I could get it working, but wondered why the CREATE TABLE statement worked 
without the STORED AS Clause as well...that's what puzzles me a bit...

But I will use the STORED AS Clause to be on the safe side.



Von: Lefty Leverenz leftylever...@gmail.commailto:leftylever...@gmail.com
An: user@hive.apache.orgmailto:user@hive.apache.org
CC: w00t w00t w00...@yahoo.demailto:w00...@yahoo.de
Gesendet: 19:06 Samstag, 10.August 2013
Betreff: Re: Hive and Lzo Compression

I'm not seeing any documentation link in Sanjay's message, so here it is again 
(in the Hive wiki's language manual):  
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LZO.


On Thu, Aug 8, 2013 at 3:30 PM, Sanjay Subramanian 
sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com
 wrote:
Please refer this documentation here
Let me know if u need more clarifications so that we can make this document 
better and complete

Thanks

sanjay

From: w00t w00t w00...@yahoo.demailto:w00...@yahoo.de
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org, w00t w00t 
w00...@yahoo.demailto:w00...@yahoo.de
Date: Thursday, August 8, 2013 2:02 AM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Hive and Lzo Compression


Hello,

I am started to run Hive with Lzo compression on Hortonworks 1.2

I have managed to install/configure Lzo and  hive -e set 
io.compression.codecs shows me the Lzo Codecs:
io.compression.codecs=
org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.DefaultCodec,
com.hadoop.compression.lzo.LzoCodec,
com.hadoop.compression.lzo.LzopCodec,
org.apache.hadoop.io.compress.BZip2Codec

However, I have some questions where I would be happy if you could help me.

(1) CREATE TABLE statement

I read in different postings, that in the CREATE TABLE statement, I have to use 
the following STORAGE clause:

CREATE EXTERNAL TABLE txt_table_lzo (
   txt_line STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ''
STORED AS INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat' 
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/user/myuser/data/in/lzo_compressed';

It works withouth any problems now to execute SELECT statements on this table 
with Lzo data.

However I also created a table on the same data without this STORAGE clause:

CREATE EXTERNAL TABLE txt_table_lzo_tst (
   txt_line STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ''
LOCATION '/user/myuser/data/in/lzo_compressed';

The interesting thing is, it works as well, when I execute a SELECT statement 
and this table.

Can you help, why the second CREATE TABLE statement works as well?
What should I use in DDLs?
Is it best practice to use the STORED AS clause with a 
deprecatedLzoTextInputFormat? Or should I remove it?


(2) Output and Intermediate Compression Settings

I want to use output compression .

In Programming

Re: Strange error in Hive - Insert INTO

2013-08-14 Thread Sanjay Subramanian
Another reason I can think of is possibly some STRING column in your table has 
a DELIMITER character…Like once in production I had tab spaces in the string 
and my table was also defined using TAB as delimiter

From: Stephen Sprague sprag...@gmail.commailto:sprag...@gmail.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Wednesday, August 14, 2013 8:43 AM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: Strange error in Hive - Insert INTO

Hi Jerome,
That's a grandiose sql statement you got there! :)I find that if you break 
up those nested queries into simple CTAS (Create Table AS) statements and 
create a cascading effect of referring to the table in the previous step it 
makes debugging *so* much easier.  In other SQL dialects like DB2 this is 
facilitated by the WITH keyword. Maybe the Hive gurus will implement that some 
day.   But that's a topic for another day.

So all that said, i see that the columns in your create table statement don't 
match the columns in your outermost select statement.  In particular, DT_JOUR 
is listed as the 6th column in your create table statement but it appears to be 
the 2nd column in your select statement. So something looks fishy there.

My guess is ultimately you're missing a comma somewhere in the select list so 
hive is eating an column as a column alias and all your data is skewed over by 
one column. This happens not so infrequently since it is valid sql.

Long winded answer to a simple question. Apologies up front!


On Wed, Aug 14, 2013 at 5:35 AM, Jérôme Verdier 
verdier.jerom...@gmail.commailto:verdier.jerom...@gmail.com wrote:
Hi everybody,

I faced a strange error in Hive today.

I have launch a hive script to make some calculations, joins, union, etc... and 
then insert these results in over hive table.

Everything is working fine (.hql is working, full ok, data are imported), but 
one field (CO_RGRP_PRODUITS) is very strange.

after the insert, CO_RGRP_PRODUITS is looking like a TIMESTAMP (1970-01-01 
01:00:00) instead of being a simple STRING.

I precise that source field are simple string like this  : 0101380,  for example

What is going wrong here.

You can find my script below (create table and .hql insert/calculations)

Thanks for your help.


INSERT SCRIPT :
--THM_CA_RGRP_PRODUITS_JOUR
CREATE TABLE default.THM_CA_RGRP_PRODUITS_JOUR (
CO_SOCIETE BIGINT,
TYPE_ENTITE STRING,
CODE_ENTITE STRING,
TYPE_RGRP_PRODUITS STRING,
CO_RGRP_PRODUITS STRING,
DT_JOUR TIMESTAMP,
MT_CA_NET_TTC FLOAT,
MT_OBJ_CA_NET_TTC FLOAT,
NB_CLIENTS FLOAT,
MT_CA_NET_TTC_COMP FLOAT,
MT_OBJ_CA_NET_TTC_COMP FLOAT,
NB_CLIENTS_COMP FLOAT);

INSERT SCRIPT :

INSERT INTO TABLE THM_CA_RGRP_PRODUITS_JOUR

  SELECT
  1  as CO_SOCIETE,-- A 
modifier = variable
  '2013-01-02 00:00:00.0' as 
DT_JOUR, -- A modifier = variable
  'MAG'   as 
TYPE_ENTITE,
  m.co_magasinas 
CODE_ENTITE,
  'FAM'   as 
TYPE_RGRP_PRODUITS,
  sourceunion.CO_RGRP_PRODUITSas 
CO_RGRP_PRODUITS,
  SUM(MT_CA_NET_TTC)  as 
MT_CA_NET_TTC,
  SUM(MT_OBJ_CA_NET_TTC)  as 
MT_OBJ_CA_NET_TTC,
  SUM(NB_CLIENTS) as 
NB_CLIENTS,
  SUM(MT_CA_NET_TTC_COMP) as 
MT_CA_NET_TTC_COMP,
  SUM(MT_OBJ_CA_NET_TTC_COMP) as 
MT_OBJ_CA_NET_TTC_COMP,
  SUM(NB_CLIENTS_COMP)as 
NB_CLIENTS_COMP

FROM (
  SELECT
mtransf.id_mag_transfere as ID_MAGASIN,
v.co_famille as CO_RGRP_PRODUITS,
sum(v.mt_ca_net_ttc) as MT_CA_NET_TTC,
0as MT_OBJ_CA_NET_TTC,
0as NB_CLIENTS,
sum(v.mt_ca_net_ttc * (CASE WHEN mtransf.flag_mag_comp = 'NC' THEN 
0 ELSE 1 END))
 as MT_CA_NET_TTC_COMP,
0as MT_OBJ_CA_NET_TTC_COMP,
0as NB_CLIENTS_COMP
  FROM default.VENTES_FAM v
  JOIN default.kpi_magasin mtransf
  ON  mtransf.co_societe = CASE WHEN v.co_societe = 1 THEN 1 ELSE 2 END
  AND mtransf.id_magasin = v.id_magasin
  WHERE
  mtransf.co_societe= 1 -- Modifier variable
  AND v.dt_jour = '2013-01-02 

Re: Hive and Lzo Compression

2013-08-14 Thread Sanjay Subramanian
I am not sure if in this cade data is loaded
OR partition  added with location specified (to some location in HDFS)

Yes u r stating the question correctly

sanjay

From: Nitin Pawar nitinpawar...@gmail.commailto:nitinpawar...@gmail.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Wednesday, August 14, 2013 10:54 AM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: Hive and Lzo Compression

Please correct me if I understood the question correctly

You created a table def without mentioning a stored as clause
then you load data into table from a compressed a file
then do a select query and it still works
but how did it figured out which compression codec to use?

Am I stating it correctly ?



On Wed, Aug 14, 2013 at 11:11 PM, Sanjay Subramanian 
sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com
 wrote:
That is really interesting…let me try and think of a reason…meanwhile any other 
LZO Hive Samurais out there ? Please help with some guidance

sanjay

From: w00t w00t w00...@yahoo.demailto:w00...@yahoo.de
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org, w00t w00t 
w00...@yahoo.demailto:w00...@yahoo.de
Date: Wednesday, August 14, 2013 1:15 AM

To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: Hive and Lzo Compression


Thanks for your reply.

The interesting thing I experience is that the SELECT query still works - even 
when I do not specify the STORED AS clause... that puzzles me a bit.


Von: Sanjay Subramanian 
sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com
An: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org; w00t w00t 
w00...@yahoo.demailto:w00...@yahoo.de
Gesendet: 3:44 Mittwoch, 14.August 2013
Betreff: Re: Hive and Lzo Compression

Hi

I think the CREATE TABLE without the STORED AS clause will not give any errors 
while creating the table.
However when you query that table and since that table contains .lzo files , 
you would  get errors.
With external tables , u r separating the table creation(definition) from the 
data. So only at the time of query of that table, hive might report errors.

LZO compression rocks ! I am so glad I used it in our projects here.

Regards

sanjay

From: w00t w00t w00...@yahoo.demailto:w00...@yahoo.de
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org, w00t w00t 
w00...@yahoo.demailto:w00...@yahoo.de
Date: Tuesday, August 13, 2013 12:13 AM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: Hive and Lzo Compression

Thanks for your replies and the link.

I could get it working, but wondered why the CREATE TABLE statement worked 
without the STORED AS Clause as well...that's what puzzles me a bit...

But I will use the STORED AS Clause to be on the safe side.



Von: Lefty Leverenz leftylever...@gmail.commailto:leftylever...@gmail.com
An: user@hive.apache.orgmailto:user@hive.apache.org
CC: w00t w00t w00...@yahoo.demailto:w00...@yahoo.de
Gesendet: 19:06 Samstag, 10.August 2013
Betreff: Re: Hive and Lzo Compression

I'm not seeing any documentation link in Sanjay's message, so here it is again 
(in the Hive wiki's language manual):  
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LZO.


On Thu, Aug 8, 2013 at 3:30 PM, Sanjay Subramanian 
sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com
 wrote:
Please refer this documentation here
Let me know if u need more clarifications so that we can make this document 
better and complete

Thanks

sanjay

From: w00t w00t w00...@yahoo.demailto:w00...@yahoo.de
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org, w00t w00t 
w00...@yahoo.demailto:w00...@yahoo.de
Date: Thursday, August 8, 2013 2:02 AM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Hive and Lzo Compression


Hello,

I am started to run Hive with Lzo compression on Hortonworks 1.2

I have managed to install/configure Lzo and  hive -e set 
io.compression.codecs shows me the Lzo Codecs:
io.compression.codecs=
org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.DefaultCodec,
com.hadoop.compression.lzo.LzoCodec,
com.hadoop.compression.lzo.LzopCodec,
org.apache.hadoop.io.compress.BZip2Codec

However, I have some questions where I would be happy if you could help me.

(1) CREATE TABLE statement

I read in different postings, that in the CREATE TABLE statement, I have to use 
the following STORAGE clause:

CREATE EXTERNAL TABLE txt_table_lzo (
   txt_line STRING
)
ROW

Re: Review Request (wikidoc): LZO Compression in Hive

2013-08-14 Thread Sanjay Subramanian
Once again, I am down on my knees humbling calling upon the Hive Jedi Masters 
to please provide this paadwaan  with cwiki update privileges

May the Force be with u

Thanks

sanjay

From: Sanjay Subramanian 
sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Wednesday, July 31, 2013 9:38 AM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Cc: d...@hive.apache.orgmailto:d...@hive.apache.org 
d...@hive.apache.orgmailto:d...@hive.apache.org
Subject: Re: Review Request (wikidoc): LZO Compression in Hive

Hi guys

Any chance I could get cwiki update privileges today ?

Thanks

sanjay

From: Sanjay Subramanian 
sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com
Date: Tuesday, July 30, 2013 4:26 PM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Cc: d...@hive.apache.orgmailto:d...@hive.apache.org 
d...@hive.apache.orgmailto:d...@hive.apache.org
Subject: Review Request (wikidoc): LZO Compression in Hive

Hi

Met with Lefty this afternoon and she was kind to spend time to add my 
documentation to the site - since I still don't have editing privileges :-)

Please review the new wikidoc about LZO compression in the Hive language 
manual.  If anything is unclear or needs more information, you can email 
suggestions to this list or edit the wiki yourself (if you have editing 
privileges).  Here are the links:

  1.  Language 
Manualhttps://cwiki.apache.org/confluence/display/Hive/LanguageManual (new 
bullet under File Formats)
  2.  LZO 
Compressionhttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+LZO
  3.  CREATE 
TABLEhttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable
 (near end of section, pasted in here:)
Use STORED AS TEXTFILE if the data needs to be stored as plain text files. Use 
STORED AS SEQUENCEFILE if the data needs to be compressed. Please read more 
about 
CompressedStoragehttps://cwiki.apache.org/confluence/display/Hive/CompressedStorage
 if you are planning to keep data compressed in your Hive tables. Use 
INPUTFORMAT and OUTPUTFORMAT to specify the name of a corresponding InputFormat 
and OutputFormat class as a string literal, e.g., 
'org.apache.hadoop.hive.contrib.fileformat.base64.Base64TextInputFormat'. For 
LZO compression, the values to use are 'INPUTFORMAT 
com.hadoop.mapred.DeprecatedLzoTextInputFormat OUTPUTFORMAT 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' (see LZO 
Compressionhttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+LZO).

My cwiki id is
https://cwiki.apache.org/confluence/display/~sanjaysubraman...@yahoo.com
It will be great if I could get edit privileges

Thanks
sanjay

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: Hive and Lzo Compression

2013-08-13 Thread Sanjay Subramanian
Hi

I think the CREATE TABLE without the STORED AS clause will not give any errors 
while creating the table.
However when you query that table and since that table contains .lzo files , 
you would  get errors.
With external tables , u r separating the table creation(definition) from the 
data. So only at the time of query of that table, hive might report errors.

LZO compression rocks ! I am so glad I used it in our projects here.

Regards

sanjay

From: w00t w00t w00...@yahoo.demailto:w00...@yahoo.de
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org, w00t w00t 
w00...@yahoo.demailto:w00...@yahoo.de
Date: Tuesday, August 13, 2013 12:13 AM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: Hive and Lzo Compression

Thanks for your replies and the link.

I could get it working, but wondered why the CREATE TABLE statement worked 
without the STORED AS Clause as well...that's what puzzles me a bit...

But I will use the STORED AS Clause to be on the safe side.



Von: Lefty Leverenz leftylever...@gmail.commailto:leftylever...@gmail.com
An: user@hive.apache.orgmailto:user@hive.apache.org
CC: w00t w00t w00...@yahoo.demailto:w00...@yahoo.de
Gesendet: 19:06 Samstag, 10.August 2013
Betreff: Re: Hive and Lzo Compression

I'm not seeing any documentation link in Sanjay's message, so here it is again 
(in the Hive wiki's language manual):  
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LZO.


On Thu, Aug 8, 2013 at 3:30 PM, Sanjay Subramanian 
sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com
 wrote:
Please refer this documentation here
Let me know if u need more clarifications so that we can make this document 
better and complete

Thanks

sanjay

From: w00t w00t w00...@yahoo.demailto:w00...@yahoo.de
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org, w00t w00t 
w00...@yahoo.demailto:w00...@yahoo.de
Date: Thursday, August 8, 2013 2:02 AM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Hive and Lzo Compression


Hello,

I am started to run Hive with Lzo compression on Hortonworks 1.2

I have managed to install/configure Lzo and  hive -e set 
io.compression.codecs shows me the Lzo Codecs:
io.compression.codecs=
org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.DefaultCodec,
com.hadoop.compression.lzo.LzoCodec,
com.hadoop.compression.lzo.LzopCodec,
org.apache.hadoop.io.compress.BZip2Codec

However, I have some questions where I would be happy if you could help me.

(1) CREATE TABLE statement

I read in different postings, that in the CREATE TABLE statement, I have to use 
the following STORAGE clause:

CREATE EXTERNAL TABLE txt_table_lzo (
   txt_line STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ''
STORED AS INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat' 
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/user/myuser/data/in/lzo_compressed';

It works withouth any problems now to execute SELECT statements on this table 
with Lzo data.

However I also created a table on the same data without this STORAGE clause:

CREATE EXTERNAL TABLE txt_table_lzo_tst (
   txt_line STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ''
LOCATION '/user/myuser/data/in/lzo_compressed';

The interesting thing is, it works as well, when I execute a SELECT statement 
and this table.

Can you help, why the second CREATE TABLE statement works as well?
What should I use in DDLs?
Is it best practice to use the STORED AS clause with a 
deprecatedLzoTextInputFormat? Or should I remove it?


(2) Output and Intermediate Compression Settings

I want to use output compression .

In Programming Hive from Capriolo, Wampler, Rutherglen the following commands 
are recommended:
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec;

  However, in some other places in forums, I found the following 
recommended settings:
SET hive.exec.compress.output=true
SET mapreduce.output.fileoutputformat.compress=true
SET 
mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec

Am I right, that the first settings are for Hadoop versions prior 0.23?
Or is there any other reason why the settings are different?

I am using Hadoop 1.1.2 with Hive 0.10.0.
Which settings would you recommend to use?

--
  I also want to compress intermediate results.

 Again, in  Programming Hive the following settings are recommended:
 SET hive.exec.compress.intermediate=true;
 SET 
mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzopCodec;

  Is this the right setting?

  Or should I again use the settings (which look more valid

Re: LZO output compression

2013-08-13 Thread Sanjay Subramanian
Check this class where these are defined
http://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.1/src/mapred/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.java

From: w00t w00t w00...@yahoo.demailto:w00...@yahoo.de
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org, w00t w00t 
w00...@yahoo.demailto:w00...@yahoo.de
Date: Tuesday, August 13, 2013 2:39 AM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org, w00t w00t 
w00...@yahoo.demailto:w00...@yahoo.de
Subject: Re: LZO output compression

Oh, I could get it working using these settings:

SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec;

But I have one question, where maybe on of you can help me with an explaination:
As I am running Hadoop 1.1.* why do I need the old command for Hadoop 0.20?:
SET mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec;

I supposed the commands for the newer Hadoop versions are:
SET hive.exec.compress.output=true;
SET mapreduce.output.fileoutputformat.compress=true;
SET 
mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec;



Von: w00t w00t w00...@yahoo.demailto:w00...@yahoo.de
An: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Gesendet: 11:26 Dienstag, 13.August 2013
Betreff: LZO output compression

Hello,

I am running Hortonworks 1.2 using Hadoop 1.1.2.21 and Hive 0.10.0.21.

I set up LZO compression and can read LZO compressed data without problems.

My next try was to test output compression.
Therefore, I created the following small script:
--
SET hive.exec.compress.output=true;
SET mapreduce.output.fileoutputformat.compress=true;
SET 
mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec;

DROP TABLE IF EXISTS simple_lzo;

CREATE TABLE simple_lzo
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' AS
SELECT count(*)
FROM txt_table_lzo;

The output gets compressed but with default-codec deflate - not with LZO.

Do you know what the problem could be here and how I could debug it?
There are no error messages or so.

Additionally, I also tried the commands for Hadoop 0.20:
mapred.output.compress=true;
mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzopCodec

That didn't work as well.

In Pig or Java MR, I have no problems to gerneate LZO compressed output.

Thanks



CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Hiveserver2 Beeline command clarification

2013-08-13 Thread Sanjay Subramanian
Hi guys

I just hooked up hivservrer2 to ldap. In beeline I realized you can login like 
the following (don't need to define org.apache.hive.jdbc.HiveDriver)

beeline !connect jdbc:hive2://dev-thdp5:1 
sanjay.subraman...@wizecommerce.com
scan complete in 2ms
Connecting to jdbc:hive2://dev-thdp5:1
Enter password for jdbc:hive2://dev-thdp5:1: 
Connected to: Hive (version 0.10.0)
Driver: Hive (version 0.10.0-cdh4.3.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://dev-thdp5:1 show tables;
+--+
| tab_name |
+--+
| keyword_impressions_log  |
+--+
1 row selected (1.574 seconds)
0: jdbc:hive2://dev-thdp5:1

If this is also a correct way to use beeline, then I actually prefer this since 
the password is not visible

sanjay




CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


hive -h hostname option bypasses ROLES and access permissions ?

2013-08-12 Thread Sanjay Subramanian
Hi

Hive version 0.9.0 (hive-common-0.9.0-cdh4.1.2.jar)


property
  namehive.security.authorization.enabled/name
  valuetrue/value
  descriptionenable or disable the hive client authorization/description
/property



Linux User = hiveuser1 (no hive permissions)

CASE 1
hive -e select * from outpdir_ptitle_explanation_parsed limit 10
Authorization failed:No privilege 'Select' found for inputs { database:default, 
table:outpdir_ptitle_explanation_parsed, columnName:header_servername}. Use 
show grant to get more details.

CASE 2 (use the -h option)
hive -h localhost -e select * from outpdir_ptitle_explanation_parsed limit 10
Shows results !!!

Why does -h option bypass authorization

Thanks

sanjay


CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Does hiveserver2 support -e and -f options eom

2013-08-12 Thread Sanjay Subramanian



CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.



Re: Hive and Lzo Compression

2013-08-10 Thread Sanjay Subramanian
Thanks Lefty.

Sent from my iPhone

On Aug 10, 2013, at 10:08 AM, Lefty Leverenz 
leftylever...@gmail.commailto:leftylever...@gmail.com wrote:

I'm not seeing any documentation link in Sanjay's message, so here it is again 
(in the Hive wiki's language manual):  
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LZO.


On Thu, Aug 8, 2013 at 3:30 PM, Sanjay Subramanian 
sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com
 wrote:
Please refer this documentation here
Let me know if u need more clarifications so that we can make this document 
better and complete

Thanks

sanjay

From: w00t w00t w00...@yahoo.demailto:w00...@yahoo.de
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org, w00t w00t 
w00...@yahoo.demailto:w00...@yahoo.de
Date: Thursday, August 8, 2013 2:02 AM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Hive and Lzo Compression


Hello,

I am started to run Hive with Lzo compression on Hortonworks 1.2

I have managed to install/configure Lzo and  hive -e set 
io.compression.codecs shows me the Lzo Codecs:
io.compression.codecs=
org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.DefaultCodec,
com.hadoop.compression.lzo.LzoCodec,
com.hadoop.compression.lzo.LzopCodec,
org.apache.hadoop.io.compress.BZip2Codec

However, I have some questions where I would be happy if you could help me.

(1) CREATE TABLE statement

I read in different postings, that in the CREATE TABLE statement, I have to use 
the following STORAGE clause:

CREATE EXTERNAL TABLE txt_table_lzo (
   txt_line STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ''
STORED AS INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat' 
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/user/myuser/data/in/lzo_compressed';

It works withouth any problems now to execute SELECT statements on this table 
with Lzo data.

However I also created a table on the same data without this STORAGE clause:

CREATE EXTERNAL TABLE txt_table_lzo_tst (
   txt_line STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ''
LOCATION '/user/myuser/data/in/lzo_compressed';

The interesting thing is, it works as well, when I execute a SELECT statement 
and this table.

Can you help, why the second CREATE TABLE statement works as well?
What should I use in DDLs?
Is it best practice to use the STORED AS clause with a 
deprecatedLzoTextInputFormat? Or should I remove it?


(2) Output and Intermediate Compression Settings

I want to use output compression .

In Programming Hive from Capriolo, Wampler, Rutherglen the following commands 
are recommended:
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec;

  However, in some other places in forums, I found the following 
recommended settings:
SET hive.exec.compress.output=true
SET mapreduce.output.fileoutputformat.compress=true
SET 
mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec

Am I right, that the first settings are for Hadoop versions prior 0.23?
Or is there any other reason why the settings are different?

I am using Hadoop 1.1.2 with Hive 0.10.0.
Which settings would you recommend to use?

--
  I also want to compress intermediate results.

 Again, in  Programming Hive the following settings are recommended:
 SET hive.exec.compress.intermediate=true;
 SET 
mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzopCodec;

  Is this the right setting?

  Or should I again use the settings (which look more valid for Hadoop 
0.23 and greater)?:
  SET hive.exec.compress.intermediate=true;
  SET 
mapreduce.map.output.compression.codec=com.hadoop.compression.lzo.LzopCodec;

Thanks




CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.



-- Lefty

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along

Re: Hive and Lzo Compression

2013-08-08 Thread Sanjay Subramanian
Please refer this documentation here
Let me know if u need more clarifications so that we can make this document 
better and complete

Thanks

sanjay

From: w00t w00t w00...@yahoo.demailto:w00...@yahoo.de
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org, w00t w00t 
w00...@yahoo.demailto:w00...@yahoo.de
Date: Thursday, August 8, 2013 2:02 AM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Hive and Lzo Compression


Hello,

I am started to run Hive with Lzo compression on Hortonworks 1.2

I have managed to install/configure Lzo and  hive -e set 
io.compression.codecs shows me the Lzo Codecs:
io.compression.codecs=
org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.DefaultCodec,
com.hadoop.compression.lzo.LzoCodec,
com.hadoop.compression.lzo.LzopCodec,
org.apache.hadoop.io.compress.BZip2Codec

However, I have some questions where I would be happy if you could help me.

(1) CREATE TABLE statement

I read in different postings, that in the CREATE TABLE statement, I have to use 
the following STORAGE clause:

CREATE EXTERNAL TABLE txt_table_lzo (
   txt_line STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ''
STORED AS INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat' 
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/user/myuser/data/in/lzo_compressed';

It works withouth any problems now to execute SELECT statements on this table 
with Lzo data.

However I also created a table on the same data without this STORAGE clause:

CREATE EXTERNAL TABLE txt_table_lzo_tst (
   txt_line STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ''
LOCATION '/user/myuser/data/in/lzo_compressed';

The interesting thing is, it works as well, when I execute a SELECT statement 
and this table.

Can you help, why the second CREATE TABLE statement works as well?
What should I use in DDLs?
Is it best practice to use the STORED AS clause with a 
deprecatedLzoTextInputFormat? Or should I remove it?


(2) Output and Intermediate Compression Settings

I want to use output compression .

In Programming Hive from Capriolo, Wampler, Rutherglen the following commands 
are recommended:
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec;

  However, in some other places in forums, I found the following 
recommended settings:
SET hive.exec.compress.output=true
SET mapreduce.output.fileoutputformat.compress=true
SET 
mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec

Am I right, that the first settings are for Hadoop versions prior 0.23?
Or is there any other reason why the settings are different?

I am using Hadoop 1.1.2 with Hive 0.10.0.
Which settings would you recommend to use?

--
  I also want to compress intermediate results.

 Again, in  Programming Hive the following settings are recommended:
 SET hive.exec.compress.intermediate=true;
 SET 
mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzopCodec;

  Is this the right setting?

  Or should I again use the settings (which look more valid for Hadoop 
0.23 and greater)?:
  SET hive.exec.compress.intermediate=true;
  SET 
mapreduce.map.output.compression.codec=com.hadoop.compression.lzo.LzopCodec;

Thanks




CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Hive query hive_server_host A and write results to hadoop cluster B

2013-08-08 Thread Sanjay Subramanian
Hi guys

Perhaps u know this already but very useful. This directly creates a file based 
on the output of this query  to name_node_host_2 HDFS cluster

Regards

Sanjay

hive -h hive_server_host1  -e hive_query_string| hdfs dfs -put - 
hdfs://name_node_host_2:port/path/to/ur/dir/your_file_name

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: Hive Query Issue

2013-08-07 Thread Sanjay Subramanian
Hi

Some quick checks

Please don't mind if my questions sound trivial

1. Is your hdfs cluster or pseudo-distributed node up and running
 Can u see the HDFS at http://host:50070 ?

2. Is your mrV1 or Yarn (mrV2) up and running
2a)mrV1 http://host:50030
2b) YARN http://host:8088

3. Is your hive-default xml; pointing to the job tracker

4. Can u run a MR job on your cluster independent of Hive ?
 U can try the WordCount that comes standard with the examples

Thanks

sanjay

From: Manickam P manicka...@outlook.commailto:manicka...@outlook.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Tuesday, August 6, 2013 4:10 AM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: RE: Hive Query Issue

Hi,

Thanks for your reply. I have a small test table which has around 10 records. 
when i run select * from table.. it launched a job and that got completed 
successfully. When i run the same query in my actual table no job got 
initiated. it got struck. I did not get any error in my task tracker also. All 
my data nodes are up and running.

I don't have any clue here.


Thanks,
Manickam P


Date: Tue, 6 Aug 2013 16:24:38 +0530
Subject: Re: Hive Query Issue
From: nitinpawar...@gmail.commailto:nitinpawar...@gmail.com
To: user@hive.apache.orgmailto:user@hive.apache.org

when you run select * from table .. it does not launch a mapreduce job,

where are when you put some condition, it does need to process the data so it 
launches a mapreduce job

now when you start this query, go to your jobtracker page and see how many jobs 
are running. Is it able to start your job? if not why ?
that will tell you why its stuck


On Tue, Aug 6, 2013 at 3:22 PM, Manickam P 
manicka...@outlook.commailto:manicka...@outlook.com wrote:
Hi,

If i run a hive query like select * from table_name limit 10 it is working.
If i run query like select column_name from table_name limit 10 it hangs.
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
It just prints the above after that nothing happening in the hive shell.

Can you pls help me on this.


Thanks,
Manickam P



--
Nitin Pawar

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: Hive UDAF extending UDAF class: iterate or evaluate method

2013-08-07 Thread Sanjay Subramanian
Please follow the guidance for UDF provided in the Hive Programming book by 
Wampler/Capriolo. That will work for u.

I can say with confidence that their book was mighty helpful to me in my 
project from start to production...

And I would recommend go ahead with a way, implement and then fine tune 
otherwise u will be in analysis paralysis mode…

We are all on a  path of discovery here ...

Regards

sanjay

From: Ritesh Agrawal ragra...@netflix.commailto:ragra...@netflix.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Wednesday, August 7, 2013 5:57 PM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: Hive UDAF extending UDAF class: iterate or evaluate method

Hi Sanjay, Lefty

Thanks for the help but none of above responses directly answering my question 
(probably I am not asking clear enough :-(  ).

Below I have two different structure of a UDAF (aggregation function). My 
question is which one is the preferred/right approach

http://pastebin.com/QCgd4Hxc  : This version is based on based on what I could 
understand from API docs about UDAF class.

http://pastebin.com/Uctamtek : This version is based on the book Hadoop The 
definitive guide. Notice the function names for different from the first one.

I hope this clarifies my question.

Thanks
Ritesh





On Wed, Aug 7, 2013 at 5:34 PM, Lefty Leverenz 
leftylever...@gmail.commailto:leftylever...@gmail.com wrote:
Sounds like the wikidoc needs some work.  I'm open to suggestions.  If Sanjay's 
simple UDF helps, I could put it in the wiki along with any advice you think 
would help.

Does anyone else have use cases to contribute?

-- Lefty


On Mon, Aug 5, 2013 at 2:45 PM, Sanjay Subramanian 
sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com
 wrote:
Hi Ritesh

To help u get started , I am writing a simple HelloWorld-ish UDF that might 
help…If it doesn't please ask for more clarifications...

Good Luck
Thanks

sanjay


ToUpperCase.java


package com.sanjaysubramanian.utils.hive.udf;


import org.apache.hadoop.hive.ql.exec.UDF;


public finalclass ToUpperCase extends UDF{

protected final Log logger = LogFactory.getLog(toUpperCase.class);


publicString evaluate(final String inputString) {

 if (inputString != null){

return inputString.toUpper;

 }

 else {

return inputString;

}

   }

}



Usage in a Hive script

hive -e 

create temporary function toupper  as 
'com.sanjaysubramanian.utils.hive.udf.ToUpperCase';

SELECT
  first_name,
  toupper(first_name)
FROM
  company_names


***

From: Ritesh Agrawal ragra...@netflix.commailto:ragra...@netflix.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Monday, August 5, 2013 9:41 AM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: Hive UDAF extending UDAF class: iterate or evaluate method

Hi Lefty,

I used the wiki you sent to write my first version of UDAF. However, I found it 
to be utterly complex, especially for storing partial results as I am not very 
familiar with hive API. Then I found another example of UDAF in the hadoop the 
definitive guide book and it had much simpler code but using different method. 
Instead of using iterate it was using evaluate method and so I am getting 
confused.

Ritesh


On Sun, Aug 4, 2013 at 2:18 PM, Lefty Leverenz 
leftylever...@gmail.commailto:leftylever...@gmail.com wrote:
You might find this wikidoc useful:  
GenericUDAFCaseStudyhttps://cwiki.apache.org/confluence/display/Hive/GenericUDAFCaseStudy.

The O'Reilly book Programming Hive also has a section called User-Defined 
Aggregate Functions in chapter 13 (Functions), pages 172 to 176.

-- Lefty


On Sun, Aug 4, 2013 at 7:12 AM, Ritesh Agrawal 
ragra...@netflix.commailto:ragra...@netflix.com wrote:
Hi all,

I am trying to write a UDAF function. I found an example that shows how to 
implement a UDAF in Hadoop The Definitive Guide book. However I am little 
confused. In the book, the author extends UDAF class and implements init, 
iterate, terminatePartial,  merge and terminate function. However looking at 
the hive docs 
(http://hive.apache.org/docs/r0.11.0/api/org/apache/hadoop/hive/ql/exec/UDAF.html),
 it seems I need to implement init, aggregate, evaluatePartial, 
aggregatePartial and evaluate function. Please let me know what are the write 
functions to implement.

Ritesh



--
Lefty


CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may

Re: Hive Thrift Service - Not Running Continously

2013-08-06 Thread Sanjay Subramanian
At least that would ensure asynch running and would not die if ur session dies

Another way I would propose also is to have a screen session dedicated to 
hive_server and start it in synch mode
Create a screen session
==
screen -S  my_awesome_hive_server
## gets u into the screen session
##Start hive server
hive --service hiveserver
## Press Ctrl A D to detach from screen session

To go back to your screen session

screen -x  my_awesome_hive_server

From: Bhaskar, Snehalata 
snehalata_bhas...@syntelinc.commailto:snehalata_bhas...@syntelinc.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Monday, August 5, 2013 9:34 PM
To: Raj Hadoop hadoop...@yahoo.commailto:hadoop...@yahoo.com
Cc: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: RE: Hive Thrift Service - Not Running Continously

Can you please try executing “nohup hive --service hiveserver ” command?

Regards,
Snehalata

From: Raj Hadoop [mailto:hadoop...@yahoo.com]
Sent: Monday, August 05, 2013 9:59 PM
To: Hive
Subject: Hive Thrift Service - Not Running Continously

Hi,


The hive thrift service is not running continously. I had to execute  the 
command (hive --service hiveserver ) very frequently . Can any one help me on 
this?

Thanks,
Raj

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: Hive Thrift Service - Not Running Continously

2013-08-05 Thread Sanjay Subramanian
Can u see the logs why the service is dying ? Will have some clues 
there…sometimes the hive log directory can get full and that might kill it. I 
have seen that happening in our install here

From: Raj Hadoop hadoop...@yahoo.commailto:hadoop...@yahoo.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org, Raj Hadoop 
hadoop...@yahoo.commailto:hadoop...@yahoo.com
Date: Monday, August 5, 2013 9:29 AM
To: Hive user@hive.apache.orgmailto:user@hive.apache.org
Subject: Hive Thrift Service - Not Running Continously

Hi,


The hive thrift service is not running continously. I had to execute  the 
command (hive --service hiveserver ) very frequently . Can any one help me on 
this?

Thanks,
Raj

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Is there a way to disable -h hostname option ?

2013-08-02 Thread Sanjay Subramanian
Hive version 0.9.0 (hive-common-0.9.0-cdh4.1.2.jar)

Is there a way to disable  -h hostname option ?

This way , I can disable anyone on the network drop/alter tables.

Like I said below , even if I create a ROLE with only SELECT permission , a 
user on the network can DROP tables with a -h hostname option

Thanks

Sanjay

From: Sanjay Subramanian 
sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Thursday, August 1, 2013 6:37 PM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Hive Authorization is bypassed with -h option

Hi

Hive version 0.9.0 (hive-common-0.9.0-cdh4.1.2.jar)


property
  namehive.security.authorization.enabled/name
  valuetrue/value
  descriptionenable or disable the hive client authorization/description
/property



Linux User = hiveuser1 (no hive permissions)

CASE 1
hive -e select * from outpdir_ptitle_explanation_parsed limit 10
Authorization failed:No privilege 'Select' found for inputs { database:default, 
table:outpdir_ptitle_explanation_parsed, columnName:header_servername}. Use 
show grant to get more details.

CASE 2 (use the -h option)
hive -h localhost -e select * from outpdir_ptitle_explanation_parsed limit 10
Shows results !!!

Why does -h option bypass authorization

Thanks

sanjay


CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Nice hive notes and cheat sheets from Minwoo Kim

2013-08-01 Thread Sanjay Subramanian
http://julingks.tistory.com/category/Hive

Thanks

sanjay

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Hive Authorization is bypassed with -h option

2013-08-01 Thread Sanjay Subramanian
Hi

Hive version 0.9.0 (hive-common-0.9.0-cdh4.1.2.jar)


property
  namehive.security.authorization.enabled/name
  valuetrue/value
  descriptionenable or disable the hive client authorization/description
/property



Linux User = hiveuser1 (no hive permissions)

CASE 1
hive -e select * from outpdir_ptitle_explanation_parsed limit 10
Authorization failed:No privilege 'Select' found for inputs { database:default, 
table:outpdir_ptitle_explanation_parsed, columnName:header_servername}. Use 
show grant to get more details.

CASE 2 (use the -h option)
hive -h localhost -e select * from outpdir_ptitle_explanation_parsed limit 10
Shows results !!!

Why does -h option bypass authorization

Thanks

sanjay


CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: Review Request (wikidoc): LZO Compression in Hive

2013-07-31 Thread Sanjay Subramanian
Hi guys

Any chance I could get cwiki update privileges today ?

Thanks

sanjay

From: Sanjay Subramanian 
sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com
Date: Tuesday, July 30, 2013 4:26 PM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Cc: d...@hive.apache.orgmailto:d...@hive.apache.org 
d...@hive.apache.orgmailto:d...@hive.apache.org
Subject: Review Request (wikidoc): LZO Compression in Hive

Hi

Met with Lefty this afternoon and she was kind to spend time to add my 
documentation to the site - since I still don't have editing privileges :-)

Please review the new wikidoc about LZO compression in the Hive language 
manual.  If anything is unclear or needs more information, you can email 
suggestions to this list or edit the wiki yourself (if you have editing 
privileges).  Here are the links:

  1.  Language 
Manualhttps://cwiki.apache.org/confluence/display/Hive/LanguageManual (new 
bullet under File Formats)
  2.  LZO 
Compressionhttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+LZO
  3.  CREATE 
TABLEhttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable
 (near end of section, pasted in here:)
Use STORED AS TEXTFILE if the data needs to be stored as plain text files. Use 
STORED AS SEQUENCEFILE if the data needs to be compressed. Please read more 
about 
CompressedStoragehttps://cwiki.apache.org/confluence/display/Hive/CompressedStorage
 if you are planning to keep data compressed in your Hive tables. Use 
INPUTFORMAT and OUTPUTFORMAT to specify the name of a corresponding InputFormat 
and OutputFormat class as a string literal, e.g., 
'org.apache.hadoop.hive.contrib.fileformat.base64.Base64TextInputFormat'. For 
LZO compression, the values to use are 'INPUTFORMAT 
com.hadoop.mapred.DeprecatedLzoTextInputFormat OUTPUTFORMAT 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' (see LZO 
Compressionhttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+LZO).

My cwiki id is
https://cwiki.apache.org/confluence/display/~sanjaysubraman...@yahoo.com
It will be great if I could get edit privileges

Thanks
sanjay

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Review Request (wikidoc): LZO Compression in Hive

2013-07-30 Thread Sanjay Subramanian
Hi

Met with Lefty this afternoon and she was kind to spend time to add my 
documentation to the site - since I still don't have editing privileges :-)

Please review the new wikidoc about LZO compression in the Hive language 
manual.  If anything is unclear or needs more information, you can email 
suggestions to this list or edit the wiki yourself (if you have editing 
privileges).  Here are the links:

  1.  Language 
Manualhttps://cwiki.apache.org/confluence/display/Hive/LanguageManual (new 
bullet under File Formats)
  2.  LZO 
Compressionhttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+LZO
  3.  CREATE 
TABLEhttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable
 (near end of section, pasted in here:)
Use STORED AS TEXTFILE if the data needs to be stored as plain text files. Use 
STORED AS SEQUENCEFILE if the data needs to be compressed. Please read more 
about 
CompressedStoragehttps://cwiki.apache.org/confluence/display/Hive/CompressedStorage
 if you are planning to keep data compressed in your Hive tables. Use 
INPUTFORMAT and OUTPUTFORMAT to specify the name of a corresponding InputFormat 
and OutputFormat class as a string literal, e.g., 
'org.apache.hadoop.hive.contrib.fileformat.base64.Base64TextInputFormat'. For 
LZO compression, the values to use are 'INPUTFORMAT 
com.hadoop.mapred.DeprecatedLzoTextInputFormat OUTPUTFORMAT 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' (see LZO 
Compressionhttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+LZO).

My cwiki id is
https://cwiki.apache.org/confluence/display/~sanjaysubraman...@yahoo.com
It will be great if I could get edit privileges

Thanks
sanjay

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Possible release date for Hive 0.12.0 ?

2013-07-29 Thread Sanjay Subramanian
Hi guys

When is stable Hive 0.12.0 expected

I have a use case that needs this fixed and looks like its fixed in 0.12.0

https://issues.apache.org/jira/browse/HIVE-3603

Sanjay




CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: Need help in joining 2 tables

2013-07-26 Thread Sanjay Subramanian
Hi Rams

Please don't think I am sermonizing or preaching and please don't mind what I 
am saying :-)

This community is there is help u and there is no doubt about that.

However I am assuming you tried out a few options by yourself before you 
reached out to the community with your question.

Since the whole community is opensource and everyone is busy in projects , I 
think it will be better utilization of everyones time if you spend a few 
minutes first playing around a few options, juxtapose syntax if needed and see 
what errors you get (if any) and then ask your question.
There are many  times an error for a syntax might give new ideas to community 
developers.

If this makes sense, I am happy…..if not, I don't mind :-)

Good luck with Hive…its a great technology and there are many people working 
very hard to make this better every day

Regards

sanjay


From: Ramasubramanian Narayanan 
ramasubramanian.naraya...@gmail.commailto:ramasubramanian.naraya...@gmail.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Friday, July 26, 2013 1:17 AM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Need help in joining 2 tables

Hi,

Need to create a view in HIVE as per the below query (written in Oracle). Need 
help to provide the equivalent query in HIVE to create view.


EMP_ADDRESS

Emp_Id
Address1
Address2
Address3
Address4
City
State
Pin
Country
Alternate_Address1
Alternate_Address2
Alternate_Address3
Alternate_Address4
Alternate_City
Alternate_State
Alternate_Pin
Alternate_Country
row_create_date


EMP_MASTER

Emp_Id
Name
DOB
Department
Sex
FathersName
row_create_date


View Query

select
MAST.Emp_Id,
MAST.Name,
MAST.DOB,
ADDR.Address1,
ADDR.Address2,
ADDR.Address3,
ADDR.Address4,
ADDR.City,
ADDR.State,
ADDR.Pin,
ADDR.Country
from EMP_MASTER MAST, EMP_ADDRESS ADDR
where
MAST.row_create_date = (select max(row_create_date) from EMP_MASTER where 
Emp_Id = MAST.Emp_Id)
and ADDR.row_create_date = (select max(row_create_date) from EMP_ADDRESS where 
Emp_Id = ADDR.Emp_Id)


regards,
Rams

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: Merging different HDFS file for HIVE

2013-07-26 Thread Sanjay Subramanian
We have a similar situation like this in production…for your case case I would 
propose the following steps

1. Design a map reduce job (Job Output format - Text, Lzo, Snappy, your choice)
 Inputs to Mapper
 -- records from these three feeds
Outputs from Mapper
 -- Key = EMP1   Value = feed1~field1  field2  field6  field9
 -- Key = EMP1   Value = feed2~field5  field7  field10
 -- Key = EMP1   Value = feed3~field3  field4  field8

   Reducer Output
 -- Key = EMP1   Value = field1  field2  field3  field4  field5  field6  
field7  field8  field9  field10

2. (Optional) If u use LZO then u will need to run LzoIndexer

3. CREATE TABLE IF NOT EXISTS YOUR_HIVE_TABLE

4. ALTER TABLE ADD PARTITION (foo1 = , foo2 = ) LOCATION 'path/to/files'


From: Stephen Sprague sprag...@gmail.commailto:sprag...@gmail.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Friday, July 26, 2013 4:37 PM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: Merging different HDFS file for HIVE

i like #2.

so you have three, say, external tables representing your three feed files. 
After the third and final file is loaded then join 'em all together - maybe 
make the table partitioned for one per day.

for example:

alter table final add partition (datekey=MMDD);
insert overwrite table final partition (datekey=MMDD)  select 
EMP_ID,f1,...,f10 from FF1 a join FF2 b on (a.EMP_ID=b.EMP_ID join FF3 c on 
(b.EMP_ID=c.EMP_ID)


Or a variation on #3.   make a view on the three tables which would look just 
like the select statement above.


What do you want to optimize for?


On Fri, Jul 26, 2013 at 5:30 AM, Nitin Pawar 
nitinpawar...@gmail.commailto:nitinpawar...@gmail.com wrote:
Option 1 ) Use pig or oozie, write a workflow and join the files to a single 
file
Option 2 ) Create a temp table for each of the different file and then join 
them to a single table and delete temp table
Option 3 ) don't do anything, change your queries to look at three different 
files when they query  about different files

Wait for others to give better suggestions :)


On Fri, Jul 26, 2013 at 4:22 PM, Ramasubramanian Narayanan 
ramasubramanian.naraya...@gmail.commailto:ramasubramanian.naraya...@gmail.com
 wrote:
Hi,

Please help in providing solution for the below problem... this scenario is 
applicable in Banking atleast...

I have a HIVE table with the below structure...

Hive Table:
Field1
...
Field 10


For the above table, I will get the values for each feed in different file. You 
can imagine that these files belongs to same branch and will get at any time 
interval. I have to load into table only if I get all 3 files for the same 
branch. (assume that we have a common field in all the files to join)

Feed file 1 :
EMP ID
Field 1
Field 2
Field 6
Field 9

Feed File2 :
EMP ID
Field 5
Field 7
Field 10

Feed File3 :
EMP ID
Field 3
Field 4
Field 8

Now the question is,
what is the best way to make all these files to make it as a single file so 
that it can be placed under the HIVE structure.

regards,
Rams



--
Nitin Pawar


CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: Merging different HDFS file for HIVE

2013-07-26 Thread Sanjay Subramanian
Hi

I am using Oozie Coordinators to schedule and run daily Oozie Workflows that 
contain 35-40 actions each (I use shell, java , hive and map reduce oozie 
actions)

So if anyone needs help and has questions please fire away…

sanjay


From: Sanjay Subramanian 
sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Friday, July 26, 2013 6:23 PM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: Merging different HDFS file for HIVE

We have a similar situation like this in production…for your case case I would 
propose the following steps

1. Design a map reduce job (Job Output format - Text, Lzo, Snappy, your choice)
 Inputs to Mapper
 -- records from these three feeds
Outputs from Mapper
 -- Key = EMP1   Value = feed1~field1  field2  field6  field9
 -- Key = EMP1   Value = feed2~field5  field7  field10
 -- Key = EMP1   Value = feed3~field3  field4  field8

   Reducer Output
 -- Key = EMP1   Value = field1  field2  field3  field4  field5  field6  
field7  field8  field9  field10

2. (Optional) If u use LZO then u will need to run LzoIndexer

3. CREATE TABLE IF NOT EXISTS YOUR_HIVE_TABLE

4. ALTER TABLE ADD PARTITION (foo1 = , foo2 = ) LOCATION 'path/to/files'


From: Stephen Sprague sprag...@gmail.commailto:sprag...@gmail.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Friday, July 26, 2013 4:37 PM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: Merging different HDFS file for HIVE

i like #2.

so you have three, say, external tables representing your three feed files. 
After the third and final file is loaded then join 'em all together - maybe 
make the table partitioned for one per day.

for example:

alter table final add partition (datekey=MMDD);
insert overwrite table final partition (datekey=MMDD)  select 
EMP_ID,f1,...,f10 from FF1 a join FF2 b on (a.EMP_ID=b.EMP_ID join FF3 c on 
(b.EMP_ID=c.EMP_ID)


Or a variation on #3.   make a view on the three tables which would look just 
like the select statement above.


What do you want to optimize for?


On Fri, Jul 26, 2013 at 5:30 AM, Nitin Pawar 
nitinpawar...@gmail.commailto:nitinpawar...@gmail.com wrote:
Option 1 ) Use pig or oozie, write a workflow and join the files to a single 
file
Option 2 ) Create a temp table for each of the different file and then join 
them to a single table and delete temp table
Option 3 ) don't do anything, change your queries to look at three different 
files when they query  about different files

Wait for others to give better suggestions :)


On Fri, Jul 26, 2013 at 4:22 PM, Ramasubramanian Narayanan 
ramasubramanian.naraya...@gmail.commailto:ramasubramanian.naraya...@gmail.com
 wrote:
Hi,

Please help in providing solution for the below problem... this scenario is 
applicable in Banking atleast...

I have a HIVE table with the below structure...

Hive Table:
Field1
...
Field 10


For the above table, I will get the values for each feed in different file. You 
can imagine that these files belongs to same branch and will get at any time 
interval. I have to load into table only if I get all 3 files for the same 
branch. (assume that we have a common field in all the files to join)

Feed file 1 :
EMP ID
Field 1
Field 2
Field 6
Field 9

Feed File2 :
EMP ID
Field 5
Field 7
Field 10

Feed File3 :
EMP ID
Field 3
Field 4
Field 8

Now the question is,
what is the best way to make all these files to make it as a single file so 
that it can be placed under the HIVE structure.

regards,
Rams



--
Nitin Pawar


CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message

Re: Help in debugging Hive Query

2013-07-25 Thread Sanjay Subramanian
The query is correct but since u r creating a managed table , that is possibly 
creating some issue and the records are not all getting created

This is what I would propose

CHECKPOINT  1 : Is this query running at all ?
===
Use this option in BOLD and run the QUERY ONLY (without any table creation) to 
log errors and pipe to a log file by using nohup or some other way that u prefer
hive -hiveconf hive.root.logger=INFO,console -e

select a.evar23,sum(b.max_visit_page_num) from
(select distinct visid_high,visid_low,evar23 from web.omniture_web_data) a
JOIN
(select visid_high,visid_low,max(visit_page_num) as max_visit_page_num from 
omniture_web_data group by visid_high,visid_low) b
where a.visid_high=b.visid_high and a.visid_low=b.visid_low
group by a.evar23;


CHECKPOINT 2 : Run the query (using the CREATE TABLE option) with these 
additional options
===
Required params:

SET mapreduce.job.maps=500;
SET mapreduce.job.reduces=8;
SET mapreduce.tasktracker.map.tasks.maximum=12;
SET mapreduce.tasktracker.reduce.tasks.maximum=8;
SET 
mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapreduce.map.output.compress=true;


Optional params:
---
If u r using compression in output , use the following ; u can change the 
LzoCodec to whatever u r using for compression
SET hive.exec.compress.intermediate=true;
SET hive.exec.compress.output=true;
SET 
mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec;
SET mapreduce.output.fileoutputformat.compress=true;


Thanks

Sanjay


From: Raj Hadoop hadoop...@yahoo.commailto:hadoop...@yahoo.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org, Raj Hadoop 
hadoop...@yahoo.commailto:hadoop...@yahoo.com
Date: Thursday, July 25, 2013 5:00 AM
To: Hive user@hive.apache.orgmailto:user@hive.apache.org
Subject: Help in debugging Hive Query

All,

I am trying to determine visits for customer from omniture weblog file using 
Hive.

Table: omniture_web_data
Columns: visid_high,visid_low,evar23,visit_page_num

Sample Data:
visid_high,visid_low,evar23,visit_page_num
999,888,1003,10
999,888,1003,14
999,888,1003,6
999,777,1003,12
999,777,1003,20

I want to calculate for each Customer Number ( evar23 is  Customer Number ) , 
total visits. visid_high and visid_low determines a unique visit.
For each distinct visitor, calculate sum of maximum visit_page_num. In above 
example

14 + 20 = 34 should be the total visits for the customer 1003.

I am trying to run the following queries - Method 1 is almost the same as 
Method 2. Except in Method 1 I only choose a particualr customer number 1003. 
In method 2 , i generalized to all.

In Method 1 , I am getting the accurate result. In metnhod 2 , I am not getting 
the same result as Method 1.

Any suggestions on how to trouble shoot. ALso, any alternative approaches.

// Method 1
select a.evar23,sum(b.max_visit_page_num) from
(select distinct visid_high,visid_low,evar23 from web.omniture_web_data where 
evar23='1003') a
JOIN
(select visid_high,visid_low,max(visit_page_num) as max_visit_page_num from 
omniture_web_data where evar23='1003' group by visid_high,visid_low) b
where a.visid_high=b.visid_high and a.visid_low=b.visid_low
group by a.evar23;

/ Result of Method 1

100334

// Method 2

create table temp123 as
select a.evar23,sum(b.max_visit_page_num) from
(select distinct visid_high,visid_low,evar23 from web.omniture_web_data) a
JOIN
(select visid_high,visid_low,max(visit_page_num) as max_visit_page_num from 
omniture_web_data group by visid_high,visid_low) b
where a.visid_high=b.visid_high and a.visid_low=b.visid_low
group by a.evar23;

select * from temp123 where evar23='1003';

// The Result of Method 2 is not the same as Method 1. It is showing a 
different number.



Thanks,
Raj



CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Calling same UDF multiple times in a SELECT query

2013-07-23 Thread Sanjay Subramanian
Hi

V r using version hive-exec-0.9.0-cdh4.1.2 in production

I need to check and use the output from a UDF in a query to assign values to 2 
columns in a SELECT query

Example

SELECT
 a,
 IF(fooUdf(a)  -1  , -1, fooUdf(a)) as b,
 IF(fooUdf(a)  -1  , fooUdf(a), 0) as c
FROM
 my_hive_table


So will fooUdf be called 4 times ? Or once ?

Why this is important is because in our case this UDF calls a web service and I 
don't want so many calls to the service.

Thanks

sanjay



CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: Calling same UDF multiple times in a SELECT query

2013-07-23 Thread Sanjay Subramanian
Thanks Jan

I will mod my UDF and test it out

I want to make sure I understand your words here
The obvious condition is that it must always return the identical result when 
called with same parameters.

If I can make sure that a call to the web service is successful it will always 
return same output for a given set of input

F(x1,y1) will always equal - z1

that’s what u mean right ?

sanjay

From: Jan Dolinár dolik@gmail.commailto:dolik@gmail.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Tuesday, July 23, 2013 12:35 PM
To: user user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: Calling same UDF multiple times in a SELECT query

Hi,

If you use annotation, Hive should be able to optimize it to single call:

 @UDFType(deterministic = true)

The obvious condition is that it must always return the identical result when 
called with same parameters.

Little bit more on this can be found in Mark Grovers post at 
http://mark.thegrovers.ca/1/post/2012/06/how-to-write-a-hive-udf.html.

Regards,
Jan


On Tue, Jul 23, 2013 at 9:25 PM, Nitin Pawar 
nitinpawar...@gmail.commailto:nitinpawar...@gmail.com wrote:
fucntion return values are not stored for repeat use of same (as per my 
understanding)

I know you may have already thought about other approach as

select a , if (call -1, -1 call) as b from (select a, fooudf(a) as call from 
table




On Wed, Jul 24, 2013 at 12:42 AM, Sanjay Subramanian 
sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com
 wrote:
Hi

V r using version hive-exec-0.9.0-cdh4.1.2 in production

I need to check and use the output from a UDF in a query to assign values to 2 
columns in a SELECT query

Example

SELECT
 a,
 IF(fooUdf(a)  -1  , -1, fooUdf(a)) as b,
 IF(fooUdf(a)  -1  , fooUdf(a), 0) as c
FROM
 my_hive_table


So will fooUdf be called 4 times ? Or once ?

Why this is important is because in our case this UDF calls a web service and I 
don't want so many calls to the service.

Thanks

sanjay



CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.



--
Nitin Pawar


CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: export csv, use ',' as split

2013-07-10 Thread Sanjay Subramanian
Hive does not have a output delimiter specifier yet (not sure if 0.11.x may 
have it)

But for now please try the following

hive -e myquery | sed 's/\t/,/g'  result.csv

Good luck

Sanjay

From: kentkong_work kentkong_w...@163.commailto:kentkong_w...@163.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Tuesday, July 9, 2013 9:48 PM
To: user user@hive.apache.orgmailto:user@hive.apache.org
Subject: export csv, use ',' as split

hi here,
   I create a table like this and put a lot data into it.
   then I export query result into csv file like this:
hive -e myquery  result.csv

   but the csv uses tab as split.
   how to make hive use ','? thanks!

CREATE TABLE if not exists upload_users(
  username string,
  mobile string,
  id_type string,
  id_no string,
  email string,
  address string,
  validate_time string
) partitioned by (fileid string)
row format delimited fields terminated by \,;



CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: Hive CLI

2013-07-09 Thread Sanjay Subramanian
Hi Rahul

Is there a reason why u use Hive CLI  ?

I have aliases defined that I use, so I never had to use Hive CLI again

alias hivescript='hive -e '
alias hivescriptd='hive -hiveconf hive.root.logger=INFO,console -e '

So when I want to run hive commands from Linux  I just type

hivescript select colA, colB, colC from TableN where partitionA='somval1' 
limit 100

Good luck

sanjay




From: rahul kavale kavale.ra...@gmail.commailto:kavale.ra...@gmail.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Monday, July 8, 2013 11:26 PM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: Hive CLI

Sorry My bad, it was my bad as Ctrl-a was screen binding which is why I was not 
able to use it for moving cursor. This had nothing to do with Hive or its 
configuration. My bad

Thanks  Regards,
Rahul


On 9 July 2013 11:00, rahul kavale 
kavale.ra...@gmail.commailto:kavale.ra...@gmail.com wrote:
Hey there,
I have been using HIVE(0.7) for a while now using CLI and bash scripts.
But its a pain to move cursor in the CLI i.e. once you enter a very long query 
then you cant go to start of the query (like you do using Ctrl+A/Ctrl+E in 
terminal). Does anyone know how to do it?

Thanks  Regards,
Rahul


CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: integration issure about hive and hbase

2013-07-09 Thread Sanjay Subramanian
I am attaching portions from a document  I had written last year while 
investigating Hbase and Hive. You may have already crossed that 
bridge….nevertheless…

Please forgive me :-) if some steps seamy hacky and not very well explained….I 
was on a solo mission to build a Hive Data platform from scratch and QDBW  
(Quick and Dirty But Works) was my philosophy to go ahead !!!

Good luck

Sanjay


=

Hive and Hbase integration on local Fedora desktop 
guidehttps://wizecommerce.atlassian.net/wiki/display/traffic/Hive+and+Hbase+integration+on+local+Fedora+desktop+guide

Pre-requisites

  *   Hadoop needs to be installed and HDFS needs to be be running  (Hadoop 
HDFS setup on local Fedora desktop 
guidehttps://wizecommerce.atlassian.net/wiki/display/traffic/Hadoop+HDFS+setup+on+local+Fedora+desktop+guide)
  *   Hive needs to be installed (Hive setup on local Fedora desktop 
guidehttps://wizecommerce.atlassian.net/wiki/display/traffic/Hive+setup+on+local+Fedora+desktop+guide)
  *   HBase needs to be installed and running.(Hbase setup on local Fedora 
desktop 
guidehttps://wizecommerce.atlassian.net/wiki/display/traffic/Hbase+setup+on+local+Fedora+desktop+guide)
 *   Make sure ZooKeeper is running on port 2181. If not stop Hbase , 
change $HBASE_HOME/conf/hbase-site.xml and restart HBase



Copying JARS to HADOOP_CLASSPATH

Before you query tables , copy these jars from $HIVE_HOME/lib  
$HADOOP_HOME/lib

  1.  Make sure zookeeper-3.4.3.jar is not there
 *   ls -latr   $HADOOP_HOME/lib/zookeeper-3.4.3.jar
  2.  Copy zookeeper-3.4.3.jar
 *   sudo cp -av $HIVE_HOME/zookeeper-3.4.3.jar $HADOOP_HOME/lib
  3.  Make sure hive-common-0.9.0.jar is not there
 *   ls -latr   $HADOOP_HOME/lib/hive-common-0.9.0.jar
  4.  Copy hive-common-0.9.0.jar
 *   sudo cp -av $HIVE_HOME/hive-common-0.9.0.jar $HADOOP_HOME/lib
  5.  Make sure hive-hbase-handler-0.9.0.jar is not there
 *   ls -latr   $HADOOP_HOME/lib/hive-hbase-handler-0.9.0.jar
  6.  Copy hive-hbase-handler-0.9.0.jar
 *   sudo cp -av $HIVE_HOME/hive-hbase-handler-0.9.0.jar $HADOOP_HOME/lib
  7.  Exit from Hive Shell (type exit;)

  8.  Exit from HBase shell

  9.  Stop Hbase
 *   $HBASE_HOME/bin/stop-hbase.sh
  10. Stop Hadoop/HDFS
 *   $HADOOP_HOME/bin/stop-all.sh
  11. Check if NO java processes related to Hadoop/HDFS/Hbase/Hive exist
 *ps auxw | grep java
  12. Start Hadoop/HDFS
 *   $HADOOP_HOME/bin/start-all.sh
  13. Start Hbase
 *   $HBASE_HOME/bin/start-hbase.sh
  14. Check ALL  java processes related to Hadoop/HDFS/Hbase/Hive exist
 *   ps auxw | grep java



Create tables in HBase

  *   Refer Hbase setup on local Fedora desktop 
guidehttps://wizecommerce.atlassian.net/wiki/display/traffic/Hbase+setup+on+local+Fedora+desktop+guide
 and create the tables mentioned there
 *   hbase_2_hive_food
 *   hbase_2_hive_names

Create tables in HIVE

To run Hive type

$HIVE_HOME/bin/hive

This will take you to Hive shell. In the shell, create these two tables

  *   CREATE EXTERNAL TABLE hbase_hive_names(hbid INT, id INT,  fn STRING, ln 
STRING, age INT) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
WITH SERDEPROPERTIES (hbase.columns.mapping = 
:key,id:id,name:fn,name:ln,age:age) TBLPROPERTIES(hbase.table.name = 
hbase_2_hive_names);
 *   This HIVE table will map to Hbase table hbase_2_hive_names
  *   CREATE EXTERNAL TABLE hbase_hive_food(hbid INT, id INT,  name STRING) 
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH 
SERDEPROPERTIES (hbase.columns.mapping = :key,id:id,name:name) 
TBLPROPERTIES(hbase.table.name =id:id,name:namehttp://idid%2Cnamename/) 
TBLPROPERTIES(hbase.table.name = hbase_2_hive_food);
 *   This HIVE table will map to Hbase table hbase_2_hive_food


Creating  Loading tables in HBase through Hive


  *   Make sure there is no table in Hbase called 'hive2hbase_names_table'
  *   In Hive shell
 *   CREATE TABLE hive2hbase_names_table (hb_id int, fn string, ln string, 
age_dnq INT) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH 
SERDEPROPERTIES (hbase.columns.mapping = 
:key,student:fn,student:ln,student:age) TBLPROPERTIES (hbase.table.name = 
hive2hbase_names_table) ;
  *   Go to HBase shell
 *   check that table hive2hbase_names_table is created.
  *   In Hive Shell
 *   create a Hive table and populate with data which we will use to 
populate the HiveHBase table
 *   CREATE TABLE names_tab (hb_id int, fn string, ln string, age_dnq INT) 
PARTITIONED BY (age INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
 *   LOAD DATA LOCAL INPATH 
'/data/mycode/impressions/inputfiles/names1.tsv.4fields' OVERWRITE INTO TABLE 
names_tab PARTITION (age=60);
 *   LOAD DATA LOCAL INPATH 

Re: Special characters in web log file causing issues

2013-07-08 Thread Sanjay Subramanian
U may have to remove non-printable chars first, save an intermediate file and 
then load into Hive

tr -cd '[:print:]\r\n\t'

Or if u have strings function that will only output printable chars


From: Raj Hadoop hadoop...@yahoo.commailto:hadoop...@yahoo.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org, Raj Hadoop 
hadoop...@yahoo.commailto:hadoop...@yahoo.com
Date: Monday, July 8, 2013 1:52 PM
To: Hive user@hive.apache.orgmailto:user@hive.apache.org
Subject: Special characters in web log file causing issues


Hi ,

The log file that I am trying to load throuh Hive has some special characters

The field is shown below and the special characters ¿¿are also shown.

Shockwave Flash;Chrome Remote Desktop Viewer;Native Client;Chrome PDF 
Viewer;Adobe Acrobat;Microsoft Office 2010;Motive Plug-
in;Motive Management Plug-in;Google Update;Java(TM) Platform SE 7 
U21;McAfee SiteAdvisor;McAfee Virtual Technician;Windows Live¿¿ Photo 
Gallery;McAfee SecurityCenter;Silverlig


The above is causing the record to be terminated and loading another line.  How 
can I avoid this type of issues and how to load the proper data ? Any 
suggestions please.

Thanks,
Raj

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: Loading a flat file + one additional field to a Hive table

2013-07-06 Thread Sanjay Subramanian
Assuming that u can replicate this problem with say 100K records in the log 
file versus some records in Hive
The way I would start the debug is
Select * from my_hive_table | sort  hive.out.check1.sorted

Your original log file is say log.original.txt

Sort log.original.txt  log.original.txt.sorted

Make sure separators in both files hive.out.check1.sorted and 
log.original.txt.sorted are same
Diff log.original.txt.sorted  hive.out.check1.sorted

See what u get…and analyze why some extra records came in

My guess is
=
In your Hive Meta Store, possibly there is another PARTITION that could be 
pointing to a data location containing data from some previous logs….

sanjay

From: Raj Hadoop hadoop...@yahoo.commailto:hadoop...@yahoo.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org, Raj Hadoop 
hadoop...@yahoo.commailto:hadoop...@yahoo.com
Date: Friday, July 5, 2013 3:27 PM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: Loading a flat file + one additional field to a Hive table

Thanks Sanjay. I will look into this.

Also - one more question.

When I am trying to load log file to Hive and comparing the counts like this

select count(*) from Table

Versus

wc -l File

I see a few hundred records greater in Table. How should I debug it? Any 
tips please.


From: Sanjay Subramanian 
sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org; Raj Hadoop 
hadoop...@yahoo.commailto:hadoop...@yahoo.com
Sent: Saturday, July 6, 2013 4:32 AM
Subject: Re: Loading a flat file + one additional field to a Hive table

How about this ?

Assume you have a log file called
oompaloompa.log

TIMESTAMP=$(date +%Y_%m_%d_T%H_%M_%S);mv oompaloopa.log 
oompaloopa.log.${TIMESTAMP};cat oompaloopa.log.${TIMESTAMP}| hdfs dfs -put - 
/user/sasubramanian/oompaloopa.log.${TIMESTAMP}

This will directly put the file on HDFS and u can put it to the LOCATION 
specified by your HIVE TABLE definition

sanjay


From: manishbh...@rocketmail.commailto:manishbh...@rocketmail.com 
manishbh...@rocketmail.commailto:manishbh...@rocketmail.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Friday, July 5, 2013 10:39 AM
To: Raj Hadoop hadoop...@yahoo.commailto:hadoop...@yahoo.com, Hive 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: Loading a flat file + one additional field to a Hive table

Raj,

You should dump the data in a temp table first and then move the data into 
final table with select query.
Select date(), c1,c2. From temp table.
Reason: we should avoid custom operation in load unless it is necessary.


Sent via Rocket from my HTC

- Reply message -
From: Raj Hadoop hadoop...@yahoo.commailto:hadoop...@yahoo.com
To: Hive user@hive.apache.orgmailto:user@hive.apache.org
Subject: Loading a flat file + one additional field to a Hive table
Date: Fri, Jul 5, 2013 10:30 PM


Hi,

Can any one please suggest the best way to do the following in Hive?

Load 'todays date stamp' +  ALL FIELDS C1,C2,C3,C4 IN A FILE F1  to a Hive 
table  T1 ( D1,C1,C2,C3,C4)

Can the following command be modified in some way to acheive the above
hive  load data local inpath '/software/home/hadoop/dat_files/' into table 
T1;

My requirement is to append a date stamp to a Web log file and then load it to 
Hive table.

Thanks,
Raj

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.



CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: Loading a flat file + one additional field to a Hive table

2013-07-05 Thread Sanjay Subramanian
How about this ?

Assume you have a log file called
oompaloompa.log

TIMESTAMP=$(date +%Y_%m_%d_T%H_%M_%S);mv oompaloopa.log 
oompaloopa.log.${TIMESTAMP};cat oompaloopa.log.${TIMESTAMP}| hdfs dfs -put - 
/user/sasubramanian/oompaloopa.log.${TIMESTAMP}

This will directly put the file on HDFS and u can put it to the LOCATION 
specified by your HIVE TABLE definition

sanjay


From: manishbh...@rocketmail.commailto:manishbh...@rocketmail.com 
manishbh...@rocketmail.commailto:manishbh...@rocketmail.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Friday, July 5, 2013 10:39 AM
To: Raj Hadoop hadoop...@yahoo.commailto:hadoop...@yahoo.com, Hive 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: Loading a flat file + one additional field to a Hive table

Raj,

You should dump the data in a temp table first and then move the data into 
final table with select query.
Select date(), c1,c2. From temp table.
Reason: we should avoid custom operation in load unless it is necessary.


Sent via Rocket from my HTC

- Reply message -
From: Raj Hadoop hadoop...@yahoo.commailto:hadoop...@yahoo.com
To: Hive user@hive.apache.orgmailto:user@hive.apache.org
Subject: Loading a flat file + one additional field to a Hive table
Date: Fri, Jul 5, 2013 10:30 PM


Hi,

Can any one please suggest the best way to do the following in Hive?

Load 'todays date stamp' +  ALL FIELDS C1,C2,C3,C4 IN A FILE F1  to a Hive 
table  T1 ( D1,C1,C2,C3,C4)

Can the following command be modified in some way to acheive the above
hive  load data local inpath '/software/home/hadoop/dat_files/' into table 
T1;

My requirement is to append a date stamp to a Web log file and then load it to 
Hive table.

Thanks,
Raj

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


One query works the other does notŠany clues ?

2013-07-03 Thread Sanjay Subramanian
THIS FAILS
=
INSERT OVERWRITE DIRECTORY 
'/user/beeswax/warehouse/impressions_hive_stats/outpdir_impressions_header/2013-07-01/record_counts'
 select 'outpdir_impressions_header', '2013-07-01', 'record_counts', 
'all_servers', count(*) from outpdir_impressions_header where 
header_date_partition='2013-07-01'

13/07/03 14:28:31 ERROR exec.Task: Failed with exception Unable to rename: 
hdfs://thv-nn1.pv.sv.nextag.com:8020/tmp/hive-nextag/hive_2013-07-03_14-25-39_324_3631745157890476605/-ext-1
 to: 
/user/beeswax/warehouse/impressions_hive_stats/outpdir_impressions_header/2013-07-01/record_counts
org.apache.hadoop.hive.ql.metadata.HiveException: Unable to rename: 
hdfs://thv-nn1.pv.sv.nextag.com:8020/tmp/hive-nextag/hive_2013-07-03_14-25-39_324_3631745157890476605/-ext-1
 to: 
/user/beeswax/warehouse/impressions_hive_stats/outpdir_impressions_header/2013-07-01/record_counts
at org.apache.hadoop.hive.ql.exec.MoveTask.moveFile(MoveTask.java:95)
at org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:159)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)
at 
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1331)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1117)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:950)
at 
org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:215)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:406)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:341)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:695)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:607)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:208)

FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.MoveTask



THIS SUCCEEDS
=
INSERT OVERWRITE DIRECTORY 
'/user/beeswax/warehouse/impressions_hive_stats/outpdir_impressions_header/record_counts/2013-07-01'
 select 'outpdir_impressions_header', '2013-07-01', 'record_counts', 
'all_servers', count(*) from outpdir_impressions_header where 
header_date_partition='2013-07-01'

TABLE DEFINITION

CREATE EXTERNAL TABLE  IF NOT EXISTS impressions_hive_stats(table_name STRING, 
aggregation_date STRING , metric_name STRING, metric_key STRING, metric_value 
BIGINT) PARTITIONED BY (table_name_partition STRING, aggregation_date_partition 
STRING , metric_name_partition STRING, metric_key_partition STRING)  STORED AS 
INPUTFORMAT  \com.hadoop.mapred.DeprecatedLzoTextInputFormat\   OUTPUTFORMAT 
\org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat\ ;

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: One query works the other does notŠany clues ?

2013-07-03 Thread Sanjay Subramanian
For the time being I have added  the create HDFS dir in my hive script…got to 
keep moving on….cant wait for ideal solution :-) but would love to know the 
ideal solution !

!hdfs dfs -mkdir 
/user/beeswax/warehouse/impressions_hive_stats/outpdir_impressions_header/2013-07-01/record_counts
;
INSERT OVERWRITE DIRECTORY 
'/user/beeswax/warehouse/impressions_hive_stats/outpdir_impressions_header/2013-07-01/record_counts'
SELECT
 'outpdir_impressions_header',
 '2013-07-01',
 'record_counts',
 'all_servers',
 count(*)
FROM
 outpdir_impressions_header
WHERE
 header_date_partition='2013-07-01'
;

From: Sanjay Subramanian 
sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Wednesday, July 3, 2013 2:40 PM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: One query works the other does notŠany clues ?

THIS FAILS
=
INSERT OVERWRITE DIRECTORY 
'/user/beeswax/warehouse/impressions_hive_stats/outpdir_impressions_header/2013-07-01/record_counts'
 select 'outpdir_impressions_header', '2013-07-01', 'record_counts', 
'all_servers', count(*) from outpdir_impressions_header where 
header_date_partition='2013-07-01'

13/07/03 14:28:31 ERROR exec.Task: Failed with exception Unable to rename: 
hdfs://thv-nn1.pv.sv.nextag.com:8020/tmp/hive-nextag/hive_2013-07-03_14-25-39_324_3631745157890476605/-ext-1
 to: 
/user/beeswax/warehouse/impressions_hive_stats/outpdir_impressions_header/2013-07-01/record_counts
org.apache.hadoop.hive.ql.metadata.HiveException: Unable to rename: 
hdfs://thv-nn1.pv.sv.nextag.com:8020/tmp/hive-nextag/hive_2013-07-03_14-25-39_324_3631745157890476605/-ext-1
 to: 
/user/beeswax/warehouse/impressions_hive_stats/outpdir_impressions_header/2013-07-01/record_counts
at org.apache.hadoop.hive.ql.exec.MoveTask.moveFile(MoveTask.java:95)
at org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:159)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)
at 
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1331)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1117)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:950)
at 
org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:215)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:406)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:341)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:695)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:607)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:208)

FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.MoveTask



THIS SUCCEEDS
=
INSERT OVERWRITE DIRECTORY 
'/user/beeswax/warehouse/impressions_hive_stats/outpdir_impressions_header/record_counts/2013-07-01'
 select 'outpdir_impressions_header', '2013-07-01', 'record_counts', 
'all_servers', count(*) from outpdir_impressions_header where 
header_date_partition='2013-07-01'

TABLE DEFINITION

CREATE EXTERNAL TABLE  IF NOT EXISTS impressions_hive_stats(table_name STRING, 
aggregation_date STRING , metric_name STRING, metric_key STRING, metric_value 
BIGINT) PARTITIONED BY (table_name_partition STRING, aggregation_date_partition 
STRING , metric_name_partition STRING, metric_key_partition STRING)  STORED AS 
INPUTFORMAT  \com.hadoop.mapred.DeprecatedLzoTextInputFormat\   OUTPUTFORMAT 
\org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat\ ;

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s

Request perm to edit wiki

2013-07-01 Thread Sanjay Subramanian
Hi

I'd like to add documentation on how to use LZO compression in Hive

May I get edit permissions?

Thanks.

Sanjay



Confluence username: 
sanjaysubraman...@yahoo.commailto:sanjaysubraman...@yahoo.com

Apache JIRA username - sanjaysubramanian

From: Lefty Leverenz le...@hortonworks.commailto:le...@hortonworks.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Tuesday, June 18, 2013 2:12 AM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: LZO compression implementation in Hive

You can email hive-...@hadoop.apache.orgmailto:hive-...@hadoop.apache.org 
asking for edit privileges on the Hive wiki.
Here's an example from the 
archiveshttp://mail-archives.apache.org/mod_mbox/hive-dev/201304.mbox/%3ccakhgjs42yqqj54rc8yvzgxbubies7mpv9fdofq2z3onfo67...@mail.gmail.com%3e.

Once you have permission, you can use the Edit and Add symbols in the upper 
right corner of each wiki page.

For page location, I'd have to think it over and check the existing docs.  
Initial thoughts:

  *   The Language 
Manualhttps://cwiki.apache.org/confluence/display/Hive/LanguageManual has a 
placeholder for file formats which only contains ORC files now.
  *   Perhaps the DDL section is more appropriate.  If so, add a subbullet on 
the Language Manual page.
  *   If you don't want it in the Language Manual, just put it on the home 
pagehttps://cwiki.apache.org/confluence/display/Hive/Home and add a link 
under User Documentation.
  *   As long as it's somewhere in the Hive wiki tree, the apparent 
organization can be changed by adding/removing links.
  *   Many people don't pay attention to the page structure, they just google 
the topic they're looking for.

– Lefty



On Tue, Jun 18, 2013 at 2:56 AM, Sanjay Subramanian 
sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com
 wrote:
:-) Not sure how to add a page…may be the Admin needs to grant me permission

From: Sanjay Subramanian 
sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Monday, June 17, 2013 11:50 PM

To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: LZO compression implementation in Hive

Sure…would love to add the LZO compression in Hive
Is there a specific page structure u want me to add  to in confluence?

https://cwiki.apache.org/confluencehttps://cwiki.apache.org/confluence/dashboard.action

thanks
sanjay


From: Lefty Leverenz le...@hortonworks.commailto:le...@hortonworks.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Monday, June 17, 2013 7:08 PM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: LZO compression implementation in Hive

Perhaps you'd like to write up your insights in the Hive wiki, and others could 
add their insights.  Then the information would be available to all, 
immediately.

– Lefty



On Mon, Jun 17, 2013 at 4:39 PM, Ramki Palle 
ramki.pa...@gmail.commailto:ramki.pa...@gmail.com wrote:
Hi Sanjay,

Can you quickly give your insights on thip topic, if possible?

Regards,
Ramki.


On Mon, May 20, 2013 at 2:51 PM, Sanjay Subramanian 
sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com
 wrote:
Hi Programming Hive Book authors

Maybe a lot of u have already successfully implemented this but only these last 
two weeks , we implemented our aggregations using LZO compression in Hive - MR 
jobs creating LZO files as Input for Hive --- Therafter Hive aggregations 
creating more LZO files as output.
As usual nothing was straight forward :-)  Also the other challenge was to 
neatly tie all into actions in Oozie workflows….but after being underwater for 
weeks I think I am able to  rise above water and breathe !

In the next version of the book , If u guys r planning to add detailed sections 
on using lzo compression in Hive , let me know…my experiences might be useful 
:-)

Thanks

sanjay



CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.



CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended

Re: show table throwing strange error

2013-06-21 Thread Sanjay Subramanian
Can u stop following services
hive-server
hive-meta-store
Hive-server2 (if u r running that)

Move current hive.log some place else

Start following services
hive-server
hive-meta-store
Hive-server2 (if u r running that)


And check hive.log ?

Also can u paste the CREATE TABLe script verbatim here…I feel if u are using 
some custom INPUTFORMAT / OUTPUTFORMAT class  that have to be specified in 
quotes…u may have to be escape that

Plus try and add a semicolon to the end of the create table script ...

sanjay

From: Mohammad Tariq donta...@gmail.commailto:donta...@gmail.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Thursday, June 20, 2013 12:52 PM
To: user user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: show table throwing strange error

Thank you for looking into it Sanjay. show tables is working fine from both 
Ubuntu and Hive shell. But i'm getting the same error as yesterday when i'm 
running create table, which is :

line 1:30 character '' not supported here
line 1:31 character '' not supported here
line 1:32 character '' not supported here
line 1:33 character '' not supported here
line 1:34 character '' not supported here
line 1:35 character '' not supported here
line 1:36 character '' not supported here
line 1:37 character '' not supported here
line 1:38 character '' not supported here
line 1:39 character '' not supported here
line 1:40 character '' not supported here
line 1:41 character '' not supported here
line 1:42 character '' not supported here
.
.
.
.

Also, I have noticed 1 strange thing. hive.log is totally messed up. Looks 
like logs are getting written in some binary encoding. I have attached a 
snapshot of the same. Any idea?

Warm Regards,
Tariq
cloudfront.blogspot.comhttp://cloudfront.blogspot.com


On Fri, Jun 21, 2013 at 1:03 AM, Sanjay Subramanian 
sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com
 wrote:
Can u try from your ubuntu command prompt
$ hive -e show tables

From: Mohammad Tariq donta...@gmail.commailto:donta...@gmail.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Thursday, June 20, 2013 4:28 AM
To: user user@hive.apache.orgmailto:user@hive.apache.org

Subject: Re: show table throwing strange error

Thank you for the response ma'am. It didn't help either.

Warm Regards,
Tariq
cloudfront.blogspot.comhttp://cloudfront.blogspot.com


On Thu, Jun 20, 2013 at 8:43 AM, Sunita Arvind 
sunitarv...@gmail.commailto:sunitarv...@gmail.com wrote:
Your issue seems familiar. Try logging out of hive session and re-login.

Sunita


On Wed, Jun 19, 2013 at 8:53 PM, Mohammad Tariq 
donta...@gmail.commailto:donta...@gmail.com wrote:
Hello list,

 I have a hive(0.9.0) setup on my Ubuntu box running hadoop-1.0.4. 
Everything was going smooth till now. But today when I issued show tables I got 
some strange error on the CLI. Here is the error :

hive show tables;
FAILED: Parse Error: line 1:0 character '' not supported here
line 1:1 character '' not supported here
line 1:2 character '' not supported here
line 1:3 character '' not supported here
line 1:4 character '' not supported here
line 1:5 character '' not supported here
line 1:6 character '' not supported here
line 1:7 character '' not supported here
line 1:8 character '' not supported here
line 1:9 character '' not supported here
line 1:10 character '' not supported here
line 1:11 character '' not supported here
line 1:12 character '' not supported here
line 1:13 character '' not supported here
line 1:14 character '' not supported here
line 1:15 character '' not supported here
line 1:16 character '' not supported here
line 1:17 character '' not supported here
line 1:18 character '' not supported here
line 1:19 character '' not supported here
line 1:20 character '' not supported here
line 1:21 character '' not supported here
line 1:22 character '' not supported here
line 1:23 character '' not supported here
line 1:24 character '' not supported here
line 1:25 character '' not supported here
line 1:26 character '' not supported here
line 1:27 character '' not supported here
line 1:28 character '' not supported here
line 1:29 character '' not supported here
line 1:30 character '' not supported here
line 1:31 character '' not supported here
line 1:32 character '' not supported here
line 1:33 character '' not supported here
line 1:34 character '' not supported here
line 1:35 character '' not supported here
line 1:36 character '' not supported here
line 1:37 character '' not supported here
line 1:38 character '' not supported here
line 1:39 character '' not supported here
line 1:40 character '' not supported here
line 1:41 character '' not supported here
line 1:42 character '' not supported here
line 1:43 character '' not supported here
line 1:44 character '' not supported here
line 1:45 character '' not supported here
line 1:46 character '' not supported

Re: show table throwing strange error

2013-06-20 Thread Sanjay Subramanian
Can u try from your ubuntu command prompt
$ hive -e show tables

From: Mohammad Tariq donta...@gmail.commailto:donta...@gmail.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Thursday, June 20, 2013 4:28 AM
To: user user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: show table throwing strange error

Thank you for the response ma'am. It didn't help either.

Warm Regards,
Tariq
cloudfront.blogspot.comhttp://cloudfront.blogspot.com


On Thu, Jun 20, 2013 at 8:43 AM, Sunita Arvind 
sunitarv...@gmail.commailto:sunitarv...@gmail.com wrote:
Your issue seems familiar. Try logging out of hive session and re-login.

Sunita


On Wed, Jun 19, 2013 at 8:53 PM, Mohammad Tariq 
donta...@gmail.commailto:donta...@gmail.com wrote:
Hello list,

 I have a hive(0.9.0) setup on my Ubuntu box running hadoop-1.0.4. 
Everything was going smooth till now. But today when I issued show tables I got 
some strange error on the CLI. Here is the error :

hive show tables;
FAILED: Parse Error: line 1:0 character '' not supported here
line 1:1 character '' not supported here
line 1:2 character '' not supported here
line 1:3 character '' not supported here
line 1:4 character '' not supported here
line 1:5 character '' not supported here
line 1:6 character '' not supported here
line 1:7 character '' not supported here
line 1:8 character '' not supported here
line 1:9 character '' not supported here
line 1:10 character '' not supported here
line 1:11 character '' not supported here
line 1:12 character '' not supported here
line 1:13 character '' not supported here
line 1:14 character '' not supported here
line 1:15 character '' not supported here
line 1:16 character '' not supported here
line 1:17 character '' not supported here
line 1:18 character '' not supported here
line 1:19 character '' not supported here
line 1:20 character '' not supported here
line 1:21 character '' not supported here
line 1:22 character '' not supported here
line 1:23 character '' not supported here
line 1:24 character '' not supported here
line 1:25 character '' not supported here
line 1:26 character '' not supported here
line 1:27 character '' not supported here
line 1:28 character '' not supported here
line 1:29 character '' not supported here
line 1:30 character '' not supported here
line 1:31 character '' not supported here
line 1:32 character '' not supported here
line 1:33 character '' not supported here
line 1:34 character '' not supported here
line 1:35 character '' not supported here
line 1:36 character '' not supported here
line 1:37 character '' not supported here
line 1:38 character '' not supported here
line 1:39 character '' not supported here
line 1:40 character '' not supported here
line 1:41 character '' not supported here
line 1:42 character '' not supported here
line 1:43 character '' not supported here
line 1:44 character '' not supported here
line 1:45 character '' not supported here
line 1:46 character '' not supported here
line 1:47 character '' not supported here
line 1:48 character '' not supported here
line 1:49 character '' not supported here
line 1:50 character '' not supported here
line 1:51 character '' not supported here
line 1:52 character '' not supported here
line 1:53 character '' not supported here
line 1:54 character '' not supported here
line 1:55 character '' not supported here
line 1:56 character '' not supported here
line 1:57 character '' not supported here
line 1:58 character '' not supported here
line 1:59 character '' not supported here
line 1:60 character '' not supported here
line 1:61 character '' not supported here
line 1:62 character '' not supported here
line 1:63 character '' not supported here
line 1:64 character '' not supported here
line 1:65 character '' not supported here
line 1:66 character '' not supported here
line 1:67 character '' not supported here
line 1:68 character '' not supported here
line 1:69 character '' not supported here
line 1:70 character '' not supported here
line 1:71 character '' not supported here
line 1:72 character '' not supported here
line 1:73 character '' not supported here
line 1:74 character '' not supported here
line 1:75 character '' not supported here
line 1:76 character '' not supported here
line 1:77 character '' not supported here
line 1:78 character '' not supported here
line 1:79 character '' not supported here
.
.
.
.
.
.
line 1:378 character '' not supported here
line 1:379 character '' not supported here
line 1:380 character '' not supported here
line 1:381 character '' not supported here

Strangely other queries like select foo from pokes where bar = 'tariq'; are 
working fine. Tried to search over the net but could not find anything 
useful.Need some help.

Thank you so much for your time.

Warm Regards,
Tariq
cloudfront.blogspot.comhttp://cloudfront.blogspot.com



CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of 

Re: LZO compression implementation in Hive

2013-06-18 Thread Sanjay Subramanian
Sure…would love to add the LZO compression in Hive
Is there a specific page structure u want me to add  to in confluence?

https://cwiki.apache.org/confluencehttps://cwiki.apache.org/confluence/dashboard.action

thanks
sanjay


From: Lefty Leverenz le...@hortonworks.commailto:le...@hortonworks.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Monday, June 17, 2013 7:08 PM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: LZO compression implementation in Hive

Perhaps you'd like to write up your insights in the Hive wiki, and others could 
add their insights.  Then the information would be available to all, 
immediately.

– Lefty



On Mon, Jun 17, 2013 at 4:39 PM, Ramki Palle 
ramki.pa...@gmail.commailto:ramki.pa...@gmail.com wrote:
Hi Sanjay,

Can you quickly give your insights on thip topic, if possible?

Regards,
Ramki.


On Mon, May 20, 2013 at 2:51 PM, Sanjay Subramanian 
sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com
 wrote:
Hi Programming Hive Book authors

Maybe a lot of u have already successfully implemented this but only these last 
two weeks , we implemented our aggregations using LZO compression in Hive - MR 
jobs creating LZO files as Input for Hive --- Therafter Hive aggregations 
creating more LZO files as output.
As usual nothing was straight forward :-)  Also the other challenge was to 
neatly tie all into actions in Oozie workflows….but after being underwater for 
weeks I think I am able to  rise above water and breathe !

In the next version of the book , If u guys r planning to add detailed sections 
on using lzo compression in Hive , let me know…my experiences might be useful 
:-)

Thanks

sanjay



CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.



CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: hive to hbase mapping

2013-06-18 Thread Sanjay Subramanian
How about you have two streams - one to hbase and one to Hive fro your data 
generation source ?

Moving data out of Hbase may not be trivial specially if the data sizes are 
large….


From: Mario Casola mario.cas...@gmail.commailto:mario.cas...@gmail.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Friday, June 14, 2013 9:54 AM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: hive to hbase mapping

Hi Sanjay,

thanks for the response.

I need Hbase because is perfect for aggregating data through the counters, and 
write performance is great.
Now the problem is...Which is the best way for loading periodically (every hour 
for example) Hbase data in Hive table?

Mario



2013/6/14 Sanjay Subramanian 
sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com
6 months back I was tasked with building a Data platform for logs and I 
benchmarked
Hbase + Hive (queries were 8X slower)
Hive only

So I decided for Hive option and am deploying that solution to production.

Couple of things u can think while u design if u really want to go HBase+Hive 
(also look at this http://hadoopstack.com/hive-on-hbase-part-1/)
- Query only todays data in a Hive+Hbase architecture
- Older data than one day query Hive only

Hope I am not diverting from your question and problem

sanjay

From: Mario Casola mario.cas...@gmail.commailto:mario.cas...@gmail.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Friday, June 14, 2013 8:54 AM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: hive to hbase mapping

Hi,

I have a performance issue when I query HBase from Hive.
My idea is to build the scenario below:
1. Collect data in hbase for aggregation purpose
2. Create an external table that map Hive to Hbase
3. Create a real Hive table
4. Periodically transfer data from hbase to Hive through INSERTO INTO real 
hive table SELECT * FROM external table WHERE time = 201305212909

Currently I'm doing a test on a Hbase table that has 70,000,000 rows and I'm 
trying to query this table with a single column value filter, like the query 
above.
If I try this type of query directly in Hbase the response time is around 80 
seconds.
If I try the query in Hive shell, after 30 minutes, all the tasks (9 in my 
case) are 0,00% complete.

Which could be the problem?

thanks
Mario

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: LZO compression implementation in Hive

2013-06-18 Thread Sanjay Subramanian
Thanks I sent it

From: Lefty Leverenz le...@hortonworks.commailto:le...@hortonworks.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Tuesday, June 18, 2013 2:12 AM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: LZO compression implementation in Hive

You can email hive-...@hadoop.apache.orgmailto:hive-...@hadoop.apache.org 
asking for edit privileges on the Hive wiki.
Here's an example from the 
archiveshttp://mail-archives.apache.org/mod_mbox/hive-dev/201304.mbox/%3ccakhgjs42yqqj54rc8yvzgxbubies7mpv9fdofq2z3onfo67...@mail.gmail.com%3e.

Once you have permission, you can use the Edit and Add symbols in the upper 
right corner of each wiki page.

For page location, I'd have to think it over and check the existing docs.  
Initial thoughts:

  *   The Language 
Manualhttps://cwiki.apache.org/confluence/display/Hive/LanguageManual has a 
placeholder for file formats which only contains ORC files now.
  *   Perhaps the DDL section is more appropriate.  If so, add a subbullet on 
the Language Manual page.
  *   If you don't want it in the Language Manual, just put it on the home 
pagehttps://cwiki.apache.org/confluence/display/Hive/Home and add a link 
under User Documentation.
  *   As long as it's somewhere in the Hive wiki tree, the apparent 
organization can be changed by adding/removing links.
  *   Many people don't pay attention to the page structure, they just google 
the topic they're looking for.

– Lefty



On Tue, Jun 18, 2013 at 2:56 AM, Sanjay Subramanian 
sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com
 wrote:
:-) Not sure how to add a page…may be the Admin needs to grant me permission

From: Sanjay Subramanian 
sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Monday, June 17, 2013 11:50 PM

To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: LZO compression implementation in Hive

Sure…would love to add the LZO compression in Hive
Is there a specific page structure u want me to add  to in confluence?

https://cwiki.apache.org/confluencehttps://cwiki.apache.org/confluence/dashboard.action

thanks
sanjay


From: Lefty Leverenz le...@hortonworks.commailto:le...@hortonworks.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Monday, June 17, 2013 7:08 PM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: LZO compression implementation in Hive

Perhaps you'd like to write up your insights in the Hive wiki, and others could 
add their insights.  Then the information would be available to all, 
immediately.

– Lefty



On Mon, Jun 17, 2013 at 4:39 PM, Ramki Palle 
ramki.pa...@gmail.commailto:ramki.pa...@gmail.com wrote:
Hi Sanjay,

Can you quickly give your insights on thip topic, if possible?

Regards,
Ramki.


On Mon, May 20, 2013 at 2:51 PM, Sanjay Subramanian 
sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com
 wrote:
Hi Programming Hive Book authors

Maybe a lot of u have already successfully implemented this but only these last 
two weeks , we implemented our aggregations using LZO compression in Hive - MR 
jobs creating LZO files as Input for Hive --- Therafter Hive aggregations 
creating more LZO files as output.
As usual nothing was straight forward :-)  Also the other challenge was to 
neatly tie all into actions in Oozie workflows….but after being underwater for 
weeks I think I am able to  rise above water and breathe !

In the next version of the book , If u guys r planning to add detailed sections 
on using lzo compression in Hive , let me know…my experiences might be useful 
:-)

Thanks

sanjay



CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.



CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email

Re: Errors in one Hive script using LZO compression

2013-06-18 Thread Sanjay Subramanian
Ok guys I solved it in not so elegant way but I need to go forward in 
production and deploy this because of time constraints :-)

I divided the scripts into two stages
Stage 1 : The hive script creates TXT files and writes to HDFS
Stage 2 : I wrote a Lzo file creator and indexer that will convert the TXT 
files on HDFS to .lzo and .lzo.index

I still don't know what makes this specific hive script throw this error….but I 
got to keep going ahead….

Perhaps if anyone can shed more light on this error in the future , I will 
STILL be interested in knowing the root cause

Thanks

sanjay


From: Sanjay Subramanian 
sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Monday, June 17, 2013 11:59 PM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Errors in one Hive script using LZO compression

Hi

I am using LZO compression in our scripts but one script is still creating 
errors

Diagnostic Messages for this Task:
Error: java.io.IOException: java.io.EOFException: Premature EOF from inputStream
at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
at 
org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:243)
at 
org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:522)
at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.init(MapTask.java:160)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:381)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:334)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:152)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:147)
Caused by: java.io.EOFException: Premature EOF from inputStream
at 
com.hadoop.compression.lzo.LzopInputStream.readFully(LzopInputStream.java:75)
at 
com.hadoop.compression.lzo.LzopInputStream.readHeader(LzopInputStream.java:114)
at 
com.hadoop.compression.lzo.LzopInputStream.init(LzopInputStream.java:54)
at 
com.hadoop.compression.lzo.LzopCodec.createInputStream(LzopCodec.java:83)
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1871)
at 
org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1765)
at 
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1714)
at 
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1728)
at 
org.apache.hadoop.mapred.SequenceFileRecordReader.init(SequenceFileRecordReader.java:49)
at 
org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:64)
at 
org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:240)
... 9 more


SCRIPT
===
set hiveconf mapred.output.compression.type=BLOCK;
set 
mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
set mapreduce.map.output.compress=true;
set hive.exec.compress.output=true;
set 
mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec;
set mapreduce.output.fileoutputformat.compress=true;
set hive.exec.compress.intermediate=true;
set mapreduce.job.maps=500;
set mapreduce.job.reduces=8;
set mapreduce.tasktracker.map.tasks.maximum=12;
set mapreduce.tasktracker.reduce.tasks.maximum=8;
add jar 
/home/nextag/sasubramanian/mycode/impressions/jar/impressions-hiveudfs-1.0-20130615-155038.jar;
create temporary function collect  as 
'com.wizecommerce.utils.hive.udf.GenericUDAFCollect';
create temporary function isnextagip  as 
'com.wizecommerce.utils.hive.udf.IsNextagIP';
create temporary function isfrombot  as 
'com.wizecommerce.utils.hive.udf.IsFromBot';
create temporary function processblankkeyword  as 
'com.wizecommerce.utils.hive.udf.ProcessBlankKeyword';
create temporary function getValidHiddenSellers as 
'com.wizecommerce.utils.hive.udf.GetValidHiddenSellers';
INSERT OVERWRITE DIRECTORY 
'/user/beeswax/warehouse/keyword_impressions_ptitles_log/2013-03-19'
SELECT
 hp.header_date,
 hp.impression_id,
 hp.header_searchsessionid,
 hp.cached_visit_id,
 split(hp.header_servername,'[\.]')[0],
 hp.cached_ip,
 hp.header_adnode,
 IF (concat_ws(',' , collect_set(concat_ws('|', cast(hp.seller_id as 
STRING), cast(IF(hp.seller_pricetier IS NULL, -1L, hp.seller_pricetier) as 
STRING), cast

  1   2   >