RE: Hive on EMR on S3 : Beginner

richin.jain Fri, 24 Aug 2012 07:15:20 -0700

Hi Ravi,

Another way of doing apart from dynamic partition is if you can create your 
directories like below either manually or the ETL process you might be doing to 
get the table data it is pretty easy.


s3://<s3://logs/joined_analyze_files_hive/%27>ravi/logs/adv_id=123/date=2012-01-01/log.gz

s3://<s3://logs/joined_analyze_files_hive/%27>ravi/logs/adv_id=456/date=2012-01-02/log.gz

s3://<s3://logs/joined_analyze_files_hive/%27>ravi/logs/adv_id=123/date=2012-01-03/log.gz


and so on..

and then when you define your table


CREATE EXTERNAL TABLE results

        (field1 STRING,

        field2 STRING)

        PARTITIONED BY (adv_id STRING,date STRING)

        ROW FORMAT DELIMITED

        FIELDS TERMINATED BY '\t'

        LOCATION 's3://ravi/logs/';<s3://logs/joined_analyze_files_hive/%27>



Once the table is created do a



ALTER TABLE results RECOVER PARTITIONS; -- This will load all the data from the 
partitions in the results table



SELECT * FROM results; - should return you all the rows.



For queries by partition you could do something like



SELECT * from results where adv-id=123 and date='2012-01-01'; -- this query 
will just read that specific directory/partition


Here is an link to AWS forum, describing this in more detail - 
https://forums.aws.amazon.com/thread.jspa?threadID=92862

Thanks,
Richin

From: ext Ravi Shetye [mailto:ravi.she...@vizury.com]
Sent: Friday, August 24, 2012 9:09 AM
To: user@hive.apache.org
Subject: Re: Hive on EMR on S3 : Beginner




thanks for the reply



Let concentrate on the second case



CREATE EXTERNAL TABLE results (cookie STRING,

  d2 STRING,

  url STRING,

  d4 STRING,

  d5 STRING,

  d6 STRING,

  adv_id_dummy STRING,

  timestp STRING,

  ip STRING,

  userAgent STRING,

  stage STRING,

  d12 STRING,

  d13 STRING)

PARTITION BY (adv_id,date) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'

LOCATION 
's3://somewhere-outside-the-logs-tree';<s3://logs/joined_analyze_files_hive/%27>



You can then



INSERT OVERWRITE TABLE results PARTITION (adv_id, date)



Is this all I need to do to load the data?

how will the system know what data will go into what partition?

As I understand the partition columns should be psedo columns and not part of 
the actual data.



Also if I have to load just 2 of the files say

s3://logs/ad1date1.log.gz and

s3://logs/ad2date4.log.gz  how do I specify it.

On 08/24/2012 06:16 PM, Pedro Figueiredo wrote:
Hi,

On 24 Aug 2012, at 13:26, Ravi Shetye wrote:



I have the data in s3 bucket in the following manner

s3://logs/ad1date1.log.gz

s3://logs/ad1date2.log.gz

s3://logs/ad1date3.log.gz

s3://logs/ad1date4.log.gz

s3://logs/ad2date1.log.gz

s3://logs/ad2date2.log.gz

s3://logs/ad2date3.log.gz

s3://logs/ad2date4.log.gz


If you do



CREATE EXTERNAL TABLE analyze_files_tab (cookie STRING,

d2 STRING,

url STRING,

d4 STRING,

d5 STRING,

d6 STRING,

adv_id_dummy STRING,

timestp STRING,

ip STRING,

userAgent STRING,

stage STRING,

d12 STRING,

d13 STRING)
LOCATION 's3n://logs/'<s3n://logs/%27>;

you'll have all of it in a table. If you then want the results partitioned, you 
can do


CREATE EXTERNAL TABLE results (cookie STRING,

  d2 STRING,

  url STRING,

  d4 STRING,

  d5 STRING,

  d6 STRING,

  adv_id_dummy STRING,

  timestp STRING,

  ip STRING,

  userAgent STRING,

  stage STRING,

  d12 STRING,

  d13 STRING)

PARTITION BY (adv_id,date) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'

LOCATION 
's3://somewhere-outside-the-logs-tree';<s3://logs/joined_analyze_files_hive/%27>



You can then



INSERT OVERWRITE TABLE results PARTITION (adv_id, date)

<your query>



Note that to use dynamic partitions you have to first run

SET hive.exec.dynamic.partition.mode=nonstrict;

SET hive.exec.dynamic.partition=true;

Cheers,

Pedro
Pedro Figueiredo
Skype: pfig.89clouds
http://89clouds.com/ - Big Data Consulting

RE: Hive on EMR on S3 : Beginner

Reply via email to