Why Move Operations after MapReduce are in sequential?

2012-03-07 Thread Lu, Wei
Hi,

For the query below, I find the five Move Operations (after MapReduce job) are 
not operated in parallel.

from impressions2
insert OVERWRITE LOCAL DIRECTORY '/disk2/iis1' select * where 
impressionid'1239572996000'
insert OVERWRITE LOCAL DIRECTORY '/disk2/iis2' select * where 
impressionid'123959278' AND impressionid='1239572996000'
insert OVERWRITE LOCAL DIRECTORY '/disk2/iis3' select * where 
impressionid'1239648597000' AND impressionid='123959278'
insert OVERWRITE LOCAL DIRECTORY '/disk2/iis4' select * where 
impressionid'1239714028000' AND impressionid='1239648597000'
insert OVERWRITE LOCAL DIRECTORY '/disk2/iis5' select * where 
impressionid='1239714028000';

--
Ended Job = job_201203060735_0008
Copying data to local directory /disk2/iis1
Copying data to local directory /disk2/iis1
Copying data to local directory /disk2/iis2
Copying data to local directory /disk2/iis2
Copying data to local directory /disk2/iis3
Copying data to local directory /disk2/iis3
Copying data to local directory /disk2/iis4
Copying data to local directory /disk2/iis4
Copying data to local directory /disk2/iis5
Copying data to local directory /disk2/iis5
--


I thought the Move Operations could be done in parallel, and the performance 
will be improved is the MapReduce temp result is pretty large.


Regards,
Wei


Re: Hive table creation over sequence file

2012-03-07 Thread Bejoy Ks
Hi Chung
      What is the OutputFormat of the map reduce job that writes data on to 
HDFS?

Regards
Bejoy.K.S



 From: Weishung Chung weish...@gmail.com
To: user@hive.apache.org 
Sent: Wednesday, March 7, 2012 10:34 AM
Subject: Re: Hive table creation over sequence file
 

Fellow users,

I created the table as follows using the mapreduce output file

CREATE EXTERNAL TABLE mytable (
word string, count int  )
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS SEQUENCEFILE
LOCATION 's3://mydata/';

This is what i have in my reduce method, key is of type Text
 output.collect(key, new IntWritable(sum));
The exception returned by Hive
Failed with exception java.io.IOException:java.io.IOException: 
s3://mydata/part-0 not a SequenceFile
Thank you so much :)

On Tue, Mar 6, 2012 at 4:47 PM, Wei Shung Chung weish...@gmail.com wrote:

Hi users,

I have a sequence file produced by mapreduce with TEXT, INTWRITABLE key value 
pair...I tried to create a external hive table using the file but hive can't 
read it.

Thank you

Sent from my iPhone

RE: How to get a flat file out of a table in Hive

2012-03-07 Thread hezhiqiang (Ransom)
you can create table as select first, use comma separated.
Then export it.


Best regards
Ransom.

From: Omer, Farah [mailto:fo...@microstrategy.com]
Sent: Wednesday, March 07, 2012 12:32 AM
To: user@hive.apache.org
Subject: How to get a flat file out of a table in Hive

Whats the easiest way to get a flat file out from a table in Hive?

I have a table in HIVE, that has millions of rows. I want to get a dump of this 
table out in flat file format, and it should be comma separated.

Anyone knows the syntax to do it?

Thanks for the help!

Farah Omer

Senior DB Engineer, MicroStrategy, Inc.
T: 703 2702230
E: fo...@microstrategy.commailto:fo...@microstrategy.com
http://www.microstrategy.com





Need a smart way to delete the first row of my data

2012-03-07 Thread Dan Y
Hello,

I have huge gzipped files that I need to drop the header row from before
loading to a hive table.

Right now, my process is:
1. Gunzip the data (...takes forever)
2. Drop the first row using the Unix sed command
3. Re-zip the data with gzip -1 (...takes forever)
4. Create the Hive table (on the compressed file to store it efficiently)

I am trying to find a way to speed up this process.  Ideally, it would
involve loading the data to Hive as a first step and then deleting the
first row, to avoid the unzip/rezip steps.

Any ideas would be appreciated!

-Dan


RE: Need a smart way to delete the first row of my data

2012-03-07 Thread Raghunath, Ranjith
Give you a key column that is unique within your dataset I think this could 
work.


1.   Load the file as is, gunzipped, into a hive table

2.   Determine the total row size.

3.   Perform a insert into table  Select * from  Order by 
col_name desc limit total_size -1



From: Dan Y [mailto:dan.m.ye...@gmail.com]
Sent: Wednesday, March 07, 2012 10:01 AM
To: user@hive.apache.org
Subject: Need a smart way to delete the first row of my data

Hello,

I have huge gzipped files that I need to drop the header row from before 
loading to a hive table.

Right now, my process is:
1. Gunzip the data (...takes forever)
2. Drop the first row using the Unix sed command
3. Re-zip the data with gzip -1 (...takes forever)
4. Create the Hive table (on the compressed file to store it efficiently)

I am trying to find a way to speed up this process.  Ideally, it would involve 
loading the data to Hive as a first step and then deleting the first row, to 
avoid the unzip/rezip steps.

Any ideas would be appreciated!

-Dan



Re: Why Move Operations after MapReduce are in sequential?

2012-03-07 Thread Bejoy Ks
Hi Wei
     If you look at your query, it is a multi table insert, and it is treated 
as a single operation in hive. In multi table insert it is just one time scan 
of the table being done instead of scanning the same table again and again( 6 
times in your case) . From the scanned data, the various filters are applied 
altogether with the help of map reduce jobs. (you have 6 filters and hence 6 MR 
jobs). This is step 1 and then step 2 is coping the output of all these map 
reduce jobs from hdfs to lfs.
 
It would have worked the sequential way if it was not multi table inserts.

Regards
Bejoy.K.S



 From: Lu, Wei w...@microstrategy.com
To: user@hive.apache.org user@hive.apache.org; Bejoy Ks 
bejoy...@yahoo.com 
Sent: Wednesday, March 7, 2012 10:31 PM
Subject: re: Why Move Operations after MapReduce are in sequential?
 

 
Hi Bejoy.K.S,

  Yes, there are two steps and as for my query, there will be 6 steps with one 
mapreduce and 5 move operations. My question is why the 5 move operations are 
executed sequentially rather than in parallel affter step 1?

Regards,
Wei
 


 
发件人: Bejoy Ks [bejoy...@yahoo.com]
发送时间: 2012年3月7日 7:36
到: user@hive.apache.org
主题: Re: Why Move Operations after MapReduce are in sequential?


Hi Wei
     Here there are two operations that takes place for your query
insert OVERWRITE LOCAL DIRECTORY '/disk2/iis1' select * where 
impressionid'1239572996000' 

1 - A map reduce job that performs the operation select * where 
impressionid'1239572996000
2 -  A file system operation that copies the output of Step 1 from hdfs to lfs 
(hadoop fs -copyToLocal). Step 2 would be executed only after completion of 
Step 1.


Regards
Bejoy.K.S



 From: Lu, Wei w...@microstrategy.com
To: user@hive.apache.org user@hive.apache.org 
Sent: Wednesday, March 7, 2012 5:12 PM
Subject: Why Move Operations after MapReduce are in sequential?


 
Hi, 
 
For the query below, I find the five Move Operations (after MapReduce job) are 
not operated in parallel.
 
from impressions2 
insert OVERWRITE LOCAL DIRECTORY '/disk2/iis1' select * where 
impressionid'1239572996000'
insert OVERWRITE LOCAL DIRECTORY '/disk2/iis2' select * where 
impressionid'123959278' AND impressionid='1239572996000'
insert OVERWRITE LOCAL DIRECTORY '/disk2/iis3' select * where 
impressionid'1239648597000' AND impressionid='123959278'
insert OVERWRITE LOCAL DIRECTORY '/disk2/iis4' select * where 
impressionid'1239714028000' AND impressionid='1239648597000'
insert OVERWRITE LOCAL DIRECTORY '/disk2/iis5' select * where 
impressionid='1239714028000';
 
--
Ended Job = job_201203060735_0008
Copying data to local directory /disk2/iis1
Copying data to local directory /disk2/iis1
Copying data to local directory /disk2/iis2
Copying data to local directory /disk2/iis2
Copying data to local directory /disk2/iis3
Copying data to local directory /disk2/iis3
Copying data to local directory /disk2/iis4
Copying data to local directory /disk2/iis4
Copying data to local directory /disk2/iis5
Copying data to local directory /disk2/iis5
--
 
 
I thought the Move Operations could be done in parallel, and the performance 
will be improved is the MapReduce temp result is pretty large.
 
 
Regards,
Wei

Re: HIVE and S3 folders

2012-03-07 Thread Balaji Rao
Hi Mark,
   I can understand if EMR was the only thing that could recognize it.
It appears that s3cmd (a utility used to copy files to S3) also
recognizes the files created by EMR or create them and have them read
them by EMR.
When I look at the debug information, HIVE seems to be sending an
extra / when creating a table

Here is a debug message and if you see the path, there is a / and a
%2f. Probably a bug in the code ?

hive create external table wc(site string, cnt int)  location
's3://masked/wcoverlay/';

   StringToSignGETWed, 07 Mar 2012 18:26:03
GMT/masked/%2Fwcoverlay/StringToSignAWSAccessKeyId.


On Wed, Mar 7, 2012 at 12:56 PM, Mark Grover mgro...@oanda.com wrote:
 Hi Balaji,
 The Hive/Hadoop installation that comes with EMR is Amazon specific which has 
 some additional patches that make s3 paths as recognizable as HDFS paths.

 However, if you are using EC2, you most likely have Apache or Cloudera 
 installation which doesn't recognize S3 paths.

 Mark

 Mark Grover, Business Intelligence Analyst
 OANDA Corporation

 www: oanda.com www: fxtrade.com

 Best Trading Platform - World Finance's Forex Awards 2009.
 The One to Watch - Treasury Today's Adam Smith Awards 2009.


 - Original Message -
 From: Balaji Rao sbalaji...@gmail.com
 To: user@hive.apache.org
 Sent: Wednesday, March 7, 2012 12:48:31 PM
 Subject: HIVE and S3 folders

 I'm having problems with HIVE- EC2 reading files on S3.

 I have a lot of files and folders on S3 created by s3cmd and utilized
 by Elastic Map Reduce (HIVE) and they work interchangeably, files
 created by HIVE-EMR can be read by s3cmd and vice versa.
 However, I'm having problems with HIVE/Hadoop running on EC2. Both
 Hive 0.7 and 0.8 seem to create an additional folder / on S3

 For example, if I have a file s3://bucket/path/0 created by s3cmd
 or HIVE-EMR and I try to create an external table on HIVE- EC2

 create external table wc(site string, cnt int) row format delimited
 fields terminated by '\t' stored as textfile location
 's3://bucket/path'

 This does not recognize the EMR created s3 folders, instead I see a
 new folder /

 bucket / / / path

 Am I missing something here ?


 Balaji


Accessing XML files in HDFS from Hive

2012-03-07 Thread Keaton Adams
So is there documentation or something that you can point me to that gives a good example of how to access XML files stored in HDFS through (possibly) an external table definition in Hive?I am attempting to figure out how to define src to run statements such as this: SELECT xpath_string ('abbb/bccc/c/a', 'a') FROM src LIMIT 1 ;
bbccThanks.

Hive Sessionization

2012-03-07 Thread Praveen Kumar
Is there a better way to use Hive to sessionize my log data ? I'm not
sure that I'm doing so, below, in the optimal way:

The log data is stored in sequence files; a single log entry is a JSON
string; eg:

{source: {api_key: app_key_1, user_id: user0}, events:
[{timestamp: 1330988326, event_type: high_score, event_params:
{score: 1123, level: 9}}, {timestamp: 1330987183,
event_type: some_event_0, event_params: {some_param_00: val,
some_param_01: 100}}, {timestamp: 1330987775, event_type:
some_event_1, event_params: {some_param_11: 100,
some_param_10: val}}]}

Formatted, this looks like:

{'source': {'api_key': 'app_key_1', 'user_id': 'user0'},
 'events': [{'event_params': {'level': '9', 'score': '1123'},
             'event_type': 'high_score',
             'timestamp': 1330988326},
            {'event_params': {'some_param_00': 'val', 'some_param_01': 100},
             'event_type': 'some_event_0',
             'timestamp': 1330987183},
            {'event_params': {'some_param_10': 'val', 'some_param_11': 100},
             'event_type': 'some_event_1',
             'timestamp': 1330987775}]
}

'source' contains some info ( user_id and api_key ) about the source
of the events contained in 'events'; 'events' contains a list of
events generated by the source; each event has 'event_params',
'event_type', and 'timestamp' ( timestamp is a Unix timestamp in GMT
). Note that timestamps within a single log entry, and across log
entries may be out of order.

Note that I'm constrained such that I cannot change the log format,
cannot initially log the data into separate files that are partitioned
( though I could use Hive to do this after the data is logged ), etc.

In the end, I'd like a table of sessions, where a session is
associated with an app ( api_k ) and user, and has a start time and
session length ( or end time ); sessions are split where, for a given
app and user, a gap of 30 or more minutes occurs between events.

My solution does the following ( Hive script and python transform
script are below; doesn't seem like it would be useful to show the
SerDe source, but let me know if it would be ):

[1] load the data into log_entry_tmp, in a denormalized format

[2] explode the data into log_entry, so that, eg, the above single
entry would now have multiple entries:

{source_api_key:app_key_1,source_user_id:user0,event_type:high_score,event_params:{score:1123,level:9},event_timestamp:1330988326}
{source_api_key:app_key_1,source_user_id:user0,event_type:some_event_0,event_params:{some_param_00:val,some_param_01:100},event_timestamp:1330987183}
{source_api_key:app_key_1,source_user_id:user0,event_type:some_event_1,event_params:{some_param_11:100,some_param_10:val},event_timestamp:1330987775}

[3] transform and write data into session_info_0, where each entry
contains events' app_id, user_id, and timestamp

[4] tranform and write data into session_info_1, where entries are
ordered by app_id, user_id, event_timestamp ; and each entry contains
a session_id ; the python tranform script finds the splits, and groups
the data into sessions

[5] transform and write final session data to session_info_2 ; the
sessions' app + user, start time, and length in seconds

-

[Hive script]

drop table if exists app_info;
create external table app_info ( app_id int, app_name string, api_k string )
location '${WORK}/hive_tables/app_info';

add jar ../build/our-serdes.jar;

-- [1] load the data into log_entry_tmp, in a denormalized format

drop table if exists log_entry_tmp;
create external table log_entry_tmp
row format serde 'com.company.TestLogSerde'
location '${WORK}/hive_tables/test_logs';

drop table if exists log_entry;
create table log_entry (
    entry structsource_api_key:string,
                 source_user_id:string,
                 event_type:string,
                 event_params:mapstring,string,
                 event_timestamp:bigint);

-- [2] explode the data into log_entry

insert overwrite table log_entry
select explode (trans0_list) t
from log_entry_tmp;

drop table if exists session_info_0;
create table session_info_0 (
    app_id string,
    user_id string,
    event_timestamp bigint
);

-- [3] transform and write data into session_info_0, where each entry
contains events' app_id, user_id, and timestamp

insert overwrite table session_info_0
select ai.app_id, le.entry.source_user_id, le.entry.event_timestamp
from log_entry le
join app_info ai on (le.entry.source_api_key = ai.api_k);

add file ./TestLogTrans.py;

drop table if exists session_info_1;
create table session_info_1 (
    session_id string,
    app_id string,
    user_id string,
    event_timestamp bigint,
    session_start_datetime string,
    session_start_timestamp bigint,
    gap_secs int
);

-- [4] tranform and write data into session_info_1, where entries are
ordered by app_id, user_id, event_timestamp ; and each entry contains
a session_id ; the python tranform script finds the splits, and groups
the data into sessions

insert overwrite 

Re: Hadoop User Group Cologne

2012-03-07 Thread alo alt
HI,

we've setup few days ago a German UG:
http://mapredit.blogspot.com/2012/03/hadoop-ug-germany.html

Deutsch / german:
Wir haben eine UHG gegruendet, erstmal Gruppen in XING / LinkedIn und eine 
Website, die aber wirklich recht neu ist :) Wenn Du mitmachen willst, melden!

Danke und bis bald,
 - Alex 

--
Alexander Lorenz
http://mapredit.blogspot.com

On Mar 8, 2012, at 7:48 AM, Christian Bitter wrote:

 Dear all,
 
 I would like to know whether there is already or whether there is interest in 
 establishing some form of user group for hadoop in Cologne / Germany.
 
 Cheers,
 
 Christian