Why Move Operations after MapReduce are in sequential?
Hi, For the query below, I find the five Move Operations (after MapReduce job) are not operated in parallel. from impressions2 insert OVERWRITE LOCAL DIRECTORY '/disk2/iis1' select * where impressionid'1239572996000' insert OVERWRITE LOCAL DIRECTORY '/disk2/iis2' select * where impressionid'123959278' AND impressionid='1239572996000' insert OVERWRITE LOCAL DIRECTORY '/disk2/iis3' select * where impressionid'1239648597000' AND impressionid='123959278' insert OVERWRITE LOCAL DIRECTORY '/disk2/iis4' select * where impressionid'1239714028000' AND impressionid='1239648597000' insert OVERWRITE LOCAL DIRECTORY '/disk2/iis5' select * where impressionid='1239714028000'; -- Ended Job = job_201203060735_0008 Copying data to local directory /disk2/iis1 Copying data to local directory /disk2/iis1 Copying data to local directory /disk2/iis2 Copying data to local directory /disk2/iis2 Copying data to local directory /disk2/iis3 Copying data to local directory /disk2/iis3 Copying data to local directory /disk2/iis4 Copying data to local directory /disk2/iis4 Copying data to local directory /disk2/iis5 Copying data to local directory /disk2/iis5 -- I thought the Move Operations could be done in parallel, and the performance will be improved is the MapReduce temp result is pretty large. Regards, Wei
Re: Hive table creation over sequence file
Hi Chung What is the OutputFormat of the map reduce job that writes data on to HDFS? Regards Bejoy.K.S From: Weishung Chung weish...@gmail.com To: user@hive.apache.org Sent: Wednesday, March 7, 2012 10:34 AM Subject: Re: Hive table creation over sequence file Fellow users, I created the table as follows using the mapreduce output file CREATE EXTERNAL TABLE mytable ( word string, count int ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS SEQUENCEFILE LOCATION 's3://mydata/'; This is what i have in my reduce method, key is of type Text output.collect(key, new IntWritable(sum)); The exception returned by Hive Failed with exception java.io.IOException:java.io.IOException: s3://mydata/part-0 not a SequenceFile Thank you so much :) On Tue, Mar 6, 2012 at 4:47 PM, Wei Shung Chung weish...@gmail.com wrote: Hi users, I have a sequence file produced by mapreduce with TEXT, INTWRITABLE key value pair...I tried to create a external hive table using the file but hive can't read it. Thank you Sent from my iPhone
RE: How to get a flat file out of a table in Hive
you can create table as select first, use comma separated. Then export it. Best regards Ransom. From: Omer, Farah [mailto:fo...@microstrategy.com] Sent: Wednesday, March 07, 2012 12:32 AM To: user@hive.apache.org Subject: How to get a flat file out of a table in Hive Whats the easiest way to get a flat file out from a table in Hive? I have a table in HIVE, that has millions of rows. I want to get a dump of this table out in flat file format, and it should be comma separated. Anyone knows the syntax to do it? Thanks for the help! Farah Omer Senior DB Engineer, MicroStrategy, Inc. T: 703 2702230 E: fo...@microstrategy.commailto:fo...@microstrategy.com http://www.microstrategy.com
Need a smart way to delete the first row of my data
Hello, I have huge gzipped files that I need to drop the header row from before loading to a hive table. Right now, my process is: 1. Gunzip the data (...takes forever) 2. Drop the first row using the Unix sed command 3. Re-zip the data with gzip -1 (...takes forever) 4. Create the Hive table (on the compressed file to store it efficiently) I am trying to find a way to speed up this process. Ideally, it would involve loading the data to Hive as a first step and then deleting the first row, to avoid the unzip/rezip steps. Any ideas would be appreciated! -Dan
RE: Need a smart way to delete the first row of my data
Give you a key column that is unique within your dataset I think this could work. 1. Load the file as is, gunzipped, into a hive table 2. Determine the total row size. 3. Perform a insert into table Select * from Order by col_name desc limit total_size -1 From: Dan Y [mailto:dan.m.ye...@gmail.com] Sent: Wednesday, March 07, 2012 10:01 AM To: user@hive.apache.org Subject: Need a smart way to delete the first row of my data Hello, I have huge gzipped files that I need to drop the header row from before loading to a hive table. Right now, my process is: 1. Gunzip the data (...takes forever) 2. Drop the first row using the Unix sed command 3. Re-zip the data with gzip -1 (...takes forever) 4. Create the Hive table (on the compressed file to store it efficiently) I am trying to find a way to speed up this process. Ideally, it would involve loading the data to Hive as a first step and then deleting the first row, to avoid the unzip/rezip steps. Any ideas would be appreciated! -Dan
Re: Why Move Operations after MapReduce are in sequential?
Hi Wei If you look at your query, it is a multi table insert, and it is treated as a single operation in hive. In multi table insert it is just one time scan of the table being done instead of scanning the same table again and again( 6 times in your case) . From the scanned data, the various filters are applied altogether with the help of map reduce jobs. (you have 6 filters and hence 6 MR jobs). This is step 1 and then step 2 is coping the output of all these map reduce jobs from hdfs to lfs. It would have worked the sequential way if it was not multi table inserts. Regards Bejoy.K.S From: Lu, Wei w...@microstrategy.com To: user@hive.apache.org user@hive.apache.org; Bejoy Ks bejoy...@yahoo.com Sent: Wednesday, March 7, 2012 10:31 PM Subject: re: Why Move Operations after MapReduce are in sequential? Hi Bejoy.K.S, Yes, there are two steps and as for my query, there will be 6 steps with one mapreduce and 5 move operations. My question is why the 5 move operations are executed sequentially rather than in parallel affter step 1? Regards, Wei 发件人: Bejoy Ks [bejoy...@yahoo.com] 发送时间: 2012年3月7日 7:36 到: user@hive.apache.org 主题: Re: Why Move Operations after MapReduce are in sequential? Hi Wei Here there are two operations that takes place for your query insert OVERWRITE LOCAL DIRECTORY '/disk2/iis1' select * where impressionid'1239572996000' 1 - A map reduce job that performs the operation select * where impressionid'1239572996000 2 - A file system operation that copies the output of Step 1 from hdfs to lfs (hadoop fs -copyToLocal). Step 2 would be executed only after completion of Step 1. Regards Bejoy.K.S From: Lu, Wei w...@microstrategy.com To: user@hive.apache.org user@hive.apache.org Sent: Wednesday, March 7, 2012 5:12 PM Subject: Why Move Operations after MapReduce are in sequential? Hi, For the query below, I find the five Move Operations (after MapReduce job) are not operated in parallel. from impressions2 insert OVERWRITE LOCAL DIRECTORY '/disk2/iis1' select * where impressionid'1239572996000' insert OVERWRITE LOCAL DIRECTORY '/disk2/iis2' select * where impressionid'123959278' AND impressionid='1239572996000' insert OVERWRITE LOCAL DIRECTORY '/disk2/iis3' select * where impressionid'1239648597000' AND impressionid='123959278' insert OVERWRITE LOCAL DIRECTORY '/disk2/iis4' select * where impressionid'1239714028000' AND impressionid='1239648597000' insert OVERWRITE LOCAL DIRECTORY '/disk2/iis5' select * where impressionid='1239714028000'; -- Ended Job = job_201203060735_0008 Copying data to local directory /disk2/iis1 Copying data to local directory /disk2/iis1 Copying data to local directory /disk2/iis2 Copying data to local directory /disk2/iis2 Copying data to local directory /disk2/iis3 Copying data to local directory /disk2/iis3 Copying data to local directory /disk2/iis4 Copying data to local directory /disk2/iis4 Copying data to local directory /disk2/iis5 Copying data to local directory /disk2/iis5 -- I thought the Move Operations could be done in parallel, and the performance will be improved is the MapReduce temp result is pretty large. Regards, Wei
Re: HIVE and S3 folders
Hi Mark, I can understand if EMR was the only thing that could recognize it. It appears that s3cmd (a utility used to copy files to S3) also recognizes the files created by EMR or create them and have them read them by EMR. When I look at the debug information, HIVE seems to be sending an extra / when creating a table Here is a debug message and if you see the path, there is a / and a %2f. Probably a bug in the code ? hive create external table wc(site string, cnt int) location 's3://masked/wcoverlay/'; StringToSignGETWed, 07 Mar 2012 18:26:03 GMT/masked/%2Fwcoverlay/StringToSignAWSAccessKeyId. On Wed, Mar 7, 2012 at 12:56 PM, Mark Grover mgro...@oanda.com wrote: Hi Balaji, The Hive/Hadoop installation that comes with EMR is Amazon specific which has some additional patches that make s3 paths as recognizable as HDFS paths. However, if you are using EC2, you most likely have Apache or Cloudera installation which doesn't recognize S3 paths. Mark Mark Grover, Business Intelligence Analyst OANDA Corporation www: oanda.com www: fxtrade.com Best Trading Platform - World Finance's Forex Awards 2009. The One to Watch - Treasury Today's Adam Smith Awards 2009. - Original Message - From: Balaji Rao sbalaji...@gmail.com To: user@hive.apache.org Sent: Wednesday, March 7, 2012 12:48:31 PM Subject: HIVE and S3 folders I'm having problems with HIVE- EC2 reading files on S3. I have a lot of files and folders on S3 created by s3cmd and utilized by Elastic Map Reduce (HIVE) and they work interchangeably, files created by HIVE-EMR can be read by s3cmd and vice versa. However, I'm having problems with HIVE/Hadoop running on EC2. Both Hive 0.7 and 0.8 seem to create an additional folder / on S3 For example, if I have a file s3://bucket/path/0 created by s3cmd or HIVE-EMR and I try to create an external table on HIVE- EC2 create external table wc(site string, cnt int) row format delimited fields terminated by '\t' stored as textfile location 's3://bucket/path' This does not recognize the EMR created s3 folders, instead I see a new folder / bucket / / / path Am I missing something here ? Balaji
Accessing XML files in HDFS from Hive
So is there documentation or something that you can point me to that gives a good example of how to access XML files stored in HDFS through (possibly) an external table definition in Hive?I am attempting to figure out how to define src to run statements such as this: SELECT xpath_string ('abbb/bccc/c/a', 'a') FROM src LIMIT 1 ; bbccThanks.
Hive Sessionization
Is there a better way to use Hive to sessionize my log data ? I'm not sure that I'm doing so, below, in the optimal way: The log data is stored in sequence files; a single log entry is a JSON string; eg: {source: {api_key: app_key_1, user_id: user0}, events: [{timestamp: 1330988326, event_type: high_score, event_params: {score: 1123, level: 9}}, {timestamp: 1330987183, event_type: some_event_0, event_params: {some_param_00: val, some_param_01: 100}}, {timestamp: 1330987775, event_type: some_event_1, event_params: {some_param_11: 100, some_param_10: val}}]} Formatted, this looks like: {'source': {'api_key': 'app_key_1', 'user_id': 'user0'}, 'events': [{'event_params': {'level': '9', 'score': '1123'}, 'event_type': 'high_score', 'timestamp': 1330988326}, {'event_params': {'some_param_00': 'val', 'some_param_01': 100}, 'event_type': 'some_event_0', 'timestamp': 1330987183}, {'event_params': {'some_param_10': 'val', 'some_param_11': 100}, 'event_type': 'some_event_1', 'timestamp': 1330987775}] } 'source' contains some info ( user_id and api_key ) about the source of the events contained in 'events'; 'events' contains a list of events generated by the source; each event has 'event_params', 'event_type', and 'timestamp' ( timestamp is a Unix timestamp in GMT ). Note that timestamps within a single log entry, and across log entries may be out of order. Note that I'm constrained such that I cannot change the log format, cannot initially log the data into separate files that are partitioned ( though I could use Hive to do this after the data is logged ), etc. In the end, I'd like a table of sessions, where a session is associated with an app ( api_k ) and user, and has a start time and session length ( or end time ); sessions are split where, for a given app and user, a gap of 30 or more minutes occurs between events. My solution does the following ( Hive script and python transform script are below; doesn't seem like it would be useful to show the SerDe source, but let me know if it would be ): [1] load the data into log_entry_tmp, in a denormalized format [2] explode the data into log_entry, so that, eg, the above single entry would now have multiple entries: {source_api_key:app_key_1,source_user_id:user0,event_type:high_score,event_params:{score:1123,level:9},event_timestamp:1330988326} {source_api_key:app_key_1,source_user_id:user0,event_type:some_event_0,event_params:{some_param_00:val,some_param_01:100},event_timestamp:1330987183} {source_api_key:app_key_1,source_user_id:user0,event_type:some_event_1,event_params:{some_param_11:100,some_param_10:val},event_timestamp:1330987775} [3] transform and write data into session_info_0, where each entry contains events' app_id, user_id, and timestamp [4] tranform and write data into session_info_1, where entries are ordered by app_id, user_id, event_timestamp ; and each entry contains a session_id ; the python tranform script finds the splits, and groups the data into sessions [5] transform and write final session data to session_info_2 ; the sessions' app + user, start time, and length in seconds - [Hive script] drop table if exists app_info; create external table app_info ( app_id int, app_name string, api_k string ) location '${WORK}/hive_tables/app_info'; add jar ../build/our-serdes.jar; -- [1] load the data into log_entry_tmp, in a denormalized format drop table if exists log_entry_tmp; create external table log_entry_tmp row format serde 'com.company.TestLogSerde' location '${WORK}/hive_tables/test_logs'; drop table if exists log_entry; create table log_entry ( entry structsource_api_key:string, source_user_id:string, event_type:string, event_params:mapstring,string, event_timestamp:bigint); -- [2] explode the data into log_entry insert overwrite table log_entry select explode (trans0_list) t from log_entry_tmp; drop table if exists session_info_0; create table session_info_0 ( app_id string, user_id string, event_timestamp bigint ); -- [3] transform and write data into session_info_0, where each entry contains events' app_id, user_id, and timestamp insert overwrite table session_info_0 select ai.app_id, le.entry.source_user_id, le.entry.event_timestamp from log_entry le join app_info ai on (le.entry.source_api_key = ai.api_k); add file ./TestLogTrans.py; drop table if exists session_info_1; create table session_info_1 ( session_id string, app_id string, user_id string, event_timestamp bigint, session_start_datetime string, session_start_timestamp bigint, gap_secs int ); -- [4] tranform and write data into session_info_1, where entries are ordered by app_id, user_id, event_timestamp ; and each entry contains a session_id ; the python tranform script finds the splits, and groups the data into sessions insert overwrite
Re: Hadoop User Group Cologne
HI, we've setup few days ago a German UG: http://mapredit.blogspot.com/2012/03/hadoop-ug-germany.html Deutsch / german: Wir haben eine UHG gegruendet, erstmal Gruppen in XING / LinkedIn und eine Website, die aber wirklich recht neu ist :) Wenn Du mitmachen willst, melden! Danke und bis bald, - Alex -- Alexander Lorenz http://mapredit.blogspot.com On Mar 8, 2012, at 7:48 AM, Christian Bitter wrote: Dear all, I would like to know whether there is already or whether there is interest in establishing some form of user group for hadoop in Cologne / Germany. Cheers, Christian