Re: dfs storage full on all slave machines of 6 machine hive cluster

2013-03-19 Thread Chunky Gupta
Thank Alok, I deleted mapred.local.dir folders.
I more 2 question,

1. I have around 30 databases and each one contains many tables. So, is
there any way to find out wat are the size of each database or how much
storage a particular table in a database is occupying.

2. We have 5 slave nodes, how to find which tables data is stored on which
slave node .

Thanks,
Chunky.


On Mon, Mar 18, 2013 at 10:16 PM, Alok Kumar  wrote:

> Look into your hdfs-site.xml & mapred-site.xml conf files.
>
> *dfs.data.dir* propety contain your actual HDFS data path, better avoid
> deleting anything from these directories.
>
> *mapred.local.dir* contains temporary map-reduce job data, you can clean
> this one.
>
> "/mnt/hadoop-fs/dfs/data/current/" looks like your hdfs data path, this
> mean your hive tables have grown to ~95% of your disk size. try deleting
> hive tables or add more disk ( dropping a EXTERNAL hive table doesn't clear
> the data from HDFS)
>
> Thanks,
>
>
> On Mon, Mar 18, 2013 at 9:28 PM, Chunky Gupta wrote:
>
>> Hi Zhiwen,
>>
>> /mnt/hadoop-fs/mapred/local/taskTracker/
>>
>> Inside this folder there are folders with different user name, can I
>> delete these ?.
>>
>> I do not understand what this {*nouserdir*} you were talking about, can
>> you please explain ?.
>>
>> Thanks,
>> Chunky.
>>
>>
>>
>> On Mon, Mar 18, 2013 at 8:40 PM, Zhiwen Sun  wrote:
>>
>>> The folder "/mnt/hadoop-fs/dfs/data/current/" is the main folder of
>>> datanode in hadoop.
>>>
>>> You can use *hadoop dfs -rmr {nouserdir} *to get more free space in
>>> HDFS.
>>>
>>> *Don't delete file directly in OS file system.*
>>>
>>> Zhiwen Sun
>>>
>>>
>>>
>>> On Mon, Mar 18, 2013 at 6:48 PM, Manish Bhoge <
>>> manishbh...@rocketmail.com> wrote:
>>>
>>>> I think these directories belong to task tracker temporary storage. I
>>>> am not very confident to conclude that go ahead with your clean up. So,
>>>> wait for similar or an expert's response
>>>>
>>>> Sent from HTC via Rocket! excuse typo.
>>>>
>>>>  --
>>>> * From: * Chunky Gupta ;
>>>> * To: * ;
>>>> * Subject: * dfs storage full on all slave machines of 6 machine hive
>>>> cluster
>>>> * Sent: * Mon, Mar 18, 2013 10:37:39 AM
>>>>
>>>>   Hi,
>>>>
>>>> We have a 6 machine hive cluster. We are getting errors while a query
>>>> is running and it fails. I found that on all 5 slaves storage is nearly
>>>> full ( 96%, 98%, 100%, 97%, 98% storage used) .
>>>>
>>>> On my slaves machines, this folder "/mnt/hadoop-fs/dfs/data/current/"
>>>> is contributing 95% storage used. It contains folders with names "subdir0",
>>>> "subdir1", etc and under them there are many files with name like
>>>> "blk_-4071357924681234567" and blk_-4071357924681234567_246813.meta:, etc.
>>>>
>>>> I want to delete these subdir folders but I am not sure if it will not
>>>> affects the tables which I have created.
>>>>
>>>> Can anyone help me and tell me what are these folders used for ?.
>>>>
>>>> Thanks,
>>>> Chunky.
>>>>
>>>
>>>
>>
>
>
> --
> Alok Kumar
>


Re: dfs storage full on all slave machines of 6 machine hive cluster

2013-03-18 Thread Chunky Gupta
Hi Zhiwen,

/mnt/hadoop-fs/mapred/local/taskTracker/

Inside this folder there are folders with different user name, can I delete
these ?.

I do not understand what this {*nouserdir*} you were talking about, can you
please explain ?.

Thanks,
Chunky.


On Mon, Mar 18, 2013 at 8:40 PM, Zhiwen Sun  wrote:

> The folder "/mnt/hadoop-fs/dfs/data/current/" is the main folder of
> datanode in hadoop.
>
> You can use *hadoop dfs -rmr {nouserdir} *to get more free space in HDFS.
>
> *Don't delete file directly in OS file system.*
>
> Zhiwen Sun
>
>
>
> On Mon, Mar 18, 2013 at 6:48 PM, Manish Bhoge 
> wrote:
>
>> I think these directories belong to task tracker temporary storage. I am
>> not very confident to conclude that go ahead with your clean up. So, wait
>> for similar or an expert's response
>>
>> Sent from HTC via Rocket! excuse typo.
>>
>>  --
>> * From: * Chunky Gupta ;
>> * To: * ;
>> * Subject: * dfs storage full on all slave machines of 6 machine hive
>> cluster
>> * Sent: * Mon, Mar 18, 2013 10:37:39 AM
>>
>>   Hi,
>>
>> We have a 6 machine hive cluster. We are getting errors while a query is
>> running and it fails. I found that on all 5 slaves storage is nearly full (
>> 96%, 98%, 100%, 97%, 98% storage used) .
>>
>> On my slaves machines, this folder "/mnt/hadoop-fs/dfs/data/current/" is
>> contributing 95% storage used. It contains folders with names "subdir0",
>> "subdir1", etc and under them there are many files with name like
>> "blk_-4071357924681234567" and blk_-4071357924681234567_246813.meta:, etc.
>>
>> I want to delete these subdir folders but I am not sure if it will not
>> affects the tables which I have created.
>>
>> Can anyone help me and tell me what are these folders used for ?.
>>
>> Thanks,
>> Chunky.
>>
>
>


dfs storage full on all slave machines of 6 machine hive cluster

2013-03-18 Thread Chunky Gupta
Hi,

We have a 6 machine hive cluster. We are getting errors while a query is
running and it fails. I found that on all 5 slaves storage is nearly full (
96%, 98%, 100%, 97%, 98% storage used) .

On my slaves machines, this folder "/mnt/hadoop-fs/dfs/data/current/" is
contributing 95% storage used. It contains folders with names "subdir0",
"subdir1", etc and under them there are many files with name like
"blk_-4071357924681234567" and blk_-4071357924681234567_246813.meta:, etc.

I want to delete these subdir folders but I am not sure if it will not
affects the tables which I have created.

Can anyone help me and tell me what are these folders used for ?.

Thanks,
Chunky.


Re: Adding comment to a table for columns

2013-02-21 Thread Chunky Gupta
Hi Bejoy,

I checked and didn't find anywhere using extended and formatted
simultaneously to describe table. It is like :-

DESCRIBE [EXTENDED|FORMATTED] table_name[DOT col_name ( [DOT field_name] |
[DOT '$elem$'] | [DOT '$key$'] | [DOT '$value$'] )* ]

Everywhere it says i can use only one at a time.

I tried using both as you suggested and gets this error :-
FAILED: Parse Error: line 1:19 mismatched input 'extended' expecting
Identifier near 'formatted' in specifying table types

If you can please try to remember the exact operation , or if there is any
other way of doing it then please let me know.

Thanks,
Chunky.


On Thu, Feb 21, 2013 at 5:22 PM,  wrote:
>
> Hi Gupta
>
> Try out
>
> DESCRIBE EXTENDED FORMATTED 
>
> I vaguely recall a operation like this.
> Please check hive wiki for the exact syntax.
>
> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
> 
> From: Chunky Gupta 
> Date: Thu, 21 Feb 2013 17:15:37 +0530
> To: ; ; <
snehalata_bhas...@syntelinc.com>
> ReplyTo: user@hive.apache.org
> Subject: Re: Adding comment to a table for columns
>
> Hi Bejoy, Bhaskar
>
> I tried using FORMATTED, but it will not give me comments which I have
put while creating table. Its output is like :-
>
> col_namedata_type   comment
> cstring  from deserializer
> timestring  from deserializer
>
> Thanks,
> Chunky.
>
> On Thu, Feb 21, 2013 at 4:50 PM,  wrote:
>>
>> Hi Gupta
>>
>> You can the describe output in a formatted way using
>>
>> DESCRIBE FORMATTED ;
>> Regards
>> Bejoy KS
>>
>> Sent from remote device, Please excuse typos
>> 
>> From: Chunky Gupta 
>> Date: Thu, 21 Feb 2013 16:46:30 +0530
>> To: 
>> ReplyTo: user@hive.apache.org
>> Subject: Adding comment to a table for columns
>>
>> Hi,
>>
>> I am using this syntax to add comments for all columns :-
>>
>> CREATE EXTERNAL TABLE test ( c STRING COMMENT 'Common  class', time
STRING COMMENT 'Common  time', url STRING COMMENT 'Site URL' ) PARTITIONED
BY (dt STRING ) LOCATION 's3://BucketName/'
>>
>> Output of Describe Extended table is like :- (Output is just an example
copied from internet)
>>
>> hive> DESCRIBE EXTENDED table_name;
>>
>> Detailed Table Information Table(tableName:table_name,
dbName:benchmarking, owner:root, createTime:1309480053, lastAccessTime:0,
retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:session_key,
type:string, comment:null), FieldSchema(name:remote_address, type:string,
comment:null), FieldSchema(name:canister_lssn, type:string, comment:null),
FieldSchema(name:canister_session_id, type:bigint, comment:null),
FieldSchema(name:tltsid, type:string, comment:null),
FieldSchema(name:tltuid, type:string, comment:null),
FieldSchema(name:tltvid, type:string, comment:null),
FieldSchema(name:canister_server, type:string, comment:null),
FieldSchema(name:session_timestamp, type:string, comment:null),
FieldSchema(name:session_duration, type:string, comment:null),
FieldSchema(name:hit_count, type:bigint, comment:null),
FieldSchema(name:http_user_agent, type:string, comment:null),
FieldSchema(name:extractid, type:bigint, comment:null),
FieldSchema(name:site_link, type:string, comment:null),
FieldSchema(name:dt, type:string, comment:null), FieldSchema(name:hour,
type:int, comment:null)],
location:hdfs://hadoop2/user/hive/warehouse/benchmarking.db/table_name,
inputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat,
outputFormat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat,
compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe)
>>
>> Is there any way of getting this detailed comments and column name in
readable format, just like the output of "Describe table_name" ?.
>>
>>
>> Thanks,
>>
>> Chunky.
>
>


Re: Adding comment to a table for columns

2013-02-21 Thread Chunky Gupta
Hi Bejoy, Bhaskar

I tried using FORMATTED, but it will not give me comments which I have put
while creating table. Its output is like :-

col_namedata_type   comment
cstring  from deserializer
timestring  from deserializer

Thanks,
Chunky.

On Thu, Feb 21, 2013 at 4:50 PM,  wrote:

> **
> Hi Gupta
>
> You can the describe output in a formatted way using
>
> DESCRIBE FORMATTED ;
> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
> ------
> *From: * Chunky Gupta 
> *Date: *Thu, 21 Feb 2013 16:46:30 +0530
> *To: *
> *ReplyTo: * user@hive.apache.org
> *Subject: *Adding comment to a table for columns
>
> Hi,
>
> I am using this syntax to add comments for all columns :-
>
> CREATE EXTERNAL TABLE test ( c STRING COMMENT 'Common  class', time STRING
> COMMENT 'Common  time', url STRING COMMENT 'Site URL' ) PARTITIONED BY (dt
> STRING ) LOCATION 's3://BucketName/'
>
> Output of Describe Extended table is like :- (Output is just an example
> copied from internet)
>
> hive> DESCRIBE EXTENDED table_name;
>
> Detailed Table Information Table(tableName:table_name,
> dbName:benchmarking, owner:root, createTime:1309480053, lastAccessTime:0,
> retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:session_key,
> type:string, comment:null), FieldSchema(name:remote_address, type:string,
> comment:null), FieldSchema(name:canister_lssn, type:string, comment:null),
> FieldSchema(name:canister_session_id, type:bigint, comment:null),
> FieldSchema(name:tltsid, type:string, comment:null),
> FieldSchema(name:tltuid, type:string, comment:null),
> FieldSchema(name:tltvid, type:string, comment:null),
> FieldSchema(name:canister_server, type:string, comment:null),
> FieldSchema(name:session_timestamp, type:string, comment:null),
> FieldSchema(name:session_duration, type:string, comment:null),
> FieldSchema(name:hit_count, type:bigint, comment:null),
> FieldSchema(name:http_user_agent, type:string, comment:null),
> FieldSchema(name:extractid, type:bigint, comment:null),
> FieldSchema(name:site_link, type:string, comment:null),
> FieldSchema(name:dt, type:string, comment:null), FieldSchema(name:hour,
> type:int, comment:null)],
> location:hdfs://hadoop2/user/hive/warehouse/benchmarking.db/table_name,
> inputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat,
> outputFormat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat,
> compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
> serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe)
>
> Is there any way of getting this detailed comments and column name in
> readable format, just like the output of "Describe table_name" ?.
>
>
> Thanks,
>
> Chunky.
>


Adding comment to a table for columns

2013-02-21 Thread Chunky Gupta
Hi,

I am using this syntax to add comments for all columns :-

CREATE EXTERNAL TABLE test ( c STRING COMMENT 'Common  class', time STRING
COMMENT 'Common  time', url STRING COMMENT 'Site URL' ) PARTITIONED BY (dt
STRING ) LOCATION 's3://BucketName/'

Output of Describe Extended table is like :- (Output is just an example
copied from internet)

hive> DESCRIBE EXTENDED table_name;

Detailed Table Information Table(tableName:table_name, dbName:benchmarking,
owner:root, createTime:1309480053, lastAccessTime:0, retention:0,
sd:StorageDescriptor(cols:[FieldSchema(name:session_key, type:string,
comment:null), FieldSchema(name:remote_address, type:string, comment:null),
FieldSchema(name:canister_lssn, type:string, comment:null),
FieldSchema(name:canister_session_id, type:bigint, comment:null),
FieldSchema(name:tltsid, type:string, comment:null),
FieldSchema(name:tltuid, type:string, comment:null),
FieldSchema(name:tltvid, type:string, comment:null),
FieldSchema(name:canister_server, type:string, comment:null),
FieldSchema(name:session_timestamp, type:string, comment:null),
FieldSchema(name:session_duration, type:string, comment:null),
FieldSchema(name:hit_count, type:bigint, comment:null),
FieldSchema(name:http_user_agent, type:string, comment:null),
FieldSchema(name:extractid, type:bigint, comment:null),
FieldSchema(name:site_link, type:string, comment:null),
FieldSchema(name:dt, type:string, comment:null), FieldSchema(name:hour,
type:int, comment:null)],
location:hdfs://hadoop2/user/hive/warehouse/benchmarking.db/table_name,
inputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat,
outputFormat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat,
compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe)

Is there any way of getting this detailed comments and column name in
readable format, just like the output of "Describe table_name" ?.


Thanks,

Chunky.


Re: Need tab separated output file and put limit on number of lines in a output file

2013-02-20 Thread Chunky Gupta
Hi Mark,

We mostly do insert overwrite into local directory, and at that location
multiple files with output of that query are created and we use these files
our analysis. So, we want these files to be tab separated.

Limiting the number of records means limiting the length of a file, not
limiting the overall output. For example, suppose my queries output has
1 lines, I want to put limit on length of a file to 1000 lines, so hive
should give me 10 files of 1000 lines each, is there any configuration for
this ?

Thanks,
Chunky.

On Wed, Feb 20, 2013 at 10:50 PM, Mark Grover
wrote:

> Chunky,
> There may be another way to do this but to get tab separated output, I
> usually create an external table that's tab separated and insert
> overwrite into that table.
>
> For limiting the number of records in the output, you can use the
> limit clause in your query.
>
> Mark
>
> On Tue, Feb 19, 2013 at 10:53 PM, Chunky Gupta 
> wrote:
> > Hi,
> >
> > Currently the output file columns of my query is separate by "^A", I
> need my
> > output to be separated by tab. Can anybody help me in setting this ?
> >
> > I more doubt, I want to limit the number of lines in output files. For
> > example, I do not want any of my output file to be more than 1000 lines,
> can
> > I set this in configuration ?
> >
> > Thanks,
> > Chunky.
>


Need tab separated output file and put limit on number of lines in a output file

2013-02-19 Thread Chunky Gupta
Hi,

Currently the output file columns of my query is separate by "^A", I need
my output to be separated by tab. Can anybody help me in setting this ?

I more doubt, I want to limit the number of lines in output files. For
example, I do not want any of my output file to be more than 1000 lines,
can I set this in configuration ?

Thanks,
Chunky.


Re: Loading json files into hive table is giving NULL as output(data is in s3 bucket)

2013-02-18 Thread Chunky Gupta
Hi Dean,

I was using *hive-json-serde-0.2.jar* earlier. Now I tried
*hive-json-serde-0.3.jar
*as you suggested and it is working fine, I am getting the output as
expected.

Can you please tell me that what code change from 0.2 to 0.3 could have
solved this problem ?


Thanks,
Chunky.

On Mon, Feb 18, 2013 at 8:47 PM, Chunky Gupta wrote:

> Hi Dean,
>
> I tried with removing underscore too, and getting the same output which
> means problem is not with underscore. Yes, it was an example.
>
> Actual json file is like :-
>
>
> {"colnamec":"ColNametest","colnamets":"2013-01-14","colnameip":"10.10.10.10","colnameid":"10","colnameid2":"100","colnamep":0,"colnamecp":0,"colnamep":1,"colnameed":"31509","colnamesw":0,"colnamesu2":3,"colnameqq":"0","colnameppaa":0,"colnameqwe1":0,"colnamerty2":0,"colnameiop":"1000","colnamebnm":"23425253RFDSE","colnamefgh":2,"colnameagl":"","colnameyhgb":["1234","12345","2345","56789"],"colnamepoix":["12","4567","123","5678"],"colnamedswer":["100","567","123","678"],"colnamewerui":["10","10","10","10"]}
>
> I tried extracting one column only as I mentioned in last mail.
>
> There are values not in double quotes, some are null and some keys are
> having multiple values.
> Dean, is this json file correct for HIVE to handle it ?
>
> Thanks,
> Chunky.
>
>
>
>
>
> On Mon, Feb 18, 2013 at 6:23 PM, Dean Wampler <
> dean.wamp...@thinkbiganalytics.com> wrote:
>
>> The "uname="$._u" is the correct form. We also hacked on this SerDe at
>> Think Big Analytics. I don't know if you'll see an improvement though.
>>
>> https://github.com/thinkbiganalytics/hive-json-serde
>>
>> I wonder if there's a problem handling the leading underscore?
>>
>> Also, I know it's just an example, but in case it was taken from a real
>> situation, the dates in your example are for January.
>>
>> dean
>>
>> On Mon, Feb 18, 2013 at 6:43 AM, Chunky Gupta wrote:
>>
>>> Hi,
>>>
>>> I have data in s3 bucket, which is in json format and is a zip file. I
>>> have added this jar file in hive console :-
>>>
>>> http://code.google.com/p/hive-json-serde/downloads/detail?name=hive-json-serde-0.2.jar&can=2&q=
>>>
>>> I tried the following steps to create table and load data :-
>>>
>>> 1. CREATE EXTERNAL TABLE table_test ( uname STRING ) PARTITIONED BY (dt
>>> STRING ) ROW FORMAT SERDE "org.apache.hadoop.hive.contrib.serde2.JsonSerde"
>>> WITH SERDEPROPERTIES ( "uname"="$._u" ) LOCATION
>>> 's3://BUCKET_NAME/test_data/'
>>>
>>>I tried this also :-
>>>
>>> CREATE EXTERNAL TABLE table_test ( uname STRING ) PARTITIONED BY (dt
>>> STRING ) ROW FORMAT SERDE "org.apache.hadoop.hive.contrib.serde2.JsonSerde"
>>> WITH SERDEPROPERTIES ( "uname"="_u" ) LOCATION
>>> 's3://BUCKET_NAME/test_data/'
>>>
>>>
>>>
>>> 2. alter table table_test add partition (dt='13Feb2012') location
>>> 's3n://BUCKET_NAME/test_data/13Feb2012';
>>>
>>> and json file is like this :-
>>> -
>>> {"_u":"test_name1","_ts":"2012-01-13","_ip":"IP1"}
>>> {"_u":"test_name2","_ts":"2012-01-13","_ip":"IP2"}
>>> {"_u":"test_name3","_ts":"2012-01-13","_ip":"IP3"}
>>>
>>>
>>> When I query :-
>>> select uname from table_test;
>>>
>>> Output :-
>>> NULL 13Feb2012
>>> NULL 13Feb2012
>>> NULL 13Feb2012
>>>
>>>
>>> Please help me and let me know how to add json data in a table.
>>>
>>> Thanks,
>>> Chunky.
>>>
>>
>>
>>
>> --
>> *Dean Wampler, Ph.D.*
>> thinkbiganalytics.com
>> +1-312-339-1330
>>
>>
>


Re: Loading json files into hive table is giving NULL as output(data is in s3 bucket)

2013-02-18 Thread Chunky Gupta
Hi Dean,

I tried with removing underscore too, and getting the same output which
means problem is not with underscore. Yes, it was an example.

Actual json file is like :-

{"colnamec":"ColNametest","colnamets":"2013-01-14","colnameip":"10.10.10.10","colnameid":"10","colnameid2":"100","colnamep":0,"colnamecp":0,"colnamep":1,"colnameed":"31509","colnamesw":0,"colnamesu2":3,"colnameqq":"0","colnameppaa":0,"colnameqwe1":0,"colnamerty2":0,"colnameiop":"1000","colnamebnm":"23425253RFDSE","colnamefgh":2,"colnameagl":"","colnameyhgb":["1234","12345","2345","56789"],"colnamepoix":["12","4567","123","5678"],"colnamedswer":["100","567","123","678"],"colnamewerui":["10","10","10","10"]}

I tried extracting one column only as I mentioned in last mail.

There are values not in double quotes, some are null and some keys are
having multiple values.
Dean, is this json file correct for HIVE to handle it ?

Thanks,
Chunky.




On Mon, Feb 18, 2013 at 6:23 PM, Dean Wampler <
dean.wamp...@thinkbiganalytics.com> wrote:

> The "uname="$._u" is the correct form. We also hacked on this SerDe at
> Think Big Analytics. I don't know if you'll see an improvement though.
>
> https://github.com/thinkbiganalytics/hive-json-serde
>
> I wonder if there's a problem handling the leading underscore?
>
> Also, I know it's just an example, but in case it was taken from a real
> situation, the dates in your example are for January.
>
> dean
>
> On Mon, Feb 18, 2013 at 6:43 AM, Chunky Gupta wrote:
>
>> Hi,
>>
>> I have data in s3 bucket, which is in json format and is a zip file. I
>> have added this jar file in hive console :-
>>
>> http://code.google.com/p/hive-json-serde/downloads/detail?name=hive-json-serde-0.2.jar&can=2&q=
>>
>> I tried the following steps to create table and load data :-
>>
>> 1. CREATE EXTERNAL TABLE table_test ( uname STRING ) PARTITIONED BY (dt
>> STRING ) ROW FORMAT SERDE "org.apache.hadoop.hive.contrib.serde2.JsonSerde"
>> WITH SERDEPROPERTIES ( "uname"="$._u" ) LOCATION
>> 's3://BUCKET_NAME/test_data/'
>>
>>I tried this also :-
>>
>> CREATE EXTERNAL TABLE table_test ( uname STRING ) PARTITIONED BY (dt
>> STRING ) ROW FORMAT SERDE "org.apache.hadoop.hive.contrib.serde2.JsonSerde"
>> WITH SERDEPROPERTIES ( "uname"="_u" ) LOCATION
>> 's3://BUCKET_NAME/test_data/'
>>
>>
>>
>> 2. alter table table_test add partition (dt='13Feb2012') location
>> 's3n://BUCKET_NAME/test_data/13Feb2012';
>>
>> and json file is like this :-
>> -
>> {"_u":"test_name1","_ts":"2012-01-13","_ip":"IP1"}
>> {"_u":"test_name2","_ts":"2012-01-13","_ip":"IP2"}
>> {"_u":"test_name3","_ts":"2012-01-13","_ip":"IP3"}
>>
>>
>> When I query :-
>> select uname from table_test;
>>
>> Output :-
>> NULL 13Feb2012
>> NULL 13Feb2012
>> NULL 13Feb2012
>>
>>
>> Please help me and let me know how to add json data in a table.
>>
>> Thanks,
>> Chunky.
>>
>
>
>
> --
> *Dean Wampler, Ph.D.*
> thinkbiganalytics.com
> +1-312-339-1330
>
>


Loading json files into hive table is giving NULL as output(data is in s3 bucket)

2013-02-18 Thread Chunky Gupta
Hi,

I have data in s3 bucket, which is in json format and is a zip file. I have
added this jar file in hive console :-
http://code.google.com/p/hive-json-serde/downloads/detail?name=hive-json-serde-0.2.jar&can=2&q=

I tried the following steps to create table and load data :-

1. CREATE EXTERNAL TABLE table_test ( uname STRING ) PARTITIONED BY (dt
STRING ) ROW FORMAT SERDE "org.apache.hadoop.hive.contrib.serde2.JsonSerde"
WITH SERDEPROPERTIES ( "uname"="$._u" ) LOCATION
's3://BUCKET_NAME/test_data/'

   I tried this also :-

CREATE EXTERNAL TABLE table_test ( uname STRING ) PARTITIONED BY (dt STRING
) ROW FORMAT SERDE "org.apache.hadoop.hive.contrib.serde2.JsonSerde" WITH
SERDEPROPERTIES ( "uname"="_u" ) LOCATION 's3://BUCKET_NAME/test_data/'



2. alter table table_test add partition (dt='13Feb2012') location
's3n://BUCKET_NAME/test_data/13Feb2012';

and json file is like this :-
-
{"_u":"test_name1","_ts":"2012-01-13","_ip":"IP1"}
{"_u":"test_name2","_ts":"2012-01-13","_ip":"IP2"}
{"_u":"test_name3","_ts":"2012-01-13","_ip":"IP3"}


When I query :-
select uname from table_test;

Output :-
NULL 13Feb2012
NULL 13Feb2012
NULL 13Feb2012


Please help me and let me know how to add json data in a table.

Thanks,
Chunky.


Change timestamp format in hive

2013-02-13 Thread Chunky Gupta
Hi,

I have a log file which has timestamp in format "-MM-DD-HH:MM:SS". But
since the timestamp datatype format in hive is "-MM-DD HH:MM:SS".
I created a table with datatype of that column as TIMESTAMP. But when I
load the data it is throwing error. I think it is because of difference in
format.

Is there any way to set the timestamp format while creating the table. Or
is there some other solution for this issue ?

Thanks,
Chunky.


Re: Getting Error while executing "show partitions TABLE_NAME"

2013-02-07 Thread Chunky Gupta
Hi Venkatesh,

I checked and found that /tmp is having less space left.
I moved my db to other location having space and it is working fine now.

Thanks,
Chunky.

On Thu, Feb 7, 2013 at 12:41 AM, Venkatesh Kavuluri
wrote:

> Looks like it's memory/ disk space issue with your database server used to
> store Hive metadata. Can you check the disk usage of /tmp directory (data
> directory of DB server).
>
> --
> Date: Wed, 6 Feb 2013 18:34:31 +0530
> Subject: Getting Error while executing "show partitions TABLE_NAME"
> From: chunky.gu...@vizury.com
> To: user@hive.apache.org
>
>
> Hi All,
>
> I ran this :-
> hive> show partitions tab_name;
>
> and got this error :-
>
> FAILED: Error in metadata: javax.jdo.JDODataStoreException: Error
> executing JDOQL query "SELECT `THIS`.`PART_NAME` AS NUCORDER0 FROM
> `PARTITIONS` `THIS` LEFT OUTER JOIN `TBLS` `THIS_TABLE_DATABASE` ON
> `THIS`.`TBL_ID` = `THIS_TABLE_DATABASE`.`TBL_ID` LEFT OUTER JOIN `DBS`
> `THIS_TABLE_DATABASE_DATABASE_NAME` ON `THIS_TABLE_DATABASE`.`DB_ID` =
> `THIS_TABLE_DATABASE_DATABASE_NAME`.`DB_ID` LEFT OUTER JOIN `TBLS`
> `THIS_TABLE_TABLE_NAME` ON `THIS`.`TBL_ID` =
> `THIS_TABLE_TABLE_NAME`.`TBL_ID` WHERE
> `THIS_TABLE_DATABASE_DATABASE_NAME`.`NAME` = ? AND
> `THIS_TABLE_TABLE_NAME`.`TBL_NAME` = ? ORDER BY NUCORDER0 " : Error writing
> file '/tmp/MY0TOZFT' (Errcode: 28).
> NestedThrowables:
> java.sql.SQLException: Error writing file '/tmp/MY0TOZFT' (Errcode: 28)
> FAILED: Execution Error, return code 1 from
> org.apache.hadoop.hive.ql.exec.DDLTask
>
>
> Actually yesterday we had less no. of partitions and today I added around
> 3000 more partition for my data which is stored in s3 for Hive. I think
> this created the above error, but don't know how to solve it.
>
> Please help me in this.
>
> Thanks,
> Chunky.
>


Getting Error while executing "show partitions TABLE_NAME"

2013-02-06 Thread Chunky Gupta
Hi All,

I ran this :-
hive> show partitions tab_name;

and got this error :-

FAILED: Error in metadata: javax.jdo.JDODataStoreException: Error executing
JDOQL query "SELECT `THIS`.`PART_NAME` AS NUCORDER0 FROM `PARTITIONS`
`THIS` LEFT OUTER JOIN `TBLS` `THIS_TABLE_DATABASE` ON `THIS`.`TBL_ID` =
`THIS_TABLE_DATABASE`.`TBL_ID` LEFT OUTER JOIN `DBS`
`THIS_TABLE_DATABASE_DATABASE_NAME` ON `THIS_TABLE_DATABASE`.`DB_ID` =
`THIS_TABLE_DATABASE_DATABASE_NAME`.`DB_ID` LEFT OUTER JOIN `TBLS`
`THIS_TABLE_TABLE_NAME` ON `THIS`.`TBL_ID` =
`THIS_TABLE_TABLE_NAME`.`TBL_ID` WHERE
`THIS_TABLE_DATABASE_DATABASE_NAME`.`NAME` = ? AND
`THIS_TABLE_TABLE_NAME`.`TBL_NAME` = ? ORDER BY NUCORDER0 " : Error writing
file '/tmp/MY0TOZFT' (Errcode: 28).
NestedThrowables:
java.sql.SQLException: Error writing file '/tmp/MY0TOZFT' (Errcode: 28)
FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.DDLTask


Actually yesterday we had less no. of partitions and today I added around
3000 more partition for my data which is stored in s3 for Hive. I think
this created the above error, but don't know how to solve it.

Please help me in this.

Thanks,
Chunky.


Does Hue (Hadoop User Experience) works with Apache HIVE/HADOOP

2012-12-28 Thread Chunky Gupta
Hi,

I have Apache Hive and Apache Hadoop on Amazon EC2 machines. If anyone can
tell me that can HUE be used with this setup instead of CHD Hadoop cluster.
If not, then is there any alternate UI similar to HUE.

Please help.
Thanks,
Chunky.


Re: Alter table is giving error

2012-11-27 Thread Chunky Gupta
Hi,

Now when I am trying to load a csv file to any table I created, its not
working.

I created a table :-
CREATE EXTERNAL TABLE someidtable (
someid STRING,
)
ROW FORMAT
DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
LOCATION 's3://location/';

Then

LOAD DATA INPATH 's3://location/someidexcel.csv' INTO TABLE someidtable;

It gives this error:-
"Error in semantic analysis: Line 1:17 Invalid path
''s3n://location/someidexcel.csv'': only "file" or "hdfs" file systems
accepted"

Please help me in resolving this issue.
Thanks,
Chunky.

On Wed, Nov 7, 2012 at 6:43 PM, Chunky Gupta wrote:

> Okay Mark, I will be looking into this JIRA regularly.
> Thanks again for helping.
> Chunky.
>
>
> On Wed, Nov 7, 2012 at 12:22 PM, Mark Grover 
> wrote:
>
>> Chunky,
>> I just tried it myself. It turns out that the directory you are adding as
>> partition has to be empty for msck repair to work. This is obviously
>> sub-optimal and there is a JIRA in place (
>> https://issues.apache.org/jira/browse/HIVE-3231) to fix it.
>>
>> So, I'd suggest you keep an eye out for the next version for that fix to
>> come in. In the meanwhile, run msck after you create your partition
>> directory but before you populate your directory with data.
>>
>> Mark
>>
>>
>> On Tue, Nov 6, 2012 at 10:33 PM, Chunky Gupta wrote:
>>
>>> Hi Mark,
>>> Sorry, I forgot to mention. I have also tried
>>> msck repair table ;
>>> and same output I got which I got from msck only.
>>> Do I need to do any other settings for this to work, because I have
>>> prepared Hadoop and Hive setup from start on EC2.
>>>
>>> Thanks,
>>> Chunky.
>>>
>>>
>>>
>>> On Wed, Nov 7, 2012 at 11:58 AM, Mark Grover <
>>> grover.markgro...@gmail.com> wrote:
>>>
>>>> Chunky,
>>>> You should have run:
>>>> msck repair table ;
>>>>
>>>> Sorry, I should have made it clear in my last reply. I have added an
>>>> entry to Hive wiki for benefit of others:
>>>>
>>>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Recoverpartitions
>>>>
>>>> Mark
>>>>
>>>>
>>>> On Tue, Nov 6, 2012 at 9:55 PM, Chunky Gupta 
>>>> wrote:
>>>>
>>>>> Hi Mark,
>>>>> I didn't get any error.
>>>>> I ran this on hive console:-
>>>>>  "msck table Table_Name;"
>>>>> It says Ok and showed the execution time as 1.050 sec.
>>>>> But when I checked partitions for table using
>>>>>   "show partitions Table_Name;"
>>>>> It didn't show me any partitions.
>>>>>
>>>>> Thanks,
>>>>> Chunky.
>>>>>
>>>>>
>>>>> On Tue, Nov 6, 2012 at 10:38 PM, Mark Grover <
>>>>> grover.markgro...@gmail.com> wrote:
>>>>>
>>>>>> Glad to hear, Chunky.
>>>>>>
>>>>>> Out of curiosity, what errors did you get when using msck?
>>>>>>
>>>>>>
>>>>>> On Tue, Nov 6, 2012 at 5:14 AM, Chunky Gupta >>>>> > wrote:
>>>>>>
>>>>>>> Hi Mark,
>>>>>>> I tried msck, but it is not working for me. I have written a python
>>>>>>> script to partition the data individually.
>>>>>>>
>>>>>>> Thank you Edward, Mark and Dean.
>>>>>>> Chunky.
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Nov 5, 2012 at 11:08 PM, Mark Grover <
>>>>>>> grover.markgro...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Chunky,
>>>>>>>> I have used "recover partitions" command on EMR, and that worked
>>>>>>>> fine.
>>>>>>>>
>>>>>>>> However, take a look at
>>>>>>>> https://issues.apache.org/jira/browse/HIVE-874. Seems like msck
>>>>>>>> command in Apache Hive does the same thing. Try it out and let us know 
>>>>>>>> it
>>>>>>>> goes.
>>>>>>>>
>>>>>>>> Mark
>>>>>>>>

Re: hive query not running in cron job

2012-11-23 Thread Chunky Gupta
Thanks, its working after adding this line :)
Chunky.

On Fri, Nov 23, 2012 at 11:24 AM, wd  wrote:

> Add the following line before your crontab config
>
> source ~/.bashrc
>
>
>
> On Thu, Nov 22, 2012 at 5:59 PM, Chunky Gupta wrote:
>
>> Hi,
>> I have a python script :-
>>
>> ---cron_script.py---
>>
>> import os
>> import sys
>> from subprocess import call
>> print 'starting'
>> call(['hive', '-f',
>> '/mnt/user/test_query'],stderr=open('/mnt/user/tmp/error','w'),
>> stdout=open('/mnt/user/tmp/output','w'))
>>
>> ---cron_script.py---
>> --test_query-
>>
>> create table test (testcookie STRING, testdate STRING) ROW FORMAT
>> DELIMITED FIELDS TERMINATED BY '\t';
>>
>> --test_query-
>>
>> under crontab -e, I have added this line:-
>>
>> 10 4 * * * sudo /mnt/user/cron_script.py > /mnt/user/tmp/log
>>
>> .
>> This cron job executes and "/mnt/user/tmp/log" file is created containing
>> a string "starting".
>> And "/mnt/user/tmp/error" , "/mnt/user/tmp/output" these 2 files are
>> created but are empty. Also no table is created.
>>
>> If I run this script normally without cron job, it is working fine.
>>
>> Please help me in setting up this cron job.
>>
>> Thanks,
>> Chunky.
>>
>
>


hive query not running in cron job

2012-11-22 Thread Chunky Gupta
Hi,
I have a python script :-

---cron_script.py---

import os
import sys
from subprocess import call
print 'starting'
call(['hive', '-f',
'/mnt/user/test_query'],stderr=open('/mnt/user/tmp/error','w'),
stdout=open('/mnt/user/tmp/output','w'))

---cron_script.py---
--test_query-

create table test (testcookie STRING, testdate STRING) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';

--test_query-

under crontab -e, I have added this line:-

10 4 * * * sudo /mnt/user/cron_script.py > /mnt/user/tmp/log

.
This cron job executes and "/mnt/user/tmp/log" file is created containing a
string "starting".
And "/mnt/user/tmp/error" , "/mnt/user/tmp/output" these 2 files are
created but are empty. Also no table is created.

If I run this script normally without cron job, it is working fine.

Please help me in setting up this cron job.

Thanks,
Chunky.


Re: Alter table is giving error

2012-11-07 Thread Chunky Gupta
Okay Mark, I will be looking into this JIRA regularly.
Thanks again for helping.
Chunky.

On Wed, Nov 7, 2012 at 12:22 PM, Mark Grover wrote:

> Chunky,
> I just tried it myself. It turns out that the directory you are adding as
> partition has to be empty for msck repair to work. This is obviously
> sub-optimal and there is a JIRA in place (
> https://issues.apache.org/jira/browse/HIVE-3231) to fix it.
>
> So, I'd suggest you keep an eye out for the next version for that fix to
> come in. In the meanwhile, run msck after you create your partition
> directory but before you populate your directory with data.
>
> Mark
>
>
> On Tue, Nov 6, 2012 at 10:33 PM, Chunky Gupta wrote:
>
>> Hi Mark,
>> Sorry, I forgot to mention. I have also tried
>> msck repair table ;
>> and same output I got which I got from msck only.
>> Do I need to do any other settings for this to work, because I have
>> prepared Hadoop and Hive setup from start on EC2.
>>
>> Thanks,
>> Chunky.
>>
>>
>>
>> On Wed, Nov 7, 2012 at 11:58 AM, Mark Grover > > wrote:
>>
>>> Chunky,
>>> You should have run:
>>> msck repair table ;
>>>
>>> Sorry, I should have made it clear in my last reply. I have added an
>>> entry to Hive wiki for benefit of others:
>>>
>>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Recoverpartitions
>>>
>>> Mark
>>>
>>>
>>> On Tue, Nov 6, 2012 at 9:55 PM, Chunky Gupta wrote:
>>>
>>>> Hi Mark,
>>>> I didn't get any error.
>>>> I ran this on hive console:-
>>>>  "msck table Table_Name;"
>>>> It says Ok and showed the execution time as 1.050 sec.
>>>> But when I checked partitions for table using
>>>>       "show partitions Table_Name;"
>>>> It didn't show me any partitions.
>>>>
>>>> Thanks,
>>>> Chunky.
>>>>
>>>>
>>>> On Tue, Nov 6, 2012 at 10:38 PM, Mark Grover <
>>>> grover.markgro...@gmail.com> wrote:
>>>>
>>>>> Glad to hear, Chunky.
>>>>>
>>>>> Out of curiosity, what errors did you get when using msck?
>>>>>
>>>>>
>>>>> On Tue, Nov 6, 2012 at 5:14 AM, Chunky Gupta 
>>>>> wrote:
>>>>>
>>>>>> Hi Mark,
>>>>>> I tried msck, but it is not working for me. I have written a python
>>>>>> script to partition the data individually.
>>>>>>
>>>>>> Thank you Edward, Mark and Dean.
>>>>>> Chunky.
>>>>>>
>>>>>>
>>>>>> On Mon, Nov 5, 2012 at 11:08 PM, Mark Grover <
>>>>>> grover.markgro...@gmail.com> wrote:
>>>>>>
>>>>>>> Chunky,
>>>>>>> I have used "recover partitions" command on EMR, and that worked
>>>>>>> fine.
>>>>>>>
>>>>>>> However, take a look at
>>>>>>> https://issues.apache.org/jira/browse/HIVE-874. Seems like msck
>>>>>>> command in Apache Hive does the same thing. Try it out and let us know 
>>>>>>> it
>>>>>>> goes.
>>>>>>>
>>>>>>> Mark
>>>>>>>
>>>>>>> On Mon, Nov 5, 2012 at 7:56 AM, Edward Capriolo <
>>>>>>> edlinuxg...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Recover partitions should work the same way for different file
>>>>>>>> systems.
>>>>>>>>
>>>>>>>> Edward
>>>>>>>>
>>>>>>>> On Mon, Nov 5, 2012 at 9:33 AM, Dean Wampler
>>>>>>>>  wrote:
>>>>>>>> > Writing a script to add the external partitions individually is
>>>>>>>> the only way
>>>>>>>> > I know of.
>>>>>>>> >
>>>>>>>> > Sent from my rotary phone.
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > On Nov 5, 2012, at 8:19 AM, Chunky Gupta 
>>>>>>>> wrote:
>>>>>>>> >
>>>>>>>> > Hi Dean,
>>>>>>>> >
>>>>>

Re: Alter table is giving error

2012-11-06 Thread Chunky Gupta
Hi Mark,
Sorry, I forgot to mention. I have also tried
msck repair table ;
and same output I got which I got from msck only.
Do I need to do any other settings for this to work, because I have
prepared Hadoop and Hive setup from start on EC2.

Thanks,
Chunky.



On Wed, Nov 7, 2012 at 11:58 AM, Mark Grover wrote:

> Chunky,
> You should have run:
> msck repair table ;
>
> Sorry, I should have made it clear in my last reply. I have added an entry
> to Hive wiki for benefit of others:
>
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Recoverpartitions
>
> Mark
>
>
> On Tue, Nov 6, 2012 at 9:55 PM, Chunky Gupta wrote:
>
>> Hi Mark,
>> I didn't get any error.
>> I ran this on hive console:-
>>  "msck table Table_Name;"
>> It says Ok and showed the execution time as 1.050 sec.
>> But when I checked partitions for table using
>>   "show partitions Table_Name;"
>> It didn't show me any partitions.
>>
>> Thanks,
>> Chunky.
>>
>>
>> On Tue, Nov 6, 2012 at 10:38 PM, Mark Grover > > wrote:
>>
>>> Glad to hear, Chunky.
>>>
>>> Out of curiosity, what errors did you get when using msck?
>>>
>>>
>>> On Tue, Nov 6, 2012 at 5:14 AM, Chunky Gupta wrote:
>>>
>>>> Hi Mark,
>>>> I tried msck, but it is not working for me. I have written a python
>>>> script to partition the data individually.
>>>>
>>>> Thank you Edward, Mark and Dean.
>>>> Chunky.
>>>>
>>>>
>>>> On Mon, Nov 5, 2012 at 11:08 PM, Mark Grover <
>>>> grover.markgro...@gmail.com> wrote:
>>>>
>>>>> Chunky,
>>>>> I have used "recover partitions" command on EMR, and that worked fine.
>>>>>
>>>>> However, take a look at https://issues.apache.org/jira/browse/HIVE-874. 
>>>>> Seems
>>>>> like msck command in Apache Hive does the same thing. Try it out and let 
>>>>> us
>>>>> know it goes.
>>>>>
>>>>> Mark
>>>>>
>>>>> On Mon, Nov 5, 2012 at 7:56 AM, Edward Capriolo >>>> > wrote:
>>>>>
>>>>>> Recover partitions should work the same way for different file
>>>>>> systems.
>>>>>>
>>>>>> Edward
>>>>>>
>>>>>> On Mon, Nov 5, 2012 at 9:33 AM, Dean Wampler
>>>>>>  wrote:
>>>>>> > Writing a script to add the external partitions individually is the
>>>>>> only way
>>>>>> > I know of.
>>>>>> >
>>>>>> > Sent from my rotary phone.
>>>>>> >
>>>>>> >
>>>>>> > On Nov 5, 2012, at 8:19 AM, Chunky Gupta 
>>>>>> wrote:
>>>>>> >
>>>>>> > Hi Dean,
>>>>>> >
>>>>>> > Actually I was having Hadoop and Hive cluster on EMR and I have S3
>>>>>> storage
>>>>>> > containing logs which updates daily and having partition with
>>>>>> date(dt). And
>>>>>> > I was using this recover partition.
>>>>>> > Now I wanted to shift to EC2 and have my own Hadoop and Hive
>>>>>> cluster. So,
>>>>>> > what is the alternate of using recover partition in this case, if
>>>>>> you have
>>>>>> > any idea ?
>>>>>> > I found one way of individually partitioning all dates, so I have
>>>>>> to write
>>>>>> > script for that to do so for all dates. Is there any easiest way
>>>>>> other than
>>>>>> > this ?
>>>>>> >
>>>>>> > Thanks,
>>>>>> > Chunky
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > On Mon, Nov 5, 2012 at 6:28 PM, Dean Wampler
>>>>>> >  wrote:
>>>>>> >>
>>>>>> >> The RECOVER PARTITIONS is an enhancement added by Amazon to their
>>>>>> version
>>>>>> >> of Hive.
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/emr-

Re: Alter table is giving error

2012-11-06 Thread Chunky Gupta
Hi Mark,
I didn't get any error.
I ran this on hive console:-
 "msck table Table_Name;"
It says Ok and showed the execution time as 1.050 sec.
But when I checked partitions for table using
  "show partitions Table_Name;"
It didn't show me any partitions.

Thanks,
Chunky.

On Tue, Nov 6, 2012 at 10:38 PM, Mark Grover wrote:

> Glad to hear, Chunky.
>
> Out of curiosity, what errors did you get when using msck?
>
>
> On Tue, Nov 6, 2012 at 5:14 AM, Chunky Gupta wrote:
>
>> Hi Mark,
>> I tried msck, but it is not working for me. I have written a python
>> script to partition the data individually.
>>
>> Thank you Edward, Mark and Dean.
>> Chunky.
>>
>>
>> On Mon, Nov 5, 2012 at 11:08 PM, Mark Grover > > wrote:
>>
>>> Chunky,
>>> I have used "recover partitions" command on EMR, and that worked fine.
>>>
>>> However, take a look at https://issues.apache.org/jira/browse/HIVE-874. 
>>> Seems
>>> like msck command in Apache Hive does the same thing. Try it out and let us
>>> know it goes.
>>>
>>> Mark
>>>
>>> On Mon, Nov 5, 2012 at 7:56 AM, Edward Capriolo 
>>> wrote:
>>>
>>>> Recover partitions should work the same way for different file systems.
>>>>
>>>> Edward
>>>>
>>>> On Mon, Nov 5, 2012 at 9:33 AM, Dean Wampler
>>>>  wrote:
>>>> > Writing a script to add the external partitions individually is the
>>>> only way
>>>> > I know of.
>>>> >
>>>> > Sent from my rotary phone.
>>>> >
>>>> >
>>>> > On Nov 5, 2012, at 8:19 AM, Chunky Gupta 
>>>> wrote:
>>>> >
>>>> > Hi Dean,
>>>> >
>>>> > Actually I was having Hadoop and Hive cluster on EMR and I have S3
>>>> storage
>>>> > containing logs which updates daily and having partition with
>>>> date(dt). And
>>>> > I was using this recover partition.
>>>> > Now I wanted to shift to EC2 and have my own Hadoop and Hive cluster.
>>>> So,
>>>> > what is the alternate of using recover partition in this case, if you
>>>> have
>>>> > any idea ?
>>>> > I found one way of individually partitioning all dates, so I have to
>>>> write
>>>> > script for that to do so for all dates. Is there any easiest way
>>>> other than
>>>> > this ?
>>>> >
>>>> > Thanks,
>>>> > Chunky
>>>> >
>>>> >
>>>> >
>>>> > On Mon, Nov 5, 2012 at 6:28 PM, Dean Wampler
>>>> >  wrote:
>>>> >>
>>>> >> The RECOVER PARTITIONS is an enhancement added by Amazon to their
>>>> version
>>>> >> of Hive.
>>>> >>
>>>> >>
>>>> >>
>>>> http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/emr-hive-additional-features.html
>>>> >>
>>>> >> 
>>>> >>   Chapter 21 of Programming Hive discusses this feature and other
>>>> aspects
>>>> >> of using Hive in EMR.
>>>> >> 
>>>> >>
>>>> >> dean
>>>> >>
>>>> >>
>>>> >> On Mon, Nov 5, 2012 at 5:34 AM, Chunky Gupta <
>>>> chunky.gu...@vizury.com>
>>>> >> wrote:
>>>> >>>
>>>> >>> Hi,
>>>> >>>
>>>> >>> I am having a cluster setup on EC2 with Hadoop version 0.20.2 and
>>>> Hive
>>>> >>> version 0.8.1 (I configured everything) . I have created a table
>>>> using :-
>>>> >>>
>>>> >>> CREATE EXTERNAL TABLE XXX ( YYY )PARTITIONED BY ( ZZZ )ROW FORMAT
>>>> >>> DELIMITED FIELDS TERMINATED BY 'WWW' LOCATION
>>>> 's3://my-location/data/';
>>>> >>>
>>>> >>> Now I am trying to recover partition using :-
>>>> >>>
>>>> >>> ALTER TABLE XXX RECOVER PARTITIONS;
>>>> >>>
>>>> >>> but I am getting this error :- "FAILED: Parse Error: line 1:12
>>>> cannot
>>>> >>> recognize input near 'XXX' 'RECOVER' 'PARTITIONS' in alter table
>>>> statement"
>>>> >>>
>>>> >>> Doing same steps on a cluster setup on EMR with Hadoop version
>>>> 1.0.3 and
>>>> >>> Hive version 0.8.1 (Configured by EMR), works fine.
>>>> >>>
>>>> >>> So is this a version issue or am I missing some configuration
>>>> changes in
>>>> >>> EC2 setup ?
>>>> >>> I am not able to find exact solution for this problem on internet.
>>>> Please
>>>> >>> help me.
>>>> >>>
>>>> >>> Thanks,
>>>> >>> Chunky.
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>
>>>> >>
>>>> >>
>>>> >> --
>>>> >> Dean Wampler, Ph.D.
>>>> >> thinkbiganalytics.com
>>>> >> +1-312-339-1330
>>>> >>
>>>> >>
>>>> >
>>>>
>>>
>>>
>>
>


Re: Alter table is giving error

2012-11-06 Thread Chunky Gupta
Hi Mark,
I tried msck, but it is not working for me. I have written a python script
to partition the data individually.

Thank you Edward, Mark and Dean.
Chunky.

On Mon, Nov 5, 2012 at 11:08 PM, Mark Grover wrote:

> Chunky,
> I have used "recover partitions" command on EMR, and that worked fine.
>
> However, take a look at https://issues.apache.org/jira/browse/HIVE-874. Seems
> like msck command in Apache Hive does the same thing. Try it out and let us
> know it goes.
>
> Mark
>
> On Mon, Nov 5, 2012 at 7:56 AM, Edward Capriolo wrote:
>
>> Recover partitions should work the same way for different file systems.
>>
>> Edward
>>
>> On Mon, Nov 5, 2012 at 9:33 AM, Dean Wampler
>>  wrote:
>> > Writing a script to add the external partitions individually is the
>> only way
>> > I know of.
>> >
>> > Sent from my rotary phone.
>> >
>> >
>> > On Nov 5, 2012, at 8:19 AM, Chunky Gupta 
>> wrote:
>> >
>> > Hi Dean,
>> >
>> > Actually I was having Hadoop and Hive cluster on EMR and I have S3
>> storage
>> > containing logs which updates daily and having partition with date(dt).
>> And
>> > I was using this recover partition.
>> > Now I wanted to shift to EC2 and have my own Hadoop and Hive cluster.
>> So,
>> > what is the alternate of using recover partition in this case, if you
>> have
>> > any idea ?
>> > I found one way of individually partitioning all dates, so I have to
>> write
>> > script for that to do so for all dates. Is there any easiest way other
>> than
>> > this ?
>> >
>> > Thanks,
>> > Chunky
>> >
>> >
>> >
>> > On Mon, Nov 5, 2012 at 6:28 PM, Dean Wampler
>> >  wrote:
>> >>
>> >> The RECOVER PARTITIONS is an enhancement added by Amazon to their
>> version
>> >> of Hive.
>> >>
>> >>
>> >>
>> http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/emr-hive-additional-features.html
>> >>
>> >> 
>> >>   Chapter 21 of Programming Hive discusses this feature and other
>> aspects
>> >> of using Hive in EMR.
>> >> 
>> >>
>> >> dean
>> >>
>> >>
>> >> On Mon, Nov 5, 2012 at 5:34 AM, Chunky Gupta 
>> >> wrote:
>> >>>
>> >>> Hi,
>> >>>
>> >>> I am having a cluster setup on EC2 with Hadoop version 0.20.2 and Hive
>> >>> version 0.8.1 (I configured everything) . I have created a table
>> using :-
>> >>>
>> >>> CREATE EXTERNAL TABLE XXX ( YYY )PARTITIONED BY ( ZZZ )ROW FORMAT
>> >>> DELIMITED FIELDS TERMINATED BY 'WWW' LOCATION
>> 's3://my-location/data/';
>> >>>
>> >>> Now I am trying to recover partition using :-
>> >>>
>> >>> ALTER TABLE XXX RECOVER PARTITIONS;
>> >>>
>> >>> but I am getting this error :- "FAILED: Parse Error: line 1:12 cannot
>> >>> recognize input near 'XXX' 'RECOVER' 'PARTITIONS' in alter table
>> statement"
>> >>>
>> >>> Doing same steps on a cluster setup on EMR with Hadoop version 1.0.3
>> and
>> >>> Hive version 0.8.1 (Configured by EMR), works fine.
>> >>>
>> >>> So is this a version issue or am I missing some configuration changes
>> in
>> >>> EC2 setup ?
>> >>> I am not able to find exact solution for this problem on internet.
>> Please
>> >>> help me.
>> >>>
>> >>> Thanks,
>> >>> Chunky.
>> >>>
>> >>>
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> Dean Wampler, Ph.D.
>> >> thinkbiganalytics.com
>> >> +1-312-339-1330
>> >>
>> >>
>> >
>>
>
>


Re: Alter table is giving error

2012-11-05 Thread Chunky Gupta
Hi Dean,

Actually I was having Hadoop and Hive cluster on EMR and I have S3 storage
containing logs which updates daily and having partition with date(dt). And
I was using this recover partition.
Now I wanted to shift to EC2 and have my own Hadoop and Hive cluster. So,
what is the alternate of using recover partition in this case, if you have
any idea ?
I found one way of individually partitioning all dates, so I have to write
script for that to do so for all dates. Is there any easiest way other than
this ?

Thanks,
Chunky



On Mon, Nov 5, 2012 at 6:28 PM, Dean Wampler <
dean.wamp...@thinkbiganalytics.com> wrote:

> The RECOVER PARTITIONS is an enhancement added by Amazon to their version
> of Hive.
>
>
> http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/emr-hive-additional-features.html
>
> 
>   Chapter 21 of Programming Hive discusses this feature and other aspects
> of using Hive in EMR.
> 
>
> dean
>
>
> On Mon, Nov 5, 2012 at 5:34 AM, Chunky Gupta wrote:
>
>> Hi,
>>
>> I am having a cluster setup on EC2 with Hadoop version 0.20.2 and Hive
>> version 0.8.1 (I configured everything) . I have created a table using :-
>>
>> CREATE EXTERNAL TABLE XXX ( YYY )PARTITIONED BY ( ZZZ )ROW FORMAT
>> DELIMITED FIELDS TERMINATED BY 'WWW' LOCATION 's3://my-location/data/';
>>
>> Now I am trying to recover partition using :-
>>
>> ALTER TABLE XXX RECOVER PARTITIONS;
>>
>> but I am getting this error :- "FAILED: Parse Error: line 1:12 cannot
>> recognize input near 'XXX' 'RECOVER' 'PARTITIONS' in alter table statement"
>>
>> Doing same steps on a cluster setup on EMR with Hadoop version 1.0.3 and
>> Hive version 0.8.1 (Configured by EMR), works fine.
>>
>> So is this a version issue or am I missing some configuration changes in
>> EC2 setup ?
>> I am not able to find exact solution for this problem on internet. Please
>> help me.
>>
>> Thanks,
>> Chunky.
>>
>>
>>
>>
>
>
> --
> *Dean Wampler, Ph.D.*
> thinkbiganalytics.com
> +1-312-339-1330
>
>
>


Alter table is giving error

2012-11-05 Thread Chunky Gupta
Hi,

I am having a cluster setup on EC2 with Hadoop version 0.20.2 and Hive
version 0.8.1 (I configured everything) . I have created a table using :-

CREATE EXTERNAL TABLE XXX ( YYY )PARTITIONED BY ( ZZZ )ROW FORMAT DELIMITED
FIELDS TERMINATED BY 'WWW' LOCATION 's3://my-location/data/';

Now I am trying to recover partition using :-

ALTER TABLE XXX RECOVER PARTITIONS;

but I am getting this error :- "FAILED: Parse Error: line 1:12 cannot
recognize input near 'XXX' 'RECOVER' 'PARTITIONS' in alter table statement"

Doing same steps on a cluster setup on EMR with Hadoop version 1.0.3 and
Hive version 0.8.1 (Configured by EMR), works fine.

So is this a version issue or am I missing some configuration changes in
EC2 setup ?
I am not able to find exact solution for this problem on internet. Please
help me.

Thanks,
Chunky.


Re: Enabling fair scheduler using Bootstrap is failing

2012-10-29 Thread Chunky Gupta
Hi,

Today, I enabled logging while creating new job. Error which I see in log
files are :

ERROR org.apache.hadoop.security.UserGroupInformation (IPC Server handler
12 on 9000): PriviledgedActionException as:hadoop
cause:java.io.IOException: File /mnt/var/lib/hadoop/tmp/mapred/system/
jobtracker.info could only be replicated to 0 nodes, instead of 1

and

2012-10-30 06:14:26,527 WARN org.apache.hadoop.hdfs.DFSClient (Thread-18):
Error Recovery for block null bad datanode[0] nodes == null
2012-10-30 06:14:26,527 WARN org.apache.hadoop.hdfs.DFSClient (Thread-18):
Could not get block locations. Source file
"/mnt/var/lib/hadoop/tmp/mapred/system/jobtracker.info" - Aborting...
2012-10-30 06:14:26,527 WARN org.apache.hadoop.mapred.JobTracker (main):
Writing to file hdfs://
10.92.235.20:9000/mnt/var/lib/hadoop/tmp/mapred/system/jobtracker.infofailed!
2012-10-30 06:14:26,528 WARN org.apache.hadoop.mapred.JobTracker (main):
FileSystem is not ready yet!
2012-10-30 06:14:26,534 WARN org.apache.hadoop.mapred.JobTracker (main):
Failed to initialize recovery manager.

Also the default file I am uploading at bootstrap mapred-site.xml, I am
removing this configuration:

mapred.job.tracker
ip-10-116-159-127.ec2.internal:9001


Please suggest any solution for this.

Thanks,
Chunky.


On Mon, Oct 29, 2012 at 7:10 PM, Chunky Gupta wrote:

> Hi,
>
> I tried this also in optional arguments "--site-config-file
> s3://viz-emr-hive/config/mapred-site.xml -m
> mapred.jobtracker.taskScheduler=org.apache.hadoop.mapred.FairScheduler"
>
> This time it goes to state "Bootstrapping" and then failed.
>
> Let me know what changes I can do to make it work.
>
> Thanks,
> Chunky.
>
>
> On Mon, Oct 29, 2012 at 6:37 PM, Chunky Gupta wrote:
>
>> Hi,
>>
>> I am trying to enable fair scheduler on my emr cluster at bootstrap. The
>> steps I am doing are :
>>
>> 1. Creating Job instance from AWS console as "Create New Job Flow" with
>> Job Type as Hive program.
>> 2. Selecting "Start an Interactive Hive Session".
>> 3. Selecting Master and core instance group and Amazon EC2 Key Pair .
>> 4. Selecting "Configure your Bootstrap Actions" and action type as
>> "Configure Hadoop".
>> 5. Uploaded a mapred-site.xml in s3 with setting parameters for enabling
>> fair scheduler as :
>>  
>>   mapred.fairscheduler.allocation.file
>>   conf/pools.xml
>>   
>>   
>>   mapred.jobtracker.taskScheduler
>>   org.apache.hadoop.mapred.FairScheduler
>>   
>>   
>>   mapred.fairscheduler.assignmultiple
>>   true
>>   
>>   
>>   mapred.fairscheduler.eventlog.enabled
>>   false
>>   
>>
>> 6. In optional arguments I tried "--site-mapred-site,s3://XXX(where I
>> uploaded)/mapred-site.xml" to upload this xml file for my cluster.
>>
>> Finally the creation of machine is failing with error "On the master
>> instance (xxx), bootstrap action 1 returned a non-zero return code".
>>
>> I think in optional arguments I am giving something wrong. Please help me
>> in this.
>>
>> Thanks,
>> Chunky.
>>
>
>


Re: Enabling fair scheduler using Bootstrap is failing

2012-10-29 Thread Chunky Gupta
Hi,

I tried this also in optional arguments "--site-config-file
s3://viz-emr-hive/config/mapred-site.xml -m
mapred.jobtracker.taskScheduler=org.apache.hadoop.mapred.FairScheduler"

This time it goes to state "Bootstrapping" and then failed.

Let me know what changes I can do to make it work.

Thanks,
Chunky.

On Mon, Oct 29, 2012 at 6:37 PM, Chunky Gupta wrote:

> Hi,
>
> I am trying to enable fair scheduler on my emr cluster at bootstrap. The
> steps I am doing are :
>
> 1. Creating Job instance from AWS console as "Create New Job Flow" with
> Job Type as Hive program.
> 2. Selecting "Start an Interactive Hive Session".
> 3. Selecting Master and core instance group and Amazon EC2 Key Pair .
> 4. Selecting "Configure your Bootstrap Actions" and action type as
> "Configure Hadoop".
> 5. Uploaded a mapred-site.xml in s3 with setting parameters for enabling
> fair scheduler as :
>  
>   mapred.fairscheduler.allocation.file
>   conf/pools.xml
>   
>   
>   mapred.jobtracker.taskScheduler
>   org.apache.hadoop.mapred.FairScheduler
>   
>   
>   mapred.fairscheduler.assignmultiple
>   true
>   
>   
>   mapred.fairscheduler.eventlog.enabled
>   false
>   
>
> 6. In optional arguments I tried "--site-mapred-site,s3://XXX(where I
> uploaded)/mapred-site.xml" to upload this xml file for my cluster.
>
> Finally the creation of machine is failing with error "On the master
> instance (xxx), bootstrap action 1 returned a non-zero return code".
>
> I think in optional arguments I am giving something wrong. Please help me
> in this.
>
> Thanks,
> Chunky.
>


Enabling fair scheduler using Bootstrap is failing

2012-10-29 Thread Chunky Gupta
Hi,

I am trying to enable fair scheduler on my emr cluster at bootstrap. The
steps I am doing are :

1. Creating Job instance from AWS console as "Create New Job Flow" with Job
Type as Hive program.
2. Selecting "Start an Interactive Hive Session".
3. Selecting Master and core instance group and Amazon EC2 Key Pair .
4. Selecting "Configure your Bootstrap Actions" and action type as
"Configure Hadoop".
5. Uploaded a mapred-site.xml in s3 with setting parameters for enabling
fair scheduler as :
 
  mapred.fairscheduler.allocation.file
  conf/pools.xml
  
  
  mapred.jobtracker.taskScheduler
  org.apache.hadoop.mapred.FairScheduler
  
  
  mapred.fairscheduler.assignmultiple
  true
  
  
  mapred.fairscheduler.eventlog.enabled
  false
  

6. In optional arguments I tried "--site-mapred-site,s3://XXX(where I
uploaded)/mapred-site.xml" to upload this xml file for my cluster.

Finally the creation of machine is failing with error "On the master
instance (xxx), bootstrap action 1 returned a non-zero return code".

I think in optional arguments I am giving something wrong. Please help me
in this.

Thanks,
Chunky.


Executing queries after setting hive.exec.parallel in hive-site.xml

2012-10-25 Thread Chunky Gupta
Hi ,

I have 2 questions regarding the parameter 'hive.exec.parallel' in
hive-site.xml in ~/.versions/hive-0.8.1/conf/

1. how do I verify from query log files or from any other way, that a
particular query is executing with this parameter set to true.
2. Is it advisable to set this parameter for running queries.

Thanks,
Chunky.


Re: How to run multiple Hive queries in parallel

2012-10-22 Thread Chunky Gupta
Hi Bejoy and Bertrand

Thanks for quick reply.

I think tasks slots are not available in my cluster because I have only 4
slave machines.
Actually I am beginner to HIVE.  So, if you can let me know how I can check
if time slots are available or not.

I have different users credentials to log in into my name node machine, but
I don't have much idea about fair scheduler.

In case time slots are not available and are exhausted , then if you can
please point me to some publicly available fair scheduler which I can
integrate with HIVE to solve my problem.

Thank You,
Chunky.

On Mon, Oct 22, 2012 at 5:52 PM, Bertrand Dechoux wrote:

> Bejoy is right. I just want to say explicitly that the scheduler
> configuration is something which is orthogonal to the use of Hive. (ie same
> problem with Pig or standard MapReduce jobs).
>
> Regards
>
> Bertrand
>
> PS : There is also the capacity scheduler.
>
>
> On Mon, Oct 22, 2012 at 2:18 PM, Bejoy KS  wrote:
>
>> **
>> Hi
>>
>> Is your hive queries in waiting mode even though there are task slots
>> available on your cluster?
>>
>> If task slots are getting exhausted and you need parallelism here, then
>> you may need to look at some approaches of using fair scheduler and
>> different user accounts for each user so that each user gets his fair share
>> of task slots.
>>
>>
>> Regards
>> Bejoy KS
>>
>> Sent from handheld, please excuse typos.
>> --
>> *From: * Chunky Gupta 
>> *Date: *Mon, 22 Oct 2012 17:27:45 +0530
>> *To: *
>> *ReplyTo: * user@hive.apache.org
>> *Subject: *How to run multiple Hive queries in parallel
>>
>> Hi,
>>
>> I have one name node machine and under which there are 4 slaves machines
>> to run the job.
>>
>> The way users run queries is
>> - They ssh into the name node machine
>> - They initiate hive and submit their queries
>>
>> Currently multiple users log in with the same credentials and submit
>> queries
>>
>> Whenever 2 or more users try to run queries at a same time from different
>> hive console , it runs only one query at a time and when that query is
>> finished then only next query starts executing and so on.
>>
>> In this scenario if there is a large query which is submitted earlier
>> then all the other queries have to wait for that query to complete.
>>
>> I want to run multiple query at the same time. Is there any way or any
>> configuration parameter to do the same ?
>>
>> PS: The data is in S3 and running HIVE on AWS EMR infrastructure in
>> interactive mode.
>>
>> Thank You,
>> Chunky.
>>
>>
>
>
> --
> Bertrand Dechoux
>


How to run multiple Hive queries in parallel

2012-10-22 Thread Chunky Gupta
Hi,

I have one name node machine and under which there are 4 slaves machines to
run the job.

The way users run queries is
- They ssh into the name node machine
- They initiate hive and submit their queries

Currently multiple users log in with the same credentials and submit queries

Whenever 2 or more users try to run queries at a same time from different
hive console , it runs only one query at a time and when that query is
finished then only next query starts executing and so on.

In this scenario if there is a large query which is submitted earlier then
all the other queries have to wait for that query to complete.

I want to run multiple query at the same time. Is there any way or any
configuration parameter to do the same ?

PS: The data is in S3 and running HIVE on AWS EMR infrastructure in
interactive mode.

Thank You,
Chunky.