Re: Hive performance vs. SQL?

2012-03-20 Thread Keith Wiley
Thanks for the response.

Cheers!

On Mar 19, 2012, at 16:42 , Maxime Brugidou wrote:

 From my experience, if you can fit data in a SQL without sharding or 
 anything, don't ever think twice. Hive is not even comparable.



Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

What I primarily learned in grad school is how much I *don't* know.
Consequently, I left grad school with a higher ignorance to knowledge ratio than
when I entered.
   --  Keith Wiley




Re: How to get job names and stages of a query?

2012-03-20 Thread Manish Bhoge
Whenever you submit a Sql a job I'd get generated. You can open the job tracker 
localhost:50030/jobtracker.asp
It shows jobs are running and rest of the other details.
Thanks,
Manish
Sent from my BlackBerry, pls excuse typo

-Original Message-
From: Felix.徐 ygnhz...@gmail.com
Date: Tue, 20 Mar 2012 12:58:53 
To: user@hive.apache.org
Reply-To: user@hive.apache.org
Subject: How to get job names and stages of a query?

Hi,all
I want to track the progress of a query, how can I get the job name
including stages of a query?



Re: How to get job names and stages of a query?

2012-03-20 Thread Manish Bhoge
Whenever you submit a Sql a job I'd get generated. You can open the job tracker 
localhost:50030/jobtracker.asp
It shows jobs are running and rest of the other details.
Thanks,
Manish
Sent from my BlackBerry, pls excuse typo

-Original Message-
From: Felix.徐 ygnhz...@gmail.com
Date: Tue, 20 Mar 2012 12:58:53 
To: user@hive.apache.org
Reply-To: user@hive.apache.org
Subject: How to get job names and stages of a query?

Hi,all
I want to track the progress of a query, how can I get the job name
including stages of a query?



Fail to create temporary directory when execute bucket map join

2012-03-20 Thread binhnt22
Hello there,

 

I have 2 tables

CREATE TABLE data(calling STRING COMMENT 'Calling number', 
volumn_download BIGINT COMMENT 'Volume download',
volumn_upload BIGINT COMMENT 'Volume upload')
PARTITIONED BY(ds STRING)
CLUSTERED BY (calling) INTO 100 BUCKETS;

CREATE TABLE sub(isdn STRING, sub_id STRING)
CLUSTERED BY (isdn) INTO 100 BUCKETS;

The DATA table has 15m records while SUB table only has 600k records.

The following SQL script were executed successfully:
select /*+ MAPJOIN(b) */ a.calling, b.sub_id from data a join sub b on
a.calling=b.isdn;

But when I used Bucket map join by setting: set hive.optimize.bucketmapjoin
= true
the above SQL script failed
select /*+ MAPJOIN(b) */ a.calling, b.sub_id from data a join sub b on
a.calling=b.isdn;

hive set hive.optimize.bucketmapjoin = true;
hive select /*+ MAPJOIN(b) */ a.calling, b.sub_id from ggsn_bucket a join
sub_bucket b on a.calling=b.isdn;
Total MapReduce jobs = 1
WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please
use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties
files.
Execution log at:
/tmp/hduser/hduser_20120320080909_8e6a3419-4d2c-4148-a0c9-166d051c8274.log
2012-03-20 08:09:34 Starting to launch local task to process map join;
maximum memory = 932118528
2012-03-20 08:09:34 End of local task; Time Taken: 0.072 sec.
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
Mapred Local Task Succeeded . Convert the Join into MapJoin
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
org.apache.hadoop.util.Shell$ExitCodeException: bash: line 0: cd:
/u01/app/hduser/hadoop-0.20.203.0/tempdir/hduser/hive_2012-03-20_08-09-27_81
0_1393729636696443501/-local-10002/HashTable-Stage-1: No such file or
directory
tar: Cowardly refusing to create an empty archive
Try `tar --help' or `tar --usage' for more information.

at org.apache.hadoop.util.Shell.runCommand(Shell.java:255)
at org.apache.hadoop.util.Shell.run(Shell.java:182)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:
375)
at org.apache.hadoop.hive.common.FileUtils.tar(FileUtils.java:260)
at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:407 )
at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:136 )
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:133)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.ja
va:57)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1332)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1123)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:931)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:2 55)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:212)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:671)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:554)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Job Submission failed with exception
'org.apache.hadoop.util.Shell$ExitCodeException(bash: line 0: cd:
/u01/app/hduser/hadoop-0.20.203.0/tempdir/hduser/hive_2012-03-20_08-09-27_81
0_1393729636696443501/-local-10002/HashTable-Stage-1: No such file or
directory
tar: Cowardly refusing to create an empty archive
Try `tar --help' or `tar --usage' for more information.
)'
java.lang.IllegalArgumentException: Can not create a Path from an empty
string
at org.apache.hadoop.fs.Path.checkPathArg(Path.java:82)
at org.apache.hadoop.fs.Path.init(Path.java:90)
at org.apache.hadoop.hive.ql.exec.Utilities.getHiveJobID(Utilities.java:
379)
at org.apache.hadoop.hive.ql.exec.Utilities.clearMapRedWork(Utilities.ja
va:192)
at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:476 )
at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:136 )
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:133)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.ja
va:57)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1332)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1123)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:931)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:2 55)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:212)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:671)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:554)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 

Re: How to get job names and stages of a query?

2012-03-20 Thread Felix . 徐
I actually want to get the job name of stages by api..

在 2012年3月20日 下午2:23,Manish Bhoge manishbh...@rocketmail.com写道:

 **
 Whenever you submit a Sql a job I'd get generated. You can open the job
 tracker localhost:50030/jobtracker.asp
 It shows jobs are running and rest of the other details.
 Thanks,
 Manish
 Sent from my BlackBerry, pls excuse typo
 --
 *From: * Felix.徐 ygnhz...@gmail.com
 *Date: *Tue, 20 Mar 2012 12:58:53 +0800
 *To: *user@hive.apache.org
 *ReplyTo: * user@hive.apache.org
 *Subject: *How to get job names and stages of a query?

 Hi,all
 I want to track the progress of a query, how can I get the job name
 including stages of a query?



Re: LOAD DATA problem

2012-03-20 Thread hadoop hive
hey Sean,

its becoz you are appending the file in same partition with the same
name(which is not possible) you must change the file name before appending
into same partition.

AFAIK, i don't think that there is any other way to do that, either you can
you partition name or the file name.

Thanks
Vikas Srivastava


On Tue, Mar 20, 2012 at 6:45 AM, Sean McNamara
sean.mcnam...@webtrends.comwrote:

  Is there a way to prevent LOAD DATA LOCAL INPATH from appending _copy_1
 to logs that already exist in a partition?  If the log is already in
 hdfs/hive I'd rather it fail and give me an return code or output saying
 that the log already exists.

  For example, if I run these queries:
 /usr/local/hive/bin/hive -e LOAD DATA LOCAL INPATH 'test_a.bz2' INTO
 TABLE logs PARTITION(ds='2012-03-19', hr='23')
 /usr/local/hive/bin/hive -e LOAD DATA LOCAL INPATH 'test_b.bz2' INTO
 TABLE logs PARTITION(ds='2012-03-19', hr='23')
 /usr/local/hive/bin/hive -e LOAD DATA LOCAL INPATH 'test_b.bz2' INTO
 TABLE logs PARTITION(ds='2012-03-19', hr='23')
 /usr/local/hive/bin/hive -e LOAD DATA LOCAL INPATH 'test_b.bz2' INTO
 TABLE logs PARTITION(ds='2012-03-19', hr='23')

  I end up with:
 test_a.bz2
 test_b.bz2
 test_b_copy_1.bz2
 test_b_copy_2.bz2

  However, If I use OVERWRITE it will nuke all the data in the partition
 (including test_a.bz2) and I end up with just:
 test_b.bz2

  I recall that older versions of hive would not do this.  How do I handle
 this case?  Is there a safe atomic way to do this?

  Sean










Re: LOAD DATA problem

2012-03-20 Thread Gabi D
Hi Vikas,
we are facing the same problem that Sean reported and have also noticed
that this behavior changed with a newer version of hive. Previously, when
you inserted a file with the same name into a partition/table, hive would
fail the request (with yet another of its cryptic messages, an issue in
itself) while now it does load the file and adds the _copy_N addition to
the suffix.
I have to say that, normally, we do not check for existance of a file with
the same name in our hdfs directories. Our files arrive with unique names
and if we try to insert the same file again it is because of some failure
in one of the steps in our flow (e.g., files that were handled and loaded
into hive have not been removed from our work directory for some reason
hence in the next run of our load process they were reloaded). We do not
want to add a step that checks whether a file with the same name already
exists in hdfs - this is costly and most of the time (hopefully all of it)
unnecessary. What we would like is to get some 'duplicate file' error and
be able to disregard it, knowing that the file is already safely in its
place.
Note, that having duplicate files causes us to double count rows which is
unacceptable for many applications.
Moreover, we use gz files and since this behavior changes the suffix of the
file (from gz to gz_copy_N) when this happens we seem to be getting all
sorts of strange data since hadoop can't recognize that this is a zipped
file and does not decompress it before reading it ...
Any help or suggestions on this issue would be much appreciated, we have
been unable to find any so far.


On Tue, Mar 20, 2012 at 9:29 AM, hadoop hive hadooph...@gmail.com wrote:

 hey Sean,

 its becoz you are appending the file in same partition with the same
 name(which is not possible) you must change the file name before appending
 into same partition.

 AFAIK, i don't think that there is any other way to do that, either you
 can you partition name or the file name.

 Thanks
 Vikas Srivastava


 On Tue, Mar 20, 2012 at 6:45 AM, Sean McNamara 
 sean.mcnam...@webtrends.com wrote:

  Is there a way to prevent LOAD DATA LOCAL INPATH from appending _copy_1
 to logs that already exist in a partition?  If the log is already in
 hdfs/hive I'd rather it fail and give me an return code or output saying
 that the log already exists.

  For example, if I run these queries:
 /usr/local/hive/bin/hive -e LOAD DATA LOCAL INPATH 'test_a.bz2' INTO
 TABLE logs PARTITION(ds='2012-03-19', hr='23')
 /usr/local/hive/bin/hive -e LOAD DATA LOCAL INPATH 'test_b.bz2' INTO
 TABLE logs PARTITION(ds='2012-03-19', hr='23')
 /usr/local/hive/bin/hive -e LOAD DATA LOCAL INPATH 'test_b.bz2' INTO
 TABLE logs PARTITION(ds='2012-03-19', hr='23')
 /usr/local/hive/bin/hive -e LOAD DATA LOCAL INPATH 'test_b.bz2' INTO
 TABLE logs PARTITION(ds='2012-03-19', hr='23')

  I end up with:
 test_a.bz2
 test_b.bz2
 test_b_copy_1.bz2
 test_b_copy_2.bz2

  However, If I use OVERWRITE it will nuke all the data in the partition
 (including test_a.bz2) and I end up with just:
 test_b.bz2

  I recall that older versions of hive would not do this.  How do I
 handle this case?  Is there a safe atomic way to do this?

  Sean











Re: LOAD DATA problem

2012-03-20 Thread Sean McNamara
Gabi-

Glad to know I'm not the only one scratching my head on this one!  The changed 
behavior caught us off guard.

I haven't found a solution in my sleuthing tonight.  Indeed, any help would be 
greatly appreciated on this!

Sean

From: Gabi D gabi...@gmail.commailto:gabi...@gmail.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org
Date: Tue, 20 Mar 2012 10:03:04 +0200
To: user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: LOAD DATA problem

Hi Vikas,
we are facing the same problem that Sean reported and have also noticed that 
this behavior changed with a newer version of hive. Previously, when you 
inserted a file with the same name into a partition/table, hive would fail the 
request (with yet another of its cryptic messages, an issue in itself) while 
now it does load the file and adds the _copy_N addition to the suffix.
I have to say that, normally, we do not check for existance of a file with the 
same name in our hdfs directories. Our files arrive with unique names and if we 
try to insert the same file again it is because of some failure in one of the 
steps in our flow (e.g., files that were handled and loaded into hive have not 
been removed from our work directory for some reason hence in the next run of 
our load process they were reloaded). We do not want to add a step that checks 
whether a file with the same name already exists in hdfs - this is costly and 
most of the time (hopefully all of it) unnecessary. What we would like is to 
get some 'duplicate file' error and be able to disregard it, knowing that the 
file is already safely in its place.
Note, that having duplicate files causes us to double count rows which is 
unacceptable for many applications.
Moreover, we use gz files and since this behavior changes the suffix of the 
file (from gz to gz_copy_N) when this happens we seem to be getting all sorts 
of strange data since hadoop can't recognize that this is a zipped file and 
does not decompress it before reading it ...
Any help or suggestions on this issue would be much appreciated, we have been 
unable to find any so far.


On Tue, Mar 20, 2012 at 9:29 AM, hadoop hive 
hadooph...@gmail.commailto:hadooph...@gmail.com wrote:
hey Sean,

its becoz you are appending the file in same partition with the same name(which 
is not possible) you must change the file name before appending into same 
partition.

AFAIK, i don't think that there is any other way to do that, either you can you 
partition name or the file name.

Thanks
Vikas Srivastava


On Tue, Mar 20, 2012 at 6:45 AM, Sean McNamara 
sean.mcnam...@webtrends.commailto:sean.mcnam...@webtrends.com wrote:
Is there a way to prevent LOAD DATA LOCAL INPATH from appending _copy_1 to logs 
that already exist in a partition?  If the log is already in hdfs/hive I'd 
rather it fail and give me an return code or output saying that the log already 
exists.

For example, if I run these queries:
/usr/local/hive/bin/hive -e LOAD DATA LOCAL INPATH 'test_a.bz2' INTO TABLE 
logs PARTITION(ds='2012-03-19', hr='23')
/usr/local/hive/bin/hive -e LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE 
logs PARTITION(ds='2012-03-19', hr='23')
/usr/local/hive/bin/hive -e LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE 
logs PARTITION(ds='2012-03-19', hr='23')
/usr/local/hive/bin/hive -e LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE 
logs PARTITION(ds='2012-03-19', hr='23')

I end up with:
test_a.bz2
test_b.bz2
test_b_copy_1.bz2
test_b_copy_2.bz2

However, If I use OVERWRITE it will nuke all the data in the partition 
(including test_a.bz2) and I end up with just:
test_b.bz2

I recall that older versions of hive would not do this.  How do I handle this 
case?  Is there a safe atomic way to do this?

Sean











Re: how is number of mappers determined in mapside join?

2012-03-20 Thread Bruce Bian
Thanks Bejoy! That helps.

On Tue, Mar 20, 2012 at 12:10 AM, Bejoy Ks bejoy...@yahoo.com wrote:

 Hi Bruce
   From my understanding, that formula is not for
 CombineFileInputFormat but for other basic Input Formats.

 I'd just brief you on CombineFileInputFormat to get things more clear.
   In the default TextInputFormat every hdfs block is processed by a
 mapper. But if the files are small say 5Mb, spawning that may mappers would
 be an overkill for the job. So here we use Combine file input format,where
 one mapper process more than one small file and the min data size a mapper
 should process is defined by the min split size and the maximum data that a
 mapper can process is defined by max split size. ie data processed by a
 mapper is guaranteed to be not less than the min split size and not more
 than max split size specified.

 As you asked, if you are looking at more mappers in
 CombinedFileInputFormat then reduce the value of Max split Size. Bump it
 down to 32 mb (your block size) and just try it out. Or If you are looking
 at num mappers = num blocks, just change the input format in hive.

 By the way 32 mb is too small for a hdfs block size, you may hit NN memory
 issues pretty soon. Consider increasing it at least to 64 mb, though all
 larger clusters use either 128 or 256 Mb blocks.

 Hope it helps!..

 Regards
 Bejoy

   --
 *From:* Bruce Bian weidong@gmail.com
 *To:* user@hive.apache.org; Bejoy Ks bejoy...@yahoo.com
 *Sent:* Monday, March 19, 2012 7:48 PM
 *Subject:* Re: how is number of mappers determined in mapside join?

 Hi Bejoy,
 Thanks for your reply.
 The function is from the book, Hadoop The Definitive Guide 2nd edition. On
 page 203 there is
 The split size is calculated by the formula (see the computeSplitSize()
 method in FileInputFormat): max(minimumSize, min(maximumSize, blockSize))
 by default:minimumSize  blockSize  maximumSize so the split size is
 blockSize.

 And I've actually used the HDFS block size to control the number of
 mappers launched before.
 So as to your response, do you mean that any value of the data between 1B
 and 256MB is OK for the mappers to process?
 Then the only way I can think of to increase the #mappers is to reduce the
 max split size.

 Regards,
 Bruce

 On Mon, Mar 19, 2012 at 8:48 PM, Bejoy Ks bejoy...@yahoo.com wrote:

 Hi Bruce
   In map side join the smaller table is loader in memory and hence the
 number of mappers is dependent only on the data on larger table. Say If
 CombineHiveInputFormat is used and we have our hdfs block size as 32 mb,
 min split size as 1B and max split size as 256 mb. Which means one mapper
 would be processing data chunks not less than 1B and not more than 256 MB.
 So based on that mappers would be triggered,
 so a possibility in your case
 mapper 1 - 200 MB
 mapper 2 - 120 MB
 mapper 3 - 140 MB
 Every mapper is processing data whose size id between 1B and 256 MB.
 Totally of 460 MB, your table size.

 I'm not sure of the formula you posted here, Can you point me to the
 document from which you got this?

 Regards
 Bejoy

   --
 *From:* Bruce Bian weidong@gmail.com
 *To:* user@hive.apache.org
 *Sent:* Monday, March 19, 2012 2:42 PM
 *Subject:* how is number of mappers determined in mapside join?

 Hi there,
 when I'm executing the following queries in hive

 set hive.auto.convert.join = true;
 CREATE TABLE IDAP_ROOT as
 SELECT a.*,b.acnt_no
 FROM idap_pi_root a LEFT OUTER JOIN idap_pi_root_acnt b ON
 a.acnt_id=b.acnt_id

 the number of mappers to run in the mapside join is 3, how is it
 determined? When launching a job in hadoop mapreduce, i know it's
 determined by the function
 max(Min split size, min(Max split size, HDFS blockSize)) which in my
 configuration is max(1B, min(256MB ,32MB)=32MB and the two tables are 460MB
 and 1.5MB respectively.
 Thus I thought the mappers to launch to be around 15, which is not the
 case.

 Thanks
 Bruce








HIVE mappers eat a lot of RAM

2012-03-20 Thread Alexander Ershov
Hiya,

I'm using HIVE 0.7.1 with
1) moderate 50GB table, let's call it `temp_view`
2) query: select max(length(get_json_object(json, '$.user_id'))) from
temp_view. From my point of view this query is a total joke, nothing
serious.

Query runs just fine, everyone's happy.

But I have massive memory consumption at the map phase: 7 active mappers
eating 500 Mb of RAM each.

This is a really bad stuff, it means real mappers on real queries will
throw OutOfMemory exception (they do throw it actually).

Anyone has any ideas of what I'm doing wrong? Cause I have zero.


Re: HIVE mappers eat a lot of RAM

2012-03-20 Thread Bejoy Ks
Hi Alex
      In good clusters you have the child task JVM size as 1.5 or  2GB (or at 
least 1G). IMHO, 500MB for a task is a pretty normal memory consumption.
Now for 50G of data you are having just 7 mappers, need to increase the number 
of mappers for better parallelism.

Regards
Bejoy



 From: Alexander Ershov vohs...@gmail.com
To: user@hive.apache.org 
Sent: Tuesday, March 20, 2012 4:13 PM
Subject: HIVE mappers eat a lot of RAM
 

Hiya,

I'm using HIVE 0.7.1 with
1) moderate 50GB table, let's call it `temp_view`
2) query: select max(length(get_json_object(json, '$.user_id'))) from 
temp_view. From my point of view this query is a total joke, nothing serious.

Query runs just fine, everyone's happy.

But I have massive memory consumption at the map phase:  7 active mappers 
eating 500 Mb of RAM each.

This is a really bad stuff, it means real mappers on real queries will throw 
OutOfMemory exception (they do throw it actually).

Anyone has any ideas of what I'm doing wrong? Cause I have zero.

Re: LOAD DATA problem

2012-03-20 Thread Edward Capriolo
By now you all have realized that the load file semantics have
changed. I can not find the exact issue but here is a related change.


   * [HIVE-306] - Support INSERT [INTO] destination

I do not see a way out of this without code. Maybe you could code up a
hive query hook for this.

It defiantly makes a good point that appending copy_of_n after the gz
is bad since that will confuse text input format which relies on
extension to chose decompresser. I will open an issue on that.

On Tue, Mar 20, 2012 at 4:12 AM, Sean McNamara
sean.mcnam...@webtrends.com wrote:
 Gabi-

 Glad to know I'm not the only one scratching my head on this one!  The
 changed behavior caught us off guard.

 I haven't found a solution in my sleuthing tonight.  Indeed, any help would
 be greatly appreciated on this!

 Sean

 From: Gabi D gabi...@gmail.com
 Reply-To: user@hive.apache.org
 Date: Tue, 20 Mar 2012 10:03:04 +0200
 To: user@hive.apache.org
 Subject: Re: LOAD DATA problem

 Hi Vikas,
 we are facing the same problem that Sean reported and have also noticed that
 this behavior changed with a newer version of hive. Previously, when you
 inserted a file with the same name into a partition/table, hive would fail
 the request (with yet another of its cryptic messages, an issue in itself)
 while now it does load the file and adds the _copy_N addition to the suffix.
 I have to say that, normally, we do not check for existance of a file with
 the same name in our hdfs directories. Our files arrive with unique names
 and if we try to insert the same file again it is because of some failure in
 one of the steps in our flow (e.g., files that were handled and loaded into
 hive have not been removed from our work directory for some reason hence in
 the next run of our load process they were reloaded). We do not want to add
 a step that checks whether a file with the same name already exists in hdfs
 - this is costly and most of the time (hopefully all of it) unnecessary.
 What we would like is to get some 'duplicate file' error and be able to
 disregard it, knowing that the file is already safely in its place.
 Note, that having duplicate files causes us to double count rows which is
 unacceptable for many applications.
 Moreover, we use gz files and since this behavior changes the suffix of the
 file (from gz to gz_copy_N) when this happens we seem to be getting all
 sorts of strange data since hadoop can't recognize that this is a zipped
 file and does not decompress it before reading it ...
 Any help or suggestions on this issue would be much appreciated, we have
 been unable to find any so far.


 On Tue, Mar 20, 2012 at 9:29 AM, hadoop hive hadooph...@gmail.com wrote:

 hey Sean,

 its becoz you are appending the file in same partition with the same
 name(which is not possible) you must change the file name before appending
 into same partition.

 AFAIK, i don't think that there is any other way to do that, either you
 can you partition name or the file name.

 Thanks
 Vikas Srivastava


 On Tue, Mar 20, 2012 at 6:45 AM, Sean McNamara
 sean.mcnam...@webtrends.com wrote:

 Is there a way to prevent LOAD DATA LOCAL INPATH from appending _copy_1
 to logs that already exist in a partition?  If the log is already in
 hdfs/hive I'd rather it fail and give me an return code or output saying
 that the log already exists.

 For example, if I run these queries:
 /usr/local/hive/bin/hive -e LOAD DATA LOCAL INPATH 'test_a.bz2' INTO
 TABLE logs PARTITION(ds='2012-03-19', hr='23')
 /usr/local/hive/bin/hive -e LOAD DATA LOCAL INPATH 'test_b.bz2' INTO
 TABLE logs PARTITION(ds='2012-03-19', hr='23')
 /usr/local/hive/bin/hive -e LOAD DATA LOCAL INPATH 'test_b.bz2' INTO
 TABLE logs PARTITION(ds='2012-03-19', hr='23')
 /usr/local/hive/bin/hive -e LOAD DATA LOCAL INPATH 'test_b.bz2' INTO
 TABLE logs PARTITION(ds='2012-03-19', hr='23')

 I end up with:
 test_a.bz2
 test_b.bz2
 test_b_copy_1.bz2
 test_b_copy_2.bz2

 However, If I use OVERWRITE it will nuke all the data in the partition
 (including test_a.bz2) and I end up with just:
 test_b.bz2

 I recall that older versions of hive would not do this.  How do I handle
 this case?  Is there a safe atomic way to do this?

 Sean











Re: LOAD DATA problem

2012-03-20 Thread Edward Capriolo
The copy_n should have been fixed in 0.8.0

https://issues.apache.org/jira/browse/HIVE-2296

On Tue, Mar 20, 2012 at 4:12 AM, Sean McNamara
sean.mcnam...@webtrends.com wrote:
 Gabi-

 Glad to know I'm not the only one scratching my head on this one!  The
 changed behavior caught us off guard.

 I haven't found a solution in my sleuthing tonight.  Indeed, any help would
 be greatly appreciated on this!

 Sean

 From: Gabi D gabi...@gmail.com
 Reply-To: user@hive.apache.org
 Date: Tue, 20 Mar 2012 10:03:04 +0200
 To: user@hive.apache.org
 Subject: Re: LOAD DATA problem

 Hi Vikas,
 we are facing the same problem that Sean reported and have also noticed that
 this behavior changed with a newer version of hive. Previously, when you
 inserted a file with the same name into a partition/table, hive would fail
 the request (with yet another of its cryptic messages, an issue in itself)
 while now it does load the file and adds the _copy_N addition to the suffix.
 I have to say that, normally, we do not check for existance of a file with
 the same name in our hdfs directories. Our files arrive with unique names
 and if we try to insert the same file again it is because of some failure in
 one of the steps in our flow (e.g., files that were handled and loaded into
 hive have not been removed from our work directory for some reason hence in
 the next run of our load process they were reloaded). We do not want to add
 a step that checks whether a file with the same name already exists in hdfs
 - this is costly and most of the time (hopefully all of it) unnecessary.
 What we would like is to get some 'duplicate file' error and be able to
 disregard it, knowing that the file is already safely in its place.
 Note, that having duplicate files causes us to double count rows which is
 unacceptable for many applications.
 Moreover, we use gz files and since this behavior changes the suffix of the
 file (from gz to gz_copy_N) when this happens we seem to be getting all
 sorts of strange data since hadoop can't recognize that this is a zipped
 file and does not decompress it before reading it ...
 Any help or suggestions on this issue would be much appreciated, we have
 been unable to find any so far.


 On Tue, Mar 20, 2012 at 9:29 AM, hadoop hive hadooph...@gmail.com wrote:

 hey Sean,

 its becoz you are appending the file in same partition with the same
 name(which is not possible) you must change the file name before appending
 into same partition.

 AFAIK, i don't think that there is any other way to do that, either you
 can you partition name or the file name.

 Thanks
 Vikas Srivastava


 On Tue, Mar 20, 2012 at 6:45 AM, Sean McNamara
 sean.mcnam...@webtrends.com wrote:

 Is there a way to prevent LOAD DATA LOCAL INPATH from appending _copy_1
 to logs that already exist in a partition?  If the log is already in
 hdfs/hive I'd rather it fail and give me an return code or output saying
 that the log already exists.

 For example, if I run these queries:
 /usr/local/hive/bin/hive -e LOAD DATA LOCAL INPATH 'test_a.bz2' INTO
 TABLE logs PARTITION(ds='2012-03-19', hr='23')
 /usr/local/hive/bin/hive -e LOAD DATA LOCAL INPATH 'test_b.bz2' INTO
 TABLE logs PARTITION(ds='2012-03-19', hr='23')
 /usr/local/hive/bin/hive -e LOAD DATA LOCAL INPATH 'test_b.bz2' INTO
 TABLE logs PARTITION(ds='2012-03-19', hr='23')
 /usr/local/hive/bin/hive -e LOAD DATA LOCAL INPATH 'test_b.bz2' INTO
 TABLE logs PARTITION(ds='2012-03-19', hr='23')

 I end up with:
 test_a.bz2
 test_b.bz2
 test_b_copy_1.bz2
 test_b_copy_2.bz2

 However, If I use OVERWRITE it will nuke all the data in the partition
 (including test_a.bz2) and I end up with just:
 test_b.bz2

 I recall that older versions of hive would not do this.  How do I handle
 this case?  Is there a safe atomic way to do this?

 Sean











Optimization on bucketized/sorted tables

2012-03-20 Thread mdefoinplatel.ext
Hi folks,
I have several questions about optimization in Hive, they are mainly related to 
bucketized/sorted tables.

Let say I have a table T bucketized on user_id and sorted by user_id, time.

CREATE TABLE T
( user_id BIGINT,
  time INT
)
CLUSTERED BY(user_id) SORTED BY(user_id, time) INTO 64 BUCKETS;


In a general way, I wonder which of the following operations will benefit from 
the fact that  T is bucketized and sorted.


1)  Group by
SELECT user_id, count(time) FROM T GROUP BY user_id;


2)  Distribute by
SELECT user_id, time FROM T DISTRIBUTE BY user_id;


3)  Distribute by, Sort by
SELECT user_id, time FROM T DISTRIBUTE BY user_id SORT BY user_id, time;



4)  Insert into a bucketized/sorted table

CREATE TABLE T2
( user_id BIGINT,
  time INT
)
CLUSTERED BY(user_id) SORTED BY(user_id, time) INTO 64 BUCKETS;

set hive.enforce.bucketing = true;

INSERT OVERWRITE TABLE T2 SELECT T.user_id, T.time FROM T;


Finally, on a slightly more specific  topic...

Let say I want to perform the 'sessionization' on the table T  and I am 
planning to call a python script to do that job.
To get a valid answer I must ensure that the data are sorted by user_id,time 
and that all the data for a given user_id  are processed by a single call to my 
script.

I am planning to run the following query:
FROM (SELECT user_Id, time FROM T  DISTRIBUTE BY user_id SORT BY user_id, time) 
s SELECT TRANSFORM (s.user_id, s.time)  USING 'python session.py'  AS user_id, 
avg_session, nb_session;

So I wonder first if this is the correct  approach and second if the 
'DISTRIBUTE BY user_id SORT BY user_id, time' clauses are required knowing that 
T is already bucketized and sorted on the right columns.

Many thanks in advance for your help,
Michael


_

Ce message et ses pieces jointes peuvent contenir des informations 
confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce 
message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
electroniques etant susceptibles d'alteration,
France Telecom - Orange decline toute responsabilite si ce message a ete 
altere, deforme ou falsifie. Merci.

This message and its attachments may contain confidential or privileged 
information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and delete 
this message and its attachments.
As emails may be altered, France Telecom - Orange is not liable for messages 
that have been modified, changed or falsified.
Thank you.



Re: LOAD DATA problem

2012-03-20 Thread Gabi D
Hi Edward,
thanks for looking into this.
what fix 2296 does is not so good. It kind of messes with my filename, so
better concatenate it as filename*.*copy_n.gz (rahter than
filename*_*copy_n.gz)
but that request might be considered petty...
Still, what I think Sean is asking for, as well as am I, is the option to
tell Hive to reject duplicate files altogether (returning an error code
preferably). Could be by some addition to the syntax or a hive setup
parameter, doesn't really matter.
Will also look into hive query hooks as you suggested.

On Tue, Mar 20, 2012 at 3:05 PM, Edward Capriolo edlinuxg...@gmail.comwrote:

 The copy_n should have been fixed in 0.8.0

 https://issues.apache.org/jira/browse/HIVE-2296

 On Tue, Mar 20, 2012 at 4:12 AM, Sean McNamara
 sean.mcnam...@webtrends.com wrote:
  Gabi-
 
  Glad to know I'm not the only one scratching my head on this one!  The
  changed behavior caught us off guard.
 
  I haven't found a solution in my sleuthing tonight.  Indeed, any help
 would
  be greatly appreciated on this!
 
  Sean
 
  From: Gabi D gabi...@gmail.com
  Reply-To: user@hive.apache.org
  Date: Tue, 20 Mar 2012 10:03:04 +0200
  To: user@hive.apache.org
  Subject: Re: LOAD DATA problem
 
  Hi Vikas,
  we are facing the same problem that Sean reported and have also noticed
 that
  this behavior changed with a newer version of hive. Previously, when you
  inserted a file with the same name into a partition/table, hive would
 fail
  the request (with yet another of its cryptic messages, an issue in
 itself)
  while now it does load the file and adds the _copy_N addition to the
 suffix.
  I have to say that, normally, we do not check for existance of a file
 with
  the same name in our hdfs directories. Our files arrive with unique names
  and if we try to insert the same file again it is because of some
 failure in
  one of the steps in our flow (e.g., files that were handled and loaded
 into
  hive have not been removed from our work directory for some reason hence
 in
  the next run of our load process they were reloaded). We do not want to
 add
  a step that checks whether a file with the same name already exists in
 hdfs
  - this is costly and most of the time (hopefully all of it) unnecessary.
  What we would like is to get some 'duplicate file' error and be able to
  disregard it, knowing that the file is already safely in its place.
  Note, that having duplicate files causes us to double count rows which is
  unacceptable for many applications.
  Moreover, we use gz files and since this behavior changes the suffix of
 the
  file (from gz to gz_copy_N) when this happens we seem to be getting all
  sorts of strange data since hadoop can't recognize that this is a zipped
  file and does not decompress it before reading it ...
  Any help or suggestions on this issue would be much appreciated, we have
  been unable to find any so far.
 
 
  On Tue, Mar 20, 2012 at 9:29 AM, hadoop hive hadooph...@gmail.com
 wrote:
 
  hey Sean,
 
  its becoz you are appending the file in same partition with the same
  name(which is not possible) you must change the file name before
 appending
  into same partition.
 
  AFAIK, i don't think that there is any other way to do that, either you
  can you partition name or the file name.
 
  Thanks
  Vikas Srivastava
 
 
  On Tue, Mar 20, 2012 at 6:45 AM, Sean McNamara
  sean.mcnam...@webtrends.com wrote:
 
  Is there a way to prevent LOAD DATA LOCAL INPATH from appending _copy_1
  to logs that already exist in a partition?  If the log is already in
  hdfs/hive I'd rather it fail and give me an return code or output
 saying
  that the log already exists.
 
  For example, if I run these queries:
  /usr/local/hive/bin/hive -e LOAD DATA LOCAL INPATH 'test_a.bz2' INTO
  TABLE logs PARTITION(ds='2012-03-19', hr='23')
  /usr/local/hive/bin/hive -e LOAD DATA LOCAL INPATH 'test_b.bz2' INTO
  TABLE logs PARTITION(ds='2012-03-19', hr='23')
  /usr/local/hive/bin/hive -e LOAD DATA LOCAL INPATH 'test_b.bz2' INTO
  TABLE logs PARTITION(ds='2012-03-19', hr='23')
  /usr/local/hive/bin/hive -e LOAD DATA LOCAL INPATH 'test_b.bz2' INTO
  TABLE logs PARTITION(ds='2012-03-19', hr='23')
 
  I end up with:
  test_a.bz2
  test_b.bz2
  test_b_copy_1.bz2
  test_b_copy_2.bz2
 
  However, If I use OVERWRITE it will nuke all the data in the partition
  (including test_a.bz2) and I end up with just:
  test_b.bz2
 
  I recall that older versions of hive would not do this.  How do I
 handle
  this case?  Is there a safe atomic way to do this?
 
  Sean
 
 
 
 
 
 
 
 
 



Re: LOAD DATA problem

2012-03-20 Thread Edward Capriolo
The syntax would be 'LOAD DATA [IF NOT EXISTS] INFILE' . Is a good suggestion.

In hindsight it would have been add new syntax for the renaming files
feature rather then changing the current behaviour. Although the
change of behaviour sucks for you (and I am sorry about that), I
believe the new better default.

Either you need a 'hack' on the front end before you load the file, or
a 'hack' on the back end to catch the exception after the conflict, or
you have to expand hive's syntax for support both (also unattractive
for a couple reasons).

Our hive 'workflows' are lacked in a good amount of groovy. We have
contemplated just going crazy and writing some Domain Specific
Language and teach it to hive, but we just hacked up some groovy and
went on with our stuff.


On Tue, Mar 20, 2012 at 1:04 PM, Sean McNamara
sean.mcnam...@webtrends.com wrote:
 Still, what I think Sean is asking for, as well as am I, is the option to
 tell Hive to reject duplicate files altogether

 Exactly this.


 I would expect the default behavior of LOAD DATA LOCAL INPATH to either:

 Throw an error if the file already exists in hive/hdfs and return an exit
 code (what it used to do)
 Re-copy over the existing file (less preferable, but it would be a nice if
 there was a flag to do this)


 For now as a hack I first check if the file already exists in hdfs before I
 load in the data. Something that is built-in and atomic would be ideal.

 Sean


 From: Gabi D gabi...@gmail.com
 Reply-To: user@hive.apache.org
 Date: Tue, 20 Mar 2012 17:59:37 +0200
 To: user@hive.apache.org
 Subject: Re: LOAD DATA problem

 Hi Edward,
 thanks for looking into this.
 what fix 2296 does is not so good. It kind of messes with my filename, so
 better concatenate it as filename.copy_n.gz (rahter than
 filename_copy_n.gz) but that request might be considered petty...
 Still, what I think Sean is asking for, as well as am I, is the option to
 tell Hive to reject duplicate files altogether (returning an error code
 preferably). Could be by some addition to the syntax or a hive setup
 parameter, doesn't really matter.
 Will also look into hive query hooks as you suggested.

 On Tue, Mar 20, 2012 at 3:05 PM, Edward Capriolo edlinuxg...@gmail.com
 wrote:

 The copy_n should have been fixed in 0.8.0

 https://issues.apache.org/jira/browse/HIVE-2296

 On Tue, Mar 20, 2012 at 4:12 AM, Sean McNamara
 sean.mcnam...@webtrends.com wrote:
  Gabi-
 
  Glad to know I'm not the only one scratching my head on this one!  The
  changed behavior caught us off guard.
 
  I haven't found a solution in my sleuthing tonight.  Indeed, any help
  would
  be greatly appreciated on this!
 
  Sean
 
  From: Gabi D gabi...@gmail.com
  Reply-To: user@hive.apache.org
  Date: Tue, 20 Mar 2012 10:03:04 +0200
  To: user@hive.apache.org
  Subject: Re: LOAD DATA problem
 
  Hi Vikas,
  we are facing the same problem that Sean reported and have also noticed
  that
  this behavior changed with a newer version of hive. Previously, when you
  inserted a file with the same name into a partition/table, hive would
  fail
  the request (with yet another of its cryptic messages, an issue in
  itself)
  while now it does load the file and adds the _copy_N addition to the
  suffix.
  I have to say that, normally, we do not check for existance of a file
  with
  the same name in our hdfs directories. Our files arrive with unique
  names
  and if we try to insert the same file again it is because of some
  failure in
  one of the steps in our flow (e.g., files that were handled and loaded
  into
  hive have not been removed from our work directory for some reason hence
  in
  the next run of our load process they were reloaded). We do not want to
  add
  a step that checks whether a file with the same name already exists in
  hdfs
  - this is costly and most of the time (hopefully all of it) unnecessary.
  What we would like is to get some 'duplicate file' error and be able to
  disregard it, knowing that the file is already safely in its place.
  Note, that having duplicate files causes us to double count rows which
  is
  unacceptable for many applications.
  Moreover, we use gz files and since this behavior changes the suffix of
  the
  file (from gz to gz_copy_N) when this happens we seem to be getting all
  sorts of strange data since hadoop can't recognize that this is a zipped
  file and does not decompress it before reading it ...
  Any help or suggestions on this issue would be much appreciated, we have
  been unable to find any so far.
 
 
  On Tue, Mar 20, 2012 at 9:29 AM, hadoop hive hadooph...@gmail.com
  wrote:
 
  hey Sean,
 
  its becoz you are appending the file in same partition with the same
  name(which is not possible) you must change the file name before
  appending
  into same partition.
 
  AFAIK, i don't think that there is any other way to do that, either you
  can you partition name or the file name.
 
  Thanks
  Vikas Srivastava
 
 
  On Tue, Mar 20, 2012 at 6:45 AM, Sean 

RE: How to get job names and stages of a query?

2012-03-20 Thread Steven Wong
The Hive history file contains the job id and other job run-time info. Not sure 
if there’s API on top of it or not.

From: Felix.徐 [mailto:ygnhz...@gmail.com]
Sent: Tuesday, March 20, 2012 12:14 AM
To: user@hive.apache.org; manishbh...@rocketmail.com
Subject: Re: How to get job names and stages of a query?

I actually want to get the job name of stages by api..
在 2012年3月20日 下午2:23,Manish Bhoge 
manishbh...@rocketmail.commailto:manishbh...@rocketmail.com写道:
Whenever you submit a Sql a job I'd get generated. You can open the job tracker 
localhost:50030/jobtracker.asp
It shows jobs are running and rest of the other details.
Thanks,
Manish
Sent from my BlackBerry, pls excuse typo

From: Felix.徐 ygnhz...@gmail.commailto:ygnhz...@gmail.com
Date: Tue, 20 Mar 2012 12:58:53 +0800
To: user@hive.apache.orgmailto:user@hive.apache.org
ReplyTo: user@hive.apache.orgmailto:user@hive.apache.org
Subject: How to get job names and stages of a query?

Hi,all
I want to track the progress of a query, how can I get the job name including 
stages of a query?