RE: Load data throws exception, cant figure it out

2012-07-02 Thread Ruben de Vries
Figured it out myself by running with ' -hiveconf 
hive.root.logger=INFO,console' that it was binlog 

---
Hey Guys,

I've been playing with Hive for a while now but somehow I run into this error 
all of a sudden when setting up my production cluster.

$ hive  -e 'LOAD DATA INPATH "/tmp/members_map_2012-06-30.map" OVERWRITE INTO 
TABLE members_map_full;'
Loading data to table hyves_goldmine.members_map_full Moved to trash: 
hdfs://hadoop-nn.internal:9000/user/hive/warehouse/hyves_goldmine.db/members_map_full
Failed with exception Clear request failed : DELETE FROM `TABLE_PARAMS` WHERE 
`TBL_ID` = ?
FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.MoveTask

It seems to be some problem with the mysql metastore, but the privelages seem 
to be fine (I can run that query manually np).

Anyone has an idea what could be wrong or what steps I could take to find out 
what is wrong?


Load data throws exception, cant figure it out

2012-07-02 Thread Ruben de Vries
Hey Guys,

I've been playing with Hive for a while now but somehow I run into this error 
all of a sudden when setting up my production cluster.

$ hive  -e 'LOAD DATA INPATH "/tmp/members_map_2012-06-30.map" OVERWRITE INTO 
TABLE members_map_full;'
Loading data to table hyves_goldmine.members_map_full
Moved to trash: 
hdfs://hadoop-nn.internal:9000/user/hive/warehouse/hyves_goldmine.db/members_map_full
Failed with exception Clear request failed : DELETE FROM `TABLE_PARAMS` WHERE 
`TBL_ID` = ?
FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.MoveTask

It seems to be some problem with the mysql metastore, but the privelages seem 
to be fine (I can run that query manually np).

Anyone has an idea what could be wrong or what steps I could take to find out 
what is wrong?


RE: loading data in an array within a map

2012-06-28 Thread Ruben de Vries
Sorry, can't help you with your specific problem, but incase you're really 
stuck;
I used the JSON serde (https://github.com/rcongiu/Hive-JSON-Serde this one is 
better then the default one) and it converts nested arrays into maps perfectly. 


From: Bhaskar, Snehalata [mailto:snehalata_bhas...@syntelinc.com] 
Sent: Thursday, June 28, 2012 1:50 PM
To: user@hive.apache.org
Subject: loading data in an array within a map

Hi all,

I have created a table which has array within a map. I am using following query.

hive> create table user_profiles
    > (
    > userid string,
    > friends array,
    > properties map>
    > )
    > ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS 
TERMINATED BY ':' MAP KEYS TERMINATED BY '#' LINES TERMINATED BY '\n'
    > LOCATION '/user/sb25634/user_profiles_table';
OK
Time taken: 0.794 seconds

But I am not able to load data properly in the array inside a map. I am facing 
problem with the delimiters used in the file used for data loading. Is there 
any way I can indicate the delimiters for array elements?
Please help me.


Thanks and regards,
Snehalata Deorukhkar
Nortel No:0229-5814

Confidential: This electronic message and all contents contain information from 
Syntel, Inc. which may be privileged, confidential or otherwise protected from 
disclosure. The information is intended to be for the addressee only. If you 
are not the addressee, any disclosure, copy, distribution or use of the 
contents of this message is prohibited. If you have received this electronic 
message in error, please notify the sender immediately and destroy the original 
message and all copies.


RE: Hive-0.8.1 PHP Thrift client broken?

2012-06-18 Thread Ruben de Vries
Going to bump this one since I hope to be able to contribute some (worth a bump 
:P)

-Original Message-
From: Ruben de Vries [mailto:ruben.devr...@hyves.nl] 
Sent: Friday, June 15, 2012 11:59 AM
To: user@hive.apache.org
Subject: Hive-0.8.1 PHP Thrift client broken?

Hey Guys,

I've been slamming my head into a wall before on this issue, but now that I'm a 
bit more familiar with Hive and Thrift (I got the python version working) I 
figured I should try fixing the problem or find out more about it to contribute 
some to the project too :)

The php thriftclient lib which comes with the hive-0.8.1 package is completely 
f*cked.
It has a couple of issues so far:
 - the php files in the packages directory are 1 level to deep (eg, 
/packages/metastore/metastore/*.php).
 - there's a constant called 'GLOBAL' but this is a reserved keyword ...
 - it crashes with an error (and patch) described here: 
https://issues.apache.org/jira/browse/THRIFT-347

Now I figured I could just checkout all the code and see how the PHP code is 
generated from the thrift code ...
But the 'share/fb303/if/fb303.thrift' is missing!
I found some information on this googleing but nothing that wasn't so outdated 
that it would work!

Could someone start by helping me with this missing dependency to generate the 
thrift libs and then maybe I can see if I can fix some of the other issues 
myself :-) ?






Hive-0.8.1 PHP Thrift client broken?

2012-06-15 Thread Ruben de Vries
Hey Guys,

I've been slamming my head into a wall before on this issue, but now that I'm a 
bit more familiar with Hive and Thrift (I got the python version working) I 
figured I should try fixing the problem or find out more about it to contribute 
some to the project too :)

The php thriftclient lib which comes with the hive-0.8.1 package is completely 
f*cked.
It has a couple of issues so far:
 - the php files in the packages directory are 1 level to deep (eg, 
/packages/metastore/metastore/*.php).
 - there's a constant called 'GLOBAL' but this is a reserved keyword ...
 - it crashes with an error (and patch) described here: 
https://issues.apache.org/jira/browse/THRIFT-347

Now I figured I could just checkout all the code and see how the PHP code is 
generated from the thrift code ...
But the 'share/fb303/if/fb303.thrift' is missing!
I found some information on this googleing but nothing that wasn't so outdated 
that it would work!

Could someone start by helping me with this missing dependency to generate the 
thrift libs and then maybe I can see if I can fix some of the other issues 
myself :-) ?






RE: Hive scratch dir not cleaning up

2012-06-01 Thread Ruben de Vries
So I should write a job which cleans up 1 month old results or something like 
that?

From: Vinod Singh [mailto:vi...@vinodsingh.com] 
Sent: Friday, June 01, 2012 10:35 AM
To: user@hive.apache.org
Subject: Re: Hive scratch dir not cleaning up

Hive deletes job contents from the scratch directory on completion of the job. 
Though failed / killed jobs leave data there, which needs to be removed 
manually.

Thanks,
Vinod

http://blog.vinodsingh.com/
On Fri, Jun 1, 2012 at 1:58 PM, Ruben de Vries  wrote:
Hey Hivers,
 
I’m almost ready to replace our old hadoop implementation with a implementation 
using Hive,
 
Now I’ve ran into (hopefully) my last problem; my /tmp/hive-hduser dir is 
getting kinda big!
It doesn’t seem to cleanup this tmp files, googling for it I run into some 
tickets about a cleanup setting, should I enable this with the below setting?
Why doesn’t it do that by default? Am I the only one somehow racking up a lot 
of space with tmp files?
 
 
 
 

  hive.start.cleanup.scratchdir
  true




Hive scratch dir not cleaning up

2012-06-01 Thread Ruben de Vries
Hey Hivers,

I'm almost ready to replace our old hadoop implementation with a implementation 
using Hive,

Now I've ran into (hopefully) my last problem; my /tmp/hive-hduser dir is 
getting kinda big!
It doesn't seem to cleanup this tmp files, googling for it I run into some 
tickets about a cleanup setting, should I enable this with the below setting?
Why doesn't it do that by default? Am I the only one somehow racking up a lot 
of space with tmp files?





  hive.start.cleanup.scratchdir
  true



RE: table design and performance questions

2012-05-29 Thread Ruben de Vries
Partitioning can greatly increase performance for WHERE clauses since hive can 
omit parsing the data in the partitions which do no meet the requirement.
For example if you partition by date (I do it by INT dateint, in which case I 
set dateint to be MMDD) and you do WHERE dateint >= 20120101 then it won't 
even have to touch any of the data from before 2012-01-01 and in my case that 
means I don't parse the last 2 years of data, reducing the time the query takes 
by about 70% :-)

Buckets are the second awesome way of getting a big optimization in, 
specifically for joins! If you have 2 tables you're joining onto each other 
then if they're both bucketed on their join column it will also greatly 
increase speed.
Another good join optimization is MAPJOIN, if one of the tables you're joining 
is rather small (below 30mb) then you can force it to MAPJOIN or you can enable 
automatic mapjoin, I personally prefere explicit behavory instead of automagic 
so use a hint:
SELECT /* +MAPJOIN(the_small_table) */ fields FROM table JOIN the_small_table, 
etc.
Sorted by is for sorting within buckets, only relevant if you're doing a lot of 
ordering I think.

I'm assuming sequencefiles are faster, but I wouldn't really know :( need 
someone else to tell us more about that ;)


-Original Message-
From: Avdeev V. M. [mailto:ls...@list.ru] 
Sent: Monday, May 28, 2012 7:17 AM
To: user@hive.apache.org
Subject: table design and performance questions

Question from novice.

Where I can read table design best practices? I have a measure table with 
millions of rows and many dimension tables with less than 1000 rows each. I 
can't find out the way to get optimal design of both kind of tables. Is there 
performance tuning guides or performance FAQ?

Specifically
1) PARTITIONED BY, CLUSTERED BY, SORTED BY statements. In which cases using 
these statements make sense?
2) DDL language manual says 'This can improve performance on certain kinds of 
queries.' about CLUSTERED BY statement. What kind of queries can be improved?
3) What is preferable - SEQUENCEFILE, RCFILE or TEXTFILE - in terms of 
performance? What aspects should be taken into account when choosing a file 
format?
4) Compressed storage article says 'Keeping data compressed in Hive tables has, 
in some cases, known to give better performance that uncompressed storage;' and 
again - What is these cases? 

Thanks!
Vyacheslav


RE: Job Scheduling in Hadoop-Hive

2012-05-29 Thread Ruben de Vries
Hey, 

We use hadoop/hive for processing our access logs and we run a daily cronjob 
(python script) which does the parsing jobs and some partitioning etc.
The results from those jobs are then queried on by other jobs which generate 
the results the management team wants to see :-)


From: Ronak Bhatt [mailto:ronakb...@gmail.com] 
Sent: Saturday, May 26, 2012 4:48 PM
To: hive-u...@hadoop.apache.org
Subject: Job Scheduling in Hadoop-Hive

Hello - 

For those users whose setup is somewhat production, what do you use for job 
scheduling and dependency management? 

thanks, ronak





JOIN + LATERAL VIEW works, but + MAPJOIN and no longer get any results

2012-05-22 Thread Ruben de Vries
Okay first off; I know JOIN + LATERAL VIEW together isn't working so I moved my 
JOIN into a subquery and that makes the query work properly

However when I added a MAPJOIN hint for the JOIN in the subquery it will also 
stop doing the reducer for the main query!
This only happens when there's a LATERAL VIEW in there though, if I remove the 
LATERAL VIEW then the main query still get's a reducer to do grouping

Here's a gist: https://gist.github.com/2499436 Containing the queries and a PHP 
script which you can run to execute the test case I'm using, which does;
* setup a database called hive_mapjoin
* setup tables
* load some test data
* do the selects

You'll need the https://github.com/rcongiu/Hive-JSON-Serde/downloads 
json-serde-1.1-jar-with-dependencies.jar with it though and change the path 
I guess looking at the queries you guys can probally figure out a better 
testcase, but maybe it's helpful 

Not sure if this is a bug or me doing something that just isn't supposed to be 
working, but I can't seem to find any pointers that this wouldn't be 
supported...

Here's another gist with the plan.xml: https://gist.github.com/2499658

I've also created a ticket in JIRA but it doesn't seem to get any attention at 
all: https://issues.apache.org/jira/browse/HIVE-2992


Greetz, Ruben de Vries



RE: JOIN + LATERAL VIEW + MAPJOIN = no output?!

2012-05-01 Thread Ruben de Vries
I really do feel like this isn't as intended, should I make a ticket in JIRA?

-Original Message-
From: Ruben de Vries [mailto:ruben.devr...@hyves.nl] 
Sent: Thursday, April 26, 2012 3:37 PM
To: user@hive.apache.org
Subject: RE: JOIN + LATERAL VIEW + MAPJOIN = no output?!

https://gist.github.com/2499658

and this is the plan.xml its using

-Original Message-
From: Ruben de Vries [mailto:ruben.devr...@hyves.nl] 
Sent: Thursday, April 26, 2012 3:17 PM
To: user@hive.apache.org
Subject: JOIN + LATERAL VIEW + MAPJOIN = no output?!

Okay first off; so JOIN + LATERAL VIEW together isn't working so I moved my 
JOIN into a subquery and that makes the query work properly

However when I added a MAPJOIN hint for the JOIN in the subquery it will also 
stop doing the reducer for the main query!
This only happens when there's a LATERAL VIEW in there though, if I remove the 
LATERAL VIEW then the main query still get's a reducer to do grouping

Here's a gist: https://gist.github.com/2499436 Containing the queries and a PHP 
script which you can run to execute the test case;
 - setup a database called hive_mapjoin
 - setup tables
 - load some test data
 - do the selects
You'll need the https://github.com/rcongiu/Hive-JSON-Serde/downloads 
json-serde-1.1-jar-with-dependencies.jar with it though and change the path ;)

Not sure if this is a bug or me doing something that just isn’t supposed to be 
working,

Thanks Ruben


RE: JOIN + LATERAL VIEW + MAPJOIN = no output?!

2012-04-26 Thread Ruben de Vries
https://gist.github.com/2499658

and this is the plan.xml its using

-Original Message-
From: Ruben de Vries [mailto:ruben.devr...@hyves.nl] 
Sent: Thursday, April 26, 2012 3:17 PM
To: user@hive.apache.org
Subject: JOIN + LATERAL VIEW + MAPJOIN = no output?!

Okay first off; so JOIN + LATERAL VIEW together isn't working so I moved my 
JOIN into a subquery and that makes the query work properly

However when I added a MAPJOIN hint for the JOIN in the subquery it will also 
stop doing the reducer for the main query!
This only happens when there's a LATERAL VIEW in there though, if I remove the 
LATERAL VIEW then the main query still get's a reducer to do grouping

Here's a gist: https://gist.github.com/2499436 Containing the queries and a PHP 
script which you can run to execute the test case;
 - setup a database called hive_mapjoin
 - setup tables
 - load some test data
 - do the selects
You'll need the https://github.com/rcongiu/Hive-JSON-Serde/downloads 
json-serde-1.1-jar-with-dependencies.jar with it though and change the path ;)

Not sure if this is a bug or me doing something that just isn’t supposed to be 
working,

Thanks Ruben


JOIN + LATERAL VIEW + MAPJOIN = no output?!

2012-04-26 Thread Ruben de Vries
Okay first off; so JOIN + LATERAL VIEW together isn't working so I moved my 
JOIN into a subquery and that makes the query work properly

However when I added a MAPJOIN hint for the JOIN in the subquery it will also 
stop doing the reducer for the main query!
This only happens when there's a LATERAL VIEW in there though, if I remove the 
LATERAL VIEW then the main query still get's a reducer to do grouping

Here's a gist: https://gist.github.com/2499436
Containing the queries and a PHP script which you can run to execute the test 
case;
 - setup a database called hive_mapjoin
 - setup tables
 - load some test data
 - do the selects
You'll need the https://github.com/rcongiu/Hive-JSON-Serde/downloads 
json-serde-1.1-jar-with-dependencies.jar with it though and change the path ;)

Not sure if this is a bug or me doing something that just isn’t supposed to be 
working,

Thanks Ruben


RE: When/how to use partitions and buckets usefully?

2012-04-26 Thread Ruben de Vries
We're atm sampling our access logs 1:250 by doing a modulo of 250 on the 
memberID.
I applied the same logic to the members_map, reducing its raw size to just 
0.9mb.

Now when I run the query with the MAPJOIN hint it goes:
2012-04-26 10:51:32 Starting to launch local task to process map join;  
maximum memory = 1864171520
2012-04-26 10:51:33 Processing rows:46490   Hashtable size: 46490   
Memory usage:   11142520rate:   0.006
- done -

So it uses 10mb in memory to load the 1mb raw size members_map, this matches 
the ratio which I previously had when loading the full table and crashing 
(250mb required 2.6gb).
I'm not sure why it requires so much more memory then the raw filesize, but 
it's defenatly the main problem here.
Now it's just the question of is this a bug? Or is explainable and acceptable?

FYI my most complex query dropped from 350sec to 110sec when being able to 
MAPJOIN(), gotta love that speed if it works!


-Original Message-
From: Ruben de Vries [mailto:ruben.devr...@hyves.nl]
Sent: Thursday, April 26, 2012 9:16 AM
To: user@hive.apache.org; gemini5201...@gmail.com; mgro...@oanda.com
Subject: RE: When/how to use partitions and buckets usefully?

The table isn't compressed, the uncompressed size is really 300ish mb.
Also I haven't set the mapred.child.java.opts, but I think that if it crashes 
at 1.5gb that the default value seems high enough?

From: gemini alex [mailto:gemini5201...@gmail.com]
Sent: Thursday, April 26, 2012 5:37 AM
To: user@hive.apache.org
Subject: Re: When/how to use partitions and buckets usefully?

do you set your mapred.child.java.opts in hadoop/conf/mapred-site.xml?
在 2012年4月26日 上午8:59,Mark Grover 写道:
I am hoping that other people who have used Map Join can pitch in here...

When the smaller table gets loaded into mapper's memory, the data is loaded in 
its uncompressed form, right?

If so, is it possible at all in this case that the compressed size of smaller 
table is less than the memory available but the uncompressed size isn't?

Mark

Mark Grover, Business Intelligence Analyst OANDA Corporation

www: oanda.com www: fxtrade.com
e: mgro...@oanda.com

"Best Trading Platform" - World Finance's Forex Awards 2009.
"The One to Watch" - Treasury Today's Adam Smith Awards 2009.


- Original Message -
From: "Ruben de Vries" 
To: user@hive.apache.org
Sent: Wednesday, April 25, 2012 3:48:46 AM
Subject: RE: When/how to use partitions and buckets usefully?

I already tried running with that set to 400mb, but it didn’t work and that 
setting is only used when it’s trying to automatically figure out if it should 
be doing a mapjoin, while I’m forcing it to do a mapjoin with a hint

From: gemini alex [mailto:gemini5201...@gmail.com]
Sent: Wednesday, April 25, 2012 9:40 AM
To: user@hive.apache.org
Subject: Re: When/how to use partitions and buckets usefully?

there should be documented in wiki on LanguageManual+Joins .
在 2012年4月25日 下午3:36,gemini alex 写道:
it's seemed you use the default hive configuration, the default map join will 
have only 25M for small table,  copy your hive-default.xml to hive-site.xml and 
set hive.mapjoin.smalltable.filesize=3
在 2012年4月25日 上午12:09,Ruben de Vries 写道:

I got the (rather big) log here in a github gist: 
https://gist.github.com/2480893 And I also attached the plan.xml it was using 
to the gist.

When loading the members_map (11mil records, 320mb, 30b per record), it seems 
to take about 198b per record in the members_map, resulting in crashing around 
7mil records with 1.4gb loaded.

The members_map is a TEXTFILE with (member_id INT, gender INT, birthday STRING) 
where
 - birthday is a string containing -MM-DD
 - gender is a tinyint, 1 2 or 3
 - member_id is int with the highest member_id being 14343249 (14mil)

The log says:
"INFO hive.log: DDL: struct members_map { i32 member_id, i32 gender, string 
birthdate}"

I also tried doing the same thing but with an empty visit_stats table, with the 
same effect Some of the blogs I read talk about 25mb small table, not 300mb 
like mine ...

Anyone can make anything out of this?
I'd rather go with this if at all possible, otherwise I have to go the hard way 
and migrate all the visit_stats into buckets so they can match the members_map 
on that?

-Original Message-
From: Bejoy Ks [mailto:bejoy...@yahoo.com]
Sent: Tuesday, April 24, 2012 3:58 PM
To: user@hive.apache.org
Subject: Re: When/how to use partitions and buckets usefully?

Hi Ruben
 The operation you are seeing in your log is preparation of hash table of 
the smaller table, This hash table file is compressed and loaded into 
Distributed Cache and from there it is used for map side joins. From your 
console log the hash table size/data size has gone to nearly 1.5 GB, the data 
is large to be loaded into memory of the hive client.

2012-04-24 10:31:02 Processing rows:  

RE: When/how to use partitions and buckets usefully?

2012-04-26 Thread Ruben de Vries
The table isn't compressed, the uncompressed size is really 300ish mb.
Also I haven't set the mapred.child.java.opts, but I think that if it crashes 
at 1.5gb that the default value seems high enough?

From: gemini alex [mailto:gemini5201...@gmail.com]
Sent: Thursday, April 26, 2012 5:37 AM
To: user@hive.apache.org
Subject: Re: When/how to use partitions and buckets usefully?

do you set your mapred.child.java.opts in hadoop/conf/mapred-site.xml?
在 2012年4月26日 上午8:59,Mark Grover 写道:
I am hoping that other people who have used Map Join can pitch in here...

When the smaller table gets loaded into mapper's memory, the data is loaded in 
its uncompressed form, right?

If so, is it possible at all in this case that the compressed size of smaller 
table is less than the memory available but the uncompressed size isn't?

Mark

Mark Grover, Business Intelligence Analyst
OANDA Corporation

www: oanda.com www: fxtrade.com
e: mgro...@oanda.com

"Best Trading Platform" - World Finance's Forex Awards 2009.
"The One to Watch" - Treasury Today's Adam Smith Awards 2009.


- Original Message -
From: "Ruben de Vries" 
To: user@hive.apache.org
Sent: Wednesday, April 25, 2012 3:48:46 AM
Subject: RE: When/how to use partitions and buckets usefully?

I already tried running with that set to 400mb, but it didn’t work and that 
setting is only used when it’s trying to automatically figure out if it should 
be doing a mapjoin, while I’m forcing it to do a mapjoin with a hint

From: gemini alex [mailto:gemini5201...@gmail.com]
Sent: Wednesday, April 25, 2012 9:40 AM
To: user@hive.apache.org
Subject: Re: When/how to use partitions and buckets usefully?

there should be documented in wiki on LanguageManual+Joins .
在 2012年4月25日 下午3:36,gemini alex 写道:
it's seemed you use the default hive configuration, the default map join will 
have only 25M for small table,  copy your hive-default.xml to hive-site.xml and 
set hive.mapjoin.smalltable.filesize=3
在 2012年4月25日 上午12:09,Ruben de Vries 写道:

I got the (rather big) log here in a github gist: 
https://gist.github.com/2480893
And I also attached the plan.xml it was using to the gist.

When loading the members_map (11mil records, 320mb, 30b per record), it seems 
to take about 198b per record in the members_map, resulting in crashing around 
7mil records with 1.4gb loaded.

The members_map is a TEXTFILE with (member_id INT, gender INT, birthday STRING) 
where
 - birthday is a string containing -MM-DD
 - gender is a tinyint, 1 2 or 3
 - member_id is int with the highest member_id being 14343249 (14mil)

The log says:
"INFO hive.log: DDL: struct members_map { i32 member_id, i32 gender, string 
birthdate}"

I also tried doing the same thing but with an empty visit_stats table, with the 
same effect
Some of the blogs I read talk about 25mb small table, not 300mb like mine ...

Anyone can make anything out of this?
I'd rather go with this if at all possible,
otherwise I have to go the hard way and migrate all the visit_stats into 
buckets so they can match the members_map on that?

-Original Message-
From: Bejoy Ks [mailto:bejoy...@yahoo.com]
Sent: Tuesday, April 24, 2012 3:58 PM
To: user@hive.apache.org
Subject: Re: When/how to use partitions and buckets usefully?

Hi Ruben
 The operation you are seeing in your log is preparation of hash table of 
the smaller table, This hash table file is compressed and loaded into 
Distributed Cache and from there it is used for map side joins. From your 
console log the hash table size/data size has gone to nearly 1.5 GB, the data 
is large to be loaded into memory of the hive client.

2012-04-24 10:31:02 Processing rows:700 Hashtable size: 699 
Memory usage:   1,468,378,760  rate:   0.788


Can you enable debug logging and post in the console to get a better picture 
why it consumes this much memory.
Start your hive shell as
hive -hiveconf hive.root.logger=ALL,console;


Regards
Bejoy KS




From: Ruben de Vries 
To: "user@hive.apache.org" 
Sent: Tuesday, April 24, 2012 4:58 PM
Subject: FW: When/how to use partitions and buckets usefully?

Here are both tables:

$ hdfs -count /user/hive/warehouse/hyves_goldmine.db/members_map
  11  247231757 
hdfs://localhost:54310/user/hive/warehouse/hyves_goldmine.db/members_map

$ hdfs -count /user/hive/warehouse/hyves_goldmine.db/visit_stats
442  441 1091837835 
hdfs://localhost:54310/user/hive/warehouse/hyves_goldmine.db/visit_stats

The 'work' I'm seeing on console is the loading of the table into memory?

It seems like it's loading the visit_stats table instead ?!
I tried doing MAPJOIN(visit_stats) but it fails non existing class (my 
JSONSerde) .


From: Nitin Pawar [mailto:nitinpawar...@gmail.com]
Sent: Tuesday, April 24, 2012 11:46 AM
To: user@hive.apache.org
Subject: 

RE: When/how to use partitions and buckets usefully?

2012-04-25 Thread Ruben de Vries
I already tried running with that set to 400mb, but it didn’t work and that 
setting is only used when it’s trying to automatically figure out if it should 
be doing a mapjoin, while I’m forcing it to do a mapjoin with a hint

From: gemini alex [mailto:gemini5201...@gmail.com]
Sent: Wednesday, April 25, 2012 9:40 AM
To: user@hive.apache.org
Subject: Re: When/how to use partitions and buckets usefully?

there should be documented in wiki on LanguageManual+Joins .
在 2012年4月25日 下午3:36,gemini alex 写道:
it's seemed you use the default hive configuration, the default map join will 
have only 25M for small table,  copy your hive-default.xml to hive-site.xml and 
set hive.mapjoin.smalltable.filesize=3
在 2012年4月25日 上午12:09,Ruben de Vries 写道:

I got the (rather big) log here in a github gist: 
https://gist.github.com/2480893
And I also attached the plan.xml it was using to the gist.

When loading the members_map (11mil records, 320mb, 30b per record), it seems 
to take about 198b per record in the members_map, resulting in crashing around 
7mil records with 1.4gb loaded.

The members_map is a TEXTFILE with (member_id INT, gender INT, birthday STRING) 
where
 - birthday is a string containing -MM-DD
 - gender is a tinyint, 1 2 or 3
 - member_id is int with the highest member_id being 14343249 (14mil)

The log says:
"INFO hive.log: DDL: struct members_map { i32 member_id, i32 gender, string 
birthdate}"

I also tried doing the same thing but with an empty visit_stats table, with the 
same effect
Some of the blogs I read talk about 25mb small table, not 300mb like mine ...

Anyone can make anything out of this?
I'd rather go with this if at all possible,
otherwise I have to go the hard way and migrate all the visit_stats into 
buckets so they can match the members_map on that?

-Original Message-
From: Bejoy Ks [mailto:bejoy...@yahoo.com]
Sent: Tuesday, April 24, 2012 3:58 PM
To: user@hive.apache.org
Subject: Re: When/how to use partitions and buckets usefully?

Hi Ruben
  The operation you are seeing in your log is preparation of hash table of 
the smaller table, This hash table file is compressed and loaded into 
Distributed Cache and from there it is used for map side joins. From your 
console log the hash table size/data size has gone to nearly 1.5 GB, the data 
is large to be loaded into memory of the hive client.

2012-04-24 10:31:02 Processing rows:700 Hashtable size: 699 
Memory usage:   1,468,378,760  rate:   0.788


Can you enable debug logging and post in the console to get a better picture 
why it consumes this much memory.
Start your hive shell as
hive -hiveconf hive.root.logger=ALL,console;


Regards
Bejoy KS



________
From: Ruben de Vries 
To: "user@hive.apache.org" 
Sent: Tuesday, April 24, 2012 4:58 PM
Subject: FW: When/how to use partitions and buckets usefully?

Here are both tables:

$ hdfs -count /user/hive/warehouse/hyves_goldmine.db/members_map
   11  247231757 
hdfs://localhost:54310/user/hive/warehouse/hyves_goldmine.db/members_map

$ hdfs -count /user/hive/warehouse/hyves_goldmine.db/visit_stats
 442  441 1091837835 
hdfs://localhost:54310/user/hive/warehouse/hyves_goldmine.db/visit_stats

The 'work' I'm seeing on console is the loading of the table into memory?

It seems like it's loading the visit_stats table instead ?!
I tried doing MAPJOIN(visit_stats) but it fails non existing class (my 
JSONSerde) .


From: Nitin Pawar [mailto:nitinpawar...@gmail.com]
Sent: Tuesday, April 24, 2012 11:46 AM
To: user@hive.apache.org
Subject: Re: When/how to use partitions and buckets usefully?

This operation is erroring out on the hive client itself before starting a map 
so splitting to mappers is out of question.

can you do a dfs count for the members_map table hdfslocation and tell us the 
result?

On Tue, Apr 24, 2012 at 2:06 PM, Ruben de Vries  wrote:
Hmm I must be doing something wrong,  the members_map table is 300ish MB.
When I execute the following query:

SELECT
   /*+ MAPJOIN(members_map) */
   date_int,
   members_map.gender AS gender,
   'generic',
   COUNT( memberId ) AS unique,
   SUM( `generic`['count'] ) AS count,
   SUM( `generic`['seconds'] ) AS seconds
FROM visit_stats
JOIN members_map ON(members_map.member_id = visit_stats.memberId)
GROUP BY date_int, members_map.gender

It results in:
2012-04-24 10:25:59 Starting to launch local task to process map join;  
maximum memory = 1864171520
2012-04-24 10:26:00 Processing rows:20  Hashtable size: 
19  Memory usage:   43501848rate:   0.023
2012-04-24 10:30:54 Processing rows:690 Hashtable size: 689 
Memory usage:   1449867552  rate:   0.778
2012-04-24 10:31:02 Processing rows:700 Hashtable size: 699 
Memory usage:   1468378760  rate:   

RE: subquery + lateral view fails without count

2012-04-25 Thread Ruben de Vries
Awh poor pastie :(

Here's a gist: https://gist.github.com/2487666 
with the working and crashing query, create statements and the error attached 
too

-Original Message-
From: Mark Grover [mailto:mgro...@oanda.com] 
Sent: Wednesday, April 25, 2012 1:40 AM
To: user@hive.apache.org
Subject: Re: subquery + lateral view fails without count

Hi Ruben,
Looks like pastie is down (http://pastie.org/) because of recent DDOS attacks. 
Can you please post your queries elsewhere?

Mark

Mark Grover, Business Intelligence Analyst OANDA Corporation 

www: oanda.com www: fxtrade.com
e: mgro...@oanda.com 

"Best Trading Platform" - World Finance's Forex Awards 2009. 
"The One to Watch" - Treasury Today's Adam Smith Awards 2009. 


- Original Message -
From: "Ruben de Vries" 
To: user@hive.apache.org
Sent: Monday, April 23, 2012 9:08:16 AM
Subject: subquery + lateral view fails without count




It’s a bit of a weird case but I thought I might share it and hopefully find 
someone who can confirm this to be a bug or tell me I should do things 
differently! 



Here you can find a pastie with the full create and select queries: 
http://pastie.org/3838924 



I’ve got two tables: 

`visit_stats` with cols date_int, memberId and parts 

`member_map` with cols member_id, gender 



I use LATERAL VIEW explode(parts) to explode the parts MAP and group on those 
and I want to JOIN the member_map onto the visit_stats to group by gender. 

Since LATERAL VIEW and JOIN together aren’t supported I’m doing the join in a 
subquery and then LATERAL VIEW onto that. 



It’s working as long as my query contains a COUNT ( memberId ), but if I 
removed that from the result it will crash with the error below: 



java.lang.RuntimeException: Error in configuring object 

at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93) 

at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64) 

at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) 

at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:431) 

at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:416) 

at org.apache.hadoop.mapred.Child$4.run(Child.java:268) 

at java.security.AccessController.doPrivileged(Native Method) 

at javax.security.auth.Subject.doAs(Subject.java:396) 

at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
 

at org.apache.hadoop.mapred.Child.main(Child.java:262) 

Caused by: java.lang.reflect.InvocationTargetException 

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 

at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) 

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 

at java.lang.reflect.Method.invoke(Method.java:597) 

at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88) 

... 9 more 

Caused by: java.lang.RuntimeException: Reduce operator initialization failed 

at org.apache.hadoop.hive.ql.exec.ExecReducer.configure(ExecReducer.java:157) 

... 14 more 

Caused by: java.lang.RuntimeException: cannot find field _col1 from [0:_col4, 
1:_col6, 2:_col10] 

at 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.getStandardStructFieldRef(ObjectInspectorUtils.java:345)
 

at 
org.apache.hadoop.hive.serde2.objectinspector.StandardStructObjectInspector.getStructFieldRef(StandardStructObjectInspector.java:143)
 

at 
org.apache.hadoop.hive.ql.exec.ExprNodeColumnEvaluator.initialize(ExprNodeColumnEvaluator.java:57)
 

at org.apache.hadoop.hive.ql.exec.Operator.initEvaluators(Operator.java:896) 

at 
org.apache.hadoop.hive.ql.exec.Operator.initEvaluatorsAndReturnStruct(Operator.java:922)
 

at 
org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:60)
 

at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357) 

at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:433) 

at 
org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:389) 

at 
org.apache.hadoop.hive.ql.exec.JoinOperator.initializeOp(JoinOperator.java:60) 

at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357) 

at org.apache.hadoop.hive.ql.exec.ExecReducer.configure(ExecReducer.java:150) 

... 14 more 




RE: When/how to use partitions and buckets usefully?

2012-04-24 Thread Ruben de Vries
I got the (rather big) log here in a github gist: 
https://gist.github.com/2480893 
And I also attached the plan.xml it was using to the gist. 

When loading the members_map (11mil records, 320mb, 30b per record), it seems 
to take about 198b per record in the members_map, resulting in crashing around 
7mil records with 1.4gb loaded.

The members_map is a TEXTFILE with (member_id INT, gender INT, birthday STRING) 
where 
 - birthday is a string containing -MM-DD
 - gender is a tinyint, 1 2 or 3
 - member_id is int with the highest member_id being 14343249 (14mil)

The log says:
"INFO hive.log: DDL: struct members_map { i32 member_id, i32 gender, string 
birthdate}"

I also tried doing the same thing but with an empty visit_stats table, with the 
same effect
Some of the blogs I read talk about 25mb small table, not 300mb like mine ... 

Anyone can make anything out of this? 
I'd rather go with this if at all possible, 
otherwise I have to go the hard way and migrate all the visit_stats into 
buckets so they can match the members_map on that?

-Original Message-
From: Bejoy Ks [mailto:bejoy...@yahoo.com] 
Sent: Tuesday, April 24, 2012 3:58 PM
To: user@hive.apache.org
Subject: Re: When/how to use partitions and buckets usefully?

Hi Ruben
      The operation you are seeing in your log is preparation of hash table of 
the smaller table, This hash table file is compressed and loaded into 
Distributed Cache and from there it is used for map side joins. From your 
console log the hash table size/data size has gone to nearly 1.5 GB, the data 
is large to be loaded into memory of the hive client.

2012-04-24 10:31:02     Processing rows:        700 Hashtable size: 699 
Memory usage:   1,468,378,760      rate:   0.788


Can you enable debug logging and post in the console to get a better picture 
why it consumes this much memory.
Start your hive shell as 
hive -hiveconf hive.root.logger=ALL,console;


Regards
Bejoy KS



____
From: Ruben de Vries 
To: "user@hive.apache.org"  
Sent: Tuesday, April 24, 2012 4:58 PM
Subject: FW: When/how to use partitions and buckets usefully?

Here are both tables:

$ hdfs -count /user/hive/warehouse/hyves_goldmine.db/members_map
   1    1      247231757 
hdfs://localhost:54310/user/hive/warehouse/hyves_goldmine.db/members_map

$ hdfs -count /user/hive/warehouse/hyves_goldmine.db/visit_stats
 442  441 1091837835 
hdfs://localhost:54310/user/hive/warehouse/hyves_goldmine.db/visit_stats

The 'work' I'm seeing on console is the loading of the table into memory?

It seems like it's loading the visit_stats table instead ?!
I tried doing MAPJOIN(visit_stats) but it fails non existing class (my 
JSONSerde) . 


From: Nitin Pawar [mailto:nitinpawar...@gmail.com] 
Sent: Tuesday, April 24, 2012 11:46 AM
To: user@hive.apache.org
Subject: Re: When/how to use partitions and buckets usefully?

This operation is erroring out on the hive client itself before starting a map 
so splitting to mappers is out of question.

can you do a dfs count for the members_map table hdfslocation and tell us the 
result? 

On Tue, Apr 24, 2012 at 2:06 PM, Ruben de Vries  wrote:
Hmm I must be doing something wrong,  the members_map table is 300ish MB.
When I execute the following query:

SELECT
   /*+ MAPJOIN(members_map) */
   date_int,
   members_map.gender AS gender,
   'generic',
   COUNT( memberId ) AS unique,
   SUM( `generic`['count'] ) AS count,
   SUM( `generic`['seconds'] ) AS seconds
FROM visit_stats
JOIN members_map ON(members_map.member_id = visit_stats.memberId)
GROUP BY date_int, members_map.gender

It results in:
2012-04-24 10:25:59     Starting to launch local task to process map join;      
maximum memory = 1864171520
2012-04-24 10:26:00     Processing rows:        20          Hashtable size: 
19          Memory usage:   43501848        rate:   0.023
2012-04-24 10:30:54     Processing rows:        690 Hashtable size: 689 
Memory usage:   1449867552      rate:   0.778
2012-04-24 10:31:02     Processing rows:        700 Hashtable size: 699 
Memory usage:   1468378760      rate:   0.788
Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space


I'm running it only my local, single node, dev env, could that be a problem 
since it won't split over multiple mappers in this case?


-Original Message-
From: Bejoy Ks [mailto:bejoy...@yahoo.com]
Sent: Tuesday, April 24, 2012 9:47 AM
To: user@hive.apache.org
Subject: Re: When/how to use partitions and buckets usefully?

Hi Ruben
Map join hint is provided to hive using "MAPJOIN" keyword as :
SELECT /*+ MAPJOIN(b) */ a.key, a.value FROM a join b on a.key = b.key

To use map side join some hive configuration properties needs to be enabled

For plain map side joins
hive>SET hive.auto.convert.join=true;
Latest version

FW: When/how to use partitions and buckets usefully?

2012-04-24 Thread Ruben de Vries
Here are both tables:

$ hdfs -count /user/hive/warehouse/hyves_goldmine.db/members_map
   1    1      247231757 
hdfs://localhost:54310/user/hive/warehouse/hyves_goldmine.db/members_map

$ hdfs -count /user/hive/warehouse/hyves_goldmine.db/visit_stats
 442  441 1091837835 
hdfs://localhost:54310/user/hive/warehouse/hyves_goldmine.db/visit_stats

The 'work' I'm seeing on console is the loading of the table into memory?

It seems like it's loading the visit_stats table instead ?!
I tried doing MAPJOIN(visit_stats) but it fails non existing class (my 
JSONSerde) . 


From: Nitin Pawar [mailto:nitinpawar...@gmail.com] 
Sent: Tuesday, April 24, 2012 11:46 AM
To: user@hive.apache.org
Subject: Re: When/how to use partitions and buckets usefully?

This operation is erroring out on the hive client itself before starting a map 
so splitting to mappers is out of question.

can you do a dfs count for the members_map table hdfslocation and tell us the 
result? 

On Tue, Apr 24, 2012 at 2:06 PM, Ruben de Vries  wrote:
Hmm I must be doing something wrong,  the members_map table is 300ish MB.
When I execute the following query:

SELECT
   /*+ MAPJOIN(members_map) */
   date_int,
   members_map.gender AS gender,
   'generic',
   COUNT( memberId ) AS unique,
   SUM( `generic`['count'] ) AS count,
   SUM( `generic`['seconds'] ) AS seconds
FROM visit_stats
JOIN members_map ON(members_map.member_id = visit_stats.memberId)
GROUP BY date_int, members_map.gender

It results in:
2012-04-24 10:25:59     Starting to launch local task to process map join;      
maximum memory = 1864171520
2012-04-24 10:26:00     Processing rows:        20          Hashtable size: 
19          Memory usage:   43501848        rate:   0.023
2012-04-24 10:30:54     Processing rows:        690 Hashtable size: 689 
Memory usage:   1449867552      rate:   0.778
2012-04-24 10:31:02     Processing rows:        700 Hashtable size: 699 
Memory usage:   1468378760      rate:   0.788
Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space


I'm running it only my local, single node, dev env, could that be a problem 
since it won't split over multiple mappers in this case?


-Original Message-
From: Bejoy Ks [mailto:bejoy...@yahoo.com]
Sent: Tuesday, April 24, 2012 9:47 AM
To: user@hive.apache.org
Subject: Re: When/how to use partitions and buckets usefully?

Hi Ruben
Map join hint is provided to hive using "MAPJOIN" keyword as :
SELECT /*+ MAPJOIN(b) */ a.key, a.value FROM a join b on a.key = b.key

To use map side join some hive configuration properties needs to be enabled

For plain map side joins
hive>SET hive.auto.convert.join=true;
Latest versions of hive does a map join on the smaller table even if no map 
join hit is provided.

For bucketed map joins
hive>SET hive.optimize.bucketmapjoin=true

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins


Regards
Bejoy



From: Nitin Pawar 
To: user@hive.apache.org
Sent: Tuesday, April 24, 2012 12:46 PM
Subject: Re: When/how to use partitions and buckets usefully?

If you are doing a map side join make sure the table members_map is small 
enough to hold in memory

On 4/24/12, Ruben de Vries  wrote:
> Wow thanks everyone for the nice feedback!
>
> I can force a mapside join by doing /*+ STREAMTABLE(members_map) */ right?
>
>
> Cheers,
>
> Ruben de Vries
>
> -Original Message-
> From: Mark Grover [mailto:mgro...@oanda.com]
> Sent: Tuesday, April 24, 2012 3:17 AM
> To: user@hive.apache.org; bejoy ks
> Cc: Ruben de Vries
> Subject: Re: When/how to use partitions and buckets usefully?
>
> Hi Ruben,
> Like Bejoy pointed out, members_map is small enough to fit in memory,
> so your joins with visit_stats would be much faster with map-side join.
>
> However, there is still some virtue in bucketing visit_stats.
> Bucketing can optimize joins, group by's and potentially other queries
> in certain circumstances.
> You probably want to keep consistent bucketing columns across all your
> tables so they can leveraged in multi-table queries. Most people use
> some power of 2 as their number of buckets. To make the best use of
> the buckets, each of your buckets should be able to entirely load into
> memory on the node.
>
> I use something close the formula below to calculate the number of buckets:
>
> #buckets = (x * Average_partition_size) /
> JVM_memory_available_to_your_Hadoop_tasknode
>
> I call x (>1) the "factor of conservatism". Higher x means you are
> being more conservative by having larger number of buckets (and
> bearing the increased overhead), lower x means the reverse. What x to
> use would depend on your use case. This is because the number of bu

RE: When/how to use partitions and buckets usefully?

2012-04-24 Thread Ruben de Vries
Hmm I must be doing something wrong,  the members_map table is 300ish MB.
When I execute the following query:

SELECT
/*+ MAPJOIN(members_map) */
date_int,
members_map.gender AS gender,
'generic',
COUNT( memberId ) AS unique,
SUM( `generic`['count'] ) AS count,
SUM( `generic`['seconds'] ) AS seconds
FROM visit_stats
JOIN members_map ON(members_map.member_id = visit_stats.memberId)
GROUP BY date_int, members_map.gender

It results in:
2012-04-24 10:25:59 Starting to launch local task to process map join;  
maximum memory = 1864171520
2012-04-24 10:26:00 Processing rows:20  Hashtable size: 
19  Memory usage:   43501848rate:   0.023 
2012-04-24 10:30:54 Processing rows:690 Hashtable size: 689 
Memory usage:   1449867552  rate:   0.778
2012-04-24 10:31:02 Processing rows:700 Hashtable size: 699 
Memory usage:   1468378760  rate:   0.788
Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space


I'm running it only my local, single node, dev env, could that be a problem 
since it won't split over multiple mappers in this case?


-Original Message-
From: Bejoy Ks [mailto:bejoy...@yahoo.com] 
Sent: Tuesday, April 24, 2012 9:47 AM
To: user@hive.apache.org
Subject: Re: When/how to use partitions and buckets usefully?

Hi Ruben
Map join hint is provided to hive using "MAPJOIN" keyword as :
SELECT /*+ MAPJOIN(b) */ a.key, a.value FROM a join b on a.key = b.key

To use map side join some hive configuration properties needs to be enabled

For plain map side joins
hive>SET hive.auto.convert.join=true;
Latest versions of hive does a map join on the smaller table even if no map 
join hit is provided.

For bucketed map joins
hive>SET hive.optimize.bucketmapjoin=true

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins


Regards
Bejoy



From: Nitin Pawar 
To: user@hive.apache.org
Sent: Tuesday, April 24, 2012 12:46 PM
Subject: Re: When/how to use partitions and buckets usefully?

If you are doing a map side join make sure the table members_map is small 
enough to hold in memory

On 4/24/12, Ruben de Vries  wrote:
> Wow thanks everyone for the nice feedback!
>
> I can force a mapside join by doing /*+ STREAMTABLE(members_map) */ right?
>
>
> Cheers,
>
> Ruben de Vries
>
> -Original Message-
> From: Mark Grover [mailto:mgro...@oanda.com]
> Sent: Tuesday, April 24, 2012 3:17 AM
> To: user@hive.apache.org; bejoy ks
> Cc: Ruben de Vries
> Subject: Re: When/how to use partitions and buckets usefully?
>
> Hi Ruben,
> Like Bejoy pointed out, members_map is small enough to fit in memory, 
> so your joins with visit_stats would be much faster with map-side join.
>
> However, there is still some virtue in bucketing visit_stats. 
> Bucketing can optimize joins, group by's and potentially other queries 
> in certain circumstances.
> You probably want to keep consistent bucketing columns across all your 
> tables so they can leveraged in multi-table queries. Most people use 
> some power of 2 as their number of buckets. To make the best use of 
> the buckets, each of your buckets should be able to entirely load into 
> memory on the node.
>
> I use something close the formula below to calculate the number of buckets:
>
> #buckets = (x * Average_partition_size) / 
> JVM_memory_available_to_your_Hadoop_tasknode
>
> I call x (>1) the "factor of conservatism". Higher x means you are 
> being more conservative by having larger number of buckets (and 
> bearing the increased overhead), lower x means the reverse. What x to 
> use would depend on your use case. This is because the number of buckets in a 
> table is fixed.
> If you have a large partition, it would distribute it's data into 
> bulkier buckets and you would want to make sure these bulkier buckets 
> can still fit in memory. Moreover, buckets are generated using a 
> hashing function, if you have a strong bias towards a particular value 
> of bucketing column in your data, some buckets might be bulkier than 
> others. In that case, you'd want to make sure that those bulkier buckets can 
> still fit in memory.
>
> To summarize, it depends on:
> * How the actual partition sizes vary from the average partition size (i.e.
> the standard deviation of your partition size). More standard 
> deviations means you should be more conservative in your calculation and 
> vice-versa.
> * Distribution of the data in the bucketing columns. "Wider" 
> distribution means you should be more conservative and vice-versa.
>
> Long story short, I would say, x of 2 to 4 should suffice in most 
> cases but feel free to verify t

RE: When/how to use partitions and buckets usefully?

2012-04-23 Thread Ruben de Vries
Wow thanks everyone for the nice feedback!

I can force a mapside join by doing /*+ STREAMTABLE(members_map) */ right?


Cheers, 

Ruben de Vries

-Original Message-
From: Mark Grover [mailto:mgro...@oanda.com] 
Sent: Tuesday, April 24, 2012 3:17 AM
To: user@hive.apache.org; bejoy ks
Cc: Ruben de Vries
Subject: Re: When/how to use partitions and buckets usefully?

Hi Ruben,
Like Bejoy pointed out, members_map is small enough to fit in memory, so your 
joins with visit_stats would be much faster with map-side join.

However, there is still some virtue in bucketing visit_stats. Bucketing can 
optimize joins, group by's and potentially other queries in certain 
circumstances.
You probably want to keep consistent bucketing columns across all your tables 
so they can leveraged in multi-table queries. Most people use some power of 2 
as their number of buckets. To make the best use of the buckets, each of your 
buckets should be able to entirely load into memory on the node.

I use something close the formula below to calculate the number of buckets:

#buckets = (x * Average_partition_size) / 
JVM_memory_available_to_your_Hadoop_tasknode

I call x (>1) the "factor of conservatism". Higher x means you are being more 
conservative by having larger number of buckets (and bearing the increased 
overhead), lower x means the reverse. What x to use would depend on your use 
case. This is because the number of buckets in a table is fixed. If you have a 
large partition, it would distribute it's data into bulkier buckets and you 
would want to make sure these bulkier buckets can still fit in memory. 
Moreover, buckets are generated using a hashing function, if you have a strong 
bias towards a particular value of bucketing column in your data, some buckets 
might be bulkier than others. In that case, you'd want to make sure that those 
bulkier buckets can still fit in memory.

To summarize, it depends on:
* How the actual partition sizes vary from the average partition size (i.e. the 
standard deviation of your partition size). More standard deviations means you 
should be more conservative in your calculation and vice-versa.
* Distribution of the data in the bucketing columns. "Wider" distribution means 
you should be more conservative and vice-versa.

Long story short, I would say, x of 2 to 4 should suffice in most cases but 
feel free to verify that in your case:-) I would love to hear what factors 
others have been using when calculating their number of buckets, BTW!
Whatever answer you get for #buckets from above formula, use the closest power 
of 2 as the number of buckets in your table (I am not sure if this is a must, 
though).

Good luck!

Mark

Mark Grover, Business Intelligence Analyst OANDA Corporation 

www: oanda.com www: fxtrade.com
e: mgro...@oanda.com 

"Best Trading Platform" - World Finance's Forex Awards 2009. 
"The One to Watch" - Treasury Today's Adam Smith Awards 2009. 


- Original Message -
From: "Bejoy KS" 
To: "Ruben de Vries" , user@hive.apache.org
Sent: Monday, April 23, 2012 12:39:17 PM
Subject: Re: When/how to use partitions and buckets usefully?

If data is in hdfs, then you can bucket it only after loading into a 
temp/staging table and then to the final bucketed table. Bucketing needs a Map 
reduce job. 


Regards
Bejoy KS 

Sent from handheld, please excuse typos. 

From: Ruben de Vries 
Date: Mon, 23 Apr 2012 18:13:20 +0200
To: user@hive.apache.org; 
bejoy...@yahoo.com
Subject: RE: When/how to use partitions and buckets usefully? 




Thanks for the help so far guys, 



I bucketed the members_map, it’s 330mb in size (11 mil records). 



Can you manually bucket stuff? 

Since my initial mapreduce job is still outside of Hive I’m doing a LOAD DATA 
to import stuff into the visit_stats tables, replacing that with INSERT 
OVERWRITE SELECT slows it down a lot 





From: Bejoy KS [mailto:bejoy...@yahoo.com]
Sent: Monday, April 23, 2012 6:06 PM
To: user@hive.apache.org
Subject: Re: When/how to use partitions and buckets usefully? 



For Bucketed map join, both tables should be bucketed and the number of buckets 
of one should be multiple of other. 


Regards
Bejoy KS 

Sent from handheld, please excuse typos. 




From: "Bejoy KS" < bejoy...@yahoo.com > 


Date: Mon, 23 Apr 2012 16:03:34 + 


To: < user@hive.apache.org > 


ReplyTo: bejoy...@yahoo.com 


Subject: Re: When/how to use partitions and buckets usefully? 




Bucketed map join would be good I guess. What is the total size of the smaller 
table and what is its expected size in the next few years? 

The size should be good enough to be put in Distributed Cache, then map side 
joins would offer you much performance improvement. 


Regards
Bejoy KS 

Sent from handheld, please excuse typos. 




From: Ruben de Vries < ruben.devr...@hyves.nl > 


Date: Mon, 23 Apr 2012 17:38:20 +0200 


To: u

RE: When/how to use partitions and buckets usefully?

2012-04-23 Thread Ruben de Vries
Thanks for the help so far guys,

I bucketed the members_map, it's 330mb in size (11 mil records).

Can you manually bucket stuff?
Since my initial mapreduce job is still outside of Hive I'm doing a LOAD DATA 
to import stuff into the visit_stats tables, replacing that with INSERT 
OVERWRITE SELECT slows it down a lot

From: Bejoy KS [mailto:bejoy...@yahoo.com]
Sent: Monday, April 23, 2012 6:06 PM
To: user@hive.apache.org
Subject: Re: When/how to use partitions and buckets usefully?

For Bucketed map join, both tables should be bucketed and the number of buckets 
of one should be multiple of other.
Regards
Bejoy KS

Sent from handheld, please excuse typos.

From: "Bejoy KS" mailto:bejoy...@yahoo.com>>
Date: Mon, 23 Apr 2012 16:03:34 +
To: mailto:user@hive.apache.org>>
ReplyTo: bejoy...@yahoo.com<mailto:bejoy...@yahoo.com>
Subject: Re: When/how to use partitions and buckets usefully?

Bucketed map join would be good I guess. What is the total size of the smaller 
table and what is its expected size in the next few years?

The size should be good enough to be put in Distributed Cache, then map side 
joins would offer you much performance improvement.
Regards
Bejoy KS

Sent from handheld, please excuse typos.
________
From: Ruben de Vries mailto:ruben.devr...@hyves.nl>>
Date: Mon, 23 Apr 2012 17:38:20 +0200
To: 
user@hive.apache.orgmailto:user@hive.apache.org%3cu...@hive.apache.org>>
ReplyTo: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: RE: When/how to use partitions and buckets usefully?

Ok, very clear on the partitions, try to make them match the WHERE clauses, not 
so much about group clauses then ;)

The member_map contains 11.636.619 records atm, I think bucketing those would 
be good?
What's a good number to bucket them by then?

And is there any point in bucketing the visit_stats?

From: Tucker, Matt 
[mailto:matt.tuc...@disney.com]<mailto:[mailto:matt.tuc...@disney.com]>
Sent: Monday, April 23, 2012 5:30 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: RE: When/how to use partitions and buckets usefully?

If you're only interested in a certain window of dates for analysis, a 
date-based partition scheme will be helpful, as it will trim partitions that 
aren't needed by the query before execution.

If the member_map table is small, you might consider testing the feasibility of 
map-side joins, as it will reduce the number of processing stages.  If 
member_map is large, bucketing on member_id will avoid having as many rows from 
visit_stats compared to each member_id for joins.

Matt Tucker

From: Ruben de Vries 
[mailto:ruben.devr...@hyves.nl]<mailto:[mailto:ruben.devr...@hyves.nl]>
Sent: Monday, April 23, 2012 11:19 AM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: When/how to use partitions and buckets usefully?

It seems there's enough information to be found on how to setup and use 
partitions and buckets.
But I'm more interested in how to figure out when and what columns you should 
be partitioning and bucketing to increase performance?!

In my case I got 2 tables, 1 visit_stats (member_id, date and some MAP cols 
which give me info about the visits) and 1 member_map (member_id, gender, age).

Usually I group by date and then one of the other col so I assume that 
partitioning on date is a good start?!

It seems the join of the member_map onto the visit_stats makes the queries a 
lot slower, can that be fixed by bucketing both tables? Or just one of them?

Maybe some ppl have written good blogs on this subject but I can't really seem 
to find them!?

Any help would be appreciated, thanks in advance :)


RE: When/how to use partitions and buckets usefully?

2012-04-23 Thread Ruben de Vries
Ok, very clear on the partitions, try to make them match the WHERE clauses, not 
so much about group clauses then ;)

The member_map contains 11.636.619 records atm, I think bucketing those would 
be good?
What's a good number to bucket them by then?

And is there any point in bucketing the visit_stats?

From: Tucker, Matt [mailto:matt.tuc...@disney.com]
Sent: Monday, April 23, 2012 5:30 PM
To: user@hive.apache.org
Subject: RE: When/how to use partitions and buckets usefully?

If you're only interested in a certain window of dates for analysis, a 
date-based partition scheme will be helpful, as it will trim partitions that 
aren't needed by the query before execution.

If the member_map table is small, you might consider testing the feasibility of 
map-side joins, as it will reduce the number of processing stages.  If 
member_map is large, bucketing on member_id will avoid having as many rows from 
visit_stats compared to each member_id for joins.

Matt Tucker

From: Ruben de Vries 
[mailto:ruben.devr...@hyves.nl]<mailto:[mailto:ruben.devr...@hyves.nl]>
Sent: Monday, April 23, 2012 11:19 AM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: When/how to use partitions and buckets usefully?

It seems there's enough information to be found on how to setup and use 
partitions and buckets.
But I'm more interested in how to figure out when and what columns you should 
be partitioning and bucketing to increase performance?!

In my case I got 2 tables, 1 visit_stats (member_id, date and some MAP cols 
which give me info about the visits) and 1 member_map (member_id, gender, age).

Usually I group by date and then one of the other col so I assume that 
partitioning on date is a good start?!

It seems the join of the member_map onto the visit_stats makes the queries a 
lot slower, can that be fixed by bucketing both tables? Or just one of them?

Maybe some ppl have written good blogs on this subject but I can't really seem 
to find them!?

Any help would be appreciated, thanks in advance :)


When/how to use partitions and buckets usefully?

2012-04-23 Thread Ruben de Vries
It seems there's enough information to be found on how to setup and use 
partitions and buckets.
But I'm more interested in how to figure out when and what columns you should 
be partitioning and bucketing to increase performance?!

In my case I got 2 tables, 1 visit_stats (member_id, date and some MAP cols 
which give me info about the visits) and 1 member_map (member_id, gender, age).

Usually I group by date and then one of the other col so I assume that 
partitioning on date is a good start?!

It seems the join of the member_map onto the visit_stats makes the queries a 
lot slower, can that be fixed by bucketing both tables? Or just one of them?

Maybe some ppl have written good blogs on this subject but I can't really seem 
to find them!?

Any help would be appreciated, thanks in advance :)


subquery + lateral view fails without count

2012-04-23 Thread Ruben de Vries
It's a bit of a weird case but I thought I might share it and hopefully find 
someone who can confirm this to be a bug or tell me I should do things 
differently!

Here you can find a pastie with the full create and select queries: 
http://pastie.org/3838924

I've got two tables:
`visit_stats` with cols date_int, memberId and parts
`member_map` with cols member_id, gender

I use LATERAL VIEW explode(parts) to explode the parts MAP and group on those 
and I want to JOIN the member_map onto the visit_stats to group by gender.
Since LATERAL VIEW and JOIN together aren't supported I'm doing the join in a 
subquery and then LATERAL VIEW onto that.

It's working as long as my query contains a COUNT ( memberId ), but if I 
removed that from the result it will crash with the error below:

java.lang.RuntimeException: Error in configuring object
at 
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
at 
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
at 
org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:431)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:416)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
... 9 more
Caused by: java.lang.RuntimeException: Reduce operator initialization failed
at 
org.apache.hadoop.hive.ql.exec.ExecReducer.configure(ExecReducer.java:157)
... 14 more
Caused by: java.lang.RuntimeException: cannot find field _col1 from [0:_col4, 
1:_col6, 2:_col10]
at 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.getStandardStructFieldRef(ObjectInspectorUtils.java:345)
at 
org.apache.hadoop.hive.serde2.objectinspector.StandardStructObjectInspector.getStructFieldRef(StandardStructObjectInspector.java:143)
at 
org.apache.hadoop.hive.ql.exec.ExprNodeColumnEvaluator.initialize(ExprNodeColumnEvaluator.java:57)
at 
org.apache.hadoop.hive.ql.exec.Operator.initEvaluators(Operator.java:896)
at 
org.apache.hadoop.hive.ql.exec.Operator.initEvaluatorsAndReturnStruct(Operator.java:922)
at 
org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:60)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:433)
at 
org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:389)
at 
org.apache.hadoop.hive.ql.exec.JoinOperator.initializeOp(JoinOperator.java:60)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357)
at 
org.apache.hadoop.hive.ql.exec.ExecReducer.configure(ExecReducer.java:150)
... 14 more




RE: using the key from a SequenceFile

2012-04-19 Thread Ruben de Vries
You're a lifesaver!

From: Dilip Joseph [mailto:dilip.antony.jos...@gmail.com]
Sent: Thursday, April 19, 2012 5:47 PM
To: user@hive.apache.org
Subject: Re: using the key from a SequenceFile

An example input format for using SequenceFile keys in hive is at 
https://gist.github.com/2421795 .  The code just reverses how the key and value 
are accessed in the standard SequenceFileRecordRecorder and 
SequenceFileInputFormat that comes with hadoop.

You can use this custom input format by specifying the following when you 
create the table:

STORED AS
INPUTFORMAT 'com.mycompany.SequenceFileKeyInputFormat'

Dilip

On Thu, Apr 19, 2012 at 6:09 AM, Owen O'Malley 
mailto:omal...@apache.org>> wrote:
On Thu, Apr 19, 2012 at 3:07 AM, Ruben de Vries 
mailto:ruben.devr...@hyves.nl>> wrote:
> I'm trying to migrate a part of our current hadoop jobs from normal
> mapreduce jobs to hive,
>
> Previously the data was stored in sequencefiles with the keys containing
> valueable data!
I think you'll want to define your table using a custom InputFormat
that creates a virtual row based on both the key and value and then
use the 'STORED AS INPUTFORMAT ...'

-- Owen



--
_
Dilip Antony Joseph
http://csgrad.blogspot.com
http://www.marydilip.info


RE: using the key from a SequenceFile

2012-04-19 Thread Ruben de Vries
Hive can handle a sequence file just like a text file, only it omits the key 
completely and only uses the value part of it, other than that you won't notice 
the difference between sequence or plain text file

From: David Kulp [mailto:dk...@fiksu.com]
Sent: Thursday, April 19, 2012 2:13 PM
To: user@hive.apache.org
Subject: Re: using the key from a SequenceFile

I'm trying to achieve something very similar.  I want to write an MR program 
that writes results in a record-based sequencefile that would be directly 
readable from hive as though it were created using "STORED AS SEQUENCEFILE" 
with, say, BinarySortableSerDe.

>From this discussion it seems that Hive does not / cannot take advantage of 
>the key/values in a sequencefile, but rather it requires a value that is 
>serialized using a SerDe.  Is that right?

If so, does that mean that the right approach is to using the 
BinarySortableSerDe to pass the collector a row's worth of data as the Writable 
value.  And would Hive "just work" on such data?

If SequencefileOutputFormat is used, will it automatically place sync markers 
in the file to allow for file splitting?

Thanks!


(ps. As an aside, Avro would be better.  Wouldn't it be a huge win for 
MapReduce to have an AvroOutputFileFormat and for Hive to have a serde that 
read such files?  It seems like there's a natural correspondence between the 
richer data representations of an SQL schema and an Avro schema, and there's 
already code for working with Avro in MR as input.)



On Apr 19, 2012, at 6:15 AM, madhu phatak wrote:


Serde will allow you to create custom data from your sequence File  
https://cwiki.apache.org/confluence/display/Hive/SerDe
On Thu, Apr 19, 2012 at 3:37 PM, Ruben de Vries 
mailto:ruben.devr...@hyves.nl>> wrote:
I'm trying to migrate a part of our current hadoop jobs from normal mapreduce 
jobs to hive,
Previously the data was stored in sequencefiles with the keys containing 
valueable data!
However if I load the data into a table I loose that key data (or at least I 
can't access it with hive), I want to somehow use the key from the sequence 
file in hive.

I know this has come up before since I can find some hints of people needing it 
but I can't seem to find a working solution and since I'm not very good with 
java I really can't get it done myself :(.
Does anyone have a snippet of something like this working?

I get errors like;
../hive/mapred/CustomSeqRecordReader.java:14: cannot find symbol
[javac] symbol  : constructor SequenceFileRecordReader()
[javac] location: class 
org.apache.hadoop.mapred.SequenceFileRecordReader
[javac] public class CustomSeqRecordReader extends 
SequenceFileRecordReader implements RecordReader {


Hope some1 has a snippet or can help me out, would really love to be able to 
switch part of our jobs to hive,


Ruben de Vries



--
https://github.com/zinnia-phatak-dev/Nectar



RE: using the key from a SequenceFile

2012-04-19 Thread Ruben de Vries
Afaik SerDe only serialzes / deserializes the value part of the sequencefile :( 
?

From: madhu phatak [mailto:phatak@gmail.com]
Sent: Thursday, April 19, 2012 12:16 PM
To: user@hive.apache.org
Subject: Re: using the key from a SequenceFile

Serde will allow you to create custom data from your sequence File  
https://cwiki.apache.org/confluence/display/Hive/SerDe
On Thu, Apr 19, 2012 at 3:37 PM, Ruben de Vries 
mailto:ruben.devr...@hyves.nl>> wrote:
I'm trying to migrate a part of our current hadoop jobs from normal mapreduce 
jobs to hive,
Previously the data was stored in sequencefiles with the keys containing 
valueable data!
However if I load the data into a table I loose that key data (or at least I 
can't access it with hive), I want to somehow use the key from the sequence 
file in hive.

I know this has come up before since I can find some hints of people needing it 
but I can't seem to find a working solution and since I'm not very good with 
java I really can't get it done myself :(.
Does anyone have a snippet of something like this working?

I get errors like;
../hive/mapred/CustomSeqRecordReader.java:14: cannot find symbol
[javac] symbol  : constructor SequenceFileRecordReader()
[javac] location: class 
org.apache.hadoop.mapred.SequenceFileRecordReader
[javac] public class CustomSeqRecordReader extends 
SequenceFileRecordReader implements RecordReader {


Hope some1 has a snippet or can help me out, would really love to be able to 
switch part of our jobs to hive,


Ruben de Vries



--
https://github.com/zinnia-phatak-dev/Nectar


using the key from a SequenceFile

2012-04-19 Thread Ruben de Vries
I'm trying to migrate a part of our current hadoop jobs from normal mapreduce 
jobs to hive,
Previously the data was stored in sequencefiles with the keys containing 
valueable data!
However if I load the data into a table I loose that key data (or at least I 
can't access it with hive), I want to somehow use the key from the sequence 
file in hive.

I know this has come up before since I can find some hints of people needing it 
but I can't seem to find a working solution and since I'm not very good with 
java I really can't get it done myself :(.
Does anyone have a snippet of something like this working?

I get errors like;
../hive/mapred/CustomSeqRecordReader.java:14: cannot find symbol
[javac] symbol  : constructor SequenceFileRecordReader()
[javac] location: class 
org.apache.hadoop.mapred.SequenceFileRecordReader
[javac] public class CustomSeqRecordReader extends 
SequenceFileRecordReader implements RecordReader {


Hope some1 has a snippet or can help me out, would really love to be able to 
switch part of our jobs to hive,


Ruben de Vries