RE: When/how to use partitions and buckets usefully?

Ruben de Vries Thu, 26 Apr 2012 00:16:55 -0700

The table isn't compressed, the uncompressed size is really 300ish mb.
Also I haven't set the mapred.child.java.opts, but I think that if it crashes 
at 1.5gb that the default value seems high enough?


From: gemini alex [mailto:gemini5201...@gmail.com]
Sent: Thursday, April 26, 2012 5:37 AM
To: user@hive.apache.org
Subject: Re: When/how to use partitions and buckets usefully?

do you set your mapred.child.java.opts in hadoop/conf/mapred-site.xml?
在 2012年4月26日 上午8:59，Mark Grover <mgro...@oanda.com>写道：
I am hoping that other people who have used Map Join can pitch in here...

When the smaller table gets loaded into mapper's memory, the data is loaded in 
its uncompressed form, right?

If so, is it possible at all in this case that the compressed size of smaller 
table is less than the memory available but the uncompressed size isn't?

Mark

Mark Grover, Business Intelligence Analyst
OANDA Corporation

www: oanda.com www: fxtrade.com
e: mgro...@oanda.com

"Best Trading Platform" - World Finance's Forex Awards 2009.
"The One to Watch" - Treasury Today's Adam Smith Awards 2009.


----- Original Message -----
From: "Ruben de Vries" <ruben.devr...@hyves.nl>
To: user@hive.apache.org
Sent: Wednesday, April 25, 2012 3:48:46 AM
Subject: RE: When/how to use partitions and buckets usefully?

I already tried running with that set to 400mb, but it didn’t work and that 
setting is only used when it’s trying to automatically figure out if it should 
be doing a mapjoin, while I’m forcing it to do a mapjoin with a hint

From: gemini alex [mailto:gemini5201...@gmail.com]
Sent: Wednesday, April 25, 2012 9:40 AM
To: user@hive.apache.org
Subject: Re: When/how to use partitions and buckets usefully?

there should be documented in wiki on LanguageManual+Joins .
在 2012年4月25日 下午3:36，gemini alex <gemini5201...@gmail.com>写道：
it's seemed you use the default hive configuration, the default map join will 
have only 25M for small table,  copy your hive-default.xml to hive-site.xml and 
set hive.mapjoin.smalltable.filesize=300000000
在 2012年4月25日 上午12:09，Ruben de Vries <ruben.devr...@hyves.nl>写道：

I got the (rather big) log here in a github gist: 
https://gist.github.com/2480893
And I also attached the plan.xml it was using to the gist.

When loading the members_map (11mil records, 320mb, 30b per record), it seems 
to take about 198b per record in the members_map, resulting in crashing around 
7mil records with 1.4gb loaded.

The members_map is a TEXTFILE with (member_id INT, gender INT, birthday STRING) 
where
 - birthday is a string containing YYYY-MM-DD
 - gender is a tinyint, 1 2 or 3
 - member_id is int with the highest member_id being 14343249 (14mil)

The log says:
"INFO hive.log: DDL: struct members_map { i32 member_id, i32 gender, string 
birthdate}"

I also tried doing the same thing but with an empty visit_stats table, with the 
same effect
Some of the blogs I read talk about 25mb small table, not 300mb like mine ...

Anyone can make anything out of this?
I'd rather go with this if at all possible,
otherwise I have to go the hard way and migrate all the visit_stats into 
buckets so they can match the members_map on that?

-----Original Message-----
From: Bejoy Ks [mailto:bejoy...@yahoo.com]
Sent: Tuesday, April 24, 2012 3:58 PM
To: user@hive.apache.org
Subject: Re: When/how to use partitions and buckets usefully?

Hi Ruben
     The operation you are seeing in your log is preparation of hash table of 
the smaller table, This hash table file is compressed and loaded into 
Distributed Cache and from there it is used for map side joins. From your 
console log the hash table size/data size has gone to nearly 1.5 GB, the data 
is large to be loaded into memory of the hive client.

2012-04-24 10:31:02     Processing rows:        7000000 Hashtable size: 6999999 
Memory usage:   1,468,378,760      rate:   0.788


Can you enable debug logging and post in the console to get a better picture 
why it consumes this much memory.
Start your hive shell as
hive -hiveconf hive.root.logger=ALL,console;


Regards
Bejoy KS



________________________________
From: Ruben de Vries <ruben.devr...@hyves.nl>
To: "user@hive.apache.org" <user@hive.apache.org>
Sent: Tuesday, April 24, 2012 4:58 PM
Subject: FW: When/how to use partitions and buckets usefully?

Here are both tables:

$ hdfs -count /user/hive/warehouse/hyves_goldmine.db/members_map
          1            1          247231757 
hdfs://localhost:54310/user/hive/warehouse/hyves_goldmine.db/members_map

$ hdfs -count /user/hive/warehouse/hyves_goldmine.db/visit_stats
        442          441         1091837835 
hdfs://localhost:54310/user/hive/warehouse/hyves_goldmine.db/visit_stats

The 'work' I'm seeing on console is the loading of the table into memory?

It seems like it's loading the visit_stats table instead ?!
I tried doing MAPJOIN(visit_stats) but it fails non existing class (my 
JSONSerde) .


From: Nitin Pawar [mailto:nitinpawar...@gmail.com]
Sent: Tuesday, April 24, 2012 11:46 AM
To: user@hive.apache.org
Subject: Re: When/how to use partitions and buckets usefully?

This operation is erroring out on the hive client itself before starting a map 
so splitting to mappers is out of question.

can you do a dfs count for the members_map table hdfslocation and tell us the 
result?

On Tue, Apr 24, 2012 at 2:06 PM, Ruben de Vries <ruben.devr...@hyves.nl> wrote:
Hmm I must be doing something wrong,  the members_map table is 300ish MB.
When I execute the following query:

SELECT
  /*+ MAPJOIN(members_map) */
  date_int,
  members_map.gender AS gender,
  'generic',
  COUNT( memberId ) AS unique,
  SUM( `generic`['count'] ) AS count,
  SUM( `generic`['seconds'] ) AS seconds
FROM visit_stats
JOIN members_map ON(members_map.member_id = visit_stats.memberId)
GROUP BY date_int, members_map.gender

It results in:
2012-04-24 10:25:59     Starting to launch local task to process map join;      
maximum memory = 1864171520
2012-04-24 10:26:00     Processing rows:        200000          Hashtable size: 
199999          Memory usage:   43501848        rate:   0.023
2012-04-24 10:30:54     Processing rows:        6900000 Hashtable size: 6899999 
Memory usage:   1449867552      rate:   0.778
2012-04-24 10:31:02     Processing rows:        7000000 Hashtable size: 6999999 
Memory usage:   1468378760      rate:   0.788
Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space


I'm running it only my local, single node, dev env, could that be a problem 
since it won't split over multiple mappers in this case?


-----Original Message-----
From: Bejoy Ks [mailto:bejoy...@yahoo.com]
Sent: Tuesday, April 24, 2012 9:47 AM
To: user@hive.apache.org
Subject: Re: When/how to use partitions and buckets usefully?

Hi Ruben
Map join hint is provided to hive using "MAPJOIN" keyword as :
SELECT /*+ MAPJOIN(b) */ a.key, a.value FROM a join b on a.key = b.key

To use map side join some hive configuration properties needs to be enabled

For plain map side joins
hive>SET hive.auto.convert.join=true;
Latest versions of hive does a map join on the smaller table even if no map 
join hit is provided.

For bucketed map joins
hive>SET hive.optimize.bucketmapjoin=true

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins


Regards
Bejoy


________________________________
From: Nitin Pawar <nitinpawar...@gmail.com>
To: user@hive.apache.org
Sent: Tuesday, April 24, 2012 12:46 PM
Subject: Re: When/how to use partitions and buckets usefully?

If you are doing a map side join make sure the table members_map is small 
enough to hold in memory

On 4/24/12, Ruben de Vries <ruben.devr...@hyves.nl> wrote:
> Wow thanks everyone for the nice feedback!
>
> I can force a mapside join by doing /*+ STREAMTABLE(members_map) */ right?
>
>
> Cheers,
>
> Ruben de Vries
>
> -----Original Message-----
> From: Mark Grover [mailto:mgro...@oanda.com]
> Sent: Tuesday, April 24, 2012 3:17 AM
> To: user@hive.apache.org; bejoy ks
> Cc: Ruben de Vries
> Subject: Re: When/how to use partitions and buckets usefully?
>
> Hi Ruben,
> Like Bejoy pointed out, members_map is small enough to fit in memory,
> so your joins with visit_stats would be much faster with map-side join.
>
> However, there is still some virtue in bucketing visit_stats.
> Bucketing can optimize joins, group by's and potentially other queries
> in certain circumstances.
> You probably want to keep consistent bucketing columns across all your
> tables so they can leveraged in multi-table queries. Most people use
> some power of 2 as their number of buckets. To make the best use of
> the buckets, each of your buckets should be able to entirely load into
> memory on the node.
>
> I use something close the formula below to calculate the number of buckets:
>
> #buckets = (x * Average_partition_size) /
> JVM_memory_available_to_your_Hadoop_tasknode
>
> I call x (>1) the "factor of conservatism". Higher x means you are
> being more conservative by having larger number of buckets (and
> bearing the increased overhead), lower x means the reverse. What x to
> use would depend on your use case. This is because the number of buckets in a 
> table is fixed.
> If you have a large partition, it would distribute it's data into
> bulkier buckets and you would want to make sure these bulkier buckets
> can still fit in memory. Moreover, buckets are generated using a
> hashing function, if you have a strong bias towards a particular value
> of bucketing column in your data, some buckets might be bulkier than
> others. In that case, you'd want to make sure that those bulkier buckets can 
> still fit in memory.
>
> To summarize, it depends on:
> * How the actual partition sizes vary from the average partition size (i.e.
> the standard deviation of your partition size). More standard
> deviations means you should be more conservative in your calculation and 
> vice-versa.
> * Distribution of the data in the bucketing columns. "Wider"
> distribution means you should be more conservative and vice-versa.
>
> Long story short, I would say, x of 2 to 4 should suffice in most
> cases but feel free to verify that in your case:-) I would love to
> hear what factors others have been using when calculating their number of 
> buckets, BTW!
> Whatever answer you get for #buckets from above formula, use the
> closest power of 2 as the number of buckets in your table (I am not
> sure if this is a must, though).
>
> Good luck!
>
> Mark
>
> Mark Grover, Business Intelligence Analyst OANDA Corporation
>
> www: oanda.com www: fxtrade.com
> e: mgro...@oanda.com
>
> "Best Trading Platform" - World Finance's Forex Awards 2009.
> "The One to Watch" - Treasury Today's Adam Smith Awards 2009.
>
>
> ----- Original Message -----
> From: "Bejoy KS" <bejoy...@yahoo.com>
> To: "Ruben de Vries" <ruben.devr...@hyves.nl>, user@hive.apache.org
> Sent: Monday, April 23, 2012 12:39:17 PM
> Subject: Re: When/how to use partitions and buckets usefully?
>
> If data is in hdfs, then you can bucket it only after loading into a
> temp/staging table and then to the final bucketed table. Bucketing
> needs a Map reduce job.
>
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
>
> From: Ruben de Vries <ruben.devr...@hyves.nl>
> Date: Mon, 23 Apr 2012 18:13:20 +0200
> To: user@hive.apache.org<user@hive.apache.org>;
> bejoy...@yahoo.com<bejoy...@yahoo.com>
> Subject: RE: When/how to use partitions and buckets usefully?
>
>
>
>
> Thanks for the help so far guys,
>
>
>
> I bucketed the members_map, it's 330mb in size (11 mil records).
>
>
>
> Can you manually bucket stuff?
>
> Since my initial mapreduce job is still outside of Hive I'm doing a
> LOAD DATA to import stuff into the visit_stats tables, replacing that
> with INSERT OVERWRITE SELECT slows it down a lot
>
>
>
>
>
> From: Bejoy KS [mailto:bejoy...@yahoo.com]
> Sent: Monday, April 23, 2012 6:06 PM
> To: user@hive.apache.org
> Subject: Re: When/how to use partitions and buckets usefully?
>
>
>
> For Bucketed map join, both tables should be bucketed and the number
> of buckets of one should be multiple of other.
>
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
>
>
>
>
> From: "Bejoy KS" < bejoy...@yahoo.com >
>
>
> Date: Mon, 23 Apr 2012 16:03:34 +0000
>
>
> To: < user@hive.apache.org >
>
>
> ReplyTo: bejoy...@yahoo.com
>
>
> Subject: Re: When/how to use partitions and buckets usefully?
>
>
>
>
> Bucketed map join would be good I guess. What is the total size of the
> smaller table and what is its expected size in the next few years?
>
> The size should be good enough to be put in Distributed Cache, then
> map side joins would offer you much performance improvement.
>
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
>
>
>
>
> From: Ruben de Vries < ruben.devr...@hyves.nl >
>
>
> Date: Mon, 23 Apr 2012 17:38:20 +0200
>
>
> To: user@hive.apache.org<user@hive.apache.org >
>
>
> ReplyTo: user@hive.apache.org
>
>
> Subject: RE: When/how to use partitions and buckets usefully?
>
>
>
>
> Ok, very clear on the partitions, try to make them match the WHERE
> clauses, not so much about group clauses then ;)
>
>
>
> The member_map contains 11.636.619 records atm, I think bucketing
> those would be good?
>
> What's a good number to bucket them by then?
>
>
>
> And is there any point in bucketing the visit_stats?
>
>
>
>
>
> From: Tucker, Matt [mailto:matt.tuc...@disney.com]
> Sent: Monday, April 23, 2012 5:30 PM
> To: user@hive.apache.org
> Subject: RE: When/how to use partitions and buckets usefully?
>
>
>
> If you're only interested in a certain window of dates for analysis, a
> date-based partition scheme will be helpful, as it will trim
> partitions that aren't needed by the query before execution.
>
>
>
> If the member_map table is small, you might consider testing the
> feasibility of map-side joins, as it will reduce the number of
> processing stages. If member_map is large, bucketing on member_id will
> avoid having as many rows from visit_stats compared to each member_id for 
> joins.
>
>
>
>
> Matt Tucker
>
>
>
>
>
> From: Ruben de Vries [mailto:ruben.devr...@hyves.nl]
> Sent: Monday, April 23, 2012 11:19 AM
> To: user@hive.apache.org
> Subject: When/how to use partitions and buckets usefully?
>
>
>
> It seems there's enough information to be found on how to setup and
> use partitions and buckets.
>
> But I'm more interested in how to figure out when and what columns you
> should be partitioning and bucketing to increase performance?!
>
>
>
> In my case I got 2 tables, 1 visit_stats (member_id, date and some MAP
> cols which give me info about the visits) and 1 member_map (member_id,
> gender, age).
>
>
>
> Usually I group by date and then one of the other col so I assume that
> partitioning on date is a good start?!
>
>
>
> It seems the join of the member_map onto the visit_stats makes the
> queries a lot slower, can that be fixed by bucketing both tables? Or just one 
> of them?
>
>
>
>
> Maybe some ppl have written good blogs on this subject but I can't
> really seem to find them!?
>
>
>
> Any help would be appreciated, thanks in advance J
>


--
Nitin Pawar




--
Nitin Pawar

RE: When/how to use partitions and buckets usefully?

Reply via email to