Re: When/how to use partitions and buckets usefully?

gemini alex Wed, 25 Apr 2012 20:37:33 -0700

do you set your mapred.child.java.opts in hadoop/conf/mapred-site.xml?

在 2012年4月26日 上午8:59，Mark Grover <mgro...@oanda.com>写道：


> I am hoping that other people who have used Map Join can pitch in here...
>
> When the smaller table gets loaded into mapper's memory, the data is
> loaded in its uncompressed form, right?
>
> If so, is it possible at all in this case that the compressed size of
> smaller table is less than the memory available but the uncompressed size
> isn't?
>
> Mark
>
> Mark Grover, Business Intelligence Analyst
> OANDA Corporation
>
> www: oanda.com www: fxtrade.com
> e: mgro...@oanda.com
>
> "Best Trading Platform" - World Finance's Forex Awards 2009.
> "The One to Watch" - Treasury Today's Adam Smith Awards 2009.
>
>
> ----- Original Message -----
> From: "Ruben de Vries" <ruben.devr...@hyves.nl>
> To: user@hive.apache.org
> Sent: Wednesday, April 25, 2012 3:48:46 AM
> Subject: RE: When/how to use partitions and buckets usefully?
>
> I already tried running with that set to 400mb, but it didn’t work and
> that setting is only used when it’s trying to automatically figure out if
> it should be doing a mapjoin, while I’m forcing it to do a mapjoin with a
> hint
>
> From: gemini alex [mailto:gemini5201...@gmail.com]
> Sent: Wednesday, April 25, 2012 9:40 AM
> To: user@hive.apache.org
> Subject: Re: When/how to use partitions and buckets usefully?
>
> there should be documented in wiki on LanguageManual+Joins .
> 在 2012年4月25日 下午3:36，gemini alex <gemini5201...@gmail.com>写道：
> it's seemed you use the default hive configuration, the default map join
> will have only 25M for small table,  copy your hive-default.xml to
> hive-site.xml and set hive.mapjoin.smalltable.filesize=300000000
> 在 2012年4月25日 上午12:09，Ruben de Vries <ruben.devr...@hyves.nl>写道：
>
> I got the (rather big) log here in a github gist:
> https://gist.github.com/2480893
> And I also attached the plan.xml it was using to the gist.
>
> When loading the members_map (11mil records, 320mb, 30b per record), it
> seems to take about 198b per record in the members_map, resulting in
> crashing around 7mil records with 1.4gb loaded.
>
> The members_map is a TEXTFILE with (member_id INT, gender INT, birthday
> STRING) where
>  - birthday is a string containing YYYY-MM-DD
>  - gender is a tinyint, 1 2 or 3
>  - member_id is int with the highest member_id being 14343249 (14mil)
>
> The log says:
> "INFO hive.log: DDL: struct members_map { i32 member_id, i32 gender,
> string birthdate}"
>
> I also tried doing the same thing but with an empty visit_stats table,
> with the same effect
> Some of the blogs I read talk about 25mb small table, not 300mb like mine
> ...
>
> Anyone can make anything out of this?
> I'd rather go with this if at all possible,
> otherwise I have to go the hard way and migrate all the visit_stats into
> buckets so they can match the members_map on that?
>
> -----Original Message-----
> From: Bejoy Ks [mailto:bejoy...@yahoo.com]
> Sent: Tuesday, April 24, 2012 3:58 PM
> To: user@hive.apache.org
> Subject: Re: When/how to use partitions and buckets usefully?
>
> Hi Ruben
>      The operation you are seeing in your log is preparation of hash table
> of the smaller table, This hash table file is compressed and loaded into
> Distributed Cache and from there it is used for map side joins. From your
> console log the hash table size/data size has gone to nearly 1.5 GB, the
> data is large to be loaded into memory of the hive client.
>
> 2012-04-24 10:31:02     Processing rows:        7000000 Hashtable size:
> 6999999 Memory usage:   1,468,378,760      rate:   0.788
>
>
> Can you enable debug logging and post in the console to get a better
> picture why it consumes this much memory.
> Start your hive shell as
> hive -hiveconf hive.root.logger=ALL,console;
>
>
> Regards
> Bejoy KS
>
>
>
> ________________________________
> From: Ruben de Vries <ruben.devr...@hyves.nl>
> To: "user@hive.apache.org" <user@hive.apache.org>
> Sent: Tuesday, April 24, 2012 4:58 PM
> Subject: FW: When/how to use partitions and buckets usefully?
>
> Here are both tables:
>
> $ hdfs -count /user/hive/warehouse/hyves_goldmine.db/members_map
>           1            1          247231757
> hdfs://localhost:54310/user/hive/warehouse/hyves_goldmine.db/members_map
>
> $ hdfs -count /user/hive/warehouse/hyves_goldmine.db/visit_stats
>         442          441         1091837835
> hdfs://localhost:54310/user/hive/warehouse/hyves_goldmine.db/visit_stats
>
> The 'work' I'm seeing on console is the loading of the table into memory?
>
> It seems like it's loading the visit_stats table instead ?!
> I tried doing MAPJOIN(visit_stats) but it fails non existing class (my
> JSONSerde) .
>
>
> From: Nitin Pawar [mailto:nitinpawar...@gmail.com]
> Sent: Tuesday, April 24, 2012 11:46 AM
> To: user@hive.apache.org
> Subject: Re: When/how to use partitions and buckets usefully?
>
> This operation is erroring out on the hive client itself before starting a
> map so splitting to mappers is out of question.
>
> can you do a dfs count for the members_map table hdfslocation and tell us
> the result?
>
> On Tue, Apr 24, 2012 at 2:06 PM, Ruben de Vries <ruben.devr...@hyves.nl>
> wrote:
> Hmm I must be doing something wrong,  the members_map table is 300ish MB.
> When I execute the following query:
>
> SELECT
>   /*+ MAPJOIN(members_map) */
>   date_int,
>   members_map.gender AS gender,
>   'generic',
>   COUNT( memberId ) AS unique,
>   SUM( `generic`['count'] ) AS count,
>   SUM( `generic`['seconds'] ) AS seconds
> FROM visit_stats
> JOIN members_map ON(members_map.member_id = visit_stats.memberId)
> GROUP BY date_int, members_map.gender
>
> It results in:
> 2012-04-24 10:25:59     Starting to launch local task to process map join;
>      maximum memory = 1864171520
> 2012-04-24 10:26:00     Processing rows:        200000          Hashtable
> size: 199999          Memory usage:   43501848        rate:   0.023
> 2012-04-24 10:30:54     Processing rows:        6900000 Hashtable size:
> 6899999 Memory usage:   1449867552      rate:   0.778
> 2012-04-24 10:31:02     Processing rows:        7000000 Hashtable size:
> 6999999 Memory usage:   1468378760      rate:   0.788
> Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space
>
>
> I'm running it only my local, single node, dev env, could that be a
> problem since it won't split over multiple mappers in this case?
>
>
> -----Original Message-----
> From: Bejoy Ks [mailto:bejoy...@yahoo.com]
> Sent: Tuesday, April 24, 2012 9:47 AM
> To: user@hive.apache.org
> Subject: Re: When/how to use partitions and buckets usefully?
>
> Hi Ruben
> Map join hint is provided to hive using "MAPJOIN" keyword as :
> SELECT /*+ MAPJOIN(b) */ a.key, a.value FROM a join b on a.key = b.key
>
> To use map side join some hive configuration properties needs to be enabled
>
> For plain map side joins
> hive>SET hive.auto.convert.join=true;
> Latest versions of hive does a map join on the smaller table even if no
> map join hit is provided.
>
> For bucketed map joins
> hive>SET hive.optimize.bucketmapjoin=true
>
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins
>
>
> Regards
> Bejoy
>
>
> ________________________________
> From: Nitin Pawar <nitinpawar...@gmail.com>
> To: user@hive.apache.org
> Sent: Tuesday, April 24, 2012 12:46 PM
> Subject: Re: When/how to use partitions and buckets usefully?
>
> If you are doing a map side join make sure the table members_map is small
> enough to hold in memory
>
> On 4/24/12, Ruben de Vries <ruben.devr...@hyves.nl> wrote:
> > Wow thanks everyone for the nice feedback!
> >
> > I can force a mapside join by doing /*+ STREAMTABLE(members_map) */
> right?
> >
> >
> > Cheers,
> >
> > Ruben de Vries
> >
> > -----Original Message-----
> > From: Mark Grover [mailto:mgro...@oanda.com]
> > Sent: Tuesday, April 24, 2012 3:17 AM
> > To: user@hive.apache.org; bejoy ks
> > Cc: Ruben de Vries
> > Subject: Re: When/how to use partitions and buckets usefully?
> >
> > Hi Ruben,
> > Like Bejoy pointed out, members_map is small enough to fit in memory,
> > so your joins with visit_stats would be much faster with map-side join.
> >
> > However, there is still some virtue in bucketing visit_stats.
> > Bucketing can optimize joins, group by's and potentially other queries
> > in certain circumstances.
> > You probably want to keep consistent bucketing columns across all your
> > tables so they can leveraged in multi-table queries. Most people use
> > some power of 2 as their number of buckets. To make the best use of
> > the buckets, each of your buckets should be able to entirely load into
> > memory on the node.
> >
> > I use something close the formula below to calculate the number of
> buckets:
> >
> > #buckets = (x * Average_partition_size) /
> > JVM_memory_available_to_your_Hadoop_tasknode
> >
> > I call x (>1) the "factor of conservatism". Higher x means you are
> > being more conservative by having larger number of buckets (and
> > bearing the increased overhead), lower x means the reverse. What x to
> > use would depend on your use case. This is because the number of buckets
> in a table is fixed.
> > If you have a large partition, it would distribute it's data into
> > bulkier buckets and you would want to make sure these bulkier buckets
> > can still fit in memory. Moreover, buckets are generated using a
> > hashing function, if you have a strong bias towards a particular value
> > of bucketing column in your data, some buckets might be bulkier than
> > others. In that case, you'd want to make sure that those bulkier buckets
> can still fit in memory.
> >
> > To summarize, it depends on:
> > * How the actual partition sizes vary from the average partition size
> (i.e.
> > the standard deviation of your partition size). More standard
> > deviations means you should be more conservative in your calculation and
> vice-versa.
> > * Distribution of the data in the bucketing columns. "Wider"
> > distribution means you should be more conservative and vice-versa.
> >
> > Long story short, I would say, x of 2 to 4 should suffice in most
> > cases but feel free to verify that in your case:-) I would love to
> > hear what factors others have been using when calculating their number
> of buckets, BTW!
> > Whatever answer you get for #buckets from above formula, use the
> > closest power of 2 as the number of buckets in your table (I am not
> > sure if this is a must, though).
> >
> > Good luck!
> >
> > Mark
> >
> > Mark Grover, Business Intelligence Analyst OANDA Corporation
> >
> > www: oanda.com www: fxtrade.com
> > e: mgro...@oanda.com
> >
> > "Best Trading Platform" - World Finance's Forex Awards 2009.
> > "The One to Watch" - Treasury Today's Adam Smith Awards 2009.
> >
> >
> > ----- Original Message -----
> > From: "Bejoy KS" <bejoy...@yahoo.com>
> > To: "Ruben de Vries" <ruben.devr...@hyves.nl>, user@hive.apache.org
> > Sent: Monday, April 23, 2012 12:39:17 PM
> > Subject: Re: When/how to use partitions and buckets usefully?
> >
> > If data is in hdfs, then you can bucket it only after loading into a
> > temp/staging table and then to the final bucketed table. Bucketing
> > needs a Map reduce job.
> >
> >
> > Regards
> > Bejoy KS
> >
> > Sent from handheld, please excuse typos.
> >
> > From: Ruben de Vries <ruben.devr...@hyves.nl>
> > Date: Mon, 23 Apr 2012 18:13:20 +0200
> > To: user@hive.apache.org<user@hive.apache.org>;
> > bejoy...@yahoo.com<bejoy...@yahoo.com>
> > Subject: RE: When/how to use partitions and buckets usefully?
> >
> >
> >
> >
> > Thanks for the help so far guys,
> >
> >
> >
> > I bucketed the members_map, it's 330mb in size (11 mil records).
> >
> >
> >
> > Can you manually bucket stuff?
> >
> > Since my initial mapreduce job is still outside of Hive I'm doing a
> > LOAD DATA to import stuff into the visit_stats tables, replacing that
> > with INSERT OVERWRITE SELECT slows it down a lot
> >
> >
> >
> >
> >
> > From: Bejoy KS [mailto:bejoy...@yahoo.com]
> > Sent: Monday, April 23, 2012 6:06 PM
> > To: user@hive.apache.org
> > Subject: Re: When/how to use partitions and buckets usefully?
> >
> >
> >
> > For Bucketed map join, both tables should be bucketed and the number
> > of buckets of one should be multiple of other.
> >
> >
> > Regards
> > Bejoy KS
> >
> > Sent from handheld, please excuse typos.
> >
> >
> >
> >
> > From: "Bejoy KS" < bejoy...@yahoo.com >
> >
> >
> > Date: Mon, 23 Apr 2012 16:03:34 +0000
> >
> >
> > To: < user@hive.apache.org >
> >
> >
> > ReplyTo: bejoy...@yahoo.com
> >
> >
> > Subject: Re: When/how to use partitions and buckets usefully?
> >
> >
> >
> >
> > Bucketed map join would be good I guess. What is the total size of the
> > smaller table and what is its expected size in the next few years?
> >
> > The size should be good enough to be put in Distributed Cache, then
> > map side joins would offer you much performance improvement.
> >
> >
> > Regards
> > Bejoy KS
> >
> > Sent from handheld, please excuse typos.
> >
> >
> >
> >
> > From: Ruben de Vries < ruben.devr...@hyves.nl >
> >
> >
> > Date: Mon, 23 Apr 2012 17:38:20 +0200
> >
> >
> > To: user@hive.apache.org<user@hive.apache.org >
> >
> >
> > ReplyTo: user@hive.apache.org
> >
> >
> > Subject: RE: When/how to use partitions and buckets usefully?
> >
> >
> >
> >
> > Ok, very clear on the partitions, try to make them match the WHERE
> > clauses, not so much about group clauses then ;)
> >
> >
> >
> > The member_map contains 11.636.619 records atm, I think bucketing
> > those would be good?
> >
> > What's a good number to bucket them by then?
> >
> >
> >
> > And is there any point in bucketing the visit_stats?
> >
> >
> >
> >
> >
> > From: Tucker, Matt [mailto:matt.tuc...@disney.com]
> > Sent: Monday, April 23, 2012 5:30 PM
> > To: user@hive.apache.org
> > Subject: RE: When/how to use partitions and buckets usefully?
> >
> >
> >
> > If you're only interested in a certain window of dates for analysis, a
> > date-based partition scheme will be helpful, as it will trim
> > partitions that aren't needed by the query before execution.
> >
> >
> >
> > If the member_map table is small, you might consider testing the
> > feasibility of map-side joins, as it will reduce the number of
> > processing stages. If member_map is large, bucketing on member_id will
> > avoid having as many rows from visit_stats compared to each member_id
> for joins.
> >
> >
> >
> >
> > Matt Tucker
> >
> >
> >
> >
> >
> > From: Ruben de Vries [mailto:ruben.devr...@hyves.nl]
> > Sent: Monday, April 23, 2012 11:19 AM
> > To: user@hive.apache.org
> > Subject: When/how to use partitions and buckets usefully?
> >
> >
> >
> > It seems there's enough information to be found on how to setup and
> > use partitions and buckets.
> >
> > But I'm more interested in how to figure out when and what columns you
> > should be partitioning and bucketing to increase performance?!
> >
> >
> >
> > In my case I got 2 tables, 1 visit_stats (member_id, date and some MAP
> > cols which give me info about the visits) and 1 member_map (member_id,
> > gender, age).
> >
> >
> >
> > Usually I group by date and then one of the other col so I assume that
> > partitioning on date is a good start?!
> >
> >
> >
> > It seems the join of the member_map onto the visit_stats makes the
> > queries a lot slower, can that be fixed by bucketing both tables? Or
> just one of them?
> >
> >
> >
> >
> > Maybe some ppl have written good blogs on this subject but I can't
> > really seem to find them!?
> >
> >
> >
> > Any help would be appreciated, thanks in advance J
> >
>
>
> --
> Nitin Pawar
>
>
>
>
> --
> Nitin Pawar
>
>
>

Re: When/how to use partitions and buckets usefully?

Reply via email to