RE: Load data throws exception, cant figure it out
Figured it out myself by running with ' -hiveconf hive.root.logger=INFO,console' that it was binlog --- Hey Guys, I've been playing with Hive for a while now but somehow I run into this error all of a sudden when setting up my production cluster. $ hive -e 'LOAD DATA INPATH "/tmp/members_map_2012-06-30.map" OVERWRITE INTO TABLE members_map_full;' Loading data to table hyves_goldmine.members_map_full Moved to trash: hdfs://hadoop-nn.internal:9000/user/hive/warehouse/hyves_goldmine.db/members_map_full Failed with exception Clear request failed : DELETE FROM `TABLE_PARAMS` WHERE `TBL_ID` = ? FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask It seems to be some problem with the mysql metastore, but the privelages seem to be fine (I can run that query manually np). Anyone has an idea what could be wrong or what steps I could take to find out what is wrong?
Load data throws exception, cant figure it out
Hey Guys, I've been playing with Hive for a while now but somehow I run into this error all of a sudden when setting up my production cluster. $ hive -e 'LOAD DATA INPATH "/tmp/members_map_2012-06-30.map" OVERWRITE INTO TABLE members_map_full;' Loading data to table hyves_goldmine.members_map_full Moved to trash: hdfs://hadoop-nn.internal:9000/user/hive/warehouse/hyves_goldmine.db/members_map_full Failed with exception Clear request failed : DELETE FROM `TABLE_PARAMS` WHERE `TBL_ID` = ? FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask It seems to be some problem with the mysql metastore, but the privelages seem to be fine (I can run that query manually np). Anyone has an idea what could be wrong or what steps I could take to find out what is wrong?
RE: loading data in an array within a map
Sorry, can't help you with your specific problem, but incase you're really stuck; I used the JSON serde (https://github.com/rcongiu/Hive-JSON-Serde this one is better then the default one) and it converts nested arrays into maps perfectly. From: Bhaskar, Snehalata [mailto:snehalata_bhas...@syntelinc.com] Sent: Thursday, June 28, 2012 1:50 PM To: user@hive.apache.org Subject: loading data in an array within a map Hi all, I have created a table which has array within a map. I am using following query. hive> create table user_profiles > ( > userid string, > friends array, > properties map> > ) > ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY ':' MAP KEYS TERMINATED BY '#' LINES TERMINATED BY '\n' > LOCATION '/user/sb25634/user_profiles_table'; OK Time taken: 0.794 seconds But I am not able to load data properly in the array inside a map. I am facing problem with the delimiters used in the file used for data loading. Is there any way I can indicate the delimiters for array elements? Please help me. Thanks and regards, Snehalata Deorukhkar Nortel No:0229-5814 Confidential: This electronic message and all contents contain information from Syntel, Inc. which may be privileged, confidential or otherwise protected from disclosure. The information is intended to be for the addressee only. If you are not the addressee, any disclosure, copy, distribution or use of the contents of this message is prohibited. If you have received this electronic message in error, please notify the sender immediately and destroy the original message and all copies.
RE: Hive-0.8.1 PHP Thrift client broken?
Going to bump this one since I hope to be able to contribute some (worth a bump :P) -Original Message- From: Ruben de Vries [mailto:ruben.devr...@hyves.nl] Sent: Friday, June 15, 2012 11:59 AM To: user@hive.apache.org Subject: Hive-0.8.1 PHP Thrift client broken? Hey Guys, I've been slamming my head into a wall before on this issue, but now that I'm a bit more familiar with Hive and Thrift (I got the python version working) I figured I should try fixing the problem or find out more about it to contribute some to the project too :) The php thriftclient lib which comes with the hive-0.8.1 package is completely f*cked. It has a couple of issues so far: - the php files in the packages directory are 1 level to deep (eg, /packages/metastore/metastore/*.php). - there's a constant called 'GLOBAL' but this is a reserved keyword ... - it crashes with an error (and patch) described here: https://issues.apache.org/jira/browse/THRIFT-347 Now I figured I could just checkout all the code and see how the PHP code is generated from the thrift code ... But the 'share/fb303/if/fb303.thrift' is missing! I found some information on this googleing but nothing that wasn't so outdated that it would work! Could someone start by helping me with this missing dependency to generate the thrift libs and then maybe I can see if I can fix some of the other issues myself :-) ?
Hive-0.8.1 PHP Thrift client broken?
Hey Guys, I've been slamming my head into a wall before on this issue, but now that I'm a bit more familiar with Hive and Thrift (I got the python version working) I figured I should try fixing the problem or find out more about it to contribute some to the project too :) The php thriftclient lib which comes with the hive-0.8.1 package is completely f*cked. It has a couple of issues so far: - the php files in the packages directory are 1 level to deep (eg, /packages/metastore/metastore/*.php). - there's a constant called 'GLOBAL' but this is a reserved keyword ... - it crashes with an error (and patch) described here: https://issues.apache.org/jira/browse/THRIFT-347 Now I figured I could just checkout all the code and see how the PHP code is generated from the thrift code ... But the 'share/fb303/if/fb303.thrift' is missing! I found some information on this googleing but nothing that wasn't so outdated that it would work! Could someone start by helping me with this missing dependency to generate the thrift libs and then maybe I can see if I can fix some of the other issues myself :-) ?
RE: Hive scratch dir not cleaning up
So I should write a job which cleans up 1 month old results or something like that? From: Vinod Singh [mailto:vi...@vinodsingh.com] Sent: Friday, June 01, 2012 10:35 AM To: user@hive.apache.org Subject: Re: Hive scratch dir not cleaning up Hive deletes job contents from the scratch directory on completion of the job. Though failed / killed jobs leave data there, which needs to be removed manually. Thanks, Vinod http://blog.vinodsingh.com/ On Fri, Jun 1, 2012 at 1:58 PM, Ruben de Vries wrote: Hey Hivers, I’m almost ready to replace our old hadoop implementation with a implementation using Hive, Now I’ve ran into (hopefully) my last problem; my /tmp/hive-hduser dir is getting kinda big! It doesn’t seem to cleanup this tmp files, googling for it I run into some tickets about a cleanup setting, should I enable this with the below setting? Why doesn’t it do that by default? Am I the only one somehow racking up a lot of space with tmp files? hive.start.cleanup.scratchdir true
Hive scratch dir not cleaning up
Hey Hivers, I'm almost ready to replace our old hadoop implementation with a implementation using Hive, Now I've ran into (hopefully) my last problem; my /tmp/hive-hduser dir is getting kinda big! It doesn't seem to cleanup this tmp files, googling for it I run into some tickets about a cleanup setting, should I enable this with the below setting? Why doesn't it do that by default? Am I the only one somehow racking up a lot of space with tmp files? hive.start.cleanup.scratchdir true
RE: table design and performance questions
Partitioning can greatly increase performance for WHERE clauses since hive can omit parsing the data in the partitions which do no meet the requirement. For example if you partition by date (I do it by INT dateint, in which case I set dateint to be MMDD) and you do WHERE dateint >= 20120101 then it won't even have to touch any of the data from before 2012-01-01 and in my case that means I don't parse the last 2 years of data, reducing the time the query takes by about 70% :-) Buckets are the second awesome way of getting a big optimization in, specifically for joins! If you have 2 tables you're joining onto each other then if they're both bucketed on their join column it will also greatly increase speed. Another good join optimization is MAPJOIN, if one of the tables you're joining is rather small (below 30mb) then you can force it to MAPJOIN or you can enable automatic mapjoin, I personally prefere explicit behavory instead of automagic so use a hint: SELECT /* +MAPJOIN(the_small_table) */ fields FROM table JOIN the_small_table, etc. Sorted by is for sorting within buckets, only relevant if you're doing a lot of ordering I think. I'm assuming sequencefiles are faster, but I wouldn't really know :( need someone else to tell us more about that ;) -Original Message- From: Avdeev V. M. [mailto:ls...@list.ru] Sent: Monday, May 28, 2012 7:17 AM To: user@hive.apache.org Subject: table design and performance questions Question from novice. Where I can read table design best practices? I have a measure table with millions of rows and many dimension tables with less than 1000 rows each. I can't find out the way to get optimal design of both kind of tables. Is there performance tuning guides or performance FAQ? Specifically 1) PARTITIONED BY, CLUSTERED BY, SORTED BY statements. In which cases using these statements make sense? 2) DDL language manual says 'This can improve performance on certain kinds of queries.' about CLUSTERED BY statement. What kind of queries can be improved? 3) What is preferable - SEQUENCEFILE, RCFILE or TEXTFILE - in terms of performance? What aspects should be taken into account when choosing a file format? 4) Compressed storage article says 'Keeping data compressed in Hive tables has, in some cases, known to give better performance that uncompressed storage;' and again - What is these cases? Thanks! Vyacheslav
RE: Job Scheduling in Hadoop-Hive
Hey, We use hadoop/hive for processing our access logs and we run a daily cronjob (python script) which does the parsing jobs and some partitioning etc. The results from those jobs are then queried on by other jobs which generate the results the management team wants to see :-) From: Ronak Bhatt [mailto:ronakb...@gmail.com] Sent: Saturday, May 26, 2012 4:48 PM To: hive-u...@hadoop.apache.org Subject: Job Scheduling in Hadoop-Hive Hello - For those users whose setup is somewhat production, what do you use for job scheduling and dependency management? thanks, ronak
JOIN + LATERAL VIEW works, but + MAPJOIN and no longer get any results
Okay first off; I know JOIN + LATERAL VIEW together isn't working so I moved my JOIN into a subquery and that makes the query work properly However when I added a MAPJOIN hint for the JOIN in the subquery it will also stop doing the reducer for the main query! This only happens when there's a LATERAL VIEW in there though, if I remove the LATERAL VIEW then the main query still get's a reducer to do grouping Here's a gist: https://gist.github.com/2499436 Containing the queries and a PHP script which you can run to execute the test case I'm using, which does; * setup a database called hive_mapjoin * setup tables * load some test data * do the selects You'll need the https://github.com/rcongiu/Hive-JSON-Serde/downloads json-serde-1.1-jar-with-dependencies.jar with it though and change the path I guess looking at the queries you guys can probally figure out a better testcase, but maybe it's helpful Not sure if this is a bug or me doing something that just isn't supposed to be working, but I can't seem to find any pointers that this wouldn't be supported... Here's another gist with the plan.xml: https://gist.github.com/2499658 I've also created a ticket in JIRA but it doesn't seem to get any attention at all: https://issues.apache.org/jira/browse/HIVE-2992 Greetz, Ruben de Vries
RE: JOIN + LATERAL VIEW + MAPJOIN = no output?!
I really do feel like this isn't as intended, should I make a ticket in JIRA? -Original Message- From: Ruben de Vries [mailto:ruben.devr...@hyves.nl] Sent: Thursday, April 26, 2012 3:37 PM To: user@hive.apache.org Subject: RE: JOIN + LATERAL VIEW + MAPJOIN = no output?! https://gist.github.com/2499658 and this is the plan.xml its using -Original Message- From: Ruben de Vries [mailto:ruben.devr...@hyves.nl] Sent: Thursday, April 26, 2012 3:17 PM To: user@hive.apache.org Subject: JOIN + LATERAL VIEW + MAPJOIN = no output?! Okay first off; so JOIN + LATERAL VIEW together isn't working so I moved my JOIN into a subquery and that makes the query work properly However when I added a MAPJOIN hint for the JOIN in the subquery it will also stop doing the reducer for the main query! This only happens when there's a LATERAL VIEW in there though, if I remove the LATERAL VIEW then the main query still get's a reducer to do grouping Here's a gist: https://gist.github.com/2499436 Containing the queries and a PHP script which you can run to execute the test case; - setup a database called hive_mapjoin - setup tables - load some test data - do the selects You'll need the https://github.com/rcongiu/Hive-JSON-Serde/downloads json-serde-1.1-jar-with-dependencies.jar with it though and change the path ;) Not sure if this is a bug or me doing something that just isn’t supposed to be working, Thanks Ruben
RE: JOIN + LATERAL VIEW + MAPJOIN = no output?!
https://gist.github.com/2499658 and this is the plan.xml its using -Original Message- From: Ruben de Vries [mailto:ruben.devr...@hyves.nl] Sent: Thursday, April 26, 2012 3:17 PM To: user@hive.apache.org Subject: JOIN + LATERAL VIEW + MAPJOIN = no output?! Okay first off; so JOIN + LATERAL VIEW together isn't working so I moved my JOIN into a subquery and that makes the query work properly However when I added a MAPJOIN hint for the JOIN in the subquery it will also stop doing the reducer for the main query! This only happens when there's a LATERAL VIEW in there though, if I remove the LATERAL VIEW then the main query still get's a reducer to do grouping Here's a gist: https://gist.github.com/2499436 Containing the queries and a PHP script which you can run to execute the test case; - setup a database called hive_mapjoin - setup tables - load some test data - do the selects You'll need the https://github.com/rcongiu/Hive-JSON-Serde/downloads json-serde-1.1-jar-with-dependencies.jar with it though and change the path ;) Not sure if this is a bug or me doing something that just isn’t supposed to be working, Thanks Ruben
JOIN + LATERAL VIEW + MAPJOIN = no output?!
Okay first off; so JOIN + LATERAL VIEW together isn't working so I moved my JOIN into a subquery and that makes the query work properly However when I added a MAPJOIN hint for the JOIN in the subquery it will also stop doing the reducer for the main query! This only happens when there's a LATERAL VIEW in there though, if I remove the LATERAL VIEW then the main query still get's a reducer to do grouping Here's a gist: https://gist.github.com/2499436 Containing the queries and a PHP script which you can run to execute the test case; - setup a database called hive_mapjoin - setup tables - load some test data - do the selects You'll need the https://github.com/rcongiu/Hive-JSON-Serde/downloads json-serde-1.1-jar-with-dependencies.jar with it though and change the path ;) Not sure if this is a bug or me doing something that just isn’t supposed to be working, Thanks Ruben
RE: When/how to use partitions and buckets usefully?
We're atm sampling our access logs 1:250 by doing a modulo of 250 on the memberID. I applied the same logic to the members_map, reducing its raw size to just 0.9mb. Now when I run the query with the MAPJOIN hint it goes: 2012-04-26 10:51:32 Starting to launch local task to process map join; maximum memory = 1864171520 2012-04-26 10:51:33 Processing rows:46490 Hashtable size: 46490 Memory usage: 11142520rate: 0.006 - done - So it uses 10mb in memory to load the 1mb raw size members_map, this matches the ratio which I previously had when loading the full table and crashing (250mb required 2.6gb). I'm not sure why it requires so much more memory then the raw filesize, but it's defenatly the main problem here. Now it's just the question of is this a bug? Or is explainable and acceptable? FYI my most complex query dropped from 350sec to 110sec when being able to MAPJOIN(), gotta love that speed if it works! -Original Message- From: Ruben de Vries [mailto:ruben.devr...@hyves.nl] Sent: Thursday, April 26, 2012 9:16 AM To: user@hive.apache.org; gemini5201...@gmail.com; mgro...@oanda.com Subject: RE: When/how to use partitions and buckets usefully? The table isn't compressed, the uncompressed size is really 300ish mb. Also I haven't set the mapred.child.java.opts, but I think that if it crashes at 1.5gb that the default value seems high enough? From: gemini alex [mailto:gemini5201...@gmail.com] Sent: Thursday, April 26, 2012 5:37 AM To: user@hive.apache.org Subject: Re: When/how to use partitions and buckets usefully? do you set your mapred.child.java.opts in hadoop/conf/mapred-site.xml? 在 2012年4月26日 上午8:59,Mark Grover 写道: I am hoping that other people who have used Map Join can pitch in here... When the smaller table gets loaded into mapper's memory, the data is loaded in its uncompressed form, right? If so, is it possible at all in this case that the compressed size of smaller table is less than the memory available but the uncompressed size isn't? Mark Mark Grover, Business Intelligence Analyst OANDA Corporation www: oanda.com www: fxtrade.com e: mgro...@oanda.com "Best Trading Platform" - World Finance's Forex Awards 2009. "The One to Watch" - Treasury Today's Adam Smith Awards 2009. - Original Message - From: "Ruben de Vries" To: user@hive.apache.org Sent: Wednesday, April 25, 2012 3:48:46 AM Subject: RE: When/how to use partitions and buckets usefully? I already tried running with that set to 400mb, but it didn’t work and that setting is only used when it’s trying to automatically figure out if it should be doing a mapjoin, while I’m forcing it to do a mapjoin with a hint From: gemini alex [mailto:gemini5201...@gmail.com] Sent: Wednesday, April 25, 2012 9:40 AM To: user@hive.apache.org Subject: Re: When/how to use partitions and buckets usefully? there should be documented in wiki on LanguageManual+Joins . 在 2012年4月25日 下午3:36,gemini alex 写道: it's seemed you use the default hive configuration, the default map join will have only 25M for small table, copy your hive-default.xml to hive-site.xml and set hive.mapjoin.smalltable.filesize=3 在 2012年4月25日 上午12:09,Ruben de Vries 写道: I got the (rather big) log here in a github gist: https://gist.github.com/2480893 And I also attached the plan.xml it was using to the gist. When loading the members_map (11mil records, 320mb, 30b per record), it seems to take about 198b per record in the members_map, resulting in crashing around 7mil records with 1.4gb loaded. The members_map is a TEXTFILE with (member_id INT, gender INT, birthday STRING) where - birthday is a string containing -MM-DD - gender is a tinyint, 1 2 or 3 - member_id is int with the highest member_id being 14343249 (14mil) The log says: "INFO hive.log: DDL: struct members_map { i32 member_id, i32 gender, string birthdate}" I also tried doing the same thing but with an empty visit_stats table, with the same effect Some of the blogs I read talk about 25mb small table, not 300mb like mine ... Anyone can make anything out of this? I'd rather go with this if at all possible, otherwise I have to go the hard way and migrate all the visit_stats into buckets so they can match the members_map on that? -Original Message- From: Bejoy Ks [mailto:bejoy...@yahoo.com] Sent: Tuesday, April 24, 2012 3:58 PM To: user@hive.apache.org Subject: Re: When/how to use partitions and buckets usefully? Hi Ruben The operation you are seeing in your log is preparation of hash table of the smaller table, This hash table file is compressed and loaded into Distributed Cache and from there it is used for map side joins. From your console log the hash table size/data size has gone to nearly 1.5 GB, the data is large to be loaded into memory of the hive client. 2012-04-24 10:31:02 Processing rows:
RE: When/how to use partitions and buckets usefully?
The table isn't compressed, the uncompressed size is really 300ish mb. Also I haven't set the mapred.child.java.opts, but I think that if it crashes at 1.5gb that the default value seems high enough? From: gemini alex [mailto:gemini5201...@gmail.com] Sent: Thursday, April 26, 2012 5:37 AM To: user@hive.apache.org Subject: Re: When/how to use partitions and buckets usefully? do you set your mapred.child.java.opts in hadoop/conf/mapred-site.xml? 在 2012年4月26日 上午8:59,Mark Grover 写道: I am hoping that other people who have used Map Join can pitch in here... When the smaller table gets loaded into mapper's memory, the data is loaded in its uncompressed form, right? If so, is it possible at all in this case that the compressed size of smaller table is less than the memory available but the uncompressed size isn't? Mark Mark Grover, Business Intelligence Analyst OANDA Corporation www: oanda.com www: fxtrade.com e: mgro...@oanda.com "Best Trading Platform" - World Finance's Forex Awards 2009. "The One to Watch" - Treasury Today's Adam Smith Awards 2009. - Original Message - From: "Ruben de Vries" To: user@hive.apache.org Sent: Wednesday, April 25, 2012 3:48:46 AM Subject: RE: When/how to use partitions and buckets usefully? I already tried running with that set to 400mb, but it didn’t work and that setting is only used when it’s trying to automatically figure out if it should be doing a mapjoin, while I’m forcing it to do a mapjoin with a hint From: gemini alex [mailto:gemini5201...@gmail.com] Sent: Wednesday, April 25, 2012 9:40 AM To: user@hive.apache.org Subject: Re: When/how to use partitions and buckets usefully? there should be documented in wiki on LanguageManual+Joins . 在 2012年4月25日 下午3:36,gemini alex 写道: it's seemed you use the default hive configuration, the default map join will have only 25M for small table, copy your hive-default.xml to hive-site.xml and set hive.mapjoin.smalltable.filesize=3 在 2012年4月25日 上午12:09,Ruben de Vries 写道: I got the (rather big) log here in a github gist: https://gist.github.com/2480893 And I also attached the plan.xml it was using to the gist. When loading the members_map (11mil records, 320mb, 30b per record), it seems to take about 198b per record in the members_map, resulting in crashing around 7mil records with 1.4gb loaded. The members_map is a TEXTFILE with (member_id INT, gender INT, birthday STRING) where - birthday is a string containing -MM-DD - gender is a tinyint, 1 2 or 3 - member_id is int with the highest member_id being 14343249 (14mil) The log says: "INFO hive.log: DDL: struct members_map { i32 member_id, i32 gender, string birthdate}" I also tried doing the same thing but with an empty visit_stats table, with the same effect Some of the blogs I read talk about 25mb small table, not 300mb like mine ... Anyone can make anything out of this? I'd rather go with this if at all possible, otherwise I have to go the hard way and migrate all the visit_stats into buckets so they can match the members_map on that? -Original Message- From: Bejoy Ks [mailto:bejoy...@yahoo.com] Sent: Tuesday, April 24, 2012 3:58 PM To: user@hive.apache.org Subject: Re: When/how to use partitions and buckets usefully? Hi Ruben The operation you are seeing in your log is preparation of hash table of the smaller table, This hash table file is compressed and loaded into Distributed Cache and from there it is used for map side joins. From your console log the hash table size/data size has gone to nearly 1.5 GB, the data is large to be loaded into memory of the hive client. 2012-04-24 10:31:02 Processing rows:700 Hashtable size: 699 Memory usage: 1,468,378,760 rate: 0.788 Can you enable debug logging and post in the console to get a better picture why it consumes this much memory. Start your hive shell as hive -hiveconf hive.root.logger=ALL,console; Regards Bejoy KS From: Ruben de Vries To: "user@hive.apache.org" Sent: Tuesday, April 24, 2012 4:58 PM Subject: FW: When/how to use partitions and buckets usefully? Here are both tables: $ hdfs -count /user/hive/warehouse/hyves_goldmine.db/members_map 11 247231757 hdfs://localhost:54310/user/hive/warehouse/hyves_goldmine.db/members_map $ hdfs -count /user/hive/warehouse/hyves_goldmine.db/visit_stats 442 441 1091837835 hdfs://localhost:54310/user/hive/warehouse/hyves_goldmine.db/visit_stats The 'work' I'm seeing on console is the loading of the table into memory? It seems like it's loading the visit_stats table instead ?! I tried doing MAPJOIN(visit_stats) but it fails non existing class (my JSONSerde) . From: Nitin Pawar [mailto:nitinpawar...@gmail.com] Sent: Tuesday, April 24, 2012 11:46 AM To: user@hive.apache.org Subject:
RE: When/how to use partitions and buckets usefully?
I already tried running with that set to 400mb, but it didn’t work and that setting is only used when it’s trying to automatically figure out if it should be doing a mapjoin, while I’m forcing it to do a mapjoin with a hint From: gemini alex [mailto:gemini5201...@gmail.com] Sent: Wednesday, April 25, 2012 9:40 AM To: user@hive.apache.org Subject: Re: When/how to use partitions and buckets usefully? there should be documented in wiki on LanguageManual+Joins . 在 2012年4月25日 下午3:36,gemini alex 写道: it's seemed you use the default hive configuration, the default map join will have only 25M for small table, copy your hive-default.xml to hive-site.xml and set hive.mapjoin.smalltable.filesize=3 在 2012年4月25日 上午12:09,Ruben de Vries 写道: I got the (rather big) log here in a github gist: https://gist.github.com/2480893 And I also attached the plan.xml it was using to the gist. When loading the members_map (11mil records, 320mb, 30b per record), it seems to take about 198b per record in the members_map, resulting in crashing around 7mil records with 1.4gb loaded. The members_map is a TEXTFILE with (member_id INT, gender INT, birthday STRING) where - birthday is a string containing -MM-DD - gender is a tinyint, 1 2 or 3 - member_id is int with the highest member_id being 14343249 (14mil) The log says: "INFO hive.log: DDL: struct members_map { i32 member_id, i32 gender, string birthdate}" I also tried doing the same thing but with an empty visit_stats table, with the same effect Some of the blogs I read talk about 25mb small table, not 300mb like mine ... Anyone can make anything out of this? I'd rather go with this if at all possible, otherwise I have to go the hard way and migrate all the visit_stats into buckets so they can match the members_map on that? -Original Message- From: Bejoy Ks [mailto:bejoy...@yahoo.com] Sent: Tuesday, April 24, 2012 3:58 PM To: user@hive.apache.org Subject: Re: When/how to use partitions and buckets usefully? Hi Ruben The operation you are seeing in your log is preparation of hash table of the smaller table, This hash table file is compressed and loaded into Distributed Cache and from there it is used for map side joins. From your console log the hash table size/data size has gone to nearly 1.5 GB, the data is large to be loaded into memory of the hive client. 2012-04-24 10:31:02 Processing rows:700 Hashtable size: 699 Memory usage: 1,468,378,760 rate: 0.788 Can you enable debug logging and post in the console to get a better picture why it consumes this much memory. Start your hive shell as hive -hiveconf hive.root.logger=ALL,console; Regards Bejoy KS ________ From: Ruben de Vries To: "user@hive.apache.org" Sent: Tuesday, April 24, 2012 4:58 PM Subject: FW: When/how to use partitions and buckets usefully? Here are both tables: $ hdfs -count /user/hive/warehouse/hyves_goldmine.db/members_map 11 247231757 hdfs://localhost:54310/user/hive/warehouse/hyves_goldmine.db/members_map $ hdfs -count /user/hive/warehouse/hyves_goldmine.db/visit_stats 442 441 1091837835 hdfs://localhost:54310/user/hive/warehouse/hyves_goldmine.db/visit_stats The 'work' I'm seeing on console is the loading of the table into memory? It seems like it's loading the visit_stats table instead ?! I tried doing MAPJOIN(visit_stats) but it fails non existing class (my JSONSerde) . From: Nitin Pawar [mailto:nitinpawar...@gmail.com] Sent: Tuesday, April 24, 2012 11:46 AM To: user@hive.apache.org Subject: Re: When/how to use partitions and buckets usefully? This operation is erroring out on the hive client itself before starting a map so splitting to mappers is out of question. can you do a dfs count for the members_map table hdfslocation and tell us the result? On Tue, Apr 24, 2012 at 2:06 PM, Ruben de Vries wrote: Hmm I must be doing something wrong, the members_map table is 300ish MB. When I execute the following query: SELECT /*+ MAPJOIN(members_map) */ date_int, members_map.gender AS gender, 'generic', COUNT( memberId ) AS unique, SUM( `generic`['count'] ) AS count, SUM( `generic`['seconds'] ) AS seconds FROM visit_stats JOIN members_map ON(members_map.member_id = visit_stats.memberId) GROUP BY date_int, members_map.gender It results in: 2012-04-24 10:25:59 Starting to launch local task to process map join; maximum memory = 1864171520 2012-04-24 10:26:00 Processing rows:20 Hashtable size: 19 Memory usage: 43501848rate: 0.023 2012-04-24 10:30:54 Processing rows:690 Hashtable size: 689 Memory usage: 1449867552 rate: 0.778 2012-04-24 10:31:02 Processing rows:700 Hashtable size: 699 Memory usage: 1468378760 rate:
RE: subquery + lateral view fails without count
Awh poor pastie :( Here's a gist: https://gist.github.com/2487666 with the working and crashing query, create statements and the error attached too -Original Message- From: Mark Grover [mailto:mgro...@oanda.com] Sent: Wednesday, April 25, 2012 1:40 AM To: user@hive.apache.org Subject: Re: subquery + lateral view fails without count Hi Ruben, Looks like pastie is down (http://pastie.org/) because of recent DDOS attacks. Can you please post your queries elsewhere? Mark Mark Grover, Business Intelligence Analyst OANDA Corporation www: oanda.com www: fxtrade.com e: mgro...@oanda.com "Best Trading Platform" - World Finance's Forex Awards 2009. "The One to Watch" - Treasury Today's Adam Smith Awards 2009. - Original Message - From: "Ruben de Vries" To: user@hive.apache.org Sent: Monday, April 23, 2012 9:08:16 AM Subject: subquery + lateral view fails without count It’s a bit of a weird case but I thought I might share it and hopefully find someone who can confirm this to be a bug or tell me I should do things differently! Here you can find a pastie with the full create and select queries: http://pastie.org/3838924 I’ve got two tables: `visit_stats` with cols date_int, memberId and parts `member_map` with cols member_id, gender I use LATERAL VIEW explode(parts) to explode the parts MAP and group on those and I want to JOIN the member_map onto the visit_stats to group by gender. Since LATERAL VIEW and JOIN together aren’t supported I’m doing the join in a subquery and then LATERAL VIEW onto that. It’s working as long as my query contains a COUNT ( memberId ), but if I removed that from the result it will crash with the error below: java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:431) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:416) at org.apache.hadoop.mapred.Child$4.run(Child.java:268) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115) at org.apache.hadoop.mapred.Child.main(Child.java:262) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88) ... 9 more Caused by: java.lang.RuntimeException: Reduce operator initialization failed at org.apache.hadoop.hive.ql.exec.ExecReducer.configure(ExecReducer.java:157) ... 14 more Caused by: java.lang.RuntimeException: cannot find field _col1 from [0:_col4, 1:_col6, 2:_col10] at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.getStandardStructFieldRef(ObjectInspectorUtils.java:345) at org.apache.hadoop.hive.serde2.objectinspector.StandardStructObjectInspector.getStructFieldRef(StandardStructObjectInspector.java:143) at org.apache.hadoop.hive.ql.exec.ExprNodeColumnEvaluator.initialize(ExprNodeColumnEvaluator.java:57) at org.apache.hadoop.hive.ql.exec.Operator.initEvaluators(Operator.java:896) at org.apache.hadoop.hive.ql.exec.Operator.initEvaluatorsAndReturnStruct(Operator.java:922) at org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:60) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:433) at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:389) at org.apache.hadoop.hive.ql.exec.JoinOperator.initializeOp(JoinOperator.java:60) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357) at org.apache.hadoop.hive.ql.exec.ExecReducer.configure(ExecReducer.java:150) ... 14 more
RE: When/how to use partitions and buckets usefully?
I got the (rather big) log here in a github gist: https://gist.github.com/2480893 And I also attached the plan.xml it was using to the gist. When loading the members_map (11mil records, 320mb, 30b per record), it seems to take about 198b per record in the members_map, resulting in crashing around 7mil records with 1.4gb loaded. The members_map is a TEXTFILE with (member_id INT, gender INT, birthday STRING) where - birthday is a string containing -MM-DD - gender is a tinyint, 1 2 or 3 - member_id is int with the highest member_id being 14343249 (14mil) The log says: "INFO hive.log: DDL: struct members_map { i32 member_id, i32 gender, string birthdate}" I also tried doing the same thing but with an empty visit_stats table, with the same effect Some of the blogs I read talk about 25mb small table, not 300mb like mine ... Anyone can make anything out of this? I'd rather go with this if at all possible, otherwise I have to go the hard way and migrate all the visit_stats into buckets so they can match the members_map on that? -Original Message- From: Bejoy Ks [mailto:bejoy...@yahoo.com] Sent: Tuesday, April 24, 2012 3:58 PM To: user@hive.apache.org Subject: Re: When/how to use partitions and buckets usefully? Hi Ruben The operation you are seeing in your log is preparation of hash table of the smaller table, This hash table file is compressed and loaded into Distributed Cache and from there it is used for map side joins. From your console log the hash table size/data size has gone to nearly 1.5 GB, the data is large to be loaded into memory of the hive client. 2012-04-24 10:31:02 Processing rows: 700 Hashtable size: 699 Memory usage: 1,468,378,760 rate: 0.788 Can you enable debug logging and post in the console to get a better picture why it consumes this much memory. Start your hive shell as hive -hiveconf hive.root.logger=ALL,console; Regards Bejoy KS ____ From: Ruben de Vries To: "user@hive.apache.org" Sent: Tuesday, April 24, 2012 4:58 PM Subject: FW: When/how to use partitions and buckets usefully? Here are both tables: $ hdfs -count /user/hive/warehouse/hyves_goldmine.db/members_map 1 1 247231757 hdfs://localhost:54310/user/hive/warehouse/hyves_goldmine.db/members_map $ hdfs -count /user/hive/warehouse/hyves_goldmine.db/visit_stats 442 441 1091837835 hdfs://localhost:54310/user/hive/warehouse/hyves_goldmine.db/visit_stats The 'work' I'm seeing on console is the loading of the table into memory? It seems like it's loading the visit_stats table instead ?! I tried doing MAPJOIN(visit_stats) but it fails non existing class (my JSONSerde) . From: Nitin Pawar [mailto:nitinpawar...@gmail.com] Sent: Tuesday, April 24, 2012 11:46 AM To: user@hive.apache.org Subject: Re: When/how to use partitions and buckets usefully? This operation is erroring out on the hive client itself before starting a map so splitting to mappers is out of question. can you do a dfs count for the members_map table hdfslocation and tell us the result? On Tue, Apr 24, 2012 at 2:06 PM, Ruben de Vries wrote: Hmm I must be doing something wrong, the members_map table is 300ish MB. When I execute the following query: SELECT /*+ MAPJOIN(members_map) */ date_int, members_map.gender AS gender, 'generic', COUNT( memberId ) AS unique, SUM( `generic`['count'] ) AS count, SUM( `generic`['seconds'] ) AS seconds FROM visit_stats JOIN members_map ON(members_map.member_id = visit_stats.memberId) GROUP BY date_int, members_map.gender It results in: 2012-04-24 10:25:59 Starting to launch local task to process map join; maximum memory = 1864171520 2012-04-24 10:26:00 Processing rows: 20 Hashtable size: 19 Memory usage: 43501848 rate: 0.023 2012-04-24 10:30:54 Processing rows: 690 Hashtable size: 689 Memory usage: 1449867552 rate: 0.778 2012-04-24 10:31:02 Processing rows: 700 Hashtable size: 699 Memory usage: 1468378760 rate: 0.788 Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space I'm running it only my local, single node, dev env, could that be a problem since it won't split over multiple mappers in this case? -Original Message- From: Bejoy Ks [mailto:bejoy...@yahoo.com] Sent: Tuesday, April 24, 2012 9:47 AM To: user@hive.apache.org Subject: Re: When/how to use partitions and buckets usefully? Hi Ruben Map join hint is provided to hive using "MAPJOIN" keyword as : SELECT /*+ MAPJOIN(b) */ a.key, a.value FROM a join b on a.key = b.key To use map side join some hive configuration properties needs to be enabled For plain map side joins hive>SET hive.auto.convert.join=true; Latest version
FW: When/how to use partitions and buckets usefully?
Here are both tables: $ hdfs -count /user/hive/warehouse/hyves_goldmine.db/members_map 1 1 247231757 hdfs://localhost:54310/user/hive/warehouse/hyves_goldmine.db/members_map $ hdfs -count /user/hive/warehouse/hyves_goldmine.db/visit_stats 442 441 1091837835 hdfs://localhost:54310/user/hive/warehouse/hyves_goldmine.db/visit_stats The 'work' I'm seeing on console is the loading of the table into memory? It seems like it's loading the visit_stats table instead ?! I tried doing MAPJOIN(visit_stats) but it fails non existing class (my JSONSerde) . From: Nitin Pawar [mailto:nitinpawar...@gmail.com] Sent: Tuesday, April 24, 2012 11:46 AM To: user@hive.apache.org Subject: Re: When/how to use partitions and buckets usefully? This operation is erroring out on the hive client itself before starting a map so splitting to mappers is out of question. can you do a dfs count for the members_map table hdfslocation and tell us the result? On Tue, Apr 24, 2012 at 2:06 PM, Ruben de Vries wrote: Hmm I must be doing something wrong, the members_map table is 300ish MB. When I execute the following query: SELECT /*+ MAPJOIN(members_map) */ date_int, members_map.gender AS gender, 'generic', COUNT( memberId ) AS unique, SUM( `generic`['count'] ) AS count, SUM( `generic`['seconds'] ) AS seconds FROM visit_stats JOIN members_map ON(members_map.member_id = visit_stats.memberId) GROUP BY date_int, members_map.gender It results in: 2012-04-24 10:25:59 Starting to launch local task to process map join; maximum memory = 1864171520 2012-04-24 10:26:00 Processing rows: 20 Hashtable size: 19 Memory usage: 43501848 rate: 0.023 2012-04-24 10:30:54 Processing rows: 690 Hashtable size: 689 Memory usage: 1449867552 rate: 0.778 2012-04-24 10:31:02 Processing rows: 700 Hashtable size: 699 Memory usage: 1468378760 rate: 0.788 Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space I'm running it only my local, single node, dev env, could that be a problem since it won't split over multiple mappers in this case? -Original Message- From: Bejoy Ks [mailto:bejoy...@yahoo.com] Sent: Tuesday, April 24, 2012 9:47 AM To: user@hive.apache.org Subject: Re: When/how to use partitions and buckets usefully? Hi Ruben Map join hint is provided to hive using "MAPJOIN" keyword as : SELECT /*+ MAPJOIN(b) */ a.key, a.value FROM a join b on a.key = b.key To use map side join some hive configuration properties needs to be enabled For plain map side joins hive>SET hive.auto.convert.join=true; Latest versions of hive does a map join on the smaller table even if no map join hit is provided. For bucketed map joins hive>SET hive.optimize.bucketmapjoin=true https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins Regards Bejoy From: Nitin Pawar To: user@hive.apache.org Sent: Tuesday, April 24, 2012 12:46 PM Subject: Re: When/how to use partitions and buckets usefully? If you are doing a map side join make sure the table members_map is small enough to hold in memory On 4/24/12, Ruben de Vries wrote: > Wow thanks everyone for the nice feedback! > > I can force a mapside join by doing /*+ STREAMTABLE(members_map) */ right? > > > Cheers, > > Ruben de Vries > > -Original Message- > From: Mark Grover [mailto:mgro...@oanda.com] > Sent: Tuesday, April 24, 2012 3:17 AM > To: user@hive.apache.org; bejoy ks > Cc: Ruben de Vries > Subject: Re: When/how to use partitions and buckets usefully? > > Hi Ruben, > Like Bejoy pointed out, members_map is small enough to fit in memory, > so your joins with visit_stats would be much faster with map-side join. > > However, there is still some virtue in bucketing visit_stats. > Bucketing can optimize joins, group by's and potentially other queries > in certain circumstances. > You probably want to keep consistent bucketing columns across all your > tables so they can leveraged in multi-table queries. Most people use > some power of 2 as their number of buckets. To make the best use of > the buckets, each of your buckets should be able to entirely load into > memory on the node. > > I use something close the formula below to calculate the number of buckets: > > #buckets = (x * Average_partition_size) / > JVM_memory_available_to_your_Hadoop_tasknode > > I call x (>1) the "factor of conservatism". Higher x means you are > being more conservative by having larger number of buckets (and > bearing the increased overhead), lower x means the reverse. What x to > use would depend on your use case. This is because the number of bu
RE: When/how to use partitions and buckets usefully?
Hmm I must be doing something wrong, the members_map table is 300ish MB. When I execute the following query: SELECT /*+ MAPJOIN(members_map) */ date_int, members_map.gender AS gender, 'generic', COUNT( memberId ) AS unique, SUM( `generic`['count'] ) AS count, SUM( `generic`['seconds'] ) AS seconds FROM visit_stats JOIN members_map ON(members_map.member_id = visit_stats.memberId) GROUP BY date_int, members_map.gender It results in: 2012-04-24 10:25:59 Starting to launch local task to process map join; maximum memory = 1864171520 2012-04-24 10:26:00 Processing rows:20 Hashtable size: 19 Memory usage: 43501848rate: 0.023 2012-04-24 10:30:54 Processing rows:690 Hashtable size: 689 Memory usage: 1449867552 rate: 0.778 2012-04-24 10:31:02 Processing rows:700 Hashtable size: 699 Memory usage: 1468378760 rate: 0.788 Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space I'm running it only my local, single node, dev env, could that be a problem since it won't split over multiple mappers in this case? -Original Message- From: Bejoy Ks [mailto:bejoy...@yahoo.com] Sent: Tuesday, April 24, 2012 9:47 AM To: user@hive.apache.org Subject: Re: When/how to use partitions and buckets usefully? Hi Ruben Map join hint is provided to hive using "MAPJOIN" keyword as : SELECT /*+ MAPJOIN(b) */ a.key, a.value FROM a join b on a.key = b.key To use map side join some hive configuration properties needs to be enabled For plain map side joins hive>SET hive.auto.convert.join=true; Latest versions of hive does a map join on the smaller table even if no map join hit is provided. For bucketed map joins hive>SET hive.optimize.bucketmapjoin=true https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins Regards Bejoy From: Nitin Pawar To: user@hive.apache.org Sent: Tuesday, April 24, 2012 12:46 PM Subject: Re: When/how to use partitions and buckets usefully? If you are doing a map side join make sure the table members_map is small enough to hold in memory On 4/24/12, Ruben de Vries wrote: > Wow thanks everyone for the nice feedback! > > I can force a mapside join by doing /*+ STREAMTABLE(members_map) */ right? > > > Cheers, > > Ruben de Vries > > -Original Message- > From: Mark Grover [mailto:mgro...@oanda.com] > Sent: Tuesday, April 24, 2012 3:17 AM > To: user@hive.apache.org; bejoy ks > Cc: Ruben de Vries > Subject: Re: When/how to use partitions and buckets usefully? > > Hi Ruben, > Like Bejoy pointed out, members_map is small enough to fit in memory, > so your joins with visit_stats would be much faster with map-side join. > > However, there is still some virtue in bucketing visit_stats. > Bucketing can optimize joins, group by's and potentially other queries > in certain circumstances. > You probably want to keep consistent bucketing columns across all your > tables so they can leveraged in multi-table queries. Most people use > some power of 2 as their number of buckets. To make the best use of > the buckets, each of your buckets should be able to entirely load into > memory on the node. > > I use something close the formula below to calculate the number of buckets: > > #buckets = (x * Average_partition_size) / > JVM_memory_available_to_your_Hadoop_tasknode > > I call x (>1) the "factor of conservatism". Higher x means you are > being more conservative by having larger number of buckets (and > bearing the increased overhead), lower x means the reverse. What x to > use would depend on your use case. This is because the number of buckets in a > table is fixed. > If you have a large partition, it would distribute it's data into > bulkier buckets and you would want to make sure these bulkier buckets > can still fit in memory. Moreover, buckets are generated using a > hashing function, if you have a strong bias towards a particular value > of bucketing column in your data, some buckets might be bulkier than > others. In that case, you'd want to make sure that those bulkier buckets can > still fit in memory. > > To summarize, it depends on: > * How the actual partition sizes vary from the average partition size (i.e. > the standard deviation of your partition size). More standard > deviations means you should be more conservative in your calculation and > vice-versa. > * Distribution of the data in the bucketing columns. "Wider" > distribution means you should be more conservative and vice-versa. > > Long story short, I would say, x of 2 to 4 should suffice in most > cases but feel free to verify t
RE: When/how to use partitions and buckets usefully?
Wow thanks everyone for the nice feedback! I can force a mapside join by doing /*+ STREAMTABLE(members_map) */ right? Cheers, Ruben de Vries -Original Message- From: Mark Grover [mailto:mgro...@oanda.com] Sent: Tuesday, April 24, 2012 3:17 AM To: user@hive.apache.org; bejoy ks Cc: Ruben de Vries Subject: Re: When/how to use partitions and buckets usefully? Hi Ruben, Like Bejoy pointed out, members_map is small enough to fit in memory, so your joins with visit_stats would be much faster with map-side join. However, there is still some virtue in bucketing visit_stats. Bucketing can optimize joins, group by's and potentially other queries in certain circumstances. You probably want to keep consistent bucketing columns across all your tables so they can leveraged in multi-table queries. Most people use some power of 2 as their number of buckets. To make the best use of the buckets, each of your buckets should be able to entirely load into memory on the node. I use something close the formula below to calculate the number of buckets: #buckets = (x * Average_partition_size) / JVM_memory_available_to_your_Hadoop_tasknode I call x (>1) the "factor of conservatism". Higher x means you are being more conservative by having larger number of buckets (and bearing the increased overhead), lower x means the reverse. What x to use would depend on your use case. This is because the number of buckets in a table is fixed. If you have a large partition, it would distribute it's data into bulkier buckets and you would want to make sure these bulkier buckets can still fit in memory. Moreover, buckets are generated using a hashing function, if you have a strong bias towards a particular value of bucketing column in your data, some buckets might be bulkier than others. In that case, you'd want to make sure that those bulkier buckets can still fit in memory. To summarize, it depends on: * How the actual partition sizes vary from the average partition size (i.e. the standard deviation of your partition size). More standard deviations means you should be more conservative in your calculation and vice-versa. * Distribution of the data in the bucketing columns. "Wider" distribution means you should be more conservative and vice-versa. Long story short, I would say, x of 2 to 4 should suffice in most cases but feel free to verify that in your case:-) I would love to hear what factors others have been using when calculating their number of buckets, BTW! Whatever answer you get for #buckets from above formula, use the closest power of 2 as the number of buckets in your table (I am not sure if this is a must, though). Good luck! Mark Mark Grover, Business Intelligence Analyst OANDA Corporation www: oanda.com www: fxtrade.com e: mgro...@oanda.com "Best Trading Platform" - World Finance's Forex Awards 2009. "The One to Watch" - Treasury Today's Adam Smith Awards 2009. - Original Message - From: "Bejoy KS" To: "Ruben de Vries" , user@hive.apache.org Sent: Monday, April 23, 2012 12:39:17 PM Subject: Re: When/how to use partitions and buckets usefully? If data is in hdfs, then you can bucket it only after loading into a temp/staging table and then to the final bucketed table. Bucketing needs a Map reduce job. Regards Bejoy KS Sent from handheld, please excuse typos. From: Ruben de Vries Date: Mon, 23 Apr 2012 18:13:20 +0200 To: user@hive.apache.org; bejoy...@yahoo.com Subject: RE: When/how to use partitions and buckets usefully? Thanks for the help so far guys, I bucketed the members_map, it’s 330mb in size (11 mil records). Can you manually bucket stuff? Since my initial mapreduce job is still outside of Hive I’m doing a LOAD DATA to import stuff into the visit_stats tables, replacing that with INSERT OVERWRITE SELECT slows it down a lot From: Bejoy KS [mailto:bejoy...@yahoo.com] Sent: Monday, April 23, 2012 6:06 PM To: user@hive.apache.org Subject: Re: When/how to use partitions and buckets usefully? For Bucketed map join, both tables should be bucketed and the number of buckets of one should be multiple of other. Regards Bejoy KS Sent from handheld, please excuse typos. From: "Bejoy KS" < bejoy...@yahoo.com > Date: Mon, 23 Apr 2012 16:03:34 + To: < user@hive.apache.org > ReplyTo: bejoy...@yahoo.com Subject: Re: When/how to use partitions and buckets usefully? Bucketed map join would be good I guess. What is the total size of the smaller table and what is its expected size in the next few years? The size should be good enough to be put in Distributed Cache, then map side joins would offer you much performance improvement. Regards Bejoy KS Sent from handheld, please excuse typos. From: Ruben de Vries < ruben.devr...@hyves.nl > Date: Mon, 23 Apr 2012 17:38:20 +0200 To: u
RE: When/how to use partitions and buckets usefully?
Thanks for the help so far guys, I bucketed the members_map, it's 330mb in size (11 mil records). Can you manually bucket stuff? Since my initial mapreduce job is still outside of Hive I'm doing a LOAD DATA to import stuff into the visit_stats tables, replacing that with INSERT OVERWRITE SELECT slows it down a lot From: Bejoy KS [mailto:bejoy...@yahoo.com] Sent: Monday, April 23, 2012 6:06 PM To: user@hive.apache.org Subject: Re: When/how to use partitions and buckets usefully? For Bucketed map join, both tables should be bucketed and the number of buckets of one should be multiple of other. Regards Bejoy KS Sent from handheld, please excuse typos. From: "Bejoy KS" mailto:bejoy...@yahoo.com>> Date: Mon, 23 Apr 2012 16:03:34 + To: mailto:user@hive.apache.org>> ReplyTo: bejoy...@yahoo.com<mailto:bejoy...@yahoo.com> Subject: Re: When/how to use partitions and buckets usefully? Bucketed map join would be good I guess. What is the total size of the smaller table and what is its expected size in the next few years? The size should be good enough to be put in Distributed Cache, then map side joins would offer you much performance improvement. Regards Bejoy KS Sent from handheld, please excuse typos. ________ From: Ruben de Vries mailto:ruben.devr...@hyves.nl>> Date: Mon, 23 Apr 2012 17:38:20 +0200 To: user@hive.apache.orgmailto:user@hive.apache.org%3cu...@hive.apache.org>> ReplyTo: user@hive.apache.org<mailto:user@hive.apache.org> Subject: RE: When/how to use partitions and buckets usefully? Ok, very clear on the partitions, try to make them match the WHERE clauses, not so much about group clauses then ;) The member_map contains 11.636.619 records atm, I think bucketing those would be good? What's a good number to bucket them by then? And is there any point in bucketing the visit_stats? From: Tucker, Matt [mailto:matt.tuc...@disney.com]<mailto:[mailto:matt.tuc...@disney.com]> Sent: Monday, April 23, 2012 5:30 PM To: user@hive.apache.org<mailto:user@hive.apache.org> Subject: RE: When/how to use partitions and buckets usefully? If you're only interested in a certain window of dates for analysis, a date-based partition scheme will be helpful, as it will trim partitions that aren't needed by the query before execution. If the member_map table is small, you might consider testing the feasibility of map-side joins, as it will reduce the number of processing stages. If member_map is large, bucketing on member_id will avoid having as many rows from visit_stats compared to each member_id for joins. Matt Tucker From: Ruben de Vries [mailto:ruben.devr...@hyves.nl]<mailto:[mailto:ruben.devr...@hyves.nl]> Sent: Monday, April 23, 2012 11:19 AM To: user@hive.apache.org<mailto:user@hive.apache.org> Subject: When/how to use partitions and buckets usefully? It seems there's enough information to be found on how to setup and use partitions and buckets. But I'm more interested in how to figure out when and what columns you should be partitioning and bucketing to increase performance?! In my case I got 2 tables, 1 visit_stats (member_id, date and some MAP cols which give me info about the visits) and 1 member_map (member_id, gender, age). Usually I group by date and then one of the other col so I assume that partitioning on date is a good start?! It seems the join of the member_map onto the visit_stats makes the queries a lot slower, can that be fixed by bucketing both tables? Or just one of them? Maybe some ppl have written good blogs on this subject but I can't really seem to find them!? Any help would be appreciated, thanks in advance :)
RE: When/how to use partitions and buckets usefully?
Ok, very clear on the partitions, try to make them match the WHERE clauses, not so much about group clauses then ;) The member_map contains 11.636.619 records atm, I think bucketing those would be good? What's a good number to bucket them by then? And is there any point in bucketing the visit_stats? From: Tucker, Matt [mailto:matt.tuc...@disney.com] Sent: Monday, April 23, 2012 5:30 PM To: user@hive.apache.org Subject: RE: When/how to use partitions and buckets usefully? If you're only interested in a certain window of dates for analysis, a date-based partition scheme will be helpful, as it will trim partitions that aren't needed by the query before execution. If the member_map table is small, you might consider testing the feasibility of map-side joins, as it will reduce the number of processing stages. If member_map is large, bucketing on member_id will avoid having as many rows from visit_stats compared to each member_id for joins. Matt Tucker From: Ruben de Vries [mailto:ruben.devr...@hyves.nl]<mailto:[mailto:ruben.devr...@hyves.nl]> Sent: Monday, April 23, 2012 11:19 AM To: user@hive.apache.org<mailto:user@hive.apache.org> Subject: When/how to use partitions and buckets usefully? It seems there's enough information to be found on how to setup and use partitions and buckets. But I'm more interested in how to figure out when and what columns you should be partitioning and bucketing to increase performance?! In my case I got 2 tables, 1 visit_stats (member_id, date and some MAP cols which give me info about the visits) and 1 member_map (member_id, gender, age). Usually I group by date and then one of the other col so I assume that partitioning on date is a good start?! It seems the join of the member_map onto the visit_stats makes the queries a lot slower, can that be fixed by bucketing both tables? Or just one of them? Maybe some ppl have written good blogs on this subject but I can't really seem to find them!? Any help would be appreciated, thanks in advance :)
When/how to use partitions and buckets usefully?
It seems there's enough information to be found on how to setup and use partitions and buckets. But I'm more interested in how to figure out when and what columns you should be partitioning and bucketing to increase performance?! In my case I got 2 tables, 1 visit_stats (member_id, date and some MAP cols which give me info about the visits) and 1 member_map (member_id, gender, age). Usually I group by date and then one of the other col so I assume that partitioning on date is a good start?! It seems the join of the member_map onto the visit_stats makes the queries a lot slower, can that be fixed by bucketing both tables? Or just one of them? Maybe some ppl have written good blogs on this subject but I can't really seem to find them!? Any help would be appreciated, thanks in advance :)
subquery + lateral view fails without count
It's a bit of a weird case but I thought I might share it and hopefully find someone who can confirm this to be a bug or tell me I should do things differently! Here you can find a pastie with the full create and select queries: http://pastie.org/3838924 I've got two tables: `visit_stats` with cols date_int, memberId and parts `member_map` with cols member_id, gender I use LATERAL VIEW explode(parts) to explode the parts MAP and group on those and I want to JOIN the member_map onto the visit_stats to group by gender. Since LATERAL VIEW and JOIN together aren't supported I'm doing the join in a subquery and then LATERAL VIEW onto that. It's working as long as my query contains a COUNT ( memberId ), but if I removed that from the result it will crash with the error below: java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:431) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:416) at org.apache.hadoop.mapred.Child$4.run(Child.java:268) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115) at org.apache.hadoop.mapred.Child.main(Child.java:262) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88) ... 9 more Caused by: java.lang.RuntimeException: Reduce operator initialization failed at org.apache.hadoop.hive.ql.exec.ExecReducer.configure(ExecReducer.java:157) ... 14 more Caused by: java.lang.RuntimeException: cannot find field _col1 from [0:_col4, 1:_col6, 2:_col10] at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.getStandardStructFieldRef(ObjectInspectorUtils.java:345) at org.apache.hadoop.hive.serde2.objectinspector.StandardStructObjectInspector.getStructFieldRef(StandardStructObjectInspector.java:143) at org.apache.hadoop.hive.ql.exec.ExprNodeColumnEvaluator.initialize(ExprNodeColumnEvaluator.java:57) at org.apache.hadoop.hive.ql.exec.Operator.initEvaluators(Operator.java:896) at org.apache.hadoop.hive.ql.exec.Operator.initEvaluatorsAndReturnStruct(Operator.java:922) at org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:60) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:433) at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:389) at org.apache.hadoop.hive.ql.exec.JoinOperator.initializeOp(JoinOperator.java:60) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357) at org.apache.hadoop.hive.ql.exec.ExecReducer.configure(ExecReducer.java:150) ... 14 more
RE: using the key from a SequenceFile
You're a lifesaver! From: Dilip Joseph [mailto:dilip.antony.jos...@gmail.com] Sent: Thursday, April 19, 2012 5:47 PM To: user@hive.apache.org Subject: Re: using the key from a SequenceFile An example input format for using SequenceFile keys in hive is at https://gist.github.com/2421795 . The code just reverses how the key and value are accessed in the standard SequenceFileRecordRecorder and SequenceFileInputFormat that comes with hadoop. You can use this custom input format by specifying the following when you create the table: STORED AS INPUTFORMAT 'com.mycompany.SequenceFileKeyInputFormat' Dilip On Thu, Apr 19, 2012 at 6:09 AM, Owen O'Malley mailto:omal...@apache.org>> wrote: On Thu, Apr 19, 2012 at 3:07 AM, Ruben de Vries mailto:ruben.devr...@hyves.nl>> wrote: > I'm trying to migrate a part of our current hadoop jobs from normal > mapreduce jobs to hive, > > Previously the data was stored in sequencefiles with the keys containing > valueable data! I think you'll want to define your table using a custom InputFormat that creates a virtual row based on both the key and value and then use the 'STORED AS INPUTFORMAT ...' -- Owen -- _ Dilip Antony Joseph http://csgrad.blogspot.com http://www.marydilip.info
RE: using the key from a SequenceFile
Hive can handle a sequence file just like a text file, only it omits the key completely and only uses the value part of it, other than that you won't notice the difference between sequence or plain text file From: David Kulp [mailto:dk...@fiksu.com] Sent: Thursday, April 19, 2012 2:13 PM To: user@hive.apache.org Subject: Re: using the key from a SequenceFile I'm trying to achieve something very similar. I want to write an MR program that writes results in a record-based sequencefile that would be directly readable from hive as though it were created using "STORED AS SEQUENCEFILE" with, say, BinarySortableSerDe. >From this discussion it seems that Hive does not / cannot take advantage of >the key/values in a sequencefile, but rather it requires a value that is >serialized using a SerDe. Is that right? If so, does that mean that the right approach is to using the BinarySortableSerDe to pass the collector a row's worth of data as the Writable value. And would Hive "just work" on such data? If SequencefileOutputFormat is used, will it automatically place sync markers in the file to allow for file splitting? Thanks! (ps. As an aside, Avro would be better. Wouldn't it be a huge win for MapReduce to have an AvroOutputFileFormat and for Hive to have a serde that read such files? It seems like there's a natural correspondence between the richer data representations of an SQL schema and an Avro schema, and there's already code for working with Avro in MR as input.) On Apr 19, 2012, at 6:15 AM, madhu phatak wrote: Serde will allow you to create custom data from your sequence File https://cwiki.apache.org/confluence/display/Hive/SerDe On Thu, Apr 19, 2012 at 3:37 PM, Ruben de Vries mailto:ruben.devr...@hyves.nl>> wrote: I'm trying to migrate a part of our current hadoop jobs from normal mapreduce jobs to hive, Previously the data was stored in sequencefiles with the keys containing valueable data! However if I load the data into a table I loose that key data (or at least I can't access it with hive), I want to somehow use the key from the sequence file in hive. I know this has come up before since I can find some hints of people needing it but I can't seem to find a working solution and since I'm not very good with java I really can't get it done myself :(. Does anyone have a snippet of something like this working? I get errors like; ../hive/mapred/CustomSeqRecordReader.java:14: cannot find symbol [javac] symbol : constructor SequenceFileRecordReader() [javac] location: class org.apache.hadoop.mapred.SequenceFileRecordReader [javac] public class CustomSeqRecordReader extends SequenceFileRecordReader implements RecordReader { Hope some1 has a snippet or can help me out, would really love to be able to switch part of our jobs to hive, Ruben de Vries -- https://github.com/zinnia-phatak-dev/Nectar
RE: using the key from a SequenceFile
Afaik SerDe only serialzes / deserializes the value part of the sequencefile :( ? From: madhu phatak [mailto:phatak@gmail.com] Sent: Thursday, April 19, 2012 12:16 PM To: user@hive.apache.org Subject: Re: using the key from a SequenceFile Serde will allow you to create custom data from your sequence File https://cwiki.apache.org/confluence/display/Hive/SerDe On Thu, Apr 19, 2012 at 3:37 PM, Ruben de Vries mailto:ruben.devr...@hyves.nl>> wrote: I'm trying to migrate a part of our current hadoop jobs from normal mapreduce jobs to hive, Previously the data was stored in sequencefiles with the keys containing valueable data! However if I load the data into a table I loose that key data (or at least I can't access it with hive), I want to somehow use the key from the sequence file in hive. I know this has come up before since I can find some hints of people needing it but I can't seem to find a working solution and since I'm not very good with java I really can't get it done myself :(. Does anyone have a snippet of something like this working? I get errors like; ../hive/mapred/CustomSeqRecordReader.java:14: cannot find symbol [javac] symbol : constructor SequenceFileRecordReader() [javac] location: class org.apache.hadoop.mapred.SequenceFileRecordReader [javac] public class CustomSeqRecordReader extends SequenceFileRecordReader implements RecordReader { Hope some1 has a snippet or can help me out, would really love to be able to switch part of our jobs to hive, Ruben de Vries -- https://github.com/zinnia-phatak-dev/Nectar
using the key from a SequenceFile
I'm trying to migrate a part of our current hadoop jobs from normal mapreduce jobs to hive, Previously the data was stored in sequencefiles with the keys containing valueable data! However if I load the data into a table I loose that key data (or at least I can't access it with hive), I want to somehow use the key from the sequence file in hive. I know this has come up before since I can find some hints of people needing it but I can't seem to find a working solution and since I'm not very good with java I really can't get it done myself :(. Does anyone have a snippet of something like this working? I get errors like; ../hive/mapred/CustomSeqRecordReader.java:14: cannot find symbol [javac] symbol : constructor SequenceFileRecordReader() [javac] location: class org.apache.hadoop.mapred.SequenceFileRecordReader [javac] public class CustomSeqRecordReader extends SequenceFileRecordReader implements RecordReader { Hope some1 has a snippet or can help me out, would really love to be able to switch part of our jobs to hive, Ruben de Vries