Re: Merging small files
Changed it to sort by. On Sat, Oct 17, 2015 at 6:05 PM, Daniel Haviv < daniel.ha...@veracity-group.com> wrote: > Thanks for the tip Gopal. > I tried what you suggested (on Tez) but I'm getting a middle stage with 1 > reducer (which is awful for performance). > > This is my query: > insert into upstreamparam_org partition(day_ts, cmtsid) select * from > upstreamparam_20151013 order by datats,macaddress; > > I've attached the query plan in case it might help understand why. > > Thank you. > Daniel. > > > > > On Fri, Oct 16, 2015 at 7:19 PM, Gopal Vijayaraghavan > wrote: > >> >> > Is there a more efficient way to have Hive merge small files on the >> >files without running with two passes? >> >> Not entirely an efficient way, but adding a shuffle stage usually works >> much better as it gives you the ability to layout the files for better >> vectorization. >> >> Like for TPC-H, doing ETL with >> >> create table lineitem as select * from lineitem sort by l_shipdate, >> l_suppkey; >> >> will produce fewer files (exactly as many as your reducer #) & compresses >> harder due to the natural order of transactions (saves ~20Gb or so at 1000 >> scale). >> >> Caveat: that is not more efficient in MRv2, only in Tez/Spark which can >> run MRR pipelines as-is. >> >> Cheers, >> Gopal >> >> >> >
Re: Merging small files
Thanks for the tip Gopal. I tried what you suggested (on Tez) but I'm getting a middle stage with 1 reducer (which is awful for performance). This is my query: insert into upstreamparam_org partition(day_ts, cmtsid) select * from upstreamparam_20151013 order by datats,macaddress; I've attached the query plan in case it might help understand why. Thank you. Daniel. On Fri, Oct 16, 2015 at 7:19 PM, Gopal Vijayaraghavan wrote: > > > Is there a more efficient way to have Hive merge small files on the > >files without running with two passes? > > Not entirely an efficient way, but adding a shuffle stage usually works > much better as it gives you the ability to layout the files for better > vectorization. > > Like for TPC-H, doing ETL with > > create table lineitem as select * from lineitem sort by l_shipdate, > l_suppkey; > > will produce fewer files (exactly as many as your reducer #) & compresses > harder due to the natural order of transactions (saves ~20Gb or so at 1000 > scale). > > Caveat: that is not more efficient in MRv2, only in Tez/Spark which can > run MRR pipelines as-is. > > Cheers, > Gopal > > > Plan not optimized by CBO. Vertex dependency in root stage Reducer 2 <- Map 1 (SIMPLE_EDGE) Stage-3 Stats-Aggr Operator Stage-0 Move Operator partition:{} table:{"name:":"default.upstreamparam_org","input format:":"org.apache.hadoop.hive.ql.io.orc.OrcInputFormat","output format:":"org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat","serde:":"org.apache.hadoop.hive.ql.io.orc.OrcSerde"} Stage-2 Dependency Collection{} Stage-1 Reducer 2 File Output Operator [FS_5] compressed:false Statistics:Num rows: 8707462208 Data size: 1767614828224 Basic stats: COMPLETE Column stats: NONE table:{"name:":"default.upstreamparam_org","input format:":"org.apache.hadoop.hive.ql.io.orc.OrcInputFormat","output format:":"org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat","serde:":"org.apache.hadoop.hive.ql.io.orc.OrcSerde"} Select Operator [SEL_3] | outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12","_col13","_col14","_col15","_col16","_col17","_col18","_col19","_col20"] | Statistics:Num rows: 8707462208 Data size: 1767614828224 Basic stats: COMPLETE Column stats: NONE |<-Map 1 [SIMPLE_EDGE] Reduce Output Operator [RS_7] key expressions:_col1 (type: bigint), _col0 (type: bigint) sort order:++ Statistics:Num rows: 8707462208 Data size: 1767614828224 Basic stats: COMPLETE Column stats: NONE value expressions:_col2 (type: bigint), _col3 (type: int), _col4 (type: int), _col5 (type: bigint), _col6 (type: float), _col7 (type: float), _col8 (type: float), _col9 (type: float), _col10 (type: float), _col11 (type: float), _col12 (type: float), _col13 (type: float), _col14 (type: float), _col15 (type: float), _col16 (type: bigint), _col17 (type: bigint), _col18 (type: bigint), _col19 (type: bigint), _col20 (type: string) Select Operator [OP_6] outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12","_col13","_col14","_col15","_col16","_col17","_col18","_col19","_col20"] Statistics:Num rows: 8707462208 Data size: 1767614828224 Basic stats: COMPLETE Column stats: NONE TableScan [TS_0] alias:upstreamparam_20151013 Statistics:Num rows: 8707462208 Data size: 1767614828224 Basic stats: COMPLETE Column stats: NONE
Re: Merging small files
> Is there a more efficient way to have Hive merge small files on the >files without running with two passes? Not entirely an efficient way, but adding a shuffle stage usually works much better as it gives you the ability to layout the files for better vectorization. Like for TPC-H, doing ETL with create table lineitem as select * from lineitem sort by l_shipdate, l_suppkey; will produce fewer files (exactly as many as your reducer #) & compresses harder due to the natural order of transactions (saves ~20Gb or so at 1000 scale). Caveat: that is not more efficient in MRv2, only in Tez/Spark which can run MRR pipelines as-is. Cheers, Gopal
Merging small files
Hi, We are using Hive to merge small files by setting hive.merge.smallfiles.avgsize to 12000 and doing an insert as select to a table. The problem is that this take two passes over the data, first to insert the data and then to merge it. Is there a more efficient way to have Hive merge small files on the files without running with two passes? Thank you. Daniel
Re: Merging small files in partitions
Hi Edward,Can we do the same/similar thing for parquet file?Any pointer?Regards,Mohammad On Tuesday, June 16, 2015 2:35 PM, Edward Capriolo wrote: https://github.com/edwardcapriolo/filecrush On Tue, Jun 16, 2015 at 5:05 PM, Chagarlamudi, Prasanth wrote: Hello,I am looking for an optimized way to merge small files in hive partitions into one big file.I came across Alter Table/Partition Concatenate https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/PartitionConcatenate. Doc says this only works for RCFiles. I wish there is something similar for TEXT FILE format.Any suggestions? Thanks in advancePrasanth This e-mail and files transmitted with it are confidential, and are intended solely for the use of the individual or entity to whom this e-mail is addressed. If you are not the intended recipient, or the employee or agent responsible to deliver it to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you are not one of the named recipient(s) or otherwise have reason to believe that you received this message in error, please immediately notify sender by e-mail, and destroy the original message. Thank You.
Re: Merging small files in partitions
https://github.com/edwardcapriolo/filecrush On Tue, Jun 16, 2015 at 5:05 PM, Chagarlamudi, Prasanth < prasanth.chagarlam...@epsilon.com> wrote: > Hello, > > I am looking for an optimized way to merge small files in hive partitions > into one big file. > > I came across *Alter Table/Partition Concatenate * > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/PartitionConcatenate. > Doc says this only works for RCFiles. I wish there is something similar for > TEXT FILE format. > > Any suggestions? > > > > Thanks in advance > > Prasanth > > > > > > -- > > This e-mail and files transmitted with it are confidential, and are > intended solely for the use of the individual or entity to whom this e-mail > is addressed. If you are not the intended recipient, or the employee or > agent responsible to deliver it to the intended recipient, you are hereby > notified that any dissemination, distribution or copying of this > communication is strictly prohibited. If you are not one of the named > recipient(s) or otherwise have reason to believe that you received this > message in error, please immediately notify sender by e-mail, and destroy > the original message. Thank You. >
Merging small files in partitions
Hello, I am looking for an optimized way to merge small files in hive partitions into one big file. I came across Alter Table/Partition Concatenate https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/PartitionConcatenate. Doc says this only works for RCFiles. I wish there is something similar for TEXT FILE format. Any suggestions? Thanks in advance Prasanth This e-mail and files transmitted with it are confidential, and are intended solely for the use of the individual or entity to whom this e-mail is addressed. If you are not the intended recipient, or the employee or agent responsible to deliver it to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you are not one of the named recipient(s) or otherwise have reason to believe that you received this message in error, please immediately notify sender by e-mail, and destroy the original message. Thank You.
Re: Merging small files with dynamic partitions
I copied Hadoop19Shims' implementation of getCombineFileInputFormat (HIVE-1121) into Hadoop18Shims and it worked, if anyone is interested. And hopefully we can upgrade our Hadoop version soon :) On Fri, Nov 12, 2010 at 12:44 PM, Dave Brondsema wrote: > It seems that I can't use this with Hadoop 0.18 since the > Hadoop18Shims.getCombineFileInputFormat returns null, and > SemanticAnalyzer.java sets HIVEMERGEMAPREDFILES to false if > CombineFileInputFormat is not supported. Is that right? Maybe I can copy > the Hadoop19Shims implementation of getCombineFileInputFormat into > Hadoop18Shims? > > > On Wed, Nov 10, 2010 at 4:31 PM, yongqiang he wrote: > >> I think the problem was solved in hive trunk. You can just try hive trunk. >> >> On Wed, Nov 10, 2010 at 10:05 AM, Dave Brondsema >> wrote: >> > Hi, has there been any resolution to this? I'm having the same trouble. >> > With Hive 0.6 and Hadoop 0.18 and a dynamic partition >> > insert, hive.merge.mapredfiles doesn't work. It works fine for a static >> > partition insert. What I'm seeing is that even when I >> > set hive.merge.mapredfiles=true, the jobconf has it as false for the >> dynamic >> > partition insert. >> > I was reading https://issues.apache.org/jira/browse/HIVE-1307 and it >> looks >> > like maybe Hadoop 0.20 is required for this? >> > Thanks, >> > >> > On Sat, Oct 16, 2010 at 1:50 AM, Sammy Yu wrote: >> >> >> >> Hi guys, >> >> Thanks for the response. I tried running without >> >> hive.mergejob.maponly with the same result. I've attached the explain >> >> extended output. I am running this query on EC2 boxes, however it's >> >> not running on EMR. Hive is running on top of a hadoop 0.20.2 setup.. >> >> >> >> Thanks, >> >> Sammy >> >> >> >> On Fri, Oct 15, 2010 at 5:58 PM, Ning Zhang >> wrote: >> >> > The output file shows it only have 2 jobs (the mapreduce job and the >> >> > move task). This indicates that the plan does not have merge enabled. >> Merge >> >> > should consists of a ConditionalTask and 2 sub tasks (a MR task and a >> move >> >> > task). Can you send the plan of the query? >> >> > >> >> > One thing I noticed is that your are using Amazon EMR. I'm not sure >> if >> >> > this is enabled since SET hive.mergejob.maponly=true requires >> >> > CombineHiveInputFormat (only available in Hadoop 0.20 and someone >> reported >> >> > some distribution of Hadoop doesn't support that). So additional >> thing you >> >> > can try is to remove this setting. >> >> > >> >> > On Oct 15, 2010, at 1:43 PM, Sammy Yu wrote: >> >> > >> >> >> Hi, >> >> >> I have a dynamic partition query which generates quite a few small >> >> >> files which I would like to merge: >> >> >> >> >> >> SET hive.exec.dynamic.partition.mode=nonstrict; >> >> >> SET hive.exec.dynamic.partition=true; >> >> >> SET hive.exec.compress.output=true; >> >> >> SET io.seqfile.compression.type=BLOCK; >> >> >> SET hive.merge.size.per.task=25600; >> >> >> SET hive.merge.smallfiles.avgsize=160; >> >> >> SET hive.merge.mapfiles=true; >> >> >> SET hive.merge.mapredfiles=true; >> >> >> SET hive.mergejob.maponly=true; >> >> >> INSERT OVERWRITE TABLE daily_conversions_without_rank_all_table >> >> >> PARTITION(org_id, day) >> >> >> SELECT session_id, permanent_id, first_date, last_date, week, month, >> >> >> quarter, >> >> >> referral_type, search_engine, us_search_engine, >> >> >> keyword, unnormalized_keyword, branded, conversion_meet, goals_meet, >> >> >> pages_viewed, >> >> >> entry_page, page_types, >> >> >> org_id, day >> >> >> FROM daily_conversions_without_rank_table; >> >> >> >> >> >> I am running the latest version from trunk with HIVE-1622, but it >> >> >> seems like I just can't get the post merge process to happen. I have >> >> >> raised hive.merge.smallfiles.avgsize. I'm wondering if the >> filtering >> >> >> at runtime is causing the merge process to be skipped. Attached are >> >> >> the hive output and log files. >> >> >> >> >> >> >> >> >> Thanks, >> >> >> Sammy >> >> >> >> >> > >> >> > >> >> >> >> >> >> >> >> -- >> >> Chief Architect, BrightEdge >> >> email: s...@brightedge.com | mobile: 650.539.4867 | fax: >> >> 650.521.9678 | address: 1850 Gateway Dr Suite 400, San Mateo, CA >> >> 94404 >> > >> > >> > >> > -- >> > Dave Brondsema >> > Software Engineer >> > Geeknet >> > >> > www.geek.net >> > >> > > > > -- > Dave Brondsema > Software Engineer > Geeknet > > www.geek.net > -- Dave Brondsema Software Engineer Geeknet www.geek.net
Re: Merging small files with dynamic partitions
It seems that I can't use this with Hadoop 0.18 since the Hadoop18Shims.getCombineFileInputFormat returns null, and SemanticAnalyzer.java sets HIVEMERGEMAPREDFILES to false if CombineFileInputFormat is not supported. Is that right? Maybe I can copy the Hadoop19Shims implementation of getCombineFileInputFormat into Hadoop18Shims? On Wed, Nov 10, 2010 at 4:31 PM, yongqiang he wrote: > I think the problem was solved in hive trunk. You can just try hive trunk. > > On Wed, Nov 10, 2010 at 10:05 AM, Dave Brondsema > wrote: > > Hi, has there been any resolution to this? I'm having the same trouble. > > With Hive 0.6 and Hadoop 0.18 and a dynamic partition > > insert, hive.merge.mapredfiles doesn't work. It works fine for a static > > partition insert. What I'm seeing is that even when I > > set hive.merge.mapredfiles=true, the jobconf has it as false for the > dynamic > > partition insert. > > I was reading https://issues.apache.org/jira/browse/HIVE-1307 and it > looks > > like maybe Hadoop 0.20 is required for this? > > Thanks, > > > > On Sat, Oct 16, 2010 at 1:50 AM, Sammy Yu wrote: > >> > >> Hi guys, > >> Thanks for the response. I tried running without > >> hive.mergejob.maponly with the same result. I've attached the explain > >> extended output. I am running this query on EC2 boxes, however it's > >> not running on EMR. Hive is running on top of a hadoop 0.20.2 setup.. > >> > >> Thanks, > >> Sammy > >> > >> On Fri, Oct 15, 2010 at 5:58 PM, Ning Zhang > wrote: > >> > The output file shows it only have 2 jobs (the mapreduce job and the > >> > move task). This indicates that the plan does not have merge enabled. > Merge > >> > should consists of a ConditionalTask and 2 sub tasks (a MR task and a > move > >> > task). Can you send the plan of the query? > >> > > >> > One thing I noticed is that your are using Amazon EMR. I'm not sure if > >> > this is enabled since SET hive.mergejob.maponly=true requires > >> > CombineHiveInputFormat (only available in Hadoop 0.20 and someone > reported > >> > some distribution of Hadoop doesn't support that). So additional thing > you > >> > can try is to remove this setting. > >> > > >> > On Oct 15, 2010, at 1:43 PM, Sammy Yu wrote: > >> > > >> >> Hi, > >> >> I have a dynamic partition query which generates quite a few small > >> >> files which I would like to merge: > >> >> > >> >> SET hive.exec.dynamic.partition.mode=nonstrict; > >> >> SET hive.exec.dynamic.partition=true; > >> >> SET hive.exec.compress.output=true; > >> >> SET io.seqfile.compression.type=BLOCK; > >> >> SET hive.merge.size.per.task=25600; > >> >> SET hive.merge.smallfiles.avgsize=160; > >> >> SET hive.merge.mapfiles=true; > >> >> SET hive.merge.mapredfiles=true; > >> >> SET hive.mergejob.maponly=true; > >> >> INSERT OVERWRITE TABLE daily_conversions_without_rank_all_table > >> >> PARTITION(org_id, day) > >> >> SELECT session_id, permanent_id, first_date, last_date, week, month, > >> >> quarter, > >> >> referral_type, search_engine, us_search_engine, > >> >> keyword, unnormalized_keyword, branded, conversion_meet, goals_meet, > >> >> pages_viewed, > >> >> entry_page, page_types, > >> >> org_id, day > >> >> FROM daily_conversions_without_rank_table; > >> >> > >> >> I am running the latest version from trunk with HIVE-1622, but it > >> >> seems like I just can't get the post merge process to happen. I have > >> >> raised hive.merge.smallfiles.avgsize. I'm wondering if the filtering > >> >> at runtime is causing the merge process to be skipped. Attached are > >> >> the hive output and log files. > >> >> > >> >> > >> >> Thanks, > >> >> Sammy > >> >> > >> > > >> > > >> > >> > >> > >> -- > >> Chief Architect, BrightEdge > >> email: s...@brightedge.com | mobile: 650.539.4867 | fax: > >> 650.521.9678 | address: 1850 Gateway Dr Suite 400, San Mateo, CA > >> 94404 > > > > > > > > -- > > Dave Brondsema > > Software Engineer > > Geeknet > > > > www.geek.net > > > -- Dave Brondsema Software Engineer Geeknet www.geek.net
Re: Merging small files with dynamic partitions
I think the problem was solved in hive trunk. You can just try hive trunk. On Wed, Nov 10, 2010 at 10:05 AM, Dave Brondsema wrote: > Hi, has there been any resolution to this? I'm having the same trouble. > With Hive 0.6 and Hadoop 0.18 and a dynamic partition > insert, hive.merge.mapredfiles doesn't work. It works fine for a static > partition insert. What I'm seeing is that even when I > set hive.merge.mapredfiles=true, the jobconf has it as false for the dynamic > partition insert. > I was reading https://issues.apache.org/jira/browse/HIVE-1307 and it looks > like maybe Hadoop 0.20 is required for this? > Thanks, > > On Sat, Oct 16, 2010 at 1:50 AM, Sammy Yu wrote: >> >> Hi guys, >> Thanks for the response. I tried running without >> hive.mergejob.maponly with the same result. I've attached the explain >> extended output. I am running this query on EC2 boxes, however it's >> not running on EMR. Hive is running on top of a hadoop 0.20.2 setup.. >> >> Thanks, >> Sammy >> >> On Fri, Oct 15, 2010 at 5:58 PM, Ning Zhang wrote: >> > The output file shows it only have 2 jobs (the mapreduce job and the >> > move task). This indicates that the plan does not have merge enabled. Merge >> > should consists of a ConditionalTask and 2 sub tasks (a MR task and a move >> > task). Can you send the plan of the query? >> > >> > One thing I noticed is that your are using Amazon EMR. I'm not sure if >> > this is enabled since SET hive.mergejob.maponly=true requires >> > CombineHiveInputFormat (only available in Hadoop 0.20 and someone reported >> > some distribution of Hadoop doesn't support that). So additional thing you >> > can try is to remove this setting. >> > >> > On Oct 15, 2010, at 1:43 PM, Sammy Yu wrote: >> > >> >> Hi, >> >> I have a dynamic partition query which generates quite a few small >> >> files which I would like to merge: >> >> >> >> SET hive.exec.dynamic.partition.mode=nonstrict; >> >> SET hive.exec.dynamic.partition=true; >> >> SET hive.exec.compress.output=true; >> >> SET io.seqfile.compression.type=BLOCK; >> >> SET hive.merge.size.per.task=25600; >> >> SET hive.merge.smallfiles.avgsize=160; >> >> SET hive.merge.mapfiles=true; >> >> SET hive.merge.mapredfiles=true; >> >> SET hive.mergejob.maponly=true; >> >> INSERT OVERWRITE TABLE daily_conversions_without_rank_all_table >> >> PARTITION(org_id, day) >> >> SELECT session_id, permanent_id, first_date, last_date, week, month, >> >> quarter, >> >> referral_type, search_engine, us_search_engine, >> >> keyword, unnormalized_keyword, branded, conversion_meet, goals_meet, >> >> pages_viewed, >> >> entry_page, page_types, >> >> org_id, day >> >> FROM daily_conversions_without_rank_table; >> >> >> >> I am running the latest version from trunk with HIVE-1622, but it >> >> seems like I just can't get the post merge process to happen. I have >> >> raised hive.merge.smallfiles.avgsize. I'm wondering if the filtering >> >> at runtime is causing the merge process to be skipped. Attached are >> >> the hive output and log files. >> >> >> >> >> >> Thanks, >> >> Sammy >> >> >> > >> > >> >> >> >> -- >> Chief Architect, BrightEdge >> email: s...@brightedge.com | mobile: 650.539.4867 | fax: >> 650.521.9678 | address: 1850 Gateway Dr Suite 400, San Mateo, CA >> 94404 > > > > -- > Dave Brondsema > Software Engineer > Geeknet > > www.geek.net >
Re: Merging small files with dynamic partitions
Hi, has there been any resolution to this? I'm having the same trouble. With Hive 0.6 and Hadoop 0.18 and a dynamic partition insert, hive.merge.mapredfiles doesn't work. It works fine for a static partition insert. What I'm seeing is that even when I set hive.merge.mapredfiles=true, the jobconf has it as false for the dynamic partition insert. I was reading https://issues.apache.org/jira/browse/HIVE-1307 and it looks like maybe Hadoop 0.20 is required for this? Thanks, On Sat, Oct 16, 2010 at 1:50 AM, Sammy Yu wrote: > Hi guys, > Thanks for the response. I tried running without > hive.mergejob.maponly with the same result. I've attached the explain > extended output. I am running this query on EC2 boxes, however it's > not running on EMR. Hive is running on top of a hadoop 0.20.2 setup.. > > Thanks, > Sammy > > On Fri, Oct 15, 2010 at 5:58 PM, Ning Zhang wrote: > > The output file shows it only have 2 jobs (the mapreduce job and the move > task). This indicates that the plan does not have merge enabled. Merge > should consists of a ConditionalTask and 2 sub tasks (a MR task and a move > task). Can you send the plan of the query? > > > > One thing I noticed is that your are using Amazon EMR. I'm not sure if > this is enabled since SET hive.mergejob.maponly=true requires > CombineHiveInputFormat (only available in Hadoop 0.20 and someone reported > some distribution of Hadoop doesn't support that). So additional thing you > can try is to remove this setting. > > > > On Oct 15, 2010, at 1:43 PM, Sammy Yu wrote: > > > >> Hi, > >> I have a dynamic partition query which generates quite a few small > >> files which I would like to merge: > >> > >> SET hive.exec.dynamic.partition.mode=nonstrict; > >> SET hive.exec.dynamic.partition=true; > >> SET hive.exec.compress.output=true; > >> SET io.seqfile.compression.type=BLOCK; > >> SET hive.merge.size.per.task=25600; > >> SET hive.merge.smallfiles.avgsize=160; > >> SET hive.merge.mapfiles=true; > >> SET hive.merge.mapredfiles=true; > >> SET hive.mergejob.maponly=true; > >> INSERT OVERWRITE TABLE daily_conversions_without_rank_all_table > >> PARTITION(org_id, day) > >> SELECT session_id, permanent_id, first_date, last_date, week, month, > quarter, > >> referral_type, search_engine, us_search_engine, > >> keyword, unnormalized_keyword, branded, conversion_meet, goals_meet, > >> pages_viewed, > >> entry_page, page_types, > >> org_id, day > >> FROM daily_conversions_without_rank_table; > >> > >> I am running the latest version from trunk with HIVE-1622, but it > >> seems like I just can't get the post merge process to happen. I have > >> raised hive.merge.smallfiles.avgsize. I'm wondering if the filtering > >> at runtime is causing the merge process to be skipped. Attached are > >> the hive output and log files. > >> > >> > >> Thanks, > >> Sammy > >> > > > > > > > > -- > Chief Architect, BrightEdge > email: s...@brightedge.com | mobile: 650.539.4867 | fax: > 650.521.9678 | address: 1850 Gateway Dr Suite 400, San Mateo, CA > 94404 > -- Dave Brondsema Software Engineer Geeknet www.geek.net
Re: Merging small files with dynamic partitions
Hi guys, Thanks for the response. I tried running without hive.mergejob.maponly with the same result. I've attached the explain extended output. I am running this query on EC2 boxes, however it's not running on EMR. Hive is running on top of a hadoop 0.20.2 setup.. Thanks, Sammy On Fri, Oct 15, 2010 at 5:58 PM, Ning Zhang wrote: > The output file shows it only have 2 jobs (the mapreduce job and the move > task). This indicates that the plan does not have merge enabled. Merge should > consists of a ConditionalTask and 2 sub tasks (a MR task and a move task). > Can you send the plan of the query? > > One thing I noticed is that your are using Amazon EMR. I'm not sure if this > is enabled since SET hive.mergejob.maponly=true requires > CombineHiveInputFormat (only available in Hadoop 0.20 and someone reported > some distribution of Hadoop doesn't support that). So additional thing you > can try is to remove this setting. > > On Oct 15, 2010, at 1:43 PM, Sammy Yu wrote: > >> Hi, >> I have a dynamic partition query which generates quite a few small >> files which I would like to merge: >> >> SET hive.exec.dynamic.partition.mode=nonstrict; >> SET hive.exec.dynamic.partition=true; >> SET hive.exec.compress.output=true; >> SET io.seqfile.compression.type=BLOCK; >> SET hive.merge.size.per.task=25600; >> SET hive.merge.smallfiles.avgsize=160; >> SET hive.merge.mapfiles=true; >> SET hive.merge.mapredfiles=true; >> SET hive.mergejob.maponly=true; >> INSERT OVERWRITE TABLE daily_conversions_without_rank_all_table >> PARTITION(org_id, day) >> SELECT session_id, permanent_id, first_date, last_date, week, month, quarter, >> referral_type, search_engine, us_search_engine, >> keyword, unnormalized_keyword, branded, conversion_meet, goals_meet, >> pages_viewed, >> entry_page, page_types, >> org_id, day >> FROM daily_conversions_without_rank_table; >> >> I am running the latest version from trunk with HIVE-1622, but it >> seems like I just can't get the post merge process to happen. I have >> raised hive.merge.smallfiles.avgsize. I'm wondering if the filtering >> at runtime is causing the merge process to be skipped. Attached are >> the hive output and log files. >> >> >> Thanks, >> Sammy >> > > -- Chief Architect, BrightEdge email: s...@brightedge.com | mobile: 650.539.4867 | fax: 650.521.9678 | address: 1850 Gateway Dr Suite 400, San Mateo, CA 94404 explain.log Description: Binary data
Re: Merging small files with dynamic partitions
The output file shows it only have 2 jobs (the mapreduce job and the move task). This indicates that the plan does not have merge enabled. Merge should consists of a ConditionalTask and 2 sub tasks (a MR task and a move task). Can you send the plan of the query? One thing I noticed is that your are using Amazon EMR. I'm not sure if this is enabled since SET hive.mergejob.maponly=true requires CombineHiveInputFormat (only available in Hadoop 0.20 and someone reported some distribution of Hadoop doesn't support that). So additional thing you can try is to remove this setting. On Oct 15, 2010, at 1:43 PM, Sammy Yu wrote: > Hi, > I have a dynamic partition query which generates quite a few small > files which I would like to merge: > > SET hive.exec.dynamic.partition.mode=nonstrict; > SET hive.exec.dynamic.partition=true; > SET hive.exec.compress.output=true; > SET io.seqfile.compression.type=BLOCK; > SET hive.merge.size.per.task=25600; > SET hive.merge.smallfiles.avgsize=160; > SET hive.merge.mapfiles=true; > SET hive.merge.mapredfiles=true; > SET hive.mergejob.maponly=true; > INSERT OVERWRITE TABLE daily_conversions_without_rank_all_table > PARTITION(org_id, day) > SELECT session_id, permanent_id, first_date, last_date, week, month, quarter, > referral_type, search_engine, us_search_engine, > keyword, unnormalized_keyword, branded, conversion_meet, goals_meet, > pages_viewed, > entry_page, page_types, > org_id, day > FROM daily_conversions_without_rank_table; > > I am running the latest version from trunk with HIVE-1622, but it > seems like I just can't get the post merge process to happen. I have > raised hive.merge.smallfiles.avgsize. I'm wondering if the filtering > at runtime is causing the merge process to be skipped. Attached are > the hive output and log files. > > > Thanks, > Sammy >
Re: Merging small files with dynamic partitions
Sammy, This is not the exact remedy you were looking for, but my company open sourced our file crusher utility. http://www.jointhegrid.com/hadoop_filecrush/index.jsp We use it to good effect to turn many small files into one. Works with text and sequence files , and custom writables. Edward On Friday, October 15, 2010, Sammy Yu wrote: > Hi, > I have a dynamic partition query which generates quite a few small > files which I would like to merge: > > SET hive.exec.dynamic.partition.mode=nonstrict; > SET hive.exec.dynamic.partition=true; > SET hive.exec.compress.output=true; > SET io.seqfile.compression.type=BLOCK; > SET hive.merge.size.per.task=25600; > SET hive.merge.smallfiles.avgsize=160; > SET hive.merge.mapfiles=true; > SET hive.merge.mapredfiles=true; > SET hive.mergejob.maponly=true; > INSERT OVERWRITE TABLE daily_conversions_without_rank_all_table > PARTITION(org_id, day) > SELECT session_id, permanent_id, first_date, last_date, week, month, quarter, > referral_type, search_engine, us_search_engine, > keyword, unnormalized_keyword, branded, conversion_meet, goals_meet, > pages_viewed, > entry_page, page_types, > org_id, day > FROM daily_conversions_without_rank_table; > > I am running the latest version from trunk with HIVE-1622, but it > seems like I just can't get the post merge process to happen. I have > raised hive.merge.smallfiles.avgsize. I'm wondering if the filtering > at runtime is causing the merge process to be skipped. Attached are > the hive output and log files. > > > Thanks, > Sammy >