Re: Merging small files

2015-10-17 Thread Daniel Haviv
Changed it to sort by.


On Sat, Oct 17, 2015 at 6:05 PM, Daniel Haviv <
daniel.ha...@veracity-group.com> wrote:

> Thanks for the tip Gopal.
> I tried what you suggested (on Tez) but I'm getting a middle stage with 1
> reducer (which is awful for performance).
>
> This is my query:
> insert into upstreamparam_org partition(day_ts, cmtsid) select * from
> upstreamparam_20151013 order by datats,macaddress;
>
> I've attached the query plan in case it might help understand why.
>
> Thank you.
> Daniel.
>
>
>
>
> On Fri, Oct 16, 2015 at 7:19 PM, Gopal Vijayaraghavan 
> wrote:
>
>>
>> > Is there a more efficient way to have Hive merge small files on the
>> >files without running with two passes?
>>
>> Not entirely an efficient way, but adding a shuffle stage usually works
>> much better as it gives you the ability to layout the files for better
>> vectorization.
>>
>> Like for TPC-H, doing ETL with
>>
>> create table lineitem as select * from lineitem sort by l_shipdate,
>> l_suppkey;
>>
>> will produce fewer files (exactly as many as your reducer #) & compresses
>> harder due to the natural order of transactions (saves ~20Gb or so at 1000
>> scale).
>>
>> Caveat: that is not more efficient in MRv2, only in Tez/Spark which can
>> run MRR pipelines as-is.
>>
>> Cheers,
>> Gopal
>>
>>
>>
>


Re: Merging small files

2015-10-17 Thread Daniel Haviv
Thanks for the tip Gopal.
I tried what you suggested (on Tez) but I'm getting a middle stage with 1
reducer (which is awful for performance).

This is my query:
insert into upstreamparam_org partition(day_ts, cmtsid) select * from
upstreamparam_20151013 order by datats,macaddress;

I've attached the query plan in case it might help understand why.

Thank you.
Daniel.




On Fri, Oct 16, 2015 at 7:19 PM, Gopal Vijayaraghavan 
wrote:

>
> > Is there a more efficient way to have Hive merge small files on the
> >files without running with two passes?
>
> Not entirely an efficient way, but adding a shuffle stage usually works
> much better as it gives you the ability to layout the files for better
> vectorization.
>
> Like for TPC-H, doing ETL with
>
> create table lineitem as select * from lineitem sort by l_shipdate,
> l_suppkey;
>
> will produce fewer files (exactly as many as your reducer #) & compresses
> harder due to the natural order of transactions (saves ~20Gb or so at 1000
> scale).
>
> Caveat: that is not more efficient in MRv2, only in Tez/Spark which can
> run MRR pipelines as-is.
>
> Cheers,
> Gopal
>
>
>
Plan not optimized by CBO.

Vertex dependency in root stage
Reducer 2 <- Map 1 (SIMPLE_EDGE)

Stage-3
   Stats-Aggr Operator
  Stage-0
 Move Operator
partition:{}
table:{"name:":"default.upstreamparam_org","input 
format:":"org.apache.hadoop.hive.ql.io.orc.OrcInputFormat","output 
format:":"org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat","serde:":"org.apache.hadoop.hive.ql.io.orc.OrcSerde"}
Stage-2
   Dependency Collection{}
  Stage-1
 Reducer 2
 File Output Operator [FS_5]
compressed:false
Statistics:Num rows: 8707462208 Data size: 
1767614828224 Basic stats: COMPLETE Column stats: NONE
table:{"name:":"default.upstreamparam_org","input 
format:":"org.apache.hadoop.hive.ql.io.orc.OrcInputFormat","output 
format:":"org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat","serde:":"org.apache.hadoop.hive.ql.io.orc.OrcSerde"}
Select Operator [SEL_3]
|  
outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12","_col13","_col14","_col15","_col16","_col17","_col18","_col19","_col20"]
|  Statistics:Num rows: 8707462208 Data size: 
1767614828224 Basic stats: COMPLETE Column stats: NONE
|<-Map 1 [SIMPLE_EDGE]
   Reduce Output Operator [RS_7]
  key expressions:_col1 (type: bigint), _col0 
(type: bigint)
  sort order:++
  Statistics:Num rows: 8707462208 Data size: 
1767614828224 Basic stats: COMPLETE Column stats: NONE
  value expressions:_col2 (type: bigint), _col3 
(type: int), _col4 (type: int), _col5 (type: bigint), _col6 (type: float), 
_col7 (type: float), _col8 (type: float), _col9 (type: float), _col10 (type: 
float), _col11 (type: float), _col12 (type: float), _col13 (type: float), 
_col14 (type: float), _col15 (type: float), _col16 (type: bigint), _col17 
(type: bigint), _col18 (type: bigint), _col19 (type: bigint), _col20 (type: 
string)
  Select Operator [OP_6]
 
outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12","_col13","_col14","_col15","_col16","_col17","_col18","_col19","_col20"]
 Statistics:Num rows: 8707462208 Data size: 
1767614828224 Basic stats: COMPLETE Column stats: NONE
 TableScan [TS_0]
alias:upstreamparam_20151013
Statistics:Num rows: 8707462208 Data size: 
1767614828224 Basic stats: COMPLETE Column stats: NONE



Re: Merging small files

2015-10-16 Thread Gopal Vijayaraghavan

> Is there a more efficient way to have Hive merge small files on the
>files without running with two passes?

Not entirely an efficient way, but adding a shuffle stage usually works
much better as it gives you the ability to layout the files for better
vectorization.

Like for TPC-H, doing ETL with

create table lineitem as select * from lineitem sort by l_shipdate,
l_suppkey;

will produce fewer files (exactly as many as your reducer #) & compresses
harder due to the natural order of transactions (saves ~20Gb or so at 1000
scale).

Caveat: that is not more efficient in MRv2, only in Tez/Spark which can
run MRR pipelines as-is.

Cheers,
Gopal




Merging small files

2015-10-16 Thread Daniel Haviv
Hi,
We are using Hive to merge small files by setting
hive.merge.smallfiles.avgsize to 12000 and doing an insert as select to
a table.
The problem is that this take two passes over the data, first to insert the
data and then to merge it.

Is there a more efficient way to have Hive merge small files on the files
without running with two passes?


Thank you.
Daniel


Re: Merging small files in partitions

2015-06-17 Thread Mohammad Islam
Hi Edward,Can we do the same/similar thing for parquet file?Any 
pointer?Regards,Mohammad 


 On Tuesday, June 16, 2015 2:35 PM, Edward Capriolo  
wrote:
   

 https://github.com/edwardcapriolo/filecrush

On Tue, Jun 16, 2015 at 5:05 PM, Chagarlamudi, Prasanth 
 wrote:

Hello,I am looking for an optimized way to merge small files in hive partitions 
into one big file.I came across Alter Table/Partition Concatenate 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/PartitionConcatenate.
 Doc says this only works for RCFiles. I wish there is something similar for 
TEXT FILE format.Any suggestions? Thanks in advancePrasanth  

This e-mail and files transmitted with it are confidential, and are intended 
solely for the use of the individual or entity to whom this e-mail is 
addressed. If you are not the intended recipient, or the employee or agent 
responsible to deliver it to the intended recipient, you are hereby notified 
that any dissemination, distribution or copying of this communication is 
strictly prohibited. If you are not one of the named recipient(s) or otherwise 
have reason to believe that you received this message in error, please 
immediately notify sender by e-mail, and destroy the original message. Thank 
You.




  

Re: Merging small files in partitions

2015-06-16 Thread Edward Capriolo
https://github.com/edwardcapriolo/filecrush

On Tue, Jun 16, 2015 at 5:05 PM, Chagarlamudi, Prasanth <
prasanth.chagarlam...@epsilon.com> wrote:

>  Hello,
>
> I am looking for an optimized way to merge small files in hive partitions
> into one big file.
>
> I came across *Alter Table/Partition Concatenate *
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/PartitionConcatenate.
> Doc says this only works for RCFiles. I wish there is something similar for
> TEXT FILE format.
>
> Any suggestions?
>
>
>
> Thanks in advance
>
> Prasanth
>
>
>
>
>
> --
>
> This e-mail and files transmitted with it are confidential, and are
> intended solely for the use of the individual or entity to whom this e-mail
> is addressed. If you are not the intended recipient, or the employee or
> agent responsible to deliver it to the intended recipient, you are hereby
> notified that any dissemination, distribution or copying of this
> communication is strictly prohibited. If you are not one of the named
> recipient(s) or otherwise have reason to believe that you received this
> message in error, please immediately notify sender by e-mail, and destroy
> the original message. Thank You.
>


Merging small files in partitions

2015-06-16 Thread Chagarlamudi, Prasanth
Hello,
I am looking for an optimized way to merge small files in hive partitions into 
one big file.
I came across Alter Table/Partition Concatenate 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/PartitionConcatenate.
 Doc says this only works for RCFiles. I wish there is something similar for 
TEXT FILE format.
Any suggestions?

Thanks in advance
Prasanth





This e-mail and files transmitted with it are confidential, and are intended 
solely for the use of the individual or entity to whom this e-mail is 
addressed. If you are not the intended recipient, or the employee or agent 
responsible to deliver it to the intended recipient, you are hereby notified 
that any dissemination, distribution or copying of this communication is 
strictly prohibited. If you are not one of the named recipient(s) or otherwise 
have reason to believe that you received this message in error, please 
immediately notify sender by e-mail, and destroy the original message. Thank 
You.


Re: Merging small files with dynamic partitions

2010-11-12 Thread Dave Brondsema
I copied Hadoop19Shims' implementation of getCombineFileInputFormat
(HIVE-1121) into Hadoop18Shims and it worked, if anyone is interested.

And hopefully we can upgrade our Hadoop version soon :)

On Fri, Nov 12, 2010 at 12:44 PM, Dave Brondsema wrote:

> It seems that I can't use this with Hadoop 0.18 since the
> Hadoop18Shims.getCombineFileInputFormat returns null, and
> SemanticAnalyzer.java sets HIVEMERGEMAPREDFILES to false if
> CombineFileInputFormat is not supported.  Is that right?  Maybe I can copy
> the Hadoop19Shims implementation of getCombineFileInputFormat into
> Hadoop18Shims?
>
>
> On Wed, Nov 10, 2010 at 4:31 PM, yongqiang he wrote:
>
>> I think the problem was solved in hive trunk. You can just try hive trunk.
>>
>> On Wed, Nov 10, 2010 at 10:05 AM, Dave Brondsema 
>> wrote:
>> > Hi, has there been any resolution to this?  I'm having the same trouble.
>> >  With Hive 0.6 and Hadoop 0.18 and a dynamic partition
>> > insert, hive.merge.mapredfiles doesn't work.  It works fine for a static
>> > partition insert.  What I'm seeing is that even when I
>> > set hive.merge.mapredfiles=true, the jobconf has it as false for the
>> dynamic
>> > partition insert.
>> > I was reading https://issues.apache.org/jira/browse/HIVE-1307 and it
>> looks
>> > like maybe Hadoop 0.20 is required for this?
>> > Thanks,
>> >
>> > On Sat, Oct 16, 2010 at 1:50 AM, Sammy Yu  wrote:
>> >>
>> >> Hi guys,
>> >>   Thanks for the response.   I tried running without
>> >> hive.mergejob.maponly with the same result.  I've attached the explain
>> >> extended output.  I am running this query on EC2 boxes, however it's
>> >> not running on EMR.  Hive is running on top of a hadoop 0.20.2 setup..
>> >>
>> >> Thanks,
>> >> Sammy
>> >>
>> >> On Fri, Oct 15, 2010 at 5:58 PM, Ning Zhang 
>> wrote:
>> >> > The output file shows it only have 2 jobs (the mapreduce job and the
>> >> > move task). This indicates that the plan does not have merge enabled.
>> Merge
>> >> > should consists of a ConditionalTask and 2 sub tasks (a MR task and a
>> move
>> >> > task). Can you send the plan of the query?
>> >> >
>> >> > One thing I noticed is that your are using Amazon EMR. I'm not sure
>> if
>> >> > this is enabled since SET hive.mergejob.maponly=true requires
>> >> > CombineHiveInputFormat (only available in Hadoop 0.20 and someone
>> reported
>> >> > some distribution of Hadoop doesn't support that). So additional
>> thing you
>> >> > can try is to remove this setting.
>> >> >
>> >> > On Oct 15, 2010, at 1:43 PM, Sammy Yu wrote:
>> >> >
>> >> >> Hi,
>> >> >>  I have a dynamic partition query which generates quite a few small
>> >> >> files which I would like to merge:
>> >> >>
>> >> >> SET hive.exec.dynamic.partition.mode=nonstrict;
>> >> >> SET hive.exec.dynamic.partition=true;
>> >> >> SET hive.exec.compress.output=true;
>> >> >> SET io.seqfile.compression.type=BLOCK;
>> >> >> SET hive.merge.size.per.task=25600;
>> >> >> SET hive.merge.smallfiles.avgsize=160;
>> >> >> SET hive.merge.mapfiles=true;
>> >> >> SET hive.merge.mapredfiles=true;
>> >> >> SET hive.mergejob.maponly=true;
>> >> >> INSERT OVERWRITE TABLE daily_conversions_without_rank_all_table
>> >> >> PARTITION(org_id, day)
>> >> >> SELECT session_id, permanent_id, first_date, last_date, week, month,
>> >> >> quarter,
>> >> >> referral_type, search_engine, us_search_engine,
>> >> >> keyword, unnormalized_keyword, branded, conversion_meet, goals_meet,
>> >> >> pages_viewed,
>> >> >> entry_page, page_types,
>> >> >> org_id, day
>> >> >> FROM daily_conversions_without_rank_table;
>> >> >>
>> >> >> I am running the latest version from trunk with HIVE-1622, but it
>> >> >> seems like I just can't get the post merge process to happen. I have
>> >> >> raised hive.merge.smallfiles.avgsize.  I'm wondering if the
>> filtering
>> >> >> at runtime is causing the merge process to be skipped.  Attached are
>> >> >> the hive output and log files.
>> >> >>
>> >> >>
>> >> >> Thanks,
>> >> >> Sammy
>> >> >> 
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Chief Architect, BrightEdge
>> >> email: s...@brightedge.com   |   mobile: 650.539.4867  |   fax:
>> >> 650.521.9678  |  address: 1850 Gateway Dr Suite 400, San Mateo, CA
>> >> 94404
>> >
>> >
>> >
>> > --
>> > Dave Brondsema
>> > Software Engineer
>> > Geeknet
>> >
>> > www.geek.net
>> >
>>
>
>
>
> --
> Dave Brondsema
> Software Engineer
> Geeknet
>
> www.geek.net
>



-- 
Dave Brondsema
Software Engineer
Geeknet

www.geek.net


Re: Merging small files with dynamic partitions

2010-11-12 Thread Dave Brondsema
It seems that I can't use this with Hadoop 0.18 since the
Hadoop18Shims.getCombineFileInputFormat returns null, and
SemanticAnalyzer.java sets HIVEMERGEMAPREDFILES to false if
CombineFileInputFormat is not supported.  Is that right?  Maybe I can copy
the Hadoop19Shims implementation of getCombineFileInputFormat into
Hadoop18Shims?

On Wed, Nov 10, 2010 at 4:31 PM, yongqiang he wrote:

> I think the problem was solved in hive trunk. You can just try hive trunk.
>
> On Wed, Nov 10, 2010 at 10:05 AM, Dave Brondsema 
> wrote:
> > Hi, has there been any resolution to this?  I'm having the same trouble.
> >  With Hive 0.6 and Hadoop 0.18 and a dynamic partition
> > insert, hive.merge.mapredfiles doesn't work.  It works fine for a static
> > partition insert.  What I'm seeing is that even when I
> > set hive.merge.mapredfiles=true, the jobconf has it as false for the
> dynamic
> > partition insert.
> > I was reading https://issues.apache.org/jira/browse/HIVE-1307 and it
> looks
> > like maybe Hadoop 0.20 is required for this?
> > Thanks,
> >
> > On Sat, Oct 16, 2010 at 1:50 AM, Sammy Yu  wrote:
> >>
> >> Hi guys,
> >>   Thanks for the response.   I tried running without
> >> hive.mergejob.maponly with the same result.  I've attached the explain
> >> extended output.  I am running this query on EC2 boxes, however it's
> >> not running on EMR.  Hive is running on top of a hadoop 0.20.2 setup..
> >>
> >> Thanks,
> >> Sammy
> >>
> >> On Fri, Oct 15, 2010 at 5:58 PM, Ning Zhang 
> wrote:
> >> > The output file shows it only have 2 jobs (the mapreduce job and the
> >> > move task). This indicates that the plan does not have merge enabled.
> Merge
> >> > should consists of a ConditionalTask and 2 sub tasks (a MR task and a
> move
> >> > task). Can you send the plan of the query?
> >> >
> >> > One thing I noticed is that your are using Amazon EMR. I'm not sure if
> >> > this is enabled since SET hive.mergejob.maponly=true requires
> >> > CombineHiveInputFormat (only available in Hadoop 0.20 and someone
> reported
> >> > some distribution of Hadoop doesn't support that). So additional thing
> you
> >> > can try is to remove this setting.
> >> >
> >> > On Oct 15, 2010, at 1:43 PM, Sammy Yu wrote:
> >> >
> >> >> Hi,
> >> >>  I have a dynamic partition query which generates quite a few small
> >> >> files which I would like to merge:
> >> >>
> >> >> SET hive.exec.dynamic.partition.mode=nonstrict;
> >> >> SET hive.exec.dynamic.partition=true;
> >> >> SET hive.exec.compress.output=true;
> >> >> SET io.seqfile.compression.type=BLOCK;
> >> >> SET hive.merge.size.per.task=25600;
> >> >> SET hive.merge.smallfiles.avgsize=160;
> >> >> SET hive.merge.mapfiles=true;
> >> >> SET hive.merge.mapredfiles=true;
> >> >> SET hive.mergejob.maponly=true;
> >> >> INSERT OVERWRITE TABLE daily_conversions_without_rank_all_table
> >> >> PARTITION(org_id, day)
> >> >> SELECT session_id, permanent_id, first_date, last_date, week, month,
> >> >> quarter,
> >> >> referral_type, search_engine, us_search_engine,
> >> >> keyword, unnormalized_keyword, branded, conversion_meet, goals_meet,
> >> >> pages_viewed,
> >> >> entry_page, page_types,
> >> >> org_id, day
> >> >> FROM daily_conversions_without_rank_table;
> >> >>
> >> >> I am running the latest version from trunk with HIVE-1622, but it
> >> >> seems like I just can't get the post merge process to happen. I have
> >> >> raised hive.merge.smallfiles.avgsize.  I'm wondering if the filtering
> >> >> at runtime is causing the merge process to be skipped.  Attached are
> >> >> the hive output and log files.
> >> >>
> >> >>
> >> >> Thanks,
> >> >> Sammy
> >> >> 
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Chief Architect, BrightEdge
> >> email: s...@brightedge.com   |   mobile: 650.539.4867  |   fax:
> >> 650.521.9678  |  address: 1850 Gateway Dr Suite 400, San Mateo, CA
> >> 94404
> >
> >
> >
> > --
> > Dave Brondsema
> > Software Engineer
> > Geeknet
> >
> > www.geek.net
> >
>



-- 
Dave Brondsema
Software Engineer
Geeknet

www.geek.net


Re: Merging small files with dynamic partitions

2010-11-10 Thread yongqiang he
I think the problem was solved in hive trunk. You can just try hive trunk.

On Wed, Nov 10, 2010 at 10:05 AM, Dave Brondsema  wrote:
> Hi, has there been any resolution to this?  I'm having the same trouble.
>  With Hive 0.6 and Hadoop 0.18 and a dynamic partition
> insert, hive.merge.mapredfiles doesn't work.  It works fine for a static
> partition insert.  What I'm seeing is that even when I
> set hive.merge.mapredfiles=true, the jobconf has it as false for the dynamic
> partition insert.
> I was reading https://issues.apache.org/jira/browse/HIVE-1307 and it looks
> like maybe Hadoop 0.20 is required for this?
> Thanks,
>
> On Sat, Oct 16, 2010 at 1:50 AM, Sammy Yu  wrote:
>>
>> Hi guys,
>>   Thanks for the response.   I tried running without
>> hive.mergejob.maponly with the same result.  I've attached the explain
>> extended output.  I am running this query on EC2 boxes, however it's
>> not running on EMR.  Hive is running on top of a hadoop 0.20.2 setup..
>>
>> Thanks,
>> Sammy
>>
>> On Fri, Oct 15, 2010 at 5:58 PM, Ning Zhang  wrote:
>> > The output file shows it only have 2 jobs (the mapreduce job and the
>> > move task). This indicates that the plan does not have merge enabled. Merge
>> > should consists of a ConditionalTask and 2 sub tasks (a MR task and a move
>> > task). Can you send the plan of the query?
>> >
>> > One thing I noticed is that your are using Amazon EMR. I'm not sure if
>> > this is enabled since SET hive.mergejob.maponly=true requires
>> > CombineHiveInputFormat (only available in Hadoop 0.20 and someone reported
>> > some distribution of Hadoop doesn't support that). So additional thing you
>> > can try is to remove this setting.
>> >
>> > On Oct 15, 2010, at 1:43 PM, Sammy Yu wrote:
>> >
>> >> Hi,
>> >>  I have a dynamic partition query which generates quite a few small
>> >> files which I would like to merge:
>> >>
>> >> SET hive.exec.dynamic.partition.mode=nonstrict;
>> >> SET hive.exec.dynamic.partition=true;
>> >> SET hive.exec.compress.output=true;
>> >> SET io.seqfile.compression.type=BLOCK;
>> >> SET hive.merge.size.per.task=25600;
>> >> SET hive.merge.smallfiles.avgsize=160;
>> >> SET hive.merge.mapfiles=true;
>> >> SET hive.merge.mapredfiles=true;
>> >> SET hive.mergejob.maponly=true;
>> >> INSERT OVERWRITE TABLE daily_conversions_without_rank_all_table
>> >> PARTITION(org_id, day)
>> >> SELECT session_id, permanent_id, first_date, last_date, week, month,
>> >> quarter,
>> >> referral_type, search_engine, us_search_engine,
>> >> keyword, unnormalized_keyword, branded, conversion_meet, goals_meet,
>> >> pages_viewed,
>> >> entry_page, page_types,
>> >> org_id, day
>> >> FROM daily_conversions_without_rank_table;
>> >>
>> >> I am running the latest version from trunk with HIVE-1622, but it
>> >> seems like I just can't get the post merge process to happen. I have
>> >> raised hive.merge.smallfiles.avgsize.  I'm wondering if the filtering
>> >> at runtime is causing the merge process to be skipped.  Attached are
>> >> the hive output and log files.
>> >>
>> >>
>> >> Thanks,
>> >> Sammy
>> >> 
>> >
>> >
>>
>>
>>
>> --
>> Chief Architect, BrightEdge
>> email: s...@brightedge.com   |   mobile: 650.539.4867  |   fax:
>> 650.521.9678  |  address: 1850 Gateway Dr Suite 400, San Mateo, CA
>> 94404
>
>
>
> --
> Dave Brondsema
> Software Engineer
> Geeknet
>
> www.geek.net
>


Re: Merging small files with dynamic partitions

2010-11-10 Thread Dave Brondsema
Hi, has there been any resolution to this?  I'm having the same trouble.
 With Hive 0.6 and Hadoop 0.18 and a dynamic partition
insert, hive.merge.mapredfiles doesn't work.  It works fine for a static
partition insert.  What I'm seeing is that even when I
set hive.merge.mapredfiles=true, the jobconf has it as false for the dynamic
partition insert.

I was reading https://issues.apache.org/jira/browse/HIVE-1307 and it looks
like maybe Hadoop 0.20 is required for this?

Thanks,

On Sat, Oct 16, 2010 at 1:50 AM, Sammy Yu  wrote:

> Hi guys,
>   Thanks for the response.   I tried running without
> hive.mergejob.maponly with the same result.  I've attached the explain
> extended output.  I am running this query on EC2 boxes, however it's
> not running on EMR.  Hive is running on top of a hadoop 0.20.2 setup..
>
> Thanks,
> Sammy
>
> On Fri, Oct 15, 2010 at 5:58 PM, Ning Zhang  wrote:
> > The output file shows it only have 2 jobs (the mapreduce job and the move
> task). This indicates that the plan does not have merge enabled. Merge
> should consists of a ConditionalTask and 2 sub tasks (a MR task and a move
> task). Can you send the plan of the query?
> >
> > One thing I noticed is that your are using Amazon EMR. I'm not sure if
> this is enabled since SET hive.mergejob.maponly=true requires
> CombineHiveInputFormat (only available in Hadoop 0.20 and someone reported
> some distribution of Hadoop doesn't support that). So additional thing you
> can try is to remove this setting.
> >
> > On Oct 15, 2010, at 1:43 PM, Sammy Yu wrote:
> >
> >> Hi,
> >>  I have a dynamic partition query which generates quite a few small
> >> files which I would like to merge:
> >>
> >> SET hive.exec.dynamic.partition.mode=nonstrict;
> >> SET hive.exec.dynamic.partition=true;
> >> SET hive.exec.compress.output=true;
> >> SET io.seqfile.compression.type=BLOCK;
> >> SET hive.merge.size.per.task=25600;
> >> SET hive.merge.smallfiles.avgsize=160;
> >> SET hive.merge.mapfiles=true;
> >> SET hive.merge.mapredfiles=true;
> >> SET hive.mergejob.maponly=true;
> >> INSERT OVERWRITE TABLE daily_conversions_without_rank_all_table
> >> PARTITION(org_id, day)
> >> SELECT session_id, permanent_id, first_date, last_date, week, month,
> quarter,
> >> referral_type, search_engine, us_search_engine,
> >> keyword, unnormalized_keyword, branded, conversion_meet, goals_meet,
> >> pages_viewed,
> >> entry_page, page_types,
> >> org_id, day
> >> FROM daily_conversions_without_rank_table;
> >>
> >> I am running the latest version from trunk with HIVE-1622, but it
> >> seems like I just can't get the post merge process to happen. I have
> >> raised hive.merge.smallfiles.avgsize.  I'm wondering if the filtering
> >> at runtime is causing the merge process to be skipped.  Attached are
> >> the hive output and log files.
> >>
> >>
> >> Thanks,
> >> Sammy
> >> 
> >
> >
>
>
>
> --
> Chief Architect, BrightEdge
> email: s...@brightedge.com   |   mobile: 650.539.4867  |   fax:
> 650.521.9678  |  address: 1850 Gateway Dr Suite 400, San Mateo, CA
> 94404
>



-- 
Dave Brondsema
Software Engineer
Geeknet

www.geek.net


Re: Merging small files with dynamic partitions

2010-10-15 Thread Sammy Yu
Hi guys,
   Thanks for the response.   I tried running without
hive.mergejob.maponly with the same result.  I've attached the explain
extended output.  I am running this query on EC2 boxes, however it's
not running on EMR.  Hive is running on top of a hadoop 0.20.2 setup..

Thanks,
Sammy

On Fri, Oct 15, 2010 at 5:58 PM, Ning Zhang  wrote:
> The output file shows it only have 2 jobs (the mapreduce job and the move 
> task). This indicates that the plan does not have merge enabled. Merge should 
> consists of a ConditionalTask and 2 sub tasks (a MR task and a move task). 
> Can you send the plan of the query?
>
> One thing I noticed is that your are using Amazon EMR. I'm not sure if this 
> is enabled since SET hive.mergejob.maponly=true requires 
> CombineHiveInputFormat (only available in Hadoop 0.20 and someone reported 
> some distribution of Hadoop doesn't support that). So additional thing you 
> can try is to remove this setting.
>
> On Oct 15, 2010, at 1:43 PM, Sammy Yu wrote:
>
>> Hi,
>>  I have a dynamic partition query which generates quite a few small
>> files which I would like to merge:
>>
>> SET hive.exec.dynamic.partition.mode=nonstrict;
>> SET hive.exec.dynamic.partition=true;
>> SET hive.exec.compress.output=true;
>> SET io.seqfile.compression.type=BLOCK;
>> SET hive.merge.size.per.task=25600;
>> SET hive.merge.smallfiles.avgsize=160;
>> SET hive.merge.mapfiles=true;
>> SET hive.merge.mapredfiles=true;
>> SET hive.mergejob.maponly=true;
>> INSERT OVERWRITE TABLE daily_conversions_without_rank_all_table
>> PARTITION(org_id, day)
>> SELECT session_id, permanent_id, first_date, last_date, week, month, quarter,
>> referral_type, search_engine, us_search_engine,
>> keyword, unnormalized_keyword, branded, conversion_meet, goals_meet,
>> pages_viewed,
>> entry_page, page_types,
>> org_id, day
>> FROM daily_conversions_without_rank_table;
>>
>> I am running the latest version from trunk with HIVE-1622, but it
>> seems like I just can't get the post merge process to happen. I have
>> raised hive.merge.smallfiles.avgsize.  I'm wondering if the filtering
>> at runtime is causing the merge process to be skipped.  Attached are
>> the hive output and log files.
>>
>>
>> Thanks,
>> Sammy
>> 
>
>



-- 
Chief Architect, BrightEdge
email: s...@brightedge.com   |   mobile: 650.539.4867  |   fax:
650.521.9678  |  address: 1850 Gateway Dr Suite 400, San Mateo, CA
94404


explain.log
Description: Binary data


Re: Merging small files with dynamic partitions

2010-10-15 Thread Ning Zhang
The output file shows it only have 2 jobs (the mapreduce job and the move 
task). This indicates that the plan does not have merge enabled. Merge should 
consists of a ConditionalTask and 2 sub tasks (a MR task and a move task). Can 
you send the plan of the query? 

One thing I noticed is that your are using Amazon EMR. I'm not sure if this is 
enabled since SET hive.mergejob.maponly=true requires CombineHiveInputFormat 
(only available in Hadoop 0.20 and someone reported some distribution of Hadoop 
doesn't support that). So additional thing you can try is to remove this 
setting.

On Oct 15, 2010, at 1:43 PM, Sammy Yu wrote:

> Hi,
>  I have a dynamic partition query which generates quite a few small
> files which I would like to merge:
> 
> SET hive.exec.dynamic.partition.mode=nonstrict;
> SET hive.exec.dynamic.partition=true;
> SET hive.exec.compress.output=true;
> SET io.seqfile.compression.type=BLOCK;
> SET hive.merge.size.per.task=25600;
> SET hive.merge.smallfiles.avgsize=160;
> SET hive.merge.mapfiles=true;
> SET hive.merge.mapredfiles=true;
> SET hive.mergejob.maponly=true;
> INSERT OVERWRITE TABLE daily_conversions_without_rank_all_table
> PARTITION(org_id, day)
> SELECT session_id, permanent_id, first_date, last_date, week, month, quarter,
> referral_type, search_engine, us_search_engine,
> keyword, unnormalized_keyword, branded, conversion_meet, goals_meet,
> pages_viewed,
> entry_page, page_types,
> org_id, day
> FROM daily_conversions_without_rank_table;
> 
> I am running the latest version from trunk with HIVE-1622, but it
> seems like I just can't get the post merge process to happen. I have
> raised hive.merge.smallfiles.avgsize.  I'm wondering if the filtering
> at runtime is causing the merge process to be skipped.  Attached are
> the hive output and log files.
> 
> 
> Thanks,
> Sammy
> 



Re: Merging small files with dynamic partitions

2010-10-15 Thread Edward Capriolo
Sammy,

This is not the exact remedy you were looking for, but my company open
sourced our file crusher utility.

http://www.jointhegrid.com/hadoop_filecrush/index.jsp

We use it to good effect to turn many small files into one. Works with
text and sequence files , and custom writables.

Edward
On Friday, October 15, 2010, Sammy Yu  wrote:
> Hi,
>   I have a dynamic partition query which generates quite a few small
> files which I would like to merge:
>
> SET hive.exec.dynamic.partition.mode=nonstrict;
> SET hive.exec.dynamic.partition=true;
> SET hive.exec.compress.output=true;
> SET io.seqfile.compression.type=BLOCK;
> SET hive.merge.size.per.task=25600;
> SET hive.merge.smallfiles.avgsize=160;
> SET hive.merge.mapfiles=true;
> SET hive.merge.mapredfiles=true;
> SET hive.mergejob.maponly=true;
> INSERT OVERWRITE TABLE daily_conversions_without_rank_all_table
> PARTITION(org_id, day)
> SELECT session_id, permanent_id, first_date, last_date, week, month, quarter,
> referral_type, search_engine, us_search_engine,
> keyword, unnormalized_keyword, branded, conversion_meet, goals_meet,
> pages_viewed,
> entry_page, page_types,
> org_id, day
> FROM daily_conversions_without_rank_table;
>
> I am running the latest version from trunk with HIVE-1622, but it
> seems like I just can't get the post merge process to happen. I have
> raised hive.merge.smallfiles.avgsize.  I'm wondering if the filtering
> at runtime is causing the merge process to be skipped.  Attached are
> the hive output and log files.
>
>
> Thanks,
> Sammy
>