Hi Arpan,
Include the partition column in the distribute by clause of DML, it will 
generate only one file per day. Hope this will resolve the issue. 

> "insert into 'target_table' select a,b,c from x where ... distribute by 
> (date)"
> 
PS: Backdated processing will generate additional file(s). One file per load. 

Thanks,
Saurabh

Sent from my iPhone, please avoid typos.

> On 22-Jun-2017, at 11:30 AM, Arpan Rajani <arpan.raj...@whishworks.com> wrote:
> 
> Hello everyone,
> 
> 
> 
> I am sure many of you might have faced similar issue.
> 
> We do "insert into 'target_table' select a,b,c from x where .." kind of 
> queries for a nightly load. This insert goes in a new partition of the 
> target_table. 
> 
> Now the concern is : this inserts load hardly any data ( I would say less 
> than 128 MB per day) but data is fregmented into1200 files. Each file in a 
> few KiloBytes. This is slowing down the performance. How can we make sure, 
> this load does not generate lot of small files?
> 
> I have already set : hive.merge.mapfiles and hive.merge.mapredfiles to true 
> in custom/advanced hive-site.xml. But still the load job loads data with 1200 
> small files. 
> 
> I know why 1200 is, this is the value of maximum number of 
> reducers/containers available in one of the hive-sites. (I do not think its a 
> good idea to do cluster wide setting to change this number, as this can 
> affect other jobs which can use cluster when it has free containers) 
> 
> What could be other way/settings, so that the hive insert do not take 1200 
> slots and generate lots of small files?
> 
> I also have another question which is partly contrary to above : (This is 
> relatively less important)
> 
> When I reload this table by creating a new table by doing select on target 
> table, the newly created table does not contain too many small files. This 
> newly created table's number of files drops down from 1200 to ±50. What could 
> be the reason?
> 
> PS: I did go through 
> http://www.openkb.info/2014/12/how-to-control-file-numbers-of-hive.html
> 
> 
> 
> Regards,
> Arpan
> 
> The contents of this e-mail are confidential and for the exclusive use of the 
> intended recipient. If you receive this e-mail in error please delete it from 
> your system immediately and notify us either by e-mail or telephone. You 
> should not copy, forward or otherwise disclose the content of the e-mail. The 
> views expressed in this communication may not necessarily be the view held by 
> WHISHWORKS.

Reply via email to