RE: Managing input split sizes in Hive running the tez engine

Bikas Saha Fri, 22 Apr 2016 11:34:28 -0700

Its not a property. There is the a TezMapReduceSplitsGrouper or 
TezMapredSplitsGrouper library class that is used for this functionality.

From: Nitin Kumar [mailto:[email protected]] 
Sent: Thursday, April 21, 2016 10:00 PM
To: [email protected]
Subject: Re: Managing input split sizes in Hive running the tez engine

Thanks Bikas for your inputs!

Could you tell me the property that needs to be set in order to enable tez 
grouping?

Thanks and regards,

Nitin

On Thu, Apr 21, 2016 at 11:26 PM, Bikas Saha <[email protected] 
<mailto:[email protected]> > wrote:

Tez grouping (if enabled) is explained here. 
https://cwiki.apache.org/confluence/display/TEZ/How+initial+task+parallelism+works

For the rest of the questions, the hive user mailing list would be a better 
avenue for answers.

Bikas

From: Nitin Kumar [mailto:[email protected] 
<mailto:[email protected]> ] 
Sent: Wednesday, April 20, 2016 10:54 PM
To: [email protected] <mailto:[email protected]> 
Subject: Managing input split sizes in Hive running the tez engine

Hi,

I want to gain a better understanding of how in the input splits are calculated 
in the tez engine.

I am aware that the hive.input.format property can be set to either 
HiveInputFormat (default) or to CombineHiveInputFormat (generally accepted for 
large number of files having sizes << hdfs block size). 

I was hoping someone could walk me through the differences on how 
HiveInputFormat and CombineHiveInputFormat calculate split sizes as data file 
sizes vary from small (lesser than a block) to large (spanning multiple blocks).

I want to dictate the number of mapper tasks that are spawned for scanning a 
table. For the MR engine this can be controlled by setting the 
mapred.min.split.size and mapred.max.split.size properties. I need to know if 
there are similar configurations for the tez engine.

Also the properties tez.grouping.max-size, tez.grouping.min-size and 
tez.grouping.split-waves have been set to the values of 1GB, 16MB and 1.7 
respectively. However I observed that the created input splits do not adhere to 
these properties. 

I had two files of size 3MB each for a table. According to the set properties, 
only 1 mapper task should have spawned but 2 mapper tasks spawned instead.

Are there other properties in hive/tez that need to be set to enable input 
split grouping?

I would highly appreciate your inputs.

Thanks and regards,

Nitin

RE: Managing input split sizes in Hive running the tez engine

Reply via email to