Its not a property. There is the a TezMapReduceSplitsGrouper or TezMapredSplitsGrouper library class that is used for this functionality.
From: Nitin Kumar [mailto:[email protected]] Sent: Thursday, April 21, 2016 10:00 PM To: [email protected] Subject: Re: Managing input split sizes in Hive running the tez engine Thanks Bikas for your inputs! Could you tell me the property that needs to be set in order to enable tez grouping? Thanks and regards, Nitin On Thu, Apr 21, 2016 at 11:26 PM, Bikas Saha <[email protected] <mailto:[email protected]> > wrote: Tez grouping (if enabled) is explained here. https://cwiki.apache.org/confluence/display/TEZ/How+initial+task+parallelism+works For the rest of the questions, the hive user mailing list would be a better avenue for answers. Bikas From: Nitin Kumar [mailto:[email protected] <mailto:[email protected]> ] Sent: Wednesday, April 20, 2016 10:54 PM To: [email protected] <mailto:[email protected]> Subject: Managing input split sizes in Hive running the tez engine Hi, I want to gain a better understanding of how in the input splits are calculated in the tez engine. I am aware that the hive.input.format property can be set to either HiveInputFormat (default) or to CombineHiveInputFormat (generally accepted for large number of files having sizes << hdfs block size). I was hoping someone could walk me through the differences on how HiveInputFormat and CombineHiveInputFormat calculate split sizes as data file sizes vary from small (lesser than a block) to large (spanning multiple blocks). I want to dictate the number of mapper tasks that are spawned for scanning a table. For the MR engine this can be controlled by setting the mapred.min.split.size and mapred.max.split.size properties. I need to know if there are similar configurations for the tez engine. Also the properties tez.grouping.max-size, tez.grouping.min-size and tez.grouping.split-waves have been set to the values of 1GB, 16MB and 1.7 respectively. However I observed that the created input splits do not adhere to these properties. I had two files of size 3MB each for a table. According to the set properties, only 1 mapper task should have spawned but 2 mapper tasks spawned instead. Are there other properties in hive/tez that need to be set to enable input split grouping? I would highly appreciate your inputs. Thanks and regards, Nitin
