[ https://issues.apache.org/jira/browse/TEZ-4207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176856#comment-17176856 ]
Rajesh Balamohan commented on TEZ-4207: --------------------------------------- Thanks for the review [~ashutoshc]. Committed to master. > Provide approximate number of input records to be processed in > UnorderedKVInput > ------------------------------------------------------------------------------- > > Key: TEZ-4207 > URL: https://issues.apache.org/jira/browse/TEZ-4207 > Project: Apache Tez > Issue Type: Bug > Reporter: Rajesh Balamohan > Priority: Major > Attachments: TEZ-4207.1.patch, TEZ-4207.wip.patch > > > There are cases when broadcasted data is loaded into hashtable in upstream > applications (e.g Hive). Apps tends to predict the number of entries in the > hashtable diligently, but there are cases where these estimates can be very > complicated at compile time. > > Tez can help in such cases, by providing "approximate number of input records > counter", to be processed in UnorderedKVInput. This is to avoid expensive > rehash when hashtable sizes are not estimated correctly. It would be good to > start with broadcast first and then to move on to unordered partitioned case > later. > > This would help in predicting the number of entries at runtime & can get > better estimates for hashtable. -- This message was sent by Atlassian Jira (v8.3.4#803005)