[ https://issues.apache.org/jira/browse/HIVE-1797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935044#action_12935044 ]
He Yongqiang commented on HIVE-1797: ------------------------------------ will take a look > Compressed the hashtable dump file before put into distributed cache > -------------------------------------------------------------------- > > Key: HIVE-1797 > URL: https://issues.apache.org/jira/browse/HIVE-1797 > Project: Hive > Issue Type: Improvement > Components: Query Processor > Affects Versions: 0.7.0 > Reporter: Liyin Tang > Assignee: Liyin Tang > Attachments: hive-1797.patch, hive-1797_3.patch > > > Clearly, the size of small table is the performance bottleneck for map join. > Because the size of the small table will affect the memory usage and dumped > hashtable file. > That means there are 2 boundaries of the map join performance. > 1) The memory usage for local task and mapred task > 2) The dumped hashtable file size for distributed cache > The reason that test case in last email spends most of the execution time on > initializing is because it hits the second boundary. > Since we have already bound the memory usage, one thing we can do is to let > the performance never hits the secondary bound before it hits the first > boundary. > Assuming the heap size is 1.6 G and the small table file size is 15M > compressed (75M uncompressed), > local task can roughly hold that 1.5M unique rows in memory. > Roughly the dumped file size will be 150M, which is too large to put into the > distributed cache. > > From experiments, we can basically conclude when the dumped file size is > smaller than 30M. > The distributed cache works well and all the mappers will be initialized in > a short time (less than 30 secs). > One easy implementation is to compress the hashtable file. > I use the gzip to compress the hashtable file and the file size is compressed > from 100M to 13M. > After several tests, all the mappers will be initialized in less than 23 secs. > But this solution adds some decompression overhead to each mapper. > Mappers on the same machine will do the duplicated decompression work. > Maybe in the future, we can let the distributed cache to support this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.