[jira] Updated: (HIVE-1797) Compressed the hashtable dump file before put into distributed cache
[ https://issues.apache.org/jira/browse/HIVE-1797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Yongqiang updated HIVE-1797: --- Resolution: Fixed Status: Resolved (was: Patch Available) Committed! Thanks Liyin! Compressed the hashtable dump file before put into distributed cache Key: HIVE-1797 URL: https://issues.apache.org/jira/browse/HIVE-1797 Project: Hive Issue Type: Improvement Components: Query Processor Affects Versions: 0.7.0 Reporter: Liyin Tang Assignee: Liyin Tang Attachments: hive-1797.patch, hive-1797_3.patch Clearly, the size of small table is the performance bottleneck for map join. Because the size of the small table will affect the memory usage and dumped hashtable file. That means there are 2 boundaries of the map join performance. 1)The memory usage for local task and mapred task 2)The dumped hashtable file size for distributed cache The reason that test case in last email spends most of the execution time on initializing is because it hits the second boundary. Since we have already bound the memory usage, one thing we can do is to let the performance never hits the secondary bound before it hits the first boundary. Assuming the heap size is 1.6 G and the small table file size is 15M compressed (75M uncompressed), local task can roughly hold that 1.5M unique rows in memory. Roughly the dumped file size will be 150M, which is too large to put into the distributed cache. From experiments, we can basically conclude when the dumped file size is smaller than 30M. The distributed cache works well and all the mappers will be initialized in a short time (less than 30 secs). One easy implementation is to compress the hashtable file. I use the gzip to compress the hashtable file and the file size is compressed from 100M to 13M. After several tests, all the mappers will be initialized in less than 23 secs. But this solution adds some decompression overhead to each mapper. Mappers on the same machine will do the duplicated decompression work. Maybe in the future, we can let the distributed cache to support this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1797) Compressed the hashtable dump file before put into distributed cache
[ https://issues.apache.org/jira/browse/HIVE-1797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1797: - Status: Patch Available (was: Open) Compressed the hashtable dump file before put into distributed cache Key: HIVE-1797 URL: https://issues.apache.org/jira/browse/HIVE-1797 Project: Hive Issue Type: Improvement Components: Query Processor Affects Versions: 0.7.0 Reporter: Liyin Tang Assignee: Liyin Tang Attachments: hive-1797.patch, hive-1797_3.patch Clearly, the size of small table is the performance bottleneck for map join. Because the size of the small table will affect the memory usage and dumped hashtable file. That means there are 2 boundaries of the map join performance. 1)The memory usage for local task and mapred task 2)The dumped hashtable file size for distributed cache The reason that test case in last email spends most of the execution time on initializing is because it hits the second boundary. Since we have already bound the memory usage, one thing we can do is to let the performance never hits the secondary bound before it hits the first boundary. Assuming the heap size is 1.6 G and the small table file size is 15M compressed (75M uncompressed), local task can roughly hold that 1.5M unique rows in memory. Roughly the dumped file size will be 150M, which is too large to put into the distributed cache. From experiments, we can basically conclude when the dumped file size is smaller than 30M. The distributed cache works well and all the mappers will be initialized in a short time (less than 30 secs). One easy implementation is to compress the hashtable file. I use the gzip to compress the hashtable file and the file size is compressed from 100M to 13M. After several tests, all the mappers will be initialized in less than 23 secs. But this solution adds some decompression overhead to each mapper. Mappers on the same machine will do the duplicated decompression work. Maybe in the future, we can let the distributed cache to support this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.