[jira] Updated: (HIVE-1797) Compressed the hashtable dump file before put into distributed cache

2010-11-24 Thread He Yongqiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

He Yongqiang updated HIVE-1797:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed! Thanks Liyin!

 Compressed the hashtable dump file before put into distributed cache
 

 Key: HIVE-1797
 URL: https://issues.apache.org/jira/browse/HIVE-1797
 Project: Hive
  Issue Type: Improvement
  Components: Query Processor
Affects Versions: 0.7.0
Reporter: Liyin Tang
Assignee: Liyin Tang
 Attachments: hive-1797.patch, hive-1797_3.patch


 Clearly, the size of small table is the performance bottleneck for map join.
 Because the size of the small table will affect the memory usage and dumped 
 hashtable file.
 That means there are 2 boundaries of the map join performance.
 1)The memory usage for local task and mapred task
 2)The dumped hashtable file size for distributed cache
 The reason that test case in last email spends most of the execution time on 
 initializing is because it hits the second boundary.
 Since we have already bound the memory usage, one thing we can do is to let 
 the performance never hits the secondary bound before it hits the first 
 boundary.
 Assuming the heap size is 1.6 G and the small table file size is 15M 
 compressed (75M uncompressed),
 local  task can roughly hold that 1.5M unique rows in memory. 
 Roughly the dumped file size will be 150M, which is too large to put into the 
 distributed cache.
  
 From experiments, we can basically conclude when the dumped file size is 
 smaller than 30M. 
 The distributed cache works well and all the mappers will  be initialized in 
 a short time (less than 30 secs).
 One easy implementation is to compress the hashtable file. 
 I use the gzip to compress the hashtable file and the file size is compressed 
 from 100M to 13M.
 After several tests, all the mappers will be initialized in less than 23 secs.
 But this solution adds some decompression overhead to each mapper.
 Mappers on the same machine will do the duplicated decompression work.
 Maybe in the future, we can let the distributed cache to support this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1797) Compressed the hashtable dump file before put into distributed cache

2010-11-22 Thread Liyin Tang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liyin Tang updated HIVE-1797:
-

Status: Patch Available  (was: Open)

 Compressed the hashtable dump file before put into distributed cache
 

 Key: HIVE-1797
 URL: https://issues.apache.org/jira/browse/HIVE-1797
 Project: Hive
  Issue Type: Improvement
  Components: Query Processor
Affects Versions: 0.7.0
Reporter: Liyin Tang
Assignee: Liyin Tang
 Attachments: hive-1797.patch, hive-1797_3.patch


 Clearly, the size of small table is the performance bottleneck for map join.
 Because the size of the small table will affect the memory usage and dumped 
 hashtable file.
 That means there are 2 boundaries of the map join performance.
 1)The memory usage for local task and mapred task
 2)The dumped hashtable file size for distributed cache
 The reason that test case in last email spends most of the execution time on 
 initializing is because it hits the second boundary.
 Since we have already bound the memory usage, one thing we can do is to let 
 the performance never hits the secondary bound before it hits the first 
 boundary.
 Assuming the heap size is 1.6 G and the small table file size is 15M 
 compressed (75M uncompressed),
 local  task can roughly hold that 1.5M unique rows in memory. 
 Roughly the dumped file size will be 150M, which is too large to put into the 
 distributed cache.
  
 From experiments, we can basically conclude when the dumped file size is 
 smaller than 30M. 
 The distributed cache works well and all the mappers will  be initialized in 
 a short time (less than 30 secs).
 One easy implementation is to compress the hashtable file. 
 I use the gzip to compress the hashtable file and the file size is compressed 
 from 100M to 13M.
 After several tests, all the mappers will be initialized in less than 23 secs.
 But this solution adds some decompression overhead to each mapper.
 Mappers on the same machine will do the duplicated decompression work.
 Maybe in the future, we can let the distributed cache to support this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.