subject:"\[jira\] Updated\: \(HIVE\-1797\) Compressed the hashtable dump file before put into distributed cache"

[jira] Updated: (HIVE-1797) Compressed the hashtable dump file before put into distributed cache

2010-11-24 Thread He Yongqiang (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-1797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

He Yongqiang updated HIVE-1797:
---

Resolution: Fixed
Status: Resolved (was: Patch Available)

Committed! Thanks Liyin!

Compressed the hashtable dump file before put into distributed cache

Key: HIVE-1797
URL: https://issues.apache.org/jira/browse/HIVE-1797
Project: Hive
Issue Type: Improvement
Components: Query Processor
Affects Versions: 0.7.0
Reporter: Liyin Tang
Assignee: Liyin Tang
Attachments: hive-1797.patch, hive-1797_3.patch

Clearly, the size of small table is the performance bottleneck for map join.
Because the size of the small table will affect the memory usage and dumped
hashtable file.
That means there are 2 boundaries of the map join performance.
1)The memory usage for local task and mapred task
2)The dumped hashtable file size for distributed cache
The reason that test case in last email spends most of the execution time on
initializing is because it hits the second boundary.
Since we have already bound the memory usage, one thing we can do is to let
the performance never hits the secondary bound before it hits the first
boundary.
Assuming the heap size is 1.6 G and the small table file size is 15M
compressed (75M uncompressed),
local task can roughly hold that 1.5M unique rows in memory.
Roughly the dumped file size will be 150M, which is too large to put into the
distributed cache.

From experiments, we can basically conclude when the dumped file size is
smaller than 30M.
The distributed cache works well and all the mappers will be initialized in
a short time (less than 30 secs).
One easy implementation is to compress the hashtable file.
I use the gzip to compress the hashtable file and the file size is compressed
from 100M to 13M.
After several tests, all the mappers will be initialized in less than 23 secs.
But this solution adds some decompression overhead to each mapper.
Mappers on the same machine will do the duplicated decompression work.
Maybe in the future, we can let the distributed cache to support this.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1797) Compressed the hashtable dump file before put into distributed cache

2010-11-22 Thread Liyin Tang (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-1797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Liyin Tang updated HIVE-1797:
-

Status: Patch Available (was: Open)

Compressed the hashtable dump file before put into distributed cache

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1797) Compressed the hashtable dump file before put into distributed cache

[jira] Updated: (HIVE-1797) Compressed the hashtable dump file before put into distributed cache

2 matches

Site Navigation

Mail list logo

Footer information