[ 
https://issues.apache.org/jira/browse/HIVE-1797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liyin Tang updated HIVE-1797:
-----------------------------

    Status: Patch Available  (was: Open)

> Compressed the hashtable dump file before put into distributed cache
> --------------------------------------------------------------------
>
>                 Key: HIVE-1797
>                 URL: https://issues.apache.org/jira/browse/HIVE-1797
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor
>    Affects Versions: 0.7.0
>            Reporter: Liyin Tang
>            Assignee: Liyin Tang
>         Attachments: hive-1797.patch, hive-1797_3.patch
>
>
> Clearly, the size of small table is the performance bottleneck for map join.
> Because the size of the small table will affect the memory usage and dumped 
> hashtable file.
> That means there are 2 boundaries of the map join performance.
> 1)    The memory usage for local task and mapred task
> 2)    The dumped hashtable file size for distributed cache
> The reason that test case in last email spends most of the execution time on 
> initializing is because it hits the second boundary.
> Since we have already bound the memory usage, one thing we can do is to let 
> the performance never hits the secondary bound before it hits the first 
> boundary.
> Assuming the heap size is 1.6 G and the small table file size is 15M 
> compressed (75M uncompressed),
> local  task can roughly hold that 1.5M unique rows in memory. 
> Roughly the dumped file size will be 150M, which is too large to put into the 
> distributed cache.
>  
> From experiments, we can basically conclude when the dumped file size is 
> smaller than 30M. 
> The distributed cache works well and all the mappers will  be initialized in 
> a short time (less than 30 secs).
> One easy implementation is to compress the hashtable file. 
> I use the gzip to compress the hashtable file and the file size is compressed 
> from 100M to 13M.
> After several tests, all the mappers will be initialized in less than 23 secs.
> But this solution adds some decompression overhead to each mapper.
> Mappers on the same machine will do the duplicated decompression work.
> Maybe in the future, we can let the distributed cache to support this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to