[jira] [Updated] (TAJO-374) Investigate more efficient intermediate shuffle methods

Hyunsik Choi (JIRA) Wed, 19 Mar 2014 05:44:30 -0700

     [ 
https://issues.apache.org/jira/browse/TAJO-374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hyunsik Choi updated TAJO-374:
------------------------------

    Summary: Investigate more efficient intermediate shuffle methods  (was: 
Investigate more efficient Intermedaite data handling)

> Investigate more efficient intermediate shuffle methods
> -------------------------------------------------------
>
>                 Key: TAJO-374
>                 URL: https://issues.apache.org/jira/browse/TAJO-374
>             Project: Tajo
>          Issue Type: Improvement
>          Components: data shuffle
>            Reporter: Hyunsik Choi
>
> h3. Motivation
> Currently, Tajo materializes intermediate data on local disks. Tajo stores 
> one file for each partition. It becomes inefficient and not scalable as data 
> volume and increase. In MR, this challenge was resolved by sorting 
> intermediate key-values, grouping the same key data, and indexing on keys. 
> But, It requires unnecessary sort and disk I/O. This is not feasible in Tajo.
> h3. References
>  * TAJO-292 is an ad-hoc resolution to reduce the number of intermediate 
> files. But, it still is not scalable.
>  * Optimizing MapReduce Job Performance 
> (http://www.slideshare.net/cloudera/mr-perf)
>  * Multilevel aggregation for Hadoop/MapReduce 
> (http://www.slideshare.net/ozax86/prestrata-hadoop-word-meetup)
>  * SAILFISH: A FRAMEWORK FOR LARGE SCALE DATA PROCESSING 
> (http://research.yahoo.com/files/yl-2012-002.pdf)
>  * MAPREDUCE-4502 - Node-level aggregation with combining the result of maps



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (TAJO-374) Investigate more efficient intermediate shuffle methods

Reply via email to