[
https://issues.apache.org/jira/browse/TAJO-374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyunsik Choi updated TAJO-374:
------------------------------
Summary: Investigate more efficient intermediate shuffle methods (was:
Investigate more efficient Intermedaite data handling)
> Investigate more efficient intermediate shuffle methods
> -------------------------------------------------------
>
> Key: TAJO-374
> URL: https://issues.apache.org/jira/browse/TAJO-374
> Project: Tajo
> Issue Type: Improvement
> Components: data shuffle
> Reporter: Hyunsik Choi
>
> h3. Motivation
> Currently, Tajo materializes intermediate data on local disks. Tajo stores
> one file for each partition. It becomes inefficient and not scalable as data
> volume and increase. In MR, this challenge was resolved by sorting
> intermediate key-values, grouping the same key data, and indexing on keys.
> But, It requires unnecessary sort and disk I/O. This is not feasible in Tajo.
> h3. References
> * TAJO-292 is an ad-hoc resolution to reduce the number of intermediate
> files. But, it still is not scalable.
> * Optimizing MapReduce Job Performance
> (http://www.slideshare.net/cloudera/mr-perf)
> * Multilevel aggregation for Hadoop/MapReduce
> (http://www.slideshare.net/ozax86/prestrata-hadoop-word-meetup)
> * SAILFISH: A FRAMEWORK FOR LARGE SCALE DATA PROCESSING
> (http://research.yahoo.com/files/yl-2012-002.pdf)
> * MAPREDUCE-4502 - Node-level aggregation with combining the result of maps
--
This message was sent by Atlassian JIRA
(v6.2#6252)