[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13817207#comment-13817207
 ] 

Lijie Xu commented on MAPREDUCE-5605:
-------------------------------------

How to implement MapReduce-like framework effectively and efficiently is really 
a problem. I just want to know how you deal with the following issues that are 
related to the in-memory processing framework. 

Multi-threads vs. Multi-processes. Hadoop uses multi-processes to implement 
tasks, while Apache Spark uses multi-threads to implement them. Spark chooses 
multi-threads because it wants to share data between tasks/jobs but sharing 
data between processes (JVMs) is not efficient. Based on the architecture of 
this proposal, it seems that you want to congregate the intermediate data of 
several mappers and reducers together in memory. So that more controls can be 
done to optimize I/O. However, the concrete dataflow is not given, so I want to 
know if there is data sharing between tasks/jobs and how large the shared data 
will be. 

Fault-tolerance: Compared with multi-threads, multi-processes policy has its 
advantages: easy to manage and easy to guarantee fault-tolerance. Hadoop is 
disk-based and process-based, so failure of a mapper/reducer can be easily 
handled by rerunning the lost task on an appropriate node. If it is changed to 
memory-based, the safety of intermediate data (e.g., outputs of a mapper) is 
not easy to guarantee. Furthermore, the crashes of threads or JVM itself should 
be paid attention.

Idempotent: This term means that rerunning any task will not affect the final 
result. Putting the input/output/intermediate data of several tasks together 
needs special management to keep this feature.

Trade-off of memory usage and disk usage: Since memory is limited and data is 
huge, we still need disk to store/swap some data. So when/how to swap the 
overcrowded in-memory data onto disk is an important issue and related to the 
performance.

> Memory-centric MapReduce aiming to solve the I/O bottleneck
> -----------------------------------------------------------
>
>                 Key: MAPREDUCE-5605
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5605
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>    Affects Versions: 1.0.1
>         Environment: x86-64 Linux/Unix
> 64-bit jdk7 preferred
>            Reporter: Ming Chen
>            Assignee: Ming Chen
>             Fix For: 1.0.1
>
>         Attachments: MAPREDUCE-5605-v1.patch, 
> hadoop-core-1.0.1-mammoth-0.9.0.jar
>
>
> Memory is a very important resource to bridge the gap between CPUs and I/O 
> devices. So the idea is to maximize the usage of memory to solve the problem 
> of I/O bottleneck. We developed a multi-threaded task execution engine, which 
> runs in a single JVM on a node. In the execution engine, we have implemented 
> the algorithm of memory scheduling to realize global memory management, based 
> on which we further developed the techniques such as sequential disk 
> accessing, multi-cache and solved the problem of full garbage collection in 
> the JVM. The benchmark results shows that it can get impressive improvement 
> in typical cases. When the a system is relatively short of memory (eg, HPC, 
> small- and medium-size enterprises), the improvement will be even more 
> impressive.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to