Hassan Eslami created GIRAPH-1073:
-------------------------------------
Summary: Decouple out-of-core persistence infrastructure from
out-of-core computation
Key: GIRAPH-1073
URL: https://issues.apache.org/jira/browse/GIRAPH-1073
Project: Giraph
Issue Type: Improvement
Reporter: Hassan Eslami
Assignee: Hassan Eslami
In the current out-of-core infrastructure, the persistence layer is heavily
intertwined with the scheduling and out-of-core engine. This makes it
complicated to try new features for the persistence layer. The following
changes are needed:
* The persistence layer should be decoupled from out-of-core infrastructure.
This way one can simply implement and plug different data accessors for various
persistence resources, e.g. local file system data accessor, HDFS data
accessor, serialized in-memory data accessor, etc.
* We should be able to address out-of-core data in a more efficient and
flexible way. Currently, data are accessed/addressed through string literals in
various locations of the code. This should be changed so data can be accessed
through a unified, more flexible data indexing mechanism.
* With different implementations of data accessor, now there may be more
emphasis on having more IO threads. It is important that these IO threads are
load-balanced. Currently, partitions are assigned to IO threads using a hash
function. Hash function tent not to balance load with small number of data
points (partitions in this case).
* Currently, out-of-core uses `BufferedInputStream` and `BufferedOutputStream`
along with the default (de)serialization mechanism. The IO bandwidth achieved
in the current implementation is low. One can simply use: 1) Unsafe
(de)serialization mechanism to optimize for memory bandwidth during
(de)serialization process, 2) RandomAccessFile's read and write interface to
have lower level access to the local file system and avoid overheads in
reading/writing from/to local files.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)