[ https://issues.apache.org/jira/browse/GIRAPH-1073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332260#comment-15332260 ]
Hassan Eslami commented on GIRAPH-1073: --------------------------------------- https://reviews.facebook.net/D59691 > Decouple out-of-core persistence infrastructure from out-of-core computation > ---------------------------------------------------------------------------- > > Key: GIRAPH-1073 > URL: https://issues.apache.org/jira/browse/GIRAPH-1073 > Project: Giraph > Issue Type: Improvement > Reporter: Hassan Eslami > Assignee: Hassan Eslami > > In the current out-of-core infrastructure, the persistence layer is heavily > intertwined with the scheduling and out-of-core engine. This makes it > complicated to try new features for the persistence layer. The following > changes are needed: > * The persistence layer should be decoupled from out-of-core infrastructure. > This way one can simply implement and plug different data accessors for > various persistence resources, e.g. local file system data accessor, HDFS > data accessor, serialized in-memory data accessor, etc. > * We should be able to address out-of-core data in a more efficient and > flexible way. Currently, data are accessed/addressed through string literals > in various locations of the code. This should be changed so data can be > accessed through a unified, more flexible data indexing mechanism. > * With different implementations of data accessor, now there may be more > emphasis on having more IO threads. It is important that these IO threads are > load-balanced. Currently, partitions are assigned to IO threads using a hash > function. Hash function tent not to balance load with small number of data > points (partitions in this case). > * Currently, out-of-core uses `BufferedInputStream` and > `BufferedOutputStream` along with the default (de)serialization mechanism. > The IO bandwidth achieved in the current implementation is low. One can > simply use: 1) Unsafe (de)serialization mechanism to optimize for memory > bandwidth during (de)serialization process, 2) RandomAccessFile's read and > write interface to have lower level access to the local file system and avoid > overheads in reading/writing from/to local files. -- This message was sent by Atlassian JIRA (v6.3.4#6332)