[ 
https://issues.apache.org/jira/browse/SINGA-97?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangwei resolved SINGA-97.
--------------------------
    Resolution: Fixed

HDFS support (i.e., read/write functions against files in HDFS) could be added 
in the next version.

> SINGA-97 Add HDFS Store 
> ------------------------
>
>                 Key: SINGA-97
>                 URL: https://issues.apache.org/jira/browse/SINGA-97
>             Project: Singa
>          Issue Type: New Feature
>            Reporter: Anh Dinh
>            Assignee: Anh Dinh
>
> This ticket implements HDFS Store for reading data from HDFS. It complements 
> the existing CSV Store which reads data from CSV file. HDFS is the popular 
> distributed file system with high (sequential) I/O throughputs, thus 
> supporting it is necessary in order for SINGA to scale. 
> The implementation will extend singa::io::Store class which is declared in 
> `singa/io/store.h`. In particular, it will support the following I/O 
> operations:
> + `bool Open(string& file, Mode mode)`
> + `bool Close()`
> + `bool Flush()`
> + `int Seek(int record_idx)`
> + `int Read(string *content)`
> + `int Write(string& content)`
> HDFS usage in SINGA is different to that in standard MapReduce applications. 
> Specifically, each SINGA worker may train on sequences of records which do 
> not lie within block boundary, whereas in MapReduce  each Mapper process a 
> number of complete blocks.  In MapReduce, the runtime engine may fetch and 
> cache the entire block over the network, knowing that the block will be 
> processed entirely. In SINGA, such pre-fetching and caching strategy will be 
> sub-optimal because it wastes I/O and network bandwidth on data records which 
> are not used. 
> We defer I/O optimization to a future ticket. 
> For implementation, we choose `libhdfs3` from Pivotal for HDFS implementation 
> in C++. This library is built natively for C++, hence it is more optimized 
> and easier to deploy than the original  `libhdfs` library that is shipped 
> with Hadoop. Finally, we test the implementation in a distributed environment 
> set up from a number of  Docker containers (see SINGA-11). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to