[ https://issues.apache.org/jira/browse/BAHIR-67?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15709401#comment-15709401 ]
Christian Kadner commented on BAHIR-67: --------------------------------------- Hi Sourav, as I understand your code, the problems in the Hadoop client code which you are trying to work around are the user authentication (properties) and making sure the Knox gateway path segment is included in the HTTP(S) URL(s) that get produced by the WebHDFS file system client code. Most of the remaining code in your connector is a duplication or close adaptation of the Spark CSV code (parser, reader, writer, ...). Does it make sense to instead only override the class org.apache.hadoop.hdfs.web.WebHdfsFileSystem and provide our own implementation of that via the property fs.webhdfs.impl. This custom "BahirWebHdfsFileSystem" implementation could take care of the authentication (properties) and the Knox gateway path segment being injected into the HTTP(S) URL(s) being sent to the remote Hadoop cluster. Ideally in a configurable way that could be applied to other types of secured Hadoop system besides Apache Knox. > WebHDFS Data Source for Spark SQL > --------------------------------- > > Key: BAHIR-67 > URL: https://issues.apache.org/jira/browse/BAHIR-67 > Project: Bahir > Issue Type: New Feature > Components: Spark SQL Data Sources > Reporter: Sourav Mazumder > Original Estimate: 336h > Remaining Estimate: 336h > > Ability to read/write data in Spark from/to HDFS of a remote Hadoop Cluster > In today's world of Analytics many use cases need capability to access data > from multiple remote data sources in Spark. Though Spark has great > integration with local Hadoop cluster it lacks heavily on capability for > connecting to a remote Hadoop cluster. However, in reality not all data of > enterprises in Hadoop and running Spark Cluster locally with Hadoop Cluster > is not always a solution. > In this improvement we propose to create a connector for accessing data (read > and write) from/to HDFS of a remote Hadoop cluster from Spark using webhdfs > api. -- This message was sent by Atlassian JIRA (v6.3.4#6332)