[ 
https://issues.apache.org/jira/browse/BAHIR-67?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15709401#comment-15709401
 ] 

Christian Kadner commented on BAHIR-67:
---------------------------------------

Hi Sourav,

as I understand your code, the problems in the Hadoop client code which you are 
trying to work around are the user authentication (properties) and making sure 
the Knox gateway path segment is included in the HTTP(S) URL(s) that get 
produced by the WebHDFS file system client code.

Most of the remaining code in your connector is a duplication or close 
adaptation of the Spark CSV code (parser, reader, writer, ...).

Does it make sense to instead only override the class 
org.apache.hadoop.hdfs.web.WebHdfsFileSystem and provide our own implementation 
of that via the property fs.webhdfs.impl. This custom "BahirWebHdfsFileSystem" 
implementation could take care of the authentication (properties) and the Knox 
gateway path segment being injected into the HTTP(S) URL(s) being sent to the 
remote Hadoop cluster. Ideally in a configurable way that could be applied to 
other types of secured Hadoop system besides Apache Knox.

> WebHDFS Data Source for Spark SQL
> ---------------------------------
>
>                 Key: BAHIR-67
>                 URL: https://issues.apache.org/jira/browse/BAHIR-67
>             Project: Bahir
>          Issue Type: New Feature
>          Components: Spark SQL Data Sources
>            Reporter: Sourav Mazumder
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Ability to read/write data in Spark from/to HDFS of a remote Hadoop Cluster
> In today's world of Analytics many use cases need capability to access data 
> from multiple remote data sources in Spark. Though Spark has great 
> integration with local Hadoop cluster it lacks heavily on capability for 
> connecting to a remote Hadoop cluster. However, in reality not all data of 
> enterprises in Hadoop and running Spark Cluster locally with Hadoop Cluster 
> is not always a solution.
> In this improvement we propose to create a connector for accessing data (read 
> and write) from/to HDFS of a remote Hadoop cluster from Spark using webhdfs 
> api.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to