[jira] [Commented] (SPARK-19582) DataFrameReader conceptually inadequate

Steve Loughran (JIRA) Tue, 02 May 2017 07:22:37 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-19582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15992985#comment-15992985
 ]


Steve Loughran commented on SPARK-19582:
----------------------------------------

All spark is doing is taking a URL To data, mapping that to an FS 
implementation classname and expecting that to implement the methods in 
`org.apache.hadoop.FileSystem` so as to provide FS-like behaviour.

Giving minio is nominally an S3 clone, sounds like there's a problem here 
setting up the hadoop S3a client to bind to it. I'd isolate that to the Hadoop 
code before going near Spark, test on Hadoop 2.8 & file bugs against Hadoop 
and/or minio if there are problems. AFAIK, nobody has run the Hadoop S3A 
[tests|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/testing.md]
 against minio; doing that and documenting how to configure the client would be 
a welcome contribution. If minio is 100% S3 compatible (c3/v4 auth + multipart 
PUT; encryption optional), then the S3A client should work with it...it could 
work as another integration test for minio.

> DataFrameReader conceptually inadequate
> ---------------------------------------
>
>                 Key: SPARK-19582
>                 URL: https://issues.apache.org/jira/browse/SPARK-19582
>             Project: Spark
>          Issue Type: Bug
>          Components: Java API
>    Affects Versions: 2.1.0
>            Reporter: James Q. Arnold
>
> DataFrameReader assumes it "understands" all data sources (local file system, 
> object stores, jdbc, ...).  This seems limiting in the long term, imposing 
> both development costs to accept new sources and dependency issues for 
> existing sources (how to coordinate the XX jar for internal use vs. the XX 
> jar used by the application).  Unless I have missed how this can be done 
> currently, an application with an unsupported data source cannot create the 
> required RDD for distribution.
> I recommend at least providing a text API for supplying data.  Let the 
> application provide data as a String (or char[] or ...)---not a path, but the 
> actual data.  Alternatively, provide interfaces or abstract classes the 
> application could provide to let the application handle external data 
> sources, without forcing all that complication into the Spark implementation.
> I don't have any code to submit, but JIRA seemed like to most appropriate 
> place to raise the issue.
> Finally, if I have overlooked how this can be done with the current API, a 
> new example would be appreciated.
> Additional detail...
> We use the minio object store, which provides an API compatible with AWS-S3.  
> A few configuration/parameter values differ for minio, but one can use the 
> AWS library in the application to connect to the minio server.
> When trying to use minio objects through spark, the s3://xxx paths are 
> intercepted by spark and handed to hadoop.  So far, I have been unable to 
> find the right combination of configuration values and parameters to 
> "convince" hadoop to forward the right information to work with minio.  If I 
> could read the minio object in the application, and then hand the object 
> contents directly to spark, I could bypass hadoop and solve the problem.  
> Unfortunately, the underlying spark design prevents that.  So, I see two 
> problems.
> -  Spark seems to have taken on the responsibility of "knowing" the API 
> details of all data sources.  This seems iffy in the long run (and is the 
> root of my current problem).  In the long run, it seems unwise to assume that 
> spark should understand all possible path names, protocols, etc.  Moreover, 
> passing S3 paths to hadoop seems a little odd (why not go directly to AWS, 
> for example).  This particular confusion about S3 shows the difficulties that 
> are bound to occur.
> -  Second, spark appears not to have a way to bypass the path name 
> interpretation.  At the least, spark could provide a text/blob interface, 
> letting the application supply the data object and avoid path interpretation 
> inside spark.  Alternatively, spark could accept a reader/stream/... to build 
> the object, again letting the application provide the implementation of the 
> object input.
> As I mentioned above, I might be missing something in the API that lets us 
> work around the problem.  I'll keep looking, but the API as apparently 
> structured seems too limiting.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19582) DataFrameReader conceptually inadequate

Reply via email to