[ https://issues.apache.org/jira/browse/SPARK-19582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15992985#comment-15992985 ]
Steve Loughran commented on SPARK-19582: ---------------------------------------- All spark is doing is taking a URL To data, mapping that to an FS implementation classname and expecting that to implement the methods in `org.apache.hadoop.FileSystem` so as to provide FS-like behaviour. Giving minio is nominally an S3 clone, sounds like there's a problem here setting up the hadoop S3a client to bind to it. I'd isolate that to the Hadoop code before going near Spark, test on Hadoop 2.8 & file bugs against Hadoop and/or minio if there are problems. AFAIK, nobody has run the Hadoop S3A [tests|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/testing.md] against minio; doing that and documenting how to configure the client would be a welcome contribution. If minio is 100% S3 compatible (c3/v4 auth + multipart PUT; encryption optional), then the S3A client should work with it...it could work as another integration test for minio. > DataFrameReader conceptually inadequate > --------------------------------------- > > Key: SPARK-19582 > URL: https://issues.apache.org/jira/browse/SPARK-19582 > Project: Spark > Issue Type: Bug > Components: Java API > Affects Versions: 2.1.0 > Reporter: James Q. Arnold > > DataFrameReader assumes it "understands" all data sources (local file system, > object stores, jdbc, ...). This seems limiting in the long term, imposing > both development costs to accept new sources and dependency issues for > existing sources (how to coordinate the XX jar for internal use vs. the XX > jar used by the application). Unless I have missed how this can be done > currently, an application with an unsupported data source cannot create the > required RDD for distribution. > I recommend at least providing a text API for supplying data. Let the > application provide data as a String (or char[] or ...)---not a path, but the > actual data. Alternatively, provide interfaces or abstract classes the > application could provide to let the application handle external data > sources, without forcing all that complication into the Spark implementation. > I don't have any code to submit, but JIRA seemed like to most appropriate > place to raise the issue. > Finally, if I have overlooked how this can be done with the current API, a > new example would be appreciated. > Additional detail... > We use the minio object store, which provides an API compatible with AWS-S3. > A few configuration/parameter values differ for minio, but one can use the > AWS library in the application to connect to the minio server. > When trying to use minio objects through spark, the s3://xxx paths are > intercepted by spark and handed to hadoop. So far, I have been unable to > find the right combination of configuration values and parameters to > "convince" hadoop to forward the right information to work with minio. If I > could read the minio object in the application, and then hand the object > contents directly to spark, I could bypass hadoop and solve the problem. > Unfortunately, the underlying spark design prevents that. So, I see two > problems. > - Spark seems to have taken on the responsibility of "knowing" the API > details of all data sources. This seems iffy in the long run (and is the > root of my current problem). In the long run, it seems unwise to assume that > spark should understand all possible path names, protocols, etc. Moreover, > passing S3 paths to hadoop seems a little odd (why not go directly to AWS, > for example). This particular confusion about S3 shows the difficulties that > are bound to occur. > - Second, spark appears not to have a way to bypass the path name > interpretation. At the least, spark could provide a text/blob interface, > letting the application supply the data object and avoid path interpretation > inside spark. Alternatively, spark could accept a reader/stream/... to build > the object, again letting the application provide the implementation of the > object input. > As I mentioned above, I might be missing something in the API that lets us > work around the problem. I'll keep looking, but the API as apparently > structured seems too limiting. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org