[ https://issues.apache.org/jira/browse/SPARK-36936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17426701#comment-17426701 ]
Colin Williams edited comment on SPARK-36936 at 10/9/21, 8:38 PM: ------------------------------------------------------------------ But the Spark 3.1.2 documentation [https://spark.apache.org/docs/3.1.2/cloud-integration.html] states: <dependencyManagement> ... <dependency> <groupId>org.apache.spark</groupId> <artifactId>hadoop-cloud_2.12</artifactId> <version>${spark.version}</version> <scope>provided</scope> </dependency> ... </dependencyManagement> For which I show an artifact for 3.1.2 does not exist. 2021.10.09 13:34:47 INFO [error] (update) sbt.librarymanagement.ResolveException: Error downloading org.apache.spark:spark-hadoop-cloud_2.12:3.1.2 2021.10.09 13:34:47 INFO [error] Not found 2021.10.09 13:34:47 INFO [error] Not found 2021.10.09 13:34:47 INFO [error] not found: /home/colin/.ivy2/local/org.apache.spark/spark-hadoop-cloud_2.12/3.1.2/ivys/ivy.xml 2021.10.09 13:34:47 INFO [error] not found: [https://repo1.maven.org/maven2/org/apache/spark/spark-hadoop-cloud_2.12/3.1.2/spark-hadoop-cloud_2.12-3.1.2.pom] 2021.10.09 13:34:47 INFO [error] not found: [https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/spark/spark-hadoop-cloud_2.12/3.1.2/spark-hadoop-cloud_2.12-3.1.2.po] was (Author: colin.williams): But the Spark 3.1.2 documentation [https://spark.apache.org/docs/latest/cloud-integration.html] states: <dependencyManagement> ... <dependency> <groupId>org.apache.spark</groupId> <artifactId>hadoop-cloud_2.12</artifactId> <version>${spark.version}</version> <scope>provided</scope> </dependency> ... </dependencyManagement> For which I show an artifact for 3.1.2 does not exist. 2021.10.09 13:34:47 INFO [error] (update) sbt.librarymanagement.ResolveException: Error downloading org.apache.spark:spark-hadoop-cloud_2.12:3.1.2 2021.10.09 13:34:47 INFO [error] Not found 2021.10.09 13:34:47 INFO [error] Not found 2021.10.09 13:34:47 INFO [error] not found: /home/colin/.ivy2/local/org.apache.spark/spark-hadoop-cloud_2.12/3.1.2/ivys/ivy.xml 2021.10.09 13:34:47 INFO [error] not found: https://repo1.maven.org/maven2/org/apache/spark/spark-hadoop-cloud_2.12/3.1.2/spark-hadoop-cloud_2.12-3.1.2.pom 2021.10.09 13:34:47 INFO [error] not found: https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/spark/spark-hadoop-cloud_2.12/3.1.2/spark-hadoop-cloud_2.12-3.1.2.po > spark-hadoop-cloud broken on release and only published via 3rd party > repositories > ---------------------------------------------------------------------------------- > > Key: SPARK-36936 > URL: https://issues.apache.org/jira/browse/SPARK-36936 > Project: Spark > Issue Type: Bug > Components: Input/Output > Affects Versions: 3.1.1, 3.1.2 > Environment: name:=spark-demo > version := "0.0.1" > scalaVersion := "2.12.12" > lazy val app = (project in file("app")).settings( > assemblyPackageScala / assembleArtifact := false, > assembly / assemblyJarName := "uber.jar", > assembly / mainClass := Some("com.example.Main"), > // more settings here ... > ) > resolvers += "Cloudera" at > "https://repository.cloudera.com/artifactory/cloudera-repos/" > libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.1.2" % > "provided" > libraryDependencies += "org.apache.spark" %% "spark-hadoop-cloud" % > "3.1.1.3.1.7270.0-253" > libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % > "3.1.1.7.2.7.0-184" > libraryDependencies += "com.amazonaws" % "aws-java-sdk-bundle" % "1.11.901" > libraryDependencies += "org.scalatest" %% "scalatest" % "3.0.1" % "test" > // test suite settings > fork in Test := true > javaOptions ++= Seq("-Xms512M", "-Xmx2048M", "-XX:MaxPermSize=2048M", > "-XX:+CMSClassUnloadingEnabled") > // Show runtime of tests > testOptions in Test += Tests.Argument(TestFrameworks.ScalaTest, "-oD") > ___________________________________________________________________________________________ > > import org.apache.spark.sql.SparkSession > object SparkApp { > def main(args: Array[String]){ > val spark = SparkSession.builder().master("local") > //.config("spark.jars.repositories", > "https://repository.cloudera.com/artifactory/cloudera-repos/") > //.config("spark.jars.packages", > "org.apache.spark:spark-hadoop-cloud_2.12:3.1.1.3.1.7270.0-253") > .appName("spark session").getOrCreate > val jsonDF = spark.read.json("s3a://path-to-bucket/compact.json") > val csvDF = spark.read.format("csv").load("s3a://path-to-bucket/some.csv") > jsonDF.show() > csvDF.show() > } > } > Reporter: Colin Williams > Priority: Major > > The spark docmentation suggests using `spark-hadoop-cloud` to read / write > from S3 in [https://spark.apache.org/docs/latest/cloud-integration.html] . > However artifacts are currently published via only 3rd party resolvers in > [https://mvnrepository.com/artifact/org.apache.spark/spark-hadoop-cloud] > including Cloudera and Palantir. > > Then apache spark documentation is providing a 3rd party solution for object > stores including S3. Furthermore, if you follow the instructions and include > one of the 3rd party jars IE the Cloudera jar with the spark 3.1.2 release > and try to access object store, the following exception is returned. > > ``` > Exception in thread "main" java.lang.NoSuchMethodError: 'void > com.google.common.base.Preconditions.checkArgument(boolean, java.lang.String, > java.lang.Object, java.lang.Object)' > at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:894) > at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:870) > at > org.apache.hadoop.fs.s3a.S3AUtils.getEncryptionAlgorithm(S3AUtils.java:1605) > at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:363) > at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124) > at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479) > at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361) > at > org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:377) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307) > at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:519) > at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:428) > ``` > It looks like there are classpath conflicts using the cloudera published > `spark-hadoop-cloud` with spark 3.1.2, again contradicting the documentation. > Then the documented `spark-hadoop-cloud` approach to using object stores is > poorly supported only by 3rd party repositories and not by the released > apache spark whose documentation refers to it. > Perhaps one day apache spark will provide tested software so that developers > can quickly and easily access cloud object stores using the documentation. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org