[ https://issues.apache.org/jira/browse/SPARK-20799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16022807#comment-16022807 ]
Steve Loughran commented on SPARK-20799: ---------------------------------------- If what I think is happening is, then it's the security tightening of HADOOP-3733 which has stopped this. It is sort-of-a-regression, but as it has a security benefit "stops leaking your secrets through logs" Its not something we want to revert. Anyway, it *never* worked if you had a "/" in your secret key, so the sole reason it worked for you in the past is that you don't (see: I know something about your secret credentials:) Hadoop 2.8 is way better for S3A support all round, so I'd encourage you to stay and play. In particular, # switch from s3n:// to s3a:// for your URLs, to get the new high-performance client # try setting {{fs.s3a.experimental.fadvise=random}} in your settings and you should expect to see a significant speedup in ORC input. If the use case here is that you want to use separate credentials for a specific bucket, you can use per-bucket config now {code} fs.s3a.bucket.site-2.access.key=my access key fs.s3a.bucket.site-2.access.secret=my access secret {code} then when you refer to {{s3a://site-2/path}} , the specific key & secret for that bucket are picked up. This is why you shouldn't need to use inline secrets at all > Unable to infer schema for ORC on reading ORC from S3 > ----------------------------------------------------- > > Key: SPARK-20799 > URL: https://issues.apache.org/jira/browse/SPARK-20799 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.1.1 > Reporter: Jork Zijlstra > > We are getting the following exception: > {code}org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. > It must be specified manually.{code} > Combining following factors will cause it: > - Use S3 > - Use format ORC > - Don't apply a partitioning on de data > - Embed AWS credentials in the path > The problem is in the PartitioningAwareFileIndex def allFiles() > {code} > leafDirToChildrenFiles.get(qualifiedPath) > .orElse { leafFiles.get(qualifiedPath).map(Array(_)) } > .getOrElse(Array.empty) > {code} > leafDirToChildrenFiles uses the path WITHOUT credentials as its key while the > qualifiedPath contains the path WITH credentials. > So leafDirToChildrenFiles.get(qualifiedPath) doesn't find any files, so no > data is read and the schema cannot be defined. > Spark does output the S3xLoginHelper:90 - The Filesystem URI contains login > details. This is insecure and may be unsupported in future., but this should > not mean that it shouldn't work anymore. > Workaround: > Move the AWS credentials from the path to the SparkSession > {code} > SparkSession.builder > .config("spark.hadoop.fs.s3n.awsAccessKeyId", {awsAccessKeyId}) > .config("spark.hadoop.fs.s3n.awsSecretAccessKey", {awsSecretAccessKey}) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org