[jira] [Commented] (SPARK-20799) Unable to infer schema for ORC on reading ORC from S3

Steve Loughran (JIRA) Fri, 19 May 2017 04:15:40 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-20799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017258#comment-16017258
 ]


Steve Loughran commented on SPARK-20799:
----------------------------------------

bq. Spark does output the S3xLoginHelper:90 - The Filesystem URI contains login 
details. This is insecure and may be unsupported in future., but this should 
not mean that it shouldn't work anymore.

It probably will stop working at some point in the future as putting secrets in 
the URIs is too dangerous: everything logs them assuming they aren't sensitive 
data. the {{S3xLoginHelper}} not only warns you, it does a best-effort attempt 
to strip out the secrets from the public URI, hence the logs and the messages 
telling you off.

Prior to Hadoop 2.8, the sole *defensible* use case of secrets in URIs was it 
was the only way to have different logins on different buckets. In Hadoop 2.8 
we added the ability to configure any of the fs.s3a. options on a per-bucket 
basis, including the secret logins, endpoints, and other important values

I see what may be happening; in which case it probably constitutes a hadoop 
regression: if the filesystem's URI is converted to a string it will have these 
stripped, so if something is going path -> URI -> String ->path the secrets 
will be lost.

If you are seeing this stack trace, it means you are using Hadoop 2.8 or 
something else with the HADOOP-3733 patch in it. What version of Hadoop (or 
HDP, CDH..) are you using? If it is based on the full Apache 2.8 release, you 
get 

# per-bucket config to allow you to [configure each bucket 
separately|http://hadoop.apache.org/docs/r2.8.0/hadoop-aws/tools/hadoop-aws/index.html#Configurations_different_S3_buckets]
# the ability to use JCEKS files to keep the secrets out the configs
# session token support.

Accordingly, if you state the version, I may be able to look @ what's happening 
in a bit more detail


> Unable to infer schema for ORC on reading ORC from S3
> -----------------------------------------------------
>
>                 Key: SPARK-20799
>                 URL: https://issues.apache.org/jira/browse/SPARK-20799
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.1.1
>            Reporter: Jork Zijlstra
>
> We are getting the following exception: 
> {code}org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. 
> It must be specified manually.{code}
> Combining following factors will cause it:
> - Use S3
> - Use format ORC
> - Don't apply a partitioning on de data
> - Embed AWS credentials in the path
> The problem is in the PartitioningAwareFileIndex def allFiles()
> {code}
> leafDirToChildrenFiles.get(qualifiedPath)
>           .orElse { leafFiles.get(qualifiedPath).map(Array(_)) }
>           .getOrElse(Array.empty)
> {code}
> leafDirToChildrenFiles uses the path WITHOUT credentials as its key while the 
> qualifiedPath contains the path WITH credentials.
> So leafDirToChildrenFiles.get(qualifiedPath) doesn't find any files, so no 
> data is read and the schema cannot be defined.
> Spark does output the S3xLoginHelper:90 - The Filesystem URI contains login 
> details. This is insecure and may be unsupported in future., but this should 
> not mean that it shouldn't work anymore.
> Workaround:
> Move the AWS credentials from the path to the SparkSession
> {code}
> SparkSession.builder
>       .config("spark.hadoop.fs.s3n.awsAccessKeyId", {awsAccessKeyId})
>       .config("spark.hadoop.fs.s3n.awsSecretAccessKey", {awsSecretAccessKey})
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20799) Unable to infer schema for ORC on reading ORC from S3

Reply via email to