Re: Spark Issues on ORC

Dong Joon Hyun Fri, 02 Jun 2017 07:46:54 -0700

Thank you for confirming, Steve.

I removes the dependency of SPARK-20799 on SPARK-20901.

Bests,
Dongjoon.

From: Steve Loughran <ste...@hortonworks.com>
Date: Friday, June 2, 2017 at 4:42 AM
To: Dong Joon Hyun <dh...@hortonworks.com>
Cc: Apache Spark Dev <dev@spark.apache.org>
Subject: Re: Spark Issues on ORC

On 26 May 2017, at 19:02, Dong Joon Hyun 
<dh...@hortonworks.com<mailto:dh...@hortonworks.com>> wrote:

Hi, All.

Today, while I’m looking over JIRA issues for Spark 2.2.0 in Apache Spark.
I noticed that there are many unresolved community requests and related efforts 
over `Feature parity for ORC with Parquet`.
Some examples I found are the following. I created SPARK-20901 to organize 
these although I’m not in the body to do this.
Please let me know if this is not a proper way in the Apache Spark community.
I think we can leverage or transfer the improvement of Parquet in Spark.

SPARK-20799   Unable to infer schema for ORC on reading ORC from S3

Fixed that one for you by changing title: SPARK-20799 Unable to infer schema 
for ORC/Parquet on S3N when secrets are in the URL

I'd recommended closing that as a WONTFIX as its related to some security work 
in HADOOP-3733 where Path.toString/toURI now strip out the AWS crentials, and 
the way things get passed around as Path.toString(), its losing them. As the 
current model meant that everything which logged a path would be logging AWS 
secrets, and the logs & exceptions weren't being treated as the sensitive 
documents they became the moment that happened.

It could could as a regression, but as it never worked if there was a "/" in 
the secret, it's always been a bit patchy.

If this is really needed then it could be pushed back into Hadoop 2.8.2 but 
disabled by default unless you set some option like 
"fs.s3a.insecure.secrets.in.URL".

Maybe also (somehow) changing to only support AWS Session token triples (id, 
session-secret, session-token), so that the damage caused by secrets in logs, 
bug reports &c are less destructive

Re: Spark Issues on ORC

Reply via email to