Re: A new external catalog

2018-02-14 Thread Tayyebi, Ameen
Newbie question:

I want to add system/integration tests for the new functionality. There are a 
set of existing tests around Spark Catalog that I can leverage. Great. The 
provider I’m writing is backed by a web service though which is part of an AWS 
account. I can write the tests using a mocked client that somehow clones the 
behavior of the webservice, but I’ll get the most value if I actually run the 
tests against a real AWS Glue account.

How do you guys deal with external dependencies for system tests? Is there an 
AWS account that is used for this purpose by any chance?

Thanks,
-Ameen

From: Steve Loughran 
Date: Tuesday, February 13, 2018 at 5:01 PM
To: "Tayyebi, Ameen" 
Cc: Apache Spark Dev 
Subject: Re: A new external catalog




On 13 Feb 2018, at 21:20, Tayyebi, Ameen 
mailto:tayye...@amazon.com>> wrote:

Yes, I’m thinking about upgrading to these:
1.9.0

1.11.272

From:

1.7.3

1.11.76

272 is the earliest that has Glue.

How about I let the build system run the tests and if things start breaking I 
fall back to shading Glue’s specific SDK?


FWIW, some of the other troublespots are not functional, they're log overflow

https://issues.apache.org/jira/browse/HADOOP-15040
https://issues.apache.org/jira/browse/HADOOP-14596

Myself and Cloudera collaborators are testing the shaded 1.11.271 JAR & will go 
with that into Hadoop 3.1 if we're happy, but that's not so much for new 
features but "stack traces throughout the log", which seems to be a recurrent 
issue with the JARs, and one which often slips by CI build runs. If it wasn't 
for that, we'd have stuck with 1.11.199 because it didn't have any issues that 
we hadn't already got under control 
(https://github.com/aws/aws-sdk-java/issues/1211)

Like I said: upgrades bring fear


From: Steve Loughran mailto:ste...@hortonworks.com>>
Date: Tuesday, February 13, 2018 at 3:34 PM
To: "Tayyebi, Ameen" mailto:tayye...@amazon.com>>
Cc: Apache Spark Dev mailto:dev@spark.apache.org>>
Subject: Re: A new external catalog





On 13 Feb 2018, at 19:50, Tayyebi, Ameen 
mailto:tayye...@amazon.com>> wrote:


The biggest challenge is that I had to upgrade the AWS SDK to a newer version 
so that it includes the Glue client since Glue is a new service. So far, I 
haven’t see any jar hell issues, but that’s the main drawback I can see. I’ve 
made sure the version is in sync with the Kinesis client used by 
spark-streaming module.

Funnily enough, I'm currently updating the s3a troubleshooting doc, the latest 
version up front saying

"Whatever problem you have, changing the AWS SDK version will not fix things, 
only change the stack traces you see."

https://github.com/steveloughran/hadoop/blob/s3/HADOOP-15076-trouble-and-perf/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/troubleshooting_s3a.md

Upgrading AWS SDKs is, sadly, often viewed with almost the same fear as guava, 
especially if it's the unshaded version which forces in a version of jackson.

Which SDK version are you proposing? 1.11.x ?




Re: A new external catalog

2018-02-14 Thread Tayyebi, Ameen
Thanks a lot Steve. I’ll go through the Jira’s you linked in detail. I took a 
quick look and am sufficiently scared for now. I had run into that warning from 
the S3 stream before. Sigh.

From: Steve Loughran 
Date: Tuesday, February 13, 2018 at 5:01 PM
To: "Tayyebi, Ameen" 
Cc: Apache Spark Dev 
Subject: Re: A new external catalog




On 13 Feb 2018, at 21:20, Tayyebi, Ameen 
mailto:tayye...@amazon.com>> wrote:

Yes, I’m thinking about upgrading to these:
1.9.0

1.11.272

From:

1.7.3

1.11.76

272 is the earliest that has Glue.

How about I let the build system run the tests and if things start breaking I 
fall back to shading Glue’s specific SDK?


FWIW, some of the other troublespots are not functional, they're log overflow

https://issues.apache.org/jira/browse/HADOOP-15040
https://issues.apache.org/jira/browse/HADOOP-14596

Myself and Cloudera collaborators are testing the shaded 1.11.271 JAR & will go 
with that into Hadoop 3.1 if we're happy, but that's not so much for new 
features but "stack traces throughout the log", which seems to be a recurrent 
issue with the JARs, and one which often slips by CI build runs. If it wasn't 
for that, we'd have stuck with 1.11.199 because it didn't have any issues that 
we hadn't already got under control 
(https://github.com/aws/aws-sdk-java/issues/1211)

Like I said: upgrades bring fear


From: Steve Loughran mailto:ste...@hortonworks.com>>
Date: Tuesday, February 13, 2018 at 3:34 PM
To: "Tayyebi, Ameen" mailto:tayye...@amazon.com>>
Cc: Apache Spark Dev mailto:dev@spark.apache.org>>
Subject: Re: A new external catalog





On 13 Feb 2018, at 19:50, Tayyebi, Ameen 
mailto:tayye...@amazon.com>> wrote:


The biggest challenge is that I had to upgrade the AWS SDK to a newer version 
so that it includes the Glue client since Glue is a new service. So far, I 
haven’t see any jar hell issues, but that’s the main drawback I can see. I’ve 
made sure the version is in sync with the Kinesis client used by 
spark-streaming module.

Funnily enough, I'm currently updating the s3a troubleshooting doc, the latest 
version up front saying

"Whatever problem you have, changing the AWS SDK version will not fix things, 
only change the stack traces you see."

https://github.com/steveloughran/hadoop/blob/s3/HADOOP-15076-trouble-and-perf/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/troubleshooting_s3a.md

Upgrading AWS SDKs is, sadly, often viewed with almost the same fear as guava, 
especially if it's the unshaded version which forces in a version of jackson.

Which SDK version are you proposing? 1.11.x ?




Re: A new external catalog

2018-02-13 Thread Tayyebi, Ameen
Yes, I’m thinking about upgrading to these:
1.9.0

1.11.272

From:

1.7.3

1.11.76

272 is the earliest that has Glue.

How about I let the build system run the tests and if things start breaking I 
fall back to shading Glue’s specific SDK?

From: Steve Loughran 
Date: Tuesday, February 13, 2018 at 3:34 PM
To: "Tayyebi, Ameen" 
Cc: Apache Spark Dev 
Subject: Re: A new external catalog




On 13 Feb 2018, at 19:50, Tayyebi, Ameen 
mailto:tayye...@amazon.com>> wrote:


The biggest challenge is that I had to upgrade the AWS SDK to a newer version 
so that it includes the Glue client since Glue is a new service. So far, I 
haven’t see any jar hell issues, but that’s the main drawback I can see. I’ve 
made sure the version is in sync with the Kinesis client used by 
spark-streaming module.

Funnily enough, I'm currently updating the s3a troubleshooting doc, the latest 
version up front saying

"Whatever problem you have, changing the AWS SDK version will not fix things, 
only change the stack traces you see."

https://github.com/steveloughran/hadoop/blob/s3/HADOOP-15076-trouble-and-perf/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/troubleshooting_s3a.md

Upgrading AWS SDKs is, sadly, often viewed with almost the same fear as guava, 
especially if it's the unshaded version which forces in a version of jackson.

Which SDK version are you proposing? 1.11.x ?


A new external catalog

2018-02-13 Thread Tayyebi, Ameen
Hello everyone,

For those of you not familiar with AWS Glue 
Catalog, it’s a Hive Metastore implemented as a 
web service. The Glue service is composed of different components, but the one 
I’m interested in is the Catalog. Today, there’s a Hive metastore 
implementation and you can plug the catalog to Spark as instructed 
here. 
Basically, the Hive metastore Java class is swapped with an implementation that 
calls into Glue’s web service.

I don’t like this implementation because:

  *   It puts Hive as a middle-man between Spark and Glue
  *   It prevents Glue specific implementations

As an example of the second issue, the Hive version embedded in Spark today 
does not support partition pruning for column types that are fractionals or 
timestamps. I have a pull request to fix 
this, but as rxin correctly pointed 
out, I have to fake a new Hive version called Glue or something and put this 
under the Hive shim for it.

I have locally implemented a version of 
ExternalCatalog
 on top of Glue and would like to productionize it and submit it as a pull 
request. You can set spark.catalog.implementation config to “glue” and then it 
will use Glue instead of either the in-memory catalog or Hive.

Rudimentary tests are promising and I can hook up Parquet tables directly 
without going through any Hive. I really need this because I need to fix a data 
consistency issue with InsertIntoHiveTable command when data is backed by S3. 
Different topic.

The biggest challenge is that I had to upgrade the AWS SDK to a newer version 
so that it includes the Glue client since Glue is a new service. So far, I 
haven’t see any jar hell issues, but that’s the main drawback I can see. I’ve 
made sure the version is in sync with the Kinesis client used by 
spark-streaming module.

Are there any objections to this? Any guidance around upgrading the AWS client? 
Who would be a good person to review this pull request?

Thanks,
-Ameen