[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user piaozhexiu closed the pull request at: https://github.com/apache/spark/pull/8512 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user piaozhexiu commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-207493828 @srowen sure, done! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-207380286 @piaozhexiu close this in favor of https://github.com/apache/spark/pull/11242 it seems? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user steveloughran commented on a diff in the pull request: https://github.com/apache/spark/pull/8512#discussion_r47081031 --- Diff: core/pom.xml --- @@ -40,6 +40,16 @@ ${avro.mapred.classifier} + com.amazonaws + aws-java-sdk-s3 + ${aws.java.sdk.version} --- End diff -- 1. shouldn't version declaration go into /pom.xml, with /core/pom.xml just declaring its use 1. Be aware that Hadoop branch-2 is on v. 1.10.6; HADOOP-12269. Between 1.7.4 and 1.10.6, Amazon did change the signature on one of the methods (int -> long on multipart upload). This means that s3a is very brittle about libraries, and some shading may be wise here. Or be in sync with the Hadoop 2.7 version and say "Don't use S3a on Hadoop 2.6.x", which is something the Hadoop team would concur with. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user steveloughran commented on a diff in the pull request: https://github.com/apache/spark/pull/8512#discussion_r47081146 --- Diff: core/src/test/scala/org/apache/spark/deploy/SparkS3UtilSuite.scala --- @@ -0,0 +1,90 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.deploy + +import java.util.Date + +import scala.collection.mutable.ArrayBuffer + +import org.apache.hadoop.fs.{FileStatus, Path} +import org.apache.hadoop.mapred.{InputSplit, FileInputFormat, JobConf} + +import org.apache.spark.SparkFunSuite + +class SparkS3UtilSuite extends SparkFunSuite { + test("s3ListingEnabled function") { +val jobConf = new JobConf() + +// Disabled by user +SparkS3Util.sparkConf.set("spark.s3.bulk.listing.enabled", "false") +FileInputFormat.setInputPaths(jobConf, "s3://bucket/dir/file") +assert(SparkS3Util.s3BulkListingEnabled(jobConf) == false) + +// Input paths contain wildcards +SparkS3Util.sparkConf.set("spark.s3.bulk.listing.enabled", "true") +FileInputFormat.setInputPaths(jobConf, "s3://bucket/dir/*") +assert(SparkS3Util.s3BulkListingEnabled(jobConf) == false) + +// Input paths copntain non-S3 files +SparkS3Util.sparkConf.set("spark.s3.bulk.listing.enabled", "true") +FileInputFormat.setInputPaths(jobConf, "file://dir/file") +assert(SparkS3Util.s3BulkListingEnabled(jobConf) == false) + } + + test("isSplitable function") { --- End diff -- go on, spell Splittable correctly, even if the mis-spelling is hard coded in the Hadoop API --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user steveloughran commented on a diff in the pull request: https://github.com/apache/spark/pull/8512#discussion_r47081737 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkS3Util.scala --- @@ -0,0 +1,342 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.deploy + +import java.net.URI +import java.util + +import scala.collection.JavaConverters._ +import scala.collection.mutable.ArrayBuffer + +import com.amazonaws.{AmazonClientException, AmazonServiceException, ClientConfiguration, Protocol} +import com.amazonaws.auth.{AWSCredentialsProvider, BasicAWSCredentials, InstanceProfileCredentialsProvider, STSAssumeRoleSessionCredentialsProvider} +import com.amazonaws.internal.StaticCredentialsProvider +import com.amazonaws.services.s3.AmazonS3Client +import com.amazonaws.services.s3.model.{ListObjectsRequest, ObjectListing, S3ObjectSummary} + +import com.google.common.base.{Preconditions, Strings} +import com.google.common.cache.{Cache, CacheBuilder} +import com.google.common.collect.AbstractSequentialIterator + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.{FileStatus, GlobPattern, Path, PathFilter} +import org.apache.hadoop.fs.s3.S3Credentials +import org.apache.hadoop.io.compress.{CompressionCodecFactory, SplittableCompressionCodec} +import org.apache.hadoop.mapred.{FileInputFormat, FileSplit, InputSplit, JobConf} + +import org.apache.spark.{Logging, SparkEnv} +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.util.Utils + +/** + * :: DeveloperApi :: + * Contains util methods to interact with S3 from Spark. + */ +@DeveloperApi +object SparkS3Util extends Logging { + val sparkConf = SparkEnv.get.conf + val conf: Configuration = SparkHadoopUtil.get.newConfiguration(sparkConf) + + private val s3ClientCache: Cache[String, AmazonS3Client] = CacheBuilder +.newBuilder +.concurrencyLevel(Runtime.getRuntime.availableProcessors) +.build[String, AmazonS3Client] + + // Flag to enable S3 bulk listing. It is true by default. + private val S3_BULK_LISTING_ENABLED: String = "spark.s3.bulk.listing.enabled" + + // Properties for AmazonS3Client. Default values should just work most of time. + private val S3_CONNECT_TIMEOUT: String = "spark.s3.connect.timeout" + private val S3_MAX_CONNECTIONS: String = "spark.s3.max.connections" + private val S3_MAX_ERROR_RETRIES: String = "spark.s3.max.error.retries" + private val S3_SOCKET_TIMEOUT: String = "spark.s3.socket.timeout" + private val S3_SSL_ENABLED: String = "spark.s3.ssl.enabled" + private val S3_USE_INSTANCE_CREDENTIALS: String = "spark.s3.use.instance.credentials" + + // Ignore hidden files whose name starts with "_" and ".", or ends with "$folder$". + private val hiddenFileFilter = new PathFilter() { +override def accept(p: Path): Boolean = { + val name: String = p.getName() + !name.startsWith("_") && !name.startsWith(".") && !name.endsWith("$folder$") +} + } + + /** + * Initialize AmazonS3Client per bucket. Since access permissions might be different from bucket + * to bucket, it is necessary to initialize AmazonS3Client on a per bucket basis. + */ + private def getS3Client(bucket: String): AmazonS3Client = { +val sslEnabled: Boolean = sparkConf.getBoolean(S3_SSL_ENABLED, true) +val maxErrorRetries: Int = sparkConf.getInt(S3_MAX_ERROR_RETRIES, 10) +val connectTimeout: Int = sparkConf.getInt(S3_CONNECT_TIMEOUT, 5000) +val socketTimeout: Int = sparkConf.getInt(S3_SOCKET_TIMEOUT, 5000) +val maxConnections: Int = sparkConf.getInt(S3_MAX_CONNECTIONS, 5) +val useInstanceCredentials: Boolean = sparkConf.getBoolean(S3_USE_INSTANCE_CREDENTIALS, false) + +val clientConf: ClientConfiguration = new ClientConfiguration +clientConf.setMaxErrorRetry(maxErrorRetries) +
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user piaozhexiu commented on a diff in the pull request: https://github.com/apache/spark/pull/8512#discussion_r46992201 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkS3Util.scala --- @@ -0,0 +1,336 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.deploy + +import java.net.URI +import java.util + +import scala.collection.JavaConverters._ +import scala.collection.mutable.ArrayBuffer + +import com.amazonaws.{AmazonClientException, AmazonServiceException, ClientConfiguration, Protocol} +import com.amazonaws.auth.{AWSCredentialsProvider, BasicAWSCredentials, InstanceProfileCredentialsProvider, STSAssumeRoleSessionCredentialsProvider} +import com.amazonaws.internal.StaticCredentialsProvider +import com.amazonaws.services.s3.AmazonS3Client +import com.amazonaws.services.s3.model.{ListObjectsRequest, ObjectListing, S3ObjectSummary} + +import com.google.common.annotations.VisibleForTesting +import com.google.common.base.{Preconditions, Strings} +import com.google.common.cache.{Cache, CacheBuilder} +import com.google.common.collect.AbstractSequentialIterator + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.{FileStatus, GlobPattern, Path, PathFilter} +import org.apache.hadoop.fs.s3.S3Credentials +import org.apache.hadoop.io.compress.{CompressionCodecFactory, SplittableCompressionCodec} +import org.apache.hadoop.mapred.{FileInputFormat, FileSplit, InputSplit, JobConf} + +import org.apache.spark.{Logging, SparkEnv} +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.util.Utils + +/** + * :: DeveloperApi :: + * Contains util methods to interact with S3 from Spark. + */ +@DeveloperApi +object SparkS3Util extends Logging { + val sparkConf = SparkEnv.get.conf + val conf: Configuration = SparkHadoopUtil.get.newConfiguration(sparkConf) + + private val s3ClientCache: Cache[String, AmazonS3Client] = CacheBuilder +.newBuilder +.concurrencyLevel(Runtime.getRuntime.availableProcessors) +.build[String, AmazonS3Client] + + // Flag to enable S3 bulk listing. It is true by default. + private val S3_BULK_LISTING_ENABLED: String = "spark.s3.bulk.listing.enabled" + + // Properties for AmazonS3Client. Default values should just work most of time. + private val S3_CONNECT_TIMEOUT: String = "spark.s3.connect.timeout" + private val S3_MAX_CONNECTIONS: String = "spark.s3.max.connections" + private val S3_MAX_ERROR_RETRIES: String = "spark.s3.max.error.retries" + private val S3_SOCKET_TIMEOUT: String = "spark.s3.socket.timeout" + private val S3_SSL_ENABLED: String = "spark.s3.ssl.enabled" + private val S3_USE_INSTANCE_CREDENTIALS: String = "spark.s3.use.instance.credentials" + + // Ignore hidden files whose name starts with "_" and ".", or ends with "$folder$". + private val hiddenFileFilter = new PathFilter() { +override def accept(p: Path): Boolean = { + val name: String = p.getName() + !name.startsWith("_") && !name.startsWith(".") && !name.endsWith("$folder$") +} + } + + /** + * Initialize AmazonS3Client per bucket. Since access permissions might be different from bucket + * to bucket, it is necessary to initialize AmazonS3Client on a per bucket basis. + */ + private def getS3Client(bucket: String): AmazonS3Client = { +val sslEnabled: Boolean = sparkConf.getBoolean(S3_SSL_ENABLED, true) +val maxErrorRetries: Int = sparkConf.getInt(S3_MAX_ERROR_RETRIES, 10) +val connectTimeout: Int = sparkConf.getInt(S3_CONNECT_TIMEOUT, 5000) +val socketTimeout: Int = sparkConf.getInt(S3_SOCKET_TIMEOUT, 5000) +val maxConnections: Int = sparkConf.getInt(S3_MAX_CONNECTIONS, 5) +val useInstanceCredentials: Boolean = sparkConf.getBoolean(S3_USE_INSTANCE_CREDENTIALS, false) + +val clientConf: ClientConfiguration = new ClientConfiguration
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user steveloughran commented on a diff in the pull request: https://github.com/apache/spark/pull/8512#discussion_r46991325 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkS3Util.scala --- @@ -0,0 +1,336 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.deploy + +import java.net.URI +import java.util + +import scala.collection.JavaConverters._ +import scala.collection.mutable.ArrayBuffer + +import com.amazonaws.{AmazonClientException, AmazonServiceException, ClientConfiguration, Protocol} +import com.amazonaws.auth.{AWSCredentialsProvider, BasicAWSCredentials, InstanceProfileCredentialsProvider, STSAssumeRoleSessionCredentialsProvider} +import com.amazonaws.internal.StaticCredentialsProvider +import com.amazonaws.services.s3.AmazonS3Client +import com.amazonaws.services.s3.model.{ListObjectsRequest, ObjectListing, S3ObjectSummary} + +import com.google.common.annotations.VisibleForTesting +import com.google.common.base.{Preconditions, Strings} +import com.google.common.cache.{Cache, CacheBuilder} +import com.google.common.collect.AbstractSequentialIterator + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.{FileStatus, GlobPattern, Path, PathFilter} +import org.apache.hadoop.fs.s3.S3Credentials +import org.apache.hadoop.io.compress.{CompressionCodecFactory, SplittableCompressionCodec} +import org.apache.hadoop.mapred.{FileInputFormat, FileSplit, InputSplit, JobConf} + +import org.apache.spark.{Logging, SparkEnv} +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.util.Utils + +/** + * :: DeveloperApi :: + * Contains util methods to interact with S3 from Spark. + */ +@DeveloperApi +object SparkS3Util extends Logging { + val sparkConf = SparkEnv.get.conf + val conf: Configuration = SparkHadoopUtil.get.newConfiguration(sparkConf) + + private val s3ClientCache: Cache[String, AmazonS3Client] = CacheBuilder +.newBuilder +.concurrencyLevel(Runtime.getRuntime.availableProcessors) +.build[String, AmazonS3Client] + + // Flag to enable S3 bulk listing. It is true by default. + private val S3_BULK_LISTING_ENABLED: String = "spark.s3.bulk.listing.enabled" + + // Properties for AmazonS3Client. Default values should just work most of time. + private val S3_CONNECT_TIMEOUT: String = "spark.s3.connect.timeout" + private val S3_MAX_CONNECTIONS: String = "spark.s3.max.connections" + private val S3_MAX_ERROR_RETRIES: String = "spark.s3.max.error.retries" + private val S3_SOCKET_TIMEOUT: String = "spark.s3.socket.timeout" + private val S3_SSL_ENABLED: String = "spark.s3.ssl.enabled" + private val S3_USE_INSTANCE_CREDENTIALS: String = "spark.s3.use.instance.credentials" + + // Ignore hidden files whose name starts with "_" and ".", or ends with "$folder$". + private val hiddenFileFilter = new PathFilter() { +override def accept(p: Path): Boolean = { + val name: String = p.getName() + !name.startsWith("_") && !name.startsWith(".") && !name.endsWith("$folder$") +} + } + + /** + * Initialize AmazonS3Client per bucket. Since access permissions might be different from bucket + * to bucket, it is necessary to initialize AmazonS3Client on a per bucket basis. + */ + private def getS3Client(bucket: String): AmazonS3Client = { +val sslEnabled: Boolean = sparkConf.getBoolean(S3_SSL_ENABLED, true) +val maxErrorRetries: Int = sparkConf.getInt(S3_MAX_ERROR_RETRIES, 10) +val connectTimeout: Int = sparkConf.getInt(S3_CONNECT_TIMEOUT, 5000) +val socketTimeout: Int = sparkConf.getInt(S3_SOCKET_TIMEOUT, 5000) +val maxConnections: Int = sparkConf.getInt(S3_MAX_CONNECTIONS, 5) +val useInstanceCredentials: Boolean = sparkConf.getBoolean(S3_USE_INSTANCE_CREDENTIALS, false) + +val clientConf: ClientConfiguration = new ClientConfiguration
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user steveloughran commented on a diff in the pull request: https://github.com/apache/spark/pull/8512#discussion_r46992376 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkS3Util.scala --- @@ -0,0 +1,336 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.deploy + +import java.net.URI +import java.util + +import scala.collection.JavaConverters._ +import scala.collection.mutable.ArrayBuffer + +import com.amazonaws.{AmazonClientException, AmazonServiceException, ClientConfiguration, Protocol} +import com.amazonaws.auth.{AWSCredentialsProvider, BasicAWSCredentials, InstanceProfileCredentialsProvider, STSAssumeRoleSessionCredentialsProvider} +import com.amazonaws.internal.StaticCredentialsProvider +import com.amazonaws.services.s3.AmazonS3Client +import com.amazonaws.services.s3.model.{ListObjectsRequest, ObjectListing, S3ObjectSummary} + +import com.google.common.annotations.VisibleForTesting +import com.google.common.base.{Preconditions, Strings} +import com.google.common.cache.{Cache, CacheBuilder} +import com.google.common.collect.AbstractSequentialIterator + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.{FileStatus, GlobPattern, Path, PathFilter} +import org.apache.hadoop.fs.s3.S3Credentials +import org.apache.hadoop.io.compress.{CompressionCodecFactory, SplittableCompressionCodec} +import org.apache.hadoop.mapred.{FileInputFormat, FileSplit, InputSplit, JobConf} + +import org.apache.spark.{Logging, SparkEnv} +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.util.Utils + +/** + * :: DeveloperApi :: + * Contains util methods to interact with S3 from Spark. + */ +@DeveloperApi +object SparkS3Util extends Logging { + val sparkConf = SparkEnv.get.conf + val conf: Configuration = SparkHadoopUtil.get.newConfiguration(sparkConf) + + private val s3ClientCache: Cache[String, AmazonS3Client] = CacheBuilder +.newBuilder +.concurrencyLevel(Runtime.getRuntime.availableProcessors) +.build[String, AmazonS3Client] + + // Flag to enable S3 bulk listing. It is true by default. + private val S3_BULK_LISTING_ENABLED: String = "spark.s3.bulk.listing.enabled" + + // Properties for AmazonS3Client. Default values should just work most of time. + private val S3_CONNECT_TIMEOUT: String = "spark.s3.connect.timeout" + private val S3_MAX_CONNECTIONS: String = "spark.s3.max.connections" + private val S3_MAX_ERROR_RETRIES: String = "spark.s3.max.error.retries" + private val S3_SOCKET_TIMEOUT: String = "spark.s3.socket.timeout" + private val S3_SSL_ENABLED: String = "spark.s3.ssl.enabled" + private val S3_USE_INSTANCE_CREDENTIALS: String = "spark.s3.use.instance.credentials" + + // Ignore hidden files whose name starts with "_" and ".", or ends with "$folder$". + private val hiddenFileFilter = new PathFilter() { +override def accept(p: Path): Boolean = { + val name: String = p.getName() + !name.startsWith("_") && !name.startsWith(".") && !name.endsWith("$folder$") +} + } + + /** + * Initialize AmazonS3Client per bucket. Since access permissions might be different from bucket + * to bucket, it is necessary to initialize AmazonS3Client on a per bucket basis. + */ + private def getS3Client(bucket: String): AmazonS3Client = { +val sslEnabled: Boolean = sparkConf.getBoolean(S3_SSL_ENABLED, true) +val maxErrorRetries: Int = sparkConf.getInt(S3_MAX_ERROR_RETRIES, 10) +val connectTimeout: Int = sparkConf.getInt(S3_CONNECT_TIMEOUT, 5000) +val socketTimeout: Int = sparkConf.getInt(S3_SOCKET_TIMEOUT, 5000) +val maxConnections: Int = sparkConf.getInt(S3_MAX_CONNECTIONS, 5) +val useInstanceCredentials: Boolean = sparkConf.getBoolean(S3_USE_INSTANCE_CREDENTIALS, false) + +val clientConf: ClientConfiguration = new ClientConfiguration
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user steveloughran commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-162974033 Has anyone looked at the performance of this versus S3a in Hadoop 2.7+? Because while I do agree this will dramatically improve s3n: and s3: perf, all ongoing Hadoop work is on the s3a FS, with s3n left alone on the grounds that every upgrade of jets3t or change breaks things. S3a does use {{ListRequest}} and I'd expect it to not only list faster, but have faster reads too. That doesn't mean this patch won't be useful: if anyone still uses s3: it'll be essential (there's no maintenance going on there), and the code here will also benefit hadoop <= 2.6. It's just for 2.7+ I would say "use s3a and be done with it". That said, there's lots of work on s3a which remains to be looked at, especially in lazy seeks. What could be very useful for the Hadoop team here is some tests for Spark using S3 so as to catch regressions in functionality, performance, scale 1. Measure that ls() performance. Maybe we can find/get someone to create an s3 store pre-populated with many files. 2. look at the costs of read + seek + close on big files. [HADOOP-12376](https://issues.apache.org/jira/browse/HADOOP-12376) turned out to be a surprise there: if you close() a multiGB file 3 bytes in, that close() still completes. Again, having some public reference files would aid testing here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-162996871 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47360/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-162996870 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-162996514 **[Test build #47360 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47360/consoleFull)** for PR 8512 at commit [`4ec276b`](https://github.com/apache/spark/commit/4ec276b5ac1fc023087fea2aefb1278cf8a33e80). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-162996866 **[Test build #47360 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47360/consoleFull)** for PR 8512 at commit [`4ec276b`](https://github.com/apache/spark/commit/4ec276b5ac1fc023087fea2aefb1278cf8a33e80). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_:\n * `public class JavaQuantileDiscretizerExample `\n * `public class JavaSQLTransformerExample `\n * `final class DecisionTreeClassifier @Since(\"1.4.0\") (`\n * `final class GBTClassifier @Since(\"1.4.0\") (`\n * `class LogisticRegression @Since(\"1.2.0\") (`\n * `class MultilayerPerceptronClassifier @Since(\"1.5.0\") (`\n * `class NaiveBayes @Since(\"1.5.0\") (`\n * `final class OneVsRest @Since(\"1.4.0\") (`\n * `final class RandomForestClassifier @Since(\"1.4.0\") (`\n * ` public abstract static class PrefixComputer `\n --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-163002146 **[Test build #47362 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47362/consoleFull)** for PR 8512 at commit [`8e4a14c`](https://github.com/apache/spark/commit/8e4a14c5d3e4c708db0284aa02339178aae77158). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-163065488 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47362/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-163065410 **[Test build #47362 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47362/consoleFull)** for PR 8512 at commit [`8e4a14c`](https://github.com/apache/spark/commit/8e4a14c5d3e4c708db0284aa02339178aae77158). * This patch **fails from timeout after a configured wait of \`250m\`**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-163065486 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-163067717 **[Test build #47377 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47377/consoleFull)** for PR 8512 at commit [`5bdd924`](https://github.com/apache/spark/commit/5bdd92443c33fc2ec72fabb141c878cf3488d1f4). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-163085307 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47377/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-163085306 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-163085217 **[Test build #47377 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47377/consoleFull)** for PR 8512 at commit [`5bdd924`](https://github.com/apache/spark/commit/5bdd92443c33fc2ec72fabb141c878cf3488d1f4). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user piaozhexiu commented on a diff in the pull request: https://github.com/apache/spark/pull/8512#discussion_r44330648 --- Diff: core/pom.xml --- @@ -40,6 +40,11 @@ ${avro.mapred.classifier} + com.amazonaws + aws-java-sdk + ${aws.java.sdk.version} + + --- End diff -- @pwendell @yhuai here is what dependencies look like now- ``` [INFO] +- com.amazonaws:aws-java-sdk-s3:jar:1.9.16:compile [INFO] | \- com.amazonaws:aws-java-sdk-kms:jar:1.9.16:compile [INFO] +- com.amazonaws:aws-java-sdk-sts:jar:1.9.16:compile [INFO] | \- com.amazonaws:aws-java-sdk-core:jar:1.9.16:compile [INFO] | +- commons-logging:commons-logging:jar:1.1.3:compile [INFO] | +- org.apache.httpcomponents:httpclient:jar:4.3.2:compile [INFO] | | \- org.apache.httpcomponents:httpcore:jar:4.3.2:compile [INFO] | \- joda-time:joda-time:jar:2.5:compile ``` I could further exclude transitive deps if that helps. Btw, there is a [known issue](https://github.com/aws/aws-sdk-java/issues/444) in aws sdk when using it with Java 8 u60 and joda-time < 2.8.1. So I had to update joda-time version to 2.8.1+ in my build at Netflix. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-154915394 Build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-154915385 Build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-154917592 [Test build #45338 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45338/consoleFull) for PR 8512 at commit [`5509ac8`](https://github.com/apache/spark/commit/5509ac80108b1abe2ef75649c85dec08a930520d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user piaozhexiu commented on a diff in the pull request: https://github.com/apache/spark/pull/8512#discussion_r44240260 --- Diff: core/src/main/scala/org/apache/spark/rdd/UnionRDD.scala --- @@ -62,7 +66,23 @@ class UnionRDD[T: ClassTag]( var rdds: Seq[RDD[T]]) extends RDD[T](sc, Nil) { // Nil since we implement getDependencies + // Evaluate partitions in parallel. Partitions of each rdd will be cached by the `partitions` + // val in `RDD`. + private lazy val evaluatePartitions: Unit = { +val threshold = conf.getInt("spark.rdd.parallelListingThreshold", 10) +Preconditions.checkArgument(threshold > 0, + "spark.rdd.parallelListingThreshold must be positive: %s", threshold.toString) --- End diff -- Fixed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-154915335 [Test build #45337 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45337/consoleFull) for PR 8512 at commit [`d09fb6a`](https://github.com/apache/spark/commit/d09fb6a4ae27ac1af5087c064b1bf1ef20bf3cee). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user piaozhexiu commented on a diff in the pull request: https://github.com/apache/spark/pull/8512#discussion_r44240487 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkS3Util.scala --- @@ -0,0 +1,336 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.deploy + +import java.net.URI +import java.util + +import scala.collection.JavaConverters._ +import scala.collection.mutable.ArrayBuffer + +import com.amazonaws.{AmazonClientException, AmazonServiceException, ClientConfiguration, Protocol} +import com.amazonaws.auth.{AWSCredentialsProvider, BasicAWSCredentials, InstanceProfileCredentialsProvider, STSAssumeRoleSessionCredentialsProvider} +import com.amazonaws.internal.StaticCredentialsProvider +import com.amazonaws.services.s3.AmazonS3Client +import com.amazonaws.services.s3.model.{ListObjectsRequest, ObjectListing, S3ObjectSummary} + +import com.google.common.annotations.VisibleForTesting +import com.google.common.base.{Preconditions, Strings} +import com.google.common.cache.{Cache, CacheBuilder} +import com.google.common.collect.AbstractSequentialIterator + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.{FileStatus, GlobPattern, Path, PathFilter} +import org.apache.hadoop.fs.s3.S3Credentials +import org.apache.hadoop.io.compress.{CompressionCodecFactory, SplittableCompressionCodec} +import org.apache.hadoop.mapred.{FileInputFormat, FileSplit, InputSplit, JobConf} + +import org.apache.spark.{Logging, SparkEnv} +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.util.Utils + +/** + * :: DeveloperApi :: + * Contains util methods to interact with S3 from Spark. + */ +@DeveloperApi +object SparkS3Util extends Logging { --- End diff -- We could, but there is some use case when users need to override aws confidentials at runtime by accessing this object. For eg, at Netflix, there was an S3 bucket called `vault` which had different access permissions from default buckets. To access to this bucket, users had to explicitly set a special IAM role in user code. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-154972016 [Test build #45337 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45337/console) for PR 8512 at commit [`d09fb6a`](https://github.com/apache/spark/commit/d09fb6a4ae27ac1af5087c064b1bf1ef20bf3cee). * This patch **passes all tests**. * This patch **does not merge cleanly**. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-154972535 Build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-154984359 Build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-154984021 [Test build #45338 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45338/console) for PR 8512 at commit [`5509ac8`](https://github.com/apache/spark/commit/5509ac80108b1abe2ef75649c85dec08a930520d). * This patch **passes all tests**. * This patch **does not merge cleanly**. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-154915021 Build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-154915032 Build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user piaozhexiu commented on a diff in the pull request: https://github.com/apache/spark/pull/8512#discussion_r44240249 --- Diff: core/pom.xml --- @@ -40,6 +40,11 @@ ${avro.mapred.classifier} + com.amazonaws + aws-java-sdk + ${aws.java.sdk.version} + + --- End diff -- @pwendell OK, I changed `sdk` to `sdk-s3` and `sdk-sts`. Will this help? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/8512#discussion_r43717724 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkS3Util.scala --- @@ -0,0 +1,336 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.deploy + +import java.net.URI +import java.util + +import scala.collection.JavaConverters._ +import scala.collection.mutable.ArrayBuffer + +import com.amazonaws.{AmazonClientException, AmazonServiceException, ClientConfiguration, Protocol} +import com.amazonaws.auth.{AWSCredentialsProvider, BasicAWSCredentials, InstanceProfileCredentialsProvider, STSAssumeRoleSessionCredentialsProvider} +import com.amazonaws.internal.StaticCredentialsProvider +import com.amazonaws.services.s3.AmazonS3Client +import com.amazonaws.services.s3.model.{ListObjectsRequest, ObjectListing, S3ObjectSummary} + +import com.google.common.annotations.VisibleForTesting +import com.google.common.base.{Preconditions, Strings} +import com.google.common.cache.{Cache, CacheBuilder} +import com.google.common.collect.AbstractSequentialIterator + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.{FileStatus, GlobPattern, Path, PathFilter} +import org.apache.hadoop.fs.s3.S3Credentials +import org.apache.hadoop.io.compress.{CompressionCodecFactory, SplittableCompressionCodec} +import org.apache.hadoop.mapred.{FileInputFormat, FileSplit, InputSplit, JobConf} + +import org.apache.spark.{Logging, SparkEnv} +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.util.Utils + +/** + * :: DeveloperApi :: + * Contains util methods to interact with S3 from Spark. + */ +@DeveloperApi +object SparkS3Util extends Logging { --- End diff -- Shouldn't this just be private[spark]? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/8512#discussion_r43717704 --- Diff: core/pom.xml --- @@ -40,6 +40,11 @@ ${avro.mapred.classifier} + com.amazonaws + aws-java-sdk + ${aws.java.sdk.version} + + --- End diff -- I took a quick look and unfortunately this has more than 50 transitive dependencies (jackson, joda time, apache http client) that are likely to cause conflicts. I don't think we can merge this until we look into this more deeply. Can we use a more narrow version, for instance only the s3 sdk? Even then we'll still have many potential conflicts but it would at least reduce the amount of auditing we need to do. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/8512#discussion_r43576793 --- Diff: core/src/main/scala/org/apache/spark/rdd/UnionRDD.scala --- @@ -62,7 +66,23 @@ class UnionRDD[T: ClassTag]( var rdds: Seq[RDD[T]]) extends RDD[T](sc, Nil) { // Nil since we implement getDependencies + // Evaluate partitions in parallel. Partitions of each rdd will be cached by the `partitions` + // val in `RDD`. + private lazy val evaluatePartitions: Unit = { +val threshold = conf.getInt("spark.rdd.parallelListingThreshold", 10) +Preconditions.checkArgument(threshold > 0, + "spark.rdd.parallelListingThreshold must be positive: %s", threshold.toString) --- End diff -- We can do it in a follow-up pr. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/8512#discussion_r43576761 --- Diff: core/src/main/scala/org/apache/spark/rdd/UnionRDD.scala --- @@ -62,7 +66,23 @@ class UnionRDD[T: ClassTag]( var rdds: Seq[RDD[T]]) extends RDD[T](sc, Nil) { // Nil since we implement getDependencies + // Evaluate partitions in parallel. Partitions of each rdd will be cached by the `partitions` + // val in `RDD`. + private lazy val evaluatePartitions: Unit = { +val threshold = conf.getInt("spark.rdd.parallelListingThreshold", 10) +Preconditions.checkArgument(threshold > 0, + "spark.rdd.parallelListingThreshold must be positive: %s", threshold.toString) +if (rdds.length > threshold) { + val parArray = rdds.toParArray + parArray.tasksupport = new ForkJoinTaskSupport(new ForkJoinPool(threshold)) + parArray.foreach(_.partitions) +} else { + rdds.foreach(_.partitions) +} + } + --- End diff -- I am not very comfortable with this lazy val. It is not obvious what it is doing and there is no way to disable it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/8512#discussion_r43576785 --- Diff: core/src/main/scala/org/apache/spark/rdd/UnionRDD.scala --- @@ -62,7 +66,23 @@ class UnionRDD[T: ClassTag]( var rdds: Seq[RDD[T]]) extends RDD[T](sc, Nil) { // Nil since we implement getDependencies + // Evaluate partitions in parallel. Partitions of each rdd will be cached by the `partitions` + // val in `RDD`. + private lazy val evaluatePartitions: Unit = { +val threshold = conf.getInt("spark.rdd.parallelListingThreshold", 10) +Preconditions.checkArgument(threshold > 0, + "spark.rdd.parallelListingThreshold must be positive: %s", threshold.toString) --- End diff -- Can we remove this check? If `threshold <=0`, we can just to the `else` block. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-152777962 @pwendell If you are good with the added dependency to core, I will merge it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-152657561 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-152657574 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-152657569 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-152675362 **[Test build #44706 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44706/consoleFull)** for PR 8512 at commit [`e00e248`](https://github.com/apache/spark/commit/e00e24840240cb7a418f17e339bd77f14bbe029d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-152675441 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44706/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-152657162 **[Test build #44701 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44701/consoleFull)** for PR 8512 at commit [`e00e248`](https://github.com/apache/spark/commit/e00e24840240cb7a418f17e339bd77f14bbe029d). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-152657246 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-152657247 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44701/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-152675440 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user piaozhexiu commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-152688590 @yhuai Thank you for reviewing! I think I addressed all your comments. Let me know if you have any further comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/8512#discussion_r43567327 --- Diff: core/pom.xml --- @@ -40,6 +40,11 @@ ${avro.mapred.classifier} + com.amazonaws + aws-java-sdk + ${aws.java.sdk.version} + + --- End diff -- @pwendell Is it good to add this dependency to core? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-152659191 **[Test build #44706 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44706/consoleFull)** for PR 8512 at commit [`e00e248`](https://github.com/apache/spark/commit/e00e24840240cb7a418f17e339bd77f14bbe029d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-152624053 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-152623983 **[Test build #44691 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44691/consoleFull)** for PR 8512 at commit [`e00e248`](https://github.com/apache/spark/commit/e00e24840240cb7a418f17e339bd77f14bbe029d). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-152624055 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44691/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user piaozhexiu commented on a diff in the pull request: https://github.com/apache/spark/pull/8512#discussion_r43530868 --- Diff: core/src/main/scala/org/apache/spark/rdd/UnionRDD.scala --- @@ -62,7 +64,20 @@ class UnionRDD[T: ClassTag]( var rdds: Seq[RDD[T]]) extends RDD[T](sc, Nil) { // Nil since we implement getDependencies + // Evaluate partitions in parallel (will be cached in each rdd) --- End diff -- Elaborated the comment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-152599055 **[Test build #44691 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44691/consoleFull)** for PR 8512 at commit [`e00e248`](https://github.com/apache/spark/commit/e00e24840240cb7a418f17e339bd77f14bbe029d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user piaozhexiu commented on a diff in the pull request: https://github.com/apache/spark/pull/8512#discussion_r43530739 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkS3Util.scala --- @@ -0,0 +1,336 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.deploy + +import java.net.URI +import java.util + +import scala.collection.JavaConverters._ +import scala.collection.mutable.ArrayBuffer + +import com.amazonaws.{AmazonClientException, AmazonServiceException, ClientConfiguration, Protocol} +import com.amazonaws.auth.{AWSCredentialsProvider, BasicAWSCredentials, InstanceProfileCredentialsProvider, STSAssumeRoleSessionCredentialsProvider} +import com.amazonaws.internal.StaticCredentialsProvider +import com.amazonaws.services.s3.AmazonS3Client +import com.amazonaws.services.s3.model.{ListObjectsRequest, ObjectListing, S3ObjectSummary} + +import com.google.common.annotations.VisibleForTesting +import com.google.common.base.{Preconditions, Strings} +import com.google.common.cache.{Cache, CacheBuilder} +import com.google.common.collect.AbstractSequentialIterator + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.{FileStatus, GlobPattern, Path, PathFilter} +import org.apache.hadoop.fs.s3.S3Credentials +import org.apache.hadoop.io.compress.{CompressionCodecFactory, SplittableCompressionCodec} +import org.apache.hadoop.mapred.{FileInputFormat, FileSplit, InputSplit, JobConf} + +import org.apache.spark.{SparkConf, Logging} +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.util.Utils + +/** + * :: DeveloperApi :: + * Contains util methods to interact with S3 from Spark. + */ +@DeveloperApi +object SparkS3Util extends Logging { + val sparkConf = new SparkConf() + val conf: Configuration = SparkHadoopUtil.get.newConfiguration(sparkConf) + + private val s3ClientCache: Cache[String, AmazonS3Client] = CacheBuilder +.newBuilder +.concurrencyLevel(Runtime.getRuntime.availableProcessors) +.build[String, AmazonS3Client] + + // Flag to enable S3 bulk listing. It is true by default. + private val S3_BULK_LISTING_ENABLED: String = "spark.s3.bulk.listing.enabled" + + // Properties for AmazonS3Client. Default values should just work most of time. + private val S3_CONNECT_TIMEOUT: String = "spark.s3.connect.timeout" + private val S3_MAX_CONNECTIONS: String = "spark.s3.max.connections" + private val S3_MAX_ERROR_RETRIES: String = "spark.s3.max.error.retries" + private val S3_SOCKET_TIMEOUT: String = "spark.s3.socket.timeout" + private val S3_SSL_ENABLED: String = "spark.s3.ssl.enabled" + private val S3_USE_INSTANCE_CREDENTIALS: String = "spark.s3.use.instance.credentials" + + // Ignore hidden files whose name starts with "_" and ".", or ends with "$folder$". + private val hiddenFileFilter = new PathFilter() { +override def accept(p: Path): Boolean = { + val name: String = p.getName() + !name.startsWith("_") && !name.startsWith(".") && !name.endsWith("$folder$") +} + } + + /** + * Initialize AmazonS3Client per bucket. Since access permissions might be different from bucket + * to bucket, it is necessary to initialize AmazonS3Client on a per bucket basis. + */ + private def getS3Client(bucket: String): AmazonS3Client = { +val sslEnabled: Boolean = sparkConf.getBoolean(S3_SSL_ENABLED, true) +val maxErrorRetries: Int = sparkConf.getInt(S3_MAX_ERROR_RETRIES, 10) +val connectTimeout: Int = sparkConf.getInt(S3_CONNECT_TIMEOUT, 5000) +val socketTimeout: Int = sparkConf.getInt(S3_SOCKET_TIMEOUT, 5000) +val maxConnections: Int = sparkConf.getInt(S3_MAX_CONNECTIONS, 5) +val useInstanceCredentials: Boolean = sparkConf.getBoolean(S3_USE_INSTANCE_CREDENTIALS, false) + +val clientConf: ClientConfiguration = new ClientConfiguration
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user piaozhexiu commented on a diff in the pull request: https://github.com/apache/spark/pull/8512#discussion_r43530758 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkS3Util.scala --- @@ -0,0 +1,336 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.deploy + +import java.net.URI +import java.util + +import scala.collection.JavaConverters._ +import scala.collection.mutable.ArrayBuffer + +import com.amazonaws.{AmazonClientException, AmazonServiceException, ClientConfiguration, Protocol} +import com.amazonaws.auth.{AWSCredentialsProvider, BasicAWSCredentials, InstanceProfileCredentialsProvider, STSAssumeRoleSessionCredentialsProvider} +import com.amazonaws.internal.StaticCredentialsProvider +import com.amazonaws.services.s3.AmazonS3Client +import com.amazonaws.services.s3.model.{ListObjectsRequest, ObjectListing, S3ObjectSummary} + +import com.google.common.annotations.VisibleForTesting +import com.google.common.base.{Preconditions, Strings} +import com.google.common.cache.{Cache, CacheBuilder} +import com.google.common.collect.AbstractSequentialIterator + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.{FileStatus, GlobPattern, Path, PathFilter} +import org.apache.hadoop.fs.s3.S3Credentials +import org.apache.hadoop.io.compress.{CompressionCodecFactory, SplittableCompressionCodec} +import org.apache.hadoop.mapred.{FileInputFormat, FileSplit, InputSplit, JobConf} + +import org.apache.spark.{SparkConf, Logging} +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.util.Utils + +/** + * :: DeveloperApi :: + * Contains util methods to interact with S3 from Spark. + */ +@DeveloperApi +object SparkS3Util extends Logging { + val sparkConf = new SparkConf() --- End diff -- Fixed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-152598139 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user piaozhexiu commented on a diff in the pull request: https://github.com/apache/spark/pull/8512#discussion_r43530971 --- Diff: core/src/main/scala/org/apache/spark/rdd/UnionRDD.scala --- @@ -62,7 +64,20 @@ class UnionRDD[T: ClassTag]( var rdds: Seq[RDD[T]]) extends RDD[T](sc, Nil) { // Nil since we implement getDependencies + // Evaluate partitions in parallel (will be cached in each rdd) + private lazy val evaluatePartitions: Unit = { +val threshold = conf.getInt("spark.rdd.parallelListingThreshold", 10) +if (rdds.length > threshold) { + val parArray = rdds.toParArray + parArray.tasksupport = new ForkJoinTaskSupport(new ForkJoinPool(threshold)) --- End diff -- Added `Preconditions.checkArgument` to ensure that threshold is positive. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-152598097 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-152657775 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-152657805 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user piaozhexiu commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-152635891 The test failures (`FlumeStreamSuite`) seem unrelated. I'll rebase to force another run of build. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-152636209 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-152636758 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-152636725 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-152637655 **[Test build #44701 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44701/consoleFull)** for PR 8512 at commit [`e00e248`](https://github.com/apache/spark/commit/e00e24840240cb7a418f17e339bd77f14bbe029d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user davies commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-152255040 LGTM, @yhuai Could you take a final look? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-152248073 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-152248104 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-152251272 **[Test build #44612 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44612/consoleFull)** for PR 8512 at commit [`c1fa9ce`](https://github.com/apache/spark/commit/c1fa9ce06b5bace1be7ff702e2ee0bc223311076). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-152290834 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44612/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-152290672 **[Test build #44612 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44612/consoleFull)** for PR 8512 at commit [`c1fa9ce`](https://github.com/apache/spark/commit/c1fa9ce06b5bace1be7ff702e2ee0bc223311076). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-152290833 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/8512#discussion_r43437967 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkS3Util.scala --- @@ -0,0 +1,336 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.deploy + +import java.net.URI +import java.util + +import scala.collection.JavaConverters._ +import scala.collection.mutable.ArrayBuffer + +import com.amazonaws.{AmazonClientException, AmazonServiceException, ClientConfiguration, Protocol} +import com.amazonaws.auth.{AWSCredentialsProvider, BasicAWSCredentials, InstanceProfileCredentialsProvider, STSAssumeRoleSessionCredentialsProvider} +import com.amazonaws.internal.StaticCredentialsProvider +import com.amazonaws.services.s3.AmazonS3Client +import com.amazonaws.services.s3.model.{ListObjectsRequest, ObjectListing, S3ObjectSummary} + +import com.google.common.annotations.VisibleForTesting +import com.google.common.base.{Preconditions, Strings} +import com.google.common.cache.{Cache, CacheBuilder} +import com.google.common.collect.AbstractSequentialIterator + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.{FileStatus, GlobPattern, Path, PathFilter} +import org.apache.hadoop.fs.s3.S3Credentials +import org.apache.hadoop.io.compress.{CompressionCodecFactory, SplittableCompressionCodec} +import org.apache.hadoop.mapred.{FileInputFormat, FileSplit, InputSplit, JobConf} + +import org.apache.spark.{SparkConf, Logging} +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.util.Utils + +/** + * :: DeveloperApi :: + * Contains util methods to interact with S3 from Spark. + */ +@DeveloperApi +object SparkS3Util extends Logging { + val sparkConf = new SparkConf() + val conf: Configuration = SparkHadoopUtil.get.newConfiguration(sparkConf) + + private val s3ClientCache: Cache[String, AmazonS3Client] = CacheBuilder +.newBuilder +.concurrencyLevel(Runtime.getRuntime.availableProcessors) +.build[String, AmazonS3Client] + + // Flag to enable S3 bulk listing. It is true by default. + private val S3_BULK_LISTING_ENABLED: String = "spark.s3.bulk.listing.enabled" + + // Properties for AmazonS3Client. Default values should just work most of time. + private val S3_CONNECT_TIMEOUT: String = "spark.s3.connect.timeout" + private val S3_MAX_CONNECTIONS: String = "spark.s3.max.connections" + private val S3_MAX_ERROR_RETRIES: String = "spark.s3.max.error.retries" + private val S3_SOCKET_TIMEOUT: String = "spark.s3.socket.timeout" + private val S3_SSL_ENABLED: String = "spark.s3.ssl.enabled" + private val S3_USE_INSTANCE_CREDENTIALS: String = "spark.s3.use.instance.credentials" + + // Ignore hidden files whose name starts with "_" and ".", or ends with "$folder$". + private val hiddenFileFilter = new PathFilter() { +override def accept(p: Path): Boolean = { + val name: String = p.getName() + !name.startsWith("_") && !name.startsWith(".") && !name.endsWith("$folder$") +} + } + + /** + * Initialize AmazonS3Client per bucket. Since access permissions might be different from bucket + * to bucket, it is necessary to initialize AmazonS3Client on a per bucket basis. + */ + private def getS3Client(bucket: String): AmazonS3Client = { +val sslEnabled: Boolean = sparkConf.getBoolean(S3_SSL_ENABLED, true) +val maxErrorRetries: Int = sparkConf.getInt(S3_MAX_ERROR_RETRIES, 10) +val connectTimeout: Int = sparkConf.getInt(S3_CONNECT_TIMEOUT, 5000) +val socketTimeout: Int = sparkConf.getInt(S3_SOCKET_TIMEOUT, 5000) +val maxConnections: Int = sparkConf.getInt(S3_MAX_CONNECTIONS, 5) +val useInstanceCredentials: Boolean = sparkConf.getBoolean(S3_USE_INSTANCE_CREDENTIALS, false) + +val clientConf: ClientConfiguration = new ClientConfiguration +
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/8512#discussion_r43439266 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkS3Util.scala --- @@ -0,0 +1,336 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.deploy + +import java.net.URI +import java.util + +import scala.collection.JavaConverters._ +import scala.collection.mutable.ArrayBuffer + +import com.amazonaws.{AmazonClientException, AmazonServiceException, ClientConfiguration, Protocol} +import com.amazonaws.auth.{AWSCredentialsProvider, BasicAWSCredentials, InstanceProfileCredentialsProvider, STSAssumeRoleSessionCredentialsProvider} +import com.amazonaws.internal.StaticCredentialsProvider +import com.amazonaws.services.s3.AmazonS3Client +import com.amazonaws.services.s3.model.{ListObjectsRequest, ObjectListing, S3ObjectSummary} + +import com.google.common.annotations.VisibleForTesting +import com.google.common.base.{Preconditions, Strings} +import com.google.common.cache.{Cache, CacheBuilder} +import com.google.common.collect.AbstractSequentialIterator + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.{FileStatus, GlobPattern, Path, PathFilter} +import org.apache.hadoop.fs.s3.S3Credentials +import org.apache.hadoop.io.compress.{CompressionCodecFactory, SplittableCompressionCodec} +import org.apache.hadoop.mapred.{FileInputFormat, FileSplit, InputSplit, JobConf} + +import org.apache.spark.{SparkConf, Logging} +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.util.Utils + +/** + * :: DeveloperApi :: + * Contains util methods to interact with S3 from Spark. + */ +@DeveloperApi +object SparkS3Util extends Logging { + val sparkConf = new SparkConf() --- End diff -- Instead of creating a `SparkConf` at here, can we use `SparkEnv.get.conf`` to get the conf associated with the `SparkContext`? If we create a new one at here, if users set confs we used for this tool in their application (not in the default conf file), we will not be able to pick up the settings, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/8512#discussion_r43439962 --- Diff: core/src/main/scala/org/apache/spark/rdd/UnionRDD.scala --- @@ -62,7 +64,20 @@ class UnionRDD[T: ClassTag]( var rdds: Seq[RDD[T]]) extends RDD[T](sc, Nil) { // Nil since we implement getDependencies + // Evaluate partitions in parallel (will be cached in each rdd) --- End diff -- What will be cached? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-152310099 @piaozhexiu Thank you for updating it! I left a few comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/8512#discussion_r43440241 --- Diff: core/src/main/scala/org/apache/spark/rdd/UnionRDD.scala --- @@ -62,7 +64,20 @@ class UnionRDD[T: ClassTag]( var rdds: Seq[RDD[T]]) extends RDD[T](sc, Nil) { // Nil since we implement getDependencies + // Evaluate partitions in parallel (will be cached in each rdd) + private lazy val evaluatePartitions: Unit = { +val threshold = conf.getInt("spark.rdd.parallelListingThreshold", 10) +if (rdds.length > threshold) { + val parArray = rdds.toParArray + parArray.tasksupport = new ForkJoinTaskSupport(new ForkJoinPool(threshold)) --- End diff -- What will happen if `threshold =0` or `threshold =-1`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user piaozhexiu commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-151907100 @yhuai @liancheng @marmbrus Sorry to bug you again, but can you review this patch sometime soon? In fact, I am leaving Netflix on Nov 3rd, and I am trying to wrap up what I've been doing before my leave. We've been using this patch in production for a couple of months, and I fixed all the bugs that I discovered so far. At this point, this patch is quite stable. I would really appreciate if you could help. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-151674046 **[Test build #44476 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44476/consoleFull)** for PR 8512 at commit [`7150da7`](https://github.com/apache/spark/commit/7150da7259148858c419f71ae5d9d0e24c55f4ea). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-151689877 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-151689879 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44476/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-151689805 **[Test build #44476 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44476/consoleFull)** for PR 8512 at commit [`7150da7`](https://github.com/apache/spark/commit/7150da7259148858c419f71ae5d9d0e24c55f4ea). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-151673388 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-151673408 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-150748668 **[Test build #44280 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44280/consoleFull)** for PR 8512 at commit [`ad11b08`](https://github.com/apache/spark/commit/ad11b082ba363afe97f16de9297e21f09f994e9d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-150748702 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44280/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-150748701 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-150734465 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-150734581 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-150736390 **[Test build #44280 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44280/consoleFull)** for PR 8512 at commit [`ad11b08`](https://github.com/apache/spark/commit/ad11b082ba363afe97f16de9297e21f09f994e9d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user davies commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-148841351 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user piaozhexiu commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-148582745 @yhuai @liancheng @davies I updated the patch incorporating all your comments. Can you please take a look again? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-148560037 [Test build #43818 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43818/consoleFull) for PR 8512 at commit [`f9b2938`](https://github.com/apache/spark/commit/f9b2938f93b65c494160cac64f31c2454097a47b). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-148559743 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org