[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2016-04-08 Thread piaozhexiu
Github user piaozhexiu closed the pull request at:

https://github.com/apache/spark/pull/8512


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2016-04-08 Thread piaozhexiu
Github user piaozhexiu commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-207493828
  
@srowen sure, done!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2016-04-08 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-207380286
  
@piaozhexiu close this in favor of 
https://github.com/apache/spark/pull/11242 it seems?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-12-09 Thread steveloughran
Github user steveloughran commented on a diff in the pull request:

https://github.com/apache/spark/pull/8512#discussion_r47081031
  
--- Diff: core/pom.xml ---
@@ -40,6 +40,16 @@
   ${avro.mapred.classifier}
 
 
+  com.amazonaws
+  aws-java-sdk-s3
+  ${aws.java.sdk.version}
--- End diff --

1. shouldn't version declaration go into /pom.xml, with /core/pom.xml just 
declaring its use
1. Be aware that Hadoop branch-2 is on v. 1.10.6; HADOOP-12269. Between 
1.7.4 and 1.10.6, Amazon did change the signature on one of the methods (int -> 
long on multipart upload). This means that s3a is very brittle about libraries, 
and some shading may be wise here. Or be in sync with the Hadoop 2.7 version 
and say "Don't use S3a on Hadoop 2.6.x", which is something the Hadoop team 
would concur with.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-12-09 Thread steveloughran
Github user steveloughran commented on a diff in the pull request:

https://github.com/apache/spark/pull/8512#discussion_r47081146
  
--- Diff: 
core/src/test/scala/org/apache/spark/deploy/SparkS3UtilSuite.scala ---
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.deploy
+
+import java.util.Date
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.hadoop.fs.{FileStatus, Path}
+import org.apache.hadoop.mapred.{InputSplit, FileInputFormat, JobConf}
+
+import org.apache.spark.SparkFunSuite
+
+class SparkS3UtilSuite extends SparkFunSuite {
+  test("s3ListingEnabled function") {
+val jobConf = new JobConf()
+
+// Disabled by user
+SparkS3Util.sparkConf.set("spark.s3.bulk.listing.enabled", "false")
+FileInputFormat.setInputPaths(jobConf, "s3://bucket/dir/file")
+assert(SparkS3Util.s3BulkListingEnabled(jobConf) == false)
+
+// Input paths contain wildcards
+SparkS3Util.sparkConf.set("spark.s3.bulk.listing.enabled", "true")
+FileInputFormat.setInputPaths(jobConf, "s3://bucket/dir/*")
+assert(SparkS3Util.s3BulkListingEnabled(jobConf) == false)
+
+// Input paths copntain non-S3 files
+SparkS3Util.sparkConf.set("spark.s3.bulk.listing.enabled", "true")
+FileInputFormat.setInputPaths(jobConf, "file://dir/file")
+assert(SparkS3Util.s3BulkListingEnabled(jobConf) == false)
+  }
+
+  test("isSplitable function") {
--- End diff --

go on, spell Splittable correctly, even if the mis-spelling is hard coded 
in the Hadoop API


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-12-09 Thread steveloughran
Github user steveloughran commented on a diff in the pull request:

https://github.com/apache/spark/pull/8512#discussion_r47081737
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkS3Util.scala ---
@@ -0,0 +1,342 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.deploy
+
+import java.net.URI
+import java.util
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable.ArrayBuffer
+
+import com.amazonaws.{AmazonClientException, AmazonServiceException, 
ClientConfiguration, Protocol}
+import com.amazonaws.auth.{AWSCredentialsProvider, BasicAWSCredentials, 
InstanceProfileCredentialsProvider, STSAssumeRoleSessionCredentialsProvider}
+import com.amazonaws.internal.StaticCredentialsProvider
+import com.amazonaws.services.s3.AmazonS3Client
+import com.amazonaws.services.s3.model.{ListObjectsRequest, ObjectListing, 
S3ObjectSummary}
+
+import com.google.common.base.{Preconditions, Strings}
+import com.google.common.cache.{Cache, CacheBuilder}
+import com.google.common.collect.AbstractSequentialIterator
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.{FileStatus, GlobPattern, Path, PathFilter}
+import org.apache.hadoop.fs.s3.S3Credentials
+import org.apache.hadoop.io.compress.{CompressionCodecFactory, 
SplittableCompressionCodec}
+import org.apache.hadoop.mapred.{FileInputFormat, FileSplit, InputSplit, 
JobConf}
+
+import org.apache.spark.{Logging, SparkEnv}
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.util.Utils
+
+/**
+ * :: DeveloperApi ::
+ * Contains util methods to interact with S3 from Spark.
+ */
+@DeveloperApi
+object SparkS3Util extends Logging {
+  val sparkConf = SparkEnv.get.conf
+  val conf: Configuration = SparkHadoopUtil.get.newConfiguration(sparkConf)
+
+  private val s3ClientCache: Cache[String, AmazonS3Client] = CacheBuilder
+.newBuilder
+.concurrencyLevel(Runtime.getRuntime.availableProcessors)
+.build[String, AmazonS3Client]
+
+  // Flag to enable S3 bulk listing. It is true by default.
+  private val S3_BULK_LISTING_ENABLED: String = 
"spark.s3.bulk.listing.enabled"
+
+  // Properties for AmazonS3Client. Default values should just work most 
of time.
+  private val S3_CONNECT_TIMEOUT: String = "spark.s3.connect.timeout"
+  private val S3_MAX_CONNECTIONS: String = "spark.s3.max.connections"
+  private val S3_MAX_ERROR_RETRIES: String = "spark.s3.max.error.retries"
+  private val S3_SOCKET_TIMEOUT: String = "spark.s3.socket.timeout"
+  private val S3_SSL_ENABLED: String = "spark.s3.ssl.enabled"
+  private val S3_USE_INSTANCE_CREDENTIALS: String = 
"spark.s3.use.instance.credentials"
+
+  // Ignore hidden files whose name starts with "_" and ".", or ends with 
"$folder$".
+  private val hiddenFileFilter = new PathFilter() {
+override def accept(p: Path): Boolean = {
+  val name: String = p.getName()
+  !name.startsWith("_") && !name.startsWith(".") && 
!name.endsWith("$folder$")
+}
+  }
+
+  /**
+   * Initialize AmazonS3Client per bucket. Since access permissions might 
be different from bucket
+   * to bucket, it is necessary to initialize AmazonS3Client on a per 
bucket basis.
+   */
+  private def getS3Client(bucket: String): AmazonS3Client = {
+val sslEnabled: Boolean = sparkConf.getBoolean(S3_SSL_ENABLED, true)
+val maxErrorRetries: Int = sparkConf.getInt(S3_MAX_ERROR_RETRIES, 10)
+val connectTimeout: Int = sparkConf.getInt(S3_CONNECT_TIMEOUT, 5000)
+val socketTimeout: Int = sparkConf.getInt(S3_SOCKET_TIMEOUT, 5000)
+val maxConnections: Int = sparkConf.getInt(S3_MAX_CONNECTIONS, 5)
+val useInstanceCredentials: Boolean = 
sparkConf.getBoolean(S3_USE_INSTANCE_CREDENTIALS, false)
+
+val clientConf: ClientConfiguration = new ClientConfiguration
+clientConf.setMaxErrorRetry(maxErrorRetries)
+ 

[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-12-08 Thread piaozhexiu
Github user piaozhexiu commented on a diff in the pull request:

https://github.com/apache/spark/pull/8512#discussion_r46992201
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkS3Util.scala ---
@@ -0,0 +1,336 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.deploy
+
+import java.net.URI
+import java.util
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable.ArrayBuffer
+
+import com.amazonaws.{AmazonClientException, AmazonServiceException, 
ClientConfiguration, Protocol}
+import com.amazonaws.auth.{AWSCredentialsProvider, BasicAWSCredentials, 
InstanceProfileCredentialsProvider, STSAssumeRoleSessionCredentialsProvider}
+import com.amazonaws.internal.StaticCredentialsProvider
+import com.amazonaws.services.s3.AmazonS3Client
+import com.amazonaws.services.s3.model.{ListObjectsRequest, ObjectListing, 
S3ObjectSummary}
+
+import com.google.common.annotations.VisibleForTesting
+import com.google.common.base.{Preconditions, Strings}
+import com.google.common.cache.{Cache, CacheBuilder}
+import com.google.common.collect.AbstractSequentialIterator
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.{FileStatus, GlobPattern, Path, PathFilter}
+import org.apache.hadoop.fs.s3.S3Credentials
+import org.apache.hadoop.io.compress.{CompressionCodecFactory, 
SplittableCompressionCodec}
+import org.apache.hadoop.mapred.{FileInputFormat, FileSplit, InputSplit, 
JobConf}
+
+import org.apache.spark.{Logging, SparkEnv}
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.util.Utils
+
+/**
+ * :: DeveloperApi ::
+ * Contains util methods to interact with S3 from Spark.
+ */
+@DeveloperApi
+object SparkS3Util extends Logging {
+  val sparkConf = SparkEnv.get.conf
+  val conf: Configuration = SparkHadoopUtil.get.newConfiguration(sparkConf)
+
+  private val s3ClientCache: Cache[String, AmazonS3Client] = CacheBuilder
+.newBuilder
+.concurrencyLevel(Runtime.getRuntime.availableProcessors)
+.build[String, AmazonS3Client]
+
+  // Flag to enable S3 bulk listing. It is true by default.
+  private val S3_BULK_LISTING_ENABLED: String = 
"spark.s3.bulk.listing.enabled"
+
+  // Properties for AmazonS3Client. Default values should just work most 
of time.
+  private val S3_CONNECT_TIMEOUT: String = "spark.s3.connect.timeout"
+  private val S3_MAX_CONNECTIONS: String = "spark.s3.max.connections"
+  private val S3_MAX_ERROR_RETRIES: String = "spark.s3.max.error.retries"
+  private val S3_SOCKET_TIMEOUT: String = "spark.s3.socket.timeout"
+  private val S3_SSL_ENABLED: String = "spark.s3.ssl.enabled"
+  private val S3_USE_INSTANCE_CREDENTIALS: String = 
"spark.s3.use.instance.credentials"
+
+  // Ignore hidden files whose name starts with "_" and ".", or ends with 
"$folder$".
+  private val hiddenFileFilter = new PathFilter() {
+override def accept(p: Path): Boolean = {
+  val name: String = p.getName()
+  !name.startsWith("_") && !name.startsWith(".") && 
!name.endsWith("$folder$")
+}
+  }
+
+  /**
+   * Initialize AmazonS3Client per bucket. Since access permissions might 
be different from bucket
+   * to bucket, it is necessary to initialize AmazonS3Client on a per 
bucket basis.
+   */
+  private def getS3Client(bucket: String): AmazonS3Client = {
+val sslEnabled: Boolean = sparkConf.getBoolean(S3_SSL_ENABLED, true)
+val maxErrorRetries: Int = sparkConf.getInt(S3_MAX_ERROR_RETRIES, 10)
+val connectTimeout: Int = sparkConf.getInt(S3_CONNECT_TIMEOUT, 5000)
+val socketTimeout: Int = sparkConf.getInt(S3_SOCKET_TIMEOUT, 5000)
+val maxConnections: Int = sparkConf.getInt(S3_MAX_CONNECTIONS, 5)
+val useInstanceCredentials: Boolean = 
sparkConf.getBoolean(S3_USE_INSTANCE_CREDENTIALS, false)
+
+val clientConf: ClientConfiguration = new ClientConfiguration
   

[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-12-08 Thread steveloughran
Github user steveloughran commented on a diff in the pull request:

https://github.com/apache/spark/pull/8512#discussion_r46991325
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkS3Util.scala ---
@@ -0,0 +1,336 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.deploy
+
+import java.net.URI
+import java.util
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable.ArrayBuffer
+
+import com.amazonaws.{AmazonClientException, AmazonServiceException, 
ClientConfiguration, Protocol}
+import com.amazonaws.auth.{AWSCredentialsProvider, BasicAWSCredentials, 
InstanceProfileCredentialsProvider, STSAssumeRoleSessionCredentialsProvider}
+import com.amazonaws.internal.StaticCredentialsProvider
+import com.amazonaws.services.s3.AmazonS3Client
+import com.amazonaws.services.s3.model.{ListObjectsRequest, ObjectListing, 
S3ObjectSummary}
+
+import com.google.common.annotations.VisibleForTesting
+import com.google.common.base.{Preconditions, Strings}
+import com.google.common.cache.{Cache, CacheBuilder}
+import com.google.common.collect.AbstractSequentialIterator
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.{FileStatus, GlobPattern, Path, PathFilter}
+import org.apache.hadoop.fs.s3.S3Credentials
+import org.apache.hadoop.io.compress.{CompressionCodecFactory, 
SplittableCompressionCodec}
+import org.apache.hadoop.mapred.{FileInputFormat, FileSplit, InputSplit, 
JobConf}
+
+import org.apache.spark.{Logging, SparkEnv}
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.util.Utils
+
+/**
+ * :: DeveloperApi ::
+ * Contains util methods to interact with S3 from Spark.
+ */
+@DeveloperApi
+object SparkS3Util extends Logging {
+  val sparkConf = SparkEnv.get.conf
+  val conf: Configuration = SparkHadoopUtil.get.newConfiguration(sparkConf)
+
+  private val s3ClientCache: Cache[String, AmazonS3Client] = CacheBuilder
+.newBuilder
+.concurrencyLevel(Runtime.getRuntime.availableProcessors)
+.build[String, AmazonS3Client]
+
+  // Flag to enable S3 bulk listing. It is true by default.
+  private val S3_BULK_LISTING_ENABLED: String = 
"spark.s3.bulk.listing.enabled"
+
+  // Properties for AmazonS3Client. Default values should just work most 
of time.
+  private val S3_CONNECT_TIMEOUT: String = "spark.s3.connect.timeout"
+  private val S3_MAX_CONNECTIONS: String = "spark.s3.max.connections"
+  private val S3_MAX_ERROR_RETRIES: String = "spark.s3.max.error.retries"
+  private val S3_SOCKET_TIMEOUT: String = "spark.s3.socket.timeout"
+  private val S3_SSL_ENABLED: String = "spark.s3.ssl.enabled"
+  private val S3_USE_INSTANCE_CREDENTIALS: String = 
"spark.s3.use.instance.credentials"
+
+  // Ignore hidden files whose name starts with "_" and ".", or ends with 
"$folder$".
+  private val hiddenFileFilter = new PathFilter() {
+override def accept(p: Path): Boolean = {
+  val name: String = p.getName()
+  !name.startsWith("_") && !name.startsWith(".") && 
!name.endsWith("$folder$")
+}
+  }
+
+  /**
+   * Initialize AmazonS3Client per bucket. Since access permissions might 
be different from bucket
+   * to bucket, it is necessary to initialize AmazonS3Client on a per 
bucket basis.
+   */
+  private def getS3Client(bucket: String): AmazonS3Client = {
+val sslEnabled: Boolean = sparkConf.getBoolean(S3_SSL_ENABLED, true)
+val maxErrorRetries: Int = sparkConf.getInt(S3_MAX_ERROR_RETRIES, 10)
+val connectTimeout: Int = sparkConf.getInt(S3_CONNECT_TIMEOUT, 5000)
+val socketTimeout: Int = sparkConf.getInt(S3_SOCKET_TIMEOUT, 5000)
+val maxConnections: Int = sparkConf.getInt(S3_MAX_CONNECTIONS, 5)
+val useInstanceCredentials: Boolean = 
sparkConf.getBoolean(S3_USE_INSTANCE_CREDENTIALS, false)
+
+val clientConf: ClientConfiguration = new ClientConfiguration

[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-12-08 Thread steveloughran
Github user steveloughran commented on a diff in the pull request:

https://github.com/apache/spark/pull/8512#discussion_r46992376
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkS3Util.scala ---
@@ -0,0 +1,336 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.deploy
+
+import java.net.URI
+import java.util
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable.ArrayBuffer
+
+import com.amazonaws.{AmazonClientException, AmazonServiceException, 
ClientConfiguration, Protocol}
+import com.amazonaws.auth.{AWSCredentialsProvider, BasicAWSCredentials, 
InstanceProfileCredentialsProvider, STSAssumeRoleSessionCredentialsProvider}
+import com.amazonaws.internal.StaticCredentialsProvider
+import com.amazonaws.services.s3.AmazonS3Client
+import com.amazonaws.services.s3.model.{ListObjectsRequest, ObjectListing, 
S3ObjectSummary}
+
+import com.google.common.annotations.VisibleForTesting
+import com.google.common.base.{Preconditions, Strings}
+import com.google.common.cache.{Cache, CacheBuilder}
+import com.google.common.collect.AbstractSequentialIterator
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.{FileStatus, GlobPattern, Path, PathFilter}
+import org.apache.hadoop.fs.s3.S3Credentials
+import org.apache.hadoop.io.compress.{CompressionCodecFactory, 
SplittableCompressionCodec}
+import org.apache.hadoop.mapred.{FileInputFormat, FileSplit, InputSplit, 
JobConf}
+
+import org.apache.spark.{Logging, SparkEnv}
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.util.Utils
+
+/**
+ * :: DeveloperApi ::
+ * Contains util methods to interact with S3 from Spark.
+ */
+@DeveloperApi
+object SparkS3Util extends Logging {
+  val sparkConf = SparkEnv.get.conf
+  val conf: Configuration = SparkHadoopUtil.get.newConfiguration(sparkConf)
+
+  private val s3ClientCache: Cache[String, AmazonS3Client] = CacheBuilder
+.newBuilder
+.concurrencyLevel(Runtime.getRuntime.availableProcessors)
+.build[String, AmazonS3Client]
+
+  // Flag to enable S3 bulk listing. It is true by default.
+  private val S3_BULK_LISTING_ENABLED: String = 
"spark.s3.bulk.listing.enabled"
+
+  // Properties for AmazonS3Client. Default values should just work most 
of time.
+  private val S3_CONNECT_TIMEOUT: String = "spark.s3.connect.timeout"
+  private val S3_MAX_CONNECTIONS: String = "spark.s3.max.connections"
+  private val S3_MAX_ERROR_RETRIES: String = "spark.s3.max.error.retries"
+  private val S3_SOCKET_TIMEOUT: String = "spark.s3.socket.timeout"
+  private val S3_SSL_ENABLED: String = "spark.s3.ssl.enabled"
+  private val S3_USE_INSTANCE_CREDENTIALS: String = 
"spark.s3.use.instance.credentials"
+
+  // Ignore hidden files whose name starts with "_" and ".", or ends with 
"$folder$".
+  private val hiddenFileFilter = new PathFilter() {
+override def accept(p: Path): Boolean = {
+  val name: String = p.getName()
+  !name.startsWith("_") && !name.startsWith(".") && 
!name.endsWith("$folder$")
+}
+  }
+
+  /**
+   * Initialize AmazonS3Client per bucket. Since access permissions might 
be different from bucket
+   * to bucket, it is necessary to initialize AmazonS3Client on a per 
bucket basis.
+   */
+  private def getS3Client(bucket: String): AmazonS3Client = {
+val sslEnabled: Boolean = sparkConf.getBoolean(S3_SSL_ENABLED, true)
+val maxErrorRetries: Int = sparkConf.getInt(S3_MAX_ERROR_RETRIES, 10)
+val connectTimeout: Int = sparkConf.getInt(S3_CONNECT_TIMEOUT, 5000)
+val socketTimeout: Int = sparkConf.getInt(S3_SOCKET_TIMEOUT, 5000)
+val maxConnections: Int = sparkConf.getInt(S3_MAX_CONNECTIONS, 5)
+val useInstanceCredentials: Boolean = 
sparkConf.getBoolean(S3_USE_INSTANCE_CREDENTIALS, false)
+
+val clientConf: ClientConfiguration = new ClientConfiguration

[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-12-08 Thread steveloughran
Github user steveloughran commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-162974033
  
Has anyone looked at the performance of this versus S3a in Hadoop 2.7+? 
Because while I do agree this will dramatically improve s3n: and s3: perf, all 
ongoing Hadoop work is on the s3a FS, with s3n left alone on the grounds that 
every upgrade of jets3t or change breaks things. S3a does use {{ListRequest}} 
and I'd expect it to not only list faster, but have faster reads too.

That doesn't mean this patch won't be useful: if anyone still uses s3: 
it'll be essential (there's no maintenance going on there), and the code here 
will also benefit hadoop <= 2.6. It's just for 2.7+ I would say "use s3a and be 
done with it". That said, there's lots of work on s3a which remains to be 
looked at, especially in lazy seeks.

What could be very useful for the Hadoop team here is some tests for Spark 
using S3 so as to catch regressions in functionality, performance, scale

1. Measure that ls() performance. Maybe we can find/get someone to create 
an s3 store pre-populated with many files.
2. look at the costs of read + seek + close on big files. 
[HADOOP-12376](https://issues.apache.org/jira/browse/HADOOP-12376) turned out 
to be a surprise there: if you close() a multiGB file 3 bytes in, that close() 
still completes. Again, having some public reference files would aid testing 
here



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-12-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-162996871
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47360/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-12-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-162996870
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-12-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-162996514
  
**[Test build #47360 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47360/consoleFull)**
 for PR 8512 at commit 
[`4ec276b`](https://github.com/apache/spark/commit/4ec276b5ac1fc023087fea2aefb1278cf8a33e80).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-12-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-162996866
  
**[Test build #47360 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47360/consoleFull)**
 for PR 8512 at commit 
[`4ec276b`](https://github.com/apache/spark/commit/4ec276b5ac1fc023087fea2aefb1278cf8a33e80).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:\n  * 
`public class JavaQuantileDiscretizerExample `\n  * `public class 
JavaSQLTransformerExample `\n  * `final class DecisionTreeClassifier 
@Since(\"1.4.0\") (`\n  * `final class GBTClassifier @Since(\"1.4.0\") (`\n  * 
`class LogisticRegression @Since(\"1.2.0\") (`\n  * `class 
MultilayerPerceptronClassifier @Since(\"1.5.0\") (`\n  * `class NaiveBayes 
@Since(\"1.5.0\") (`\n  * `final class OneVsRest @Since(\"1.4.0\") (`\n  * 
`final class RandomForestClassifier @Since(\"1.4.0\") (`\n  * `  public 
abstract static class PrefixComputer `\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-12-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-163002146
  
**[Test build #47362 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47362/consoleFull)**
 for PR 8512 at commit 
[`8e4a14c`](https://github.com/apache/spark/commit/8e4a14c5d3e4c708db0284aa02339178aae77158).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-12-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-163065488
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47362/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-12-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-163065410
  
**[Test build #47362 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47362/consoleFull)**
 for PR 8512 at commit 
[`8e4a14c`](https://github.com/apache/spark/commit/8e4a14c5d3e4c708db0284aa02339178aae77158).
 * This patch **fails from timeout after a configured wait of \`250m\`**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-12-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-163065486
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-12-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-163067717
  
**[Test build #47377 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47377/consoleFull)**
 for PR 8512 at commit 
[`5bdd924`](https://github.com/apache/spark/commit/5bdd92443c33fc2ec72fabb141c878cf3488d1f4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-12-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-163085307
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47377/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-12-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-163085306
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-12-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-163085217
  
**[Test build #47377 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47377/consoleFull)**
 for PR 8512 at commit 
[`5bdd924`](https://github.com/apache/spark/commit/5bdd92443c33fc2ec72fabb141c878cf3488d1f4).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-11-09 Thread piaozhexiu
Github user piaozhexiu commented on a diff in the pull request:

https://github.com/apache/spark/pull/8512#discussion_r44330648
  
--- Diff: core/pom.xml ---
@@ -40,6 +40,11 @@
   ${avro.mapred.classifier}
 
 
+  com.amazonaws
+  aws-java-sdk
+  ${aws.java.sdk.version}
+
+
--- End diff --

@pwendell @yhuai here is what dependencies look like now-
```
[INFO] +- com.amazonaws:aws-java-sdk-s3:jar:1.9.16:compile
[INFO] |  \- com.amazonaws:aws-java-sdk-kms:jar:1.9.16:compile
[INFO] +- com.amazonaws:aws-java-sdk-sts:jar:1.9.16:compile
[INFO] |  \- com.amazonaws:aws-java-sdk-core:jar:1.9.16:compile
[INFO] | +- commons-logging:commons-logging:jar:1.1.3:compile
[INFO] | +- org.apache.httpcomponents:httpclient:jar:4.3.2:compile
[INFO] | |  \- org.apache.httpcomponents:httpcore:jar:4.3.2:compile
[INFO] | \- joda-time:joda-time:jar:2.5:compile
```
I could further exclude transitive deps if that helps.

Btw, there is a [known 
issue](https://github.com/aws/aws-sdk-java/issues/444) in aws sdk when using it 
with Java 8 u60 and joda-time < 2.8.1. So I had to update joda-time version to 
2.8.1+ in my build at Netflix.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-11-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-154915394
  
Build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-11-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-154915385
  
 Build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-11-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-154917592
  
  [Test build #45338 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45338/consoleFull)
 for   PR 8512 at commit 
[`5509ac8`](https://github.com/apache/spark/commit/5509ac80108b1abe2ef75649c85dec08a930520d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-11-08 Thread piaozhexiu
Github user piaozhexiu commented on a diff in the pull request:

https://github.com/apache/spark/pull/8512#discussion_r44240260
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/UnionRDD.scala ---
@@ -62,7 +66,23 @@ class UnionRDD[T: ClassTag](
 var rdds: Seq[RDD[T]])
   extends RDD[T](sc, Nil) {  // Nil since we implement getDependencies
 
+  // Evaluate partitions in parallel. Partitions of each rdd will be 
cached by the `partitions`
+  // val in `RDD`.
+  private lazy val evaluatePartitions: Unit = {
+val threshold = conf.getInt("spark.rdd.parallelListingThreshold", 10)
+Preconditions.checkArgument(threshold > 0,
+  "spark.rdd.parallelListingThreshold must be positive: %s", 
threshold.toString)
--- End diff --

Fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-11-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-154915335
  
  [Test build #45337 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45337/consoleFull)
 for   PR 8512 at commit 
[`d09fb6a`](https://github.com/apache/spark/commit/d09fb6a4ae27ac1af5087c064b1bf1ef20bf3cee).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-11-08 Thread piaozhexiu
Github user piaozhexiu commented on a diff in the pull request:

https://github.com/apache/spark/pull/8512#discussion_r44240487
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkS3Util.scala ---
@@ -0,0 +1,336 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.deploy
+
+import java.net.URI
+import java.util
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable.ArrayBuffer
+
+import com.amazonaws.{AmazonClientException, AmazonServiceException, 
ClientConfiguration, Protocol}
+import com.amazonaws.auth.{AWSCredentialsProvider, BasicAWSCredentials, 
InstanceProfileCredentialsProvider, STSAssumeRoleSessionCredentialsProvider}
+import com.amazonaws.internal.StaticCredentialsProvider
+import com.amazonaws.services.s3.AmazonS3Client
+import com.amazonaws.services.s3.model.{ListObjectsRequest, ObjectListing, 
S3ObjectSummary}
+
+import com.google.common.annotations.VisibleForTesting
+import com.google.common.base.{Preconditions, Strings}
+import com.google.common.cache.{Cache, CacheBuilder}
+import com.google.common.collect.AbstractSequentialIterator
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.{FileStatus, GlobPattern, Path, PathFilter}
+import org.apache.hadoop.fs.s3.S3Credentials
+import org.apache.hadoop.io.compress.{CompressionCodecFactory, 
SplittableCompressionCodec}
+import org.apache.hadoop.mapred.{FileInputFormat, FileSplit, InputSplit, 
JobConf}
+
+import org.apache.spark.{Logging, SparkEnv}
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.util.Utils
+
+/**
+ * :: DeveloperApi ::
+ * Contains util methods to interact with S3 from Spark.
+ */
+@DeveloperApi
+object SparkS3Util extends Logging {
--- End diff --

We could, but there is some use case when users need to override aws 
confidentials at runtime by accessing this object. For eg, at Netflix, there 
was an S3 bucket called `vault` which had different access permissions from 
default buckets. To access to this bucket, users had to explicitly set a 
special IAM role in user code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-11-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-154972016
  
  [Test build #45337 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45337/console)
 for   PR 8512 at commit 
[`d09fb6a`](https://github.com/apache/spark/commit/d09fb6a4ae27ac1af5087c064b1bf1ef20bf3cee).
 * This patch **passes all tests**.
 * This patch **does not merge cleanly**.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-11-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-154972535
  
Build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-11-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-154984359
  
Build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-11-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-154984021
  
  [Test build #45338 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45338/console)
 for   PR 8512 at commit 
[`5509ac8`](https://github.com/apache/spark/commit/5509ac80108b1abe2ef75649c85dec08a930520d).
 * This patch **passes all tests**.
 * This patch **does not merge cleanly**.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-11-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-154915021
  
 Build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-11-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-154915032
  
Build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-11-08 Thread piaozhexiu
Github user piaozhexiu commented on a diff in the pull request:

https://github.com/apache/spark/pull/8512#discussion_r44240249
  
--- Diff: core/pom.xml ---
@@ -40,6 +40,11 @@
   ${avro.mapred.classifier}
 
 
+  com.amazonaws
+  aws-java-sdk
+  ${aws.java.sdk.version}
+
+
--- End diff --

@pwendell  OK, I changed `sdk` to `sdk-s3` and `sdk-sts`. Will this help?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-11-02 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/8512#discussion_r43717724
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkS3Util.scala ---
@@ -0,0 +1,336 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.deploy
+
+import java.net.URI
+import java.util
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable.ArrayBuffer
+
+import com.amazonaws.{AmazonClientException, AmazonServiceException, 
ClientConfiguration, Protocol}
+import com.amazonaws.auth.{AWSCredentialsProvider, BasicAWSCredentials, 
InstanceProfileCredentialsProvider, STSAssumeRoleSessionCredentialsProvider}
+import com.amazonaws.internal.StaticCredentialsProvider
+import com.amazonaws.services.s3.AmazonS3Client
+import com.amazonaws.services.s3.model.{ListObjectsRequest, ObjectListing, 
S3ObjectSummary}
+
+import com.google.common.annotations.VisibleForTesting
+import com.google.common.base.{Preconditions, Strings}
+import com.google.common.cache.{Cache, CacheBuilder}
+import com.google.common.collect.AbstractSequentialIterator
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.{FileStatus, GlobPattern, Path, PathFilter}
+import org.apache.hadoop.fs.s3.S3Credentials
+import org.apache.hadoop.io.compress.{CompressionCodecFactory, 
SplittableCompressionCodec}
+import org.apache.hadoop.mapred.{FileInputFormat, FileSplit, InputSplit, 
JobConf}
+
+import org.apache.spark.{Logging, SparkEnv}
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.util.Utils
+
+/**
+ * :: DeveloperApi ::
+ * Contains util methods to interact with S3 from Spark.
+ */
+@DeveloperApi
+object SparkS3Util extends Logging {
--- End diff --

Shouldn't this just be private[spark]? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-11-02 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/8512#discussion_r43717704
  
--- Diff: core/pom.xml ---
@@ -40,6 +40,11 @@
   ${avro.mapred.classifier}
 
 
+  com.amazonaws
+  aws-java-sdk
+  ${aws.java.sdk.version}
+
+
--- End diff --

I took a quick look and unfortunately this has more than 50 transitive 
dependencies (jackson, joda time, apache http client) that are likely to cause 
conflicts. I don't think we can merge this until we look into this more deeply. 
Can we use a more narrow version, for instance only the s3 sdk? Even then we'll 
still have many potential conflicts but it would at least reduce the amount of 
auditing we need to do.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-31 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/8512#discussion_r43576793
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/UnionRDD.scala ---
@@ -62,7 +66,23 @@ class UnionRDD[T: ClassTag](
 var rdds: Seq[RDD[T]])
   extends RDD[T](sc, Nil) {  // Nil since we implement getDependencies
 
+  // Evaluate partitions in parallel. Partitions of each rdd will be 
cached by the `partitions`
+  // val in `RDD`.
+  private lazy val evaluatePartitions: Unit = {
+val threshold = conf.getInt("spark.rdd.parallelListingThreshold", 10)
+Preconditions.checkArgument(threshold > 0,
+  "spark.rdd.parallelListingThreshold must be positive: %s", 
threshold.toString)
--- End diff --

We can do it in a follow-up pr.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-31 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/8512#discussion_r43576761
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/UnionRDD.scala ---
@@ -62,7 +66,23 @@ class UnionRDD[T: ClassTag](
 var rdds: Seq[RDD[T]])
   extends RDD[T](sc, Nil) {  // Nil since we implement getDependencies
 
+  // Evaluate partitions in parallel. Partitions of each rdd will be 
cached by the `partitions`
+  // val in `RDD`.
+  private lazy val evaluatePartitions: Unit = {
+val threshold = conf.getInt("spark.rdd.parallelListingThreshold", 10)
+Preconditions.checkArgument(threshold > 0,
+  "spark.rdd.parallelListingThreshold must be positive: %s", 
threshold.toString)
+if (rdds.length > threshold) {
+  val parArray = rdds.toParArray
+  parArray.tasksupport = new ForkJoinTaskSupport(new 
ForkJoinPool(threshold))
+  parArray.foreach(_.partitions)
+} else {
+  rdds.foreach(_.partitions)
+}
+  }
+
--- End diff --

I am not very comfortable with this lazy val. It is not obvious what it is 
doing and there is no way to disable it. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-31 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/8512#discussion_r43576785
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/UnionRDD.scala ---
@@ -62,7 +66,23 @@ class UnionRDD[T: ClassTag](
 var rdds: Seq[RDD[T]])
   extends RDD[T](sc, Nil) {  // Nil since we implement getDependencies
 
+  // Evaluate partitions in parallel. Partitions of each rdd will be 
cached by the `partitions`
+  // val in `RDD`.
+  private lazy val evaluatePartitions: Unit = {
+val threshold = conf.getInt("spark.rdd.parallelListingThreshold", 10)
+Preconditions.checkArgument(threshold > 0,
+  "spark.rdd.parallelListingThreshold must be positive: %s", 
threshold.toString)
--- End diff --

Can we remove this check? If `threshold <=0`, we can just to the `else` 
block.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-31 Thread yhuai
Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-152777962
  
@pwendell If you are good with the added dependency to core, I will merge 
it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-30 Thread yhuai
Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-152657561
  
test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-30 Thread yhuai
Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-152657574
  
test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-30 Thread yhuai
Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-152657569
  
test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-152675362
  
**[Test build #44706 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44706/consoleFull)**
 for PR 8512 at commit 
[`e00e248`](https://github.com/apache/spark/commit/e00e24840240cb7a418f17e339bd77f14bbe029d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-152675441
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44706/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-152657162
  
**[Test build #44701 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44701/consoleFull)**
 for PR 8512 at commit 
[`e00e248`](https://github.com/apache/spark/commit/e00e24840240cb7a418f17e339bd77f14bbe029d).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-152657246
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-152657247
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44701/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-152675440
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-30 Thread piaozhexiu
Github user piaozhexiu commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-152688590
  
@yhuai Thank you for reviewing! I think I addressed all your comments. Let 
me know if you have any further comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-30 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/8512#discussion_r43567327
  
--- Diff: core/pom.xml ---
@@ -40,6 +40,11 @@
   ${avro.mapred.classifier}
 
 
+  com.amazonaws
+  aws-java-sdk
+  ${aws.java.sdk.version}
+
+
--- End diff --

@pwendell Is it good to add this dependency to core?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-152659191
  
**[Test build #44706 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44706/consoleFull)**
 for PR 8512 at commit 
[`e00e248`](https://github.com/apache/spark/commit/e00e24840240cb7a418f17e339bd77f14bbe029d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-152624053
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-152623983
  
**[Test build #44691 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44691/consoleFull)**
 for PR 8512 at commit 
[`e00e248`](https://github.com/apache/spark/commit/e00e24840240cb7a418f17e339bd77f14bbe029d).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-152624055
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44691/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-30 Thread piaozhexiu
Github user piaozhexiu commented on a diff in the pull request:

https://github.com/apache/spark/pull/8512#discussion_r43530868
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/UnionRDD.scala ---
@@ -62,7 +64,20 @@ class UnionRDD[T: ClassTag](
 var rdds: Seq[RDD[T]])
   extends RDD[T](sc, Nil) {  // Nil since we implement getDependencies
 
+  // Evaluate partitions in parallel (will be cached in each rdd)
--- End diff --

Elaborated the comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-152599055
  
**[Test build #44691 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44691/consoleFull)**
 for PR 8512 at commit 
[`e00e248`](https://github.com/apache/spark/commit/e00e24840240cb7a418f17e339bd77f14bbe029d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-30 Thread piaozhexiu
Github user piaozhexiu commented on a diff in the pull request:

https://github.com/apache/spark/pull/8512#discussion_r43530739
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkS3Util.scala ---
@@ -0,0 +1,336 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.deploy
+
+import java.net.URI
+import java.util
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable.ArrayBuffer
+
+import com.amazonaws.{AmazonClientException, AmazonServiceException, 
ClientConfiguration, Protocol}
+import com.amazonaws.auth.{AWSCredentialsProvider, BasicAWSCredentials, 
InstanceProfileCredentialsProvider, STSAssumeRoleSessionCredentialsProvider}
+import com.amazonaws.internal.StaticCredentialsProvider
+import com.amazonaws.services.s3.AmazonS3Client
+import com.amazonaws.services.s3.model.{ListObjectsRequest, ObjectListing, 
S3ObjectSummary}
+
+import com.google.common.annotations.VisibleForTesting
+import com.google.common.base.{Preconditions, Strings}
+import com.google.common.cache.{Cache, CacheBuilder}
+import com.google.common.collect.AbstractSequentialIterator
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.{FileStatus, GlobPattern, Path, PathFilter}
+import org.apache.hadoop.fs.s3.S3Credentials
+import org.apache.hadoop.io.compress.{CompressionCodecFactory, 
SplittableCompressionCodec}
+import org.apache.hadoop.mapred.{FileInputFormat, FileSplit, InputSplit, 
JobConf}
+
+import org.apache.spark.{SparkConf, Logging}
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.util.Utils
+
+/**
+ * :: DeveloperApi ::
+ * Contains util methods to interact with S3 from Spark.
+ */
+@DeveloperApi
+object SparkS3Util extends Logging {
+  val sparkConf = new SparkConf()
+  val conf: Configuration = SparkHadoopUtil.get.newConfiguration(sparkConf)
+
+  private val s3ClientCache: Cache[String, AmazonS3Client] = CacheBuilder
+.newBuilder
+.concurrencyLevel(Runtime.getRuntime.availableProcessors)
+.build[String, AmazonS3Client]
+
+  // Flag to enable S3 bulk listing. It is true by default.
+  private val S3_BULK_LISTING_ENABLED: String = 
"spark.s3.bulk.listing.enabled"
+
+  // Properties for AmazonS3Client. Default values should just work most 
of time.
+  private val S3_CONNECT_TIMEOUT: String = "spark.s3.connect.timeout"
+  private val S3_MAX_CONNECTIONS: String = "spark.s3.max.connections"
+  private val S3_MAX_ERROR_RETRIES: String = "spark.s3.max.error.retries"
+  private val S3_SOCKET_TIMEOUT: String = "spark.s3.socket.timeout"
+  private val S3_SSL_ENABLED: String = "spark.s3.ssl.enabled"
+  private val S3_USE_INSTANCE_CREDENTIALS: String = 
"spark.s3.use.instance.credentials"
+
+  // Ignore hidden files whose name starts with "_" and ".", or ends with 
"$folder$".
+  private val hiddenFileFilter = new PathFilter() {
+override def accept(p: Path): Boolean = {
+  val name: String = p.getName()
+  !name.startsWith("_") && !name.startsWith(".") && 
!name.endsWith("$folder$")
+}
+  }
+
+  /**
+   * Initialize AmazonS3Client per bucket. Since access permissions might 
be different from bucket
+   * to bucket, it is necessary to initialize AmazonS3Client on a per 
bucket basis.
+   */
+  private def getS3Client(bucket: String): AmazonS3Client = {
+val sslEnabled: Boolean = sparkConf.getBoolean(S3_SSL_ENABLED, true)
+val maxErrorRetries: Int = sparkConf.getInt(S3_MAX_ERROR_RETRIES, 10)
+val connectTimeout: Int = sparkConf.getInt(S3_CONNECT_TIMEOUT, 5000)
+val socketTimeout: Int = sparkConf.getInt(S3_SOCKET_TIMEOUT, 5000)
+val maxConnections: Int = sparkConf.getInt(S3_MAX_CONNECTIONS, 5)
+val useInstanceCredentials: Boolean = 
sparkConf.getBoolean(S3_USE_INSTANCE_CREDENTIALS, false)
+
+val clientConf: ClientConfiguration = new ClientConfiguration

[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-30 Thread piaozhexiu
Github user piaozhexiu commented on a diff in the pull request:

https://github.com/apache/spark/pull/8512#discussion_r43530758
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkS3Util.scala ---
@@ -0,0 +1,336 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.deploy
+
+import java.net.URI
+import java.util
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable.ArrayBuffer
+
+import com.amazonaws.{AmazonClientException, AmazonServiceException, 
ClientConfiguration, Protocol}
+import com.amazonaws.auth.{AWSCredentialsProvider, BasicAWSCredentials, 
InstanceProfileCredentialsProvider, STSAssumeRoleSessionCredentialsProvider}
+import com.amazonaws.internal.StaticCredentialsProvider
+import com.amazonaws.services.s3.AmazonS3Client
+import com.amazonaws.services.s3.model.{ListObjectsRequest, ObjectListing, 
S3ObjectSummary}
+
+import com.google.common.annotations.VisibleForTesting
+import com.google.common.base.{Preconditions, Strings}
+import com.google.common.cache.{Cache, CacheBuilder}
+import com.google.common.collect.AbstractSequentialIterator
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.{FileStatus, GlobPattern, Path, PathFilter}
+import org.apache.hadoop.fs.s3.S3Credentials
+import org.apache.hadoop.io.compress.{CompressionCodecFactory, 
SplittableCompressionCodec}
+import org.apache.hadoop.mapred.{FileInputFormat, FileSplit, InputSplit, 
JobConf}
+
+import org.apache.spark.{SparkConf, Logging}
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.util.Utils
+
+/**
+ * :: DeveloperApi ::
+ * Contains util methods to interact with S3 from Spark.
+ */
+@DeveloperApi
+object SparkS3Util extends Logging {
+  val sparkConf = new SparkConf()
--- End diff --

Fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-152598139
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-30 Thread piaozhexiu
Github user piaozhexiu commented on a diff in the pull request:

https://github.com/apache/spark/pull/8512#discussion_r43530971
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/UnionRDD.scala ---
@@ -62,7 +64,20 @@ class UnionRDD[T: ClassTag](
 var rdds: Seq[RDD[T]])
   extends RDD[T](sc, Nil) {  // Nil since we implement getDependencies
 
+  // Evaluate partitions in parallel (will be cached in each rdd)
+  private lazy val evaluatePartitions: Unit = {
+val threshold = conf.getInt("spark.rdd.parallelListingThreshold", 10)
+if (rdds.length > threshold) {
+  val parArray = rdds.toParArray
+  parArray.tasksupport = new ForkJoinTaskSupport(new 
ForkJoinPool(threshold))
--- End diff --

Added `Preconditions.checkArgument` to ensure that threshold is positive. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-152598097
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-152657775
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-152657805
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-30 Thread piaozhexiu
Github user piaozhexiu commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-152635891
  
The test failures (`FlumeStreamSuite`) seem unrelated. I'll rebase to force 
another run of build.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-30 Thread yhuai
Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-152636209
  
test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-152636758
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-152636725
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-152637655
  
**[Test build #44701 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44701/consoleFull)**
 for PR 8512 at commit 
[`e00e248`](https://github.com/apache/spark/commit/e00e24840240cb7a418f17e339bd77f14bbe029d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-29 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-152255040
  
LGTM, @yhuai Could you take a final look?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-152248073
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-152248104
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-152251272
  
**[Test build #44612 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44612/consoleFull)**
 for PR 8512 at commit 
[`c1fa9ce`](https://github.com/apache/spark/commit/c1fa9ce06b5bace1be7ff702e2ee0bc223311076).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-152290834
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44612/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-152290672
  
**[Test build #44612 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44612/consoleFull)**
 for PR 8512 at commit 
[`c1fa9ce`](https://github.com/apache/spark/commit/c1fa9ce06b5bace1be7ff702e2ee0bc223311076).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-152290833
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-29 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/8512#discussion_r43437967
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkS3Util.scala ---
@@ -0,0 +1,336 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.deploy
+
+import java.net.URI
+import java.util
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable.ArrayBuffer
+
+import com.amazonaws.{AmazonClientException, AmazonServiceException, 
ClientConfiguration, Protocol}
+import com.amazonaws.auth.{AWSCredentialsProvider, BasicAWSCredentials, 
InstanceProfileCredentialsProvider, STSAssumeRoleSessionCredentialsProvider}
+import com.amazonaws.internal.StaticCredentialsProvider
+import com.amazonaws.services.s3.AmazonS3Client
+import com.amazonaws.services.s3.model.{ListObjectsRequest, ObjectListing, 
S3ObjectSummary}
+
+import com.google.common.annotations.VisibleForTesting
+import com.google.common.base.{Preconditions, Strings}
+import com.google.common.cache.{Cache, CacheBuilder}
+import com.google.common.collect.AbstractSequentialIterator
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.{FileStatus, GlobPattern, Path, PathFilter}
+import org.apache.hadoop.fs.s3.S3Credentials
+import org.apache.hadoop.io.compress.{CompressionCodecFactory, 
SplittableCompressionCodec}
+import org.apache.hadoop.mapred.{FileInputFormat, FileSplit, InputSplit, 
JobConf}
+
+import org.apache.spark.{SparkConf, Logging}
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.util.Utils
+
+/**
+ * :: DeveloperApi ::
+ * Contains util methods to interact with S3 from Spark.
+ */
+@DeveloperApi
+object SparkS3Util extends Logging {
+  val sparkConf = new SparkConf()
+  val conf: Configuration = SparkHadoopUtil.get.newConfiguration(sparkConf)
+
+  private val s3ClientCache: Cache[String, AmazonS3Client] = CacheBuilder
+.newBuilder
+.concurrencyLevel(Runtime.getRuntime.availableProcessors)
+.build[String, AmazonS3Client]
+
+  // Flag to enable S3 bulk listing. It is true by default.
+  private val S3_BULK_LISTING_ENABLED: String = 
"spark.s3.bulk.listing.enabled"
+
+  // Properties for AmazonS3Client. Default values should just work most 
of time.
+  private val S3_CONNECT_TIMEOUT: String = "spark.s3.connect.timeout"
+  private val S3_MAX_CONNECTIONS: String = "spark.s3.max.connections"
+  private val S3_MAX_ERROR_RETRIES: String = "spark.s3.max.error.retries"
+  private val S3_SOCKET_TIMEOUT: String = "spark.s3.socket.timeout"
+  private val S3_SSL_ENABLED: String = "spark.s3.ssl.enabled"
+  private val S3_USE_INSTANCE_CREDENTIALS: String = 
"spark.s3.use.instance.credentials"
+
+  // Ignore hidden files whose name starts with "_" and ".", or ends with 
"$folder$".
+  private val hiddenFileFilter = new PathFilter() {
+override def accept(p: Path): Boolean = {
+  val name: String = p.getName()
+  !name.startsWith("_") && !name.startsWith(".") && 
!name.endsWith("$folder$")
+}
+  }
+
+  /**
+   * Initialize AmazonS3Client per bucket. Since access permissions might 
be different from bucket
+   * to bucket, it is necessary to initialize AmazonS3Client on a per 
bucket basis.
+   */
+  private def getS3Client(bucket: String): AmazonS3Client = {
+val sslEnabled: Boolean = sparkConf.getBoolean(S3_SSL_ENABLED, true)
+val maxErrorRetries: Int = sparkConf.getInt(S3_MAX_ERROR_RETRIES, 10)
+val connectTimeout: Int = sparkConf.getInt(S3_CONNECT_TIMEOUT, 5000)
+val socketTimeout: Int = sparkConf.getInt(S3_SOCKET_TIMEOUT, 5000)
+val maxConnections: Int = sparkConf.getInt(S3_MAX_CONNECTIONS, 5)
+val useInstanceCredentials: Boolean = 
sparkConf.getBoolean(S3_USE_INSTANCE_CREDENTIALS, false)
+
+val clientConf: ClientConfiguration = new ClientConfiguration
+

[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-29 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/8512#discussion_r43439266
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkS3Util.scala ---
@@ -0,0 +1,336 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.deploy
+
+import java.net.URI
+import java.util
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable.ArrayBuffer
+
+import com.amazonaws.{AmazonClientException, AmazonServiceException, 
ClientConfiguration, Protocol}
+import com.amazonaws.auth.{AWSCredentialsProvider, BasicAWSCredentials, 
InstanceProfileCredentialsProvider, STSAssumeRoleSessionCredentialsProvider}
+import com.amazonaws.internal.StaticCredentialsProvider
+import com.amazonaws.services.s3.AmazonS3Client
+import com.amazonaws.services.s3.model.{ListObjectsRequest, ObjectListing, 
S3ObjectSummary}
+
+import com.google.common.annotations.VisibleForTesting
+import com.google.common.base.{Preconditions, Strings}
+import com.google.common.cache.{Cache, CacheBuilder}
+import com.google.common.collect.AbstractSequentialIterator
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.{FileStatus, GlobPattern, Path, PathFilter}
+import org.apache.hadoop.fs.s3.S3Credentials
+import org.apache.hadoop.io.compress.{CompressionCodecFactory, 
SplittableCompressionCodec}
+import org.apache.hadoop.mapred.{FileInputFormat, FileSplit, InputSplit, 
JobConf}
+
+import org.apache.spark.{SparkConf, Logging}
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.util.Utils
+
+/**
+ * :: DeveloperApi ::
+ * Contains util methods to interact with S3 from Spark.
+ */
+@DeveloperApi
+object SparkS3Util extends Logging {
+  val sparkConf = new SparkConf()
--- End diff --

Instead of creating a `SparkConf` at here, can we use `SparkEnv.get.conf`` 
to get the conf associated with the `SparkContext`? If we create a new one at 
here, if users set confs we used for this tool in their application (not in the 
default conf file), we will not be able to pick up the settings, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-29 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/8512#discussion_r43439962
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/UnionRDD.scala ---
@@ -62,7 +64,20 @@ class UnionRDD[T: ClassTag](
 var rdds: Seq[RDD[T]])
   extends RDD[T](sc, Nil) {  // Nil since we implement getDependencies
 
+  // Evaluate partitions in parallel (will be cached in each rdd)
--- End diff --

What will be cached?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-29 Thread yhuai
Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-152310099
  
@piaozhexiu Thank you for updating it! I left a few comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-29 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/8512#discussion_r43440241
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/UnionRDD.scala ---
@@ -62,7 +64,20 @@ class UnionRDD[T: ClassTag](
 var rdds: Seq[RDD[T]])
   extends RDD[T](sc, Nil) {  // Nil since we implement getDependencies
 
+  // Evaluate partitions in parallel (will be cached in each rdd)
+  private lazy val evaluatePartitions: Unit = {
+val threshold = conf.getInt("spark.rdd.parallelListingThreshold", 10)
+if (rdds.length > threshold) {
+  val parArray = rdds.toParArray
+  parArray.tasksupport = new ForkJoinTaskSupport(new 
ForkJoinPool(threshold))
--- End diff --

What will happen if `threshold =0` or `threshold =-1`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-28 Thread piaozhexiu
Github user piaozhexiu commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-151907100
  
@yhuai @liancheng @marmbrus 

Sorry to bug you again, but can you review this patch sometime soon?

In fact, I am leaving Netflix on Nov 3rd, and I am trying to wrap up what 
I've been doing before my leave. We've been using this patch in production for 
a couple of months, and I fixed all the bugs that I discovered so far. At this 
point, this patch is quite stable.

I would really appreciate if you could help.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-151674046
  
**[Test build #44476 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44476/consoleFull)**
 for PR 8512 at commit 
[`7150da7`](https://github.com/apache/spark/commit/7150da7259148858c419f71ae5d9d0e24c55f4ea).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-151689877
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-151689879
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44476/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-151689805
  
**[Test build #44476 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44476/consoleFull)**
 for PR 8512 at commit 
[`7150da7`](https://github.com/apache/spark/commit/7150da7259148858c419f71ae5d9d0e24c55f4ea).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-151673388
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-151673408
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-23 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-150748668
  
**[Test build #44280 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44280/consoleFull)**
 for PR 8512 at commit 
[`ad11b08`](https://github.com/apache/spark/commit/ad11b082ba363afe97f16de9297e21f09f994e9d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-150748702
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44280/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-150748701
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-150734465
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-150734581
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-23 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-150736390
  
**[Test build #44280 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44280/consoleFull)**
 for PR 8512 at commit 
[`ad11b08`](https://github.com/apache/spark/commit/ad11b082ba363afe97f16de9297e21f09f994e9d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-16 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-148841351
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-15 Thread piaozhexiu
Github user piaozhexiu commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-148582745
  
@yhuai @liancheng @davies 

I updated the patch incorporating all your comments. Can you please take a 
look again? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-15 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-148560037
  
  [Test build #43818 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43818/consoleFull)
 for   PR 8512 at commit 
[`f9b2938`](https://github.com/apache/spark/commit/f9b2938f93b65c494160cac64f31c2454097a47b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...

2015-10-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8512#issuecomment-148559743
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   >