[GitHub] spark pull request #14038: [SPARK-16317][SQL] Add a new interface to filter ...

2017-05-23 Thread maropu
Github user maropu closed the pull request at:

https://github.com/apache/spark/pull/14038


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14038: [SPARK-16317][SQL] Add a new interface to filter ...

2017-05-23 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/14038#discussion_r118045500
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala
 ---
@@ -31,6 +31,34 @@ import org.apache.spark.sql.types.StructType
 
 
 /**
+ * A filter class to list up qualified paths in parallel.
+ */
+abstract class PathFilter extends Serializable {
+  final def accept(path: Path): Boolean = isDataPath(path) || 
isMetaDataPath(path)
+  def isDataPath(path: Path): Boolean = false
+  def isMetaDataPath(path: Path): Boolean = false
+}
+
+object PathFilter {
+
+  /** A default path filter to pass through all input paths. */
+  val defaultPathFilter = new PathFilter {
+
+override def isDataPath(path: Path): Boolean = {
+  // We filter follow paths:
+  // 1. everything that starts with _ and ., except _common_metadata 
and _metadata
+  // because Parquet needs to find those metadata files from leaf 
files returned by this method.
+  // We should refactor this logic to not mix metadata files with data 
files.
+  // 2. everything that ends with `._COPYING_`, because this is a 
intermediate state of file. we
+  // should skip this file in case of double reading.
+  val name = path.getName
+  !((name.startsWith("_") && !name.contains("=")) || 
name.startsWith(".") ||
+name.endsWith("._COPYING_"))
--- End diff --

Like @rxin said, this sounds risky to me too. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14038: [SPARK-16317][SQL] Add a new interface to filter ...

2016-11-28 Thread steveloughran
Github user steveloughran commented on a diff in the pull request:

https://github.com/apache/spark/pull/14038#discussion_r89839965
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategySuite.scala
 ---
@@ -441,6 +441,44 @@ class FileSourceStrategySuite extends QueryTest with 
SharedSQLContext with Predi
 }
   }
 
+  test("filter out invalid files in a driver") {
+withSQLConf(
+  "fs.file.impl" -> classOf[MockDistributedFileSystem].getName,
+  SQLConf.PARALLEL_PARTITION_DISCOVERY_THRESHOLD.key -> "3") {
+  val table =
+createTable(
+  files = Seq(
+"p1=1/file1" -> 1,
+"p1=1/file2" -> 1,
+"p1=2/file3" -> 1,
+"p1=2/invalid_file" -> 1))
--- End diff --

I'd consider adding the full set of invalid files:
```
p1=2/file=3 -> 1
p1=2/.temp -> 1

```




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14038: [SPARK-16317][SQL] Add a new interface to filter ...

2016-07-06 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14038#discussion_r69713606
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala
 ---
@@ -156,3 +162,10 @@ class ListingFileCatalog(
 
   override def hashCode(): Int = paths.toSet.hashCode()
 }
+
+object ListingFileCatalog {
+  /** A default path filter to pass through all input paths. */
+  val passThroughPathFilter = new SerializablePathFilter {
+override def accept(path: Path): Boolean = true
+  }
--- End diff --

The latest fixes satisfies your intention?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14038: [SPARK-16317][SQL] Add a new interface to filter ...

2016-07-06 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14038#discussion_r69713543
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala
 ---
@@ -30,6 +29,13 @@ import org.apache.spark.sql.types.StructType
 
 
 /**
+ * A filter class to list up qualified files in parallel.
+ */
+private[spark] abstract class SerializablePathFilter extends Serializable {
--- End diff --

yea, fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14038: [SPARK-16317][SQL] Add a new interface to filter ...

2016-07-06 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/14038#discussion_r69706488
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala
 ---
@@ -230,6 +229,15 @@ trait FileFormat {
   }
 
   /**
+   * Return a `SerializablePathFilter` class to filter qualified files for 
this format.
+   */
+  def getPathFilter(options: Map[String, String]): SerializablePathFilter 
= {
+new SerializablePathFilter {
+  override def accept(path: Path): Boolean = true
+}
--- End diff --

We can replace this with `passThroughPathFilter` after moving it into this 
file.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14038: [SPARK-16317][SQL] Add a new interface to filter ...

2016-07-06 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/14038#discussion_r69706413
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala
 ---
@@ -156,3 +162,10 @@ class ListingFileCatalog(
 
   override def hashCode(): Int = paths.toSet.hashCode()
 }
+
+object ListingFileCatalog {
+  /** A default path filter to pass through all input paths. */
+  val passThroughPathFilter = new SerializablePathFilter {
+override def accept(path: Path): Boolean = true
+  }
--- End diff --

Let's move into `fileSourceInterface.scala`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14038: [SPARK-16317][SQL] Add a new interface to filter ...

2016-07-06 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/14038#discussion_r69706379
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala
 ---
@@ -30,6 +29,13 @@ import org.apache.spark.sql.types.StructType
 
 
 /**
+ * A filter class to list up qualified files in parallel.
+ */
+private[spark] abstract class SerializablePathFilter extends Serializable {
--- End diff --

Also, it probably makes more sense to move this class into 
`fileSourceInterfaces.scala` since it's part of the public interface.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14038: [SPARK-16317][SQL] Add a new interface to filter ...

2016-07-06 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14038#discussion_r69706249
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala
 ---
@@ -30,6 +29,13 @@ import org.apache.spark.sql.types.StructType
 
 
 /**
+ * A filter class to list up qualified files in parallel.
+ */
+private[spark] abstract class SerializablePathFilter extends Serializable {
--- End diff --

okay


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14038: [SPARK-16317][SQL] Add a new interface to filter ...

2016-07-06 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/14038#discussion_r69706044
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala
 ---
@@ -30,6 +29,13 @@ import org.apache.spark.sql.types.StructType
 
 
 /**
+ * A filter class to list up qualified files in parallel.
+ */
+private[spark] abstract class SerializablePathFilter extends Serializable {
--- End diff --

Maybe just `PathFilter`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14038: [SPARK-16317][SQL] Add a new interface to filter ...

2016-07-05 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14038#discussion_r69521137
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala
 ---
@@ -230,6 +236,15 @@ trait FileFormat {
   }
 
   /**
+   * Return a `SerializablePathFilter` class to filter qualified files for 
this format.
+   */
+  def getPathFilter(): SerializablePathFilter = {
--- End diff --

yea, my bad. I'll re-check the whole code to remove `Options`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14038: [SPARK-16317][SQL] Add a new interface to filter ...

2016-07-05 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14038#discussion_r69520870
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala
 ---
@@ -172,6 +171,13 @@ case class HadoopFsRelation(
 }
 
 /**
+ * A helper class to list up qualified files in parallel.
+ */
+private[spark] abstract class SerializablePathFilter extends PathFilter 
with Serializable {
--- End diff --

okay, I'll remove the dependency.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14038: [SPARK-16317][SQL] Add a new interface to filter ...

2016-07-05 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14038#discussion_r69520532
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala
 ---
@@ -172,6 +171,13 @@ case class HadoopFsRelation(
 }
 
 /**
+ * A helper class to list up qualified files in parallel.
+ */
+private[spark] abstract class SerializablePathFilter extends PathFilter 
with Serializable {
--- End diff --

yea, we need `Serializable` here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14038: [SPARK-16317][SQL] Add a new interface to filter ...

2016-07-05 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/14038#discussion_r69520473
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala
 ---
@@ -437,11 +442,26 @@ private[sql] object HadoopFsRelation extends Logging {
   accessTime: Long,
   blockLocations: Array[FakeBlockLocation])
 
+  private[sql] def mergePathFilter(
+  filter1: Option[PathFilter], filter2: Option[PathFilter]): Path => 
Boolean = {
+(filter1, filter2) match {
+  case (Some(f1), Some(f2)) =>
+(path: Path) => f1.accept(path) && f2.accept(path)
+  case (Some(f1), None) =>
+(path: Path) => f1.accept(path)
+  case (None, Some(f2)) =>
+(path: Path) => f2.accept(path)
+  case (None, None) =>
+(path: Path) => true
+}
--- End diff --

This can be conciser:

```scala
(filter1 ++ filter2).reduceOption { (f1, f2) =>
  (path: Path) => f1.accept(path) && f2.accept(path)
}.getOrElse {
  (path: Path) => true
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14038: [SPARK-16317][SQL] Add a new interface to filter ...

2016-07-05 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/14038#discussion_r69520156
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala
 ---
@@ -172,6 +171,13 @@ case class HadoopFsRelation(
 }
 
 /**
+ * A helper class to list up qualified files in parallel.
+ */
+private[spark] abstract class SerializablePathFilter extends PathFilter 
with Serializable {
--- End diff --

Oh I see, because parallel file listing may filter input files on executor 
side.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14038: [SPARK-16317][SQL] Add a new interface to filter ...

2016-07-05 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/14038#discussion_r69519931
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala
 ---
@@ -172,6 +171,13 @@ case class HadoopFsRelation(
 }
 
 /**
+ * A helper class to list up qualified files in parallel.
+ */
+private[spark] abstract class SerializablePathFilter extends PathFilter 
with Serializable {
--- End diff --

I probably missed something here, but why it has to be `Serializable`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14038: [SPARK-16317][SQL] Add a new interface to filter ...

2016-07-05 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/14038#discussion_r69519846
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala
 ---
@@ -230,6 +236,15 @@ trait FileFormat {
   }
 
   /**
+   * Return a `SerializablePathFilter` class to filter qualified files for 
this format.
+   */
+  def getPathFilter(): SerializablePathFilter = {
--- End diff --

What is the semantics of the return value of the method? Seems that it 
should never return a null filter since it defaults to an "accept all" filter. 
If this is true, it's unnecessary to use `Option` to wrap returned filters 
elsewhere in this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14038: [SPARK-16317][SQL] Add a new interface to filter ...

2016-07-05 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/14038#discussion_r69519641
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala
 ---
@@ -172,6 +171,13 @@ case class HadoopFsRelation(
 }
 
 /**
+ * A helper class to list up qualified files in parallel.
+ */
+private[spark] abstract class SerializablePathFilter extends PathFilter 
with Serializable {
--- End diff --

Extending from `PathFilter` makes internal implementation a little bit 
easier, but I'd prefer to avoid depending on Hadoop classes/interfaces in Spark 
SQL public interfaces whenever possible.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14038: [SPARK-16317][SQL] Add a new interface to filter ...

2016-07-05 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14038#discussion_r69519099
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala
 ---
@@ -230,6 +236,15 @@ trait FileFormat {
   }
 
   /**
+   * Return a `SerializablePathFilter` class to filter qualified files for 
this format.
+   */
+  def getPathFilter(): SerializablePathFilter = {
--- End diff --

okay, I'll fix now


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14038: [SPARK-16317][SQL] Add a new interface to filter ...

2016-07-05 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/14038#discussion_r69518421
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala
 ---
@@ -230,6 +236,15 @@ trait FileFormat {
   }
 
   /**
+   * Return a `SerializablePathFilter` class to filter qualified files for 
this format.
+   */
+  def getPathFilter(): SerializablePathFilter = {
--- End diff --

Shall we add either the data source options map or the Hadoop conf as an 
argument of this method?

For example, the Avro data source may filter out all input files whose file 
names don't end with ".avro" if Hadoop conf 
"avro.mapred.ignore.inputs.without.extension" is set to true. This is 
consistent with default behavior of `AvroInputFormat`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14038: [SPARK-16317][SQL] Add a new interface to filter ...

2016-07-03 Thread maropu
GitHub user maropu reopened a pull request:

https://github.com/apache/spark/pull/14038

[SPARK-16317][SQL] Add a new interface to filter files in FileFormat

## What changes were proposed in this pull request?
This pr is to add an interface for filtering files in `FileFormat` not to 
pass invalid files into `FileFormat#buildReader`.

## How was this patch tested?
Added tests to filter files in a driver and in parallel.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/maropu/spark SPARK-16317

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14038.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14038


commit 67703098f96da37fbe23e0f2d76017698671d5e2
Author: Takeshi YAMAMURO 
Date:   2016-07-04T02:13:34Z

Add a new interface to filter files in FileFormat




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14038: [SPARK-16317][SQL] Add a new interface to filter ...

2016-07-03 Thread maropu
Github user maropu closed the pull request at:

https://github.com/apache/spark/pull/14038


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14038: [SPARK-16317][SQL] Add a new interface to filter ...

2016-07-03 Thread maropu
GitHub user maropu opened a pull request:

https://github.com/apache/spark/pull/14038

[SPARK-16317][SQL] Add a new interface to filter files in FileFormat

## What changes were proposed in this pull request?
This pr is to add an interface for filtering files in `FileFormat` not to 
pass invalid files into `FileFormat#buildReader`.

## How was this patch tested?
Added tests to filter files in a driver and in parallel.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/maropu/spark SPARK-16317

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14038.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14038


commit 67703098f96da37fbe23e0f2d76017698671d5e2
Author: Takeshi YAMAMURO 
Date:   2016-07-04T02:13:34Z

Add a new interface to filter files in FileFormat




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org