[GitHub] spark pull request #17346: [SPARK-19965][SS] DataFrame batch reader may fail...

lw-lin Tue, 02 May 2017 20:54:12 -0700

Github user lw-lin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17346#discussion_r114468906
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala
 ---
    @@ -36,20 +37,27 @@ import org.apache.spark.util.SerializableConfiguration
      * A [[FileIndex]] that generates the list of files to process by 
recursively listing all the
      * files present in `paths`.
      *
    - * @param rootPaths the list of root table paths to scan
    + * @param rootPathsSpecified the list of root table paths to scan (some of 
which might be
    + *                           filtered out later)
      * @param parameters as set of options to control discovery
      * @param partitionSchema an optional partition schema that will be use to 
provide types for the
      *                        discovered partitions
      */
     class InMemoryFileIndex(
         sparkSession: SparkSession,
    -    override val rootPaths: Seq[Path],
    +    rootPathsSpecified: Seq[Path],
         parameters: Map[String, String],
         partitionSchema: Option[StructType],
         fileStatusCache: FileStatusCache = NoopCache)
       extends PartitioningAwareFileIndex(
         sparkSession, parameters, partitionSchema, fileStatusCache) {
     
    +  // Filter out streaming metadata dirs or files such as 
"/.../_spark_metadata" (the metadata dir)
    +  // or "/.../_spark_metadata/0" (a file in the metadata dir). 
`rootPathsSpecified` might contain
    +  // such streaming metadata dir or files, e.g. when after globbing 
"basePath/*" where "basePath"
    +  // is the output of a streaming query.
    +  override val rootPaths = 
rootPathsSpecified.filterNot(FileStreamSink.ancestorIsMetadataDirectory)
    --- End diff --
    
    Yea your are quite correct! They will be filted by 
`InMemoryFileIndex.shouldFilterOut`.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17346: [SPARK-19965][SS] DataFrame batch reader may fail...

Reply via email to