[GitHub] spark pull request #23130: [SPARK-26161][SQL] Ignore empty files in load

cloud-fan Wed, 28 Nov 2018 03:35:12 -0800

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23130#discussion_r237045706
  
    --- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/sources/SaveLoadSuite.scala ---
    @@ -142,4 +144,15 @@ class SaveLoadSuite extends DataSourceTest with 
SharedSQLContext with BeforeAndA
           assert(e.contains(s"Partition column `$unknown` not found in schema 
$schemaCatalog"))
         }
       }
    +
    +  test("skip empty files in non bucketed read") {
    +    withTempDir { dir =>
    +      val path = dir.getCanonicalPath
    +      Files.write(Paths.get(path, "empty"), Array.empty[Byte])
    +      Files.write(Paths.get(path, "notEmpty"), 
"a".getBytes(StandardCharsets.UTF_8))
    +      val readback = spark.read.option("wholetext", true).text(path)
    +
    +      assert(readback.rdd.getNumPartitions === 1)
    --- End diff --
    
    does this test fail without your change? IIUC one partition can read 
multiple files. Is JSON the only data source that may return a row for empty 
file?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23130: [SPARK-26161][SQL] Ignore empty files in load

Reply via email to