Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/23130#discussion_r237045706 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/SaveLoadSuite.scala --- @@ -142,4 +144,15 @@ class SaveLoadSuite extends DataSourceTest with SharedSQLContext with BeforeAndA assert(e.contains(s"Partition column `$unknown` not found in schema $schemaCatalog")) } } + + test("skip empty files in non bucketed read") { + withTempDir { dir => + val path = dir.getCanonicalPath + Files.write(Paths.get(path, "empty"), Array.empty[Byte]) + Files.write(Paths.get(path, "notEmpty"), "a".getBytes(StandardCharsets.UTF_8)) + val readback = spark.read.option("wholetext", true).text(path) + + assert(readback.rdd.getNumPartitions === 1) --- End diff -- does this test fail without your change? IIUC one partition can read multiple files. Is JSON the only data source that may return a row for empty file?
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org