[GitHub] spark pull request #21769: [SPARK-24805][SQL] Do not ignore avro files witho...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/21769#discussion_r203148694 --- Diff: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala --- @@ -64,7 +64,7 @@ private[avro] class AvroFileFormat extends FileFormat with DataSourceRegister { // Schema evolution is not supported yet. Here we only pick a single random sample file to // figure out the schema of the whole dataset. val sampleFile = - if (conf.getBoolean(AvroFileFormat.IgnoreFilesWithoutExtensionProperty, true)) { + if (AvroFileFormat.ignoreFilesWithoutExtensions(conf)) { --- End diff -- Here is the PR: https://github.com/apache/spark/pull/21798 Please, have a look at it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21769: [SPARK-24805][SQL] Do not ignore avro files witho...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21769 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21769: [SPARK-24805][SQL] Do not ignore avro files witho...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21769#discussion_r202831899 --- Diff: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala --- @@ -64,7 +64,7 @@ private[avro] class AvroFileFormat extends FileFormat with DataSourceRegister { // Schema evolution is not supported yet. Here we only pick a single random sample file to // figure out the schema of the whole dataset. val sampleFile = - if (conf.getBoolean(AvroFileFormat.IgnoreFilesWithoutExtensionProperty, true)) { + if (AvroFileFormat.ignoreFilesWithoutExtensions(conf)) { --- End diff -- Can we submit a separate PR to add a new option for AVRO? We should not rely on hadoopConf to control the behaviors of AVRO. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21769: [SPARK-24805][SQL] Do not ignore avro files witho...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/21769#discussion_r202541358 --- Diff: external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala --- @@ -680,12 +689,22 @@ class AvroSuite extends QueryTest with SharedSQLContext with SQLTestUtils { Files.createFile(new File(tempSaveDir, "non-avro").toPath) - val newDf = spark -.read -.option(AvroFileFormat.IgnoreFilesWithoutExtensionProperty, "true") -.avro(tempSaveDir) + val count = try { --- End diff -- Nit: consider writing the `try...finally` like this: ``` val hadoopConf = spark.sqlContext.sparkContext.hadoopConfiguration try { hadoopConf.set(AvroFileFormat.IgnoreFilesWithoutExtensionProperty, "true") val count = spark.read.avro(tempSaveDir).count() assert(count == 8) } finally { hadoopConf.unset(AvroFileFormat.IgnoreFilesWithoutExtensionProperty) } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21769: [SPARK-24805][SQL] Do not ignore avro files witho...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/21769#discussion_r202526423 --- Diff: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala --- @@ -64,7 +64,7 @@ private[avro] class AvroFileFormat extends FileFormat with DataSourceRegister { // Schema evolution is not supported yet. Here we only pick a single random sample file to // figure out the schema of the whole dataset. val sampleFile = - if (conf.getBoolean(AvroFileFormat.IgnoreFilesWithoutExtensionProperty, true)) { + if (AvroFileFormat.ignoreFilesWithoutExtensions(conf)) { --- End diff -- The Hadoop config can be changed like: ```scala spark .sqlContext .sparkContext .hadoopConfiguration .set("avro.mapred.ignore.inputs.without.extension", "true") ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21769: [SPARK-24805][SQL] Do not ignore avro files witho...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/21769#discussion_r202525034 --- Diff: external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala --- @@ -623,7 +624,7 @@ class AvroSuite extends SparkFunSuite { spark.read.avro("*/*/*/*/*/*/*/something.avro") } -intercept[FileNotFoundException] { +intercept[java.io.IOException] { TestUtils.withTempDir { dir => FileUtils.touch(new File(dir, "test")) spark.read.avro(dir.toString) --- End diff -- I would actually remove this piece of code from the test, and write a separate test (reading a folder with files without `.avro` extensions) in which the Hadoop's parameter `avro.mapred.ignore.inputs.without.extension` is set to `true` explicitly. Unfortunatelly there is no method like `withSQLConf` for Hadoop's configs. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21769: [SPARK-24805][SQL] Do not ignore avro files witho...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/21769#discussion_r202524884 --- Diff: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala --- @@ -64,7 +64,7 @@ private[avro] class AvroFileFormat extends FileFormat with DataSourceRegister { // Schema evolution is not supported yet. Here we only pick a single random sample file to // figure out the schema of the whole dataset. val sampleFile = - if (conf.getBoolean(AvroFileFormat.IgnoreFilesWithoutExtensionProperty, true)) { + if (AvroFileFormat.ignoreFilesWithoutExtensions(conf)) { --- End diff -- Here is how people use the option so far: https://github.com/databricks/spark-avro/issues/71#issuecomment-280844562 . Probably we should seperatly from the PR discuss how we are going to fix the "bug" and break backward compatibily. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21769: [SPARK-24805][SQL] Do not ignore avro files witho...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/21769#discussion_r202524696 --- Diff: external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala --- @@ -809,4 +810,16 @@ class AvroSuite extends SparkFunSuite { assert(readDf.collect().sameElements(writeDf.collect())) } } + + test("SPARK-24805: reading files without .avro extension") { --- End diff -- Does it just introduce unnesseccary dependency here and overcomplicate the test? I can create small (with just one row) avro file without `.avro` extension especially for the test. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21769: [SPARK-24805][SQL] Do not ignore avro files witho...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/21769#discussion_r202524552 --- Diff: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala --- @@ -64,7 +64,7 @@ private[avro] class AvroFileFormat extends FileFormat with DataSourceRegister { // Schema evolution is not supported yet. Here we only pick a single random sample file to // figure out the schema of the whole dataset. val sampleFile = - if (conf.getBoolean(AvroFileFormat.IgnoreFilesWithoutExtensionProperty, true)) { + if (AvroFileFormat.ignoreFilesWithoutExtensions(conf)) { --- End diff -- The `avro.mapred.ignore.inputs.without.extension` is hadoop's parameter. This PR aims to change the default behavior only. I don't want to convert the hadoop parameter to Avro datasource option here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21769: [SPARK-24805][SQL] Do not ignore avro files witho...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/21769#discussion_r202521397 --- Diff: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala --- @@ -64,7 +64,7 @@ private[avro] class AvroFileFormat extends FileFormat with DataSourceRegister { // Schema evolution is not supported yet. Here we only pick a single random sample file to // figure out the schema of the whole dataset. val sampleFile = - if (conf.getBoolean(AvroFileFormat.IgnoreFilesWithoutExtensionProperty, true)) { + if (AvroFileFormat.ignoreFilesWithoutExtensions(conf)) { --- End diff -- I tried running queries. The option `avro.mapred.ignore.inputs.without.extension` is not set in `conf`. This is a bug in `spark-avro`. Please read the value from `options`. It would be good to have a new test case with `avro.mapred.ignore.inputs.without.extension` as true. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21769: [SPARK-24805][SQL] Do not ignore avro files witho...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/21769#discussion_r202521471 --- Diff: external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala --- @@ -623,7 +624,7 @@ class AvroSuite extends SparkFunSuite { spark.read.avro("*/*/*/*/*/*/*/something.avro") } -intercept[FileNotFoundException] { +intercept[java.io.IOException] { TestUtils.withTempDir { dir => FileUtils.touch(new File(dir, "test")) spark.read.avro(dir.toString) --- End diff -- We can fix the case as ``` spark.read.option("avro.mapred.ignore.inputs.without.extension", false).avro(dir.toString) ``` The behavior will be the same as before. And we don't need to modify the expected `FileNotFoundException` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21769: [SPARK-24805][SQL] Do not ignore avro files witho...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/21769#discussion_r202521435 --- Diff: external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala --- @@ -809,4 +810,16 @@ class AvroSuite extends SparkFunSuite { assert(readDf.collect().sameElements(writeDf.collect())) } } + + test("SPARK-24805: reading files without .avro extension") { --- End diff -- Nit: Can we create a temp path and copy the original `episodes.avro` to the path? So that we don't need to have two duplicated resource file. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21769: [SPARK-24805][SQL] Do not ignore avro files witho...
GitHub user MaxGekk opened a pull request: https://github.com/apache/spark/pull/21769 [SPARK-24805][SQL] Do not ignore avro files without extensions ## What changes were proposed in this pull request? In the PR, I propose to change default behaviour of AVRO datasource which currently ignores files without `.avro` extension in read by default. This PR sets the default value for `avro.mapred.ignore.inputs.without.extension` to `false` in the case if the parameter is not set by an user. ## How was this patch tested? Added a test file without extension in AVRO format, and new test for reading the file with and wihout specified schema. You can merge this pull request into a Git repository by running: $ git pull https://github.com/MaxGekk/spark-1 avro-without-extension Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21769.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21769 commit 35063ef8e734bdeb39316e02b2e8451d0d75d43a Author: Maxim Gekk Date: 2018-07-14T09:49:30Z Test for reading files without avro extension commit 760f98e7aecb5a4c267599b318479d7f2ade165a Author: Maxim Gekk Date: 2018-07-14T10:38:33Z Fix tests commit 8562a8d43868d551efa6a0e9a00d50cdc838a178 Author: Maxim Gekk Date: 2018-07-14T10:53:24Z Adding ticket number to test title --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org