[GitHub] spark pull request #22611: [SPARK-25595] Ignore corrupt Avro files if flag I...

HyukjinKwon Tue, 02 Oct 2018 19:39:01 -0700

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22611#discussion_r222167036
  
    --- Diff: 
external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala ---
    @@ -100,6 +77,50 @@ private[avro] class AvroFileFormat extends FileFormat
         }
       }
     
    +  private def inferAvroSchemaFromFiles(
    +      files: Seq[FileStatus],
    +      conf: Configuration,
    +      ignoreExtension: Boolean): Schema = {
    +    val ignoreCorruptFiles = SQLConf.get.ignoreCorruptFiles
    +    // Schema evolution is not supported yet. Here we only pick first 
random readable sample file to
    +    // figure out the schema of the whole dataset.
    +    val avroReader = files.iterator.map { f =>
    +      val path = f.getPath
    +      if (!ignoreExtension && !path.getName.endsWith(".avro")) {
    +        None
    +      } else {
    +        val in = new FsInput(path, conf)
    +        try {
    +          Some(DataFileReader.openReader(in, new 
GenericDatumReader[GenericRecord]()))
    +        } catch {
    +          case e: IOException =>
    +            if (ignoreCorruptFiles) {
    +              logWarning(s"Skipped the footer in the corrupted file: 
$path", e)
    +              None
    +            } else {
    +              throw new SparkException(s"Could not read file: $path", e)
    +            }
    +        } finally {
    +          in.close()
    +        }
    +      }
    +    }.collectFirst {
    +      case Some(reader) => reader
    +    }
    +
    +    avroReader match {
    +      case Some(reader) =>
    +        try {
    --- End diff --
    
    ditto for `tryWithResource`



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22611: [SPARK-25595] Ignore corrupt Avro files if flag I...

Reply via email to