[ https://issues.apache.org/jira/browse/SPARK-17993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15816771#comment-15816771 ]
Michael Allman commented on SPARK-17993: ---------------------------------------- Cool. I'll work on a simple PR to silence those warnings in the default log configuration. > Spark prints an avalanche of warning messages from Parquet when reading > parquet files written by older versions of Parquet-mr > ----------------------------------------------------------------------------------------------------------------------------- > > Key: SPARK-17993 > URL: https://issues.apache.org/jira/browse/SPARK-17993 > Project: Spark > Issue Type: Bug > Components: SQL > Reporter: Michael Allman > Assignee: Michael Allman > Fix For: 2.1.0 > > > It looks like https://github.com/apache/spark/pull/14690 broke parquet log > output redirection. After that patch, when querying parquet files written by > Parquet-mr 1.6.0 Spark prints a torrent of (harmless) warning messages from > the Parquet reader: > {code} > Oct 18, 2016 7:42:18 PM WARNING: org.apache.parquet.CorruptStatistics: > Ignoring statistics because created_by could not be parsed (see PARQUET-251): > parquet-mr version 1.6.0 > org.apache.parquet.VersionParser$VersionParseException: Could not parse > created_by: parquet-mr version 1.6.0 using format: (.+) version ((.*) > )?\(build ?(.*)\) > at org.apache.parquet.VersionParser.parse(VersionParser.java:112) > at > org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60) > at > org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263) > at > org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:583) > at > org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:513) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:270) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:225) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:162) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:372) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > This only happens during execution, not planning, and it doesn't matter what > log level the {{SparkContext}} is set to. > This is a regression I noted as something we needed to fix as a follow up to > PR 14690. I feel responsible, so I'm going to expedite a fix for it. I > suspect that PR broke Spark's Parquet log output redirection. That's the > premise I'm going by. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org