[ 
https://issues.apache.org/jira/browse/SPARK-17993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15655355#comment-15655355
 ] 

Michael Allman commented on SPARK-17993:
----------------------------------------

This patch will be part of Spark 2.1, but it looks like it won't make it into 
2.0.2. If you'd like help backporting this patch to 2.0, mail me privately and 
I can send you a patch.

> Spark prints an avalanche of warning messages from Parquet when reading 
> parquet files written by older versions of Parquet-mr
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-17993
>                 URL: https://issues.apache.org/jira/browse/SPARK-17993
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>            Reporter: Michael Allman
>            Assignee: Michael Allman
>             Fix For: 2.1.0
>
>
> It looks like https://github.com/apache/spark/pull/14690 broke parquet log 
> output redirection. After that patch, when querying parquet files written by 
> Parquet-mr 1.6.0 Spark prints a torrent of (harmless) warning messages from 
> the Parquet reader:
> {code}
> Oct 18, 2016 7:42:18 PM WARNING: org.apache.parquet.CorruptStatistics: 
> Ignoring statistics because created_by could not be parsed (see PARQUET-251): 
> parquet-mr version 1.6.0
> org.apache.parquet.VersionParser$VersionParseException: Could not parse 
> created_by: parquet-mr version 1.6.0 using format: (.+) version ((.*) 
> )?\(build ?(.*)\)
>       at org.apache.parquet.VersionParser.parse(VersionParser.java:112)
>       at 
> org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60)
>       at 
> org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263)
>       at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:583)
>       at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:513)
>       at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:270)
>       at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:225)
>       at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
>       at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>       at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
>       at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:162)
>       at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
>       at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
>  Source)
>       at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>       at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>       at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:372)
>       at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
>       at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
>       at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>       at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>       at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>       at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>       at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>       at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>       at org.apache.spark.scheduler.Task.run(Task.scala:99)
>       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>       at java.lang.Thread.run(Thread.java:745)
> {code}
> This only happens during execution, not planning, and it doesn't matter what 
> log level the {{SparkContext}} is set to.
> This is a regression I noted as something we needed to fix as a follow up to 
> PR 14690. I feel responsible, so I'm going to expedite a fix for it. I 
> suspect that PR broke Spark's Parquet log output redirection. That's the 
> premise I'm going by.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to