Re: Programmatic: parquet file corruption error

2020-03-27 Thread Zahid Rahman
Thanks Wenchen.  SOLVED! KINDA!

I removed all dependencies from the pom.xml  in my IDE so I wouldn't be
picking up any libraries from maven repository.
I *instead* included the libraries (jar)  from the *spark download* of
*spark-3.0.0-preview2-bin-hadoop2.7*
This way I am using the *same libraries* which are used when running
*spark-submit
scripts*.

I  believe I managed to trace the issue.
I copied  the log4j.properties.template into Intellij's resources
directory in my project.
Obviously renaming it to log4.properties.
So now I am using also *same** log4j.properties* as when running *spark-submit
scipt.*

I noticed the value of *log4j.logger.org.apache.parquet=ERROR* &
*log4j.logger.parquet=ERROR*.
It appears that this parquet corruption warning is an *outstanding bug* and
the *work around* is to quieten the warning.


#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# Set everything to be logged to the console
log4j.rootCategory=WARN, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss}
%p %c{1}: %m%n

# Set the default spark-shell log level to WARN. When running the
spark-shell, the
# log level for this class is used to overwrite the root logger's log
level, so that
# the user can have different defaults for the shell and regular Spark apps.
log4j.logger.org.apache.spark.repl.Main=WARN

# Settings to quiet third party logs that are too verbose
log4j.logger.org.sparkproject.jetty=WARN
log4j.logger.org.sparkproject.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=WARN
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR

# SPARK-9183: Settings to avoid annoying messages when looking up
nonexistent UDFs in SparkSQL with Hive support
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR




Backbutton.co.uk
¯\_(ツ)_/¯
♡۶Java♡۶RMI ♡۶
Make Use Method {MUM}
makeuse.org



On Fri, 27 Mar 2020 at 07:44, Wenchen Fan  wrote:

> Running Spark application with an IDE is not officially supported. It may
> work under some cases but there is no guarantee at all. The official way is
> to run interactive queries with spark-shell or package your application to
> a jar and use spark-submit.
>
> On Thu, Mar 26, 2020 at 4:12 PM Zahid Rahman  wrote:
>
>> Hi,
>>
>> When I run the code for a user defined data type dataset using case class
>> in scala  and run the code in the interactive spark-shell against parquet
>> file. The results are as expected.
>> However I then the same code programmatically in IntelliJ IDE then spark
>> is give a file corruption error.
>>
>> Steps I have taken to determine the source of error are :
>> I have tested for file permission and made sure to chmod 777 , just in
>> case.
>> I tried a fresh copy of same parquet file.
>> I ran both programme before and after the fresh copy.
>> I also rebooted then ran programmatically against a fresh parquet file.
>> The corruption error was consistent in all cases.
>> I have copy and pasted the spark-shell , the error message and the code
>> in the IDE and the pom.xml, IntelliJ java  classpath command line.
>>
>> Perhaps the code in the libraries are different than the ones  used by
>> spark-shell from that when run programmatically.
>> I don't believe it is an error on my part.
>>
>> <--
>>
>> 07:28:45 WARN  CorruptStatistics:117 - Ignoring statistics because
>> created_by could not be parsed (see PARQUET-251): parquet-mr (build
>> 32c46643845ea8a705c35d4ec8fc654cc8ff816d)
>> org.apache.parquet.VersionParser$VersionParseException: Could not parse
>> created_by: parquet-mr (build 32c46643845ea8a705c35d4ec8fc654cc8ff816d)
>> using format:
>> (.*?)\s+version\s*(?:([^(]*?)\s*(?:\(\s*build\s*([^)]*?)\s*\))?)?
>> at org.apache.parquet.VersionParser.parse(VersionParser.java:112)
>> at
>> 

Re: Programmatic: parquet file corruption error

2020-03-27 Thread Wenchen Fan
Running Spark application with an IDE is not officially supported. It may
work under some cases but there is no guarantee at all. The official way is
to run interactive queries with spark-shell or package your application to
a jar and use spark-submit.

On Thu, Mar 26, 2020 at 4:12 PM Zahid Rahman  wrote:

> Hi,
>
> When I run the code for a user defined data type dataset using case class
> in scala  and run the code in the interactive spark-shell against parquet
> file. The results are as expected.
> However I then the same code programmatically in IntelliJ IDE then spark
> is give a file corruption error.
>
> Steps I have taken to determine the source of error are :
> I have tested for file permission and made sure to chmod 777 , just in
> case.
> I tried a fresh copy of same parquet file.
> I ran both programme before and after the fresh copy.
> I also rebooted then ran programmatically against a fresh parquet file.
> The corruption error was consistent in all cases.
> I have copy and pasted the spark-shell , the error message and the code in
> the IDE and the pom.xml, IntelliJ java  classpath command line.
>
> Perhaps the code in the libraries are different than the ones  used by
> spark-shell from that when run programmatically.
> I don't believe it is an error on my part.
>
> <--
>
> 07:28:45 WARN  CorruptStatistics:117 - Ignoring statistics because
> created_by could not be parsed (see PARQUET-251): parquet-mr (build
> 32c46643845ea8a705c35d4ec8fc654cc8ff816d)
> org.apache.parquet.VersionParser$VersionParseException: Could not parse
> created_by: parquet-mr (build 32c46643845ea8a705c35d4ec8fc654cc8ff816d)
> using format:
> (.*?)\s+version\s*(?:([^(]*?)\s*(?:\(\s*build\s*([^)]*?)\s*\))?)?
> at org.apache.parquet.VersionParser.parse(VersionParser.java:112)
> at
> org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:72)
> at
> org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatisticsInternal(ParquetMetadataConverter.java:435)
> at
> org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:454)
> at
> org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:914)
> at
> org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:885)
> at
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:532)
> at
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:505)
> at
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:499)
> at
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:448)
> at
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:105)
> at
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:131)
> at
> org.apache.spark.sql.execution.datasources.v2.parquet.ParquetPartitionReaderFactory.buildReaderBase(ParquetPartitionReaderFactory.scala:174)
> at
> org.apache.spark.sql.execution.datasources.v2.parquet.ParquetPartitionReaderFactory.createVectorizedReader(ParquetPartitionReaderFactory.scala:205)
> at
> org.apache.spark.sql.execution.datasources.v2.parquet.ParquetPartitionReaderFactory.buildColumnarReader(ParquetPartitionReaderFactory.scala:103)
> at
> org.apache.spark.sql.execution.datasources.v2.FilePartitionReaderFactory.$anonfun$createColumnarReader$1(FilePartitionReaderFactory.scala:38)
> at
> org.apache.spark.sql.execution.datasources.v2.FilePartitionReaderFactory$$Lambda$2018/.apply(Unknown
> Source)
> at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
> at
> org.apache.spark.sql.execution.datasources.v2.FilePartitionReader.getNextReader(FilePartitionReader.scala:109)
> at
> org.apache.spark.sql.execution.datasources.v2.FilePartitionReader.next(FilePartitionReader.scala:42)
> at
> org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:62)
> at
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
> Source)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
> Source)
> at
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:726)
> at
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:321)
> at

Programmatic: parquet file corruption error

2020-03-26 Thread Zahid Rahman
Hi,

When I run the code for a user defined data type dataset using case class
in scala  and run the code in the interactive spark-shell against parquet
file. The results are as expected.
However I then the same code programmatically in IntelliJ IDE then spark is
give a file corruption error.

Steps I have taken to determine the source of error are :
I have tested for file permission and made sure to chmod 777 , just in
case.
I tried a fresh copy of same parquet file.
I ran both programme before and after the fresh copy.
I also rebooted then ran programmatically against a fresh parquet file.
The corruption error was consistent in all cases.
I have copy and pasted the spark-shell , the error message and the code in
the IDE and the pom.xml, IntelliJ java  classpath command line.

Perhaps the code in the libraries are different than the ones  used by
spark-shell from that when run programmatically.
I don't believe it is an error on my part.
<--

07:28:45 WARN  CorruptStatistics:117 - Ignoring statistics because
created_by could not be parsed (see PARQUET-251): parquet-mr (build
32c46643845ea8a705c35d4ec8fc654cc8ff816d)
org.apache.parquet.VersionParser$VersionParseException: Could not parse
created_by: parquet-mr (build 32c46643845ea8a705c35d4ec8fc654cc8ff816d)
using format:
(.*?)\s+version\s*(?:([^(]*?)\s*(?:\(\s*build\s*([^)]*?)\s*\))?)?
at org.apache.parquet.VersionParser.parse(VersionParser.java:112)
at
org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:72)
at
org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatisticsInternal(ParquetMetadataConverter.java:435)
at
org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:454)
at
org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:914)
at
org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:885)
at
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:532)
at
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:505)
at
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:499)
at
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:448)
at
org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:105)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:131)
at
org.apache.spark.sql.execution.datasources.v2.parquet.ParquetPartitionReaderFactory.buildReaderBase(ParquetPartitionReaderFactory.scala:174)
at
org.apache.spark.sql.execution.datasources.v2.parquet.ParquetPartitionReaderFactory.createVectorizedReader(ParquetPartitionReaderFactory.scala:205)
at
org.apache.spark.sql.execution.datasources.v2.parquet.ParquetPartitionReaderFactory.buildColumnarReader(ParquetPartitionReaderFactory.scala:103)
at
org.apache.spark.sql.execution.datasources.v2.FilePartitionReaderFactory.$anonfun$createColumnarReader$1(FilePartitionReaderFactory.scala:38)
at
org.apache.spark.sql.execution.datasources.v2.FilePartitionReaderFactory$$Lambda$2018/.apply(Unknown
Source)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
at
org.apache.spark.sql.execution.datasources.v2.FilePartitionReader.getNextReader(FilePartitionReader.scala:109)
at
org.apache.spark.sql.execution.datasources.v2.FilePartitionReader.next(FilePartitionReader.scala:42)
at
org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:62)
at
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
Source)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
Source)
at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:726)
at
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:321)
at
org.apache.spark.sql.execution.SparkPlan$$Lambda$1879/.apply(Unknown
Source)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)
at
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
at org.apache.spark.rdd.RDD$$Lambda$1875/.apply(Unknown
Source)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at