Hello Gabor, When I brought similar concerns up on the dev@spark mailing list, I was told to bring that discussion to the Parquet community first.
I've been on the Avro and Parquet mailing lists since, hoping I might help coordinate between the three communities. If Avro and Parquet are not upgraded in lock step, many compatibility issues are pushed to downstream projects to work around. E.g. https://beam.apache.org/documentation/io/built-in/parquet/ <https://beam.apache.org/documentation/io/built-in/parquet/> Our current workaround is so embarrassing I'd rather not mention it here. michael > On Jan 24, 2020, at 5:18 AM, Gabor Szadovszky <[email protected]> wrote: > > Thanks a lot, Michael for highlighting this. However, it is more a spark > issue than a parquet one. > Could you add your concerns to the spark PR/jira? > > Thanks a lot, > Gabor > > On Thu, Jan 23, 2020 at 7:08 PM Michael Heuer <[email protected] > <mailto:[email protected]>> wrote: > >> For example, >> >> https://github.com/bigdatagenomics/adam/pull/2245 >> <https://github.com/bigdatagenomics/adam/pull/2245> < >> https://github.com/bigdatagenomics/adam/pull/2245 >> <https://github.com/bigdatagenomics/adam/pull/2245>> >> >> ... >> Caused by: java.lang.NoSuchMethodError: >> org.apache.parquet.schema.Types$PrimitiveBuilder.as(Lorg/apache/parquet/schema/LogicalTypeAnnotation;)Lorg/apache/parquet/schema/Types$Builder; >> at >> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:161) >> at >> org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:226) >> at >> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:182) >> at >> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:141) >> at >> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:244) >> at >> org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:135) >> at >> org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:126) >> at >> org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121) >> at >> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388) >> at >> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349) >> at >> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35) >> at org.apache.spark.internal.io >> .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350) >> at org.apache.spark.internal.io >> .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120) >> at org.apache.spark.internal.io >> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83) >> at org.apache.spark.internal.io >> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78) >> at >> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) >> at org.apache.spark.scheduler.Task.run(Task.scala:123) >> at >> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) >> at >> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) >> at >> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >> at java.lang.Thread.run(Thread.java:748) >> >> >>> On Jan 23, 2020, at 10:52 AM, Michael Heuer <[email protected]> wrote: >>> >>> Hello Gabor, >>> >>> This Spark PR upgrades Parquet but does not upgrade Avro, note the >> exclusion for parquet-avro >>> >>> >> https://github.com/apache/spark/pull/26804/files#diff-600376dffeb79835ede4a0b285078036R2104 >> < >> https://github.com/apache/spark/pull/26804/files#diff-600376dffeb79835ede4a0b285078036R2104 >> >> <https://github.com/apache/spark/pull/26804/files#diff-600376dffeb79835ede4a0b285078036R2104> >>> >>> >>> Parquet 1.11.0/1.11.1 depends on Avro 1.9.1 and Spark depends on Avro >> 1.8.2, how will this Spark PR be compatible? >>> >>> michael >>> >>> >>>> On Jan 23, 2020, at 3:38 AM, Gabor Szadovszky <[email protected] >>>> <mailto:[email protected]> >> <mailto:[email protected] <mailto:[email protected]>>> wrote: >>>> >>>> Thanks, Fokko. I've linked the related issues to the release jira as >>>> blockers. >>>> Currently, every issue is resolved. Waiting for feedback if the >>>> fixes/descriptions are correct and if we need to fix anything else for >>>> Spark. >>>> >>>> On Wed, Jan 22, 2020 at 5:18 PM Driesprong, Fokko <[email protected] >>>> <mailto:[email protected]> >> <mailto:[email protected] <mailto:[email protected]>>> >>>> wrote: >>>> >>>>> Thank you Gabor, >>>>> >>>>> What kind of issues are found? Let me know if I can help in any way. >>>>> >>>>> Cheers, Fokko >>>>> >>>>> Op wo 22 jan. 2020 om 11:10 schreef Gabor Szadovszky <[email protected] >>>>> <mailto:[email protected]> >> <mailto:[email protected] <mailto:[email protected]>>>: >>>>> >>>>>> Dear All, >>>>>> >>>>>> During the migration to 1.11.0 in Spark we discovered some issues in >> the >>>>>> parquet release. I am preparing the minor release 1.11.1 to fix these >>>>>> issues. Created the jira >>>>>> https://issues.apache.org/jira/browse/PARQUET-1774 >>>>>> <https://issues.apache.org/jira/browse/PARQUET-1774> < >> https://issues.apache.org/jira/browse/PARQUET-1774 >> <https://issues.apache.org/jira/browse/PARQUET-1774>> to >>>>>> track this effort. Feel free to link any bug jiras if they are >>>>> regressions >>>>>> in 1.11.0. >>>>>> The release will be prepared in the separate branch "parquet-1.11.x". >>>>> I'll >>>>>> do the backports as required. >>>>>> >>>>>> Regards, >>>>>> Gabor
