Hello Gabor,

When I brought similar concerns up on the dev@spark mailing list, I was told to 
bring that discussion to the Parquet community first.

I've been on the Avro and Parquet mailing lists since, hoping I might help 
coordinate between the three communities.  If Avro and Parquet are not upgraded 
in lock step, many compatibility issues are pushed to downstream projects to 
work around.  E.g.

https://beam.apache.org/documentation/io/built-in/parquet/ 
<https://beam.apache.org/documentation/io/built-in/parquet/>

Our current workaround is so embarrassing I'd rather not mention it here.

   michael


> On Jan 24, 2020, at 5:18 AM, Gabor Szadovszky <[email protected]> wrote:
> 
> Thanks a lot, Michael for highlighting this. However, it is more a spark
> issue than a parquet one.
> Could you add your concerns to the spark PR/jira?
> 
> Thanks a lot,
> Gabor
> 
> On Thu, Jan 23, 2020 at 7:08 PM Michael Heuer <[email protected] 
> <mailto:[email protected]>> wrote:
> 
>> For example,
>> 
>> https://github.com/bigdatagenomics/adam/pull/2245 
>> <https://github.com/bigdatagenomics/adam/pull/2245> <
>> https://github.com/bigdatagenomics/adam/pull/2245 
>> <https://github.com/bigdatagenomics/adam/pull/2245>>
>> 
>> ...
>> Caused by: java.lang.NoSuchMethodError:
>> org.apache.parquet.schema.Types$PrimitiveBuilder.as(Lorg/apache/parquet/schema/LogicalTypeAnnotation;)Lorg/apache/parquet/schema/Types$Builder;
>>        at
>> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:161)
>>        at
>> org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:226)
>>        at
>> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:182)
>>        at
>> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:141)
>>        at
>> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:244)
>>        at
>> org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:135)
>>        at
>> org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:126)
>>        at
>> org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
>>        at
>> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
>>        at
>> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
>>        at
>> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
>>        at org.apache.spark.internal.io
>> .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
>>        at org.apache.spark.internal.io
>> .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
>>        at org.apache.spark.internal.io
>> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
>>        at org.apache.spark.internal.io
>> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
>>        at
>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>>        at org.apache.spark.scheduler.Task.run(Task.scala:123)
>>        at
>> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>>        at
>> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>>        at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>>        at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>        at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>        at java.lang.Thread.run(Thread.java:748)
>> 
>> 
>>> On Jan 23, 2020, at 10:52 AM, Michael Heuer <[email protected]> wrote:
>>> 
>>> Hello Gabor,
>>> 
>>> This Spark PR upgrades Parquet but does not upgrade Avro, note the
>> exclusion for parquet-avro
>>> 
>>> 
>> https://github.com/apache/spark/pull/26804/files#diff-600376dffeb79835ede4a0b285078036R2104
>> <
>> https://github.com/apache/spark/pull/26804/files#diff-600376dffeb79835ede4a0b285078036R2104
>>  
>> <https://github.com/apache/spark/pull/26804/files#diff-600376dffeb79835ede4a0b285078036R2104>
>>> 
>>> 
>>> Parquet 1.11.0/1.11.1 depends on Avro 1.9.1 and Spark depends on Avro
>> 1.8.2, how will this Spark PR be compatible?
>>> 
>>>   michael
>>> 
>>> 
>>>> On Jan 23, 2020, at 3:38 AM, Gabor Szadovszky <[email protected] 
>>>> <mailto:[email protected]>
>> <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>>> 
>>>> Thanks, Fokko. I've linked the related issues to the release jira as
>>>> blockers.
>>>> Currently, every issue is resolved. Waiting for feedback if the
>>>> fixes/descriptions are correct and if we need to fix anything else for
>>>> Spark.
>>>> 
>>>> On Wed, Jan 22, 2020 at 5:18 PM Driesprong, Fokko <[email protected] 
>>>> <mailto:[email protected]>
>> <mailto:[email protected] <mailto:[email protected]>>>
>>>> wrote:
>>>> 
>>>>> Thank you Gabor,
>>>>> 
>>>>> What kind of issues are found? Let me know if I can help in any way.
>>>>> 
>>>>> Cheers, Fokko
>>>>> 
>>>>> Op wo 22 jan. 2020 om 11:10 schreef Gabor Szadovszky <[email protected] 
>>>>> <mailto:[email protected]>
>> <mailto:[email protected] <mailto:[email protected]>>>:
>>>>> 
>>>>>> Dear All,
>>>>>> 
>>>>>> During the migration to 1.11.0 in Spark we discovered some issues in
>> the
>>>>>> parquet release. I am preparing the minor release 1.11.1 to fix these
>>>>>> issues. Created the jira
>>>>>> https://issues.apache.org/jira/browse/PARQUET-1774 
>>>>>> <https://issues.apache.org/jira/browse/PARQUET-1774> <
>> https://issues.apache.org/jira/browse/PARQUET-1774 
>> <https://issues.apache.org/jira/browse/PARQUET-1774>> to
>>>>>> track this effort. Feel free to link any bug jiras if they are
>>>>> regressions
>>>>>> in 1.11.0.
>>>>>> The release will be prepared in the separate branch "parquet-1.11.x".
>>>>> I'll
>>>>>> do the backports as required.
>>>>>> 
>>>>>> Regards,
>>>>>> Gabor

Reply via email to