[ 
https://issues.apache.org/jira/browse/SPARK-35744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17362804#comment-17362804
 ] 

Steven Aerts commented on SPARK-35744:
--------------------------------------

[~Gengliang.Wang] in the avro java/scala world there are two ways of handling 
data.

You can use [GenericData 
|https://avro.apache.org/docs/1.8.1/api/java/org/apache/avro/generic/GenericData.html]which
 gives you a generic way to handle any avro data.  This is also what spark-avro 
uses internally.

The other option you have is to use 
[SpecificData|https://avro.apache.org/docs/1.8.1/api/java/org/apache/avro/specific/SpecificData.html],
 where you let the[ avro codegen 
generate|https://avro.apache.org/docs/1.10.2/gettingstartedjava.html#Serializing+and+deserializing+with+code+generation]
 specific classes and you can use these classes specifically generated for your 
avro schema.  If you use these classes in spark you will hit the issue 
mentioned.

I am not sure how common this issue is.  And I would totally understand if you 
would close this issue as too exotic.

> Performance degradation in avro SpecificRecordBuilders
> ------------------------------------------------------
>
>                 Key: SPARK-35744
>                 URL: https://issues.apache.org/jira/browse/SPARK-35744
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.2.0
>            Reporter: Steven Aerts
>            Priority: Minor
>
> Creating this bug to let you know that when we tested out spark 3.2.0 we saw 
> a significant performance degradation where our code was handling Avro 
> Specific Record objects.  This slowed down some of our jobs with a factor 4.
> Spark 3.2.0 upsteps the avro version from 1.8.2 to 1.10.2.
> The degradation was caused by a change introduced in avro 1.9.0.  This change 
> degrades performance when creating avro specific records in certain 
> classloader topologies, like the ones used in spark.
> We notified and [proposed|https://github.com/apache/avro/pull/1253] a simple 
> fix upstream in the avro project.  (Links contain more details)
> It is unclear for us how many other projects are using avro specific records 
> in a spark context and will be impacted by this degradation.
>  Feel free to close this issue if you think this issue is too much of a 
> corner case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to