+1 for making use_fastavro the default for Python3. I don't see any significant drawbacks in doing this from Beam's point of view. One concern is whether avro and fastavro can safely co-exist in the same environment so that Beam continues to work for users who already have avro library installed.
Note that there are two use_fastavro flags (confusingly enough). (1) for avro file source [1] (2) an experiment flag [2] with the same name that makes Dataflow runner use fastavro library for reading/writing intermediate files and for reading Avro files exported by BigQuery. I can help with the latter. [1] https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/avroio.py#L81 [2] https://lists.apache.org/thread.html/94bd362a3a041654e6ef9003fb3fa797e25274fdb4766065481a0796@%3Cuser.beam.apache.org%3E Thanks, Cham On Wed, Mar 27, 2019 at 3:27 PM Valentyn Tymofieiev <[email protected]> wrote: > Thanks, Robbe and Frederik, for raising this. > > Over the course of making Beam Python 3 compatible this is at least the > second time [1] we have to deal with an error in avro-python3 package. The > release cadence of Apache Avro (1 release a year) > is concerning to me [2]. Even if we have a new release with Python 3 fixes > soon, as Beam users start use Beam more actively on Python 3, we may > encounter more issues in avro-python3. If this happens, Beam will have to > monkey-patch its way around the avro-python3 issues, because waiting for > next Avro release may not be practical. > > So, I agree that it is be a good time to start transitioning off of > avro/avro-python3 dependency, given that fastavro is known to be a faster > alternative [3], and is released monthly[4] > > There are couple of ways to make this transition depending on how careful > we want to be. We should: > > 1. Remove the dependency on avro in the current codepath whenever fastavro > is used, as you propose. > 2. Remove Beam dependency on avro-python3 now, OR, if we want to be > safer, set use_fastavro=True a default option on Python 3, but keep the > dependency on avro-python3, and keep that codepath, even though it may not > work right now on Py3, but might work after next Avro release. > 3. set use_fastavro=True a default option on Python 2. > 4. Remove Beam dependency on avro and avro-python3 after several releases. > > Adding +Chamikara Jayalath <[email protected]> and +Udi Meiri > <[email protected]> who have been working on Beam IOs may have some > thoughts here. Do you think that it is safe to make use_fastavro=True a > default option for both Py2 and Py3 now? If we make use_fastavro a default > option on Py3, do you think there is a benefit to still keep the Avro > codepath on Py3, or we can remove it? > > Thanks, > Valentyn > > [1] https://github.com/apache/avro/pull/436 > [2] https://avro.apache.org/releases.html > [3] > https://medium.com/@abrarsheikh/benchmarking-avro-and-fastavro-using-pytest-benchmark-tox-and-matplotlib-bd7a83964453 > [4] https://pypi.org/project/fastavro/#history > > On Wed, Mar 27, 2019 at 10:49 AM Robbe Sneyders <[email protected]> > wrote: > >> Hi all, >> >> We're looking at fixing avroio on Python 3, which still fails due to a >> non-picklable schema class in Avro [1]. This is fixed when using the latest >> Avro master, but the last release dates back to May 2017. >> >> Fastavro does not have the same problem, but is currently also failing >> due to a dependency of avroio on Avro for schema parsing. >> >> We would therefore propose to (temporarily?) deprecate Avro on Python 3, >> and implement a pure fastavro solution instead. +Frederik Bode >> <[email protected]> already submitted a PR for this [2]. >> >> Use of fastavro is currently activated with the `use_fastavro` flag, >> which defaults to False. Since this flag would not make sense anymore on >> Python 3, we would like to switch the default value to True. The >> documentation already mentions that this will probably become the default >> on the long term, but this change would also impact Python 2. Is this a >> problem? >> >> Also, looking at the performance gain of fastavro, is there any reason to >> not deprecate Avro in favor of fastavro on Python 3 indefinitely? >> >> [1] https://issues.apache.org/jira/browse/BEAM-6522#comment-16784499 >> [2] https://github.com/apache/beam/pull/8130 >> >> Kind regards, >> Robbe >> >
