Hi Ryan, Thanks for the input. When I last tried running my pipeline, this problem doesn't seem to be a huge bottleneck. I probably had other things that were making it worse. I still think it is weird that when you take a thread dump "snapshot" most of the methods are waiting on that lock so if after I fix all my other problems I might come back to this.
Best regards, Augusto On 2019/04/04 12:39:52, Ryan Skraba <[email protected]> wrote: > Hello Augusto!> > > I just took a look. The behaviour that you're seeing looks like it's set> > in Avro ReflectData -- to avoid doing expensive reflection calls for each> > serialization/deserialization, it uses a cache per-class AND access is> > synchronized [1]. Only one thread in your executor JVM is accessing the> > cached ClassAccessorData at a time, and so it's "normal" that the others> > are waiting... Of course, this doesn't mean that only one thread in the> > executor is running at a time, just that they always need to wait their> > turn before passing through that one method.> > > You could have more executors with fewer cores per executor. That might> > shed some light, but it's not really a workaround or solution.> > > We've had really good results with AvroCoder.of(Schema), which uses> > GenericData underneath. We already knew the schema we wanted, so it was> > ok to lose the "magic" of ReflectData and its automatic schema inference,> > etc. I'm a bit surprised that this hasn't come up as a bottleneck before> > in Avro, but I didn't find an existing JIRA.> > > If Avro serialization isn't important to you, you might want to check out> > the custom Coder route. I'd love to hear if you see a big gain in perf!> > > I hope this helps, Ryan> > > [1]> > https://github.com/apache/avro/blame/branch-1.8/lang/java/avro/src/main/java/org/apache/avro/reflect/ReflectData.java#L262> > > .> > > On Tue, Apr 2, 2019 at 5:52 PM Maximilian Michels <[email protected]> wrote:> > > > Hey Augusto,> > >> > > I haven't used @DefaultCoder, but it could be the problem here.> > >> > > What if you specify the coder directly for your PCollection? For example:> > >> > > pCol.setCoder(AvroCoder.of(YourClazz.class));> > >> > >> > > Thanks,> > > Max> > >> > > On 01.04.19 17:52, Augusto Ribeiro wrote:> > > > Hi Max,> > > >> > > > I tried to run the job again in a cluster, this is a thread dump from> > > > one of the Spark executors (16 cores)> > > >> > > > https://imgur.com/u2Gz0xY> > > >> > > > As you can see, almost all threads are blocked on that single Avro> > > > reflection method.> > > >> > > > Best regards,> > > > Augusto> > > >> > > >> > > > On 2019/03/27 07:43:17, Augusto Ribeiro <[email protected]> > > > <http://gmail.com>> wrote:> > > > > Hi Max,>> > > > >> > > > > Thanks for the answer I will give it another try after I sorted out> > > > some other things. I will try to save more data next time (screenshots,> > > > thread dumps) so that if it happens again I will be more specific in my> > > > questions.>> > > > >> > > > > Best regards,>> > > > > Augusto>> > > > >> > > > > On 2019/03/26 12:31:54, Maximilian Michels <[email protected]> > > > <http://apache.org>> wrote: >> > > > > > Hi Augusto,> >> > > > > > >> > > > > > Generally speaking Avro should provide very good performance. The> > > > calls > >> > > > > > you are seeing should not be significant because Avro caches the> > > > schema > >> > > > > > information for a type. It only creates a schema via Reflection the> > > > > >> > > > > > first time it sees a new type.> >> > > > > > >> > > > > > You can optimize further by using your domain knowledge and create> > > > a > >> > > > > > custom coder. However, if you do not do anything fancy, I think the> > > > odds > >> > > > > > are low that you will see a performance increase.> >> > > > > > >> > > > > > Cheers,> >> > > > > > Max> >> > > > > > >> > > > > > On 26.03.19 09:35, Augusto Ribeiro wrote:> >> > > > > > > Hi again,> >> > > > > > > > >> > > > > > > Sorry for bumping this thread but nobody really came with> > > > insight.> >> > > > > > > > >> > > > > > > Should I be defining my own coders for my objects or is it common> > > > practice to use the AvroCoder or maybe some other coder?> >> > > > > > > > >> > > > > > > Best regards,> >> > > > > > > Augusto> >> > > > > > > > >> > > > > > > On 2019/03/21 07:35:07, [email protected] <http://gmail.com>> > > > <[email protected] <http://gmail.com>> wrote:> >> > > > > > >> Hi>> >> > > > > > >>> >> > > > > > >> I am trying out Beam to do some data aggregations. Many of the> > > > inputs/outputs of my transforms are complex objects (not super complex,> > > > but containing Maps/Lists/Sets sometimes) so when I was prompted to> > > > defined a coder to these objects I added the annotation> > > > @DefaultCoder(AvroCoder.class) and things worked in my development> > > > environment.>> >> > > > > > >>> >> > > > > > >> Now that I am trying to run in on "real" data I notice that> > > > after I deployed it to a spark runner and looking at some thread dumps,> > > > many of the threads were blocked on the following method on the Avro> > > > library (ReflectData.getAccessorsFor). So my question is, did I do the> > > > wrong thing by using the AvroCoder or is there some other coder that> > > > easily can solve my problem?>> >> > > > > > >>> >> > > > > > >> Best regards,>> >> > > > > > >> Augusto>> >> > > > > > >>> >> > > > > > >>> >> > > > > > >>> >> > > > > > >> > >> >
