Hi Gyula, as a follow-up, you may be interested in https://issues.apache.org/jira/browse/FLINK-9413
Nico On 04/05/18 15:36, Gyula Fóra wrote: > Looks pretty clear that one operator takes too long to start (even on > the UI it shows it in the created state for far too long). Any idea what > might cause this delay? It actually often crashes on Akka ask timeout > during scheduling the node. > > Gyula > > Piotr Nowojski <pi...@data-artisans.com > <mailto:pi...@data-artisans.com>> ezt írta (időpont: 2018. máj. 4., P, > 15:33): > > Ufuk: I don’t know why. > > +1 for your other suggestions. > > Piotrek > > > On 4 May 2018, at 14:52, Ufuk Celebi <u...@data-artisans.com > <mailto:u...@data-artisans.com>> wrote: > > > > Hey Gyula! > > > > I'm including Piotr and Nico (cc'd) who have worked on the network > > stack in the last releases. > > > > Registering the network structures including the intermediate results > > actually happens **before** any state is restored. I'm not sure why > > this reproducibly happens when you restore state. @Nico, Piotr: any > > ideas here? > > > > In general I think what happens here is the following: > > - a task requests the result of a local upstream producer, but that > > one has not registered its intermediate result yet > > - this should result in a retry of the request with some backoff > > (controlled via the config params you mention > > taskmanager.network.request-backoff.max, > > taskmanager.network.request-backoff.initial) > > > > As a first step I would set logging to DEBUG and check the TM logs for > > messages like "Retriggering partition request {}:{}." > > > > You can also check the SingleInputGate code which has the logic for > > retriggering requests. > > > > – Ufuk > > > > > > On Fri, May 4, 2018 at 10:27 AM, Gyula Fóra <gyula.f...@gmail.com > <mailto:gyula.f...@gmail.com>> wrote: > >> Hi Ufuk, > >> > >> Do you have any quick idea what could cause this problems in > flink 1.4.2? > >> Seems like one operator takes too long to deploy and downstream > tasks error > >> out on partition not found. This only seems to happen when the job is > >> restored from state and in fact that operator has some keyed and > operator > >> state as well. > >> > >> Deploying the same job from empty state works well. We tried > increasing the > >> taskmanager.network.request-backoff.max that didnt help. > >> > >> It would be great if you have some pointers where to look > further, I havent > >> seen this happening before. > >> > >> Thank you! > >> Gyula > >> > >> The errror: > >> org.apache.flink.runtime.io > <http://org.apache.flink.runtime.io>.network.partition.: Partition > >> 4c5e9cd5dd410331103f51127996068a@b35ef4ffe25e3d17c5d6051ebe2860cd > not found. > >> at > >> org.apache.flink.runtime.io > > <http://org.apache.flink.runtime.io>.network.partition.ResultPartitionManager.createSubpartitionView(ResultPartitionManager.java:77) > >> at > >> org.apache.flink.runtime.io > > <http://org.apache.flink.runtime.io>.network.partition.consumer.LocalInputChannel.requestSubpartition(LocalInputChannel.java:115) > >> at > >> org.apache.flink.runtime.io > > <http://org.apache.flink.runtime.io>.network.partition.consumer.LocalInputChannel$1.run(LocalInputChannel.java:159) > >> at java.util.TimerThread.mainLoop(Timer.java:555) > >> at java.util.TimerThread.run(Timer.java:505) > > > > > > > > -- > > Data Artisans GmbH | Stresemannstr. 121a | 10963 Berlin > > > > i...@data-artisans.com <mailto:i...@data-artisans.com> > > +49-30-43208879 <tel:+49%2030%2043208879> > > > > Registered at Amtsgericht Charlottenburg - HRB 158244 B > > Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen > -- Nico Kruber | Software Engineer data Artisans Follow us @dataArtisans -- Join Flink Forward - The Apache Flink Conference Stream Processing | Event Driven | Real Time -- Data Artisans GmbH | Stresemannstr. 121A,10963 Berlin, Germany data Artisans, Inc. | 1161 Mission Street, San Francisco, CA-94103, USA -- Data Artisans GmbH Registered at Amtsgericht Charlottenburg: HRB 158244 B Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen
signature.asc
Description: OpenPGP digital signature