Hi Jungtaek, Thanks, we thought that might be the issue but haven't tested yet as building against an unreleased version of Spark is tough for us, due to network restrictions. We will try though. I will report back if we find anything.
Best regards, Patrick On Fri, Oct 12, 2018, 2:57 PM Jungtaek Lim <kabh...@gmail.com> wrote: > Hi Patrick, > > Looks like you might be struggling with state memory, which multiple > issues are going to be resolved in Spark 2.4. > > 1. SPARK-24441 [1]: Expose total estimated size of states in > HDFSBackedStateStoreProvider > 2. SPARK-24637 [2]: Add metrics regarding state and watermark to > dropwizard metrics > 3. SPARK-24717 [3]: Split out min retain version of state for memory in > HDFSBackedStateStoreProvider > > There're other patches relevant to state store as well, but above issues > are applied to map/flatmapGroupsWithState. > > Since Spark community is in progress on releasing Spark 2.4.0, could you > try experimenting Spark 2.4.0 RC if you really don't mind? You could try > out applying individual patches and see whether it helps. > > Thanks, > Jungtaek Lim (HeartSaVioR) > > 1. https://issues.apache.org/jira/browse/SPARK-24441 > 2. https://issues.apache.org/jira/browse/SPARK-24637 > 3. https://issues.apache.org/jira/browse/SPARK-24717 > > > 2018년 10월 12일 (금) 오후 9:31, Patrick McGloin <mcgloin.patr...@gmail.com>님이 > 작성: > >> Hi allI sent this earlier but the screenshots were not attached. >> Hopefully this time it is correct. >> >> We have a Spark Structured streaming stream which is using >> mapGroupWithState. After some time of processing in a stable manner >> suddenly each mini batch starts taking 40 seconds. Suspiciously it looks >> like exactly 40 seconds each time. Before this the batches were taking less >> than a second. >> >> >> Looking at the details for a particular task most partitions are >> processed really quickly but a few take exactly 40 seconds: >> >> >> >> >> The GC was looking ok as the data was being processed quickly but >> suddenly the full GCs etc stop (at the same time as the 40 second issue): >> >> >> >> I have taken a thread dump from one of the executors as this issue is >> happening but I cannot see any resource they are blocked on: >> >> >> >> >> Are we hitting a GC problem and why is it manifesting in this way? Is >> there another resource that is blocking and what is it? >> >> >> Thanks, >> Patrick >> >> >> >> This message has been sent by ABN AMRO Bank N.V., which has its seat at >> Gustav >> Mahlerlaan 10 (1082 PP) Amsterdam, the Netherlands >> <https://maps.google.com/?q=Gustav+Mahlerlaan+10+(1082+PP)+Amsterdam,+the+Netherlands&entry=gmail&source=g>, >> and is registered in the Commercial Register of Amsterdam under number >> 34334259. >> >