[ https://issues.apache.org/jira/browse/SPARK-22163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192395#comment-16192395 ]
Michael N commented on SPARK-22163: ----------------------------------- Committers: The actual issues are with Seam Owen himself where 1. Sean Owen does not completely understand the issues in the tickets. 2. For the posted questions in the ticket to help with understanding the issue, he does not know the answers. 3. Instead of either find the correct answers to the questions or ask people who know the correct answers, he'd blindly close the tickets. For instance, for the issue with ticket https://issues.apache.org/jira/browse/SPARK-22163, it occurs in both slaves and driver. Further, while Spark uses multiple threads to run the Stream processing asynchronously, there is still synchronization from batch to batch. For instance, say batch interval is 5 seconds. But batch #3 takes 1 minute to complete, Spark does not start batch #4 until batch #3 is completelty done. So in aspect, batch processing between each interval is synchronous. Yet Sean Owen assumes everything in Spark Stream is asynchronous. Another instance is Sean Owen does not understand the difference between design flaw and coding bugs. Code may be perfect and match with the design. However, if the design is flawed, then it is the design that needs to be changed. An example is in the earlier Spark API around 1.x, Spark's map interface passes in only one object at a time. So that is a design flaw because it causes massive overhead for billions of objects. The newer Spark map interface passes in a list of objects via an Iterator. Therefore, the broader issue is with Sean Owen acting more a blocker and closing the tickets when he does not understand them. When such cases come up, he should have either learn about the issues or ask someone else to do that. > Design Issue of Spark Streaming that Causes Random Run-time Exception > --------------------------------------------------------------------- > > Key: SPARK-22163 > URL: https://issues.apache.org/jira/browse/SPARK-22163 > Project: Spark > Issue Type: Bug > Components: DStreams, Structured Streaming > Affects Versions: 2.2.0 > Environment: Spark Streaming > Kafka > Linux > Reporter: Michael N > Priority: Critical > > The application objects can contain List and can be modified dynamically as > well. However, Spark Streaming framework asynchronously serializes the > application's objects as the application runs. Therefore, it causes random > run-time exception on the List when Spark Streaming framework happens to > serializes the application's objects while the application modifies a List in > its own object. > In fact, there are multiple bugs reported about > Caused by: java.util.ConcurrentModificationException > at java.util.ArrayList.writeObject > that are permutation of the same root cause. So the design issue of Spark > streaming framework is that it should do this serialization asynchronously. > Instead, it should either > 1. do this serialization synchronously. This is preferred to eliminate the > issue completely. Or > 2. Allow it to be configured per application whether to do this serialization > synchronously or asynchronously, depending on the nature of each application. > Also, Spark documentation should describe the conditions that trigger Spark > to do this type of serialization asynchronously, so the applications can work > around them until the fix is provided. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org