[ 
https://issues.apache.org/jira/browse/SPARK-22163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192395#comment-16192395
 ] 

Michael N commented on SPARK-22163:
-----------------------------------

Committers: The actual issues are with Seam Owen himself where

1. Sean Owen does not completely understand the issues in the tickets.
2. For the posted questions in the ticket to help with understanding the issue, 
he does not know the answers.
3. Instead of either find the correct answers to the questions or ask people 
who know the correct answers, he'd blindly close the tickets.

For instance, for the issue with ticket 
https://issues.apache.org/jira/browse/SPARK-22163, it occurs in both slaves and 
driver. Further, while Spark uses multiple threads to run the Stream processing 
asynchronously, there is still synchronization from batch to batch. For 
instance, say batch interval is 5 seconds. But batch #3 takes 1 minute to 
complete, Spark does not start batch #4 until batch #3 is completelty done. So 
in aspect, batch processing between each interval is synchronous. Yet Sean Owen 
assumes everything in Spark Stream is asynchronous.

Another instance is Sean Owen does not understand the difference between design 
flaw and coding bugs. Code may be perfect and match with the design. However, 
if the design is flawed, then it is the design that needs to be changed. An 
example is in the earlier Spark API around 1.x, Spark's map interface passes in 
only one object at a time. So that is a design flaw because it causes massive 
overhead for billions of objects. The newer Spark map interface passes in a 
list of objects via an Iterator.

Therefore, the broader issue is with Sean Owen acting more a blocker and 
closing the tickets when he does not understand them. When such cases come up, 
he should have either learn about the issues or ask someone else to do that.


> Design Issue of Spark Streaming that Causes Random Run-time Exception
> ---------------------------------------------------------------------
>
>                 Key: SPARK-22163
>                 URL: https://issues.apache.org/jira/browse/SPARK-22163
>             Project: Spark
>          Issue Type: Bug
>          Components: DStreams, Structured Streaming
>    Affects Versions: 2.2.0
>         Environment: Spark Streaming
> Kafka
> Linux
>            Reporter: Michael N
>            Priority: Critical
>
> The application objects can contain List and can be modified dynamically as 
> well.   However, Spark Streaming framework asynchronously serializes the 
> application's objects as the application runs.  Therefore, it causes random 
> run-time exception on the List when Spark Streaming framework happens to 
> serializes the application's objects while the application modifies a List in 
> its own object.  
> In fact, there are multiple bugs reported about
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject
> that are permutation of the same root cause. So the design issue of Spark 
> streaming framework is that it should do this serialization asynchronously.  
> Instead, it should either
> 1. do this serialization synchronously. This is preferred to eliminate the 
> issue completely.  Or
> 2. Allow it to be configured per application whether to do this serialization 
> synchronously or asynchronously, depending on the nature of each application.
> Also, Spark documentation should describe the conditions that trigger Spark 
> to do this type of serialization asynchronously, so the applications can work 
> around them until the fix is provided. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to