[ 
https://issues.apache.org/jira/browse/SPARK-22163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192110#comment-16192110
 ] 

Michael N edited comment on SPARK-22163 at 10/4/17 10:33 PM:
-------------------------------------------------------------

Please distinguish between code bug vs design flaws.  That is why this ticket 
is separate from the other ticket.

Here is the analogy to give more guidance as to why this is a design flaw. The 
older Spark's map framework has a major design flaw, where it makes a function 
call for every single object. its code implementation matched with its design, 
but its design flaw is that it has massive overhead when there are millions and 
billions of objects, where it needs to make the same function millions and 
billions times for every single object. 

The questions posted previously and re-posted below are intended to provide the 
insights as to why this issue is a design flaw of Spark's framework trying to 
serialize application objects of a Streaming application that runs 
continuously.  Please make sure you understand the differences between code 
bugs vs design flaws first, and provide the answers to the questions below and 
resolve them, before respond further, instead of arbitrarily closing this 
ticket.

1. In the first place, why does Spark serialize the application objects 
***asynchronously*** while the streaming application is running continuously 
from batch to batch ?

2. If Spark needs to do this type of serialization at all, why does it not do 
at the end of the batch ***synchronously*** ?



was (Author: michaeln_apache):
Please distinguish between code bug vs design flaws.  That is why this ticket 
is separate from the other ticket.

The analogy is the design flaw with the older Spark's map framework where it 
makes a function call for every single object. its code implementation is ok, 
but its design flaw is that it has massive overhead when there are millions and 
billions of objects.  On the other hand, the newer flatMap framework make one 
function call for a list of objects via the Iterator. 

Here are the questions to provide the insights as to why this issue is a design 
flaw of Spark's framework trying to serialize application objects of a 
Streaming application that runs continuously.  Please make sure you understand 
the differences between code bugs vs design flaws first, and provide the 
answers to the questions below and resolve them, before respond further, 
instead of arbitrarily closing this ticket.

1. In the first place, why does Spark serialize the application objects 
***asynchronously*** while the streaming application is running continuously 
from batch to batch ?

2. If Spark needs to do this type of serialization at all, why does it not do 
at the end of the batch ***synchronously*** ?


> Design Issue of Spark Streaming that Causes Random Run-time Exception
> ---------------------------------------------------------------------
>
>                 Key: SPARK-22163
>                 URL: https://issues.apache.org/jira/browse/SPARK-22163
>             Project: Spark
>          Issue Type: Bug
>          Components: DStreams, Structured Streaming
>    Affects Versions: 2.2.0
>         Environment: Spark Streaming
> Kafka
> Linux
>            Reporter: Michael N
>            Priority: Critical
>
> The application objects can contain List and can be modified dynamically as 
> well.   However, Spark Streaming framework asynchronously serializes the 
> application's objects as the application runs.  Therefore, it causes random 
> run-time exception on the List when Spark Streaming framework happens to 
> serializes the application's objects while the application modifies a List in 
> its own object.  
> In fact, there are multiple bugs reported about
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject
> that are permutation of the same root cause. So the design issue of Spark 
> streaming framework is that it should do this serialization asynchronously.  
> Instead, it should either
> 1. do this serialization synchronously. This is preferred to eliminate the 
> issue completely.  Or
> 2. Allow it to be configured per application whether to do this serialization 
> synchronously or asynchronously, depending on the nature of each application.
> Also, Spark documentation should describe the conditions that trigger Spark 
> to do this type of serialization asynchronously, so the applications can work 
> around them until the fix is provided. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to