[jira] [Commented] (SPARK-3876) Doing a RDD map/reduce within a DStream map fails with a high enough input rate

Saisai Shao (JIRA) Fri, 10 Oct 2014 06:33:58 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-3876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14166832#comment-14166832
 ]


Saisai Shao commented on SPARK-3876:
------------------------------------

Hi [~afilip], is there any specific purpose you need to do rdd's map and reduce 
operation inside DStream's map function. I don't think this code can be worked 
and correctly executed in remote side, this code can be translated into RDD's 
transformation in each batch duration, like:

rdd.map { r => rdd1.map(c => op(c, r)).reduce(...) }.foreach(...)

since rdd's transformation should be divided into stages in driver side and be 
executed in executor side, remotely using rdd in closure may get error.

 If you want to use this RDD as a lookup table, you can build a local hashmap 
and broadcast to the remote side for looking up. So maybe this is not a bug.

> Doing a RDD map/reduce within a DStream map fails with a high enough input 
> rate
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-3876
>                 URL: https://issues.apache.org/jira/browse/SPARK-3876
>             Project: Spark
>          Issue Type: Bug
>          Components: Streaming
>    Affects Versions: 1.0.2
>            Reporter: Andrei Filip
>
> Having a custom receiver than generates random strings at custom rates: 
> JavaRandomSentenceReceiver
> A class that does work on a received string:
> class LengthGetter implements Serializable{
>       public int getStrLength(String s){
>               return s.length();
>       }
> }
> The following code:
> List<LengthGetter> objList = Arrays.asList(new LengthGetter(), new 
> LengthGetter(), new LengthGetter());
>               
>               final JavaRDD<LengthGetter> objRdd = sc.parallelize(objList);
>               
>               
>               JavaInputDStream<String> sentences = jssc.receiverStream(new 
> JavaRandomSentenceReceiver(frequency));
>               
>               sentences.map(new Function<String, Integer>() {
>                       @Override
>                       public Integer call(final String input) throws 
> Exception {
>                               Integer res = objRdd.map(new 
> Function<LengthGetter, Integer>() {
>                                       @Override
>                                       public Integer call(LengthGetter lg) 
> throws Exception {
>                                               return lg.getStrLength(input);
>                                       }
>                               }).reduce(new Function2<Integer, Integer, 
> Integer>() {
>                                       
>                                       @Override
>                                       public Integer call(Integer left, 
> Integer right) throws Exception {
>                                               return left + right;
>                                       }
>                               });
>                               
>                               
>                               return res;
>                       }                       
>               }).print();
> fails for high enough frequencies with the following stack trace:
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 3.0:0 failed 1 times, most recent failure: Exception 
> failure in TID 3 on host localhost: java.lang.NullPointerException
>         org.apache.spark.rdd.RDD.map(RDD.scala:270)
>         org.apache.spark.api.java.JavaRDDLike$class.map(JavaRDDLike.scala:72)
>         org.apache.spark.api.java.JavaRDD.map(JavaRDD.scala:29)
> Other information that might be useful is that my current batch duration is 
> set to 1sec and the frequencies for JavaRandomSentenceReceiver at which the 
> application fails are as low as 2Hz (1Hz for example works)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3876) Doing a RDD map/reduce within a DStream map fails with a high enough input rate

Reply via email to