[
https://issues.apache.org/jira/browse/SPARK-3876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14169019#comment-14169019
]
Andrei Filip commented on SPARK-3876:
-
In a nutshell, the use case aims to parallelize many operations performed on
the same input, and aggregate the outputs. This suggestion was given to me in
this discussion:
http://chat.stackoverflow.com/rooms/61251/discussion-between-smola-and-andrei
(towards the end)
To be honest, the more fundamental question is whether spark streaming is
actually appropriate for this sort of use case.
> Doing a RDD map/reduce within a DStream map fails with a high enough input
> rate
> ---
>
> Key: SPARK-3876
> URL: https://issues.apache.org/jira/browse/SPARK-3876
> Project: Spark
> Issue Type: Bug
> Components: Streaming
>Affects Versions: 1.0.2
>Reporter: Andrei Filip
>
> Having a custom receiver than generates random strings at custom rates:
> JavaRandomSentenceReceiver
> A class that does work on a received string:
> class LengthGetter implements Serializable{
> public int getStrLength(String s){
> return s.length();
> }
> }
> The following code:
> List objList = Arrays.asList(new LengthGetter(), new
> LengthGetter(), new LengthGetter());
>
> final JavaRDD objRdd = sc.parallelize(objList);
>
>
> JavaInputDStream sentences = jssc.receiverStream(new
> JavaRandomSentenceReceiver(frequency));
>
> sentences.map(new Function() {
> @Override
> public Integer call(final String input) throws
> Exception {
> Integer res = objRdd.map(new
> Function() {
> @Override
> public Integer call(LengthGetter lg)
> throws Exception {
> return lg.getStrLength(input);
> }
> }).reduce(new Function2 Integer>() {
>
> @Override
> public Integer call(Integer left,
> Integer right) throws Exception {
> return left + right;
> }
> });
>
>
> return res;
> }
> }).print();
> fails for high enough frequencies with the following stack trace:
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due
> to stage failure: Task 3.0:0 failed 1 times, most recent failure: Exception
> failure in TID 3 on host localhost: java.lang.NullPointerException
> org.apache.spark.rdd.RDD.map(RDD.scala:270)
> org.apache.spark.api.java.JavaRDDLike$class.map(JavaRDDLike.scala:72)
> org.apache.spark.api.java.JavaRDD.map(JavaRDD.scala:29)
> Other information that might be useful is that my current batch duration is
> set to 1sec and the frequencies for JavaRandomSentenceReceiver at which the
> application fails are as low as 2Hz (1Hz for example works)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org