[ 
https://issues.apache.org/jira/browse/SPARK-21025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043379#comment-16043379
 ] 

meng xi commented on SPARK-21025:
---------------------------------

no, I just comment out one line, the system incorrectly format my code in this 
way...

Okey, let me explain a little bit about our code logic: we would like to do a 
"carry forward" data cleansing, which uses the previous data point to fill up 
missing field in current data. After scan the whole RDD, we reconstruct the 
RDD. this snippet is just clone the original one, but if you run it, the result 
RDD is empty

> missing data in jsc.union
> -------------------------
>
>                 Key: SPARK-21025
>                 URL: https://issues.apache.org/jira/browse/SPARK-21025
>             Project: Spark
>          Issue Type: Bug
>          Components: Java API
>    Affects Versions: 2.1.0, 2.1.1
>         Environment: Ubuntu 16.04
>            Reporter: meng xi
>
> we are using an iterator of RDD for some special data processing, and then 
> using union to rebuild a new RDD. we found the result RDD are often empty or 
> missing most of the data. Here is a simplified code snippet for this bug:
>         SparkConf sparkConf = new 
> SparkConf().setAppName("Test").setMaster("local[*]");
>         SparkContext sparkContext = SparkContext.getOrCreate(sparkConf);
>         JavaSparkContext jsc = 
> JavaSparkContext.fromSparkContext(sparkContext);
>         JavaRDD<String[]> src = jsc.parallelize(IntStream.range(0, 
> 3000).mapToObj(i -> new String[10]).collect(Collectors.toList()));
>         Iterator<String[]> it = src.toLocalIterator();
>         List<JavaRDD<String[]>> rddList = new LinkedList<>();
>         List<String[]> resultBuffer = new LinkedList<>();
>         while (it.hasNext()) {
>             resultBuffer.add(it.next());
>             if (resultBuffer.size() == 1000) {
>                 JavaRDD<String[]> rdd = jsc.parallelize(resultBuffer);
> //                rdd.count();
>                 rddList.add(rdd);
>                 resultBuffer.clear();
>             }
>         }
>         JavaRDD<String[]> desc = jsc.union(jsc.parallelize(resultBuffer), 
> rddList);
>         System.out.println(desc.count());
> this code should duplicate the original RDD, but it just returns an empty 
> RDD. Please note that if I uncomment the rdd.count, it will return the 
> correct result. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to