[ 
https://issues.apache.org/jira/browse/SPARK-21025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21025.
-------------------------------
    Resolution: Invalid

Oh, and I realized the mistake here. You put the body of the loop on one line 
and attempted to comment out just one statement, but you comment out all of the 
other statements, including the one that updates resultbuffer.

> missing data in jsc.union
> -------------------------
>
>                 Key: SPARK-21025
>                 URL: https://issues.apache.org/jira/browse/SPARK-21025
>             Project: Spark
>          Issue Type: Bug
>          Components: Java API
>    Affects Versions: 2.1.0, 2.1.1
>         Environment: Ubuntu 16.04
>            Reporter: meng xi
>
> we are using an iterator of RDD for some special data processing, and then 
> using union to rebuild a new RDD. we found the result RDD are often empty or 
> missing most of the data. Here is a simplified code snippet for this bug:
>         SparkConf sparkConf = new 
> SparkConf().setAppName("Test").setMaster("local[*]");
>         SparkContext sparkContext = SparkContext.getOrCreate(sparkConf);
>         JavaSparkContext jsc = 
> JavaSparkContext.fromSparkContext(sparkContext);
>         JavaRDD<String[]> src = jsc.parallelize(IntStream.range(0, 
> 3000).mapToObj(i -> new String[10]).collect(Collectors.toList()));
>         Iterator<String[]> it = src.toLocalIterator();
>         List<JavaRDD<String[]>> rddList = new LinkedList<>();
>         List<String[]> resultBuffer = new LinkedList<>();
>         while (it.hasNext()) {
>             resultBuffer.add(it.next());
>             if (resultBuffer.size() == 1000) {
>                 JavaRDD<String[]> rdd = jsc.parallelize(resultBuffer);
> //                rdd.count();
>                 rddList.add(rdd);
>                 resultBuffer.clear();
>             }
>         }
>         JavaRDD<String[]> desc = jsc.union(jsc.parallelize(resultBuffer), 
> rddList);
>         System.out.println(desc.count());
> this code should duplicate the original RDD, but it just returns an empty 
> RDD. Please note that if I uncomment the rdd.count, it will return the 
> correct result. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to