[jira] [Updated] (SPARK-21025) missing data in jsc.union

meng xi (JIRA) Thu, 08 Jun 2017 13:35:40 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-21025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


meng xi updated SPARK-21025:
----------------------------
    Description: 
we are using an iterator of RDD for some special data processing, and then 
using union to rebuild a new RDD. we found the result RDD are often empty or 
missing most of the data. Here is a simplified code snippet for this bug:

        SparkConf sparkConf = new 
SparkConf().setAppName("Test").setMaster("local[*]");
        SparkContext sparkContext = SparkContext.getOrCreate(sparkConf);
        JavaSparkContext jsc = JavaSparkContext.fromSparkContext(sparkContext);
        JavaRDD<String[]> src = jsc.parallelize(IntStream.range(0, 
3000).mapToObj(i -> new String[10]).collect(Collectors.toList()));
        Iterator<String[]> it = src.toLocalIterator();
        List<JavaRDD<String[]>> rddList = new LinkedList<>();
        List<String[]> resultBuffer = new LinkedList<>();
        while (it.hasNext()) {
            resultBuffer.add(it.next());
            if (resultBuffer.size() == 1000) {
                JavaRDD<String[]> rdd = jsc.parallelize(resultBuffer);

//                rdd.count();
                rddList.add(rdd);
                resultBuffer.clear();
            }
        }
        JavaRDD<String[]> desc = jsc.union(jsc.parallelize(resultBuffer), 
rddList);
        System.out.println(desc.count());

this code should duplicate the original RDD, but it just returns an empty RDD. 
Please note that if I uncomment the rdd.count, it will return the correct 
result. 

  was:
we are using an iterator of RDD for some special data processing, and then 
using union to rebuild a new RDD. we found the result RDD are often empty or 
missing most of the data. Here is a simplified code snippet for this bug:

        SparkConf sparkConf = new 
SparkConf().setAppName("Test").setMaster("local[*]");
        SparkContext sparkContext = SparkContext.getOrCreate(sparkConf);
        JavaSparkContext jsc = JavaSparkContext.fromSparkContext(sparkContext);
        JavaRDD<String[]> src = jsc.parallelize(IntStream.range(0, 
3000).mapToObj(i -> new String[10]).collect(Collectors.toList()));
        Iterator<String[]> it = src.toLocalIterator();
        List<JavaRDD<String[]>> rddList = new LinkedList<>();
        List<String[]> resultBuffer = new LinkedList<>();
        while (it.hasNext()) {
            resultBuffer.add(it.next());
            if (resultBuffer.size() == 1000) {
                JavaRDD<String[]> rdd = jsc.parallelize(resultBuffer);

//                rdd.count();

                rddList.add(rdd);

                resultBuffer.clear();
            }
        }
        JavaRDD<String[]> desc = jsc.union(jsc.parallelize(resultBuffer), 
rddList);
        System.out.println(desc.count());

this code should duplicate the original RDD, but it just returns an empty RDD. 
Please note that if I uncomment the rdd.count, it will return the correct 
result. 


> missing data in jsc.union
> -------------------------
>
>                 Key: SPARK-21025
>                 URL: https://issues.apache.org/jira/browse/SPARK-21025
>             Project: Spark
>          Issue Type: Bug
>          Components: Java API
>    Affects Versions: 2.1.0, 2.1.1
>         Environment: Ubuntu 16.04
>            Reporter: meng xi
>
> we are using an iterator of RDD for some special data processing, and then 
> using union to rebuild a new RDD. we found the result RDD are often empty or 
> missing most of the data. Here is a simplified code snippet for this bug:
>         SparkConf sparkConf = new 
> SparkConf().setAppName("Test").setMaster("local[*]");
>         SparkContext sparkContext = SparkContext.getOrCreate(sparkConf);
>         JavaSparkContext jsc = 
> JavaSparkContext.fromSparkContext(sparkContext);
>         JavaRDD<String[]> src = jsc.parallelize(IntStream.range(0, 
> 3000).mapToObj(i -> new String[10]).collect(Collectors.toList()));
>         Iterator<String[]> it = src.toLocalIterator();
>         List<JavaRDD<String[]>> rddList = new LinkedList<>();
>         List<String[]> resultBuffer = new LinkedList<>();
>         while (it.hasNext()) {
>             resultBuffer.add(it.next());
>             if (resultBuffer.size() == 1000) {
>                 JavaRDD<String[]> rdd = jsc.parallelize(resultBuffer);
> //                rdd.count();
>                 rddList.add(rdd);
>                 resultBuffer.clear();
>             }
>         }
>         JavaRDD<String[]> desc = jsc.union(jsc.parallelize(resultBuffer), 
> rddList);
>         System.out.println(desc.count());
> this code should duplicate the original RDD, but it just returns an empty 
> RDD. Please note that if I uncomment the rdd.count, it will return the 
> correct result. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21025) missing data in jsc.union

Reply via email to