[ https://issues.apache.org/jira/browse/SPARK-21025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
meng xi updated SPARK-21025: ---------------------------- Description: we are using an iterator of RDD for some special data processing, and then using union to rebuild a new RDD. we found the result RDD are often empty or missing most of the data. Here is a simplified code snippet for this bug: SparkConf sparkConf = new SparkConf().setAppName("Test").setMaster("local[*]"); SparkContext sparkContext = SparkContext.getOrCreate(sparkConf); JavaSparkContext jsc = JavaSparkContext.fromSparkContext(sparkContext); JavaRDD<String[]> src = jsc.parallelize(IntStream.range(0, 3000).mapToObj(i -> new String[10]).collect(Collectors.toList())); Iterator<String[]> it = src.toLocalIterator(); List<JavaRDD<String[]>> rddList = new LinkedList<>(); List<String[]> resultBuffer = new LinkedList<>(); while (it.hasNext()) { resultBuffer.add(it.next()); if (resultBuffer.size() == 1000) { JavaRDD<String[]> rdd = jsc.parallelize(resultBuffer); // rdd.count(); rddList.add(rdd); resultBuffer.clear(); } } JavaRDD<String[]> desc = jsc.union(jsc.parallelize(resultBuffer), rddList); System.out.println(desc.count()); this code should duplicate the original RDD, but it just returns an empty RDD. Please note that if I uncomment the rdd.count, it will return the correct result. was: we are using an iterator of RDD for some special data processing, and then using union to rebuild a new RDD. we found the result RDD are often empty or missing most of the data. Here is a simplified code snippet for this bug: SparkConf sparkConf = new SparkConf().setAppName("Test").setMaster("local[*]"); SparkContext sparkContext = SparkContext.getOrCreate(sparkConf); JavaSparkContext jsc = JavaSparkContext.fromSparkContext(sparkContext); JavaRDD<String[]> src = jsc.parallelize(IntStream.range(0, 3000).mapToObj(i -> new String[10]).collect(Collectors.toList())); Iterator<String[]> it = src.toLocalIterator(); List<JavaRDD<String[]>> rddList = new LinkedList<>(); List<String[]> resultBuffer = new LinkedList<>(); while (it.hasNext()) { resultBuffer.add(it.next()); if (resultBuffer.size() == 1000) { JavaRDD<String[]> rdd = jsc.parallelize(resultBuffer); // rdd.count(); rddList.add(rdd); resultBuffer.clear(); } } JavaRDD<String[]> desc = jsc.union(jsc.parallelize(resultBuffer), rddList); System.out.println(desc.count()); this code should duplicate the original RDD, but it just returns an empty RDD. Please note that if I uncomment the rdd.count, it will return the correct result. > missing data in jsc.union > ------------------------- > > Key: SPARK-21025 > URL: https://issues.apache.org/jira/browse/SPARK-21025 > Project: Spark > Issue Type: Bug > Components: Java API > Affects Versions: 2.1.0, 2.1.1 > Environment: Ubuntu 16.04 > Reporter: meng xi > > we are using an iterator of RDD for some special data processing, and then > using union to rebuild a new RDD. we found the result RDD are often empty or > missing most of the data. Here is a simplified code snippet for this bug: > SparkConf sparkConf = new > SparkConf().setAppName("Test").setMaster("local[*]"); > SparkContext sparkContext = SparkContext.getOrCreate(sparkConf); > JavaSparkContext jsc = > JavaSparkContext.fromSparkContext(sparkContext); > JavaRDD<String[]> src = jsc.parallelize(IntStream.range(0, > 3000).mapToObj(i -> new String[10]).collect(Collectors.toList())); > Iterator<String[]> it = src.toLocalIterator(); > List<JavaRDD<String[]>> rddList = new LinkedList<>(); > List<String[]> resultBuffer = new LinkedList<>(); > while (it.hasNext()) { > resultBuffer.add(it.next()); > if (resultBuffer.size() == 1000) { > JavaRDD<String[]> rdd = jsc.parallelize(resultBuffer); > // rdd.count(); > rddList.add(rdd); > resultBuffer.clear(); > } > } > JavaRDD<String[]> desc = jsc.union(jsc.parallelize(resultBuffer), > rddList); > System.out.println(desc.count()); > this code should duplicate the original RDD, but it just returns an empty > RDD. Please note that if I uncomment the rdd.count, it will return the > correct result. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org