[ https://issues.apache.org/jira/browse/SPARK-30264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16997396#comment-16997396 ]
moshe ohaion commented on SPARK-30264: -------------------------------------- Steps to reproduce: # File users8.avro was created by GenericMain.java. # Run the following spark job: *public static void main(String[] args) throws IOException {* *SparkConf sparkConf = new SparkConf()* *.setAppName("Test cache");* *sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");* *sparkConf.set("spark.kryo.registrator", SparkKryoRegistrator.class.getName());* *JavaSparkContext sc = new JavaSparkContext(sparkConf);* *JavaPairRDD<AvroKey, NullWritable> records = sc.newAPIHadoopFile("<path_to_users8_directory>/*.avro", AvroKeyInputFormat.class, AvroKey.class, NullWritable.class, sc.hadoopConfiguration());* *JavaRDD<GenericRecord> genericRecordJavaRDD = records.keys().map(x -> ((GenericRecord) x.datum()));* *JavaRDD<GenericRecord> cache = genericRecordJavaRDD.persist(StorageLevel.{color:#FF0000}MEMORY_ONLY_SER{color}());* *long count = cache.map(genericRecord -> genericRecord.get("username")).distinct().count();* *System.out.println(count);* *}* # Count printed will be 5 as it should be # Replace *{color:#FF0000}MEMORY_ONLY_SER{color}* to ** {color:#FF0000}*MEMORY_ONLY* {color}and run the job again.** # Count printed will be 1 If you also add cache.saveAsTextFile() you will see that when running with {color:#FF0000}*MEMORY_ONLY* {color}you get the same user 5 times. I tried on 2.4.0, 2.4.4 and 3.0.0 preview. [^GenericMain.java] . [^users8.avro] > Unexpected behaviour when using persist MEMORY_ONLY in RDD > ---------------------------------------------------------- > > Key: SPARK-30264 > URL: https://issues.apache.org/jira/browse/SPARK-30264 > Project: Spark > Issue Type: Question > Components: Java API > Affects Versions: 2.4.0 > Reporter: moshe ohaion > Priority: Major > Attachments: GenericMain.java, users8.avro > > > Persist method with MEMORY_ONLY behave different than using with > MEMORY_ONLY_SER. > persist(StorageLevel.MEMORY_ONLY()).distinct().count() return 1 > while persist(StorageLevel.MEMORY_ONLY_SER()).distinct().count() return 100 > I expect both to return the same results. The right result is 100, for some > reason MEMORY_ONLY causing all the objects in the RDD to be the same one. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org