[ 
https://issues.apache.org/jira/browse/SPARK-30264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16997396#comment-16997396
 ] 

moshe ohaion commented on SPARK-30264:
--------------------------------------

Steps to reproduce:
 # File users8.avro was created by GenericMain.java.
 # Run the following spark job:
*public static void main(String[] args) throws IOException {*
 *SparkConf sparkConf = new SparkConf()*
 *.setAppName("Test cache");*

 *sparkConf.set("spark.serializer", 
"org.apache.spark.serializer.KryoSerializer");*
 *sparkConf.set("spark.kryo.registrator", 
SparkKryoRegistrator.class.getName());*

 *JavaSparkContext sc = new JavaSparkContext(sparkConf);*
 *JavaPairRDD<AvroKey, NullWritable> records = 
sc.newAPIHadoopFile("<path_to_users8_directory>/*.avro", 
AvroKeyInputFormat.class, AvroKey.class, NullWritable.class, 
sc.hadoopConfiguration());*
 *JavaRDD<GenericRecord> genericRecordJavaRDD = records.keys().map(x -> 
((GenericRecord) x.datum()));*
 *JavaRDD<GenericRecord> cache = 
genericRecordJavaRDD.persist(StorageLevel.{color:#FF0000}MEMORY_ONLY_SER{color}());*
 *long count = cache.map(genericRecord -> 
genericRecord.get("username")).distinct().count();*
 
*System.out.println(count);*

*}*
 # Count printed will be 5 as it should be 
 # Replace *{color:#FF0000}MEMORY_ONLY_SER{color}* to ** 
{color:#FF0000}*MEMORY_ONLY* {color}and run the job again.**
 # Count printed will be 1

 

If you also add cache.saveAsTextFile() you will see that when running with 
{color:#FF0000}*MEMORY_ONLY* {color}you get the same user 5 times.

 

I tried on 2.4.0, 2.4.4 and 3.0.0 preview.

 

 

[^GenericMain.java] . [^users8.avro]

> Unexpected behaviour when using persist MEMORY_ONLY in RDD
> ----------------------------------------------------------
>
>                 Key: SPARK-30264
>                 URL: https://issues.apache.org/jira/browse/SPARK-30264
>             Project: Spark
>          Issue Type: Question
>          Components: Java API
>    Affects Versions: 2.4.0
>            Reporter: moshe ohaion
>            Priority: Major
>         Attachments: GenericMain.java, users8.avro
>
>
> Persist method with MEMORY_ONLY behave different than using with 
> MEMORY_ONLY_SER.
> persist(StorageLevel.MEMORY_ONLY()).distinct().count() return 1
> while persist(StorageLevel.MEMORY_ONLY_SER()).distinct().count() return 100
> I expect both to return the same results. The right result is 100, for some 
> reason MEMORY_ONLY causing all the objects in the RDD to be the same one. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to