[ https://issues.apache.org/jira/browse/SPARK-26906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16782093#comment-16782093 ]
Sean Owen commented on SPARK-26906: ----------------------------------- I can't reproduce this on 2.4.0. It shows "2x replicated". What are you running, and is it local or on a cluster? > Pyspark RDD Replication Potentially Not Working > ----------------------------------------------- > > Key: SPARK-26906 > URL: https://issues.apache.org/jira/browse/SPARK-26906 > Project: Spark > Issue Type: Bug > Components: PySpark, Web UI > Affects Versions: 2.3.2 > Environment: I am using Google Cloud's Dataproc version [1.3.19-deb9 > 2018/12/14|https://cloud.google.com/dataproc/docs/release-notes#december_14_2018] > (version 2.3.2 Spark and version 2.9.0 Hadoop) with version Debian 9, with > python version 3.7. PySpark shell is activated using pyspark --num-executors > = 100 > Reporter: Han Altae-Tran > Priority: Minor > Attachments: spark_ui.png > > > Pyspark RDD replication doesn't seem to be functioning properly. Even with a > simple example, the UI reports only 1x replication, despite using the flag > for 2x replication > {code:java} > rdd = sc.range(10**9) > mapped = rdd.map(lambda x: x) > mapped.persist(pyspark.StorageLevel.DISK_ONLY_2) \\ PythonRDD[1] at RDD at > PythonRDD.scala:52 > mapped.count(){code} > > Interestingly, if you catch the UI page at just the right time, you see that > it starts off 2x replicated, but ends up 1x replicated afterward. Perhaps the > RDD is replicated, but it is just the UI that is unable to register this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org