[ https://issues.apache.org/jira/browse/SPARK-25870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16668313#comment-16668313 ]
Marco Gaido commented on SPARK-25870: ------------------------------------- If you do some transformations (simple or complex doesn't matter) you create a new dataframe. So there is no guarantee that setting the seed returns the same rows. This sounds to me more as a problem in the logic of your use case than in Spark. You can add an ID to the rows and sample the IDs and then filter for them for instance, in order to achieve what you are trying to. > RandomSplit with seed gives different results depending on column order > ----------------------------------------------------------------------- > > Key: SPARK-25870 > URL: https://issues.apache.org/jira/browse/SPARK-25870 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.3.2 > Reporter: Daniel > Priority: Minor > Original Estimate: 96h > Remaining Estimate: 96h > > Co-discovered by Zhihui Hong (zhon...@syr.edu): > {{If you run the following example, the resulting dataframe will have > different rows even though the have the same seed:}} > {{from pyspark.sql import SparkSession, functions as fn}} > {{spark = SparkSession.builder.getOrCreate()}}{{ }} > {{df = spark.range(0, 10).withColumn('r', (fn.rand()*10).cast('int'))}} > {{# sample 1}} > {{df.randomSplit([0.8, 0.2], seed=0)[0].show(5)}}{{ }} > {{# sample 2}} > {{df.select('r', 'id').randomSplit([0.8, 0.2], seed=0)[0].show(5)}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org