[jira] [Updated] (SPARK-14241) Output of monotonically_increasing_id lacks stable relation with rows of DataFrame

Paul Shearer (JIRA) Tue, 29 Mar 2016 06:53:15 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-14241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Paul Shearer updated SPARK-14241:
---------------------------------
    Description: 
If you use monotonically_increasing_id() to append a column of IDs to a 
DataFrame, the IDs do not have a stable, deterministic relationship to the rows 
they are appended to. A given ID value can land on different rows depending on 
what happens in the task graph:

http://stackoverflow.com/questions/35705038/how-do-i-add-an-persistent-column-of-row-ids-to-spark-dataframe/35706321#35706321

>From a user perspective this behavior is very unexpected, and many things one 
>would normally like to do with an ID column are in fact only possible under 
>very narrow circumstances. The function should either be made deterministic, 
>or there should be a prominent warning note in the API docs regarding its 
>behavior.

  was:
If you use monotonically_increasing_id() to append a column of IDs to a 
DataFrame, the IDs do not have a stable, deterministic relationship to the rows 
they are appended to. A given ID value can land on different rows depending on 
what happens in the task graph:

http://stackoverflow.com/questions/35705038/how-do-i-add-an-persistent-column-of-row-ids-to-spark-dataframe/35706321#35706321

>From a user perspective this behavior is very unexpected, and many things one 
>would like to do with an ID column are only possible under very narrow 
>circumstances. The function should either be made deterministic, or there 
>should be a prominent warning note in the API docs regarding its behavior.


> Output of monotonically_increasing_id lacks stable relation with rows of 
> DataFrame
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-14241
>                 URL: https://issues.apache.org/jira/browse/SPARK-14241
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, Spark Core
>    Affects Versions: 1.6.0, 1.6.1
>            Reporter: Paul Shearer
>
> If you use monotonically_increasing_id() to append a column of IDs to a 
> DataFrame, the IDs do not have a stable, deterministic relationship to the 
> rows they are appended to. A given ID value can land on different rows 
> depending on what happens in the task graph:
> http://stackoverflow.com/questions/35705038/how-do-i-add-an-persistent-column-of-row-ids-to-spark-dataframe/35706321#35706321
> From a user perspective this behavior is very unexpected, and many things one 
> would normally like to do with an ID column are in fact only possible under 
> very narrow circumstances. The function should either be made deterministic, 
> or there should be a prominent warning note in the API docs regarding its 
> behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14241) Output of monotonically_increasing_id lacks stable relation with rows of DataFrame

Reply via email to