[jira] [Commented] (SPARK-12467) Get rid of sorting in Row's constructor in pyspark
[ https://issues.apache.org/jira/browse/SPARK-12467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15996936#comment-15996936 ] Hyukjin Kwon commented on SPARK-12467: -- Yea, I do agree with the advantage and the others of your comment. Let's resolve this. Please reopen this anyone feel against this and have a good idea to resolve this, or believe it is worth breaking backward compatibility. I am resolving this. > Get rid of sorting in Row's constructor in pyspark > -- > > Key: SPARK-12467 > URL: https://issues.apache.org/jira/browse/SPARK-12467 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.5.2, 2.2.0 >Reporter: Irakli Machabeli >Priority: Minor > > Current implementation of Row's __new__ sorts columns by name > First of all there is no obvious reason to sort, second, if one converts > dataframe to rdd and than back to dataframe, order of column changes. While > this is not a bug, nevetheless it makes looking at the data really > inconvenient. > def __new__(self, *args, **kwargs): > if args and kwargs: > raise ValueError("Can not use both args " > "and kwargs to create Row") > if args: > # create row class or objects > return tuple.__new__(self, args) > elif kwargs: > # create row objects > names = sorted(kwargs.keys()) # just get rid of sorting here!!! > row = tuple.__new__(self, [kwargs[n] for n in names]) > row.__fields__ = names > return row > else: > raise ValueError("No args or kwargs") -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12467) Get rid of sorting in Row's constructor in pyspark
[ https://issues.apache.org/jira/browse/SPARK-12467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15996901#comment-15996901 ] Maciej Szymkiewicz commented on SPARK-12467: [~hyukjin.kwon] Personally I like {{namedtuple}} because it is static type checker friendly. This is a huge advantage over {{Row}}. But it is just a preference. Regarding this JIRA my opinion is the same as for the other one - it is simply won't fix. Considering we are still committed to supporting Python 2.7, dropping support for <= 3.5 is at least decade away. Any other attempt to "fix" this will break backward compatibility and I've seen user code depending on sorting behavior. Finally as you said it is documented. > Get rid of sorting in Row's constructor in pyspark > -- > > Key: SPARK-12467 > URL: https://issues.apache.org/jira/browse/SPARK-12467 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.5.2, 2.2.0 >Reporter: Irakli Machabeli >Priority: Minor > > Current implementation of Row's __new__ sorts columns by name > First of all there is no obvious reason to sort, second, if one converts > dataframe to rdd and than back to dataframe, order of column changes. While > this is not a bug, nevetheless it makes looking at the data really > inconvenient. > def __new__(self, *args, **kwargs): > if args and kwargs: > raise ValueError("Can not use both args " > "and kwargs to create Row") > if args: > # create row class or objects > return tuple.__new__(self, args) > elif kwargs: > # create row objects > names = sorted(kwargs.keys()) # just get rid of sorting here!!! > row = tuple.__new__(self, [kwargs[n] for n in names]) > row.__fields__ = names > return row > else: > raise ValueError("No args or kwargs") -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12467) Get rid of sorting in Row's constructor in pyspark
[ https://issues.apache.org/jira/browse/SPARK-12467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15996761#comment-15996761 ] Hyukjin Kwon commented on SPARK-12467: -- I added 2.2.0 as I tested this in other JIRAs for testing purpose. > Get rid of sorting in Row's constructor in pyspark > -- > > Key: SPARK-12467 > URL: https://issues.apache.org/jira/browse/SPARK-12467 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.5.2, 2.2.0 >Reporter: Irakli Machabeli >Priority: Minor > > Current implementation of Row's __new__ sorts columns by name > First of all there is no obvious reason to sort, second, if one converts > dataframe to rdd and than back to dataframe, order of column changes. While > this is not a bug, nevetheless it makes looking at the data really > inconvenient. > def __new__(self, *args, **kwargs): > if args and kwargs: > raise ValueError("Can not use both args " > "and kwargs to create Row") > if args: > # create row class or objects > return tuple.__new__(self, args) > elif kwargs: > # create row objects > names = sorted(kwargs.keys()) # just get rid of sorting here!!! > row = tuple.__new__(self, [kwargs[n] for n in names]) > row.__fields__ = names > return row > else: > raise ValueError("No args or kwargs") -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12467) Get rid of sorting in Row's constructor in pyspark
[ https://issues.apache.org/jira/browse/SPARK-12467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15996758#comment-15996758 ] Hyukjin Kwon commented on SPARK-12467: -- I actually quite like {{**kwargs}} usage and I think arguably it is straightforward and easier than namedtuple like way. It is documented so if users are aware of this, probably, it is not worth deprecating/removing yet. We will anyway easily support this in the far future after dropping Python before 3.6. > Get rid of sorting in Row's constructor in pyspark > -- > > Key: SPARK-12467 > URL: https://issues.apache.org/jira/browse/SPARK-12467 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.5.2 >Reporter: Irakli Machabeli >Priority: Minor > > Current implementation of Row's __new__ sorts columns by name > First of all there is no obvious reason to sort, second, if one converts > dataframe to rdd and than back to dataframe, order of column changes. While > this is not a bug, nevetheless it makes looking at the data really > inconvenient. > def __new__(self, *args, **kwargs): > if args and kwargs: > raise ValueError("Can not use both args " > "and kwargs to create Row") > if args: > # create row class or objects > return tuple.__new__(self, args) > elif kwargs: > # create row objects > names = sorted(kwargs.keys()) # just get rid of sorting here!!! > row = tuple.__new__(self, [kwargs[n] for n in names]) > row.__fields__ = names > return row > else: > raise ValueError("No args or kwargs") -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12467) Get rid of sorting in Row's constructor in pyspark
[ https://issues.apache.org/jira/browse/SPARK-12467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15996702#comment-15996702 ] Maciej Szymkiewicz commented on SPARK-12467: ??Row has named fields, so it shouldn't depend upon the ordering in order to make a match.??? Unfortunately this is not so simple. There is no requirement for schema names to match the input. Moreover this: {code} schema = spark.sql('SELECT number, letters, some_date FROM test_trash.thingy').schema # C-works {code} doesn't work. It just fails silently by casting data to incorrect types. Finally: ??If you can't write data into it's own implied schema?? is a good point, but it is not it's own schema. It's "own implied schema" is: {code} spark.table('test_trash.thingy').schema {code} Maybe the best solution here is to deprecate and remove {{**kwargs}} variant? It is not really necessary, and given language limitations, it is more confusing than useful. Or at least remove it from examples and encourage users to use "long form": {code} Row("numers", "letters", "some_date")(1, "real1", datetime(2017,12,1,3,15)) {code} or {{namedtuple}}. > Get rid of sorting in Row's constructor in pyspark > -- > > Key: SPARK-12467 > URL: https://issues.apache.org/jira/browse/SPARK-12467 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.5.2 >Reporter: Irakli Machabeli >Priority: Minor > > Current implementation of Row's __new__ sorts columns by name > First of all there is no obvious reason to sort, second, if one converts > dataframe to rdd and than back to dataframe, order of column changes. While > this is not a bug, nevetheless it makes looking at the data really > inconvenient. > def __new__(self, *args, **kwargs): > if args and kwargs: > raise ValueError("Can not use both args " > "and kwargs to create Row") > if args: > # create row class or objects > return tuple.__new__(self, args) > elif kwargs: > # create row objects > names = sorted(kwargs.keys()) # just get rid of sorting here!!! > row = tuple.__new__(self, [kwargs[n] for n in names]) > row.__fields__ = names > return row > else: > raise ValueError("No args or kwargs") -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12467) Get rid of sorting in Row's constructor in pyspark
[ https://issues.apache.org/jira/browse/SPARK-12467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15994849#comment-15994849 ] John Berryman commented on SPARK-12467: --- Here's a slightly different example that I think should point out another problem {code} from datetime import datetime from pyspark.sql import Row rows = [ Row(number=1, letters='real1', some_date=datetime(2017,12,1,3,15)), Row(number=2, letters='real2', some_date=datetime(2017,12,2,3,15)), Row(number=3, letters='real3', some_date=datetime(2017,12,3,3,15)), ] rows_rdd = spark.sparkContext.parallelize(rows) df = spark.createDataFrame(rows_rdd) spark.sql('CREATE DATABASE test_trash') df.write.mode(saveMode='overwrite').saveAsTable('test_trash.thingy') schema = spark.sql('SELECT number, letters, some_date FROM test_trash.thingy').schema df = spark.createDataFrame(rows_rdd, schema) df.count() {code} - In the first part of the code I define a bunch of Rows with the schema implicit schema {{'number':=int, 'letters'=string, 'some_date'=date}}. - In the second part of code I query a table made from that data set and I query the fields in the same order: {{number, letters, some_date}} so the schema should be exactly the same. (Though I still think order shouldn't matter since Rows have named fields.) - In the third part of the code I attempt to create a dataframe using the original data and the schema that was created _from_ the original data. But I get an error saying that that the original data doesn't fit _in it's own implied schema_. If you can't write data into it's own implied schema, then this is a bug. > Get rid of sorting in Row's constructor in pyspark > -- > > Key: SPARK-12467 > URL: https://issues.apache.org/jira/browse/SPARK-12467 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.5.2 >Reporter: Irakli Machabeli >Priority: Minor > > Current implementation of Row's __new__ sorts columns by name > First of all there is no obvious reason to sort, second, if one converts > dataframe to rdd and than back to dataframe, order of column changes. While > this is not a bug, nevetheless it makes looking at the data really > inconvenient. > def __new__(self, *args, **kwargs): > if args and kwargs: > raise ValueError("Can not use both args " > "and kwargs to create Row") > if args: > # create row class or objects > return tuple.__new__(self, args) > elif kwargs: > # create row objects > names = sorted(kwargs.keys()) # just get rid of sorting here!!! > row = tuple.__new__(self, [kwargs[n] for n in names]) > row.__fields__ = names > return row > else: > raise ValueError("No args or kwargs") -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12467) Get rid of sorting in Row's constructor in pyspark
[ https://issues.apache.org/jira/browse/SPARK-12467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15994833#comment-15994833 ] John Berryman commented on SPARK-12467: --- I believe there is still something here that needs to be fixed. Consider the following code: {code} from pyspark.sql import Row from datetime import datetime rows = [ dict(number=1, letters='real1', some_date=datetime(2017,12,1,3,15)), dict(number=2, letters='real2', some_date=datetime(2017,12,2,3,15)), dict(number=3, letters='real3', some_date=datetime(2017,12,3,3,15)), ] rows_rdd = spark.sparkContext.parallelize(rows).map(lambda r: Row(**r)) df = spark.createDataFrame(rows_rdd) spark.sql('CREATE DATABASE test_trash') df.write.mode(saveMode='overwrite').saveAsTable('test_trash.thingy') schema = spark.sql('SELECT letters, number, some_date FROM test_trash.thingy').schema # A-works # schema = spark.sql('SELECT some_date, number, letters FROM test_trash.thingy').schema # B-fails schema = spark.sql('SELECT number, letters, some_date FROM test_trash.thingy').schema # C-works rows_rdd = spark.sparkContext.parallelize(rows).map(lambda r: Row(**r)) df = spark.createDataFrame(rows_rdd, schema) df.count() {code} If I uncomment line #A it works, line #B fails, and line #C works. The only difference is the ordering of the named fields. The behavior is inconsistent. Also, ``Row`` objects have named fields so why should there be any dependence upon ordering at all? Also, the errors don't really convey the problem ``AttributeError: 'int' object has no attribute 'tzinfo'`` - the error should be about some explicit schema mismatch (though I contend that this isn't really a mismatch at all; the above lines should all work). > Get rid of sorting in Row's constructor in pyspark > -- > > Key: SPARK-12467 > URL: https://issues.apache.org/jira/browse/SPARK-12467 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.5.2 >Reporter: Irakli Machabeli >Priority: Minor > > Current implementation of Row's __new__ sorts columns by name > First of all there is no obvious reason to sort, second, if one converts > dataframe to rdd and than back to dataframe, order of column changes. While > this is not a bug, nevetheless it makes looking at the data really > inconvenient. > def __new__(self, *args, **kwargs): > if args and kwargs: > raise ValueError("Can not use both args " > "and kwargs to create Row") > if args: > # create row class or objects > return tuple.__new__(self, args) > elif kwargs: > # create row objects > names = sorted(kwargs.keys()) # just get rid of sorting here!!! > row = tuple.__new__(self, [kwargs[n] for n in names]) > row.__fields__ = names > return row > else: > raise ValueError("No args or kwargs") -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12467) Get rid of sorting in Row's constructor in pyspark
[ https://issues.apache.org/jira/browse/SPARK-12467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15994759#comment-15994759 ] Hyukjin Kwon commented on SPARK-12467: -- [~imachabeli] I will resolve this if there is no argument against ^. > Get rid of sorting in Row's constructor in pyspark > -- > > Key: SPARK-12467 > URL: https://issues.apache.org/jira/browse/SPARK-12467 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.5.2 >Reporter: Irakli Machabeli >Priority: Minor > > Current implementation of Row's __new__ sorts columns by name > First of all there is no obvious reason to sort, second, if one converts > dataframe to rdd and than back to dataframe, order of column changes. While > this is not a bug, nevetheless it makes looking at the data really > inconvenient. > def __new__(self, *args, **kwargs): > if args and kwargs: > raise ValueError("Can not use both args " > "and kwargs to create Row") > if args: > # create row class or objects > return tuple.__new__(self, args) > elif kwargs: > # create row objects > names = sorted(kwargs.keys()) # just get rid of sorting here!!! > row = tuple.__new__(self, [kwargs[n] for n in names]) > row.__fields__ = names > return row > else: > raise ValueError("No args or kwargs") -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12467) Get rid of sorting in Row's constructor in pyspark
[ https://issues.apache.org/jira/browse/SPARK-12467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15994748#comment-15994748 ] Maciej Szymkiewicz commented on SPARK-12467: Python before 3.6 does not preserve the order of the keyword arguments (PEP 468) so without sorting keyword becomes nondeterministic, > Get rid of sorting in Row's constructor in pyspark > -- > > Key: SPARK-12467 > URL: https://issues.apache.org/jira/browse/SPARK-12467 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.5.2 >Reporter: Irakli Machabeli >Priority: Minor > > Current implementation of Row's __new__ sorts columns by name > First of all there is no obvious reason to sort, second, if one converts > dataframe to rdd and than back to dataframe, order of column changes. While > this is not a bug, nevetheless it makes looking at the data really > inconvenient. > def __new__(self, *args, **kwargs): > if args and kwargs: > raise ValueError("Can not use both args " > "and kwargs to create Row") > if args: > # create row class or objects > return tuple.__new__(self, args) > elif kwargs: > # create row objects > names = sorted(kwargs.keys()) # just get rid of sorting here!!! > row = tuple.__new__(self, [kwargs[n] for n in names]) > row.__fields__ = names > return row > else: > raise ValueError("No args or kwargs") -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org