[ 
https://issues.apache.org/jira/browse/SPARK-29748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16977749#comment-16977749
 ] 

Bryan Cutler commented on SPARK-29748:
--------------------------------------

[~zero323] and [~jhereth] this is targeted for Spark 3.0 and I agree, the 
behavior of Row should be very well defined to avoid any further confusion.

bq. Introducing {{LegacyRow}} seems to make little sense if implementation of 
{{Row}} stays the same otherwise. Sorting or not, depending on the config, 
should be enough.

LegacyRow isn't meant to be public and the user will not be aware of it. The 
reasons for it are to separate different implementations and make for a clean 
removal in the future without affecting the standard Row class. Having a 
separate implementation will make it easier to debug and diagnose problems - I 
don't want to get in the situation where a Row could sort fields or not, and 
then getting bug reports not knowing which way it was configured.

bq. I don't think we should introduce such behavior now, when 3.5 is 
deprecated. Having yet another way to initialize Row will be confusing at best 

That's reasonable. I'm not crazy about an option for OrderedDict as input, but 
I think users of Python < 3.6 should have a way to create a Row with ordered 
fields other than the 2-step process in the pydoc. We can explore other options 
for this.

bq. Make legacy behavior the only option for Python < 3.6.

I don't think we should have 2 very different behaviors that are chosen based 
on your Python verison. The user should be aware of what is happening and need 
to make the decision to use the legacy sorting. Some users will not know this, 
then upgrade their Python version and see Rows breaking. We should allow users 
with Python < 3.6 to make Rows with ordered fields and then be able to upgrade 
Python version without breaking their Spark app.

bq. For Python 3.6 let's introduce legacy sorting mechanism (keeping only 
single Row) class, enabled by default and deprecated.

Yeah, I'm not sure if we should enable the legacy sorting as default or not, 
what do others think?
 

> Remove sorting of fields in PySpark SQL Row creation
> ----------------------------------------------------
>
>                 Key: SPARK-29748
>                 URL: https://issues.apache.org/jira/browse/SPARK-29748
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 3.0.0
>            Reporter: Bryan Cutler
>            Priority: Major
>
> Currently, when a PySpark Row is created with keyword arguments, the fields 
> are sorted alphabetically. This has created a lot of confusion with users 
> because it is not obvious (although it is stated in the pydocs) that they 
> will be sorted alphabetically, and then an error can occur later when 
> applying a schema and the field order does not match.
> The original reason for sorting fields is because kwargs in python < 3.6 are 
> not guaranteed to be in the same order that they were entered. Sorting 
> alphabetically would ensure a consistent order.  Matters are further 
> complicated with the flag {{__from_dict__}} that allows the {{Row}} fields to 
> to be referenced by name when made by kwargs, but this flag is not serialized 
> with the Row and leads to inconsistent behavior.
> This JIRA proposes that any sorting of the Fields is removed. Users with 
> Python 3.6+ creating Rows with kwargs can continue to do so since Python will 
> ensure the order is the same as entered. Users with Python < 3.6 will have to 
> create Rows with an OrderedDict or by using the Row class as a factory 
> (explained in the pydoc).  If kwargs are used, an error will be raised or 
> based on a conf setting it can fall back to a LegacyRow that will sort the 
> fields as before. This LegacyRow will be immediately deprecated and removed 
> once support for Python < 3.6 is dropped.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to