Thanks all. I created a WIP PR at https://github.com/apache/spark/pull/26496, we can further discuss the details in there.
On Thu, Nov 7, 2019 at 7:01 PM Takuya UESHIN <ues...@happy-camper.st> wrote: > +1 > > On Thu, Nov 7, 2019 at 6:54 PM Shane Knapp <skn...@berkeley.edu> wrote: > >> +1 >> >> On Thu, Nov 7, 2019 at 6:08 PM Hyukjin Kwon <gurwls...@gmail.com> wrote: >> > >> > +1 >> > >> > 2019년 11월 6일 (수) 오후 11:38, Wenchen Fan <cloud0...@gmail.com>님이 작성: >> >> >> >> Sounds reasonable to me. We should make the behavior consistent within >> Spark. >> >> >> >> On Tue, Nov 5, 2019 at 6:29 AM Bryan Cutler <cutl...@gmail.com> wrote: >> >>> >> >>> Currently, when a PySpark Row is created with keyword arguments, the >> fields are sorted alphabetically. This has created a lot of confusion with >> users because it is not obvious (although it is stated in the pydocs) that >> they will be sorted alphabetically. Then later when applying a schema and >> the field order does not match, an error will occur. Here is a list of some >> of the JIRAs that I have been tracking all related to this issue: >> SPARK-24915, SPARK-22232, SPARK-27939, SPARK-27712, and relevant discussion >> of the issue [1]. >> >>> >> >>> The original reason for sorting fields is because kwargs in python < >> 3.6 are not guaranteed to be in the same order that they were entered [2]. >> Sorting alphabetically ensures a consistent order. Matters are further >> complicated with the flag _from_dict_ that allows the Row fields to to be >> referenced by name when made by kwargs, but this flag is not serialized >> with the Row and leads to inconsistent behavior. For instance: >> >>> >> >>> >>> spark.createDataFrame([Row(A="1", B="2")], "B string, A >> string").first() >> >>> Row(B='2', A='1') >> >>> >>> spark.createDataFrame(spark.sparkContext.parallelize([Row(A="1", >> B="2")]), "B string, A string").first() >> >>> Row(B='1', A='2') >> >>> >> >>> I think the best way to fix this is to remove the sorting of fields >> when constructing a Row. For users with Python 3.6+, nothing would change >> because these versions of Python ensure that the kwargs stays in the >> ordered entered. For users with Python < 3.6, using kwargs would check a >> conf to either raise an error or fallback to a LegacyRow that sorts the >> fields as before. With Python < 3.6 being deprecated now, this LegacyRow >> can also be removed at the same time. There are also other ways to create >> Rows that will not be affected. I have opened a JIRA [3] to capture this, >> but I am wondering what others think about fixing this for Spark 3.0? >> >>> >> >>> [1] https://github.com/apache/spark/pull/20280 >> >>> [2] https://www.python.org/dev/peps/pep-0468/ >> >>> [3] https://issues.apache.org/jira/browse/SPARK-29748 >> >> >> >> -- >> Shane Knapp >> UC Berkeley EECS Research / RISELab Staff Technical Lead >> https://rise.cs.berkeley.edu >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> > > -- > Takuya UESHIN > Tokyo, Japan > > http://twitter.com/ueshin >