[jira] [Commented] (SPARK-12467) Get rid of sorting in Row's constructor in pyspark

2017-05-04 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15996936#comment-15996936
 ] 

Hyukjin Kwon commented on SPARK-12467:
--

Yea, I do agree with the advantage and the others of your comment. Let's 
resolve this.

Please reopen this anyone feel against this and have a good idea to resolve 
this, or believe it is worth breaking backward compatibility. I am resolving 
this.

> Get rid of sorting in Row's constructor in pyspark
> --
>
> Key: SPARK-12467
> URL: https://issues.apache.org/jira/browse/SPARK-12467
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.2, 2.2.0
>Reporter: Irakli Machabeli
>Priority: Minor
>
> Current implementation of Row's __new__ sorts columns by name
> First of all there is no obvious reason to sort, second, if one converts 
> dataframe to rdd and than back to dataframe, order of column changes. While 
> this is not  a bug, nevetheless it makes looking at the data really 
> inconvenient.
> def __new__(self, *args, **kwargs):
> if args and kwargs:
> raise ValueError("Can not use both args "
>  "and kwargs to create Row")
> if args:
> # create row class or objects
> return tuple.__new__(self, args)
> elif kwargs:
> # create row objects
> names = sorted(kwargs.keys()) # just get rid of sorting here!!!
> row = tuple.__new__(self, [kwargs[n] for n in names])
> row.__fields__ = names
> return row
> else:
> raise ValueError("No args or kwargs")



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12467) Get rid of sorting in Row's constructor in pyspark

2017-05-04 Thread Maciej Szymkiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15996901#comment-15996901
 ] 

Maciej Szymkiewicz commented on SPARK-12467:


[~hyukjin.kwon] Personally I like {{namedtuple}} because it is  static type 
checker friendly. This is a huge advantage over {{Row}}. But it is just a 
preference. 

Regarding this JIRA my opinion is the same as for the other one - it is simply 
won't fix. Considering we are still committed to supporting Python 2.7, 
dropping support for <= 3.5  is at least decade away. Any other attempt to 
"fix" this will break backward compatibility and I've seen user code depending 
on sorting behavior. Finally as you said it is documented.

> Get rid of sorting in Row's constructor in pyspark
> --
>
> Key: SPARK-12467
> URL: https://issues.apache.org/jira/browse/SPARK-12467
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.2, 2.2.0
>Reporter: Irakli Machabeli
>Priority: Minor
>
> Current implementation of Row's __new__ sorts columns by name
> First of all there is no obvious reason to sort, second, if one converts 
> dataframe to rdd and than back to dataframe, order of column changes. While 
> this is not  a bug, nevetheless it makes looking at the data really 
> inconvenient.
> def __new__(self, *args, **kwargs):
> if args and kwargs:
> raise ValueError("Can not use both args "
>  "and kwargs to create Row")
> if args:
> # create row class or objects
> return tuple.__new__(self, args)
> elif kwargs:
> # create row objects
> names = sorted(kwargs.keys()) # just get rid of sorting here!!!
> row = tuple.__new__(self, [kwargs[n] for n in names])
> row.__fields__ = names
> return row
> else:
> raise ValueError("No args or kwargs")



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12467) Get rid of sorting in Row's constructor in pyspark

2017-05-04 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15996761#comment-15996761
 ] 

Hyukjin Kwon commented on SPARK-12467:
--

I added 2.2.0 as I tested this in other JIRAs for testing purpose.

> Get rid of sorting in Row's constructor in pyspark
> --
>
> Key: SPARK-12467
> URL: https://issues.apache.org/jira/browse/SPARK-12467
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.2, 2.2.0
>Reporter: Irakli Machabeli
>Priority: Minor
>
> Current implementation of Row's __new__ sorts columns by name
> First of all there is no obvious reason to sort, second, if one converts 
> dataframe to rdd and than back to dataframe, order of column changes. While 
> this is not  a bug, nevetheless it makes looking at the data really 
> inconvenient.
> def __new__(self, *args, **kwargs):
> if args and kwargs:
> raise ValueError("Can not use both args "
>  "and kwargs to create Row")
> if args:
> # create row class or objects
> return tuple.__new__(self, args)
> elif kwargs:
> # create row objects
> names = sorted(kwargs.keys()) # just get rid of sorting here!!!
> row = tuple.__new__(self, [kwargs[n] for n in names])
> row.__fields__ = names
> return row
> else:
> raise ValueError("No args or kwargs")



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12467) Get rid of sorting in Row's constructor in pyspark

2017-05-04 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15996758#comment-15996758
 ] 

Hyukjin Kwon commented on SPARK-12467:
--

I actually quite like {{**kwargs}} usage and I think arguably it is 
straightforward and easier than namedtuple like way. It is documented so if 
users are aware of this, probably, it is not worth deprecating/removing yet.

We will anyway easily support this in the far future after dropping Python 
before 3.6.


> Get rid of sorting in Row's constructor in pyspark
> --
>
> Key: SPARK-12467
> URL: https://issues.apache.org/jira/browse/SPARK-12467
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.2
>Reporter: Irakli Machabeli
>Priority: Minor
>
> Current implementation of Row's __new__ sorts columns by name
> First of all there is no obvious reason to sort, second, if one converts 
> dataframe to rdd and than back to dataframe, order of column changes. While 
> this is not  a bug, nevetheless it makes looking at the data really 
> inconvenient.
> def __new__(self, *args, **kwargs):
> if args and kwargs:
> raise ValueError("Can not use both args "
>  "and kwargs to create Row")
> if args:
> # create row class or objects
> return tuple.__new__(self, args)
> elif kwargs:
> # create row objects
> names = sorted(kwargs.keys()) # just get rid of sorting here!!!
> row = tuple.__new__(self, [kwargs[n] for n in names])
> row.__fields__ = names
> return row
> else:
> raise ValueError("No args or kwargs")



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12467) Get rid of sorting in Row's constructor in pyspark

2017-05-04 Thread Maciej Szymkiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15996702#comment-15996702
 ] 

Maciej Szymkiewicz commented on SPARK-12467:


 ??Row has named fields, so it shouldn't depend upon the ordering in order to 
make a match.???

Unfortunately this is not so simple. There is no requirement for schema names 
to match the input.

Moreover  this:

{code}
schema = spark.sql('SELECT number, letters, some_date FROM 
test_trash.thingy').schema # C-works
{code}

doesn't work. It just fails silently by casting data to incorrect types.

Finally:

??If you can't write data into it's own implied schema??

is a good point, but it is not it's own schema. It's "own implied schema" is:

{code}
spark.table('test_trash.thingy').schema
{code}

Maybe the best solution here is to deprecate and remove {{**kwargs}} variant? 
It is not really necessary, and given language limitations, it is more 
confusing than useful. Or at least remove it from examples and encourage users 
to use "long form":

{code}
 Row("numers", "letters", "some_date")(1, "real1", datetime(2017,12,1,3,15))
{code} 

or {{namedtuple}}.

> Get rid of sorting in Row's constructor in pyspark
> --
>
> Key: SPARK-12467
> URL: https://issues.apache.org/jira/browse/SPARK-12467
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.2
>Reporter: Irakli Machabeli
>Priority: Minor
>
> Current implementation of Row's __new__ sorts columns by name
> First of all there is no obvious reason to sort, second, if one converts 
> dataframe to rdd and than back to dataframe, order of column changes. While 
> this is not  a bug, nevetheless it makes looking at the data really 
> inconvenient.
> def __new__(self, *args, **kwargs):
> if args and kwargs:
> raise ValueError("Can not use both args "
>  "and kwargs to create Row")
> if args:
> # create row class or objects
> return tuple.__new__(self, args)
> elif kwargs:
> # create row objects
> names = sorted(kwargs.keys()) # just get rid of sorting here!!!
> row = tuple.__new__(self, [kwargs[n] for n in names])
> row.__fields__ = names
> return row
> else:
> raise ValueError("No args or kwargs")



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12467) Get rid of sorting in Row's constructor in pyspark

2017-05-03 Thread John Berryman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15994849#comment-15994849
 ] 

John Berryman commented on SPARK-12467:
---

Here's a slightly different example that I think should point out another 
problem

{code}
from datetime import datetime 
from pyspark.sql import Row

rows = [
Row(number=1, letters='real1', some_date=datetime(2017,12,1,3,15)),
Row(number=2, letters='real2', some_date=datetime(2017,12,2,3,15)),
Row(number=3, letters='real3', some_date=datetime(2017,12,3,3,15)),
]
rows_rdd = spark.sparkContext.parallelize(rows)
df = spark.createDataFrame(rows_rdd)

spark.sql('CREATE DATABASE test_trash')
df.write.mode(saveMode='overwrite').saveAsTable('test_trash.thingy')
schema = spark.sql('SELECT number, letters, some_date FROM 
test_trash.thingy').schema

df = spark.createDataFrame(rows_rdd, schema)
df.count()
{code}

- In the first part of the code I define a bunch of Rows with the schema 
implicit schema {{'number':=int, 'letters'=string, 'some_date'=date}}.
- In the second part of code I query a table made from that data set and I 
query the fields in the same order: {{number, letters, some_date}} so the 
schema should be exactly the same. (Though I still think order shouldn't matter 
since Rows have named fields.)
- In the third part of the code I attempt to create a dataframe using the 
original data and the schema that was created _from_ the original data. But I 
get an error saying that that the original data doesn't fit _in it's own 
implied schema_.

If you can't write data into it's own implied schema, then this is a bug.

> Get rid of sorting in Row's constructor in pyspark
> --
>
> Key: SPARK-12467
> URL: https://issues.apache.org/jira/browse/SPARK-12467
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.2
>Reporter: Irakli Machabeli
>Priority: Minor
>
> Current implementation of Row's __new__ sorts columns by name
> First of all there is no obvious reason to sort, second, if one converts 
> dataframe to rdd and than back to dataframe, order of column changes. While 
> this is not  a bug, nevetheless it makes looking at the data really 
> inconvenient.
> def __new__(self, *args, **kwargs):
> if args and kwargs:
> raise ValueError("Can not use both args "
>  "and kwargs to create Row")
> if args:
> # create row class or objects
> return tuple.__new__(self, args)
> elif kwargs:
> # create row objects
> names = sorted(kwargs.keys()) # just get rid of sorting here!!!
> row = tuple.__new__(self, [kwargs[n] for n in names])
> row.__fields__ = names
> return row
> else:
> raise ValueError("No args or kwargs")



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12467) Get rid of sorting in Row's constructor in pyspark

2017-05-03 Thread John Berryman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15994833#comment-15994833
 ] 

John Berryman commented on SPARK-12467:
---

I believe there is still something here that needs to be fixed.

Consider the following code:

{code}
from pyspark.sql import Row
from datetime import datetime 
rows = [
dict(number=1, letters='real1', some_date=datetime(2017,12,1,3,15)),
dict(number=2, letters='real2', some_date=datetime(2017,12,2,3,15)),
dict(number=3, letters='real3', some_date=datetime(2017,12,3,3,15)),
]
rows_rdd = spark.sparkContext.parallelize(rows).map(lambda r: Row(**r))
df = spark.createDataFrame(rows_rdd)
spark.sql('CREATE DATABASE test_trash')
df.write.mode(saveMode='overwrite').saveAsTable('test_trash.thingy')

schema = spark.sql('SELECT letters, number, some_date FROM 
test_trash.thingy').schema # A-works
# schema = spark.sql('SELECT some_date, number, letters FROM 
test_trash.thingy').schema # B-fails
schema = spark.sql('SELECT number, letters, some_date FROM 
test_trash.thingy').schema # C-works


rows_rdd = spark.sparkContext.parallelize(rows).map(lambda r: Row(**r))
df = spark.createDataFrame(rows_rdd, schema)
df.count()
{code}

If I uncomment line #A it works, line #B fails, and line #C works. The only 
difference is the ordering of the named fields. The behavior is inconsistent. 
Also, ``Row`` objects have named fields so why should there be any dependence 
upon ordering at all? Also, the errors don't really convey the problem 
``AttributeError: 'int' object has no attribute 'tzinfo'`` - the error should 
be about some explicit schema mismatch (though I contend that this isn't really 
a mismatch at all; the above lines should all work).

> Get rid of sorting in Row's constructor in pyspark
> --
>
> Key: SPARK-12467
> URL: https://issues.apache.org/jira/browse/SPARK-12467
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.2
>Reporter: Irakli Machabeli
>Priority: Minor
>
> Current implementation of Row's __new__ sorts columns by name
> First of all there is no obvious reason to sort, second, if one converts 
> dataframe to rdd and than back to dataframe, order of column changes. While 
> this is not  a bug, nevetheless it makes looking at the data really 
> inconvenient.
> def __new__(self, *args, **kwargs):
> if args and kwargs:
> raise ValueError("Can not use both args "
>  "and kwargs to create Row")
> if args:
> # create row class or objects
> return tuple.__new__(self, args)
> elif kwargs:
> # create row objects
> names = sorted(kwargs.keys()) # just get rid of sorting here!!!
> row = tuple.__new__(self, [kwargs[n] for n in names])
> row.__fields__ = names
> return row
> else:
> raise ValueError("No args or kwargs")



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12467) Get rid of sorting in Row's constructor in pyspark

2017-05-03 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15994759#comment-15994759
 ] 

Hyukjin Kwon commented on SPARK-12467:
--

[~imachabeli] I will resolve this if there is no argument against ^.

> Get rid of sorting in Row's constructor in pyspark
> --
>
> Key: SPARK-12467
> URL: https://issues.apache.org/jira/browse/SPARK-12467
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.2
>Reporter: Irakli Machabeli
>Priority: Minor
>
> Current implementation of Row's __new__ sorts columns by name
> First of all there is no obvious reason to sort, second, if one converts 
> dataframe to rdd and than back to dataframe, order of column changes. While 
> this is not  a bug, nevetheless it makes looking at the data really 
> inconvenient.
> def __new__(self, *args, **kwargs):
> if args and kwargs:
> raise ValueError("Can not use both args "
>  "and kwargs to create Row")
> if args:
> # create row class or objects
> return tuple.__new__(self, args)
> elif kwargs:
> # create row objects
> names = sorted(kwargs.keys()) # just get rid of sorting here!!!
> row = tuple.__new__(self, [kwargs[n] for n in names])
> row.__fields__ = names
> return row
> else:
> raise ValueError("No args or kwargs")



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12467) Get rid of sorting in Row's constructor in pyspark

2017-05-03 Thread Maciej Szymkiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15994748#comment-15994748
 ] 

Maciej Szymkiewicz commented on SPARK-12467:


Python before 3.6 does not preserve the order of the keyword arguments (PEP 
468) so without sorting keyword becomes nondeterministic,

> Get rid of sorting in Row's constructor in pyspark
> --
>
> Key: SPARK-12467
> URL: https://issues.apache.org/jira/browse/SPARK-12467
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.2
>Reporter: Irakli Machabeli
>Priority: Minor
>
> Current implementation of Row's __new__ sorts columns by name
> First of all there is no obvious reason to sort, second, if one converts 
> dataframe to rdd and than back to dataframe, order of column changes. While 
> this is not  a bug, nevetheless it makes looking at the data really 
> inconvenient.
> def __new__(self, *args, **kwargs):
> if args and kwargs:
> raise ValueError("Can not use both args "
>  "and kwargs to create Row")
> if args:
> # create row class or objects
> return tuple.__new__(self, args)
> elif kwargs:
> # create row objects
> names = sorted(kwargs.keys()) # just get rid of sorting here!!!
> row = tuple.__new__(self, [kwargs[n] for n in names])
> row.__fields__ = names
> return row
> else:
> raise ValueError("No args or kwargs")



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org