Max Moroz created SPARK-16204:
---------------------------------

             Summary: Row() interfact
                 Key: SPARK-16204
                 URL: https://issues.apache.org/jira/browse/SPARK-16204
             Project: Spark
          Issue Type: Improvement
          Components: PySpark
    Affects Versions: 2.0.0
            Reporter: Max Moroz
            Priority: Trivial


Row('a', 'b') creates a Row-like class, while is slightly unexpected. To create 
an actual Row, one needs Row(field1 = 'a', field2 = 'b'). Of course 
Of course, Row('a', 'b')('a', 'b') does create a row.

I understand the logic, it's similar to namedtuple. But there's a difference in 
that namedtuple *only* creates classes, while Row creates both Row-like classes 
and record-like instances. 

Wouldn't be possible to do something slightly more safe? Like for example, 
replace expose the class-creation interface through something else, like a 
global function, or a Row class method, or a brand new class like RowFactory? 
Overloading the __init__ to create both records and classes seems unnecessarily 
dangerous.

In addition, the classes created by Row('a', 'b') allow creation of invalid 
classes (where the field names are not strings); it would be better to catch 
that early rather than let it happen silently and then fail (like when someone 
tries to print(Row('a', 42)).

And finally, key in Row(field1 = 'a', field2 = 'b') seems to search through the 
values instead of keys as promised in the documentation at least in 1.6.1 
(admittedly the docs only mention it in 2.0.0, but I thought it's not a change 
between the versions?).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to