Hey Friends,

Recently I have been using Spark 1.3.1, mainly pyspark.sql. I noticed that
the Row object collected directly from a DataFrame is different from the
Row object we directly defined from Row(*arg, **kwarg).

>>>from pyspark.sql.types import Row
>>>aaa = Row(a=1, b=2, c=Row(a=1, b=2))
>>>tuple(sc.parallelize([aaa]).toDF().collect()[0])

(1, 2, (1, 2))

>>>tuple(aaa)

(1, 2, Row(a=1, b=2))


This matters to me because I wanted to be able to create a DataFrame
with one of the columns being a Row object by
sqlcontext.createDataFrame(data, schema) where I specifically pass in
the schema. However, if the data is RDD of Row objects like "aaa" in
my example, it'll fail in __verify_type function.



Thank you,

Wei

Reply via email to