[ 
https://issues.apache.org/jira/browse/SPARK-13802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575825#comment-15575825
 ] 

Thomas Dunne edited comment on SPARK-13802 at 10/14/16 4:45 PM:
----------------------------------------------------------------

This is especially troublesome when combined with creating a DataFrame, while 
using your own schema.

The data I am working on can contain a lot of empty fields, which makes the 
schema inference potentially have to scan every row to determine their type. 
Providing our own schema should fix this, right?

Nope... Rather than matching up the keys of the Row, with the field names of 
the provided schema, lets just change the order of one (the Row), and naively 
use zip(row, schema.fields). This means that even keeping both schema field 
order, and Row key value is not enough, due to Rows sorting keys, we need to 
manually sort schema fields too.

Doesn't seem consistent or desirable behavior at all.


was (Author: thomas9):
This is especially troublesome when combined with creating a DataFrame, while 
using your own schema.

The data I am working on can contain a lot of empty fields, which makes the 
schema inference potentially have to scan every row to determine their type. 
Providing our own schema should fix this, right?

Nope... Rather than matching up the keys of the Row, with the field names of 
the provided schema, lets just change the order of one (the Row), and naively 
use zip(row, schema.fields). This means that even keeping both schema field 
order, and Row key value is not enough, due to Rows sorting keys, we need to 
manually sort schema fields too.

> Fields order in Row(**kwargs) is not consistent with Schema.toInternal method
> -----------------------------------------------------------------------------
>
>                 Key: SPARK-13802
>                 URL: https://issues.apache.org/jira/browse/SPARK-13802
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.6.0
>            Reporter: Szymon Matejczyk
>
> When using Row constructor from kwargs, fields in the tuple underneath are 
> sorted by name. When Schema is reading the row, it is not using the fields in 
> this order.
> {code}
> from pyspark.sql import Row
> from pyspark.sql.types import *
> schema = StructType([
>     StructField("id", StringType()),
>     StructField("first_name", StringType())])
> row = Row(id="39", first_name="Szymon")
> schema.toInternal(row)
> Out[5]: ('Szymon', '39')
> {code}
> {code}
> df = sqlContext.createDataFrame([row], schema)
> df.show(1)
> +------+----------+
> |    id|first_name|
> +------+----------+
> |Szymon|        39|
> +------+----------+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to