[ 
https://issues.apache.org/jira/browse/SPARK-9970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-9970:
----------------------------------
    Priority: Major  (was: Minor)

> SQLContext.createDataFrame failed to properly determine column names
> --------------------------------------------------------------------
>
>                 Key: SPARK-9970
>                 URL: https://issues.apache.org/jira/browse/SPARK-9970
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.4.0
>            Reporter: Maciej Bryński
>
> Hi,
> I'm trying to do "nested join" of tables.
> After first join everything is ok, but second join made some of the column 
> names lost.
> My code is following:
> {code:java}
> def joinTable(tableLeft, tableRight, columnLeft, columnRight, columnNested, 
> joinType = "left_outer"):
>     tmpTable = sqlCtx.createDataFrame(tableRight.rdd.groupBy(lambda r: 
> r.asDict()[columnRight]))
>     tmpTable = tmpTable.select(tmpTable._1.alias("joinColumn"), 
> tmpTable._2.data.alias(columnNested))
>     return tableLeft.join(tmpTable, tableLeft[columnLeft] == 
> tmpTable["joinColumn"], joinType).drop("joinColumn")
> user = sqlContext.read.json(path + "user.json")
> user.printSchema()
> root
>  |-- id: long (nullable = true)
>  |-- name: string (nullable = true)
> order =  sqlContext.read.json(path + "order.json")
> order.printSchema();
> root
>  |-- id: long (nullable = true)
>  |-- price: double (nullable = true)
>  |-- userid: long (nullable = true)
> lines =  sqlContext.read.json(path + "lines.json")
> lines.printSchema();
> root
>  |-- id: long (nullable = true)
>  |-- orderid: long (nullable = true)
>  |-- product: string (nullable = true)
> {code}
> And joining code:
> Please look for the result of 2nd join. There are columns _1,_2,_3. Should be 
> 'id', 'orderid', 'product'
> {code:java}
> orders = joinTable(order, lines, "id", "orderid", "lines")
> orders.printSchema()
> root
>  |-- id: long (nullable = true)
>  |-- price: double (nullable = true)
>  |-- userid: long (nullable = true)
>  |-- lines: array (nullable = true)
>  |    |-- element: struct (containsNull = true)
>  |    |    |-- id: long (nullable = true)
>  |    |    |-- orderid: long (nullable = true)
>  |    |    |-- product: string (nullable = true)
> clients = joinTable(user, orders, "id", "userid", "orders")
> clients.printSchema()
> root
>  |-- id: long (nullable = true)
>  |-- name: string (nullable = true)
>  |-- orders: array (nullable = true)
>  |    |-- element: struct (containsNull = true)
>  |    |    |-- id: long (nullable = true)
>  |    |    |-- price: double (nullable = true)
>  |    |    |-- userid: long (nullable = true)
>  |    |    |-- lines: array (nullable = true)
>  |    |    |    |-- element: struct (containsNull = true)
>  |    |    |    |    |-- _1: long (nullable = true)
>  |    |    |    |    |-- _2: long (nullable = true)
>  |    |    |    |    |-- _3: string (nullable = true)
> {code}
> I tried to check if groupBy isn't the cause of the problem but it looks right
> {code:java}
> grouped =  orders.rdd.groupBy(lambda r: r.userid)
> print grouped.map(lambda x: list(x[1])).collect()
> [[Row(id=1, price=202.3, userid=1, lines=[Row(id=1, orderid=1, 
> product=u'XXX'), Row(id=2, orderid=1, product=u'YYY')]), Row(id=2, 
> price=343.99, userid=1, lines=[Row(id=3, orderid=2, product=u'XXX')])], 
> [Row(id=3, price=399.99, userid=2, lines=[Row(id=4, orderid=3, 
> product=u'XXX')])]]
> {code}
> So I assume that the problem is with createDataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to