GitHub user michalsenkyr opened a pull request:

    https://github.com/apache/spark/pull/22527

    [SPARK-17952][SQL] Nested Java beans support in createDataFrame

    ## What changes were proposed in this pull request?
    
    When constructing a DataFrame from a Java bean, using nested beans throws 
an error despite 
[documentation](http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection)
 stating otherwise. This PR aims to add that support.
    
    This PR does not yet add nested beans support in array or List fields. This 
can be added later or in another PR.
    
    ## How was this patch tested?
    
    Nested bean was added to the appropriate unit test.
    
    Also manually tested in Spark shell on code emulating the referenced JIRA:
    
    ```
    scala> import scala.beans.BeanProperty
    import scala.beans.BeanProperty
    
    scala> class SubCategory(@BeanProperty var id: String, @BeanProperty var 
name: String) extends Serializable
    defined class SubCategory
    
    scala> class Category(@BeanProperty var id: String, @BeanProperty var 
subCategory: SubCategory) extends Serializable
    defined class Category
    
    scala> import scala.collection.JavaConverters._
    import scala.collection.JavaConverters._
    
    scala> spark.createDataFrame(Seq(new Category("s-111", new 
SubCategory("sc-111", "Sub-1"))).asJava, classOf[Category])
    java.lang.IllegalArgumentException: The value (SubCategory@65130cf2) of the 
type (SubCategory) cannot be converted to struct<id:string,name:string>
      at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:262)
      at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:238)
      at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
      at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:396)
      at 
org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1108)
      at 
org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1108)
      at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
      at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
      at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
      at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
      at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
      at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
      at 
org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1108)
      at 
org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1106)
      at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
      at scala.collection.Iterator$class.toStream(Iterator.scala:1320)
      at scala.collection.AbstractIterator.toStream(Iterator.scala:1334)
      at scala.collection.TraversableOnce$class.toSeq(TraversableOnce.scala:298)
      at scala.collection.AbstractIterator.toSeq(Iterator.scala:1334)
      at 
org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:423)
      ... 51 elided
    ```
    
    New behavior:
    
    ```
    scala> spark.createDataFrame(Seq(new Category("s-111", new 
SubCategory("sc-111", "Sub-1"))).asJava, classOf[Category])
    res0: org.apache.spark.sql.DataFrame = [id: string, subCategory: struct<id: 
string, name: string>]
    
    scala> res0.show()
    +-----+---------------+
    |   id|    subCategory|
    +-----+---------------+
    |s-111|[sc-111, Sub-1]|
    +-----+---------------+
    ```
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/michalsenkyr/spark SPARK-17952

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22527.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22527
    
----
commit ccea758b069c4622e9b1f71b92167c81cfcd81b8
Author: Michal Senkyr <mike.senkyr@...>
Date:   2018-09-22T18:25:36Z

    Add nested Java beans support to SQLContext.beansToRow

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to