[ 
https://issues.apache.org/jira/browse/SPARK-10199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736139#comment-14736139
 ] 

Xiangrui Meng commented on SPARK-10199:
---------------------------------------

Yes, please. Thanks for doing the benchmark! We will close the JIRAs as well. 
Next time, we should discuss on the JIRA page first and implement something 
minimal for more discussions before we implement everything.

> Avoid using reflections for parquet model save
> ----------------------------------------------
>
>                 Key: SPARK-10199
>                 URL: https://issues.apache.org/jira/browse/SPARK-10199
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML, MLlib
>            Reporter: Feynman Liang
>            Priority: Minor
>
> These items are not high priority since the overhead writing to Parquest is 
> much greater than for runtime reflections.
> Multiple model save/load in MLlib use case classes to infer a schema for the 
> data frame saved to Parquet. However, inferring a schema from case classes or 
> tuples uses [runtime 
> reflection|https://github.com/apache/spark/blob/d7b4c095271c36fcc7f9ded267ecf5ec66fac803/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L361]
>  which is unnecessary since the types are already known at the time `save` is 
> called.
> It would be better to just specify the schema for the data frame directly 
> using {{sqlContext.createDataFrame(dataRDD, schema)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to