[ 
https://issues.apache.org/jira/browse/SPARK-10199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731529#comment-14731529
 ] 

Xiangrui Meng commented on SPARK-10199:
---------------------------------------

The improvement numbers also depends on the model size. In unit tests, the 
model sizes are usually very small. Then the overhead of reflection becomes 
significant. With real models, it could be either the model itself is too small 
or the model is large and then the overhead of reflection becomes 
insignificant. Keeping the code simple and easy to understand is also quite 
important. +[~josephkb]

> Avoid using reflections for parquet model save
> ----------------------------------------------
>
>                 Key: SPARK-10199
>                 URL: https://issues.apache.org/jira/browse/SPARK-10199
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML, MLlib
>            Reporter: Feynman Liang
>            Priority: Minor
>
> These items are not high priority since the overhead writing to Parquest is 
> much greater than for runtime reflections.
> Multiple model save/load in MLlib use case classes to infer a schema for the 
> data frame saved to Parquet. However, inferring a schema from case classes or 
> tuples uses [runtime 
> reflection|https://github.com/apache/spark/blob/d7b4c095271c36fcc7f9ded267ecf5ec66fac803/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L361]
>  which is unnecessary since the types are already known at the time `save` is 
> called.
> It would be better to just specify the schema for the data frame directly 
> using {{sqlContext.createDataFrame(dataRDD, schema)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to