Hi all,

SparkSQL usually creates DataFrame with GenericRowWithSchema(is that
right?). And 'Row' is a super class of GenericRow and GenericRowWithSchema.
The only difference is that GenericRowWithSchema has its schema information
as StructType. But I think one DataFrame has only one schema then each row
should not have to store schema in it. Because StructType is very heavy and
most of RDD has many rows. To test this,
1) create DataFrame and call rdd ( RDD[Row] ) <= GenericRowWithSchema
2) dataframe.map( row => Row(row.toSeq)) <= GenericRow
3) dataframe.map( row => row.toSeq) <= underlying sequence of a row
4) saveAsObjectFile or use org.apache.spark.util.SizeEstimator.estimate
And my result is,
(dataframe with 5columns)
GenericRowWithSchema => 13gb
GenericRow => 8.2gb
Seq => 7gb

Best regards
Kevin



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/GenericRowWithSchema-is-too-heavy-tp24018.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to