Internally I believe that we only actually create one struct object for each row, so you are really only paying the cost of the pointer in most use cases (as shown below).
scala> val df = Seq((1,2), (3,4)).toDF("a", "b") df: org.apache.spark.sql.DataFrame = [a: int, b: int] scala> df.collect() res1: Array[org.apache.spark.sql.Row] = Array([1,2], [3,4]) scala> res1(0).schema eq res1(1).schema res3: Boolean = true I'd strongly suggest that you use something like parquet <https://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files> or avro <http://spark-packages.org/package/databricks/spark-avro> to store DataFrames as it is likely much more space efficient and faster than generic serialization. Michael On Mon, Jul 27, 2015 at 9:02 PM, Kevin Jung <itsjb.j...@samsung.com> wrote: > Hi all, > > SparkSQL usually creates DataFrame with GenericRowWithSchema(is that > right?). And 'Row' is a super class of GenericRow and GenericRowWithSchema. > The only difference is that GenericRowWithSchema has its schema information > as StructType. But I think one DataFrame has only one schema then each row > should not have to store schema in it. Because StructType is very heavy and > most of RDD has many rows. To test this, > 1) create DataFrame and call rdd ( RDD[Row] ) <= GenericRowWithSchema > 2) dataframe.map( row => Row(row.toSeq)) <= GenericRow > 3) dataframe.map( row => row.toSeq) <= underlying sequence of a row > 4) saveAsObjectFile or use org.apache.spark.util.SizeEstimator.estimate > And my result is, > (dataframe with 5columns) > GenericRowWithSchema => 13gb > GenericRow => 8.2gb > Seq => 7gb > > Best regards > Kevin > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/GenericRowWithSchema-is-too-heavy-tp24018.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >