Internally I believe that we only actually create one struct object for
each row, so you are really only paying the cost of the pointer in most use
cases (as shown below).

scala> val df = Seq((1,2), (3,4)).toDF("a", "b")
df: org.apache.spark.sql.DataFrame = [a: int, b: int]

scala> df.collect()
res1: Array[org.apache.spark.sql.Row] = Array([1,2], [3,4])

scala> res1(0).schema eq res1(1).schema
res3: Boolean = true

I'd strongly suggest that you use something like parquet
<https://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files>
or avro <http://spark-packages.org/package/databricks/spark-avro> to store
DataFrames as it is likely much more space efficient and faster than
generic serialization.

Michael

On Mon, Jul 27, 2015 at 9:02 PM, Kevin Jung <itsjb.j...@samsung.com> wrote:

> Hi all,
>
> SparkSQL usually creates DataFrame with GenericRowWithSchema(is that
> right?). And 'Row' is a super class of GenericRow and GenericRowWithSchema.
> The only difference is that GenericRowWithSchema has its schema information
> as StructType. But I think one DataFrame has only one schema then each row
> should not have to store schema in it. Because StructType is very heavy and
> most of RDD has many rows. To test this,
> 1) create DataFrame and call rdd ( RDD[Row] ) <= GenericRowWithSchema
> 2) dataframe.map( row => Row(row.toSeq)) <= GenericRow
> 3) dataframe.map( row => row.toSeq) <= underlying sequence of a row
> 4) saveAsObjectFile or use org.apache.spark.util.SizeEstimator.estimate
> And my result is,
> (dataframe with 5columns)
> GenericRowWithSchema => 13gb
> GenericRow => 8.2gb
> Seq => 7gb
>
> Best regards
> Kevin
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/GenericRowWithSchema-is-too-heavy-tp24018.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to