Now this is very important:
“Normal RDDs” refers to “batch RDDs”. However the default in-memory Serialization of RDDs which are part of DSTream is “Srialized” rather than actual (hydrated) Objects. The Spark documentation states that “Serialization” is required for space and garbage collection efficiency (but creates higher CPU load) – which makes sense consider the large number of RDDs which get discarded in a streaming app So what does Data Bricks actually recommend as Object Oriented model for RDD elements used in Spark Streaming apps – flat or not and can you provide a detailed description / spec of both From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Thursday, April 16, 2015 7:23 PM To: Evo Eftimov Cc: Christian Perez; user Subject: Re: Super slow caching in 1.3? Here are the types that we specialize, other types will be much slower. This is only for Spark SQL, normal RDDs do not serialize data that is cached. I'll also not that until yesterday we were missing FloatType https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnBuilder.scala#L154 Christian, can you provide the schema of the fast and slow datasets? On Thu, Apr 16, 2015 at 10:14 AM, Evo Eftimov <evo.efti...@isecc.com> wrote: Michael what exactly do you mean by "flattened" version/structure here e.g.: 1. An Object with only primitive data types as attributes 2. An Object with no more than one level of other Objects as attributes 3. An Array/List of primitive types 4. An Array/List of Objects This question is in general about RDDs not necessarily RDDs in the context of SparkSQL When answering can you also score how bad the performance of each of the above options is -----Original Message----- From: Christian Perez [mailto:christ...@svds.com] Sent: Thursday, April 16, 2015 6:09 PM To: Michael Armbrust Cc: user Subject: Re: Super slow caching in 1.3? Hi Michael, Good question! We checked 1.2 and found that it is also slow cacheing the same flat parquet file. Caching other file formats of the same data were faster by up to a factor of ~2. Note that the parquet file was created in Impala but the other formats were written by Spark SQL. Cheers, Christian On Mon, Apr 6, 2015 at 6:17 PM, Michael Armbrust <mich...@databricks.com> wrote: > Do you think you are seeing a regression from 1.2? Also, are you > caching nested data or flat rows? The in-memory caching is not really > designed for nested data and so performs pretty slowly here (its just > falling back to kryo and even then there are some locking issues). > > If so, would it be possible to try caching a flattened version? > > CACHE TABLE flattenedTable AS SELECT ... FROM parquetTable > > On Mon, Apr 6, 2015 at 5:00 PM, Christian Perez <christ...@svds.com> wrote: >> >> Hi all, >> >> Has anyone else noticed very slow time to cache a Parquet file? It >> takes 14 s per 235 MB (1 block) uncompressed node local Parquet file >> on M2 EC2 instances. Or are my expectations way off... >> >> Cheers, >> >> Christian >> >> -- >> Christian Perez >> Silicon Valley Data Science >> Data Analyst >> christ...@svds.com >> @cp_phd >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For >> additional commands, e-mail: user-h...@spark.apache.org >> > -- Christian Perez Silicon Valley Data Science Data Analyst christ...@svds.com @cp_phd --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org