Re: Super slow caching in 1.3?

2015-04-27 Thread Christian Perez
Michael,

There is only one schema: both versions have 200 string columns in one file.

On Mon, Apr 20, 2015 at 9:08 AM, Evo Eftimov evo.efti...@isecc.com wrote:
 Now this is very important:



 “Normal RDDs” refers to “batch RDDs”. However the default in-memory
 Serialization of RDDs which are part of DSTream is “Srialized” rather than
 actual (hydrated) Objects. The Spark documentation states that
 “Serialization” is required for space and garbage collection efficiency (but
 creates higher CPU load) – which makes sense consider the large number of
 RDDs which get discarded in a streaming app



 So what does Data Bricks actually recommend as Object Oriented model for RDD
 elements used in Spark Streaming apps – flat or not and can you provide a
 detailed description / spec of both



 From: Michael Armbrust [mailto:mich...@databricks.com]
 Sent: Thursday, April 16, 2015 7:23 PM
 To: Evo Eftimov
 Cc: Christian Perez; user


 Subject: Re: Super slow caching in 1.3?



 Here are the types that we specialize, other types will be much slower.
 This is only for Spark SQL, normal RDDs do not serialize data that is
 cached.  I'll also not that until yesterday we were missing FloatType

 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnBuilder.scala#L154



 Christian, can you provide the schema of the fast and slow datasets?



 On Thu, Apr 16, 2015 at 10:14 AM, Evo Eftimov evo.efti...@isecc.com wrote:

 Michael what exactly do you mean by flattened version/structure here e.g.:

 1. An Object with only primitive data types as attributes
 2. An Object with  no more than one level of other Objects as attributes
 3. An Array/List of primitive types
 4. An Array/List of Objects

 This question is in general about RDDs not necessarily RDDs in the context
 of SparkSQL

 When answering can you also score how bad the performance of each of the
 above options is


 -Original Message-
 From: Christian Perez [mailto:christ...@svds.com]
 Sent: Thursday, April 16, 2015 6:09 PM
 To: Michael Armbrust
 Cc: user
 Subject: Re: Super slow caching in 1.3?

 Hi Michael,

 Good question! We checked 1.2 and found that it is also slow cacheing the
 same flat parquet file. Caching other file formats of the same data were
 faster by up to a factor of ~2. Note that the parquet file was created in
 Impala but the other formats were written by Spark SQL.

 Cheers,

 Christian

 On Mon, Apr 6, 2015 at 6:17 PM, Michael Armbrust mich...@databricks.com
 wrote:
 Do you think you are seeing a regression from 1.2?  Also, are you
 caching nested data or flat rows?  The in-memory caching is not really
 designed for nested data and so performs pretty slowly here (its just
 falling back to kryo and even then there are some locking issues).

 If so, would it be possible to try caching a flattened version?

 CACHE TABLE flattenedTable AS SELECT ... FROM parquetTable

 On Mon, Apr 6, 2015 at 5:00 PM, Christian Perez christ...@svds.com
 wrote:

 Hi all,

 Has anyone else noticed very slow time to cache a Parquet file? It
 takes 14 s per 235 MB (1 block) uncompressed node local Parquet file
 on M2 EC2 instances. Or are my expectations way off...

 Cheers,

 Christian

 --
 Christian Perez
 Silicon Valley Data Science
 Data Analyst
 christ...@svds.com
 @cp_phd

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For
 additional commands, e-mail: user-h...@spark.apache.org





 --
 Christian Perez
 Silicon Valley Data Science
 Data Analyst
 christ...@svds.com
 @cp_phd

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
 commands, e-mail: user-h...@spark.apache.org





-- 
Christian Perez
Silicon Valley Data Science
Data Analyst
christ...@svds.com
@cp_phd

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Super slow caching in 1.3?

2015-04-27 Thread Wenlei Xie
I face the similar issue in Spark 1.2. Cache the schema RDD takes about 50s
for 400MB data. The schema is similar to the TPC-H LineItem.

Here is the code I tried the cache. I am wondering if there is any setting
missing?

Thank you so much!

lineitemSchemaRDD.registerTempTable(lineitem);
sqlContext.sqlContext().cacheTable(lineitem);
System.out.println(lineitemSchemaRDD.count());


On Mon, Apr 6, 2015 at 8:00 PM, Christian Perez christ...@svds.com wrote:

 Hi all,

 Has anyone else noticed very slow time to cache a Parquet file? It
 takes 14 s per 235 MB (1 block) uncompressed node local Parquet file
 on M2 EC2 instances. Or are my expectations way off...

 Cheers,

 Christian

 --
 Christian Perez
 Silicon Valley Data Science
 Data Analyst
 christ...@svds.com
 @cp_phd

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
Wenlei Xie (谢文磊)

Ph.D. Candidate
Department of Computer Science
456 Gates Hall, Cornell University
Ithaca, NY 14853, USA
Email: wenlei@gmail.com


RE: Super slow caching in 1.3?

2015-04-20 Thread Evo Eftimov
Now this is very important:

 

“Normal RDDs” refers to “batch RDDs”. However the default in-memory 
Serialization of RDDs which are part of DSTream is “Srialized” rather than 
actual (hydrated) Objects. The Spark documentation states that “Serialization” 
is required for space and garbage collection efficiency (but creates higher CPU 
load) – which makes sense consider the large number of RDDs which get discarded 
in a streaming app

 

So what does Data Bricks actually recommend as Object Oriented model for RDD 
elements used in Spark Streaming apps – flat or not and can you provide a 
detailed description / spec of both 

 

From: Michael Armbrust [mailto:mich...@databricks.com] 
Sent: Thursday, April 16, 2015 7:23 PM
To: Evo Eftimov
Cc: Christian Perez; user
Subject: Re: Super slow caching in 1.3?

 

Here are the types that we specialize, other types will be much slower.  This 
is only for Spark SQL, normal RDDs do not serialize data that is cached.  I'll 
also not that until yesterday we were missing FloatType

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnBuilder.scala#L154

 

Christian, can you provide the schema of the fast and slow datasets?

 

On Thu, Apr 16, 2015 at 10:14 AM, Evo Eftimov evo.efti...@isecc.com wrote:

Michael what exactly do you mean by flattened version/structure here e.g.:

1. An Object with only primitive data types as attributes
2. An Object with  no more than one level of other Objects as attributes
3. An Array/List of primitive types
4. An Array/List of Objects

This question is in general about RDDs not necessarily RDDs in the context of 
SparkSQL

When answering can you also score how bad the performance of each of the above 
options is


-Original Message-
From: Christian Perez [mailto:christ...@svds.com]
Sent: Thursday, April 16, 2015 6:09 PM
To: Michael Armbrust
Cc: user
Subject: Re: Super slow caching in 1.3?

Hi Michael,

Good question! We checked 1.2 and found that it is also slow cacheing the same 
flat parquet file. Caching other file formats of the same data were faster by 
up to a factor of ~2. Note that the parquet file was created in Impala but the 
other formats were written by Spark SQL.

Cheers,

Christian

On Mon, Apr 6, 2015 at 6:17 PM, Michael Armbrust mich...@databricks.com wrote:
 Do you think you are seeing a regression from 1.2?  Also, are you
 caching nested data or flat rows?  The in-memory caching is not really
 designed for nested data and so performs pretty slowly here (its just
 falling back to kryo and even then there are some locking issues).

 If so, would it be possible to try caching a flattened version?

 CACHE TABLE flattenedTable AS SELECT ... FROM parquetTable

 On Mon, Apr 6, 2015 at 5:00 PM, Christian Perez christ...@svds.com wrote:

 Hi all,

 Has anyone else noticed very slow time to cache a Parquet file? It
 takes 14 s per 235 MB (1 block) uncompressed node local Parquet file
 on M2 EC2 instances. Or are my expectations way off...

 Cheers,

 Christian

 --
 Christian Perez
 Silicon Valley Data Science
 Data Analyst
 christ...@svds.com
 @cp_phd

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For
 additional commands, e-mail: user-h...@spark.apache.org





--
Christian Perez
Silicon Valley Data Science
Data Analyst
christ...@svds.com
@cp_phd

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org



 



Re: Super slow caching in 1.3?

2015-04-16 Thread Christian Perez
Hi Michael,

Good question! We checked 1.2 and found that it is also slow cacheing
the same flat parquet file. Caching other file formats of the same
data were faster by up to a factor of ~2. Note that the parquet file
was created in Impala but the other formats were written by Spark SQL.

Cheers,

Christian

On Mon, Apr 6, 2015 at 6:17 PM, Michael Armbrust mich...@databricks.com wrote:
 Do you think you are seeing a regression from 1.2?  Also, are you caching
 nested data or flat rows?  The in-memory caching is not really designed for
 nested data and so performs pretty slowly here (its just falling back to
 kryo and even then there are some locking issues).

 If so, would it be possible to try caching a flattened version?

 CACHE TABLE flattenedTable AS SELECT ... FROM parquetTable

 On Mon, Apr 6, 2015 at 5:00 PM, Christian Perez christ...@svds.com wrote:

 Hi all,

 Has anyone else noticed very slow time to cache a Parquet file? It
 takes 14 s per 235 MB (1 block) uncompressed node local Parquet file
 on M2 EC2 instances. Or are my expectations way off...

 Cheers,

 Christian

 --
 Christian Perez
 Silicon Valley Data Science
 Data Analyst
 christ...@svds.com
 @cp_phd

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





-- 
Christian Perez
Silicon Valley Data Science
Data Analyst
christ...@svds.com
@cp_phd

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Super slow caching in 1.3?

2015-04-16 Thread Evo Eftimov
Michael what exactly do you mean by flattened version/structure here e.g.:

1. An Object with only primitive data types as attributes
2. An Object with  no more than one level of other Objects as attributes 
3. An Array/List of primitive types 
4. An Array/List of Objects 

This question is in general about RDDs not necessarily RDDs in the context of 
SparkSQL

When answering can you also score how bad the performance of each of the above 
options is  

-Original Message-
From: Christian Perez [mailto:christ...@svds.com] 
Sent: Thursday, April 16, 2015 6:09 PM
To: Michael Armbrust
Cc: user
Subject: Re: Super slow caching in 1.3?

Hi Michael,

Good question! We checked 1.2 and found that it is also slow cacheing the same 
flat parquet file. Caching other file formats of the same data were faster by 
up to a factor of ~2. Note that the parquet file was created in Impala but the 
other formats were written by Spark SQL.

Cheers,

Christian

On Mon, Apr 6, 2015 at 6:17 PM, Michael Armbrust mich...@databricks.com wrote:
 Do you think you are seeing a regression from 1.2?  Also, are you 
 caching nested data or flat rows?  The in-memory caching is not really 
 designed for nested data and so performs pretty slowly here (its just 
 falling back to kryo and even then there are some locking issues).

 If so, would it be possible to try caching a flattened version?

 CACHE TABLE flattenedTable AS SELECT ... FROM parquetTable

 On Mon, Apr 6, 2015 at 5:00 PM, Christian Perez christ...@svds.com wrote:

 Hi all,

 Has anyone else noticed very slow time to cache a Parquet file? It 
 takes 14 s per 235 MB (1 block) uncompressed node local Parquet file 
 on M2 EC2 instances. Or are my expectations way off...

 Cheers,

 Christian

 --
 Christian Perez
 Silicon Valley Data Science
 Data Analyst
 christ...@svds.com
 @cp_phd

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For 
 additional commands, e-mail: user-h...@spark.apache.org





--
Christian Perez
Silicon Valley Data Science
Data Analyst
christ...@svds.com
@cp_phd

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Super slow caching in 1.3?

2015-04-16 Thread Michael Armbrust
Here are the types that we specialize, other types will be much slower.
This is only for Spark SQL, normal RDDs do not serialize data that is
cached.  I'll also not that until yesterday we were missing FloatType
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnBuilder.scala#L154

Christian, can you provide the schema of the fast and slow datasets?

On Thu, Apr 16, 2015 at 10:14 AM, Evo Eftimov evo.efti...@isecc.com wrote:

 Michael what exactly do you mean by flattened version/structure here
 e.g.:

 1. An Object with only primitive data types as attributes
 2. An Object with  no more than one level of other Objects as attributes
 3. An Array/List of primitive types
 4. An Array/List of Objects

 This question is in general about RDDs not necessarily RDDs in the context
 of SparkSQL

 When answering can you also score how bad the performance of each of the
 above options is

 -Original Message-
 From: Christian Perez [mailto:christ...@svds.com]
 Sent: Thursday, April 16, 2015 6:09 PM
 To: Michael Armbrust
 Cc: user
 Subject: Re: Super slow caching in 1.3?

 Hi Michael,

 Good question! We checked 1.2 and found that it is also slow cacheing the
 same flat parquet file. Caching other file formats of the same data were
 faster by up to a factor of ~2. Note that the parquet file was created in
 Impala but the other formats were written by Spark SQL.

 Cheers,

 Christian

 On Mon, Apr 6, 2015 at 6:17 PM, Michael Armbrust mich...@databricks.com
 wrote:
  Do you think you are seeing a regression from 1.2?  Also, are you
  caching nested data or flat rows?  The in-memory caching is not really
  designed for nested data and so performs pretty slowly here (its just
  falling back to kryo and even then there are some locking issues).
 
  If so, would it be possible to try caching a flattened version?
 
  CACHE TABLE flattenedTable AS SELECT ... FROM parquetTable
 
  On Mon, Apr 6, 2015 at 5:00 PM, Christian Perez christ...@svds.com
 wrote:
 
  Hi all,
 
  Has anyone else noticed very slow time to cache a Parquet file? It
  takes 14 s per 235 MB (1 block) uncompressed node local Parquet file
  on M2 EC2 instances. Or are my expectations way off...
 
  Cheers,
 
  Christian
 
  --
  Christian Perez
  Silicon Valley Data Science
  Data Analyst
  christ...@svds.com
  @cp_phd
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For
  additional commands, e-mail: user-h...@spark.apache.org
 
 



 --
 Christian Perez
 Silicon Valley Data Science
 Data Analyst
 christ...@svds.com
 @cp_phd

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
 commands, e-mail: user-h...@spark.apache.org





RE: Super slow caching in 1.3?

2015-04-16 Thread Evo Eftimov
Well normal RDDs can also be serialized  if you select that type of Memory 
Persistence ….

 

Ok thanks, so just to confirm:

 

IF a “normal” RDD is not going to be persisted in-memory as Serialized objects 
(which would mean it has to be persisted as “actual/hydrated” objects) THEN 
there are no limitations to the level of hierarchy in the Object Oriented Model 
of the RDD elements (limitations in terms of performance impact/degradation) – 
right?

 

From: Michael Armbrust [mailto:mich...@databricks.com] 
Sent: Thursday, April 16, 2015 7:23 PM
To: Evo Eftimov
Cc: Christian Perez; user
Subject: Re: Super slow caching in 1.3?

 

Here are the types that we specialize, other types will be much slower.  This 
is only for Spark SQL, normal RDDs do not serialize data that is cached.  I'll 
also not that until yesterday we were missing FloatType

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnBuilder.scala#L154

 

Christian, can you provide the schema of the fast and slow datasets?

 

On Thu, Apr 16, 2015 at 10:14 AM, Evo Eftimov evo.efti...@isecc.com wrote:

Michael what exactly do you mean by flattened version/structure here e.g.:

1. An Object with only primitive data types as attributes
2. An Object with  no more than one level of other Objects as attributes
3. An Array/List of primitive types
4. An Array/List of Objects

This question is in general about RDDs not necessarily RDDs in the context of 
SparkSQL

When answering can you also score how bad the performance of each of the above 
options is


-Original Message-
From: Christian Perez [mailto:christ...@svds.com]
Sent: Thursday, April 16, 2015 6:09 PM
To: Michael Armbrust
Cc: user
Subject: Re: Super slow caching in 1.3?

Hi Michael,

Good question! We checked 1.2 and found that it is also slow cacheing the same 
flat parquet file. Caching other file formats of the same data were faster by 
up to a factor of ~2. Note that the parquet file was created in Impala but the 
other formats were written by Spark SQL.

Cheers,

Christian

On Mon, Apr 6, 2015 at 6:17 PM, Michael Armbrust mich...@databricks.com wrote:
 Do you think you are seeing a regression from 1.2?  Also, are you
 caching nested data or flat rows?  The in-memory caching is not really
 designed for nested data and so performs pretty slowly here (its just
 falling back to kryo and even then there are some locking issues).

 If so, would it be possible to try caching a flattened version?

 CACHE TABLE flattenedTable AS SELECT ... FROM parquetTable

 On Mon, Apr 6, 2015 at 5:00 PM, Christian Perez christ...@svds.com wrote:

 Hi all,

 Has anyone else noticed very slow time to cache a Parquet file? It
 takes 14 s per 235 MB (1 block) uncompressed node local Parquet file
 on M2 EC2 instances. Or are my expectations way off...

 Cheers,

 Christian

 --
 Christian Perez
 Silicon Valley Data Science
 Data Analyst
 christ...@svds.com
 @cp_phd

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For
 additional commands, e-mail: user-h...@spark.apache.org





--
Christian Perez
Silicon Valley Data Science
Data Analyst
christ...@svds.com
@cp_phd

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org



 



Re: Super slow caching in 1.3?

2015-04-06 Thread Michael Armbrust
Do you think you are seeing a regression from 1.2?  Also, are you caching
nested data or flat rows?  The in-memory caching is not really designed for
nested data and so performs pretty slowly here (its just falling back to
kryo and even then there are some locking issues).

If so, would it be possible to try caching a flattened version?

CACHE TABLE flattenedTable AS SELECT ... FROM parquetTable

On Mon, Apr 6, 2015 at 5:00 PM, Christian Perez christ...@svds.com wrote:

 Hi all,

 Has anyone else noticed very slow time to cache a Parquet file? It
 takes 14 s per 235 MB (1 block) uncompressed node local Parquet file
 on M2 EC2 instances. Or are my expectations way off...

 Cheers,

 Christian

 --
 Christian Perez
 Silicon Valley Data Science
 Data Analyst
 christ...@svds.com
 @cp_phd

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org