Re: Super slow caching in 1.3?

2015-04-27 Thread Christian Perez
To: Evo Eftimov Cc: Christian Perez; user Subject: Re: Super slow caching in 1.3? Here are the types that we specialize, other types will be much slower. This is only for Spark SQL, normal RDDs do not serialize data that is cached. I'll also not that until yesterday we were missing FloatType

Re: Super slow caching in 1.3?

2015-04-27 Thread Wenlei Xie
I face the similar issue in Spark 1.2. Cache the schema RDD takes about 50s for 400MB data. The schema is similar to the TPC-H LineItem. Here is the code I tried the cache. I am wondering if there is any setting missing? Thank you so much! lineitemSchemaRDD.registerTempTable(lineitem);

RE: Super slow caching in 1.3?

2015-04-20 Thread Evo Eftimov
a detailed description / spec of both From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Thursday, April 16, 2015 7:23 PM To: Evo Eftimov Cc: Christian Perez; user Subject: Re: Super slow caching in 1.3? Here are the types that we specialize, other types will be much slower

Re: Super slow caching in 1.3?

2015-04-16 Thread Christian Perez
Hi Michael, Good question! We checked 1.2 and found that it is also slow cacheing the same flat parquet file. Caching other file formats of the same data were faster by up to a factor of ~2. Note that the parquet file was created in Impala but the other formats were written by Spark SQL. Cheers,

RE: Super slow caching in 1.3?

2015-04-16 Thread Evo Eftimov
: user Subject: Re: Super slow caching in 1.3? Hi Michael, Good question! We checked 1.2 and found that it is also slow cacheing the same flat parquet file. Caching other file formats of the same data were faster by up to a factor of ~2. Note that the parquet file was created in Impala

Re: Super slow caching in 1.3?

2015-04-16 Thread Michael Armbrust
the performance of each of the above options is -Original Message- From: Christian Perez [mailto:christ...@svds.com] Sent: Thursday, April 16, 2015 6:09 PM To: Michael Armbrust Cc: user Subject: Re: Super slow caching in 1.3? Hi Michael, Good question! We checked 1.2 and found

RE: Super slow caching in 1.3?

2015-04-16 Thread Evo Eftimov
Subject: Re: Super slow caching in 1.3? Here are the types that we specialize, other types will be much slower. This is only for Spark SQL, normal RDDs do not serialize data that is cached. I'll also not that until yesterday we were missing FloatType https://github.com/apache/spark/blob

Super slow caching in 1.3?

2015-04-06 Thread Christian Perez
Hi all, Has anyone else noticed very slow time to cache a Parquet file? It takes 14 s per 235 MB (1 block) uncompressed node local Parquet file on M2 EC2 instances. Or are my expectations way off... Cheers, Christian -- Christian Perez Silicon Valley Data Science Data Analyst

Re: Super slow caching in 1.3?

2015-04-06 Thread Michael Armbrust
Do you think you are seeing a regression from 1.2? Also, are you caching nested data or flat rows? The in-memory caching is not really designed for nested data and so performs pretty slowly here (its just falling back to kryo and even then there are some locking issues). If so, would it be