Hi Michael,

Good question! We checked 1.2 and found that it is also slow cacheing
the same flat parquet file. Caching other file formats of the same
data were faster by up to a factor of ~2. Note that the parquet file
was created in Impala but the other formats were written by Spark SQL.

Cheers,

Christian

On Mon, Apr 6, 2015 at 6:17 PM, Michael Armbrust <mich...@databricks.com> wrote:
> Do you think you are seeing a regression from 1.2?  Also, are you caching
> nested data or flat rows?  The in-memory caching is not really designed for
> nested data and so performs pretty slowly here (its just falling back to
> kryo and even then there are some locking issues).
>
> If so, would it be possible to try caching a flattened version?
>
> CACHE TABLE flattenedTable AS SELECT ... FROM parquetTable
>
> On Mon, Apr 6, 2015 at 5:00 PM, Christian Perez <christ...@svds.com> wrote:
>>
>> Hi all,
>>
>> Has anyone else noticed very slow time to cache a Parquet file? It
>> takes 14 s per 235 MB (1 block) uncompressed node local Parquet file
>> on M2 EC2 instances. Or are my expectations way off...
>>
>> Cheers,
>>
>> Christian
>>
>> --
>> Christian Perez
>> Silicon Valley Data Science
>> Data Analyst
>> christ...@svds.com
>> @cp_phd
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>



-- 
Christian Perez
Silicon Valley Data Science
Data Analyst
christ...@svds.com
@cp_phd

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to