Are you sure it is not spilling to disk?

How many rows are cached in your result set -> sqlContext.sql("SELECT *
FROM raw WHERE (dt_year=2015 OR dt_year=2016)")

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 25 April 2016 at 23:47, Imran Akbar <skunkw...@gmail.com> wrote:

> Hi,
>
> I'm running a simple query like this through Spark SQL:
>
> sqlContext.sql("SELECT MIN(age) FROM data WHERE country = 'GBR' AND
> dt_year=2015 AND dt_month BETWEEN 1 AND 11 AND product IN
> ('cereal')").show()
>
> which takes 3 minutes to run against an in-memory cache of 9 GB of data.
>
> The data was 100% cached in memory before I ran the query (see screenshot
> 1).
> The data was cached like this:
> data = sqlContext.sql("SELECT * FROM raw WHERE (dt_year=2015 OR
> dt_year=2016)")
> data.cache()
> data.registerTempTable("data")
> and then I ran an action query to load the data into the cache.
>
> I see lots of rows of logs like this:
> 16/04/25 22:39:11 INFO MemoryStore: Block rdd_13136_2856 stored as values
> in memory (estimated size 2.5 MB, free 9.7 GB)
> 16/04/25 22:39:11 INFO BlockManager: Found block rdd_13136_2856 locally
> 16/04/25 22:39:11 INFO MemoryStore: 6 blocks selected for dropping
> 16/04/25 22:39:11 INFO BlockManager: Dropping block rdd_13136_3866 from
> memory
>
> Screenshot 2 shows the job page of the longest job.
>
> The data was partitioned in Parquet by month, country, and product before
> I cached it.
>
> Any ideas what the issue could be?  This is running on localhost.
>
> regards,
> imran
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

Reply via email to