Thanks Dr. Mich, Jorn,

It's about 150 million rows in the cached dataset.  How do I tell if it's
spilling to disk?  I didn't really see any logs to that affect.
How do I determine the optimal number of partitions for a given input
dataset?  What's too much?

regards,
imran

On Mon, Apr 25, 2016 at 3:55 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Are you sure it is not spilling to disk?
>
> How many rows are cached in your result set -> sqlContext.sql("SELECT *
> FROM raw WHERE (dt_year=2015 OR dt_year=2016)")
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 25 April 2016 at 23:47, Imran Akbar <skunkw...@gmail.com> wrote:
>
>> Hi,
>>
>> I'm running a simple query like this through Spark SQL:
>>
>> sqlContext.sql("SELECT MIN(age) FROM data WHERE country = 'GBR' AND
>> dt_year=2015 AND dt_month BETWEEN 1 AND 11 AND product IN
>> ('cereal')").show()
>>
>> which takes 3 minutes to run against an in-memory cache of 9 GB of data.
>>
>> The data was 100% cached in memory before I ran the query (see screenshot
>> 1).
>> The data was cached like this:
>> data = sqlContext.sql("SELECT * FROM raw WHERE (dt_year=2015 OR
>> dt_year=2016)")
>> data.cache()
>> data.registerTempTable("data")
>> and then I ran an action query to load the data into the cache.
>>
>> I see lots of rows of logs like this:
>> 16/04/25 22:39:11 INFO MemoryStore: Block rdd_13136_2856 stored as values
>> in memory (estimated size 2.5 MB, free 9.7 GB)
>> 16/04/25 22:39:11 INFO BlockManager: Found block rdd_13136_2856 locally
>> 16/04/25 22:39:11 INFO MemoryStore: 6 blocks selected for dropping
>> 16/04/25 22:39:11 INFO BlockManager: Dropping block rdd_13136_3866 from
>> memory
>>
>> Screenshot 2 shows the job page of the longest job.
>>
>> The data was partitioned in Parquet by month, country, and product before
>> I cached it.
>>
>> Any ideas what the issue could be?  This is running on localhost.
>>
>> regards,
>> imran
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>
>

Reply via email to