Thanks Dr. Mich, Jorn, It's about 150 million rows in the cached dataset. How do I tell if it's spilling to disk? I didn't really see any logs to that affect. How do I determine the optimal number of partitions for a given input dataset? What's too much?
regards, imran On Mon, Apr 25, 2016 at 3:55 PM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Are you sure it is not spilling to disk? > > How many rows are cached in your result set -> sqlContext.sql("SELECT * > FROM raw WHERE (dt_year=2015 OR dt_year=2016)") > > HTH > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 25 April 2016 at 23:47, Imran Akbar <skunkw...@gmail.com> wrote: > >> Hi, >> >> I'm running a simple query like this through Spark SQL: >> >> sqlContext.sql("SELECT MIN(age) FROM data WHERE country = 'GBR' AND >> dt_year=2015 AND dt_month BETWEEN 1 AND 11 AND product IN >> ('cereal')").show() >> >> which takes 3 minutes to run against an in-memory cache of 9 GB of data. >> >> The data was 100% cached in memory before I ran the query (see screenshot >> 1). >> The data was cached like this: >> data = sqlContext.sql("SELECT * FROM raw WHERE (dt_year=2015 OR >> dt_year=2016)") >> data.cache() >> data.registerTempTable("data") >> and then I ran an action query to load the data into the cache. >> >> I see lots of rows of logs like this: >> 16/04/25 22:39:11 INFO MemoryStore: Block rdd_13136_2856 stored as values >> in memory (estimated size 2.5 MB, free 9.7 GB) >> 16/04/25 22:39:11 INFO BlockManager: Found block rdd_13136_2856 locally >> 16/04/25 22:39:11 INFO MemoryStore: 6 blocks selected for dropping >> 16/04/25 22:39:11 INFO BlockManager: Dropping block rdd_13136_3866 from >> memory >> >> Screenshot 2 shows the job page of the longest job. >> >> The data was partitioned in Parquet by month, country, and product before >> I cached it. >> >> Any ideas what the issue could be? This is running on localhost. >> >> regards, >> imran >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> > >