Re: slow SQL query with cached dataset

2016-04-28 Thread Imran Akbar
Thanks Dr. Mich, Jorn, It's about 150 million rows in the cached dataset. How do I tell if it's spilling to disk? I didn't really see any logs to that affect. How do I determine the optimal number of partitions for a given input dataset? What's too much? regards, imran On Mon, Apr 25, 2016

Re: slow SQL query with cached dataset

2016-04-28 Thread Mich Talebzadeh
Hi Imran, " How do I tell if it's spilling to disk?" Well that is a very valid question. I do not have a quantitative matrix to use it to state that out of X GB of data in Spark, Y GB has been spilled to disk because of the volume of data. Unlike an RDBMS Spark uses memory ass opposed to shared

Re: slow SQL query with cached dataset

2016-04-25 Thread Jörn Franke
I do not know your data, but it looks that you have too many partitions for such a small data set. > On 26 Apr 2016, at 00:47, Imran Akbar wrote: > > Hi, > > I'm running a simple query like this through Spark SQL: > > sqlContext.sql("SELECT MIN(age) FROM data WHERE

Re: slow SQL query with cached dataset

2016-04-25 Thread Mich Talebzadeh
Are you sure it is not spilling to disk? How many rows are cached in your result set -> sqlContext.sql("SELECT * FROM raw WHERE (dt_year=2015 OR dt_year=2016)") HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw