Thanks Dr. Mich, Jorn,
It's about 150 million rows in the cached dataset. How do I tell if it's
spilling to disk? I didn't really see any logs to that affect.
How do I determine the optimal number of partitions for a given input
dataset? What's too much?
regards,
imran
On Mon, Apr 25, 2016
Hi Imran,
" How do I tell if it's spilling to disk?"
Well that is a very valid question. I do not have a quantitative matrix to
use it to state that out of X GB of data in Spark, Y GB has been spilled to
disk because of the volume of data.
Unlike an RDBMS Spark uses memory ass opposed to shared
I do not know your data, but it looks that you have too many partitions for
such a small data set.
> On 26 Apr 2016, at 00:47, Imran Akbar wrote:
>
> Hi,
>
> I'm running a simple query like this through Spark SQL:
>
> sqlContext.sql("SELECT MIN(age) FROM data WHERE
Are you sure it is not spilling to disk?
How many rows are cached in your result set -> sqlContext.sql("SELECT *
FROM raw WHERE (dt_year=2015 OR dt_year=2016)")
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw