That's exactly what I am doing, but my question is does parallelize send the data to a worker node. From a performance perspective on small sets, the ideal would be to load in local jvm memory of the driver. I mean even designating the current machine as a worker node, besides driver, would still mean a localhost lo/net communication. I guess Spark is a batch oriented system, and I am still checking if there are ways to use it like this too, load data manually but process it with the functional & other spark libraries but without the distribution or m/r part.
________________________________ From: Andy Dang <nam...@gmail.com> Sent: Wednesday, September 30, 2015 8:17 PM To: Nicolae Marasoiu Cc: user@spark.apache.org Subject: Re: sc.parallelize with defaultParallelism=1 Can't you just load the data from HBase first, and then call sc.parallelize on your dataset? -Andy ------- Regards, Andy (Nam) Dang On Wed, Sep 30, 2015 at 12:52 PM, Nicolae Marasoiu <nicolae.maras...@adswizz.com<mailto:nicolae.maras...@adswizz.com>> wrote: Hi, When calling sc.parallelize(data,1), is there a preference where to put the data? I see 2 possibilities: sending it to a worker node, or keeping it on the driver program. I would prefer to keep the data local to the driver. The use case is when I need just to load a bit of data from HBase, and then compute over it e.g. aggregate, using Spark. Thanks, Nicu