That's exactly what I am doing, but my question is does parallelize send the 
data to a worker node. From a performance perspective on small sets, the ideal 
would be to load in local jvm memory of the driver. I mean even designating the 
current machine as a worker node, besides driver, would still mean a localhost 
lo/net communication. I guess Spark is a batch oriented system, and I am still 
checking if there are ways to use it like this too, load data manually but 
process it with the functional & other spark libraries but without the 
distribution or m/r part.


________________________________
From: Andy Dang <nam...@gmail.com>
Sent: Wednesday, September 30, 2015 8:17 PM
To: Nicolae Marasoiu
Cc: user@spark.apache.org
Subject: Re: sc.parallelize with defaultParallelism=1

Can't you just load the data from HBase first, and then call sc.parallelize on 
your dataset?

-Andy

-------
Regards,
Andy (Nam) Dang

On Wed, Sep 30, 2015 at 12:52 PM, Nicolae Marasoiu 
<nicolae.maras...@adswizz.com<mailto:nicolae.maras...@adswizz.com>> wrote:

Hi,


When calling sc.parallelize(data,1), is there a preference where to put the 
data? I see 2 possibilities: sending it to a worker node, or keeping it on the 
driver program.


I would prefer to keep the data local to the driver. The use case is when I 
need just to load a bit of data from HBase, and then compute over it e.g. 
aggregate, using Spark.


Thanks,

Nicu

Reply via email to