Re: sc.parallelize with defaultParallelism=1

Marcelo Vanzin Wed, 30 Sep 2015 11:03:52 -0700

If you want to process the data locally, why do you need to use sc.parallelize?


Store the data in regular Scala collections and use their methods to
process them (they have pretty much the same set of methods as Spark
RDDs). Then when you're happy, finally use Spark to process the
pre-processed input data.

Or you can run Spark in "local" mode, in which case the executor(s)
run in the same VM as the master.

Unless I'm misunderstanding what it is you're trying to achieve here?


On Wed, Sep 30, 2015 at 10:25 AM, Nicolae Marasoiu
<nicolae.maras...@adswizz.com> wrote:
> That's exactly what I am doing, but my question is does parallelize send the
> data to a worker node. From a performance perspective on small sets, the
> ideal would be to load in local jvm memory of the driver. I mean even
> designating the current machine as a worker node, besides driver, would
> still mean a localhost lo/net communication. I guess Spark is a batch
> oriented system, and I am still checking if there are ways to use it like
> this too, load data manually but process it with the functional & other
> spark libraries but without the distribution or m/r part.
>
>
>
> ________________________________
> From: Andy Dang <nam...@gmail.com>
> Sent: Wednesday, September 30, 2015 8:17 PM
> To: Nicolae Marasoiu
> Cc: user@spark.apache.org
> Subject: Re: sc.parallelize with defaultParallelism=1
>
> Can't you just load the data from HBase first, and then call sc.parallelize
> on your dataset?
>
> -Andy
>
> -------
> Regards,
> Andy (Nam) Dang
>
> On Wed, Sep 30, 2015 at 12:52 PM, Nicolae Marasoiu
> <nicolae.maras...@adswizz.com> wrote:
>>
>> Hi,
>>
>>
>> When calling sc.parallelize(data,1), is there a preference where to put
>> the data? I see 2 possibilities: sending it to a worker node, or keeping it
>> on the driver program.
>>
>>
>> I would prefer to keep the data local to the driver. The use case is when
>> I need just to load a bit of data from HBase, and then compute over it e.g.
>> aggregate, using Spark.
>>
>>
>> Thanks,
>>
>> Nicu
>
>



-- 
Marcelo

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: sc.parallelize with defaultParallelism=1

Reply via email to