Re: sc.parallelize with defaultParallelism=1

2015-09-30 Thread Andy Dang
Can't you just load the data from HBase first, and then call sc.parallelize
on your dataset?

-Andy

---
Regards,
Andy (Nam) Dang

On Wed, Sep 30, 2015 at 12:52 PM, Nicolae Marasoiu <
nicolae.maras...@adswizz.com> wrote:

> Hi,
>
>
> When calling sc.parallelize(data,1), is there a preference where to put
> the data? I see 2 possibilities: sending it to a worker node, or keeping it
> on the driver program.
>
>
> I would prefer to keep the data local to the driver. The use case is when
> I need just to load a bit of data from HBase, and then compute over it e.g.
> aggregate, using Spark.
>
>
> Thanks,
>
> Nicu
>


Re: sc.parallelize with defaultParallelism=1

2015-09-30 Thread Nicolae Marasoiu
That's exactly what I am doing, but my question is does parallelize send the 
data to a worker node. From a performance perspective on small sets, the ideal 
would be to load in local jvm memory of the driver. I mean even designating the 
current machine as a worker node, besides driver, would still mean a localhost 
lo/net communication. I guess Spark is a batch oriented system, and I am still 
checking if there are ways to use it like this too, load data manually but 
process it with the functional & other spark libraries but without the 
distribution or m/r part.



From: Andy Dang <nam...@gmail.com>
Sent: Wednesday, September 30, 2015 8:17 PM
To: Nicolae Marasoiu
Cc: user@spark.apache.org
Subject: Re: sc.parallelize with defaultParallelism=1

Can't you just load the data from HBase first, and then call sc.parallelize on 
your dataset?

-Andy

---
Regards,
Andy (Nam) Dang

On Wed, Sep 30, 2015 at 12:52 PM, Nicolae Marasoiu 
<nicolae.maras...@adswizz.com<mailto:nicolae.maras...@adswizz.com>> wrote:

Hi,


When calling sc.parallelize(data,1), is there a preference where to put the 
data? I see 2 possibilities: sending it to a worker node, or keeping it on the 
driver program.


I would prefer to keep the data local to the driver. The use case is when I 
need just to load a bit of data from HBase, and then compute over it e.g. 
aggregate, using Spark.


Thanks,

Nicu



Re: sc.parallelize with defaultParallelism=1

2015-09-30 Thread Marcelo Vanzin
If you want to process the data locally, why do you need to use sc.parallelize?

Store the data in regular Scala collections and use their methods to
process them (they have pretty much the same set of methods as Spark
RDDs). Then when you're happy, finally use Spark to process the
pre-processed input data.

Or you can run Spark in "local" mode, in which case the executor(s)
run in the same VM as the master.

Unless I'm misunderstanding what it is you're trying to achieve here?


On Wed, Sep 30, 2015 at 10:25 AM, Nicolae Marasoiu
<nicolae.maras...@adswizz.com> wrote:
> That's exactly what I am doing, but my question is does parallelize send the
> data to a worker node. From a performance perspective on small sets, the
> ideal would be to load in local jvm memory of the driver. I mean even
> designating the current machine as a worker node, besides driver, would
> still mean a localhost lo/net communication. I guess Spark is a batch
> oriented system, and I am still checking if there are ways to use it like
> this too, load data manually but process it with the functional & other
> spark libraries but without the distribution or m/r part.
>
>
>
> 
> From: Andy Dang <nam...@gmail.com>
> Sent: Wednesday, September 30, 2015 8:17 PM
> To: Nicolae Marasoiu
> Cc: user@spark.apache.org
> Subject: Re: sc.parallelize with defaultParallelism=1
>
> Can't you just load the data from HBase first, and then call sc.parallelize
> on your dataset?
>
> -Andy
>
> ---
> Regards,
> Andy (Nam) Dang
>
> On Wed, Sep 30, 2015 at 12:52 PM, Nicolae Marasoiu
> <nicolae.maras...@adswizz.com> wrote:
>>
>> Hi,
>>
>>
>> When calling sc.parallelize(data,1), is there a preference where to put
>> the data? I see 2 possibilities: sending it to a worker node, or keeping it
>> on the driver program.
>>
>>
>> I would prefer to keep the data local to the driver. The use case is when
>> I need just to load a bit of data from HBase, and then compute over it e.g.
>> aggregate, using Spark.
>>
>>
>> Thanks,
>>
>> Nicu
>
>



-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



sc.parallelize with defaultParallelism=1

2015-09-30 Thread Nicolae Marasoiu
Hi,


When calling sc.parallelize(data,1), is there a preference where to put the 
data? I see 2 possibilities: sending it to a worker node, or keeping it on the 
driver program.


I would prefer to keep the data local to the driver. The use case is when I 
need just to load a bit of data from HBase, and then compute over it e.g. 
aggregate, using Spark.


Thanks,

Nicu