Re: How to get the number of partitions for a SparkDataFrame in Spark 2.0-preview?

2016-07-23 Thread Neil Chang
One example for using dapply is to apply linear regression on many small
partitions.
I think red can do that with parallelism too but heard dapply is faster.

On Friday, July 22, 2016, Pedro Rodriguez  wrote:

> I haven't used SparkR/R before, only Scala/Python APIs so I don't know for
> sure.
>
> I am guessing if things are in a DataFrame they were read either from some
> disk source (S3/HDFS/file/etc) or they were created from parallelize. If
> you are using the first, Spark will for the most part choose a reasonable
> number of partitions while for parallelize I think it depends on what your
> min parallelism is set to.
>
> In my brief google it looks like dapply is an analogue of mapPartitions.
> Usually the reason to use this is if your map operation has some expensive
> initialization function. For example, you need to open a connection to a
> database so its better to re-use that connection for one partition's
> elements than create it for each element.
>
> What are you trying to accomplish with dapply?
>
> On Fri, Jul 22, 2016 at 8:05 PM, Neil Chang  > wrote:
>
>> Thanks Pedro,
>>   so to use sparkR dapply on SparkDataFrame, don't we need partition the
>> DataFrame first? the example in doc doesn't seem to do this.
>> Without knowing how it partitioned, how can one write the function to
>> process each partition?
>>
>> Neil
>>
>> On Fri, Jul 22, 2016 at 5:56 PM, Pedro Rodriguez > > wrote:
>>
>>> This should work and I don't think triggers any actions:
>>>
>>> df.rdd.partitions.length
>>>
>>> On Fri, Jul 22, 2016 at 2:20 PM, Neil Chang >> > wrote:
>>>
 Seems no function does this in Spark 2.0 preview?

>>>
>>>
>>>
>>> --
>>> Pedro Rodriguez
>>> PhD Student in Distributed Machine Learning | CU Boulder
>>> UC Berkeley AMPLab Alumni
>>>
>>> ski.rodrig...@gmail.com
>>>  |
>>> pedrorodriguez.io | 909-353-4423
>>> Github: github.com/EntilZha | LinkedIn:
>>> https://www.linkedin.com/in/pedrorodriguezscience
>>>
>>>
>>
>
>
> --
> Pedro Rodriguez
> PhD Student in Distributed Machine Learning | CU Boulder
> UC Berkeley AMPLab Alumni
>
> ski.rodrig...@gmail.com
>  |
> pedrorodriguez.io | 909-353-4423
> Github: github.com/EntilZha | LinkedIn:
> https://www.linkedin.com/in/pedrorodriguezscience
>
>


Re: How to get the number of partitions for a SparkDataFrame in Spark 2.0-preview?

2016-07-22 Thread Pedro Rodriguez
I haven't used SparkR/R before, only Scala/Python APIs so I don't know for
sure.

I am guessing if things are in a DataFrame they were read either from some
disk source (S3/HDFS/file/etc) or they were created from parallelize. If
you are using the first, Spark will for the most part choose a reasonable
number of partitions while for parallelize I think it depends on what your
min parallelism is set to.

In my brief google it looks like dapply is an analogue of mapPartitions.
Usually the reason to use this is if your map operation has some expensive
initialization function. For example, you need to open a connection to a
database so its better to re-use that connection for one partition's
elements than create it for each element.

What are you trying to accomplish with dapply?

On Fri, Jul 22, 2016 at 8:05 PM, Neil Chang  wrote:

> Thanks Pedro,
>   so to use sparkR dapply on SparkDataFrame, don't we need partition the
> DataFrame first? the example in doc doesn't seem to do this.
> Without knowing how it partitioned, how can one write the function to
> process each partition?
>
> Neil
>
> On Fri, Jul 22, 2016 at 5:56 PM, Pedro Rodriguez 
> wrote:
>
>> This should work and I don't think triggers any actions:
>>
>> df.rdd.partitions.length
>>
>> On Fri, Jul 22, 2016 at 2:20 PM, Neil Chang  wrote:
>>
>>> Seems no function does this in Spark 2.0 preview?
>>>
>>
>>
>>
>> --
>> Pedro Rodriguez
>> PhD Student in Distributed Machine Learning | CU Boulder
>> UC Berkeley AMPLab Alumni
>>
>> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
>> Github: github.com/EntilZha | LinkedIn:
>> https://www.linkedin.com/in/pedrorodriguezscience
>>
>>
>


-- 
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni

ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
Github: github.com/EntilZha | LinkedIn:
https://www.linkedin.com/in/pedrorodriguezscience


Re: How to get the number of partitions for a SparkDataFrame in Spark 2.0-preview?

2016-07-22 Thread Neil Chang
Thanks Pedro,
  so to use sparkR dapply on SparkDataFrame, don't we need partition the
DataFrame first? the example in doc doesn't seem to do this.
Without knowing how it partitioned, how can one write the function to
process each partition?

Neil

On Fri, Jul 22, 2016 at 5:56 PM, Pedro Rodriguez 
wrote:

> This should work and I don't think triggers any actions:
>
> df.rdd.partitions.length
>
> On Fri, Jul 22, 2016 at 2:20 PM, Neil Chang  wrote:
>
>> Seems no function does this in Spark 2.0 preview?
>>
>
>
>
> --
> Pedro Rodriguez
> PhD Student in Distributed Machine Learning | CU Boulder
> UC Berkeley AMPLab Alumni
>
> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
> Github: github.com/EntilZha | LinkedIn:
> https://www.linkedin.com/in/pedrorodriguezscience
>
>


Re: How to get the number of partitions for a SparkDataFrame in Spark 2.0-preview?

2016-07-22 Thread Pedro Rodriguez
This should work and I don't think triggers any actions:

df.rdd.partitions.length

On Fri, Jul 22, 2016 at 2:20 PM, Neil Chang  wrote:

> Seems no function does this in Spark 2.0 preview?
>



-- 
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni

ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
Github: github.com/EntilZha | LinkedIn:
https://www.linkedin.com/in/pedrorodriguezscience


How to get the number of partitions for a SparkDataFrame in Spark 2.0-preview?

2016-07-22 Thread Neil Chang
Seems no function does this in Spark 2.0 preview?