RE: SparkR Count vs Take performance

Sun, Rui Wed, 02 Mar 2016 05:30:55 -0800

This is nothing to do with object serialization/deserialization. It is expected 
behavior that take(1) most likely runs slower than count() on an empty RDD.

This is all about the algorithm with which take() is implemented. Take()  
1. Reads one partition to get the elements
2. If the fetched elements do not satisfy the limit, it will estimate the 
number of additional partitions and fetch elements in them.
Take() repeats the step 2 until it get the desired number of elements or it 
will go through all partitions.

So take(1) on an empty RDD will go through all partitions in a sequential way.

Comparing with take(), Count() also computes all partition, but the computation 
is parallel on all partitions at once.

Take() implementation in SparkR is less optimized than that in Scala as SparkR 
won't estimate the number of additional partitions but will read just one 
partition in each fetch.

-----Original Message-----
From: Sean Owen [mailto:so...@cloudera.com] 
Sent: Wednesday, March 2, 2016 3:37 AM
To: Dirceu Semighini Filho <dirceu.semigh...@gmail.com>
Cc: user <user@spark.apache.org>
Subject: Re: SparkR Count vs Take performance

Yeah one surprising result is that you can't call isEmpty on an RDD of 
nonserializable objects. You can't do much with an RDD of nonserializable 
objects anyway, but they can exist as an intermediate stage.

We could fix that pretty easily with a little copy and paste of the
take() code; right now isEmpty is simple but has this drawback.

On Tue, Mar 1, 2016 at 7:18 PM, Dirceu Semighini Filho 
<dirceu.semigh...@gmail.com> wrote:
> Great, I didn't noticed this isEmpty method.
> Well serialization is been a problem in this project, we have noticed 
> a lot of time been spent in serializing and deserializing things to 
> send and get from the cluster.
>
> 2016-03-01 15:47 GMT-03:00 Sean Owen <so...@cloudera.com>:
>>
>> There is an "isEmpty" method that basically does exactly what your 
>> second version does.
>>
>> I have seen it be unusually slow at times because it must copy 1 
>> element to the driver, and it's possible that's slow. It still 
>> shouldn't be slow in general, and I'd be surprised if it's slower 
>> than a count in all but pathological cases.
>>
>>
>>
>> On Tue, Mar 1, 2016 at 6:03 PM, Dirceu Semighini Filho 
>> <dirceu.semigh...@gmail.com> wrote:
>> > Hello all.
>> > I have a script that create a dataframe from this operation:
>> >
>> > mytable <- sql(sqlContext,("SELECT ID_PRODUCT, ... FROM mytable"))
>> >
>> > rSparkDf <- createPartitionedDataFrame(sqlContext,myRdataframe)
>> > dFrame <- 
>> > join(mytable,rSparkDf,mytable$ID_PRODUCT==rSparkDf$ID_PRODUCT)
>> >
>> > After filtering this dFrame with this:
>> >
>> >
>> > I tried to execute the following
>> > filteredDF <- filterRDD(toRDD(dFrame),function (row) {row['COLUMN'] 
>> > %in% c("VALUES", ...)}) Now I need to know if the resulting 
>> > dataframe is empty, and to do that I tried this two codes:
>> > if(count(filteredDF) > 0)
>> > and
>> > if(length(take(filteredDF,1)) > 0)
>> > I thought that the second one, using take, shoule run faster than 
>> > count, but that didn't happen.
>> > The take operation creates one job per partition of my rdd (which 
>> > was
>> > 200)
>> > and this make it to run slower than the count.
>> > Is this the expected behaviour?
>> >
>> > Regards,
>> > Dirceu
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org

RE: SparkR Count vs Take performance

Reply via email to