Re: SparkR Count vs Take performance

Dirceu Semighini Filho Wed, 02 Mar 2016 05:34:50 -0800

Thanks Sun, this explain why I was getting too many jobs running, my RDDs
were empty.




2016-03-02 10:29 GMT-03:00 Sun, Rui <rui....@intel.com>:

> This is nothing to do with object serialization/deserialization. It is
> expected behavior that take(1) most likely runs slower than count() on an
> empty RDD.
>
> This is all about the algorithm with which take() is implemented. Take()
> 1. Reads one partition to get the elements
> 2. If the fetched elements do not satisfy the limit, it will estimate the
> number of additional partitions and fetch elements in them.
> Take() repeats the step 2 until it get the desired number of elements or
> it will go through all partitions.
>
> So take(1) on an empty RDD will go through all partitions in a sequential
> way.
>
> Comparing with take(), Count() also computes all partition, but the
> computation is parallel on all partitions at once.
>
> Take() implementation in SparkR is less optimized than that in Scala as
> SparkR won't estimate the number of additional partitions but will read
> just one partition in each fetch.
>
> -----Original Message-----
> From: Sean Owen [mailto:so...@cloudera.com]
> Sent: Wednesday, March 2, 2016 3:37 AM
> To: Dirceu Semighini Filho <dirceu.semigh...@gmail.com>
> Cc: user <user@spark.apache.org>
> Subject: Re: SparkR Count vs Take performance
>
> Yeah one surprising result is that you can't call isEmpty on an RDD of
> nonserializable objects. You can't do much with an RDD of nonserializable
> objects anyway, but they can exist as an intermediate stage.
>
> We could fix that pretty easily with a little copy and paste of the
> take() code; right now isEmpty is simple but has this drawback.
>
> On Tue, Mar 1, 2016 at 7:18 PM, Dirceu Semighini Filho <
> dirceu.semigh...@gmail.com> wrote:
> > Great, I didn't noticed this isEmpty method.
> > Well serialization is been a problem in this project, we have noticed
> > a lot of time been spent in serializing and deserializing things to
> > send and get from the cluster.
> >
> > 2016-03-01 15:47 GMT-03:00 Sean Owen <so...@cloudera.com>:
> >>
> >> There is an "isEmpty" method that basically does exactly what your
> >> second version does.
> >>
> >> I have seen it be unusually slow at times because it must copy 1
> >> element to the driver, and it's possible that's slow. It still
> >> shouldn't be slow in general, and I'd be surprised if it's slower
> >> than a count in all but pathological cases.
> >>
> >>
> >>
> >> On Tue, Mar 1, 2016 at 6:03 PM, Dirceu Semighini Filho
> >> <dirceu.semigh...@gmail.com> wrote:
> >> > Hello all.
> >> > I have a script that create a dataframe from this operation:
> >> >
> >> > mytable <- sql(sqlContext,("SELECT ID_PRODUCT, ... FROM mytable"))
> >> >
> >> > rSparkDf <- createPartitionedDataFrame(sqlContext,myRdataframe)
> >> > dFrame <-
> >> > join(mytable,rSparkDf,mytable$ID_PRODUCT==rSparkDf$ID_PRODUCT)
> >> >
> >> > After filtering this dFrame with this:
> >> >
> >> >
> >> > I tried to execute the following
> >> > filteredDF <- filterRDD(toRDD(dFrame),function (row) {row['COLUMN']
> >> > %in% c("VALUES", ...)}) Now I need to know if the resulting
> >> > dataframe is empty, and to do that I tried this two codes:
> >> > if(count(filteredDF) > 0)
> >> > and
> >> > if(length(take(filteredDF,1)) > 0)
> >> > I thought that the second one, using take, shoule run faster than
> >> > count, but that didn't happen.
> >> > The take operation creates one job per partition of my rdd (which
> >> > was
> >> > 200)
> >> > and this make it to run slower than the count.
> >> > Is this the expected behaviour?
> >> >
> >> > Regards,
> >> > Dirceu
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
> commands, e-mail: user-h...@spark.apache.org
>
>

Re: SparkR Count vs Take performance

Reply via email to