Re: SparkR Count vs Take performance

2016-03-02 Thread Dirceu Semighini Filho
n Owen [mailto:so...@cloudera.com] > Sent: Wednesday, March 2, 2016 3:37 AM > To: Dirceu Semighini Filho <dirceu.semigh...@gmail.com> > Cc: user <user@spark.apache.org> > Subject: Re: SparkR Count vs Take performance > > Yeah one surprising result is that you can't call i

RE: SparkR Count vs Take performance

2016-03-02 Thread Sun, Rui
fetch. -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Wednesday, March 2, 2016 3:37 AM To: Dirceu Semighini Filho <dirceu.semigh...@gmail.com> Cc: user <user@spark.apache.org> Subject: Re: SparkR Count vs Take performance Yeah one surprising result is th

Re: SparkR Count vs Take performance

2016-03-01 Thread Sean Owen
Yeah one surprising result is that you can't call isEmpty on an RDD of nonserializable objects. You can't do much with an RDD of nonserializable objects anyway, but they can exist as an intermediate stage. We could fix that pretty easily with a little copy and paste of the take() code; right now

Re: SparkR Count vs Take performance

2016-03-01 Thread Dirceu Semighini Filho
Great, I didn't noticed this isEmpty method. Well serialization is been a problem in this project, we have noticed a lot of time been spent in serializing and deserializing things to send and get from the cluster. 2016-03-01 15:47 GMT-03:00 Sean Owen : > There is an "isEmpty"

Re: SparkR Count vs Take performance

2016-03-01 Thread Sean Owen
There is an "isEmpty" method that basically does exactly what your second version does. I have seen it be unusually slow at times because it must copy 1 element to the driver, and it's possible that's slow. It still shouldn't be slow in general, and I'd be surprised if it's slower than a count in

SparkR Count vs Take performance

2016-03-01 Thread Dirceu Semighini Filho
Hello all. I have a script that create a dataframe from this operation: mytable <- sql(sqlContext,("SELECT ID_PRODUCT, ... FROM mytable")) rSparkDf <- createPartitionedDataFrame(sqlContext,myRdataframe) dFrame <- join(mytable,rSparkDf,mytable$ID_PRODUCT==rSparkDf$ID_PRODUCT) After filtering