Re: RDD.isEmpty

Pat Ferrel Wed, 09 Dec 2015 09:57:09 -0800

Err, compiled for Spark 1.3.1, running on 1.5.1 if that makes any difference. 
The Spark impl is “provided” so should be using 1.5.1 code afaik.


The code is as you see below for isEmpty, so not sure what else could it could 
be measuring since it’s the only spark thing on the line. I can regen the 
timeline but here is the .take(1) timeline. It is an order of magnitude faster 
(from my recollection) but even the take(1) still seems incredibly slow for an 
empty test. I was surprised that isEmpty is a distributed calc. When run from 
the driver this value could have already been calculated as a byproduct of 
creating the RDD, no?

I could use an accumulator to count members as the RDD is created and get a 
negligible .isEmpty calc time, right? The RDD creation might be slightly slower 
due to using an accumulator.





On Dec 9, 2015, at 9:29 AM, Sean Owen <so...@cloudera.com> wrote:

Are you sure it's isEmpty? and not an upstream stage? isEmpty is
definitely the action here.  It doesn't make sense that take(1) is so
much faster since it's the "same thing".

On Wed, Dec 9, 2015 at 5:11 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
> Sure, I thought this might be a known issue.
> 
> I have a 122M dataset, which is the trust and rating data from epinions. The 
> data is split into two RDDs and there is an item properties RDD. The code is 
> just trying to remove any empty RDD from the list.
> 
> val esRDDs: List[RDD[(String, Map[String, Any])]] =
>  (correlators ::: properties).filterNot( c => c.isEmpty())
> 
> On my 16G MBP with 4g per executor and 4 executors the IsEmpty takes over a 
> hundred minutes (going from memory, I can supply the timeline given a few 
> hours to recalc it).
> 
> Running a different version of the code that does a .count for debug and 
> .take(1) instead of the .isEmpty the count of one epinions RDD take 8 minutes 
> and the .take(1) uses 3 minutes.
> 
> Other users have seen total runtime on 13G dataset of 700 minutes with the 
> execution time mostly spent in isEmpty.
> 
> 
> On Dec 9, 2015, at 8:50 AM, Sean Owen <so...@cloudera.com> wrote:
> 
> It should at best collect 1 item to the driver. This means evaluating
> at least 1 element of 1 partition. I can imagine pathological cases
> where that's slow, but, do you have any more info? how slow is slow
> and what is slow?
> 
> On Wed, Dec 9, 2015 at 4:41 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
>> I’m getting *huge* execution times on a moderate sized dataset during the
>> RDD.isEmpty. Everything in the calculation is fast except an RDD.isEmpty
>> calculation. I’m using Spark 1.5.1 and from researching I would expect this
>> calculation to be linearly proportional to the number of partitions as a
>> worst case, which should be a trivial amount of time but it is taking many
>> minutes to hours to complete this single phase.
>> 
>> I know that has been a small amount of discussion about using this so would
>> love to hear what the current thinking on the subject is. Is there a better
>> way to find if an RDD has data? Can someone explain why this is happening?
>> 
>> reference PR
>> https://github.com/apache/spark/pull/4534
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: RDD.isEmpty

Reply via email to