RDD.isEmpty

2015-12-09 Thread Pat Ferrel
I’m getting *huge* execution times on a moderate sized dataset during the RDD.isEmpty. Everything in the calculation is fast except an RDD.isEmpty calculation. I’m using Spark 1.5.1 and from researching I would expect this calculation to be linearly proportional to the number of partitions

Re: RDD.isEmpty

2015-12-09 Thread Sean Owen
e any more info? how slow is slow > and what is slow? > > On Wed, Dec 9, 2015 at 4:41 PM, Pat Ferrel <p...@occamsmachete.com> wrote: >> I’m getting *huge* execution times on a moderate sized dataset during the >> RDD.isEmpty. Everything in the calculation is fast except an RDD

Re: RDD.isEmpty

2015-12-09 Thread Pat Ferrel
u have any more info? how slow is slow > and what is slow? > > On Wed, Dec 9, 2015 at 4:41 PM, Pat Ferrel <p...@occamsmachete.com> wrote: >> I’m getting *huge* execution times on a moderate sized dataset during the >> RDD.isEmpty. Everything in the calculation is fast except

Re: RDD.isEmpty

2015-12-09 Thread Sean Owen
> wrote: > > It should at best collect 1 item to the driver. This means evaluating > at least 1 element of 1 partition. I can imagine pathological cases > where that's slow, but, do you have any more info? how slow is slow > and what is slow? > > On Wed, Dec 9, 2015 at 4

Re: RDD.isEmpty

2015-12-09 Thread Sean Owen
wrote: > I’m getting *huge* execution times on a moderate sized dataset during the > RDD.isEmpty. Everything in the calculation is fast except an RDD.isEmpty > calculation. I’m using Spark 1.5.1 and from researching I would expect this > calculation to be linearly proportional to the nu

Re: RDD.isEmpty

2015-12-09 Thread Pat Ferrel
I ran a 124M dataset on my laptop with isEmpty it took 32 minutes without isEmpty it took 18 minutes all but 1.5 minutes were in writing to Elasticsearch, which is on the same laptop So excluding the time writing to Elasticsearch, which was nearly the same in both cases, the core Spark code

Re: RDD.isEmpty

2015-12-09 Thread Sean Owen
On Wed, Dec 9, 2015 at 7:49 PM, Pat Ferrel wrote: > The “Any” is required by the code it is being passed to, which is the > Elasticsearch Spark index writing code. The values are actually RDD[(String, > Map[String, String])] (Is it frequently a big big map by any chance?)

Re: RDD.isEmpty

2015-12-09 Thread Pat Ferrel
t; It should at best collect 1 item to the driver. This means evaluating > at least 1 element of 1 partition. I can imagine pathological cases > where that's slow, but, do you have any more info? how slow is slow > and what is slow? > > On Wed, Dec 9, 2015 at 4:41 PM, Pat

Re: RDD.isEmpty

2015-12-09 Thread Pat Ferrel
Ferrel <p...@occamsmachete.com> wrote: > I’m getting *huge* execution times on a moderate sized dataset during the > RDD.isEmpty. Everything in the calculation is fast except an RDD.isEmpty > calculation. I’m using Spark 1.5.1 and from researching I would expect this > calculation to be linearly