I’m getting *huge* execution times on a moderate sized dataset during the
RDD.isEmpty. Everything in the calculation is fast except an RDD.isEmpty
calculation. I’m using Spark 1.5.1 and from researching I would expect this
calculation to be linearly proportional to the number of partitions
e any more info? how slow is slow
> and what is slow?
>
> On Wed, Dec 9, 2015 at 4:41 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
>> I’m getting *huge* execution times on a moderate sized dataset during the
>> RDD.isEmpty. Everything in the calculation is fast except an RDD
u have any more info? how slow is slow
> and what is slow?
>
> On Wed, Dec 9, 2015 at 4:41 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
>> I’m getting *huge* execution times on a moderate sized dataset during the
>> RDD.isEmpty. Everything in the calculation is fast except
> wrote:
>
> It should at best collect 1 item to the driver. This means evaluating
> at least 1 element of 1 partition. I can imagine pathological cases
> where that's slow, but, do you have any more info? how slow is slow
> and what is slow?
>
> On Wed, Dec 9, 2015 at 4
wrote:
> I’m getting *huge* execution times on a moderate sized dataset during the
> RDD.isEmpty. Everything in the calculation is fast except an RDD.isEmpty
> calculation. I’m using Spark 1.5.1 and from researching I would expect this
> calculation to be linearly proportional to the nu
I ran a 124M dataset on my laptop
with isEmpty it took 32 minutes
without isEmpty it took 18 minutes all but 1.5 minutes were in writing to
Elasticsearch, which is on the same laptop
So excluding the time writing to Elasticsearch, which was nearly the same in
both cases, the core Spark code
On Wed, Dec 9, 2015 at 7:49 PM, Pat Ferrel wrote:
> The “Any” is required by the code it is being passed to, which is the
> Elasticsearch Spark index writing code. The values are actually RDD[(String,
> Map[String, String])]
(Is it frequently a big big map by any chance?)
t; It should at best collect 1 item to the driver. This means evaluating
> at least 1 element of 1 partition. I can imagine pathological cases
> where that's slow, but, do you have any more info? how slow is slow
> and what is slow?
>
> On Wed, Dec 9, 2015 at 4:41 PM, Pat
Ferrel <p...@occamsmachete.com> wrote:
> I’m getting *huge* execution times on a moderate sized dataset during the
> RDD.isEmpty. Everything in the calculation is fast except an RDD.isEmpty
> calculation. I’m using Spark 1.5.1 and from researching I would expect this
> calculation to be linearly