I ran a 124M dataset on my laptop
with isEmpty it took 32 minutes
without isEmpty it took 18 minutes all but 1.5 minutes were in writing to 
Elasticsearch, which is on the same laptop

So excluding the time writing to Elasticsearch, which was nearly the same in 
both cases, the core Spark code took 10x longer with isEmpty. There are other 
isEmpty calls that I’ll optimize away but they are much smaller gains. Also 
strike the comparison to take(1), pretend I never said that.

I can avoid isEmpty but still a bit of a head scratcher.


> On Dec 9, 2015, at 11:53 AM, Sean Owen <so...@cloudera.com> wrote:
> 
> On Wed, Dec 9, 2015 at 7:49 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
>> The “Any” is required by the code it is being passed to, which is the
>> Elasticsearch Spark index writing code. The values are actually RDD[(String,
>> Map[String, String])]
> 
> (Is it frequently a big big map by any chance?)

No, 5000 chars or so per Map.

> 
>> No shuffle that I know of. RDDs are created from the output of Mahout
>> SimilarityAnalysis.cooccurrence and are turned into RDD[(String, Map[String,
>> String])], Since all actual values are simple there is no serialization
>> except for standard Java/Scala so no custom serializers or use of Kryo.
> 
> It's still worth looking at the stages in the job.
> 
> 
>> I understand that the driver can’t know, I was suggesting that isEmpty could
>> be backed by a boolean RDD member variable calculated for every RDD at
>> creation time in Spark. This is really the only way to solve generally since
>> sometimes you get an RDD from a lib, so wrapping it as I suggested is not
>> practical, it would have to be in Spark. BTW the same approach could be used
>> for count, an accumulator per RDD, then returned as a pre-calculated RDD
>> state value.
> 
> What would the boolean do? you can't cache the size in general even if
> you know it, but you don't know it at the time the RDD is created
> (before it's evaluated).

Sorry, maybe I misunderstand but either the accumulator being referenced causes 
the DAG to be executed up to the right spot or you have to checkpoint, either 
way we get the count from a fully executed closure.

> 
>> Are you suggesting that both take(1) and isEmpty are unusable for some
>> reason in my case? I can pass around this information if I have to, I just
>> thought the worst case was O(n) where n was number of partitions and
>> therefor always trivial to calculate.
> 
> No, just trying to guess at reasons you observe what you do. There's
> no difference between isEmpty and take(1) if there are > 0 partitions,
> so if they behave very differently it's something to do with what
> you're measuring and how.
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to