Typically very large closures include some array, and the serialization itself 
should be much more expensive than the closure check. Does anybody have actual 
data on this could be a problem? We don't want to add a config flag if for 
virtually any case it doesn't make sense to change.

On Mon, Jan 21, 2019 at 12:37 PM, Felix Cheung < felixcheun...@hotmail.com > 
wrote:

> 
> Agreed on the pros / cons, esp driver could be the data science notebook.
> Is it worthwhile making it configurable?
> 
> 
> 
>  
> *From:* Sean Owen < srowen@ gmail. com ( sro...@gmail.com ) >
> *Sent:* Monday, January 21, 2019 10:42 AM
> *To:* Reynold Xin
> *Cc:* dev
> *Subject:* Re: Make proactive check for closure serializability optional?
>  
> None except the bug / PR I linked to, which is really just a bug in
> the RowMatrix implementation; a 2GB closure isn't reasonable.
> I doubt it's much overhead in the common case, because closures are
> small and this extra check happens once per execution of the closure.
> 
> I can also imagine middle-ground cases where people are dragging along
> largeish 10MB closures (like, a model or some data) and this could add
> non-trivial memory pressure on the driver. They should be broadcasting
> those things, sure.
> 
> Given just that I'd leave it alone, but was wondering if anyone had
> ever had the same thought or more arguments that it should be
> disable-able. In 'production' one would imagine all the closures do
> serialize correctly and so this is just a bit overhead that could be
> skipped.
> 
> On Mon, Jan 21, 2019 at 12:17 PM Reynold Xin < rxin@ databricks. com (
> r...@databricks.com ) > wrote:
> >
> > Did you actually observe a perf issue?
> >
> > On Mon, Jan 21, 2019 at 10:04 AM Sean Owen < srowen@ gmail. com (
> sro...@gmail.com ) > wrote:
> >>
> >> The ClosureCleaner proactively checks that closures passed to
> >> transformations like RDD.map() are serializable, before they're
> >> executed. It does this by just serializing it with the JavaSerializer.
> >>
> >> That's a nice feature, although there's overhead in always trying to
> >> serialize the closure ahead of time, especially if the closure is
> >> large. It shouldn't be large, usually. But I noticed it when coming up
> >> with this fix: https:/ / github. com/ apache/ spark/ pull/ 23600 (
> https://github.com/apache/spark/pull/23600 )
> >>
> >> It made me wonder, should this be optional, or even not the default?
> >> Closures that don't serialize still fail, just later when an action is
> >> invoked. I don't feel strongly about it, just checking if anyone had
> >> pondered this before.
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
> dev-unsubscr...@spark.apache.org )
> >>
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
> dev-unsubscr...@spark.apache.org )
>

Reply via email to