Typically very large closures include some array, and the serialization itself should be much more expensive than the closure check. Does anybody have actual data on this could be a problem? We don't want to add a config flag if for virtually any case it doesn't make sense to change.
On Mon, Jan 21, 2019 at 12:37 PM, Felix Cheung < felixcheun...@hotmail.com > wrote: > > Agreed on the pros / cons, esp driver could be the data science notebook. > Is it worthwhile making it configurable? > > > > > *From:* Sean Owen < srowen@ gmail. com ( sro...@gmail.com ) > > *Sent:* Monday, January 21, 2019 10:42 AM > *To:* Reynold Xin > *Cc:* dev > *Subject:* Re: Make proactive check for closure serializability optional? > > None except the bug / PR I linked to, which is really just a bug in > the RowMatrix implementation; a 2GB closure isn't reasonable. > I doubt it's much overhead in the common case, because closures are > small and this extra check happens once per execution of the closure. > > I can also imagine middle-ground cases where people are dragging along > largeish 10MB closures (like, a model or some data) and this could add > non-trivial memory pressure on the driver. They should be broadcasting > those things, sure. > > Given just that I'd leave it alone, but was wondering if anyone had > ever had the same thought or more arguments that it should be > disable-able. In 'production' one would imagine all the closures do > serialize correctly and so this is just a bit overhead that could be > skipped. > > On Mon, Jan 21, 2019 at 12:17 PM Reynold Xin < rxin@ databricks. com ( > r...@databricks.com ) > wrote: > > > > Did you actually observe a perf issue? > > > > On Mon, Jan 21, 2019 at 10:04 AM Sean Owen < srowen@ gmail. com ( > sro...@gmail.com ) > wrote: > >> > >> The ClosureCleaner proactively checks that closures passed to > >> transformations like RDD.map() are serializable, before they're > >> executed. It does this by just serializing it with the JavaSerializer. > >> > >> That's a nice feature, although there's overhead in always trying to > >> serialize the closure ahead of time, especially if the closure is > >> large. It shouldn't be large, usually. But I noticed it when coming up > >> with this fix: https:/ / github. com/ apache/ spark/ pull/ 23600 ( > https://github.com/apache/spark/pull/23600 ) > >> > >> It made me wonder, should this be optional, or even not the default? > >> Closures that don't serialize still fail, just later when an action is > >> invoked. I don't feel strongly about it, just checking if anyone had > >> pondered this before. > >> > >> --------------------------------------------------------------------- > >> To unsubscribe e-mail: dev-unsubscribe@ spark. apache. org ( > dev-unsubscr...@spark.apache.org ) > >> > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscribe@ spark. apache. org ( > dev-unsubscr...@spark.apache.org ) >