Example to handle data skewness
Hi Dev community, A large data skew is leading to memory problem in my cluster. I was wondering if anyone has tackled this with their own hash function and it worked for the same cluster configuration. Thanks, Sejal
BroadcastHashJoinExec cleanup
Hello, looking at BroadcastHashJoinExec, it seems to me that it never destroys the broadcasted variables. And I think this can cause problems like SPARK-22575. Anyway, when I tried to add a "cleanup" to destroy the variable, I saw some test failure because it was trying to access a the destroyed broadcasted variable. I think that the reason of this relies in BroadcastExchangeExec, where the same broadcasted relation can be provided if there are 2 or more invocations. Then my questions are: first of all, am I right or am I missing something? If I am right, in which cases a BroadcastExchangeExec can be used more than once (I can't think of any)? Thanks, Marco
Nondeterministic Catalyst expressions -- trait and property?!
Hi, Why does Spark SQL need Nondeterministic trait [1] and property? That must be confusing for others not only me, right? [1] https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L299 [2] https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala?utf8=%E2%9C%93#L83 Given the exact same names I suspect Nondeterministic trait does more than the name says (and property could express alone). Any plans to "fix" this (e.g. renaming the trait)? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams https://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski