Hi List, I'm currently trying to naively implement a Data-Vault-type Data-Warehouse using SparkSQL, and was wondering whether there's an inherent practical limit to query complexity, beyond which SparkSQL will stop functioning, even for relatively small amounts of data.
I'm currently looking at a query, which has 19 joins (including some cartesian joins) in the main query, and another instance of the same 19 joins in a subquery. What I'm seeing is, that even with very restrictive filtering, which gets pushed down the pipeline, I run out of driver memory (36G) after just a few minutes, into a ~4900-task stage. In fact, quite often just using SparkUI pushes me into the GC-Overhead limit, with the job then failing. Obviously, this way of organizing the data isn't ideal, and we're looking into moving most of the joins into a relational DB. Nonetheless, the way the driver explodes with no apparent reason is pretty worrying. The behaviour is also quite independent of how much memory I give the driver. I'm currently looking into getting a memory dump of the driver, to figure out which object is hogging memory in the driver. Given that I don't consciously collect() any major amount of data, I'm surprised about this behavior, I even suspect that the large graph might be causing issues in just the SparkUI - I'll have to retry with it disabled. If you have any experience with significant amount of joins in single queries, then I'd love to hear from you - maybe someone has also experienced exploding driver syndrome with non-obvious causes in this context. Thanks for any input, Rick