Github user tzachz commented on the pull request: https://github.com/apache/spark/pull/8357#issuecomment-141569618 @yhuai I've just encountered this with my application, so I can give an example of the conditions that cause this - here are some datapoints I hope might help: - My program reads and writes a parquet file about **2-3 times a minute** - The write operation uses `DataFrame.selectExpr` with a list of **~70 expressions** (renaming all columns before saving), these seem to be the expressions who's parsing creates the memory-consuming objects - After about **2 hours**, heap contains **~25K `scala.util.parsing.combinator.Parsers$$anon$3` instances** taking up **1GB** of heap (that can't be collected by GC) - Note that `2 (hours) * 60 (minutes) * 3 (queries per minute) * 70 (expressions) =~ 25K`, which is the number of instances - might be a coincidence, but it might indicate that this is exactly the calculation needed to predict the severity of this issue So I'd imagine this is a showstopper for any application that performs save/read operations *often enough* (i.e. multiple times a minute), with a large number of expressions to parse for each operation. Hope this helps.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org