Github user tzachz commented on the pull request:

    https://github.com/apache/spark/pull/8357#issuecomment-141569618
  
    @yhuai I've just encountered this with my application, so I can give an 
example of the conditions that cause this - here are some datapoints I hope 
might help: 
     - My program reads and writes a parquet file about **2-3 times a minute**
     - The write operation uses `DataFrame.selectExpr` with a list of **~70 
expressions** (renaming all columns before saving), these seem to be the 
expressions who's parsing creates the memory-consuming objects
     - After about **2 hours**, heap contains **~25K 
`scala.util.parsing.combinator.Parsers$$anon$3` instances** taking up **1GB** 
of heap (that can't be collected by GC)
     - Note that `2 (hours) * 60 (minutes) * 3 (queries per minute) * 70 
(expressions) =~ 25K`, which is the number of instances - might be a 
coincidence, but it might indicate that this is exactly the calculation needed 
to predict the severity of this issue
    
    So I'd imagine this is a showstopper for any application that performs 
save/read operations *often enough* (i.e. multiple times a minute), with a 
large number of expressions to parse for each operation.
    
    Hope this helps.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to