Re: DataFrame#rdd doesn't respect DataFrame#cache, slowing down CrossValidator

2015-07-31 Thread Justin Uang
Sweet! It's here: https://issues.apache.org/jira/browse/SPARK-9141?focusedCommentId=14649437page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14649437 On Tue, Jul 28, 2015 at 11:21 PM Michael Armbrust mich...@databricks.com wrote: Can you add your description of the

Re: DataFrame#rdd doesn't respect DataFrame#cache, slowing down CrossValidator

2015-07-28 Thread Michael Armbrust
Can you add your description of the problem as a comment to that ticket and we'll make sure to test both cases and break it out if the root cause ends up being different. On Tue, Jul 28, 2015 at 2:48 PM, Justin Uang justin.u...@gmail.com wrote: Sweet! Does this cover DataFrame#rdd also using

Re: DataFrame#rdd doesn't respect DataFrame#cache, slowing down CrossValidator

2015-07-28 Thread Joseph Bradley
Thanks for bringing this up! I talked with Michael Armbrust, and it sounds like this is a from a bug in DataFrame caching: https://issues.apache.org/jira/browse/SPARK-9141 It's marked as a blocker for 1.5. Joseph On Tue, Jul 28, 2015 at 2:36 AM, Justin Uang justin.u...@gmail.com wrote: Hey

Re: DataFrame#rdd doesn't respect DataFrame#cache, slowing down CrossValidator

2015-07-28 Thread Justin Uang
Sweet! Does this cover DataFrame#rdd also using the cached query from DataFrame#cache? I think the ticket 9141 is mainly concerned with whether a derived DataFrame (B) of a cached DataFrame (A) uses the cached query of A, not whether the rdd from A.rdd or B.rdd uses the cached query of A. On Tue,