Re: Custom aggregations: modular and lightweight solutions?

2019-08-13 Thread Andrew Leverentz
t this example working (for arbitrary values of rowSize), I suspect that it would also give me a solution to the custom-aggregation issue I outlined in my previous email. Any suggestions would be much appreciated. Thanks, ~ Andrew On Mon, Aug 12, 2019 at 5:31 PM Andrew Leverentz < andrew

Custom aggregations: modular and lightweight solutions?

2019-08-12 Thread Andrew Leverentz
Hi All, I'm attempting to clean up some Spark code which performs groupByKey / mapGroups to compute custom aggregations, and I could use some help understanding the Spark API's necessary to make my code more modular and maintainable. In particular, my current approach is as follows: - Start

RandomForest - subsamplingRate parameter

2015-06-03 Thread Andrew Leverentz
When training a RandomForest model, the Strategy class (in mllib.tree.configuration) provides a subsamplingRate parameter. I was hoping to use this to cut down on processing time for large datasets (more than 2MM rows and 9K predictors), but I've found that the runtime stays approximately

RE: Understanding Spark/MLlib failures

2015-04-24 Thread Andrew Leverentz
, with cryptic error messages along the lines of “Missing an output location for shuffle.” Having some way to diagnose what’s really going here on would be helpful. ~ Andrew From: Reza Zadeh [mailto:r...@databricks.com] Sent: Thursday, April 23, 2015 4:58 PM To: Andrew Leverentz Cc: user Subject: Re

RE: Understanding Spark/MLlib failures

2015-04-24 Thread Andrew Leverentz
: Thursday, April 23, 2015 4:46 PM To: Andrew Leverentz Cc: user@spark.apache.org Subject: Re: Understanding Spark/MLlib failures Hi Andrew, I observed similar behavior under high GC pressure, when running ALS. What happened to me was that, there would be very long Full GC pauses (over 600 seconds