t this example working (for arbitrary values of
rowSize), I suspect that it would also give me a solution to the
custom-aggregation issue I outlined in my previous email. Any suggestions
would be much appreciated.
Thanks,
~ Andrew
On Mon, Aug 12, 2019 at 5:31 PM Andrew Leverentz <
andrew
Hi All,
I'm attempting to clean up some Spark code which performs groupByKey /
mapGroups to compute custom aggregations, and I could use some help
understanding the Spark API's necessary to make my code more modular and
maintainable.
In particular, my current approach is as follows:
- Start
When training a RandomForest model, the Strategy class (in
mllib.tree.configuration) provides a subsamplingRate parameter. I was hoping
to use this to cut down on processing time for large datasets (more than 2MM
rows and 9K predictors), but I've found that the runtime stays approximately
, with cryptic error messages along the
lines of “Missing an output location for shuffle.” Having some way to diagnose
what’s really going here on would be helpful.
~ Andrew
From: Reza Zadeh [mailto:r...@databricks.com]
Sent: Thursday, April 23, 2015 4:58 PM
To: Andrew Leverentz
Cc: user
Subject: Re
: Thursday, April 23, 2015 4:46 PM
To: Andrew Leverentz
Cc: user@spark.apache.org
Subject: Re: Understanding Spark/MLlib failures
Hi Andrew,
I observed similar behavior under high GC pressure, when running ALS. What
happened to me was that, there would be very long Full GC pauses (over 600
seconds