Well yes, MLlib-like routines or pretty much anything else could be run on the derived results, but you have to unload the results from Redshift and then load them into some other tool. So it's nicer to leave them in memory and operate on them there. Major architectural advantage to Spark.
Ron From: Gary Malouf [mailto:malouf.g...@gmail.com] Sent: Wednesday, August 06, 2014 1:17 PM To: Nicholas Chammas Cc: Daniel, Ronald (ELS-SDG); user@spark.apache.org Subject: Re: Regarding tooling/performance vs RedShift Also, regarding something like redshift not having MLlib built in, much of that could be done on the derived results. On Aug 6, 2014 4:07 PM, "Nicholas Chammas" <nicholas.cham...@gmail.com<mailto:nicholas.cham...@gmail.com>> wrote: On Wed, Aug 6, 2014 at 3:41 PM, Daniel, Ronald (ELS-SDG)<r.dan...@elsevier.com<mailto:r.dan...@elsevier.com>> wrote: Mostly I was just objecting to " Redshift does very well, but Shark is on par or better than it in most of the tests " when that was not how I read the results, and Redshift was on HDDs. My bad. You are correct; the only test Shark (mem) does better on is test #1 "Scan Query". And indeed, it would be good to see an updated benchmark with Redshift running on SSDs. Nick