Hi Rohit, I think the 3rd question on the FAQ may help you.
https://spark.apache.org/faq.html Some other links that talk about building bigger clusters and processing more data: http://spark-summit.org/wp-content/uploads/2014/07/Building-1000-node-Spark-Cluster-on-EMR.pdf http://apache-spark-user-list.1001560.n3.nabble.com/Largest-Spark-Cluster-td3782.html Best Regards, Sonal Nube Technologies <http://www.nubetech.co> <http://in.linkedin.com/in/sonalgoyal> On Wed, Jul 16, 2014 at 9:17 AM, Rohit Pujari <rpuj...@hortonworks.com> wrote: > Hello Folks: > > There is lot of buzz in the hadoop community around Spark's inability to > scale beyond the 1 TB datasets ( or 10-20 nodes). It is being regarded as > great tech for cpu intensive workloads on smaller data( less that TB) but > fails to scale and perform effectively on larger datasets. How true it is? > > Are there any customers in who are running petabyte scale workloads on > spark in production? Are there any benchmarks performed by databricks or > other companies to clear this perception? > > I'm a big fan of spark. Knowing spark is in its early stages, I'd like > to better understand boundaries of the tech and recommend right solution > for right problem. > > Thanks, > Rohit Pujari > Solutions Engineer, Hortonworks > rpuj...@hortonworks.com > 716-430-6899 > > CONFIDENTIALITY NOTICE > NOTICE: This message is intended for the use of the individual or entity > to which it is addressed and may contain information that is confidential, > privileged and exempt from disclosure under applicable law. If the reader > of this message is not the intended recipient, you are hereby notified that > any printing, copying, dissemination, distribution, disclosure or > forwarding of this communication is strictly prohibited. If you have > received this communication in error, please contact the sender immediately > and delete it from your system. Thank You.