Re: Can Spark stack scale to petabyte scale without performance degradation?

2014-07-16 Thread Rohit Pujari
Thanks Matei.


On Tue, Jul 15, 2014 at 11:47 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 Yup, as mentioned in the FAQ, we are aware of multiple deployments running
 jobs on over 1000 nodes. Some of our proof of concepts involved people
 running a 2000-node job on EC2.

 I wouldn't confuse buzz with FUD :).

 Matei

 On Jul 15, 2014, at 9:17 PM, Sonal Goyal sonalgoy...@gmail.com wrote:

 Hi Rohit,

 I think the 3rd question on the FAQ may help you.

 https://spark.apache.org/faq.html

 Some other links that talk about building bigger clusters and processing
 more data:


 http://spark-summit.org/wp-content/uploads/2014/07/Building-1000-node-Spark-Cluster-on-EMR.pdf

 http://apache-spark-user-list.1001560.n3.nabble.com/Largest-Spark-Cluster-td3782.html



 Best Regards,
 Sonal
 Nube Technologies http://www.nubetech.co/

  http://in.linkedin.com/in/sonalgoyal




 On Wed, Jul 16, 2014 at 9:17 AM, Rohit Pujari rpuj...@hortonworks.com
 wrote:

 Hello Folks:

 There is lot of buzz in the hadoop community around Spark's inability to
 scale beyond the 1 TB datasets ( or 10-20 nodes). It is being regarded as
 great tech for cpu intensive workloads on smaller data( less that TB) but
 fails to scale and perform effectively on larger datasets. How true it is?

 Are there any customers in who are running petabyte scale workloads on
 spark in production? Are there any benchmarks performed by databricks or
 other companies to clear this perception?

  I'm a big fan of spark. Knowing spark is in its early stages, I'd like
 to better understand boundaries of the tech and recommend right solution
 for right problem.

 Thanks,
 Rohit Pujari
 Solutions Engineer, Hortonworks
 rpuj...@hortonworks.com
 716-430-6899

 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity
 to which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader
 of this message is not the intended recipient, you are hereby notified that
 any printing, copying, dissemination, distribution, disclosure or
 forwarding of this communication is strictly prohibited. If you have
 received this communication in error, please contact the sender immediately
 and delete it from your system. Thank You.






-- 
Rohit Pujari
Solutions Engineer, Hortonworks
rpuj...@hortonworks.com
716-430-6899

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


Can Spark stack scale to petabyte scale without performance degradation?

2014-07-15 Thread Rohit Pujari
Hello Folks:

There is lot of buzz in the hadoop community around Spark's inability to
scale beyond the 1 TB datasets ( or 10-20 nodes). It is being regarded as
great tech for cpu intensive workloads on smaller data( less that TB) but
fails to scale and perform effectively on larger datasets. How true it is?

Are there any customers in who are running petabyte scale workloads on
spark in production? Are there any benchmarks performed by databricks or
other companies to clear this perception?

I'm a big fan of spark. Knowing spark is in its early stages, I'd like to
better understand boundaries of the tech and recommend right solution for
right problem.

Thanks,
Rohit Pujari
Solutions Engineer, Hortonworks
rpuj...@hortonworks.com
716-430-6899

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


Re: Can Spark stack scale to petabyte scale without performance degradation?

2014-07-15 Thread Sonal Goyal
Hi Rohit,

I think the 3rd question on the FAQ may help you.

https://spark.apache.org/faq.html

Some other links that talk about building bigger clusters and processing
more data:

http://spark-summit.org/wp-content/uploads/2014/07/Building-1000-node-Spark-Cluster-on-EMR.pdf
http://apache-spark-user-list.1001560.n3.nabble.com/Largest-Spark-Cluster-td3782.html



Best Regards,
Sonal
Nube Technologies http://www.nubetech.co

http://in.linkedin.com/in/sonalgoyal




On Wed, Jul 16, 2014 at 9:17 AM, Rohit Pujari rpuj...@hortonworks.com
wrote:

 Hello Folks:

 There is lot of buzz in the hadoop community around Spark's inability to
 scale beyond the 1 TB datasets ( or 10-20 nodes). It is being regarded as
 great tech for cpu intensive workloads on smaller data( less that TB) but
 fails to scale and perform effectively on larger datasets. How true it is?

 Are there any customers in who are running petabyte scale workloads on
 spark in production? Are there any benchmarks performed by databricks or
 other companies to clear this perception?

  I'm a big fan of spark. Knowing spark is in its early stages, I'd like
 to better understand boundaries of the tech and recommend right solution
 for right problem.

 Thanks,
 Rohit Pujari
 Solutions Engineer, Hortonworks
 rpuj...@hortonworks.com
 716-430-6899

 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity
 to which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader
 of this message is not the intended recipient, you are hereby notified that
 any printing, copying, dissemination, distribution, disclosure or
 forwarding of this communication is strictly prohibited. If you have
 received this communication in error, please contact the sender immediately
 and delete it from your system. Thank You.