Re: Breaking the previous large-scale sort record with Spark

2014-11-05 Thread Reynold Xin
Hi all,

We are excited to announce that the benchmark entry has been reviewed by
the Sort Benchmark committee and Spark has officially won the Daytona
GraySort contest in sorting 100TB of data.

Our entry tied with a UCSD research team building high performance systems
and we jointly set a new world record. This is an important milestone for
the project, as it validates the amount of engineering work put into Spark
by the community.

As Matei said, For an engine to scale from these multi-hour petabyte batch
jobs down to 100-millisecond streaming and interactive queries is quite
uncommon, and it's thanks to all of you folks that we are able to make this
happen.

Updated blog post:
http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html




On Fri, Oct 10, 2014 at 7:54 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 Hi folks,

 I interrupt your regularly scheduled user / dev list to bring you some
 pretty cool news for the project, which is that we've been able to use
 Spark to break MapReduce's 100 TB and 1 PB sort records, sorting data 3x
 faster on 10x fewer nodes. There's a detailed writeup at
 http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html.
 Summary: while Hadoop MapReduce held last year's 100 TB world record by
 sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23 minutes on
 206 nodes; and we also scaled up to sort 1 PB in 234 minutes.

 I want to thank Reynold Xin for leading this effort over the past few
 weeks, along with Parviz Deyhim, Xiangrui Meng, Aaron Davidson and Ali
 Ghodsi. In addition, we'd really like to thank Amazon's EC2 team for
 providing the machines to make this possible. Finally, this result would of
 course not be possible without the many many other contributions, testing
 and feature requests from throughout the community.

 For an engine to scale from these multi-hour petabyte batch jobs down to
 100-millisecond streaming and interactive queries is quite uncommon, and
 it's thanks to all of you folks that we are able to make this happen.

 Matei
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Breaking the previous large-scale sort record with Spark

2014-11-05 Thread Matei Zaharia
Congrats to everyone who helped make this happen. And if anyone has even more 
machines they'd like us to run on next year, let us know :).

Matei

 On Nov 5, 2014, at 3:11 PM, Reynold Xin r...@databricks.com wrote:
 
 Hi all,
 
 We are excited to announce that the benchmark entry has been reviewed by
 the Sort Benchmark committee and Spark has officially won the Daytona
 GraySort contest in sorting 100TB of data.
 
 Our entry tied with a UCSD research team building high performance systems
 and we jointly set a new world record. This is an important milestone for
 the project, as it validates the amount of engineering work put into Spark
 by the community.
 
 As Matei said, For an engine to scale from these multi-hour petabyte batch
 jobs down to 100-millisecond streaming and interactive queries is quite
 uncommon, and it's thanks to all of you folks that we are able to make this
 happen.
 
 Updated blog post:
 http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
 
 
 
 
 On Fri, Oct 10, 2014 at 7:54 AM, Matei Zaharia matei.zaha...@gmail.com
 wrote:
 
 Hi folks,
 
 I interrupt your regularly scheduled user / dev list to bring you some
 pretty cool news for the project, which is that we've been able to use
 Spark to break MapReduce's 100 TB and 1 PB sort records, sorting data 3x
 faster on 10x fewer nodes. There's a detailed writeup at
 http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html.
 Summary: while Hadoop MapReduce held last year's 100 TB world record by
 sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23 minutes on
 206 nodes; and we also scaled up to sort 1 PB in 234 minutes.
 
 I want to thank Reynold Xin for leading this effort over the past few
 weeks, along with Parviz Deyhim, Xiangrui Meng, Aaron Davidson and Ali
 Ghodsi. In addition, we'd really like to thank Amazon's EC2 team for
 providing the machines to make this possible. Finally, this result would of
 course not be possible without the many many other contributions, testing
 and feature requests from throughout the community.
 
 For an engine to scale from these multi-hour petabyte batch jobs down to
 100-millisecond streaming and interactive queries is quite uncommon, and
 it's thanks to all of you folks that we are able to make this happen.
 
 Matei
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Breaking the previous large-scale sort record with Spark

2014-10-13 Thread Matei Zaharia
The biggest scaling issue was supporting a large number of reduce tasks 
efficiently, which the JIRAs in that post handle. In particular, our current 
default shuffle (the hash-based one) has each map task open a separate file 
output stream for each reduce task, which wastes a lot of memory (since each 
stream has its own buffer).

A second thing that helped efficiency tremendously was Reynold's new network 
module (https://issues.apache.org/jira/browse/SPARK-2468). Doing I/O on 32 
cores, 10 Gbps Ethernet and 8+ disks efficiently is not easy, as can be seen 
when you try to scale up other software.

Finally, with 30,000 tasks even sending info about every map's output size to 
each reducer was a problem, so Reynold has a patch that avoids that if the 
number of tasks is large.

Matei

On Oct 10, 2014, at 10:09 PM, Ilya Ganelin ilgan...@gmail.com wrote:

 Hi Matei - I read your post with great interest. Could you possibly comment 
 in more depth on some of the issues you guys saw when scaling up spark and 
 how you resolved them? I am interested specifically in spark-related 
 problems. I'm working on scaling up spark to very large datasets and have 
 been running into a variety of issues. Thanks in advance!
 
 On Oct 10, 2014 10:54 AM, Matei Zaharia matei.zaha...@gmail.com wrote:
 Hi folks,
 
 I interrupt your regularly scheduled user / dev list to bring you some pretty 
 cool news for the project, which is that we've been able to use Spark to 
 break MapReduce's 100 TB and 1 PB sort records, sorting data 3x faster on 10x 
 fewer nodes. There's a detailed writeup at 
 http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html.
  Summary: while Hadoop MapReduce held last year's 100 TB world record by 
 sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23 minutes on 206 
 nodes; and we also scaled up to sort 1 PB in 234 minutes.
 
 I want to thank Reynold Xin for leading this effort over the past few weeks, 
 along with Parviz Deyhim, Xiangrui Meng, Aaron Davidson and Ali Ghodsi. In 
 addition, we'd really like to thank Amazon's EC2 team for providing the 
 machines to make this possible. Finally, this result would of course not be 
 possible without the many many other contributions, testing and feature 
 requests from throughout the community.
 
 For an engine to scale from these multi-hour petabyte batch jobs down to 
 100-millisecond streaming and interactive queries is quite uncommon, and it's 
 thanks to all of you folks that we are able to make this happen.
 
 Matei
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Breaking the previous large-scale sort record with Spark

2014-10-13 Thread Ilya Ganelin
Thank you for the details! Would you mind speaking to what tools proved
most useful as far as identifying bottlenecks or bugs? Thanks again.
On Oct 13, 2014 5:36 PM, Matei Zaharia matei.zaha...@gmail.com wrote:

 The biggest scaling issue was supporting a large number of reduce tasks
 efficiently, which the JIRAs in that post handle. In particular, our
 current default shuffle (the hash-based one) has each map task open a
 separate file output stream for each reduce task, which wastes a lot of
 memory (since each stream has its own buffer).

 A second thing that helped efficiency tremendously was Reynold's new
 network module (https://issues.apache.org/jira/browse/SPARK-2468). Doing
 I/O on 32 cores, 10 Gbps Ethernet and 8+ disks efficiently is not easy, as
 can be seen when you try to scale up other software.

 Finally, with 30,000 tasks even sending info about every map's output size
 to each reducer was a problem, so Reynold has a patch that avoids that if
 the number of tasks is large.

 Matei

 On Oct 10, 2014, at 10:09 PM, Ilya Ganelin ilgan...@gmail.com wrote:

  Hi Matei - I read your post with great interest. Could you possibly
 comment in more depth on some of the issues you guys saw when scaling up
 spark and how you resolved them? I am interested specifically in
 spark-related problems. I'm working on scaling up spark to very large
 datasets and have been running into a variety of issues. Thanks in advance!
 
  On Oct 10, 2014 10:54 AM, Matei Zaharia matei.zaha...@gmail.com
 wrote:
  Hi folks,
 
  I interrupt your regularly scheduled user / dev list to bring you some
 pretty cool news for the project, which is that we've been able to use
 Spark to break MapReduce's 100 TB and 1 PB sort records, sorting data 3x
 faster on 10x fewer nodes. There's a detailed writeup at
 http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html.
 Summary: while Hadoop MapReduce held last year's 100 TB world record by
 sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23 minutes on
 206 nodes; and we also scaled up to sort 1 PB in 234 minutes.
 
  I want to thank Reynold Xin for leading this effort over the past few
 weeks, along with Parviz Deyhim, Xiangrui Meng, Aaron Davidson and Ali
 Ghodsi. In addition, we'd really like to thank Amazon's EC2 team for
 providing the machines to make this possible. Finally, this result would of
 course not be possible without the many many other contributions, testing
 and feature requests from throughout the community.
 
  For an engine to scale from these multi-hour petabyte batch jobs down to
 100-millisecond streaming and interactive queries is quite uncommon, and
 it's thanks to all of you folks that we are able to make this happen.
 
  Matei
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 




Re: Breaking the previous large-scale sort record with Spark

2014-10-13 Thread Krishna Sankar
Well done guys. MapReduce sort at that time was a good feat and Spark now
has raised the bar with the ability to sort a PB.
Like some of the folks in the list, a summary of what worked (and didn't)
as well as the monitoring practices would be good.
Cheers
k/
P.S: What are you folks planning next ?

On Fri, Oct 10, 2014 at 7:54 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 Hi folks,

 I interrupt your regularly scheduled user / dev list to bring you some
 pretty cool news for the project, which is that we've been able to use
 Spark to break MapReduce's 100 TB and 1 PB sort records, sorting data 3x
 faster on 10x fewer nodes. There's a detailed writeup at
 http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html.
 Summary: while Hadoop MapReduce held last year's 100 TB world record by
 sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23 minutes on
 206 nodes; and we also scaled up to sort 1 PB in 234 minutes.

 I want to thank Reynold Xin for leading this effort over the past few
 weeks, along with Parviz Deyhim, Xiangrui Meng, Aaron Davidson and Ali
 Ghodsi. In addition, we'd really like to thank Amazon's EC2 team for
 providing the machines to make this possible. Finally, this result would of
 course not be possible without the many many other contributions, testing
 and feature requests from throughout the community.

 For an engine to scale from these multi-hour petabyte batch jobs down to
 100-millisecond streaming and interactive queries is quite uncommon, and
 it's thanks to all of you folks that we are able to make this happen.

 Matei
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Breaking the previous large-scale sort record with Spark

2014-10-11 Thread Henry Saputra
Congrats to Reynold et al leading this effort!

- Henry

On Fri, Oct 10, 2014 at 7:54 AM, Matei Zaharia matei.zaha...@gmail.com wrote:
 Hi folks,

 I interrupt your regularly scheduled user / dev list to bring you some pretty 
 cool news for the project, which is that we've been able to use Spark to 
 break MapReduce's 100 TB and 1 PB sort records, sorting data 3x faster on 10x 
 fewer nodes. There's a detailed writeup at 
 http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html.
  Summary: while Hadoop MapReduce held last year's 100 TB world record by 
 sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23 minutes on 206 
 nodes; and we also scaled up to sort 1 PB in 234 minutes.

 I want to thank Reynold Xin for leading this effort over the past few weeks, 
 along with Parviz Deyhim, Xiangrui Meng, Aaron Davidson and Ali Ghodsi. In 
 addition, we'd really like to thank Amazon's EC2 team for providing the 
 machines to make this possible. Finally, this result would of course not be 
 possible without the many many other contributions, testing and feature 
 requests from throughout the community.

 For an engine to scale from these multi-hour petabyte batch jobs down to 
 100-millisecond streaming and interactive queries is quite uncommon, and it's 
 thanks to all of you folks that we are able to make this happen.

 Matei
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Breaking the previous large-scale sort record with Spark

2014-10-10 Thread Debasish Das
Awesome news Matei !

Congratulations to the databricks team and all the community members...

On Fri, Oct 10, 2014 at 7:54 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 Hi folks,

 I interrupt your regularly scheduled user / dev list to bring you some
 pretty cool news for the project, which is that we've been able to use
 Spark to break MapReduce's 100 TB and 1 PB sort records, sorting data 3x
 faster on 10x fewer nodes. There's a detailed writeup at
 http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html.
 Summary: while Hadoop MapReduce held last year's 100 TB world record by
 sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23 minutes on
 206 nodes; and we also scaled up to sort 1 PB in 234 minutes.

 I want to thank Reynold Xin for leading this effort over the past few
 weeks, along with Parviz Deyhim, Xiangrui Meng, Aaron Davidson and Ali
 Ghodsi. In addition, we'd really like to thank Amazon's EC2 team for
 providing the machines to make this possible. Finally, this result would of
 course not be possible without the many many other contributions, testing
 and feature requests from throughout the community.

 For an engine to scale from these multi-hour petabyte batch jobs down to
 100-millisecond streaming and interactive queries is quite uncommon, and
 it's thanks to all of you folks that we are able to make this happen.

 Matei
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Breaking the previous large-scale sort record with Spark

2014-10-10 Thread Mridul Muralidharan
Brilliant stuff ! Congrats all :-)
This is indeed really heartening news !

Regards,
Mridul


On Fri, Oct 10, 2014 at 8:24 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
 Hi folks,

 I interrupt your regularly scheduled user / dev list to bring you some pretty 
 cool news for the project, which is that we've been able to use Spark to 
 break MapReduce's 100 TB and 1 PB sort records, sorting data 3x faster on 10x 
 fewer nodes. There's a detailed writeup at 
 http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html.
  Summary: while Hadoop MapReduce held last year's 100 TB world record by 
 sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23 minutes on 206 
 nodes; and we also scaled up to sort 1 PB in 234 minutes.

 I want to thank Reynold Xin for leading this effort over the past few weeks, 
 along with Parviz Deyhim, Xiangrui Meng, Aaron Davidson and Ali Ghodsi. In 
 addition, we'd really like to thank Amazon's EC2 team for providing the 
 machines to make this possible. Finally, this result would of course not be 
 possible without the many many other contributions, testing and feature 
 requests from throughout the community.

 For an engine to scale from these multi-hour petabyte batch jobs down to 
 100-millisecond streaming and interactive queries is quite uncommon, and it's 
 thanks to all of you folks that we are able to make this happen.

 Matei
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Breaking the previous large-scale sort record with Spark

2014-10-10 Thread Nan Zhu
Great! Congratulations! 

-- 
Nan Zhu


On Friday, October 10, 2014 at 11:19 AM, Mridul Muralidharan wrote:

 Brilliant stuff ! Congrats all :-)
 This is indeed really heartening news !
 
 Regards,
 Mridul
 
 
 On Fri, Oct 10, 2014 at 8:24 PM, Matei Zaharia matei.zaha...@gmail.com 
 (mailto:matei.zaha...@gmail.com) wrote:
  Hi folks,
  
  I interrupt your regularly scheduled user / dev list to bring you some 
  pretty cool news for the project, which is that we've been able to use 
  Spark to break MapReduce's 100 TB and 1 PB sort records, sorting data 3x 
  faster on 10x fewer nodes. There's a detailed writeup at 
  http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html.
   Summary: while Hadoop MapReduce held last year's 100 TB world record by 
  sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23 minutes on 
  206 nodes; and we also scaled up to sort 1 PB in 234 minutes.
  
  I want to thank Reynold Xin for leading this effort over the past few 
  weeks, along with Parviz Deyhim, Xiangrui Meng, Aaron Davidson and Ali 
  Ghodsi. In addition, we'd really like to thank Amazon's EC2 team for 
  providing the machines to make this possible. Finally, this result would of 
  course not be possible without the many many other contributions, testing 
  and feature requests from throughout the community.
  
  For an engine to scale from these multi-hour petabyte batch jobs down to 
  100-millisecond streaming and interactive queries is quite uncommon, and 
  it's thanks to all of you folks that we are able to make this happen.
  
  Matei
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
  (mailto:dev-unsubscr...@spark.apache.org)
  For additional commands, e-mail: dev-h...@spark.apache.org 
  (mailto:dev-h...@spark.apache.org)
  
 
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 (mailto:user-unsubscr...@spark.apache.org)
 For additional commands, e-mail: user-h...@spark.apache.org 
 (mailto:user-h...@spark.apache.org)
 
 




Re: Breaking the previous large-scale sort record with Spark

2014-10-10 Thread arthur.hk.c...@gmail.com
Wonderful !!

On 11 Oct, 2014, at 12:00 am, Nan Zhu zhunanmcg...@gmail.com wrote:

 Great! Congratulations!
 
 -- 
 Nan Zhu
 On Friday, October 10, 2014 at 11:19 AM, Mridul Muralidharan wrote:
 
 Brilliant stuff ! Congrats all :-)
 This is indeed really heartening news !
 
 Regards,
 Mridul
 
 
 On Fri, Oct 10, 2014 at 8:24 PM, Matei Zaharia matei.zaha...@gmail.com 
 wrote:
 Hi folks,
 
 I interrupt your regularly scheduled user / dev list to bring you some 
 pretty cool news for the project, which is that we've been able to use 
 Spark to break MapReduce's 100 TB and 1 PB sort records, sorting data 3x 
 faster on 10x fewer nodes. There's a detailed writeup at 
 http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html.
  Summary: while Hadoop MapReduce held last year's 100 TB world record by 
 sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23 minutes on 
 206 nodes; and we also scaled up to sort 1 PB in 234 minutes.
 
 I want to thank Reynold Xin for leading this effort over the past few 
 weeks, along with Parviz Deyhim, Xiangrui Meng, Aaron Davidson and Ali 
 Ghodsi. In addition, we'd really like to thank Amazon's EC2 team for 
 providing the machines to make this possible. Finally, this result would of 
 course not be possible without the many many other contributions, testing 
 and feature requests from throughout the community.
 
 For an engine to scale from these multi-hour petabyte batch jobs down to 
 100-millisecond streaming and interactive queries is quite uncommon, and 
 it's thanks to all of you folks that we are able to make this happen.
 
 Matei
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 



Re: Breaking the previous large-scale sort record with Spark

2014-10-10 Thread Steve Nunez
Great stuff. Wonderful to see such progress in so short a time.

How about some links to code and instructions so that these benchmarks can
be reproduced?

Regards,
- Steve

From:  Debasish Das debasish.da...@gmail.com
Date:  Friday, October 10, 2014 at 8:17
To:  Matei Zaharia matei.zaha...@gmail.com
Cc:  user user@spark.apache.org, dev d...@spark.apache.org
Subject:  Re: Breaking the previous large-scale sort record with Spark

 Awesome news Matei !
 
 Congratulations to the databricks team and all the community members...
 
 On Fri, Oct 10, 2014 at 7:54 AM, Matei Zaharia matei.zaha...@gmail.com
 wrote:
 Hi folks,
 
 I interrupt your regularly scheduled user / dev list to bring you some pretty
 cool news for the project, which is that we've been able to use Spark to
 break MapReduce's 100 TB and 1 PB sort records, sorting data 3x faster on 10x
 fewer nodes. There's a detailed writeup at
 http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-
 record.html. Summary: while Hadoop MapReduce held last year's 100 TB world
 record by sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23
 minutes on 206 nodes; and we also scaled up to sort 1 PB in 234 minutes.
 
 I want to thank Reynold Xin for leading this effort over the past few weeks,
 along with Parviz Deyhim, Xiangrui Meng, Aaron Davidson and Ali Ghodsi. In
 addition, we'd really like to thank Amazon's EC2 team for providing the
 machines to make this possible. Finally, this result would of course not be
 possible without the many many other contributions, testing and feature
 requests from throughout the community.
 
 For an engine to scale from these multi-hour petabyte batch jobs down to
 100-millisecond streaming and interactive queries is quite uncommon, and it's
 thanks to all of you folks that we are able to make this happen.
 
 Matei
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 
 



-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


Re: Breaking the previous large-scale sort record with Spark

2014-10-10 Thread Ilya Ganelin
Hi Matei - I read your post with great interest. Could you possibly comment
in more depth on some of the issues you guys saw when scaling up spark and
how you resolved them? I am interested specifically in spark-related
problems. I'm working on scaling up spark to very large datasets and have
been running into a variety of issues. Thanks in advance!
On Oct 10, 2014 10:54 AM, Matei Zaharia matei.zaha...@gmail.com wrote:

 Hi folks,

 I interrupt your regularly scheduled user / dev list to bring you some
 pretty cool news for the project, which is that we've been able to use
 Spark to break MapReduce's 100 TB and 1 PB sort records, sorting data 3x
 faster on 10x fewer nodes. There's a detailed writeup at
 http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html.
 Summary: while Hadoop MapReduce held last year's 100 TB world record by
 sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23 minutes on
 206 nodes; and we also scaled up to sort 1 PB in 234 minutes.

 I want to thank Reynold Xin for leading this effort over the past few
 weeks, along with Parviz Deyhim, Xiangrui Meng, Aaron Davidson and Ali
 Ghodsi. In addition, we'd really like to thank Amazon's EC2 team for
 providing the machines to make this possible. Finally, this result would of
 course not be possible without the many many other contributions, testing
 and feature requests from throughout the community.

 For an engine to scale from these multi-hour petabyte batch jobs down to
 100-millisecond streaming and interactive queries is quite uncommon, and
 it's thanks to all of you folks that we are able to make this happen.

 Matei
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org