subject:"Re\: Breaking the previous large\-scale sort record with Spark"

Re: Breaking the previous large-scale sort record with Spark

2014-11-05 Thread Reynold Xin

Hi all,

We are excited to announce that the benchmark entry has been reviewed by
the Sort Benchmark committee and Spark has officially won the Daytona
GraySort contest in sorting 100TB of data.

Our entry tied with a UCSD research team building high performance systems
and we jointly set a new world record. This is an important milestone for
the project, as it validates the amount of engineering work put into Spark
by the community.

As Matei said, For an engine to scale from these multi-hour petabyte batch
jobs down to 100-millisecond streaming and interactive queries is quite
uncommon, and it's thanks to all of you folks that we are able to make this
happen.

Updated blog post:
http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html




On Fri, Oct 10, 2014 at 7:54 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 Hi folks,

 I interrupt your regularly scheduled user / dev list to bring you some
 pretty cool news for the project, which is that we've been able to use
 Spark to break MapReduce's 100 TB and 1 PB sort records, sorting data 3x
 faster on 10x fewer nodes. There's a detailed writeup at
 http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html.
 Summary: while Hadoop MapReduce held last year's 100 TB world record by
 sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23 minutes on
 206 nodes; and we also scaled up to sort 1 PB in 234 minutes.

 I want to thank Reynold Xin for leading this effort over the past few
 weeks, along with Parviz Deyhim, Xiangrui Meng, Aaron Davidson and Ali
 Ghodsi. In addition, we'd really like to thank Amazon's EC2 team for
 providing the machines to make this possible. Finally, this result would of
 course not be possible without the many many other contributions, testing
 and feature requests from throughout the community.

 For an engine to scale from these multi-hour petabyte batch jobs down to
 100-millisecond streaming and interactive queries is quite uncommon, and
 it's thanks to all of you folks that we are able to make this happen.

 Matei
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Breaking the previous large-scale sort record with Spark

2014-11-05 Thread Matei Zaharia

Congrats to everyone who helped make this happen. And if anyone has even more 
machines they'd like us to run on next year, let us know :).

Matei

 On Nov 5, 2014, at 3:11 PM, Reynold Xin r...@databricks.com wrote:
 
 Hi all,
 
 We are excited to announce that the benchmark entry has been reviewed by
 the Sort Benchmark committee and Spark has officially won the Daytona
 GraySort contest in sorting 100TB of data.
 
 Our entry tied with a UCSD research team building high performance systems
 and we jointly set a new world record. This is an important milestone for
 the project, as it validates the amount of engineering work put into Spark
 by the community.
 
 As Matei said, For an engine to scale from these multi-hour petabyte batch
 jobs down to 100-millisecond streaming and interactive queries is quite
 uncommon, and it's thanks to all of you folks that we are able to make this
 happen.
 
 Updated blog post:
 http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
 
 
 
 
 On Fri, Oct 10, 2014 at 7:54 AM, Matei Zaharia matei.zaha...@gmail.com
 wrote:
 
 Hi folks,
 
 I interrupt your regularly scheduled user / dev list to bring you some
 pretty cool news for the project, which is that we've been able to use
 Spark to break MapReduce's 100 TB and 1 PB sort records, sorting data 3x
 faster on 10x fewer nodes. There's a detailed writeup at
 http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html.
 Summary: while Hadoop MapReduce held last year's 100 TB world record by
 sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23 minutes on
 206 nodes; and we also scaled up to sort 1 PB in 234 minutes.
 
 I want to thank Reynold Xin for leading this effort over the past few
 weeks, along with Parviz Deyhim, Xiangrui Meng, Aaron Davidson and Ali
 Ghodsi. In addition, we'd really like to thank Amazon's EC2 team for
 providing the machines to make this possible. Finally, this result would of
 course not be possible without the many many other contributions, testing
 and feature requests from throughout the community.
 
 For an engine to scale from these multi-hour petabyte batch jobs down to
 100-millisecond streaming and interactive queries is quite uncommon, and
 it's thanks to all of you folks that we are able to make this happen.
 
 Matei
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Breaking the previous large-scale sort record with Spark

2014-10-13 Thread Matei Zaharia

The biggest scaling issue was supporting a large number of reduce tasks
efficiently, which the JIRAs in that post handle. In particular, our current
default shuffle (the hash-based one) has each map task open a separate file
output stream for each reduce task, which wastes a lot of memory (since each
stream has its own buffer).

A second thing that helped efficiency tremendously was Reynold's new network
module (https://issues.apache.org/jira/browse/SPARK-2468). Doing I/O on 32
cores, 10 Gbps Ethernet and 8+ disks efficiently is not easy, as can be seen
when you try to scale up other software.

Finally, with 30,000 tasks even sending info about every map's output size to
each reducer was a problem, so Reynold has a patch that avoids that if the
number of tasks is large.

Matei

On Oct 10, 2014, at 10:09 PM, Ilya Ganelin ilgan...@gmail.com wrote:

Hi Matei - I read your post with great interest. Could you possibly comment
in more depth on some of the issues you guys saw when scaling up spark and
how you resolved them? I am interested specifically in spark-related
problems. I'm working on scaling up spark to very large datasets and have
been running into a variety of issues. Thanks in advance!

On Oct 10, 2014 10:54 AM, Matei Zaharia matei.zaha...@gmail.com wrote:
Hi folks,

I interrupt your regularly scheduled user / dev list to bring you some pretty
cool news for the project, which is that we've been able to use Spark to
break MapReduce's 100 TB and 1 PB sort records, sorting data 3x faster on 10x
fewer nodes. There's a detailed writeup at
http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html.
Summary: while Hadoop MapReduce held last year's 100 TB world record by
sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23 minutes on 206
nodes; and we also scaled up to sort 1 PB in 234 minutes.

I want to thank Reynold Xin for leading this effort over the past few weeks,
along with Parviz Deyhim, Xiangrui Meng, Aaron Davidson and Ali Ghodsi. In
addition, we'd really like to thank Amazon's EC2 team for providing the
machines to make this possible. Finally, this result would of course not be
possible without the many many other contributions, testing and feature
requests from throughout the community.

For an engine to scale from these multi-hour petabyte batch jobs down to
100-millisecond streaming and interactive queries is quite uncommon, and it's
thanks to all of you folks that we are able to make this happen.

Matei
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Breaking the previous large-scale sort record with Spark

2014-10-13 Thread Ilya Ganelin

Thank you for the details! Would you mind speaking to what tools proved
most useful as far as identifying bottlenecks or bugs? Thanks again.
On Oct 13, 2014 5:36 PM, Matei Zaharia matei.zaha...@gmail.com wrote:

The biggest scaling issue was supporting a large number of reduce tasks
efficiently, which the JIRAs in that post handle. In particular, our
current default shuffle (the hash-based one) has each map task open a
separate file output stream for each reduce task, which wastes a lot of
memory (since each stream has its own buffer).

A second thing that helped efficiency tremendously was Reynold's new
network module (https://issues.apache.org/jira/browse/SPARK-2468). Doing
I/O on 32 cores, 10 Gbps Ethernet and 8+ disks efficiently is not easy, as
can be seen when you try to scale up other software.

Finally, with 30,000 tasks even sending info about every map's output size
to each reducer was a problem, so Reynold has a patch that avoids that if
the number of tasks is large.

Matei

On Oct 10, 2014, at 10:09 PM, Ilya Ganelin ilgan...@gmail.com wrote:

Hi Matei - I read your post with great interest. Could you possibly
comment in more depth on some of the issues you guys saw when scaling up
spark and how you resolved them? I am interested specifically in
spark-related problems. I'm working on scaling up spark to very large
datasets and have been running into a variety of issues. Thanks in advance!

On Oct 10, 2014 10:54 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:
Hi folks,

I interrupt your regularly scheduled user / dev list to bring you some
pretty cool news for the project, which is that we've been able to use
Spark to break MapReduce's 100 TB and 1 PB sort records, sorting data 3x
faster on 10x fewer nodes. There's a detailed writeup at
http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html.
Summary: while Hadoop MapReduce held last year's 100 TB world record by
sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23 minutes on
206 nodes; and we also scaled up to sort 1 PB in 234 minutes.

I want to thank Reynold Xin for leading this effort over the past few
weeks, along with Parviz Deyhim, Xiangrui Meng, Aaron Davidson and Ali
Ghodsi. In addition, we'd really like to thank Amazon's EC2 team for
providing the machines to make this possible. Finally, this result would of
course not be possible without the many many other contributions, testing
and feature requests from throughout the community.

For an engine to scale from these multi-hour petabyte batch jobs down to
100-millisecond streaming and interactive queries is quite uncommon, and
it's thanks to all of you folks that we are able to make this happen.

Matei
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Breaking the previous large-scale sort record with Spark

2014-10-13 Thread Krishna Sankar

Well done guys. MapReduce sort at that time was a good feat and Spark now
has raised the bar with the ability to sort a PB.
Like some of the folks in the list, a summary of what worked (and didn't)
as well as the monitoring practices would be good.
Cheers
k/
P.S: What are you folks planning next ?

On Fri, Oct 10, 2014 at 7:54 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:

Hi folks,

I interrupt your regularly scheduled user / dev list to bring you some
pretty cool news for the project, which is that we've been able to use
Spark to break MapReduce's 100 TB and 1 PB sort records, sorting data 3x
faster on 10x fewer nodes. There's a detailed writeup at
http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html.
Summary: while Hadoop MapReduce held last year's 100 TB world record by
sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23 minutes on
206 nodes; and we also scaled up to sort 1 PB in 234 minutes.

I want to thank Reynold Xin for leading this effort over the past few
weeks, along with Parviz Deyhim, Xiangrui Meng, Aaron Davidson and Ali
Ghodsi. In addition, we'd really like to thank Amazon's EC2 team for
providing the machines to make this possible. Finally, this result would of
course not be possible without the many many other contributions, testing
and feature requests from throughout the community.

For an engine to scale from these multi-hour petabyte batch jobs down to
100-millisecond streaming and interactive queries is quite uncommon, and
it's thanks to all of you folks that we are able to make this happen.

Matei
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Breaking the previous large-scale sort record with Spark

2014-10-11 Thread Henry Saputra

Congrats to Reynold et al leading this effort!

- Henry

On Fri, Oct 10, 2014 at 7:54 AM, Matei Zaharia matei.zaha...@gmail.com wrote:
 Hi folks,

 I interrupt your regularly scheduled user / dev list to bring you some pretty 
 cool news for the project, which is that we've been able to use Spark to 
 break MapReduce's 100 TB and 1 PB sort records, sorting data 3x faster on 10x 
 fewer nodes. There's a detailed writeup at 
 http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html.
  Summary: while Hadoop MapReduce held last year's 100 TB world record by 
 sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23 minutes on 206 
 nodes; and we also scaled up to sort 1 PB in 234 minutes.

 I want to thank Reynold Xin for leading this effort over the past few weeks, 
 along with Parviz Deyhim, Xiangrui Meng, Aaron Davidson and Ali Ghodsi. In 
 addition, we'd really like to thank Amazon's EC2 team for providing the 
 machines to make this possible. Finally, this result would of course not be 
 possible without the many many other contributions, testing and feature 
 requests from throughout the community.

 For an engine to scale from these multi-hour petabyte batch jobs down to 
 100-millisecond streaming and interactive queries is quite uncommon, and it's 
 thanks to all of you folks that we are able to make this happen.

 Matei
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Breaking the previous large-scale sort record with Spark

2014-10-10 Thread Debasish Das

Awesome news Matei !

Congratulations to the databricks team and all the community members...

On Fri, Oct 10, 2014 at 7:54 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:

Hi folks,

I interrupt your regularly scheduled user / dev list to bring you some
pretty cool news for the project, which is that we've been able to use
Spark to break MapReduce's 100 TB and 1 PB sort records, sorting data 3x
faster on 10x fewer nodes. There's a detailed writeup at
http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html.
Summary: while Hadoop MapReduce held last year's 100 TB world record by
sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23 minutes on
206 nodes; and we also scaled up to sort 1 PB in 234 minutes.

I want to thank Reynold Xin for leading this effort over the past few
weeks, along with Parviz Deyhim, Xiangrui Meng, Aaron Davidson and Ali
Ghodsi. In addition, we'd really like to thank Amazon's EC2 team for
providing the machines to make this possible. Finally, this result would of
course not be possible without the many many other contributions, testing
and feature requests from throughout the community.

For an engine to scale from these multi-hour petabyte batch jobs down to
100-millisecond streaming and interactive queries is quite uncommon, and
it's thanks to all of you folks that we are able to make this happen.

Matei
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Breaking the previous large-scale sort record with Spark

2014-10-10 Thread Mridul Muralidharan

Brilliant stuff ! Congrats all :-)
This is indeed really heartening news !

Regards,
Mridul


On Fri, Oct 10, 2014 at 8:24 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
 Hi folks,

 I interrupt your regularly scheduled user / dev list to bring you some pretty 
 cool news for the project, which is that we've been able to use Spark to 
 break MapReduce's 100 TB and 1 PB sort records, sorting data 3x faster on 10x 
 fewer nodes. There's a detailed writeup at 
 http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html.
  Summary: while Hadoop MapReduce held last year's 100 TB world record by 
 sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23 minutes on 206 
 nodes; and we also scaled up to sort 1 PB in 234 minutes.

 I want to thank Reynold Xin for leading this effort over the past few weeks, 
 along with Parviz Deyhim, Xiangrui Meng, Aaron Davidson and Ali Ghodsi. In 
 addition, we'd really like to thank Amazon's EC2 team for providing the 
 machines to make this possible. Finally, this result would of course not be 
 possible without the many many other contributions, testing and feature 
 requests from throughout the community.

 For an engine to scale from these multi-hour petabyte batch jobs down to 
 100-millisecond streaming and interactive queries is quite uncommon, and it's 
 thanks to all of you folks that we are able to make this happen.

 Matei
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Breaking the previous large-scale sort record with Spark

2014-10-10 Thread Nan Zhu

Great! Congratulations! 

-- 
Nan Zhu


On Friday, October 10, 2014 at 11:19 AM, Mridul Muralidharan wrote:

 Brilliant stuff ! Congrats all :-)
 This is indeed really heartening news !
 
 Regards,
 Mridul
 
 
 On Fri, Oct 10, 2014 at 8:24 PM, Matei Zaharia matei.zaha...@gmail.com 
 (mailto:matei.zaha...@gmail.com) wrote:
  Hi folks,
  
  I interrupt your regularly scheduled user / dev list to bring you some 
  pretty cool news for the project, which is that we've been able to use 
  Spark to break MapReduce's 100 TB and 1 PB sort records, sorting data 3x 
  faster on 10x fewer nodes. There's a detailed writeup at 
  http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html.
   Summary: while Hadoop MapReduce held last year's 100 TB world record by 
  sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23 minutes on 
  206 nodes; and we also scaled up to sort 1 PB in 234 minutes.
  
  I want to thank Reynold Xin for leading this effort over the past few 
  weeks, along with Parviz Deyhim, Xiangrui Meng, Aaron Davidson and Ali 
  Ghodsi. In addition, we'd really like to thank Amazon's EC2 team for 
  providing the machines to make this possible. Finally, this result would of 
  course not be possible without the many many other contributions, testing 
  and feature requests from throughout the community.
  
  For an engine to scale from these multi-hour petabyte batch jobs down to 
  100-millisecond streaming and interactive queries is quite uncommon, and 
  it's thanks to all of you folks that we are able to make this happen.
  
  Matei
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
  (mailto:dev-unsubscr...@spark.apache.org)
  For additional commands, e-mail: dev-h...@spark.apache.org 
  (mailto:dev-h...@spark.apache.org)
  
 
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 (mailto:user-unsubscr...@spark.apache.org)
 For additional commands, e-mail: user-h...@spark.apache.org 
 (mailto:user-h...@spark.apache.org)

Re: Breaking the previous large-scale sort record with Spark

2014-10-10 Thread arthur.hk.c...@gmail.com

Wonderful !!

On 11 Oct, 2014, at 12:00 am, Nan Zhu zhunanmcg...@gmail.com wrote:

 Great! Congratulations!
 
 -- 
 Nan Zhu
 On Friday, October 10, 2014 at 11:19 AM, Mridul Muralidharan wrote:
 
 Brilliant stuff ! Congrats all :-)
 This is indeed really heartening news !
 
 Regards,
 Mridul
 
 
 On Fri, Oct 10, 2014 at 8:24 PM, Matei Zaharia matei.zaha...@gmail.com 
 wrote:
 Hi folks,
 
 I interrupt your regularly scheduled user / dev list to bring you some 
 pretty cool news for the project, which is that we've been able to use 
 Spark to break MapReduce's 100 TB and 1 PB sort records, sorting data 3x 
 faster on 10x fewer nodes. There's a detailed writeup at 
 http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html.
  Summary: while Hadoop MapReduce held last year's 100 TB world record by 
 sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23 minutes on 
 206 nodes; and we also scaled up to sort 1 PB in 234 minutes.
 
 I want to thank Reynold Xin for leading this effort over the past few 
 weeks, along with Parviz Deyhim, Xiangrui Meng, Aaron Davidson and Ali 
 Ghodsi. In addition, we'd really like to thank Amazon's EC2 team for 
 providing the machines to make this possible. Finally, this result would of 
 course not be possible without the many many other contributions, testing 
 and feature requests from throughout the community.
 
 For an engine to scale from these multi-hour petabyte batch jobs down to 
 100-millisecond streaming and interactive queries is quite uncommon, and 
 it's thanks to all of you folks that we are able to make this happen.
 
 Matei
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Breaking the previous large-scale sort record with Spark

2014-10-10 Thread Steve Nunez

Great stuff. Wonderful to see such progress in so short a time.

How about some links to code and instructions so that these benchmarks can
be reproduced?

Regards,
- Steve

From:  Debasish Das debasish.da...@gmail.com
Date:  Friday, October 10, 2014 at 8:17
To:  Matei Zaharia matei.zaha...@gmail.com
Cc:  user user@spark.apache.org, dev d...@spark.apache.org
Subject:  Re: Breaking the previous large-scale sort record with Spark

 Awesome news Matei !

 Congratulations to the databricks team and all the community members...

 On Fri, Oct 10, 2014 at 7:54 AM, Matei Zaharia matei.zaha...@gmail.com
 wrote:
 Hi folks,

 I interrupt your regularly scheduled user / dev list to bring you some pretty
 cool news for the project, which is that we've been able to use Spark to
 break MapReduce's 100 TB and 1 PB sort records, sorting data 3x faster on 10x
 fewer nodes. There's a detailed writeup at
 http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-
 record.html. Summary: while Hadoop MapReduce held last year's 100 TB world
 record by sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23
 minutes on 206 nodes; and we also scaled up to sort 1 PB in 234 minutes.

 I want to thank Reynold Xin for leading this effort over the past few weeks,
 along with Parviz Deyhim, Xiangrui Meng, Aaron Davidson and Ali Ghodsi. In
 addition, we'd really like to thank Amazon's EC2 team for providing the
 machines to make this possible. Finally, this result would of course not be
 possible without the many many other contributions, testing and feature
 requests from throughout the community.

 For an engine to scale from these multi-hour petabyte batch jobs down to
 100-millisecond streaming and interactive queries is quite uncommon, and it's
 thanks to all of you folks that we are able to make this happen.

 Matei
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Breaking the previous large-scale sort record with Spark

2014-10-10 Thread Ilya Ganelin

Hi folks,

I interrupt your regularly scheduled user / dev list to bring you some
pretty cool news for the project, which is that we've been able to use
Spark to break MapReduce's 100 TB and 1 PB sort records, sorting data 3x
faster on 10x fewer nodes. There's a detailed writeup at
http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html.
Summary: while Hadoop MapReduce held last year's 100 TB world record by
sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23 minutes on
206 nodes; and we also scaled up to sort 1 PB in 234 minutes.

I want to thank Reynold Xin for leading this effort over the past few
weeks, along with Parviz Deyhim, Xiangrui Meng, Aaron Davidson and Ali
Ghodsi. In addition, we'd really like to thank Amazon's EC2 team for
providing the machines to make this possible. Finally, this result would of
course not be possible without the many many other contributions, testing
and feature requests from throughout the community.

For an engine to scale from these multi-hour petabyte batch jobs down to
100-millisecond streaming and interactive queries is quite uncommon, and
it's thanks to all of you folks that we are able to make this happen.

Matei
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Breaking the previous large-scale sort record with Spark

Re: Breaking the previous large-scale sort record with Spark

Re: Breaking the previous large-scale sort record with Spark

Re: Breaking the previous large-scale sort record with Spark

Re: Breaking the previous large-scale sort record with Spark

Re: Breaking the previous large-scale sort record with Spark

Re: Breaking the previous large-scale sort record with Spark

Re: Breaking the previous large-scale sort record with Spark

Re: Breaking the previous large-scale sort record with Spark

Re: Breaking the previous large-scale sort record with Spark

Re: Breaking the previous large-scale sort record with Spark

Re: Breaking the previous large-scale sort record with Spark

12 matches

Site Navigation

Mail list logo

Footer information