Re: Benchmark results between Flink and Spark

2015-07-14 Thread Jerry Lam
FYI, another benchmark:
http://eastcirclek.blogspot.kr/2015/06/terasort-for-spark-and-flink-with-range.html

quote: I have observed a lot of fetch failures while running Spark, which
results in many restarted tasks and, therefore, takes the longest time. I
suspect that executors are incapable of serving shuffle data due to JVMs
doing long garbage-collection (I also tried large numbers for
spark.core.connection.ack.wait.timeout). Flink seems to be irrelevant to GC
issues thanks to its own internal memory management. MapReduce and Tez
execute each task in a separate process and rely on an external auxiliary
service for shuffling. Although the shuffle service could exhibit fetch
failures for other reasons, it works without any fetch failure in this
experiment for Hadoop MapReduce and Tez.

On Mon, Jul 6, 2015 at 3:13 AM, Jan-Paul Bultmann janpaulbultm...@me.com
wrote:

 Sorry, that should be shortest path, and diameter of the graph.
 I shouldn't write emails before I get my morning coffee...

 On 06 Jul 2015, at 09:09, Jan-Paul Bultmann janpaulbultm...@me.com
 wrote:

 I would guess the opposite is true for highly iterative benchmarks (common
 in graph processing and data-science).

 Spark has a pretty large overhead per iteration, more optimisations and
 planning only makes this worse.

 Sure people implemented things like dijkstra's algorithm in spark
 (a problem where the number of iterations is bounded by the circumference
 of the input graph),
 but all the datasets I've seen it running on had a very small
 circumference (which is common for e.g. social networks).

 Take sparkSQL for example. Catalyst is a really good query optimiser, but
 it introduces significant overhead.
 Since spark has no iterative semantics on its own (unlike flink),
 one has to materialise the intermediary dataframe at each iteration
 boundary to determine if a termination criterion is reached.
 This causes a huge amount of planning, especially since it looks like
 catalyst will try to optimise the dependency graph
 regardless of caching. A dependency graph that grows in the number of
 iterations and thus the size of the input dataset.

 In flink on the other hand, you can describe you entire iterative program
 through transformations without ever calling an action.
 This means that the optimiser will only have to do planing once.

 Just my 2 cents :)
 Cheers, Jan

 On 06 Jul 2015, at 06:10, n...@reactor8.com wrote:

 Maybe some flink benefits from some pts they outline here:

 http://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-Bytes.html

 Probably if re-ran the benchmarks with 1.5/tungsten line would close the
 gap a bit(or a lot) with spark moving towards similar style off-heap memory
 mgmt, more planning optimizations


 *From:* Jerry Lam [mailto:chiling...@gmail.com chiling...@gmail.com]
 *Sent:* Sunday, July 5, 2015 6:28 PM
 *To:* Ted Yu
 *Cc:* Slim Baltagi; user
 *Subject:* Re: Benchmark results between Flink and Spark

 Hi guys,

 I just read the paper too. There is no much information regarding why
 Flink is faster than Spark for data science type of workloads in the
 benchmark. It is very difficult to generalize the conclusion of a benchmark
 from my point of view. How much experience the author has with Spark is in
 comparisons to Flink is one of the immediate questions I have. It would be
 great if they have the benchmark software available somewhere for other
 people to experiment.

 just my 2 cents,

 Jerry

 On Sun, Jul 5, 2015 at 4:35 PM, Ted Yu yuzhih...@gmail.com wrote:

 There was no mentioning of the versions of Flink and Spark used in
 benchmarking.

 The size of cluster is quite small.

 Cheers

 On Sun, Jul 5, 2015 at 10:24 AM, Slim Baltagi sbalt...@gmail.com wrote:

 Hi

 Apache Flink outperforms Apache Spark in processing machine learning 
 graph
 algorithms and relational queries but not in batch processing!

 The results were published in the proceedings of the 18th International
 Conference, Business Information Systems 2015, Poznań, Poland, June 24-26,
 2015.

 Thanks to our friend Google, Chapter 3: 'Evaluating New Approaches of Big
 Data Analytics Frameworks' by Norman Spangenberg, Martin Roth and Bogdan
 Franczyk is available for preview at http://goo.gl/WocQci on pages 28-37.

 Enjoy!

 Slim Baltagi
 http://www.SparkBigData.com http://www.sparkbigdata.com/




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Benchmark-results-between-Flink-and-Spark-tp23626.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com
 http://nabble.com/.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org






Re: Benchmark results between Flink and Spark

2015-07-06 Thread Jan-Paul Bultmann
I would guess the opposite is true for highly iterative benchmarks (common in 
graph processing and data-science).

Spark has a pretty large overhead per iteration, more optimisations and 
planning only makes this worse.

Sure people implemented things like dijkstra's algorithm in spark
(a problem where the number of iterations is bounded by the circumference of 
the input graph),
but all the datasets I've seen it running on had a very small circumference 
(which is common for e.g. social networks).

Take sparkSQL for example. Catalyst is a really good query optimiser, but it 
introduces significant overhead.
Since spark has no iterative semantics on its own (unlike flink),
one has to materialise the intermediary dataframe at each iteration boundary to 
determine if a termination criterion is reached.
This causes a huge amount of planning, especially since it looks like catalyst 
will try to optimise the dependency graph
regardless of caching. A dependency graph that grows in the number of 
iterations and thus the size of the input dataset.

In flink on the other hand, you can describe you entire iterative program 
through transformations without ever calling an action.
This means that the optimiser will only have to do planing once.

Just my 2 cents :)
Cheers, Jan

 On 06 Jul 2015, at 06:10, n...@reactor8.com wrote:
 
 Maybe some flink benefits from some pts they outline here:
  
 http://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-Bytes.html 
 http://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-Bytes.html
  
 Probably if re-ran the benchmarks with 1.5/tungsten line would close the gap 
 a bit(or a lot) with spark moving towards similar style off-heap memory mgmt, 
 more planning optimizations
  
  
 From: Jerry Lam [mailto:chiling...@gmail.com] 
 Sent: Sunday, July 5, 2015 6:28 PM
 To: Ted Yu
 Cc: Slim Baltagi; user
 Subject: Re: Benchmark results between Flink and Spark
  
 Hi guys,
  
 I just read the paper too. There is no much information regarding why Flink 
 is faster than Spark for data science type of workloads in the benchmark. It 
 is very difficult to generalize the conclusion of a benchmark from my point 
 of view. How much experience the author has with Spark is in comparisons to 
 Flink is one of the immediate questions I have. It would be great if they 
 have the benchmark software available somewhere for other people to 
 experiment.
  
 just my 2 cents,
  
 Jerry
  
 On Sun, Jul 5, 2015 at 4:35 PM, Ted Yu yuzhih...@gmail.com 
 mailto:yuzhih...@gmail.com wrote:
 There was no mentioning of the versions of Flink and Spark used in 
 benchmarking.
  
 The size of cluster is quite small.
  
 Cheers
  
 On Sun, Jul 5, 2015 at 10:24 AM, Slim Baltagi sbalt...@gmail.com 
 mailto:sbalt...@gmail.com wrote:
 Hi
 
 Apache Flink outperforms Apache Spark in processing machine learning  graph
 algorithms and relational queries but not in batch processing!
 
 The results were published in the proceedings of the 18th International
 Conference, Business Information Systems 2015, Poznań, Poland, June 24-26,
 2015.
 
 Thanks to our friend Google, Chapter 3: 'Evaluating New Approaches of Big
 Data Analytics Frameworks' by Norman Spangenberg, Martin Roth and Bogdan
 Franczyk is available for preview at http://goo.gl/WocQci 
 http://goo.gl/WocQci on pages 28-37.
 
 Enjoy!
 
 Slim Baltagi
 http://www.SparkBigData.com http://www.sparkbigdata.com/
 
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Benchmark-results-between-Flink-and-Spark-tp23626.html
  
 http://apache-spark-user-list.1001560.n3.nabble.com/Benchmark-results-between-Flink-and-Spark-tp23626.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 mailto:user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org 
 mailto:user-h...@spark.apache.org


Re: Benchmark results between Flink and Spark

2015-07-06 Thread Jan-Paul Bultmann
Sorry, that should be shortest path, and diameter of the graph.
I shouldn't write emails before I get my morning coffee...

 On 06 Jul 2015, at 09:09, Jan-Paul Bultmann janpaulbultm...@me.com wrote:
 
 I would guess the opposite is true for highly iterative benchmarks (common in 
 graph processing and data-science).
 
 Spark has a pretty large overhead per iteration, more optimisations and 
 planning only makes this worse.
 
 Sure people implemented things like dijkstra's algorithm in spark
 (a problem where the number of iterations is bounded by the circumference of 
 the input graph),
 but all the datasets I've seen it running on had a very small circumference 
 (which is common for e.g. social networks).
 
 Take sparkSQL for example. Catalyst is a really good query optimiser, but it 
 introduces significant overhead.
 Since spark has no iterative semantics on its own (unlike flink),
 one has to materialise the intermediary dataframe at each iteration boundary 
 to determine if a termination criterion is reached.
 This causes a huge amount of planning, especially since it looks like 
 catalyst will try to optimise the dependency graph
 regardless of caching. A dependency graph that grows in the number of 
 iterations and thus the size of the input dataset.
 
 In flink on the other hand, you can describe you entire iterative program 
 through transformations without ever calling an action.
 This means that the optimiser will only have to do planing once.
 
 Just my 2 cents :)
 Cheers, Jan
 
 On 06 Jul 2015, at 06:10, n...@reactor8.com mailto:n...@reactor8.com wrote:
 
 Maybe some flink benefits from some pts they outline here:
  
 http://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-Bytes.html 
 http://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-Bytes.html
  
 Probably if re-ran the benchmarks with 1.5/tungsten line would close the gap 
 a bit(or a lot) with spark moving towards similar style off-heap memory 
 mgmt, more planning optimizations
  
  
 From: Jerry Lam [mailto:chiling...@gmail.com mailto:chiling...@gmail.com] 
 Sent: Sunday, July 5, 2015 6:28 PM
 To: Ted Yu
 Cc: Slim Baltagi; user
 Subject: Re: Benchmark results between Flink and Spark
  
 Hi guys,
  
 I just read the paper too. There is no much information regarding why Flink 
 is faster than Spark for data science type of workloads in the benchmark. It 
 is very difficult to generalize the conclusion of a benchmark from my point 
 of view. How much experience the author has with Spark is in comparisons to 
 Flink is one of the immediate questions I have. It would be great if they 
 have the benchmark software available somewhere for other people to 
 experiment.
  
 just my 2 cents,
  
 Jerry
  
 On Sun, Jul 5, 2015 at 4:35 PM, Ted Yu yuzhih...@gmail.com 
 mailto:yuzhih...@gmail.com wrote:
 There was no mentioning of the versions of Flink and Spark used in 
 benchmarking.
  
 The size of cluster is quite small.
  
 Cheers
  
 On Sun, Jul 5, 2015 at 10:24 AM, Slim Baltagi sbalt...@gmail.com 
 mailto:sbalt...@gmail.com wrote:
 Hi
 
 Apache Flink outperforms Apache Spark in processing machine learning  
 graph
 algorithms and relational queries but not in batch processing!
 
 The results were published in the proceedings of the 18th International
 Conference, Business Information Systems 2015, Poznań, Poland, June 24-26,
 2015.
 
 Thanks to our friend Google, Chapter 3: 'Evaluating New Approaches of Big
 Data Analytics Frameworks' by Norman Spangenberg, Martin Roth and Bogdan
 Franczyk is available for preview at http://goo.gl/WocQci 
 http://goo.gl/WocQci on pages 28-37.
 
 Enjoy!
 
 Slim Baltagi
 http://www.SparkBigData.com http://www.sparkbigdata.com/
 
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Benchmark-results-between-Flink-and-Spark-tp23626.html
  
 http://apache-spark-user-list.1001560.n3.nabble.com/Benchmark-results-between-Flink-and-Spark-tp23626.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com 
 http://nabble.com/.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 mailto:user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org 
 mailto:user-h...@spark.apache.org



Benchmark results between Flink and Spark

2015-07-05 Thread Slim Baltagi
Hi

Apache Flink outperforms Apache Spark in processing machine learning  graph
algorithms and relational queries but not in batch processing!

The results were published in the proceedings of the 18th International
Conference, Business Information Systems 2015, Poznań, Poland, June 24-26,
2015. 

Thanks to our friend Google, Chapter 3: 'Evaluating New Approaches of Big
Data Analytics Frameworks' by Norman Spangenberg, Martin Roth and Bogdan
Franczyk is available for preview at http://goo.gl/WocQci on pages 28-37. 

Enjoy!

Slim Baltagi
http://www.SparkBigData.com 




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Benchmark-results-between-Flink-and-Spark-tp23626.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Benchmark results between Flink and Spark

2015-07-05 Thread Ted Yu
There was no mentioning of the versions of Flink and Spark used in
benchmarking.

The size of cluster is quite small.

Cheers

On Sun, Jul 5, 2015 at 10:24 AM, Slim Baltagi sbalt...@gmail.com wrote:

 Hi

 Apache Flink outperforms Apache Spark in processing machine learning 
 graph
 algorithms and relational queries but not in batch processing!

 The results were published in the proceedings of the 18th International
 Conference, Business Information Systems 2015, Poznań, Poland, June 24-26,
 2015.

 Thanks to our friend Google, Chapter 3: 'Evaluating New Approaches of Big
 Data Analytics Frameworks' by Norman Spangenberg, Martin Roth and Bogdan
 Franczyk is available for preview at http://goo.gl/WocQci on pages 28-37.

 Enjoy!

 Slim Baltagi
 http://www.SparkBigData.com




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Benchmark-results-between-Flink-and-Spark-tp23626.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Benchmark results between Flink and Spark

2015-07-05 Thread Jerry Lam
Hi guys,

I just read the paper too. There is no much information regarding why Flink
is faster than Spark for data science type of workloads in the benchmark.
It is very difficult to generalize the conclusion of a benchmark from my
point of view. How much experience the author has with Spark is in
comparisons to Flink is one of the immediate questions I have. It would be
great if they have the benchmark software available somewhere for other
people to experiment.

just my 2 cents,

Jerry

On Sun, Jul 5, 2015 at 4:35 PM, Ted Yu yuzhih...@gmail.com wrote:

 There was no mentioning of the versions of Flink and Spark used in
 benchmarking.

 The size of cluster is quite small.

 Cheers

 On Sun, Jul 5, 2015 at 10:24 AM, Slim Baltagi sbalt...@gmail.com wrote:

 Hi

 Apache Flink outperforms Apache Spark in processing machine learning 
 graph
 algorithms and relational queries but not in batch processing!

 The results were published in the proceedings of the 18th International
 Conference, Business Information Systems 2015, Poznań, Poland, June 24-26,
 2015.

 Thanks to our friend Google, Chapter 3: 'Evaluating New Approaches of Big
 Data Analytics Frameworks' by Norman Spangenberg, Martin Roth and Bogdan
 Franczyk is available for preview at http://goo.gl/WocQci on pages 28-37.

 Enjoy!

 Slim Baltagi
 http://www.SparkBigData.com




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Benchmark-results-between-Flink-and-Spark-tp23626.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





RE: Benchmark results between Flink and Spark

2015-07-05 Thread nate
Maybe some flink benefits from some pts they outline here:

 

http://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-Bytes.html

 

Probably if re-ran the benchmarks with 1.5/tungsten line would close the gap a 
bit(or a lot) with spark moving towards similar style off-heap memory mgmt, 
more planning optimizations

 

 

From: Jerry Lam [mailto:chiling...@gmail.com] 
Sent: Sunday, July 5, 2015 6:28 PM
To: Ted Yu
Cc: Slim Baltagi; user
Subject: Re: Benchmark results between Flink and Spark

 

Hi guys,

 

I just read the paper too. There is no much information regarding why Flink is 
faster than Spark for data science type of workloads in the benchmark. It is 
very difficult to generalize the conclusion of a benchmark from my point of 
view. How much experience the author has with Spark is in comparisons to Flink 
is one of the immediate questions I have. It would be great if they have the 
benchmark software available somewhere for other people to experiment.

 

just my 2 cents,

 

Jerry

 

On Sun, Jul 5, 2015 at 4:35 PM, Ted Yu yuzhih...@gmail.com 
mailto:yuzhih...@gmail.com  wrote:

There was no mentioning of the versions of Flink and Spark used in benchmarking.

 

The size of cluster is quite small.

 

Cheers

 

On Sun, Jul 5, 2015 at 10:24 AM, Slim Baltagi sbalt...@gmail.com 
mailto:sbalt...@gmail.com  wrote:

Hi

Apache Flink outperforms Apache Spark in processing machine learning  graph
algorithms and relational queries but not in batch processing!

The results were published in the proceedings of the 18th International
Conference, Business Information Systems 2015, Poznań, Poland, June 24-26,
2015.

Thanks to our friend Google, Chapter 3: 'Evaluating New Approaches of Big
Data Analytics Frameworks' by Norman Spangenberg, Martin Roth and Bogdan
Franczyk is available for preview at http://goo.gl/WocQci on pages 28-37.

Enjoy!

Slim Baltagi
http://www.SparkBigData.com




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Benchmark-results-between-Flink-and-Spark-tp23626.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
mailto:user-unsubscr...@spark.apache.org 
For additional commands, e-mail: user-h...@spark.apache.org 
mailto:user-h...@spark.apache.org