Re: Benchmark results between Flink and Spark
FYI, another benchmark: http://eastcirclek.blogspot.kr/2015/06/terasort-for-spark-and-flink-with-range.html quote: I have observed a lot of fetch failures while running Spark, which results in many restarted tasks and, therefore, takes the longest time. I suspect that executors are incapable of serving shuffle data due to JVMs doing long garbage-collection (I also tried large numbers for spark.core.connection.ack.wait.timeout). Flink seems to be irrelevant to GC issues thanks to its own internal memory management. MapReduce and Tez execute each task in a separate process and rely on an external auxiliary service for shuffling. Although the shuffle service could exhibit fetch failures for other reasons, it works without any fetch failure in this experiment for Hadoop MapReduce and Tez. On Mon, Jul 6, 2015 at 3:13 AM, Jan-Paul Bultmann janpaulbultm...@me.com wrote: Sorry, that should be shortest path, and diameter of the graph. I shouldn't write emails before I get my morning coffee... On 06 Jul 2015, at 09:09, Jan-Paul Bultmann janpaulbultm...@me.com wrote: I would guess the opposite is true for highly iterative benchmarks (common in graph processing and data-science). Spark has a pretty large overhead per iteration, more optimisations and planning only makes this worse. Sure people implemented things like dijkstra's algorithm in spark (a problem where the number of iterations is bounded by the circumference of the input graph), but all the datasets I've seen it running on had a very small circumference (which is common for e.g. social networks). Take sparkSQL for example. Catalyst is a really good query optimiser, but it introduces significant overhead. Since spark has no iterative semantics on its own (unlike flink), one has to materialise the intermediary dataframe at each iteration boundary to determine if a termination criterion is reached. This causes a huge amount of planning, especially since it looks like catalyst will try to optimise the dependency graph regardless of caching. A dependency graph that grows in the number of iterations and thus the size of the input dataset. In flink on the other hand, you can describe you entire iterative program through transformations without ever calling an action. This means that the optimiser will only have to do planing once. Just my 2 cents :) Cheers, Jan On 06 Jul 2015, at 06:10, n...@reactor8.com wrote: Maybe some flink benefits from some pts they outline here: http://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-Bytes.html Probably if re-ran the benchmarks with 1.5/tungsten line would close the gap a bit(or a lot) with spark moving towards similar style off-heap memory mgmt, more planning optimizations *From:* Jerry Lam [mailto:chiling...@gmail.com chiling...@gmail.com] *Sent:* Sunday, July 5, 2015 6:28 PM *To:* Ted Yu *Cc:* Slim Baltagi; user *Subject:* Re: Benchmark results between Flink and Spark Hi guys, I just read the paper too. There is no much information regarding why Flink is faster than Spark for data science type of workloads in the benchmark. It is very difficult to generalize the conclusion of a benchmark from my point of view. How much experience the author has with Spark is in comparisons to Flink is one of the immediate questions I have. It would be great if they have the benchmark software available somewhere for other people to experiment. just my 2 cents, Jerry On Sun, Jul 5, 2015 at 4:35 PM, Ted Yu yuzhih...@gmail.com wrote: There was no mentioning of the versions of Flink and Spark used in benchmarking. The size of cluster is quite small. Cheers On Sun, Jul 5, 2015 at 10:24 AM, Slim Baltagi sbalt...@gmail.com wrote: Hi Apache Flink outperforms Apache Spark in processing machine learning graph algorithms and relational queries but not in batch processing! The results were published in the proceedings of the 18th International Conference, Business Information Systems 2015, Poznań, Poland, June 24-26, 2015. Thanks to our friend Google, Chapter 3: 'Evaluating New Approaches of Big Data Analytics Frameworks' by Norman Spangenberg, Martin Roth and Bogdan Franczyk is available for preview at http://goo.gl/WocQci on pages 28-37. Enjoy! Slim Baltagi http://www.SparkBigData.com http://www.sparkbigdata.com/ -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Benchmark-results-between-Flink-and-Spark-tp23626.html Sent from the Apache Spark User List mailing list archive at Nabble.com http://nabble.com/. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Benchmark results between Flink and Spark
I would guess the opposite is true for highly iterative benchmarks (common in graph processing and data-science). Spark has a pretty large overhead per iteration, more optimisations and planning only makes this worse. Sure people implemented things like dijkstra's algorithm in spark (a problem where the number of iterations is bounded by the circumference of the input graph), but all the datasets I've seen it running on had a very small circumference (which is common for e.g. social networks). Take sparkSQL for example. Catalyst is a really good query optimiser, but it introduces significant overhead. Since spark has no iterative semantics on its own (unlike flink), one has to materialise the intermediary dataframe at each iteration boundary to determine if a termination criterion is reached. This causes a huge amount of planning, especially since it looks like catalyst will try to optimise the dependency graph regardless of caching. A dependency graph that grows in the number of iterations and thus the size of the input dataset. In flink on the other hand, you can describe you entire iterative program through transformations without ever calling an action. This means that the optimiser will only have to do planing once. Just my 2 cents :) Cheers, Jan On 06 Jul 2015, at 06:10, n...@reactor8.com wrote: Maybe some flink benefits from some pts they outline here: http://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-Bytes.html http://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-Bytes.html Probably if re-ran the benchmarks with 1.5/tungsten line would close the gap a bit(or a lot) with spark moving towards similar style off-heap memory mgmt, more planning optimizations From: Jerry Lam [mailto:chiling...@gmail.com] Sent: Sunday, July 5, 2015 6:28 PM To: Ted Yu Cc: Slim Baltagi; user Subject: Re: Benchmark results between Flink and Spark Hi guys, I just read the paper too. There is no much information regarding why Flink is faster than Spark for data science type of workloads in the benchmark. It is very difficult to generalize the conclusion of a benchmark from my point of view. How much experience the author has with Spark is in comparisons to Flink is one of the immediate questions I have. It would be great if they have the benchmark software available somewhere for other people to experiment. just my 2 cents, Jerry On Sun, Jul 5, 2015 at 4:35 PM, Ted Yu yuzhih...@gmail.com mailto:yuzhih...@gmail.com wrote: There was no mentioning of the versions of Flink and Spark used in benchmarking. The size of cluster is quite small. Cheers On Sun, Jul 5, 2015 at 10:24 AM, Slim Baltagi sbalt...@gmail.com mailto:sbalt...@gmail.com wrote: Hi Apache Flink outperforms Apache Spark in processing machine learning graph algorithms and relational queries but not in batch processing! The results were published in the proceedings of the 18th International Conference, Business Information Systems 2015, Poznań, Poland, June 24-26, 2015. Thanks to our friend Google, Chapter 3: 'Evaluating New Approaches of Big Data Analytics Frameworks' by Norman Spangenberg, Martin Roth and Bogdan Franczyk is available for preview at http://goo.gl/WocQci http://goo.gl/WocQci on pages 28-37. Enjoy! Slim Baltagi http://www.SparkBigData.com http://www.sparkbigdata.com/ -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Benchmark-results-between-Flink-and-Spark-tp23626.html http://apache-spark-user-list.1001560.n3.nabble.com/Benchmark-results-between-Flink-and-Spark-tp23626.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org mailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org mailto:user-h...@spark.apache.org
Re: Benchmark results between Flink and Spark
Sorry, that should be shortest path, and diameter of the graph. I shouldn't write emails before I get my morning coffee... On 06 Jul 2015, at 09:09, Jan-Paul Bultmann janpaulbultm...@me.com wrote: I would guess the opposite is true for highly iterative benchmarks (common in graph processing and data-science). Spark has a pretty large overhead per iteration, more optimisations and planning only makes this worse. Sure people implemented things like dijkstra's algorithm in spark (a problem where the number of iterations is bounded by the circumference of the input graph), but all the datasets I've seen it running on had a very small circumference (which is common for e.g. social networks). Take sparkSQL for example. Catalyst is a really good query optimiser, but it introduces significant overhead. Since spark has no iterative semantics on its own (unlike flink), one has to materialise the intermediary dataframe at each iteration boundary to determine if a termination criterion is reached. This causes a huge amount of planning, especially since it looks like catalyst will try to optimise the dependency graph regardless of caching. A dependency graph that grows in the number of iterations and thus the size of the input dataset. In flink on the other hand, you can describe you entire iterative program through transformations without ever calling an action. This means that the optimiser will only have to do planing once. Just my 2 cents :) Cheers, Jan On 06 Jul 2015, at 06:10, n...@reactor8.com mailto:n...@reactor8.com wrote: Maybe some flink benefits from some pts they outline here: http://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-Bytes.html http://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-Bytes.html Probably if re-ran the benchmarks with 1.5/tungsten line would close the gap a bit(or a lot) with spark moving towards similar style off-heap memory mgmt, more planning optimizations From: Jerry Lam [mailto:chiling...@gmail.com mailto:chiling...@gmail.com] Sent: Sunday, July 5, 2015 6:28 PM To: Ted Yu Cc: Slim Baltagi; user Subject: Re: Benchmark results between Flink and Spark Hi guys, I just read the paper too. There is no much information regarding why Flink is faster than Spark for data science type of workloads in the benchmark. It is very difficult to generalize the conclusion of a benchmark from my point of view. How much experience the author has with Spark is in comparisons to Flink is one of the immediate questions I have. It would be great if they have the benchmark software available somewhere for other people to experiment. just my 2 cents, Jerry On Sun, Jul 5, 2015 at 4:35 PM, Ted Yu yuzhih...@gmail.com mailto:yuzhih...@gmail.com wrote: There was no mentioning of the versions of Flink and Spark used in benchmarking. The size of cluster is quite small. Cheers On Sun, Jul 5, 2015 at 10:24 AM, Slim Baltagi sbalt...@gmail.com mailto:sbalt...@gmail.com wrote: Hi Apache Flink outperforms Apache Spark in processing machine learning graph algorithms and relational queries but not in batch processing! The results were published in the proceedings of the 18th International Conference, Business Information Systems 2015, Poznań, Poland, June 24-26, 2015. Thanks to our friend Google, Chapter 3: 'Evaluating New Approaches of Big Data Analytics Frameworks' by Norman Spangenberg, Martin Roth and Bogdan Franczyk is available for preview at http://goo.gl/WocQci http://goo.gl/WocQci on pages 28-37. Enjoy! Slim Baltagi http://www.SparkBigData.com http://www.sparkbigdata.com/ -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Benchmark-results-between-Flink-and-Spark-tp23626.html http://apache-spark-user-list.1001560.n3.nabble.com/Benchmark-results-between-Flink-and-Spark-tp23626.html Sent from the Apache Spark User List mailing list archive at Nabble.com http://nabble.com/. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org mailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org mailto:user-h...@spark.apache.org
Benchmark results between Flink and Spark
Hi Apache Flink outperforms Apache Spark in processing machine learning graph algorithms and relational queries but not in batch processing! The results were published in the proceedings of the 18th International Conference, Business Information Systems 2015, Poznań, Poland, June 24-26, 2015. Thanks to our friend Google, Chapter 3: 'Evaluating New Approaches of Big Data Analytics Frameworks' by Norman Spangenberg, Martin Roth and Bogdan Franczyk is available for preview at http://goo.gl/WocQci on pages 28-37. Enjoy! Slim Baltagi http://www.SparkBigData.com -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Benchmark-results-between-Flink-and-Spark-tp23626.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Benchmark results between Flink and Spark
There was no mentioning of the versions of Flink and Spark used in benchmarking. The size of cluster is quite small. Cheers On Sun, Jul 5, 2015 at 10:24 AM, Slim Baltagi sbalt...@gmail.com wrote: Hi Apache Flink outperforms Apache Spark in processing machine learning graph algorithms and relational queries but not in batch processing! The results were published in the proceedings of the 18th International Conference, Business Information Systems 2015, Poznań, Poland, June 24-26, 2015. Thanks to our friend Google, Chapter 3: 'Evaluating New Approaches of Big Data Analytics Frameworks' by Norman Spangenberg, Martin Roth and Bogdan Franczyk is available for preview at http://goo.gl/WocQci on pages 28-37. Enjoy! Slim Baltagi http://www.SparkBigData.com -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Benchmark-results-between-Flink-and-Spark-tp23626.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Benchmark results between Flink and Spark
Hi guys, I just read the paper too. There is no much information regarding why Flink is faster than Spark for data science type of workloads in the benchmark. It is very difficult to generalize the conclusion of a benchmark from my point of view. How much experience the author has with Spark is in comparisons to Flink is one of the immediate questions I have. It would be great if they have the benchmark software available somewhere for other people to experiment. just my 2 cents, Jerry On Sun, Jul 5, 2015 at 4:35 PM, Ted Yu yuzhih...@gmail.com wrote: There was no mentioning of the versions of Flink and Spark used in benchmarking. The size of cluster is quite small. Cheers On Sun, Jul 5, 2015 at 10:24 AM, Slim Baltagi sbalt...@gmail.com wrote: Hi Apache Flink outperforms Apache Spark in processing machine learning graph algorithms and relational queries but not in batch processing! The results were published in the proceedings of the 18th International Conference, Business Information Systems 2015, Poznań, Poland, June 24-26, 2015. Thanks to our friend Google, Chapter 3: 'Evaluating New Approaches of Big Data Analytics Frameworks' by Norman Spangenberg, Martin Roth and Bogdan Franczyk is available for preview at http://goo.gl/WocQci on pages 28-37. Enjoy! Slim Baltagi http://www.SparkBigData.com -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Benchmark-results-between-Flink-and-Spark-tp23626.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: Benchmark results between Flink and Spark
Maybe some flink benefits from some pts they outline here: http://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-Bytes.html Probably if re-ran the benchmarks with 1.5/tungsten line would close the gap a bit(or a lot) with spark moving towards similar style off-heap memory mgmt, more planning optimizations From: Jerry Lam [mailto:chiling...@gmail.com] Sent: Sunday, July 5, 2015 6:28 PM To: Ted Yu Cc: Slim Baltagi; user Subject: Re: Benchmark results between Flink and Spark Hi guys, I just read the paper too. There is no much information regarding why Flink is faster than Spark for data science type of workloads in the benchmark. It is very difficult to generalize the conclusion of a benchmark from my point of view. How much experience the author has with Spark is in comparisons to Flink is one of the immediate questions I have. It would be great if they have the benchmark software available somewhere for other people to experiment. just my 2 cents, Jerry On Sun, Jul 5, 2015 at 4:35 PM, Ted Yu yuzhih...@gmail.com mailto:yuzhih...@gmail.com wrote: There was no mentioning of the versions of Flink and Spark used in benchmarking. The size of cluster is quite small. Cheers On Sun, Jul 5, 2015 at 10:24 AM, Slim Baltagi sbalt...@gmail.com mailto:sbalt...@gmail.com wrote: Hi Apache Flink outperforms Apache Spark in processing machine learning graph algorithms and relational queries but not in batch processing! The results were published in the proceedings of the 18th International Conference, Business Information Systems 2015, Poznań, Poland, June 24-26, 2015. Thanks to our friend Google, Chapter 3: 'Evaluating New Approaches of Big Data Analytics Frameworks' by Norman Spangenberg, Martin Roth and Bogdan Franczyk is available for preview at http://goo.gl/WocQci on pages 28-37. Enjoy! Slim Baltagi http://www.SparkBigData.com -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Benchmark-results-between-Flink-and-Spark-tp23626.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org mailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org mailto:user-h...@spark.apache.org