Comparison of Trino, Spark, and Hive-MR3

2023-05-31 Thread Sungwoo Park
Hi everyone, We published an article on the performance and correctness of Trino, Spark, and Hive-MR3, and thought that it could be of interest to Spark users. https://www.datamonad.com/post/2023-05-31-trino-spark-hive-performance-1.7/ Omitted in the article is the performance of Spark 2.3.1 vs

Re: Help with Shuffle Read performance

2022-09-30 Thread Sungwoo Park
Hi Leszek, For running YARN on Kubernetes and then running Spark on YARN, is there a lot of overhead for maintaining YARN on Kubernetes? I thought people usually want to move from YARN to Kubernetes because of the overhead of maintaining Hadoop. Thanks, --- Sungwoo On Fri, Sep 30, 2022 at

Re: Pipelined execution in Spark (???)

2022-09-07 Thread Sungwoo Park
n Wed, Sep 7, 2022 at 5:49 PM Sungwoo Park wrote: > >> You are right -- Spark can't do this with its current architecture. My >> question was: if there was a new implementation supporting pipelined >> execution, what kind of Spark jobs would benefit (a lot) from it? >> >

Re: Pipelined execution in Spark (???)

2022-09-07 Thread Sungwoo Park
com LI <http://linkedin.com/in/russelljurney> FB > <http://facebook.com/jurney> datasyndrome.com > > > On Wed, Sep 7, 2022 at 7:42 AM Sungwoo Park wrote: > >> Hello Spark users, >> >> I have a question on the architecture of Spark (which could lead to a >&

Pipelined execution in Spark (???)

2022-09-07 Thread Sungwoo Park
Hello Spark users, I have a question on the architecture of Spark (which could lead to a research problem). In its current implementation, Spark finishes executing all the tasks in a stage before proceeding to child stages. For example, given a two-stage map-reduce DAG, Spark finishes executing

Re: [Spark][Core] Resource Allocation

2022-07-15 Thread Sungwoo Park
For 1), this is a recurring question in this mailing list, and the answer is: no, Spark does not support the coordination between multiple Spark applications. Spark relies on an external resource manager, such as Yarn and Kubernetes, to allocate resources to multiple Spark applications. For

Re: A scene with unstable Spark performance

2022-05-17 Thread Sungwoo Park
The problem you describe is the motivation for developing Spark on MR3. >From the blog article (https://www.datamonad.com/post/2021-08-18-spark-mr3/ ): *The main motivation for developing Spark on MR3 is to allow multiple Spark applications to share compute resources such as Yarn containers or

Re:

2022-04-02 Thread Sungwoo Park
/comparison-llap/ Thanks, -- SW On Sat, Apr 2, 2022 at 9:58 PM Bitfox wrote: > Nice reading. Can you give a comparison on Hive on MR3 and Hive on Tez? > > Thanks > > On Sat, Apr 2, 2022 at 7:17 PM Sungwoo Park wrote: > >> Hi Spark users, >> >> We have pu

[no subject]

2022-04-02 Thread Sungwoo Park
Hi Spark users, We have published an article where we evaluate the performance of Spark 2.3.8 and Spark 3.2.1 (along with Hive 3). If interested, please see: https://www.datamonad.com/post/2022-04-01-spark-hive-performance-1.4/ --- SW

[Announce] Spark on MR3

2021-08-19 Thread Sungwoo Park
Hi Spark users, We would like to announce the release of Spark on MR3, which is Apache Spark using MR3 as the execution backend. MR3 is a general purpose execution engine for Hadoop and Kubernetes, and Hive on MR3 has been its main application. Spark on MR3 is a new application of MR3. The main