This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/spark-website.git
The following commit(s) were added to refs/heads/asf-site by this push: new d60afe5997 Update spark connect page based on pr feedback (#512) d60afe5997 is described below commit d60afe59979f3a8b728e4838acb624a9d3b37722 Author: Matthew Powers <matthewkevinpow...@gmail.com> AuthorDate: Thu Apr 4 16:40:02 2024 -0400 Update spark connect page based on pr feedback (#512) Addresses the comments from this PR: https://github.com/apache/spark-website/pull/511 Also rewords some of the language. --- _layouts/home.html | 1 + site/index.html | 1 + site/spark-connect/index.html | 42 ++++++++++++++++++++---------------------- spark-connect/index.md | 42 ++++++++++++++++++++---------------------- 4 files changed, 42 insertions(+), 44 deletions(-) diff --git a/_layouts/home.html b/_layouts/home.html index 8bae9e1680..ad8f255b7c 100644 --- a/_layouts/home.html +++ b/_layouts/home.html @@ -73,6 +73,7 @@ </a> <ul class="dropdown-menu" aria-labelledby="libraries"> <li><a class="dropdown-item" href="{{site.baseurl}}/sql/">SQL and DataFrames</a></li> + <li><a class="dropdown-item" href="{{site.baseurl}}/spark-connect/">Spark Connect</a></li> <li><a class="dropdown-item" href="{{site.baseurl}}/streaming/">Spark Streaming</a></li> <li><a class="dropdown-item" href="{{site.baseurl}}/mllib/">MLlib (machine learning)</a></li> <li><a class="dropdown-item" href="{{site.baseurl}}/graphx/">GraphX (graph)</a></li> diff --git a/site/index.html b/site/index.html index d2abd82fb3..3e6072f1d7 100644 --- a/site/index.html +++ b/site/index.html @@ -69,6 +69,7 @@ </a> <ul class="dropdown-menu" aria-labelledby="libraries"> <li><a class="dropdown-item" href="/sql/">SQL and DataFrames</a></li> + <li><a class="dropdown-item" href="/spark-connect/">Spark Connect</a></li> <li><a class="dropdown-item" href="/streaming/">Spark Streaming</a></li> <li><a class="dropdown-item" href="/mllib/">MLlib (machine learning)</a></li> <li><a class="dropdown-item" href="/graphx/">GraphX (graph)</a></li> diff --git a/site/spark-connect/index.html b/site/spark-connect/index.html index 6dc75abe37..1877c0f32d 100644 --- a/site/spark-connect/index.html +++ b/site/spark-connect/index.html @@ -142,7 +142,7 @@ <div class="container"> <div class="row mt-4"> <div class="col-12 col-md-9"> - <p>This post explains the Spark Connect architecture, the benefits of Spark Connect, and how to upgrade to Spark Connect.</p> + <p>This page explains the Spark Connect architecture, the benefits of Spark Connect, and how to upgrade to Spark Connect.</p> <p>Let’s start by exploring the architecture of Spark Connect at a high level.</p> @@ -154,9 +154,9 @@ <ol> <li>A connection is established between the Client and Spark Server</li> - <li>The Client converts a DataFrame query to an unresolved logical plan</li> + <li>The Client converts a DataFrame query to an unresolved logical plan that describes the intent of the operation rather than how it should be executed</li> <li>The unresolved logical plan is encoded and sent to the Spark Server</li> - <li>The Spark Server runs the query</li> + <li>The Spark Server optimizes and runs the query</li> <li>The Spark Server sends the results back to the Client</li> </ol> @@ -164,11 +164,11 @@ <p>Let’s go through these steps in more detail to get a better understanding of the inner workings of Spark Connect.</p> -<p><strong>Establishing a connection between the Client and Spark Server</strong></p> +<p><strong>Establishing a connection between the Client and the Spark Server</strong></p> <p>The network communication for Spark Connect uses the <a href="https://grpc.io/">gRPC framework</a>.</p> -<p>gRPC is performant and language agnostic. Spark Connect uses language-agnostic technologies, so it’s portable.</p> +<p>gRPC is performant and language agnostic which makes Spark Connect portable.</p> <p><strong>Converting a DataFrame query to an unresolved logical plan</strong></p> @@ -182,7 +182,7 @@ GlobalLimit 5 +- LocalLimit 5 +- SubqueryAlias spark_catalog.default.some_table - +- Relation spark_catalog.default.some_table[character#15,franchise#16] parquet + +- UnresolvedRelation spark_catalog.default.some_table </code></pre></div></div> <p>The Client is responsible for creating the unresolved logical plan and passing it to the Spark Server for execution.</p> @@ -197,9 +197,9 @@ GlobalLimit 5 <p><strong>Executing the query on the Spark Server</strong></p> -<p>The Spark Server receives the unresolved logical plan (once the Protocol Buffer is deserialized) and executes it just like any other query.</p> +<p>The Spark Server receives the unresolved logical plan (once the Protocol Buffer is deserialized) and analyzes, optimizes, and executes it just like any other query.</p> -<p>Spark performs many optimizations to an unresolved logical plan before executing the query. All of these optimizations happen on the Spark Server.</p> +<p>Spark performs many optimizations to an unresolved logical plan before executing the query. All of these optimizations happen on the Spark Server and are independent of the client application.</p> <p>Spark Connect lets you leverage Spark’s powerful query optimization capabilities, even with Clients that don’t depend on Spark or the JVM.</p> @@ -207,9 +207,9 @@ GlobalLimit 5 <p>The Spark Server sends the results back to the Client after executing the query.</p> -<p>The results are sent to the client as Apache Arrow record batches. A record batch includes many rows of data.</p> +<p>The results are sent to the client as Apache Arrow record batches. A single record batch includes many rows of data.</p> -<p>The record batch is streamed to the client, which means it is sent in partial chunks, not all at once. Streaming the results from the Spark Server to the Client prevents memory issues caused by an excessively large request.</p> +<p>The full result is streamed to the client in partial chunks of record batches, not all at once. Streaming the results from the Spark Server to the Client prevents memory issues caused by an excessively large request.</p> <p>Here’s a recap of how Spark Connect works in image form:</p> @@ -221,11 +221,11 @@ GlobalLimit 5 <p><strong>Spark Connect workloads are easier to maintain</strong></p> -<p>With the Spark JVM architecture, the client and Spark Driver must run identical software versions. They need the same Java, Scala, and other dependency versions. Suppose you develop a Spark project on your local machine, package it as a JAR file, and deploy it in the cloud to run on a production dataset. You need to build the JAR file on your local machine with the same dependencies used in the cloud. If you compile the JAR file with Scala 2.13, you must also provision the cluster [...] +<p>When you do not use Spark Connect, the client and Spark Driver must run identical software versions. They need the same Java, Scala, and other dependency versions. Suppose you develop a Spark project on your local machine, package it as a JAR file, and deploy it in the cloud to run on a production dataset. You need to build the JAR file on your local machine with the same dependencies used in the cloud. If you compile the JAR file with Scala 2.13, you must also provision the clust [...] -<p>Suppose you are building your JAR with Scala 2.12, and your cloud provider releases a new runtime built with Scala 2.13. For Spark JVM, you need to update your project locally, which may be challenging. For example, when you update your project to Scala 2.13, you must also upgrade all the project dependencies (and transitive dependencies) to Scala 2.13. If some of those JAR files don’t exist, you can’t upgrade.</p> +<p>Suppose you are building your JAR with Scala 2.12, and your cloud provider releases a new runtime built with Scala 2.13. When you don’t use Spark Connect, you need to update your project locally, which may be challenging. For example, when you update your project to Scala 2.13, you must also upgrade all the project dependencies (and transitive dependencies) to Scala 2.13. If some of those JAR files don’t exist, you can’t upgrade.</p> -<p>In contrast, Spark Connect decouples the client and the Spark Driver, so you can update the Spark Driver including server-side dependencies without updating the client. This makes Spark projects much easier to maintain.</p> +<p>In contrast, Spark Connect decouples the client and the Spark Driver, so you can update the Spark Driver including server-side dependencies without updating the client. This makes Spark projects much easier to maintain. In particular, for pure Python workloads, decoupling Python from the Java dependency on the client improves the overall user experience with Apache Spark.</p> <p><strong>Spark Connect lets you build Spark Connect Clients in non-JVM languages</strong></p> @@ -234,7 +234,7 @@ GlobalLimit 5 <ul> <li><a href="https://github.com/apache/spark/tree/master/python/pyspark/sql/connect">Spark Connect Python</a></li> <li><a href="https://github.com/apache/spark-connect-go">Spark Connect Go</a></li> - <li><a href="https://github.com/sjrusso8/spark-connect-rs">Spark Connect Rust</a></li> + <li><a href="https://github.com/sjrusso8/spark-connect-rs">Spark Connect Rust</a> (third-party project)</li> </ul> <p>For example, the Apache Spark Connect Client for Golang, <a href="https://github.com/apache/spark-connect-go">spark-connect-go</a>, implements the Spark Connect protocol and does not rely on Java. You can use this Spark Connect Client to develop Spark applications with Go without installing Java or Spark.</p> @@ -250,13 +250,13 @@ df.Show(100, false) <p>spark-connect-go is a magnificent example of how the decoupled nature of Spark Connect allows for a better end-user experience.</p> -<p>Go isn’t the only language that will benefit from this architecture. The Spark Community is also building <a href="https://github.com/sjrusso8/spark-connect-rs">a Rust </a>Spark Connect Client.</p> +<p>Go isn’t the only language that will benefit from this architecture.</p> -<p><strong>Spark Connect allows for better remote development</strong></p> +<p><strong>Spark Connect allows for better remote development and testing</strong></p> <p>Spark Connect also enables you to embed Spark in text editors on remote clusters without SSH (“remote development”).</p> -<p>Embedding Spark in text editors with Classic Spark requires a Spark Session running locally or an SSH connection to a remote Spark Driver.</p> +<p>When you do not use Spark Connect, embedding Spark in text editors with Spark requires a Spark Session running locally or an SSH connection to a remote Spark Driver.</p> <p>Spark Connect lets you connect to a remote Spark Driver with a connection that’s fully embedded in a text editor without SSH. This provides users with a better experience when developing code in a text editor like VS Code on a remote Spark cluster.</p> @@ -264,15 +264,13 @@ df.Show(100, false) <p><strong>Spark Connect makes debugging easier</strong></p> -<p>Spark Connect lets you connect a text editor like IntelliJ to a remote Spark cluster and step through your code with a debugger. You can debug an application running on a production dataset, just like you would for a test dataset on your local machine. This gives you a great developer experience, especially when you want to leverage high-quality debugging tools built into IDEs.</p> +<p>Spark Connect lets you connect your text editor like IntelliJ to a remote Spark cluster and step through your code with a debugger. You can debug an application running on a production dataset, just like you would for a test dataset on your local machine. This gives you a great developer experience, especially when you want to leverage high-quality debugging tools built into IDEs.</p> -<p>Spark JVM does not allow for this debugging experience because it does not fully integrate with text editors. Spark Connect allows you to build tight integrations in text editors with the wonderful debugging experience for remote Spark workflows.</p> +<p>Spark JVM does not allow for this debugging experience because it does not fully integrate with text editors. Spark Connect allows you to build tight integrations in text editors with the wonderful debugging experience for remote Spark workflows. By simply switching the connection string for the Spark Connect session it becomes easy to configure the client to run tests in different execution environments without deploying a complicated Spark application.</p> <p><strong>Spark Connect is more stable</strong></p> -<p>When many users access the same Spark JVM cluster, they all have to run computations on the same driver node, which can cause instability. One user may execute code that causes the driver node to fail, rendering the Spark cluster unusable for the other cluster users.</p> - -<p>Spark Connect is more stable because the requests are formatted to unresolved logical plans on the client, not in the Spark Driver. Spark Connect code that errors out will only cause the client to raise an out-of-memory exception. It won’t cause the Spark Driver to have an out-of-memory exception that takes down the cluster for all users.</p> +<p>Due to the decoupled nature of client applications leveraging Spark Connect, failures of the client are now decoupled from the Spark Driver. This means that when a client application fails, its failure modes are completely independent of the other applications and don’t impact the running Spark Driver that may continue serving other client applications.</p> <h2 id="upgrading-to-spark-connect">Upgrading to Spark Connect</h2> diff --git a/spark-connect/index.md b/spark-connect/index.md index 77cf3d0dc8..3eb4a26eeb 100644 --- a/spark-connect/index.md +++ b/spark-connect/index.md @@ -6,7 +6,7 @@ description: Spark Connect makes remote Spark development easier. subproject: Spark Connect --- -This post explains the Spark Connect architecture, the benefits of Spark Connect, and how to upgrade to Spark Connect. +This page explains the Spark Connect architecture, the benefits of Spark Connect, and how to upgrade to Spark Connect. Let’s start by exploring the architecture of Spark Connect at a high level. @@ -17,20 +17,20 @@ Spark Connect is a protocol that specifies how a client application can communic Here’s how Spark Connect works at a high level: 1. A connection is established between the Client and Spark Server -2. The Client converts a DataFrame query to an unresolved logical plan +2. The Client converts a DataFrame query to an unresolved logical plan that describes the intent of the operation rather than how it should be executed 3. The unresolved logical plan is encoded and sent to the Spark Server -4. The Spark Server runs the query +4. The Spark Server optimizes and runs the query 5. The Spark Server sends the results back to the Client <img src="{{site.baseurl}}/images/spark-connect1.png" style="width: 100%; max-width: 500px;"> Let’s go through these steps in more detail to get a better understanding of the inner workings of Spark Connect. -**Establishing a connection between the Client and Spark Server** +**Establishing a connection between the Client and the Spark Server** The network communication for Spark Connect uses the [gRPC framework](https://grpc.io/). -gRPC is performant and language agnostic. Spark Connect uses language-agnostic technologies, so it’s portable. +gRPC is performant and language agnostic which makes Spark Connect portable. **Converting a DataFrame query to an unresolved logical plan** @@ -45,7 +45,7 @@ Here’s the unresolved logical plan for the query: GlobalLimit 5 +- LocalLimit 5 +- SubqueryAlias spark_catalog.default.some_table - +- Relation spark_catalog.default.some_table[character#15,franchise#16] parquet + +- UnresolvedRelation spark_catalog.default.some_table ``` The Client is responsible for creating the unresolved logical plan and passing it to the Spark Server for execution. @@ -60,9 +60,9 @@ Now let’s look at how the Spark Server executes the query. **Executing the query on the Spark Server** -The Spark Server receives the unresolved logical plan (once the Protocol Buffer is deserialized) and executes it just like any other query. +The Spark Server receives the unresolved logical plan (once the Protocol Buffer is deserialized) and analyzes, optimizes, and executes it just like any other query. -Spark performs many optimizations to an unresolved logical plan before executing the query. All of these optimizations happen on the Spark Server. +Spark performs many optimizations to an unresolved logical plan before executing the query. All of these optimizations happen on the Spark Server and are independent of the client application. Spark Connect lets you leverage Spark’s powerful query optimization capabilities, even with Clients that don’t depend on Spark or the JVM. @@ -70,9 +70,9 @@ Spark Connect lets you leverage Spark’s powerful query optimization capabiliti The Spark Server sends the results back to the Client after executing the query. -The results are sent to the client as Apache Arrow record batches. A record batch includes many rows of data. +The results are sent to the client as Apache Arrow record batches. A single record batch includes many rows of data. -The record batch is streamed to the client, which means it is sent in partial chunks, not all at once. Streaming the results from the Spark Server to the Client prevents memory issues caused by an excessively large request. +The full result is streamed to the client in partial chunks of record batches, not all at once. Streaming the results from the Spark Server to the Client prevents memory issues caused by an excessively large request. Here’s a recap of how Spark Connect works in image form: @@ -84,11 +84,11 @@ Let’s now turn our attention to the benefits of the Spark Connect architecture **Spark Connect workloads are easier to maintain** -With the Spark JVM architecture, the client and Spark Driver must run identical software versions. They need the same Java, Scala, and other dependency versions. Suppose you develop a Spark project on your local machine, package it as a JAR file, and deploy it in the cloud to run on a production dataset. You need to build the JAR file on your local machine with the same dependencies used in the cloud. If you compile the JAR file with Scala 2.13, you must also provision the cluster wi [...] +When you do not use Spark Connect, the client and Spark Driver must run identical software versions. They need the same Java, Scala, and other dependency versions. Suppose you develop a Spark project on your local machine, package it as a JAR file, and deploy it in the cloud to run on a production dataset. You need to build the JAR file on your local machine with the same dependencies used in the cloud. If you compile the JAR file with Scala 2.13, you must also provision the cluster [...] -Suppose you are building your JAR with Scala 2.12, and your cloud provider releases a new runtime built with Scala 2.13. For Spark JVM, you need to update your project locally, which may be challenging. For example, when you update your project to Scala 2.13, you must also upgrade all the project dependencies (and transitive dependencies) to Scala 2.13. If some of those JAR files don’t exist, you can’t upgrade. +Suppose you are building your JAR with Scala 2.12, and your cloud provider releases a new runtime built with Scala 2.13. When you don't use Spark Connect, you need to update your project locally, which may be challenging. For example, when you update your project to Scala 2.13, you must also upgrade all the project dependencies (and transitive dependencies) to Scala 2.13. If some of those JAR files don’t exist, you can’t upgrade. -In contrast, Spark Connect decouples the client and the Spark Driver, so you can update the Spark Driver including server-side dependencies without updating the client. This makes Spark projects much easier to maintain. +In contrast, Spark Connect decouples the client and the Spark Driver, so you can update the Spark Driver including server-side dependencies without updating the client. This makes Spark projects much easier to maintain. In particular, for pure Python workloads, decoupling Python from the Java dependency on the client improves the overall user experience with Apache Spark. **Spark Connect lets you build Spark Connect Clients in non-JVM languages** @@ -96,7 +96,7 @@ Spark Connect decouples the client and the Spark Driver so that you can write a * [Spark Connect Python](https://github.com/apache/spark/tree/master/python/pyspark/sql/connect) * [Spark Connect Go](https://github.com/apache/spark-connect-go) -* [Spark Connect Rust](https://github.com/sjrusso8/spark-connect-rs) +* [Spark Connect Rust](https://github.com/sjrusso8/spark-connect-rs) (third-party project) For example, the Apache Spark Connect Client for Golang, [spark-connect-go](https://github.com/apache/spark-connect-go), implements the Spark Connect protocol and does not rely on Java. You can use this Spark Connect Client to develop Spark applications with Go without installing Java or Spark. @@ -112,13 +112,13 @@ When `df.Show()` is invoked, spark-connect-go processes the query into an unreso spark-connect-go is a magnificent example of how the decoupled nature of Spark Connect allows for a better end-user experience. -Go isn’t the only language that will benefit from this architecture. The Spark Community is also building [a Rust ](https://github.com/sjrusso8/spark-connect-rs)Spark Connect Client. +Go isn’t the only language that will benefit from this architecture. -**Spark Connect allows for better remote development** +**Spark Connect allows for better remote development and testing** Spark Connect also enables you to embed Spark in text editors on remote clusters without SSH (“remote development”). -Embedding Spark in text editors with Classic Spark requires a Spark Session running locally or an SSH connection to a remote Spark Driver. +When you do not use Spark Connect, embedding Spark in text editors with Spark requires a Spark Session running locally or an SSH connection to a remote Spark Driver. Spark Connect lets you connect to a remote Spark Driver with a connection that’s fully embedded in a text editor without SSH. This provides users with a better experience when developing code in a text editor like VS Code on a remote Spark cluster. @@ -126,15 +126,13 @@ With Spark Connect, switching from a local Spark Session to a remote Spark Sessi **Spark Connect makes debugging easier** -Spark Connect lets you connect a text editor like IntelliJ to a remote Spark cluster and step through your code with a debugger. You can debug an application running on a production dataset, just like you would for a test dataset on your local machine. This gives you a great developer experience, especially when you want to leverage high-quality debugging tools built into IDEs. +Spark Connect lets you connect your text editor like IntelliJ to a remote Spark cluster and step through your code with a debugger. You can debug an application running on a production dataset, just like you would for a test dataset on your local machine. This gives you a great developer experience, especially when you want to leverage high-quality debugging tools built into IDEs. -Spark JVM does not allow for this debugging experience because it does not fully integrate with text editors. Spark Connect allows you to build tight integrations in text editors with the wonderful debugging experience for remote Spark workflows. +Spark JVM does not allow for this debugging experience because it does not fully integrate with text editors. Spark Connect allows you to build tight integrations in text editors with the wonderful debugging experience for remote Spark workflows. By simply switching the connection string for the Spark Connect session it becomes easy to configure the client to run tests in different execution environments without deploying a complicated Spark application. **Spark Connect is more stable** -When many users access the same Spark JVM cluster, they all have to run computations on the same driver node, which can cause instability. One user may execute code that causes the driver node to fail, rendering the Spark cluster unusable for the other cluster users. - -Spark Connect is more stable because the requests are formatted to unresolved logical plans on the client, not in the Spark Driver. Spark Connect code that errors out will only cause the client to raise an out-of-memory exception. It won’t cause the Spark Driver to have an out-of-memory exception that takes down the cluster for all users. +Due to the decoupled nature of client applications leveraging Spark Connect, failures of the client are now decoupled from the Spark Driver. This means that when a client application fails, its failure modes are completely independent of the other applications and don’t impact the running Spark Driver that may continue serving other client applications. ## Upgrading to Spark Connect --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org