(spark-website) branch asf-site updated: Update spark connect page based on pr feedback (#512)

dongjoon Thu, 04 Apr 2024 13:40:13 -0700

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/spark-website.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new d60afe5997 Update spark connect page based on pr feedback (#512)
d60afe5997 is described below

commit d60afe59979f3a8b728e4838acb624a9d3b37722
Author: Matthew Powers <matthewkevinpow...@gmail.com>
AuthorDate: Thu Apr 4 16:40:02 2024 -0400

    Update spark connect page based on pr feedback (#512)
    
    Addresses the comments from this PR: 
https://github.com/apache/spark-website/pull/511
    
    Also rewords some of the language.
---
 _layouts/home.html            |  1 +
 site/index.html               |  1 +
 site/spark-connect/index.html | 42 ++++++++++++++++++++----------------------
 spark-connect/index.md        | 42 ++++++++++++++++++++----------------------
 4 files changed, 42 insertions(+), 44 deletions(-)

diff --git a/_layouts/home.html b/_layouts/home.html
index 8bae9e1680..ad8f255b7c 100644
--- a/_layouts/home.html
+++ b/_layouts/home.html
@@ -73,6 +73,7 @@
         </a>
         <ul class="dropdown-menu" aria-labelledby="libraries">
           <li><a class="dropdown-item" href="{{site.baseurl}}/sql/">SQL and 
DataFrames</a></li>
+          <li><a class="dropdown-item" 
href="{{site.baseurl}}/spark-connect/">Spark Connect</a></li>
           <li><a class="dropdown-item" 
href="{{site.baseurl}}/streaming/">Spark Streaming</a></li>
           <li><a class="dropdown-item" href="{{site.baseurl}}/mllib/">MLlib 
(machine learning)</a></li>
           <li><a class="dropdown-item" href="{{site.baseurl}}/graphx/">GraphX 
(graph)</a></li>
diff --git a/site/index.html b/site/index.html
index d2abd82fb3..3e6072f1d7 100644
--- a/site/index.html
+++ b/site/index.html
@@ -69,6 +69,7 @@
         </a>
         <ul class="dropdown-menu" aria-labelledby="libraries">
           <li><a class="dropdown-item" href="/sql/">SQL and DataFrames</a></li>
+          <li><a class="dropdown-item" href="/spark-connect/">Spark 
Connect</a></li>
           <li><a class="dropdown-item" href="/streaming/">Spark 
Streaming</a></li>
           <li><a class="dropdown-item" href="/mllib/">MLlib (machine 
learning)</a></li>
           <li><a class="dropdown-item" href="/graphx/">GraphX (graph)</a></li>
diff --git a/site/spark-connect/index.html b/site/spark-connect/index.html
index 6dc75abe37..1877c0f32d 100644
--- a/site/spark-connect/index.html
+++ b/site/spark-connect/index.html
@@ -142,7 +142,7 @@
 <div class="container">
   <div class="row mt-4">
     <div class="col-12 col-md-9">
-      <p>This post explains the Spark Connect architecture, the benefits of 
Spark Connect, and how to upgrade to Spark Connect.</p>
+      <p>This page explains the Spark Connect architecture, the benefits of 
Spark Connect, and how to upgrade to Spark Connect.</p>
 
 <p>Let’s start by exploring the architecture of Spark Connect at a high 
level.</p>
 
@@ -154,9 +154,9 @@
 
 <ol>
   <li>A connection is established between the Client and Spark Server</li>
-  <li>The Client converts a DataFrame query to an unresolved logical plan</li>
+  <li>The Client converts a DataFrame query to an unresolved logical plan that 
describes the intent of the operation rather than how it should be executed</li>
   <li>The unresolved logical plan is encoded and sent to the Spark Server</li>
-  <li>The Spark Server runs the query</li>
+  <li>The Spark Server optimizes and runs the query</li>
   <li>The Spark Server sends the results back to the Client</li>
 </ol>
 
@@ -164,11 +164,11 @@
 
 <p>Let’s go through these steps in more detail to get a better understanding 
of the inner workings of Spark Connect.</p>
 
-<p><strong>Establishing a connection between the Client and Spark 
Server</strong></p>
+<p><strong>Establishing a connection between the Client and the Spark 
Server</strong></p>
 
 <p>The network communication for Spark Connect uses the <a 
href="https://grpc.io/";>gRPC framework</a>.</p>
 
-<p>gRPC is performant and language agnostic.  Spark Connect uses 
language-agnostic technologies, so it’s portable.</p>
+<p>gRPC is performant and language agnostic which makes Spark Connect 
portable.</p>
 
 <p><strong>Converting a DataFrame query to an unresolved logical 
plan</strong></p>
 
@@ -182,7 +182,7 @@
 GlobalLimit 5
 +- LocalLimit 5
    +- SubqueryAlias spark_catalog.default.some_table
-      +- Relation spark_catalog.default.some_table[character#15,franchise#16] 
parquet
+      +- UnresolvedRelation spark_catalog.default.some_table
 </code></pre></div></div>
 
 <p>The Client is responsible for creating the unresolved logical plan and 
passing it to the Spark Server for execution.</p>
@@ -197,9 +197,9 @@ GlobalLimit 5
 
 <p><strong>Executing the query on the Spark Server</strong></p>
 
-<p>The Spark Server receives the unresolved logical plan (once the Protocol 
Buffer is deserialized) and executes it just like any other query.</p>
+<p>The Spark Server receives the unresolved logical plan (once the Protocol 
Buffer is deserialized) and analyzes, optimizes, and executes it just like any 
other query.</p>
 
-<p>Spark performs many optimizations to an unresolved logical plan before 
executing the query.  All of these optimizations happen on the Spark Server.</p>
+<p>Spark performs many optimizations to an unresolved logical plan before 
executing the query.  All of these optimizations happen on the Spark Server and 
are independent of the client application.</p>
 
 <p>Spark Connect lets you leverage Spark’s powerful query optimization 
capabilities, even with Clients that don’t depend on Spark or the JVM.</p>
 
@@ -207,9 +207,9 @@ GlobalLimit 5
 
 <p>The Spark Server sends the results back to the Client after executing the 
query.</p>
 
-<p>The results are sent to the client as Apache Arrow record batches.  A 
record batch includes many rows of data.</p>
+<p>The results are sent to the client as Apache Arrow record batches.  A 
single record batch includes many rows of data.</p>
 
-<p>The record batch is streamed to the client, which means it is sent in 
partial chunks, not all at once.  Streaming the results from the Spark Server 
to the Client prevents memory issues caused by an excessively large request.</p>
+<p>The full result is streamed to the client in partial chunks of record 
batches, not all at once.  Streaming the results from the Spark Server to the 
Client prevents memory issues caused by an excessively large request.</p>
 
 <p>Here’s a recap of how Spark Connect works in image form:</p>
 
@@ -221,11 +221,11 @@ GlobalLimit 5
 
 <p><strong>Spark Connect workloads are easier to maintain</strong></p>
 
-<p>With the Spark JVM architecture, the client and Spark Driver must run 
identical software versions.  They need the same Java, Scala, and other 
dependency versions.  Suppose you develop a Spark project on your local 
machine, package it as a JAR file, and deploy it in the cloud to run on a 
production dataset.  You need to build the JAR file on your local machine with 
the same dependencies used in the cloud.  If you compile the JAR file with 
Scala 2.13, you must also provision the cluster [...]
+<p>When you do not use Spark Connect, the client and Spark Driver must run 
identical software versions.  They need the same Java, Scala, and other 
dependency versions.  Suppose you develop a Spark project on your local 
machine, package it as a JAR file, and deploy it in the cloud to run on a 
production dataset.  You need to build the JAR file on your local machine with 
the same dependencies used in the cloud.  If you compile the JAR file with 
Scala 2.13, you must also provision the clust [...]
 
-<p>Suppose you are building your JAR with Scala 2.12, and your cloud provider 
releases a new runtime built with Scala 2.13.  For Spark JVM, you need to 
update your project locally, which may be challenging.  For example, when you 
update your project to Scala 2.13, you must also upgrade all the project 
dependencies (and transitive dependencies) to Scala 2.13.  If some of those JAR 
files don’t exist, you can’t upgrade.</p>
+<p>Suppose you are building your JAR with Scala 2.12, and your cloud provider 
releases a new runtime built with Scala 2.13.  When you don&#8217;t use Spark 
Connect, you need to update your project locally, which may be challenging.  
For example, when you update your project to Scala 2.13, you must also upgrade 
all the project dependencies (and transitive dependencies) to Scala 2.13.  If 
some of those JAR files don’t exist, you can’t upgrade.</p>
 
-<p>In contrast, Spark Connect decouples the client and the Spark Driver, so 
you can update the Spark Driver including server-side dependencies without 
updating the client.  This makes Spark projects much easier to maintain.</p>
+<p>In contrast, Spark Connect decouples the client and the Spark Driver, so 
you can update the Spark Driver including server-side dependencies without 
updating the client.  This makes Spark projects much easier to maintain.  In 
particular, for pure Python workloads, decoupling Python from the Java 
dependency on the client improves the overall user experience with Apache 
Spark.</p>
 
 <p><strong>Spark Connect lets you build Spark Connect Clients in non-JVM 
languages</strong></p>
 
@@ -234,7 +234,7 @@ GlobalLimit 5
 <ul>
   <li><a 
href="https://github.com/apache/spark/tree/master/python/pyspark/sql/connect";>Spark
 Connect Python</a></li>
   <li><a href="https://github.com/apache/spark-connect-go";>Spark Connect 
Go</a></li>
-  <li><a href="https://github.com/sjrusso8/spark-connect-rs";>Spark Connect 
Rust</a></li>
+  <li><a href="https://github.com/sjrusso8/spark-connect-rs";>Spark Connect 
Rust</a> (third-party project)</li>
 </ul>
 
 <p>For example, the Apache Spark Connect Client for Golang, <a 
href="https://github.com/apache/spark-connect-go";>spark-connect-go</a>, 
implements the Spark Connect protocol and does not rely on Java.  You can use 
this Spark Connect Client to develop Spark applications with Go without 
installing Java or Spark.</p>
@@ -250,13 +250,13 @@ df.Show(100, false)
 
 <p>spark-connect-go is a magnificent example of how the decoupled nature of 
Spark Connect allows for a better end-user experience.</p>
 
-<p>Go isn’t the only language that will benefit from this architecture.  The 
Spark Community is also building <a 
href="https://github.com/sjrusso8/spark-connect-rs";>a Rust </a>Spark Connect 
Client.</p>
+<p>Go isn’t the only language that will benefit from this architecture.</p>
 
-<p><strong>Spark Connect allows for better remote development</strong></p>
+<p><strong>Spark Connect allows for better remote development and 
testing</strong></p>
 
 <p>Spark Connect also enables you to embed Spark in text editors on remote 
clusters without SSH (“remote development”).</p>
 
-<p>Embedding Spark in text editors with Classic Spark requires a Spark Session 
running locally or an SSH connection to a remote Spark Driver.</p>
+<p>When you do not use Spark Connect, embedding Spark in text editors with 
Spark requires a Spark Session running locally or an SSH connection to a remote 
Spark Driver.</p>
 
 <p>Spark Connect lets you connect to a remote Spark Driver with a connection 
that’s fully embedded in a text editor without SSH.  This provides users with a 
better experience when developing code in a text editor like VS Code on a 
remote Spark cluster.</p>
 
@@ -264,15 +264,13 @@ df.Show(100, false)
 
 <p><strong>Spark Connect makes debugging easier</strong></p>
 
-<p>Spark Connect lets you connect a text editor like IntelliJ to a remote 
Spark cluster and step through your code with a debugger.  You can debug an 
application running on a production dataset, just like you would for a test 
dataset on your local machine.  This gives you a great developer experience, 
especially when you want to leverage high-quality debugging tools built into 
IDEs.</p>
+<p>Spark Connect lets you connect your text editor like IntelliJ to a remote 
Spark cluster and step through your code with a debugger.  You can debug an 
application running on a production dataset, just like you would for a test 
dataset on your local machine.  This gives you a great developer experience, 
especially when you want to leverage high-quality debugging tools built into 
IDEs.</p>
 
-<p>Spark JVM does not allow for this debugging experience because it does not 
fully integrate with text editors.  Spark Connect allows you to build tight 
integrations in text editors with the wonderful debugging experience for remote 
Spark workflows.</p>
+<p>Spark JVM does not allow for this debugging experience because it does not 
fully integrate with text editors.  Spark Connect allows you to build tight 
integrations in text editors with the wonderful debugging experience for remote 
Spark workflows.  By simply switching the connection string for the Spark 
Connect session it becomes easy to configure the client to run tests in 
different execution environments without deploying a complicated Spark 
application.</p>
 
 <p><strong>Spark Connect is more stable</strong></p>
 
-<p>When many users access the same Spark JVM cluster, they all have to run 
computations on the same driver node, which can cause instability.  One user 
may execute code that causes the driver node to fail, rendering the Spark 
cluster unusable for the other cluster users.</p>
-
-<p>Spark Connect is more stable because the requests are formatted to 
unresolved logical plans on the client, not in the Spark Driver.  Spark Connect 
code that errors out will only cause the client to raise an out-of-memory 
exception.  It won’t cause the Spark Driver to have an out-of-memory exception 
that takes down the cluster for all users.</p>
+<p>Due to the decoupled nature of client applications leveraging Spark 
Connect, failures of the client are now decoupled from the Spark Driver. This 
means that when a client application fails, its failure modes are completely 
independent of the other applications and don’t impact the running Spark Driver 
that may continue serving other client applications.</p>
 
 <h2 id="upgrading-to-spark-connect">Upgrading to Spark Connect</h2>
 
diff --git a/spark-connect/index.md b/spark-connect/index.md
index 77cf3d0dc8..3eb4a26eeb 100644
--- a/spark-connect/index.md
+++ b/spark-connect/index.md
@@ -6,7 +6,7 @@ description: Spark Connect makes remote Spark development 
easier.
 subproject: Spark Connect
 ---
 
-This post explains the Spark Connect architecture, the benefits of Spark 
Connect, and how to upgrade to Spark Connect.
+This page explains the Spark Connect architecture, the benefits of Spark 
Connect, and how to upgrade to Spark Connect.
 
 Let’s start by exploring the architecture of Spark Connect at a high level.
 
@@ -17,20 +17,20 @@ Spark Connect is a protocol that specifies how a client 
application can communic
 Here’s how Spark Connect works at a high level:
 
 1. A connection is established between the Client and Spark Server
-2. The Client converts a DataFrame query to an unresolved logical plan
+2. The Client converts a DataFrame query to an unresolved logical plan that 
describes the intent of the operation rather than how it should be executed
 3. The unresolved logical plan is encoded and sent to the Spark Server
-4. The Spark Server runs the query
+4. The Spark Server optimizes and runs the query
 5. The Spark Server sends the results back to the Client
 
 <img src="{{site.baseurl}}/images/spark-connect1.png" style="width: 100%; 
max-width: 500px;">
 
 Let’s go through these steps in more detail to get a better understanding of 
the inner workings of Spark Connect.
 
-**Establishing a connection between the Client and Spark Server**
+**Establishing a connection between the Client and the Spark Server**
 
 The network communication for Spark Connect uses the [gRPC 
framework](https://grpc.io/).
 
-gRPC is performant and language agnostic.  Spark Connect uses 
language-agnostic technologies, so it’s portable.
+gRPC is performant and language agnostic which makes Spark Connect portable.
 
 **Converting a DataFrame query to an unresolved logical plan**
 
@@ -45,7 +45,7 @@ Here’s the unresolved logical plan for the query:
 GlobalLimit 5
 +- LocalLimit 5
    +- SubqueryAlias spark_catalog.default.some_table
-      +- Relation spark_catalog.default.some_table[character#15,franchise#16] 
parquet
+      +- UnresolvedRelation spark_catalog.default.some_table
 ```
 
 The Client is responsible for creating the unresolved logical plan and passing 
it to the Spark Server for execution.
@@ -60,9 +60,9 @@ Now let’s look at how the Spark Server executes the query.
 
 **Executing the query on the Spark Server**
 
-The Spark Server receives the unresolved logical plan (once the Protocol 
Buffer is deserialized) and executes it just like any other query.
+The Spark Server receives the unresolved logical plan (once the Protocol 
Buffer is deserialized) and analyzes, optimizes, and executes it just like any 
other query.
 
-Spark performs many optimizations to an unresolved logical plan before 
executing the query.  All of these optimizations happen on the Spark Server.
+Spark performs many optimizations to an unresolved logical plan before 
executing the query.  All of these optimizations happen on the Spark Server and 
are independent of the client application.
 
 Spark Connect lets you leverage Spark’s powerful query optimization 
capabilities, even with Clients that don’t depend on Spark or the JVM.
 
@@ -70,9 +70,9 @@ Spark Connect lets you leverage Spark’s powerful query 
optimization capabiliti
 
 The Spark Server sends the results back to the Client after executing the 
query.
 
-The results are sent to the client as Apache Arrow record batches.  A record 
batch includes many rows of data.
+The results are sent to the client as Apache Arrow record batches.  A single 
record batch includes many rows of data.
 
-The record batch is streamed to the client, which means it is sent in partial 
chunks, not all at once.  Streaming the results from the Spark Server to the 
Client prevents memory issues caused by an excessively large request.
+The full result is streamed to the client in partial chunks of record batches, 
not all at once.  Streaming the results from the Spark Server to the Client 
prevents memory issues caused by an excessively large request.
 
 Here’s a recap of how Spark Connect works in image form:
 
@@ -84,11 +84,11 @@ Let’s now turn our attention to the benefits of the Spark 
Connect architecture
 
 **Spark Connect workloads are easier to maintain**
 
-With the Spark JVM architecture, the client and Spark Driver must run 
identical software versions.  They need the same Java, Scala, and other 
dependency versions.  Suppose you develop a Spark project on your local 
machine, package it as a JAR file, and deploy it in the cloud to run on a 
production dataset.  You need to build the JAR file on your local machine with 
the same dependencies used in the cloud.  If you compile the JAR file with 
Scala 2.13, you must also provision the cluster wi [...]
+When you do not use Spark Connect, the client and Spark Driver must run 
identical software versions.  They need the same Java, Scala, and other 
dependency versions.  Suppose you develop a Spark project on your local 
machine, package it as a JAR file, and deploy it in the cloud to run on a 
production dataset.  You need to build the JAR file on your local machine with 
the same dependencies used in the cloud.  If you compile the JAR file with 
Scala 2.13, you must also provision the cluster  [...]
 
-Suppose you are building your JAR with Scala 2.12, and your cloud provider 
releases a new runtime built with Scala 2.13.  For Spark JVM, you need to 
update your project locally, which may be challenging.  For example, when you 
update your project to Scala 2.13, you must also upgrade all the project 
dependencies (and transitive dependencies) to Scala 2.13.  If some of those JAR 
files don’t exist, you can’t upgrade.
+Suppose you are building your JAR with Scala 2.12, and your cloud provider 
releases a new runtime built with Scala 2.13.  When you don't use Spark 
Connect, you need to update your project locally, which may be challenging.  
For example, when you update your project to Scala 2.13, you must also upgrade 
all the project dependencies (and transitive dependencies) to Scala 2.13.  If 
some of those JAR files don’t exist, you can’t upgrade.
 
-In contrast, Spark Connect decouples the client and the Spark Driver, so you 
can update the Spark Driver including server-side dependencies without updating 
the client.  This makes Spark projects much easier to maintain.
+In contrast, Spark Connect decouples the client and the Spark Driver, so you 
can update the Spark Driver including server-side dependencies without updating 
the client.  This makes Spark projects much easier to maintain.  In particular, 
for pure Python workloads, decoupling Python from the Java dependency on the 
client improves the overall user experience with Apache Spark.
 
 **Spark Connect lets you build Spark Connect Clients in non-JVM languages**
 
@@ -96,7 +96,7 @@ Spark Connect decouples the client and the Spark Driver so 
that you can write a
 
 * [Spark Connect 
Python](https://github.com/apache/spark/tree/master/python/pyspark/sql/connect)
 * [Spark Connect Go](https://github.com/apache/spark-connect-go)
-* [Spark Connect Rust](https://github.com/sjrusso8/spark-connect-rs)
+* [Spark Connect Rust](https://github.com/sjrusso8/spark-connect-rs) 
(third-party project)
 
 For example, the Apache Spark Connect Client for Golang, 
[spark-connect-go](https://github.com/apache/spark-connect-go), implements the 
Spark Connect protocol and does not rely on Java.  You can use this Spark 
Connect Client to develop Spark applications with Go without installing Java or 
Spark.
 
@@ -112,13 +112,13 @@ When `df.Show()` is invoked, spark-connect-go processes 
the query into an unreso
 
 spark-connect-go is a magnificent example of how the decoupled nature of Spark 
Connect allows for a better end-user experience.
 
-Go isn’t the only language that will benefit from this architecture.  The 
Spark Community is also building [a Rust 
](https://github.com/sjrusso8/spark-connect-rs)Spark Connect Client.
+Go isn’t the only language that will benefit from this architecture.
 
-**Spark Connect allows for better remote development**
+**Spark Connect allows for better remote development and testing**
 
 Spark Connect also enables you to embed Spark in text editors on remote 
clusters without SSH (“remote development”).
 
-Embedding Spark in text editors with Classic Spark requires a Spark Session 
running locally or an SSH connection to a remote Spark Driver.
+When you do not use Spark Connect, embedding Spark in text editors with Spark 
requires a Spark Session running locally or an SSH connection to a remote Spark 
Driver.
 
 Spark Connect lets you connect to a remote Spark Driver with a connection 
that’s fully embedded in a text editor without SSH.  This provides users with a 
better experience when developing code in a text editor like VS Code on a 
remote Spark cluster.
 
@@ -126,15 +126,13 @@ With Spark Connect, switching from a local Spark Session 
to a remote Spark Sessi
 
 **Spark Connect makes debugging easier**
 
-Spark Connect lets you connect a text editor like IntelliJ to a remote Spark 
cluster and step through your code with a debugger.  You can debug an 
application running on a production dataset, just like you would for a test 
dataset on your local machine.  This gives you a great developer experience, 
especially when you want to leverage high-quality debugging tools built into 
IDEs.
+Spark Connect lets you connect your text editor like IntelliJ to a remote 
Spark cluster and step through your code with a debugger.  You can debug an 
application running on a production dataset, just like you would for a test 
dataset on your local machine.  This gives you a great developer experience, 
especially when you want to leverage high-quality debugging tools built into 
IDEs.
 
-Spark JVM does not allow for this debugging experience because it does not 
fully integrate with text editors.  Spark Connect allows you to build tight 
integrations in text editors with the wonderful debugging experience for remote 
Spark workflows.
+Spark JVM does not allow for this debugging experience because it does not 
fully integrate with text editors.  Spark Connect allows you to build tight 
integrations in text editors with the wonderful debugging experience for remote 
Spark workflows.  By simply switching the connection string for the Spark 
Connect session it becomes easy to configure the client to run tests in 
different execution environments without deploying a complicated Spark 
application.
 
 **Spark Connect is more stable**
 
-When many users access the same Spark JVM cluster, they all have to run 
computations on the same driver node, which can cause instability.  One user 
may execute code that causes the driver node to fail, rendering the Spark 
cluster unusable for the other cluster users.
-
-Spark Connect is more stable because the requests are formatted to unresolved 
logical plans on the client, not in the Spark Driver.  Spark Connect code that 
errors out will only cause the client to raise an out-of-memory exception.  It 
won’t cause the Spark Driver to have an out-of-memory exception that takes down 
the cluster for all users.
+Due to the decoupled nature of client applications leveraging Spark Connect, 
failures of the client are now decoupled from the Spark Driver. This means that 
when a client application fails, its failure modes are completely independent 
of the other applications and don’t impact the running Spark Driver that may 
continue serving other client applications.
 
 ## Upgrading to Spark Connect
 


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark-website) branch asf-site updated: Update spark connect page based on pr feedback (#512)

Reply via email to