Re: [PR] Add Spark Connect page [spark-website]

via GitHub Wed, 03 Apr 2024 05:23:22 -0700


srowen commented on code in PR #511:
URL: https://github.com/apache/spark-website/pull/511#discussion_r1549628247



##########
spark-connect/index.md:
##########
@@ -0,0 +1,151 @@
+---
+layout: global
+type: "page singular"
+title: Spark Connect
+description: Spark Connect makes remote Spark development easier.
+subproject: Spark Connect
+---
+
+This post explains the Spark Connect architecture, the benefits of Spark 
Connect, and how to upgrade to Spark Connect.
+
+Let’s start by exploring the architecture of Spark Connect at a high level.
+
+## High-level Spark Connect architecture
+
+Spark Connect is a protocol that specifies how a client application can 
communicate with a remote Spark Server.  Clients that implement the Spark 
Connect protocol can connect and make requests to remote Spark Servers, very 
similar to how client applications can connect to databases using a JDBC driver 
- a query `spark.table("some_table").limit(5)` should simply return the result. 
 This architecture gives end users a great developer experience.
+
+Here’s how Spark Connect works at a high level:
+
+1. A connection is established between the Client and Spark Server
+2. The Client converts a DataFrame query to an unresolved logical plan
+3. The unresolved logical plan is encoded and sent to the Spark Server
+4. The Spark Server runs the query
+5. The Spark Server sends the results back to the Client
+
+<img src="{{site.baseurl}}/images/spark-connect1.png" style="width: 100%; 
max-width: 500px;">
+
+Let’s go through these steps in more detail to get a better understanding of 
the inner workings of Spark Connect.
+
+**Establishing a connection between the Client and Spark Server**
+
+The network communication for Spark Connect uses the [gRPC 
framework](https://grpc.io/).
+
+gRPC is performant and language agnostic.  Spark Connect uses 
language-agnostic technologies, so it’s portable.
+
+**Converting a DataFrame query to an unresolved logical plan**
+
+The Client parses DataFrame queries and converts them to unresolved logical 
plans.
+
+Suppose you have the following DataFrame query: 
`spark.table("some_table").limit(5)`.
+
+Here’s the unresolved logical plan for the query: 
+
+```
+== Parsed Logical Plan ==
+GlobalLimit 5
++- LocalLimit 5
+   +- SubqueryAlias spark_catalog.default.some_table
+      +- Relation spark_catalog.default.some_table[character#15,franchise#16] 
parquet
+```
+
+The Client is responsible for creating the unresolved logical plan and passing 
it to the Spark Server for execution.
+
+**Sending the unresolved logical plan to the Spark Server**
+
+The unresolved logical plan must be serialized so it can be sent over a 
network.  Spark Connect uses Protocol Buffers, which are “language-neutral, 
platform-neutral extensible mechanisms for serializing structured data”.
+
+The Client and the Spark Server must be able to communicate with a 
language-neutral format like Protocol Buffers because they might be using 
different programming languages or different software versions.
+
+Now let’s look at how the Spark Server executes the query. 
+
+**Executing the query on the Spark Server**
+
+The Spark Server receives the unresolved logical plan (once the Protocol 
Buffer is deserialized) and executes it just like any other query.
+
+Spark performs many optimizations to an unresolved logical plan before 
executing the query.  All of these optimizations happen on the Spark Server.
+
+Spark Connect lets you leverage Spark’s powerful query optimization 
capabilities, even with Clients that don’t depend on Spark or the JVM.
+
+**Sending the results back to the Client**
+
+The Spark Server sends the results back to the Client after executing the 
query.
+
+The results are sent to the client as Apache Arrow record batches.  A record 
batch includes many rows of data.
+
+The record batch is streamed to the client, which means it is sent in partial 
chunks, not all at once.  Streaming the results from the Spark Server to the 
Client prevents memory issues caused by an excessively large request.
+
+Here’s a recap of how Spark Connect works in image form:
+
+<img src="{{site.baseurl}}/images/spark-connect2.png" style="width: 100%; 
max-width: 500px;">
+
+## Benefits of Spark Connect
+
+Let’s now turn our attention to the benefits of the Spark Connect architecture.
+
+**Spark Connect workloads are easier to maintain**
+
+With the Spark JVM architecture, the client and Spark Driver must run 
identical software versions.  They need the same Java, Scala, and other 
dependency versions.  Suppose you develop a Spark project on your local 
machine, package it as a JAR file, and deploy it in the cloud to run on a 
production dataset.  You need to build the JAR file on your local machine with 
the same dependencies used in the cloud.  If you compile the JAR file with 
Scala 2.13, you must also provision the cluster with a Spark JAR compiled with 
Scala 2.13.
+
+Suppose you are building your JAR with Scala 2.12, and your cloud provider 
releases a new runtime built with Scala 2.13.  For Spark JVM, you need to 
update your project locally, which may be challenging.  For example, when you 
update your project to Scala 2.13, you must also upgrade all the project 
dependencies (and transitive dependencies) to Scala 2.13.  If some of those JAR 
files don’t exist, you can’t upgrade.
+
+In contrast, Spark Connect decouples the client and the Spark Driver, so you 
can update the Spark Driver including server-side dependencies without updating 
the client.  This makes Spark projects much easier to maintain.
+
+**Spark Connect lets you build Spark Connect Clients in non-JVM languages**
+
+Spark Connect decouples the client and the Spark Driver so that you can write 
a Spark Connect Client in any language.  Here are some Spark Connect Clients 
that don’t depend on Java/Scala:
+
+* [Spark Connect 
Python](https://github.com/apache/spark/tree/master/python/pyspark/sql/connect)
+* [Spark Connect Go](https://github.com/apache/spark-connect-go)
+* [Spark Connect Rust](https://github.com/sjrusso8/spark-connect-rs)
+
+For example, the Apache Spark Connect Client for Golang, 
[spark-connect-go](https://github.com/apache/spark-connect-go), implements the 
Spark Connect protocol and does not rely on Java.  You can use this Spark 
Connect Client to develop Spark applications with Go without installing Java or 
Spark.
+
+Here’s how to execute a query with the Go programming language using 
spark-connect-go:
+
+```
+spark, _ := sql.SparkSession.Builder.Remote(remote).Build()
+df, _ := spark.Sql("select * from my_cool_table where age > 42")
+df.Show(100, false)
+```
+
+When `df.Show()` is invoked, spark-connect-go processes the query into an 
unresolved logical plan and sends it to the Spark Driver for execution.
+
+spark-connect-go is a magnificent example of how the decoupled nature of Spark 
Connect allows for a better end-user experience.
+
+Go isn’t the only language that will benefit from this architecture.  The 
Spark Community is also building [a Rust 
](https://github.com/sjrusso8/spark-connect-rs)Spark Connect Client.

Review Comment:
   If anything, remove reference to "Spark Community" here to avoid implying 
it's part of the project.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

Re: [PR] Add Spark Connect page [spark-website]

Reply via email to