[GitHub] [spark] xinrong-meng commented on a diff in pull request #42284: [SPARK-44629][PYTHON][DOCS] Publish PySpark Test Guidelines webpage

via GitHub Fri, 04 Aug 2023 12:03:50 -0700


xinrong-meng commented on code in PR #42284:
URL: https://github.com/apache/spark/pull/42284#discussion_r1284751248



##########
python/docs/source/getting_started/testing_pyspark.ipynb:
##########
@@ -0,0 +1,457 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "4ee2125b-f889-47e6-9c3d-8bd63a253683",
+   "metadata": {},
+   "source": [
+    "# Testing PySpark\n",
+    "\n",
+    "This guide is a reference for writing robust tests for PySpark code.\n",
+    "\n",
+    "To view the docs for PySpark test utils, see here. To see the code for 
PySpark built-in test utils, check out the Spark repository here. To see the 
JIRA board tickets for the PySpark test framework, see here."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0e8ee4b6-9544-45e1-8a91-e71ed8ef8b9d",
+   "metadata": {},
+   "source": [
+    "## Build a PySpark Application\n",
+    "Here is an example for how to start a PySpark application. Feel free to 
skip to the next section, “Testing your PySpark Application,” if you already 
have an application you’re ready to test.\n",
+    "\n",
+    "First, start your Spark Session."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "9af4a35b-17e8-4e45-816b-34c14c5902f7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pyspark.sql import SparkSession \n",
+    "from pyspark.sql.functions import col \n",
+    "\n",
+    "# Create a SparkSession \n",
+    "spark = SparkSession.builder.appName(\"Testing PySpark 
Example\").getOrCreate() "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4a4c6efe-91f5-4e18-b4b2-b0401c2368e4",
+   "metadata": {},
+   "source": [
+    "Next, create a DataFrame."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "3b483dd8-3a76-41c6-9206-301d7ef314d6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sample_data = [{\"name\": \"John    D.\", \"age\": 30}, \n",
+    "  {\"name\": \"Alice   G.\", \"age\": 25}, \n",
+    "  {\"name\": \"Bob  T.\", \"age\": 35}, \n",
+    "  {\"name\": \"Eve   A.\", \"age\": 28}] \n",
+    "\n",
+    "df = spark.createDataFrame(sample_data)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e0f44333-0e08-470b-9fa2-38f59e3dbd63",
+   "metadata": {},
+   "source": [
+    "Now, let’s define and apply a transformation function to our DataFrame."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "a6c0b766-af5f-4e1d-acf8-887d7cf0b0b2",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "+---+--------+\n",
+      "|age|    name|\n",
+      "+---+--------+\n",
+      "| 30| John D.|\n",
+      "| 25|Alice G.|\n",
+      "| 35|  Bob T.|\n",
+      "| 28|  Eve A.|\n",
+      "+---+--------+\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "from pyspark.sql.functions import col, regexp_replace\n",
+    "\n",
+    "# Remove additional spaces in name\n",
+    "def remove_extra_spaces(df, column_name):\n",
+    "    # Remove extra spaces from the specified column\n",
+    "    df_transformed = df.withColumn(column_name, 
regexp_replace(col(column_name), \"\\\\s+\", \" \"))\n",
+    "    \n",
+    "    return df_transformed\n",
+    "\n",
+    "transformed_df = remove_extra_spaces(df, \"name\")\n",
+    "\n",
+    "transformed_df.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c471be1b-c052-4f31-abc9-35668aebc9c1",
+   "metadata": {},
+   "source": [
+    "You can also do this using Spark Connect. Please reference ‘Live 
Notebook: Spark Connect’ for examples.\n",
+    "For more information on how to use Spark Connect and its benefits, see: 
https://spark.apache.org/docs/latest/spark-connect-overview.html";
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "530beaa6-aabf-43a1-ad2b-361f267e9608",
+   "metadata": {},
+   "source": [
+    "## Testing your PySpark Application\n",
+    "Now let’s test our PySpark transformation function. \n",
+    "\n",
+    "One option is to simply eyeball the resulting DataFrame. However, this 
can be impractical for large DataFrame or input sizes.\n",
+    "\n",
+    "A better way is to write tests. Here are some examples of how we can test 
our code.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d84a9fc1-9768-4af4-bfbf-e832f23334dc",
+   "metadata": {},
+   "source": [
+    "### Option 1: No test framework\n",
+    "\n",
+    "For simple ad-hoc validation cases, PySpark testing utils like 
assertDataFrameEqual and assertSchemaEqual can also be used in a standalone 
context.\n",

Review Comment:
   nit: Shall we put `assertDataFrameEqual` and `assertSchemaEqual` in special 
fonts?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] xinrong-meng commented on a diff in pull request #42284: [SPARK-44629][PYTHON][DOCS] Publish PySpark Test Guidelines webpage

Reply via email to