[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42284: [SPARK-44629] Publish PySpark Test Guidelines webpage

via GitHub Thu, 03 Aug 2023 18:30:26 -0700


HyukjinKwon commented on code in PR #42284:
URL: https://github.com/apache/spark/pull/42284#discussion_r1283880817



##########
python/docs/source/getting_started/testing_pyspark.ipynb:
##########
@@ -0,0 +1,525 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "4ee2125b-f889-47e6-9c3d-8bd63a253683",
+   "metadata": {},
+   "source": [
+    "# Testing PySpark\n",
+    "\n",
+    "This guide is a reference for writing robust tests for PySpark code.\n",
+    "\n",
+    "To view the docs for PySpark test utils, see here. To see the code for 
PySpark built-in test utils, check out the Spark repository here. To see the 
JIRA board tickets for the PySpark test framework, see here."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0e8ee4b6-9544-45e1-8a91-e71ed8ef8b9d",
+   "metadata": {},
+   "source": [
+    "## Build a PySpark Application\n",
+    "Here is an example for how to start a PySpark application. Feel free to 
skip to the next section, “Testing your PySpark Application,” if you already 
have an application you’re ready to test.\n",
+    "\n",
+    "First, start your Spark Session."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "64e82e7c",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'4.0.0.dev0'"
+      ]
+     },
+     "execution_count": 1,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "import pyspark\n",
+    "pyspark.__version__"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "9af4a35b-17e8-4e45-816b-34c14c5902f7",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Setting default log level to \"WARN\".\n",
+      "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).\n",
+      "23/08/01 19:05:21 WARN NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable\n"
+     ]
+    }
+   ],
+   "source": [
+    "from pyspark.sql import SparkSession \n",
+    "from pyspark.sql.functions import col \n",
+    "\n",
+    "# Create a SparkSession \n",
+    "spark = SparkSession.builder.appName(\"Testing PySpark 
Example\").getOrCreate() "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4a4c6efe-91f5-4e18-b4b2-b0401c2368e4",
+   "metadata": {},
+   "source": [
+    "Next, create a DataFrame."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "3b483dd8-3a76-41c6-9206-301d7ef314d6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sample_data = [{\"name\": \"John    D.\", \"age\": 30}, \n",
+    "  {\"name\": \"Alice   G.\", \"age\": 25}, \n",
+    "  {\"name\": \"Bob  T.\", \"age\": 35}, \n",
+    "  {\"name\": \"Eve   A.\", \"age\": 28}] \n",
+    "\n",
+    "df = spark.createDataFrame(sample_data)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e0f44333-0e08-470b-9fa2-38f59e3dbd63",
+   "metadata": {},
+   "source": [
+    "Now, let’s define and apply a transformation function to our DataFrame."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "a6c0b766-af5f-4e1d-acf8-887d7cf0b0b2",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "                                                                        
        \r"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "+---+--------+\n",
+      "|age|    name|\n",
+      "+---+--------+\n",
+      "| 30| John D.|\n",
+      "| 25|Alice G.|\n",
+      "| 35|  Bob T.|\n",
+      "| 28|  Eve A.|\n",
+      "+---+--------+\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "from pyspark.sql.functions import col, regexp_replace\n",
+    "\n",
+    "# Remove additional spaces in name\n",
+    "def remove_extra_spaces(df, column_name):\n",
+    "    # Remove extra spaces from the specified column\n",
+    "    df_transformed = df.withColumn(column_name, 
regexp_replace(col(column_name), \"\\\\s+\", \" \"))\n",
+    "    \n",
+    "    return df_transformed\n",
+    "\n",
+    "transformed_df = remove_extra_spaces(df, \"name\")\n",
+    "\n",
+    "transformed_df.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c471be1b-c052-4f31-abc9-35668aebc9c1",
+   "metadata": {},
+   "source": [
+    "You can also do this using Spark Connect. The only difference is using 
remote() when you create your SparkSession, for example:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4ba5f56e-dbfb-442f-a9ac-e0d0ad2d0931",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "spark = SparkSession.builder.remote(\"sc://localhost\").appName(\"Sample 
PySpark ETL\").getOrCreate()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7ad40cf4-f1fb-460d-a3b1-68fb6a2ad9f6",
+   "metadata": {},
+   "source": [
+    "For more information on how to use Spark Connect and its benefits, see: 
https://spark.apache.org/docs/latest/spark-connect-overview.html";
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "530beaa6-aabf-43a1-ad2b-361f267e9608",
+   "metadata": {},
+   "source": [
+    "## Testing your PySpark Application\n",
+    "Now let’s test our PySpark transformation function. \n",
+    "\n",
+    "One option is to simply eyeball the resulting DataFrame. However, this 
can be impractical for large DataFrame or input sizes.\n",
+    "\n",
+    "A better way is to write tests. Here are some examples of how we can test 
our code.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d84a9fc1-9768-4af4-bfbf-e832f23334dc",
+   "metadata": {},
+   "source": [
+    "### Option 1: No test framework\n",
+    "\n",
+    "For simple ad-hoc validation cases, PySpark testing utils like 
assertDataFrameEqual and assertSchemaEqual can also be used in a standalone 
context.\n",
+    "You could easily test PySpark code in a notebook session. For example, 
say you want to assert equality between two DataFrames:\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "8e533732-ee40-4cd0-9669-8eb92973908a",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      
"/Users/amanda.liu/Documents/Databricks/spark/python/pyspark/pandas/__init__.py:50:
 UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is 
required to set this environment variable to '1' in both driver and executor 
sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it 
does not work if there is a Spark context already launched.\n",
+      "  warnings.warn(\n"
+     ]
+    }
+   ],
+   "source": [
+    "import pyspark.testing\n",
+    "\n",
+    "from pyspark.testing.utils import assertDataFrameEqual\n",
+    "\n",
+    "# Example 1\n",
+    "df1 = spark.createDataFrame(data=[(\"1\", 1000), (\"2\", 3000)], 
schema=[\"id\", \"amount\"])\n",
+    "df2 = spark.createDataFrame(data=[(\"1\", 1000), (\"2\", 3000)], 
schema=[\"id\", \"amount\"])\n",
+    "assertDataFrameEqual(df1, df2)  # pass, DataFrames are identical\n",
+    "\n",
+    "# Example 2\n",
+    "df1 = spark.createDataFrame(data=[(\"1\", 0.1), (\"2\", 3.23)], 
schema=[\"id\", \"amount\"])\n",
+    "df2 = spark.createDataFrame(data=[(\"1\", 0.109), (\"2\", 3.23)], 
schema=[\"id\", \"amount\"])\n",
+    "assertDataFrameEqual(df1, df2, rtol=1e-1)  # pass, DataFrames are approx 
equal by rtol"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "76ade5f2-4a1f-4601-9d2a-80da9da950ff",
+   "metadata": {},
+   "source": [
+    "You can also simply compare two DataFrame schemas:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "74393af5-40fb-4d04-87cb-265971ffe6d0",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pyspark.testing.utils import assertSchemaEqual\n",
+    "from pyspark.sql.types import StructType, StructField, ArrayType, 
DoubleType\n",
+    "\n",
+    "s1 = StructType([StructField(\"names\", ArrayType(DoubleType(), True), 
True)])\n",
+    "s2 = StructType([StructField(\"names\", ArrayType(DoubleType(), True), 
True)])\n",
+    "\n",
+    "assertSchemaEqual(s1, s2)  # pass, schemas are identical"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c67be105-f6b1-4083-ad11-9e819331eae8",
+   "metadata": {},
+   "source": [
+    "### Option 2: Unit Test\n",
+    "For more complex testing scenarios, you may want to use a testing 
framework.\n",
+    "\n",
+    "One of the most popular testing framework options is unit tests. Let’s 
walk through how you can use the built-in Python unittest library to write 
PySpark tests. \n",
+    "First, you will need a Spark session. You can use the @classmethod 
decorator from the unittest package to take care of setting up and tearing down 
a Spark session."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "54093761-0b49-4aee-baec-2d29bcf13f9f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import unittest\n",
+    "\n",
+    "class PySparkTestCase(unittest.TestCase):\n",
+    "    @classmethod\n",
+    "    def setUpClass(cls):\n",
+    "        cls.spark = SparkSession.builder.appName(\"Sample PySpark 
ETL\").getOrCreate() \n",
+    "\n",
+    "    @classmethod\n",
+    "    def tearDownClass(cls):\n",
+    "        cls.spark.stop()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3de27500-8526-412e-bf09-6927a760c5d7",
+   "metadata": {},
+   "source": [
+    "Now let’s write a unittest class."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "34feb5e1-944f-4f6b-9c5f-3b0bf68c7d05",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pyspark.testing.utils import assertDataFrameEqual\n",
+    "\n",
+    "class TestTranformation(PySparkTestCase):\n",
+    "    def test_single_space(self):\n",
+    "        sample_data = [{\"name\": \"John    D.\", \"age\": 30}, \n",
+    "                       {\"name\": \"Alice   G.\", \"age\": 25}, \n",
+    "                       {\"name\": \"Bob  T.\", \"age\": 35}, \n",
+    "                       {\"name\": \"Eve   A.\", \"age\": 28}] \n",
+    "                        \n",
+    "        # create a Spark DataFrame\n",
+    "        original_df = spark.createDataFrame(sample_data)\n",
+    "        \n",
+    "        # apply the transformation function from before\n",
+    "        transformed_df = remove_extra_spaces(original_df, \"name\")\n",
+    "        \n",
+    "        expected_data = [{\"name\": \"John D.\", \"age\": 30}, \n",
+    "        {\"name\": \"Alice G.\", \"age\": 25}, \n",
+    "        {\"name\": \"Bob T.\", \"age\": 35}, \n",
+    "        {\"name\": \"Eve A.\", \"age\": 28}]\n",
+    "        \n",
+    "        expected_df = spark.createDataFrame(expected_data)\n",
+    "    \n",
+    "        assertDataFrameEqual(transformed_df, expected_df)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "319a690f-71bd-4886-bd3a-424e866525c2",
+   "metadata": {},
+   "source": [
+    "When run, unittest will pick up all functions with a name beginning with 
“test.”"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7d79e53d-cc1e-4fdf-a069-478337bed83d",
+   "metadata": {},
+   "source": [
+    "### Option 3: pytest\n",
+    "\n",
+    "We can also write our tests with pytest, which is one of the most popular 
Python testing frameworks. \n",
+    "\n",
+    "Using the pytest fixture allows us to share a spark session across tests, 
tearing it down when the tests are complete."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "60a4f304-1911-4b4d-8ed9-00ecc8b0890b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pytest\n",
+    "\n",
+    "@pytest.fixture(scope=\"session\")\n",
+    "def spark():\n",
+    "    spark = SparkSession.builder.appName(\"Sample PySpark 
ETL\").getOrCreate()\n",
+    "    yield spark\n",
+    "    spark.stop()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fcb4e26a-9bfc-48a5-8aca-538697d66642",
+   "metadata": {},
+   "source": [
+    "We can then define our tests like this:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "fa5db3a1-7305-44b7-ab84-f5ed55fd2ba9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pyspark.testing.utils import assertDataFrameEqual\n",
+    "\n",
+    "def test_single_space(self):\n",
+    "        sample_data = [{\"name\": \"John    D.\", \"age\": 30}, \n",
+    "                       {\"name\": \"Alice   G.\", \"age\": 25}, \n",
+    "                       {\"name\": \"Bob  T.\", \"age\": 35}, \n",
+    "                       {\"name\": \"Eve   A.\", \"age\": 28}] \n",
+    "                        \n",
+    "        # create a Spark DataFrame\n",
+    "        original_df = spark.createDataFrame(sample_data)\n",
+    "        \n",
+    "        # apply the transformation function from before\n",
+    "        transformed_df = remove_extra_spaces(original_df, \"name\")\n",
+    "        \n",
+    "        expected_data = [{\"name\": \"John D.\", \"age\": 30}, \n",
+    "        {\"name\": \"Alice G.\", \"age\": 25}, \n",
+    "        {\"name\": \"Bob T.\", \"age\": 35}, \n",
+    "        {\"name\": \"Eve A.\", \"age\": 28}]\n",
+    "        \n",
+    "        expected_df = spark.createDataFrame(expected_data)\n",
+    "    \n",
+    "        assertDataFrameEqual(transformed_df, expected_df)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0fc3f394-3260-4e42-82cf-1a7edc859151",
+   "metadata": {},
+   "source": [
+    "When you run your test file with the pytest command, it will pick up all 
functions that have their name beginning with “test.”"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ea1ee67d-db57-48c7-ba55-a32f64723e29",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# pytest /path/to/file"

Review Comment:
   why comment?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42284: [SPARK-44629] Publish PySpark Test Guidelines webpage

Reply via email to