HyukjinKwon commented on code in PR #42284: URL: https://github.com/apache/spark/pull/42284#discussion_r1283880513
########## python/docs/source/getting_started/testing_pyspark.ipynb: ########## @@ -0,0 +1,525 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "4ee2125b-f889-47e6-9c3d-8bd63a253683", + "metadata": {}, + "source": [ + "# Testing PySpark\n", + "\n", + "This guide is a reference for writing robust tests for PySpark code.\n", + "\n", + "To view the docs for PySpark test utils, see here. To see the code for PySpark built-in test utils, check out the Spark repository here. To see the JIRA board tickets for the PySpark test framework, see here." + ] + }, + { + "cell_type": "markdown", + "id": "0e8ee4b6-9544-45e1-8a91-e71ed8ef8b9d", + "metadata": {}, + "source": [ + "## Build a PySpark Application\n", + "Here is an example for how to start a PySpark application. Feel free to skip to the next section, “Testing your PySpark Application,” if you already have an application you’re ready to test.\n", + "\n", + "First, start your Spark Session." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "64e82e7c", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'4.0.0.dev0'" + ] + }, + "execution_count": 1, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import pyspark\n", + "pyspark.__version__" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "9af4a35b-17e8-4e45-816b-34c14c5902f7", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Setting default log level to \"WARN\".\n", + "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n", + "23/08/01 19:05:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n" + ] + } + ], + "source": [ + "from pyspark.sql import SparkSession \n", + "from pyspark.sql.functions import col \n", + "\n", + "# Create a SparkSession \n", + "spark = SparkSession.builder.appName(\"Testing PySpark Example\").getOrCreate() " + ] + }, + { + "cell_type": "markdown", + "id": "4a4c6efe-91f5-4e18-b4b2-b0401c2368e4", + "metadata": {}, + "source": [ + "Next, create a DataFrame." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "3b483dd8-3a76-41c6-9206-301d7ef314d6", + "metadata": {}, + "outputs": [], + "source": [ + "sample_data = [{\"name\": \"John D.\", \"age\": 30}, \n", + " {\"name\": \"Alice G.\", \"age\": 25}, \n", + " {\"name\": \"Bob T.\", \"age\": 35}, \n", + " {\"name\": \"Eve A.\", \"age\": 28}] \n", + "\n", + "df = spark.createDataFrame(sample_data)" + ] + }, + { + "cell_type": "markdown", + "id": "e0f44333-0e08-470b-9fa2-38f59e3dbd63", + "metadata": {}, + "source": [ + "Now, let’s define and apply a transformation function to our DataFrame." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "a6c0b766-af5f-4e1d-acf8-887d7cf0b0b2", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + " \r" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "+---+--------+\n", + "|age| name|\n", + "+---+--------+\n", + "| 30| John D.|\n", + "| 25|Alice G.|\n", + "| 35| Bob T.|\n", + "| 28| Eve A.|\n", + "+---+--------+\n", + "\n" + ] + } + ], + "source": [ + "from pyspark.sql.functions import col, regexp_replace\n", + "\n", + "# Remove additional spaces in name\n", + "def remove_extra_spaces(df, column_name):\n", + " # Remove extra spaces from the specified column\n", + " df_transformed = df.withColumn(column_name, regexp_replace(col(column_name), \"\\\\s+\", \" \"))\n", + " \n", + " return df_transformed\n", + "\n", + "transformed_df = remove_extra_spaces(df, \"name\")\n", + "\n", + "transformed_df.show()" + ] + }, + { + "cell_type": "markdown", + "id": "c471be1b-c052-4f31-abc9-35668aebc9c1", + "metadata": {}, + "source": [ + "You can also do this using Spark Connect. The only difference is using remote() when you create your SparkSession, for example:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4ba5f56e-dbfb-442f-a9ac-e0d0ad2d0931", + "metadata": {}, + "outputs": [], + "source": [ + "spark = SparkSession.builder.remote(\"sc://localhost\").appName(\"Sample PySpark ETL\").getOrCreate()" Review Comment: this cell wasn't executed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org