nchammas commented on a change in pull request #29491: URL: https://github.com/apache/spark/pull/29491#discussion_r474085613
########## File path: python/docs/source/getting_started/quickstart.ipynb ########## @@ -0,0 +1,1091 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Quickstart\n", + "\n", + "This is a short introduction and quickstart for PySpark DataFrame. PySpark DataFrame is lazily evaludated and implemented on thetop of [RDD](https://spark.apache.org/docs/latest/rdd-programming-guide.html#overview). When the data is [transformed](https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations), it does not actually compute but plans how to compute later. When the [actions](https://spark.apache.org/docs/latest/rdd-programming-guide.html#actions) such as `collect()` are explicitly called, the computation starts.\n", + "This notebook shows the basic usages of the DataFrame, geared mainly for new users. You can run the latest version of these examples by yourself on a live notebook [here](https://mybinder.org/v2/gh/databricks/apache/master?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb).\n", + "\n", + "There are also other useful information in Apache Spark documentation site, see the latest version of [Spark SQL and DataFrames](https://spark.apache.org/docs/latest/sql-programming-guide.html), [RDD Programming Guide](https://spark.apache.org/docs/latest/rdd-programming-guide.html), [Structured Streaming Programming Guide](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html), [Spark Streaming Programming Guide](https://spark.apache.org/docs/latest/streaming-programming-guide.html) and [Machine Learning Library (MLlib) Guide](https://spark.apache.org/docs/latest/ml-guide.html).\n", + "\n", + "Usually PySaprk applications start with initializing `SparkSession` which is the entry point of PySpark as below. In case of running it in PySpark shell via <code>pyspark</code> executable, the shell automatically creates the session in the variable <code>spark</code> for users." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "from pyspark.sql import SparkSession\n", + "\n", + "spark = SparkSession.builder.getOrCreate()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## DataFrame Creation\n", + "\n", + "A PySpark DataFrame can be created via `pyspark.sql.SparkSession.createDataFrame` typically by passing a list of lists, tuples, dictionaries and `pyspark.sql.Row`s, a [pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) and an RDD consisting of such a list.\n", + "`pyspark.sql.SparkSession.createDataFrame` takes the `schema` argument to specify the schema of the DataFrame. When it is omitted, PySpark infers the corresponding schema by taking a sample from the data.\n", + "\n", + "The example below creates a PySpark DataFrame from a list of rows" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "DataFrame[a: bigint, b: double, c: string, d: date, e: timestamp]" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import datetime\n", + "import pandas as pd\n", + "from pyspark.sql import Row\n", + "\n", + "df = spark.createDataFrame([\n", + " Row(a=1, b=2., c='string1', d=datetime.date(2000, 1, 1), e=datetime.datetime(2000, 1, 1, 12, 0)),\n", + " Row(a=2, b=3., c='string2', d=datetime.date(2000, 2, 1), e=datetime.datetime(2000, 1, 2, 12, 0)),\n", + " Row(a=4, b=5., c='string3', d=datetime.date(2000, 3, 1), e=datetime.datetime(2000, 1, 3, 12, 0))\n", + "])\n", + "df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Create a PySpark DataFrame with the schema." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "DataFrame[a: bigint, b: double, c: string, d: date, e: timestamp]" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df = spark.createDataFrame([\n", + " (1, 2., 'string1', datetime.date(2000, 1, 1), datetime.datetime(2000, 1, 1, 12, 0)),\n", + " (2, 3., 'string2', datetime.date(2000, 2, 1), datetime.datetime(2000, 1, 2, 12, 0)),\n", + " (3, 4., 'string3', datetime.date(2000, 3, 1), datetime.datetime(2000, 1, 3, 12, 0))\n", + "], schema='a long, b double, c string, d date, e timestamp')\n", + "df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Create a PySpark DataFrame from a pandas DataFrame" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "DataFrame[a: bigint, b: double, c: string, d: date, e: timestamp]" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pandas_df = pd.DataFrame({\n", + " 'a': [1, 2, 3],\n", + " 'b': [2., 3., 4.],\n", + " 'c': ['string1', 'string2', 'string3'],\n", + " 'd': [datetime.date(2000, 1, 1), datetime.date(2000, 2, 1), datetime.date(2000, 3, 1)],\n", + " 'e': [datetime.datetime(2000, 1, 1, 12, 0), datetime.datetime(2000, 1, 2, 12, 0), datetime.datetime(2000, 1, 3, 12, 0)]\n", + "})\n", + "df = spark.createDataFrame(pandas_df)\n", + "df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Create a PySpark DataFrame from an RDD consisting of a list of tuples." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "DataFrame[a: bigint, b: double, c: string, d: date, e: timestamp]" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "rdd = spark.sparkContext.parallelize([\n", + " (1, 2., 'string1', datetime.date(2000, 1, 1), datetime.datetime(2000, 1, 1, 12, 0)),\n", + " (2, 3., 'string2', datetime.date(2000, 2, 1), datetime.datetime(2000, 1, 2, 12, 0)),\n", + " (3, 4., 'string3', datetime.date(2000, 3, 1), datetime.datetime(2000, 1, 3, 12, 0))\n", + "])\n", + "df = spark.createDataFrame(rdd, schema=['a', 'b', 'c', 'd', 'e'])\n", + "df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The DataFrames created above all have the same results and schema." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "+---+---+-------+----------+-------------------+\n", + "| a| b| c| d| e|\n", + "+---+---+-------+----------+-------------------+\n", + "| 1|2.0|string1|2000-01-01|2000-01-01 12:00:00|\n", + "| 2|3.0|string2|2000-02-01|2000-01-02 12:00:00|\n", + "| 3|4.0|string3|2000-03-01|2000-01-03 12:00:00|\n", + "+---+---+-------+----------+-------------------+\n", + "\n", + "root\n", + " |-- a: long (nullable = true)\n", + " |-- b: double (nullable = true)\n", + " |-- c: string (nullable = true)\n", + " |-- d: date (nullable = true)\n", + " |-- e: timestamp (nullable = true)\n", + "\n" + ] + } + ], + "source": [ + "# All DataFrames above result same.\n", + "df.show()\n", + "df.printSchema()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Viewing Data\n", + "\n", + "The top rows of a DataFrame can be displayed using `DataFrame.show()`." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "+---+---+-------+----------+-------------------+\n", + "| a| b| c| d| e|\n", + "+---+---+-------+----------+-------------------+\n", + "| 1|2.0|string1|2000-01-01|2000-01-01 12:00:00|\n", + "+---+---+-------+----------+-------------------+\n", + "only showing top 1 row\n", + "\n" + ] + } + ], + "source": [ + "df.show(1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Alternatively, you can enable `spark.sql.repl.eagerEval.enabled` configuration to enable the eager evaluation of PySpark DataFrame in notebooks such as Jupyter." Review comment: Consider adding a note here that you wouldn't want to do this if you were dealing with really large amounts of data. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org