This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push: new ef334ae6a68 [SPARK-42493][DOCS][PYTHON] Make Python the first tab for code examples - Spark SQL, DataFrames and Datasets Guide ef334ae6a68 is described below commit ef334ae6a6889bbfba8b6c7afeb71b1ca1df87eb Author: Allan Folting <allan.folt...@databricks.com> AuthorDate: Thu Mar 2 09:19:05 2023 +0900 [SPARK-42493][DOCS][PYTHON] Make Python the first tab for code examples - Spark SQL, DataFrames and Datasets Guide ### What changes were proposed in this pull request? Making Python the first tab for code examples in the Spark SQL, DataFrames and Datasets Guide. ### Why are the changes needed? Python is the easiest approachable and most popular language so this change moves it to the first tab (showing by default) for code examples. ### Does this PR introduce _any_ user-facing change? Yes, the user facing Spark documentation is updated. ### How was this patch tested? I built the website locally and manually tested the pages. Closes #40087 from allanf-db/spark_docs. Authored-by: Allan Folting <allan.folt...@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls...@apache.org> --- docs/sql-data-sources-load-save-functions.md | 96 +++++++++---------- docs/sql-getting-started.md | 135 ++++++++++++++------------- 2 files changed, 113 insertions(+), 118 deletions(-) diff --git a/docs/sql-data-sources-load-save-functions.md b/docs/sql-data-sources-load-save-functions.md index 25df34ef5b0..c6cf8054f5f 100644 --- a/docs/sql-data-sources-load-save-functions.md +++ b/docs/sql-data-sources-load-save-functions.md @@ -28,6 +28,11 @@ In the simplest form, the default data source (`parquet` unless otherwise config <div class="codetabs"> + +<div data-lang="python" markdown="1"> +{% include_example generic_load_save_functions python/sql/datasource.py %} +</div> + <div data-lang="scala" markdown="1"> {% include_example generic_load_save_functions scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %} </div> @@ -36,16 +41,10 @@ In the simplest form, the default data source (`parquet` unless otherwise config {% include_example generic_load_save_functions java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %} </div> -<div data-lang="python" markdown="1"> - -{% include_example generic_load_save_functions python/sql/datasource.py %} -</div> - <div data-lang="r" markdown="1"> - {% include_example generic_load_save_functions r/RSparkSQLExample.R %} - </div> + </div> ### Manually Specifying Options @@ -64,6 +63,11 @@ as well. For other formats, refer to the API documentation of the particular for To load a JSON file you can use: <div class="codetabs"> + +<div data-lang="python" markdown="1"> +{% include_example manual_load_options python/sql/datasource.py %} +</div> + <div data-lang="scala" markdown="1"> {% include_example manual_load_options scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %} </div> @@ -72,18 +76,20 @@ To load a JSON file you can use: {% include_example manual_load_options java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %} </div> -<div data-lang="python" markdown="1"> -{% include_example manual_load_options python/sql/datasource.py %} -</div> - <div data-lang="r" markdown="1"> {% include_example manual_load_options r/RSparkSQLExample.R %} </div> + </div> To load a CSV file you can use: <div class="codetabs"> + +<div data-lang="python" markdown="1"> +{% include_example manual_load_options_csv python/sql/datasource.py %} +</div> + <div data-lang="scala" markdown="1"> {% include_example manual_load_options_csv scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %} </div> @@ -92,14 +98,10 @@ To load a CSV file you can use: {% include_example manual_load_options_csv java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %} </div> -<div data-lang="python" markdown="1"> -{% include_example manual_load_options_csv python/sql/datasource.py %} -</div> - <div data-lang="r" markdown="1"> {% include_example manual_load_options_csv r/RSparkSQLExample.R %} - </div> + </div> The extra options are also used during write operation. @@ -113,6 +115,10 @@ ORC data source: <div class="codetabs"> +<div data-lang="python" markdown="1"> +{% include_example manual_save_options_orc python/sql/datasource.py %} +</div> + <div data-lang="scala" markdown="1"> {% include_example manual_save_options_orc scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %} </div> @@ -121,16 +127,11 @@ ORC data source: {% include_example manual_save_options_orc java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %} </div> -<div data-lang="python" markdown="1"> -{% include_example manual_save_options_orc python/sql/datasource.py %} -</div> - <div data-lang="r" markdown="1"> {% include_example manual_save_options_orc r/RSparkSQLExample.R %} </div> <div data-lang="SQL" markdown="1"> - {% highlight sql %} CREATE TABLE users_with_options ( name STRING, @@ -143,7 +144,6 @@ OPTIONS ( orc.column.encoding.direct 'name' ) {% endhighlight %} - </div> </div> @@ -152,6 +152,10 @@ Parquet data source: <div class="codetabs"> +<div data-lang="python" markdown="1"> +{% include_example manual_save_options_parquet python/sql/datasource.py %} +</div> + <div data-lang="scala" markdown="1"> {% include_example manual_save_options_parquet scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %} </div> @@ -160,16 +164,11 @@ Parquet data source: {% include_example manual_save_options_parquet java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %} </div> -<div data-lang="python" markdown="1"> -{% include_example manual_save_options_parquet python/sql/datasource.py %} -</div> - <div data-lang="r" markdown="1"> {% include_example manual_save_options_parquet r/RSparkSQLExample.R %} </div> <div data-lang="SQL" markdown="1"> - {% highlight sql %} CREATE TABLE users_with_options ( name STRING, @@ -183,7 +182,6 @@ OPTIONS ( parquet.page.write-checksum.enabled true ) {% endhighlight %} - </div> </div> @@ -194,6 +192,11 @@ Instead of using read API to load a file into DataFrame and query it, you can al file directly with SQL. <div class="codetabs"> + +<div data-lang="python" markdown="1"> +{% include_example direct_sql python/sql/datasource.py %} +</div> + <div data-lang="scala" markdown="1"> {% include_example direct_sql scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %} </div> @@ -202,14 +205,10 @@ file directly with SQL. {% include_example direct_sql java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %} </div> -<div data-lang="python" markdown="1"> -{% include_example direct_sql python/sql/datasource.py %} -</div> - <div data-lang="r" markdown="1"> {% include_example direct_sql r/RSparkSQLExample.R %} - </div> + </div> ### Save Modes @@ -287,6 +286,10 @@ Bucketing and sorting are applicable only to persistent tables: <div class="codetabs"> +<div data-lang="python" markdown="1"> +{% include_example write_sorting_and_bucketing python/sql/datasource.py %} +</div> + <div data-lang="scala" markdown="1"> {% include_example write_sorting_and_bucketing scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %} </div> @@ -295,12 +298,7 @@ Bucketing and sorting are applicable only to persistent tables: {% include_example write_sorting_and_bucketing java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %} </div> -<div data-lang="python" markdown="1"> -{% include_example write_sorting_and_bucketing python/sql/datasource.py %} -</div> - <div data-lang="SQL" markdown="1"> - {% highlight sql %} CREATE TABLE users_bucketed_by_name( @@ -311,9 +309,9 @@ CREATE TABLE users_bucketed_by_name( CLUSTERED BY(name) INTO 42 BUCKETS; {% endhighlight %} - </div> + </div> while partitioning can be used with both `save` and `saveAsTable` when using the Dataset APIs. @@ -321,6 +319,10 @@ while partitioning can be used with both `save` and `saveAsTable` when using the <div class="codetabs"> +<div data-lang="python" markdown="1"> +{% include_example write_partitioning python/sql/datasource.py %} +</div> + <div data-lang="scala" markdown="1"> {% include_example write_partitioning scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %} </div> @@ -329,12 +331,7 @@ while partitioning can be used with both `save` and `saveAsTable` when using the {% include_example write_partitioning java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %} </div> -<div data-lang="python" markdown="1"> -{% include_example write_partitioning python/sql/datasource.py %} -</div> - <div data-lang="SQL" markdown="1"> - {% highlight sql %} CREATE TABLE users_by_favorite_color( @@ -344,7 +341,6 @@ CREATE TABLE users_by_favorite_color( ) USING csv PARTITIONED BY(favorite_color); {% endhighlight %} - </div> </div> @@ -353,6 +349,10 @@ It is possible to use both partitioning and bucketing for a single table: <div class="codetabs"> +<div data-lang="python" markdown="1"> +{% include_example write_partition_and_bucket python/sql/datasource.py %} +</div> + <div data-lang="scala" markdown="1"> {% include_example write_partition_and_bucket scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %} </div> @@ -361,12 +361,7 @@ It is possible to use both partitioning and bucketing for a single table: {% include_example write_partition_and_bucket java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %} </div> -<div data-lang="python" markdown="1"> -{% include_example write_partition_and_bucket python/sql/datasource.py %} -</div> - <div data-lang="SQL" markdown="1"> - {% highlight sql %} CREATE TABLE users_bucketed_and_partitioned( @@ -378,7 +373,6 @@ PARTITIONED BY (favorite_color) CLUSTERED BY(name) SORTED BY (favorite_numbers) INTO 42 BUCKETS; {% endhighlight %} - </div> </div> diff --git a/docs/sql-getting-started.md b/docs/sql-getting-started.md index 69396924e35..85da88a15c7 100644 --- a/docs/sql-getting-started.md +++ b/docs/sql-getting-started.md @@ -25,6 +25,13 @@ license: | ## Starting Point: SparkSession <div class="codetabs"> +<div data-lang="python" markdown="1"> + +The entry point into all functionality in Spark is the [`SparkSession`](api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.html) class. To create a basic `SparkSession`, just use `SparkSession.builder`: + +{% include_example init_session python/sql/basic.py %} +</div> + <div data-lang="scala" markdown="1"> The entry point into all functionality in Spark is the [`SparkSession`](api/scala/org/apache/spark/sql/SparkSession.html) class. To create a basic `SparkSession`, just use `SparkSession.builder()`: @@ -39,13 +46,6 @@ The entry point into all functionality in Spark is the [`SparkSession`](api/java {% include_example init_session java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %} </div> -<div data-lang="python" markdown="1"> - -The entry point into all functionality in Spark is the [`SparkSession`](api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.html) class. To create a basic `SparkSession`, just use `SparkSession.builder`: - -{% include_example init_session python/sql/basic.py %} -</div> - <div data-lang="r" markdown="1"> The entry point into all functionality in Spark is the [`SparkSession`](api/R/reference/sparkR.session.html) class. To initialize a basic `SparkSession`, just call `sparkR.session()`: @@ -63,31 +63,31 @@ To use these features, you do not need to have an existing Hive setup. ## Creating DataFrames <div class="codetabs"> -<div data-lang="scala" markdown="1"> +<div data-lang="python" markdown="1"> With a `SparkSession`, applications can create DataFrames from an [existing `RDD`](#interoperating-with-rdds), from a Hive table, or from [Spark data sources](sql-data-sources.html). As an example, the following creates a DataFrame based on the content of a JSON file: -{% include_example create_df scala/org/apache/spark/examples/sql/SparkSQLExample.scala %} +{% include_example create_df python/sql/basic.py %} </div> -<div data-lang="java" markdown="1"> +<div data-lang="scala" markdown="1"> With a `SparkSession`, applications can create DataFrames from an [existing `RDD`](#interoperating-with-rdds), from a Hive table, or from [Spark data sources](sql-data-sources.html). As an example, the following creates a DataFrame based on the content of a JSON file: -{% include_example create_df java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %} +{% include_example create_df scala/org/apache/spark/examples/sql/SparkSQLExample.scala %} </div> -<div data-lang="python" markdown="1"> +<div data-lang="java" markdown="1"> With a `SparkSession`, applications can create DataFrames from an [existing `RDD`](#interoperating-with-rdds), from a Hive table, or from [Spark data sources](sql-data-sources.html). As an example, the following creates a DataFrame based on the content of a JSON file: -{% include_example create_df python/sql/basic.py %} +{% include_example create_df java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %} </div> <div data-lang="r" markdown="1"> @@ -111,6 +111,21 @@ As mentioned above, in Spark 2.0, DataFrames are just Dataset of `Row`s in Scala Here we include some basic examples of structured data processing using Datasets: <div class="codetabs"> + +<div data-lang="python" markdown="1"> +In Python, it's possible to access a DataFrame's columns either by attribute +(`df.age`) or by indexing (`df['age']`). While the former is convenient for +interactive data exploration, users are highly encouraged to use the +latter form, which is future proof and won't break with column names that +are also attributes on the DataFrame class. + +{% include_example untyped_ops python/sql/basic.py %} +For a complete list of the types of operations that can be performed on a DataFrame refer to the [API Documentation](api/python/reference/pyspark.sql.html#dataframe-apis). + +In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/python/reference/pyspark.sql.html#functions). + +</div> + <div data-lang="scala" markdown="1"> {% include_example untyped_ops scala/org/apache/spark/examples/sql/SparkSQLExample.scala %} @@ -128,20 +143,6 @@ For a complete list of the types of operations that can be performed on a Datase In addition to simple column references and expressions, Datasets also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/java/org/apache/spark/sql/functions.html). </div> -<div data-lang="python" markdown="1"> -In Python, it's possible to access a DataFrame's columns either by attribute -(`df.age`) or by indexing (`df['age']`). While the former is convenient for -interactive data exploration, users are highly encouraged to use the -latter form, which is future proof and won't break with column names that -are also attributes on the DataFrame class. - -{% include_example untyped_ops python/sql/basic.py %} -For a complete list of the types of operations that can be performed on a DataFrame refer to the [API Documentation](api/python/reference/pyspark.sql.html#dataframe-apis). - -In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/python/reference/pyspark.sql.html#functions). - -</div> - <div data-lang="r" markdown="1"> {% include_example untyped_ops r/RSparkSQLExample.R %} @@ -157,6 +158,13 @@ In addition to simple column references and expressions, DataFrames also have a ## Running SQL Queries Programmatically <div class="codetabs"> + +<div data-lang="python" markdown="1"> +The `sql` function on a `SparkSession` enables applications to run SQL queries programmatically and returns the result as a `DataFrame`. + +{% include_example run_sql python/sql/basic.py %} +</div> + <div data-lang="scala" markdown="1"> The `sql` function on a `SparkSession` enables applications to run SQL queries programmatically and returns the result as a `DataFrame`. @@ -169,12 +177,6 @@ The `sql` function on a `SparkSession` enables applications to run SQL queries p {% include_example run_sql java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %} </div> -<div data-lang="python" markdown="1"> -The `sql` function on a `SparkSession` enables applications to run SQL queries programmatically and returns the result as a `DataFrame`. - -{% include_example run_sql python/sql/basic.py %} -</div> - <div data-lang="r" markdown="1"> The `sql` function enables applications to run SQL queries programmatically and returns the result as a `SparkDataFrame`. @@ -193,6 +195,11 @@ view is tied to a system preserved database `global_temp`, and we must use the q refer it, e.g. `SELECT * FROM global_temp.view1`. <div class="codetabs"> + +<div data-lang="python" markdown="1"> +{% include_example global_temp_view python/sql/basic.py %} +</div> + <div data-lang="scala" markdown="1"> {% include_example global_temp_view scala/org/apache/spark/examples/sql/SparkSQLExample.scala %} </div> @@ -201,21 +208,14 @@ refer it, e.g. `SELECT * FROM global_temp.view1`. {% include_example global_temp_view java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %} </div> -<div data-lang="python" markdown="1"> -{% include_example global_temp_view python/sql/basic.py %} -</div> - <div data-lang="SQL" markdown="1"> - {% highlight sql %} - CREATE GLOBAL TEMPORARY VIEW temp_view AS SELECT a + 1, b * 2 FROM tbl SELECT * FROM global_temp.temp_view - {% endhighlight %} - </div> + </div> @@ -229,6 +229,7 @@ that allows Spark to perform many operations like filtering, sorting and hashing the bytes back into an object. <div class="codetabs"> + <div data-lang="scala" markdown="1"> {% include_example create_ds scala/org/apache/spark/examples/sql/SparkSQLExample.scala %} </div> @@ -252,6 +253,15 @@ you to construct Datasets when the columns and their types are not known until r ### Inferring the Schema Using Reflection <div class="codetabs"> +<div data-lang="python" markdown="1"> + +Spark SQL can convert an RDD of Row objects to a DataFrame, inferring the datatypes. Rows are constructed by passing a list of +key/value pairs as kwargs to the Row class. The keys of this list define the column names of the table, +and the types are inferred by sampling the whole dataset, similar to the inference that is performed on JSON files. + +{% include_example schema_inferring python/sql/basic.py %} +</div> + <div data-lang="scala" markdown="1"> The Scala interface for Spark SQL supports automatically converting an RDD containing case classes @@ -276,21 +286,29 @@ Serializable and has getters and setters for all of its fields. {% include_example schema_inferring java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %} </div> -<div data-lang="python" markdown="1"> - -Spark SQL can convert an RDD of Row objects to a DataFrame, inferring the datatypes. Rows are constructed by passing a list of -key/value pairs as kwargs to the Row class. The keys of this list define the column names of the table, -and the types are inferred by sampling the whole dataset, similar to the inference that is performed on JSON files. - -{% include_example schema_inferring python/sql/basic.py %} -</div> - </div> ### Programmatically Specifying the Schema <div class="codetabs"> +<div data-lang="python" markdown="1"> + +When a dictionary of kwargs cannot be defined ahead of time (for example, +the structure of records is encoded in a string, or a text dataset will be parsed and +fields will be projected differently for different users), +a `DataFrame` can be created programmatically with three steps. + +1. Create an RDD of tuples or lists from the original RDD; +2. Create the schema represented by a `StructType` matching the structure of +tuples or lists in the RDD created in the step 1. +3. Apply the schema to the RDD via `createDataFrame` method provided by `SparkSession`. + +For example: + +{% include_example programmatic_schema python/sql/basic.py %} +</div> + <div data-lang="scala" markdown="1"> When case classes cannot be defined ahead of time (for example, @@ -327,23 +345,6 @@ For example: {% include_example programmatic_schema java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %} </div> -<div data-lang="python" markdown="1"> - -When a dictionary of kwargs cannot be defined ahead of time (for example, -the structure of records is encoded in a string, or a text dataset will be parsed and -fields will be projected differently for different users), -a `DataFrame` can be created programmatically with three steps. - -1. Create an RDD of tuples or lists from the original RDD; -2. Create the schema represented by a `StructType` matching the structure of -tuples or lists in the RDD created in the step 1. -3. Apply the schema to the RDD via `createDataFrame` method provided by `SparkSession`. - -For example: - -{% include_example programmatic_schema python/sql/basic.py %} -</div> - </div> ## Scalar Functions --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org