http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/sql-programming-guide.html ---------------------------------------------------------------------- diff --git a/site/docs/2.1.0/sql-programming-guide.html b/site/docs/2.1.0/sql-programming-guide.html index 17f5981..4534a98 100644 --- a/site/docs/2.1.0/sql-programming-guide.html +++ b/site/docs/2.1.0/sql-programming-guide.html @@ -127,95 +127,95 @@ <ul id="markdown-toc"> - <li><a href="#overview" id="markdown-toc-overview">Overview</a> <ul> - <li><a href="#sql" id="markdown-toc-sql">SQL</a></li> - <li><a href="#datasets-and-dataframes" id="markdown-toc-datasets-and-dataframes">Datasets and DataFrames</a></li> + <li><a href="#overview">Overview</a> <ul> + <li><a href="#sql">SQL</a></li> + <li><a href="#datasets-and-dataframes">Datasets and DataFrames</a></li> </ul> </li> - <li><a href="#getting-started" id="markdown-toc-getting-started">Getting Started</a> <ul> - <li><a href="#starting-point-sparksession" id="markdown-toc-starting-point-sparksession">Starting Point: SparkSession</a></li> - <li><a href="#creating-dataframes" id="markdown-toc-creating-dataframes">Creating DataFrames</a></li> - <li><a href="#untyped-dataset-operations-aka-dataframe-operations" id="markdown-toc-untyped-dataset-operations-aka-dataframe-operations">Untyped Dataset Operations (aka DataFrame Operations)</a></li> - <li><a href="#running-sql-queries-programmatically" id="markdown-toc-running-sql-queries-programmatically">Running SQL Queries Programmatically</a></li> - <li><a href="#global-temporary-view" id="markdown-toc-global-temporary-view">Global Temporary View</a></li> - <li><a href="#creating-datasets" id="markdown-toc-creating-datasets">Creating Datasets</a></li> - <li><a href="#interoperating-with-rdds" id="markdown-toc-interoperating-with-rdds">Interoperating with RDDs</a> <ul> - <li><a href="#inferring-the-schema-using-reflection" id="markdown-toc-inferring-the-schema-using-reflection">Inferring the Schema Using Reflection</a></li> - <li><a href="#programmatically-specifying-the-schema" id="markdown-toc-programmatically-specifying-the-schema">Programmatically Specifying the Schema</a></li> + <li><a href="#getting-started">Getting Started</a> <ul> + <li><a href="#starting-point-sparksession">Starting Point: SparkSession</a></li> + <li><a href="#creating-dataframes">Creating DataFrames</a></li> + <li><a href="#untyped-dataset-operations-aka-dataframe-operations">Untyped Dataset Operations (aka DataFrame Operations)</a></li> + <li><a href="#running-sql-queries-programmatically">Running SQL Queries Programmatically</a></li> + <li><a href="#global-temporary-view">Global Temporary View</a></li> + <li><a href="#creating-datasets">Creating Datasets</a></li> + <li><a href="#interoperating-with-rdds">Interoperating with RDDs</a> <ul> + <li><a href="#inferring-the-schema-using-reflection">Inferring the Schema Using Reflection</a></li> + <li><a href="#programmatically-specifying-the-schema">Programmatically Specifying the Schema</a></li> </ul> </li> </ul> </li> - <li><a href="#data-sources" id="markdown-toc-data-sources">Data Sources</a> <ul> - <li><a href="#generic-loadsave-functions" id="markdown-toc-generic-loadsave-functions">Generic Load/Save Functions</a> <ul> - <li><a href="#manually-specifying-options" id="markdown-toc-manually-specifying-options">Manually Specifying Options</a></li> - <li><a href="#run-sql-on-files-directly" id="markdown-toc-run-sql-on-files-directly">Run SQL on files directly</a></li> - <li><a href="#save-modes" id="markdown-toc-save-modes">Save Modes</a></li> - <li><a href="#saving-to-persistent-tables" id="markdown-toc-saving-to-persistent-tables">Saving to Persistent Tables</a></li> + <li><a href="#data-sources">Data Sources</a> <ul> + <li><a href="#generic-loadsave-functions">Generic Load/Save Functions</a> <ul> + <li><a href="#manually-specifying-options">Manually Specifying Options</a></li> + <li><a href="#run-sql-on-files-directly">Run SQL on files directly</a></li> + <li><a href="#save-modes">Save Modes</a></li> + <li><a href="#saving-to-persistent-tables">Saving to Persistent Tables</a></li> </ul> </li> - <li><a href="#parquet-files" id="markdown-toc-parquet-files">Parquet Files</a> <ul> - <li><a href="#loading-data-programmatically" id="markdown-toc-loading-data-programmatically">Loading Data Programmatically</a></li> - <li><a href="#partition-discovery" id="markdown-toc-partition-discovery">Partition Discovery</a></li> - <li><a href="#schema-merging" id="markdown-toc-schema-merging">Schema Merging</a></li> - <li><a href="#hive-metastore-parquet-table-conversion" id="markdown-toc-hive-metastore-parquet-table-conversion">Hive metastore Parquet table conversion</a> <ul> - <li><a href="#hiveparquet-schema-reconciliation" id="markdown-toc-hiveparquet-schema-reconciliation">Hive/Parquet Schema Reconciliation</a></li> - <li><a href="#metadata-refreshing" id="markdown-toc-metadata-refreshing">Metadata Refreshing</a></li> + <li><a href="#parquet-files">Parquet Files</a> <ul> + <li><a href="#loading-data-programmatically">Loading Data Programmatically</a></li> + <li><a href="#partition-discovery">Partition Discovery</a></li> + <li><a href="#schema-merging">Schema Merging</a></li> + <li><a href="#hive-metastore-parquet-table-conversion">Hive metastore Parquet table conversion</a> <ul> + <li><a href="#hiveparquet-schema-reconciliation">Hive/Parquet Schema Reconciliation</a></li> + <li><a href="#metadata-refreshing">Metadata Refreshing</a></li> </ul> </li> - <li><a href="#configuration" id="markdown-toc-configuration">Configuration</a></li> + <li><a href="#configuration">Configuration</a></li> </ul> </li> - <li><a href="#json-datasets" id="markdown-toc-json-datasets">JSON Datasets</a></li> - <li><a href="#hive-tables" id="markdown-toc-hive-tables">Hive Tables</a> <ul> - <li><a href="#interacting-with-different-versions-of-hive-metastore" id="markdown-toc-interacting-with-different-versions-of-hive-metastore">Interacting with Different Versions of Hive Metastore</a></li> + <li><a href="#json-datasets">JSON Datasets</a></li> + <li><a href="#hive-tables">Hive Tables</a> <ul> + <li><a href="#interacting-with-different-versions-of-hive-metastore">Interacting with Different Versions of Hive Metastore</a></li> </ul> </li> - <li><a href="#jdbc-to-other-databases" id="markdown-toc-jdbc-to-other-databases">JDBC To Other Databases</a></li> - <li><a href="#troubleshooting" id="markdown-toc-troubleshooting">Troubleshooting</a></li> + <li><a href="#jdbc-to-other-databases">JDBC To Other Databases</a></li> + <li><a href="#troubleshooting">Troubleshooting</a></li> </ul> </li> - <li><a href="#performance-tuning" id="markdown-toc-performance-tuning">Performance Tuning</a> <ul> - <li><a href="#caching-data-in-memory" id="markdown-toc-caching-data-in-memory">Caching Data In Memory</a></li> - <li><a href="#other-configuration-options" id="markdown-toc-other-configuration-options">Other Configuration Options</a></li> + <li><a href="#performance-tuning">Performance Tuning</a> <ul> + <li><a href="#caching-data-in-memory">Caching Data In Memory</a></li> + <li><a href="#other-configuration-options">Other Configuration Options</a></li> </ul> </li> - <li><a href="#distributed-sql-engine" id="markdown-toc-distributed-sql-engine">Distributed SQL Engine</a> <ul> - <li><a href="#running-the-thrift-jdbcodbc-server" id="markdown-toc-running-the-thrift-jdbcodbc-server">Running the Thrift JDBC/ODBC server</a></li> - <li><a href="#running-the-spark-sql-cli" id="markdown-toc-running-the-spark-sql-cli">Running the Spark SQL CLI</a></li> + <li><a href="#distributed-sql-engine">Distributed SQL Engine</a> <ul> + <li><a href="#running-the-thrift-jdbcodbc-server">Running the Thrift JDBC/ODBC server</a></li> + <li><a href="#running-the-spark-sql-cli">Running the Spark SQL CLI</a></li> </ul> </li> - <li><a href="#migration-guide" id="markdown-toc-migration-guide">Migration Guide</a> <ul> - <li><a href="#upgrading-from-spark-sql-20-to-21" id="markdown-toc-upgrading-from-spark-sql-20-to-21">Upgrading From Spark SQL 2.0 to 2.1</a></li> - <li><a href="#upgrading-from-spark-sql-16-to-20" id="markdown-toc-upgrading-from-spark-sql-16-to-20">Upgrading From Spark SQL 1.6 to 2.0</a></li> - <li><a href="#upgrading-from-spark-sql-15-to-16" id="markdown-toc-upgrading-from-spark-sql-15-to-16">Upgrading From Spark SQL 1.5 to 1.6</a></li> - <li><a href="#upgrading-from-spark-sql-14-to-15" id="markdown-toc-upgrading-from-spark-sql-14-to-15">Upgrading From Spark SQL 1.4 to 1.5</a></li> - <li><a href="#upgrading-from-spark-sql-13-to-14" id="markdown-toc-upgrading-from-spark-sql-13-to-14">Upgrading from Spark SQL 1.3 to 1.4</a> <ul> - <li><a href="#dataframe-data-readerwriter-interface" id="markdown-toc-dataframe-data-readerwriter-interface">DataFrame data reader/writer interface</a></li> - <li><a href="#dataframegroupby-retains-grouping-columns" id="markdown-toc-dataframegroupby-retains-grouping-columns">DataFrame.groupBy retains grouping columns</a></li> - <li><a href="#behavior-change-on-dataframewithcolumn" id="markdown-toc-behavior-change-on-dataframewithcolumn">Behavior change on DataFrame.withColumn</a></li> + <li><a href="#migration-guide">Migration Guide</a> <ul> + <li><a href="#upgrading-from-spark-sql-20-to-21">Upgrading From Spark SQL 2.0 to 2.1</a></li> + <li><a href="#upgrading-from-spark-sql-16-to-20">Upgrading From Spark SQL 1.6 to 2.0</a></li> + <li><a href="#upgrading-from-spark-sql-15-to-16">Upgrading From Spark SQL 1.5 to 1.6</a></li> + <li><a href="#upgrading-from-spark-sql-14-to-15">Upgrading From Spark SQL 1.4 to 1.5</a></li> + <li><a href="#upgrading-from-spark-sql-13-to-14">Upgrading from Spark SQL 1.3 to 1.4</a> <ul> + <li><a href="#dataframe-data-readerwriter-interface">DataFrame data reader/writer interface</a></li> + <li><a href="#dataframegroupby-retains-grouping-columns">DataFrame.groupBy retains grouping columns</a></li> + <li><a href="#behavior-change-on-dataframewithcolumn">Behavior change on DataFrame.withColumn</a></li> </ul> </li> - <li><a href="#upgrading-from-spark-sql-10-12-to-13" id="markdown-toc-upgrading-from-spark-sql-10-12-to-13">Upgrading from Spark SQL 1.0-1.2 to 1.3</a> <ul> - <li><a href="#rename-of-schemardd-to-dataframe" id="markdown-toc-rename-of-schemardd-to-dataframe">Rename of SchemaRDD to DataFrame</a></li> - <li><a href="#unification-of-the-java-and-scala-apis" id="markdown-toc-unification-of-the-java-and-scala-apis">Unification of the Java and Scala APIs</a></li> - <li><a href="#isolation-of-implicit-conversions-and-removal-of-dsl-package-scala-only" id="markdown-toc-isolation-of-implicit-conversions-and-removal-of-dsl-package-scala-only">Isolation of Implicit Conversions and Removal of dsl Package (Scala-only)</a></li> - <li><a href="#removal-of-the-type-aliases-in-orgapachesparksql-for-datatype-scala-only" id="markdown-toc-removal-of-the-type-aliases-in-orgapachesparksql-for-datatype-scala-only">Removal of the type aliases in org.apache.spark.sql for DataType (Scala-only)</a></li> - <li><a href="#udf-registration-moved-to-sqlcontextudf-java--scala" id="markdown-toc-udf-registration-moved-to-sqlcontextudf-java--scala">UDF Registration Moved to <code>sqlContext.udf</code> (Java & Scala)</a></li> - <li><a href="#python-datatypes-no-longer-singletons" id="markdown-toc-python-datatypes-no-longer-singletons">Python DataTypes No Longer Singletons</a></li> + <li><a href="#upgrading-from-spark-sql-10-12-to-13">Upgrading from Spark SQL 1.0-1.2 to 1.3</a> <ul> + <li><a href="#rename-of-schemardd-to-dataframe">Rename of SchemaRDD to DataFrame</a></li> + <li><a href="#unification-of-the-java-and-scala-apis">Unification of the Java and Scala APIs</a></li> + <li><a href="#isolation-of-implicit-conversions-and-removal-of-dsl-package-scala-only">Isolation of Implicit Conversions and Removal of dsl Package (Scala-only)</a></li> + <li><a href="#removal-of-the-type-aliases-in-orgapachesparksql-for-datatype-scala-only">Removal of the type aliases in org.apache.spark.sql for DataType (Scala-only)</a></li> + <li><a href="#udf-registration-moved-to-sqlcontextudf-java--scala">UDF Registration Moved to <code>sqlContext.udf</code> (Java & Scala)</a></li> + <li><a href="#python-datatypes-no-longer-singletons">Python DataTypes No Longer Singletons</a></li> </ul> </li> - <li><a href="#compatibility-with-apache-hive" id="markdown-toc-compatibility-with-apache-hive">Compatibility with Apache Hive</a> <ul> - <li><a href="#deploying-in-existing-hive-warehouses" id="markdown-toc-deploying-in-existing-hive-warehouses">Deploying in Existing Hive Warehouses</a></li> - <li><a href="#supported-hive-features" id="markdown-toc-supported-hive-features">Supported Hive Features</a></li> - <li><a href="#unsupported-hive-functionality" id="markdown-toc-unsupported-hive-functionality">Unsupported Hive Functionality</a></li> + <li><a href="#compatibility-with-apache-hive">Compatibility with Apache Hive</a> <ul> + <li><a href="#deploying-in-existing-hive-warehouses">Deploying in Existing Hive Warehouses</a></li> + <li><a href="#supported-hive-features">Supported Hive Features</a></li> + <li><a href="#unsupported-hive-functionality">Unsupported Hive Functionality</a></li> </ul> </li> </ul> </li> - <li><a href="#reference" id="markdown-toc-reference">Reference</a> <ul> - <li><a href="#data-types" id="markdown-toc-data-types">Data Types</a></li> - <li><a href="#nan-semantics" id="markdown-toc-nan-semantics">NaN Semantics</a></li> + <li><a href="#reference">Reference</a> <ul> + <li><a href="#data-types">Data Types</a></li> + <li><a href="#nan-semantics">NaN Semantics</a></li> </ul> </li> </ul> @@ -275,7 +275,7 @@ While, in <a href="api/java/index.html?org/apache/spark/sql/Dataset.html">Java A <p>The entry point into all functionality in Spark is the <a href="api/scala/index.html#org.apache.spark.sql.SparkSession"><code>SparkSession</code></a> class. To create a basic <code>SparkSession</code>, just use <code>SparkSession.builder()</code>:</p> - <div class="highlight"><pre><span class="k">import</span> <span class="nn">org.apache.spark.sql.SparkSession</span> + <div class="highlight"><pre><span></span><span class="k">import</span> <span class="nn">org.apache.spark.sql.SparkSession</span> <span class="k">val</span> <span class="n">spark</span> <span class="k">=</span> <span class="nc">SparkSession</span> <span class="o">.</span><span class="n">builder</span><span class="o">()</span> @@ -293,7 +293,7 @@ While, in <a href="api/java/index.html?org/apache/spark/sql/Dataset.html">Java A <p>The entry point into all functionality in Spark is the <a href="api/java/index.html#org.apache.spark.sql.SparkSession"><code>SparkSession</code></a> class. To create a basic <code>SparkSession</code>, just use <code>SparkSession.builder()</code>:</p> - <div class="highlight"><pre><span class="kn">import</span> <span class="nn">org.apache.spark.sql.SparkSession</span><span class="o">;</span> + <div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">org.apache.spark.sql.SparkSession</span><span class="o">;</span> <span class="n">SparkSession</span> <span class="n">spark</span> <span class="o">=</span> <span class="n">SparkSession</span> <span class="o">.</span><span class="na">builder</span><span class="o">()</span> @@ -308,12 +308,12 @@ While, in <a href="api/java/index.html?org/apache/spark/sql/Dataset.html">Java A <p>The entry point into all functionality in Spark is the <a href="api/python/pyspark.sql.html#pyspark.sql.SparkSession"><code>SparkSession</code></a> class. To create a basic <code>SparkSession</code>, just use <code>SparkSession.builder</code>:</p> - <div class="highlight"><pre><span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="kn">import</span> <span class="n">SparkSession</span> + <div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="kn">import</span> <span class="n">SparkSession</span> <span class="n">spark</span> <span class="o">=</span> <span class="n">SparkSession</span> \ <span class="o">.</span><span class="n">builder</span> \ - <span class="o">.</span><span class="n">appName</span><span class="p">(</span><span class="s">"Python Spark SQL basic example"</span><span class="p">)</span> \ - <span class="o">.</span><span class="n">config</span><span class="p">(</span><span class="s">"spark.some.config.option"</span><span class="p">,</span> <span class="s">"some-value"</span><span class="p">)</span> \ + <span class="o">.</span><span class="n">appName</span><span class="p">(</span><span class="s2">"Python Spark SQL basic example"</span><span class="p">)</span> \ + <span class="o">.</span><span class="n">config</span><span class="p">(</span><span class="s2">"spark.some.config.option"</span><span class="p">,</span> <span class="s2">"some-value"</span><span class="p">)</span> \ <span class="o">.</span><span class="n">getOrCreate</span><span class="p">()</span> </pre></div> <div><small>Find full example code at "examples/src/main/python/sql/basic.py" in the Spark repo.</small></div> @@ -323,7 +323,7 @@ While, in <a href="api/java/index.html?org/apache/spark/sql/Dataset.html">Java A <p>The entry point into all functionality in Spark is the <a href="api/R/sparkR.session.html"><code>SparkSession</code></a> class. To initialize a basic <code>SparkSession</code>, just call <code>sparkR.session()</code>:</p> - <div class="highlight"><pre>sparkR.session<span class="p">(</span>appName <span class="o">=</span> <span class="s">"R Spark SQL basic example"</span><span class="p">,</span> sparkConfig <span class="o">=</span> <span class="kt">list</span><span class="p">(</span>spark.some.config.option <span class="o">=</span> <span class="s">"some-value"</span><span class="p">))</span> + <div class="highlight"><pre><span></span>sparkR.session<span class="p">(</span>appName <span class="o">=</span> <span class="s">"R Spark SQL basic example"</span><span class="p">,</span> sparkConfig <span class="o">=</span> <span class="kt">list</span><span class="p">(</span>spark.some.config.option <span class="o">=</span> <span class="s">"some-value"</span><span class="p">))</span> </pre></div> <div><small>Find full example code at "examples/src/main/r/RSparkSQLExample.R" in the Spark repo.</small></div> @@ -344,7 +344,7 @@ from a Hive table, or from <a href="#data-sources">Spark data sources</a>.</p> <p>As an example, the following creates a DataFrame based on the content of a JSON file:</p> - <div class="highlight"><pre><span class="k">val</span> <span class="n">df</span> <span class="k">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">json</span><span class="o">(</span><span class="s">"examples/src/main/resources/people.json"</span><span class="o">)</span> + <div class="highlight"><pre><span></span><span class="k">val</span> <span class="n">df</span> <span class="k">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">json</span><span class="o">(</span><span class="s">"examples/src/main/resources/people.json"</span><span class="o">)</span> <span class="c1">// Displays the content of the DataFrame to stdout</span> <span class="n">df</span><span class="o">.</span><span class="n">show</span><span class="o">()</span> @@ -365,7 +365,7 @@ from a Hive table, or from <a href="#data-sources">Spark data sources</a>.</p> <p>As an example, the following creates a DataFrame based on the content of a JSON file:</p> - <div class="highlight"><pre><span class="kn">import</span> <span class="nn">org.apache.spark.sql.Dataset</span><span class="o">;</span> + <div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">org.apache.spark.sql.Dataset</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.sql.Row</span><span class="o">;</span> <span class="n">Dataset</span><span class="o"><</span><span class="n">Row</span><span class="o">></span> <span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="na">read</span><span class="o">().</span><span class="na">json</span><span class="o">(</span><span class="s">"examples/src/main/resources/people.json"</span><span class="o">);</span> @@ -389,17 +389,17 @@ from a Hive table, or from <a href="#data-sources">Spark data sources</a>.</p> <p>As an example, the following creates a DataFrame based on the content of a JSON file:</p> - <div class="highlight"><pre><span class="c"># spark is an existing SparkSession</span> -<span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">json</span><span class="p">(</span><span class="s">"examples/src/main/resources/people.json"</span><span class="p">)</span> -<span class="c"># Displays the content of the DataFrame to stdout</span> + <div class="highlight"><pre><span></span><span class="c1"># spark is an existing SparkSession</span> +<span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">json</span><span class="p">(</span><span class="s2">"examples/src/main/resources/people.json"</span><span class="p">)</span> +<span class="c1"># Displays the content of the DataFrame to stdout</span> <span class="n">df</span><span class="o">.</span><span class="n">show</span><span class="p">()</span> -<span class="c"># +----+-------+</span> -<span class="c"># | age| name|</span> -<span class="c"># +----+-------+</span> -<span class="c"># |null|Michael|</span> -<span class="c"># | 30| Andy|</span> -<span class="c"># | 19| Justin|</span> -<span class="c"># +----+-------+</span> +<span class="c1"># +----+-------+</span> +<span class="c1"># | age| name|</span> +<span class="c1"># +----+-------+</span> +<span class="c1"># |null|Michael|</span> +<span class="c1"># | 30| Andy|</span> +<span class="c1"># | 19| Justin|</span> +<span class="c1"># +----+-------+</span> </pre></div> <div><small>Find full example code at "examples/src/main/python/sql/basic.py" in the Spark repo.</small></div> </div> @@ -410,7 +410,7 @@ from a Hive table, or from <a href="#data-sources">Spark data sources</a>.</p> <p>As an example, the following creates a DataFrame based on the content of a JSON file:</p> - <div class="highlight"><pre>df <span class="o"><-</span> read.json<span class="p">(</span><span class="s">"examples/src/main/resources/people.json"</span><span class="p">)</span> + <div class="highlight"><pre><span></span>df <span class="o"><-</span> read.json<span class="p">(</span><span class="s">"examples/src/main/resources/people.json"</span><span class="p">)</span> <span class="c1"># Displays the content of the DataFrame</span> <span class="kp">head</span><span class="p">(</span>df<span class="p">)</span> @@ -444,7 +444,7 @@ showDF<span class="p">(</span>df<span class="p">)</span> <div class="codetabs"> <div data-lang="scala"> - <div class="highlight"><pre><span class="c1">// This import is needed to use the $-notation</span> + <div class="highlight"><pre><span></span><span class="c1">// This import is needed to use the $-notation</span> <span class="k">import</span> <span class="nn">spark.implicits._</span> <span class="c1">// Print the schema in a tree format</span> <span class="n">df</span><span class="o">.</span><span class="n">printSchema</span><span class="o">()</span> @@ -499,8 +499,8 @@ showDF<span class="p">(</span>df<span class="p">)</span> <div data-lang="java"> - <div class="highlight"><pre><span class="c1">// col("...") is preferable to df.col("...")</span> -<span class="kn">import</span> <span class="nn">static</span> <span class="n">org</span><span class="o">.</span><span class="na">apache</span><span class="o">.</span><span class="na">spark</span><span class="o">.</span><span class="na">sql</span><span class="o">.</span><span class="na">functions</span><span class="o">.</span><span class="na">col</span><span class="o">;</span> + <div class="highlight"><pre><span></span><span class="c1">// col("...") is preferable to df.col("...")</span> +<span class="kn">import static</span> <span class="nn">org.apache.spark.sql.functions.col</span><span class="o">;</span> <span class="c1">// Print the schema in a tree format</span> <span class="n">df</span><span class="o">.</span><span class="na">printSchema</span><span class="o">();</span> @@ -560,50 +560,50 @@ interactive data exploration, users are highly encouraged to use the latter form, which is future proof and won’t break with column names that are also attributes on the DataFrame class.</p> - <div class="highlight"><pre><span class="c"># spark, df are from the previous example</span> -<span class="c"># Print the schema in a tree format</span> + <div class="highlight"><pre><span></span><span class="c1"># spark, df are from the previous example</span> +<span class="c1"># Print the schema in a tree format</span> <span class="n">df</span><span class="o">.</span><span class="n">printSchema</span><span class="p">()</span> -<span class="c"># root</span> -<span class="c"># |-- age: long (nullable = true)</span> -<span class="c"># |-- name: string (nullable = true)</span> - -<span class="c"># Select only the "name" column</span> -<span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">"name"</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span> -<span class="c"># +-------+</span> -<span class="c"># | name|</span> -<span class="c"># +-------+</span> -<span class="c"># |Michael|</span> -<span class="c"># | Andy|</span> -<span class="c"># | Justin|</span> -<span class="c"># +-------+</span> - -<span class="c"># Select everybody, but increment the age by 1</span> -<span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s">'name'</span><span class="p">],</span> <span class="n">df</span><span class="p">[</span><span class="s">'age'</span><span class="p">]</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span> -<span class="c"># +-------+---------+</span> -<span class="c"># | name|(age + 1)|</span> -<span class="c"># +-------+---------+</span> -<span class="c"># |Michael| null|</span> -<span class="c"># | Andy| 31|</span> -<span class="c"># | Justin| 20|</span> -<span class="c"># +-------+---------+</span> - -<span class="c"># Select people older than 21</span> -<span class="n">df</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s">'age'</span><span class="p">]</span> <span class="o">></span> <span class="mi">21</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span> -<span class="c"># +---+----+</span> -<span class="c"># |age|name|</span> -<span class="c"># +---+----+</span> -<span class="c"># | 30|Andy|</span> -<span class="c"># +---+----+</span> - -<span class="c"># Count people by age</span> -<span class="n">df</span><span class="o">.</span><span class="n">groupBy</span><span class="p">(</span><span class="s">"age"</span><span class="p">)</span><span class="o">.</span><span class="n">count</span><span class="p">()</span><span class="o">.</span><span class="n">show</span><span class="p">()</span> -<span class="c"># +----+-----+</span> -<span class="c"># | age|count|</span> -<span class="c"># +----+-----+</span> -<span class="c"># | 19| 1|</span> -<span class="c"># |null| 1|</span> -<span class="c"># | 30| 1|</span> -<span class="c"># +----+-----+</span> +<span class="c1"># root</span> +<span class="c1"># |-- age: long (nullable = true)</span> +<span class="c1"># |-- name: string (nullable = true)</span> + +<span class="c1"># Select only the "name" column</span> +<span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s2">"name"</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span> +<span class="c1"># +-------+</span> +<span class="c1"># | name|</span> +<span class="c1"># +-------+</span> +<span class="c1"># |Michael|</span> +<span class="c1"># | Andy|</span> +<span class="c1"># | Justin|</span> +<span class="c1"># +-------+</span> + +<span class="c1"># Select everybody, but increment the age by 1</span> +<span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">'name'</span><span class="p">],</span> <span class="n">df</span><span class="p">[</span><span class="s1">'age'</span><span class="p">]</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span> +<span class="c1"># +-------+---------+</span> +<span class="c1"># | name|(age + 1)|</span> +<span class="c1"># +-------+---------+</span> +<span class="c1"># |Michael| null|</span> +<span class="c1"># | Andy| 31|</span> +<span class="c1"># | Justin| 20|</span> +<span class="c1"># +-------+---------+</span> + +<span class="c1"># Select people older than 21</span> +<span class="n">df</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">'age'</span><span class="p">]</span> <span class="o">></span> <span class="mi">21</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span> +<span class="c1"># +---+----+</span> +<span class="c1"># |age|name|</span> +<span class="c1"># +---+----+</span> +<span class="c1"># | 30|Andy|</span> +<span class="c1"># +---+----+</span> + +<span class="c1"># Count people by age</span> +<span class="n">df</span><span class="o">.</span><span class="n">groupBy</span><span class="p">(</span><span class="s2">"age"</span><span class="p">)</span><span class="o">.</span><span class="n">count</span><span class="p">()</span><span class="o">.</span><span class="n">show</span><span class="p">()</span> +<span class="c1"># +----+-----+</span> +<span class="c1"># | age|count|</span> +<span class="c1"># +----+-----+</span> +<span class="c1"># | 19| 1|</span> +<span class="c1"># |null| 1|</span> +<span class="c1"># | 30| 1|</span> +<span class="c1"># +----+-----+</span> </pre></div> <div><small>Find full example code at "examples/src/main/python/sql/basic.py" in the Spark repo.</small></div> <p>For a complete list of the types of operations that can be performed on a DataFrame refer to the <a href="api/python/pyspark.sql.html#pyspark.sql.DataFrame">API Documentation</a>.</p> @@ -614,7 +614,7 @@ are also attributes on the DataFrame class.</p> <div data-lang="r"> - <div class="highlight"><pre><span class="c1"># Create the DataFrame</span> + <div class="highlight"><pre><span></span><span class="c1"># Create the DataFrame</span> df <span class="o"><-</span> read.json<span class="p">(</span><span class="s">"examples/src/main/resources/people.json"</span><span class="p">)</span> <span class="c1"># Show the content of the DataFrame</span> @@ -673,7 +673,7 @@ printSchema<span class="p">(</span>df<span class="p">)</span> <div data-lang="scala"> <p>The <code>sql</code> function on a <code>SparkSession</code> enables applications to run SQL queries programmatically and returns the result as a <code>DataFrame</code>.</p> - <div class="highlight"><pre><span class="c1">// Register the DataFrame as a SQL temporary view</span> + <div class="highlight"><pre><span></span><span class="c1">// Register the DataFrame as a SQL temporary view</span> <span class="n">df</span><span class="o">.</span><span class="n">createOrReplaceTempView</span><span class="o">(</span><span class="s">"people"</span><span class="o">)</span> <span class="k">val</span> <span class="n">sqlDF</span> <span class="k">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">sql</span><span class="o">(</span><span class="s">"SELECT * FROM people"</span><span class="o">)</span> @@ -692,7 +692,7 @@ printSchema<span class="p">(</span>df<span class="p">)</span> <div data-lang="java"> <p>The <code>sql</code> function on a <code>SparkSession</code> enables applications to run SQL queries programmatically and returns the result as a <code>Dataset<Row></code>.</p> - <div class="highlight"><pre><span class="kn">import</span> <span class="nn">org.apache.spark.sql.Dataset</span><span class="o">;</span> + <div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">org.apache.spark.sql.Dataset</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.sql.Row</span><span class="o">;</span> <span class="c1">// Register the DataFrame as a SQL temporary view</span> @@ -714,18 +714,18 @@ printSchema<span class="p">(</span>df<span class="p">)</span> <div data-lang="python"> <p>The <code>sql</code> function on a <code>SparkSession</code> enables applications to run SQL queries programmatically and returns the result as a <code>DataFrame</code>.</p> - <div class="highlight"><pre><span class="c"># Register the DataFrame as a SQL temporary view</span> -<span class="n">df</span><span class="o">.</span><span class="n">createOrReplaceTempView</span><span class="p">(</span><span class="s">"people"</span><span class="p">)</span> + <div class="highlight"><pre><span></span><span class="c1"># Register the DataFrame as a SQL temporary view</span> +<span class="n">df</span><span class="o">.</span><span class="n">createOrReplaceTempView</span><span class="p">(</span><span class="s2">"people"</span><span class="p">)</span> -<span class="n">sqlDF</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s">"SELECT * FROM people"</span><span class="p">)</span> +<span class="n">sqlDF</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s2">"SELECT * FROM people"</span><span class="p">)</span> <span class="n">sqlDF</span><span class="o">.</span><span class="n">show</span><span class="p">()</span> -<span class="c"># +----+-------+</span> -<span class="c"># | age| name|</span> -<span class="c"># +----+-------+</span> -<span class="c"># |null|Michael|</span> -<span class="c"># | 30| Andy|</span> -<span class="c"># | 19| Justin|</span> -<span class="c"># +----+-------+</span> +<span class="c1"># +----+-------+</span> +<span class="c1"># | age| name|</span> +<span class="c1"># +----+-------+</span> +<span class="c1"># |null|Michael|</span> +<span class="c1"># | 30| Andy|</span> +<span class="c1"># | 19| Justin|</span> +<span class="c1"># +----+-------+</span> </pre></div> <div><small>Find full example code at "examples/src/main/python/sql/basic.py" in the Spark repo.</small></div> </div> @@ -733,7 +733,7 @@ printSchema<span class="p">(</span>df<span class="p">)</span> <div data-lang="r"> <p>The <code>sql</code> function enables applications to run SQL queries programmatically and returns the result as a <code>SparkDataFrame</code>.</p> - <div class="highlight"><pre>df <span class="o"><-</span> sql<span class="p">(</span><span class="s">"SELECT * FROM table"</span><span class="p">)</span> + <div class="highlight"><pre><span></span>df <span class="o"><-</span> sql<span class="p">(</span><span class="s">"SELECT * FROM table"</span><span class="p">)</span> </pre></div> <div><small>Find full example code at "examples/src/main/r/RSparkSQLExample.R" in the Spark repo.</small></div> @@ -750,7 +750,7 @@ refer it, e.g. <code>SELECT * FROM global_temp.view1</code>.</p> <div class="codetabs"> <div data-lang="scala"> - <div class="highlight"><pre><span class="c1">// Register the DataFrame as a global temporary view</span> + <div class="highlight"><pre><span></span><span class="c1">// Register the DataFrame as a global temporary view</span> <span class="n">df</span><span class="o">.</span><span class="n">createGlobalTempView</span><span class="o">(</span><span class="s">"people"</span><span class="o">)</span> <span class="c1">// Global temporary view is tied to a system preserved database `global_temp`</span> @@ -777,7 +777,7 @@ refer it, e.g. <code>SELECT * FROM global_temp.view1</code>.</p> </div> <div data-lang="java"> - <div class="highlight"><pre><span class="c1">// Register the DataFrame as a global temporary view</span> + <div class="highlight"><pre><span></span><span class="c1">// Register the DataFrame as a global temporary view</span> <span class="n">df</span><span class="o">.</span><span class="na">createGlobalTempView</span><span class="o">(</span><span class="s">"people"</span><span class="o">);</span> <span class="c1">// Global temporary view is tied to a system preserved database `global_temp`</span> @@ -804,37 +804,37 @@ refer it, e.g. <code>SELECT * FROM global_temp.view1</code>.</p> </div> <div data-lang="python"> - <div class="highlight"><pre><span class="c"># Register the DataFrame as a global temporary view</span> -<span class="n">df</span><span class="o">.</span><span class="n">createGlobalTempView</span><span class="p">(</span><span class="s">"people"</span><span class="p">)</span> - -<span class="c"># Global temporary view is tied to a system preserved database `global_temp`</span> -<span class="n">spark</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s">"SELECT * FROM global_temp.people"</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span> -<span class="c"># +----+-------+</span> -<span class="c"># | age| name|</span> -<span class="c"># +----+-------+</span> -<span class="c"># |null|Michael|</span> -<span class="c"># | 30| Andy|</span> -<span class="c"># | 19| Justin|</span> -<span class="c"># +----+-------+</span> - -<span class="c"># Global temporary view is cross-session</span> -<span class="n">spark</span><span class="o">.</span><span class="n">newSession</span><span class="p">()</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s">"SELECT * FROM global_temp.people"</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span> -<span class="c"># +----+-------+</span> -<span class="c"># | age| name|</span> -<span class="c"># +----+-------+</span> -<span class="c"># |null|Michael|</span> -<span class="c"># | 30| Andy|</span> -<span class="c"># | 19| Justin|</span> -<span class="c"># +----+-------+</span> + <div class="highlight"><pre><span></span><span class="c1"># Register the DataFrame as a global temporary view</span> +<span class="n">df</span><span class="o">.</span><span class="n">createGlobalTempView</span><span class="p">(</span><span class="s2">"people"</span><span class="p">)</span> + +<span class="c1"># Global temporary view is tied to a system preserved database `global_temp`</span> +<span class="n">spark</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s2">"SELECT * FROM global_temp.people"</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span> +<span class="c1"># +----+-------+</span> +<span class="c1"># | age| name|</span> +<span class="c1"># +----+-------+</span> +<span class="c1"># |null|Michael|</span> +<span class="c1"># | 30| Andy|</span> +<span class="c1"># | 19| Justin|</span> +<span class="c1"># +----+-------+</span> + +<span class="c1"># Global temporary view is cross-session</span> +<span class="n">spark</span><span class="o">.</span><span class="n">newSession</span><span class="p">()</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s2">"SELECT * FROM global_temp.people"</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span> +<span class="c1"># +----+-------+</span> +<span class="c1"># | age| name|</span> +<span class="c1"># +----+-------+</span> +<span class="c1"># |null|Michael|</span> +<span class="c1"># | 30| Andy|</span> +<span class="c1"># | 19| Justin|</span> +<span class="c1"># +----+-------+</span> </pre></div> <div><small>Find full example code at "examples/src/main/python/sql/basic.py" in the Spark repo.</small></div> </div> <div data-lang="sql"> - <div class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">CREATE</span> <span class="k">GLOBAL</span> <span class="k">TEMPORARY</span> <span class="k">VIEW</span> <span class="n">temp_view</span> <span class="k">AS</span> <span class="k">SELECT</span> <span class="n">a</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">b</span> <span class="o">*</span> <span class="mi">2</span> <span class="k">FROM</span> <span class="n">tbl</span> + <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span></span><span class="k">CREATE</span> <span class="k">GLOBAL</span> <span class="k">TEMPORARY</span> <span class="k">VIEW</span> <span class="n">temp_view</span> <span class="k">AS</span> <span class="k">SELECT</span> <span class="n">a</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">b</span> <span class="o">*</span> <span class="mi">2</span> <span class="k">FROM</span> <span class="n">tbl</span> -<span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">global_temp</span><span class="p">.</span><span class="n">temp_view</span></code></pre></div> +<span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">global_temp</span><span class="p">.</span><span class="n">temp_view</span></code></pre></figure> </div> </div> @@ -850,7 +850,7 @@ the bytes back into an object.</p> <div class="codetabs"> <div data-lang="scala"> - <div class="highlight"><pre><span class="c1">// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,</span> + <div class="highlight"><pre><span></span><span class="c1">// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,</span> <span class="c1">// you can use custom classes that implement the Product interface</span> <span class="k">case</span> <span class="k">class</span> <span class="nc">Person</span><span class="o">(</span><span class="n">name</span><span class="k">:</span> <span class="kt">String</span><span class="o">,</span> <span class="n">age</span><span class="k">:</span> <span class="kt">Long</span><span class="o">)</span> @@ -883,7 +883,7 @@ the bytes back into an object.</p> </div> <div data-lang="java"> - <div class="highlight"><pre><span class="kn">import</span> <span class="nn">java.util.Arrays</span><span class="o">;</span> + <div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">java.util.Arrays</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">java.util.Collections</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">java.io.Serializable</span><span class="o">;</span> @@ -915,7 +915,7 @@ the bytes back into an object.</p> <span class="o">}</span> <span class="c1">// Create an instance of a Bean class</span> -<span class="n">Person</span> <span class="n">person</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">Person</span><span class="o">();</span> +<span class="n">Person</span> <span class="n">person</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Person</span><span class="o">();</span> <span class="n">person</span><span class="o">.</span><span class="na">setName</span><span class="o">(</span><span class="s">"Andy"</span><span class="o">);</span> <span class="n">person</span><span class="o">.</span><span class="na">setAge</span><span class="o">(</span><span class="mi">32</span><span class="o">);</span> @@ -982,7 +982,7 @@ reflection and become the names of the columns. Case classes can also be nested types such as <code>Seq</code>s or <code>Array</code>s. This RDD can be implicitly converted to a DataFrame and then be registered as a table. Tables can be used in subsequent SQL statements.</p> - <div class="highlight"><pre><span class="k">import</span> <span class="nn">org.apache.spark.sql.catalyst.encoders.ExpressionEncoder</span> + <div class="highlight"><pre><span></span><span class="k">import</span> <span class="nn">org.apache.spark.sql.catalyst.encoders.ExpressionEncoder</span> <span class="k">import</span> <span class="nn">org.apache.spark.sql.Encoder</span> <span class="c1">// For implicit conversions from RDDs to DataFrames</span> @@ -1037,7 +1037,7 @@ does not support JavaBeans that contain <code>Map</code> field(s). Nested JavaBe fields are supported though. You can create a JavaBean by creating a class that implements Serializable and has getters and setters for all of its fields.</p> - <div class="highlight"><pre><span class="kn">import</span> <span class="nn">org.apache.spark.api.java.JavaRDD</span><span class="o">;</span> + <div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">org.apache.spark.api.java.JavaRDD</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.api.java.function.Function</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.api.java.function.MapFunction</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.sql.Dataset</span><span class="o">;</span> @@ -1053,7 +1053,7 @@ Serializable and has getters and setters for all of its fields.</p> <span class="nd">@Override</span> <span class="kd">public</span> <span class="n">Person</span> <span class="nf">call</span><span class="o">(</span><span class="n">String</span> <span class="n">line</span><span class="o">)</span> <span class="kd">throws</span> <span class="n">Exception</span> <span class="o">{</span> <span class="n">String</span><span class="o">[]</span> <span class="n">parts</span> <span class="o">=</span> <span class="n">line</span><span class="o">.</span><span class="na">split</span><span class="o">(</span><span class="s">","</span><span class="o">);</span> - <span class="n">Person</span> <span class="n">person</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">Person</span><span class="o">();</span> + <span class="n">Person</span> <span class="n">person</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Person</span><span class="o">();</span> <span class="n">person</span><span class="o">.</span><span class="na">setName</span><span class="o">(</span><span class="n">parts</span><span class="o">[</span><span class="mi">0</span><span class="o">]);</span> <span class="n">person</span><span class="o">.</span><span class="na">setAge</span><span class="o">(</span><span class="n">Integer</span><span class="o">.</span><span class="na">parseInt</span><span class="o">(</span><span class="n">parts</span><span class="o">[</span><span class="mi">1</span><span class="o">].</span><span class="na">trim</span><span class="o">()));</span> <span class="k">return</span> <span class="n">person</span><span class="o">;</span> @@ -1106,28 +1106,28 @@ Serializable and has getters and setters for all of its fields.</p> key/value pairs as kwargs to the Row class. The keys of this list define the column names of the table, and the types are inferred by sampling the whole dataset, similar to the inference that is performed on JSON files.</p> - <div class="highlight"><pre><span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="kn">import</span> <span class="n">Row</span> + <div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="kn">import</span> <span class="n">Row</span> <span class="n">sc</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">sparkContext</span> -<span class="c"># Load a text file and convert each line to a Row.</span> -<span class="n">lines</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">textFile</span><span class="p">(</span><span class="s">"examples/src/main/resources/people.txt"</span><span class="p">)</span> -<span class="n">parts</span> <span class="o">=</span> <span class="n">lines</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">l</span><span class="p">:</span> <span class="n">l</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s">","</span><span class="p">))</span> +<span class="c1"># Load a text file and convert each line to a Row.</span> +<span class="n">lines</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">textFile</span><span class="p">(</span><span class="s2">"examples/src/main/resources/people.txt"</span><span class="p">)</span> +<span class="n">parts</span> <span class="o">=</span> <span class="n">lines</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">l</span><span class="p">:</span> <span class="n">l</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s2">","</span><span class="p">))</span> <span class="n">people</span> <span class="o">=</span> <span class="n">parts</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">p</span><span class="p">:</span> <span class="n">Row</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="n">p</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">age</span><span class="o">=</span><span class="nb">int</span><span class="p">(</span><span class="n">p</span><span class="p">[</span><span class="mi">1</span><span class="p">])))</span> -<span class="c"># Infer the schema, and register the DataFrame as a table.</span> +<span class="c1"># Infer the schema, and register the DataFrame as a table.</span> <span class="n">schemaPeople</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">people</span><span class="p">)</span> -<span class="n">schemaPeople</span><span class="o">.</span><span class="n">createOrReplaceTempView</span><span class="p">(</span><span class="s">"people"</span><span class="p">)</span> +<span class="n">schemaPeople</span><span class="o">.</span><span class="n">createOrReplaceTempView</span><span class="p">(</span><span class="s2">"people"</span><span class="p">)</span> -<span class="c"># SQL can be run over DataFrames that have been registered as a table.</span> -<span class="n">teenagers</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s">"SELECT name FROM people WHERE age >= 13 AND age <= 19"</span><span class="p">)</span> +<span class="c1"># SQL can be run over DataFrames that have been registered as a table.</span> +<span class="n">teenagers</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s2">"SELECT name FROM people WHERE age >= 13 AND age <= 19"</span><span class="p">)</span> -<span class="c"># The results of SQL queries are Dataframe objects.</span> -<span class="c"># rdd returns the content as an :class:`pyspark.RDD` of :class:`Row`.</span> -<span class="n">teenNames</span> <span class="o">=</span> <span class="n">teenagers</span><span class="o">.</span><span class="n">rdd</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">p</span><span class="p">:</span> <span class="s">"Name: "</span> <span class="o">+</span> <span class="n">p</span><span class="o">.</span><span class="n">name</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span> +<span class="c1"># The results of SQL queries are Dataframe objects.</span> +<span class="c1"># rdd returns the content as an :class:`pyspark.RDD` of :class:`Row`.</span> +<span class="n">teenNames</span> <span class="o">=</span> <span class="n">teenagers</span><span class="o">.</span><span class="n">rdd</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">p</span><span class="p">:</span> <span class="s2">"Name: "</span> <span class="o">+</span> <span class="n">p</span><span class="o">.</span><span class="n">name</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span> <span class="k">for</span> <span class="n">name</span> <span class="ow">in</span> <span class="n">teenNames</span><span class="p">:</span> <span class="k">print</span><span class="p">(</span><span class="n">name</span><span class="p">)</span> -<span class="c"># Name: Justin</span> +<span class="c1"># Name: Justin</span> </pre></div> <div><small>Find full example code at "examples/src/main/python/sql/basic.py" in the Spark repo.</small></div> </div> @@ -1155,7 +1155,7 @@ by <code>SparkSession</code>.</li> <p>For example:</p> - <div class="highlight"><pre><span class="k">import</span> <span class="nn">org.apache.spark.sql.types._</span> + <div class="highlight"><pre><span></span><span class="k">import</span> <span class="nn">org.apache.spark.sql.types._</span> <span class="c1">// Create an RDD</span> <span class="k">val</span> <span class="n">peopleRDD</span> <span class="k">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">sparkContext</span><span class="o">.</span><span class="n">textFile</span><span class="o">(</span><span class="s">"examples/src/main/resources/people.txt"</span><span class="o">)</span> @@ -1213,7 +1213,7 @@ by <code>SparkSession</code>.</li> <p>For example:</p> - <div class="highlight"><pre><span class="kn">import</span> <span class="nn">java.util.ArrayList</span><span class="o">;</span> + <div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">java.util.ArrayList</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">java.util.List</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.api.java.JavaRDD</span><span class="o">;</span> @@ -1296,43 +1296,43 @@ tuples or lists in the RDD created in the step 1.</li> <p>For example:</p> - <div class="highlight"><pre><span class="c"># Import data types</span> + <div class="highlight"><pre><span></span><span class="c1"># Import data types</span> <span class="kn">from</span> <span class="nn">pyspark.sql.types</span> <span class="kn">import</span> <span class="o">*</span> <span class="n">sc</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">sparkContext</span> -<span class="c"># Load a text file and convert each line to a Row.</span> -<span class="n">lines</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">textFile</span><span class="p">(</span><span class="s">"examples/src/main/resources/people.txt"</span><span class="p">)</span> -<span class="n">parts</span> <span class="o">=</span> <span class="n">lines</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">l</span><span class="p">:</span> <span class="n">l</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s">","</span><span class="p">))</span> -<span class="c"># Each line is converted to a tuple.</span> +<span class="c1"># Load a text file and convert each line to a Row.</span> +<span class="n">lines</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">textFile</span><span class="p">(</span><span class="s2">"examples/src/main/resources/people.txt"</span><span class="p">)</span> +<span class="n">parts</span> <span class="o">=</span> <span class="n">lines</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">l</span><span class="p">:</span> <span class="n">l</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s2">","</span><span class="p">))</span> +<span class="c1"># Each line is converted to a tuple.</span> <span class="n">people</span> <span class="o">=</span> <span class="n">parts</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">p</span><span class="p">:</span> <span class="p">(</span><span class="n">p</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">p</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">strip</span><span class="p">()))</span> -<span class="c"># The schema is encoded in a string.</span> -<span class="n">schemaString</span> <span class="o">=</span> <span class="s">"name age"</span> +<span class="c1"># The schema is encoded in a string.</span> +<span class="n">schemaString</span> <span class="o">=</span> <span class="s2">"name age"</span> <span class="n">fields</span> <span class="o">=</span> <span class="p">[</span><span class="n">StructField</span><span class="p">(</span><span class="n">field_name</span><span class="p">,</span> <span class="n">StringType</span><span class="p">(),</span> <span class="bp">True</span><span class="p">)</span> <span class="k">for</span> <span class="n">field_name</span> <span class="ow">in</span> <span class="n">schemaString</span><span class="o">.</span><span class="n">split</span><span class="p">()]</span> <span class="n">schema</span> <span class="o">=</span> <span class="n">StructType</span><span class="p">(</span><span class="n">fields</span><span class="p">)</span> -<span class="c"># Apply the schema to the RDD.</span> +<span class="c1"># Apply the schema to the RDD.</span> <span class="n">schemaPeople</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">people</span><span class="p">,</span> <span class="n">schema</span><span class="p">)</span> -<span class="c"># Creates a temporary view using the DataFrame</span> -<span class="n">schemaPeople</span><span class="o">.</span><span class="n">createOrReplaceTempView</span><span class="p">(</span><span class="s">"people"</span><span class="p">)</span> +<span class="c1"># Creates a temporary view using the DataFrame</span> +<span class="n">schemaPeople</span><span class="o">.</span><span class="n">createOrReplaceTempView</span><span class="p">(</span><span class="s2">"people"</span><span class="p">)</span> -<span class="c"># Creates a temporary view using the DataFrame</span> -<span class="n">schemaPeople</span><span class="o">.</span><span class="n">createOrReplaceTempView</span><span class="p">(</span><span class="s">"people"</span><span class="p">)</span> +<span class="c1"># Creates a temporary view using the DataFrame</span> +<span class="n">schemaPeople</span><span class="o">.</span><span class="n">createOrReplaceTempView</span><span class="p">(</span><span class="s2">"people"</span><span class="p">)</span> -<span class="c"># SQL can be run over DataFrames that have been registered as a table.</span> -<span class="n">results</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s">"SELECT name FROM people"</span><span class="p">)</span> +<span class="c1"># SQL can be run over DataFrames that have been registered as a table.</span> +<span class="n">results</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s2">"SELECT name FROM people"</span><span class="p">)</span> <span class="n">results</span><span class="o">.</span><span class="n">show</span><span class="p">()</span> -<span class="c"># +-------+</span> -<span class="c"># | name|</span> -<span class="c"># +-------+</span> -<span class="c"># |Michael|</span> -<span class="c"># | Andy|</span> -<span class="c"># | Justin|</span> -<span class="c"># +-------+</span> +<span class="c1"># +-------+</span> +<span class="c1"># | name|</span> +<span class="c1"># +-------+</span> +<span class="c1"># |Michael|</span> +<span class="c1"># | Andy|</span> +<span class="c1"># | Justin|</span> +<span class="c1"># +-------+</span> </pre></div> <div><small>Find full example code at "examples/src/main/python/sql/basic.py" in the Spark repo.</small></div> </div> @@ -1354,14 +1354,14 @@ goes into specific options that are available for the built-in data sources.</p> <div class="codetabs"> <div data-lang="scala"> - <div class="highlight"><pre><span class="k">val</span> <span class="n">usersDF</span> <span class="k">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">load</span><span class="o">(</span><span class="s">"examples/src/main/resources/users.parquet"</span><span class="o">)</span> + <div class="highlight"><pre><span></span><span class="k">val</span> <span class="n">usersDF</span> <span class="k">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">load</span><span class="o">(</span><span class="s">"examples/src/main/resources/users.parquet"</span><span class="o">)</span> <span class="n">usersDF</span><span class="o">.</span><span class="n">select</span><span class="o">(</span><span class="s">"name"</span><span class="o">,</span> <span class="s">"favorite_color"</span><span class="o">).</span><span class="n">write</span><span class="o">.</span><span class="n">save</span><span class="o">(</span><span class="s">"namesAndFavColors.parquet"</span><span class="o">)</span> </pre></div> <div><small>Find full example code at "examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala" in the Spark repo.</small></div> </div> <div data-lang="java"> - <div class="highlight"><pre><span class="n">Dataset</span><span class="o"><</span><span class="n">Row</span><span class="o">></span> <span class="n">usersDF</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="na">read</span><span class="o">().</span><span class="na">load</span><span class="o">(</span><span class="s">"examples/src/main/resources/users.parquet"</span><span class="o">);</span> + <div class="highlight"><pre><span></span><span class="n">Dataset</span><span class="o"><</span><span class="n">Row</span><span class="o">></span> <span class="n">usersDF</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="na">read</span><span class="o">().</span><span class="na">load</span><span class="o">(</span><span class="s">"examples/src/main/resources/users.parquet"</span><span class="o">);</span> <span class="n">usersDF</span><span class="o">.</span><span class="na">select</span><span class="o">(</span><span class="s">"name"</span><span class="o">,</span> <span class="s">"favorite_color"</span><span class="o">).</span><span class="na">write</span><span class="o">().</span><span class="na">save</span><span class="o">(</span><span class="s">"namesAndFavColors.parquet"</span><span class="o">);</span> </pre></div> <div><small>Find full example code at "examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java" in the Spark repo.</small></div> @@ -1369,15 +1369,15 @@ goes into specific options that are available for the built-in data sources.</p> <div data-lang="python"> - <div class="highlight"><pre><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="s">"examples/src/main/resources/users.parquet"</span><span class="p">)</span> -<span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">"name"</span><span class="p">,</span> <span class="s">"favorite_color"</span><span class="p">)</span><span class="o">.</span><span class="n">write</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="s">"namesAndFavColors.parquet"</span><span class="p">)</span> + <div class="highlight"><pre><span></span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="s2">"examples/src/main/resources/users.parquet"</span><span class="p">)</span> +<span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s2">"name"</span><span class="p">,</span> <span class="s2">"favorite_color"</span><span class="p">)</span><span class="o">.</span><span class="n">write</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="s2">"namesAndFavColors.parquet"</span><span class="p">)</span> </pre></div> <div><small>Find full example code at "examples/src/main/python/sql/datasource.py" in the Spark repo.</small></div> </div> <div data-lang="r"> - <div class="highlight"><pre>df <span class="o"><-</span> read.df<span class="p">(</span><span class="s">"examples/src/main/resources/users.parquet"</span><span class="p">)</span> + <div class="highlight"><pre><span></span>df <span class="o"><-</span> read.df<span class="p">(</span><span class="s">"examples/src/main/resources/users.parquet"</span><span class="p">)</span> write.df<span class="p">(</span>select<span class="p">(</span>df<span class="p">,</span> <span class="s">"name"</span><span class="p">,</span> <span class="s">"favorite_color"</span><span class="p">),</span> <span class="s">"namesAndFavColors.parquet"</span><span class="p">)</span> </pre></div> <div><small>Find full example code at "examples/src/main/r/RSparkSQLExample.R" in the Spark repo.</small></div> @@ -1395,14 +1395,14 @@ source type can be converted into other types using this syntax.</p> <div class="codetabs"> <div data-lang="scala"> - <div class="highlight"><pre><span class="k">val</span> <span class="n">peopleDF</span> <span class="k">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">format</span><span class="o">(</span><span class="s">"json"</span><span class="o">).</span><span class="n">load</span><span class="o">(</span><span class="s">"examples/src/main/resources/people.json"</span><span class="o">)</span> + <div class="highlight"><pre><span></span><span class="k">val</span> <span class="n">peopleDF</span> <span class="k">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">format</span><span class="o">(</span><span class="s">"json"</span><span class="o">).</span><span class="n">load</span><span class="o">(</span><span class="s">"examples/src/main/resources/people.json"</span><span class="o">)</span> <span class="n">peopleDF</span><span class="o">.</span><span class="n">select</span><span class="o">(</span><span class="s">"name"</span><span class="o">,</span> <span class="s">"age"</span><span class="o">).</span><span class="n">write</span><span class="o">.</span><span class="n">format</span><span class="o">(</span><span class="s">"parquet"</span><span class="o">).</span><span class="n">save</span><span class="o">(</span><span class="s">"namesAndAges.parquet"</span><span class="o">)</span> </pre></div> <div><small>Find full example code at "examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala" in the Spark repo.</small></div> </div> <div data-lang="java"> - <div class="highlight"><pre><span class="n">Dataset</span><span class="o"><</span><span class="n">Row</span><span class="o">></span> <span class="n">peopleDF</span> <span class="o">=</span> + <div class="highlight"><pre><span></span><span class="n">Dataset</span><span class="o"><</span><span class="n">Row</span><span class="o">></span> <span class="n">peopleDF</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="na">read</span><span class="o">().</span><span class="na">format</span><span class="o">(</span><span class="s">"json"</span><span class="o">).</span><span class="na">load</span><span class="o">(</span><span class="s">"examples/src/main/resources/people.json"</span><span class="o">);</span> <span class="n">peopleDF</span><span class="o">.</span><span class="na">select</span><span class="o">(</span><span class="s">"name"</span><span class="o">,</span> <span class="s">"age"</span><span class="o">).</span><span class="na">write</span><span class="o">().</span><span class="na">format</span><span class="o">(</span><span class="s">"parquet"</span><span class="o">).</span><span class="na">save</span><span class="o">(</span><span class="s">"namesAndAges.parquet"</span><span class="o">);</span> </pre></div> @@ -1410,14 +1410,14 @@ source type can be converted into other types using this syntax.</p> </div> <div data-lang="python"> - <div class="highlight"><pre><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="s">"examples/src/main/resources/people.json"</span><span class="p">,</span> <span class="n">format</span><span class="o">=</span><span class="s">"json"</span><span class="p">)</span> -<span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">"name"</span><span class="p">,</span> <span class="s">"age"</span><span class="p">)</span><span class="o">.</span><span class="n">write</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="s">"namesAndAges.parquet"</span><span class="p">,</span> <span class="n">format</span><span class="o">=</span><span class="s">"parquet"</span><span class="p">)</span> + <div class="highlight"><pre><span></span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="s2">"examples/src/main/resources/people.json"</span><span class="p">,</span> <span class="n">format</span><span class="o">=</span><span class="s2">"json"</span><span class="p">)</span> +<span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s2">"name"</span><span class="p">,</span> <span class="s2">"age"</span><span class="p">)</span><span class="o">.</span><span class="n">write</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="s2">"namesAndAges.parquet"</span><span class="p">,</span> <span class="n">format</span><span class="o">=</span><span class="s2">"parquet"</span><span class="p">)</span> </pre></div> <div><small>Find full example code at "examples/src/main/python/sql/datasource.py" in the Spark repo.</small></div> </div> <div data-lang="r"> - <div class="highlight"><pre>df <span class="o"><-</span> read.df<span class="p">(</span><span class="s">"examples/src/main/resources/people.json"</span><span class="p">,</span> <span class="s">"json"</span><span class="p">)</span> + <div class="highlight"><pre><span></span>df <span class="o"><-</span> read.df<span class="p">(</span><span class="s">"examples/src/main/resources/people.json"</span><span class="p">,</span> <span class="s">"json"</span><span class="p">)</span> namesAndAges <span class="o"><-</span> select<span class="p">(</span>df<span class="p">,</span> <span class="s">"name"</span><span class="p">,</span> <span class="s">"age"</span><span class="p">)</span> write.df<span class="p">(</span>namesAndAges<span class="p">,</span> <span class="s">"namesAndAges.parquet"</span><span class="p">,</span> <span class="s">"parquet"</span><span class="p">)</span> </pre></div> @@ -1432,26 +1432,26 @@ file directly with SQL.</p> <div class="codetabs"> <div data-lang="scala"> - <div class="highlight"><pre><span class="k">val</span> <span class="n">sqlDF</span> <span class="k">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">sql</span><span class="o">(</span><span class="s">"SELECT * FROM parquet.`examples/src/main/resources/users.parquet`"</span><span class="o">)</span> + <div class="highlight"><pre><span></span><span class="k">val</span> <span class="n">sqlDF</span> <span class="k">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">sql</span><span class="o">(</span><span class="s">"SELECT * FROM parquet.`examples/src/main/resources/users.parquet`"</span><span class="o">)</span> </pre></div> <div><small>Find full example code at "examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala" in the Spark repo.</small></div> </div> <div data-lang="java"> - <div class="highlight"><pre><span class="n">Dataset</span><span class="o"><</span><span class="n">Row</span><span class="o">></span> <span class="n">sqlDF</span> <span class="o">=</span> + <div class="highlight"><pre><span></span><span class="n">Dataset</span><span class="o"><</span><span class="n">Row</span><span class="o">></span> <span class="n">sqlDF</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="na">sql</span><span class="o">(</span><span class="s">"SELECT * FROM parquet.`examples/src/main/resources/users.parquet`"</span><span class="o">);</span> </pre></div> <div><small>Find full example code at "examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java" in the Spark repo.</small></div> </div> <div data-lang="python"> - <div class="highlight"><pre><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s">"SELECT * FROM parquet.`examples/src/main/resources/users.parquet`"</span><span class="p">)</span> + <div class="highlight"><pre><span></span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s2">"SELECT * FROM parquet.`examples/src/main/resources/users.parquet`"</span><span class="p">)</span> </pre></div> <div><small>Find full example code at "examples/src/main/python/sql/datasource.py" in the Spark repo.</small></div> </div> <div data-lang="r"> - <div class="highlight"><pre>df <span class="o"><-</span> sql<span class="p">(</span><span class="s">"SELECT * FROM parquet.`examples/src/main/resources/users.parquet`"</span><span class="p">)</span> + <div class="highlight"><pre><span></span>df <span class="o"><-</span> sql<span class="p">(</span><span class="s">"SELECT * FROM parquet.`examples/src/main/resources/users.parquet`"</span><span class="p">)</span> </pre></div> <div><small>Find full example code at "examples/src/main/r/RSparkSQLExample.R" in the Spark repo.</small></div> @@ -1531,7 +1531,7 @@ compatibility reasons.</p> <div class="codetabs"> <div data-lang="scala"> - <div class="highlight"><pre><span class="c1">// Encoders for most common types are automatically provided by importing spark.implicits._</span> + <div class="highlight"><pre><span></span><span class="c1">// Encoders for most common types are automatically provided by importing spark.implicits._</span> <span class="k">import</span> <span class="nn">spark.implicits._</span> <span class="k">val</span> <span class="n">peopleDF</span> <span class="k">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">json</span><span class="o">(</span><span class="s">"examples/src/main/resources/people.json"</span><span class="o">)</span> @@ -1558,7 +1558,7 @@ compatibility reasons.</p> </div> <div data-lang="java"> - <div class="highlight"><pre><span class="kn">import</span> <span class="nn">org.apache.spark.api.java.JavaRDD</span><span class="o">;</span> + <div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">org.apache.spark.api.java.JavaRDD</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.api.java.JavaSparkContext</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.api.java.function.MapFunction</span><span class="o">;</span> <span class="kn">import</span> <span class="nn">org.apache.spark.sql.Encoders</span><span class="o">;</span> @@ -1595,32 +1595,32 @@ compatibility reasons.</p> <div data-lang="python"> - <div class="highlight"><pre><span class="n">peopleDF</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">json</span><span class="p">(</span><span class="s">"examples/src/main/resources/people.json"</span><span class="p">)</span> + <div class="highlight"><pre><span></span><span class="n">peopleDF</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">json</span><span class="p">(</span><span class="s2">"examples/src/main/resources/people.json"</span><span class="p">)</span> -<span class="c"># DataFrames can be saved as Parquet files, maintaining the schema information.</span> -<span class="n">peopleDF</span><span class="o">.</span><span class="n">write</span><span class="o">.</span><span class="n">parquet</span><span class="p">(</span><span class="s">"people.parquet"</span><span class="p">)</span> +<span class="c1"># DataFrames can be saved as Parquet files, maintaining the schema information.</span> +<span class="n">peopleDF</span><span class="o">.</span><span class="n">write</span><span class="o">.</span><span class="n">parquet</span><span class="p">(</span><span class="s2">"people.parquet"</span><span class="p">)</span> -<span class="c"># Read in the Parquet file created above.</span> -<span class="c"># Parquet files are self-describing so the schema is preserved.</span> -<span class="c"># The result of loading a parquet file is also a DataFrame.</span> -<span class="n">parquetFile</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">parquet</span><span class="p">(</span><span class="s">"people.parquet"</span><span class="p">)</span> +<span class="c1"># Read in the Parquet file created above.</span> +<span class="c1"># Parquet files are self-describing so the schema is preserved.</span> +<span class="c1"># The result of loading a parquet file is also a DataFrame.</span> +<span class="n">parquetFile</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">parquet</span><span class="p">(</span><span class="s2">"people.parquet"</span><span class="p">)</span> -<span class="c"># Parquet files can also be used to create a temporary view and then used in SQL statements.</span> -<span class="n">parquetFile</span><span class="o">.</span><span class="n">createOrReplaceTempView</span><span class="p">(</span><span class="s">"parquetFile"</span><span class="p">)</span> -<span class="n">teenagers</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s">"SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19"</span><span class="p">)</span> +<span class="c1"># Parquet files can also be used to create a temporary view and then used in SQL statements.</span> +<span class="n">parquetFile</span><span class="o">.</span><span class="n">createOrReplaceTempView</span><span class="p">(</span><span class="s2">"parquetFile"</span><span class="p">)</span> +<span class="n">teenagers</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s2">"SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19"</span><span class="p">)</span> <span class="n">teenagers</span><span class="o">.</span><span class="n">show</span><span class="p">()</span> -<span class="c"># +------+</span> -<span class="c"># | name|</span> -<span class="c"># +------+</span> -<span class="c"># |Justin|</span> -<span class="c"># +------+</span> +<span class="c1"># +------+</span> +<span class="c1"># | name|</span> +<span class="c1"># +------+</span> +<span class="c1"># |Justin|</span> +<span class="c1"># +------+</span> </pre></div> <div><small>Find full example code at "examples/src/main/python/sql/datasource.py" in the Spark repo.</small></div> </div> <div data-lang="r"> - <div class="highlight"><pre>df <span class="o"><-</span> read.df<span class="p">(</span><span class="s">"examples/src/main/resources/people.json"</span><span class="p">,</span> <span class="s">"json"</span><span class="p">)</span> + <div class="highlight"><pre><span></span>df <span class="o"><-</span> read.df<span class="p">(</span><span class="s">"examples/src/main/resources/people.json"</span><span class="p">,</span> <span class="s">"json"</span><span class="p">)</span> <span class="c1"># SparkDataFrame can be saved as Parquet files, maintaining the schema information.</span> write.parquet<span class="p">(</span>df<span class="p">,</span> <span class="s">"people.parquet"</span><span class="p">)</span> @@ -1652,13 +1652,13 @@ teenNames <span class="o"><-</span> dapply<span class="p">(</span>df<span cla <div data-lang="sql"> - <div class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">CREATE</span> <span class="k">TEMPORARY</span> <span class="k">VIEW</span> <span class="n">parquetTable</span> + <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span></span><span class="k">CREATE</span> <span class="k">TEMPORARY</span> <span class="k">VIEW</span> <span class="n">parquetTable</span> <span class="k">USING</span> <span class="n">org</span><span class="p">.</span><span class="n">apache</span><span class="p">.</span><span class="n">spark</span><span class="p">.</span><span class="k">sql</span><span class="p">.</span><span class="n">parquet</span> <span class="k">OPTIONS</span> <span class="p">(</span> <span class="n">path</span> <span class="ss">"examples/src/main/resources/people.parquet"</span> <span class="p">)</span> -<span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">parquetTable</span></code></pre></div> +<span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">parquetTable</span></code></pre></figure> </div> @@ -1673,7 +1673,7 @@ partitioning information automatically. For example, we can store all our previo population data into a partitioned table using the following directory structure, with two extra columns, <code>gender</code> and <code>country</code> as partitioning columns:</p> -<div class="highlight"><pre><code class="language-text" data-lang="text">path +<figure class="highlight"><pre><code class="language-text" data-lang="text"><span></span>path âââ to âââ table âââ gender=male @@ -1691,17 +1691,17 @@ columns, <code>gender</code> and <code>country</code> as partitioning columns:</   â  âââ data.parquet   âââ country=CN   â  âââ data.parquet -   âââ ...</code></pre></div> +   âââ ...</code></pre></figure> <p>By passing <code>path/to/table</code> to either <code>SparkSession.read.parquet</code> or <code>SparkSession.read.load</code>, Spark SQL will automatically extract the partitioning information from the paths. Now the schema of the returned DataFrame becomes:</p> -<div class="highlight"><pre><code class="language-text" data-lang="text">root +<figure class="highlight"><pre><code class="language-text" data-lang="text"><span></span>root |-- name: string (nullable = true) |-- age: long (nullable = true) |-- gender: strin
<TRUNCATED> --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org