Author: pwendell Date: Mon Mar 16 05:30:06 2015 New Revision: 1666875 URL: http://svn.apache.org/r1666875 Log: Updating to incorperate doc changes in SPARK-6275 and SPARK-5310
Modified: spark/site/docs/1.3.0/sql-programming-guide.html Modified: spark/site/docs/1.3.0/sql-programming-guide.html URL: http://svn.apache.org/viewvc/spark/site/docs/1.3.0/sql-programming-guide.html?rev=1666875&r1=1666874&r2=1666875&view=diff ============================================================================== --- spark/site/docs/1.3.0/sql-programming-guide.html (original) +++ spark/site/docs/1.3.0/sql-programming-guide.html Mon Mar 16 05:30:06 2015 @@ -113,7 +113,7 @@ <ul id="markdown-toc"> <li><a href="#overview">Overview</a></li> <li><a href="#dataframes">DataFrames</a> <ul> - <li><a href="#starting-point-sqlcontext">Starting Point: SQLContext</a></li> + <li><a href="#starting-point-sqlcontext">Starting Point: <code>SQLContext</code></a></li> <li><a href="#creating-dataframes">Creating DataFrames</a></li> <li><a href="#dataframe-operations">DataFrame Operations</a></li> <li><a href="#running-sql-queries-programmatically">Running SQL Queries Programmatically</a></li> @@ -133,6 +133,8 @@ </li> <li><a href="#parquet-files">Parquet Files</a> <ul> <li><a href="#loading-data-programmatically">Loading Data Programmatically</a></li> + <li><a href="#partition-discovery">Partition discovery</a></li> + <li><a href="#schema-merging">Schema merging</a></li> <li><a href="#configuration">Configuration</a></li> </ul> </li> @@ -158,7 +160,7 @@ <li><a href="#unification-of-the-java-and-scala-apis">Unification of the Java and Scala APIs</a></li> <li><a href="#isolation-of-implicit-conversions-and-removal-of-dsl-package-scala-only">Isolation of Implicit Conversions and Removal of dsl Package (Scala-only)</a></li> <li><a href="#removal-of-the-type-aliases-in-orgapachesparksql-for-datatype-scala-only">Removal of the type aliases in org.apache.spark.sql for DataType (Scala-only)</a></li> - <li><a href="#udf-registration-moved-to-sqlcontextudf-java--scala">UDF Registration Moved to sqlContext.udf (Java & Scala)</a></li> + <li><a href="#udf-registration-moved-to-sqlcontextudf-java--scala">UDF Registration Moved to <code>sqlContext.udf</code> (Java & Scala)</a></li> <li><a href="#python-datatypes-no-longer-singletons">Python DataTypes No Longer Singletons</a></li> </ul> </li> @@ -191,14 +193,14 @@ <p>All of the examples on this page use sample data included in the Spark distribution and can be run in the <code>spark-shell</code> or the <code>pyspark</code> shell.</p> -<h2 id="starting-point-sqlcontext">Starting Point: SQLContext</h2> +<h2 id="starting-point-sqlcontext">Starting Point: <code>SQLContext</code></h2> <div class="codetabs"> <div data-lang="scala"> <p>The entry point into all functionality in Spark SQL is the -<a href="api/scala/index.html#org.apache.spark.sql.SQLContext">SQLContext</a> class, or one of its -descendants. To create a basic SQLContext, all you need is a SparkContext.</p> +<a href="api/scala/index.html#org.apache.spark.sql.`SQLContext`"><code>SQLContext</code></a> class, or one of its +descendants. To create a basic <code>SQLContext</code>, all you need is a SparkContext.</p> <div class="highlight"><pre><code class="language-scala" data-lang="scala"><span class="k">val</span> <span class="n">sc</span><span class="k">:</span> <span class="kt">SparkContext</span> <span class="c1">// An existing SparkContext.</span> <span class="k">val</span> <span class="n">sqlContext</span> <span class="k">=</span> <span class="k">new</span> <span class="n">org</span><span class="o">.</span><span class="n">apache</span><span class="o">.</span><span class="n">spark</span><span class="o">.</span><span class="n">sql</span><span class="o">.</span><span class="nc">SQLContext</span><span class="o">(</span><span class="n">sc</span><span class="o">)</span> @@ -211,8 +213,8 @@ descendants. To create a basic SQLConte <div data-lang="java"> <p>The entry point into all functionality in Spark SQL is the -<a href="api/java/index.html#org.apache.spark.sql.SQLContext">SQLContext</a> class, or one of its -descendants. To create a basic SQLContext, all you need is a SparkContext.</p> +<a href="api/java/index.html#org.apache.spark.sql.SQLContext"><code>SQLContext</code></a> class, or one of its +descendants. To create a basic <code>SQLContext</code>, all you need is a SparkContext.</p> <div class="highlight"><pre><code class="language-java" data-lang="java"><span class="n">JavaSparkContext</span> <span class="n">sc</span> <span class="o">=</span> <span class="o">...;</span> <span class="c1">// An existing JavaSparkContext.</span> <span class="n">SQLContext</span> <span class="n">sqlContext</span> <span class="o">=</span> <span class="k">new</span> <span class="n">org</span><span class="o">.</span><span class="na">apache</span><span class="o">.</span><span class="na">spark</span><span class="o">.</span><span class="na">sql</span><span class="o">.</span><span class="na">SQLContext</span><span class="o">(</span><span class="n">sc</span><span class="o">);</span></code></pre></div> @@ -222,8 +224,8 @@ descendants. To create a basic SQLConte <div data-lang="python"> <p>The entry point into all relational functionality in Spark is the -<a href="api/python/pyspark.sql.SQLContext-class.html">SQLContext</a> class, or one -of its decedents. To create a basic SQLContext, all you need is a SparkContext.</p> +<a href="api/python/pyspark.sql.SQLContext-class.html"><code>SQLContext</code></a> class, or one +of its decedents. To create a basic <code>SQLContext</code>, all you need is a SparkContext.</p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="kn">import</span> <span class="n">SQLContext</span> <span class="n">sqlContext</span> <span class="o">=</span> <span class="n">SQLContext</span><span class="p">(</span><span class="n">sc</span><span class="p">)</span></code></pre></div> @@ -231,20 +233,20 @@ of its decedents. To create a basic SQL </div> </div> -<p>In addition to the basic SQLContext, you can also create a HiveContext, which provides a -superset of the functionality provided by the basic SQLContext. Additional features include +<p>In addition to the basic <code>SQLContext</code>, you can also create a <code>HiveContext</code>, which provides a +superset of the functionality provided by the basic <code>SQLContext</code>. Additional features include the ability to write queries using the more complete HiveQL parser, access to Hive UDFs, and the -ability to read data from Hive tables. To use a HiveContext, you do not need to have an -existing Hive setup, and all of the data sources available to a SQLContext are still available. -HiveContext is only packaged separately to avoid including all of Hive’s dependencies in the default -Spark build. If these dependencies are not a problem for your application then using HiveContext -is recommended for the 1.3 release of Spark. Future releases will focus on bringing SQLContext up -to feature parity with a HiveContext.</p> +ability to read data from Hive tables. To use a <code>HiveContext</code>, you do not need to have an +existing Hive setup, and all of the data sources available to a <code>SQLContext</code> are still available. +<code>HiveContext</code> is only packaged separately to avoid including all of Hive’s dependencies in the default +Spark build. If these dependencies are not a problem for your application then using <code>HiveContext</code> +is recommended for the 1.3 release of Spark. Future releases will focus on bringing <code>SQLContext</code> up +to feature parity with a <code>HiveContext</code>.</p> <p>The specific variant of SQL that is used to parse queries can also be selected using the <code>spark.sql.dialect</code> option. This parameter can be changed using either the <code>setConf</code> method on -a SQLContext or by using a <code>SET key=value</code> command in SQL. For a SQLContext, the only dialect -available is “sql” which uses a simple SQL parser provided by Spark SQL. In a HiveContext, the +a <code>SQLContext</code> or by using a <code>SET key=value</code> command in SQL. For a <code>SQLContext</code>, the only dialect +available is “sql” which uses a simple SQL parser provided by Spark SQL. In a <code>HiveContext</code>, the default is “hiveql”, though “sql” is also available. Since the HiveQL parser is much more complete, this is recommended for most use cases.</p> @@ -309,10 +311,10 @@ this is recommended for most use cases.< <span class="c1">// Show the content of the DataFrame</span> <span class="n">df</span><span class="o">.</span><span class="n">show</span><span class="o">()</span> -<span class="c1">// age name </span> +<span class="c1">// age name</span> <span class="c1">// null Michael</span> -<span class="c1">// 30 Andy </span> -<span class="c1">// 19 Justin </span> +<span class="c1">// 30 Andy</span> +<span class="c1">// 19 Justin</span> <span class="c1">// Print the schema in a tree format</span> <span class="n">df</span><span class="o">.</span><span class="n">printSchema</span><span class="o">()</span> @@ -322,17 +324,17 @@ this is recommended for most use cases.< <span class="c1">// Select only the "name" column</span> <span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="o">(</span><span class="s">"name"</span><span class="o">).</span><span class="n">show</span><span class="o">()</span> -<span class="c1">// name </span> +<span class="c1">// name</span> <span class="c1">// Michael</span> -<span class="c1">// Andy </span> -<span class="c1">// Justin </span> +<span class="c1">// Andy</span> +<span class="c1">// Justin</span> <span class="c1">// Select everybody, but increment the age by 1</span> <span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="o">(</span><span class="s">"name"</span><span class="o">,</span> <span class="n">df</span><span class="o">(</span><span class="s">"age"</span><span class="o">)</span> <span class="o">+</span> <span class="mi">1</span><span class="o">).</span><span class="n">show</span><span class="o">()</span> <span class="c1">// name (age + 1)</span> -<span class="c1">// Michael null </span> -<span class="c1">// Andy 31 </span> -<span class="c1">// Justin 20 </span> +<span class="c1">// Michael null</span> +<span class="c1">// Andy 31</span> +<span class="c1">// Justin 20</span> <span class="c1">// Select people older than 21</span> <span class="n">df</span><span class="o">.</span><span class="n">filter</span><span class="o">(</span><span class="n">df</span><span class="o">(</span><span class="s">"name"</span><span class="o">)</span> <span class="o">></span> <span class="mi">21</span><span class="o">).</span><span class="n">show</span><span class="o">()</span> @@ -358,10 +360,10 @@ this is recommended for most use cases.< <span class="c1">// Show the content of the DataFrame</span> <span class="n">df</span><span class="o">.</span><span class="na">show</span><span class="o">();</span> -<span class="c1">// age name </span> +<span class="c1">// age name</span> <span class="c1">// null Michael</span> -<span class="c1">// 30 Andy </span> -<span class="c1">// 19 Justin </span> +<span class="c1">// 30 Andy</span> +<span class="c1">// 19 Justin</span> <span class="c1">// Print the schema in a tree format</span> <span class="n">df</span><span class="o">.</span><span class="na">printSchema</span><span class="o">();</span> @@ -371,17 +373,17 @@ this is recommended for most use cases.< <span class="c1">// Select only the "name" column</span> <span class="n">df</span><span class="o">.</span><span class="na">select</span><span class="o">(</span><span class="s">"name"</span><span class="o">).</span><span class="na">show</span><span class="o">();</span> -<span class="c1">// name </span> +<span class="c1">// name</span> <span class="c1">// Michael</span> -<span class="c1">// Andy </span> -<span class="c1">// Justin </span> +<span class="c1">// Andy</span> +<span class="c1">// Justin</span> <span class="c1">// Select everybody, but increment the age by 1</span> <span class="n">df</span><span class="o">.</span><span class="na">select</span><span class="o">(</span><span class="s">"name"</span><span class="o">,</span> <span class="n">df</span><span class="o">.</span><span class="na">col</span><span class="o">(</span><span class="s">"age"</span><span class="o">).</span><span class="na">plus</span><span class="o">(</span><span class="mi">1</span><span class="o">)).</span><span class="na">show</span><span class="o">();</span> <span class="c1">// name (age + 1)</span> -<span class="c1">// Michael null </span> -<span class="c1">// Andy 31 </span> -<span class="c1">// Justin 20 </span> +<span class="c1">// Michael null</span> +<span class="c1">// Andy 31</span> +<span class="c1">// Justin 20</span> <span class="c1">// Select people older than 21</span> <span class="n">df</span><span class="o">.</span><span class="na">filter</span><span class="o">(</span><span class="n">df</span><span class="o">(</span><span class="s">"name"</span><span class="o">)</span> <span class="o">></span> <span class="mi">21</span><span class="o">).</span><span class="na">show</span><span class="o">();</span> @@ -407,10 +409,10 @@ this is recommended for most use cases.< <span class="c"># Show the content of the DataFrame</span> <span class="n">df</span><span class="o">.</span><span class="n">show</span><span class="p">()</span> -<span class="c">## age name </span> +<span class="c">## age name</span> <span class="c">## null Michael</span> -<span class="c">## 30 Andy </span> -<span class="c">## 19 Justin </span> +<span class="c">## 30 Andy</span> +<span class="c">## 19 Justin</span> <span class="c"># Print the schema in a tree format</span> <span class="n">df</span><span class="o">.</span><span class="n">printSchema</span><span class="p">()</span> @@ -420,17 +422,17 @@ this is recommended for most use cases.< <span class="c"># Select only the "name" column</span> <span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">"name"</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span> -<span class="c">## name </span> +<span class="c">## name</span> <span class="c">## Michael</span> -<span class="c">## Andy </span> -<span class="c">## Justin </span> +<span class="c">## Andy</span> +<span class="c">## Justin</span> <span class="c"># Select everybody, but increment the age by 1</span> <span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">"name"</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">age</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span> <span class="c">## name (age + 1)</span> -<span class="c">## Michael null </span> -<span class="c">## Andy 31 </span> -<span class="c">## Justin 20 </span> +<span class="c">## Michael null</span> +<span class="c">## Andy 31</span> +<span class="c">## Justin 20</span> <span class="c"># Select people older than 21</span> <span class="n">df</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">name</span> <span class="o">></span> <span class="mi">21</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span> @@ -509,7 +511,7 @@ registered as a table. Tables can be us <span class="k">case</span> <span class="k">class</span> <span class="nc">Person</span><span class="o">(</span><span class="n">name</span><span class="k">:</span> <span class="kt">String</span><span class="o">,</span> <span class="n">age</span><span class="k">:</span> <span class="kt">Int</span><span class="o">)</span> <span class="c1">// Create an RDD of Person objects and register it as a table.</span> -<span class="k">val</span> <span class="n">people</span> <span class="k">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">textFile</span><span class="o">(</span><span class="s">"examples/src/main/resources/people.txt"</span><span class="o">).</span><span class="n">map</span><span class="o">(</span><span class="k">_</span><span class="o">.</span><span class="n">split</span><span class="o">(</span><span class="s">","</span><span class="o">)).</span><span class="n">map</span><span class="o">(</span><span class="n">p</span> <span class="k">=></span> <span class="nc">Person</span><span class="o">(</span><span class="n">p</span><span class="o">(</span><span class="mi">0</span><span class="o">),</span> <span class="n">p</span><span class="o">(</span><span class="mi">1</span><span class="o">).</span><span class="n">trim</span><span class="o">.</span><span class="n">toInt</span><span class="o">))</span> +<span class="k">val</span> <span class="n">people</span> <span class="k">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">textFile</span><span class="o">(</span><span class="s">"examples/src/main/resources/people.txt"</span><span class="o">).</span><span class="n">map</span><span class="o">(</span><span class="k">_</span><span class="o">.</span><span class="n">split</span><span class="o">(</span><span class="s">","</span><span class="o">)).</span><span class="n">map</span><span class="o">(</span><span class="n">p</span> <span class="k">=></span> <span class="nc">Person</span><span class="o">(</span><span class="n">p</span><span class="o">(</span><span class="mi">0</span><span class="o">),</span> <span class="n">p</span><span class="o">(</span><span class="mi">1</span><span class="o">).</span><span class="n">trim</span><span class="o">.</span><span class="n">toInt</span><span class="o">)).</span><span class="n">toDF</span><span class="o ">()</span> <span class="n">people</span><span class="o">.</span><span class="n">registerTempTable</span><span class="o">(</span><span class="s">"people"</span><span class="o">)</span> <span class="c1">// SQL statements can be run by using the sql methods provided by sqlContext.</span> @@ -917,7 +919,7 @@ new data.</p> contents of the dataframe and create a pointer to the data in the HiveMetastore. Persistent tables will still exist even after your Spark program has restarted, as long as you maintain your connection to the same metastore. A DataFrame for a persistent table can be created by calling the <code>table</code> -method on a SQLContext with the name of the table.</p> +method on a <code>SQLContext</code> with the name of the table.</p> <p>By default <code>saveAsTable</code> will create a “managed table”, meaning that the location of the data will be controlled by the metastore. Managed tables will also have their data deleted automatically @@ -1017,9 +1019,120 @@ of the original data.</p> </div> +<h3 id="partition-discovery">Partition discovery</h3> + +<p>Table partitioning is a common optimization approach used in systems like Hive. In a partitioned +table, data are usually stored in different directories, with partitioning column values encoded in +the path of each partition directory. The Parquet data source is now able to discover and infer +partitioning information automatically. For exmaple, we can store all our previously used +population data into a partitioned table using the following directory structure, with two extra +columns, <code>gender</code> and <code>country</code> as partitioning columns:</p> + +<div class="highlight"><pre><code class="language-text" data-lang="text">path +âââ to + âââ table + âââ gender=male + â  âââ ... + â  â + â  âââ country=US + â  â  âââ data.parquet + â  âââ country=CN + â  â  âââ data.parquet + â  âââ ... + âââ gender=female +   âââ ... +   â +   âââ country=US +   â  âââ data.parquet +   âââ country=CN +   â  âââ data.parquet +   âââ ...</code></pre></div> + +<p>By passing <code>path/to/table</code> to either <code>SQLContext.parquetFile</code> or <code>SQLContext.load</code>, Spark SQL will +automatically extract the partitioning information from the paths. Now the schema of the returned +DataFrame becomes:</p> + +<div class="highlight"><pre><code class="language-text" data-lang="text">root +|-- name: string (nullable = true) +|-- age: long (nullable = true) +|-- gender: string (nullable = true) +|-- country: string (nullable = true)</code></pre></div> + +<p>Notice that the data types of the partitioning columns are automatically inferred. Currently, +numeric data types and string type are supported.</p> + +<h3 id="schema-merging">Schema merging</h3> + +<p>Like ProtocolBuffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start with +a simple schema, and gradually add more columns to the schema as needed. In this way, users may end +up with multiple Parquet files with different but mutually compatible schemas. The Parquet data +source is now able to automatically detect this case and merge schemas of all these files.</p> + +<div class="codetabs"> + +<div data-lang="scala"> + + <div class="highlight"><pre><code class="language-scala" data-lang="scala"><span class="c1">// sqlContext from the previous example is used in this example.</span> +<span class="c1">// This is used to implicitly convert an RDD to a DataFrame.</span> +<span class="k">import</span> <span class="nn">sqlContext.implicits._</span> + +<span class="c1">// Create a simple DataFrame, stored into a partition directory</span> +<span class="k">val</span> <span class="n">df1</span> <span class="k">=</span> <span class="n">sparkContext</span><span class="o">.</span><span class="n">makeRDD</span><span class="o">(</span><span class="mi">1</span> <span class="n">to</span> <span class="mi">5</span><span class="o">).</span><span class="n">map</span><span class="o">(</span><span class="n">i</span> <span class="k">=></span> <span class="o">(</span><span class="n">i</span><span class="o">,</span> <span class="n">i</span> <span class="o">*</span> <span class="mi">2</span><span class="o">)).</span><span class="n">toDF</span><span class="o">(</span><span class="s">"single"</span><span class="o">,</span> <span class="s">"double"</span><span class="o">)</span> +<span class="n">df1</span><span class="o">.</span><span class="n">saveAsParquetFile</span><span class="o">(</span><span class="s">"data/test_table/key=1"</span><span class="o">)</span> + +<span class="c1">// Create another DataFrame in a new partition directory,</span> +<span class="c1">// adding a new column and dropping an existing column</span> +<span class="k">val</span> <span class="n">df2</span> <span class="k">=</span> <span class="n">sparkContext</span><span class="o">.</span><span class="n">makeRDD</span><span class="o">(</span><span class="mi">6</span> <span class="n">to</span> <span class="mi">10</span><span class="o">).</span><span class="n">map</span><span class="o">(</span><span class="n">i</span> <span class="k">=></span> <span class="o">(</span><span class="n">i</span><span class="o">,</span> <span class="n">i</span> <span class="o">*</span> <span class="mi">3</span><span class="o">)).</span><span class="n">toDF</span><span class="o">(</span><span class="s">"single"</span><span class="o">,</span> <span class="s">"triple"</span><span class="o">)</span> +<span class="n">df2</span><span class="o">.</span><span class="n">saveAsParquetFile</span><span class="o">(</span><span class="s">"data/test_table/key=2"</span><span class="o">)</span> + +<span class="c1">// Read the partitioned table</span> +<span class="k">val</span> <span class="n">df3</span> <span class="k">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">parquetFile</span><span class="o">(</span><span class="s">"data/test_table"</span><span class="o">)</span> +<span class="n">df3</span><span class="o">.</span><span class="n">printSchema</span><span class="o">()</span> + +<span class="c1">// The final schema consists of all 3 columns in the Parquet files together</span> +<span class="c1">// with the partiioning column appeared in the partition directory paths.</span> +<span class="c1">// root</span> +<span class="c1">// |-- single: int (nullable = true)</span> +<span class="c1">// |-- double: int (nullable = true)</span> +<span class="c1">// |-- triple: int (nullable = true)</span> +<span class="c1">// |-- key : int (nullable = true)</span></code></pre></div> + + </div> + +<div data-lang="python"> + + <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="c"># sqlContext from the previous example is used in this example.</span> + +<span class="c"># Create a simple DataFrame, stored into a partition directory</span> +<span class="n">df1</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">6</span><span class="p">))</span>\ + <span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">i</span><span class="p">:</span> <span class="n">Row</span><span class="p">(</span><span class="n">single</span><span class="o">=</span><span class="n">i</span><span class="p">,</span> <span class="n">double</span><span class="o">=</span><span class="n">i</span> <span class="o">*</span> <span class="mi">2</span><span class="p">)))</span> +<span class="n">df1</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="s">"data/test_table/key=1"</span><span class="p">,</span> <span class="s">"parquet"</span><span class="p">)</span> + +<span class="c"># Create another DataFrame in a new partition directory,</span> +<span class="c"># adding a new column and dropping an existing column</span> +<span class="n">df2</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">6</span><span class="p">,</span> <span class="mi">11</span><span class="p">))</span> + <span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">i</span><span class="p">:</span> <span class="n">Row</span><span class="p">(</span><span class="n">single</span><span class="o">=</span><span class="n">i</span><span class="p">,</span> <span class="n">triple</span><span class="o">=</span><span class="n">i</span> <span class="o">*</span> <span class="mi">3</span><span class="p">)))</span> +<span class="n">df2</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="s">"data/test_table/key=2"</span><span class="p">,</span> <span class="s">"parquet"</span><span class="p">)</span> + +<span class="c"># Read the partitioned table</span> +<span class="n">df3</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">parquetFile</span><span class="p">(</span><span class="s">"data/test_table"</span><span class="p">)</span> +<span class="n">df3</span><span class="o">.</span><span class="n">printSchema</span><span class="p">()</span> + +<span class="c"># The final schema consists of all 3 columns in the Parquet files together</span> +<span class="c"># with the partiioning column appeared in the partition directory paths.</span> +<span class="c"># root</span> +<span class="c"># |-- single: int (nullable = true)</span> +<span class="c"># |-- double: int (nullable = true)</span> +<span class="c"># |-- triple: int (nullable = true)</span> +<span class="c"># |-- key : int (nullable = true)</span></code></pre></div> + + </div> + +</div> + <h3 id="configuration">Configuration</h3> -<p>Configuration of Parquet can be done using the <code>setConf</code> method on SQLContext or by running +<p>Configuration of Parquet can be done using the <code>setConf</code> method on <code>SQLContext</code> or by running <code>SET key=value</code> commands using SQL.</p> <table class="table"> @@ -1082,7 +1195,7 @@ of the original data.</p> <div data-lang="scala"> <p>Spark SQL can automatically infer the schema of a JSON dataset and load it as a DataFrame. -This conversion can be done using one of two methods in a SQLContext:</p> +This conversion can be done using one of two methods in a <code>SQLContext</code>:</p> <ul> <li><code>jsonFile</code> - loads data from a directory of JSON files where each line of the files is a JSON object.</li> @@ -1124,7 +1237,7 @@ a regular multi-line JSON file will most <div data-lang="java"> <p>Spark SQL can automatically infer the schema of a JSON dataset and load it as a DataFrame. -This conversion can be done using one of two methods in a SQLContext :</p> +This conversion can be done using one of two methods in a <code>SQLContext</code> :</p> <ul> <li><code>jsonFile</code> - loads data from a directory of JSON files where each line of the files is a JSON object.</li> @@ -1167,7 +1280,7 @@ a regular multi-line JSON file will most <div data-lang="python"> <p>Spark SQL can automatically infer the schema of a JSON dataset and load it as a DataFrame. -This conversion can be done using one of two methods in a SQLContext:</p> +This conversion can be done using one of two methods in a <code>SQLContext</code>:</p> <ul> <li><code>jsonFile</code> - loads data from a directory of JSON files where each line of the files is a JSON object.</li> @@ -1197,7 +1310,7 @@ a regular multi-line JSON file will most <span class="c"># Register this DataFrame as a table.</span> <span class="n">people</span><span class="o">.</span><span class="n">registerTempTable</span><span class="p">(</span><span class="s">"people"</span><span class="p">)</span> -<span class="c"># SQL statements can be run by using the sql methods provided by sqlContext.</span> +<span class="c"># SQL statements can be run by using the sql methods provided by `sqlContext`.</span> <span class="n">teenagers</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s">"SELECT name FROM people WHERE age >= 13 AND age <= 19"</span><span class="p">)</span> <span class="c"># Alternatively, a DataFrame can be created for a JSON dataset represented by</span> @@ -1239,7 +1352,7 @@ on all of the worker nodes, as they will <p>When working with Hive one must construct a <code>HiveContext</code>, which inherits from <code>SQLContext</code>, and adds support for finding tables in the MetaStore and writing queries using HiveQL. Users who do -not have an existing Hive deployment can still create a HiveContext. When not configured by the +not have an existing Hive deployment can still create a <code>HiveContext</code>. When not configured by the hive-site.xml, the context automatically creates <code>metastore_db</code> and <code>warehouse</code> in the current directory.</p> @@ -1403,7 +1516,7 @@ turning on some experimental options.</p Then Spark SQL will scan only required columns and will automatically tune compression to minimize memory usage and GC pressure. You can call <code>sqlContext.uncacheTable("tableName")</code> to remove the table from memory.</p> -<p>Configuration of in-memory caching can be done using the <code>setConf</code> method on SQLContext or by running +<p>Configuration of in-memory caching can be done using the <code>setConf</code> method on <code>SQLContext</code> or by running <code>SET key=value</code> commands using SQL.</p> <table class="table"> @@ -1513,10 +1626,10 @@ your machine and a blank password. For s <p>You may also use the beeline script that comes with Hive.</p> -<p>Thrift JDBC server also supports sending thrift RPC messages over HTTP transport. -Use the following setting to enable HTTP mode as system property or in <code>hive-site.xml</code> file in <code>conf/</code>: </p> +<p>Thrift JDBC server also supports sending thrift RPC messages over HTTP transport. +Use the following setting to enable HTTP mode as system property or in <code>hive-site.xml</code> file in <code>conf/</code>:</p> -<pre><code>hive.server2.transport.mode - Set this to value: http +<pre><code>hive.server2.transport.mode - Set this to value: http hive.server2.thrift.http.port - HTTP port number fo listen on; default is 10001 hive.server2.http.endpoint - HTTP endpoint; default is cliservice </code></pre> @@ -1591,7 +1704,7 @@ case classes or tuples) with a method <c <p>Spark 1.3 removes the type aliases that were present in the base sql package for <code>DataType</code>. Users should instead import the classes in <code>org.apache.spark.sql.types</code></p> -<h4 id="udf-registration-moved-to-sqlcontextudf-java--scala">UDF Registration Moved to sqlContext.udf (Java & Scala)</h4> +<h4 id="udf-registration-moved-to-sqlcontextudf-java--scala">UDF Registration Moved to <code>sqlContext.udf</code> (Java & Scala)</h4> <p>Functions that are used to register UDFs, either for use in the DataFrame DSL or SQL, have been moved into the udf object in <code>SQLContext</code>.</p> --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org