Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/22746#discussion_r226239048 --- Diff: docs/sql-data-sources-parquet.md --- @@ -0,0 +1,321 @@ +--- +layout: global +title: Parquet Files +displayTitle: Parquet Files +--- + +* Table of contents +{:toc} + +[Parquet](http://parquet.io) is a columnar format that is supported by many other data processing systems. +Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema +of the original data. When writing Parquet files, all columns are automatically converted to be nullable for +compatibility reasons. + +### Loading Data Programmatically + +Using the data from the above example: + +<div class="codetabs"> + +<div data-lang="scala" markdown="1"> +{% include_example basic_parquet_example scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %} +</div> + +<div data-lang="java" markdown="1"> +{% include_example basic_parquet_example java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %} +</div> + +<div data-lang="python" markdown="1"> + +{% include_example basic_parquet_example python/sql/datasource.py %} +</div> + +<div data-lang="r" markdown="1"> + +{% include_example basic_parquet_example r/RSparkSQLExample.R %} + +</div> + +<div data-lang="sql" markdown="1"> + +{% highlight sql %} + +CREATE TEMPORARY VIEW parquetTable +USING org.apache.spark.sql.parquet +OPTIONS ( + path "examples/src/main/resources/people.parquet" +) + +SELECT * FROM parquetTable + +{% endhighlight %} + +</div> + +</div> + +### Partition Discovery + +Table partitioning is a common optimization approach used in systems like Hive. In a partitioned +table, data are usually stored in different directories, with partitioning column values encoded in +the path of each partition directory. All built-in file sources (including Text/CSV/JSON/ORC/Parquet) +are able to discover and infer partitioning information automatically. +For example, we can store all our previously used +population data into a partitioned table using the following directory structure, with two extra +columns, `gender` and `country` as partitioning columns: + +{% highlight text %} + +path +âââ to + âââ table + âââ gender=male + â  âââ ... + â  â + â  âââ country=US + â  â  âââ data.parquet + â  âââ country=CN + â  â  âââ data.parquet + â  âââ ... + âââ gender=female +   âââ ... +   â +   âââ country=US +   â  âââ data.parquet +   âââ country=CN +   â  âââ data.parquet +   âââ ... + +{% endhighlight %} + +By passing `path/to/table` to either `SparkSession.read.parquet` or `SparkSession.read.load`, Spark SQL +will automatically extract the partitioning information from the paths. +Now the schema of the returned DataFrame becomes: + +{% highlight text %} + +root +|-- name: string (nullable = true) +|-- age: long (nullable = true) +|-- gender: string (nullable = true) +|-- country: string (nullable = true) + +{% endhighlight %} + +Notice that the data types of the partitioning columns are automatically inferred. Currently, +numeric data types, date, timestamp and string type are supported. Sometimes users may not want +to automatically infer the data types of the partitioning columns. For these use cases, the +automatic type inference can be configured by +`spark.sql.sources.partitionColumnTypeInference.enabled`, which is default to `true`. When type +inference is disabled, string type will be used for the partitioning columns. + +Starting from Spark 1.6.0, partition discovery only finds partitions under the given paths +by default. For the above example, if users pass `path/to/table/gender=male` to either +`SparkSession.read.parquet` or `SparkSession.read.load`, `gender` will not be considered as a +partitioning column. If users need to specify the base path that partition discovery +should start with, they can set `basePath` in the data source options. For example, +when `path/to/table/gender=male` is the path of the data and +users set `basePath` to `path/to/table/`, `gender` will be a partitioning column. + +### Schema Merging + +Like ProtocolBuffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start with +a simple schema, and gradually add more columns to the schema as needed. In this way, users may end +up with multiple Parquet files with different but mutually compatible schemas. The Parquet data +source is now able to automatically detect this case and merge schemas of all these files. + +Since schema merging is a relatively expensive operation, and is not a necessity in most cases, we +turned it off by default starting from 1.5.0. You may enable it by + +1. setting data source option `mergeSchema` to `true` when reading Parquet files (as shown in the + examples below), or +2. setting the global SQL option `spark.sql.parquet.mergeSchema` to `true`. + +<div class="codetabs"> + +<div data-lang="scala" markdown="1"> +{% include_example schema_merging scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %} +</div> + +<div data-lang="java" markdown="1"> +{% include_example schema_merging java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %} +</div> + +<div data-lang="python" markdown="1"> + +{% include_example schema_merging python/sql/datasource.py %} +</div> + +<div data-lang="r" markdown="1"> + +{% include_example schema_merging r/RSparkSQLExample.R %} + +</div> + +</div> + +### Hive metastore Parquet table conversion + +When reading from and writing to Hive metastore Parquet tables, Spark SQL will try to use its own +Parquet support instead of Hive SerDe for better performance. This behavior is controlled by the +`spark.sql.hive.convertMetastoreParquet` configuration, and is turned on by default. + +#### Hive/Parquet Schema Reconciliation + +There are two key differences between Hive and Parquet from the perspective of table schema +processing. + +1. Hive is case insensitive, while Parquet is not +1. Hive considers all columns nullable, while nullability in Parquet is significant + +Due to this reason, we must reconcile Hive metastore schema with Parquet schema when converting a +Hive metastore Parquet table to a Spark SQL Parquet table. The reconciliation rules are: + +1. Fields that have the same name in both schema must have the same data type regardless of + nullability. The reconciled field should have the data type of the Parquet side, so that + nullability is respected. + +1. The reconciled schema contains exactly those fields defined in Hive metastore schema. + + - Any fields that only appear in the Parquet schema are dropped in the reconciled schema. + - Any fields that only appear in the Hive metastore schema are added as nullable field in the + reconciled schema. + +#### Metadata Refreshing + +Spark SQL caches Parquet metadata for better performance. When Hive metastore Parquet table +conversion is enabled, metadata of those converted tables are also cached. If these tables are +updated by Hive or other external tools, you need to refresh them manually to ensure consistent +metadata. + +<div class="codetabs"> + +<div data-lang="scala" markdown="1"> + +{% highlight scala %} +// spark is an existing SparkSession +spark.catalog.refreshTable("my_table") +{% endhighlight %} + +</div> + +<div data-lang="java" markdown="1"> + +{% highlight java %} +// spark is an existing SparkSession +spark.catalog().refreshTable("my_table"); +{% endhighlight %} + +</div> + +<div data-lang="python" markdown="1"> + +{% highlight python %} +# spark is an existing SparkSession +spark.catalog.refreshTable("my_table") +{% endhighlight %} + +</div> + +<div data-lang="r" markdown="1"> + +{% highlight r %} +refreshTable("my_table") +{% endhighlight %} + +</div> + +<div data-lang="sql" markdown="1"> + +{% highlight sql %} +REFRESH TABLE my_table; +{% endhighlight %} + +</div> + +</div> + +### Configuration + +Configuration of Parquet can be done using the `setConf` method on `SparkSession` or by running +`SET key=value` commands using SQL. + +<table class="table"> +<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr> +<tr> + <td><code>spark.sql.parquet.binaryAsString</code></td> + <td>false</td> + <td> + Some other Parquet-producing systems, in particular Impala, Hive, and older versions of Spark SQL, do --- End diff -- nit: `in paticular Impala, ...` -> `in paticular, Impala, ...`?
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org