[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22746 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...
Github user xuanyuanking commented on a diff in the pull request: https://github.com/apache/spark/pull/22746#discussion_r226288610 --- Diff: docs/sql-data-sources-other.md --- @@ -0,0 +1,114 @@ +--- +layout: global +title: Other Data Sources +displayTitle: Other Data Sources +--- + +* Table of contents +{:toc} + +## ORC Files + +Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC file format for ORC files. +To do that, the following configurations are newly added. The vectorized reader is used for the +native ORC tables (e.g., the ones created using the clause `USING ORC`) when `spark.sql.orc.impl` +is set to `native` and `spark.sql.orc.enableVectorizedReader` is set to `true`. For the Hive ORC +serde tables (e.g., the ones created using the clause `USING HIVE OPTIONS (fileFormat 'ORC')`), +the vectorized reader is used when `spark.sql.hive.convertMetastoreOrc` is also set to `true`. + + + Property NameDefaultMeaning + +spark.sql.orc.impl +native +The name of ORC implementation. It can be one of native and hive. native means the native ORC support that is built on Apache ORC 1.4. `hive` means the ORC library in Hive 1.2.1. + + +spark.sql.orc.enableVectorizedReader +true +Enables vectorized orc decoding in native implementation. If false, a new non-vectorized ORC reader is used in native implementation. For hive implementation, this is ignored. + + + +## JSON Datasets --- End diff -- Done in 17995f9. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/22746#discussion_r226263066 --- Diff: docs/sql-migration-guide-upgrade.md --- @@ -0,0 +1,520 @@ +--- +layout: global +title: Spark SQL Upgrading Guide +displayTitle: Spark SQL Upgrading Guide +--- + +* Table of contents +{:toc} + +## Upgrading From Spark SQL 2.4 to 3.0 + + - In PySpark, when creating a `SparkSession` with `SparkSession.builder.getOrCreate()`, if there is an existing `SparkContext`, the builder was trying to update the `SparkConf` of the existing `SparkContext` with configurations specified to the builder, but the `SparkContext` is shared by all `SparkSession`s, so we should not update them. Since 3.0, the builder come to not update the configurations. This is the same behavior as Java/Scala API in 2.3 and above. If you want to update them, you need to update them prior to creating a `SparkSession`. + +## Upgrading From Spark SQL 2.3 to 2.4 + + - In Spark version 2.3 and earlier, the second parameter to array_contains function is implicitly promoted to the element type of first array type parameter. This type promotion can be lossy and may cause `array_contains` function to return wrong result. This problem has been addressed in 2.4 by employing a safer type promotion mechanism. This can cause some change in behavior and are illustrated in the table below. + + + +Query + + +Result Spark 2.3 or Prior + + +Result Spark 2.4 + + +Remarks + + + + +SELECT array_contains(array(1), 1.34D); + + +true + + +false + + +In Spark 2.4, left and right parameters are promoted to array(double) and double type respectively. + + + + +SELECT array_contains(array(1), '1'); + + +true + + +AnalysisException is thrown since integer type can not be promoted to string type in a loss-less manner. + + +Users can use explict cast + + + + +SELECT array_contains(array(1), 'anystring'); + + +null + + +AnalysisException is thrown since integer type can not be promoted to string type in a loss-less manner. + + +Users can use explict cast --- End diff -- `explict` -> `explicit` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/22746#discussion_r226262995 --- Diff: docs/sql-migration-guide-upgrade.md --- @@ -0,0 +1,520 @@ +--- +layout: global +title: Spark SQL Upgrading Guide +displayTitle: Spark SQL Upgrading Guide +--- + +* Table of contents +{:toc} + +## Upgrading From Spark SQL 2.4 to 3.0 + + - In PySpark, when creating a `SparkSession` with `SparkSession.builder.getOrCreate()`, if there is an existing `SparkContext`, the builder was trying to update the `SparkConf` of the existing `SparkContext` with configurations specified to the builder, but the `SparkContext` is shared by all `SparkSession`s, so we should not update them. Since 3.0, the builder come to not update the configurations. This is the same behavior as Java/Scala API in 2.3 and above. If you want to update them, you need to update them prior to creating a `SparkSession`. + +## Upgrading From Spark SQL 2.3 to 2.4 + + - In Spark version 2.3 and earlier, the second parameter to array_contains function is implicitly promoted to the element type of first array type parameter. This type promotion can be lossy and may cause `array_contains` function to return wrong result. This problem has been addressed in 2.4 by employing a safer type promotion mechanism. This can cause some change in behavior and are illustrated in the table below. + + + +Query + + +Result Spark 2.3 or Prior + + +Result Spark 2.4 + + +Remarks + + + + +SELECT array_contains(array(1), 1.34D); + + +true + + +false + + +In Spark 2.4, left and right parameters are promoted to array(double) and double type respectively. + + + + +SELECT array_contains(array(1), '1'); + + +true + + +AnalysisException is thrown since integer type can not be promoted to string type in a loss-less manner. + + +Users can use explict cast --- End diff -- `explict` -> `explicit` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/22746#discussion_r226250306 --- Diff: docs/sql-performance-turing.md --- @@ -0,0 +1,151 @@ +--- +layout: global +title: Performance Tuning +displayTitle: Performance Tuning +--- + +* Table of contents +{:toc} + +For some workloads, it is possible to improve performance by either caching data in memory, or by +turning on some experimental options. + +## Caching Data In Memory + +Spark SQL can cache tables using an in-memory columnar format by calling `spark.catalog.cacheTable("tableName")` or `dataFrame.cache()`. +Then Spark SQL will scan only required columns and will automatically tune compression to minimize +memory usage and GC pressure. You can call `spark.catalog.uncacheTable("tableName")` to remove the table from memory. + +Configuration of in-memory caching can be done using the `setConf` method on `SparkSession` or by running +`SET key=value` commands using SQL. + + +Property NameDefaultMeaning + + spark.sql.inMemoryColumnarStorage.compressed + true + +When set to true Spark SQL will automatically select a compression codec for each column based +on statistics of the data. + + + + spark.sql.inMemoryColumnarStorage.batchSize + 1 + +Controls the size of batches for columnar caching. Larger batch sizes can improve memory utilization +and compression, but risk OOMs when caching data. + + + + + +## Other Configuration Options + +The following options can also be used to tune the performance of query execution. It is possible +that these options will be deprecated in future release as more optimizations are performed automatically. + + + Property NameDefaultMeaning + +spark.sql.files.maxPartitionBytes +134217728 (128 MB) + + The maximum number of bytes to pack into a single partition when reading files. + + + +spark.sql.files.openCostInBytes +4194304 (4 MB) + + The estimated cost to open a file, measured by the number of bytes could be scanned in the same + time. This is used when putting multiple files into a partition. It is better to over estimated, --- End diff -- nit: `It is better to over estimated` -> ` It is better to over-estimate`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/22746#discussion_r226247607 --- Diff: docs/sql-migration-guide-upgrade.md --- @@ -0,0 +1,520 @@ +--- +layout: global +title: Spark SQL Upgrading Guide +displayTitle: Spark SQL Upgrading Guide +--- + +* Table of contents +{:toc} + +## Upgrading From Spark SQL 2.4 to 3.0 + + - In PySpark, when creating a `SparkSession` with `SparkSession.builder.getOrCreate()`, if there is an existing `SparkContext`, the builder was trying to update the `SparkConf` of the existing `SparkContext` with configurations specified to the builder, but the `SparkContext` is shared by all `SparkSession`s, so we should not update them. Since 3.0, the builder come to not update the configurations. This is the same behavior as Java/Scala API in 2.3 and above. If you want to update them, you need to update them prior to creating a `SparkSession`. + +## Upgrading From Spark SQL 2.3 to 2.4 + + - In Spark version 2.3 and earlier, the second parameter to array_contains function is implicitly promoted to the element type of first array type parameter. This type promotion can be lossy and may cause `array_contains` function to return wrong result. This problem has been addressed in 2.4 by employing a safer type promotion mechanism. This can cause some change in behavior and are illustrated in the table below. + + + +Query + + +Result Spark 2.3 or Prior + + +Result Spark 2.4 + + +Remarks + + + + +SELECT array_contains(array(1), 1.34D); + + +true + + +false + + +In Spark 2.4, left and right parameters are promoted to array(double) and double type respectively. + + + + +SELECT array_contains(array(1), '1'); + + +true + + +AnalysisException is thrown since integer type can not be promoted to string type in a loss-less manner. + + +Users can use explict cast + + + + +SELECT array_contains(array(1), 'anystring'); + + +null + + +AnalysisException is thrown since integer type can not be promoted to string type in a loss-less manner. + + +Users can use explict cast + + + + + - Since Spark 2.4, when there is a struct field in front of the IN operator before a subquery, the inner query must contain a struct field as well. In previous versions, instead, the fields of the struct were compared to the output of the inner query. Eg. if `a` is a `struct(a string, b int)`, in Spark 2.4 `a in (select (1 as a, 'a' as b) from range(1))` is a valid query, while `a in (select 1, 'a' from range(1))` is not. In previous version it was the opposite. + - In versions 2.2.1+ and 2.3, if `spark.sql.caseSensitive` is set to true, then the `CURRENT_DATE` and `CURRENT_TIMESTAMP` functions incorrectly became case-sensitive and would resolve to columns (unless typed in lower case). In Spark 2.4 this has been fixed and the functions are no longer case-sensitive. + - Since Spark 2.4, Spark will evaluate the set operations referenced in a query by following a precedence rule as per the SQL standard. If the order is not specified by parentheses, set operations are performed from left to right with the exception that all INTERSECT operations are performed before any UNION, EXCEPT or MINUS operations. The old behaviour of giving equal precedence to all the set operations are preserved under a newly added configuration `spark.sql.legacy.setopsPrecedence.enabled` with a default value of `false`. When this property is set to `true`, spark will evaluate the set operators from left to right as they appear in the query given no explicit ordering is enforced by usage of parenthesis. + - Since Spark 2.4, Spark will display table description column Last Access value as UNKNOWN when the value was Jan 01 1970. + - Since Spark 2.4, Spark maximizes the usage of a vectorized ORC reader for ORC files by default. To do that, `spark.sql.orc.impl` and `spark.sql.orc.filterPushdown` change their default values to `native` and `true` respectively. + - In PySpark, when Arrow optimization is enabled, previously `toPandas` just failed when Arrow optimization i
[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/22746#discussion_r226246375 --- Diff: docs/sql-migration-guide-upgrade.md --- @@ -0,0 +1,520 @@ +--- +layout: global +title: Spark SQL Upgrading Guide +displayTitle: Spark SQL Upgrading Guide +--- + +* Table of contents +{:toc} + +## Upgrading From Spark SQL 2.4 to 3.0 + + - In PySpark, when creating a `SparkSession` with `SparkSession.builder.getOrCreate()`, if there is an existing `SparkContext`, the builder was trying to update the `SparkConf` of the existing `SparkContext` with configurations specified to the builder, but the `SparkContext` is shared by all `SparkSession`s, so we should not update them. Since 3.0, the builder come to not update the configurations. This is the same behavior as Java/Scala API in 2.3 and above. If you want to update them, you need to update them prior to creating a `SparkSession`. + +## Upgrading From Spark SQL 2.3 to 2.4 + + - In Spark version 2.3 and earlier, the second parameter to array_contains function is implicitly promoted to the element type of first array type parameter. This type promotion can be lossy and may cause `array_contains` function to return wrong result. This problem has been addressed in 2.4 by employing a safer type promotion mechanism. This can cause some change in behavior and are illustrated in the table below. + + + +Query + + +Result Spark 2.3 or Prior + + +Result Spark 2.4 + + +Remarks + + + + +SELECT array_contains(array(1), 1.34D); + + +true + + +false + + +In Spark 2.4, left and right parameters are promoted to array(double) and double type respectively. + + + + +SELECT array_contains(array(1), '1'); + + +true + + +AnalysisException is thrown since integer type can not be promoted to string type in a loss-less manner. + + +Users can use explict cast + + + + +SELECT array_contains(array(1), 'anystring'); + + +null + + +AnalysisException is thrown since integer type can not be promoted to string type in a loss-less manner. + + +Users can use explict cast + + + + + - Since Spark 2.4, when there is a struct field in front of the IN operator before a subquery, the inner query must contain a struct field as well. In previous versions, instead, the fields of the struct were compared to the output of the inner query. Eg. if `a` is a `struct(a string, b int)`, in Spark 2.4 `a in (select (1 as a, 'a' as b) from range(1))` is a valid query, while `a in (select 1, 'a' from range(1))` is not. In previous version it was the opposite. + - In versions 2.2.1+ and 2.3, if `spark.sql.caseSensitive` is set to true, then the `CURRENT_DATE` and `CURRENT_TIMESTAMP` functions incorrectly became case-sensitive and would resolve to columns (unless typed in lower case). In Spark 2.4 this has been fixed and the functions are no longer case-sensitive. + - Since Spark 2.4, Spark will evaluate the set operations referenced in a query by following a precedence rule as per the SQL standard. If the order is not specified by parentheses, set operations are performed from left to right with the exception that all INTERSECT operations are performed before any UNION, EXCEPT or MINUS operations. The old behaviour of giving equal precedence to all the set operations are preserved under a newly added configuration `spark.sql.legacy.setopsPrecedence.enabled` with a default value of `false`. When this property is set to `true`, spark will evaluate the set operators from left to right as they appear in the query given no explicit ordering is enforced by usage of parenthesis. + - Since Spark 2.4, Spark will display table description column Last Access value as UNKNOWN when the value was Jan 01 1970. + - Since Spark 2.4, Spark maximizes the usage of a vectorized ORC reader for ORC files by default. To do that, `spark.sql.orc.impl` and `spark.sql.orc.filterPushdown` change their default values to `native` and `true` respectively. + - In PySpark, when Arrow optimization is enabled, previously `toPandas` just failed when Arrow optimization i
[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/22746#discussion_r226245945 --- Diff: docs/sql-migration-guide-upgrade.md --- @@ -0,0 +1,520 @@ +--- +layout: global +title: Spark SQL Upgrading Guide +displayTitle: Spark SQL Upgrading Guide +--- + +* Table of contents +{:toc} + +## Upgrading From Spark SQL 2.4 to 3.0 + + - In PySpark, when creating a `SparkSession` with `SparkSession.builder.getOrCreate()`, if there is an existing `SparkContext`, the builder was trying to update the `SparkConf` of the existing `SparkContext` with configurations specified to the builder, but the `SparkContext` is shared by all `SparkSession`s, so we should not update them. Since 3.0, the builder come to not update the configurations. This is the same behavior as Java/Scala API in 2.3 and above. If you want to update them, you need to update them prior to creating a `SparkSession`. --- End diff -- `the builder come` -> `the builder comes`? cc @ueshin --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/22746#discussion_r226241683 --- Diff: docs/sql-distributed-sql-engine.md --- @@ -0,0 +1,85 @@ +--- +layout: global +title: Distributed SQL Engine +displayTitle: Distributed SQL Engine +--- + +* Table of contents +{:toc} + +Spark SQL can also act as a distributed query engine using its JDBC/ODBC or command-line interface. +In this mode, end-users or applications can interact with Spark SQL directly to run SQL queries, +without the need to write any code. + +## Running the Thrift JDBC/ODBC server + +The Thrift JDBC/ODBC server implemented here corresponds to the [`HiveServer2`](https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2) +in Hive 1.2.1. You can test the JDBC server with the beeline script that comes with either Spark or Hive 1.2.1. + +To start the JDBC/ODBC server, run the following in the Spark directory: + +./sbin/start-thriftserver.sh + +This script accepts all `bin/spark-submit` command line options, plus a `--hiveconf` option to +specify Hive properties. You may run `./sbin/start-thriftserver.sh --help` for a complete list of +all available options. By default, the server listens on localhost:1. You may override this +behaviour via either environment variables, i.e.: + +{% highlight bash %} +export HIVE_SERVER2_THRIFT_PORT= +export HIVE_SERVER2_THRIFT_BIND_HOST= +./sbin/start-thriftserver.sh \ + --master \ + ... +{% endhighlight %} + +or system properties: + +{% highlight bash %} +./sbin/start-thriftserver.sh \ + --hiveconf hive.server2.thrift.port= \ + --hiveconf hive.server2.thrift.bind.host= \ + --master + ... +{% endhighlight %} + +Now you can use beeline to test the Thrift JDBC/ODBC server: + +./bin/beeline + +Connect to the JDBC/ODBC server in beeline with: + +beeline> !connect jdbc:hive2://localhost:1 + +Beeline will ask you for a username and password. In non-secure mode, simply enter the username on +your machine and a blank password. For secure mode, please follow the instructions given in the +[beeline documentation](https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients). + +Configuration of Hive is done by placing your `hive-site.xml`, `core-site.xml` and `hdfs-site.xml` files in `conf/`. + +You may also use the beeline script that comes with Hive. + +Thrift JDBC server also supports sending thrift RPC messages over HTTP transport. +Use the following setting to enable HTTP mode as system property or in `hive-site.xml` file in `conf/`: + +hive.server2.transport.mode - Set this to value: http +hive.server2.thrift.http.port - HTTP port number to listen on; default is 10001 +hive.server2.http.endpoint - HTTP endpoint; default is cliservice + +To test, use beeline to connect to the JDBC/ODBC server in http mode with: + +beeline> !connect jdbc:hive2://:/?hive.server2.transport.mode=http;hive.server2.thrift.http.path= + + +## Running the Spark SQL CLI + +The Spark SQL CLI is a convenient tool to run the Hive metastore service in local mode and execute +queries input from the command line. Note that the Spark SQL CLI cannot talk to the Thrift JDBC server. + +To start the Spark SQL CLI, run the following in the Spark directory: + +./bin/spark-sql + +Configuration of Hive is done by placing your `hive-site.xml`, `core-site.xml` and `hdfs-site.xml` files in `conf/`. +You may run `./bin/spark-sql --help` for a complete list of all available +options. --- End diff -- super nit: this line can be concatenated with the previous line. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/22746#discussion_r226239048 --- Diff: docs/sql-data-sources-parquet.md --- @@ -0,0 +1,321 @@ +--- +layout: global +title: Parquet Files +displayTitle: Parquet Files +--- + +* Table of contents +{:toc} + +[Parquet](http://parquet.io) is a columnar format that is supported by many other data processing systems. +Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema +of the original data. When writing Parquet files, all columns are automatically converted to be nullable for +compatibility reasons. + +### Loading Data Programmatically + +Using the data from the above example: + + + + +{% include_example basic_parquet_example scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %} + + + +{% include_example basic_parquet_example java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %} + + + + +{% include_example basic_parquet_example python/sql/datasource.py %} + + + + +{% include_example basic_parquet_example r/RSparkSQLExample.R %} + + + + + +{% highlight sql %} + +CREATE TEMPORARY VIEW parquetTable +USING org.apache.spark.sql.parquet +OPTIONS ( + path "examples/src/main/resources/people.parquet" +) + +SELECT * FROM parquetTable + +{% endhighlight %} + + + + + +### Partition Discovery + +Table partitioning is a common optimization approach used in systems like Hive. In a partitioned +table, data are usually stored in different directories, with partitioning column values encoded in +the path of each partition directory. All built-in file sources (including Text/CSV/JSON/ORC/Parquet) +are able to discover and infer partitioning information automatically. +For example, we can store all our previously used +population data into a partitioned table using the following directory structure, with two extra +columns, `gender` and `country` as partitioning columns: + +{% highlight text %} + +path +âââ to +âââ table +âââ gender=male +â  âââ ... +â  â +â  âââ country=US +â  â  âââ data.parquet +â  âââ country=CN +â  â  âââ data.parquet +â  âââ ... +âââ gender=female +   âââ ... +   â +   âââ country=US +   â  âââ data.parquet +   âââ country=CN +   â  âââ data.parquet +   âââ ... + +{% endhighlight %} + +By passing `path/to/table` to either `SparkSession.read.parquet` or `SparkSession.read.load`, Spark SQL +will automatically extract the partitioning information from the paths. +Now the schema of the returned DataFrame becomes: + +{% highlight text %} + +root +|-- name: string (nullable = true) +|-- age: long (nullable = true) +|-- gender: string (nullable = true) +|-- country: string (nullable = true) + +{% endhighlight %} + +Notice that the data types of the partitioning columns are automatically inferred. Currently, +numeric data types, date, timestamp and string type are supported. Sometimes users may not want +to automatically infer the data types of the partitioning columns. For these use cases, the +automatic type inference can be configured by +`spark.sql.sources.partitionColumnTypeInference.enabled`, which is default to `true`. When type +inference is disabled, string type will be used for the partitioning columns. + +Starting from Spark 1.6.0, partition discovery only finds partitions under the given paths +by default. For the above example, if users pass `path/to/table/gender=male` to either +`SparkSession.read.parquet` or `SparkSession.read.load`, `gender` will not be considered as a +partitioning column. If users need to specify the base path that partition discovery +should start with, they can set `basePath` in the data source options. For example, +when `path/to/table/gender=male` is the path of the data and +users set `basePath` to `path/to/table/`, `gender` will be a partitioning column. + +### Schema Merging + +Like ProtocolBuffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start with +a simple schema, and gradually add more columns to the schema as needed. In this way, users may end +up with multiple Parquet files with different but mutually compatible schemas. The Parquet d
[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/22746#discussion_r226237047 --- Diff: docs/sql-data-sources-parquet.md --- @@ -0,0 +1,321 @@ +--- +layout: global +title: Parquet Files +displayTitle: Parquet Files +--- + +* Table of contents +{:toc} + +[Parquet](http://parquet.io) is a columnar format that is supported by many other data processing systems. +Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema +of the original data. When writing Parquet files, all columns are automatically converted to be nullable for +compatibility reasons. + +### Loading Data Programmatically + +Using the data from the above example: + + + + +{% include_example basic_parquet_example scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %} + + + +{% include_example basic_parquet_example java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %} + + + + +{% include_example basic_parquet_example python/sql/datasource.py %} + + + + +{% include_example basic_parquet_example r/RSparkSQLExample.R %} + + + + + +{% highlight sql %} + +CREATE TEMPORARY VIEW parquetTable +USING org.apache.spark.sql.parquet +OPTIONS ( + path "examples/src/main/resources/people.parquet" +) + +SELECT * FROM parquetTable + +{% endhighlight %} + + + + + +### Partition Discovery + +Table partitioning is a common optimization approach used in systems like Hive. In a partitioned +table, data are usually stored in different directories, with partitioning column values encoded in +the path of each partition directory. All built-in file sources (including Text/CSV/JSON/ORC/Parquet) +are able to discover and infer partitioning information automatically. +For example, we can store all our previously used +population data into a partitioned table using the following directory structure, with two extra +columns, `gender` and `country` as partitioning columns: + +{% highlight text %} + +path +âââ to +âââ table +âââ gender=male +â  âââ ... +â  â +â  âââ country=US +â  â  âââ data.parquet +â  âââ country=CN +â  â  âââ data.parquet +â  âââ ... +âââ gender=female +   âââ ... +   â +   âââ country=US +   â  âââ data.parquet +   âââ country=CN +   â  âââ data.parquet +   âââ ... + +{% endhighlight %} + +By passing `path/to/table` to either `SparkSession.read.parquet` or `SparkSession.read.load`, Spark SQL +will automatically extract the partitioning information from the paths. +Now the schema of the returned DataFrame becomes: + +{% highlight text %} + +root +|-- name: string (nullable = true) +|-- age: long (nullable = true) +|-- gender: string (nullable = true) +|-- country: string (nullable = true) + +{% endhighlight %} + +Notice that the data types of the partitioning columns are automatically inferred. Currently, +numeric data types, date, timestamp and string type are supported. Sometimes users may not want +to automatically infer the data types of the partitioning columns. For these use cases, the +automatic type inference can be configured by +`spark.sql.sources.partitionColumnTypeInference.enabled`, which is default to `true`. When type +inference is disabled, string type will be used for the partitioning columns. + +Starting from Spark 1.6.0, partition discovery only finds partitions under the given paths +by default. For the above example, if users pass `path/to/table/gender=male` to either +`SparkSession.read.parquet` or `SparkSession.read.load`, `gender` will not be considered as a +partitioning column. If users need to specify the base path that partition discovery +should start with, they can set `basePath` in the data source options. For example, +when `path/to/table/gender=male` is the path of the data and +users set `basePath` to `path/to/table/`, `gender` will be a partitioning column. + +### Schema Merging + +Like ProtocolBuffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start with --- End diff -- `ProtocolBuffer` -> `Protocol Buffers` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.or
[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/22746#discussion_r226235672 --- Diff: docs/sql-data-sources-load-save-functions.md --- @@ -0,0 +1,283 @@ +--- +layout: global +title: Generic Load/Save Functions +displayTitle: Generic Load/Save Functions +--- + +* Table of contents +{:toc} + + +In the simplest form, the default data source (`parquet` unless otherwise configured by +`spark.sql.sources.default`) will be used for all operations. + + + + +{% include_example generic_load_save_functions scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %} + + + +{% include_example generic_load_save_functions java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %} + + + + +{% include_example generic_load_save_functions python/sql/datasource.py %} + + + + +{% include_example generic_load_save_functions r/RSparkSQLExample.R %} + + + + +### Manually Specifying Options + +You can also manually specify the data source that will be used along with any extra options +that you would like to pass to the data source. Data sources are specified by their fully qualified +name (i.e., `org.apache.spark.sql.parquet`), but for built-in sources you can also use their short +names (`json`, `parquet`, `jdbc`, `orc`, `libsvm`, `csv`, `text`). DataFrames loaded from any data +source type can be converted into other types using this syntax. + +To load a JSON file you can use: + + + +{% include_example manual_load_options scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %} + + + +{% include_example manual_load_options java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %} + + + +{% include_example manual_load_options python/sql/datasource.py %} + + + +{% include_example manual_load_options r/RSparkSQLExample.R %} + + + +To load a CSV file you can use: + + + +{% include_example manual_load_options_csv scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %} + + + +{% include_example manual_load_options_csv java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %} + + + +{% include_example manual_load_options_csv python/sql/datasource.py %} + + + +{% include_example manual_load_options_csv r/RSparkSQLExample.R %} + + + + +### Run SQL on files directly + +Instead of using read API to load a file into DataFrame and query it, you can also query that +file directly with SQL. + + + +{% include_example direct_sql scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %} + + + +{% include_example direct_sql java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %} + + + +{% include_example direct_sql python/sql/datasource.py %} + + + +{% include_example direct_sql r/RSparkSQLExample.R %} + + + + +### Save Modes + +Save operations can optionally take a `SaveMode`, that specifies how to handle existing data if +present. It is important to realize that these save modes do not utilize any locking and are not +atomic. Additionally, when performing an `Overwrite`, the data will be deleted before writing out the +new data. + + +Scala/JavaAny LanguageMeaning + + SaveMode.ErrorIfExists (default) + "error" or "errorifexists" (default) + +When saving a DataFrame to a data source, if data already exists, +an exception is expected to be thrown. + + + + SaveMode.Append + "append" + +When saving a DataFrame to a data source, if data/table already exists, +contents of the DataFrame are expected to be appended to existing data. + + + + SaveMode.Overwrite + "overwrite" + +Overwrite mode means that when saving a DataFrame to a data source, +if data/table already exists, existing data is expected to be overwritten by the contents of +the DataFrame. + + + + SaveMode.Ignore + "ignore" + +Ignore mode means that when saving a DataFrame to a data source, if data already exists, +the save operation is expected to not save the contents of the DataFrame and to not --- End diff -- nit: `expected to not ... to not ...` -> `expected not to ... not to ...`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/22746#discussion_r226231876 --- Diff: docs/sql-data-sources-jdbc.md --- @@ -0,0 +1,223 @@ +--- +layout: global +title: JDBC To Other Databases +displayTitle: JDBC To Other Databases +--- + +* Table of contents +{:toc} + +Spark SQL also includes a data source that can read data from other databases using JDBC. This +functionality should be preferred over using [JdbcRDD](api/scala/index.html#org.apache.spark.rdd.JdbcRDD). +This is because the results are returned +as a DataFrame and they can easily be processed in Spark SQL or joined with other data sources. +The JDBC data source is also easier to use from Java or Python as it does not require the user to +provide a ClassTag. +(Note that this is different than the Spark SQL JDBC server, which allows other applications to +run queries using Spark SQL). + +To get started you will need to include the JDBC driver for your particular database on the +spark classpath. For example, to connect to postgres from the Spark Shell you would run the +following command: + +{% highlight bash %} +bin/spark-shell --driver-class-path postgresql-9.4.1207.jar --jars postgresql-9.4.1207.jar +{% endhighlight %} + +Tables from the remote database can be loaded as a DataFrame or Spark SQL temporary view using +the Data Sources API. Users can specify the JDBC connection properties in the data source options. +user and password are normally provided as connection properties for +logging into the data sources. In addition to the connection properties, Spark also supports +the following case-insensitive options: + + + Property NameMeaning + +url + + The JDBC URL to connect to. The source-specific connection properties may be specified in the URL. e.g., jdbc:postgresql://localhost/test?user=fred&password=secret + + + + +dbtable + + The JDBC table that should be read from or written into. Note that when using it in the read + path anything that is valid in a FROM clause of a SQL query can be used. + For example, instead of a full table you could also use a subquery in parentheses. It is not + allowed to specify `dbtable` and `query` options at the same time. + + + +query + + A query that will be used to read data into Spark. The specified query will be parenthesized and used + as a subquery in the FROM clause. Spark will also assign an alias to the subquery clause. + As an example, spark will issue a query of the following form to the JDBC Source. + SELECTFROM ( ) spark_gen_alias + Below are couple of restrictions while using this option. + + It is not allowed to specify `dbtable` and `query` options at the same time. + It is not allowed to spcify `query` and `partitionColumn` options at the same time. When specifying --- End diff -- `spcify` -> `specify` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/22746#discussion_r226227872 --- Diff: docs/sql-data-sources.md --- @@ -0,0 +1,42 @@ +--- +layout: global +title: Data Sources +displayTitle: Data Sources +--- + + +Spark SQL supports operating on a variety of data sources through the DataFrame interface. +A DataFrame can be operated on using relational transformations and can also be used to create a temporary view. +Registering a DataFrame as a temporary view allows you to run SQL queries over its data. This section +describes the general methods for loading and saving data using the Spark Data Sources and then +goes into specific options that are available for the built-in data sources. + + +* [Generic Load/Save Functions](sql-data-sources-load-save-functions.html) + * [Manually Sepcifying Options](sql-data-sources-load-save-functions.html#manually-sepcifying-options) --- End diff -- `sepcifying` -> `specifying`. In other places, too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...
Github user xuanyuanking commented on a diff in the pull request: https://github.com/apache/spark/pull/22746#discussion_r226226005 --- Diff: docs/sql-data-sources-other.md --- @@ -0,0 +1,114 @@ +--- +layout: global +title: Other Data Sources +displayTitle: Other Data Sources +--- + +* Table of contents +{:toc} + +## ORC Files + +Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC file format for ORC files. +To do that, the following configurations are newly added. The vectorized reader is used for the +native ORC tables (e.g., the ones created using the clause `USING ORC`) when `spark.sql.orc.impl` +is set to `native` and `spark.sql.orc.enableVectorizedReader` is set to `true`. For the Hive ORC +serde tables (e.g., the ones created using the clause `USING HIVE OPTIONS (fileFormat 'ORC')`), +the vectorized reader is used when `spark.sql.hive.convertMetastoreOrc` is also set to `true`. + + + Property NameDefaultMeaning + +spark.sql.orc.impl +native +The name of ORC implementation. It can be one of native and hive. native means the native ORC support that is built on Apache ORC 1.4. `hive` means the ORC library in Hive 1.2.1. + + +spark.sql.orc.enableVectorizedReader +true +Enables vectorized orc decoding in native implementation. If false, a new non-vectorized ORC reader is used in native implementation. For hive implementation, this is ignored. + + + +## JSON Datasets --- End diff -- Got it, will change it soon. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22746#discussion_r226213393 --- Diff: docs/sql-data-sources-other.md --- @@ -0,0 +1,114 @@ +--- +layout: global +title: Other Data Sources +displayTitle: Other Data Sources +--- + +* Table of contents +{:toc} + +## ORC Files + +Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC file format for ORC files. +To do that, the following configurations are newly added. The vectorized reader is used for the +native ORC tables (e.g., the ones created using the clause `USING ORC`) when `spark.sql.orc.impl` +is set to `native` and `spark.sql.orc.enableVectorizedReader` is set to `true`. For the Hive ORC +serde tables (e.g., the ones created using the clause `USING HIVE OPTIONS (fileFormat 'ORC')`), +the vectorized reader is used when `spark.sql.hive.convertMetastoreOrc` is also set to `true`. + + + Property NameDefaultMeaning + +spark.sql.orc.impl +native +The name of ORC implementation. It can be one of native and hive. native means the native ORC support that is built on Apache ORC 1.4. `hive` means the ORC library in Hive 1.2.1. + + +spark.sql.orc.enableVectorizedReader +true +Enables vectorized orc decoding in native implementation. If false, a new non-vectorized ORC reader is used in native implementation. For hive implementation, this is ignored. + + + +## JSON Datasets --- End diff -- We support a typical JSON file, don't we? > For a regular multi-line JSON file, set the `multiLine` option to `true`. IMO, that notice means we provides more flexibility. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...
Github user xuanyuanking commented on a diff in the pull request: https://github.com/apache/spark/pull/22746#discussion_r226208608 --- Diff: docs/sql-distributed-sql-engine.md --- @@ -0,0 +1,85 @@ +--- +layout: global +title: Distributed SQL Engine +displayTitle: Distributed SQL Engine +--- + +* Table of contents +{:toc} + +Spark SQL can also act as a distributed query engine using its JDBC/ODBC or command-line interface. +In this mode, end-users or applications can interact with Spark SQL directly to run SQL queries, +without the need to write any code. + +## Running the Thrift JDBC/ODBC server + +The Thrift JDBC/ODBC server implemented here corresponds to the [`HiveServer2`](https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2) +in Hive 1.2.1 You can test the JDBC server with the beeline script that comes with either Spark or Hive 1.2.1. --- End diff -- Thanks, done in 27b066d. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...
Github user xuanyuanking commented on a diff in the pull request: https://github.com/apache/spark/pull/22746#discussion_r226202227 --- Diff: docs/_data/menu-sql.yaml --- @@ -0,0 +1,81 @@ +- text: Getting Started + url: sql-getting-started.html + subitems: +- text: "Starting Point: SparkSession" + url: sql-getting-started.html#starting-point-sparksession +- text: Creating DataFrames + url: sql-getting-started.html#creating-dataframes +- text: Untyped Dataset Operations (DataFrame operations) + url: sql-getting-started.html#untyped-dataset-operations-aka-dataframe-operations +- text: Running SQL Queries Programmatically + url: sql-getting-started.html#running-sql-queries-programmatically +- text: Global Temporary View + url: sql-getting-started.html#global-temporary-view +- text: Creating Datasets + url: sql-getting-started.html#creating-datasets +- text: Interoperating with RDDs + url: sql-getting-started.html#interoperating-with-rdds +- text: Aggregations + url: sql-getting-started.html#aggregations +- text: Data Sources + url: sql-data-sources.html + subitems: +- text: "Generic Load/Save Functions" + url: sql-data-sources-load-save-functions.html +- text: Parquet Files + url: sql-data-sources-parquet.html +- text: ORC Files + url: sql-data-sources-other.html#orc-files +- text: JSON Datasets + url: sql-data-sources-other.html#json-datasets +- text: Hive Tables + url: sql-data-sources-hive-tables.html +- text: JDBC To Other Databases + url: sql-data-sources-jdbc.html +- text: Avro Files + url: sql-data-sources-avro.html +- text: Troubleshooting + url: sql-data-sources-other.html#troubleshooting --- End diff -- Make sense, will split into `sql-data-sources-orc`, `sql-data-sources-json` and `sql-data-sources-troubleshooting`(still need sql-data-sources prefix cause [here](https://github.com/apache/spark/pull/22746/files#diff-5075091c2498292f7afcac68bfd63e1eR13) we need "sql-data-sources" as the nav-left tag, otherwise the nav menu will not show the subitems). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...
Github user xuanyuanking commented on a diff in the pull request: https://github.com/apache/spark/pull/22746#discussion_r226202492 --- Diff: docs/sql-data-sources-other.md --- @@ -0,0 +1,114 @@ +--- +layout: global +title: Other Data Sources +displayTitle: Other Data Sources +--- + +* Table of contents +{:toc} + +## ORC Files + +Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC file format for ORC files. +To do that, the following configurations are newly added. The vectorized reader is used for the +native ORC tables (e.g., the ones created using the clause `USING ORC`) when `spark.sql.orc.impl` +is set to `native` and `spark.sql.orc.enableVectorizedReader` is set to `true`. For the Hive ORC +serde tables (e.g., the ones created using the clause `USING HIVE OPTIONS (fileFormat 'ORC')`), +the vectorized reader is used when `spark.sql.hive.convertMetastoreOrc` is also set to `true`. + + + Property NameDefaultMeaning + +spark.sql.orc.impl +native +The name of ORC implementation. It can be one of native and hive. native means the native ORC support that is built on Apache ORC 1.4. `hive` means the ORC library in Hive 1.2.1. + + +spark.sql.orc.enableVectorizedReader +true +Enables vectorized orc decoding in native implementation. If false, a new non-vectorized ORC reader is used in native implementation. For hive implementation, this is ignored. + + + +## JSON Datasets --- End diff -- Maybe keep `Datasets`? As the below description `Note that the file that is offered as a json file is not a typical JSON file`. WDYT? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22746#discussion_r226191366 --- Diff: docs/sql-distributed-sql-engine.md --- @@ -0,0 +1,85 @@ +--- +layout: global +title: Distributed SQL Engine +displayTitle: Distributed SQL Engine +--- + +* Table of contents +{:toc} + +Spark SQL can also act as a distributed query engine using its JDBC/ODBC or command-line interface. +In this mode, end-users or applications can interact with Spark SQL directly to run SQL queries, +without the need to write any code. + +## Running the Thrift JDBC/ODBC server + +The Thrift JDBC/ODBC server implemented here corresponds to the [`HiveServer2`](https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2) +in Hive 1.2.1 You can test the JDBC server with the beeline script that comes with either Spark or Hive 1.2.1. --- End diff -- nit. `1.2.1 You` -> `1.2.1. You` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22746#discussion_r226190219 --- Diff: docs/sql-data-sources-other.md --- @@ -0,0 +1,114 @@ +--- +layout: global +title: Other Data Sources +displayTitle: Other Data Sources +--- + +* Table of contents +{:toc} + +## ORC Files + +Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC file format for ORC files. +To do that, the following configurations are newly added. The vectorized reader is used for the +native ORC tables (e.g., the ones created using the clause `USING ORC`) when `spark.sql.orc.impl` +is set to `native` and `spark.sql.orc.enableVectorizedReader` is set to `true`. For the Hive ORC +serde tables (e.g., the ones created using the clause `USING HIVE OPTIONS (fileFormat 'ORC')`), +the vectorized reader is used when `spark.sql.hive.convertMetastoreOrc` is also set to `true`. + + + Property NameDefaultMeaning + +spark.sql.orc.impl +native +The name of ORC implementation. It can be one of native and hive. native means the native ORC support that is built on Apache ORC 1.4. `hive` means the ORC library in Hive 1.2.1. + + +spark.sql.orc.enableVectorizedReader +true +Enables vectorized orc decoding in native implementation. If false, a new non-vectorized ORC reader is used in native implementation. For hive implementation, this is ignored. + + + +## JSON Datasets --- End diff -- For consistency with the other data sources, `Datasets` -> `Files`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22746#discussion_r226189929 --- Diff: docs/_data/menu-sql.yaml --- @@ -0,0 +1,81 @@ +- text: Getting Started + url: sql-getting-started.html + subitems: +- text: "Starting Point: SparkSession" + url: sql-getting-started.html#starting-point-sparksession +- text: Creating DataFrames + url: sql-getting-started.html#creating-dataframes +- text: Untyped Dataset Operations (DataFrame operations) + url: sql-getting-started.html#untyped-dataset-operations-aka-dataframe-operations +- text: Running SQL Queries Programmatically + url: sql-getting-started.html#running-sql-queries-programmatically +- text: Global Temporary View + url: sql-getting-started.html#global-temporary-view +- text: Creating Datasets + url: sql-getting-started.html#creating-datasets +- text: Interoperating with RDDs + url: sql-getting-started.html#interoperating-with-rdds +- text: Aggregations + url: sql-getting-started.html#aggregations +- text: Data Sources + url: sql-data-sources.html + subitems: +- text: "Generic Load/Save Functions" + url: sql-data-sources-load-save-functions.html +- text: Parquet Files + url: sql-data-sources-parquet.html +- text: ORC Files + url: sql-data-sources-other.html#orc-files +- text: JSON Datasets + url: sql-data-sources-other.html#json-datasets +- text: Hive Tables + url: sql-data-sources-hive-tables.html +- text: JDBC To Other Databases + url: sql-data-sources-jdbc.html +- text: Avro Files + url: sql-data-sources-avro.html +- text: Troubleshooting + url: sql-data-sources-other.html#troubleshooting --- End diff -- Hi, @xuanyuanking . Generally, it looks good. Can we split `sql-data-sources-other` into three files? For me, `troubleshooting` looks weird in terms of level of information. Actually, `sql-data-sources-other` has only two files and `troubleshooting` for JDBC. Maybe, `sql-data-sources-orc`, `sql-data-sources-json` and `toubleshooting`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...
Github user xuanyuanking commented on a diff in the pull request: https://github.com/apache/spark/pull/22746#discussion_r226011439 --- Diff: docs/sql-reference.md --- @@ -0,0 +1,641 @@ +--- +layout: global +title: Reference +displayTitle: Reference +--- + +* Table of contents +{:toc} + +## Data Types + +Spark SQL and DataFrames support the following data types: + +* Numeric types + - `ByteType`: Represents 1-byte signed integer numbers. + The range of numbers is from `-128` to `127`. + - `ShortType`: Represents 2-byte signed integer numbers. + The range of numbers is from `-32768` to `32767`. + - `IntegerType`: Represents 4-byte signed integer numbers. + The range of numbers is from `-2147483648` to `2147483647`. + - `LongType`: Represents 8-byte signed integer numbers. + The range of numbers is from `-9223372036854775808` to `9223372036854775807`. + - `FloatType`: Represents 4-byte single-precision floating point numbers. + - `DoubleType`: Represents 8-byte double-precision floating point numbers. + - `DecimalType`: Represents arbitrary-precision signed decimal numbers. Backed internally by `java.math.BigDecimal`. A `BigDecimal` consists of an arbitrary precision integer unscaled value and a 32-bit integer scale. +* String type + - `StringType`: Represents character string values. +* Binary type + - `BinaryType`: Represents byte sequence values. +* Boolean type + - `BooleanType`: Represents boolean values. +* Datetime type + - `TimestampType`: Represents values comprising values of fields year, month, day, + hour, minute, and second. + - `DateType`: Represents values comprising values of fields year, month, day. +* Complex types + - `ArrayType(elementType, containsNull)`: Represents values comprising a sequence of + elements with the type of `elementType`. `containsNull` is used to indicate if + elements in a `ArrayType` value can have `null` values. + - `MapType(keyType, valueType, valueContainsNull)`: + Represents values comprising a set of key-value pairs. The data type of keys are + described by `keyType` and the data type of values are described by `valueType`. + For a `MapType` value, keys are not allowed to have `null` values. `valueContainsNull` + is used to indicate if values of a `MapType` value can have `null` values. + - `StructType(fields)`: Represents values with the structure described by + a sequence of `StructField`s (`fields`). +* `StructField(name, dataType, nullable)`: Represents a field in a `StructType`. +The name of a field is indicated by `name`. The data type of a field is indicated +by `dataType`. `nullable` is used to indicate if values of this fields can have +`null` values. + + + + +All data types of Spark SQL are located in the package `org.apache.spark.sql.types`. +You can access them by doing + +{% include_example data_types scala/org/apache/spark/examples/sql/SparkSQLExample.scala %} + + + + Data type + Value type in Scala + API to access or create a data type + + ByteType + Byte + + ByteType + + + + ShortType + Short + + ShortType + + + + IntegerType + Int + + IntegerType + + + + LongType + Long + + LongType + + + + FloatType + Float + + FloatType + + + + DoubleType + Double + + DoubleType + + + + DecimalType + java.math.BigDecimal + + DecimalType + + + + StringType + String + + StringType + + + + BinaryType + Array[Byte] + + BinaryType + + + + BooleanType + Boolean + + BooleanType + + + + TimestampType + java.sql.Timestamp + + TimestampType + + + + DateType + java.sql.Date + + DateType + + + + ArrayType + scala.collection.Seq + + ArrayType(elementType, [containsNull]) + Note: The default value of containsNull is true. + + + + MapType + scala.collection.Map + + MapType(keyType, valueType, [valueContainsNull]) + Note: The default value of valueContainsNull is true. + + + + StructType + org.apache.spark.sql.Row + + StructType(fields) + Note: fields is a Seq of StructFields. Also, two fields with the same + name are not allowed. + + + + StructField + The value type in Scala of the data type of this field + (For example, Int for a
[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...
Github user xuanyuanking commented on a diff in the pull request: https://github.com/apache/spark/pull/22746#discussion_r226011057 --- Diff: docs/_data/menu-sql.yaml --- @@ -0,0 +1,79 @@ +- text: Getting Started + url: sql-getting-started.html + subitems: +- text: "Starting Point: SparkSession" + url: sql-getting-started.html#starting-point-sparksession +- text: Creating DataFrames + url: sql-getting-started.html#creating-dataframes +- text: Untyped Dataset Operations --- End diff -- make sense, keep same with `sql-getting-started.html#untyped-dataset-operations-aka-dataframe-operations`, done in b3fc39d. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/22746#discussion_r225921716 --- Diff: docs/sql-reference.md --- @@ -0,0 +1,641 @@ +--- +layout: global +title: Reference +displayTitle: Reference +--- + +* Table of contents +{:toc} + +## Data Types + +Spark SQL and DataFrames support the following data types: + +* Numeric types + - `ByteType`: Represents 1-byte signed integer numbers. + The range of numbers is from `-128` to `127`. + - `ShortType`: Represents 2-byte signed integer numbers. + The range of numbers is from `-32768` to `32767`. + - `IntegerType`: Represents 4-byte signed integer numbers. + The range of numbers is from `-2147483648` to `2147483647`. + - `LongType`: Represents 8-byte signed integer numbers. + The range of numbers is from `-9223372036854775808` to `9223372036854775807`. + - `FloatType`: Represents 4-byte single-precision floating point numbers. + - `DoubleType`: Represents 8-byte double-precision floating point numbers. + - `DecimalType`: Represents arbitrary-precision signed decimal numbers. Backed internally by `java.math.BigDecimal`. A `BigDecimal` consists of an arbitrary precision integer unscaled value and a 32-bit integer scale. +* String type + - `StringType`: Represents character string values. +* Binary type + - `BinaryType`: Represents byte sequence values. +* Boolean type + - `BooleanType`: Represents boolean values. +* Datetime type + - `TimestampType`: Represents values comprising values of fields year, month, day, + hour, minute, and second. + - `DateType`: Represents values comprising values of fields year, month, day. +* Complex types + - `ArrayType(elementType, containsNull)`: Represents values comprising a sequence of + elements with the type of `elementType`. `containsNull` is used to indicate if + elements in a `ArrayType` value can have `null` values. + - `MapType(keyType, valueType, valueContainsNull)`: + Represents values comprising a set of key-value pairs. The data type of keys are + described by `keyType` and the data type of values are described by `valueType`. + For a `MapType` value, keys are not allowed to have `null` values. `valueContainsNull` + is used to indicate if values of a `MapType` value can have `null` values. + - `StructType(fields)`: Represents values with the structure described by + a sequence of `StructField`s (`fields`). +* `StructField(name, dataType, nullable)`: Represents a field in a `StructType`. +The name of a field is indicated by `name`. The data type of a field is indicated +by `dataType`. `nullable` is used to indicate if values of this fields can have +`null` values. + + + + +All data types of Spark SQL are located in the package `org.apache.spark.sql.types`. +You can access them by doing + +{% include_example data_types scala/org/apache/spark/examples/sql/SparkSQLExample.scala %} + + + + Data type + Value type in Scala + API to access or create a data type + + ByteType + Byte + + ByteType + + + + ShortType + Short + + ShortType + + + + IntegerType + Int + + IntegerType + + + + LongType + Long + + LongType + + + + FloatType + Float + + FloatType + + + + DoubleType + Double + + DoubleType + + + + DecimalType + java.math.BigDecimal + + DecimalType + + + + StringType + String + + StringType + + + + BinaryType + Array[Byte] + + BinaryType + + + + BooleanType + Boolean + + BooleanType + + + + TimestampType + java.sql.Timestamp + + TimestampType + + + + DateType + java.sql.Date + + DateType + + + + ArrayType + scala.collection.Seq + + ArrayType(elementType, [containsNull]) + Note: The default value of containsNull is true. + + + + MapType + scala.collection.Map + + MapType(keyType, valueType, [valueContainsNull]) + Note: The default value of valueContainsNull is true. + + + + StructType + org.apache.spark.sql.Row + + StructType(fields) + Note: fields is a Seq of StructFields. Also, two fields with the same + name are not allowed. + + + + StructField + The value type in Scala of the data type of this field + (For example, Int for
[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22746#discussion_r225797461 --- Diff: docs/_data/menu-sql.yaml --- @@ -0,0 +1,79 @@ +- text: Getting Started + url: sql-getting-started.html + subitems: +- text: "Starting Point: SparkSession" + url: sql-getting-started.html#starting-point-sparksession +- text: Creating DataFrames + url: sql-getting-started.html#creating-dataframes +- text: Untyped Dataset Operations --- End diff -- how about `Untyped Dataset Operations (DataFrame operations)` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...
Github user xuanyuanking commented on a diff in the pull request: https://github.com/apache/spark/pull/22746#discussion_r225794532 --- Diff: docs/sql-reference.md --- @@ -0,0 +1,641 @@ +--- +layout: global +title: Reference +displayTitle: Reference +--- + +* Table of contents +{:toc} + +## Data Types + +Spark SQL and DataFrames support the following data types: + +* Numeric types +- `ByteType`: Represents 1-byte signed integer numbers. --- End diff -- Thanks, done in 58115e5. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...
Github user xuanyuanking commented on a diff in the pull request: https://github.com/apache/spark/pull/22746#discussion_r225794477 --- Diff: docs/sql-getting-started.md --- @@ -0,0 +1,369 @@ +--- +layout: global +title: Getting Started +displayTitle: Getting Started +--- + +* Table of contents +{:toc} + +## Starting Point: SparkSession + + + + +The entry point into all functionality in Spark is the [`SparkSession`](api/scala/index.html#org.apache.spark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.builder()`: + +{% include_example init_session scala/org/apache/spark/examples/sql/SparkSQLExample.scala %} + + + + +The entry point into all functionality in Spark is the [`SparkSession`](api/java/index.html#org.apache.spark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.builder()`: + +{% include_example init_session java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %} + + + + +The entry point into all functionality in Spark is the [`SparkSession`](api/python/pyspark.sql.html#pyspark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.builder`: + +{% include_example init_session python/sql/basic.py %} + + + + +The entry point into all functionality in Spark is the [`SparkSession`](api/R/sparkR.session.html) class. To initialize a basic `SparkSession`, just call `sparkR.session()`: + +{% include_example init_session r/RSparkSQLExample.R %} + +Note that when invoked for the first time, `sparkR.session()` initializes a global `SparkSession` singleton instance, and always returns a reference to this instance for successive invocations. In this way, users only need to initialize the `SparkSession` once, then SparkR functions like `read.df` will be able to access this global instance implicitly, and users don't need to pass the `SparkSession` instance around. + + + +`SparkSession` in Spark 2.0 provides builtin support for Hive features including the ability to +write queries using HiveQL, access to Hive UDFs, and the ability to read data from Hive tables. +To use these features, you do not need to have an existing Hive setup. + +## Creating DataFrames + + + +With a `SparkSession`, applications can create DataFrames from an [existing `RDD`](#interoperating-with-rdds), +from a Hive table, or from [Spark data sources](#data-sources). --- End diff -- Done in 58115e5, also fix link in ml-pipeline.md\sparkr.md\structured-streaming-programming-guide.md --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...
Github user xuanyuanking commented on a diff in the pull request: https://github.com/apache/spark/pull/22746#discussion_r225789933 --- Diff: docs/sql-getting-started.md --- @@ -0,0 +1,369 @@ +--- +layout: global +title: Getting Started +displayTitle: Getting Started +--- + +* Table of contents +{:toc} + +## Starting Point: SparkSession + + + + +The entry point into all functionality in Spark is the [`SparkSession`](api/scala/index.html#org.apache.spark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.builder()`: + +{% include_example init_session scala/org/apache/spark/examples/sql/SparkSQLExample.scala %} + + + + +The entry point into all functionality in Spark is the [`SparkSession`](api/java/index.html#org.apache.spark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.builder()`: + +{% include_example init_session java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %} + + + + +The entry point into all functionality in Spark is the [`SparkSession`](api/python/pyspark.sql.html#pyspark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.builder`: + +{% include_example init_session python/sql/basic.py %} + + + + +The entry point into all functionality in Spark is the [`SparkSession`](api/R/sparkR.session.html) class. To initialize a basic `SparkSession`, just call `sparkR.session()`: + +{% include_example init_session r/RSparkSQLExample.R %} + +Note that when invoked for the first time, `sparkR.session()` initializes a global `SparkSession` singleton instance, and always returns a reference to this instance for successive invocations. In this way, users only need to initialize the `SparkSession` once, then SparkR functions like `read.df` will be able to access this global instance implicitly, and users don't need to pass the `SparkSession` instance around. + + + +`SparkSession` in Spark 2.0 provides builtin support for Hive features including the ability to +write queries using HiveQL, access to Hive UDFs, and the ability to read data from Hive tables. +To use these features, you do not need to have an existing Hive setup. + +## Creating DataFrames + + + +With a `SparkSession`, applications can create DataFrames from an [existing `RDD`](#interoperating-with-rdds), +from a Hive table, or from [Spark data sources](#data-sources). --- End diff -- Sorry for the missing, will check all inner link by `
[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/22746#discussion_r225783658 --- Diff: docs/sql-reference.md --- @@ -0,0 +1,641 @@ +--- +layout: global +title: Reference +displayTitle: Reference +--- + +* Table of contents +{:toc} + +## Data Types + +Spark SQL and DataFrames support the following data types: + +* Numeric types +- `ByteType`: Represents 1-byte signed integer numbers. --- End diff -- nit: use 2 space indent. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22746#discussion_r225780740 --- Diff: docs/sql-getting-started.md --- @@ -0,0 +1,369 @@ +--- +layout: global +title: Getting Started +displayTitle: Getting Started +--- + +* Table of contents +{:toc} + +## Starting Point: SparkSession + + + + +The entry point into all functionality in Spark is the [`SparkSession`](api/scala/index.html#org.apache.spark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.builder()`: + +{% include_example init_session scala/org/apache/spark/examples/sql/SparkSQLExample.scala %} + + + + +The entry point into all functionality in Spark is the [`SparkSession`](api/java/index.html#org.apache.spark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.builder()`: + +{% include_example init_session java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %} + + + + +The entry point into all functionality in Spark is the [`SparkSession`](api/python/pyspark.sql.html#pyspark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.builder`: + +{% include_example init_session python/sql/basic.py %} + + + + +The entry point into all functionality in Spark is the [`SparkSession`](api/R/sparkR.session.html) class. To initialize a basic `SparkSession`, just call `sparkR.session()`: + +{% include_example init_session r/RSparkSQLExample.R %} + +Note that when invoked for the first time, `sparkR.session()` initializes a global `SparkSession` singleton instance, and always returns a reference to this instance for successive invocations. In this way, users only need to initialize the `SparkSession` once, then SparkR functions like `read.df` will be able to access this global instance implicitly, and users don't need to pass the `SparkSession` instance around. + + + +`SparkSession` in Spark 2.0 provides builtin support for Hive features including the ability to +write queries using HiveQL, access to Hive UDFs, and the ability to read data from Hive tables. +To use these features, you do not need to have an existing Hive setup. + +## Creating DataFrames + + + +With a `SparkSession`, applications can create DataFrames from an [existing `RDD`](#interoperating-with-rdds), +from a Hive table, or from [Spark data sources](#data-sources). --- End diff -- The link `[Spark data sources](#data-sources)` does not work after this change. Could you fix all the similar cases? Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org