subject:"\[GitHub\] spark pull request #22746\: \[SPARK\-24499\]\[SQL\]\[DOC\] Split the page of sql\-pro..."

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

2018-10-18 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22746


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

2018-10-18 Thread xuanyuanking

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/22746#discussion_r226288610
  
--- Diff: docs/sql-data-sources-other.md ---
@@ -0,0 +1,114 @@
+---
+layout: global
+title: Other Data Sources
+displayTitle: Other Data Sources
+---
+
+* Table of contents
+{:toc}
+
+## ORC Files
+
+Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC 
file format for ORC files.
+To do that, the following configurations are newly added. The vectorized 
reader is used for the
+native ORC tables (e.g., the ones created using the clause `USING ORC`) 
when `spark.sql.orc.impl`
+is set to `native` and `spark.sql.orc.enableVectorizedReader` is set to 
`true`. For the Hive ORC
+serde tables (e.g., the ones created using the clause `USING HIVE OPTIONS 
(fileFormat 'ORC')`),
+the vectorized reader is used when `spark.sql.hive.convertMetastoreOrc` is 
also set to `true`.
+
+
+  Property 
NameDefaultMeaning
+  
+spark.sql.orc.impl
+native
+The name of ORC implementation. It can be one of 
native and hive. native means the native 
ORC support that is built on Apache ORC 1.4. `hive` means the ORC library in 
Hive 1.2.1.
+  
+  
+spark.sql.orc.enableVectorizedReader
+true
+Enables vectorized orc decoding in native 
implementation. If false, a new non-vectorized ORC reader is used 
in native implementation. For hive implementation, 
this is ignored.
+  
+
+
+## JSON Datasets
--- End diff --

Done in 17995f9. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

2018-10-18 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22746#discussion_r226263066
  
--- Diff: docs/sql-migration-guide-upgrade.md ---
@@ -0,0 +1,520 @@
+---
+layout: global
+title: Spark SQL Upgrading Guide
+displayTitle: Spark SQL Upgrading Guide
+---
+
+* Table of contents
+{:toc}
+
+## Upgrading From Spark SQL 2.4 to 3.0
+
+  - In PySpark, when creating a `SparkSession` with 
`SparkSession.builder.getOrCreate()`, if there is an existing `SparkContext`, 
the builder was trying to update the `SparkConf` of the existing `SparkContext` 
with configurations specified to the builder, but the `SparkContext` is shared 
by all `SparkSession`s, so we should not update them. Since 3.0, the builder 
come to not update the configurations. This is the same behavior as Java/Scala 
API in 2.3 and above. If you want to update them, you need to update them prior 
to creating a `SparkSession`.
+
+## Upgrading From Spark SQL 2.3 to 2.4
+
+  - In Spark version 2.3 and earlier, the second parameter to 
array_contains function is implicitly promoted to the element type of first 
array type parameter. This type promotion can be lossy and may cause 
`array_contains` function to return wrong result. This problem has been 
addressed in 2.4 by employing a safer type promotion mechanism. This can cause 
some change in behavior and are illustrated in the table below.
+  
+
+  
+Query
+  
+  
+Result Spark 2.3 or Prior
+  
+  
+Result Spark 2.4
+  
+  
+Remarks
+  
+
+
+  
+SELECT  array_contains(array(1), 1.34D);
+  
+  
+true
+  
+  
+false
+  
+  
+In Spark 2.4, left and right parameters are  promoted to 
array(double) and double type respectively.
+  
+
+
+  
+SELECT  array_contains(array(1), '1');
+  
+  
+true
+  
+  
+AnalysisException is thrown since integer type can not be 
promoted to string type in a loss-less manner.
+  
+  
+Users can use explict cast
+  
+
+
+  
+SELECT  array_contains(array(1), 'anystring');
+  
+  
+null
+  
+  
+AnalysisException is thrown since integer type can not be 
promoted to string type in a loss-less manner.
+  
+  
+Users can use explict cast
--- End diff --

`explict` -> `explicit`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

2018-10-18 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22746#discussion_r226262995
  
--- Diff: docs/sql-migration-guide-upgrade.md ---
@@ -0,0 +1,520 @@
+---
+layout: global
+title: Spark SQL Upgrading Guide
+displayTitle: Spark SQL Upgrading Guide
+---
+
+* Table of contents
+{:toc}
+
+## Upgrading From Spark SQL 2.4 to 3.0
+
+  - In PySpark, when creating a `SparkSession` with 
`SparkSession.builder.getOrCreate()`, if there is an existing `SparkContext`, 
the builder was trying to update the `SparkConf` of the existing `SparkContext` 
with configurations specified to the builder, but the `SparkContext` is shared 
by all `SparkSession`s, so we should not update them. Since 3.0, the builder 
come to not update the configurations. This is the same behavior as Java/Scala 
API in 2.3 and above. If you want to update them, you need to update them prior 
to creating a `SparkSession`.
+
+## Upgrading From Spark SQL 2.3 to 2.4
+
+  - In Spark version 2.3 and earlier, the second parameter to 
array_contains function is implicitly promoted to the element type of first 
array type parameter. This type promotion can be lossy and may cause 
`array_contains` function to return wrong result. This problem has been 
addressed in 2.4 by employing a safer type promotion mechanism. This can cause 
some change in behavior and are illustrated in the table below.
+  
+
+  
+Query
+  
+  
+Result Spark 2.3 or Prior
+  
+  
+Result Spark 2.4
+  
+  
+Remarks
+  
+
+
+  
+SELECT  array_contains(array(1), 1.34D);
+  
+  
+true
+  
+  
+false
+  
+  
+In Spark 2.4, left and right parameters are  promoted to 
array(double) and double type respectively.
+  
+
+
+  
+SELECT  array_contains(array(1), '1');
+  
+  
+true
+  
+  
+AnalysisException is thrown since integer type can not be 
promoted to string type in a loss-less manner.
+  
+  
+Users can use explict cast
--- End diff --

`explict` -> `explicit`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

2018-10-18 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22746#discussion_r226250306
  
--- Diff: docs/sql-performance-turing.md ---
@@ -0,0 +1,151 @@
+---
+layout: global
+title: Performance Tuning
+displayTitle: Performance Tuning
+---
+
+* Table of contents
+{:toc}
+
+For some workloads, it is possible to improve performance by either 
caching data in memory, or by
+turning on some experimental options.
+
+## Caching Data In Memory
+
+Spark SQL can cache tables using an in-memory columnar format by calling 
`spark.catalog.cacheTable("tableName")` or `dataFrame.cache()`.
+Then Spark SQL will scan only required columns and will automatically tune 
compression to minimize
+memory usage and GC pressure. You can call 
`spark.catalog.uncacheTable("tableName")` to remove the table from memory.
+
+Configuration of in-memory caching can be done using the `setConf` method 
on `SparkSession` or by running
+`SET key=value` commands using SQL.
+
+
+Property NameDefaultMeaning
+
+  spark.sql.inMemoryColumnarStorage.compressed
+  true
+  
+When set to true Spark SQL will automatically select a compression 
codec for each column based
+on statistics of the data.
+  
+
+
+  spark.sql.inMemoryColumnarStorage.batchSize
+  1
+  
+Controls the size of batches for columnar caching. Larger batch sizes 
can improve memory utilization
+and compression, but risk OOMs when caching data.
+  
+
+
+
+
+## Other Configuration Options
+
+The following options can also be used to tune the performance of query 
execution. It is possible
+that these options will be deprecated in future release as more 
optimizations are performed automatically.
+
+
+  Property NameDefaultMeaning
+  
+spark.sql.files.maxPartitionBytes
+134217728 (128 MB)
+
+  The maximum number of bytes to pack into a single partition when 
reading files.
+
+  
+  
+spark.sql.files.openCostInBytes
+4194304 (4 MB)
+
+  The estimated cost to open a file, measured by the number of bytes 
could be scanned in the same
+  time. This is used when putting multiple files into a partition. It 
is better to over estimated,
--- End diff --

nit: `It is better to over estimated` -> ` It is better to over-estimate`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

2018-10-18 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22746#discussion_r226247607
  
--- Diff: docs/sql-migration-guide-upgrade.md ---
@@ -0,0 +1,520 @@
+---
+layout: global
+title: Spark SQL Upgrading Guide
+displayTitle: Spark SQL Upgrading Guide
+---
+
+* Table of contents
+{:toc}
+
+## Upgrading From Spark SQL 2.4 to 3.0
+
+  - In PySpark, when creating a `SparkSession` with 
`SparkSession.builder.getOrCreate()`, if there is an existing `SparkContext`, 
the builder was trying to update the `SparkConf` of the existing `SparkContext` 
with configurations specified to the builder, but the `SparkContext` is shared 
by all `SparkSession`s, so we should not update them. Since 3.0, the builder 
come to not update the configurations. This is the same behavior as Java/Scala 
API in 2.3 and above. If you want to update them, you need to update them prior 
to creating a `SparkSession`.
+
+## Upgrading From Spark SQL 2.3 to 2.4
+
+  - In Spark version 2.3 and earlier, the second parameter to 
array_contains function is implicitly promoted to the element type of first 
array type parameter. This type promotion can be lossy and may cause 
`array_contains` function to return wrong result. This problem has been 
addressed in 2.4 by employing a safer type promotion mechanism. This can cause 
some change in behavior and are illustrated in the table below.
+  
+
+  
+Query
+  
+  
+Result Spark 2.3 or Prior
+  
+  
+Result Spark 2.4
+  
+  
+Remarks
+  
+
+
+  
+SELECT  array_contains(array(1), 1.34D);
+  
+  
+true
+  
+  
+false
+  
+  
+In Spark 2.4, left and right parameters are  promoted to 
array(double) and double type respectively.
+  
+
+
+  
+SELECT  array_contains(array(1), '1');
+  
+  
+true
+  
+  
+AnalysisException is thrown since integer type can not be 
promoted to string type in a loss-less manner.
+  
+  
+Users can use explict cast
+  
+
+
+  
+SELECT  array_contains(array(1), 'anystring');
+  
+  
+null
+  
+  
+AnalysisException is thrown since integer type can not be 
promoted to string type in a loss-less manner.
+  
+  
+Users can use explict cast
+  
+
+  
+
+  - Since Spark 2.4, when there is a struct field in front of the IN 
operator before a subquery, the inner query must contain a struct field as 
well. In previous versions, instead, the fields of the struct were compared to 
the output of the inner query. Eg. if `a` is a `struct(a string, b int)`, in 
Spark 2.4 `a in (select (1 as a, 'a' as b) from range(1))` is a valid query, 
while `a in (select 1, 'a' from range(1))` is not. In previous version it was 
the opposite.
+  - In versions 2.2.1+ and 2.3, if `spark.sql.caseSensitive` is set to 
true, then the `CURRENT_DATE` and `CURRENT_TIMESTAMP` functions incorrectly 
became case-sensitive and would resolve to columns (unless typed in lower 
case). In Spark 2.4 this has been fixed and the functions are no longer 
case-sensitive.
+  - Since Spark 2.4, Spark will evaluate the set operations referenced in 
a query by following a precedence rule as per the SQL standard. If the order is 
not specified by parentheses, set operations are performed from left to right 
with the exception that all INTERSECT operations are performed before any 
UNION, EXCEPT or MINUS operations. The old behaviour of giving equal precedence 
to all the set operations are preserved under a newly added configuration 
`spark.sql.legacy.setopsPrecedence.enabled` with a default value of `false`. 
When this property is set to `true`, spark will evaluate the set operators from 
left to right as they appear in the query given no explicit ordering is 
enforced by usage of parenthesis.
+  - Since Spark 2.4, Spark will display table description column Last 
Access value as UNKNOWN when the value was Jan 01 1970.
+  - Since Spark 2.4, Spark maximizes the usage of a vectorized ORC reader 
for ORC files by default. To do that, `spark.sql.orc.impl` and 
`spark.sql.orc.filterPushdown` change their default values to `native` and 
`true` respectively.
+  - In PySpark, when Arrow optimization is enabled, previously `toPandas` 
just failed when Arrow optimization i

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

2018-10-18 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22746#discussion_r226246375
  
--- Diff: docs/sql-migration-guide-upgrade.md ---
@@ -0,0 +1,520 @@
+---
+layout: global
+title: Spark SQL Upgrading Guide
+displayTitle: Spark SQL Upgrading Guide
+---
+
+* Table of contents
+{:toc}
+
+## Upgrading From Spark SQL 2.4 to 3.0
+
+  - In PySpark, when creating a `SparkSession` with 
`SparkSession.builder.getOrCreate()`, if there is an existing `SparkContext`, 
the builder was trying to update the `SparkConf` of the existing `SparkContext` 
with configurations specified to the builder, but the `SparkContext` is shared 
by all `SparkSession`s, so we should not update them. Since 3.0, the builder 
come to not update the configurations. This is the same behavior as Java/Scala 
API in 2.3 and above. If you want to update them, you need to update them prior 
to creating a `SparkSession`.
+
+## Upgrading From Spark SQL 2.3 to 2.4
+
+  - In Spark version 2.3 and earlier, the second parameter to 
array_contains function is implicitly promoted to the element type of first 
array type parameter. This type promotion can be lossy and may cause 
`array_contains` function to return wrong result. This problem has been 
addressed in 2.4 by employing a safer type promotion mechanism. This can cause 
some change in behavior and are illustrated in the table below.
+  
+
+  
+Query
+  
+  
+Result Spark 2.3 or Prior
+  
+  
+Result Spark 2.4
+  
+  
+Remarks
+  
+
+
+  
+SELECT  array_contains(array(1), 1.34D);
+  
+  
+true
+  
+  
+false
+  
+  
+In Spark 2.4, left and right parameters are  promoted to 
array(double) and double type respectively.
+  
+
+
+  
+SELECT  array_contains(array(1), '1');
+  
+  
+true
+  
+  
+AnalysisException is thrown since integer type can not be 
promoted to string type in a loss-less manner.
+  
+  
+Users can use explict cast
+  
+
+
+  
+SELECT  array_contains(array(1), 'anystring');
+  
+  
+null
+  
+  
+AnalysisException is thrown since integer type can not be 
promoted to string type in a loss-less manner.
+  
+  
+Users can use explict cast
+  
+
+  
+
+  - Since Spark 2.4, when there is a struct field in front of the IN 
operator before a subquery, the inner query must contain a struct field as 
well. In previous versions, instead, the fields of the struct were compared to 
the output of the inner query. Eg. if `a` is a `struct(a string, b int)`, in 
Spark 2.4 `a in (select (1 as a, 'a' as b) from range(1))` is a valid query, 
while `a in (select 1, 'a' from range(1))` is not. In previous version it was 
the opposite.
+  - In versions 2.2.1+ and 2.3, if `spark.sql.caseSensitive` is set to 
true, then the `CURRENT_DATE` and `CURRENT_TIMESTAMP` functions incorrectly 
became case-sensitive and would resolve to columns (unless typed in lower 
case). In Spark 2.4 this has been fixed and the functions are no longer 
case-sensitive.
+  - Since Spark 2.4, Spark will evaluate the set operations referenced in 
a query by following a precedence rule as per the SQL standard. If the order is 
not specified by parentheses, set operations are performed from left to right 
with the exception that all INTERSECT operations are performed before any 
UNION, EXCEPT or MINUS operations. The old behaviour of giving equal precedence 
to all the set operations are preserved under a newly added configuration 
`spark.sql.legacy.setopsPrecedence.enabled` with a default value of `false`. 
When this property is set to `true`, spark will evaluate the set operators from 
left to right as they appear in the query given no explicit ordering is 
enforced by usage of parenthesis.
+  - Since Spark 2.4, Spark will display table description column Last 
Access value as UNKNOWN when the value was Jan 01 1970.
+  - Since Spark 2.4, Spark maximizes the usage of a vectorized ORC reader 
for ORC files by default. To do that, `spark.sql.orc.impl` and 
`spark.sql.orc.filterPushdown` change their default values to `native` and 
`true` respectively.
+  - In PySpark, when Arrow optimization is enabled, previously `toPandas` 
just failed when Arrow optimization i

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

2018-10-18 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22746#discussion_r226245945
  
--- Diff: docs/sql-migration-guide-upgrade.md ---
@@ -0,0 +1,520 @@
+---
+layout: global
+title: Spark SQL Upgrading Guide
+displayTitle: Spark SQL Upgrading Guide
+---
+
+* Table of contents
+{:toc}
+
+## Upgrading From Spark SQL 2.4 to 3.0
+
+  - In PySpark, when creating a `SparkSession` with 
`SparkSession.builder.getOrCreate()`, if there is an existing `SparkContext`, 
the builder was trying to update the `SparkConf` of the existing `SparkContext` 
with configurations specified to the builder, but the `SparkContext` is shared 
by all `SparkSession`s, so we should not update them. Since 3.0, the builder 
come to not update the configurations. This is the same behavior as Java/Scala 
API in 2.3 and above. If you want to update them, you need to update them prior 
to creating a `SparkSession`.
--- End diff --

`the builder come` -> `the builder comes`?
cc @ueshin


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

2018-10-18 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22746#discussion_r226241683
  
--- Diff: docs/sql-distributed-sql-engine.md ---
@@ -0,0 +1,85 @@
+---
+layout: global
+title: Distributed SQL Engine
+displayTitle: Distributed SQL Engine
+---
+
+* Table of contents
+{:toc}
+
+Spark SQL can also act as a distributed query engine using its JDBC/ODBC 
or command-line interface.
+In this mode, end-users or applications can interact with Spark SQL 
directly to run SQL queries,
+without the need to write any code.
+
+## Running the Thrift JDBC/ODBC server
+
+The Thrift JDBC/ODBC server implemented here corresponds to the 
[`HiveServer2`](https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2)
+in Hive 1.2.1. You can test the JDBC server with the beeline script that 
comes with either Spark or Hive 1.2.1.
+
+To start the JDBC/ODBC server, run the following in the Spark directory:
+
+./sbin/start-thriftserver.sh
+
+This script accepts all `bin/spark-submit` command line options, plus a 
`--hiveconf` option to
+specify Hive properties. You may run `./sbin/start-thriftserver.sh --help` 
for a complete list of
+all available options. By default, the server listens on localhost:1. 
You may override this
+behaviour via either environment variables, i.e.:
+
+{% highlight bash %}
+export HIVE_SERVER2_THRIFT_PORT=
+export HIVE_SERVER2_THRIFT_BIND_HOST=
+./sbin/start-thriftserver.sh \
+  --master  \
+  ...
+{% endhighlight %}
+
+or system properties:
+
+{% highlight bash %}
+./sbin/start-thriftserver.sh \
+  --hiveconf hive.server2.thrift.port= \
+  --hiveconf hive.server2.thrift.bind.host= \
+  --master 
+  ...
+{% endhighlight %}
+
+Now you can use beeline to test the Thrift JDBC/ODBC server:
+
+./bin/beeline
+
+Connect to the JDBC/ODBC server in beeline with:
+
+beeline> !connect jdbc:hive2://localhost:1
+
+Beeline will ask you for a username and password. In non-secure mode, 
simply enter the username on
+your machine and a blank password. For secure mode, please follow the 
instructions given in the
+[beeline 
documentation](https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients).
+
+Configuration of Hive is done by placing your `hive-site.xml`, 
`core-site.xml` and `hdfs-site.xml` files in `conf/`.
+
+You may also use the beeline script that comes with Hive.
+
+Thrift JDBC server also supports sending thrift RPC messages over HTTP 
transport.
+Use the following setting to enable HTTP mode as system property or in 
`hive-site.xml` file in `conf/`:
+
+hive.server2.transport.mode - Set this to value: http
+hive.server2.thrift.http.port - HTTP port number to listen on; default 
is 10001
+hive.server2.http.endpoint - HTTP endpoint; default is cliservice
+
+To test, use beeline to connect to the JDBC/ODBC server in http mode with:
+
+beeline> !connect 
jdbc:hive2://:/?hive.server2.transport.mode=http;hive.server2.thrift.http.path=
+
+
+## Running the Spark SQL CLI
+
+The Spark SQL CLI is a convenient tool to run the Hive metastore service 
in local mode and execute
+queries input from the command line. Note that the Spark SQL CLI cannot 
talk to the Thrift JDBC server.
+
+To start the Spark SQL CLI, run the following in the Spark directory:
+
+./bin/spark-sql
+
+Configuration of Hive is done by placing your `hive-site.xml`, 
`core-site.xml` and `hdfs-site.xml` files in `conf/`.
+You may run `./bin/spark-sql --help` for a complete list of all available
+options.
--- End diff --

super nit: this line can be concatenated with the previous line.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

2018-10-18 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22746#discussion_r226239048
  
--- Diff: docs/sql-data-sources-parquet.md ---
@@ -0,0 +1,321 @@
+---
+layout: global
+title: Parquet Files
+displayTitle: Parquet Files
+---
+
+* Table of contents
+{:toc}
+
+[Parquet](http://parquet.io) is a columnar format that is supported by 
many other data processing systems.
+Spark SQL provides support for both reading and writing Parquet files that 
automatically preserves the schema
+of the original data. When writing Parquet files, all columns are 
automatically converted to be nullable for
+compatibility reasons.
+
+### Loading Data Programmatically
+
+Using the data from the above example:
+
+
+
+
+{% include_example basic_parquet_example 
scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
+
+
+
+{% include_example basic_parquet_example 
java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
+
+
+
+
+{% include_example basic_parquet_example python/sql/datasource.py %}
+
+
+
+
+{% include_example basic_parquet_example r/RSparkSQLExample.R %}
+
+
+
+
+
+{% highlight sql %}
+
+CREATE TEMPORARY VIEW parquetTable
+USING org.apache.spark.sql.parquet
+OPTIONS (
+  path "examples/src/main/resources/people.parquet"
+)
+
+SELECT * FROM parquetTable
+
+{% endhighlight %}
+
+
+
+
+
+### Partition Discovery
+
+Table partitioning is a common optimization approach used in systems like 
Hive. In a partitioned
+table, data are usually stored in different directories, with partitioning 
column values encoded in
+the path of each partition directory. All built-in file sources (including 
Text/CSV/JSON/ORC/Parquet)
+are able to discover and infer partitioning information automatically.
+For example, we can store all our previously used
+population data into a partitioned table using the following directory 
structure, with two extra
+columns, `gender` and `country` as partitioning columns:
+
+{% highlight text %}
+
+path
+âââ to
+âââ table
+âââ gender=male
+âÂ Â  âââ ...
+âÂ Â  â
+âÂ Â  âââ country=US
+âÂ Â  âÂ Â  âââ data.parquet
+âÂ Â  âââ country=CN
+âÂ Â  âÂ Â  âââ data.parquet
+âÂ Â  âââ ...
+âââ gender=female
+ Â Â  âââ ...
+ Â Â  â
+ Â Â  âââ country=US
+ Â Â  âÂ Â  âââ data.parquet
+ Â Â  âââ country=CN
+ Â Â  âÂ Â  âââ data.parquet
+ Â Â  âââ ...
+
+{% endhighlight %}
+
+By passing `path/to/table` to either `SparkSession.read.parquet` or 
`SparkSession.read.load`, Spark SQL
+will automatically extract the partitioning information from the paths.
+Now the schema of the returned DataFrame becomes:
+
+{% highlight text %}
+
+root
+|-- name: string (nullable = true)
+|-- age: long (nullable = true)
+|-- gender: string (nullable = true)
+|-- country: string (nullable = true)
+
+{% endhighlight %}
+
+Notice that the data types of the partitioning columns are automatically 
inferred. Currently,
+numeric data types, date, timestamp and string type are supported. 
Sometimes users may not want
+to automatically infer the data types of the partitioning columns. For 
these use cases, the
+automatic type inference can be configured by
+`spark.sql.sources.partitionColumnTypeInference.enabled`, which is default 
to `true`. When type
+inference is disabled, string type will be used for the partitioning 
columns.
+
+Starting from Spark 1.6.0, partition discovery only finds partitions under 
the given paths
+by default. For the above example, if users pass 
`path/to/table/gender=male` to either
+`SparkSession.read.parquet` or `SparkSession.read.load`, `gender` will not 
be considered as a
+partitioning column. If users need to specify the base path that partition 
discovery
+should start with, they can set `basePath` in the data source options. For 
example,
+when `path/to/table/gender=male` is the path of the data and
+users set `basePath` to `path/to/table/`, `gender` will be a partitioning 
column.
+
+### Schema Merging
+
+Like ProtocolBuffer, Avro, and Thrift, Parquet also supports schema 
evolution. Users can start with
+a simple schema, and gradually add more columns to the schema as needed. 
In this way, users may end
+up with multiple Parquet files with different but mutually compatible 
schemas. The Parquet d

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

2018-10-18 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22746#discussion_r226237047
  
--- Diff: docs/sql-data-sources-parquet.md ---
@@ -0,0 +1,321 @@
+---
+layout: global
+title: Parquet Files
+displayTitle: Parquet Files
+---
+
+* Table of contents
+{:toc}
+
+[Parquet](http://parquet.io) is a columnar format that is supported by 
many other data processing systems.
+Spark SQL provides support for both reading and writing Parquet files that 
automatically preserves the schema
+of the original data. When writing Parquet files, all columns are 
automatically converted to be nullable for
+compatibility reasons.
+
+### Loading Data Programmatically
+
+Using the data from the above example:
+
+
+
+
+{% include_example basic_parquet_example 
scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
+
+
+
+{% include_example basic_parquet_example 
java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
+
+
+
+
+{% include_example basic_parquet_example python/sql/datasource.py %}
+
+
+
+
+{% include_example basic_parquet_example r/RSparkSQLExample.R %}
+
+
+
+
+
+{% highlight sql %}
+
+CREATE TEMPORARY VIEW parquetTable
+USING org.apache.spark.sql.parquet
+OPTIONS (
+  path "examples/src/main/resources/people.parquet"
+)
+
+SELECT * FROM parquetTable
+
+{% endhighlight %}
+
+
+
+
+
+### Partition Discovery
+
+Table partitioning is a common optimization approach used in systems like 
Hive. In a partitioned
+table, data are usually stored in different directories, with partitioning 
column values encoded in
+the path of each partition directory. All built-in file sources (including 
Text/CSV/JSON/ORC/Parquet)
+are able to discover and infer partitioning information automatically.
+For example, we can store all our previously used
+population data into a partitioned table using the following directory 
structure, with two extra
+columns, `gender` and `country` as partitioning columns:
+
+{% highlight text %}
+
+path
+âââ to
+âââ table
+âââ gender=male
+âÂ Â  âââ ...
+âÂ Â  â
+âÂ Â  âââ country=US
+âÂ Â  âÂ Â  âââ data.parquet
+âÂ Â  âââ country=CN
+âÂ Â  âÂ Â  âââ data.parquet
+âÂ Â  âââ ...
+âââ gender=female
+ Â Â  âââ ...
+ Â Â  â
+ Â Â  âââ country=US
+ Â Â  âÂ Â  âââ data.parquet
+ Â Â  âââ country=CN
+ Â Â  âÂ Â  âââ data.parquet
+ Â Â  âââ ...
+
+{% endhighlight %}
+
+By passing `path/to/table` to either `SparkSession.read.parquet` or 
`SparkSession.read.load`, Spark SQL
+will automatically extract the partitioning information from the paths.
+Now the schema of the returned DataFrame becomes:
+
+{% highlight text %}
+
+root
+|-- name: string (nullable = true)
+|-- age: long (nullable = true)
+|-- gender: string (nullable = true)
+|-- country: string (nullable = true)
+
+{% endhighlight %}
+
+Notice that the data types of the partitioning columns are automatically 
inferred. Currently,
+numeric data types, date, timestamp and string type are supported. 
Sometimes users may not want
+to automatically infer the data types of the partitioning columns. For 
these use cases, the
+automatic type inference can be configured by
+`spark.sql.sources.partitionColumnTypeInference.enabled`, which is default 
to `true`. When type
+inference is disabled, string type will be used for the partitioning 
columns.
+
+Starting from Spark 1.6.0, partition discovery only finds partitions under 
the given paths
+by default. For the above example, if users pass 
`path/to/table/gender=male` to either
+`SparkSession.read.parquet` or `SparkSession.read.load`, `gender` will not 
be considered as a
+partitioning column. If users need to specify the base path that partition 
discovery
+should start with, they can set `basePath` in the data source options. For 
example,
+when `path/to/table/gender=male` is the path of the data and
+users set `basePath` to `path/to/table/`, `gender` will be a partitioning 
column.
+
+### Schema Merging
+
+Like ProtocolBuffer, Avro, and Thrift, Parquet also supports schema 
evolution. Users can start with
--- End diff --

`ProtocolBuffer` -> `Protocol Buffers`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.or

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

2018-10-18 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22746#discussion_r226235672
  
--- Diff: docs/sql-data-sources-load-save-functions.md ---
@@ -0,0 +1,283 @@
+---
+layout: global
+title: Generic Load/Save Functions
+displayTitle: Generic Load/Save Functions
+---
+
+* Table of contents
+{:toc}
+
+
+In the simplest form, the default data source (`parquet` unless otherwise 
configured by
+`spark.sql.sources.default`) will be used for all operations.
+
+
+
+
+{% include_example generic_load_save_functions 
scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
+
+
+
+{% include_example generic_load_save_functions 
java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
+
+
+
+
+{% include_example generic_load_save_functions python/sql/datasource.py %}
+
+
+
+
+{% include_example generic_load_save_functions r/RSparkSQLExample.R %}
+
+
+
+
+### Manually Specifying Options
+
+You can also manually specify the data source that will be used along with 
any extra options
+that you would like to pass to the data source. Data sources are specified 
by their fully qualified
+name (i.e., `org.apache.spark.sql.parquet`), but for built-in sources you 
can also use their short
+names (`json`, `parquet`, `jdbc`, `orc`, `libsvm`, `csv`, `text`). 
DataFrames loaded from any data
+source type can be converted into other types using this syntax.
+
+To load a JSON file you can use:
+
+
+
+{% include_example manual_load_options 
scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
+
+
+
+{% include_example manual_load_options 
java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
+
+
+
+{% include_example manual_load_options python/sql/datasource.py %}
+
+
+
+{% include_example manual_load_options r/RSparkSQLExample.R %}
+
+
+
+To load a CSV file you can use:
+
+
+
+{% include_example manual_load_options_csv 
scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
+
+
+
+{% include_example manual_load_options_csv 
java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
+
+
+
+{% include_example manual_load_options_csv python/sql/datasource.py %}
+
+
+
+{% include_example manual_load_options_csv r/RSparkSQLExample.R %}
+
+
+
+
+### Run SQL on files directly
+
+Instead of using read API to load a file into DataFrame and query it, you 
can also query that
+file directly with SQL.
+
+
+
+{% include_example direct_sql 
scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
+
+
+
+{% include_example direct_sql 
java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
+
+
+
+{% include_example direct_sql python/sql/datasource.py %}
+
+
+
+{% include_example direct_sql r/RSparkSQLExample.R %}
+
+
+
+
+### Save Modes
+
+Save operations can optionally take a `SaveMode`, that specifies how to 
handle existing data if
+present. It is important to realize that these save modes do not utilize 
any locking and are not
+atomic. Additionally, when performing an `Overwrite`, the data will be 
deleted before writing out the
+new data.
+
+
+Scala/JavaAny LanguageMeaning
+
+  SaveMode.ErrorIfExists (default)
+  "error" or "errorifexists" (default)
+  
+When saving a DataFrame to a data source, if data already exists,
+an exception is expected to be thrown.
+  
+
+
+  SaveMode.Append
+  "append"
+  
+When saving a DataFrame to a data source, if data/table already exists,
+contents of the DataFrame are expected to be appended to existing data.
+  
+
+
+  SaveMode.Overwrite
+  "overwrite"
+  
+Overwrite mode means that when saving a DataFrame to a data source,
+if data/table already exists, existing data is expected to be 
overwritten by the contents of
+the DataFrame.
+  
+
+
+  SaveMode.Ignore
+  "ignore"
+  
+Ignore mode means that when saving a DataFrame to a data source, if 
data already exists,
+the save operation is expected to not save the contents of the 
DataFrame and to not
--- End diff --

nit: `expected to not ... to not ...` -> `expected not to ... not to ...`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

2018-10-18 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22746#discussion_r226231876
  
--- Diff: docs/sql-data-sources-jdbc.md ---
@@ -0,0 +1,223 @@
+---
+layout: global
+title: JDBC To Other Databases
+displayTitle: JDBC To Other Databases
+---
+
+* Table of contents
+{:toc}
+
+Spark SQL also includes a data source that can read data from other 
databases using JDBC. This
+functionality should be preferred over using 
[JdbcRDD](api/scala/index.html#org.apache.spark.rdd.JdbcRDD).
+This is because the results are returned
+as a DataFrame and they can easily be processed in Spark SQL or joined 
with other data sources.
+The JDBC data source is also easier to use from Java or Python as it does 
not require the user to
+provide a ClassTag.
+(Note that this is different than the Spark SQL JDBC server, which allows 
other applications to
+run queries using Spark SQL).
+
+To get started you will need to include the JDBC driver for your 
particular database on the
+spark classpath. For example, to connect to postgres from the Spark Shell 
you would run the
+following command:
+
+{% highlight bash %}
+bin/spark-shell --driver-class-path postgresql-9.4.1207.jar --jars 
postgresql-9.4.1207.jar
+{% endhighlight %}
+
+Tables from the remote database can be loaded as a DataFrame or Spark SQL 
temporary view using
+the Data Sources API. Users can specify the JDBC connection properties in 
the data source options.
+user and password are normally provided as 
connection properties for
+logging into the data sources. In addition to the connection properties, 
Spark also supports
+the following case-insensitive options:
+
+
+  Property NameMeaning
+  
+url
+
+  The JDBC URL to connect to. The source-specific connection 
properties may be specified in the URL. e.g., 
jdbc:postgresql://localhost/test?user=fred&password=secret
+
+  
+
+  
+dbtable
+
+  The JDBC table that should be read from or written into. Note that 
when using it in the read
+  path anything that is valid in a FROM clause of a SQL 
query can be used.
+  For example, instead of a full table you could also use a subquery 
in parentheses. It is not
+  allowed to specify `dbtable` and `query` options at the same time.
+
+  
+  
+query
+
+  A query that will be used to read data into Spark. The specified 
query will be parenthesized and used
+  as a subquery in the FROM clause. Spark will also 
assign an alias to the subquery clause.
+  As an example, spark will issue a query of the following form to the 
JDBC Source.
+   SELECT  FROM () 
spark_gen_alias
+  Below are couple of restrictions while using this option.
+  
+  It is not allowed to specify `dbtable` and `query` options 
at the same time. 
+  It is not allowed to spcify `query` and `partitionColumn` 
options at the same time. When specifying
--- End diff --

`spcify` -> `specify`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

2018-10-18 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22746#discussion_r226227872
  
--- Diff: docs/sql-data-sources.md ---
@@ -0,0 +1,42 @@
+---
+layout: global
+title: Data Sources
+displayTitle: Data Sources
+---
+
+
+Spark SQL supports operating on a variety of data sources through the 
DataFrame interface.
+A DataFrame can be operated on using relational transformations and can 
also be used to create a temporary view.
+Registering a DataFrame as a temporary view allows you to run SQL queries 
over its data. This section
+describes the general methods for loading and saving data using the Spark 
Data Sources and then
+goes into specific options that are available for the built-in data 
sources.
+
+
+* [Generic Load/Save Functions](sql-data-sources-load-save-functions.html)
+  * [Manually Sepcifying 
Options](sql-data-sources-load-save-functions.html#manually-sepcifying-options)
--- End diff --

`sepcifying` -> `specifying`. In other places, too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

2018-10-18 Thread xuanyuanking

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/22746#discussion_r226226005
  
--- Diff: docs/sql-data-sources-other.md ---
@@ -0,0 +1,114 @@
+---
+layout: global
+title: Other Data Sources
+displayTitle: Other Data Sources
+---
+
+* Table of contents
+{:toc}
+
+## ORC Files
+
+Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC 
file format for ORC files.
+To do that, the following configurations are newly added. The vectorized 
reader is used for the
+native ORC tables (e.g., the ones created using the clause `USING ORC`) 
when `spark.sql.orc.impl`
+is set to `native` and `spark.sql.orc.enableVectorizedReader` is set to 
`true`. For the Hive ORC
+serde tables (e.g., the ones created using the clause `USING HIVE OPTIONS 
(fileFormat 'ORC')`),
+the vectorized reader is used when `spark.sql.hive.convertMetastoreOrc` is 
also set to `true`.
+
+
+  Property 
NameDefaultMeaning
+  
+spark.sql.orc.impl
+native
+The name of ORC implementation. It can be one of 
native and hive. native means the native 
ORC support that is built on Apache ORC 1.4. `hive` means the ORC library in 
Hive 1.2.1.
+  
+  
+spark.sql.orc.enableVectorizedReader
+true
+Enables vectorized orc decoding in native 
implementation. If false, a new non-vectorized ORC reader is used 
in native implementation. For hive implementation, 
this is ignored.
+  
+
+
+## JSON Datasets
--- End diff --

Got it, will change it soon.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

2018-10-18 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/22746#discussion_r226213393
  
--- Diff: docs/sql-data-sources-other.md ---
@@ -0,0 +1,114 @@
+---
+layout: global
+title: Other Data Sources
+displayTitle: Other Data Sources
+---
+
+* Table of contents
+{:toc}
+
+## ORC Files
+
+Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC 
file format for ORC files.
+To do that, the following configurations are newly added. The vectorized 
reader is used for the
+native ORC tables (e.g., the ones created using the clause `USING ORC`) 
when `spark.sql.orc.impl`
+is set to `native` and `spark.sql.orc.enableVectorizedReader` is set to 
`true`. For the Hive ORC
+serde tables (e.g., the ones created using the clause `USING HIVE OPTIONS 
(fileFormat 'ORC')`),
+the vectorized reader is used when `spark.sql.hive.convertMetastoreOrc` is 
also set to `true`.
+
+
+  Property 
NameDefaultMeaning
+  
+spark.sql.orc.impl
+native
+The name of ORC implementation. It can be one of 
native and hive. native means the native 
ORC support that is built on Apache ORC 1.4. `hive` means the ORC library in 
Hive 1.2.1.
+  
+  
+spark.sql.orc.enableVectorizedReader
+true
+Enables vectorized orc decoding in native 
implementation. If false, a new non-vectorized ORC reader is used 
in native implementation. For hive implementation, 
this is ignored.
+  
+
+
+## JSON Datasets
--- End diff --

We support a typical JSON file, don't we?
> For a regular multi-line JSON file, set the `multiLine` option to `true`.

IMO, that notice means we provides more flexibility.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

2018-10-18 Thread xuanyuanking

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/22746#discussion_r226208608
  
--- Diff: docs/sql-distributed-sql-engine.md ---
@@ -0,0 +1,85 @@
+---
+layout: global
+title: Distributed SQL Engine
+displayTitle: Distributed SQL Engine
+---
+
+* Table of contents
+{:toc}
+
+Spark SQL can also act as a distributed query engine using its JDBC/ODBC 
or command-line interface.
+In this mode, end-users or applications can interact with Spark SQL 
directly to run SQL queries,
+without the need to write any code.
+
+## Running the Thrift JDBC/ODBC server
+
+The Thrift JDBC/ODBC server implemented here corresponds to the 
[`HiveServer2`](https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2)
+in Hive 1.2.1 You can test the JDBC server with the beeline script that 
comes with either Spark or Hive 1.2.1.
--- End diff --

Thanks, done in 27b066d.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

2018-10-18 Thread xuanyuanking

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/22746#discussion_r226202227
  
--- Diff: docs/_data/menu-sql.yaml ---
@@ -0,0 +1,81 @@
+- text: Getting Started
+  url: sql-getting-started.html
+  subitems:
+- text: "Starting Point: SparkSession"
+  url: sql-getting-started.html#starting-point-sparksession
+- text: Creating DataFrames
+  url: sql-getting-started.html#creating-dataframes
+- text: Untyped Dataset Operations (DataFrame operations)
+  url: 
sql-getting-started.html#untyped-dataset-operations-aka-dataframe-operations
+- text: Running SQL Queries Programmatically
+  url: sql-getting-started.html#running-sql-queries-programmatically
+- text: Global Temporary View
+  url: sql-getting-started.html#global-temporary-view
+- text: Creating Datasets
+  url: sql-getting-started.html#creating-datasets
+- text: Interoperating with RDDs
+  url: sql-getting-started.html#interoperating-with-rdds
+- text: Aggregations
+  url: sql-getting-started.html#aggregations
+- text: Data Sources
+  url: sql-data-sources.html
+  subitems:
+- text: "Generic Load/Save Functions"
+  url: sql-data-sources-load-save-functions.html
+- text: Parquet Files
+  url: sql-data-sources-parquet.html
+- text: ORC Files
+  url: sql-data-sources-other.html#orc-files
+- text: JSON Datasets
+  url: sql-data-sources-other.html#json-datasets
+- text: Hive Tables
+  url: sql-data-sources-hive-tables.html
+- text: JDBC To Other Databases
+  url: sql-data-sources-jdbc.html
+- text: Avro Files
+  url: sql-data-sources-avro.html
+- text: Troubleshooting
+  url: sql-data-sources-other.html#troubleshooting
--- End diff --

Make sense, will split into `sql-data-sources-orc`, `sql-data-sources-json` 
and `sql-data-sources-troubleshooting`(still need sql-data-sources prefix cause 
[here](https://github.com/apache/spark/pull/22746/files#diff-5075091c2498292f7afcac68bfd63e1eR13)
 we need "sql-data-sources" as the nav-left tag, otherwise the nav menu will 
not show the subitems).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

2018-10-18 Thread xuanyuanking

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/22746#discussion_r226202492
  
--- Diff: docs/sql-data-sources-other.md ---
@@ -0,0 +1,114 @@
+---
+layout: global
+title: Other Data Sources
+displayTitle: Other Data Sources
+---
+
+* Table of contents
+{:toc}
+
+## ORC Files
+
+Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC 
file format for ORC files.
+To do that, the following configurations are newly added. The vectorized 
reader is used for the
+native ORC tables (e.g., the ones created using the clause `USING ORC`) 
when `spark.sql.orc.impl`
+is set to `native` and `spark.sql.orc.enableVectorizedReader` is set to 
`true`. For the Hive ORC
+serde tables (e.g., the ones created using the clause `USING HIVE OPTIONS 
(fileFormat 'ORC')`),
+the vectorized reader is used when `spark.sql.hive.convertMetastoreOrc` is 
also set to `true`.
+
+
+  Property 
NameDefaultMeaning
+  
+spark.sql.orc.impl
+native
+The name of ORC implementation. It can be one of 
native and hive. native means the native 
ORC support that is built on Apache ORC 1.4. `hive` means the ORC library in 
Hive 1.2.1.
+  
+  
+spark.sql.orc.enableVectorizedReader
+true
+Enables vectorized orc decoding in native 
implementation. If false, a new non-vectorized ORC reader is used 
in native implementation. For hive implementation, 
this is ignored.
+  
+
+
+## JSON Datasets
--- End diff --

Maybe keep `Datasets`? As the below description `Note that the file that is 
offered as a json file is not a typical JSON file`. WDYT?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

2018-10-18 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/22746#discussion_r226191366
  
--- Diff: docs/sql-distributed-sql-engine.md ---
@@ -0,0 +1,85 @@
+---
+layout: global
+title: Distributed SQL Engine
+displayTitle: Distributed SQL Engine
+---
+
+* Table of contents
+{:toc}
+
+Spark SQL can also act as a distributed query engine using its JDBC/ODBC 
or command-line interface.
+In this mode, end-users or applications can interact with Spark SQL 
directly to run SQL queries,
+without the need to write any code.
+
+## Running the Thrift JDBC/ODBC server
+
+The Thrift JDBC/ODBC server implemented here corresponds to the 
[`HiveServer2`](https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2)
+in Hive 1.2.1 You can test the JDBC server with the beeline script that 
comes with either Spark or Hive 1.2.1.
--- End diff --

nit. `1.2.1 You` -> `1.2.1. You`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

2018-10-18 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/22746#discussion_r226190219
  
--- Diff: docs/sql-data-sources-other.md ---
@@ -0,0 +1,114 @@
+---
+layout: global
+title: Other Data Sources
+displayTitle: Other Data Sources
+---
+
+* Table of contents
+{:toc}
+
+## ORC Files
+
+Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC 
file format for ORC files.
+To do that, the following configurations are newly added. The vectorized 
reader is used for the
+native ORC tables (e.g., the ones created using the clause `USING ORC`) 
when `spark.sql.orc.impl`
+is set to `native` and `spark.sql.orc.enableVectorizedReader` is set to 
`true`. For the Hive ORC
+serde tables (e.g., the ones created using the clause `USING HIVE OPTIONS 
(fileFormat 'ORC')`),
+the vectorized reader is used when `spark.sql.hive.convertMetastoreOrc` is 
also set to `true`.
+
+
+  Property 
NameDefaultMeaning
+  
+spark.sql.orc.impl
+native
+The name of ORC implementation. It can be one of 
native and hive. native means the native 
ORC support that is built on Apache ORC 1.4. `hive` means the ORC library in 
Hive 1.2.1.
+  
+  
+spark.sql.orc.enableVectorizedReader
+true
+Enables vectorized orc decoding in native 
implementation. If false, a new non-vectorized ORC reader is used 
in native implementation. For hive implementation, 
this is ignored.
+  
+
+
+## JSON Datasets
--- End diff --

For consistency with the other data sources, `Datasets` -> `Files`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

2018-10-18 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/22746#discussion_r226189929
  
--- Diff: docs/_data/menu-sql.yaml ---
@@ -0,0 +1,81 @@
+- text: Getting Started
+  url: sql-getting-started.html
+  subitems:
+- text: "Starting Point: SparkSession"
+  url: sql-getting-started.html#starting-point-sparksession
+- text: Creating DataFrames
+  url: sql-getting-started.html#creating-dataframes
+- text: Untyped Dataset Operations (DataFrame operations)
+  url: 
sql-getting-started.html#untyped-dataset-operations-aka-dataframe-operations
+- text: Running SQL Queries Programmatically
+  url: sql-getting-started.html#running-sql-queries-programmatically
+- text: Global Temporary View
+  url: sql-getting-started.html#global-temporary-view
+- text: Creating Datasets
+  url: sql-getting-started.html#creating-datasets
+- text: Interoperating with RDDs
+  url: sql-getting-started.html#interoperating-with-rdds
+- text: Aggregations
+  url: sql-getting-started.html#aggregations
+- text: Data Sources
+  url: sql-data-sources.html
+  subitems:
+- text: "Generic Load/Save Functions"
+  url: sql-data-sources-load-save-functions.html
+- text: Parquet Files
+  url: sql-data-sources-parquet.html
+- text: ORC Files
+  url: sql-data-sources-other.html#orc-files
+- text: JSON Datasets
+  url: sql-data-sources-other.html#json-datasets
+- text: Hive Tables
+  url: sql-data-sources-hive-tables.html
+- text: JDBC To Other Databases
+  url: sql-data-sources-jdbc.html
+- text: Avro Files
+  url: sql-data-sources-avro.html
+- text: Troubleshooting
+  url: sql-data-sources-other.html#troubleshooting
--- End diff --

Hi, @xuanyuanking . Generally, it looks good.

Can we split `sql-data-sources-other` into three files? For me, 
`troubleshooting` looks weird in terms of level of information. Actually, 
`sql-data-sources-other` has only two files and `troubleshooting` for JDBC.

Maybe, `sql-data-sources-orc`, `sql-data-sources-json` and `toubleshooting`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

2018-10-17 Thread xuanyuanking

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/22746#discussion_r226011439
  
--- Diff: docs/sql-reference.md ---
@@ -0,0 +1,641 @@
+---
+layout: global
+title: Reference
+displayTitle: Reference
+---
+
+* Table of contents
+{:toc}
+
+## Data Types
+
+Spark SQL and DataFrames support the following data types:
+
+* Numeric types
+  - `ByteType`: Represents 1-byte signed integer numbers.
+  The range of numbers is from `-128` to `127`.
+  - `ShortType`: Represents 2-byte signed integer numbers.
+  The range of numbers is from `-32768` to `32767`.
+  - `IntegerType`: Represents 4-byte signed integer numbers.
+  The range of numbers is from `-2147483648` to `2147483647`.
+  - `LongType`: Represents 8-byte signed integer numbers.
+  The range of numbers is from `-9223372036854775808` to 
`9223372036854775807`.
+  - `FloatType`: Represents 4-byte single-precision floating point numbers.
+  - `DoubleType`: Represents 8-byte double-precision floating point 
numbers.
+  - `DecimalType`: Represents arbitrary-precision signed decimal numbers. 
Backed internally by `java.math.BigDecimal`. A `BigDecimal` consists of an 
arbitrary precision integer unscaled value and a 32-bit integer scale.
+* String type
+  - `StringType`: Represents character string values.
+* Binary type
+  - `BinaryType`: Represents byte sequence values.
+* Boolean type
+  - `BooleanType`: Represents boolean values.
+* Datetime type
+  - `TimestampType`: Represents values comprising values of fields year, 
month, day,
+  hour, minute, and second.
+  - `DateType`: Represents values comprising values of fields year, month, 
day.
+* Complex types
+  - `ArrayType(elementType, containsNull)`: Represents values comprising a 
sequence of
+  elements with the type of `elementType`. `containsNull` is used to 
indicate if
+  elements in a `ArrayType` value can have `null` values.
+  - `MapType(keyType, valueType, valueContainsNull)`:
+  Represents values comprising a set of key-value pairs. The data type of 
keys are
+  described by `keyType` and the data type of values are described by 
`valueType`.
+  For a `MapType` value, keys are not allowed to have `null` values. 
`valueContainsNull`
+  is used to indicate if values of a `MapType` value can have `null` 
values.
+  - `StructType(fields)`: Represents values with the structure described by
+  a sequence of `StructField`s (`fields`).
+* `StructField(name, dataType, nullable)`: Represents a field in a 
`StructType`.
+The name of a field is indicated by `name`. The data type of a field 
is indicated
+by `dataType`. `nullable` is used to indicate if values of this fields 
can have
+`null` values.
+
+
+
+
+All data types of Spark SQL are located in the package 
`org.apache.spark.sql.types`.
+You can access them by doing
+
+{% include_example data_types 
scala/org/apache/spark/examples/sql/SparkSQLExample.scala %}
+
+
+
+  Data type
+  Value type in Scala
+  API to access or create a data type
+
+   ByteType 
+   Byte 
+  
+  ByteType
+  
+
+
+   ShortType 
+   Short 
+  
+  ShortType
+  
+
+
+   IntegerType 
+   Int 
+  
+  IntegerType
+  
+
+
+   LongType 
+   Long 
+  
+  LongType
+  
+
+
+   FloatType 
+   Float 
+  
+  FloatType
+  
+
+
+   DoubleType 
+   Double 
+  
+  DoubleType
+  
+
+
+   DecimalType 
+   java.math.BigDecimal 
+  
+  DecimalType
+  
+
+
+   StringType 
+   String 
+  
+  StringType
+  
+
+
+   BinaryType 
+   Array[Byte] 
+  
+  BinaryType
+  
+
+
+   BooleanType 
+   Boolean 
+  
+  BooleanType
+  
+
+
+   TimestampType 
+   java.sql.Timestamp 
+  
+  TimestampType
+  
+
+
+   DateType 
+   java.sql.Date 
+  
+  DateType
+  
+
+
+   ArrayType 
+   scala.collection.Seq 
+  
+  ArrayType(elementType, [containsNull])
+  Note: The default value of containsNull is true.
+  
+
+
+   MapType 
+   scala.collection.Map 
+  
+  MapType(keyType, valueType, [valueContainsNull])
+  Note: The default value of valueContainsNull is 
true.
+  
+
+
+   StructType 
+   org.apache.spark.sql.Row 
+  
+  StructType(fields)
+  Note: fields is a Seq of StructFields. Also, two fields 
with the same
+  name are not allowed.
+  
+
+
+   StructField 
+   The value type in Scala of the data type of this field
+  (For example, Int for a

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

2018-10-17 Thread xuanyuanking

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/22746#discussion_r226011057
  
--- Diff: docs/_data/menu-sql.yaml ---
@@ -0,0 +1,79 @@
+- text: Getting Started
+  url: sql-getting-started.html
+  subitems:
+- text: "Starting Point: SparkSession"
+  url: sql-getting-started.html#starting-point-sparksession
+- text: Creating DataFrames
+  url: sql-getting-started.html#creating-dataframes
+- text: Untyped Dataset Operations
--- End diff --

make sense, keep same with 
`sql-getting-started.html#untyped-dataset-operations-aka-dataframe-operations`, 
done in b3fc39d.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

2018-10-17 Thread gengliangwang

Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/22746#discussion_r225921716
  
--- Diff: docs/sql-reference.md ---
@@ -0,0 +1,641 @@
+---
+layout: global
+title: Reference
+displayTitle: Reference
+---
+
+* Table of contents
+{:toc}
+
+## Data Types
+
+Spark SQL and DataFrames support the following data types:
+
+* Numeric types
+  - `ByteType`: Represents 1-byte signed integer numbers.
+  The range of numbers is from `-128` to `127`.
+  - `ShortType`: Represents 2-byte signed integer numbers.
+  The range of numbers is from `-32768` to `32767`.
+  - `IntegerType`: Represents 4-byte signed integer numbers.
+  The range of numbers is from `-2147483648` to `2147483647`.
+  - `LongType`: Represents 8-byte signed integer numbers.
+  The range of numbers is from `-9223372036854775808` to 
`9223372036854775807`.
+  - `FloatType`: Represents 4-byte single-precision floating point numbers.
+  - `DoubleType`: Represents 8-byte double-precision floating point 
numbers.
+  - `DecimalType`: Represents arbitrary-precision signed decimal numbers. 
Backed internally by `java.math.BigDecimal`. A `BigDecimal` consists of an 
arbitrary precision integer unscaled value and a 32-bit integer scale.
+* String type
+  - `StringType`: Represents character string values.
+* Binary type
+  - `BinaryType`: Represents byte sequence values.
+* Boolean type
+  - `BooleanType`: Represents boolean values.
+* Datetime type
+  - `TimestampType`: Represents values comprising values of fields year, 
month, day,
+  hour, minute, and second.
+  - `DateType`: Represents values comprising values of fields year, month, 
day.
+* Complex types
+  - `ArrayType(elementType, containsNull)`: Represents values comprising a 
sequence of
+  elements with the type of `elementType`. `containsNull` is used to 
indicate if
+  elements in a `ArrayType` value can have `null` values.
+  - `MapType(keyType, valueType, valueContainsNull)`:
+  Represents values comprising a set of key-value pairs. The data type of 
keys are
+  described by `keyType` and the data type of values are described by 
`valueType`.
+  For a `MapType` value, keys are not allowed to have `null` values. 
`valueContainsNull`
+  is used to indicate if values of a `MapType` value can have `null` 
values.
+  - `StructType(fields)`: Represents values with the structure described by
+  a sequence of `StructField`s (`fields`).
+* `StructField(name, dataType, nullable)`: Represents a field in a 
`StructType`.
+The name of a field is indicated by `name`. The data type of a field 
is indicated
+by `dataType`. `nullable` is used to indicate if values of this fields 
can have
+`null` values.
+
+
+
+
+All data types of Spark SQL are located in the package 
`org.apache.spark.sql.types`.
+You can access them by doing
+
+{% include_example data_types 
scala/org/apache/spark/examples/sql/SparkSQLExample.scala %}
+
+
+
+  Data type
+  Value type in Scala
+  API to access or create a data type
+
+   ByteType 
+   Byte 
+  
+  ByteType
+  
+
+
+   ShortType 
+   Short 
+  
+  ShortType
+  
+
+
+   IntegerType 
+   Int 
+  
+  IntegerType
+  
+
+
+   LongType 
+   Long 
+  
+  LongType
+  
+
+
+   FloatType 
+   Float 
+  
+  FloatType
+  
+
+
+   DoubleType 
+   Double 
+  
+  DoubleType
+  
+
+
+   DecimalType 
+   java.math.BigDecimal 
+  
+  DecimalType
+  
+
+
+   StringType 
+   String 
+  
+  StringType
+  
+
+
+   BinaryType 
+   Array[Byte] 
+  
+  BinaryType
+  
+
+
+   BooleanType 
+   Boolean 
+  
+  BooleanType
+  
+
+
+   TimestampType 
+   java.sql.Timestamp 
+  
+  TimestampType
+  
+
+
+   DateType 
+   java.sql.Date 
+  
+  DateType
+  
+
+
+   ArrayType 
+   scala.collection.Seq 
+  
+  ArrayType(elementType, [containsNull])
+  Note: The default value of containsNull is true.
+  
+
+
+   MapType 
+   scala.collection.Map 
+  
+  MapType(keyType, valueType, [valueContainsNull])
+  Note: The default value of valueContainsNull is 
true.
+  
+
+
+   StructType 
+   org.apache.spark.sql.Row 
+  
+  StructType(fields)
+  Note: fields is a Seq of StructFields. Also, two fields 
with the same
+  name are not allowed.
+  
+
+
+   StructField 
+   The value type in Scala of the data type of this field
+  (For example, Int for

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

2018-10-16 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22746#discussion_r225797461
  
--- Diff: docs/_data/menu-sql.yaml ---
@@ -0,0 +1,79 @@
+- text: Getting Started
+  url: sql-getting-started.html
+  subitems:
+- text: "Starting Point: SparkSession"
+  url: sql-getting-started.html#starting-point-sparksession
+- text: Creating DataFrames
+  url: sql-getting-started.html#creating-dataframes
+- text: Untyped Dataset Operations
--- End diff --

how about `Untyped Dataset Operations (DataFrame operations)`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

2018-10-16 Thread xuanyuanking

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/22746#discussion_r225794532
  
--- Diff: docs/sql-reference.md ---
@@ -0,0 +1,641 @@
+---
+layout: global
+title: Reference
+displayTitle: Reference
+---
+
+* Table of contents
+{:toc}
+
+## Data Types
+
+Spark SQL and DataFrames support the following data types:
+
+* Numeric types
+- `ByteType`: Represents 1-byte signed integer numbers.
--- End diff --

Thanks, done in 58115e5.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

2018-10-16 Thread xuanyuanking

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/22746#discussion_r225794477
  
--- Diff: docs/sql-getting-started.md ---
@@ -0,0 +1,369 @@
+---
+layout: global
+title: Getting Started
+displayTitle: Getting Started
+---
+
+* Table of contents
+{:toc}
+
+## Starting Point: SparkSession
+
+
+
+
+The entry point into all functionality in Spark is the 
[`SparkSession`](api/scala/index.html#org.apache.spark.sql.SparkSession) class. 
To create a basic `SparkSession`, just use `SparkSession.builder()`:
+
+{% include_example init_session 
scala/org/apache/spark/examples/sql/SparkSQLExample.scala %}
+
+
+
+
+The entry point into all functionality in Spark is the 
[`SparkSession`](api/java/index.html#org.apache.spark.sql.SparkSession) class. 
To create a basic `SparkSession`, just use `SparkSession.builder()`:
+
+{% include_example init_session 
java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %}
+
+
+
+
+The entry point into all functionality in Spark is the 
[`SparkSession`](api/python/pyspark.sql.html#pyspark.sql.SparkSession) class. 
To create a basic `SparkSession`, just use `SparkSession.builder`:
+
+{% include_example init_session python/sql/basic.py %}
+
+
+
+
+The entry point into all functionality in Spark is the 
[`SparkSession`](api/R/sparkR.session.html) class. To initialize a basic 
`SparkSession`, just call `sparkR.session()`:
+
+{% include_example init_session r/RSparkSQLExample.R %}
+
+Note that when invoked for the first time, `sparkR.session()` initializes 
a global `SparkSession` singleton instance, and always returns a reference to 
this instance for successive invocations. In this way, users only need to 
initialize the `SparkSession` once, then SparkR functions like `read.df` will 
be able to access this global instance implicitly, and users don't need to pass 
the `SparkSession` instance around.
+
+
+
+`SparkSession` in Spark 2.0 provides builtin support for Hive features 
including the ability to
+write queries using HiveQL, access to Hive UDFs, and the ability to read 
data from Hive tables.
+To use these features, you do not need to have an existing Hive setup.
+
+## Creating DataFrames
+
+
+
+With a `SparkSession`, applications can create DataFrames from an 
[existing `RDD`](#interoperating-with-rdds),
+from a Hive table, or from [Spark data sources](#data-sources).
--- End diff --

Done in 58115e5, also fix link in 
ml-pipeline.md\sparkr.md\structured-streaming-programming-guide.md


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

2018-10-16 Thread xuanyuanking

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/22746#discussion_r225789933
  
--- Diff: docs/sql-getting-started.md ---
@@ -0,0 +1,369 @@
+---
+layout: global
+title: Getting Started
+displayTitle: Getting Started
+---
+
+* Table of contents
+{:toc}
+
+## Starting Point: SparkSession
+
+
+
+
+The entry point into all functionality in Spark is the 
[`SparkSession`](api/scala/index.html#org.apache.spark.sql.SparkSession) class. 
To create a basic `SparkSession`, just use `SparkSession.builder()`:
+
+{% include_example init_session 
scala/org/apache/spark/examples/sql/SparkSQLExample.scala %}
+
+
+
+
+The entry point into all functionality in Spark is the 
[`SparkSession`](api/java/index.html#org.apache.spark.sql.SparkSession) class. 
To create a basic `SparkSession`, just use `SparkSession.builder()`:
+
+{% include_example init_session 
java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %}
+
+
+
+
+The entry point into all functionality in Spark is the 
[`SparkSession`](api/python/pyspark.sql.html#pyspark.sql.SparkSession) class. 
To create a basic `SparkSession`, just use `SparkSession.builder`:
+
+{% include_example init_session python/sql/basic.py %}
+
+
+
+
+The entry point into all functionality in Spark is the 
[`SparkSession`](api/R/sparkR.session.html) class. To initialize a basic 
`SparkSession`, just call `sparkR.session()`:
+
+{% include_example init_session r/RSparkSQLExample.R %}
+
+Note that when invoked for the first time, `sparkR.session()` initializes 
a global `SparkSession` singleton instance, and always returns a reference to 
this instance for successive invocations. In this way, users only need to 
initialize the `SparkSession` once, then SparkR functions like `read.df` will 
be able to access this global instance implicitly, and users don't need to pass 
the `SparkSession` instance around.
+
+
+
+`SparkSession` in Spark 2.0 provides builtin support for Hive features 
including the ability to
+write queries using HiveQL, access to Hive UDFs, and the ability to read 
data from Hive tables.
+To use these features, you do not need to have an existing Hive setup.
+
+## Creating DataFrames
+
+
+
+With a `SparkSession`, applications can create DataFrames from an 
[existing `RDD`](#interoperating-with-rdds),
+from a Hive table, or from [Spark data sources](#data-sources).
--- End diff --

Sorry for the missing, will check all inner link by `

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

2018-10-16 Thread gengliangwang

Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/22746#discussion_r225783658
  
--- Diff: docs/sql-reference.md ---
@@ -0,0 +1,641 @@
+---
+layout: global
+title: Reference
+displayTitle: Reference
+---
+
+* Table of contents
+{:toc}
+
+## Data Types
+
+Spark SQL and DataFrames support the following data types:
+
+* Numeric types
+- `ByteType`: Represents 1-byte signed integer numbers.
--- End diff --

nit: use 2 space indent.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

2018-10-16 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22746#discussion_r225780740
  
--- Diff: docs/sql-getting-started.md ---
@@ -0,0 +1,369 @@
+---
+layout: global
+title: Getting Started
+displayTitle: Getting Started
+---
+
+* Table of contents
+{:toc}
+
+## Starting Point: SparkSession
+
+
+
+
+The entry point into all functionality in Spark is the 
[`SparkSession`](api/scala/index.html#org.apache.spark.sql.SparkSession) class. 
To create a basic `SparkSession`, just use `SparkSession.builder()`:
+
+{% include_example init_session 
scala/org/apache/spark/examples/sql/SparkSQLExample.scala %}
+
+
+
+
+The entry point into all functionality in Spark is the 
[`SparkSession`](api/java/index.html#org.apache.spark.sql.SparkSession) class. 
To create a basic `SparkSession`, just use `SparkSession.builder()`:
+
+{% include_example init_session 
java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %}
+
+
+
+
+The entry point into all functionality in Spark is the 
[`SparkSession`](api/python/pyspark.sql.html#pyspark.sql.SparkSession) class. 
To create a basic `SparkSession`, just use `SparkSession.builder`:
+
+{% include_example init_session python/sql/basic.py %}
+
+
+
+
+The entry point into all functionality in Spark is the 
[`SparkSession`](api/R/sparkR.session.html) class. To initialize a basic 
`SparkSession`, just call `sparkR.session()`:
+
+{% include_example init_session r/RSparkSQLExample.R %}
+
+Note that when invoked for the first time, `sparkR.session()` initializes 
a global `SparkSession` singleton instance, and always returns a reference to 
this instance for successive invocations. In this way, users only need to 
initialize the `SparkSession` once, then SparkR functions like `read.df` will 
be able to access this global instance implicitly, and users don't need to pass 
the `SparkSession` instance around.
+
+
+
+`SparkSession` in Spark 2.0 provides builtin support for Hive features 
including the ability to
+write queries using HiveQL, access to Hive UDFs, and the ability to read 
data from Hive tables.
+To use these features, you do not need to have an existing Hive setup.
+
+## Creating DataFrames
+
+
+
+With a `SparkSession`, applications can create DataFrames from an 
[existing `RDD`](#interoperating-with-rdds),
+from a Hive table, or from [Spark data sources](#data-sources).
--- End diff --

The link `[Spark data sources](#data-sources)` does not work after this 
change. Could you fix all the similar cases? Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

31 matches

Site Navigation

Mail list logo

Footer information