[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20018 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158454282 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,41 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +// Create Hive managed table with parquet +sql("CREATE TABLE records(key int, value string) STORED AS PARQUET") +// Save DataFrame to Hive Managed table as Parquet format +val hiveTableDF = sql("SELECT * FROM records") + hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records") +// Create External Hive table with parquet +sql("CREATE EXTERNAL TABLE records(key int, value string) " + + "STORED AS PARQUET LOCATION '/user/hive/warehouse/'") +// to make Hive parquet format compatible with spark parquet format +spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true") + +// Multiple parquet files could be created accordingly to volume of data under directory given. +val hiveExternalTableLocation = "/user/hive/warehouse/database_name.db/records" + +// Save DataFrame to Hive External table as compatible parquet format + hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation) + +// turn on flag for Dynamic Partitioning +spark.sqlContext.setConf("hive.exec.dynamic.partition", "true") +spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict") + +// You can create partitions in Hive table, so downstream queries run much faster. +hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key") + .parquet(hiveExternalTableLocation) + +// reduce number of files for each partition by repartition --- End diff -- @HyukjinKwon Thanks for highlight, improved the same. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158454291 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,41 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +// Create Hive managed table with parquet +sql("CREATE TABLE records(key int, value string) STORED AS PARQUET") +// Save DataFrame to Hive Managed table as Parquet format +val hiveTableDF = sql("SELECT * FROM records") + hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records") +// Create External Hive table with parquet +sql("CREATE EXTERNAL TABLE records(key int, value string) " + + "STORED AS PARQUET LOCATION '/user/hive/warehouse/'") +// to make Hive parquet format compatible with spark parquet format +spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true") + +// Multiple parquet files could be created accordingly to volume of data under directory given. +val hiveExternalTableLocation = "/user/hive/warehouse/database_name.db/records" + +// Save DataFrame to Hive External table as compatible parquet format + hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation) + +// turn on flag for Dynamic Partitioning +spark.sqlContext.setConf("hive.exec.dynamic.partition", "true") +spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict") + +// You can create partitions in Hive table, so downstream queries run much faster. +hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key") + .parquet(hiveExternalTableLocation) + +// reduce number of files for each partition by repartition +hiveTableDF.repartition($"key").write.mode(SaveMode.Overwrite) + .partitionBy("key").parquet(hiveExternalTableLocation) + +// Control number of files in each partition by coalesce --- End diff -- @HyukjinKwon Thanks for highlight, improved the same. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158454275 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,41 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +// Create Hive managed table with parquet +sql("CREATE TABLE records(key int, value string) STORED AS PARQUET") +// Save DataFrame to Hive Managed table as Parquet format +val hiveTableDF = sql("SELECT * FROM records") + hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records") +// Create External Hive table with parquet +sql("CREATE EXTERNAL TABLE records(key int, value string) " + + "STORED AS PARQUET LOCATION '/user/hive/warehouse/'") +// to make Hive parquet format compatible with spark parquet format +spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true") + +// Multiple parquet files could be created accordingly to volume of data under directory given. +val hiveExternalTableLocation = "/user/hive/warehouse/database_name.db/records" + +// Save DataFrame to Hive External table as compatible parquet format + hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation) + +// turn on flag for Dynamic Partitioning --- End diff -- @HyukjinKwon Thanks for highlight, improved the same. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158454240 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,41 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +// Create Hive managed table with parquet +sql("CREATE TABLE records(key int, value string) STORED AS PARQUET") +// Save DataFrame to Hive Managed table as Parquet format +val hiveTableDF = sql("SELECT * FROM records") + hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records") +// Create External Hive table with parquet +sql("CREATE EXTERNAL TABLE records(key int, value string) " + + "STORED AS PARQUET LOCATION '/user/hive/warehouse/'") +// to make Hive parquet format compatible with spark parquet format --- End diff -- @HyukjinKwon Thanks for highlight, improved the same. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158454252 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,41 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +// Create Hive managed table with parquet +sql("CREATE TABLE records(key int, value string) STORED AS PARQUET") +// Save DataFrame to Hive Managed table as Parquet format +val hiveTableDF = sql("SELECT * FROM records") + hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records") +// Create External Hive table with parquet +sql("CREATE EXTERNAL TABLE records(key int, value string) " + + "STORED AS PARQUET LOCATION '/user/hive/warehouse/'") +// to make Hive parquet format compatible with spark parquet format +spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true") + +// Multiple parquet files could be created accordingly to volume of data under directory given. --- End diff -- @HyukjinKwon Thanks for highlight, improved the same. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158454265 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,41 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +// Create Hive managed table with parquet +sql("CREATE TABLE records(key int, value string) STORED AS PARQUET") +// Save DataFrame to Hive Managed table as Parquet format +val hiveTableDF = sql("SELECT * FROM records") + hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records") +// Create External Hive table with parquet +sql("CREATE EXTERNAL TABLE records(key int, value string) " + + "STORED AS PARQUET LOCATION '/user/hive/warehouse/'") +// to make Hive parquet format compatible with spark parquet format +spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true") + +// Multiple parquet files could be created accordingly to volume of data under directory given. +val hiveExternalTableLocation = "/user/hive/warehouse/database_name.db/records" + +// Save DataFrame to Hive External table as compatible parquet format --- End diff -- @HyukjinKwon Thanks for highlight, improved the same. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158454218 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,41 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +// Create Hive managed table with parquet --- End diff -- @HyukjinKwon Thanks for highlight, improved the same. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158454228 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,41 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +// Create Hive managed table with parquet +sql("CREATE TABLE records(key int, value string) STORED AS PARQUET") +// Save DataFrame to Hive Managed table as Parquet format --- End diff -- @HyukjinKwon Thanks for highlight, improved the same. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158408627 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,41 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +// Create Hive managed table with parquet +sql("CREATE TABLE records(key int, value string) STORED AS PARQUET") +// Save DataFrame to Hive Managed table as Parquet format +val hiveTableDF = sql("SELECT * FROM records") + hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records") +// Create External Hive table with parquet +sql("CREATE EXTERNAL TABLE records(key int, value string) " + + "STORED AS PARQUET LOCATION '/user/hive/warehouse/'") +// to make Hive parquet format compatible with spark parquet format --- End diff -- `spark` -> `Spark` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158407717 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,41 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +// Create Hive managed table with parquet +sql("CREATE TABLE records(key int, value string) STORED AS PARQUET") +// Save DataFrame to Hive Managed table as Parquet format +val hiveTableDF = sql("SELECT * FROM records") + hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records") +// Create External Hive table with parquet +sql("CREATE EXTERNAL TABLE records(key int, value string) " + + "STORED AS PARQUET LOCATION '/user/hive/warehouse/'") +// to make Hive parquet format compatible with spark parquet format +spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true") + +// Multiple parquet files could be created accordingly to volume of data under directory given. +val hiveExternalTableLocation = "/user/hive/warehouse/database_name.db/records" + +// Save DataFrame to Hive External table as compatible parquet format --- End diff -- `parquet` ->`Parquet`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158407910 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,41 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +// Create Hive managed table with parquet +sql("CREATE TABLE records(key int, value string) STORED AS PARQUET") +// Save DataFrame to Hive Managed table as Parquet format +val hiveTableDF = sql("SELECT * FROM records") + hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records") +// Create External Hive table with parquet +sql("CREATE EXTERNAL TABLE records(key int, value string) " + + "STORED AS PARQUET LOCATION '/user/hive/warehouse/'") +// to make Hive parquet format compatible with spark parquet format +spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true") + +// Multiple parquet files could be created accordingly to volume of data under directory given. +val hiveExternalTableLocation = "/user/hive/warehouse/database_name.db/records" + +// Save DataFrame to Hive External table as compatible parquet format + hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation) + +// turn on flag for Dynamic Partitioning +spark.sqlContext.setConf("hive.exec.dynamic.partition", "true") +spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict") + +// You can create partitions in Hive table, so downstream queries run much faster. +hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key") + .parquet(hiveExternalTableLocation) + +// reduce number of files for each partition by repartition --- End diff -- `reduce` -> `Reduce`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158407739 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,41 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +// Create Hive managed table with parquet +sql("CREATE TABLE records(key int, value string) STORED AS PARQUET") +// Save DataFrame to Hive Managed table as Parquet format +val hiveTableDF = sql("SELECT * FROM records") + hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records") +// Create External Hive table with parquet +sql("CREATE EXTERNAL TABLE records(key int, value string) " + + "STORED AS PARQUET LOCATION '/user/hive/warehouse/'") +// to make Hive parquet format compatible with spark parquet format --- End diff -- `parquet` ->`Parquet` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158407768 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,41 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +// Create Hive managed table with parquet +sql("CREATE TABLE records(key int, value string) STORED AS PARQUET") +// Save DataFrame to Hive Managed table as Parquet format +val hiveTableDF = sql("SELECT * FROM records") + hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records") +// Create External Hive table with parquet +sql("CREATE EXTERNAL TABLE records(key int, value string) " + + "STORED AS PARQUET LOCATION '/user/hive/warehouse/'") +// to make Hive parquet format compatible with spark parquet format +spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true") + +// Multiple parquet files could be created accordingly to volume of data under directory given. --- End diff -- `parquet` -> `Parquet`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158407583 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,41 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +// Create Hive managed table with parquet --- End diff -- `parquet` -> `Parquet` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158407620 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,41 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +// Create Hive managed table with parquet +sql("CREATE TABLE records(key int, value string) STORED AS PARQUET") +// Save DataFrame to Hive Managed table as Parquet format --- End diff -- `Managed` -> `managed` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158407946 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,41 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +// Create Hive managed table with parquet +sql("CREATE TABLE records(key int, value string) STORED AS PARQUET") +// Save DataFrame to Hive Managed table as Parquet format +val hiveTableDF = sql("SELECT * FROM records") + hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records") +// Create External Hive table with parquet +sql("CREATE EXTERNAL TABLE records(key int, value string) " + + "STORED AS PARQUET LOCATION '/user/hive/warehouse/'") +// to make Hive parquet format compatible with spark parquet format +spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true") + +// Multiple parquet files could be created accordingly to volume of data under directory given. +val hiveExternalTableLocation = "/user/hive/warehouse/database_name.db/records" + +// Save DataFrame to Hive External table as compatible parquet format + hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation) + +// turn on flag for Dynamic Partitioning +spark.sqlContext.setConf("hive.exec.dynamic.partition", "true") +spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict") + +// You can create partitions in Hive table, so downstream queries run much faster. +hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key") + .parquet(hiveExternalTableLocation) + +// reduce number of files for each partition by repartition +hiveTableDF.repartition($"key").write.mode(SaveMode.Overwrite) + .partitionBy("key").parquet(hiveExternalTableLocation) + +// Control number of files in each partition by coalesce --- End diff -- ` Control number of files` -> ` Control the number of files` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158407873 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,41 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +// Create Hive managed table with parquet +sql("CREATE TABLE records(key int, value string) STORED AS PARQUET") +// Save DataFrame to Hive Managed table as Parquet format +val hiveTableDF = sql("SELECT * FROM records") + hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records") +// Create External Hive table with parquet +sql("CREATE EXTERNAL TABLE records(key int, value string) " + + "STORED AS PARQUET LOCATION '/user/hive/warehouse/'") +// to make Hive parquet format compatible with spark parquet format +spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true") + +// Multiple parquet files could be created accordingly to volume of data under directory given. +val hiveExternalTableLocation = "/user/hive/warehouse/database_name.db/records" + +// Save DataFrame to Hive External table as compatible parquet format + hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation) + +// turn on flag for Dynamic Partitioning --- End diff -- `turn` -> `Turn`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158370581 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,63 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +/* + * Save DataFrame to Hive Managed table as Parquet format + * 1. Create Hive Database / Schema with location at HDFS if you want to mentioned explicitly else default + * warehouse location will be used to store Hive table Data. + * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path; + * You don't have to explicitly give location for each table, every tables under specified schema will be located at + * location given while creating schema. + * 2. Create Hive Managed table with storage format as 'Parquet' + * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET; + */ +val hiveTableDF = sql("SELECT * FROM records").toDF() + hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records") + +/* + * Save DataFrame to Hive External table as compatible parquet format. + * 1. Create Hive External table with storage format as parquet. + * Ex: CREATE EXTERNAL TABLE records(key int, value string) STORED AS PARQUET; + * Since we are not explicitly providing hive database location, it automatically takes default warehouse location + * given to 'spark.sql.warehouse.dir' while creating SparkSession with enableHiveSupport(). + * For example, we have given '/user/hive/warehouse/' as a Hive Warehouse location. It will create schema directories + * under '/user/hive/warehouse/' as '/user/hive/warehouse/database_name.db' and '/user/hive/warehouse/database_name'. + */ + +// to make Hive parquet format compatible with spark parquet format +spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true") +// Multiple parquet files could be created accordingly to volume of data under directory given. +val hiveExternalTableLocation = s"/user/hive/warehouse/database_name.db/records" + hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation) + +// turn on flag for Dynamic Partitioning +spark.sqlContext.setConf("hive.exec.dynamic.partition", "true") +spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict") +// You can create partitions in Hive table, so downstream queries run much faster. +hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key") + .parquet(hiveExternalTableLocation) +/* +If Data volume is very huge, then every partitions would have many small-small files which may harm +downstream query performance due to File I/O, Bandwidth I/O, Network I/O, Disk I/O. +To improve performance you can create single parquet file under each partition directory using 'repartition' +on partitioned key for Hive table. When you add partition to table, there will be change in table DDL. +Ex: CREATE TABLE records(value string) PARTITIONED BY(key int) STORED AS PARQUET; + */ +hiveTableDF.repartition($"key").write.mode(SaveMode.Overwrite) + .partitionBy("key").parquet(hiveExternalTableLocation) + +/* + You can also do coalesce to control number of files under each partitions, repartition does full shuffle and equal + data distribution to all partitions. here coalesce can reduce number of files to given 'Int' argument without --- End diff -- @srowen done. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158370509 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,63 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +/* + * Save DataFrame to Hive Managed table as Parquet format + * 1. Create Hive Database / Schema with location at HDFS if you want to mentioned explicitly else default + * warehouse location will be used to store Hive table Data. + * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path; + * You don't have to explicitly give location for each table, every tables under specified schema will be located at + * location given while creating schema. + * 2. Create Hive Managed table with storage format as 'Parquet' + * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET; + */ +val hiveTableDF = sql("SELECT * FROM records").toDF() + hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records") + +/* + * Save DataFrame to Hive External table as compatible parquet format. + * 1. Create Hive External table with storage format as parquet. + * Ex: CREATE EXTERNAL TABLE records(key int, value string) STORED AS PARQUET; + * Since we are not explicitly providing hive database location, it automatically takes default warehouse location + * given to 'spark.sql.warehouse.dir' while creating SparkSession with enableHiveSupport(). + * For example, we have given '/user/hive/warehouse/' as a Hive Warehouse location. It will create schema directories + * under '/user/hive/warehouse/' as '/user/hive/warehouse/database_name.db' and '/user/hive/warehouse/database_name'. + */ + +// to make Hive parquet format compatible with spark parquet format +spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true") +// Multiple parquet files could be created accordingly to volume of data under directory given. +val hiveExternalTableLocation = s"/user/hive/warehouse/database_name.db/records" + hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation) + +// turn on flag for Dynamic Partitioning +spark.sqlContext.setConf("hive.exec.dynamic.partition", "true") +spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict") +// You can create partitions in Hive table, so downstream queries run much faster. +hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key") + .parquet(hiveExternalTableLocation) +/* +If Data volume is very huge, then every partitions would have many small-small files which may harm +downstream query performance due to File I/O, Bandwidth I/O, Network I/O, Disk I/O. +To improve performance you can create single parquet file under each partition directory using 'repartition' +on partitioned key for Hive table. When you add partition to table, there will be change in table DDL. +Ex: CREATE TABLE records(value string) PARTITIONED BY(key int) STORED AS PARQUET; + */ +hiveTableDF.repartition($"key").write.mode(SaveMode.Overwrite) --- End diff -- @cloud-fan removed all comments , as discussed with @srowen it does really make sense to have at docs with removed inconsitency. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158370168 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,63 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +/* + * Save DataFrame to Hive Managed table as Parquet format + * 1. Create Hive Database / Schema with location at HDFS if you want to mentioned explicitly else default + * warehouse location will be used to store Hive table Data. + * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path; + * You don't have to explicitly give location for each table, every tables under specified schema will be located at + * location given while creating schema. + * 2. Create Hive Managed table with storage format as 'Parquet' + * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET; + */ +val hiveTableDF = sql("SELECT * FROM records").toDF() + hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records") + +/* + * Save DataFrame to Hive External table as compatible parquet format. + * 1. Create Hive External table with storage format as parquet. + * Ex: CREATE EXTERNAL TABLE records(key int, value string) STORED AS PARQUET; + * Since we are not explicitly providing hive database location, it automatically takes default warehouse location + * given to 'spark.sql.warehouse.dir' while creating SparkSession with enableHiveSupport(). + * For example, we have given '/user/hive/warehouse/' as a Hive Warehouse location. It will create schema directories + * under '/user/hive/warehouse/' as '/user/hive/warehouse/database_name.db' and '/user/hive/warehouse/database_name'. + */ + +// to make Hive parquet format compatible with spark parquet format +spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true") +// Multiple parquet files could be created accordingly to volume of data under directory given. +val hiveExternalTableLocation = s"/user/hive/warehouse/database_name.db/records" + hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation) + +// turn on flag for Dynamic Partitioning +spark.sqlContext.setConf("hive.exec.dynamic.partition", "true") +spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict") +// You can create partitions in Hive table, so downstream queries run much faster. +hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key") + .parquet(hiveExternalTableLocation) +/* +If Data volume is very huge, then every partitions would have many small-small files which may harm --- End diff -- @srowen I totally agree with you. I will rephrase content for docs. from here: i have removed as of now. please check and do needful. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158368719 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,63 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +/* + * Save DataFrame to Hive Managed table as Parquet format + * 1. Create Hive Database / Schema with location at HDFS if you want to mentioned explicitly else default + * warehouse location will be used to store Hive table Data. + * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path; + * You don't have to explicitly give location for each table, every tables under specified schema will be located at + * location given while creating schema. + * 2. Create Hive Managed table with storage format as 'Parquet' + * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET; + */ +val hiveTableDF = sql("SELECT * FROM records").toDF() + hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records") + +/* + * Save DataFrame to Hive External table as compatible parquet format. + * 1. Create Hive External table with storage format as parquet. + * Ex: CREATE EXTERNAL TABLE records(key int, value string) STORED AS PARQUET; --- End diff -- @cloud-fan we'll keep all comments description at documentation with user friendly lines. I have added location also. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158368554 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,63 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +/* + * Save DataFrame to Hive Managed table as Parquet format + * 1. Create Hive Database / Schema with location at HDFS if you want to mentioned explicitly else default + * warehouse location will be used to store Hive table Data. + * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path; + * You don't have to explicitly give location for each table, every tables under specified schema will be located at + * location given while creating schema. + * 2. Create Hive Managed table with storage format as 'Parquet' + * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET; + */ +val hiveTableDF = sql("SELECT * FROM records").toDF() --- End diff -- @srowen done cc\ @cloud-fan removed toDF() --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158366994 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,63 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +/* --- End diff -- @srowen Done, changes addressed --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158210754 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,63 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +/* + * Save DataFrame to Hive Managed table as Parquet format + * 1. Create Hive Database / Schema with location at HDFS if you want to mentioned explicitly else default + * warehouse location will be used to store Hive table Data. + * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path; + * You don't have to explicitly give location for each table, every tables under specified schema will be located at + * location given while creating schema. + * 2. Create Hive Managed table with storage format as 'Parquet' + * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET; + */ +val hiveTableDF = sql("SELECT * FROM records").toDF() + hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records") + +/* + * Save DataFrame to Hive External table as compatible parquet format. + * 1. Create Hive External table with storage format as parquet. + * Ex: CREATE EXTERNAL TABLE records(key int, value string) STORED AS PARQUET; + * Since we are not explicitly providing hive database location, it automatically takes default warehouse location + * given to 'spark.sql.warehouse.dir' while creating SparkSession with enableHiveSupport(). + * For example, we have given '/user/hive/warehouse/' as a Hive Warehouse location. It will create schema directories + * under '/user/hive/warehouse/' as '/user/hive/warehouse/database_name.db' and '/user/hive/warehouse/database_name'. + */ + +// to make Hive parquet format compatible with spark parquet format +spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true") +// Multiple parquet files could be created accordingly to volume of data under directory given. +val hiveExternalTableLocation = s"/user/hive/warehouse/database_name.db/records" + hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation) + +// turn on flag for Dynamic Partitioning +spark.sqlContext.setConf("hive.exec.dynamic.partition", "true") +spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict") +// You can create partitions in Hive table, so downstream queries run much faster. +hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key") + .parquet(hiveExternalTableLocation) +/* +If Data volume is very huge, then every partitions would have many small-small files which may harm +downstream query performance due to File I/O, Bandwidth I/O, Network I/O, Disk I/O. +To improve performance you can create single parquet file under each partition directory using 'repartition' +on partitioned key for Hive table. When you add partition to table, there will be change in table DDL. +Ex: CREATE TABLE records(value string) PARTITIONED BY(key int) STORED AS PARQUET; + */ +hiveTableDF.repartition($"key").write.mode(SaveMode.Overwrite) + .partitionBy("key").parquet(hiveExternalTableLocation) + +/* + You can also do coalesce to control number of files under each partitions, repartition does full shuffle and equal + data distribution to all partitions. here coalesce can reduce number of files to given 'Int' argument without + full data shuffle. + */ +// coalesce of 10 could create 10 parquet files under each partitions, +// if data is huge and make sense to do partitioning. +hiveTableDF.coalesce(10).write.mode(SaveMode.Overwrite) --- End diff -- ditto --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158210714 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,63 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +/* + * Save DataFrame to Hive Managed table as Parquet format + * 1. Create Hive Database / Schema with location at HDFS if you want to mentioned explicitly else default + * warehouse location will be used to store Hive table Data. + * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path; + * You don't have to explicitly give location for each table, every tables under specified schema will be located at + * location given while creating schema. + * 2. Create Hive Managed table with storage format as 'Parquet' + * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET; + */ +val hiveTableDF = sql("SELECT * FROM records").toDF() + hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records") + +/* + * Save DataFrame to Hive External table as compatible parquet format. + * 1. Create Hive External table with storage format as parquet. + * Ex: CREATE EXTERNAL TABLE records(key int, value string) STORED AS PARQUET; + * Since we are not explicitly providing hive database location, it automatically takes default warehouse location + * given to 'spark.sql.warehouse.dir' while creating SparkSession with enableHiveSupport(). + * For example, we have given '/user/hive/warehouse/' as a Hive Warehouse location. It will create schema directories + * under '/user/hive/warehouse/' as '/user/hive/warehouse/database_name.db' and '/user/hive/warehouse/database_name'. + */ + +// to make Hive parquet format compatible with spark parquet format +spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true") +// Multiple parquet files could be created accordingly to volume of data under directory given. +val hiveExternalTableLocation = s"/user/hive/warehouse/database_name.db/records" + hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation) + +// turn on flag for Dynamic Partitioning +spark.sqlContext.setConf("hive.exec.dynamic.partition", "true") +spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict") +// You can create partitions in Hive table, so downstream queries run much faster. +hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key") + .parquet(hiveExternalTableLocation) +/* +If Data volume is very huge, then every partitions would have many small-small files which may harm +downstream query performance due to File I/O, Bandwidth I/O, Network I/O, Disk I/O. +To improve performance you can create single parquet file under each partition directory using 'repartition' +on partitioned key for Hive table. When you add partition to table, there will be change in table DDL. +Ex: CREATE TABLE records(value string) PARTITIONED BY(key int) STORED AS PARQUET; + */ +hiveTableDF.repartition($"key").write.mode(SaveMode.Overwrite) --- End diff -- This is not a standard usage, let's not put it in the example. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158210666 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,63 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +/* + * Save DataFrame to Hive Managed table as Parquet format + * 1. Create Hive Database / Schema with location at HDFS if you want to mentioned explicitly else default + * warehouse location will be used to store Hive table Data. + * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path; + * You don't have to explicitly give location for each table, every tables under specified schema will be located at + * location given while creating schema. + * 2. Create Hive Managed table with storage format as 'Parquet' + * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET; + */ +val hiveTableDF = sql("SELECT * FROM records").toDF() + hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records") + +/* + * Save DataFrame to Hive External table as compatible parquet format. + * 1. Create Hive External table with storage format as parquet. + * Ex: CREATE EXTERNAL TABLE records(key int, value string) STORED AS PARQUET; --- End diff -- it's weird to create an external table without a location. User may be confused between the difference between managed table and external table. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158210425 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,63 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +/* + * Save DataFrame to Hive Managed table as Parquet format + * 1. Create Hive Database / Schema with location at HDFS if you want to mentioned explicitly else default + * warehouse location will be used to store Hive table Data. + * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path; + * You don't have to explicitly give location for each table, every tables under specified schema will be located at + * location given while creating schema. + * 2. Create Hive Managed table with storage format as 'Parquet' + * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET; + */ +val hiveTableDF = sql("SELECT * FROM records").toDF() --- End diff -- actually, I think `spark.table("records")` is a better example. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158210374 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,63 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +/* + * Save DataFrame to Hive Managed table as Parquet format + * 1. Create Hive Database / Schema with location at HDFS if you want to mentioned explicitly else default + * warehouse location will be used to store Hive table Data. + * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path; + * You don't have to explicitly give location for each table, every tables under specified schema will be located at + * location given while creating schema. + * 2. Create Hive Managed table with storage format as 'Parquet' + * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET; + */ +val hiveTableDF = sql("SELECT * FROM records").toDF() --- End diff -- `.toDF` is not needed --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158210132 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,63 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +/* --- End diff -- +1 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158134032 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,63 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +/* + * Save DataFrame to Hive Managed table as Parquet format + * 1. Create Hive Database / Schema with location at HDFS if you want to mentioned explicitly else default + * warehouse location will be used to store Hive table Data. + * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path; + * You don't have to explicitly give location for each table, every tables under specified schema will be located at + * location given while creating schema. + * 2. Create Hive Managed table with storage format as 'Parquet' + * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET; + */ +val hiveTableDF = sql("SELECT * FROM records").toDF() + hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records") + +/* + * Save DataFrame to Hive External table as compatible parquet format. + * 1. Create Hive External table with storage format as parquet. + * Ex: CREATE EXTERNAL TABLE records(key int, value string) STORED AS PARQUET; + * Since we are not explicitly providing hive database location, it automatically takes default warehouse location + * given to 'spark.sql.warehouse.dir' while creating SparkSession with enableHiveSupport(). + * For example, we have given '/user/hive/warehouse/' as a Hive Warehouse location. It will create schema directories + * under '/user/hive/warehouse/' as '/user/hive/warehouse/database_name.db' and '/user/hive/warehouse/database_name'. + */ + +// to make Hive parquet format compatible with spark parquet format +spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true") +// Multiple parquet files could be created accordingly to volume of data under directory given. +val hiveExternalTableLocation = s"/user/hive/warehouse/database_name.db/records" + hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation) + +// turn on flag for Dynamic Partitioning +spark.sqlContext.setConf("hive.exec.dynamic.partition", "true") +spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict") +// You can create partitions in Hive table, so downstream queries run much faster. +hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key") + .parquet(hiveExternalTableLocation) +/* +If Data volume is very huge, then every partitions would have many small-small files which may harm +downstream query performance due to File I/O, Bandwidth I/O, Network I/O, Disk I/O. +To improve performance you can create single parquet file under each partition directory using 'repartition' +on partitioned key for Hive table. When you add partition to table, there will be change in table DDL. +Ex: CREATE TABLE records(value string) PARTITIONED BY(key int) STORED AS PARQUET; + */ +hiveTableDF.repartition($"key").write.mode(SaveMode.Overwrite) + .partitionBy("key").parquet(hiveExternalTableLocation) + +/* + You can also do coalesce to control number of files under each partitions, repartition does full shuffle and equal + data distribution to all partitions. here coalesce can reduce number of files to given 'Int' argument without --- End diff -- Sentences need some cleanup here. What do you mean by 'Int' argument? maybe it's best to point people to the API docs rather than incompletely repeat it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158133877 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,63 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +/* + * Save DataFrame to Hive Managed table as Parquet format + * 1. Create Hive Database / Schema with location at HDFS if you want to mentioned explicitly else default + * warehouse location will be used to store Hive table Data. + * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path; + * You don't have to explicitly give location for each table, every tables under specified schema will be located at + * location given while creating schema. + * 2. Create Hive Managed table with storage format as 'Parquet' + * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET; + */ +val hiveTableDF = sql("SELECT * FROM records").toDF() + hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records") + +/* + * Save DataFrame to Hive External table as compatible parquet format. + * 1. Create Hive External table with storage format as parquet. + * Ex: CREATE EXTERNAL TABLE records(key int, value string) STORED AS PARQUET; + * Since we are not explicitly providing hive database location, it automatically takes default warehouse location + * given to 'spark.sql.warehouse.dir' while creating SparkSession with enableHiveSupport(). + * For example, we have given '/user/hive/warehouse/' as a Hive Warehouse location. It will create schema directories + * under '/user/hive/warehouse/' as '/user/hive/warehouse/database_name.db' and '/user/hive/warehouse/database_name'. + */ + +// to make Hive parquet format compatible with spark parquet format +spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true") +// Multiple parquet files could be created accordingly to volume of data under directory given. +val hiveExternalTableLocation = s"/user/hive/warehouse/database_name.db/records" + hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation) + +// turn on flag for Dynamic Partitioning +spark.sqlContext.setConf("hive.exec.dynamic.partition", "true") +spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict") +// You can create partitions in Hive table, so downstream queries run much faster. +hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key") + .parquet(hiveExternalTableLocation) +/* +If Data volume is very huge, then every partitions would have many small-small files which may harm --- End diff -- This is more stuff that should go in docs, not comments in an example. It kind of duplicates existing documentation. Is this commentary really needed to illustrate usage of the API? that's the only goal right here. What are small-small files? You have some inconsistent capitalization; Parquet should be capitalized but not file, bandwidth, etc. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158133606 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -102,8 +101,63 @@ object SparkHiveExample { // | 4| val_4| 4| val_4| // | 5| val_5| 5| val_5| // ... -// $example off:spark_hive$ +/* --- End diff -- Oh just noticed this. You're using javadoc style comments here, but they won't have effect. just use the `//` block style for comments that you see above, for consistency. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158113948 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -104,6 +103,60 @@ object SparkHiveExample { // ... // $example off:spark_hive$ --- End diff -- @srowen I mis-understood your first comment. I have reverted as suggested. Please check now --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r158100765 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -104,6 +103,60 @@ object SparkHiveExample { // ... // $example off:spark_hive$ --- End diff -- Why do you turn the example listing off then on again? just remove those two lines --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r157973588 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -104,6 +103,60 @@ object SparkHiveExample { // ... // $example off:spark_hive$ --- End diff -- @srowen I have updated DDL when storing data with parititoning in Hive. cc\ @HyukjinKwon @mgaido91 @markgrover @markhamstra --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r157942866 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -104,6 +103,60 @@ object SparkHiveExample { // ... // $example off:spark_hive$ --- End diff -- @srowen Can you please review this cc\ @holdenk @sameeragarwal --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user chetkhatri commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r157796580 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -104,6 +103,60 @@ object SparkHiveExample { // ... // $example off:spark_hive$ --- End diff -- @srowen Thank you for valueable feedback review, I have added that so it can help other develoeprs. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/20018#discussion_r157757263 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala --- @@ -104,6 +103,60 @@ object SparkHiveExample { // ... // $example off:spark_hive$ --- End diff -- Do you not want the code below to render in the docs as part of the example? maybe not, just checking if that's intentional. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...
GitHub user chetkhatri opened a pull request: https://github.com/apache/spark/pull/20018 SPARK-22833 [Improvement] in SparkHive Scala Examples ## What changes were proposed in this pull request? SparkHive Scala Examples Improvement made: * Writing DataFrame / DataSet to Hive Managed , Hive External table using different storage format. * Implementation of Partition, Reparition, Coalesce with appropriate example. ## How was this patch tested? * Patch has been tested manually and by running ./dev/run-tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/chetkhatri/spark scala-sparkhive-examples Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20018.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20018 commit 9d9b42bb49997ce7d308fbf50072e5f5e0eccaa2 Author: chetkhatriDate: 2017-12-19T11:33:47Z SPARK-22833 [Improvement] in SparkHive Scala Examples --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org