[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-23 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20018


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-22 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158454282
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -102,8 +101,41 @@ object SparkHiveExample {
 // |  4| val_4|  4| val_4|
 // |  5| val_5|  5| val_5|
 // ...
-// $example off:spark_hive$
 
+// Create Hive managed table with parquet
+sql("CREATE TABLE records(key int, value string) STORED AS PARQUET")
+// Save DataFrame to Hive Managed table as Parquet format
+val hiveTableDF = sql("SELECT * FROM records")
+
hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
+// Create External Hive table with parquet
+sql("CREATE EXTERNAL TABLE records(key int, value string) " +
+  "STORED AS PARQUET LOCATION '/user/hive/warehouse/'")
+// to make Hive parquet format compatible with spark parquet format
+spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
+
+// Multiple parquet files could be created accordingly to volume of 
data under directory given.
+val hiveExternalTableLocation = 
"/user/hive/warehouse/database_name.db/records"
+
+// Save DataFrame to Hive External table as compatible parquet format
+
hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation)
+
+// turn on flag for Dynamic Partitioning
+spark.sqlContext.setConf("hive.exec.dynamic.partition", "true")
+spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", 
"nonstrict")
+
+// You can create partitions in Hive table, so downstream queries run 
much faster.
+hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key")
+  .parquet(hiveExternalTableLocation)
+
+// reduce number of files for each partition by repartition
--- End diff --

@HyukjinKwon Thanks for highlight, improved the same.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-22 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158454291
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -102,8 +101,41 @@ object SparkHiveExample {
 // |  4| val_4|  4| val_4|
 // |  5| val_5|  5| val_5|
 // ...
-// $example off:spark_hive$
 
+// Create Hive managed table with parquet
+sql("CREATE TABLE records(key int, value string) STORED AS PARQUET")
+// Save DataFrame to Hive Managed table as Parquet format
+val hiveTableDF = sql("SELECT * FROM records")
+
hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
+// Create External Hive table with parquet
+sql("CREATE EXTERNAL TABLE records(key int, value string) " +
+  "STORED AS PARQUET LOCATION '/user/hive/warehouse/'")
+// to make Hive parquet format compatible with spark parquet format
+spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
+
+// Multiple parquet files could be created accordingly to volume of 
data under directory given.
+val hiveExternalTableLocation = 
"/user/hive/warehouse/database_name.db/records"
+
+// Save DataFrame to Hive External table as compatible parquet format
+
hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation)
+
+// turn on flag for Dynamic Partitioning
+spark.sqlContext.setConf("hive.exec.dynamic.partition", "true")
+spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", 
"nonstrict")
+
+// You can create partitions in Hive table, so downstream queries run 
much faster.
+hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key")
+  .parquet(hiveExternalTableLocation)
+
+// reduce number of files for each partition by repartition
+hiveTableDF.repartition($"key").write.mode(SaveMode.Overwrite)
+  .partitionBy("key").parquet(hiveExternalTableLocation)
+
+// Control number of files in each partition by coalesce
--- End diff --

@HyukjinKwon Thanks for highlight, improved the same.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-22 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158454275
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -102,8 +101,41 @@ object SparkHiveExample {
 // |  4| val_4|  4| val_4|
 // |  5| val_5|  5| val_5|
 // ...
-// $example off:spark_hive$
 
+// Create Hive managed table with parquet
+sql("CREATE TABLE records(key int, value string) STORED AS PARQUET")
+// Save DataFrame to Hive Managed table as Parquet format
+val hiveTableDF = sql("SELECT * FROM records")
+
hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
+// Create External Hive table with parquet
+sql("CREATE EXTERNAL TABLE records(key int, value string) " +
+  "STORED AS PARQUET LOCATION '/user/hive/warehouse/'")
+// to make Hive parquet format compatible with spark parquet format
+spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
+
+// Multiple parquet files could be created accordingly to volume of 
data under directory given.
+val hiveExternalTableLocation = 
"/user/hive/warehouse/database_name.db/records"
+
+// Save DataFrame to Hive External table as compatible parquet format
+
hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation)
+
+// turn on flag for Dynamic Partitioning
--- End diff --

@HyukjinKwon Thanks for highlight, improved the same.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-22 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158454240
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -102,8 +101,41 @@ object SparkHiveExample {
 // |  4| val_4|  4| val_4|
 // |  5| val_5|  5| val_5|
 // ...
-// $example off:spark_hive$
 
+// Create Hive managed table with parquet
+sql("CREATE TABLE records(key int, value string) STORED AS PARQUET")
+// Save DataFrame to Hive Managed table as Parquet format
+val hiveTableDF = sql("SELECT * FROM records")
+
hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
+// Create External Hive table with parquet
+sql("CREATE EXTERNAL TABLE records(key int, value string) " +
+  "STORED AS PARQUET LOCATION '/user/hive/warehouse/'")
+// to make Hive parquet format compatible with spark parquet format
--- End diff --

@HyukjinKwon Thanks for highlight, improved the same.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-22 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158454252
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -102,8 +101,41 @@ object SparkHiveExample {
 // |  4| val_4|  4| val_4|
 // |  5| val_5|  5| val_5|
 // ...
-// $example off:spark_hive$
 
+// Create Hive managed table with parquet
+sql("CREATE TABLE records(key int, value string) STORED AS PARQUET")
+// Save DataFrame to Hive Managed table as Parquet format
+val hiveTableDF = sql("SELECT * FROM records")
+
hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
+// Create External Hive table with parquet
+sql("CREATE EXTERNAL TABLE records(key int, value string) " +
+  "STORED AS PARQUET LOCATION '/user/hive/warehouse/'")
+// to make Hive parquet format compatible with spark parquet format
+spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
+
+// Multiple parquet files could be created accordingly to volume of 
data under directory given.
--- End diff --

@HyukjinKwon Thanks for highlight, improved the same.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-22 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158454265
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -102,8 +101,41 @@ object SparkHiveExample {
 // |  4| val_4|  4| val_4|
 // |  5| val_5|  5| val_5|
 // ...
-// $example off:spark_hive$
 
+// Create Hive managed table with parquet
+sql("CREATE TABLE records(key int, value string) STORED AS PARQUET")
+// Save DataFrame to Hive Managed table as Parquet format
+val hiveTableDF = sql("SELECT * FROM records")
+
hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
+// Create External Hive table with parquet
+sql("CREATE EXTERNAL TABLE records(key int, value string) " +
+  "STORED AS PARQUET LOCATION '/user/hive/warehouse/'")
+// to make Hive parquet format compatible with spark parquet format
+spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
+
+// Multiple parquet files could be created accordingly to volume of 
data under directory given.
+val hiveExternalTableLocation = 
"/user/hive/warehouse/database_name.db/records"
+
+// Save DataFrame to Hive External table as compatible parquet format
--- End diff --

@HyukjinKwon Thanks for highlight, improved the same.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-22 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158454218
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -102,8 +101,41 @@ object SparkHiveExample {
 // |  4| val_4|  4| val_4|
 // |  5| val_5|  5| val_5|
 // ...
-// $example off:spark_hive$
 
+// Create Hive managed table with parquet
--- End diff --

@HyukjinKwon Thanks for highlight, improved the same.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-22 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158454228
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -102,8 +101,41 @@ object SparkHiveExample {
 // |  4| val_4|  4| val_4|
 // |  5| val_5|  5| val_5|
 // ...
-// $example off:spark_hive$
 
+// Create Hive managed table with parquet
+sql("CREATE TABLE records(key int, value string) STORED AS PARQUET")
+// Save DataFrame to Hive Managed table as Parquet format
--- End diff --

@HyukjinKwon Thanks for highlight, improved the same.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-21 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158408627
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -102,8 +101,41 @@ object SparkHiveExample {
 // |  4| val_4|  4| val_4|
 // |  5| val_5|  5| val_5|
 // ...
-// $example off:spark_hive$
 
+// Create Hive managed table with parquet
+sql("CREATE TABLE records(key int, value string) STORED AS PARQUET")
+// Save DataFrame to Hive Managed table as Parquet format
+val hiveTableDF = sql("SELECT * FROM records")
+
hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
+// Create External Hive table with parquet
+sql("CREATE EXTERNAL TABLE records(key int, value string) " +
+  "STORED AS PARQUET LOCATION '/user/hive/warehouse/'")
+// to make Hive parquet format compatible with spark parquet format
--- End diff --

`spark` -> `Spark`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-21 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158407717
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -102,8 +101,41 @@ object SparkHiveExample {
 // |  4| val_4|  4| val_4|
 // |  5| val_5|  5| val_5|
 // ...
-// $example off:spark_hive$
 
+// Create Hive managed table with parquet
+sql("CREATE TABLE records(key int, value string) STORED AS PARQUET")
+// Save DataFrame to Hive Managed table as Parquet format
+val hiveTableDF = sql("SELECT * FROM records")
+
hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
+// Create External Hive table with parquet
+sql("CREATE EXTERNAL TABLE records(key int, value string) " +
+  "STORED AS PARQUET LOCATION '/user/hive/warehouse/'")
+// to make Hive parquet format compatible with spark parquet format
+spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
+
+// Multiple parquet files could be created accordingly to volume of 
data under directory given.
+val hiveExternalTableLocation = 
"/user/hive/warehouse/database_name.db/records"
+
+// Save DataFrame to Hive External table as compatible parquet format
--- End diff --

`parquet` ->`Parquet`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-21 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158407910
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -102,8 +101,41 @@ object SparkHiveExample {
 // |  4| val_4|  4| val_4|
 // |  5| val_5|  5| val_5|
 // ...
-// $example off:spark_hive$
 
+// Create Hive managed table with parquet
+sql("CREATE TABLE records(key int, value string) STORED AS PARQUET")
+// Save DataFrame to Hive Managed table as Parquet format
+val hiveTableDF = sql("SELECT * FROM records")
+
hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
+// Create External Hive table with parquet
+sql("CREATE EXTERNAL TABLE records(key int, value string) " +
+  "STORED AS PARQUET LOCATION '/user/hive/warehouse/'")
+// to make Hive parquet format compatible with spark parquet format
+spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
+
+// Multiple parquet files could be created accordingly to volume of 
data under directory given.
+val hiveExternalTableLocation = 
"/user/hive/warehouse/database_name.db/records"
+
+// Save DataFrame to Hive External table as compatible parquet format
+
hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation)
+
+// turn on flag for Dynamic Partitioning
+spark.sqlContext.setConf("hive.exec.dynamic.partition", "true")
+spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", 
"nonstrict")
+
+// You can create partitions in Hive table, so downstream queries run 
much faster.
+hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key")
+  .parquet(hiveExternalTableLocation)
+
+// reduce number of files for each partition by repartition
--- End diff --

`reduce` -> `Reduce`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-21 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158407739
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -102,8 +101,41 @@ object SparkHiveExample {
 // |  4| val_4|  4| val_4|
 // |  5| val_5|  5| val_5|
 // ...
-// $example off:spark_hive$
 
+// Create Hive managed table with parquet
+sql("CREATE TABLE records(key int, value string) STORED AS PARQUET")
+// Save DataFrame to Hive Managed table as Parquet format
+val hiveTableDF = sql("SELECT * FROM records")
+
hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
+// Create External Hive table with parquet
+sql("CREATE EXTERNAL TABLE records(key int, value string) " +
+  "STORED AS PARQUET LOCATION '/user/hive/warehouse/'")
+// to make Hive parquet format compatible with spark parquet format
--- End diff --

`parquet` ->`Parquet`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-21 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158407768
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -102,8 +101,41 @@ object SparkHiveExample {
 // |  4| val_4|  4| val_4|
 // |  5| val_5|  5| val_5|
 // ...
-// $example off:spark_hive$
 
+// Create Hive managed table with parquet
+sql("CREATE TABLE records(key int, value string) STORED AS PARQUET")
+// Save DataFrame to Hive Managed table as Parquet format
+val hiveTableDF = sql("SELECT * FROM records")
+
hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
+// Create External Hive table with parquet
+sql("CREATE EXTERNAL TABLE records(key int, value string) " +
+  "STORED AS PARQUET LOCATION '/user/hive/warehouse/'")
+// to make Hive parquet format compatible with spark parquet format
+spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
+
+// Multiple parquet files could be created accordingly to volume of 
data under directory given.
--- End diff --

`parquet` -> `Parquet`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-21 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158407583
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -102,8 +101,41 @@ object SparkHiveExample {
 // |  4| val_4|  4| val_4|
 // |  5| val_5|  5| val_5|
 // ...
-// $example off:spark_hive$
 
+// Create Hive managed table with parquet
--- End diff --

`parquet` -> `Parquet`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-21 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158407620
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -102,8 +101,41 @@ object SparkHiveExample {
 // |  4| val_4|  4| val_4|
 // |  5| val_5|  5| val_5|
 // ...
-// $example off:spark_hive$
 
+// Create Hive managed table with parquet
+sql("CREATE TABLE records(key int, value string) STORED AS PARQUET")
+// Save DataFrame to Hive Managed table as Parquet format
--- End diff --

`Managed` -> `managed`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-21 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158407946
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -102,8 +101,41 @@ object SparkHiveExample {
 // |  4| val_4|  4| val_4|
 // |  5| val_5|  5| val_5|
 // ...
-// $example off:spark_hive$
 
+// Create Hive managed table with parquet
+sql("CREATE TABLE records(key int, value string) STORED AS PARQUET")
+// Save DataFrame to Hive Managed table as Parquet format
+val hiveTableDF = sql("SELECT * FROM records")
+
hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
+// Create External Hive table with parquet
+sql("CREATE EXTERNAL TABLE records(key int, value string) " +
+  "STORED AS PARQUET LOCATION '/user/hive/warehouse/'")
+// to make Hive parquet format compatible with spark parquet format
+spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
+
+// Multiple parquet files could be created accordingly to volume of 
data under directory given.
+val hiveExternalTableLocation = 
"/user/hive/warehouse/database_name.db/records"
+
+// Save DataFrame to Hive External table as compatible parquet format
+
hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation)
+
+// turn on flag for Dynamic Partitioning
+spark.sqlContext.setConf("hive.exec.dynamic.partition", "true")
+spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", 
"nonstrict")
+
+// You can create partitions in Hive table, so downstream queries run 
much faster.
+hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key")
+  .parquet(hiveExternalTableLocation)
+
+// reduce number of files for each partition by repartition
+hiveTableDF.repartition($"key").write.mode(SaveMode.Overwrite)
+  .partitionBy("key").parquet(hiveExternalTableLocation)
+
+// Control number of files in each partition by coalesce
--- End diff --

` Control number of files` -> ` Control the number of files`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-21 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158407873
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -102,8 +101,41 @@ object SparkHiveExample {
 // |  4| val_4|  4| val_4|
 // |  5| val_5|  5| val_5|
 // ...
-// $example off:spark_hive$
 
+// Create Hive managed table with parquet
+sql("CREATE TABLE records(key int, value string) STORED AS PARQUET")
+// Save DataFrame to Hive Managed table as Parquet format
+val hiveTableDF = sql("SELECT * FROM records")
+
hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
+// Create External Hive table with parquet
+sql("CREATE EXTERNAL TABLE records(key int, value string) " +
+  "STORED AS PARQUET LOCATION '/user/hive/warehouse/'")
+// to make Hive parquet format compatible with spark parquet format
+spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
+
+// Multiple parquet files could be created accordingly to volume of 
data under directory given.
+val hiveExternalTableLocation = 
"/user/hive/warehouse/database_name.db/records"
+
+// Save DataFrame to Hive External table as compatible parquet format
+
hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation)
+
+// turn on flag for Dynamic Partitioning
--- End diff --

`turn` -> `Turn`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-21 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158370581
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -102,8 +101,63 @@ object SparkHiveExample {
 // |  4| val_4|  4| val_4|
 // |  5| val_5|  5| val_5|
 // ...
-// $example off:spark_hive$
 
+/*
+ * Save DataFrame to Hive Managed table as Parquet format
+ * 1. Create Hive Database / Schema with location at HDFS if you want 
to mentioned explicitly else default
+ * warehouse location will be used to store Hive table Data.
+ * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path;
+ * You don't have to explicitly give location for each table, every 
tables under specified schema will be located at
+ * location given while creating schema.
+ * 2. Create Hive Managed table with storage format as 'Parquet'
+ * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET;
+ */
+val hiveTableDF = sql("SELECT * FROM records").toDF()
+
hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
+
+/*
+ * Save DataFrame to Hive External table as compatible parquet format.
+ * 1. Create Hive External table with storage format as parquet.
+ * Ex: CREATE EXTERNAL TABLE records(key int, value string) STORED AS 
PARQUET;
+ * Since we are not explicitly providing hive database location, it 
automatically takes default warehouse location
+ * given to 'spark.sql.warehouse.dir' while creating SparkSession with 
enableHiveSupport().
+ * For example, we have given '/user/hive/warehouse/' as a Hive 
Warehouse location. It will create schema directories
+ * under '/user/hive/warehouse/' as 
'/user/hive/warehouse/database_name.db' and 
'/user/hive/warehouse/database_name'.
+ */
+
+// to make Hive parquet format compatible with spark parquet format
+spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
+// Multiple parquet files could be created accordingly to volume of 
data under directory given.
+val hiveExternalTableLocation = 
s"/user/hive/warehouse/database_name.db/records"
+
hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation)
+
+// turn on flag for Dynamic Partitioning
+spark.sqlContext.setConf("hive.exec.dynamic.partition", "true")
+spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", 
"nonstrict")
+// You can create partitions in Hive table, so downstream queries run 
much faster.
+hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key")
+  .parquet(hiveExternalTableLocation)
+/*
+If Data volume is very huge, then every partitions would have many 
small-small files which may harm
+downstream query performance due to File I/O, Bandwidth I/O, Network 
I/O, Disk I/O.
+To improve performance you can create single parquet file under each 
partition directory using 'repartition'
+on partitioned key for Hive table. When you add partition to table, 
there will be change in table DDL.
+Ex: CREATE TABLE records(value string) PARTITIONED BY(key int) STORED 
AS PARQUET;
+ */
+hiveTableDF.repartition($"key").write.mode(SaveMode.Overwrite)
+  .partitionBy("key").parquet(hiveExternalTableLocation)
+
+/*
+ You can also do coalesce to control number of files under each 
partitions, repartition does full shuffle and equal
+ data distribution to all partitions. here coalesce can reduce number 
of files to given 'Int' argument without
--- End diff --

@srowen done.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-21 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158370509
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -102,8 +101,63 @@ object SparkHiveExample {
 // |  4| val_4|  4| val_4|
 // |  5| val_5|  5| val_5|
 // ...
-// $example off:spark_hive$
 
+/*
+ * Save DataFrame to Hive Managed table as Parquet format
+ * 1. Create Hive Database / Schema with location at HDFS if you want 
to mentioned explicitly else default
+ * warehouse location will be used to store Hive table Data.
+ * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path;
+ * You don't have to explicitly give location for each table, every 
tables under specified schema will be located at
+ * location given while creating schema.
+ * 2. Create Hive Managed table with storage format as 'Parquet'
+ * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET;
+ */
+val hiveTableDF = sql("SELECT * FROM records").toDF()
+
hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
+
+/*
+ * Save DataFrame to Hive External table as compatible parquet format.
+ * 1. Create Hive External table with storage format as parquet.
+ * Ex: CREATE EXTERNAL TABLE records(key int, value string) STORED AS 
PARQUET;
+ * Since we are not explicitly providing hive database location, it 
automatically takes default warehouse location
+ * given to 'spark.sql.warehouse.dir' while creating SparkSession with 
enableHiveSupport().
+ * For example, we have given '/user/hive/warehouse/' as a Hive 
Warehouse location. It will create schema directories
+ * under '/user/hive/warehouse/' as 
'/user/hive/warehouse/database_name.db' and 
'/user/hive/warehouse/database_name'.
+ */
+
+// to make Hive parquet format compatible with spark parquet format
+spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
+// Multiple parquet files could be created accordingly to volume of 
data under directory given.
+val hiveExternalTableLocation = 
s"/user/hive/warehouse/database_name.db/records"
+
hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation)
+
+// turn on flag for Dynamic Partitioning
+spark.sqlContext.setConf("hive.exec.dynamic.partition", "true")
+spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", 
"nonstrict")
+// You can create partitions in Hive table, so downstream queries run 
much faster.
+hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key")
+  .parquet(hiveExternalTableLocation)
+/*
+If Data volume is very huge, then every partitions would have many 
small-small files which may harm
+downstream query performance due to File I/O, Bandwidth I/O, Network 
I/O, Disk I/O.
+To improve performance you can create single parquet file under each 
partition directory using 'repartition'
+on partitioned key for Hive table. When you add partition to table, 
there will be change in table DDL.
+Ex: CREATE TABLE records(value string) PARTITIONED BY(key int) STORED 
AS PARQUET;
+ */
+hiveTableDF.repartition($"key").write.mode(SaveMode.Overwrite)
--- End diff --

@cloud-fan removed all comments , as discussed with @srowen it does really 
make sense to have at docs with removed inconsitency.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-21 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158370168
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -102,8 +101,63 @@ object SparkHiveExample {
 // |  4| val_4|  4| val_4|
 // |  5| val_5|  5| val_5|
 // ...
-// $example off:spark_hive$
 
+/*
+ * Save DataFrame to Hive Managed table as Parquet format
+ * 1. Create Hive Database / Schema with location at HDFS if you want 
to mentioned explicitly else default
+ * warehouse location will be used to store Hive table Data.
+ * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path;
+ * You don't have to explicitly give location for each table, every 
tables under specified schema will be located at
+ * location given while creating schema.
+ * 2. Create Hive Managed table with storage format as 'Parquet'
+ * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET;
+ */
+val hiveTableDF = sql("SELECT * FROM records").toDF()
+
hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
+
+/*
+ * Save DataFrame to Hive External table as compatible parquet format.
+ * 1. Create Hive External table with storage format as parquet.
+ * Ex: CREATE EXTERNAL TABLE records(key int, value string) STORED AS 
PARQUET;
+ * Since we are not explicitly providing hive database location, it 
automatically takes default warehouse location
+ * given to 'spark.sql.warehouse.dir' while creating SparkSession with 
enableHiveSupport().
+ * For example, we have given '/user/hive/warehouse/' as a Hive 
Warehouse location. It will create schema directories
+ * under '/user/hive/warehouse/' as 
'/user/hive/warehouse/database_name.db' and 
'/user/hive/warehouse/database_name'.
+ */
+
+// to make Hive parquet format compatible with spark parquet format
+spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
+// Multiple parquet files could be created accordingly to volume of 
data under directory given.
+val hiveExternalTableLocation = 
s"/user/hive/warehouse/database_name.db/records"
+
hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation)
+
+// turn on flag for Dynamic Partitioning
+spark.sqlContext.setConf("hive.exec.dynamic.partition", "true")
+spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", 
"nonstrict")
+// You can create partitions in Hive table, so downstream queries run 
much faster.
+hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key")
+  .parquet(hiveExternalTableLocation)
+/*
+If Data volume is very huge, then every partitions would have many 
small-small files which may harm
--- End diff --

@srowen I totally agree with you. I will rephrase content for docs. from 
here: i have removed as of now. please check and do needful.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-21 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158368719
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -102,8 +101,63 @@ object SparkHiveExample {
 // |  4| val_4|  4| val_4|
 // |  5| val_5|  5| val_5|
 // ...
-// $example off:spark_hive$
 
+/*
+ * Save DataFrame to Hive Managed table as Parquet format
+ * 1. Create Hive Database / Schema with location at HDFS if you want 
to mentioned explicitly else default
+ * warehouse location will be used to store Hive table Data.
+ * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path;
+ * You don't have to explicitly give location for each table, every 
tables under specified schema will be located at
+ * location given while creating schema.
+ * 2. Create Hive Managed table with storage format as 'Parquet'
+ * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET;
+ */
+val hiveTableDF = sql("SELECT * FROM records").toDF()
+
hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
+
+/*
+ * Save DataFrame to Hive External table as compatible parquet format.
+ * 1. Create Hive External table with storage format as parquet.
+ * Ex: CREATE EXTERNAL TABLE records(key int, value string) STORED AS 
PARQUET;
--- End diff --

@cloud-fan we'll keep all comments description at documentation with user 
friendly lines. I have added location also.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-21 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158368554
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -102,8 +101,63 @@ object SparkHiveExample {
 // |  4| val_4|  4| val_4|
 // |  5| val_5|  5| val_5|
 // ...
-// $example off:spark_hive$
 
+/*
+ * Save DataFrame to Hive Managed table as Parquet format
+ * 1. Create Hive Database / Schema with location at HDFS if you want 
to mentioned explicitly else default
+ * warehouse location will be used to store Hive table Data.
+ * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path;
+ * You don't have to explicitly give location for each table, every 
tables under specified schema will be located at
+ * location given while creating schema.
+ * 2. Create Hive Managed table with storage format as 'Parquet'
+ * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET;
+ */
+val hiveTableDF = sql("SELECT * FROM records").toDF()
--- End diff --

@srowen done cc\ @cloud-fan removed toDF() 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-21 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158366994
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -102,8 +101,63 @@ object SparkHiveExample {
 // |  4| val_4|  4| val_4|
 // |  5| val_5|  5| val_5|
 // ...
-// $example off:spark_hive$
 
+/*
--- End diff --

@srowen Done, changes addressed


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-20 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158210754
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -102,8 +101,63 @@ object SparkHiveExample {
 // |  4| val_4|  4| val_4|
 // |  5| val_5|  5| val_5|
 // ...
-// $example off:spark_hive$
 
+/*
+ * Save DataFrame to Hive Managed table as Parquet format
+ * 1. Create Hive Database / Schema with location at HDFS if you want 
to mentioned explicitly else default
+ * warehouse location will be used to store Hive table Data.
+ * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path;
+ * You don't have to explicitly give location for each table, every 
tables under specified schema will be located at
+ * location given while creating schema.
+ * 2. Create Hive Managed table with storage format as 'Parquet'
+ * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET;
+ */
+val hiveTableDF = sql("SELECT * FROM records").toDF()
+
hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
+
+/*
+ * Save DataFrame to Hive External table as compatible parquet format.
+ * 1. Create Hive External table with storage format as parquet.
+ * Ex: CREATE EXTERNAL TABLE records(key int, value string) STORED AS 
PARQUET;
+ * Since we are not explicitly providing hive database location, it 
automatically takes default warehouse location
+ * given to 'spark.sql.warehouse.dir' while creating SparkSession with 
enableHiveSupport().
+ * For example, we have given '/user/hive/warehouse/' as a Hive 
Warehouse location. It will create schema directories
+ * under '/user/hive/warehouse/' as 
'/user/hive/warehouse/database_name.db' and 
'/user/hive/warehouse/database_name'.
+ */
+
+// to make Hive parquet format compatible with spark parquet format
+spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
+// Multiple parquet files could be created accordingly to volume of 
data under directory given.
+val hiveExternalTableLocation = 
s"/user/hive/warehouse/database_name.db/records"
+
hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation)
+
+// turn on flag for Dynamic Partitioning
+spark.sqlContext.setConf("hive.exec.dynamic.partition", "true")
+spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", 
"nonstrict")
+// You can create partitions in Hive table, so downstream queries run 
much faster.
+hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key")
+  .parquet(hiveExternalTableLocation)
+/*
+If Data volume is very huge, then every partitions would have many 
small-small files which may harm
+downstream query performance due to File I/O, Bandwidth I/O, Network 
I/O, Disk I/O.
+To improve performance you can create single parquet file under each 
partition directory using 'repartition'
+on partitioned key for Hive table. When you add partition to table, 
there will be change in table DDL.
+Ex: CREATE TABLE records(value string) PARTITIONED BY(key int) STORED 
AS PARQUET;
+ */
+hiveTableDF.repartition($"key").write.mode(SaveMode.Overwrite)
+  .partitionBy("key").parquet(hiveExternalTableLocation)
+
+/*
+ You can also do coalesce to control number of files under each 
partitions, repartition does full shuffle and equal
+ data distribution to all partitions. here coalesce can reduce number 
of files to given 'Int' argument without
+ full data shuffle.
+ */
+// coalesce of 10 could create 10 parquet files under each partitions,
+// if data is huge and make sense to do partitioning.
+hiveTableDF.coalesce(10).write.mode(SaveMode.Overwrite)
--- End diff --

ditto


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-20 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158210714
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -102,8 +101,63 @@ object SparkHiveExample {
 // |  4| val_4|  4| val_4|
 // |  5| val_5|  5| val_5|
 // ...
-// $example off:spark_hive$
 
+/*
+ * Save DataFrame to Hive Managed table as Parquet format
+ * 1. Create Hive Database / Schema with location at HDFS if you want 
to mentioned explicitly else default
+ * warehouse location will be used to store Hive table Data.
+ * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path;
+ * You don't have to explicitly give location for each table, every 
tables under specified schema will be located at
+ * location given while creating schema.
+ * 2. Create Hive Managed table with storage format as 'Parquet'
+ * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET;
+ */
+val hiveTableDF = sql("SELECT * FROM records").toDF()
+
hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
+
+/*
+ * Save DataFrame to Hive External table as compatible parquet format.
+ * 1. Create Hive External table with storage format as parquet.
+ * Ex: CREATE EXTERNAL TABLE records(key int, value string) STORED AS 
PARQUET;
+ * Since we are not explicitly providing hive database location, it 
automatically takes default warehouse location
+ * given to 'spark.sql.warehouse.dir' while creating SparkSession with 
enableHiveSupport().
+ * For example, we have given '/user/hive/warehouse/' as a Hive 
Warehouse location. It will create schema directories
+ * under '/user/hive/warehouse/' as 
'/user/hive/warehouse/database_name.db' and 
'/user/hive/warehouse/database_name'.
+ */
+
+// to make Hive parquet format compatible with spark parquet format
+spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
+// Multiple parquet files could be created accordingly to volume of 
data under directory given.
+val hiveExternalTableLocation = 
s"/user/hive/warehouse/database_name.db/records"
+
hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation)
+
+// turn on flag for Dynamic Partitioning
+spark.sqlContext.setConf("hive.exec.dynamic.partition", "true")
+spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", 
"nonstrict")
+// You can create partitions in Hive table, so downstream queries run 
much faster.
+hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key")
+  .parquet(hiveExternalTableLocation)
+/*
+If Data volume is very huge, then every partitions would have many 
small-small files which may harm
+downstream query performance due to File I/O, Bandwidth I/O, Network 
I/O, Disk I/O.
+To improve performance you can create single parquet file under each 
partition directory using 'repartition'
+on partitioned key for Hive table. When you add partition to table, 
there will be change in table DDL.
+Ex: CREATE TABLE records(value string) PARTITIONED BY(key int) STORED 
AS PARQUET;
+ */
+hiveTableDF.repartition($"key").write.mode(SaveMode.Overwrite)
--- End diff --

This is not a standard usage, let's not put it in the example.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-20 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158210666
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -102,8 +101,63 @@ object SparkHiveExample {
 // |  4| val_4|  4| val_4|
 // |  5| val_5|  5| val_5|
 // ...
-// $example off:spark_hive$
 
+/*
+ * Save DataFrame to Hive Managed table as Parquet format
+ * 1. Create Hive Database / Schema with location at HDFS if you want 
to mentioned explicitly else default
+ * warehouse location will be used to store Hive table Data.
+ * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path;
+ * You don't have to explicitly give location for each table, every 
tables under specified schema will be located at
+ * location given while creating schema.
+ * 2. Create Hive Managed table with storage format as 'Parquet'
+ * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET;
+ */
+val hiveTableDF = sql("SELECT * FROM records").toDF()
+
hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
+
+/*
+ * Save DataFrame to Hive External table as compatible parquet format.
+ * 1. Create Hive External table with storage format as parquet.
+ * Ex: CREATE EXTERNAL TABLE records(key int, value string) STORED AS 
PARQUET;
--- End diff --

it's weird to create an external table without a location. User may be 
confused between the difference between managed table and external table.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-20 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158210425
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -102,8 +101,63 @@ object SparkHiveExample {
 // |  4| val_4|  4| val_4|
 // |  5| val_5|  5| val_5|
 // ...
-// $example off:spark_hive$
 
+/*
+ * Save DataFrame to Hive Managed table as Parquet format
+ * 1. Create Hive Database / Schema with location at HDFS if you want 
to mentioned explicitly else default
+ * warehouse location will be used to store Hive table Data.
+ * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path;
+ * You don't have to explicitly give location for each table, every 
tables under specified schema will be located at
+ * location given while creating schema.
+ * 2. Create Hive Managed table with storage format as 'Parquet'
+ * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET;
+ */
+val hiveTableDF = sql("SELECT * FROM records").toDF()
--- End diff --

actually, I think `spark.table("records")` is a better example.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-20 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158210374
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -102,8 +101,63 @@ object SparkHiveExample {
 // |  4| val_4|  4| val_4|
 // |  5| val_5|  5| val_5|
 // ...
-// $example off:spark_hive$
 
+/*
+ * Save DataFrame to Hive Managed table as Parquet format
+ * 1. Create Hive Database / Schema with location at HDFS if you want 
to mentioned explicitly else default
+ * warehouse location will be used to store Hive table Data.
+ * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path;
+ * You don't have to explicitly give location for each table, every 
tables under specified schema will be located at
+ * location given while creating schema.
+ * 2. Create Hive Managed table with storage format as 'Parquet'
+ * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET;
+ */
+val hiveTableDF = sql("SELECT * FROM records").toDF()
--- End diff --

`.toDF` is not needed


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-20 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158210132
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -102,8 +101,63 @@ object SparkHiveExample {
 // |  4| val_4|  4| val_4|
 // |  5| val_5|  5| val_5|
 // ...
-// $example off:spark_hive$
 
+/*
--- End diff --

+1


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-20 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158134032
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -102,8 +101,63 @@ object SparkHiveExample {
 // |  4| val_4|  4| val_4|
 // |  5| val_5|  5| val_5|
 // ...
-// $example off:spark_hive$
 
+/*
+ * Save DataFrame to Hive Managed table as Parquet format
+ * 1. Create Hive Database / Schema with location at HDFS if you want 
to mentioned explicitly else default
+ * warehouse location will be used to store Hive table Data.
+ * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path;
+ * You don't have to explicitly give location for each table, every 
tables under specified schema will be located at
+ * location given while creating schema.
+ * 2. Create Hive Managed table with storage format as 'Parquet'
+ * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET;
+ */
+val hiveTableDF = sql("SELECT * FROM records").toDF()
+
hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
+
+/*
+ * Save DataFrame to Hive External table as compatible parquet format.
+ * 1. Create Hive External table with storage format as parquet.
+ * Ex: CREATE EXTERNAL TABLE records(key int, value string) STORED AS 
PARQUET;
+ * Since we are not explicitly providing hive database location, it 
automatically takes default warehouse location
+ * given to 'spark.sql.warehouse.dir' while creating SparkSession with 
enableHiveSupport().
+ * For example, we have given '/user/hive/warehouse/' as a Hive 
Warehouse location. It will create schema directories
+ * under '/user/hive/warehouse/' as 
'/user/hive/warehouse/database_name.db' and 
'/user/hive/warehouse/database_name'.
+ */
+
+// to make Hive parquet format compatible with spark parquet format
+spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
+// Multiple parquet files could be created accordingly to volume of 
data under directory given.
+val hiveExternalTableLocation = 
s"/user/hive/warehouse/database_name.db/records"
+
hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation)
+
+// turn on flag for Dynamic Partitioning
+spark.sqlContext.setConf("hive.exec.dynamic.partition", "true")
+spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", 
"nonstrict")
+// You can create partitions in Hive table, so downstream queries run 
much faster.
+hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key")
+  .parquet(hiveExternalTableLocation)
+/*
+If Data volume is very huge, then every partitions would have many 
small-small files which may harm
+downstream query performance due to File I/O, Bandwidth I/O, Network 
I/O, Disk I/O.
+To improve performance you can create single parquet file under each 
partition directory using 'repartition'
+on partitioned key for Hive table. When you add partition to table, 
there will be change in table DDL.
+Ex: CREATE TABLE records(value string) PARTITIONED BY(key int) STORED 
AS PARQUET;
+ */
+hiveTableDF.repartition($"key").write.mode(SaveMode.Overwrite)
+  .partitionBy("key").parquet(hiveExternalTableLocation)
+
+/*
+ You can also do coalesce to control number of files under each 
partitions, repartition does full shuffle and equal
+ data distribution to all partitions. here coalesce can reduce number 
of files to given 'Int' argument without
--- End diff --

Sentences need some cleanup here. What do you mean by 'Int' argument? maybe 
it's best to point people to the API docs rather than incompletely repeat it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-20 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158133877
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -102,8 +101,63 @@ object SparkHiveExample {
 // |  4| val_4|  4| val_4|
 // |  5| val_5|  5| val_5|
 // ...
-// $example off:spark_hive$
 
+/*
+ * Save DataFrame to Hive Managed table as Parquet format
+ * 1. Create Hive Database / Schema with location at HDFS if you want 
to mentioned explicitly else default
+ * warehouse location will be used to store Hive table Data.
+ * Ex: CREATE DATABASE IF NOT EXISTS database_name LOCATION hdfs_path;
+ * You don't have to explicitly give location for each table, every 
tables under specified schema will be located at
+ * location given while creating schema.
+ * 2. Create Hive Managed table with storage format as 'Parquet'
+ * Ex: CREATE TABLE records(key int, value string) STORED AS PARQUET;
+ */
+val hiveTableDF = sql("SELECT * FROM records").toDF()
+
hiveTableDF.write.mode(SaveMode.Overwrite).saveAsTable("database_name.records")
+
+/*
+ * Save DataFrame to Hive External table as compatible parquet format.
+ * 1. Create Hive External table with storage format as parquet.
+ * Ex: CREATE EXTERNAL TABLE records(key int, value string) STORED AS 
PARQUET;
+ * Since we are not explicitly providing hive database location, it 
automatically takes default warehouse location
+ * given to 'spark.sql.warehouse.dir' while creating SparkSession with 
enableHiveSupport().
+ * For example, we have given '/user/hive/warehouse/' as a Hive 
Warehouse location. It will create schema directories
+ * under '/user/hive/warehouse/' as 
'/user/hive/warehouse/database_name.db' and 
'/user/hive/warehouse/database_name'.
+ */
+
+// to make Hive parquet format compatible with spark parquet format
+spark.sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
+// Multiple parquet files could be created accordingly to volume of 
data under directory given.
+val hiveExternalTableLocation = 
s"/user/hive/warehouse/database_name.db/records"
+
hiveTableDF.write.mode(SaveMode.Overwrite).parquet(hiveExternalTableLocation)
+
+// turn on flag for Dynamic Partitioning
+spark.sqlContext.setConf("hive.exec.dynamic.partition", "true")
+spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", 
"nonstrict")
+// You can create partitions in Hive table, so downstream queries run 
much faster.
+hiveTableDF.write.mode(SaveMode.Overwrite).partitionBy("key")
+  .parquet(hiveExternalTableLocation)
+/*
+If Data volume is very huge, then every partitions would have many 
small-small files which may harm
--- End diff --

This is more stuff that should go in docs, not comments in an example. It 
kind of duplicates existing documentation. Is this commentary really needed to 
illustrate usage of the API? that's the only goal right here. 

What are small-small files? You have some inconsistent capitalization; 
Parquet should be capitalized but not file, bandwidth, etc.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-20 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158133606
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -102,8 +101,63 @@ object SparkHiveExample {
 // |  4| val_4|  4| val_4|
 // |  5| val_5|  5| val_5|
 // ...
-// $example off:spark_hive$
 
+/*
--- End diff --

Oh just noticed this. You're using javadoc style comments here, but they 
won't have effect.
just use the `//` block style for comments that you see above, for 
consistency.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-20 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158113948
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -104,6 +103,60 @@ object SparkHiveExample {
 // ...
 // $example off:spark_hive$
--- End diff --

@srowen I mis-understood your first comment. I have reverted as suggested. 
Please check now


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-20 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r158100765
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -104,6 +103,60 @@ object SparkHiveExample {
 // ...
 // $example off:spark_hive$
--- End diff --

Why do you turn the example listing off then on again? just remove those 
two lines


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-20 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r157973588
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -104,6 +103,60 @@ object SparkHiveExample {
 // ...
 // $example off:spark_hive$
--- End diff --

@srowen I have updated DDL when storing data with parititoning in Hive.
cc\ @HyukjinKwon @mgaido91 @markgrover @markhamstra 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-19 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r157942866
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -104,6 +103,60 @@ object SparkHiveExample {
 // ...
 // $example off:spark_hive$
--- End diff --

@srowen Can you please review this cc\ @holdenk @sameeragarwal 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-19 Thread chetkhatri
Github user chetkhatri commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r157796580
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -104,6 +103,60 @@ object SparkHiveExample {
 // ...
 // $example off:spark_hive$
--- End diff --

@srowen Thank you for valueable feedback review, I have added that so it 
can help other develoeprs.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-19 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/20018#discussion_r157757263
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala
 ---
@@ -104,6 +103,60 @@ object SparkHiveExample {
 // ...
 // $example off:spark_hive$
--- End diff --

Do you not want the code below to render in the docs as part of the 
example? maybe not, just checking if that's intentional.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20018: SPARK-22833 [Improvement] in SparkHive Scala Exam...

2017-12-19 Thread chetkhatri
GitHub user chetkhatri opened a pull request:

https://github.com/apache/spark/pull/20018

SPARK-22833 [Improvement] in SparkHive Scala Examples

## What changes were proposed in this pull request?

SparkHive Scala Examples Improvement made:
* Writing DataFrame / DataSet to Hive Managed , Hive External table using 
different storage format.
* Implementation of Partition, Reparition, Coalesce with appropriate 
example.

## How was this patch tested?
* Patch has been tested manually and by running ./dev/run-tests.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/chetkhatri/spark scala-sparkhive-examples

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20018.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20018


commit 9d9b42bb49997ce7d308fbf50072e5f5e0eccaa2
Author: chetkhatri 
Date:   2017-12-19T11:33:47Z

SPARK-22833 [Improvement] in SparkHive Scala Examples




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org