[GitHub] spark pull request #22350: [SPARK-25356][SQL]Add Parquet block size option t...

2018-09-06 Thread 10110346
Github user 10110346 closed the pull request at:

https://github.com/apache/spark/pull/22350


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22350: [SPARK-25356][SQL]Add Parquet block size option t...

2018-09-06 Thread 10110346
Github user 10110346 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22350#discussion_r215819785
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 ---
@@ -123,6 +123,9 @@ class ParquetFileFormat
 // Sets compression scheme
 conf.set(ParquetOutputFormat.COMPRESSION, 
parquetOptions.compressionCodecClassName)
 
+// Sets Parquet block size
+conf.setInt(ParquetOutputFormat.BLOCK_SIZE, 
sparkSession.sessionState.conf.parquetBlockSize)
--- End diff --

 Sounds reasonable.  I close it now, thanks


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22350: [SPARK-25356][SQL]Add Parquet block size option t...

2018-09-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22350#discussion_r215812113
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 ---
@@ -123,6 +123,9 @@ class ParquetFileFormat
 // Sets compression scheme
 conf.set(ParquetOutputFormat.COMPRESSION, 
parquetOptions.compressionCodecClassName)
 
+// Sets Parquet block size
+conf.setInt(ParquetOutputFormat.BLOCK_SIZE, 
sparkSession.sessionState.conf.parquetBlockSize)
--- End diff --

I doubt if it is common enough to have an alias and document this in 
`sql-programming-guide.md`. Other configurations like `parquet.page.size`, 
`parquet.enable.dictionary` or `parquet.writer.version` are also rather 
similarly used as much as that configuration in my experience.

I would add this for now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22350: [SPARK-25356][SQL]Add Parquet block size option t...

2018-09-06 Thread 10110346
Github user 10110346 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22350#discussion_r215598798
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 ---
@@ -123,6 +123,9 @@ class ParquetFileFormat
 // Sets compression scheme
 conf.set(ParquetOutputFormat.COMPRESSION, 
parquetOptions.compressionCodecClassName)
 
+// Sets Parquet block size
+conf.setInt(ParquetOutputFormat.BLOCK_SIZE, 
sparkSession.sessionState.conf.parquetBlockSize)
--- End diff --

Yes, we are already able to set this via `parquet.block.size`, 
I think we should add this parameter into  "sql-programming-guide.md"


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22350: [SPARK-25356][SQL]Add Parquet block size option t...

2018-09-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22350#discussion_r215595058
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 ---
@@ -123,6 +123,9 @@ class ParquetFileFormat
 // Sets compression scheme
 conf.set(ParquetOutputFormat.COMPRESSION, 
parquetOptions.compressionCodecClassName)
 
+// Sets Parquet block size
+conf.setInt(ParquetOutputFormat.BLOCK_SIZE, 
sparkSession.sessionState.conf.parquetBlockSize)
--- End diff --

For clarification, we are already able to set this via `parquet.block.size` 
but this PR proposes an alias for it, right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22350: [SPARK-25356][SQL]Add Parquet block size option t...

2018-09-06 Thread 10110346
GitHub user 10110346 opened a pull request:

https://github.com/apache/spark/pull/22350

[SPARK-25356][SQL]Add Parquet block size  option to SparkSQL configuration

## What changes were proposed in this pull request?


I think we should configure the Parquet buffer size when using Parquet 
format.
Because for HDFS, `dfs.block.size` is configurable, sometimes we hope the 
block size of parquet to be consistent with it.
And  whether this parameter `spark.sql.files.maxPartitionBytes` is best 
consistent with the Parquet  block size when using Parquet format?
Also we may want to shrink Parquet block size in some tests.

## How was this patch tested?
N/A


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/10110346/spark addblocksize

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22350.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22350


commit 3485b523d54e83ed3388febd06b3ac4914d181ed
Author: liuxian 
Date:   2018-09-06T10:35:43Z

fix




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org