[GitHub] spark pull request #22350: [SPARK-25356][SQL]Add Parquet block size option t...
Github user 10110346 closed the pull request at: https://github.com/apache/spark/pull/22350 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22350: [SPARK-25356][SQL]Add Parquet block size option t...
Github user 10110346 commented on a diff in the pull request: https://github.com/apache/spark/pull/22350#discussion_r215819785 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala --- @@ -123,6 +123,9 @@ class ParquetFileFormat // Sets compression scheme conf.set(ParquetOutputFormat.COMPRESSION, parquetOptions.compressionCodecClassName) +// Sets Parquet block size +conf.setInt(ParquetOutputFormat.BLOCK_SIZE, sparkSession.sessionState.conf.parquetBlockSize) --- End diff -- Sounds reasonable. I close it nowï¼ thanks --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22350: [SPARK-25356][SQL]Add Parquet block size option t...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22350#discussion_r215812113 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala --- @@ -123,6 +123,9 @@ class ParquetFileFormat // Sets compression scheme conf.set(ParquetOutputFormat.COMPRESSION, parquetOptions.compressionCodecClassName) +// Sets Parquet block size +conf.setInt(ParquetOutputFormat.BLOCK_SIZE, sparkSession.sessionState.conf.parquetBlockSize) --- End diff -- I doubt if it is common enough to have an alias and document this in `sql-programming-guide.md`. Other configurations like `parquet.page.size`, `parquet.enable.dictionary` or `parquet.writer.version` are also rather similarly used as much as that configuration in my experience. I would add this for now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22350: [SPARK-25356][SQL]Add Parquet block size option t...
Github user 10110346 commented on a diff in the pull request: https://github.com/apache/spark/pull/22350#discussion_r215598798 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala --- @@ -123,6 +123,9 @@ class ParquetFileFormat // Sets compression scheme conf.set(ParquetOutputFormat.COMPRESSION, parquetOptions.compressionCodecClassName) +// Sets Parquet block size +conf.setInt(ParquetOutputFormat.BLOCK_SIZE, sparkSession.sessionState.conf.parquetBlockSize) --- End diff -- Yes, we are already able to set this via `parquet.block.size`, I think we should add this parameter into "sql-programming-guide.md" --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22350: [SPARK-25356][SQL]Add Parquet block size option t...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22350#discussion_r215595058 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala --- @@ -123,6 +123,9 @@ class ParquetFileFormat // Sets compression scheme conf.set(ParquetOutputFormat.COMPRESSION, parquetOptions.compressionCodecClassName) +// Sets Parquet block size +conf.setInt(ParquetOutputFormat.BLOCK_SIZE, sparkSession.sessionState.conf.parquetBlockSize) --- End diff -- For clarification, we are already able to set this via `parquet.block.size` but this PR proposes an alias for it, right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22350: [SPARK-25356][SQL]Add Parquet block size option t...
GitHub user 10110346 opened a pull request: https://github.com/apache/spark/pull/22350 [SPARK-25356][SQL]Add Parquet block size option to SparkSQL configuration ## What changes were proposed in this pull request? I think we should configure the Parquet buffer size when using Parquet format. Because for HDFS, `dfs.block.size` is configurable, sometimes we hope the block size of parquet to be consistent with it. And whether this parameter `spark.sql.files.maxPartitionBytes` is best consistent with the Parquet block size when using Parquet format? Also we may want to shrink Parquet block size in some tests. ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/10110346/spark addblocksize Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22350.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22350 commit 3485b523d54e83ed3388febd06b3ac4914d181ed Author: liuxian Date: 2018-09-06T10:35:43Z fix --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org