[jira] [Updated] (DRILL-5379) Set Hdfs Block Size based on Parquet Block Size

JIRA Mon, 24 Apr 2017 08:33:27 -0700

     [ 
https://issues.apache.org/jira/browse/DRILL-5379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


F Méthot updated DRILL-5379:
----------------------------

Hi Padma,

  We finally got to deploy the change to our test HDFS cluster and I am
happy to report the following:

Test 1:
`store.parquet.block-size` = 1GB

A CTAS on a 900MB file source file resulted in single parquet file using a
SINGLE block on hdfs. (just like we wanted!)

Test 2:
`store.parquet.block-size` = 128MB

A CTAS on a 900MB file source file resulted in 7 parquet files, each file
are using a SINGLE block on hdfs. (I was not expecting that, but I think
its perfect like that)


That was an easy change and it's worth it. We are thinking of putting that
on prod right away. What do you think?






On Wed, Apr 19, 2017 at 8:33 PM, Padma Penumarthy (JIRA) <j...@apache.org>



> Set Hdfs Block Size based on Parquet Block Size
> -----------------------------------------------
>
>                 Key: DRILL-5379
>                 URL: https://issues.apache.org/jira/browse/DRILL-5379
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - Parquet
>    Affects Versions: 1.9.0
>            Reporter: F Méthot
>             Fix For: Future
>
>
> It seems there a way to force Drill to store CTAS generated parquet file as a 
> single block when using HDFS. Java HDFS API allows to do that, files could be 
> created with the Parquet block-size set in a session or system config.
> Since it is ideal  to have single parquet file per hdfs block.
> Here is the HDFS API that allow to do that:
> http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.fs.Path,%20boolean,%20int,%20short,%20long)
> http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.fs.Path,%20boolean,%20int,%20short,%20long)
> Drill uses the hadoop ParquetFileWriter 
> (https://github.com/Parquet/parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/ParquetFileWriter.java).
> This is where the file creation occurs so it might be tricky.
> However, ParquetRecordWriter.java 
> (https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetRecordWriter.java)
>  in Drill creates the ParquetFileWriter with an hadoop configuration object.
> something to explore: Could the block size be set as a property within the 
> Configuration object before passing it to ParquetFileWriter constructor?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (DRILL-5379) Set Hdfs Block Size based on Parquet Block Size

Reply via email to