Github user ppadma commented on a diff in the pull request:

    https://github.com/apache/drill/pull/826#discussion_r116895850
  
    --- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetRecordWriter.java
 ---
    @@ -380,14 +384,21 @@ public void endRecord() throws IOException {
     
           // since ParquetFileWriter will overwrite empty output file (append 
is not supported)
           // we need to re-apply file permission
    -      parquetFileWriter = new ParquetFileWriter(conf, schema, path, 
ParquetFileWriter.Mode.OVERWRITE);
    +      if (useConfiguredBlockSize) {
    --- End diff --
    
    What we are doing is create parquet file as single block without changing 
the file system default block size.  For ex. default Parquet block size is 
512MB and if file system block size is 128MB, we create single file with 4 
blocks on filesystem, which can get distributed on different nodes, not good 
for performance. If we change Parquet block size to 128MB (to match with file 
system block size), for same amount of data, we end up creating 4 files, one 
block each, which is not good either. 
    
    JIRA wanted single HDFS block per Parquet file that is larger than file 
system block size , without changing file system block size.  They had file 
system block size configured as 128MB. Lowering parquet block size (from 
default value of 512MB) to match with file system block size is creating too 
many files for them. For some other reasons, they are not able to change file 
system block size. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to