[GitHub] drill pull request #1214: DRILL-6331: Revisit Hive Drill native parquet impl...

arina-ielchiieva Thu, 26 Apr 2018 07:05:24 -0700

Github user arina-ielchiieva commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1214#discussion_r184401600
  
    --- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/ops/BaseOperatorContext.java 
---
    @@ -158,25 +159,26 @@ public void close() {
         } catch (RuntimeException e) {
           ex = ex == null ? e : ex;
         }
    -    try {
    -      if (fs != null) {
    +
    +    for (DrillFileSystem fs : fileSystems) {
    +      try {
             fs.close();
    -        fs = null;
    -      }
    -    } catch (IOException e) {
    +      } catch (IOException e) {
           throw UserException.resourceError(e)
    -        .addContext("Failed to close the Drill file system for " + 
getName())
    -        .build(logger);
    +          .addContext("Failed to close the Drill file system for " + 
getName())
    +          .build(logger);
    +      }
         }
    +
         if (ex != null) {
           throw ex;
         }
       }
     
       @Override
       public DrillFileSystem newFileSystem(Configuration conf) throws 
IOException {
    -    Preconditions.checkState(fs == null, "Tried to create a second 
FileSystem. Can only be called once per OperatorContext");
    -    fs = new DrillFileSystem(conf, getStats());
    +    DrillFileSystem fs = new DrillFileSystem(conf, getStats());
    --- End diff --
    
    When `AbstractParquetScanBatchCreator.getBatch` method is called, it 
receives one operator context which is used to allow to create only one file 
system. It also receives `AbstractParquetRowGroupScan` which contains several 
row groups. Row groups may belong to different files. For Drill parquet files, 
we create only one fs and use it for to create readers for each row group. 
That's why it was fine when operator context allowed to create only one fs. But 
we needed to adjust it for Hive files. For Hive we need to create fs for each 
file (since config to each file system is different and created using 
projection pusher), that's why I had to change operator context to allow more 
then one file system. I have also introduced `AbstractDrillFileSystemManager` 
which controls number of file systems created. `ParquetDrillFileSystemManager` 
creates only one (as was done before). 
`HiveDrillNativeParquetDrillFileSystemManager` creates fs for each file, so 
when two row groups belong to the same
  file, they will get the same fs.
    
    But I agree that for tracking fs (i.e. 
store.parquet.reader.pagereader.async is set to false) this will create mess in 
calculations. So I suggest the following fix, for Hive we'll always create non 
tracking fs, for Drill depending on store.parquet.reader.pagereader.async 
option. Also I'll add checks in operator context to disallow to create more 
then one tracking fs and to create tracking fs at all when non-tracking is / 
are already created.

---

[GitHub] drill pull request #1214: DRILL-6331: Revisit Hive Drill native parquet impl...

Reply via email to