[jira] [Commented] (DRILL-6331) Parquet filter pushdown does not support the native hive reader

ASF GitHub Bot (JIRA) Thu, 26 Apr 2018 07:06:14 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-6331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16454257#comment-16454257
 ]


ASF GitHub Bot commented on DRILL-6331:
---------------------------------------

Github user arina-ielchiieva commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1214#discussion_r184401600
  
    --- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/ops/BaseOperatorContext.java 
---
    @@ -158,25 +159,26 @@ public void close() {
         } catch (RuntimeException e) {
           ex = ex == null ? e : ex;
         }
    -    try {
    -      if (fs != null) {
    +
    +    for (DrillFileSystem fs : fileSystems) {
    +      try {
             fs.close();
    -        fs = null;
    -      }
    -    } catch (IOException e) {
    +      } catch (IOException e) {
           throw UserException.resourceError(e)
    -        .addContext("Failed to close the Drill file system for " + 
getName())
    -        .build(logger);
    +          .addContext("Failed to close the Drill file system for " + 
getName())
    +          .build(logger);
    +      }
         }
    +
         if (ex != null) {
           throw ex;
         }
       }
     
       @Override
       public DrillFileSystem newFileSystem(Configuration conf) throws 
IOException {
    -    Preconditions.checkState(fs == null, "Tried to create a second 
FileSystem. Can only be called once per OperatorContext");
    -    fs = new DrillFileSystem(conf, getStats());
    +    DrillFileSystem fs = new DrillFileSystem(conf, getStats());
    --- End diff --
    
    When `AbstractParquetScanBatchCreator.getBatch` method is called, it 
receives one operator context which is used to allow to create only one file 
system. It also receives `AbstractParquetRowGroupScan` which contains several 
row groups. Row groups may belong to different files. For Drill parquet files, 
we create only one fs and use it for to create readers for each row group. 
That's why it was fine when operator context allowed to create only one fs. But 
we needed to adjust it for Hive files. For Hive we need to create fs for each 
file (since config to each file system is different and created using 
projection pusher), that's why I had to change operator context to allow more 
then one file system. I have also introduced `AbstractDrillFileSystemManager` 
which controls number of file systems created. `ParquetDrillFileSystemManager` 
creates only one (as was done before). 
`HiveDrillNativeParquetDrillFileSystemManager` creates fs for each file, so 
when two row groups belong to the same file, they will get the same fs.
    
    But I agree that for tracking fs (i.e. 
store.parquet.reader.pagereader.async is set to false) this will create mess in 
calculations. So I suggest the following fix, for Hive we'll always create non 
tracking fs, for Drill depending on store.parquet.reader.pagereader.async 
option. Also I'll add checks in operator context to disallow to create more 
then one tracking fs and to create tracking fs at all when non-tracking is / 
are already created.


> Parquet filter pushdown does not support the native hive reader
> ---------------------------------------------------------------
>
>                 Key: DRILL-6331
>                 URL: https://issues.apache.org/jira/browse/DRILL-6331
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - Hive
>    Affects Versions: 1.13.0
>            Reporter: Arina Ielchiieva
>            Assignee: Arina Ielchiieva
>            Priority: Major
>             Fix For: 1.14.0
>
>
> Initially HiveDrillNativeParquetGroupScan was based mainly on HiveScan, the 
> core difference between them was
> that HiveDrillNativeParquetScanBatchCreator was creating ParquetRecordReader 
> instead of HiveReader.
> This allowed to read Hive parquet files using Drill native parquet reader but 
> did not expose Hive data to Drill optimizations.
> For example, filter push down, limit push down, count to direct scan 
> optimizations.
> Hive code had to be refactored to use the same interfaces as 
> ParquestGroupScan in order to be exposed to such optimizations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (DRILL-6331) Parquet filter pushdown does not support the native hive reader

Reply via email to