[ 
https://issues.apache.org/jira/browse/DRILL-6744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bridget Bevens updated DRILL-6744:
----------------------------------
    Labels: doc-complete ready-to-commit  (was: doc-impacting ready-to-commit)

> Support filter push down for varchar / decimal data types
> ---------------------------------------------------------
>
>                 Key: DRILL-6744
>                 URL: https://issues.apache.org/jira/browse/DRILL-6744
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.14.0
>            Reporter: Arina Ielchiieva
>            Assignee: Arina Ielchiieva
>            Priority: Major
>              Labels: doc-complete, ready-to-commit
>             Fix For: 1.15.0
>
>
> Since now Drill is using Apache Parquet 1.10.0 where issue with incorrectly 
> stored varchar / decimal min / max statistics is resolved, we should add 
> support for varchar / decimal filter push down. Only files created with 
> parquet lib 1.9.1 (1.10.0)) and later will be subjected to push down. In 
> cases if user knows that prior created files have correct min / max 
> statistics (i.e. user exactly knows that data in binary columns in ASCII (not 
> UTF-8)) than parquet.strings.signed-min-max.enabled can be set to true to 
> enable filter push down.
> *Description*
> _Note: Drill is using Parquet 1.10.0 library since 1.13.0 version._
> *Varchar Partition Pruning*
> Varchar Pruning will work for files generated prior and after Parquet 1.10.0 
> version, since to enable partition pruning both min and max values should be 
> the same and there are no issues with incorrectly stored statistics for 
> binary data for the same min and max values. Partition pruning using Drill 
> metadata files will also work, no matter when metadata file was created 
> (prior or after Drill 1.15.0).
> Partition pruning won't work for files where partition is null due to 
> PARQUET-1341, issue will be fixed in Parquet 1.11.0.
> *Varchar Filter Push Down*
> Varchar filter push down will work for parquet files created with Parquet 
> 1.10.0 and later.
> There are two options how to enable push down for files generated with prior 
> Parquet versions, when user exactly knows that binary data is in ASCII (not 
> UTF-8):
> 1. set configuration {{enableStringsSignedMinMax}} to true (false by default) 
> for parquet format plugin: 
> {noformat}
>         "parquet" : {
>           type: "parquet",
>           enableStringsSignedMinMax: true 
>         }
> {noformat}
> This would apply to all parquet files of a given file plugin, including all 
> workspaces.
> 2. If user wants to enable / disable allowing reading binary statistics for 
> old parquet files per session, session option 
> {{store.parquet.reader.strings_signed_min_max}} can be used. By default, it 
> has empty string value. Setting such option will take priority over config in 
> parquet format plugin. Option allows three values: 'true', 'false', '' (empty 
> string).
> _Note: store.parquet.reader.strings_signed_min_max also can be set at system 
> level, thus it will apply to all parquet files in the system._
> The same config / session option will apply to allow reading binary 
> statistics from Drill metadata files generated prior to Drill 1.15.0. If 
> Drill metadata file was created prior to  Drill 1.15.0 but for parquet files 
> created with Parquet library 1.10.0 and later, user would have to enable 
> config / session option or regenerate Drill metadata file with Drill 1.15.0 
> or later, because from the metadata file we don't know if statistics is 
> stored correctly (prior Drill was writing reading and writing binary 
> statistics by default though did not use it).
> When creating Drill metadata file with Drill 1.15.0 and later for old parquet 
> files, user should mind config / session option. If strings_signed_min_max is 
> enabled,  Drill will store in the Drill metadata file binary statistics but 
> since metadata file was created with Drill 1.15.0 and later, Drill would read 
> it back disregarding the option (assuming that if statistics is present in 
> the Drill metadata file, it is correct). If user mistakenly enabled 
> strings_signed_min_max, he needs to disable it and regenerated Drill metadata 
> file. The same is in the opposite way, if user created metadata file when 
> strings_signed_min_max was disabled, no min / max values for binary 
> statistics will be written and thus read back, even if during reading the 
> metadata strings_signed_min_max is enabled.
> *Decimal Partition Pruning*
> Decimal values can be represented in four logical types: int_32, int_64, 
> fixed_len_byte_array and binary.
> Partition pruning will work for all  logical types for old and new decimal 
> files, i.e. created with Parquet 1.10.0, prior and after. Partition pruning 
> won't work for files with null partition due to PARQUET-1341 which will be 
> fixed in Parquet 1.11.0.
> Partition pruning with Drill metadata file will work for old and new decimal 
> files disregarding with which Drill version metadata file was created.
> *Decimal Filter Push Down*
> For int_32 / int_64 decimal push down will work only for new files (i.e. 
> generated by Parquet 1.10.0 and later), for old files push down won't work 
> due to PARQUET-1322.
> For old int_32 / int_64 decimal push down will work with old Drill metadata 
> file, i.e. prior to Drill 1.14.0, for Drill metadata file generated after 
> Drill 1.14.0 push down won't work since it is generated after upgrade to 
> Parquet 1.10.0 (due to PARQUET-1322). For new int_32 / int_64 decimal, push 
> down will work with new Drill metadata file.
> For old fixed_len_byte_array / binary decimal files generated prior to 
> Parquet 1.10.0 filter push down won't work. Push down with old Drill metadata 
> file only if strings_signed_min_max config / session option is set to true. 
> Push down with new Drill metadata file won't work.
> For new fixed_len_byte_array / binary files filter push down will work with 
> and without metadata file (only if Drill metadata file was generated by Drill 
> 1.15.0). If Drill metadata file was generated prior to Drill 1.15.0, to 
> enable reading such statistics user needs to enable strings_signed_min_max 
> config / session option or re-generated Drill metadata file.
> *Hive Varchar Filter Push Down using Drill native reader*
> Hive 2.3 parquet files are generated with Parquet library prior to 1.10.0 
> version, where statistics for binary UTF-8 is can be stored incorrectly. If 
> user exactly knows that data in the binary columns in ASCIIĀ (not in UTF-8), 
> session option store.parquet.reader.strings_signed_min_max can be set to 
> 'true' to enable varchar filter push down.
> *Hive Decimal Filter Push Down using Drill native reader*
> Hive 2.3 parquet files are generated with Parquet library prior to 1.10.0 
> version, decimal statistics for such files is not available thus push down 
> won't work with Hive parquet decimal files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to