[ 
https://issues.apache.org/jira/browse/DRILL-6744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16686559#comment-16686559
 ] 

ASF GitHub Bot commented on DRILL-6744:
---------------------------------------

arina-ielchiieva commented on issue #1537: DRILL-6744: Support varchar and 
decimal push down
URL: https://github.com/apache/drill/pull/1537#issuecomment-438675122
 
 
   @vvysotskyi addressed code review comments:
   1. Used VersionUtil from Hadoop lib.
   2. Made ParquetReaderConfig immutable.
   3. Added trace login instead of removing default in switch.
   4. Made other changes as requested.
   
   Thanks for the code review.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Support filter push down for varchar / decimal data types
> ---------------------------------------------------------
>
>                 Key: DRILL-6744
>                 URL: https://issues.apache.org/jira/browse/DRILL-6744
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.14.0
>            Reporter: Arina Ielchiieva
>            Assignee: Arina Ielchiieva
>            Priority: Major
>              Labels: doc-impacting
>             Fix For: 1.15.0
>
>
> Since now Drill is using Apache Parquet 1.10.0 where issue with incorrectly 
> stored varchar / decimal min / max statistics is resolved, we should add 
> support for varchar / decimal filter push down. Only files created with 
> parquet lib 1.9.1 (1.10.0)) and later will be subjected to push down. In 
> cases if user knows that prior created files have correct min / max 
> statistics (i.e. user exactly knows that data in binary columns in ASCII (not 
> UTF-8)) than parquet.strings.signed-min-max.enabled can be set to true to 
> enable filter push down.
> *Description*
> _Note: Drill is using Parquet 1.10.0 library since 1.13.0 version._
> *Varchar Partition Pruning*
> Varchar Pruning will work for files generated prior and after Parquet 1.10.0 
> version, since to enable partition pruning both min and max values should be 
> the same and there are no issues with incorrectly stored statistics for 
> binary data for the same min and max values. Partition pruning using Drill 
> metadata files will also work, no matter when metadata file was created 
> (prior or after Drill 1.15.0).
> Partition pruning won't work for files where partition is null due to 
> PARQUET-1341, issue will be fixed in Parquet 1.11.0.
> *Varchar Filter Push Down*
> Varchar filter push down will work for parquet files created with Parquet 
> 1.10.0 and later.
> There are two options how to enable push down for files generated with prior 
> Parquet versions, when user exactly knows that binary data is in ASCII (not 
> UTF-8):
> 1. set configuration {{enableStringsSignedMinMax}} to true (false by default) 
> for parquet format plugin: 
> {noformat}
>         "parquet" : {
>           type: "parquet",
>           enableStringsSignedMinMax: true 
>         }
> {noformat}
> This would apply to all parquet files of a given file plugin, including all 
> workspaces.
> 2. If user wants to enable / disable allowing reading binary statistics for 
> old parquet files per session, session option 
> {{store.parquet.reader.strings_signed_min_max}} can be used. By default, it 
> has empty string value. Setting such option will take priority over config in 
> parquet format plugin. Option allows three values: 'true', 'false', '' (empty 
> string).
> _Note: store.parquet.reader.strings_signed_min_max also can be set at system 
> level, thus it will apply to all parquet files in the system._
> The same config / session option will apply to allow reading binary 
> statistics from Drill metadata files generated prior to Drill 1.15.0. If 
> Drill metadata file was created prior to  Drill 1.15.0 but for parquet files 
> created with Parquet library 1.10.0 and later, user would have to enable 
> config / session option or regenerate Drill metadata file with Drill 1.15.0 
> or later, because from the metadata file we don't know if statistics is 
> stored correctly (prior Drill was writing reading and writing binary 
> statistics by default though did not use it).
> When creating Drill metadata file with Drill 1.15.0 and later for old parquet 
> files, user should mind config / session option. If strings_signed_min_max is 
> enabled,  Drill will store in the Drill metadata file binary statistics but 
> since metadata file was created with Drill 1.15.0 and later, Drill would read 
> it back disregarding the option (assuming that if statistics is present in 
> the Drill metadata file, it is correct). If user mistakenly enabled 
> strings_signed_min_max, he needs to disable it and regenerated Drill metadata 
> file. The same is in the opposite way, if user created metadata file when 
> strings_signed_min_max was disabled, no min / max values for binary 
> statistics will be written and thus read back, even if during reading the 
> metadata strings_signed_min_max is enabled.
> *Decimal Partition Pruning*
> Decimal values can be represented in four logical types: int_32, int_64, 
> fixed_len_byte_array and binary.
> Partition pruning will work for all  logical types for old and new decimal 
> files, i.e. created with Parquet 1.10.0, prior and after. Partition pruning 
> won't work for files with null partition due to PARQUET-1341 which will be 
> fixed in Parquet 1.11.0.
> Partition pruning with Drill metadata file will work for old and new decimal 
> files disregarding with which Drill version metadata file was created.
> *Decimal Filter Push Down*
> For int_32 / int_64 decimal push down will work only for new files (i.e. 
> generated by Parquet 1.10.0 and later), for old files push down won't work 
> due to PARQUET-1322.
> For old int_32 / int_64 decimal push down will work with old Drill metadata 
> file, i.e. prior to Drill 1.14.0, for Drill metadata file generated after 
> Drill 1.14.0 push down won't work since it is generated after upgrade to 
> Parquet 1.10.0 (due to PARQUET-1322). For new int_32 / int_64 decimal, push 
> down will work with new Drill metadata file.
> For old fixed_len_byte_array / binary decimal files generated prior to 
> Parquet 1.10.0 filter push down won't work. Push down with old Drill metadata 
> file only if strings_signed_min_max config / session option is set to true. 
> Push down with new Drill metadata file won't work.
> For new fixed_len_byte_array / binary files filter push down will work with 
> and without metadata file (only if Drill metadata file was generated by Drill 
> 1.15.0). If Drill metadata file was generated prior to Drill 1.15.0, to 
> enable reading such statistics user needs to enable strings_signed_min_max 
> config / session option or re-generated Drill metadata file.
> *Hive Varchar Filter Push Down using Drill native reader*
> Hive 2.3 parquet files are generated with Parquet library prior to 1.10.0 
> version, where statistics for binary UTF-8 is can be stored incorrectly. If 
> user exactly knows that data in the binary columns in ASCIIĀ (not in UTF-8), 
> session option store.parquet.reader.strings_signed_min_max can be set to 
> 'true' to enable varchar filter push down.
> *Hive Decimal Filter Push Down using Drill native reader*
> Hive 2.3 parquet files are generated with Parquet library prior to 1.10.0 
> version, decimal statistics for such files is not available thus push down 
> won't work with Hive parquet decimal files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to