[ https://issues.apache.org/jira/browse/DRILL-6744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16720827#comment-16720827 ]
Bridget Bevens commented on DRILL-6744: --------------------------------------- Hi [~arina] and [~vitalii], I created and shared a first draft of the content [here|https://docs.google.com/document/d/1qWf6eiA-18Xm_vVSBpH9swpBGplB4GPYrxJysKHLMDA/edit?usp=sharing]. Please have a look and let me know what changes I need to make. Thanks, Bridget > Support filter push down for varchar / decimal data types > --------------------------------------------------------- > > Key: DRILL-6744 > URL: https://issues.apache.org/jira/browse/DRILL-6744 > Project: Apache Drill > Issue Type: Improvement > Affects Versions: 1.14.0 > Reporter: Arina Ielchiieva > Assignee: Arina Ielchiieva > Priority: Major > Labels: doc-impacting, ready-to-commit > Fix For: 1.15.0 > > > Since now Drill is using Apache Parquet 1.10.0 where issue with incorrectly > stored varchar / decimal min / max statistics is resolved, we should add > support for varchar / decimal filter push down. Only files created with > parquet lib 1.9.1 (1.10.0)) and later will be subjected to push down. In > cases if user knows that prior created files have correct min / max > statistics (i.e. user exactly knows that data in binary columns in ASCII (not > UTF-8)) than parquet.strings.signed-min-max.enabled can be set to true to > enable filter push down. > *Description* > _Note: Drill is using Parquet 1.10.0 library since 1.13.0 version._ > *Varchar Partition Pruning* > Varchar Pruning will work for files generated prior and after Parquet 1.10.0 > version, since to enable partition pruning both min and max values should be > the same and there are no issues with incorrectly stored statistics for > binary data for the same min and max values. Partition pruning using Drill > metadata files will also work, no matter when metadata file was created > (prior or after Drill 1.15.0). > Partition pruning won't work for files where partition is null due to > PARQUET-1341, issue will be fixed in Parquet 1.11.0. > *Varchar Filter Push Down* > Varchar filter push down will work for parquet files created with Parquet > 1.10.0 and later. > There are two options how to enable push down for files generated with prior > Parquet versions, when user exactly knows that binary data is in ASCII (not > UTF-8): > 1. set configuration {{enableStringsSignedMinMax}} to true (false by default) > for parquet format plugin: > {noformat} > "parquet" : { > type: "parquet", > enableStringsSignedMinMax: true > } > {noformat} > This would apply to all parquet files of a given file plugin, including all > workspaces. > 2. If user wants to enable / disable allowing reading binary statistics for > old parquet files per session, session option > {{store.parquet.reader.strings_signed_min_max}} can be used. By default, it > has empty string value. Setting such option will take priority over config in > parquet format plugin. Option allows three values: 'true', 'false', '' (empty > string). > _Note: store.parquet.reader.strings_signed_min_max also can be set at system > level, thus it will apply to all parquet files in the system._ > The same config / session option will apply to allow reading binary > statistics from Drill metadata files generated prior to Drill 1.15.0. If > Drill metadata file was created prior to Drill 1.15.0 but for parquet files > created with Parquet library 1.10.0 and later, user would have to enable > config / session option or regenerate Drill metadata file with Drill 1.15.0 > or later, because from the metadata file we don't know if statistics is > stored correctly (prior Drill was writing reading and writing binary > statistics by default though did not use it). > When creating Drill metadata file with Drill 1.15.0 and later for old parquet > files, user should mind config / session option. If strings_signed_min_max is > enabled, Drill will store in the Drill metadata file binary statistics but > since metadata file was created with Drill 1.15.0 and later, Drill would read > it back disregarding the option (assuming that if statistics is present in > the Drill metadata file, it is correct). If user mistakenly enabled > strings_signed_min_max, he needs to disable it and regenerated Drill metadata > file. The same is in the opposite way, if user created metadata file when > strings_signed_min_max was disabled, no min / max values for binary > statistics will be written and thus read back, even if during reading the > metadata strings_signed_min_max is enabled. > *Decimal Partition Pruning* > Decimal values can be represented in four logical types: int_32, int_64, > fixed_len_byte_array and binary. > Partition pruning will work for all logical types for old and new decimal > files, i.e. created with Parquet 1.10.0, prior and after. Partition pruning > won't work for files with null partition due to PARQUET-1341 which will be > fixed in Parquet 1.11.0. > Partition pruning with Drill metadata file will work for old and new decimal > files disregarding with which Drill version metadata file was created. > *Decimal Filter Push Down* > For int_32 / int_64 decimal push down will work only for new files (i.e. > generated by Parquet 1.10.0 and later), for old files push down won't work > due to PARQUET-1322. > For old int_32 / int_64 decimal push down will work with old Drill metadata > file, i.e. prior to Drill 1.14.0, for Drill metadata file generated after > Drill 1.14.0 push down won't work since it is generated after upgrade to > Parquet 1.10.0 (due to PARQUET-1322). For new int_32 / int_64 decimal, push > down will work with new Drill metadata file. > For old fixed_len_byte_array / binary decimal files generated prior to > Parquet 1.10.0 filter push down won't work. Push down with old Drill metadata > file only if strings_signed_min_max config / session option is set to true. > Push down with new Drill metadata file won't work. > For new fixed_len_byte_array / binary files filter push down will work with > and without metadata file (only if Drill metadata file was generated by Drill > 1.15.0). If Drill metadata file was generated prior to Drill 1.15.0, to > enable reading such statistics user needs to enable strings_signed_min_max > config / session option or re-generated Drill metadata file. > *Hive Varchar Filter Push Down using Drill native reader* > Hive 2.3 parquet files are generated with Parquet library prior to 1.10.0 > version, where statistics for binary UTF-8 is can be stored incorrectly. If > user exactly knows that data in the binary columns in ASCIIĀ (not in UTF-8), > session option store.parquet.reader.strings_signed_min_max can be set to > 'true' to enable varchar filter push down. > *Hive Decimal Filter Push Down using Drill native reader* > Hive 2.3 parquet files are generated with Parquet library prior to 1.10.0 > version, decimal statistics for such files is not available thus push down > won't work with Hive parquet decimal files. -- This message was sent by Atlassian JIRA (v7.6.3#76005)