[GitHub] vrozov commented on a change in pull request #1349: DRILL-6554: Minor code improvements in parquet statistics handling
vrozov commented on a change in pull request #1349: DRILL-6554: Minor code improvements in parquet statistics handling URL: https://github.com/apache/drill/pull/1349#discussion_r199217282 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java ## @@ -417,16 +417,9 @@ public static DateCorruptionStatus checkForCorruptDateValuesInStatistics(Parquet // column does not appear in this file, skip it continue; } - Statistics statistics = footer.getBlocks().get(rowGroupIndex).getColumns().get(colIndex).getStatistics(); - Integer max = (Integer) statistics.genericGetMax(); - if (statistics.hasNonNullValue()) { -if (max > ParquetReaderUtility.DATE_CORRUPTION_THRESHOLD) { - return DateCorruptionStatus.META_SHOWS_CORRUPTION; -} - } else { -// no statistics, go check the first page -return DateCorruptionStatus.META_UNCLEAR_TEST_VALUES; - } + IntStatistics statistics = (IntStatistics)footer.getBlocks().get(rowGroupIndex).getColumns().get(colIndex).getStatistics(); Review comment: I don't see any specific code style in regards to spacing for casting (grep -I -R \(\([a-zA-Z0-9_]*\)[a-zA-Z0-9_] * | grep -c java). It seems to be a preference of a contributor. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] vrozov commented on a change in pull request #1349: DRILL-6554: Minor code improvements in parquet statistics handling
vrozov commented on a change in pull request #1349: DRILL-6554: Minor code improvements in parquet statistics handling URL: https://github.com/apache/drill/pull/1349#discussion_r199215493 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/ParquetPredicatesHelper.java ## @@ -39,22 +40,21 @@ static boolean isNullOrEmpty(Statistics stat) { * * @param stat parquet column statistics * @param rowCount number of rows in the parquet file - * @return True if all rows are null in the parquet file - * False if at least one row is not null. + * @return true if all rows are null in the parquet file and false otherwise */ static boolean isAllNulls(Statistics stat, long rowCount) { -return stat.isNumNullsSet() && stat.getNumNulls() == rowCount; +Preconditions.checkArgument(rowCount >= 0, String.format("negative rowCount %d is not valid", rowCount)); Review comment: It does not matter where it comes from (it actually comes from `RowGroupInfo`). The condition needs to be intercepted prior to calling `isAllNulls()` as `isAllNulls()` would return the wrong result (whether true or false) for a negative row count. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] vrozov commented on a change in pull request #1349: DRILL-6554: Minor code improvements in parquet statistics handling
vrozov commented on a change in pull request #1349: DRILL-6554: Minor code improvements in parquet statistics handling URL: https://github.com/apache/drill/pull/1349#discussion_r199210267 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/ParquetPredicatesHelper.java ## @@ -39,22 +40,21 @@ static boolean isNullOrEmpty(Statistics stat) { * * @param stat parquet column statistics * @param rowCount number of rows in the parquet file - * @return True if all rows are null in the parquet file - * False if at least one row is not null. + * @return true if all rows are null in the parquet file and false otherwise */ static boolean isAllNulls(Statistics stat, long rowCount) { -return stat.isNumNullsSet() && stat.getNumNulls() == rowCount; +Preconditions.checkArgument(rowCount >= 0, String.format("negative rowCount %d is not valid", rowCount)); Review comment: I hope for the reverse. Negative row count passed to this method indicates a bug, and bug means incorrect result. I'd rather fail a query than give a wrong result. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] vrozov commented on a change in pull request #1349: DRILL-6554: Minor code improvements in parquet statistics handling
vrozov commented on a change in pull request #1349: DRILL-6554: Minor code improvements in parquet statistics handling URL: https://github.com/apache/drill/pull/1349#discussion_r199208654 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/ParquetPredicatesHelper.java ## @@ -39,22 +40,21 @@ static boolean isNullOrEmpty(Statistics stat) { * * @param stat parquet column statistics * @param rowCount number of rows in the parquet file - * @return True if all rows are null in the parquet file - * False if at least one row is not null. + * @return true if all rows are null in the parquet file and false otherwise */ static boolean isAllNulls(Statistics stat, long rowCount) { -return stat.isNumNullsSet() && stat.getNumNulls() == rowCount; +Preconditions.checkArgument(rowCount >= 0, String.format("negative rowCount %d is not valid", rowCount)); Review comment: To validate input. It can't give the correct answer if an input is a junk (`rowCount` is negative). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] vrozov commented on a change in pull request #1349: DRILL-6554: Minor code improvements in parquet statistics handling
vrozov commented on a change in pull request #1349: DRILL-6554: Minor code improvements in parquet statistics handling URL: https://github.com/apache/drill/pull/1349#discussion_r199207853 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java ## @@ -417,16 +417,9 @@ public static DateCorruptionStatus checkForCorruptDateValuesInStatistics(Parquet // column does not appear in this file, skip it continue; } - Statistics statistics = footer.getBlocks().get(rowGroupIndex).getColumns().get(colIndex).getStatistics(); - Integer max = (Integer) statistics.genericGetMax(); - if (statistics.hasNonNullValue()) { -if (max > ParquetReaderUtility.DATE_CORRUPTION_THRESHOLD) { - return DateCorruptionStatus.META_SHOWS_CORRUPTION; -} - } else { -// no statistics, go check the first page -return DateCorruptionStatus.META_UNCLEAR_TEST_VALUES; - } + IntStatistics statistics = (IntStatistics)footer.getBlocks().get(rowGroupIndex).getColumns().get(colIndex).getStatistics(); Review comment: Please see parquet format spec. `DATE` is always `int32`. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] vrozov commented on a change in pull request #1349: DRILL-6554: Minor code improvements in parquet statistics handling
vrozov commented on a change in pull request #1349: DRILL-6554: Minor code improvements in parquet statistics handling URL: https://github.com/apache/drill/pull/1349#discussion_r199205436 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/ParquetPredicatesHelper.java ## @@ -39,22 +40,21 @@ static boolean isNullOrEmpty(Statistics stat) { * * @param stat parquet column statistics * @param rowCount number of rows in the parquet file - * @return True if all rows are null in the parquet file - * False if at least one row is not null. + * @return true if all rows are null in the parquet file and false otherwise */ static boolean isAllNulls(Statistics stat, long rowCount) { -return stat.isNumNullsSet() && stat.getNumNulls() == rowCount; +Preconditions.checkArgument(rowCount >= 0, String.format("negative rowCount %d is not valid", rowCount)); Review comment: To avoid bugs :). It is invalid to call `isAllNulls()` with negative `rowCount`. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] vrozov commented on a change in pull request #1349: DRILL-6554: Minor code improvements in parquet statistics handling
vrozov commented on a change in pull request #1349: DRILL-6554: Minor code improvements in parquet statistics handling URL: https://github.com/apache/drill/pull/1349#discussion_r199204147 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/ParquetPredicatesHelper.java ## @@ -39,22 +40,21 @@ static boolean isNullOrEmpty(Statistics stat) { * * @param stat parquet column statistics * @param rowCount number of rows in the parquet file - * @return True if all rows are null in the parquet file - * False if at least one row is not null. + * @return true if all rows are null in the parquet file and false otherwise */ static boolean isAllNulls(Statistics stat, long rowCount) { -return stat.isNumNullsSet() && stat.getNumNulls() == rowCount; +Preconditions.checkArgument(rowCount >= 0, String.format("negative rowCount %d is not valid", rowCount)); Review comment: When can rowCount be negative? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services