[GitHub] [parquet-format] emkornfield commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
emkornfield commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1319366325 ## src/main/thrift/parquet.thrift: ## @@ -977,6 +1038,25 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5:

[jira] [Updated] (PARQUET-2343) Fixes NPE when rewriting file with multiple rowgroups

2023-09-07 Thread Gang Wu (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Wu updated PARQUET-2343: - Fix Version/s: 1.13.2 > Fixes NPE when rewriting file with multiple rowgroups >

[jira] [Resolved] (PARQUET-2343) Fixes NPE when rewriting file with multiple rowgroups

2023-09-07 Thread Gang Wu (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Wu resolved PARQUET-2343. -- Resolution: Fixed > Fixes NPE when rewriting file with multiple rowgroups >

[GitHub] [parquet-format] wgtmac commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
wgtmac commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1319252134 ## src/main/thrift/parquet.thrift: ## @@ -977,6 +1038,25 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5:

[GitHub] [parquet-format] emkornfield commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
emkornfield commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318930925 ## src/main/thrift/parquet.thrift: ## @@ -191,6 +191,73 @@ enum FieldRepetitionType { REPEATED = 2; } +/** + * A histogram of repetition and definition

[GitHub] [parquet-format] emkornfield commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
emkornfield commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1319135598 ## src/main/thrift/parquet.thrift: ## @@ -191,6 +191,73 @@ enum FieldRepetitionType { REPEATED = 2; } +/** + * A histogram of repetition and definition

[GitHub] [parquet-format] emkornfield commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
emkornfield commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1319219710 ## src/main/thrift/parquet.thrift: ## @@ -977,6 +1038,25 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5:

[GitHub] [parquet-format] etseidl commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
etseidl commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1319212210 ## src/main/thrift/parquet.thrift: ## @@ -977,6 +1038,25 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5:

[GitHub] [parquet-format] emkornfield commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
emkornfield commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1319135598 ## src/main/thrift/parquet.thrift: ## @@ -191,6 +191,73 @@ enum FieldRepetitionType { REPEATED = 2; } +/** + * A histogram of repetition and definition

[GitHub] [parquet-format] emkornfield commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
emkornfield commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1319135598 ## src/main/thrift/parquet.thrift: ## @@ -191,6 +191,73 @@ enum FieldRepetitionType { REPEATED = 2; } +/** + * A histogram of repetition and definition

[GitHub] [parquet-format] emkornfield commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
emkornfield commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1319130534 ## src/main/thrift/parquet.thrift: ## @@ -583,7 +659,12 @@ struct DataPageHeaderV2 { If missing it is considered compressed */ 7: optional bool

[GitHub] [parquet-format] emkornfield commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
emkornfield commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1319130288 ## src/main/thrift/parquet.thrift: ## @@ -977,6 +1073,15 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5:

[GitHub] [parquet-format] tustvold commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
tustvold commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1319059263 ## src/main/thrift/parquet.thrift: ## @@ -191,6 +191,73 @@ enum FieldRepetitionType { REPEATED = 2; } +/** + * A histogram of repetition and definition

[GitHub] [parquet-format] tustvold commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
tustvold commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1319059263 ## src/main/thrift/parquet.thrift: ## @@ -191,6 +191,73 @@ enum FieldRepetitionType { REPEATED = 2; } +/** + * A histogram of repetition and definition

[GitHub] [parquet-format] pitrou commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
pitrou commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1319010418 ## src/main/thrift/parquet.thrift: ## @@ -191,6 +191,73 @@ enum FieldRepetitionType { REPEATED = 2; } +/** + * A histogram of repetition and definition

[GitHub] [parquet-format] tustvold commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
tustvold commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318957040 ## src/main/thrift/parquet.thrift: ## @@ -191,6 +191,73 @@ enum FieldRepetitionType { REPEATED = 2; } +/** + * A histogram of repetition and definition

[GitHub] [parquet-format] tustvold commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
tustvold commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318957040 ## src/main/thrift/parquet.thrift: ## @@ -191,6 +191,73 @@ enum FieldRepetitionType { REPEATED = 2; } +/** + * A histogram of repetition and definition

[GitHub] [parquet-format] emkornfield commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
emkornfield commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318930925 ## src/main/thrift/parquet.thrift: ## @@ -191,6 +191,73 @@ enum FieldRepetitionType { REPEATED = 2; } +/** + * A histogram of repetition and definition

[GitHub] [parquet-format] emkornfield commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
emkornfield commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318928626 ## src/main/thrift/parquet.thrift: ## @@ -529,7 +596,15 @@ struct DataPageHeader { /** Encoding used for repetition levels **/ 4: required Encoding

[GitHub] [parquet-format] wgtmac commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
wgtmac commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318848187 ## src/main/thrift/parquet.thrift: ## @@ -529,7 +596,15 @@ struct DataPageHeader { /** Encoding used for repetition levels **/ 4: required Encoding

[GitHub] [parquet-format] etseidl commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
etseidl commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318770227 ## src/main/thrift/parquet.thrift: ## @@ -977,6 +1073,15 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5:

[GitHub] [parquet-format] JFinis commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
JFinis commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318737835 ## src/main/thrift/parquet.thrift: ## @@ -529,7 +596,15 @@ struct DataPageHeader { /** Encoding used for repetition levels **/ 4: required Encoding

[GitHub] [parquet-format] mapleFU commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
mapleFU commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318576796 ## src/main/thrift/parquet.thrift: ## @@ -191,6 +191,73 @@ enum FieldRepetitionType { REPEATED = 2; } +/** + * A histogram of repetition and definition

[GitHub] [parquet-format] etseidl commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
etseidl commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318570484 ## src/main/thrift/parquet.thrift: ## @@ -191,6 +191,73 @@ enum FieldRepetitionType { REPEATED = 2; } +/** + * A histogram of repetition and definition

[GitHub] [parquet-format] etseidl commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
etseidl commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318565796 ## src/main/thrift/parquet.thrift: ## @@ -191,6 +191,73 @@ enum FieldRepetitionType { REPEATED = 2; } +/** + * A histogram of repetition and definition

[GitHub] [parquet-format] mapleFU commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
mapleFU commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318413890 ## src/main/thrift/parquet.thrift: ## @@ -191,6 +191,73 @@ enum FieldRepetitionType { REPEATED = 2; } +/** + * A histogram of repetition and definition

[GitHub] [parquet-format] pitrou commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
pitrou commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318413084 ## src/main/thrift/parquet.thrift: ## @@ -191,6 +191,73 @@ enum FieldRepetitionType { REPEATED = 2; } +/** + * A histogram of repetition and definition

[GitHub] [parquet-format] pitrou commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
pitrou commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318393474 ## src/main/thrift/parquet.thrift: ## @@ -977,6 +1073,15 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5:

[GitHub] [parquet-format] tustvold commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
tustvold commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318386752 ## src/main/thrift/parquet.thrift: ## @@ -191,6 +191,73 @@ enum FieldRepetitionType { REPEATED = 2; } +/** + * A histogram of repetition and definition

[GitHub] [parquet-format] pitrou commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
pitrou commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318384010 ## src/main/thrift/parquet.thrift: ## @@ -191,6 +191,73 @@ enum FieldRepetitionType { REPEATED = 2; } +/** + * A histogram of repetition and definition

[GitHub] [parquet-format] JFinis commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
JFinis commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318367540 ## src/main/thrift/parquet.thrift: ## @@ -977,6 +1073,15 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5:

[GitHub] [parquet-format] tustvold commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
tustvold commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318362226 ## src/main/thrift/parquet.thrift: ## @@ -191,6 +191,73 @@ enum FieldRepetitionType { REPEATED = 2; } +/** + * A histogram of repetition and definition

[GitHub] [parquet-format] tustvold commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
tustvold commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318362226 ## src/main/thrift/parquet.thrift: ## @@ -191,6 +191,73 @@ enum FieldRepetitionType { REPEATED = 2; } +/** + * A histogram of repetition and definition

[GitHub] [parquet-format] mapleFU commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
mapleFU commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318347248 ## src/main/thrift/parquet.thrift: ## @@ -191,6 +191,73 @@ enum FieldRepetitionType { REPEATED = 2; } +/** + * A histogram of repetition and definition

[GitHub] [parquet-format] pitrou commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
pitrou commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318334927 ## src/main/thrift/parquet.thrift: ## @@ -191,6 +191,73 @@ enum FieldRepetitionType { REPEATED = 2; } +/** + * A histogram of repetition and definition

[GitHub] [parquet-format] tustvold commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
tustvold commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318314149 ## src/main/thrift/parquet.thrift: ## @@ -529,7 +596,15 @@ struct DataPageHeader { /** Encoding used for repetition levels **/ 4: required Encoding

[GitHub] [parquet-format] JFinis commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
JFinis commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318292827 ## src/main/thrift/parquet.thrift: ## @@ -977,6 +1073,15 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5:

[GitHub] [parquet-format] JFinis commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
JFinis commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318275562 ## src/main/thrift/parquet.thrift: ## @@ -977,6 +1073,15 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5:

[GitHub] [parquet-format] JFinis commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
JFinis commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318273187 ## src/main/thrift/parquet.thrift: ## @@ -977,6 +1073,15 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5:

[GitHub] [parquet-format] JFinis commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
JFinis commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318275562 ## src/main/thrift/parquet.thrift: ## @@ -977,6 +1073,15 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5:

[GitHub] [parquet-format] JFinis commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
JFinis commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318273187 ## src/main/thrift/parquet.thrift: ## @@ -977,6 +1073,15 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5:

[GitHub] [parquet-format] JFinis commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
JFinis commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318260892 ## src/main/thrift/parquet.thrift: ## @@ -977,6 +1073,15 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5:

[GitHub] [parquet-format] etseidl commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-09-07 Thread via GitHub
etseidl commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318190332 ## src/main/thrift/parquet.thrift: ## @@ -977,6 +1073,15 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: