Re: [PR] PARQUET-2261: Implement SizeStatistics [parquet-mr]

2023-12-04 Thread via GitHub
shangxinli commented on code in PR #1177: URL: https://github.com/apache/parquet-mr/pull/1177#discussion_r1414220467 ## parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterBase.java: ## @@ -409,4 +428,14 @@ abstract void writePage( ValuesWriter definit

Re: [PR] PARQUET-2261: Implement SizeStatistics [parquet-mr]

2023-12-04 Thread via GitHub
shangxinli commented on code in PR #1177: URL: https://github.com/apache/parquet-mr/pull/1177#discussion_r1414218649 ## parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterBase.java: ## @@ -389,7 +400,14 @@ void writePage() { this.rowsWrittenSoFar += pag

Re: [PR] PARQUET-2261: Implement SizeStatistics [parquet-mr]

2023-12-04 Thread via GitHub
shangxinli commented on code in PR #1177: URL: https://github.com/apache/parquet-mr/pull/1177#discussion_r1414218649 ## parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterBase.java: ## @@ -389,7 +400,14 @@ void writePage() { this.rowsWrittenSoFar += pag

Re: [PR] PARQUET-2261: Implement SizeStatistics [parquet-mr]

2023-11-23 Thread via GitHub
wgtmac commented on code in PR #1177: URL: https://github.com/apache/parquet-mr/pull/1177#discussion_r1403914499 ## parquet-column/src/main/java/org/apache/parquet/column/statistics/SizeStatistics.java: ## @@ -0,0 +1,254 @@ +/* + * Licensed to the Apache Software Foundation (ASF

Re: [PR] PARQUET-2261: Implement SizeStatistics [parquet-mr]

2023-11-23 Thread via GitHub
wgtmac commented on code in PR #1177: URL: https://github.com/apache/parquet-mr/pull/1177#discussion_r1403909642 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java: ## @@ -316,7 +346,14 @@ private int toIntWithCheck(long size) { retur

Re: [PR] PARQUET-2261: Implement SizeStatistics [parquet-mr]

2023-11-23 Thread via GitHub
wgtmac commented on code in PR #1177: URL: https://github.com/apache/parquet-mr/pull/1177#discussion_r1403907724 ## parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/OffsetIndexBuilder.java: ## @@ -80,11 +90,22 @@ public void add(int compressedPageSize,

Re: [PR] PARQUET-2261: Implement SizeStatistics [parquet-mr]

2023-11-23 Thread via GitHub
wgtmac commented on code in PR #1177: URL: https://github.com/apache/parquet-mr/pull/1177#discussion_r1403906914 ## parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndex.java: ## @@ -57,4 +57,16 @@ public interface ColumnIndex extends Visitor {

Re: [PR] PARQUET-2261: Implement SizeStatistics [parquet-mr]

2023-11-23 Thread via GitHub
wgtmac commented on code in PR #1177: URL: https://github.com/apache/parquet-mr/pull/1177#discussion_r1403906654 ## parquet-column/src/main/java/org/apache/parquet/column/statistics/SizeStatistics.java: ## @@ -0,0 +1,254 @@ +/* + * Licensed to the Apache Software Foundation (ASF

Re: [PR] PARQUET-2261: Implement SizeStatistics [parquet-mr]

2023-11-23 Thread via GitHub
wgtmac commented on PR #1177: URL: https://github.com/apache/parquet-mr/pull/1177#issuecomment-1825101267 > Hi @wgtmac thanks for this great work. Could this influence the rewriter? How could we rebuild the `SizeStatistics` during rewriting? Thanks for your review! IMO, the rewriter c

Re: [PR] PARQUET-2261: Implement SizeStatistics [parquet-mr]

2023-11-23 Thread via GitHub
ConeyLiu commented on PR #1177: URL: https://github.com/apache/parquet-mr/pull/1177#issuecomment-1824649905 Hi @wgtmac thanks for this great work. Could this influence the rewriter? How could we rebuild the `SizeStatistics` during rewriting? -- This is an automated message from the Apache

Re: [PR] PARQUET-2261: Implement SizeStatistics [parquet-mr]

2023-11-23 Thread via GitHub
ConeyLiu commented on code in PR #1177: URL: https://github.com/apache/parquet-mr/pull/1177#discussion_r1403548852 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java: ## @@ -775,6 +848,51 @@ public void writeDataPageV2( uncompressedDataSize,

Re: [PR] PARQUET-2261: Implement SizeStatistics [parquet-mr]

2023-11-23 Thread via GitHub
ConeyLiu commented on code in PR #1177: URL: https://github.com/apache/parquet-mr/pull/1177#discussion_r1403542243 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java: ## @@ -316,7 +346,14 @@ private int toIntWithCheck(long size) { ret

Re: [PR] PARQUET-2261: Implement SizeStatistics [parquet-mr]

2023-11-23 Thread via GitHub
ConeyLiu commented on code in PR #1177: URL: https://github.com/apache/parquet-mr/pull/1177#discussion_r1403537889 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java: ## @@ -316,7 +346,14 @@ private int toIntWithCheck(long size) { ret

Re: [PR] PARQUET-2261: Implement SizeStatistics [parquet-mr]

2023-11-23 Thread via GitHub
ConeyLiu commented on code in PR #1177: URL: https://github.com/apache/parquet-mr/pull/1177#discussion_r1403536463 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java: ## @@ -152,13 +157,26 @@ public void writePage(BytesInput bytesInput, int

Re: [PR] PARQUET-2261: Implement SizeStatistics [parquet-mr]

2023-11-23 Thread via GitHub
ConeyLiu commented on code in PR #1177: URL: https://github.com/apache/parquet-mr/pull/1177#discussion_r1403535802 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java: ## @@ -87,6 +90,7 @@ private static final class ColumnChunkPageWriter impl

Re: [PR] PARQUET-2261: Implement SizeStatistics [parquet-mr]

2023-11-23 Thread via GitHub
ConeyLiu commented on code in PR #1177: URL: https://github.com/apache/parquet-mr/pull/1177#discussion_r1403534057 ## parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/OffsetIndexBuilder.java: ## @@ -116,11 +137,28 @@ private OffsetIndexBuilder() {

Re: [PR] PARQUET-2261: Implement SizeStatistics [parquet-mr]

2023-11-23 Thread via GitHub
ConeyLiu commented on code in PR #1177: URL: https://github.com/apache/parquet-mr/pull/1177#discussion_r1403533624 ## parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/OffsetIndexBuilder.java: ## @@ -80,11 +90,22 @@ public void add(int compressedPageSiz

Re: [PR] PARQUET-2261: Implement SizeStatistics [parquet-mr]

2023-11-23 Thread via GitHub
ConeyLiu commented on code in PR #1177: URL: https://github.com/apache/parquet-mr/pull/1177#discussion_r1403526389 ## parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndex.java: ## @@ -57,4 +57,16 @@ public interface ColumnIndex extends Visitor

Re: [PR] PARQUET-2261: Implement SizeStatistics [parquet-mr]

2023-11-23 Thread via GitHub
ConeyLiu commented on code in PR #1177: URL: https://github.com/apache/parquet-mr/pull/1177#discussion_r1403513092 ## parquet-column/src/main/java/org/apache/parquet/column/statistics/SizeStatistics.java: ## @@ -0,0 +1,254 @@ +/* + * Licensed to the Apache Software Foundation (A

Re: [PR] PARQUET-2261: Implement SizeStatistics [parquet-mr]

2023-11-23 Thread via GitHub
ConeyLiu commented on code in PR #1177: URL: https://github.com/apache/parquet-mr/pull/1177#discussion_r1403509676 ## parquet-column/src/main/java/org/apache/parquet/column/statistics/SizeStatistics.java: ## @@ -0,0 +1,254 @@ +/* + * Licensed to the Apache Software Foundation (A

Re: [PR] PARQUET-2261: Implement SizeStatistics [parquet-mr]

2023-11-23 Thread via GitHub
ConeyLiu commented on code in PR #1177: URL: https://github.com/apache/parquet-mr/pull/1177#discussion_r1403507869 ## parquet-column/src/main/java/org/apache/parquet/column/statistics/SizeStatistics.java: ## @@ -0,0 +1,254 @@ +/* + * Licensed to the Apache Software Foundation (A

Re: [PR] PARQUET-2261: Implement SizeStatistics [parquet-mr]

2023-11-23 Thread via GitHub
ConeyLiu commented on code in PR #1177: URL: https://github.com/apache/parquet-mr/pull/1177#discussion_r1403507171 ## parquet-column/src/main/java/org/apache/parquet/column/statistics/SizeStatistics.java: ## @@ -0,0 +1,254 @@ +/* + * Licensed to the Apache Software Foundation (A

Re: [PR] PARQUET-2261: Implement SizeStatistics [parquet-mr]

2023-11-23 Thread via GitHub
ConeyLiu commented on PR #1177: URL: https://github.com/apache/parquet-mr/pull/1177#issuecomment-1824590527 Thank @wgtmac for your notification. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to t

Re: [PR] PARQUET-2261: Implement SizeStatistics [parquet-mr]

2023-11-23 Thread via GitHub
ConeyLiu commented on code in PR #1177: URL: https://github.com/apache/parquet-mr/pull/1177#discussion_r1403506285 ## parquet-column/src/main/java/org/apache/parquet/column/statistics/SizeStatistics.java: ## @@ -0,0 +1,254 @@ +/* + * Licensed to the Apache Software Foundation (A

Re: [PR] PARQUET-2261: Implement SizeStatistics [parquet-mr]

2023-11-23 Thread via GitHub
ConeyLiu commented on code in PR #1177: URL: https://github.com/apache/parquet-mr/pull/1177#discussion_r1403505946 ## parquet-column/src/main/java/org/apache/parquet/column/statistics/SizeStatistics.java: ## @@ -0,0 +1,254 @@ +/* + * Licensed to the Apache Software Foundation (A

Re: [PR] PARQUET-2261: Implement SizeStatistics [parquet-mr]

2023-11-22 Thread via GitHub
wgtmac commented on PR #1177: URL: https://github.com/apache/parquet-mr/pull/1177#issuecomment-1823898172 cc @ConeyLiu as I have modified mergeColumnStatistics method which you've just refactored. -- This is an automated message from the Apache Git Service. To respond to the message, plea

Re: [PR] PARQUET-2261: Implement SizeStatistics [parquet-mr]

2023-11-22 Thread via GitHub
wgtmac commented on PR #1177: URL: https://github.com/apache/parquet-mr/pull/1177#issuecomment-1823895468 I have just rebased on the latest master branch and fixed all CI falures. As this PR gets too large, I will add print cli command and rewriter support for `SizeStatistics` in follow-up

[PR] PARQUET-2261: Implement SizeStatistics [parquet-mr]

2023-11-22 Thread via GitHub
wgtmac opened a new pull request, #1201: URL: https://github.com/apache/parquet-mr/pull/1201 Make sure you have checked _all_ steps below. ### Jira - [ ] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references them in

Re: [PR] PARQUET-2261: Implement SizeStatistics [parquet-mr]

2023-10-21 Thread via GitHub
wgtmac commented on code in PR #1177: URL: https://github.com/apache/parquet-mr/pull/1177#discussion_r1367696949 ## parquet-column/src/main/java/org/apache/parquet/column/statistics/SizeStatistics.java: ## @@ -0,0 +1,214 @@ +/* + * Licensed to the Apache Software Foundation (ASF

Re: [PR] PARQUET-2261: Implement SizeStatistics [parquet-mr]

2023-10-20 Thread via GitHub
etseidl commented on code in PR #1177: URL: https://github.com/apache/parquet-mr/pull/1177#discussion_r1367498685 ## parquet-column/src/main/java/org/apache/parquet/column/statistics/SizeStatistics.java: ## @@ -0,0 +1,214 @@ +/* + * Licensed to the Apache Software Foundation (AS

Re: [PR] PARQUET-2261: Implement SizeStatistics [parquet-mr]

2023-10-20 Thread via GitHub
emkornfield commented on PR #1177: URL: https://github.com/apache/parquet-mr/pull/1177#issuecomment-1773069981 @wgtmac took a scan through and this generally seems like what I expected. Thank you for doing it. Agree unit tests are needed. -- This is an automated message from the Apache

Re: [PR] PARQUET-2261: Implement SizeStatistics [parquet-mr]

2023-10-19 Thread via GitHub
wgtmac commented on PR #1177: URL: https://github.com/apache/parquet-mr/pull/1177#issuecomment-1771949119 > Thanks @wgtmac, this looks great! I'm not sure if this is in scope for this PR, but it would be nice if the CLI was aware of the changes. Specifically, it would be great if the `colum

Re: [PR] PARQUET-2261: Implement SizeStatistics [parquet-mr]

2023-10-19 Thread via GitHub
etseidl commented on PR #1177: URL: https://github.com/apache/parquet-mr/pull/1177#issuecomment-1771735094 Thanks @wgtmac, this looks great! I'm not sure if this is in scope for this PR, but it would be nice if the CLI was aware of the changes. Specifically, it would be great if the `column

Re: [PR] PARQUET-2261: Implement SizeStatistics [parquet-mr]

2023-10-19 Thread via GitHub
wgtmac commented on PR #1177: URL: https://github.com/apache/parquet-mr/pull/1177#issuecomment-1771189247 I have drafted the POC to read/write SizeStatistics. The feature implementation should be complete and associated tests will be added progressively. Please take a look when you have tim

[PR] PARQUET-2261: Implement SizeStatistics [parquet-mr]

2023-10-19 Thread via GitHub
wgtmac opened a new pull request, #1177: URL: https://github.com/apache/parquet-mr/pull/1177 Make sure you have checked _all_ steps below. ### Jira - [ ] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references them in