tcrasset commented on issue #38877:
URL: https://github.com/apache/arrow/issues/38877#issuecomment-1825926099

   Thank you for your notes @mapleFU.
   
   I need a clarification though:
   
   > For "accross multiple ColumnChunkMetadata", in fact, the Statistics only 
work for one column-chunk. We cannot regard it as a whole-file distinct-count.
   
   As I understand it from [the 
spec](https://parquet.apache.org/docs/file-format/), a file consists of one or 
more RowGroups, which contain one or more ColumnChunks.
   
   I understand we cannot regard it as a whole file distinct count (as in the 
distinct count of all the columns combined), but is it a per-column distinct 
count, or a per-column-**chunk** distinct count? You seem to say it's a 
per-column-chunk, but I want to be sure I understand correctly.
   
   ```text
   +-------+--------+
   | col_1 | col2_2 |
   +-------+--------+
   | a     | b      |
   | x     | b      |
   ================== Row group boundary
   | b     | d      |
   | x     | d      |
   +-------+--------+
   ```
   
   Here we have 4 column chunks.
   
   It is 
   
   ```text
   
   <Column col_1 Chunk 1 + Column Metadata> --> distinct_count = 2 ("a", "x")
   <Column col_2 Chunk 1 + Column Metadata> --> distinct_count = 1 ("c")
   <Column col_1 Chunk 2 + Column Metadata> --> distinct_count = 2 ("b", "x")
   <Column col_2 Chunk 2 + Column Metadata> --> distinct_count = 1 ("d")
   ```
   
   then, right? But the actual distinct count of col_1 is 3, so we cannot add 
them up.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to