Hi Yash,
there are a few mechanisms in Parquet that can help with this. Not all of
them will be present in every parquet file. And not all implementations
make use of them or populate them (i.e. C++ lacks a few):
1. Per Column statistics per-row-group and data pages [1]. Includes
min/max
[
https://issues.apache.org/jira/browse/PARQUET-1882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17155675#comment-17155675
]
Eric Gorelik commented on PARQUET-1882:
---
Here's a minimal one.
{code:c++}
#include
#include
Hi,
If I want to query a parquet file with a criteria such as income > 1000,
does Parquet support indexing of the columns to make it faster to identify
the records with the criteria? I know we can partition the file on a
column. But in my case assume it is already partitioned on a single column
[
https://issues.apache.org/jira/browse/PARQUET-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17155616#comment-17155616
]
Natang commented on PARQUET-1580:
-
Can this be backported to 1.10.1?
> Page-level CRC checksum
[
https://issues.apache.org/jira/browse/PARQUET-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17155322#comment-17155322
]
Gabor Szadovszky commented on PARQUET-1883:
---
[~sha...@uber.com], [~satishkotha],
INT96 IS
Hi,
I wasn't aware of the fact that jemalloc mmap automatically for larger
allocations. And I didn't yet test this.
The approach could be different in that we would know which parts of the
buffers are going to be used next (the buffers are appendonly) and which
parts won't be needed until