Re: Does Parquet format provide indexing for quick retrieval based on column filters?

2020-07-10 Thread Micah Kornfield
Hi Yash, there are a few mechanisms in Parquet that can help with this. Not all of them will be present in every parquet file. And not all implementations make use of them or populate them (i.e. C++ lacks a few): 1. Per Column statistics per-row-group and data pages [1]. Includes min/max

[jira] [Commented] (PARQUET-1882) Writing an all-null column and then reading it with buffered_stream aborts the process

2020-07-10 Thread Eric Gorelik (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-1882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17155675#comment-17155675 ] Eric Gorelik commented on PARQUET-1882: --- Here's a minimal one.  {code:c++} #include #include

Does Parquet format provide indexing for quick retrieval based on column filters?

2020-07-10 Thread Yash Ganthe
Hi, If I want to query a parquet file with a criteria such as income > 1000, does Parquet support indexing of the columns to make it faster to identify the records with the criteria? I know we can partition the file on a column. But in my case assume it is already partitioned on a single column

[jira] [Commented] (PARQUET-1580) Page-level CRC checksum verification for DataPageV1

2020-07-10 Thread Natang (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17155616#comment-17155616 ] Natang commented on PARQUET-1580: - Can this be backported to 1.10.1? > Page-level CRC checksum

[jira] [Commented] (PARQUET-1883) int96 support in parquet-avro

2020-07-10 Thread Gabor Szadovszky (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17155322#comment-17155322 ] Gabor Szadovszky commented on PARQUET-1883: --- [~sha...@uber.com], [~satishkotha], INT96 IS

Re: Writing very large rowgroups to Apache Parquet

2020-07-10 Thread Roman Karlstetter
Hi, I wasn't aware of the fact that jemalloc mmap automatically for larger allocations. And I didn't yet test this. The approach could be different in that we would know which parts of the buffers are going to be used next (the buffers are appendonly) and which parts won't be needed until