[jira] [Created] (PARQUET-2250) Expose column descriptor through RecordReader

2023-02-23 Thread fatemah (Jira)
fatemah created PARQUET-2250:


 Summary: Expose column descriptor through RecordReader
 Key: PARQUET-2250
 URL: https://issues.apache.org/jira/browse/PARQUET-2250
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-cpp
Reporter: fatemah


Currently, the RecordReader does not expose the underlying column descriptor. 
This would be useful in some scenarios.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2225) [C++] Allow reading dense with RecordReader

2023-01-10 Thread fatemah (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

fatemah updated PARQUET-2225:
-
Summary: [C++] Allow reading dense with RecordReader  (was: Allow reading 
dense with RecordReader)

> [C++] Allow reading dense with RecordReader
> ---
>
> Key: PARQUET-2225
> URL: https://issues.apache.org/jira/browse/PARQUET-2225
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: fatemah
>Assignee: fatemah
>Priority: Major
>
> Currently ReadRecords reads spaced by default. Some readers may need to read 
> the values dense, and reading spaced is less efficient than reading dense. We 
> need an option for reading dense.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (PARQUET-2225) Allow reading dense with RecordReader

2023-01-10 Thread fatemah (Jira)
fatemah created PARQUET-2225:


 Summary: Allow reading dense with RecordReader
 Key: PARQUET-2225
 URL: https://issues.apache.org/jira/browse/PARQUET-2225
 Project: Parquet
  Issue Type: New Feature
  Components: parquet-cpp
Reporter: fatemah
Assignee: fatemah


Currently ReadRecords reads spaced by default. Some readers may need to read 
the values dense, and reading spaced is less efficient than reading dense. We 
need an option for reading dense.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2210) Skip pages based on header metadata using a callback

2022-11-07 Thread fatemah (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

fatemah updated PARQUET-2210:
-
Description: Currently, we do not expose the page header metadata and they 
cannot be used for skipping pages. I propose exposing the metadata through a 
callback that would allow the caller to decide if they want to read or skip the 
page based on the metadata. The signature of the callback would be the 
following: std::function skip_page_callback)  
(was: Currently, we do not use the statistics that is stored in the page 
headers for pruning the rows that we read. Row group pruning is very 
coarse-grained and in many cases does not prune the row group. I propose adding 
a FilteredPageReader that would accept a filter and would not return the pages 
that do not match the filter based on page statistics.

Initial set of filters can be: EQUALS, IS NULL, IS NOT NULL.

Also, the FilteredPageReader will keep track of what row ranges matched and not 
matched. We could use this to skip reading rows that do not match from the rest 
of the columns. Note that the SkipRecords API was recently added to the Parquet 
reader (https://issues.apache.org/jira/browse/PARQUET-2188))

> Skip pages based on header metadata using a callback
> 
>
> Key: PARQUET-2210
> URL: https://issues.apache.org/jira/browse/PARQUET-2210
> Project: Parquet
>  Issue Type: New Feature
>Reporter: fatemah
>Priority: Major
>
> Currently, we do not expose the page header metadata and they cannot be used 
> for skipping pages. I propose exposing the metadata through a callback that 
> would allow the caller to decide if they want to read or skip the page based 
> on the metadata. The signature of the callback would be the following: 
> std::function skip_page_callback)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2210) Skip pages based on header metadata using a callback

2022-11-07 Thread fatemah (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

fatemah updated PARQUET-2210:
-
Summary: Skip pages based on header metadata using a callback  (was: Add 
FilteredPageReader to filter rows based on page statistics)

> Skip pages based on header metadata using a callback
> 
>
> Key: PARQUET-2210
> URL: https://issues.apache.org/jira/browse/PARQUET-2210
> Project: Parquet
>  Issue Type: New Feature
>Reporter: fatemah
>Priority: Major
>
> Currently, we do not use the statistics that is stored in the page headers 
> for pruning the rows that we read. Row group pruning is very coarse-grained 
> and in many cases does not prune the row group. I propose adding a 
> FilteredPageReader that would accept a filter and would not return the pages 
> that do not match the filter based on page statistics.
> Initial set of filters can be: EQUALS, IS NULL, IS NOT NULL.
> Also, the FilteredPageReader will keep track of what row ranges matched and 
> not matched. We could use this to skip reading rows that do not match from 
> the rest of the columns. Note that the SkipRecords API was recently added to 
> the Parquet reader (https://issues.apache.org/jira/browse/PARQUET-2188)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (PARQUET-2210) Add FilteredPageReader to filter rows based on page statistics

2022-10-31 Thread fatemah (Jira)
fatemah created PARQUET-2210:


 Summary: Add FilteredPageReader to filter rows based on page 
statistics
 Key: PARQUET-2210
 URL: https://issues.apache.org/jira/browse/PARQUET-2210
 Project: Parquet
  Issue Type: New Feature
Reporter: fatemah


Currently, we do not use the statistics that is stored in the page headers for 
pruning the rows that we read. Row group pruning is very coarse-grained and in 
many cases does not prune the row group. I propose adding a FilteredPageReader 
that would accept a filter and would not return the pages that do not match the 
filter based on page statistics.

Initial set of filters can be: EQUALS, IS NULL, IS NOT NULL.

Also, the FilteredPageReader will keep track of what row ranges matched and not 
matched. We could use this to skip reading rows that do not match from the rest 
of the columns. Note that the SkipRecords API was recently added to the Parquet 
reader (https://issues.apache.org/jira/browse/PARQUET-2188)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (PARQUET-2209) Optimize skip for the case that number of values to skip equals page size

2022-10-31 Thread fatemah (Jira)
fatemah created PARQUET-2209:


 Summary: Optimize skip for the case that number of values to skip 
equals page size
 Key: PARQUET-2209
 URL: https://issues.apache.org/jira/browse/PARQUET-2209
 Project: Parquet
  Issue Type: Improvement
Reporter: fatemah


Optimize skip for the case that the number of values to skip equals page size. 
Right now, we end up reading to the end of the page and throwing away the 
rep/defs and values that we have read, which is unnecessary.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (PARQUET-2206) Microbenchmark for ColumnReadaer ReadBatch and Skip

2022-10-26 Thread fatemah (Jira)
fatemah created PARQUET-2206:


 Summary: Microbenchmark for ColumnReadaer ReadBatch and Skip
 Key: PARQUET-2206
 URL: https://issues.apache.org/jira/browse/PARQUET-2206
 Project: Parquet
  Issue Type: Improvement
Reporter: fatemah


 Adding a micro benchmark for column reader ReadBatch and Skip. Later, I will 
add benchmarks for RecordReader's ReadRecords and SkipRecords.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (PARQUET-2204) TypedColumnReaderImpl::Skip should reuse scratch space

2022-10-25 Thread fatemah (Jira)
fatemah created PARQUET-2204:


 Summary: TypedColumnReaderImpl::Skip should reuse scratch space
 Key: PARQUET-2204
 URL: https://issues.apache.org/jira/browse/PARQUET-2204
 Project: Parquet
  Issue Type: Improvement
Reporter: fatemah


TypedColumnReaderImpl::Skip allocates scratch space on every call. The scratch 
space is used to read rep/def levels and values and throw them away. The memory 
allocation slows down the skip based on microbenchmarks. The scratch space can 
be allocated once and re-used.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (PARQUET-2201) Add Stress test for RecordReader SkipRecords

2022-10-06 Thread fatemah (Jira)
fatemah created PARQUET-2201:


 Summary: Add Stress test for RecordReader SkipRecords
 Key: PARQUET-2201
 URL: https://issues.apache.org/jira/browse/PARQUET-2201
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-cpp
Reporter: fatemah


Adding a stress test that will call a random sequence of ReadRecords and 
SkipRecords.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (PARQUET-2200) Add SkipValues() to decoder, Refactor TypedColumnReader::Skip to use it.

2022-10-06 Thread fatemah (Jira)
fatemah created PARQUET-2200:


 Summary: Add SkipValues() to decoder, Refactor 
TypedColumnReader::Skip to use it.
 Key: PARQUET-2200
 URL: https://issues.apache.org/jira/browse/PARQUET-2200
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-cpp
Reporter: fatemah


The proposed SkipValues will read and throw away values. We can then refactor 
TypedColumnReader and RecordReader 
(https://issues.apache.org/jira/browse/PARQUET-2188) Skip methods to use it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (PARQUET-2188) Add SkipRecords API to RecordReader

2022-09-13 Thread fatemah (Jira)
fatemah created PARQUET-2188:


 Summary: Add SkipRecords API to RecordReader
 Key: PARQUET-2188
 URL: https://issues.apache.org/jira/browse/PARQUET-2188
 Project: Parquet
  Issue Type: New Feature
  Components: parquet-cpp
Reporter: fatemah


The RecordReader is missing an API to skip records. There is a Skip method in 
the ColumnReader, but that skips based on the number of values/levels and not 
records. For repeated fields, this SkipRecords API will detect the record 
boundaries and correctly skip the right number of values for the requested 
number of records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (PARQUET-2179) Add a test for skipping repeated fields

2022-08-30 Thread fatemah (Jira)
fatemah created PARQUET-2179:


 Summary: Add a test for skipping repeated fields
 Key: PARQUET-2179
 URL: https://issues.apache.org/jira/browse/PARQUET-2179
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-cpp
Reporter: fatemah


The existing test only tests non-repeated fields. Adding a test for repeated 
fields to make it clear that it is skipping values and not records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (PARQUET-2175) Skip method skips levels and not rows for repeated fields

2022-08-23 Thread fatemah (Jira)
fatemah created PARQUET-2175:


 Summary: Skip method skips levels and not rows for repeated fields
 Key: PARQUET-2175
 URL: https://issues.apache.org/jira/browse/PARQUET-2175
 Project: Parquet
  Issue Type: Bug
  Components: parquet-cpp
Reporter: fatemah


The implementation of TypedColumnReader::Skip method with signature:

virtual int64_t Skip(int64_t num_levels_to_skip) = 0;

will skip levels for both repeated fields and non-repeated fields. We want to 
be able to skip rows for repeated fields, and skipping levels is not that 
useful.

For example, for the following rows:

message M \{ repeated int32 b = 1 }

rows: {}, \{[10,10]}, \{[20, 20, 20]}

values = \{10, 10, 20, 20, 20};
def_levels = \{0, 1, 1, 1, 1, 1};
rep_levels = \{0, 0, 1, 0, 1, 1};

We want skip(2) to skip the first two rows, so that the next value that we read 
is 20. However, it will skip the first two levels, and the next value that we 
read is 10.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)