[jira] [Created] (PARQUET-2250) Expose column descriptor through RecordReader
fatemah created PARQUET-2250: Summary: Expose column descriptor through RecordReader Key: PARQUET-2250 URL: https://issues.apache.org/jira/browse/PARQUET-2250 Project: Parquet Issue Type: Improvement Components: parquet-cpp Reporter: fatemah Currently, the RecordReader does not expose the underlying column descriptor. This would be useful in some scenarios. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (PARQUET-2225) [C++] Allow reading dense with RecordReader
[ https://issues.apache.org/jira/browse/PARQUET-2225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] fatemah updated PARQUET-2225: - Summary: [C++] Allow reading dense with RecordReader (was: Allow reading dense with RecordReader) > [C++] Allow reading dense with RecordReader > --- > > Key: PARQUET-2225 > URL: https://issues.apache.org/jira/browse/PARQUET-2225 > Project: Parquet > Issue Type: New Feature > Components: parquet-cpp >Reporter: fatemah >Assignee: fatemah >Priority: Major > > Currently ReadRecords reads spaced by default. Some readers may need to read > the values dense, and reading spaced is less efficient than reading dense. We > need an option for reading dense. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (PARQUET-2225) Allow reading dense with RecordReader
fatemah created PARQUET-2225: Summary: Allow reading dense with RecordReader Key: PARQUET-2225 URL: https://issues.apache.org/jira/browse/PARQUET-2225 Project: Parquet Issue Type: New Feature Components: parquet-cpp Reporter: fatemah Assignee: fatemah Currently ReadRecords reads spaced by default. Some readers may need to read the values dense, and reading spaced is less efficient than reading dense. We need an option for reading dense. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (PARQUET-2210) Skip pages based on header metadata using a callback
[ https://issues.apache.org/jira/browse/PARQUET-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] fatemah updated PARQUET-2210: - Description: Currently, we do not expose the page header metadata and they cannot be used for skipping pages. I propose exposing the metadata through a callback that would allow the caller to decide if they want to read or skip the page based on the metadata. The signature of the callback would be the following: std::function skip_page_callback) (was: Currently, we do not use the statistics that is stored in the page headers for pruning the rows that we read. Row group pruning is very coarse-grained and in many cases does not prune the row group. I propose adding a FilteredPageReader that would accept a filter and would not return the pages that do not match the filter based on page statistics. Initial set of filters can be: EQUALS, IS NULL, IS NOT NULL. Also, the FilteredPageReader will keep track of what row ranges matched and not matched. We could use this to skip reading rows that do not match from the rest of the columns. Note that the SkipRecords API was recently added to the Parquet reader (https://issues.apache.org/jira/browse/PARQUET-2188)) > Skip pages based on header metadata using a callback > > > Key: PARQUET-2210 > URL: https://issues.apache.org/jira/browse/PARQUET-2210 > Project: Parquet > Issue Type: New Feature >Reporter: fatemah >Priority: Major > > Currently, we do not expose the page header metadata and they cannot be used > for skipping pages. I propose exposing the metadata through a callback that > would allow the caller to decide if they want to read or skip the page based > on the metadata. The signature of the callback would be the following: > std::function skip_page_callback) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (PARQUET-2210) Skip pages based on header metadata using a callback
[ https://issues.apache.org/jira/browse/PARQUET-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] fatemah updated PARQUET-2210: - Summary: Skip pages based on header metadata using a callback (was: Add FilteredPageReader to filter rows based on page statistics) > Skip pages based on header metadata using a callback > > > Key: PARQUET-2210 > URL: https://issues.apache.org/jira/browse/PARQUET-2210 > Project: Parquet > Issue Type: New Feature >Reporter: fatemah >Priority: Major > > Currently, we do not use the statistics that is stored in the page headers > for pruning the rows that we read. Row group pruning is very coarse-grained > and in many cases does not prune the row group. I propose adding a > FilteredPageReader that would accept a filter and would not return the pages > that do not match the filter based on page statistics. > Initial set of filters can be: EQUALS, IS NULL, IS NOT NULL. > Also, the FilteredPageReader will keep track of what row ranges matched and > not matched. We could use this to skip reading rows that do not match from > the rest of the columns. Note that the SkipRecords API was recently added to > the Parquet reader (https://issues.apache.org/jira/browse/PARQUET-2188) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (PARQUET-2210) Add FilteredPageReader to filter rows based on page statistics
fatemah created PARQUET-2210: Summary: Add FilteredPageReader to filter rows based on page statistics Key: PARQUET-2210 URL: https://issues.apache.org/jira/browse/PARQUET-2210 Project: Parquet Issue Type: New Feature Reporter: fatemah Currently, we do not use the statistics that is stored in the page headers for pruning the rows that we read. Row group pruning is very coarse-grained and in many cases does not prune the row group. I propose adding a FilteredPageReader that would accept a filter and would not return the pages that do not match the filter based on page statistics. Initial set of filters can be: EQUALS, IS NULL, IS NOT NULL. Also, the FilteredPageReader will keep track of what row ranges matched and not matched. We could use this to skip reading rows that do not match from the rest of the columns. Note that the SkipRecords API was recently added to the Parquet reader (https://issues.apache.org/jira/browse/PARQUET-2188) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (PARQUET-2209) Optimize skip for the case that number of values to skip equals page size
fatemah created PARQUET-2209: Summary: Optimize skip for the case that number of values to skip equals page size Key: PARQUET-2209 URL: https://issues.apache.org/jira/browse/PARQUET-2209 Project: Parquet Issue Type: Improvement Reporter: fatemah Optimize skip for the case that the number of values to skip equals page size. Right now, we end up reading to the end of the page and throwing away the rep/defs and values that we have read, which is unnecessary. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (PARQUET-2206) Microbenchmark for ColumnReadaer ReadBatch and Skip
fatemah created PARQUET-2206: Summary: Microbenchmark for ColumnReadaer ReadBatch and Skip Key: PARQUET-2206 URL: https://issues.apache.org/jira/browse/PARQUET-2206 Project: Parquet Issue Type: Improvement Reporter: fatemah Adding a micro benchmark for column reader ReadBatch and Skip. Later, I will add benchmarks for RecordReader's ReadRecords and SkipRecords. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (PARQUET-2204) TypedColumnReaderImpl::Skip should reuse scratch space
fatemah created PARQUET-2204: Summary: TypedColumnReaderImpl::Skip should reuse scratch space Key: PARQUET-2204 URL: https://issues.apache.org/jira/browse/PARQUET-2204 Project: Parquet Issue Type: Improvement Reporter: fatemah TypedColumnReaderImpl::Skip allocates scratch space on every call. The scratch space is used to read rep/def levels and values and throw them away. The memory allocation slows down the skip based on microbenchmarks. The scratch space can be allocated once and re-used. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (PARQUET-2201) Add Stress test for RecordReader SkipRecords
fatemah created PARQUET-2201: Summary: Add Stress test for RecordReader SkipRecords Key: PARQUET-2201 URL: https://issues.apache.org/jira/browse/PARQUET-2201 Project: Parquet Issue Type: Improvement Components: parquet-cpp Reporter: fatemah Adding a stress test that will call a random sequence of ReadRecords and SkipRecords. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (PARQUET-2200) Add SkipValues() to decoder, Refactor TypedColumnReader::Skip to use it.
fatemah created PARQUET-2200: Summary: Add SkipValues() to decoder, Refactor TypedColumnReader::Skip to use it. Key: PARQUET-2200 URL: https://issues.apache.org/jira/browse/PARQUET-2200 Project: Parquet Issue Type: Improvement Components: parquet-cpp Reporter: fatemah The proposed SkipValues will read and throw away values. We can then refactor TypedColumnReader and RecordReader (https://issues.apache.org/jira/browse/PARQUET-2188) Skip methods to use it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (PARQUET-2188) Add SkipRecords API to RecordReader
fatemah created PARQUET-2188: Summary: Add SkipRecords API to RecordReader Key: PARQUET-2188 URL: https://issues.apache.org/jira/browse/PARQUET-2188 Project: Parquet Issue Type: New Feature Components: parquet-cpp Reporter: fatemah The RecordReader is missing an API to skip records. There is a Skip method in the ColumnReader, but that skips based on the number of values/levels and not records. For repeated fields, this SkipRecords API will detect the record boundaries and correctly skip the right number of values for the requested number of records. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (PARQUET-2179) Add a test for skipping repeated fields
fatemah created PARQUET-2179: Summary: Add a test for skipping repeated fields Key: PARQUET-2179 URL: https://issues.apache.org/jira/browse/PARQUET-2179 Project: Parquet Issue Type: Improvement Components: parquet-cpp Reporter: fatemah The existing test only tests non-repeated fields. Adding a test for repeated fields to make it clear that it is skipping values and not records. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (PARQUET-2175) Skip method skips levels and not rows for repeated fields
fatemah created PARQUET-2175: Summary: Skip method skips levels and not rows for repeated fields Key: PARQUET-2175 URL: https://issues.apache.org/jira/browse/PARQUET-2175 Project: Parquet Issue Type: Bug Components: parquet-cpp Reporter: fatemah The implementation of TypedColumnReader::Skip method with signature: virtual int64_t Skip(int64_t num_levels_to_skip) = 0; will skip levels for both repeated fields and non-repeated fields. We want to be able to skip rows for repeated fields, and skipping levels is not that useful. For example, for the following rows: message M \{ repeated int32 b = 1 } rows: {}, \{[10,10]}, \{[20, 20, 20]} values = \{10, 10, 20, 20, 20}; def_levels = \{0, 1, 1, 1, 1, 1}; rep_levels = \{0, 0, 1, 0, 1, 1}; We want skip(2) to skip the first two rows, so that the next value that we read is 20. However, it will skip the first two levels, and the next value that we read is 10. -- This message was sent by Atlassian Jira (v8.20.10#820010)