wgtmac commented on code in PR #36967:
URL: https://github.com/apache/arrow/pull/36967#discussion_r1280805331
##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -779,6 +787,28 @@ Result<std::vector<int>>
ParquetFileFragment::FilterRowGroups(
return row_groups;
}
+Result<std::vector<int>> ParquetFileFragment::FilterRangeRowGroups(
+ std::vector<int> filtered_row_groups, int64_t start_offset, int64_t
length) {
+ std::vector<int> row_groups;
+ for (int row_group : filtered_row_groups) {
+ auto rg_metadata = metadata_->RowGroup(row_group);
+ std::shared_ptr<parquet::ColumnChunkMetaData> cc0 =
rg_metadata->ColumnChunk(0);
Review Comment:
It seems that parquet specs does not forbid the case when column chunk 1 is
placed before column chunk 0. In the case, this result may be wrong.
##########
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java:
##########
@@ -66,6 +76,18 @@ public Optional<String[]> getColumns() {
return columns;
}
+ public Optional<String> getFilter() {
+ return filter;
+ }
+
+ public long getStartOffset() {
Review Comment:
Yes, my point is that if we need to support (optional) file splitting, users
should not get surprising result. We may need to implement this for all
supported file formats, or throw if the format does not support splitting. It
would be fine to leave a TODO comment for other formats for now.
##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -779,6 +787,28 @@ Result<std::vector<int>>
ParquetFileFragment::FilterRowGroups(
return row_groups;
}
+Result<std::vector<int>> ParquetFileFragment::FilterRangeRowGroups(
+ int64_t start_offset, int64_t length) {
+ std::vector<int> row_groups;
+ for (int row_group : *row_groups_) {
+ auto rg_metadata = metadata_->RowGroup(row_group);
+ std::shared_ptr<parquet::ColumnChunkMetaData> cc0 =
rg_metadata->ColumnChunk(0);
+ int64_t r_start = cc0->data_page_offset();
+ if (cc0->has_dictionary_page() && r_start > cc0->dictionary_page_offset())
{
+ r_start = cc0->dictionary_page_offset();
+ }
+ int64_t r_bytes = 0L;
+ for (int col_id = 0; col_id < rg_metadata->num_columns(); col_id++) {
+ r_bytes += rg_metadata->ColumnChunk(col_id)->total_compressed_size();
Review Comment:
IIRC, the midpoint algorithm is used in the impala. However, it has
drawbacks as @mapleFU has mentioned that the size of ColumnMetaData is not
included. The more columns you have, the more imprecise total_compressed_size
you will get. I prefer using start offset to match a row group, this is used by
iceberg and orc.
##########
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java:
##########
@@ -66,6 +76,18 @@ public Optional<String[]> getColumns() {
return columns;
}
+ public Optional<String> getFilter() {
+ return filter;
+ }
+
+ public long getStartOffset() {
Review Comment:
TBH, I am not very familiar with dataset, does ScanOptions apply to a single
input file or total dataset? My concern is the scope of its use.
##########
cpp/src/arrow/dataset/scanner.h:
##########
@@ -496,6 +500,18 @@ class ARROW_DS_EXPORT ScannerBuilder {
/// Schema.
Status Filter(const compute::Expression& filter);
+ /// \brief set the start offset of the scanner
+ /// \param[in] start_offset start offset for the scan
+ ///
+ /// \return Failure if the start_offset is not greater than 0
+ Status StartOffset(int64_t start_offset);
Review Comment:
For changes in the dataset api, I'd request @westonpace to take a look.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]