Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2026-01-14 Thread via GitHub


wgtmac commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2689410412


##
cpp/src/parquet/properties.h:
##
@@ -169,6 +169,25 @@ static constexpr bool DEFAULT_IS_PAGE_INDEX_ENABLED = true;
 static constexpr SizeStatisticsLevel DEFAULT_SIZE_STATISTICS_LEVEL =
 SizeStatisticsLevel::PageAndColumnChunk;
 
+struct PARQUET_EXPORT BloomFilterOptions {
+  // Expected number of distinct values (NDV) in the bloom filter.
+  //
+  // Bloom filters are most effective for high-cardinality columns. A good 
default
+  // is to set ndv equal to the number of rows. Lower values reduce disk usage 
but
+  // may not be worthwhile for very small NDVs.
+  //
+  // Increasing ndv (without increasing fpp) increases disk and memory usage.

Review Comment:
   I'm poor at math so I used gemini to generate the table for common settings 
and also verified the result by writing a simple test to print the actual size
   
   ```
   ndv=10, fpp=0.1 -> bytes=131072 (128 KiB)
   ndv=10, fpp=0.01 -> bytes=131072 (128 KiB)
   ndv=100, fpp=0.1 -> bytes=1048576 (1024 KiB)
   ndv=100, fpp=0.01 -> bytes=2097152 (2048 KiB)
   ndv=1000, fpp=0.1 -> bytes=8388608 (8192 KiB)
   ndv=1000, fpp=0.01 -> bytes=16777216 (16384 KiB)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2026-01-14 Thread via GitHub


wgtmac commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2689703726


##
cpp/src/parquet/properties.h:
##
@@ -169,6 +169,37 @@ static constexpr bool DEFAULT_IS_PAGE_INDEX_ENABLED = true;
 static constexpr SizeStatisticsLevel DEFAULT_SIZE_STATISTICS_LEVEL =
 SizeStatisticsLevel::PageAndColumnChunk;
 
+struct PARQUET_EXPORT BloomFilterOptions {
+  /// Expected number of distinct values (NDV) in the bloom filter.
+  ///
+  /// Bloom filters are most effective for high-cardinality columns. A good 
default
+  /// is to set ndv equal to the number of rows. Lower values reduce disk 
usage but
+  /// may not be worthwhile for very small NDVs.
+  ///
+  /// Increasing ndv (without increasing fpp) increases disk and memory usage.
+  int32_t ndv = 1 << 20;
+
+  /// False-positive probability (FPP) of the bloom filter.
+  ///
+  /// Lower FPP values require more disk and memory space. Recommended values 
are
+  /// 0.1, 0.05, or 0.001. Very small values are counterproductive as the 
bitset
+  /// may exceed the size of the actual data. Set ndv appropriately to minimize
+  /// space usage.
+  ///
+  /// Below is a table to demonstrate estimated size using common values.
+  ///
+  /// | ndv| fpp   | bits/key | theoretical | actual (Po2) |
+  /// |:---|:--|:-|:|:-|
+  /// | 100,000| 0.10  | ~6.0 | 75 KB   | **128 KB**   |
+  /// | 100,000| 0.01  | ~10.5| 131 KB  | **256 KB**   |

Review Comment:
   It is from the table at 
https://github.com/apache/parquet-format/blob/4b1c72c837bec5b792b2514f0057533030fcedf8/BloomFilter.md?plain=1#L232



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2026-01-14 Thread via GitHub


pitrou commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2689680147


##
cpp/src/parquet/properties.h:
##
@@ -169,6 +169,37 @@ static constexpr bool DEFAULT_IS_PAGE_INDEX_ENABLED = true;
 static constexpr SizeStatisticsLevel DEFAULT_SIZE_STATISTICS_LEVEL =
 SizeStatisticsLevel::PageAndColumnChunk;
 
+struct PARQUET_EXPORT BloomFilterOptions {
+  /// Expected number of distinct values (NDV) in the bloom filter.
+  ///
+  /// Bloom filters are most effective for high-cardinality columns. A good 
default
+  /// is to set ndv equal to the number of rows. Lower values reduce disk 
usage but
+  /// may not be worthwhile for very small NDVs.
+  ///
+  /// Increasing ndv (without increasing fpp) increases disk and memory usage.
+  int32_t ndv = 1 << 20;
+
+  /// False-positive probability (FPP) of the bloom filter.
+  ///
+  /// Lower FPP values require more disk and memory space. Recommended values 
are

Review Comment:
   (it's not precisely proportional to that, but it seems to be a reasonable 
rule of thumb)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2026-01-14 Thread via GitHub


wgtmac commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2689331390


##
cpp/src/parquet/bloom_filter_writer.cc:
##
@@ -0,0 +1,266 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "parquet/bloom_filter_writer.h"
+
+#include 
+#include 
+
+#include "arrow/array.h"
+#include "arrow/io/interfaces.h"
+#include "arrow/type_traits.h"
+#include "arrow/util/bit_run_reader.h"
+#include "arrow/util/checked_cast.h"
+
+#include "parquet/exception.h"
+#include "parquet/metadata.h"
+#include "parquet/properties.h"
+#include "parquet/schema.h"
+#include "parquet/types.h"
+
+namespace parquet {
+
+constexpr int64_t kHashBatchSize = 256;
+
+template 
+TypedBloomFilterWriter::TypedBloomFilterWriter(const 
ColumnDescriptor* descr,
+BloomFilter* 
bloom_filter)
+: descr_(descr), bloom_filter_(bloom_filter) {}
+
+template 
+void TypedBloomFilterWriter::Update(const T* values, int64_t 
num_values) {
+  if constexpr (std::is_same_v) {
+throw ParquetException("Bloom filter is not supported for boolean type");
+  }
+
+  ARROW_DCHECK(bloom_filter_ != nullptr);
+  std::array hashes;
+  for (int64_t i = 0; i < num_values; i += kHashBatchSize) {
+auto batch_size = static_cast(std::min(kHashBatchSize, num_values - 
i));
+if constexpr (std::is_same_v) {
+  bloom_filter_->Hashes(values + i, descr_->type_length(), batch_size, 
hashes.data());
+} else {
+  bloom_filter_->Hashes(values + i, batch_size, hashes.data());
+}
+bloom_filter_->InsertHashes(hashes.data(), batch_size);
+  }
+}
+
+template <>
+void TypedBloomFilterWriter::Update(const bool*, int64_t) {
+  throw ParquetException("Bloom filter is not supported for boolean type");
+}
+
+template 
+void TypedBloomFilterWriter::UpdateSpaced(const T* values,
+   int64_t num_values,
+   const uint8_t* 
valid_bits,
+   int64_t 
valid_bits_offset) {
+  ARROW_DCHECK(bloom_filter_ != nullptr);
+  std::array hashes;
+  ::arrow::internal::VisitSetBitRunsVoid(
+  valid_bits, valid_bits_offset, num_values, [&](int64_t position, int64_t 
length) {
+for (int64_t i = 0; i < length; i += kHashBatchSize) {
+  auto batch_size = static_cast(std::min(kHashBatchSize, length - 
i));
+  if constexpr (std::is_same_v) {
+bloom_filter_->Hashes(values + i + position, descr_->type_length(),
+  batch_size, hashes.data());
+  } else {
+bloom_filter_->Hashes(values + i + position, batch_size, 
hashes.data());
+  }
+  bloom_filter_->InsertHashes(hashes.data(), batch_size);
+}
+  });
+}
+
+template <>
+void TypedBloomFilterWriter::UpdateSpaced(const bool*, int64_t,
+   const uint8_t*, 
int64_t) {
+  throw ParquetException("Bloom filter is not supported for boolean type");
+}
+
+template 
+void TypedBloomFilterWriter::Update(const ::arrow::Array& values) 
{
+  ParquetException::NYI("Updating bloom filter is not implemented for array of 
type: " +
+values.type()->ToString());
+}
+
+namespace {
+
+template 
+void UpdateBinaryBloomFilter(BloomFilter& bloom_filter, const ArrayType& 
array) {
+  std::array byte_arrays;
+  std::array hashes;
+  ::arrow::internal::VisitSetBitRunsVoid(
+  array.null_bitmap_data(), array.offset(), array.length(),
+  [&](int64_t position, int64_t length) {
+for (int64_t i = 0; i < length; i += kHashBatchSize) {
+  auto batch_size = static_cast(std::min(kHashBatchSize, length - 
i));
+  for (int j = 0; j < batch_size; j++) {
+byte_arrays[j] = array.GetView(position + i + j);
+  }
+  bloom_filter.Hashes(byte_arrays.data(), batch_size, hashes.data());
+  bloom_filter.InsertHashes(hashes.data(), batch_size);
+}
+  });
+}
+
+}  // namespace
+
+template <>
+void TypedBloomFilterWriter::Update(const ::arrow::Array& 
values) {
+  ARROW_DCHECK(bloom_filter_ != n

Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2026-01-14 Thread via GitHub


github-actions[bot] commented on PR #37400:
URL: https://github.com/apache/arrow/pull/37400#issuecomment-3748427078

   Revision: a126e03eb8b519276a5828bd7337b9fe3cdf8f52
   
   Submitted crossbow builds: [ursacomputing/crossbow @ 
actions-b54235b7c8](https://github.com/ursacomputing/crossbow/branches/all?query=actions-b54235b7c8)
   
   |Task|Status|
   ||--|
   |example-cpp-minimal-build-static|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-b54235b7c8-github-example-cpp-minimal-build-static)](https://github.com/ursacomputing/crossbow/actions/runs/20987685680/job/60325324466)|
   |example-cpp-minimal-build-static-system-dependency|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-b54235b7c8-github-example-cpp-minimal-build-static-system-dependency)](https://github.com/ursacomputing/crossbow/actions/runs/20987684445/job/60325320784)|
   |example-cpp-tutorial|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-b54235b7c8-github-example-cpp-tutorial)](https://github.com/ursacomputing/crossbow/actions/runs/20987684884/job/60325321885)|
   |test-build-cpp-fuzz|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-b54235b7c8-github-test-build-cpp-fuzz)](https://github.com/ursacomputing/crossbow/actions/runs/20987684575/job/60325320841)|
   |test-conda-cpp|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-b54235b7c8-github-test-conda-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/20987685218/job/60325322782)|
   |test-conda-cpp-valgrind|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-b54235b7c8-github-test-conda-cpp-valgrind)](https://github.com/ursacomputing/crossbow/actions/runs/20987685010/job/60325322258)|
   |test-debian-12-cpp-amd64|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-b54235b7c8-github-test-debian-12-cpp-amd64)](https://github.com/ursacomputing/crossbow/actions/runs/20987685014/job/60325322352)|
   |test-debian-12-cpp-i386|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-b54235b7c8-github-test-debian-12-cpp-i386)](https://github.com/ursacomputing/crossbow/actions/runs/20987685287/job/60325322942)|
   |test-debian-experimental-cpp-gcc-15|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-b54235b7c8-github-test-debian-experimental-cpp-gcc-15)](https://github.com/ursacomputing/crossbow/actions/runs/20987684964/job/60325322154)|
   |test-fedora-42-cpp|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-b54235b7c8-github-test-fedora-42-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/20987685534/job/60325323696)|
   |test-ubuntu-22.04-cpp|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-b54235b7c8-github-test-ubuntu-22.04-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/20987685065/job/60325322457)|
   |test-ubuntu-22.04-cpp-20|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-b54235b7c8-github-test-ubuntu-22.04-cpp-20)](https://github.com/ursacomputing/crossbow/actions/runs/20987685091/job/60325322492)|
   |test-ubuntu-22.04-cpp-bundled|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-b54235b7c8-github-test-ubuntu-22.04-cpp-bundled)](https://github.com/ursacomputing/crossbow/actions/runs/20987684795/job/60325321610)|
   |test-ubuntu-22.04-cpp-emscripten|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-b54235b7c8-github-test-ubuntu-22.04-cpp-emscripten)](https://github.com/ursacomputing/crossbow/actions/runs/20987684789/job/60325321594)|
   |test-ubuntu-22.04-cpp-no-threading|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-b54235b7c8-github-test-ubuntu-22.04-cpp-no-threading)](https://github.com/ursacomputing/crossbow/actions/runs/20987685402/job/60325323293)|
   |test-ubuntu-24.04-cpp|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-b54235b7c8-github-test-ubuntu-24.04-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/20987684722/job/60325321469)|
   |test-ubuntu-24.04-cpp-bundled-offline|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/cr

Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2026-01-14 Thread via GitHub


pitrou commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2689174831


##
cpp/src/parquet/properties.h:
##
@@ -215,6 +234,15 @@ class PARQUET_EXPORT ColumnProperties {
 page_index_enabled_ = page_index_enabled;
   }
 
+  void set_bloom_filter_options(const BloomFilterOptions& 
bloom_filter_options) {
+if (bloom_filter_options.fpp >= 1.0 || bloom_filter_options.fpp <= 0.0) {
+  throw ParquetException(
+  "Bloom filter false positive probability must be in (0.0, 1.0), got 
" +

Review Comment:
   Oh, I had misread the conditions, sorry.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2026-01-14 Thread via GitHub


pitrou commented on PR #37400:
URL: https://github.com/apache/arrow/pull/37400#issuecomment-3748418202

   > I think fuzz_internal.cc has already dealt with reading bloom filter, am I 
missing something?
   
   I mean, perhaps the changed APIs mandate some code changes there? Let's see 
run the CI anyway.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2026-01-14 Thread via GitHub


wgtmac commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2689175121


##
cpp/src/parquet/bloom_filter_writer.cc:
##
@@ -0,0 +1,266 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "parquet/bloom_filter_writer.h"
+
+#include 
+#include 
+
+#include "arrow/array.h"
+#include "arrow/io/interfaces.h"
+#include "arrow/type_traits.h"
+#include "arrow/util/bit_run_reader.h"
+#include "arrow/util/checked_cast.h"
+
+#include "parquet/exception.h"
+#include "parquet/metadata.h"
+#include "parquet/properties.h"
+#include "parquet/schema.h"
+#include "parquet/types.h"
+
+namespace parquet {
+
+constexpr int64_t kHashBatchSize = 256;
+
+template 
+TypedBloomFilterWriter::TypedBloomFilterWriter(const 
ColumnDescriptor* descr,
+BloomFilter* 
bloom_filter)
+: descr_(descr), bloom_filter_(bloom_filter) {}
+
+template 
+void TypedBloomFilterWriter::Update(const T* values, int64_t 
num_values) {
+  if constexpr (std::is_same_v) {
+throw ParquetException("Bloom filter is not supported for boolean type");
+  }

Review Comment:
   Good catch! Let me remove it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2026-01-14 Thread via GitHub


pitrou commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2689176785


##
cpp/src/parquet/arrow/index_test.cc:
##
@@ -0,0 +1,680 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#ifdef _MSC_VER
+#  pragma warning(push)
+// Disable forcing value to bool warnings
+#  pragma warning(disable : 4800)
+#endif
+
+#include 
+#include 
+
+#include 
+#include 
+#include 
+
+#include "arrow/chunked_array.h"
+#include "arrow/compute/api.h"
+#include "arrow/scalar.h"
+#include "arrow/table.h"
+#include "arrow/testing/builder.h"
+#include "arrow/testing/gtest_util.h"
+#include "arrow/testing/random.h"
+#include "arrow/util/checked_cast.h"
+
+#include "parquet/arrow/reader.h"
+#include "parquet/arrow/reader_internal.h"
+#include "parquet/arrow/schema.h"
+#include "parquet/arrow/test_util.h"
+#include "parquet/arrow/writer.h"
+#include "parquet/bloom_filter.h"
+#include "parquet/bloom_filter_reader.h"
+#include "parquet/file_writer.h"
+#include "parquet/page_index.h"
+#include "parquet/properties.h"
+
+using arrow::Array;
+using arrow::Buffer;
+using arrow::ChunkedArray;
+using arrow::default_memory_pool;
+using arrow::Result;
+using arrow::Status;
+using arrow::Table;
+using arrow::internal::checked_cast;
+using arrow::io::BufferReader;
+
+using ParquetType = parquet::Type;
+
+using parquet::arrow::FromParquetSchema;
+using parquet::schema::GroupNode;
+using parquet::schema::NodePtr;
+using parquet::schema::PrimitiveNode;
+
+namespace parquet::arrow {
+
+namespace {
+
+struct ColumnIndexObject {
+  std::vector null_pages;
+  std::vector min_values;
+  std::vector max_values;
+  BoundaryOrder::type boundary_order = BoundaryOrder::Unordered;
+  std::vector null_counts;
+
+  ColumnIndexObject() = default;
+
+  ColumnIndexObject(const std::vector& null_pages,
+const std::vector& min_values,
+const std::vector& max_values,
+BoundaryOrder::type boundary_order,
+const std::vector& null_counts)
+  : null_pages(null_pages),
+min_values(min_values),
+max_values(max_values),
+boundary_order(boundary_order),
+null_counts(null_counts) {}
+
+  explicit ColumnIndexObject(const ColumnIndex* column_index) {
+if (column_index == nullptr) {
+  return;
+}
+null_pages = column_index->null_pages();
+min_values = column_index->encoded_min_values();
+max_values = column_index->encoded_max_values();
+boundary_order = column_index->boundary_order();
+if (column_index->has_null_counts()) {
+  null_counts = column_index->null_counts();
+}
+  }
+
+  bool operator==(const ColumnIndexObject& b) const {
+return null_pages == b.null_pages && min_values == b.min_values &&
+   max_values == b.max_values && boundary_order == b.boundary_order &&
+   null_counts == b.null_counts;
+  }
+};
+
+auto encode_int64 = [](int64_t value) {
+  return std::string(reinterpret_cast(&value), sizeof(int64_t));
+};
+
+auto encode_double = [](double value) {
+  return std::string(reinterpret_cast(&value), sizeof(double));
+};
+
+}  // namespace
+
+class TestingWithPageIndex {
+ public:
+  void WriteFile(const std::shared_ptr& writer_properties,
+ const std::shared_ptr<::arrow::Table>& table) {
+// Get schema from table.
+auto schema = table->schema();
+std::shared_ptr parquet_schema;
+auto arrow_writer_properties = default_arrow_writer_properties();
+ASSERT_OK_NO_THROW(ToParquetSchema(schema.get(), *writer_properties,
+   *arrow_writer_properties, 
&parquet_schema));
+auto schema_node = 
std::static_pointer_cast(parquet_schema->schema_root());
+
+// Write table to buffer.
+auto sink = CreateOutputStream();
+auto pool = ::arrow::default_memory_pool();
+auto writer = ParquetFileWriter::Open(sink, schema_node, 
writer_properties);
+std::unique_ptr arrow_writer;
+ASSERT_OK(FileWriter::Make(pool, std::move(writer), schema, 
arrow_writer_properties,
+   &arrow_writer));
+ASSERT_OK_NO_THROW(arrow_writer->WriteTable(*table));
+ASSERT_OK_NO_THROW(arrow_writer->

Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2026-01-14 Thread via GitHub


wgtmac commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2689156105


##
cpp/src/parquet/properties.h:
##
@@ -215,6 +234,15 @@ class PARQUET_EXPORT ColumnProperties {
 page_index_enabled_ = page_index_enabled;
   }
 
+  void set_bloom_filter_options(const BloomFilterOptions& 
bloom_filter_options) {
+if (bloom_filter_options.fpp >= 1.0 || bloom_filter_options.fpp <= 0.0) {
+  throw ParquetException(
+  "Bloom filter false positive probability must be in (0.0, 1.0), got 
" +

Review Comment:
   Per the `parquet-java` impl, the valid range is `(0.0, 1.0)` : 
https://github.com/apache/parquet-java/blob/aa41aa121ff29e360273675edc899ba1c2640047/parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java#L285



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2026-01-14 Thread via GitHub


pitrou commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2689667662


##
cpp/src/parquet/properties.h:
##
@@ -169,6 +169,37 @@ static constexpr bool DEFAULT_IS_PAGE_INDEX_ENABLED = true;
 static constexpr SizeStatisticsLevel DEFAULT_SIZE_STATISTICS_LEVEL =
 SizeStatisticsLevel::PageAndColumnChunk;
 
+struct PARQUET_EXPORT BloomFilterOptions {
+  /// Expected number of distinct values (NDV) in the bloom filter.
+  ///
+  /// Bloom filters are most effective for high-cardinality columns. A good 
default
+  /// is to set ndv equal to the number of rows. Lower values reduce disk 
usage but
+  /// may not be worthwhile for very small NDVs.
+  ///
+  /// Increasing ndv (without increasing fpp) increases disk and memory usage.
+  int32_t ndv = 1 << 20;
+
+  /// False-positive probability (FPP) of the bloom filter.
+  ///
+  /// Lower FPP values require more disk and memory space. Recommended values 
are
+  /// 0.1, 0.05, or 0.001. Very small values are counterproductive as the 
bitset
+  /// may exceed the size of the actual data. Set ndv appropriately to minimize
+  /// space usage.
+  ///
+  /// Below is a table to demonstrate estimated size using common values.
+  ///
+  /// | ndv| fpp   | bits/key | theoretical | actual (Po2) |
+  /// |:---|:--|:-|:|:-|
+  /// | 100,000| 0.10  | ~6.0 | 75 KB   | **128 KB**   |
+  /// | 100,000| 0.01  | ~10.5| 131 KB  | **256 KB**   |

Review Comment:
   I'm not sure how you computed those numbers? Using the formula from 
`OptimalNumOfBits` (`-8.0 / log(1 - pow(fpp, 1.0 / 8))` I get 5.77 bits / key 
for fpp = 0.1 (and 9.68 bits / key for fpp = 0.01).



##
cpp/src/parquet/bloom_filter_writer.cc:
##
@@ -0,0 +1,264 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "parquet/bloom_filter_writer.h"
+
+#include 
+#include 
+
+#include "arrow/array.h"
+#include "arrow/io/interfaces.h"
+#include "arrow/type_traits.h"
+#include "arrow/util/bit_run_reader.h"
+#include "arrow/util/checked_cast.h"
+
+#include "parquet/exception.h"
+#include "parquet/metadata.h"
+#include "parquet/properties.h"
+#include "parquet/schema.h"
+#include "parquet/types.h"
+
+namespace parquet {
+
+constexpr int64_t kHashBatchSize = 256;
+
+template 
+TypedBloomFilterWriter::TypedBloomFilterWriter(const 
ColumnDescriptor* descr,
+BloomFilter* 
bloom_filter)
+: descr_(descr), bloom_filter_(bloom_filter) {}
+
+template 
+void TypedBloomFilterWriter::Update(const T* values, int64_t 
num_values) {
+  ARROW_DCHECK(bloom_filter_ != nullptr);
+  std::array hashes;
+  for (int64_t i = 0; i < num_values; i += kHashBatchSize) {
+auto batch_size = static_cast(std::min(kHashBatchSize, num_values - 
i));
+if constexpr (std::is_same_v) {
+  bloom_filter_->Hashes(values + i, descr_->type_length(), batch_size, 
hashes.data());
+} else {
+  bloom_filter_->Hashes(values + i, batch_size, hashes.data());
+}
+bloom_filter_->InsertHashes(hashes.data(), batch_size);
+  }
+}
+
+template <>
+void TypedBloomFilterWriter::Update(const bool*, int64_t) {
+  throw ParquetException("Bloom filter is not supported for boolean type");
+}
+
+template 
+void TypedBloomFilterWriter::UpdateSpaced(const T* values,
+   int64_t num_values,
+   const uint8_t* 
valid_bits,
+   int64_t 
valid_bits_offset) {
+  ARROW_DCHECK(bloom_filter_ != nullptr);
+  std::array hashes;
+  ::arrow::internal::VisitSetBitRunsVoid(
+  valid_bits, valid_bits_offset, num_values, [&](int64_t position, int64_t 
length) {
+for (int64_t i = 0; i < length; i += kHashBatchSize) {
+  auto batch_size = static_cast(std::min(kHashBatchSize, length - 
i));
+  if constexpr (std::is_same_v) {
+bloom_filter_->Hashes(values + i + position, descr_->type_length(),
+  batch_size, hashes.data());
+  } else {
+bloom_filter_->Hashes(values + i

Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2026-01-14 Thread via GitHub


pitrou commented on PR #37400:
URL: https://github.com/apache/arrow/pull/37400#issuecomment-3748418606

   @github-actions crossbow submit -g cpp


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2026-01-14 Thread via GitHub


wgtmac commented on PR #37400:
URL: https://github.com/apache/arrow/pull/37400#issuecomment-3748338317

   Thanks for your review @pitrou. I think I've addressed all comments and 
rebased on top of the main branch. I think fuzz_internal.cc has already dealt 
with reading bloom filter, am I missing something?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2026-01-13 Thread via GitHub


wgtmac commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2689294027


##
cpp/src/parquet/arrow/index_test.cc:
##
@@ -0,0 +1,680 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#ifdef _MSC_VER
+#  pragma warning(push)
+// Disable forcing value to bool warnings
+#  pragma warning(disable : 4800)
+#endif
+
+#include 
+#include 
+
+#include 
+#include 
+#include 
+
+#include "arrow/chunked_array.h"
+#include "arrow/compute/api.h"
+#include "arrow/scalar.h"
+#include "arrow/table.h"
+#include "arrow/testing/builder.h"
+#include "arrow/testing/gtest_util.h"
+#include "arrow/testing/random.h"
+#include "arrow/util/checked_cast.h"
+
+#include "parquet/arrow/reader.h"
+#include "parquet/arrow/reader_internal.h"
+#include "parquet/arrow/schema.h"
+#include "parquet/arrow/test_util.h"
+#include "parquet/arrow/writer.h"
+#include "parquet/bloom_filter.h"
+#include "parquet/bloom_filter_reader.h"
+#include "parquet/file_writer.h"
+#include "parquet/page_index.h"
+#include "parquet/properties.h"
+
+using arrow::Array;
+using arrow::Buffer;
+using arrow::ChunkedArray;
+using arrow::default_memory_pool;
+using arrow::Result;
+using arrow::Status;
+using arrow::Table;
+using arrow::internal::checked_cast;
+using arrow::io::BufferReader;
+
+using ParquetType = parquet::Type;
+
+using parquet::arrow::FromParquetSchema;
+using parquet::schema::GroupNode;
+using parquet::schema::NodePtr;
+using parquet::schema::PrimitiveNode;
+
+namespace parquet::arrow {
+
+namespace {
+
+struct ColumnIndexObject {
+  std::vector null_pages;
+  std::vector min_values;
+  std::vector max_values;
+  BoundaryOrder::type boundary_order = BoundaryOrder::Unordered;
+  std::vector null_counts;
+
+  ColumnIndexObject() = default;
+
+  ColumnIndexObject(const std::vector& null_pages,
+const std::vector& min_values,
+const std::vector& max_values,
+BoundaryOrder::type boundary_order,
+const std::vector& null_counts)
+  : null_pages(null_pages),
+min_values(min_values),
+max_values(max_values),
+boundary_order(boundary_order),
+null_counts(null_counts) {}
+
+  explicit ColumnIndexObject(const ColumnIndex* column_index) {
+if (column_index == nullptr) {
+  return;
+}
+null_pages = column_index->null_pages();
+min_values = column_index->encoded_min_values();
+max_values = column_index->encoded_max_values();
+boundary_order = column_index->boundary_order();
+if (column_index->has_null_counts()) {
+  null_counts = column_index->null_counts();
+}
+  }
+
+  bool operator==(const ColumnIndexObject& b) const {
+return null_pages == b.null_pages && min_values == b.min_values &&
+   max_values == b.max_values && boundary_order == b.boundary_order &&
+   null_counts == b.null_counts;
+  }
+};
+
+auto encode_int64 = [](int64_t value) {
+  return std::string(reinterpret_cast(&value), sizeof(int64_t));
+};
+
+auto encode_double = [](double value) {
+  return std::string(reinterpret_cast(&value), sizeof(double));
+};
+
+}  // namespace
+
+class TestingWithPageIndex {
+ public:
+  void WriteFile(const std::shared_ptr& writer_properties,
+ const std::shared_ptr<::arrow::Table>& table) {
+// Get schema from table.
+auto schema = table->schema();
+std::shared_ptr parquet_schema;
+auto arrow_writer_properties = default_arrow_writer_properties();
+ASSERT_OK_NO_THROW(ToParquetSchema(schema.get(), *writer_properties,
+   *arrow_writer_properties, 
&parquet_schema));
+auto schema_node = 
std::static_pointer_cast(parquet_schema->schema_root());
+
+// Write table to buffer.
+auto sink = CreateOutputStream();
+auto pool = ::arrow::default_memory_pool();
+auto writer = ParquetFileWriter::Open(sink, schema_node, 
writer_properties);
+std::unique_ptr arrow_writer;
+ASSERT_OK(FileWriter::Make(pool, std::move(writer), schema, 
arrow_writer_properties,
+   &arrow_writer));
+ASSERT_OK_NO_THROW(arrow_writer->WriteTable(*table));
+ASSERT_OK_NO_THROW(arrow_writer->

Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2026-01-13 Thread via GitHub


wgtmac commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2689171536


##
cpp/src/parquet/arrow/index_test.cc:
##
@@ -0,0 +1,680 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#ifdef _MSC_VER
+#  pragma warning(push)
+// Disable forcing value to bool warnings
+#  pragma warning(disable : 4800)
+#endif
+
+#include 
+#include 
+
+#include 
+#include 
+#include 
+
+#include "arrow/chunked_array.h"
+#include "arrow/compute/api.h"
+#include "arrow/scalar.h"
+#include "arrow/table.h"
+#include "arrow/testing/builder.h"
+#include "arrow/testing/gtest_util.h"
+#include "arrow/testing/random.h"
+#include "arrow/util/checked_cast.h"
+
+#include "parquet/arrow/reader.h"
+#include "parquet/arrow/reader_internal.h"
+#include "parquet/arrow/schema.h"
+#include "parquet/arrow/test_util.h"
+#include "parquet/arrow/writer.h"
+#include "parquet/bloom_filter.h"
+#include "parquet/bloom_filter_reader.h"
+#include "parquet/file_writer.h"
+#include "parquet/page_index.h"
+#include "parquet/properties.h"
+
+using arrow::Array;
+using arrow::Buffer;
+using arrow::ChunkedArray;
+using arrow::default_memory_pool;
+using arrow::Result;
+using arrow::Status;
+using arrow::Table;
+using arrow::internal::checked_cast;
+using arrow::io::BufferReader;
+
+using ParquetType = parquet::Type;
+
+using parquet::arrow::FromParquetSchema;
+using parquet::schema::GroupNode;
+using parquet::schema::NodePtr;
+using parquet::schema::PrimitiveNode;
+
+namespace parquet::arrow {
+
+namespace {
+
+struct ColumnIndexObject {
+  std::vector null_pages;
+  std::vector min_values;
+  std::vector max_values;
+  BoundaryOrder::type boundary_order = BoundaryOrder::Unordered;
+  std::vector null_counts;
+
+  ColumnIndexObject() = default;
+
+  ColumnIndexObject(const std::vector& null_pages,
+const std::vector& min_values,
+const std::vector& max_values,
+BoundaryOrder::type boundary_order,
+const std::vector& null_counts)
+  : null_pages(null_pages),
+min_values(min_values),
+max_values(max_values),
+boundary_order(boundary_order),
+null_counts(null_counts) {}
+
+  explicit ColumnIndexObject(const ColumnIndex* column_index) {
+if (column_index == nullptr) {
+  return;
+}
+null_pages = column_index->null_pages();
+min_values = column_index->encoded_min_values();
+max_values = column_index->encoded_max_values();
+boundary_order = column_index->boundary_order();
+if (column_index->has_null_counts()) {
+  null_counts = column_index->null_counts();
+}
+  }
+
+  bool operator==(const ColumnIndexObject& b) const {
+return null_pages == b.null_pages && min_values == b.min_values &&
+   max_values == b.max_values && boundary_order == b.boundary_order &&
+   null_counts == b.null_counts;
+  }
+};
+
+auto encode_int64 = [](int64_t value) {
+  return std::string(reinterpret_cast(&value), sizeof(int64_t));
+};
+
+auto encode_double = [](double value) {
+  return std::string(reinterpret_cast(&value), sizeof(double));
+};
+
+}  // namespace
+
+class TestingWithPageIndex {
+ public:
+  void WriteFile(const std::shared_ptr& writer_properties,
+ const std::shared_ptr<::arrow::Table>& table) {
+// Get schema from table.
+auto schema = table->schema();
+std::shared_ptr parquet_schema;
+auto arrow_writer_properties = default_arrow_writer_properties();
+ASSERT_OK_NO_THROW(ToParquetSchema(schema.get(), *writer_properties,
+   *arrow_writer_properties, 
&parquet_schema));
+auto schema_node = 
std::static_pointer_cast(parquet_schema->schema_root());
+
+// Write table to buffer.
+auto sink = CreateOutputStream();
+auto pool = ::arrow::default_memory_pool();
+auto writer = ParquetFileWriter::Open(sink, schema_node, 
writer_properties);
+std::unique_ptr arrow_writer;
+ASSERT_OK(FileWriter::Make(pool, std::move(writer), schema, 
arrow_writer_properties,
+   &arrow_writer));
+ASSERT_OK_NO_THROW(arrow_writer->WriteTable(*table));
+ASSERT_OK_NO_THROW(arrow_writer->

Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2026-01-13 Thread via GitHub


pitrou commented on PR #37400:
URL: https://github.com/apache/arrow/pull/37400#issuecomment-3744567223

   This needs rebasing on git main, by the way, also I think you'll need to 
update the fuzzing harness, for example in 
https://github.com/apache/arrow/blob/d54a2051cf9020a0fdf50836420c38ad14787abb/cpp/src/parquet/arrow/fuzz_internal.cc#L160-L175
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2026-01-13 Thread via GitHub


pitrou commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2686170586


##
cpp/src/parquet/bloom_filter_writer.cc:
##
@@ -0,0 +1,266 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "parquet/bloom_filter_writer.h"
+
+#include 
+#include 
+
+#include "arrow/array.h"
+#include "arrow/io/interfaces.h"
+#include "arrow/type_traits.h"
+#include "arrow/util/bit_run_reader.h"
+#include "arrow/util/checked_cast.h"
+
+#include "parquet/exception.h"
+#include "parquet/metadata.h"
+#include "parquet/properties.h"
+#include "parquet/schema.h"
+#include "parquet/types.h"
+
+namespace parquet {
+
+constexpr int64_t kHashBatchSize = 256;
+
+template 
+TypedBloomFilterWriter::TypedBloomFilterWriter(const 
ColumnDescriptor* descr,
+BloomFilter* 
bloom_filter)
+: descr_(descr), bloom_filter_(bloom_filter) {}
+
+template 
+void TypedBloomFilterWriter::Update(const T* values, int64_t 
num_values) {
+  if constexpr (std::is_same_v) {
+throw ParquetException("Bloom filter is not supported for boolean type");
+  }

Review Comment:
   You're specializing `Update` for `BooleanType` below, is this snippet still 
necessary?



##
cpp/src/parquet/file_writer.cc:
##
@@ -456,10 +472,24 @@ class FileSerializer : public ParquetFileWriter::Contents 
{
 if (page_index_builder_ != nullptr) {
   // Serialize page index after all row groups have been written and report
   // location to the file metadata.
-  PageIndexLocation page_index_location;
   page_index_builder_->Finish();
-  page_index_builder_->WriteTo(sink_.get(), &page_index_location);
-  metadata_->SetPageIndexLocation(page_index_location);
+  auto write_result = page_index_builder_->WriteTo(sink_.get());
+  metadata_->SetIndexLocations(IndexKind::kColumnIndex,
+   write_result.column_index_locations);
+  metadata_->SetIndexLocations(IndexKind::kOffsetIndex,
+   write_result.offset_index_locations);
+}
+  }
+
+  void WriteBloomFilter() {
+if (bloom_filter_builder_ != nullptr) {
+  if (properties_->file_encryption_properties()) {
+ParquetException::NYI("Encryption is not supported with bloom filter");

Review Comment:
   ```suggestion
   ParquetException::NYI("Encryption is currently not supported with 
bloom filter");
   ```



##
cpp/src/parquet/properties.h:
##
@@ -169,6 +169,25 @@ static constexpr bool DEFAULT_IS_PAGE_INDEX_ENABLED = true;
 static constexpr SizeStatisticsLevel DEFAULT_SIZE_STATISTICS_LEVEL =
 SizeStatisticsLevel::PageAndColumnChunk;
 
+struct PARQUET_EXPORT BloomFilterOptions {
+  // Expected number of distinct values (NDV) in the bloom filter.
+  //
+  // Bloom filters are most effective for high-cardinality columns. A good 
default
+  // is to set ndv equal to the number of rows. Lower values reduce disk usage 
but
+  // may not be worthwhile for very small NDVs.
+  //
+  // Increasing ndv (without increasing fpp) increases disk and memory usage.

Review Comment:
   Let's use docstring syntax (i.e. `///` instead of `//`)?
   
   Also, can we give a rough guideline of disk space per bloom filter depending 
on ndv and fpp?



##
cpp/src/parquet/column_writer.cc:
##
@@ -2660,32 +2689,38 @@ std::shared_ptr 
ColumnWriter::Make(ColumnChunkMetaDataBuilder* met
 encoding = properties->dictionary_index_encoding();
   }
   switch (descr->physical_type()) {
-case Type::BOOLEAN:
+case Type::BOOLEAN: {
+  if (bloom_filter != nullptr) {
+throw ParquetException("Bloom filter is not supported for boolean 
type");
+  }
   return std::make_shared>(
-  metadata, std::move(pager), use_dictionary, encoding, properties);
+  metadata, std::move(pager), use_dictionary, encoding, properties,
+  /*bloom_filter=*/nullptr);
+}
 case Type::INT32:
   return std::make_shared>(
-  metadata, std::move(pager), use_dictionary, encoding, properties);
+  metadata, std::move(pager), use_dictionary, encoding, properties, 
bloom_filter);
 c

Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2026-01-13 Thread via GitHub


pitrou commented on PR #37400:
URL: https://github.com/apache/arrow/pull/37400#issuecomment-3743166016

   Hmm, I'm really sorry. I will make some time to review it quickly, but I'm 
afraid it's probably too late for this release as feature freeze in already in 
full force and @raulcd has started producing a RC :(


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2026-01-13 Thread via GitHub


wgtmac commented on PR #37400:
URL: https://github.com/apache/arrow/pull/37400#issuecomment-3743130432

   Gentle ping @pitrou. Is it possible to include it to the next release?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-12-16 Thread via GitHub


wgtmac commented on PR #37400:
URL: https://github.com/apache/arrow/pull/37400#issuecomment-3659250459

   Gentle ping @pitrou 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-12-06 Thread via GitHub


wgtmac commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2593144411


##
cpp/src/parquet/column_writer.cc:
##
@@ -1867,6 +1876,7 @@ class TypedColumnWriterImpl : public ColumnWriterImpl,
 chunk_geospatial_statistics_->Update(values, num_values);
   }
 }
+bloom_filter_writer_->Update(values, num_values);

Review Comment:
   Yes, bloom_filter_writer_ internally checks if this column has enabled bloom 
filter and it is a no-op if not. I've changed to let the column writer check 
this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-12-06 Thread via GitHub


wgtmac commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2595946370


##
cpp/src/parquet/bloom_filter_writer.cc:
##
@@ -0,0 +1,305 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "parquet/bloom_filter_writer.h"
+
+#include 
+#include 
+
+#include "arrow/array.h"
+#include "arrow/io/interfaces.h"
+#include "arrow/type_traits.h"
+#include "arrow/util/bit_run_reader.h"
+#include "arrow/util/checked_cast.h"
+#include "arrow/util/unreachable.h"
+
+#include "parquet/exception.h"
+#include "parquet/metadata.h"
+#include "parquet/properties.h"
+#include "parquet/schema.h"
+#include "parquet/types.h"
+
+namespace parquet {
+
+constexpr int64_t kHashBatchSize = 256;
+
+template 
+BloomFilterWriter::BloomFilterWriter(const ColumnDescriptor* 
descr,
+  BloomFilter* bloom_filter)
+: descr_(descr), bloom_filter_(bloom_filter) {}
+
+template 
+bool BloomFilterWriter::bloom_filter_enabled() const {
+  return bloom_filter_ != nullptr;
+}
+
+template 
+void BloomFilterWriter::Update(const T* values, int64_t 
num_values) {
+  if (!bloom_filter_enabled()) {
+return;
+  }
+
+  if constexpr (std::is_same_v) {
+throw ParquetException("Bloom filter is not supported for boolean type");
+  }
+
+  std::array hashes;
+  for (int64_t i = 0; i < num_values; i += kHashBatchSize) {
+auto batch_size = static_cast(std::min(kHashBatchSize, num_values - 
i));
+if constexpr (std::is_same_v) {
+  bloom_filter_->Hashes(values, descr_->type_length(), batch_size, 
hashes.data());
+} else {
+  bloom_filter_->Hashes(values, batch_size, hashes.data());
+}
+bloom_filter_->InsertHashes(hashes.data(), batch_size);
+  }
+}
+
+template <>
+void BloomFilterWriter::Update(const bool*, int64_t) {
+  if (!bloom_filter_enabled()) {
+return;
+  }
+  throw ParquetException("Bloom filter is not supported for boolean type");
+}
+
+template 
+void BloomFilterWriter::UpdateSpaced(const T* values, int64_t 
num_values,
+  const uint8_t* valid_bits,
+  int64_t valid_bits_offset) {
+  if (!bloom_filter_enabled()) {
+return;
+  }
+
+  std::array hashes;
+  ::arrow::internal::VisitSetBitRunsVoid(
+  valid_bits, valid_bits_offset, num_values, [&](int64_t position, int64_t 
length) {
+for (int64_t i = 0; i < length; i += kHashBatchSize) {
+  auto batch_size = static_cast(std::min(kHashBatchSize, length - 
i));
+  if constexpr (std::is_same_v) {
+bloom_filter_->Hashes(values + i + position, descr_->type_length(),
+  batch_size, hashes.data());
+  } else {
+bloom_filter_->Hashes(values + i + position, batch_size, 
hashes.data());
+  }
+  bloom_filter_->InsertHashes(hashes.data(), batch_size);
+}
+  });
+}
+
+template <>
+void BloomFilterWriter::UpdateSpaced(const bool*, int64_t, const 
uint8_t*,
+  int64_t) {
+  if (!bloom_filter_enabled()) {
+return;
+  }
+  throw ParquetException("Bloom filter is not supported for boolean type");
+}
+
+template 
+void BloomFilterWriter::Update(const ::arrow::Array& values) {
+  ::arrow::Unreachable("Update for non-ByteArray type should be unreachable");
+}
+
+namespace {
+
+template 
+void UpdateBinaryBloomFilter(BloomFilter& bloom_filter, const ArrayType& 
array) {
+  // Using a small batch size because an extra `byte_arrays` is used.
+  constexpr int64_t kBinaryHashBatchSize = 64;
+  std::array byte_arrays;
+  std::array hashes;
+
+  auto batch_insert_hashes = [&](int count) {
+if (count > 0) {
+  bloom_filter.Hashes(byte_arrays.data(), count, hashes.data());
+  bloom_filter.InsertHashes(hashes.data(), count);
+}
+  };
+
+  int batch_idx = 0;
+  ::arrow::internal::VisitSetBitRunsVoid(
+  array.null_bitmap_data(), array.offset(), array.length(),
+  [&](int64_t position, int64_t run_length) {
+for (int64_t i = 0; i < run_length; ++i) {
+  byte_arrays[batch_idx++] = array.GetView(position + i);
+  if (batch_i

Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-12-06 Thread via GitHub


wgtmac commented on PR #37400:
URL: https://github.com/apache/arrow/pull/37400#issuecomment-3619881409

   @pitrou Thanks for your review! I've addressed all your comments. Please 
take a look again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-12-06 Thread via GitHub


wgtmac commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2594642965


##
cpp/src/parquet/page_index.h:
##
@@ -378,9 +378,11 @@ class PARQUET_EXPORT PageIndexBuilder {
   /// are set.
   ///
   /// \param[out] sink The output stream to write the page index.
-  /// \param[out] location The location of all page index to the start of sink.
+  /// \param[out] column_index_location The location of all column indexes.
+  /// \param[out] offset_index_location The location of all offset indexes.

Review Comment:
   Yes, this looks better. I've changed both bloom filter builder and page 
index builder to adopt this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-12-05 Thread via GitHub


wgtmac commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2593028729


##
cpp/src/parquet/bloom_filter_writer.cc:
##
@@ -0,0 +1,305 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "parquet/bloom_filter_writer.h"
+
+#include 
+#include 
+
+#include "arrow/array.h"
+#include "arrow/io/interfaces.h"
+#include "arrow/type_traits.h"
+#include "arrow/util/bit_run_reader.h"
+#include "arrow/util/checked_cast.h"
+#include "arrow/util/unreachable.h"
+
+#include "parquet/exception.h"
+#include "parquet/metadata.h"
+#include "parquet/properties.h"
+#include "parquet/schema.h"
+#include "parquet/types.h"
+
+namespace parquet {
+
+constexpr int64_t kHashBatchSize = 256;
+
+template 
+BloomFilterWriter::BloomFilterWriter(const ColumnDescriptor* 
descr,
+  BloomFilter* bloom_filter)
+: descr_(descr), bloom_filter_(bloom_filter) {}
+
+template 
+bool BloomFilterWriter::bloom_filter_enabled() const {
+  return bloom_filter_ != nullptr;
+}
+
+template 
+void BloomFilterWriter::Update(const T* values, int64_t 
num_values) {
+  if (!bloom_filter_enabled()) {
+return;
+  }
+
+  if constexpr (std::is_same_v) {
+throw ParquetException("Bloom filter is not supported for boolean type");
+  }
+
+  std::array hashes;
+  for (int64_t i = 0; i < num_values; i += kHashBatchSize) {
+auto batch_size = static_cast(std::min(kHashBatchSize, num_values - 
i));
+if constexpr (std::is_same_v) {
+  bloom_filter_->Hashes(values, descr_->type_length(), batch_size, 
hashes.data());
+} else {
+  bloom_filter_->Hashes(values, batch_size, hashes.data());
+}
+bloom_filter_->InsertHashes(hashes.data(), batch_size);
+  }
+}
+
+template <>
+void BloomFilterWriter::Update(const bool*, int64_t) {
+  if (!bloom_filter_enabled()) {
+return;
+  }
+  throw ParquetException("Bloom filter is not supported for boolean type");
+}
+
+template 
+void BloomFilterWriter::UpdateSpaced(const T* values, int64_t 
num_values,
+  const uint8_t* valid_bits,
+  int64_t valid_bits_offset) {
+  if (!bloom_filter_enabled()) {
+return;
+  }
+
+  std::array hashes;
+  ::arrow::internal::VisitSetBitRunsVoid(
+  valid_bits, valid_bits_offset, num_values, [&](int64_t position, int64_t 
length) {
+for (int64_t i = 0; i < length; i += kHashBatchSize) {
+  auto batch_size = static_cast(std::min(kHashBatchSize, length - 
i));
+  if constexpr (std::is_same_v) {
+bloom_filter_->Hashes(values + i + position, descr_->type_length(),
+  batch_size, hashes.data());
+  } else {
+bloom_filter_->Hashes(values + i + position, batch_size, 
hashes.data());
+  }
+  bloom_filter_->InsertHashes(hashes.data(), batch_size);
+}
+  });
+}
+
+template <>
+void BloomFilterWriter::UpdateSpaced(const bool*, int64_t, const 
uint8_t*,
+  int64_t) {
+  if (!bloom_filter_enabled()) {
+return;
+  }
+  throw ParquetException("Bloom filter is not supported for boolean type");
+}
+
+template 
+void BloomFilterWriter::Update(const ::arrow::Array& values) {
+  ::arrow::Unreachable("Update for non-ByteArray type should be unreachable");
+}
+
+namespace {
+
+template 
+void UpdateBinaryBloomFilter(BloomFilter& bloom_filter, const ArrayType& 
array) {
+  // Using a small batch size because an extra `byte_arrays` is used.
+  constexpr int64_t kBinaryHashBatchSize = 64;
+  std::array byte_arrays;
+  std::array hashes;
+
+  auto batch_insert_hashes = [&](int count) {
+if (count > 0) {
+  bloom_filter.Hashes(byte_arrays.data(), count, hashes.data());
+  bloom_filter.InsertHashes(hashes.data(), count);
+}
+  };
+
+  int batch_idx = 0;
+  ::arrow::internal::VisitSetBitRunsVoid(
+  array.null_bitmap_data(), array.offset(), array.length(),
+  [&](int64_t position, int64_t run_length) {
+for (int64_t i = 0; i < run_length; ++i) {
+  byte_arrays[batch_idx++] = array.GetView(position + i);
+  if (batch_i

Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-12-05 Thread via GitHub


wgtmac commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2593301025


##
cpp/src/parquet/bloom_filter_writer.h:
##
@@ -0,0 +1,105 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/type_fwd.h"
+
+#include "parquet/bloom_filter.h"
+#include "parquet/type_fwd.h"
+
+namespace parquet {
+
+/// \brief Writer for updating a bloom filter with values of a specific 
Parquet type.
+template 
+class PARQUET_EXPORT BloomFilterWriter {
+ public:
+  using T = typename ParquetType::c_type;
+
+  /// \param descr The descriptor of the column to write. Must outlive this 
writer.
+  /// \param bloom_filter The bloom filter to update. If nullptr, this writer 
does not
+  /// enable bloom filter. Otherwise, the input bloom filter must outlive this 
writer.
+  BloomFilterWriter(const ColumnDescriptor* descr, BloomFilter* bloom_filter);
+
+  /// \brief Update the bloom filter with typed values.
+  ///
+  /// \param values The values to update the bloom filter with.
+  /// \param num_values The number of values to update the bloom filter with.
+  void Update(const T* values, int64_t num_values);
+
+  /// \brief Update the bloom filter with typed values that have spaces.
+  ///
+  /// \param values The values to update the bloom filter with.
+  /// \param num_values The number of values to update the bloom filter with.
+  /// \param valid_bits The validity bitmap of the values.
+  /// \param valid_bits_offset The offset of the validity bitmap.
+  void UpdateSpaced(const T* values, int64_t num_values, const uint8_t* 
valid_bits,
+int64_t valid_bits_offset);
+
+  /// \brief Update the bloom filter with an Arrow array.
+  ///
+  /// \param values The Arrow array to update the bloom filter with.
+  void Update(const ::arrow::Array& values);
+
+  /// \brief Check if this writer has enabled the bloom filter.
+  bool bloom_filter_enabled() const;
+
+ private:
+  const ColumnDescriptor* descr_;
+  BloomFilter* bloom_filter_;
+};
+
+/// \brief Interface for building bloom filters of a parquet file.
+class PARQUET_EXPORT BloomFilterBuilder {
+ public:
+  virtual ~BloomFilterBuilder() = default;
+
+  /// \brief API to create a BloomFilterBuilder.
+  ///
+  /// \param schema The schema of the file and it must outlive the created 
builder.
+  /// \param properties Properties to get bloom filter options. It must 
outlive the
+  /// created builder.
+  static std::unique_ptr Make(const SchemaDescriptor* 
schema,
+  const WriterProperties* 
properties);
+
+  /// \brief Start a new row group to write bloom filters, meaning that next 
calls
+  /// to `GetOrCreateBloomFilter` will create bloom filters for the new row 
group.
+  ///
+  /// \throws ParquetException if `WriteTo()` has been called already.
+  virtual void AppendRowGroup() = 0;
+
+  /// \brief Get or create a BloomFilter from the column ordinal of the 
current row group.

Review Comment:
   I've simplified it to only create a new bloom filter and throw if exists.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-12-05 Thread via GitHub


wgtmac commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2593028729


##
cpp/src/parquet/bloom_filter_writer.cc:
##
@@ -0,0 +1,305 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "parquet/bloom_filter_writer.h"
+
+#include 
+#include 
+
+#include "arrow/array.h"
+#include "arrow/io/interfaces.h"
+#include "arrow/type_traits.h"
+#include "arrow/util/bit_run_reader.h"
+#include "arrow/util/checked_cast.h"
+#include "arrow/util/unreachable.h"
+
+#include "parquet/exception.h"
+#include "parquet/metadata.h"
+#include "parquet/properties.h"
+#include "parquet/schema.h"
+#include "parquet/types.h"
+
+namespace parquet {
+
+constexpr int64_t kHashBatchSize = 256;
+
+template 
+BloomFilterWriter::BloomFilterWriter(const ColumnDescriptor* 
descr,
+  BloomFilter* bloom_filter)
+: descr_(descr), bloom_filter_(bloom_filter) {}
+
+template 
+bool BloomFilterWriter::bloom_filter_enabled() const {
+  return bloom_filter_ != nullptr;
+}
+
+template 
+void BloomFilterWriter::Update(const T* values, int64_t 
num_values) {
+  if (!bloom_filter_enabled()) {
+return;
+  }
+
+  if constexpr (std::is_same_v) {
+throw ParquetException("Bloom filter is not supported for boolean type");
+  }
+
+  std::array hashes;
+  for (int64_t i = 0; i < num_values; i += kHashBatchSize) {
+auto batch_size = static_cast(std::min(kHashBatchSize, num_values - 
i));
+if constexpr (std::is_same_v) {
+  bloom_filter_->Hashes(values, descr_->type_length(), batch_size, 
hashes.data());
+} else {
+  bloom_filter_->Hashes(values, batch_size, hashes.data());
+}
+bloom_filter_->InsertHashes(hashes.data(), batch_size);
+  }
+}
+
+template <>
+void BloomFilterWriter::Update(const bool*, int64_t) {
+  if (!bloom_filter_enabled()) {
+return;
+  }
+  throw ParquetException("Bloom filter is not supported for boolean type");
+}
+
+template 
+void BloomFilterWriter::UpdateSpaced(const T* values, int64_t 
num_values,
+  const uint8_t* valid_bits,
+  int64_t valid_bits_offset) {
+  if (!bloom_filter_enabled()) {
+return;
+  }
+
+  std::array hashes;
+  ::arrow::internal::VisitSetBitRunsVoid(
+  valid_bits, valid_bits_offset, num_values, [&](int64_t position, int64_t 
length) {
+for (int64_t i = 0; i < length; i += kHashBatchSize) {
+  auto batch_size = static_cast(std::min(kHashBatchSize, length - 
i));
+  if constexpr (std::is_same_v) {
+bloom_filter_->Hashes(values + i + position, descr_->type_length(),
+  batch_size, hashes.data());
+  } else {
+bloom_filter_->Hashes(values + i + position, batch_size, 
hashes.data());
+  }
+  bloom_filter_->InsertHashes(hashes.data(), batch_size);
+}
+  });
+}
+
+template <>
+void BloomFilterWriter::UpdateSpaced(const bool*, int64_t, const 
uint8_t*,
+  int64_t) {
+  if (!bloom_filter_enabled()) {
+return;
+  }
+  throw ParquetException("Bloom filter is not supported for boolean type");
+}
+
+template 
+void BloomFilterWriter::Update(const ::arrow::Array& values) {
+  ::arrow::Unreachable("Update for non-ByteArray type should be unreachable");
+}
+
+namespace {
+
+template 
+void UpdateBinaryBloomFilter(BloomFilter& bloom_filter, const ArrayType& 
array) {
+  // Using a small batch size because an extra `byte_arrays` is used.
+  constexpr int64_t kBinaryHashBatchSize = 64;
+  std::array byte_arrays;
+  std::array hashes;
+
+  auto batch_insert_hashes = [&](int count) {
+if (count > 0) {
+  bloom_filter.Hashes(byte_arrays.data(), count, hashes.data());
+  bloom_filter.InsertHashes(hashes.data(), count);
+}
+  };
+
+  int batch_idx = 0;
+  ::arrow::internal::VisitSetBitRunsVoid(
+  array.null_bitmap_data(), array.offset(), array.length(),
+  [&](int64_t position, int64_t run_length) {
+for (int64_t i = 0; i < run_length; ++i) {
+  byte_arrays[batch_idx++] = array.GetView(position + i);
+  if (batch_i

Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-12-05 Thread via GitHub


wgtmac commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2593225569


##
cpp/src/parquet/bloom_filter_writer.h:
##
@@ -0,0 +1,105 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/type_fwd.h"
+
+#include "parquet/bloom_filter.h"
+#include "parquet/type_fwd.h"
+
+namespace parquet {
+
+/// \brief Writer for updating a bloom filter with values of a specific 
Parquet type.
+template 
+class PARQUET_EXPORT BloomFilterWriter {
+ public:
+  using T = typename ParquetType::c_type;
+
+  /// \param descr The descriptor of the column to write. Must outlive this 
writer.
+  /// \param bloom_filter The bloom filter to update. If nullptr, this writer 
does not
+  /// enable bloom filter. Otherwise, the input bloom filter must outlive this 
writer.
+  BloomFilterWriter(const ColumnDescriptor* descr, BloomFilter* bloom_filter);
+
+  /// \brief Update the bloom filter with typed values.
+  ///
+  /// \param values The values to update the bloom filter with.
+  /// \param num_values The number of values to update the bloom filter with.
+  void Update(const T* values, int64_t num_values);
+
+  /// \brief Update the bloom filter with typed values that have spaces.
+  ///
+  /// \param values The values to update the bloom filter with.
+  /// \param num_values The number of values to update the bloom filter with.
+  /// \param valid_bits The validity bitmap of the values.
+  /// \param valid_bits_offset The offset of the validity bitmap.
+  void UpdateSpaced(const T* values, int64_t num_values, const uint8_t* 
valid_bits,
+int64_t valid_bits_offset);
+
+  /// \brief Update the bloom filter with an Arrow array.
+  ///
+  /// \param values The Arrow array to update the bloom filter with.
+  void Update(const ::arrow::Array& values);
+
+  /// \brief Check if this writer has enabled the bloom filter.
+  bool bloom_filter_enabled() const;
+
+ private:
+  const ColumnDescriptor* descr_;
+  BloomFilter* bloom_filter_;
+};
+
+/// \brief Interface for building bloom filters of a parquet file.
+class PARQUET_EXPORT BloomFilterBuilder {
+ public:
+  virtual ~BloomFilterBuilder() = default;
+
+  /// \brief API to create a BloomFilterBuilder.
+  ///
+  /// \param schema The schema of the file and it must outlive the created 
builder.
+  /// \param properties Properties to get bloom filter options. It must 
outlive the
+  /// created builder.
+  static std::unique_ptr Make(const SchemaDescriptor* 
schema,
+  const WriterProperties* 
properties);
+
+  /// \brief Start a new row group to write bloom filters, meaning that next 
calls
+  /// to `GetOrCreateBloomFilter` will create bloom filters for the new row 
group.
+  ///
+  /// \throws ParquetException if `WriteTo()` has been called already.
+  virtual void AppendRowGroup() = 0;
+
+  /// \brief Get or create a BloomFilter from the column ordinal of the 
current row group.
+  ///
+  /// \param column_ordinal Column ordinal for the bloom filter.
+  /// \return designated BloomFilter whose ownership belongs to the builder.

Review Comment:
   Make sense. Let me add it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-12-05 Thread via GitHub


wgtmac commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2593174390


##
cpp/src/parquet/column_writer_test.cc:
##
@@ -2380,5 +2382,86 @@ TYPED_TEST(TestColumnWriterMaxRowsPerPage, 
RequiredLargeChunk) {
   }
 }
 
+template 
+class TestBloomFilterWriter : public TestPrimitiveWriter {
+ public:
+  void SetUp() override {
+TestPrimitiveWriter::SetUp();
+builder_ = nullptr;
+bloom_filter_ = nullptr;
+  }
+
+  std::shared_ptr> BuildWriter(
+  int64_t output_size, const ColumnProperties& column_properties) {
+this->sink_ = CreateOutputStream();
+
+WriterProperties::Builder properties_builder;
+if (column_properties.encoding() == Encoding::PLAIN_DICTIONARY ||
+column_properties.encoding() == Encoding::RLE_DICTIONARY) {
+  properties_builder.enable_dictionary();
+} else {
+  properties_builder.disable_dictionary();
+  properties_builder.encoding(column_properties.encoding());
+}
+auto path = this->schema_.Column(0)->path();
+if (column_properties.bloom_filter_enabled()) {
+  
properties_builder.enable_bloom_filter(*column_properties.bloom_filter_options(),
+ path);
+} else {
+  properties_builder.disable_bloom_filter(path);
+}
+this->writer_properties_ = properties_builder.build();
+
+this->metadata_ =
+ColumnChunkMetaDataBuilder::Make(this->writer_properties_, 
this->descr_);
+auto pager = PageWriter::Open(this->sink_, column_properties.compression(),
+  this->metadata_.get());
+
+builder_ = BloomFilterBuilder::Make(&this->schema_, 
this->writer_properties_.get());
+builder_->AppendRowGroup();
+bloom_filter_ = builder_->GetOrCreateBloomFilter(/*column_ordinal=*/0);
+
+return std::static_pointer_cast>(
+ColumnWriter::Make(this->metadata_.get(), std::move(pager),
+   this->writer_properties_.get(), bloom_filter_));
+  }
+
+  std::unique_ptr builder_;
+  BloomFilter* bloom_filter_{nullptr};

Review Comment:
   Yes, let me add a comment.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-12-05 Thread via GitHub


wgtmac commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2593144411


##
cpp/src/parquet/column_writer.cc:
##
@@ -1867,6 +1876,7 @@ class TypedColumnWriterImpl : public ColumnWriterImpl,
 chunk_geospatial_statistics_->Update(values, num_values);
   }
 }
+bloom_filter_writer_->Update(values, num_values);

Review Comment:
   Yes, bloom_filter_writer_ internally checks if this column has enabled bloom 
filter and it is a no-op if not. If you prefer to move the internal check out 
and check if bloom_filter_writer_ is nullptr here, I can make such a change.
   
   After double check, I think it is better to change this with the benefit 
that `template class TypedBloomFilterWriter;` can be removed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-12-05 Thread via GitHub


wgtmac commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2593144411


##
cpp/src/parquet/column_writer.cc:
##
@@ -1867,6 +1876,7 @@ class TypedColumnWriterImpl : public ColumnWriterImpl,
 chunk_geospatial_statistics_->Update(values, num_values);
   }
 }
+bloom_filter_writer_->Update(values, num_values);

Review Comment:
   Yes, bloom_filter_writer_ internally checks if this column has enabled bloom 
filter and it is a no-op if not. If you prefer to move the internal check out 
and check if bloom_filter_writer_ is nullptr here, I can make such change.



##
cpp/src/parquet/column_writer.cc:
##
@@ -1867,6 +1876,7 @@ class TypedColumnWriterImpl : public ColumnWriterImpl,
 chunk_geospatial_statistics_->Update(values, num_values);
   }
 }
+bloom_filter_writer_->Update(values, num_values);

Review Comment:
   Yes, bloom_filter_writer_ internally checks if this column has enabled bloom 
filter and it is a no-op if not. If you prefer to move the internal check out 
and check if bloom_filter_writer_ is nullptr here, I can make such a change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-12-05 Thread via GitHub


wgtmac commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2593098315


##
cpp/src/parquet/column_writer.cc:
##
@@ -1976,17 +1990,18 @@ Status 
TypedColumnWriterImpl::WriteArrowDictionary(
  &exec_ctx));
   referenced_dictionary = referenced_dictionary_datum.make_array();
 }
-
-int64_t non_null_count = chunk_indices->length() - 
chunk_indices->null_count();
-page_statistics_->IncrementNullCount(num_chunk_levels - non_null_count);
-page_statistics_->IncrementNumValues(non_null_count);
-page_statistics_->Update(*referenced_dictionary, /*update_counts=*/false);
-
+if (page_statistics_ != nullptr) {

Review Comment:
   I think we should remove the `DCHECK` because it is totally valid to enable 
bloom filter but disable column stats.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-12-05 Thread via GitHub


wgtmac commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2593090285


##
cpp/src/parquet/column_writer.cc:
##
@@ -2003,7 +2018,7 @@ Status 
TypedColumnWriterImpl::WriteArrowDictionary(
   AddIfNotNull(rep_levels, offset));
 std::shared_ptr writeable_indices =
 indices->Slice(value_offset, batch_num_spaced_values);
-if (page_statistics_) {
+if (page_statistics_ || bloom_filter_writer_->bloom_filter_enabled()) {

Review Comment:
   Because `update_stats` is reused for preparing the dictionary values. 
Perhaps we should rename `update_stats` to `update_index` if we regard all of 
them as index?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-12-05 Thread via GitHub


wgtmac commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2593062048


##
cpp/src/parquet/bloom_filter_writer.cc:
##
@@ -0,0 +1,305 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "parquet/bloom_filter_writer.h"
+
+#include 
+#include 
+
+#include "arrow/array.h"
+#include "arrow/io/interfaces.h"
+#include "arrow/type_traits.h"
+#include "arrow/util/bit_run_reader.h"
+#include "arrow/util/checked_cast.h"
+#include "arrow/util/unreachable.h"
+
+#include "parquet/exception.h"
+#include "parquet/metadata.h"
+#include "parquet/properties.h"
+#include "parquet/schema.h"
+#include "parquet/types.h"
+
+namespace parquet {
+
+constexpr int64_t kHashBatchSize = 256;
+
+template 
+BloomFilterWriter::BloomFilterWriter(const ColumnDescriptor* 
descr,
+  BloomFilter* bloom_filter)
+: descr_(descr), bloom_filter_(bloom_filter) {}
+
+template 
+bool BloomFilterWriter::bloom_filter_enabled() const {
+  return bloom_filter_ != nullptr;
+}
+
+template 
+void BloomFilterWriter::Update(const T* values, int64_t 
num_values) {
+  if (!bloom_filter_enabled()) {
+return;
+  }
+
+  if constexpr (std::is_same_v) {
+throw ParquetException("Bloom filter is not supported for boolean type");
+  }
+
+  std::array hashes;
+  for (int64_t i = 0; i < num_values; i += kHashBatchSize) {
+auto batch_size = static_cast(std::min(kHashBatchSize, num_values - 
i));
+if constexpr (std::is_same_v) {
+  bloom_filter_->Hashes(values, descr_->type_length(), batch_size, 
hashes.data());
+} else {
+  bloom_filter_->Hashes(values, batch_size, hashes.data());
+}
+bloom_filter_->InsertHashes(hashes.data(), batch_size);
+  }
+}
+
+template <>
+void BloomFilterWriter::Update(const bool*, int64_t) {
+  if (!bloom_filter_enabled()) {
+return;
+  }
+  throw ParquetException("Bloom filter is not supported for boolean type");
+}
+
+template 
+void BloomFilterWriter::UpdateSpaced(const T* values, int64_t 
num_values,
+  const uint8_t* valid_bits,
+  int64_t valid_bits_offset) {
+  if (!bloom_filter_enabled()) {
+return;
+  }
+
+  std::array hashes;
+  ::arrow::internal::VisitSetBitRunsVoid(
+  valid_bits, valid_bits_offset, num_values, [&](int64_t position, int64_t 
length) {
+for (int64_t i = 0; i < length; i += kHashBatchSize) {
+  auto batch_size = static_cast(std::min(kHashBatchSize, length - 
i));
+  if constexpr (std::is_same_v) {
+bloom_filter_->Hashes(values + i + position, descr_->type_length(),
+  batch_size, hashes.data());
+  } else {
+bloom_filter_->Hashes(values + i + position, batch_size, 
hashes.data());
+  }
+  bloom_filter_->InsertHashes(hashes.data(), batch_size);
+}
+  });
+}
+
+template <>
+void BloomFilterWriter::UpdateSpaced(const bool*, int64_t, const 
uint8_t*,
+  int64_t) {
+  if (!bloom_filter_enabled()) {
+return;
+  }
+  throw ParquetException("Bloom filter is not supported for boolean type");
+}
+
+template 
+void BloomFilterWriter::Update(const ::arrow::Array& values) {
+  ::arrow::Unreachable("Update for non-ByteArray type should be unreachable");

Review Comment:
   Because this function is only called by 
`TypedColumnWriterImpl::WriteArrowDense`. I think it is better 
to throw a not implemented exception.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-12-05 Thread via GitHub


wgtmac commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2593041434


##
cpp/src/parquet/bloom_filter_writer.cc:
##
@@ -0,0 +1,305 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "parquet/bloom_filter_writer.h"
+
+#include 
+#include 
+
+#include "arrow/array.h"
+#include "arrow/io/interfaces.h"
+#include "arrow/type_traits.h"
+#include "arrow/util/bit_run_reader.h"
+#include "arrow/util/checked_cast.h"
+#include "arrow/util/unreachable.h"
+
+#include "parquet/exception.h"
+#include "parquet/metadata.h"
+#include "parquet/properties.h"
+#include "parquet/schema.h"
+#include "parquet/types.h"
+
+namespace parquet {
+
+constexpr int64_t kHashBatchSize = 256;
+
+template 
+BloomFilterWriter::BloomFilterWriter(const ColumnDescriptor* 
descr,
+  BloomFilter* bloom_filter)
+: descr_(descr), bloom_filter_(bloom_filter) {}
+
+template 
+bool BloomFilterWriter::bloom_filter_enabled() const {
+  return bloom_filter_ != nullptr;
+}
+
+template 
+void BloomFilterWriter::Update(const T* values, int64_t 
num_values) {
+  if (!bloom_filter_enabled()) {
+return;
+  }
+
+  if constexpr (std::is_same_v) {
+throw ParquetException("Bloom filter is not supported for boolean type");
+  }
+
+  std::array hashes;
+  for (int64_t i = 0; i < num_values; i += kHashBatchSize) {
+auto batch_size = static_cast(std::min(kHashBatchSize, num_values - 
i));
+if constexpr (std::is_same_v) {
+  bloom_filter_->Hashes(values, descr_->type_length(), batch_size, 
hashes.data());
+} else {
+  bloom_filter_->Hashes(values, batch_size, hashes.data());
+}
+bloom_filter_->InsertHashes(hashes.data(), batch_size);
+  }
+}
+
+template <>
+void BloomFilterWriter::Update(const bool*, int64_t) {
+  if (!bloom_filter_enabled()) {
+return;
+  }
+  throw ParquetException("Bloom filter is not supported for boolean type");
+}
+
+template 
+void BloomFilterWriter::UpdateSpaced(const T* values, int64_t 
num_values,
+  const uint8_t* valid_bits,
+  int64_t valid_bits_offset) {
+  if (!bloom_filter_enabled()) {
+return;
+  }
+
+  std::array hashes;
+  ::arrow::internal::VisitSetBitRunsVoid(
+  valid_bits, valid_bits_offset, num_values, [&](int64_t position, int64_t 
length) {
+for (int64_t i = 0; i < length; i += kHashBatchSize) {
+  auto batch_size = static_cast(std::min(kHashBatchSize, length - 
i));
+  if constexpr (std::is_same_v) {
+bloom_filter_->Hashes(values + i + position, descr_->type_length(),
+  batch_size, hashes.data());
+  } else {
+bloom_filter_->Hashes(values + i + position, batch_size, 
hashes.data());
+  }
+  bloom_filter_->InsertHashes(hashes.data(), batch_size);
+}
+  });
+}
+
+template <>
+void BloomFilterWriter::UpdateSpaced(const bool*, int64_t, const 
uint8_t*,
+  int64_t) {
+  if (!bloom_filter_enabled()) {
+return;
+  }
+  throw ParquetException("Bloom filter is not supported for boolean type");
+}
+
+template 
+void BloomFilterWriter::Update(const ::arrow::Array& values) {
+  ::arrow::Unreachable("Update for non-ByteArray type should be unreachable");
+}
+
+namespace {
+
+template 
+void UpdateBinaryBloomFilter(BloomFilter& bloom_filter, const ArrayType& 
array) {
+  // Using a small batch size because an extra `byte_arrays` is used.
+  constexpr int64_t kBinaryHashBatchSize = 64;
+  std::array byte_arrays;
+  std::array hashes;
+
+  auto batch_insert_hashes = [&](int count) {
+if (count > 0) {
+  bloom_filter.Hashes(byte_arrays.data(), count, hashes.data());
+  bloom_filter.InsertHashes(hashes.data(), count);
+}
+  };
+
+  int batch_idx = 0;
+  ::arrow::internal::VisitSetBitRunsVoid(
+  array.null_bitmap_data(), array.offset(), array.length(),
+  [&](int64_t position, int64_t run_length) {
+for (int64_t i = 0; i < run_length; ++i) {

Review Comment:
   Ah, I didn't notice that difference since I took over this PR half wa

Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-12-05 Thread via GitHub


wgtmac commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2593028729


##
cpp/src/parquet/bloom_filter_writer.cc:
##
@@ -0,0 +1,305 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "parquet/bloom_filter_writer.h"
+
+#include 
+#include 
+
+#include "arrow/array.h"
+#include "arrow/io/interfaces.h"
+#include "arrow/type_traits.h"
+#include "arrow/util/bit_run_reader.h"
+#include "arrow/util/checked_cast.h"
+#include "arrow/util/unreachable.h"
+
+#include "parquet/exception.h"
+#include "parquet/metadata.h"
+#include "parquet/properties.h"
+#include "parquet/schema.h"
+#include "parquet/types.h"
+
+namespace parquet {
+
+constexpr int64_t kHashBatchSize = 256;
+
+template 
+BloomFilterWriter::BloomFilterWriter(const ColumnDescriptor* 
descr,
+  BloomFilter* bloom_filter)
+: descr_(descr), bloom_filter_(bloom_filter) {}
+
+template 
+bool BloomFilterWriter::bloom_filter_enabled() const {
+  return bloom_filter_ != nullptr;
+}
+
+template 
+void BloomFilterWriter::Update(const T* values, int64_t 
num_values) {
+  if (!bloom_filter_enabled()) {
+return;
+  }
+
+  if constexpr (std::is_same_v) {
+throw ParquetException("Bloom filter is not supported for boolean type");
+  }
+
+  std::array hashes;
+  for (int64_t i = 0; i < num_values; i += kHashBatchSize) {
+auto batch_size = static_cast(std::min(kHashBatchSize, num_values - 
i));
+if constexpr (std::is_same_v) {
+  bloom_filter_->Hashes(values, descr_->type_length(), batch_size, 
hashes.data());
+} else {
+  bloom_filter_->Hashes(values, batch_size, hashes.data());
+}
+bloom_filter_->InsertHashes(hashes.data(), batch_size);
+  }
+}
+
+template <>
+void BloomFilterWriter::Update(const bool*, int64_t) {
+  if (!bloom_filter_enabled()) {
+return;
+  }
+  throw ParquetException("Bloom filter is not supported for boolean type");
+}
+
+template 
+void BloomFilterWriter::UpdateSpaced(const T* values, int64_t 
num_values,
+  const uint8_t* valid_bits,
+  int64_t valid_bits_offset) {
+  if (!bloom_filter_enabled()) {
+return;
+  }
+
+  std::array hashes;
+  ::arrow::internal::VisitSetBitRunsVoid(
+  valid_bits, valid_bits_offset, num_values, [&](int64_t position, int64_t 
length) {
+for (int64_t i = 0; i < length; i += kHashBatchSize) {
+  auto batch_size = static_cast(std::min(kHashBatchSize, length - 
i));
+  if constexpr (std::is_same_v) {
+bloom_filter_->Hashes(values + i + position, descr_->type_length(),
+  batch_size, hashes.data());
+  } else {
+bloom_filter_->Hashes(values + i + position, batch_size, 
hashes.data());
+  }
+  bloom_filter_->InsertHashes(hashes.data(), batch_size);
+}
+  });
+}
+
+template <>
+void BloomFilterWriter::UpdateSpaced(const bool*, int64_t, const 
uint8_t*,
+  int64_t) {
+  if (!bloom_filter_enabled()) {
+return;
+  }
+  throw ParquetException("Bloom filter is not supported for boolean type");
+}
+
+template 
+void BloomFilterWriter::Update(const ::arrow::Array& values) {
+  ::arrow::Unreachable("Update for non-ByteArray type should be unreachable");
+}
+
+namespace {
+
+template 
+void UpdateBinaryBloomFilter(BloomFilter& bloom_filter, const ArrayType& 
array) {
+  // Using a small batch size because an extra `byte_arrays` is used.
+  constexpr int64_t kBinaryHashBatchSize = 64;
+  std::array byte_arrays;
+  std::array hashes;
+
+  auto batch_insert_hashes = [&](int count) {
+if (count > 0) {
+  bloom_filter.Hashes(byte_arrays.data(), count, hashes.data());
+  bloom_filter.InsertHashes(hashes.data(), count);
+}
+  };
+
+  int batch_idx = 0;
+  ::arrow::internal::VisitSetBitRunsVoid(
+  array.null_bitmap_data(), array.offset(), array.length(),
+  [&](int64_t position, int64_t run_length) {
+for (int64_t i = 0; i < run_length; ++i) {
+  byte_arrays[batch_idx++] = array.GetView(position + i);
+  if (batch_i

Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-12-05 Thread via GitHub


wgtmac commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2592912114


##
cpp/src/parquet/metadata.h:
##
@@ -531,8 +523,8 @@ class PARQUET_EXPORT FileMetaDataBuilder {
   // The prior RowGroupMetaDataBuilder (if any) is destroyed
   RowGroupMetaDataBuilder* AppendRowGroup();
 
-  // Update location to all page indexes in the parquet file
-  void SetPageIndexLocation(const PageIndexLocation& location);

Review Comment:
   It is not easy to keep the old API without significant change. I think this 
function is very purpose-built as the writer internals and not very helpful to 
downstream users.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-12-05 Thread via GitHub


wgtmac commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2592872002


##
cpp/src/parquet/bloom_filter_writer.cc:
##
@@ -0,0 +1,305 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "parquet/bloom_filter_writer.h"
+
+#include 
+#include 
+
+#include "arrow/array.h"
+#include "arrow/io/interfaces.h"
+#include "arrow/type_traits.h"
+#include "arrow/util/bit_run_reader.h"
+#include "arrow/util/checked_cast.h"
+#include "arrow/util/unreachable.h"
+
+#include "parquet/exception.h"
+#include "parquet/metadata.h"
+#include "parquet/properties.h"
+#include "parquet/schema.h"
+#include "parquet/types.h"
+
+namespace parquet {
+
+constexpr int64_t kHashBatchSize = 256;
+
+template 
+BloomFilterWriter::BloomFilterWriter(const ColumnDescriptor* 
descr,
+  BloomFilter* bloom_filter)
+: descr_(descr), bloom_filter_(bloom_filter) {}
+
+template 
+bool BloomFilterWriter::bloom_filter_enabled() const {
+  return bloom_filter_ != nullptr;
+}
+
+template 
+void BloomFilterWriter::Update(const T* values, int64_t 
num_values) {
+  if (!bloom_filter_enabled()) {
+return;
+  }
+
+  if constexpr (std::is_same_v) {
+throw ParquetException("Bloom filter is not supported for boolean type");
+  }
+
+  std::array hashes;
+  for (int64_t i = 0; i < num_values; i += kHashBatchSize) {
+auto batch_size = static_cast(std::min(kHashBatchSize, num_values - 
i));
+if constexpr (std::is_same_v) {
+  bloom_filter_->Hashes(values, descr_->type_length(), batch_size, 
hashes.data());

Review Comment:
   Good catch!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-12-03 Thread via GitHub


pitrou commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2585255072


##
cpp/src/parquet/metadata.h:
##
@@ -531,8 +523,8 @@ class PARQUET_EXPORT FileMetaDataBuilder {
   // The prior RowGroupMetaDataBuilder (if any) is destroyed
   RowGroupMetaDataBuilder* AppendRowGroup();
 
-  // Update location to all page indexes in the parquet file
-  void SetPageIndexLocation(const PageIndexLocation& location);

Review Comment:
   This is removing an existing API without even deprecating it. Is nobody 
using it?



##
cpp/src/parquet/metadata.h:
##
@@ -505,19 +504,12 @@ class PARQUET_EXPORT RowGroupMetaDataBuilder {
   std::unique_ptr impl_;
 };
 
-/// \brief Public struct for location to all page indexes in a parquet file.
-struct PageIndexLocation {
-  /// Alias type of page index location of a row group. The index location
-  /// is located by column ordinal. If the column does not have the page index,
-  /// its value is set to std::nullopt.
-  using RowGroupIndexLocation = std::vector>;
-  /// Alias type of page index location of a parquet file. The index location
-  /// is located by the row group ordinal.
-  using FileIndexLocation = std::map;
-  /// Row group column index locations which uses row group ordinal as the key.
-  FileIndexLocation column_index_location;
-  /// Row group offset index locations which uses row group ordinal as the key.
-  FileIndexLocation offset_index_location;
+/// \brief Locations of indexes for all row groups and columns.
+struct PARQUET_EXPORT IndexLocations {
+  enum class IndexType : uint8_t { kColumnIndex, kOffsetIndex, kBloomFilter };
+
+  IndexType type;
+  std::map> locations;

Review Comment:
   Can we please remove the horrible `std::map` from these APIs?



##
cpp/src/parquet/CMakeLists.txt:
##
@@ -408,6 +409,8 @@ add_parquet_test(arrow-reader-writer-test
  arrow/arrow_statistics_test.cc
  arrow/variant_test.cc)
 
+add_parquet_test(arrow-parquet-index-test SOURCES 
arrow/arrow_parquet_index_test.cc)

Review Comment:
   Can we just call the source file `index_test.cc`? It's annoying to have 
these `arrow` and `parquet` prefixes.



##
cpp/src/parquet/column_writer.cc:
##
@@ -1293,6 +1297,9 @@ class TypedColumnWriterImpl : public ColumnWriterImpl,
 current_dict_encoder_ =
 dynamic_cast*>(current_encoder_.get());
 
+bloom_filter_writer_ =
+std::make_unique(metadata->descr(), 
bloom_filter);

Review Comment:
   Why not pass `descr_` here, is it different from `metadata->descr()`?



##
cpp/src/parquet/column_writer.cc:
##
@@ -1976,17 +1990,18 @@ Status 
TypedColumnWriterImpl::WriteArrowDictionary(
  &exec_ctx));
   referenced_dictionary = referenced_dictionary_datum.make_array();
 }
-
-int64_t non_null_count = chunk_indices->length() - 
chunk_indices->null_count();
-page_statistics_->IncrementNullCount(num_chunk_levels - non_null_count);
-page_statistics_->IncrementNumValues(non_null_count);
-page_statistics_->Update(*referenced_dictionary, /*update_counts=*/false);
-
+if (page_statistics_ != nullptr) {

Review Comment:
   Why this change? You've just `DCHECK`ed that the pointer was not null.



##
cpp/src/parquet/bloom_filter_writer.cc:
##
@@ -0,0 +1,305 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "parquet/bloom_filter_writer.h"
+
+#include 
+#include 
+
+#include "arrow/array.h"
+#include "arrow/io/interfaces.h"
+#include "arrow/type_traits.h"
+#include "arrow/util/bit_run_reader.h"
+#include "arrow/util/checked_cast.h"
+#include "arrow/util/unreachable.h"
+
+#include "parquet/exception.h"
+#include "parquet/metadata.h"
+#include "parquet/properties.h"
+#include "parquet/schema.h"
+#include "parquet/types.h"
+
+namespace parquet {
+
+constexpr int64_t kHashBatchSize = 256;
+
+template 
+BloomFilterWriter::BloomFilterWriter(const ColumnDescriptor* 
descr,
+  BloomFilter* bloom_filter)
+: descr_(descr), bloom_filter_(bloom_filter) {}
+
+template 
+bool BloomFilterWriter::bloom_filter_enabled() const

Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-12-01 Thread via GitHub


XiangCao1998 commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2579791687


##
cpp/src/parquet/bloom_filter_writer.cc:
##
@@ -0,0 +1,305 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "parquet/bloom_filter_writer.h"
+
+#include 
+#include 
+
+#include "arrow/array.h"
+#include "arrow/io/interfaces.h"
+#include "arrow/type_traits.h"
+#include "arrow/util/bit_run_reader.h"
+#include "arrow/util/checked_cast.h"
+#include "arrow/util/unreachable.h"
+
+#include "parquet/exception.h"
+#include "parquet/metadata.h"
+#include "parquet/properties.h"
+#include "parquet/schema.h"
+#include "parquet/types.h"
+
+namespace parquet {
+
+constexpr int64_t kHashBatchSize = 256;
+
+template 
+BloomFilterWriter::BloomFilterWriter(const ColumnDescriptor* 
descr,
+  BloomFilter* bloom_filter)
+: descr_(descr), bloom_filter_(bloom_filter) {}
+
+template 
+bool BloomFilterWriter::bloom_filter_enabled() const {
+  return bloom_filter_ != nullptr;
+}
+
+template 
+void BloomFilterWriter::Update(const T* values, int64_t 
num_values) {
+  if (!bloom_filter_enabled()) {
+return;
+  }
+
+  if constexpr (std::is_same_v) {
+throw ParquetException("Bloom filter is not supported for boolean type");
+  }
+
+  std::array hashes;
+  for (int64_t i = 0; i < num_values; i += kHashBatchSize) {
+auto batch_size = static_cast(std::min(kHashBatchSize, num_values - 
i));
+if constexpr (std::is_same_v) {
+  bloom_filter_->Hashes(values, descr_->type_length(), batch_size, 
hashes.data());

Review Comment:
   It should be values + i; otherwise, values beyond index 256 won't be written 
into the Bloom index.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-11-26 Thread via GitHub


pitrou commented on PR #37400:
URL: https://github.com/apache/arrow/pull/37400#issuecomment-3581897606

   Yes, I should do that. Sorry!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-11-26 Thread via GitHub


wgtmac commented on PR #37400:
URL: https://github.com/apache/arrow/pull/37400#issuecomment-3581893164

   As I have made significant changes to this PR, I don't think I have the 
privilege to approve it. Could you please take a look when you have time? 
@pitrou 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-11-26 Thread via GitHub


wgtmac commented on PR #37400:
URL: https://github.com/apache/arrow/pull/37400#issuecomment-3581872344

   Thanks for the review, @HuaHuaY. I have addressed all your comments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-11-26 Thread via GitHub


HuaHuaY commented on PR #37400:
URL: https://github.com/apache/arrow/pull/37400#issuecomment-3581050265

   #48170 is merged and then this PR also needs to modify 
`CopyColumnSpecificProperties` in `cpp/src/parquet/properties.cc`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-11-25 Thread via GitHub


HuaHuaY commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2562992719


##
cpp/src/parquet/CMakeLists.txt:
##
@@ -159,6 +159,7 @@ set(PARQUET_SRCS
 arrow/variant_internal.cc
 arrow/writer.cc
 bloom_filter.cc
+bloom_filter_writer.cc

Review Comment:
   We'd better keep the files sorted by filename.



##
cpp/src/parquet/page_index_test.cc:
##
@@ -839,28 +841,38 @@ class PageIndexBuilderTest : public ::testing::Test {
 
  protected:
   std::unique_ptr ReadColumnIndex(int row_group, int column) {
-auto location = 
page_index_location_.column_index_location[row_group][column];
-if (!location.has_value()) {
+auto row_group_location = column_index_location_.locations.find(row_group);
+if (row_group_location == column_index_location_.locations.end()) {
+  return nullptr;
+}
+auto column_location = row_group_location->second.find(column);
+if (column_location == row_group_location->second.end()) {
   return nullptr;
 }
 auto properties = default_reader_properties();
-return ColumnIndex::Make(*schema_.Column(column), buffer_->data() + 
location->offset,
- static_cast(location->length), 
properties);
+return ColumnIndex::Make(
+*schema_.Column(column), buffer_->data() + 
column_location->second.offset,
+static_cast(column_location->second.length), properties);
   }
 
   std::unique_ptr ReadOffsetIndex(int row_group, int column) {
-auto location = 
page_index_location_.offset_index_location[row_group][column];
-if (!location.has_value()) {
+auto row_group_location = offset_index_location_.locations.find(row_group);
+if (row_group_location == offset_index_location_.locations.end()) {
+  return nullptr;
+}
+auto column_location = row_group_location->second.find(column);
+if (column_location == row_group_location->second.end()) {
   return nullptr;
 }
 auto properties = default_reader_properties();
-return OffsetIndex::Make(buffer_->data() + location->offset,
- static_cast(location->length), 
properties);
+return OffsetIndex::Make(buffer_->data() + column_location->second.offset,
+ 
static_cast(column_location->second.length),
+ properties);
   }
 
   SchemaDescriptor schema_;
   std::shared_ptr buffer_;
-  PageIndexLocation page_index_location_;
+  IndexLocations column_index_location_, offset_index_location_;

Review Comment:
   I think defining only one variable at one line is a better style?



##
cpp/src/parquet/CMakeLists.txt:
##
@@ -372,8 +373,8 @@ set_source_files_properties(public_api_test.cc PROPERTIES 
SKIP_UNITY_BUILD_INCLU
 
 add_parquet_test(internals-test
  SOURCES
- bloom_filter_reader_test.cc
  bloom_filter_test.cc
+ bloom_filter_reader_writer_test.cc

Review Comment:
   ditto



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-11-21 Thread via GitHub


wgtmac commented on PR #37400:
URL: https://github.com/apache/arrow/pull/37400#issuecomment-3562099621

   I've rebased this PR and addressed all previous comments (including my 
posts). Now I think it is ready to review again. @pitrou @mapleFU @adamreeve 
@emkornfield @HuaHuaY 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-11-20 Thread via GitHub


wgtmac commented on PR #37400:
URL: https://github.com/apache/arrow/pull/37400#issuecomment-3556894757

   This PR has been stale for long time. Let me revive this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-09-28 Thread via GitHub


HuaHuaY commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2384812758


##
cpp/src/parquet/xxhasher.h:
##
@@ -34,6 +34,7 @@ class PARQUET_EXPORT XxHasher : public Hasher {
   uint64_t Hash(const Int96* value) const override;
   uint64_t Hash(const ByteArray* value) const override;
   uint64_t Hash(const FLBA* val, uint32_t len) const override;
+  uint64_t Hash(std::string_view value) const override;

Review Comment:
   When will we call this overloaded function? It seems that `std::string_view` 
is not a value type of `TypedColumnWriter`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-09-28 Thread via GitHub


HuaHuaY commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2384812758


##
cpp/src/parquet/xxhasher.h:
##
@@ -34,6 +34,7 @@ class PARQUET_EXPORT XxHasher : public Hasher {
   uint64_t Hash(const Int96* value) const override;
   uint64_t Hash(const ByteArray* value) const override;
   uint64_t Hash(const FLBA* val, uint32_t len) const override;
+  uint64_t Hash(std::string_view value) const override;

Review Comment:
   When will we call this overloaded function? It seems that `std::string_view` 
is not one kind of value type of `TypedColumnWriter`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-09-28 Thread via GitHub


HuaHuaY commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2384812758


##
cpp/src/parquet/xxhasher.h:
##
@@ -34,6 +34,7 @@ class PARQUET_EXPORT XxHasher : public Hasher {
   uint64_t Hash(const Int96* value) const override;
   uint64_t Hash(const ByteArray* value) const override;
   uint64_t Hash(const FLBA* val, uint32_t len) const override;
+  uint64_t Hash(std::string_view value) const override;

Review Comment:
   When will we call this overloaded function?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-09-28 Thread via GitHub


HuaHuaY commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2384813548


##
cpp/src/parquet/arrow/arrow_reader_writer_test.cc:
##
@@ -69,9 +69,10 @@
 #include "parquet/arrow/schema.h"
 #include "parquet/arrow/test_util.h"
 #include "parquet/arrow/writer.h"
+#include "parquet/bloom_filter.h"
+#include "parquet/bloom_filter_reader.h"

Review Comment:
   ```suggestion
   ```



##
cpp/src/parquet/xxhasher.h:
##
@@ -34,6 +34,7 @@ class PARQUET_EXPORT XxHasher : public Hasher {
   uint64_t Hash(const Int96* value) const override;
   uint64_t Hash(const ByteArray* value) const override;
   uint64_t Hash(const FLBA* val, uint32_t len) const override;
+  uint64_t Hash(std::string_view value) const override;

Review Comment:
   When will we call this overload function?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-08-24 Thread via GitHub


mapleFU commented on PR #37400:
URL: https://github.com/apache/arrow/pull/37400#issuecomment-3217969931

   I'll move forward next week


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-08-23 Thread via GitHub


wgtmac commented on PR #37400:
URL: https://github.com/apache/arrow/pull/37400#issuecomment-3217870214

   Should we revive this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-08-07 Thread via GitHub


wgtmac commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2255905490


##
cpp/src/parquet/properties.h:
##
@@ -214,6 +240,17 @@ class PARQUET_EXPORT ColumnProperties {
 page_index_enabled_ = page_index_enabled;
   }
 
+  void set_bloom_filter_options(std::optional 
bloom_filter_options) {
+if (bloom_filter_options) {
+  if (bloom_filter_options->fpp >= 1.0 || bloom_filter_options->fpp <= 
0.0) {

Review Comment:
   Should we hide the logic by adding `Status BloomFilterOptions::Validate()` 
and call it here?



##
cpp/src/parquet/properties.h:
##
@@ -167,6 +167,32 @@ static constexpr Compression::type 
DEFAULT_COMPRESSION_TYPE = Compression::UNCOM
 static constexpr bool DEFAULT_IS_PAGE_INDEX_ENABLED = true;
 static constexpr SizeStatisticsLevel DEFAULT_SIZE_STATISTICS_LEVEL =
 SizeStatisticsLevel::PageAndColumnChunk;
+static constexpr int32_t DEFAULT_BLOOM_FILTER_NDV = 1024 * 1024;
+static constexpr double DEFAULT_BLOOM_FILTER_FPP = 0.05;
+
+struct PARQUET_EXPORT BloomFilterOptions {
+  /// Expected number of distinct values to be inserted into the bloom filter.
+  ///
+  /// Usage of bloom filter is most beneficial for columns with large 
cardinality,
+  /// so a good heuristic is to set ndv to number of rows. However, it can 
reduce
+  /// disk size if you know in advance a smaller number of distinct values.
+  /// For very small ndv value it is probably not worth it to use bloom filter 
anyway.
+  ///
+  /// Increasing this value (without increasing fpp) will result in an 
increase in
+  /// disk or memory size.
+  int32_t ndv = DEFAULT_BLOOM_FILTER_NDV;
+  /// False-positive probability of the bloom filter.
+  ///
+  /// The bloom filter data structure is a trade-off between disk and memory 
space
+  /// versus fpp, the smaller the fpp, the more memory and disk space is 
required,
+  /// thus setting it to a reasonable value e.g. 0.1, 0.05, or 0.001 is 
recommended.
+  ///
+  /// Setting to very small number diminishes the value of the filter itself,
+  /// as the bitset size is even larger than just storing the whole value.
+  /// User is also expected to set ndv if it can be known in advance in order
+  /// to largely reduce space usage.
+  double fpp = DEFAULT_BLOOM_FILTER_FPP;
+};

Review Comment:
   ```suggestion
   struct PARQUET_EXPORT BloomFilterOptions {
 /// Expected number of distinct values, default 1M.
 ///
 /// Bloom filters are most effective for high-cardinality columns.
 /// A larger value increases memory and disk usage.
 int32_t ndv = DEFAULT_BLOOM_FILTER_NDV;
 
 /// False-positive probability, default 0.05.
 ///
 /// Lower values require more memory and disk space.
 /// Acceptable values are in the range (0.0, 1.0).
 double fpp = DEFAULT_BLOOM_FILTER_FPP;
   };
   ```
   
   I still think that we don't need too much detail about bloom filter.



##
cpp/src/parquet/properties.h:
##
@@ -186,6 +200,24 @@ class PARQUET_EXPORT ColumnProperties {
 page_index_enabled_ = page_index_enabled;
   }
 
+  void set_bloom_filter_options(std::optional 
bloom_filter_options) {

Review Comment:
   The link seems pointing to a weird place. Could you explain it again?



##
cpp/src/parquet/arrow/arrow_parquet_index_test.cc:
##
@@ -0,0 +1,736 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#ifdef _MSC_VER
+#  pragma warning(push)
+// Disable forcing value to bool warnings
+#  pragma warning(disable : 4800)
+#endif
+
+#include "gmock/gmock.h"
+#include "gtest/gtest.h"
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "arrow/array/builder_binary.h"
+#include "arrow/array/builder_dict.h"
+#include "arrow/array/builder_nested.h"
+#include "arrow/chunked_array.h"
+#include "arrow/compute/api.h"
+#include "arrow/extension/json.h"
+#include "arrow/io/api.h"
+#include "arrow/record_batch.h"
+#include "arrow/scalar.h"
+#include "arrow/table.h"
+#include "arrow/testing/builder.h"
+#include "arrow/testing/extension_type.h"
+#include "arrow/testing/gtest_util.h"
+#include "arrow/testing/random.h"
+#include "arrow/testing/util.h"
+#include "arrow/type_fwd.h"
+#include "a

Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-06-25 Thread via GitHub


pitrou commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2166994191


##
cpp/src/parquet/bloom_filter_writer_internal.cc:
##
@@ -0,0 +1,195 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "parquet/bloom_filter_writer_internal.h"
+
+#include "parquet/exception.h"
+#include "parquet/schema.h"
+
+#include "arrow/util/bit_run_reader.h"
+#include "arrow/util/checked_cast.h"
+#include "arrow/util/unreachable.h"
+#include "arrow/visit_data_inline.h"
+
+namespace parquet::internal {
+
+constexpr int64_t kHashBatchSize = 256;
+
+template 
+BloomFilterWriterImpl::BloomFilterWriterImpl(const 
ColumnDescriptor* descr,
+  BloomFilter* 
bloom_filter)
+: descr_(descr), bloom_filter_(bloom_filter) {}
+
+template 
+bool BloomFilterWriterImpl::HasBloomFilter() const {
+  return bloom_filter_ != nullptr;
+}
+
+template 
+void BloomFilterWriterImpl::UpdateBloomFilter(const T* values,
+   int64_t num_values) 
{
+  if (bloom_filter_ == nullptr) {
+return;
+  }
+  std::array hashes;
+  for (int64_t i = 0; i < num_values; i += kHashBatchSize) {
+int64_t current_hash_batch_size = std::min(kHashBatchSize, num_values - i);
+bloom_filter_->Hashes(values, static_cast(current_hash_batch_size),
+  hashes.data());
+bloom_filter_->InsertHashes(hashes.data(), 
static_cast(current_hash_batch_size));
+  }
+}
+
+template <>
+void BloomFilterWriterImpl::UpdateBloomFilter(const FLBA* values,
+int64_t num_values) {
+  if (bloom_filter_ == nullptr) {
+return;
+  }
+  std::array hashes;
+  for (int64_t i = 0; i < num_values; i += kHashBatchSize) {
+int64_t current_hash_batch_size = std::min(kHashBatchSize, num_values - i);
+bloom_filter_->Hashes(values, descr_->type_length(),
+  static_cast(current_hash_batch_size), 
hashes.data());
+bloom_filter_->InsertHashes(hashes.data(), 
static_cast(current_hash_batch_size));
+  }
+}
+
+template <>
+void BloomFilterWriterImpl::UpdateBloomFilter(const bool*, 
int64_t) {
+  if (ARROW_PREDICT_FALSE(bloom_filter_ != nullptr)) {
+throw ParquetException("BooleanType does not support bloom filters");
+  }
+}
+
+template 
+void BloomFilterWriterImpl::UpdateBloomFilterSpaced(
+const T* values, int64_t num_values, const uint8_t* valid_bits,
+int64_t valid_bits_offset) {
+  if (bloom_filter_ == nullptr) {
+// No bloom filter to update
+return;
+  }
+  std::array hashes;
+  ::arrow::internal::VisitSetBitRunsVoid(
+  valid_bits, valid_bits_offset, num_values, [&](int64_t position, int64_t 
length) {
+for (int64_t i = 0; i < length; i += kHashBatchSize) {
+  auto current_hash_batch_size = std::min(kHashBatchSize, length - i);
+  bloom_filter_->Hashes(values + i + position,
+static_cast(current_hash_batch_size), 
hashes.data());
+  bloom_filter_->InsertHashes(hashes.data(),
+  
static_cast(current_hash_batch_size));
+}
+  });
+}
+
+template <>
+void BloomFilterWriterImpl::UpdateBloomFilterSpaced(const bool*, 
int64_t,
+ const 
uint8_t*,
+ int64_t) {}
+
+template <>
+void BloomFilterWriterImpl::UpdateBloomFilterSpaced(const FLBA* 
values,
+  int64_t 
num_values,
+  const uint8_t* 
valid_bits,
+  int64_t 
valid_bits_offset) {
+  if (bloom_filter_ == nullptr) {
+return;
+  }
+  std::array hashes;
+  ::arrow::internal::VisitSetBitRunsVoid(
+  valid_bits, valid_bits_offset, num_values, [&](int64_t position, int64_t 
length) {
+for (int64_t i = 0; i < length; i += kHashBatchSize) {
+  auto current_hash_batch_size = std::min(kHashBatchSize, length - i);
+  bloom_filter_->Hashes(values + i

Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-06-25 Thread via GitHub


pitrou commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2166991431


##
cpp/src/parquet/bloom_filter_writer_internal.h:
##
@@ -0,0 +1,51 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "parquet/bloom_filter.h"
+
+namespace arrow {

Review Comment:
   I mean `arrow/type_fwd.h`, sorry.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-06-25 Thread via GitHub


wgtmac commented on PR #37400:
URL: https://github.com/apache/arrow/pull/37400#issuecomment-3005082517

   > This is a preliminary review before it's too late.
   > 
   > I would generally appreciate if code reviews were a bit more thorough.
   
   @pitrou Thanks for your review! I tried to use some bold words to attract 
attention to this desperate PR :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-06-25 Thread via GitHub


pitrou commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2167079414


##
cpp/src/parquet/bloom_filter_writer_internal.cc:
##
@@ -0,0 +1,195 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "parquet/bloom_filter_writer_internal.h"
+
+#include "parquet/exception.h"
+#include "parquet/schema.h"
+
+#include "arrow/util/bit_run_reader.h"
+#include "arrow/util/checked_cast.h"
+#include "arrow/util/unreachable.h"
+#include "arrow/visit_data_inline.h"
+
+namespace parquet::internal {
+
+constexpr int64_t kHashBatchSize = 256;
+
+template 
+BloomFilterWriterImpl::BloomFilterWriterImpl(const 
ColumnDescriptor* descr,
+  BloomFilter* 
bloom_filter)
+: descr_(descr), bloom_filter_(bloom_filter) {}
+
+template 
+bool BloomFilterWriterImpl::HasBloomFilter() const {
+  return bloom_filter_ != nullptr;
+}
+
+template 
+void BloomFilterWriterImpl::UpdateBloomFilter(const T* values,
+   int64_t num_values) 
{
+  if (bloom_filter_ == nullptr) {
+return;
+  }
+  std::array hashes;
+  for (int64_t i = 0; i < num_values; i += kHashBatchSize) {
+int64_t current_hash_batch_size = std::min(kHashBatchSize, num_values - i);
+bloom_filter_->Hashes(values, static_cast(current_hash_batch_size),
+  hashes.data());
+bloom_filter_->InsertHashes(hashes.data(), 
static_cast(current_hash_batch_size));
+  }
+}
+
+template <>
+void BloomFilterWriterImpl::UpdateBloomFilter(const FLBA* values,
+int64_t num_values) {
+  if (bloom_filter_ == nullptr) {
+return;
+  }
+  std::array hashes;
+  for (int64_t i = 0; i < num_values; i += kHashBatchSize) {
+int64_t current_hash_batch_size = std::min(kHashBatchSize, num_values - i);
+bloom_filter_->Hashes(values, descr_->type_length(),
+  static_cast(current_hash_batch_size), 
hashes.data());
+bloom_filter_->InsertHashes(hashes.data(), 
static_cast(current_hash_batch_size));
+  }
+}
+
+template <>
+void BloomFilterWriterImpl::UpdateBloomFilter(const bool*, 
int64_t) {
+  if (ARROW_PREDICT_FALSE(bloom_filter_ != nullptr)) {
+throw ParquetException("BooleanType does not support bloom filters");
+  }
+}
+
+template 
+void BloomFilterWriterImpl::UpdateBloomFilterSpaced(
+const T* values, int64_t num_values, const uint8_t* valid_bits,
+int64_t valid_bits_offset) {
+  if (bloom_filter_ == nullptr) {
+// No bloom filter to update
+return;
+  }
+  std::array hashes;
+  ::arrow::internal::VisitSetBitRunsVoid(
+  valid_bits, valid_bits_offset, num_values, [&](int64_t position, int64_t 
length) {
+for (int64_t i = 0; i < length; i += kHashBatchSize) {
+  auto current_hash_batch_size = std::min(kHashBatchSize, length - i);
+  bloom_filter_->Hashes(values + i + position,
+static_cast(current_hash_batch_size), 
hashes.data());
+  bloom_filter_->InsertHashes(hashes.data(),
+  
static_cast(current_hash_batch_size));
+}
+  });
+}
+
+template <>
+void BloomFilterWriterImpl::UpdateBloomFilterSpaced(const bool*, 
int64_t,
+ const 
uint8_t*,
+ int64_t) {}
+
+template <>
+void BloomFilterWriterImpl::UpdateBloomFilterSpaced(const FLBA* 
values,
+  int64_t 
num_values,
+  const uint8_t* 
valid_bits,
+  int64_t 
valid_bits_offset) {
+  if (bloom_filter_ == nullptr) {
+return;
+  }
+  std::array hashes;
+  ::arrow::internal::VisitSetBitRunsVoid(
+  valid_bits, valid_bits_offset, num_values, [&](int64_t position, int64_t 
length) {
+for (int64_t i = 0; i < length; i += kHashBatchSize) {
+  auto current_hash_batch_size = std::min(kHashBatchSize, length - i);
+  bloom_filter_->Hashes(values + i

Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-06-25 Thread via GitHub


mapleFU commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2167046128


##
cpp/src/parquet/bloom_filter_writer_internal.cc:
##
@@ -0,0 +1,195 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "parquet/bloom_filter_writer_internal.h"
+
+#include "parquet/exception.h"
+#include "parquet/schema.h"
+
+#include "arrow/util/bit_run_reader.h"
+#include "arrow/util/checked_cast.h"
+#include "arrow/util/unreachable.h"
+#include "arrow/visit_data_inline.h"
+
+namespace parquet::internal {
+
+constexpr int64_t kHashBatchSize = 256;
+
+template 
+BloomFilterWriterImpl::BloomFilterWriterImpl(const 
ColumnDescriptor* descr,
+  BloomFilter* 
bloom_filter)
+: descr_(descr), bloom_filter_(bloom_filter) {}
+
+template 
+bool BloomFilterWriterImpl::HasBloomFilter() const {
+  return bloom_filter_ != nullptr;
+}
+
+template 
+void BloomFilterWriterImpl::UpdateBloomFilter(const T* values,
+   int64_t num_values) 
{
+  if (bloom_filter_ == nullptr) {
+return;
+  }
+  std::array hashes;
+  for (int64_t i = 0; i < num_values; i += kHashBatchSize) {
+int64_t current_hash_batch_size = std::min(kHashBatchSize, num_values - i);
+bloom_filter_->Hashes(values, static_cast(current_hash_batch_size),
+  hashes.data());
+bloom_filter_->InsertHashes(hashes.data(), 
static_cast(current_hash_batch_size));
+  }
+}
+
+template <>
+void BloomFilterWriterImpl::UpdateBloomFilter(const FLBA* values,
+int64_t num_values) {
+  if (bloom_filter_ == nullptr) {
+return;
+  }
+  std::array hashes;
+  for (int64_t i = 0; i < num_values; i += kHashBatchSize) {
+int64_t current_hash_batch_size = std::min(kHashBatchSize, num_values - i);
+bloom_filter_->Hashes(values, descr_->type_length(),
+  static_cast(current_hash_batch_size), 
hashes.data());
+bloom_filter_->InsertHashes(hashes.data(), 
static_cast(current_hash_batch_size));
+  }
+}
+
+template <>
+void BloomFilterWriterImpl::UpdateBloomFilter(const bool*, 
int64_t) {
+  if (ARROW_PREDICT_FALSE(bloom_filter_ != nullptr)) {
+throw ParquetException("BooleanType does not support bloom filters");
+  }
+}
+
+template 
+void BloomFilterWriterImpl::UpdateBloomFilterSpaced(
+const T* values, int64_t num_values, const uint8_t* valid_bits,
+int64_t valid_bits_offset) {
+  if (bloom_filter_ == nullptr) {
+// No bloom filter to update
+return;
+  }
+  std::array hashes;
+  ::arrow::internal::VisitSetBitRunsVoid(
+  valid_bits, valid_bits_offset, num_values, [&](int64_t position, int64_t 
length) {
+for (int64_t i = 0; i < length; i += kHashBatchSize) {
+  auto current_hash_batch_size = std::min(kHashBatchSize, length - i);
+  bloom_filter_->Hashes(values + i + position,
+static_cast(current_hash_batch_size), 
hashes.data());
+  bloom_filter_->InsertHashes(hashes.data(),
+  
static_cast(current_hash_batch_size));
+}
+  });
+}
+
+template <>
+void BloomFilterWriterImpl::UpdateBloomFilterSpaced(const bool*, 
int64_t,
+ const 
uint8_t*,
+ int64_t) {}
+
+template <>
+void BloomFilterWriterImpl::UpdateBloomFilterSpaced(const FLBA* 
values,
+  int64_t 
num_values,
+  const uint8_t* 
valid_bits,
+  int64_t 
valid_bits_offset) {
+  if (bloom_filter_ == nullptr) {
+return;
+  }
+  std::array hashes;
+  ::arrow::internal::VisitSetBitRunsVoid(
+  valid_bits, valid_bits_offset, num_values, [&](int64_t position, int64_t 
length) {
+for (int64_t i = 0; i < length; i += kHashBatchSize) {
+  auto current_hash_batch_size = std::min(kHashBatchSize, length - i);
+  bloom_filter_->Hashes(values + 

Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-06-20 Thread via GitHub


mapleFU commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2158235483


##
cpp/src/parquet/bloom_filter_writer_internal.h:
##
@@ -0,0 +1,51 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "parquet/bloom_filter.h"
+
+namespace arrow {

Review Comment:
   ```c++
   namespace arrow {
   class Array;
   }
   ```
   
   The code here is not in `type_fwd.h`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-06-20 Thread via GitHub


mapleFU commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2158234804


##
cpp/src/parquet/properties.h:
##
@@ -909,6 +993,19 @@ class PARQUET_EXPORT WriterProperties {
 return false;
   }
 
+  bool bloom_filter_enabled() const {

Review Comment:
   If any bloom filter exists, a `bloom_filter_builder_` should be created. 
It's a bit like below. But I don't have a `bloom_filter_enabled` by default
   
   ```c++
 bool page_index_enabled() const {
   if (default_column_properties_.page_index_enabled()) {
 return true;
   }
   for (const auto& item : column_properties_) {
 if (item.second.page_index_enabled()) {
   return true;
 }
   }
   return false;
 }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-06-20 Thread via GitHub


mapleFU commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2158219252


##
cpp/src/parquet/column_writer_test.cc:
##
@@ -1719,6 +1721,94 @@ TEST(TestColumnWriter, WriteDataPageV2HeaderNullCount) {
   }
 }
 
+template 
+class TestBloomFilterWriter : public TestPrimitiveWriter {
+ public:
+  void SetUp() override {
+TestPrimitiveWriter::SetUp();
+builder_ = nullptr;
+bloom_filter_ = nullptr;
+  }
+
+  std::shared_ptr> BuildWriterWithBloomFilter(
+  int64_t output_size, const ColumnProperties& column_properties);
+
+  std::unique_ptr builder_;
+  BloomFilter* bloom_filter_{nullptr};
+};
+
+template 
+std::shared_ptr>
+TestBloomFilterWriter::BuildWriterWithBloomFilter(
+int64_t output_size, const ColumnProperties& column_properties) {

Review Comment:
   Emm the internal stats depends on `ColumnProperties`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-06-20 Thread via GitHub


mapleFU commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2158230057


##
cpp/src/parquet/properties.h:
##
@@ -658,6 +702,43 @@ class PARQUET_EXPORT WriterProperties {
   return this->disable_statistics(path->ToDotString());
 }
 
+/// Disable bloom filter for the column specified by `path`.
+/// Default disabled.
+Builder* disable_bloom_filter(const std::string& path) {
+  bloom_filter_options_[path] = std::nullopt;

Review Comment:
   Nice catch



##
cpp/src/parquet/properties.h:
##
@@ -214,6 +240,17 @@ class PARQUET_EXPORT ColumnProperties {
 page_index_enabled_ = page_index_enabled;
   }
 
+  void set_bloom_filter_options(std::optional 
bloom_filter_options) {
+if (bloom_filter_options) {
+  if (bloom_filter_options->fpp > 1.0 || bloom_filter_options->fpp < 0.0) {

Review Comment:
   You're right, will update this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-06-20 Thread via GitHub


mapleFU commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2158224305


##
cpp/src/parquet/properties.h:
##
@@ -214,6 +240,17 @@ class PARQUET_EXPORT ColumnProperties {
 page_index_enabled_ = page_index_enabled;
   }
 
+  void set_bloom_filter_options(std::optional 
bloom_filter_options) {
+if (bloom_filter_options) {
+  if (bloom_filter_options->fpp > 1.0 || bloom_filter_options->fpp < 0.0) {

Review Comment:
   You're right, will update this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-06-20 Thread via GitHub


mapleFU commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2158222425


##
cpp/src/parquet/metadata.h:
##
@@ -505,21 +504,40 @@ class PARQUET_EXPORT RowGroupMetaDataBuilder {
   std::unique_ptr impl_;
 };
 
+/// Alias type of page index location of a row group. The index location
+/// is located by column ordinal. If a column does not have a page index,
+/// its value is set to std::nullopt.
+using RowGroupIndexLocation = std::vector>;
+
+/// Alias type of bloom filter location of a row group. The filter location
+/// is located by column ordinal.
+///
+/// Number of columns with a bloom filter to be relatively small compared to
+/// the number of overall columns, so map is used.
+using RowGroupBloomFilterLocation = std::map;
+
+/// Alias type of page index and location of a parquet file. The
+/// index location is located by the row group ordinal.
+using FileIndexLocation = std::map;
+
+/// Alias type of bloom filter and location of a parquet file. The
+/// index location is located by the row group ordinal.
+using FileBloomFilterLocation = std::map;

Review Comment:
   > Why std::map instead of std::unordered_map? 
   
   Both ok to me, here I just want when flushing, the bloom filter could be 
flush in order, but this is not required.
   
   > Also, does the two-level map really make sense?
   
   Maybe some bloom filter has bad equality, like a final row-group has only 10 
rows. It's regarded as "bad equality"



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-06-20 Thread via GitHub


mapleFU commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2158203762


##
cpp/src/parquet/bloom_filter_writer_internal.cc:
##
@@ -0,0 +1,195 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "parquet/bloom_filter_writer_internal.h"
+
+#include "parquet/exception.h"
+#include "parquet/schema.h"
+
+#include "arrow/util/bit_run_reader.h"
+#include "arrow/util/checked_cast.h"
+#include "arrow/util/unreachable.h"
+#include "arrow/visit_data_inline.h"
+
+namespace parquet::internal {
+
+constexpr int64_t kHashBatchSize = 256;
+
+template 
+BloomFilterWriterImpl::BloomFilterWriterImpl(const 
ColumnDescriptor* descr,
+  BloomFilter* 
bloom_filter)
+: descr_(descr), bloom_filter_(bloom_filter) {}
+
+template 
+bool BloomFilterWriterImpl::HasBloomFilter() const {
+  return bloom_filter_ != nullptr;
+}
+
+template 
+void BloomFilterWriterImpl::UpdateBloomFilter(const T* values,
+   int64_t num_values) 
{
+  if (bloom_filter_ == nullptr) {
+return;
+  }
+  std::array hashes;
+  for (int64_t i = 0; i < num_values; i += kHashBatchSize) {
+int64_t current_hash_batch_size = std::min(kHashBatchSize, num_values - i);
+bloom_filter_->Hashes(values, static_cast(current_hash_batch_size),
+  hashes.data());
+bloom_filter_->InsertHashes(hashes.data(), 
static_cast(current_hash_batch_size));
+  }
+}
+
+template <>
+void BloomFilterWriterImpl::UpdateBloomFilter(const FLBA* values,
+int64_t num_values) {
+  if (bloom_filter_ == nullptr) {
+return;
+  }
+  std::array hashes;
+  for (int64_t i = 0; i < num_values; i += kHashBatchSize) {
+int64_t current_hash_batch_size = std::min(kHashBatchSize, num_values - i);
+bloom_filter_->Hashes(values, descr_->type_length(),
+  static_cast(current_hash_batch_size), 
hashes.data());
+bloom_filter_->InsertHashes(hashes.data(), 
static_cast(current_hash_batch_size));
+  }
+}
+
+template <>
+void BloomFilterWriterImpl::UpdateBloomFilter(const bool*, 
int64_t) {
+  if (ARROW_PREDICT_FALSE(bloom_filter_ != nullptr)) {
+throw ParquetException("BooleanType does not support bloom filters");
+  }
+}
+
+template 
+void BloomFilterWriterImpl::UpdateBloomFilterSpaced(
+const T* values, int64_t num_values, const uint8_t* valid_bits,
+int64_t valid_bits_offset) {
+  if (bloom_filter_ == nullptr) {
+// No bloom filter to update
+return;
+  }
+  std::array hashes;
+  ::arrow::internal::VisitSetBitRunsVoid(
+  valid_bits, valid_bits_offset, num_values, [&](int64_t position, int64_t 
length) {
+for (int64_t i = 0; i < length; i += kHashBatchSize) {
+  auto current_hash_batch_size = std::min(kHashBatchSize, length - i);
+  bloom_filter_->Hashes(values + i + position,
+static_cast(current_hash_batch_size), 
hashes.data());
+  bloom_filter_->InsertHashes(hashes.data(),
+  
static_cast(current_hash_batch_size));
+}
+  });
+}
+
+template <>
+void BloomFilterWriterImpl::UpdateBloomFilterSpaced(const bool*, 
int64_t,
+ const 
uint8_t*,
+ int64_t) {}

Review Comment:
   I'll change the logic to below:
   
   ```c++
 if (ARROW_PREDICT_FALSE(bloom_filter_ != nullptr)) {
   throw ParquetException("BooleanType does not support bloom filters");
 }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-06-20 Thread via GitHub


mapleFU commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2158212457


##
cpp/src/parquet/bloom_filter_writer_internal.cc:
##
@@ -0,0 +1,195 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "parquet/bloom_filter_writer_internal.h"
+
+#include "parquet/exception.h"
+#include "parquet/schema.h"
+
+#include "arrow/util/bit_run_reader.h"
+#include "arrow/util/checked_cast.h"
+#include "arrow/util/unreachable.h"
+#include "arrow/visit_data_inline.h"
+
+namespace parquet::internal {
+
+constexpr int64_t kHashBatchSize = 256;
+
+template 
+BloomFilterWriterImpl::BloomFilterWriterImpl(const 
ColumnDescriptor* descr,
+  BloomFilter* 
bloom_filter)
+: descr_(descr), bloom_filter_(bloom_filter) {}
+
+template 
+bool BloomFilterWriterImpl::HasBloomFilter() const {
+  return bloom_filter_ != nullptr;
+}
+
+template 
+void BloomFilterWriterImpl::UpdateBloomFilter(const T* values,
+   int64_t num_values) 
{
+  if (bloom_filter_ == nullptr) {
+return;
+  }
+  std::array hashes;
+  for (int64_t i = 0; i < num_values; i += kHashBatchSize) {
+int64_t current_hash_batch_size = std::min(kHashBatchSize, num_values - i);
+bloom_filter_->Hashes(values, static_cast(current_hash_batch_size),
+  hashes.data());
+bloom_filter_->InsertHashes(hashes.data(), 
static_cast(current_hash_batch_size));
+  }
+}
+
+template <>
+void BloomFilterWriterImpl::UpdateBloomFilter(const FLBA* values,
+int64_t num_values) {
+  if (bloom_filter_ == nullptr) {
+return;
+  }
+  std::array hashes;
+  for (int64_t i = 0; i < num_values; i += kHashBatchSize) {
+int64_t current_hash_batch_size = std::min(kHashBatchSize, num_values - i);
+bloom_filter_->Hashes(values, descr_->type_length(),
+  static_cast(current_hash_batch_size), 
hashes.data());
+bloom_filter_->InsertHashes(hashes.data(), 
static_cast(current_hash_batch_size));
+  }
+}
+
+template <>
+void BloomFilterWriterImpl::UpdateBloomFilter(const bool*, 
int64_t) {
+  if (ARROW_PREDICT_FALSE(bloom_filter_ != nullptr)) {
+throw ParquetException("BooleanType does not support bloom filters");
+  }
+}
+
+template 
+void BloomFilterWriterImpl::UpdateBloomFilterSpaced(
+const T* values, int64_t num_values, const uint8_t* valid_bits,
+int64_t valid_bits_offset) {
+  if (bloom_filter_ == nullptr) {
+// No bloom filter to update
+return;
+  }
+  std::array hashes;
+  ::arrow::internal::VisitSetBitRunsVoid(
+  valid_bits, valid_bits_offset, num_values, [&](int64_t position, int64_t 
length) {
+for (int64_t i = 0; i < length; i += kHashBatchSize) {
+  auto current_hash_batch_size = std::min(kHashBatchSize, length - i);
+  bloom_filter_->Hashes(values + i + position,
+static_cast(current_hash_batch_size), 
hashes.data());
+  bloom_filter_->InsertHashes(hashes.data(),
+  
static_cast(current_hash_batch_size));
+}
+  });
+}
+
+template <>
+void BloomFilterWriterImpl::UpdateBloomFilterSpaced(const bool*, 
int64_t,
+ const 
uint8_t*,
+ int64_t) {}
+
+template <>
+void BloomFilterWriterImpl::UpdateBloomFilterSpaced(const FLBA* 
values,
+  int64_t 
num_values,
+  const uint8_t* 
valid_bits,
+  int64_t 
valid_bits_offset) {
+  if (bloom_filter_ == nullptr) {
+return;
+  }
+  std::array hashes;
+  ::arrow::internal::VisitSetBitRunsVoid(
+  valid_bits, valid_bits_offset, num_values, [&](int64_t position, int64_t 
length) {
+for (int64_t i = 0; i < length; i += kHashBatchSize) {
+  auto current_hash_batch_size = std::min(kHashBatchSize, length - i);
+  bloom_filter_->Hashes(values + 

Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-06-19 Thread via GitHub


mapleFU commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2158198523


##
cpp/src/parquet/bloom_filter.h:
##
@@ -106,6 +106,34 @@ class PARQUET_EXPORT BloomFilter {
   /// @return hash result.
   virtual uint64_t Hash(const FLBA* value, uint32_t len) const = 0;
 
+  // Variant of const reference argument to facilitate template
+
+  /// Compute hash for ByteArray value by using its plain encoding result.
+  ///
+  /// @param value the value to hash.
+  uint64_t Hash(const ByteArray& value) const { return Hash(&value); }
+
+  /// Compute hash for fixed byte array value by using its plain encoding 
result.
+  ///
+  /// @param value the value to hash.
+  /// @param type_len the value length.
+  uint64_t Hash(const FLBA& value, uint32_t type_len) const {
+return Hash(&value, type_len);
+  }
+
+  /// Compute hash for Int96 value by using its plain encoding result.
+  ///
+  /// @param value the value to hash.
+  uint64_t Hash(const Int96& value) const { return Hash(&value); }
+
+  /// Compute hash for std::string_view value by using its plain encoding 
result.
+  ///
+  /// @param value the value to hash.
+  uint64_t Hash(std::string_view value) const {
+ByteArray ba(value);
+return Hash(&ba);

Review Comment:
   I add a `Hash(std::string_view)`, but other methods doesn't change, since I 
think they should call method in hasher?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-06-19 Thread via GitHub


mapleFU commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2158193354


##
cpp/src/parquet/bloom_filter_builder_internal.h:
##
@@ -0,0 +1,74 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/io/type_fwd.h"
+#include "parquet/types.h"
+
+namespace parquet {
+
+/// @brief Interface for collecting bloom filter of a parquet file.
+class PARQUET_EXPORT BloomFilterBuilder {
+ public:
+  /// @brief API to create a BloomFilterBuilder.
+  ///
+  /// @param schema The schema of the file and it must outlive the created
+  /// BloomFilterBuilder.
+  /// @param properties The properties of the file with a set of 
`BloomFilterOption`s
+  /// for columns enabling bloom filters. It must outlive the created 
BloomFilterBuilder.
+  static std::unique_ptr Make(const SchemaDescriptor* 
schema,
+  const WriterProperties* 
properties);
+
+  /// @brief Start a new row group to host all bloom filters belong to it.
+  ///
+  /// This method must be called before `GetOrCreateBloomFilter` for columns 
of a new row
+  /// group.
+  ///
+  /// @throws ParquetException if WriteTo() has been called to flush bloom 
filters.
+  virtual void AppendRowGroup() = 0;

Review Comment:
   Here I just use `BloomFilter*` per row group - per column. And the style is 
same as `parquet::PageIndexBuilder`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-06-19 Thread via GitHub


mapleFU commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2158190645


##
cpp/src/parquet/bloom_filter.h:
##
@@ -106,6 +106,34 @@ class PARQUET_EXPORT BloomFilter {
   /// @return hash result.
   virtual uint64_t Hash(const FLBA* value, uint32_t len) const = 0;
 
+  // Variant of const reference argument to facilitate template
+
+  /// Compute hash for ByteArray value by using its plain encoding result.
+  ///
+  /// @param value the value to hash.
+  uint64_t Hash(const ByteArray& value) const { return Hash(&value); }
+
+  /// Compute hash for fixed byte array value by using its plain encoding 
result.
+  ///
+  /// @param value the value to hash.
+  /// @param type_len the value length.
+  uint64_t Hash(const FLBA& value, uint32_t type_len) const {
+return Hash(&value, type_len);
+  }
+
+  /// Compute hash for Int96 value by using its plain encoding result.
+  ///
+  /// @param value the value to hash.
+  uint64_t Hash(const Int96& value) const { return Hash(&value); }
+
+  /// Compute hash for std::string_view value by using its plain encoding 
result.
+  ///
+  /// @param value the value to hash.
+  uint64_t Hash(std::string_view value) const {
+ByteArray ba(value);
+return Hash(&ba);

Review Comment:
   Nice catch!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-06-19 Thread via GitHub


mapleFU commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2158122006


##
cpp/src/parquet/CMakeLists.txt:
##
@@ -159,6 +159,8 @@ set(PARQUET_SRCS
 arrow/variant_internal.cc
 arrow/writer.cc
 bloom_filter.cc
+bloom_filter_builder_internal.cc
+bloom_filter_writer_internal.cc

Review Comment:
   > It's confusing to have both "builder" and "writer" APIs
   
   1. `builder` is used in creating the column writer, and be created and hold 
in file level
   2. `writer` is just a helper for `column_writer.cc`
   
   Do you have better idea for this? I don't want `writer` exposed to file 
level api



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-06-19 Thread via GitHub


mapleFU commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2157909745


##
cpp/src/parquet/bloom_filter_writer_internal.cc:
##
@@ -0,0 +1,195 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "parquet/bloom_filter_writer_internal.h"
+
+#include "parquet/exception.h"
+#include "parquet/schema.h"
+
+#include "arrow/util/bit_run_reader.h"
+#include "arrow/util/checked_cast.h"
+#include "arrow/util/unreachable.h"
+#include "arrow/visit_data_inline.h"
+
+namespace parquet::internal {
+
+constexpr int64_t kHashBatchSize = 256;
+
+template 
+BloomFilterWriterImpl::BloomFilterWriterImpl(const 
ColumnDescriptor* descr,
+  BloomFilter* 
bloom_filter)
+: descr_(descr), bloom_filter_(bloom_filter) {}
+
+template 
+bool BloomFilterWriterImpl::HasBloomFilter() const {
+  return bloom_filter_ != nullptr;
+}
+
+template 
+void BloomFilterWriterImpl::UpdateBloomFilter(const T* values,
+   int64_t num_values) 
{
+  if (bloom_filter_ == nullptr) {
+return;
+  }
+  std::array hashes;
+  for (int64_t i = 0; i < num_values; i += kHashBatchSize) {
+int64_t current_hash_batch_size = std::min(kHashBatchSize, num_values - i);
+bloom_filter_->Hashes(values, static_cast(current_hash_batch_size),
+  hashes.data());
+bloom_filter_->InsertHashes(hashes.data(), 
static_cast(current_hash_batch_size));
+  }
+}
+
+template <>
+void BloomFilterWriterImpl::UpdateBloomFilter(const FLBA* values,
+int64_t num_values) {
+  if (bloom_filter_ == nullptr) {
+return;
+  }
+  std::array hashes;
+  for (int64_t i = 0; i < num_values; i += kHashBatchSize) {
+int64_t current_hash_batch_size = std::min(kHashBatchSize, num_values - i);
+bloom_filter_->Hashes(values, descr_->type_length(),
+  static_cast(current_hash_batch_size), 
hashes.data());
+bloom_filter_->InsertHashes(hashes.data(), 
static_cast(current_hash_batch_size));
+  }
+}
+
+template <>
+void BloomFilterWriterImpl::UpdateBloomFilter(const bool*, 
int64_t) {
+  if (ARROW_PREDICT_FALSE(bloom_filter_ != nullptr)) {
+throw ParquetException("BooleanType does not support bloom filters");
+  }
+}
+
+template 
+void BloomFilterWriterImpl::UpdateBloomFilterSpaced(
+const T* values, int64_t num_values, const uint8_t* valid_bits,
+int64_t valid_bits_offset) {
+  if (bloom_filter_ == nullptr) {
+// No bloom filter to update
+return;
+  }
+  std::array hashes;
+  ::arrow::internal::VisitSetBitRunsVoid(
+  valid_bits, valid_bits_offset, num_values, [&](int64_t position, int64_t 
length) {
+for (int64_t i = 0; i < length; i += kHashBatchSize) {
+  auto current_hash_batch_size = std::min(kHashBatchSize, length - i);
+  bloom_filter_->Hashes(values + i + position,
+static_cast(current_hash_batch_size), 
hashes.data());
+  bloom_filter_->InsertHashes(hashes.data(),
+  
static_cast(current_hash_batch_size));
+}
+  });
+}
+
+template <>
+void BloomFilterWriterImpl::UpdateBloomFilterSpaced(const bool*, 
int64_t,
+ const 
uint8_t*,
+ int64_t) {}

Review Comment:
   Boolean doesn't have bloom filter, and the code have checked this before



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-06-19 Thread via GitHub


pitrou commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2157348890


##
cpp/src/parquet/CMakeLists.txt:
##
@@ -372,8 +374,9 @@ set_source_files_properties(public_api_test.cc PROPERTIES 
SKIP_UNITY_BUILD_INCLU
 
 add_parquet_test(internals-test
  SOURCES
- bloom_filter_reader_test.cc
  bloom_filter_test.cc
+ bloom_filter_reader_writer_test.cc
+ encoding_test.cc

Review Comment:
   It seems like this line should be reverted.



##
cpp/src/parquet/bloom_filter.h:
##
@@ -106,6 +106,34 @@ class PARQUET_EXPORT BloomFilter {
   /// @return hash result.
   virtual uint64_t Hash(const FLBA* value, uint32_t len) const = 0;
 
+  // Variant of const reference argument to facilitate template
+
+  /// Compute hash for ByteArray value by using its plain encoding result.
+  ///
+  /// @param value the value to hash.
+  uint64_t Hash(const ByteArray& value) const { return Hash(&value); }
+
+  /// Compute hash for fixed byte array value by using its plain encoding 
result.
+  ///
+  /// @param value the value to hash.
+  /// @param type_len the value length.
+  uint64_t Hash(const FLBA& value, uint32_t type_len) const {
+return Hash(&value, type_len);
+  }
+
+  /// Compute hash for Int96 value by using its plain encoding result.
+  ///
+  /// @param value the value to hash.
+  uint64_t Hash(const Int96& value) const { return Hash(&value); }
+
+  /// Compute hash for std::string_view value by using its plain encoding 
result.
+  ///
+  /// @param value the value to hash.
+  uint64_t Hash(std::string_view value) const {
+ByteArray ba(value);
+return Hash(&ba);

Review Comment:
   This will stupidly create a local variable and pass a pointer to it.
   Can we do the reverse:
   1) Have a core method `Hash(std::string_view)`
   2) Have other methods (FLBA, ByteArray...) that delegate to it?



##
cpp/src/parquet/arrow/arrow_reader_writer_test.cc:
##
@@ -5807,7 +5809,7 @@ auto encode_double = [](double value) {
 
 }  // namespace
 
-class ParquetPageIndexRoundTripTest : public ::testing::Test {
+class TestingWithPageIndex {

Review Comment:
   Can we move the page index and bloom filter tests to another test file? This 
file is already much too long (and takes ages to compile).



##
cpp/src/parquet/bloom_filter_writer_internal.h:
##
@@ -0,0 +1,51 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "parquet/bloom_filter.h"
+
+namespace arrow {

Review Comment:
   Can you include the respective `type_fwd.h` headers?



##
cpp/src/parquet/properties.h:
##
@@ -658,6 +702,43 @@ class PARQUET_EXPORT WriterProperties {
   return this->disable_statistics(path->ToDotString());
 }
 
+/// Disable bloom filter for the column specified by `path`.
+/// Default disabled.
+Builder* disable_bloom_filter(const std::string& path) {
+  bloom_filter_options_[path] = std::nullopt;

Review Comment:
   Why not simply remove the entry?



##
cpp/src/parquet/bloom_filter_writer_internal.h:
##
@@ -0,0 +1,51 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "parquet/bloom_filter.h"
+
+namespace arrow {
+class Array;
+}
+
+namespace parquet {
+class ColumnDescriptor;
+}
+
+namespace parquet::internal {
+
+template 
+class PARQUET_EXPORT Bloom

Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-06-18 Thread via GitHub


mapleFU commented on PR #37400:
URL: https://github.com/apache/arrow/pull/37400#issuecomment-2986773579

   @github-actions crossbow submit -g cpp


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-06-18 Thread via GitHub


github-actions[bot] commented on PR #37400:
URL: https://github.com/apache/arrow/pull/37400#issuecomment-2986778566

   Revision: 40c9079de24e38f9a00f9a45b001b1a5c4f6684a
   
   Submitted crossbow builds: [ursacomputing/crossbow @ 
actions-0e3082917e](https://github.com/ursacomputing/crossbow/branches/all?query=actions-0e3082917e)
   
   |Task|Status|
   ||--|
   |example-cpp-minimal-build-static|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-0e3082917e-github-example-cpp-minimal-build-static)](https://github.com/ursacomputing/crossbow/actions/runs/15750995536/job/44395976112)|
   |example-cpp-minimal-build-static-system-dependency|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-0e3082917e-github-example-cpp-minimal-build-static-system-dependency)](https://github.com/ursacomputing/crossbow/actions/runs/15750994804/job/44395974368)|
   |example-cpp-tutorial|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-0e3082917e-github-example-cpp-tutorial)](https://github.com/ursacomputing/crossbow/actions/runs/15750995052/job/44395974918)|
   |test-build-cpp-fuzz|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-0e3082917e-github-test-build-cpp-fuzz)](https://github.com/ursacomputing/crossbow/actions/runs/15750995033/job/44395974905)|
   |test-conda-cpp|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-0e3082917e-github-test-conda-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/15750995088/job/44395975032)|
   |test-conda-cpp-valgrind|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-0e3082917e-github-test-conda-cpp-valgrind)](https://github.com/ursacomputing/crossbow/actions/runs/15750994689/job/44395974132)|
   |test-cuda-cpp-ubuntu-22.04-cuda-11.7.1|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-0e3082917e-github-test-cuda-cpp-ubuntu-22.04-cuda-11.7.1)](https://github.com/ursacomputing/crossbow/actions/runs/15750994751/job/44395974233)|
   |test-debian-12-cpp-amd64|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-0e3082917e-github-test-debian-12-cpp-amd64)](https://github.com/ursacomputing/crossbow/actions/runs/15750995313/job/44395975687)|
   |test-debian-12-cpp-i386|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-0e3082917e-github-test-debian-12-cpp-i386)](https://github.com/ursacomputing/crossbow/actions/runs/15750995122/job/44395975095)|
   |test-fedora-39-cpp|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-0e3082917e-github-test-fedora-39-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/15750995127/job/44395975216)|
   |test-ubuntu-22.04-cpp|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-0e3082917e-github-test-ubuntu-22.04-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/15750994775/job/44395974314)|
   |test-ubuntu-22.04-cpp-20|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-0e3082917e-github-test-ubuntu-22.04-cpp-20)](https://github.com/ursacomputing/crossbow/actions/runs/15750994953/job/44395974759)|
   |test-ubuntu-22.04-cpp-bundled|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-0e3082917e-github-test-ubuntu-22.04-cpp-bundled)](https://github.com/ursacomputing/crossbow/actions/runs/15750994466/job/44395973547)|
   |test-ubuntu-22.04-cpp-emscripten|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-0e3082917e-github-test-ubuntu-22.04-cpp-emscripten)](https://github.com/ursacomputing/crossbow/actions/runs/15750994895/job/44395974559)|
   |test-ubuntu-22.04-cpp-no-threading|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-0e3082917e-github-test-ubuntu-22.04-cpp-no-threading)](https://github.com/ursacomputing/crossbow/actions/runs/15750994610/job/44395973880)|
   |test-ubuntu-24.04-cpp|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-0e3082917e-github-test-ubuntu-24.04-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/15750994970/job/44395974741)|
   |test-ubuntu-24.04-cpp-bundled-offline|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workfl

Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-06-18 Thread via GitHub


wgtmac commented on PR #37400:
URL: https://github.com/apache/arrow/pull/37400#issuecomment-2986768016

   I will merge it by the end of this week if there is no objection.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-06-18 Thread via GitHub


wgtmac commented on PR #37400:
URL: https://github.com/apache/arrow/pull/37400#issuecomment-2986382250

   @pitrou @emkornfield Do you want to take a look? I believe it worths adding 
this long-awaited feature to the upcoming 21.0.0 release.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-06-05 Thread via GitHub


mapleFU commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2131503709


##
cpp/src/parquet/CMakeLists.txt:
##
@@ -159,6 +159,8 @@ set(PARQUET_SRCS
 arrow/variant_internal.cc
 arrow/writer.cc
 bloom_filter.cc
+bloom_filter_builder_internal.cc
+bloom_filter_writer_internal.cc

Review Comment:
   Personally, `bloom_filter_writer_internal.cc` is only used in 
`column_writer.cc`, and `bloom_filter_writer_internal.cc` is more widely used?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-06-05 Thread via GitHub


wgtmac commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2131431656


##
cpp/src/parquet/CMakeLists.txt:
##
@@ -159,6 +159,8 @@ set(PARQUET_SRCS
 arrow/variant_internal.cc
 arrow/writer.cc
 bloom_filter.cc
+bloom_filter_builder_internal.cc
+bloom_filter_writer_internal.cc

Review Comment:
   nit: should we merge these two files into one?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-06-05 Thread via GitHub


wgtmac commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2131428480


##
cpp/src/parquet/CMakeLists.txt:
##
@@ -372,8 +374,11 @@ set_source_files_properties(public_api_test.cc PROPERTIES 
SKIP_UNITY_BUILD_INCLU
 
 add_parquet_test(internals-test
  SOURCES
- bloom_filter_reader_test.cc
  bloom_filter_test.cc
+ bloom_filter_reader_writer_test.cc
+ encoding_test.cc
+ properties_test.cc
+ statistics_test.cc

Review Comment:
   ```suggestion
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-06-03 Thread via GitHub


mapleFU commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2124085369


##
cpp/src/parquet/column_writer.cc:
##
@@ -1230,6 +1235,166 @@ Status ConvertDictionaryToDense(const ::arrow::Array& 
array, MemoryPool* pool,
   return Status::OK();
 }
 
+template 
+class BloomFilterWriterImpl {

Review Comment:
   I've create a new file `bloom_filter_writer_internal.h/cc`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-06-03 Thread via GitHub


mapleFU commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2123517697


##
cpp/src/parquet/bloom_filter_builder_internal.h:
##
@@ -0,0 +1,84 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/io/type_fwd.h"
+#include "parquet/types.h"
+
+namespace parquet {
+
+/// @brief Interface for collecting bloom filter of a parquet file.
+///
+/// ```
+/// auto bloom_filter_builder = BloomFilterBuilder::Make(schema, properties);

Review Comment:
   Good idea



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-06-03 Thread via GitHub


mapleFU commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2123523648


##
cpp/src/parquet/column_writer.cc:
##
@@ -157,6 +160,8 @@ inline const T* AddIfNotNull(const T* base, int64_t offset) 
{
   return nullptr;
 }
 
+constexpr int64_t kHashBatchSize = 256;

Review Comment:
   I move it into `BloomFilterWriterImpl` now



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-06-02 Thread via GitHub


wgtmac commented on PR #37400:
URL: https://github.com/apache/arrow/pull/37400#issuecomment-2931379472

   @pitrou Would you mind taking a look as well? Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-06-02 Thread via GitHub


wgtmac commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2121578249


##
cpp/src/parquet/properties.h:
##
@@ -167,6 +167,32 @@ static constexpr Compression::type 
DEFAULT_COMPRESSION_TYPE = Compression::UNCOM
 static constexpr bool DEFAULT_IS_PAGE_INDEX_ENABLED = true;
 static constexpr SizeStatisticsLevel DEFAULT_SIZE_STATISTICS_LEVEL =
 SizeStatisticsLevel::PageAndColumnChunk;
+static constexpr int32_t DEFAULT_BLOOM_FILTER_NDV = 1024 * 1024;
+static constexpr double DEFAULT_BLOOM_FILTER_FPP = 0.05;
+
+struct PARQUET_EXPORT BloomFilterOptions {
+  /// Expected number of distinct values to be inserted into the bloom filter.
+  ///
+  /// Usage of bloom filter is most beneficial for columns with large 
cardinality,

Review Comment:
   I think there are abundant resources available for bloom filters so we don't 
need to add the introduction about BF with (biased) suggestions. It would be 
good enough to just let readers know what each parameter is about and the 
consequence of it.



##
cpp/src/parquet/bloom_filter_builder_internal.h:
##
@@ -0,0 +1,84 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/io/type_fwd.h"
+#include "parquet/types.h"
+
+namespace parquet {
+
+/// @brief Interface for collecting bloom filter of a parquet file.
+///
+/// ```
+/// auto bloom_filter_builder = BloomFilterBuilder::Make(schema, properties);
+/// for (int i = 0; i < num_row_groups; i++) {
+///   bloom_filter_builder->AppendRowGroup();
+///   auto* bloom_filter =
+///   bloom_filter_builder->GetOrCreateBloomFilter(bloom_filter_column);
+///   // Add bloom filter entries in `bloom_filter`.
+///   // ...
+/// }
+/// bloom_filter_builder->WriteTo(sink, location);
+/// ```
+class PARQUET_EXPORT BloomFilterBuilder {
+ public:
+  /// @brief API to create a BloomFilterBuilder.
+  ///
+  /// @param schema The schema of the file. The life cycle of the schema must 
be
+  /// longer than the BloomFilterBuilder.
+  /// @param properties The properties of the file. The life cycle of the 
properties
+  /// must be longer than the BloomFilterBuilder.
+  static std::unique_ptr Make(const SchemaDescriptor* 
schema,
+  const WriterProperties* 
properties);
+
+  /// @brief Append a new row group to host all incoming bloom filters.

Review Comment:
   ```suggestion
 /// @brief Start a new row group to host all bloom filters belong to it.
   ```



##
cpp/src/parquet/bloom_filter_builder_internal.h:
##
@@ -0,0 +1,84 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/io/type_fwd.h"
+#include "parquet/types.h"
+
+namespace parquet {
+
+/// @brief Interface for collecting bloom filter of a parquet file.
+///
+/// ```
+/// auto bloom_filter_builder = BloomFilterBuilder::Make(schema, properties);
+/// for (int i = 0; i < num_row_groups; i++) {
+///   bloom_filter_builder->AppendRowGroup();
+///   auto* bloom_filter =
+///   bloom_filter_builder->GetOrCreateBloomFilter(bloom_filter_column);
+///   // Add bloom filter entries in `bloom_filter`.
+///   // ...
+/// }
+/// bloom_filter_builder->WriteTo(sink, location);
+/// ```
+class PARQUET_EXPORT BloomFilterBuilder {
+ public:
+  /// @brief API to create a BloomFilterBuilder.
+  ///
+  /// @param schema The schema of the file. The life cycle of the schema must 
be
+  /// longer t

Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-05-27 Thread via GitHub


mapleFU commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2110820266


##
cpp/src/parquet/column_writer.cc:
##
@@ -1741,15 +1752,22 @@ class TypedColumnWriterImpl : public ColumnWriterImpl,
 if (num_values != num_spaced_values) {
   current_value_encoder_->PutSpaced(values, 
static_cast(num_spaced_values),
 valid_bits, valid_bits_offset);
+  UpdateBloomFilterSpaced(values, num_spaced_values, valid_bits, 
valid_bits_offset);
 } else {
   current_value_encoder_->Put(values, static_cast(num_values));
+  UpdateBloomFilter(values, num_values);
 }
 if (page_statistics_ != nullptr) {
   page_statistics_->UpdateSpaced(values, valid_bits, valid_bits_offset,
  num_spaced_values, num_values, num_nulls);
 }
 UpdateUnencodedDataBytes();
   }
+
+  void UpdateBloomFilter(const T* values, int64_t num_values);

Review Comment:
   Seems update stats is not in statistics like 
`TypedStatistics::Update/UpdateSpaced`? Both ok to me



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-05-27 Thread via GitHub


alippai commented on PR #37400:
URL: https://github.com/apache/arrow/pull/37400#issuecomment-2913569514

   Arrow 21 will be the one!
   
   @Fokko I saw Iceberg is interested. Do you have any benchmarks or good 
datasets this impl. could be compared with?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-04-24 Thread via GitHub


mapleFU commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2058002518


##
cpp/src/parquet/file_writer.cc:
##
@@ -91,7 +92,8 @@ class RowGroupSerializer : public RowGroupWriter::Contents {
  RowGroupMetaDataBuilder* metadata, int16_t 
row_group_ordinal,
  const WriterProperties* properties, bool 
buffered_row_group = false,
  InternalFileEncryptor* file_encryptor = nullptr,
- PageIndexBuilder* page_index_builder = nullptr)
+ PageIndexBuilder* page_index_builder = nullptr,
+ internal::BloomFilterBuilder* bloom_filter_builder = 
nullptr)

Review Comment:
   good idea



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-04-24 Thread via GitHub


mapleFU commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2057984560


##
cpp/src/parquet/column_writer.cc:
##
@@ -2363,12 +2390,153 @@ Status 
TypedColumnWriterImpl::WriteArrowDense(
   return Status::OK();
 }
 
+template 
+void TypedColumnWriterImpl::UpdateBloomFilter(const T* values,
+ int64_t num_values) {
+  if (bloom_filter_) {
+std::array hashes;
+for (int64_t i = 0; i < num_values; i += kHashBatchSize) {
+  int64_t current_hash_batch_size = std::min(kHashBatchSize, num_values - 
i);
+  bloom_filter_->Hashes(values, static_cast(current_hash_batch_size),

Review Comment:
   It's ok for me



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-04-24 Thread via GitHub


mapleFU commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2057969051


##
cpp/src/parquet/bloom_filter_builder.h:
##
@@ -0,0 +1,79 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/io/type_fwd.h"
+#include "parquet/types.h"
+
+namespace parquet::internal {
+
+/// \brief Interface for collecting bloom filter of a parquet file.
+///
+/// ```
+/// auto bloom_filter_builder = BloomFilterBuilder::Make(schema, properties);
+/// for (int i = 0; i < num_row_groups; i++) {
+///   bloom_filter_builder->AppendRowGroup();
+///   auto* bloom_filter =
+///   bloom_filter_builder->GetOrCreateBloomFilter(bloom_filter_column);
+///   // Add bloom filter entries in `bloom_filter`.
+///   // ...
+/// }
+/// bloom_filter_builder->WriteTo(sink, location);
+/// ```
+class PARQUET_EXPORT BloomFilterBuilder {
+ public:
+  /// \brief API to create a BloomFilterBuilder.
+  static std::unique_ptr Make(const SchemaDescriptor* 
schema,
+  const WriterProperties* 
properties);
+
+  /// Append a new row group to host all incoming bloom filters.
+  ///
+  /// This method must be called before `GetOrCreateBloomFilter` for a new row 
group.
+  ///
+  /// \throws ParquetException if WriteTo() has been called to flush bloom 
filters.
+  virtual void AppendRowGroup() = 0;
+
+  /// \brief Get the BloomFilter from column ordinal.
+  ///
+  /// \param column_ordinal Column ordinal in schema, which is only for leaf 
columns.

Review Comment:
   They're equal. Let me switch to `@param`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-04-24 Thread via GitHub


mapleFU commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2057953731


##
cpp/src/parquet/bloom_filter_builder.h:
##
@@ -0,0 +1,79 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/io/type_fwd.h"
+#include "parquet/types.h"
+
+namespace parquet::internal {
+
+/// \brief Interface for collecting bloom filter of a parquet file.
+///
+/// ```
+/// auto bloom_filter_builder = BloomFilterBuilder::Make(schema, properties);
+/// for (int i = 0; i < num_row_groups; i++) {
+///   bloom_filter_builder->AppendRowGroup();
+///   auto* bloom_filter =
+///   bloom_filter_builder->GetOrCreateBloomFilter(bloom_filter_column);
+///   // Add bloom filter entries in `bloom_filter`.
+///   // ...
+/// }
+/// bloom_filter_builder->WriteTo(sink, location);
+/// ```
+class PARQUET_EXPORT BloomFilterBuilder {
+ public:
+  /// \brief API to create a BloomFilterBuilder.

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-04-24 Thread via GitHub


mapleFU commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2057896280


##
cpp/src/parquet/bloom_filter_builder.cc:
##
@@ -0,0 +1,170 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+// This module defines an abstract interface for iterating through pages in a
+// Parquet column chunk within a row group. It could be extended in the future
+// to iterate through all data pages in all chunks in a file.
+
+#include "parquet/bloom_filter_builder.h"
+
+#include 
+#include 
+#include 
+
+#include "arrow/io/interfaces.h"
+
+#include "parquet/bloom_filter.h"
+#include "parquet/exception.h"
+#include "parquet/metadata.h"
+#include "parquet/properties.h"
+
+namespace parquet::internal {
+
+namespace {
+/// Column encryption for bloom filter is not implemented yet.
+class BloomFilterBuilderImpl : public BloomFilterBuilder {
+ public:
+  explicit BloomFilterBuilderImpl(const SchemaDescriptor* schema,
+  const WriterProperties* properties)
+  : schema_(schema), properties_(properties) {}
+  BloomFilterBuilderImpl(const BloomFilterBuilderImpl&) = delete;
+  BloomFilterBuilderImpl(BloomFilterBuilderImpl&&) = default;
+
+  /// Append a new row group to host all incoming bloom filters.
+  void AppendRowGroup() override;
+
+  BloomFilter* GetOrCreateBloomFilter(int32_t column_ordinal) override;
+
+  /// Serialize all bloom filters with header and bitset in the order of row 
group and
+  /// column id. The side effect is that it deletes all bloom filters after 
they have
+  /// been flushed.
+  void WriteTo(::arrow::io::OutputStream* sink, BloomFilterLocation* location) 
override;
+
+ private:
+  /// Make sure column ordinal is not out of bound and the builder is in good 
state.
+  void CheckState(int32_t column_ordinal) const {
+if (finished_) {
+  throw ParquetException("BloomFilterBuilder is already finished.");
+}
+if (column_ordinal < 0 || column_ordinal >= schema_->num_columns()) {

Review Comment:
   ditto



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-04-24 Thread via GitHub


mapleFU commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2057882801


##
cpp/src/parquet/bloom_filter_builder.cc:
##
@@ -0,0 +1,170 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+// This module defines an abstract interface for iterating through pages in a
+// Parquet column chunk within a row group. It could be extended in the future
+// to iterate through all data pages in all chunks in a file.
+
+#include "parquet/bloom_filter_builder.h"
+
+#include 
+#include 
+#include 
+
+#include "arrow/io/interfaces.h"
+
+#include "parquet/bloom_filter.h"
+#include "parquet/exception.h"
+#include "parquet/metadata.h"
+#include "parquet/properties.h"
+
+namespace parquet::internal {
+
+namespace {
+/// Column encryption for bloom filter is not implemented yet.
+class BloomFilterBuilderImpl : public BloomFilterBuilder {
+ public:
+  explicit BloomFilterBuilderImpl(const SchemaDescriptor* schema,
+  const WriterProperties* properties)
+  : schema_(schema), properties_(properties) {}
+  BloomFilterBuilderImpl(const BloomFilterBuilderImpl&) = delete;
+  BloomFilterBuilderImpl(BloomFilterBuilderImpl&&) = default;
+
+  /// Append a new row group to host all incoming bloom filters.
+  void AppendRowGroup() override;
+
+  BloomFilter* GetOrCreateBloomFilter(int32_t column_ordinal) override;
+
+  /// Serialize all bloom filters with header and bitset in the order of row 
group and
+  /// column id. The side effect is that it deletes all bloom filters after 
they have
+  /// been flushed.
+  void WriteTo(::arrow::io::OutputStream* sink, BloomFilterLocation* location) 
override;
+
+ private:
+  /// Make sure column ordinal is not out of bound and the builder is in good 
state.
+  void CheckState(int32_t column_ordinal) const {
+if (finished_) {

Review Comment:
   It's checked once per rowgroup, so I don't think this would be heavy
   
   And I suspect that compiler can already well handle this under -O2: 
https://godbolt.org/z/6qvevr3G1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-04-24 Thread via GitHub


mapleFU commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2057882801


##
cpp/src/parquet/bloom_filter_builder.cc:
##
@@ -0,0 +1,170 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+// This module defines an abstract interface for iterating through pages in a
+// Parquet column chunk within a row group. It could be extended in the future
+// to iterate through all data pages in all chunks in a file.
+
+#include "parquet/bloom_filter_builder.h"
+
+#include 
+#include 
+#include 
+
+#include "arrow/io/interfaces.h"
+
+#include "parquet/bloom_filter.h"
+#include "parquet/exception.h"
+#include "parquet/metadata.h"
+#include "parquet/properties.h"
+
+namespace parquet::internal {
+
+namespace {
+/// Column encryption for bloom filter is not implemented yet.
+class BloomFilterBuilderImpl : public BloomFilterBuilder {
+ public:
+  explicit BloomFilterBuilderImpl(const SchemaDescriptor* schema,
+  const WriterProperties* properties)
+  : schema_(schema), properties_(properties) {}
+  BloomFilterBuilderImpl(const BloomFilterBuilderImpl&) = delete;
+  BloomFilterBuilderImpl(BloomFilterBuilderImpl&&) = default;
+
+  /// Append a new row group to host all incoming bloom filters.
+  void AppendRowGroup() override;
+
+  BloomFilter* GetOrCreateBloomFilter(int32_t column_ordinal) override;
+
+  /// Serialize all bloom filters with header and bitset in the order of row 
group and
+  /// column id. The side effect is that it deletes all bloom filters after 
they have
+  /// been flushed.
+  void WriteTo(::arrow::io::OutputStream* sink, BloomFilterLocation* location) 
override;
+
+ private:
+  /// Make sure column ordinal is not out of bound and the builder is in good 
state.
+  void CheckState(int32_t column_ordinal) const {
+if (finished_) {

Review Comment:
   It's checked once per rowgroup, so I don't think this would be heavy
   
   And I suspect that compiler can already well handle this: 
https://godbolt.org/z/41PEWG3ox



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-04-02 Thread via GitHub


wgtmac commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r1676168880


##
cpp/src/parquet/column_writer.cc:
##
@@ -2363,12 +2390,153 @@ Status 
TypedColumnWriterImpl::WriteArrowDense(
   return Status::OK();
 }
 
+template 
+void TypedColumnWriterImpl::UpdateBloomFilter(const T* values,
+ int64_t num_values) {
+  if (bloom_filter_) {
+std::array hashes;
+for (int64_t i = 0; i < num_values; i += kHashBatchSize) {
+  int64_t current_hash_batch_size = std::min(kHashBatchSize, num_values - 
i);
+  bloom_filter_->Hashes(values, static_cast(current_hash_batch_size),

Review Comment:
   I might have forgot something, where does `int32 range` come from? 
@emkornfield 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-04-02 Thread via GitHub


wgtmac commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2015830158


##
cpp/src/parquet/bloom_filter_builder.h:
##
@@ -0,0 +1,79 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/io/type_fwd.h"
+#include "parquet/types.h"
+
+namespace parquet::internal {
+
+/// \brief Interface for collecting bloom filter of a parquet file.
+///
+/// ```
+/// auto bloom_filter_builder = BloomFilterBuilder::Make(schema, properties);
+/// for (int i = 0; i < num_row_groups; i++) {
+///   bloom_filter_builder->AppendRowGroup();
+///   auto* bloom_filter =
+///   bloom_filter_builder->GetOrCreateBloomFilter(bloom_filter_column);
+///   // Add bloom filter entries in `bloom_filter`.
+///   // ...
+/// }
+/// bloom_filter_builder->WriteTo(sink, location);
+/// ```
+class PARQUET_EXPORT BloomFilterBuilder {
+ public:
+  /// \brief API to create a BloomFilterBuilder.
+  static std::unique_ptr Make(const SchemaDescriptor* 
schema,
+  const WriterProperties* 
properties);
+
+  /// Append a new row group to host all incoming bloom filters.
+  ///
+  /// This method must be called before `GetOrCreateBloomFilter` for a new row 
group.
+  ///
+  /// \throws ParquetException if WriteTo() has been called to flush bloom 
filters.
+  virtual void AppendRowGroup() = 0;
+
+  /// \brief Get the BloomFilter from column ordinal.
+  ///
+  /// \param column_ordinal Column ordinal in schema, which is only for leaf 
columns.
+  ///
+  /// \return BloomFilter for the column and its memory ownership belongs to 
the
+  /// BloomFilterBuilder. It will return nullptr if bloom filter is not 
enabled for the
+  /// column.
+  ///
+  /// \throws ParquetException if any of following conditions applies:
+  /// 1) column_ordinal is out of bound.
+  /// 2) `WriteTo()` has been called already.
+  /// 3) `AppendRowGroup()` is not called before `GetOrCreateBloomFilter()`.
+  virtual BloomFilter* GetOrCreateBloomFilter(int32_t column_ordinal) = 0;
+
+  /// \brief Write the bloom filter to sink.
+  ///
+  /// The bloom filter cannot be modified after this method is called.
+  ///
+  /// \param[out] sink The output stream to write the bloom filter.

Review Comment:
   sink should be `param[in/out]` I guess?



##
cpp/src/parquet/bloom_filter_builder.h:
##
@@ -0,0 +1,79 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/io/type_fwd.h"
+#include "parquet/types.h"
+
+namespace parquet::internal {

Review Comment:
   If this is an internal namespace, how about renaming these files to 
`bloom_filter_internal.h/cc` ?



##
cpp/src/parquet/properties.h:
##
@@ -235,6 +272,12 @@ class PARQUET_EXPORT ColumnProperties {
 
   bool page_index_enabled() const { return page_index_enabled_; }
 
+  std::optional bloom_filter_options() const {
+return bloom_filter_options_;
+  }
+
+  bool bloom_filter_enabled() const { return bloom_filter_options_ != 
std::nullopt; }

Review Comment:
   ```suggestion
 bool bloom_filter_enabled() const { return 
bloom_filter_options_.has_value(); }
   ```



##
cpp/src/parquet/type_fwd.h:
##
@@ -89,6 +89,10 @@ class EncodedStatistics;
 class Statistics;
 struct SizeStatistics;
 
+class BloomFilter;
+struct BloomFilterOptions;
+struct BloomFilterLocation;

Review Comment:
   ```suggestion
   struct BloomFilterLoc

Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-03-26 Thread via GitHub


mapleFU commented on PR #37400:
URL: https://github.com/apache/arrow/pull/37400#issuecomment-2753501622

   I've resolve the comments and fix the ci, would you mind re-check ? @wgtmac


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-02-06 Thread via GitHub


mapleFU commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r1946034738


##
cpp/src/parquet/metadata.h:
##
@@ -499,21 +498,43 @@ class PARQUET_EXPORT RowGroupMetaDataBuilder {
   std::unique_ptr impl_;
 };
 
+/// Alias type of page index location of a row group. The index location
+/// is located by column ordinal. If a column does not have a page index,
+/// its value is set to std::nullopt.
+using RowGroupIndexLocation = std::vector>;
+
+/// Alias type of bloom filter location of a row group. The filter location
+/// is located by column ordinal.
+///
+/// Number of columns with a bloom filter to be relatively small compared to
+/// the number of overall columns, so map is used.
+using RowGroupBloomFilterLocation = std::map;

Review Comment:
   I think PageIndexLocation doesn't defined them here. what about keeping it 
consistent?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-02-06 Thread via GitHub


mapleFU commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r1946034145


##
cpp/src/parquet/column_writer.cc:
##
@@ -2475,32 +2651,37 @@ std::shared_ptr 
ColumnWriter::Make(ColumnChunkMetaDataBuilder* met
 encoding = properties->dictionary_index_encoding();
   }
   switch (descr->physical_type()) {
-case Type::BOOLEAN:
+case Type::BOOLEAN: {
+  if (bloom_filter != nullptr) {
+throw ParquetException("Bloom filter is not supported for boolean 
type");

Review Comment:
   Both lgtm



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation [arrow]

2025-02-06 Thread via GitHub


mapleFU commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r1946032202


##
cpp/src/parquet/column_writer.cc:
##
@@ -1791,11 +1810,15 @@ Status 
TypedColumnWriterImpl::WriteArrowDictionary(
  &exec_ctx));
   referenced_dictionary = referenced_dictionary_datum.make_array();
 }
-
-int64_t non_null_count = chunk_indices->length() - 
chunk_indices->null_count();
-page_statistics_->IncrementNullCount(num_chunk_levels - non_null_count);
-page_statistics_->IncrementNumValues(non_null_count);
-page_statistics_->Update(*referenced_dictionary, /*update_counts=*/false);
+if (page_statistics_ != nullptr) {
+  int64_t non_null_count = chunk_indices->length() - 
chunk_indices->null_count();
+  page_statistics_->IncrementNullCount(num_chunk_levels - non_null_count);
+  page_statistics_->IncrementNumValues(non_null_count);
+  page_statistics_->Update(*referenced_dictionary, 
/*update_counts=*/false);
+}
+if (bloom_filter_ != nullptr) {
+  UpdateBloomFilterArray(*referenced_dictionary);

Review Comment:
   > If we can accept the bloom filter to contain more values than it should 
have, data.dictionary() seems to be sufficient.
   
   Emm this might be a good way but also making the dictionary huge
   
   > BTW, if we already have dictionary encoding, should we simply disable 
building bloom filter in this case?
   
   AFAIK this is different?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



  1   2   >