Re: [PR] Make Parquet SBBF serialize/deserialize helpers public for external reuse [arrow-rs]

2025-12-17 Thread via GitHub


alamb merged PR #8762:
URL: https://github.com/apache/arrow-rs/pull/8762


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Make Parquet SBBF serialize/deserialize helpers public for external reuse [arrow-rs]

2025-12-17 Thread via GitHub


alamb commented on PR #8762:
URL: https://github.com/apache/arrow-rs/pull/8762#issuecomment-3667006113

   since @etseidl  has already approved this too I'll just merge it in
   
   Thanks again!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Make Parquet SBBF serialize/deserialize helpers public for external reuse [arrow-rs]

2025-12-15 Thread via GitHub


ODukhno commented on code in PR #8762:
URL: https://github.com/apache/arrow-rs/pull/8762#discussion_r2621241936


##
parquet/src/bloom_filter/mod.rs:
##
@@ -541,4 +599,49 @@ mod tests {
 assert_eq!(*num_bits, num_of_bits_from_ndv_fpp(*ndv, *fpp) as u64);
 }
 }
+
+#[test]
+fn test_sbbf_write_round_trip() {
+// Create a bloom filter with a 32-byte bitset (minimum size)
+let bitset_bytes = vec![0u8; 32];
+let mut original = Sbbf::new(&bitset_bytes);
+
+// Insert some test values
+let test_values = ["hello", "world", "rust", "parquet", "bloom", 
"filter"];
+for value in &test_values {
+original.insert(value);
+}
+
+// Serialize to bytes
+let mut output = Vec::new();
+original.write(&mut output).unwrap();
+
+// Validate header was written correctly
+let mut protocol = ThriftSliceInputProtocol::new(&output);
+let header = BloomFilterHeader::read_thrift(&mut protocol).unwrap();
+assert_eq!(header.num_bytes, bitset_bytes.len() as i32);
+assert_eq!(header.algorithm, BloomFilterAlgorithm::BLOCK);
+assert_eq!(header.hash, BloomFilterHash::XXHASH);
+assert_eq!(header.compression, BloomFilterCompression::UNCOMPRESSED);
+
+// Deserialize using from_bytes
+let reconstructed = Sbbf::from_bytes(&output).unwrap();
+
+// Most importantly: verify the bloom filter WORKS correctly after 
round-trip
+for value in &test_values {
+assert!(
+reconstructed.check(value),
+"Value '{}' should be present after round-trip",
+value
+);
+}
+
+// Verify false negative check (values not inserted should not be 
found)
+let missing_values = ["missing", "absent", "nothere"];
+for value in &missing_values {
+// Note: bloom filters can have false positives, but should never 
have false negatives
+// So we can't assert !check(), but we should verify inserted 
values are found
+let _ = reconstructed.check(value); // Just exercise the code path

Review Comment:
   Just pushed another change where I applied this one along with all other 
suggestions above.
   
   Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Make Parquet SBBF serialize/deserialize helpers public for external reuse [arrow-rs]

2025-12-15 Thread via GitHub


etseidl commented on code in PR #8762:
URL: https://github.com/apache/arrow-rs/pull/8762#discussion_r2620481961


##
parquet/src/bloom_filter/mod.rs:
##
@@ -417,6 +427,54 @@ impl Sbbf {
 pub(crate) fn estimated_memory_size(&self) -> usize {
 self.0.capacity() * std::mem::size_of::()
 }
+
+/// reads a Sbff from thrift encoded bytes

Review Comment:
   ```suggestion
   /// Reads a Sbff from Thrift encoded bytes
   ```



##
parquet/src/bloom_filter/mod.rs:
##
@@ -283,7 +292,8 @@ impl Sbbf {
 /// Write the bloom filter data (header and then bitset) to the output. 
This doesn't
 /// flush the writer in order to boost performance of bulk writing all 
blocks. Caller
 /// must remember to flush the writer.
-pub(crate) fn write(&self, mut writer: W) -> Result<(), 
ParquetError> {
+/// This method usually is used in conjunction with from_bytes for 
serialization/deserialization.

Review Comment:
   ```suggestion
   /// This method usually is used in conjunction with [`Self::from_bytes`] 
for serialization/deserialization.
   ```



##
parquet/src/bloom_filter/mod.rs:
##
@@ -215,9 +215,18 @@ pub(crate) fn chunk_read_bloom_filter_header_and_offset(
 #[inline]
 pub(crate) fn read_bloom_filter_header_and_length(
 buffer: Bytes,
+) -> Result<(BloomFilterHeader, u64), ParquetError> {
+read_bloom_filter_header_and_length_from_bytes(buffer.as_ref())
+}
+
+/// given a byte slice, try to read out a bloom filter header and return both 
the header and

Review Comment:
   ```suggestion
   /// Given a byte slice, try to read out a bloom filter header and return 
both the header and
   ```



##
parquet/src/bloom_filter/mod.rs:
##
@@ -541,4 +599,49 @@ mod tests {
 assert_eq!(*num_bits, num_of_bits_from_ndv_fpp(*ndv, *fpp) as u64);
 }
 }
+
+#[test]
+fn test_sbbf_write_round_trip() {
+// Create a bloom filter with a 32-byte bitset (minimum size)
+let bitset_bytes = vec![0u8; 32];
+let mut original = Sbbf::new(&bitset_bytes);
+
+// Insert some test values
+let test_values = ["hello", "world", "rust", "parquet", "bloom", 
"filter"];
+for value in &test_values {
+original.insert(value);
+}
+
+// Serialize to bytes
+let mut output = Vec::new();
+original.write(&mut output).unwrap();
+
+// Validate header was written correctly
+let mut protocol = ThriftSliceInputProtocol::new(&output);
+let header = BloomFilterHeader::read_thrift(&mut protocol).unwrap();
+assert_eq!(header.num_bytes, bitset_bytes.len() as i32);
+assert_eq!(header.algorithm, BloomFilterAlgorithm::BLOCK);
+assert_eq!(header.hash, BloomFilterHash::XXHASH);
+assert_eq!(header.compression, BloomFilterCompression::UNCOMPRESSED);
+
+// Deserialize using from_bytes
+let reconstructed = Sbbf::from_bytes(&output).unwrap();
+
+// Most importantly: verify the bloom filter WORKS correctly after 
round-trip
+for value in &test_values {
+assert!(
+reconstructed.check(value),
+"Value '{}' should be present after round-trip",
+value
+);
+}
+
+// Verify false negative check (values not inserted should not be 
found)
+let missing_values = ["missing", "absent", "nothere"];
+for value in &missing_values {
+// Note: bloom filters can have false positives, but should never 
have false negatives
+// So we can't assert !check(), but we should verify inserted 
values are found
+let _ = reconstructed.check(value); // Just exercise the code path

Review Comment:
   Since we can't verify the negative test, why not just move these comments up 
and skip this loop?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Make Parquet SBBF serialize/deserialize helpers public for external reuse [arrow-rs]

2025-12-11 Thread via GitHub


ODukhno commented on code in PR #8762:
URL: https://github.com/apache/arrow-rs/pull/8762#discussion_r2611731221


##
parquet/src/bloom_filter/mod.rs:
##
@@ -283,7 +300,7 @@ impl Sbbf {
 /// Write the bloom filter data (header and then bitset) to the output. 
This doesn't
 /// flush the writer in order to boost performance of bulk writing all 
blocks. Caller
 /// must remember to flush the writer.
-pub(crate) fn write(&self, mut writer: W) -> Result<(), 
ParquetError> {
+pub fn write(&self, mut writer: W) -> Result<(), ParquetError> {

Review Comment:
   @alamb , please take another look. I took over this work as Rose will not be 
able to work on this task anymore.
   
   Thank you!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Make Parquet SBBF serialize/deserialize helpers public for external reuse [arrow-rs]

2025-12-11 Thread via GitHub


ODukhno commented on code in PR #8762:
URL: https://github.com/apache/arrow-rs/pull/8762#discussion_r2611494866


##
parquet/src/bloom_filter/mod.rs:
##
@@ -283,7 +300,7 @@ impl Sbbf {
 /// Write the bloom filter data (header and then bitset) to the output. 
This doesn't
 /// flush the writer in order to boost performance of bulk writing all 
blocks. Caller
 /// must remember to flush the writer.
-pub(crate) fn write(&self, mut writer: W) -> Result<(), 
ParquetError> {
+pub fn write(&self, mut writer: W) -> Result<(), ParquetError> {

Review Comment:
   @alamb , are you ok if I initiate new pull request? Rose will not be able to 
work on this change anymore.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Make Parquet SBBF serialize/deserialize helpers public for external reuse [arrow-rs]

2025-12-11 Thread via GitHub


ODukhno commented on code in PR #8762:
URL: https://github.com/apache/arrow-rs/pull/8762#discussion_r2611494866


##
parquet/src/bloom_filter/mod.rs:
##
@@ -283,7 +300,7 @@ impl Sbbf {
 /// Write the bloom filter data (header and then bitset) to the output. 
This doesn't
 /// flush the writer in order to boost performance of bulk writing all 
blocks. Caller
 /// must remember to flush the writer.
-pub(crate) fn write(&self, mut writer: W) -> Result<(), 
ParquetError> {
+pub fn write(&self, mut writer: W) -> Result<(), ParquetError> {

Review Comment:
   @alamb , are you ok if I initiate new pull request? Rose will not be able to 
work on this change anymore.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Make Parquet SBBF serialize/deserialize helpers public for external reuse [arrow-rs]

2025-12-11 Thread via GitHub


ODukhno commented on code in PR #8762:
URL: https://github.com/apache/arrow-rs/pull/8762#discussion_r2611494866


##
parquet/src/bloom_filter/mod.rs:
##
@@ -283,7 +300,7 @@ impl Sbbf {
 /// Write the bloom filter data (header and then bitset) to the output. 
This doesn't
 /// flush the writer in order to boost performance of bulk writing all 
blocks. Caller
 /// must remember to flush the writer.
-pub(crate) fn write(&self, mut writer: W) -> Result<(), 
ParquetError> {
+pub fn write(&self, mut writer: W) -> Result<(), ParquetError> {

Review Comment:
   @alamb , are you ok if I initiate new pull request? Rose will work on this 
change anymore.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Make Parquet SBBF serialize/deserialize helpers public for external reuse [arrow-rs]

2025-12-05 Thread via GitHub


alamb commented on PR #8762:
URL: https://github.com/apache/arrow-rs/pull/8762#issuecomment-3618701438

   > @alamb Checking to see if you have any more comments/questions on this PR?
   
   Hi @RoseZhang123  -- I had a few suggestions on the API here: 
https://github.com/apache/arrow-rs/pull/8762#pullrequestreview-3537207043
   
   If that doesn't make sense, let me know and I can try and propose some 
changes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Make Parquet SBBF serialize/deserialize helpers public for external reuse [arrow-rs]

2025-12-05 Thread via GitHub


RoseZhang123 commented on PR #8762:
URL: https://github.com/apache/arrow-rs/pull/8762#issuecomment-3618246955

   @alamb Checking to see if you have any more comments/questions on this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Make Parquet SBBF serialize/deserialize helpers public for external reuse [arrow-rs]

2025-12-03 Thread via GitHub


alamb commented on code in PR #8762:
URL: https://github.com/apache/arrow-rs/pull/8762#discussion_r2586829463


##
parquet/src/bloom_filter/mod.rs:
##
@@ -283,7 +300,7 @@ impl Sbbf {
 /// Write the bloom filter data (header and then bitset) to the output. 
This doesn't
 /// flush the writer in order to boost performance of bulk writing all 
blocks. Caller
 /// must remember to flush the writer.
-pub(crate) fn write(&self, mut writer: W) -> Result<(), 
ParquetError> {
+pub fn write(&self, mut writer: W) -> Result<(), ParquetError> {

Review Comment:
   This method writes out the bytes using thrift encoding, which is basically
   ```text
   (header)
   (bitset)
   ```
   
   We already have the code to read a `Sbff` back from the thrift encoding here
   
   
https://github.com/apache/arrow-rs/blob/b65d20767c7510570cff0ab0154a268c29a3f712/parquet/src/bloom_filter/mod.rs#L409-L408
   
   Rather than exposing `new()` and relying on the user having to pass in the 
right bits (and relying on the bloom filter code not to change), what I think 
we should do:
   
   1. Introduce a `read()` method that is the inverse of `write()` -- 
specifically, something like
   
   ```rust
   /// reads a Sbff from thrift encoded bytes
   pub fn from_bytes(bytes: &[u8]) -> Result { 
   ...
   }
   ```
   
   And then a round trip serialization would look like:
   
   ```rust
   let mut serialized = Vec::new();
   original.write(&mut serialized)?;
   // read the serialized bytes back
   let reconstructed = Sbff::from_bytes(&serialized)?;
   ```
   



##
parquet/src/bloom_filter/mod.rs:
##
@@ -292,6 +309,44 @@ impl Sbbf {
 Ok(())
 }
 
+/// Returns the raw bitset bytes encoded in little-endian order.
+pub fn as_slice(&self) -> Vec {

Review Comment:
   typically methods named starting with `as_*` would not copy data 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Make Parquet SBBF serialize/deserialize helpers public for external reuse [arrow-rs]

2025-11-21 Thread via GitHub


RoseZhang123 commented on code in PR #8762:
URL: https://github.com/apache/arrow-rs/pull/8762#discussion_r2550843912


##
parquet/src/bloom_filter/mod.rs:
##
@@ -244,29 +244,8 @@ fn num_of_bits_from_ndv_fpp(ndv: u64, fpp: f64) -> usize {
 }
 
 impl Sbbf {
-/// Create a new [Sbbf] with given number of distinct values and false 
positive probability.
-/// Will return an error if `fpp` is greater than or equal to 1.0 or less 
than 0.0.
-pub(crate) fn new_with_ndv_fpp(ndv: u64, fpp: f64) -> Result {
-if !(0.0..1.0).contains(&fpp) {
-return Err(ParquetError::General(format!(
-"False positive probability must be between 0.0 and 1.0, got 
{fpp}"
-)));
-}
-let num_bits = num_of_bits_from_ndv_fpp(ndv, fpp);
-Ok(Self::new_with_num_of_bytes(num_bits / 8))
-}
-
-/// Create a new [Sbbf] with given number of bytes, the exact number of 
bytes will be adjusted
-/// to the next power of two bounded by [BITSET_MIN_LENGTH] and 
[BITSET_MAX_LENGTH].
-pub(crate) fn new_with_num_of_bytes(num_bytes: usize) -> Self {
-let num_bytes = optimal_num_of_bytes(num_bytes);
-assert_eq!(num_bytes % size_of::(), 0);
-let num_blocks = num_bytes / size_of::();
-let bitset = vec![Block::ZERO; num_blocks];
-Self(bitset)
-}
-
-pub(crate) fn new(bitset: &[u8]) -> Self {
+/// Create a new [Sbbf] from raw bitset bytes.

Review Comment:
   Sounds good. Decided to add one more function to make things easier. Now 
bitset_len() returns the size of the bloom filter’s bitset in bytes, and 
as_slice() returns the bitset.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Make Parquet SBBF serialize/deserialize helpers public for external reuse [arrow-rs]

2025-11-14 Thread via GitHub


etseidl commented on code in PR #8762:
URL: https://github.com/apache/arrow-rs/pull/8762#discussion_r2527904585


##
parquet/src/bloom_filter/mod.rs:
##
@@ -244,29 +244,8 @@ fn num_of_bits_from_ndv_fpp(ndv: u64, fpp: f64) -> usize {
 }
 
 impl Sbbf {
-/// Create a new [Sbbf] with given number of distinct values and false 
positive probability.
-/// Will return an error if `fpp` is greater than or equal to 1.0 or less 
than 0.0.
-pub(crate) fn new_with_ndv_fpp(ndv: u64, fpp: f64) -> Result {
-if !(0.0..1.0).contains(&fpp) {
-return Err(ParquetError::General(format!(
-"False positive probability must be between 0.0 and 1.0, got 
{fpp}"
-)));
-}
-let num_bits = num_of_bits_from_ndv_fpp(ndv, fpp);
-Ok(Self::new_with_num_of_bytes(num_bits / 8))
-}
-
-/// Create a new [Sbbf] with given number of bytes, the exact number of 
bytes will be adjusted
-/// to the next power of two bounded by [BITSET_MIN_LENGTH] and 
[BITSET_MAX_LENGTH].
-pub(crate) fn new_with_num_of_bytes(num_bytes: usize) -> Self {
-let num_bytes = optimal_num_of_bytes(num_bytes);
-assert_eq!(num_bytes % size_of::(), 0);
-let num_blocks = num_bytes / size_of::();
-let bitset = vec![Block::ZERO; num_blocks];
-Self(bitset)
-}
-
-pub(crate) fn new(bitset: &[u8]) -> Self {
+/// Create a new [Sbbf] from raw bitset bytes.

Review Comment:
   Would it be better to add a public function `as_slice` to return the bitset? 
Reading the proposed example I'm not sure how one would get the header length 
from the serialized `Sbbf`.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Make Parquet SBBF serialize/deserialize helpers public for external reuse [arrow-rs]

2025-11-14 Thread via GitHub


etseidl commented on code in PR #8762:
URL: https://github.com/apache/arrow-rs/pull/8762#discussion_r2527904585


##
parquet/src/bloom_filter/mod.rs:
##
@@ -244,29 +244,8 @@ fn num_of_bits_from_ndv_fpp(ndv: u64, fpp: f64) -> usize {
 }
 
 impl Sbbf {
-/// Create a new [Sbbf] with given number of distinct values and false 
positive probability.
-/// Will return an error if `fpp` is greater than or equal to 1.0 or less 
than 0.0.
-pub(crate) fn new_with_ndv_fpp(ndv: u64, fpp: f64) -> Result {
-if !(0.0..1.0).contains(&fpp) {
-return Err(ParquetError::General(format!(
-"False positive probability must be between 0.0 and 1.0, got 
{fpp}"
-)));
-}
-let num_bits = num_of_bits_from_ndv_fpp(ndv, fpp);
-Ok(Self::new_with_num_of_bytes(num_bits / 8))
-}
-
-/// Create a new [Sbbf] with given number of bytes, the exact number of 
bytes will be adjusted
-/// to the next power of two bounded by [BITSET_MIN_LENGTH] and 
[BITSET_MAX_LENGTH].
-pub(crate) fn new_with_num_of_bytes(num_bytes: usize) -> Self {
-let num_bytes = optimal_num_of_bytes(num_bytes);
-assert_eq!(num_bytes % size_of::(), 0);
-let num_blocks = num_bytes / size_of::();
-let bitset = vec![Block::ZERO; num_blocks];
-Self(bitset)
-}
-
-pub(crate) fn new(bitset: &[u8]) -> Self {
+/// Create a new [Sbbf] from raw bitset bytes.

Review Comment:
   Would it be better to add a public function `as_slice` to return the bitset? 
Reading the proposed example I'm not sure how one would get the header length 
from the deserialized `Sbbf`.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Make Parquet SBBF serialize/deserialize helpers public for external reuse [arrow-rs]

2025-11-06 Thread via GitHub


RoseZhang123 commented on code in PR #8762:
URL: https://github.com/apache/arrow-rs/pull/8762#discussion_r2501261532


##
parquet/src/bloom_filter/mod.rs:
##
@@ -244,29 +244,8 @@ fn num_of_bits_from_ndv_fpp(ndv: u64, fpp: f64) -> usize {
 }
 
 impl Sbbf {
-/// Create a new [Sbbf] with given number of distinct values and false 
positive probability.
-/// Will return an error if `fpp` is greater than or equal to 1.0 or less 
than 0.0.
-pub(crate) fn new_with_ndv_fpp(ndv: u64, fpp: f64) -> Result {
-if !(0.0..1.0).contains(&fpp) {
-return Err(ParquetError::General(format!(
-"False positive probability must be between 0.0 and 1.0, got 
{fpp}"
-)));
-}
-let num_bits = num_of_bits_from_ndv_fpp(ndv, fpp);
-Ok(Self::new_with_num_of_bytes(num_bits / 8))
-}
-
-/// Create a new [Sbbf] with given number of bytes, the exact number of 
bytes will be adjusted
-/// to the next power of two bounded by [BITSET_MIN_LENGTH] and 
[BITSET_MAX_LENGTH].
-pub(crate) fn new_with_num_of_bytes(num_bytes: usize) -> Self {
-let num_bytes = optimal_num_of_bytes(num_bytes);
-assert_eq!(num_bytes % size_of::(), 0);
-let num_blocks = num_bytes / size_of::();
-let bitset = vec![Block::ZERO; num_blocks];
-Self(bitset)
-}
-
-pub(crate) fn new(bitset: &[u8]) -> Self {
+/// Create a new [Sbbf] from raw bitset bytes.

Review Comment:
   Thanks! that makes sense. I just added a doc example, let me know if you 
think anything else needs to be added/changed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Make Parquet SBBF serialize/deserialize helpers public for external reuse [arrow-rs]

2025-10-31 Thread via GitHub


alamb commented on code in PR #8762:
URL: https://github.com/apache/arrow-rs/pull/8762#discussion_r2482705647


##
parquet/src/bloom_filter/mod.rs:
##
@@ -244,29 +244,8 @@ fn num_of_bits_from_ndv_fpp(ndv: u64, fpp: f64) -> usize {
 }
 
 impl Sbbf {
-/// Create a new [Sbbf] with given number of distinct values and false 
positive probability.
-/// Will return an error if `fpp` is greater than or equal to 1.0 or less 
than 0.0.
-pub(crate) fn new_with_ndv_fpp(ndv: u64, fpp: f64) -> Result {
-if !(0.0..1.0).contains(&fpp) {
-return Err(ParquetError::General(format!(
-"False positive probability must be between 0.0 and 1.0, got 
{fpp}"
-)));
-}
-let num_bits = num_of_bits_from_ndv_fpp(ndv, fpp);
-Ok(Self::new_with_num_of_bytes(num_bits / 8))
-}
-
-/// Create a new [Sbbf] with given number of bytes, the exact number of 
bytes will be adjusted
-/// to the next power of two bounded by [BITSET_MIN_LENGTH] and 
[BITSET_MAX_LENGTH].
-pub(crate) fn new_with_num_of_bytes(num_bytes: usize) -> Self {
-let num_bytes = optimal_num_of_bytes(num_bytes);
-assert_eq!(num_bytes % size_of::(), 0);
-let num_blocks = num_bytes / size_of::();
-let bitset = vec![Block::ZERO; num_blocks];
-Self(bitset)
-}
-
-pub(crate) fn new(bitset: &[u8]) -> Self {
+/// Create a new [Sbbf] from raw bitset bytes.

Review Comment:
   Can you please add a doc example here showing how to use this API with your 
intended usecase?
   
   That will both help the documentation and ensure we have exposed enough of 
the API to be useful
   
   For example, if the idea is to save a SBFF using `write` and then re-create 
it again, an exmaple that:
   1. made a Sbbf (ideally read it from a file)
   2. wrote it to a `Vec<>`
   3. Created a new Sbbf from that vec
   4. Show that it is the same as the original
   
it would be nice to mention that a correct argument for creating a sbff can 
be created using `Sbff::write`
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]