etseidl commented on code in PR #9619:
URL: https://github.com/apache/arrow-rs/pull/9619#discussion_r3314257234


##########
parquet/src/column/writer/encoder.rs:
##########
@@ -326,63 +332,52 @@ impl<T: DataType> ColumnValueEncoder for 
ColumnValueEncoderImpl<T> {
     }
 }
 
-fn get_min_max<'a, T, I>(descr: &ColumnDescriptor, mut iter: I) -> Option<(T, 
T)>
+// Get min and max values for all values in `iter`.
+//
+// For floating point we need to compare NaN values until we encounter a 
non-NaN
+// value which then becomes the new min/max. After this, only non-NaN values 
are
+// evaluated. If all values are NaN, then the min/max NaNs as determined by
+// IEEE 754 total order are returned.
+fn get_min_max<'a, T, I>(basic_type_info: &BasicTypeInfo, mut iter: I) -> 
Option<(T, T, u64)>
 where
     T: ParquetValueType + 'a,
     I: Iterator<Item = &'a T>,
 {
-    let first = loop {
-        let next = iter.next()?;
-        if !is_nan(descr, next) {
-            break next;
-        }
-    };
+    let first = iter.next()?;
+    let mut min_max_nan = is_nan(basic_type_info, first);

Review Comment:
   I ran down that rabbit hole, and then came an epiphany. Because `is_nan` is 
parameterized on the physical type, for anything other than `FLOAT`, `DOUBLE`, 
and `FIXED_LEN_BYTE_ARRAY`, the compiler will see an inline function that 
simply returns `false` and it will be inlined to oblivion. Because both uses of 
`is_nan` in those cases are known to be false at compile time, the match can be 
optimized away as well, and instead simply call `compare_greater` to find the 
min and max. I suspect the `nan_count` handling will go away as well since it 
will turn into `nan_count += false as u64`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to