ozankabak commented on PR #3570: URL: https://github.com/apache/arrow-datafusion/pull/3570#issuecomment-1260839211
Hello again! We resolved almost all the points you've mentioned. Additionally, we identified some challenging corner cases regarding NULL treatment and updated our code to handle these cases, along with new unit tests for these corner cases. Three main discussion points remain: *PostgreSQL compatibility of new unit tests* Expected values of all the new unit tests were sanity-checked against PostgreSQL. To remain in sync with PostgreSQL, we also think that it could be a good idea to add psql-parity tests for NULL-treatment cases. If you agree, let's make this subject of another PR. We can open an issue to track this (and will be happy to work on a PR to resolve it). *Changes to `ScalarValue`* At this point, we do not really have any new functionality or complexity added to this type. We just tidied up/moved some `ScalarValue`-only code that was scattered across multiple modules to `scalar.rs`. For example, the coercions you saw were already a part of the code here: https://github.com/apache/arrow-datafusion/blob/87faf868f2276b84e63cad6721ca08bd79ed9cb8/datafusion/physical-expr/src/aggregate/sum.rs#L261 We agree that there should be no coercions in general at this stage -- any necessary work along these lines should be handled previously in the code path. However, we don't want to extend the scope of this PR to remove already-existing coercions from the codebase. If there is no issue about this, we are happy to open an issue for this and help resolve it in the near future along with similar issues like [#3511](https://github.com/apache/arrow-datafusion/issues/3511). Some teaser: I have done some preliminary tests and it seems like removing all the coercions will not be a tough task at all. However, removing this kind of code always requires more careful testing than just preliminary testing and deserves a careful study/separate PR 🙂 *Bisect* We considered using the [`partition_point`](https://doc.rust-lang.org/std/primitive.slice.html#method.partition_point) function. This requires converting `&[ArrayRef]` to `&[&[ScalarValue]]`, where the inner `&[ScalarValue]` corresponds to each row (Because we are searching the insertion place of target `&[ScalarValue]` among rows). Unless we are missing something, doing this will require unnecessary memory usage and CPU time since it will involve copying. On the contrary, our implementation takes data in original columnar format and calculates rows from the index only when the comparison is done. Since we are doing *log(n)* comparisons on average, we need *log(n)* copies instead of *n* copies. If you have any suggestions for adding bisect functionality without changing the columnar orientation of the data, we can create an issue for this and help resolve it in a subsequent PR. Thanks again for all the reviews. Excited to see this merge and get used! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
