Re: [DISCUSS] Conventions for values masked by null bits

2021-02-20 Thread Antoine Pitrou
Le 21/02/2021 à 01:05, Wes McKinney a écrit : > I agree that we should avoid leaking uninitialized memory in places > where we have control over it. I could imagine a third party project > having UBSAN warnings and then tracing the origin of them to something > in Arrow that they then have to wor

Re: [DISCUSS] Conventions for values masked by null bits

2021-02-20 Thread Wes McKinney
I agree that we should avoid leaking uninitialized memory in places where we have control over it. I could imagine a third party project having UBSAN warnings and then tracing the origin of them to something in Arrow that they then have to work around. As for the potential performance implications,

Re: [DISCUSS] Conventions for values masked by null bits

2021-02-20 Thread Micah Kornfield
Hi Ben and Ben, I think it would be good to have a convention for by default filling null slots in arrays with known value. I think it might be a mistake to use zero as the value because it can lead to reliance on this behavior. Secure by default is a good approach to take. For kernels in partic

Re: [DISCUSS] Conventions for values masked by null bits

2021-02-20 Thread Antoine Pitrou
I am definitely in the camp that we should not leak past data through uninitialized Arrow memory (for example by transmitting such buffers using Arrow IPC). Regards Antoine. Le 20/02/2021 à 21:17, Benjamin Kietzman a écrit : > Original discussion at > https://github.com/apache/arrow/pull/9471

Re: [DISCUSS] Conventions for values masked by null bits

2021-02-20 Thread Jorge Cardoso Leitão
I agree. Below are two notes from a similar discussion on the Rust implementation: 1. In SIMD, for performance reasons, operations are performed over the whole buffer irrespectively of the bitmap mask, and deal with the bitmap mask separately. If a slot contains an arbitrary value, the operation

[DISCUSS] Conventions for values masked by null bits

2021-02-20 Thread Benjamin Kietzman
Original discussion at https://github.com/apache/arrow/pull/9471#issuecomment-779944257 (PR for https://issues.apache.org/jira/browse/ARROW-11595 ) Although the format does not specify what is contained in array slots masked by null bits (for example the first byte in the data buffer of an int8 ar

Re: [Rust][DataFusion] Inconsistent array ordering with "GROUP BY" SQL

2021-02-20 Thread Marc Prud'hommeaux
I understand that GROUP BY ought not imply any particular ordering; it's just that working with other SQL databases, I've come to expect that ordering will be consistent between multiple runs of the same statement, at least within the context of a single transaction on a single connection. I

Re: [Rust] Column names in FFI_ArrowSchema

2021-02-20 Thread Marc Prud'hommeaux
Great! I will start experimenting and see how far I get. While we're at it, should we consider putting something in the metadata field? That would be more involved due to the bespoke format of the property, but it might be a good time to consider any additional information that could be usefu

Re: [Rust][DataFusion] Inconsistent array ordering with "GROUP BY" SQL

2021-02-20 Thread Andy Grove
The SQL standard in general makes no guarantee of the order of resulting data unless there is an explicit ORDER BY clause. I would guess that there are two factors in play here: 1. The use of hash-based data structures, as you mention 2. If you have partitioned data then it is processed on multip

[Rust][DataFusion] Inconsistent array ordering with "GROUP BY" SQL

2021-02-20 Thread Marc Prud'hommeaux
When I group by a column in DataFusion SQL, the order of the results is different every time. For example, "select country from data group by country" against https://github.com/Teradata/kylo/blob/master/samples/sample-data/csv/userdata3.csv might return "Moldova" first one time, and then "Swed

Re: Intermittent (but frequent) flight integration failures on master

2021-02-20 Thread Andrew Lamb
Thanks Davd. I have filed https://issues.apache.org/jira/browse/ARROW-11717 to track On Fri, Feb 19, 2021 at 5:12 PM David Li wrote: > @mrkn submitted a PR to add backtraces which was merged recently: > https://github.com/apache/arrow/pull/9524 > > However I think the abort is a red herring - th

[NIGHTLY] Arrow Build Report for Job nightly-2021-02-20-0

2021-02-20 Thread Crossbow
Arrow Build Report for Job nightly-2021-02-20-0 All tasks: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-20-0 Failed Tasks: - conda-linux-gcc-py39-aarch64: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-20-0-drone-conda-linux