goldmedal commented on issue #15420: URL: https://github.com/apache/datafusion/issues/15420#issuecomment-2752034334
> The array outputs true for each row that has hash % partition == 0 (and false if not). I don't understand why the formula is `hash % partition == 0`? IMO, `hash % total_partition` is the number of the portion it belongs. Maybe the formula should be `hash % total_partition == current_partition`? Given the following data: ``` col1 | col2 | ... | hash % total_partition ------------------------- data | data | ... | 2 data | data | ... | 1 data | data | ... | 2 data | data | ... | 0 ``` The 0 partition will get ``` col1 | col2 | ... | selection ------------------------- data | data | ... | false data | data | ... | false data | data | ... | false data | data | ... | true ``` The 1 partition will get ``` col1 | col2 | ... | selection ------------------------- data | data | ... | false data | data | ... | true data | data | ... | false data | data | ... | false ``` The 2 partition will get ``` col1 | col2 | ... | selection ------------------------- data | data | ... | true data | data | ... | false data | data | ... | true data | data | ... | false ``` Then, the following plan can aggregate or join the record which `selection` is true. Does it make sense? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
