Fan Hong created FLINK-31623: -------------------------------- Summary: Improvements on DataStreamUtils#sample Key: FLINK-31623 URL: https://issues.apache.org/jira/browse/FLINK-31623 Project: Flink Issue Type: Improvement Components: Library / Machine Learning Reporter: Fan Hong
Current implementation employs two-level sampling method, which could encounter two issues: # One subtask has a memory footprint twice as large as the other subtasks, which could cause unexpected OOM in some situations. # When data instances are imbalanced distributed among partitions (subtasks), the probabilities of instances to be sampled are different in different partitions (subtasks). -- This message was sent by Atlassian Jira (v8.20.10#820010)