Fan Hong created FLINK-31623:
--------------------------------
Summary: Improvements on DataStreamUtils#sample
Key: FLINK-31623
URL: https://issues.apache.org/jira/browse/FLINK-31623
Project: Flink
Issue Type: Improvement
Components: Library / Machine Learning
Reporter: Fan Hong
Current implementation employs two-level sampling method, which could encounter
two issues:
#
One subtask has a memory footprint twice as large as the other subtasks, which
could cause unexpected OOM in some situations.
# When data instances are imbalanced distributed among partitions (subtasks),
the probabilities of instances to be sampled are different in different
partitions (subtasks).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)