Fan Hong created FLINK-31623:
--------------------------------

             Summary: Improvements on DataStreamUtils#sample
                 Key: FLINK-31623
                 URL: https://issues.apache.org/jira/browse/FLINK-31623
             Project: Flink
          Issue Type: Improvement
          Components: Library / Machine Learning
            Reporter: Fan Hong


Current implementation employs two-level sampling method, which could encounter 
two issues:
 # 
One subtask has a memory footprint twice as large as the other subtasks, which 
could cause unexpected OOM in some situations.
 
 # When data instances are imbalanced distributed among partitions (subtasks), 
the probabilities of instances to be sampled are different in different 
partitions (subtasks).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to