yunfengzhou-hub commented on code in PR #97: URL: https://github.com/apache/flink-ml/pull/97#discussion_r889790107
########## flink-ml-core/src/main/java/org/apache/flink/ml/common/datastream/DataStreamUtils.java: ########## @@ -94,6 +120,26 @@ public static <T> DataStream<T> reduce(DataStream<T> input, ReduceFunction<T> fu } } + /** + * Takes a randomly sampled subset of elements in a bounded data stream. + * + * <p>If the number of elements in the stream is smaller than expected number of samples, all + * elements will be included in the sample. + * + * @param input The input data stream. + * @param numSamples The number of elements to be sampled. + * @param randomSeed The seed to randomly pick elements as sample. + * @return A data stream containing a list of the sampled elements. + */ + public static <T> DataStream<List<T>> sample( + DataStream<T> input, int numSamples, long randomSeed) { + return input.transform( + "samplingOperator", + Types.LIST(input.getType()), + new SamplingOperator<>(numSamples, randomSeed)) + .setParallelism(1); Review Comment: According to offline discussion, I agree with it that we should not change parallelism to make it more generic. I'll make the change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org