[GitHub] [flink-ml] yunfengzhou-hub commented on a diff in pull request #97: [FLINK-27096] Improve DataCache and KMeans Performance

GitBox Sun, 22 May 2022 23:04:30 -0700


yunfengzhou-hub commented on code in PR #97:
URL: https://github.com/apache/flink-ml/pull/97#discussion_r879052166



##########
flink-ml-core/src/main/java/org/apache/flink/ml/common/datastream/DataStreamUtils.java:
##########
@@ -94,6 +120,26 @@ public static <T> DataStream<T> reduce(DataStream<T> input, 
ReduceFunction<T> fu
         }
     }
 
+    /**
+     * Takes a randomly sampled subset of elements in a bounded data stream.
+     *
+     * <p>If the number of elements in the stream is smaller than expected 
number of samples, all
+     * elements will be included in the sample.
+     *
+     * @param input The input data stream.
+     * @param numSamples The number of elements to be sampled.
+     * @param randomSeed The seed to randomly pick elements as sample.
+     * @return A data stream containing a list of the sampled elements.
+     */
+    public static <T> DataStream<List<T>> sample(
+            DataStream<T> input, int numSamples, long randomSeed) {
+        return input.transform(
+                        "samplingOperator",
+                        Types.LIST(input.getType()),
+                        new SamplingOperator<>(numSamples, randomSeed))
+                .setParallelism(1);

Review Comment:
   The SampleOperator would return only one element, which is a List<T>, so 
retaining the parallelism of upstream operator seems to be meaningless. 
   
   I agree that distributed sampling can bring better performance. I'll make 
the change this way.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink-ml] yunfengzhou-hub commented on a diff in pull request #97: [FLINK-27096] Improve DataCache and KMeans Performance

Reply via email to