[GitHub] [flink-ml] yunfengzhou-hub commented on a diff in pull request #156: [FLINK-29323] Refine Transformer for VectorAssembler

GitBox Thu, 22 Sep 2022 23:20:43 -0700


yunfengzhou-hub commented on code in PR #156:
URL: https://github.com/apache/flink-ml/pull/156#discussion_r978291706



##########
flink-ml-lib/src/main/java/org/apache/flink/ml/feature/vectorassembler/VectorAssemblerParams.java:
##########
@@ -21,11 +21,31 @@
 import org.apache.flink.ml.common.param.HasHandleInvalid;
 import org.apache.flink.ml.common.param.HasInputCols;
 import org.apache.flink.ml.common.param.HasOutputCol;
+import org.apache.flink.ml.param.IntArrayParam;
+import org.apache.flink.ml.param.Param;
+import org.apache.flink.ml.param.ParamValidators;
+
+import java.util.Arrays;
 
 /**
  * Params of {@link VectorAssembler}.
  *
  * @param <T> The class type of this instance.
  */
 public interface VectorAssemblerParams<T>
-        extends HasInputCols<T>, HasOutputCol<T>, HasHandleInvalid<T> {}
+        extends HasInputCols<T>, HasOutputCol<T>, HasHandleInvalid<T> {
+    Param<Integer[]> SIZES =

Review Comment:
   In referent to the `splitsArray` parameter in Bucketizer and 
`VectorSizeHint` in Spark VectorAssembler, do you think it would be better to 
rename this parameter to `vectorSizeArray` or `elementSizeArray`?



##########
flink-ml-lib/src/main/java/org/apache/flink/ml/feature/vectorassembler/VectorAssembler.java:
##########
@@ -74,38 +74,65 @@ public Table[] transform(Table... inputs) {
         DataStream<Row> output =
                 tEnv.toDataStream(inputs[0])
                         .flatMap(
-                                new AssemblerFunc(getInputCols(), 
getHandleInvalid()),
+                                new AssemblerFunction(
+                                        getInputCols(), getHandleInvalid(), 
getSizes()),
                                 outputTypeInfo);
         Table outputTable = tEnv.fromDataStream(output);
         return new Table[] {outputTable};
     }
 
-    private static class AssemblerFunc implements FlatMapFunction<Row, Row> {
+    private static class AssemblerFunction implements FlatMapFunction<Row, 
Row> {
         private final String[] inputCols;
         private final String handleInvalid;
+        private final int[] sizeArray;
 
-        public AssemblerFunc(String[] inputCols, String handleInvalid) {
+        public AssemblerFunction(String[] inputCols, String handleInvalid, 
int[] sizeArray) {
             this.inputCols = inputCols;
             this.handleInvalid = handleInvalid;
+            this.sizeArray = sizeArray;
         }
 
         @Override
         public void flatMap(Row value, Collector<Row> out) {
             int nnz = 0;
             int vectorSize = 0;
             try {
-                for (String inputCol : inputCols) {
+                for (int i = 0; i < inputCols.length; ++i) {
+                    String inputCol = inputCols[i];
                     Object object = value.getField(inputCol);
                     Preconditions.checkNotNull(object, "Input column value 
should not be null.");
                     if (object instanceof Number) {
+                        Preconditions.checkArgument(

Review Comment:
   There might be performance issues if we perform these checks for each 
record. Can we try to avoid this? For example, can the assembling process fail 
if the input data size does match with expected, so that no check is needed?



##########
flink-ml-examples/src/main/java/org/apache/flink/ml/examples/feature/VectorAssemblerExample.java:
##########
@@ -56,7 +56,8 @@ public static void main(String[] args) {
         VectorAssembler vectorAssembler =
                 new VectorAssembler()
                         .setInputCols("vec", "num", "sparseVec")
-                        .setOutputCol("assembledVec");
+                        .setOutputCol("assembledVec")
+                        .setSizes(2, 1, 5);

Review Comment:
   Let's update VectorAssembler's markdown document accordingly, including its 
parameter list, examples and possibly descriptions.



##########
flink-ml-lib/src/main/java/org/apache/flink/ml/feature/vectorassembler/VectorAssemblerParams.java:
##########
@@ -21,11 +21,31 @@
 import org.apache.flink.ml.common.param.HasHandleInvalid;
 import org.apache.flink.ml.common.param.HasInputCols;
 import org.apache.flink.ml.common.param.HasOutputCol;
+import org.apache.flink.ml.param.IntArrayParam;
+import org.apache.flink.ml.param.Param;
+import org.apache.flink.ml.param.ParamValidators;
+
+import java.util.Arrays;
 
 /**

Review Comment:
   Would it be better to add some document describing how users need to tackle 
the sizes parameter? For example, Spark has some descriptions of the function 
of VectorSizeHint in the JavaDoc of `VectorAssembler.handleInvalid`. Maybe we 
can add similar descriptions to the JavaDoc of `VectorAssembler`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink-ml] yunfengzhou-hub commented on a diff in pull request #156: [FLINK-29323] Refine Transformer for VectorAssembler

Reply via email to