[GitHub] [flink-ml] yunfengzhou-hub commented on a diff in pull request #156: [FLINK-29323] Refine Transformer for VectorAssembler

GitBox Tue, 11 Oct 2022 18:38:52 -0700


yunfengzhou-hub commented on code in PR #156:
URL: https://github.com/apache/flink-ml/pull/156#discussion_r992901815



##########
flink-ml-lib/src/main/java/org/apache/flink/ml/feature/vectorassembler/VectorAssemblerParams.java:
##########
@@ -21,11 +21,29 @@
 import org.apache.flink.ml.common.param.HasHandleInvalid;
 import org.apache.flink.ml.common.param.HasInputCols;
 import org.apache.flink.ml.common.param.HasOutputCol;
+import org.apache.flink.ml.param.IntArrayParam;
+import org.apache.flink.ml.param.Param;
+import org.apache.flink.ml.param.ParamValidators;
 
 /**
  * Params of {@link VectorAssembler}.
  *
  * @param <T> The class type of this instance.
  */
 public interface VectorAssemblerParams<T>
-        extends HasInputCols<T>, HasOutputCol<T>, HasHandleInvalid<T> {}
+        extends HasInputCols<T>, HasOutputCol<T>, HasHandleInvalid<T> {
+    Param<Integer[]> INPUT_SIZES =
+            new IntArrayParam(
+                    "inputSizes",
+                    "Sizes of the input elements to be assembled.",
+                    null,
+                    ParamValidators.notNull());

Review Comment:
   In Spark, VectorAssembler can infer the vector sizes in some cases, which 
means VectorSizeHint is not a compulsory prerequisite. Let's also support those 
situations in Flink ML, and remove the `ParamValidators.notNull()`.



##########
flink-ml-lib/src/main/java/org/apache/flink/ml/feature/vectorassembler/VectorAssembler.java:
##########
@@ -47,10 +47,15 @@
 
 /**
  * A Transformer which combines a given list of input columns into a vector 
column. Types of input
- * columns must be either vector or numerical value.
+ * columns must be either vector or numerical types. The elements assembled in 
the same column must
+ * have the same size. The operator deals with null values or records with 
wrong sizes according to
+ * the strategy specified by the {@link HasHandleInvalid} parameter as follows:
  *
- * <p>The `keep` option of {@link HasHandleInvalid} means that we output bad 
rows with output column
- * set to null.
+ * <p>The `keep` option means that we do the assembling action without 
checking the vector size.

Review Comment:
   Let's also remove the "we"s used here, making sure that the first-person 
perspective is not used anywhere in the documents.



##########
flink-ml-lib/src/main/java/org/apache/flink/ml/feature/vectorassembler/VectorAssembler.java:
##########
@@ -47,10 +47,15 @@
 
 /**
  * A Transformer which combines a given list of input columns into a vector 
column. Types of input
- * columns must be either vector or numerical value.
+ * columns must be either vector or numerical types. The elements assembled in 
the same column must
+ * have the same size. The operator deals with null values or records with 
wrong sizes according to

Review Comment:
   The statement that "the elements must have the same size" seems not 
accurate, as elements can be null or vectors of different sizes when 
handleInvalid is set to keep.



##########
flink-ml-lib/src/main/java/org/apache/flink/ml/feature/vectorassembler/VectorAssembler.java:
##########
@@ -47,10 +47,15 @@
 
 /**
  * A Transformer which combines a given list of input columns into a vector 
column. Types of input
- * columns must be either vector or numerical value.
+ * columns must be either vector or numerical types. The elements assembled in 
the same column must
+ * have the same size. The operator deals with null values or records with 
wrong sizes according to
+ * the strategy specified by the {@link HasHandleInvalid} parameter as follows:
  *
- * <p>The `keep` option of {@link HasHandleInvalid} means that we output bad 
rows with output column
- * set to null.
+ * <p>The `keep` option means that we do the assembling action without 
checking the vector size.

Review Comment:
   Could you please add a detailed description of the behavior when "keep" 
option is set? As I can think of, any of the following might be a correct 
behavior.
   - assembling `[1,2]`, `[3,4]` with sizes `2,3` may get `[1, 2, 3, 4]`
   - assembling `[1,2]`, `[3,4]` with sizes `2,3` may get `[1, 2, 3, 4, 0]` 
(padding with zeros)
   - assembling `[1,2]`, `[3,4,5]` with sizes `2,2` may get `[1, 2, 3, 4]` 
(trim to fit size)
   - assembling `[1,2]`, `[3,4,5]` with sizes `2,2` may get `[1, 2, NaN, NaN]`
   - assembling `[1,2]`, `null` with sizes `2,2` may get `[1, 2, NaN, NaN]`
   - assembling `[1,2]`, `null` with sizes `2,2` may get `[1, 2, 0, 0]`
   
   Let's check the option selected by Spark, implement that behavior, and 
clarify this behavior in JavaDoc.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink-ml] yunfengzhou-hub commented on a diff in pull request #156: [FLINK-29323] Refine Transformer for VectorAssembler

Reply via email to