yunfengzhou-hub commented on code in PR #156: URL: https://github.com/apache/flink-ml/pull/156#discussion_r992901815
########## flink-ml-lib/src/main/java/org/apache/flink/ml/feature/vectorassembler/VectorAssemblerParams.java: ########## @@ -21,11 +21,29 @@ import org.apache.flink.ml.common.param.HasHandleInvalid; import org.apache.flink.ml.common.param.HasInputCols; import org.apache.flink.ml.common.param.HasOutputCol; +import org.apache.flink.ml.param.IntArrayParam; +import org.apache.flink.ml.param.Param; +import org.apache.flink.ml.param.ParamValidators; /** * Params of {@link VectorAssembler}. * * @param <T> The class type of this instance. */ public interface VectorAssemblerParams<T> - extends HasInputCols<T>, HasOutputCol<T>, HasHandleInvalid<T> {} + extends HasInputCols<T>, HasOutputCol<T>, HasHandleInvalid<T> { + Param<Integer[]> INPUT_SIZES = + new IntArrayParam( + "inputSizes", + "Sizes of the input elements to be assembled.", + null, + ParamValidators.notNull()); Review Comment: In Spark, VectorAssembler can infer the vector sizes in some cases, which means VectorSizeHint is not a compulsory prerequisite. Let's also support those situations in Flink ML, and remove the `ParamValidators.notNull()`. ########## flink-ml-lib/src/main/java/org/apache/flink/ml/feature/vectorassembler/VectorAssembler.java: ########## @@ -47,10 +47,15 @@ /** * A Transformer which combines a given list of input columns into a vector column. Types of input - * columns must be either vector or numerical value. + * columns must be either vector or numerical types. The elements assembled in the same column must + * have the same size. The operator deals with null values or records with wrong sizes according to + * the strategy specified by the {@link HasHandleInvalid} parameter as follows: * - * <p>The `keep` option of {@link HasHandleInvalid} means that we output bad rows with output column - * set to null. + * <p>The `keep` option means that we do the assembling action without checking the vector size. Review Comment: Let's also remove the "we"s used here, making sure that the first-person perspective is not used anywhere in the documents. ########## flink-ml-lib/src/main/java/org/apache/flink/ml/feature/vectorassembler/VectorAssembler.java: ########## @@ -47,10 +47,15 @@ /** * A Transformer which combines a given list of input columns into a vector column. Types of input - * columns must be either vector or numerical value. + * columns must be either vector or numerical types. The elements assembled in the same column must + * have the same size. The operator deals with null values or records with wrong sizes according to Review Comment: The statement that "the elements must have the same size" seems not accurate, as elements can be null or vectors of different sizes when handleInvalid is set to keep. ########## flink-ml-lib/src/main/java/org/apache/flink/ml/feature/vectorassembler/VectorAssembler.java: ########## @@ -47,10 +47,15 @@ /** * A Transformer which combines a given list of input columns into a vector column. Types of input - * columns must be either vector or numerical value. + * columns must be either vector or numerical types. The elements assembled in the same column must + * have the same size. The operator deals with null values or records with wrong sizes according to + * the strategy specified by the {@link HasHandleInvalid} parameter as follows: * - * <p>The `keep` option of {@link HasHandleInvalid} means that we output bad rows with output column - * set to null. + * <p>The `keep` option means that we do the assembling action without checking the vector size. Review Comment: Could you please add a detailed description of the behavior when "keep" option is set? As I can think of, any of the following might be a correct behavior. - assembling `[1,2]`, `[3,4]` with sizes `2,3` may get `[1, 2, 3, 4]` - assembling `[1,2]`, `[3,4]` with sizes `2,3` may get `[1, 2, 3, 4, 0]` (padding with zeros) - assembling `[1,2]`, `[3,4,5]` with sizes `2,2` may get `[1, 2, 3, 4]` (trim to fit size) - assembling `[1,2]`, `[3,4,5]` with sizes `2,2` may get `[1, 2, NaN, NaN]` - assembling `[1,2]`, `null` with sizes `2,2` may get `[1, 2, NaN, NaN]` - assembling `[1,2]`, `null` with sizes `2,2` may get `[1, 2, 0, 0]` Let's check the option selected by Spark, implement that behavior, and clarify this behavior in JavaDoc. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org