zhengruifeng created SPARK-30347: ------------------------------------ Summary: LibSVMDataSource attach AttributeGroup Key: SPARK-30347 URL: https://issues.apache.org/jira/browse/SPARK-30347 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 3.0.0 Reporter: zhengruifeng
LibSVMDataSource will attach a special metadata to indicate numFeatures. {code:java} scala> val data = spark.read.format("libsvm").load("/data0/Dev/Opensource/spark/data/mllib/sample_multiclass_classification_data.txt") 19/12/24 18:40:09 WARN LibSVMFileFormat: 'numFeatures' option not specified, determining the number of features by going though the input. If you know the number in advance, please specify it via 'numFeatures' option to avoid the extra scan. data: org.apache.spark.sql.DataFrame = [label: double, features: vector]scala> data.schema("features").metadata res0: org.apache.spark.sql.types.Metadata = {"numFeatures":4} {code} However, ML impls all try to obtain the vector size via \{{AttributeGroup}}, which can not use this metadata: {code:java} scala> import org.apache.spark.ml.attribute._ import org.apache.spark.ml.attribute._scala> AttributeGroup.fromStructField(data.schema("features")).size res1: Int = -1 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org