[ https://issues.apache.org/jira/browse/SPARK-30347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
zhengruifeng reassigned SPARK-30347: ------------------------------------ Assignee: zhengruifeng > LibSVMDataSource attach AttributeGroup > -------------------------------------- > > Key: SPARK-30347 > URL: https://issues.apache.org/jira/browse/SPARK-30347 > Project: Spark > Issue Type: Improvement > Components: ML > Affects Versions: 3.0.0 > Reporter: zhengruifeng > Assignee: zhengruifeng > Priority: Minor > > LibSVMDataSource will attach a special metadata to indicate numFeatures. > {code:java} > scala> val data = > spark.read.format("libsvm").load("/data0/Dev/Opensource/spark/data/mllib/sample_multiclass_classification_data.txt") > 19/12/24 18:40:09 WARN LibSVMFileFormat: 'numFeatures' option not specified, > determining the number of features by going though the input. If you know the > number in advance, please specify it via 'numFeatures' option to avoid the > extra scan. > data: org.apache.spark.sql.DataFrame = [label: double, features: > vector]scala> data.schema("features").metadata > res0: org.apache.spark.sql.types.Metadata = {"numFeatures":4} > {code} > However, ML impls all try to obtain the vector size via \{{AttributeGroup}}, > which can not use this metadata: > {code:java} > scala> import org.apache.spark.ml.attribute._ > import org.apache.spark.ml.attribute._scala> > AttributeGroup.fromStructField(data.schema("features")).size > res1: Int = -1 > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org