[ 
https://issues.apache.org/jira/browse/SPARK-30347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng reassigned SPARK-30347:
------------------------------------

    Assignee: zhengruifeng

> LibSVMDataSource attach AttributeGroup
> --------------------------------------
>
>                 Key: SPARK-30347
>                 URL: https://issues.apache.org/jira/browse/SPARK-30347
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 3.0.0
>            Reporter: zhengruifeng
>            Assignee: zhengruifeng
>            Priority: Minor
>
> LibSVMDataSource will attach a special metadata to indicate numFeatures.
> {code:java}
>  scala> val data = 
> spark.read.format("libsvm").load("/data0/Dev/Opensource/spark/data/mllib/sample_multiclass_classification_data.txt")
> 19/12/24 18:40:09 WARN LibSVMFileFormat: 'numFeatures' option not specified, 
> determining the number of features by going though the input. If you know the 
> number in advance, please specify it via 'numFeatures' option to avoid the 
> extra scan.
> data: org.apache.spark.sql.DataFrame = [label: double, features: 
> vector]scala> data.schema("features").metadata
> res0: org.apache.spark.sql.types.Metadata = {"numFeatures":4}
> {code}
> However, ML impls all try to obtain the vector size via \{{AttributeGroup}}, 
> which can not use this metadata:
> {code:java}
> scala> import org.apache.spark.ml.attribute._
> import org.apache.spark.ml.attribute._scala> 
> AttributeGroup.fromStructField(data.schema("features")).size
> res1: Int = -1
>  {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to