zhengruifeng created SPARK-30347:
------------------------------------

             Summary: LibSVMDataSource attach AttributeGroup
                 Key: SPARK-30347
                 URL: https://issues.apache.org/jira/browse/SPARK-30347
             Project: Spark
          Issue Type: Improvement
          Components: ML
    Affects Versions: 3.0.0
            Reporter: zhengruifeng


LibSVMDataSource will attach a special metadata to indicate numFeatures.
{code:java}
 scala> val data = 
spark.read.format("libsvm").load("/data0/Dev/Opensource/spark/data/mllib/sample_multiclass_classification_data.txt")
19/12/24 18:40:09 WARN LibSVMFileFormat: 'numFeatures' option not specified, 
determining the number of features by going though the input. If you know the 
number in advance, please specify it via 'numFeatures' option to avoid the 
extra scan.
data: org.apache.spark.sql.DataFrame = [label: double, features: vector]scala> 
data.schema("features").metadata
res0: org.apache.spark.sql.types.Metadata = {"numFeatures":4}
{code}
However, ML impls all try to obtain the vector size via \{{AttributeGroup}}, 
which can not use this metadata:
{code:java}
scala> import org.apache.spark.ml.attribute._
import org.apache.spark.ml.attribute._scala> 
AttributeGroup.fromStructField(data.schema("features")).size
res1: Int = -1
 {code}
 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to