[ 
https://issues.apache.org/jira/browse/SPARK-11478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14996286#comment-14996286
 ] 

Yanbo Liang commented on SPARK-11478:
-------------------------------------

Because the "nullable" value is generated and ruled by the DataFrame execution 
workflow, it means only when we call "transform" we can get the "nullable" 
value at the scope of ML. (may be Spark SQL can expose API to get "nullable" 
ahead?)
{quote}
For now, does it work to change toStructField to set nullable to true? All of 
the UDFs which create Double fields apparently set nullable = true by default 
(because of how ScalaReflection works).
{quote}
Yes, the Double fields set nullable = true by default. If we change 
toStructField to set nullable to true, we can pass regression test for 
[this|https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/attribute/AttributeSuite.scala#L68]
 test case. I want to know whether toStructField setting nullable to false is 
on purpose.

> ML StringIndexer return inconsistent schema
> -------------------------------------------
>
>                 Key: SPARK-11478
>                 URL: https://issues.apache.org/jira/browse/SPARK-11478
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>            Reporter: Yanbo Liang
>
> ML StringIndexer transform and transformSchema return inconsistent schema.
> {code}
> val data = sc.parallelize(Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, 
> "a"), (5, "c")), 2)
> val df = sqlContext.createDataFrame(data).toDF("id", "label")
> val indexer = new StringIndexer()
>   .setInputCol("label")
>   .setOutputCol("labelIndex")
>   .fit(df)
> val transformed = indexer.transform(df)
> println(transformed.schema.toString())
> println(indexer.transformSchema(df.schema))
> The nullable of "labelIndex" return inconsistent value:
> StructType(StructField(id,IntegerType,false), 
> StructField(label,StringType,true), StructField(labelIndex,DoubleType,true))
> StructType(StructField(id,IntegerType,false), 
> StructField(label,StringType,true), StructField(labelIndex,DoubleType,false))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to