[ https://issues.apache.org/jira/browse/SPARK-22974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
William Zhang updated SPARK-22974: ---------------------------------- Description: If CountVectorModel transforms columns, the output column will not have attributes attached to them. If later on, those columns are used in Interaction transformer, an exception will be thrown: {quote}"org.apache.spark.SparkException: Vector attributes must be defined for interaction." {quote} To reproduce it: {quote}import org.apache.spark.ml.feature._ import org.apache.spark.sql.functions._ val df = spark.createDataFrame(Seq( (0, Array("a", "b", "c"), Array("1", "2")), (1, Array("a", "b", "b", "c", "a", "d"), Array("1", "2", "3")) )).toDF("id", "words", "nums") val cvModel: CountVectorizerModel = new CountVectorizer() .setInputCol("nums") .setOutputCol("features2") .setVocabSize(4) .setMinDF(0) .fit(df) ]val cvm = new CountVectorizerModel(Array("a", "b", "c")) .setInputCol("words") .setOutputCol("features1") val df1 = cvm.transform(df) val df2 = cvModel.transform(df1) val interaction = new Interaction().setInputCols(Array("features1", "features2")).setOutputCol("features") val df3 = interaction.transform(df2) {quote} was: If CountVectorModel transforms columns, the output column will not have attributes attached to them. If later on, those columns are used in Interaction transformer, an exception will be thrown: {quote}"org.apache.spark.SparkException: Vector attributes must be defined for interaction." {quote} To reproduce it: {quote}import org.apache.spark.ml.feature._ import org.apache.spark.sql.functions._ import org.apache.spark.ml.linalg.{SparseVector, Vector} val df = spark.createDataFrame(Seq( (0, Array("a", "b", "c"), Array("1", "2")), (1, Array("a", "b", "b", "c", "a", "d"), Array("1", "2", "3")) )).toDF("id", "words", "nums") val cvModel: CountVectorizerModel = new CountVectorizer() .setInputCol("nums") .setOutputCol("features2") .setVocabSize(4) .setMinDF(0) .fit(df) ]val cvm = new CountVectorizerModel(Array("a", "b", "c")) .setInputCol("words") .setOutputCol("features1") val df1 = cvm.transform(df) val df2 = cvModel.transform(df1) val interaction = new Interaction().setInputCols(Array("features1", "features2")).setOutputCol("features") val df3 = interaction.transform(df2) {quote} > CountVectorModel does not attach attributes to output column > ------------------------------------------------------------ > > Key: SPARK-22974 > URL: https://issues.apache.org/jira/browse/SPARK-22974 > Project: Spark > Issue Type: Bug > Components: ML > Affects Versions: 2.2.1 > Reporter: William Zhang > > If CountVectorModel transforms columns, the output column will not have > attributes attached to them. If later on, those columns are used in > Interaction transformer, an exception will be thrown: > {quote}"org.apache.spark.SparkException: Vector attributes must be defined > for interaction." > {quote} > To reproduce it: > {quote}import org.apache.spark.ml.feature._ > import org.apache.spark.sql.functions._ > val df = spark.createDataFrame(Seq( > (0, Array("a", "b", "c"), Array("1", "2")), > (1, Array("a", "b", "b", "c", "a", "d"), Array("1", "2", "3")) > )).toDF("id", "words", "nums") > val cvModel: CountVectorizerModel = new CountVectorizer() > .setInputCol("nums") > .setOutputCol("features2") > .setVocabSize(4) > .setMinDF(0) > .fit(df) > ]val cvm = new CountVectorizerModel(Array("a", "b", "c")) > .setInputCol("words") > .setOutputCol("features1") > > val df1 = cvm.transform(df) > val df2 = cvModel.transform(df1) > val interaction = new Interaction().setInputCols(Array("features1", > "features2")).setOutputCol("features") > val df3 = interaction.transform(df2) > {quote} -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org