Hi,

I am hitting this issue. https://issues.apache.org/jira/browse/SPARK-10835.

Issue seems to be resolved but resurfacing in 2.0 ML. Any workaround is
appreciated ?

Note:
Pipeline has Ngram before word2Vec.

Error:
val word2Vec = new
Word2Vec().setInputCol("wordsGrams").setOutputCol("features").setVectorSize(128).setMinCount(10)

scala> word2Vec.fit(grams)
java.lang.IllegalArgumentException: requirement failed: Column wordsGrams
must be of type ArrayType(StringType,true) but was actually
ArrayType(StringType,false).
  at scala.Predef$.require(Predef.scala:224)
  at
org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
  at
org.apache.spark.ml.feature.Word2VecBase$class.validateAndTransformSchema(Word2Vec.scala:111)
  at
org.apache.spark.ml.feature.Word2Vec.validateAndTransformSchema(Word2Vec.scala:121)
  at
org.apache.spark.ml.feature.Word2Vec.transformSchema(Word2Vec.scala:187)
  at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:70)
  at org.apache.spark.ml.feature.Word2Vec.fit(Word2Vec.scala:170)


Github code for Ngram:


override protected def validateInputType(inputType: DataType): Unit = {
    require(inputType.sameType(ArrayType(StringType)),
      s"Input type must be ArrayType(StringType) but got $inputType.")
  }

  override protected def outputDataType: DataType = new
ArrayType(StringType, false)
}

Reply via email to