Hi, I am hitting this issue. https://issues.apache.org/jira/browse/SPARK-10835.
Issue seems to be resolved but resurfacing in 2.0 ML. Any workaround is appreciated ? Note: Pipeline has Ngram before word2Vec. Error: val word2Vec = new Word2Vec().setInputCol("wordsGrams").setOutputCol("features").setVectorSize(128).setMinCount(10) scala> word2Vec.fit(grams) java.lang.IllegalArgumentException: requirement failed: Column wordsGrams must be of type ArrayType(StringType,true) but was actually ArrayType(StringType,false). at scala.Predef$.require(Predef.scala:224) at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42) at org.apache.spark.ml.feature.Word2VecBase$class.validateAndTransformSchema(Word2Vec.scala:111) at org.apache.spark.ml.feature.Word2Vec.validateAndTransformSchema(Word2Vec.scala:121) at org.apache.spark.ml.feature.Word2Vec.transformSchema(Word2Vec.scala:187) at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:70) at org.apache.spark.ml.feature.Word2Vec.fit(Word2Vec.scala:170) Github code for Ngram: override protected def validateInputType(inputType: DataType): Unit = { require(inputType.sameType(ArrayType(StringType)), s"Input type must be ArrayType(StringType) but got $inputType.") } override protected def outputDataType: DataType = new ArrayType(StringType, false) }