[ https://issues.apache.org/jira/browse/SPARK-35370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346567#comment-17346567 ]
Alexey Diomin commented on SPARK-35370: --------------------------------------- not a bug our company-specific patches to store all metadata in parquet in lower case for prevent integration errors between spark/hive/impala/etc can be closed > IllegalArgumentException when loading a PipelineModel with Spark 3 > ------------------------------------------------------------------ > > Key: SPARK-35370 > URL: https://issues.apache.org/jira/browse/SPARK-35370 > Project: Spark > Issue Type: Bug > Components: MLlib > Affects Versions: 3.1.0, 3.1.1 > Environment: spark 3.1.1 > Reporter: Avenash Kabeera > Priority: Minor > Labels: V3, decisiontree, scala, treemodels > > Hi, > This is a followup of the this issue > https://issues.apache.org/jira/browse/SPARK-33398 that fixed an exception > when loading a model in Spark 3 that trained in Spark2. After incorporating > this fix in my project, I ran into another issue which was introduced in the > fix [https://github.com/apache/spark/pull/30889/files.] > While loading my random forest model which was trained in Spark 2.2, I ran > into the following exception: > {code:java} > 16:03:34 ERROR Instrumentation:73 - java.lang.IllegalArgumentException: > nodeData does not exist. Available: treeid, nodedata > at > org.apache.spark.sql.types.StructType.$anonfun$apply$1(StructType.scala:278) > at scala.collection.immutable.Map$Map2.getOrElse(Map.scala:147) > at org.apache.spark.sql.types.StructType.apply(StructType.scala:277) > at > org.apache.spark.ml.tree.EnsembleModelReadWrite$.loadImpl(treeModels.scala:522) > at > org.apache.spark.ml.classification.RandomForestClassificationModel$RandomForestClassificationModelReader.load(RandomForestClassifier.scala:420) > at > org.apache.spark.ml.classification.RandomForestClassificationModel$RandomForestClassificationModelReader.load(RandomForestClassifier.scala:410) > at > org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$load$5(Pipeline.scala:277) > at org.apache.spark.ml.MLEvents.withLoadInstanceEvent(events.scala:160) > at org.apache.spark.ml.MLEvents.withLoadInstanceEvent$(events.scala:155) > at > org.apache.spark.ml.util.Instrumentation.withLoadInstanceEvent(Instrumentation.scala:42) > at > org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$load$4(Pipeline.scala:277) > at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) > at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) > at > org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$load$3(Pipeline.scala:274) > at > org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191) > at scala.util.Try$.apply(Try.scala:213) > at > org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191) > at org.apache.spark.ml.Pipeline$SharedReadWrite$.load(Pipeline.scala:268) > at > org.apache.spark.ml.PipelineModel$PipelineModelReader.$anonfun$load$7(Pipeline.scala:356) > at org.apache.spark.ml.MLEvents.withLoadInstanceEvent(events.scala:160) > at org.apache.spark.ml.MLEvents.withLoadInstanceEvent$(events.scala:155) > at > org.apache.spark.ml.util.Instrumentation.withLoadInstanceEvent(Instrumentation.scala:42) > at > org.apache.spark.ml.PipelineModel$PipelineModelReader.$anonfun$load$6(Pipeline.scala:355) > at > org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191) > at scala.util.Try$.apply(Try.scala:213) > at > org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191) > at > org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:355) > at > org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:349) > at org.apache.spark.ml.util.MLReadable.load(ReadWrite.scala:355) > at org.apache.spark.ml.util.MLReadable.load$(ReadWrite.scala:355) > at org.apache.spark.ml.PipelineModel$.load(Pipeline.scala:337){code} > When I looked at the data for the model, I see the schema is using > "*nodedata*" instead of "*nodeData*." Here is what my model looks like: > {code:java} > +------+-----------------------------------------------------------------------------------------------------------------+ > |treeid|nodedata > | > +------+-----------------------------------------------------------------------------------------------------------------+ > |12 |{0, 1.0, 0.20578590428109744, [249222.0, 1890856.0], > 0.046774779237015784, 1, 128, {1, [0.7468856332819338], -1}}| > |12 |{1, 1.0, 0.49179982674596906, [173902.0, 224985.0], > 0.022860340952237657, 2, 65, {4, [0.6627218934911243], -1}} | > |12 |{2, 0.0, 0.4912259578159168, [90905.0, 69638.0], 0.10950848921275999, > 3, 34, {9, [0.13666873125270484], -1}} | > |12 |{3, 1.0, 0.4308078797704775, [23317.0, 50941.0], 0.04311282777881931, > 4, 19, {10, [0.506218002482692], -1}} | {code} > I'm new to spark and the training of this model predates me so I can't say > whether specifying the column as "nodedata" was specific to our code or was > internal spark code. But I'm suspecting it's internal spark code. > > edit: > cc [~podongfeng], the author of the original PR to support loading spark2 > models in spark3. Maybe have some insights on "nodedata" vs "nodeData" > h3. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org