[ 
https://issues.apache.org/jira/browse/SPARK-35370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346567#comment-17346567
 ] 

Alexey Diomin commented on SPARK-35370:
---------------------------------------

not a bug

our company-specific patches to store all metadata in parquet in lower case for 
prevent integration errors between spark/hive/impala/etc

can be closed

> IllegalArgumentException when loading a PipelineModel with Spark 3
> ------------------------------------------------------------------
>
>                 Key: SPARK-35370
>                 URL: https://issues.apache.org/jira/browse/SPARK-35370
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 3.1.0, 3.1.1
>         Environment: spark 3.1.1
>            Reporter: Avenash Kabeera
>            Priority: Minor
>              Labels: V3, decisiontree, scala, treemodels
>
> Hi, 
> This is a followup of the this issue 
> https://issues.apache.org/jira/browse/SPARK-33398 that fixed an exception 
> when loading a model in Spark 3 that trained in Spark2.  After incorporating 
> this fix in my project, I ran into another issue which was introduced in the 
> fix [https://github.com/apache/spark/pull/30889/files.]  
> While loading my random forest model which was trained in Spark 2.2, I ran 
> into the following exception:
> {code:java}
> 16:03:34 ERROR Instrumentation:73 - java.lang.IllegalArgumentException: 
> nodeData does not exist. Available: treeid, nodedata
>  at 
> org.apache.spark.sql.types.StructType.$anonfun$apply$1(StructType.scala:278)
>  at scala.collection.immutable.Map$Map2.getOrElse(Map.scala:147)
>  at org.apache.spark.sql.types.StructType.apply(StructType.scala:277)
>  at 
> org.apache.spark.ml.tree.EnsembleModelReadWrite$.loadImpl(treeModels.scala:522)
>  at 
> org.apache.spark.ml.classification.RandomForestClassificationModel$RandomForestClassificationModelReader.load(RandomForestClassifier.scala:420)
>  at 
> org.apache.spark.ml.classification.RandomForestClassificationModel$RandomForestClassificationModelReader.load(RandomForestClassifier.scala:410)
>  at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$load$5(Pipeline.scala:277)
>  at org.apache.spark.ml.MLEvents.withLoadInstanceEvent(events.scala:160)
>  at org.apache.spark.ml.MLEvents.withLoadInstanceEvent$(events.scala:155)
>  at 
> org.apache.spark.ml.util.Instrumentation.withLoadInstanceEvent(Instrumentation.scala:42)
>  at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$load$4(Pipeline.scala:277)
>  at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>  at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
>  at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
>  at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>  at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>  at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
>  at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$load$3(Pipeline.scala:274)
>  at 
> org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
>  at scala.util.Try$.apply(Try.scala:213)
>  at 
> org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
>  at org.apache.spark.ml.Pipeline$SharedReadWrite$.load(Pipeline.scala:268)
>  at 
> org.apache.spark.ml.PipelineModel$PipelineModelReader.$anonfun$load$7(Pipeline.scala:356)
>  at org.apache.spark.ml.MLEvents.withLoadInstanceEvent(events.scala:160)
>  at org.apache.spark.ml.MLEvents.withLoadInstanceEvent$(events.scala:155)
>  at 
> org.apache.spark.ml.util.Instrumentation.withLoadInstanceEvent(Instrumentation.scala:42)
>  at 
> org.apache.spark.ml.PipelineModel$PipelineModelReader.$anonfun$load$6(Pipeline.scala:355)
>  at 
> org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
>  at scala.util.Try$.apply(Try.scala:213)
>  at 
> org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
>  at 
> org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:355)
>  at 
> org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:349)
>  at org.apache.spark.ml.util.MLReadable.load(ReadWrite.scala:355)
>  at org.apache.spark.ml.util.MLReadable.load$(ReadWrite.scala:355)
>  at org.apache.spark.ml.PipelineModel$.load(Pipeline.scala:337){code}
> When I looked at the data for the model, I see the schema is using 
> "*nodedata*" instead of "*nodeData*."  Here is what my model looks like:
> {code:java}
> +------+-----------------------------------------------------------------------------------------------------------------+
> |treeid|nodedata                                                              
>                                            |
> +------+-----------------------------------------------------------------------------------------------------------------+
> |12    |{0, 1.0, 0.20578590428109744, [249222.0, 1890856.0], 
> 0.046774779237015784, 1, 128, {1, [0.7468856332819338], -1}}|
> |12    |{1, 1.0, 0.49179982674596906, [173902.0, 224985.0], 
> 0.022860340952237657, 2, 65, {4, [0.6627218934911243], -1}}  |
> |12    |{2, 0.0, 0.4912259578159168, [90905.0, 69638.0], 0.10950848921275999, 
> 3, 34, {9, [0.13666873125270484], -1}}     |
> |12    |{3, 1.0, 0.4308078797704775, [23317.0, 50941.0], 0.04311282777881931, 
> 4, 19, {10, [0.506218002482692], -1}}      | {code}
> I'm new to spark and the training of this model predates me so I can't say 
> whether specifying the column as "nodedata" was specific to our code or was 
> internal spark code.  But I'm suspecting it's internal spark code.
>  
> edit:
> cc [~podongfeng], the author of the original PR to support loading spark2 
> models in spark3.  Maybe have some insights on "nodedata" vs "nodeData"
> h3.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to