Re: Problem in Restoring ML Pipeline with UDF

Sean Owen Tue, 08 Jun 2021 09:10:32 -0700

It's a little bit of a guess, but the class name
$line103090609224.$read$FeatureModder looks like something generated by the
shell. I think it's your 'real' classname in this case. If you redefined
this later and loaded it you may not find it matches up. Can you declare
this in a package?


On Tue, Jun 8, 2021 at 10:50 AM Artemis User <arte...@dtechspace.com> wrote:

> We have a feature engineering transformer defined as a custom class with
> UDF as follows:
>
> class FeatureModder extends Transformer with DefaultParamsWritable with
> DefaultParamsReadable[FeatureModder] {
>     val uid: String = "FeatureModder"+randomUUID
>
>     final val inputCol: Param[String] = new Param[String](this,
> "inputCos", "input column")
>     final def setInputCol(col:String) = set(inputCol, col)
>
>     final val outputCol: Param[String] = new Param[String](this,
> "outputCol", "output column")
>     final def setOutputCol(col:String) = set(outputCol, col)
>
>     final val size: Param[String] = new Param[String](this, "size",
> "length of output vector")
>     final def setSize = (n:Int) => set(size, n.toString)
>
>     override def transform(data: Dataset[_]) = {
>         val modUDF = udf({n: Int => n % $(size).toInt})
>         data.withColumn($(outputCol),
> modUDF(col($(inputCol)).cast(IntegerType)))
>     }
>
>     def transformSchema(schema: org.apache.spark.sql.types.StructType):
> org.apache.spark.sql.types.StructType = {
>         val actualType = schema($(inputCol)).dataType
>         require(actualType.equals(IntegerType) ||
> actualType.equals(DoubleType), s"Input column must be of numeric type")
>         DataTypes.createStructType(schema.fields :+
> DataTypes.createStructField($(outputCol), IntegerType, false))
>     }
>
>     override def copy(extra: ParamMap): Transformer = copy(extra)
> }
>
> This was included in an ML pipeline, fitted into a model and persisted to
> a disk file.  When we try to load the pipeline model in a separate notebook
> (we use Zeppelin), an exception is thrown complaining class not fund.
>
> java.lang.ClassNotFoundException: $line103090609224.$read$FeatureModder at
> scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:72)
> at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589) at
> java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522) at
> java.base/java.lang.Class.forName0(Native Method) at
> java.base/java.lang.Class.forName(Class.java:398) at
> org.apache.spark.util.Utils$.classForName(Utils.scala:207) at
> org.apache.spark.ml.util.DefaultParamsReader$.loadParamsInstanceReader(ReadWrite.scala:630)
> at
> org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$load$4(Pipeline.scala:276)
> at
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
> at
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
> at
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at
> scala.collection.TraversableLike.map(TraversableLike.scala:238) at
> scala.collection.TraversableLike.map$(TraversableLike.scala:231) at
> scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) at
> org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$load$3(Pipeline.scala:274)
> at
> org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
> at scala.util.Try$.apply(Try.scala:213) at
> org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
> at org.apache.spark.ml.Pipeline$SharedReadWrite$.load(Pipeline.scala:268)
> at
> org.apache.spark.ml.PipelineModel$PipelineModelReader.$anonfun$load$7(Pipeline.scala:356)
> at org.apache.spark.ml.MLEvents.withLoadInstanceEvent(events.scala:160) at
> org.apache.spark.ml.MLEvents.withLoadInstanceEvent$(events.scala:155) at
> org.apache.spark.ml.util.Instrumentation.withLoadInstanceEvent(Instrumentation.scala:42)
> at
> org.apache.spark.ml.PipelineModel$PipelineModelReader.$anonfun$load$6(Pipeline.scala:355)
> at
> org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
> at scala.util.Try$.apply(Try.scala:213) at
> org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
> at
> org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:355)
> at
> org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:349)
> at org.apache.spark.ml.util.MLReadable.load(ReadWrite.scala:355) at
> org.apache.spark.ml.util.MLReadable.load$(ReadWrite.scala:355) at
> org.apache.spark.ml.PipelineModel$.load(Pipeline.scala:337) ... 40 elided 
> Could
> someone help explaining why?  My guess was the class definition is not in
> the classpath.  The question is how to include the class definition or
> class metadata as part of the pipeline model serialization? or include the
> class definition in a notebook (we did include the class definition in the
> notebook that loads the pipeline model)?
>
> Thanks a lot in advance for your help!
>
> ND
>

Re: Problem in Restoring ML Pipeline with UDF

Reply via email to