[ 
https://issues.apache.org/jira/browse/SPARK-26326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-26326:
------------------------------
    Priority: Minor  (was: Major)

Yeah, this means you have a model with about 265M parameters, and when 
serialized as an array of bytes, is (barely) exceeding 2GB. I think 
reimplementing this under the hood is possible but it may call into question 
whether this is a realistic use case for naive bayes?

> Cannot save a NaiveBayesModel with 48685 features and 5453 labels
> -----------------------------------------------------------------
>
>                 Key: SPARK-26326
>                 URL: https://issues.apache.org/jira/browse/SPARK-26326
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.4.0
>            Reporter: Markus Paaso
>            Priority: Minor
>
> When executing
> {code:java}
> model.write().overwrite().save("/tmp/mymodel"){code}
> The error occurs
> {code:java}
> java.lang.UnsupportedOperationException: Cannot convert this array to unsafe 
> format as it's too big.
> at 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.fromPrimitiveArray(UnsafeArrayData.java:457)
> at 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.fromPrimitiveArray(UnsafeArrayData.java:524)
> at org.apache.spark.ml.linalg.MatrixUDT.serialize(MatrixUDT.scala:66)
> at org.apache.spark.ml.linalg.MatrixUDT.serialize(MatrixUDT.scala:28)
> at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$UDTConverter.toCatalystImpl(CatalystTypeConverters.scala:143)
> at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
> at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:258)
> at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:238)
> at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
> at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$.$anonfun$createToCatalystConverter$2(CatalystTypeConverters.scala:396)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LocalRelation$.$anonfun$fromProduct$1(LocalRelation.scala:43)
> at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:233)
> at scala.collection.immutable.List.foreach(List.scala:388)
> at scala.collection.TraversableLike.map(TraversableLike.scala:233)
> at scala.collection.TraversableLike.map$(TraversableLike.scala:226)
> at scala.collection.immutable.List.map(List.scala:294)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LocalRelation$.fromProduct(LocalRelation.scala:43)
> at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:315)
> at 
> org.apache.spark.ml.classification.NaiveBayesModel$NaiveBayesModelWriter.saveImpl(NaiveBayes.scala:393)
> at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:180)
> {code}
> Data file to reproduce the problem: 
> [https://github.com/make/spark-26326-files/raw/master/data.libsvm]
> Code to reproduce the problem:
> {code:java}
> import org.apache.spark.ml.classification.NaiveBayes
> import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
> // Load the data stored in LIBSVM format as a DataFrame.
> val data = spark.read.format("libsvm").load("/tmp/data.libsvm")
> // Train a NaiveBayes model.
> val model = new NaiveBayes().fit(data)
> model.write().overwrite().save("/tmp/mymodel"){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to