Greetings, I am 50.50 sure the data format is correct, as if I split the data the classifier works properly. If I introduce another dataset, created identically to the one it is trained on.
However, the creation of the data itself is in doubt, but I do not see any help on this subject with Dataset<Row> What I do is create two List<Row> List<Row> dataTraining = new ArrayList<>(); List<Row> dataTesting = new ArrayList<>(); Fill them dataTraining.add(RowFactory.create(Double.parseDouble(label), Vectors.dense(v))); dataTesting.add(RowFactory.create(Double.parseDouble(label), Vectors.dense(v))); Then construct two Dataset<Row> StructType schemaForFrame = new StructType( new StructField[] { new StructField("label", DataTypes.DoubleType, false, Metadata.empty()), new StructField("features", new VectorUDT(), false, Metadata.empty()) }); Dataset<Row> training = spark.createDataFrame(dataTraining, schemaForFrame); Dataset<Row> testing = spark.createDataFrame(dataTesting, schemaForFrame); So I am not sure if it is correct, but I am not using RDD. Also, can you inform me is you had any problems with the mailing list. I have tried for weeks for my emails to be accepted by the list. Thanks BR MK ---------------------------------------- Michael C. Kunkel, USMC, PhD Forschungszentrum Jülich Nuclear Physics Institute and Juelich Center for Hadron Physics Experimental Hadron Structure (IKP-1) www.fz-juelich.de/ikp<http://www.fz-juelich.de/ikp> On 11/07/2017 14:53, Riccardo Ferrari wrote: Hi, Are you sure you're feeding the correct data format? I found this conversation that might be useful: http://apache-spark-user-list.1001560.n3.nabble.com/Description-of-data-file-sample-libsvm-data-txt-td25832.html Best, On Tue, Jul 11, 2017 at 1:42 PM, mckunkel <m.kun...@fz-juelich.de<mailto:m.kun...@fz-juelich.de>> wrote: Greetings, Following the example on the AS page for Naive Bayes using Dataset<Row> https://spark.apache.org/docs/latest/ml-classification-regression.html#naive-bayes <https://spark.apache.org/docs/latest/ml-classification-regression.html#naive-bayes> I want to predict the outcome of another set of data. So instead of splitting the data into training and testing, I have 1 set of training and one set of testing. i.e.; Dataset<Row> training = spark.createDataFrame(dataTraining, schemaForFrame); Dataset<Row> testing = spark.createDataFrame(dataTesting, schemaForFrame); NaiveBayes nb = new NaiveBayes(); NaiveBayesModel model = nb.fit(train); Dataset<Row> predictions = model.transform(testing); predictions.show(); But I get the error. 17/07/11 13:40:38 INFO DAGScheduler: Job 2 finished: collect at NaiveBayes.scala:171, took 3.942413 s Exception in thread "main" org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (vector) => vector) at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1075) at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:144) at org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:48) at org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:30) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) ... ... ... How do I perform predictions on other datasets that were not created at a split? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Testing-another-Dataset-after-ML-training-tp28845.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org> ------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------ Forschungszentrum Juelich GmbH 52425 Juelich Sitz der Gesellschaft: Juelich Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498 Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender), Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt, Prof. Dr. Sebastian M. Schmidt ------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------