Re: Testing another Dataset after ML training

Michael C. Kunkel Tue, 11 Jul 2017 06:05:22 -0700

Greetings,

I am 50.50 sure the data format is correct, as if I split the data the 
classifier works properly. If I introduce another dataset, created identically 
to the one it is trained on.


However, the creation of the data itself is in doubt, but I do not see any help on 
this subject with Dataset<Row>

What I do is create two List<Row>

       List<Row> dataTraining = new ArrayList<>();
       List<Row> dataTesting = new ArrayList<>();

Fill them
               dataTraining.add(RowFactory.create(Double.parseDouble(label), 
Vectors.dense(v)));
               dataTesting.add(RowFactory.create(Double.parseDouble(label), 
Vectors.dense(v)));

Then construct two Dataset<Row>

       StructType schemaForFrame = new StructType(
               new StructField[] { new StructField("label", 
DataTypes.DoubleType, false, Metadata.empty()),
                       new StructField("features", new VectorUDT(), false, 
Metadata.empty()) });


       Dataset<Row> training = spark.createDataFrame(dataTraining, 
schemaForFrame);
       Dataset<Row> testing = spark.createDataFrame(dataTesting, 
schemaForFrame);


So I am not sure if it is correct, but I am not using RDD.

Also, can you inform me is you had any problems with the mailing list. I have 
tried for weeks for my emails to be accepted by the list.

Thanks

BR
MK
----------------------------------------
Michael C. Kunkel, USMC, PhD
Forschungszentrum Jülich
Nuclear Physics Institute and Juelich Center for Hadron Physics
Experimental Hadron Structure (IKP-1)
www.fz-juelich.de/ikp<http://www.fz-juelich.de/ikp>

On 11/07/2017 14:53, Riccardo Ferrari wrote:
Hi,

Are you sure you're feeding the correct data format? I found this conversation 
that might be useful:
http://apache-spark-user-list.1001560.n3.nabble.com/Description-of-data-file-sample-libsvm-data-txt-td25832.html

Best,

On Tue, Jul 11, 2017 at 1:42 PM, mckunkel 
<m.kun...@fz-juelich.de<mailto:m.kun...@fz-juelich.de>> wrote:
Greetings,

Following the example on the AS page for Naive Bayes using Dataset<Row>
https://spark.apache.org/docs/latest/ml-classification-regression.html#naive-bayes
<https://spark.apache.org/docs/latest/ml-classification-regression.html#naive-bayes>

I want to predict the outcome of another set of data. So instead of
splitting the data into training and testing, I have 1 set of training and
one set of testing. i.e.;
               Dataset<Row> training = spark.createDataFrame(dataTraining,
schemaForFrame);
               Dataset<Row> testing = spark.createDataFrame(dataTesting, 
schemaForFrame);

               NaiveBayes nb = new NaiveBayes();
               NaiveBayesModel model = nb.fit(train);
               Dataset<Row> predictions = model.transform(testing);
               predictions.show();

But I get the error.

17/07/11 13:40:38 INFO DAGScheduler: Job 2 finished: collect at
NaiveBayes.scala:171, took 3.942413 s
Exception in thread "main" org.apache.spark.SparkException: Failed to
execute user defined function($anonfun$1: (vector) => vector)
       at
org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1075)
       at
org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:144)
       at
org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:48)
       at
org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:30)
       at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
       at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)

...
...
...


How do I perform predictions on other datasets that were not created at a
split?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Testing-another-Dataset-after-ML-training-tp28845.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>





------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------

Re: Testing another Dataset after ML training

Reply via email to