Re: Testing another Dataset after ML training

2017-07-11 Thread Riccardo Ferrari
Hi,

Are you sure you're feeding the correct data format? I found this
conversation that might be useful:
http://apache-spark-user-list.1001560.n3.nabble.com/Description-of-data-file-sample-libsvm-data-txt-td25832.html

Best,

On Tue, Jul 11, 2017 at 1:42 PM, mckunkel  wrote:

> Greetings,
>
> Following the example on the AS page for Naive Bayes using Dataset
> https://spark.apache.org/docs/latest/ml-classification-
> regression.html#naive-bayes
>  regression.html#naive-bayes>
>
> I want to predict the outcome of another set of data. So instead of
> splitting the data into training and testing, I have 1 set of training and
> one set of testing. i.e.;
> Dataset training = spark.createDataFrame(
> dataTraining,
> schemaForFrame);
> Dataset testing = spark.createDataFrame(dataTesting,
> schemaForFrame);
>
> NaiveBayes nb = new NaiveBayes();
> NaiveBayesModel model = nb.fit(train);
> Dataset predictions = model.transform(testing);
> predictions.show();
>
> But I get the error.
>
> 17/07/11 13:40:38 INFO DAGScheduler: Job 2 finished: collect at
> NaiveBayes.scala:171, took 3.942413 s
> Exception in thread "main" org.apache.spark.SparkException: Failed to
> execute user defined function($anonfun$1: (vector) => vector)
> at
> org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(
> ScalaUDF.scala:1075)
> at
> org.apache.spark.sql.catalyst.expressions.Alias.eval(
> namedExpressions.scala:144)
> at
> org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(
> Projection.scala:48)
> at
> org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(
> Projection.scala:30)
> at
> scala.collection.TraversableLike$$anonfun$map$
> 1.apply(TraversableLike.scala:234)
> at
> scala.collection.TraversableLike$$anonfun$map$
> 1.apply(TraversableLike.scala:234)
>
> ...
> ...
> ...
>
>
> How do I perform predictions on other datasets that were not created at a
> split?
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Testing-another-Dataset-after-ML-
> training-tp28845.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Testing another Dataset after ML training

2017-07-11 Thread Michael C. Kunkel

Greetings,

I am 50.50 sure the data format is correct, as if I split the data the 
classifier works properly. If I introduce another dataset, created identically 
to the one it is trained on.

However, the creation of the data itself is in doubt, but I do not see any help on 
this subject with Dataset

What I do is create two List

   List dataTraining = new ArrayList<>();
   List dataTesting = new ArrayList<>();

Fill them
   dataTraining.add(RowFactory.create(Double.parseDouble(label), 
Vectors.dense(v)));
   dataTesting.add(RowFactory.create(Double.parseDouble(label), 
Vectors.dense(v)));

Then construct two Dataset

   StructType schemaForFrame = new StructType(
   new StructField[] { new StructField("label", 
DataTypes.DoubleType, false, Metadata.empty()),
   new StructField("features", new VectorUDT(), false, 
Metadata.empty()) });


   Dataset training = spark.createDataFrame(dataTraining, 
schemaForFrame);
   Dataset testing = spark.createDataFrame(dataTesting, 
schemaForFrame);


So I am not sure if it is correct, but I am not using RDD.

Also, can you inform me is you had any problems with the mailing list. I have 
tried for weeks for my emails to be accepted by the list.

Thanks

BR
MK

Michael C. Kunkel, USMC, PhD
Forschungszentrum Jülich
Nuclear Physics Institute and Juelich Center for Hadron Physics
Experimental Hadron Structure (IKP-1)
www.fz-juelich.de/ikp

On 11/07/2017 14:53, Riccardo Ferrari wrote:
Hi,

Are you sure you're feeding the correct data format? I found this conversation 
that might be useful:
http://apache-spark-user-list.1001560.n3.nabble.com/Description-of-data-file-sample-libsvm-data-txt-td25832.html

Best,

On Tue, Jul 11, 2017 at 1:42 PM, mckunkel 
mailto:m.kun...@fz-juelich.de>> wrote:
Greetings,

Following the example on the AS page for Naive Bayes using Dataset
https://spark.apache.org/docs/latest/ml-classification-regression.html#naive-bayes


I want to predict the outcome of another set of data. So instead of
splitting the data into training and testing, I have 1 set of training and
one set of testing. i.e.;
   Dataset training = spark.createDataFrame(dataTraining,
schemaForFrame);
   Dataset testing = spark.createDataFrame(dataTesting, 
schemaForFrame);

   NaiveBayes nb = new NaiveBayes();
   NaiveBayesModel model = nb.fit(train);
   Dataset predictions = model.transform(testing);
   predictions.show();

But I get the error.

17/07/11 13:40:38 INFO DAGScheduler: Job 2 finished: collect at
NaiveBayes.scala:171, took 3.942413 s
Exception in thread "main" org.apache.spark.SparkException: Failed to
execute user defined function($anonfun$1: (vector) => vector)
   at
org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1075)
   at
org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:144)
   at
org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:48)
   at
org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:30)
   at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
   at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)

...
...
...


How do I perform predictions on other datasets that were not created at a
split?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Testing-another-Dataset-after-ML-training-tp28845.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org







Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt





Re: Testing another Dataset after ML training

2017-07-11 Thread mckunkel
Im not sure why I cannot subscribe, so that everyone can view the
conversation.
Help?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Testing-another-Dataset-after-ML-training-tp28845p28846.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Testing another Dataset after ML training

2017-07-11 Thread Riccardo Ferrari
Mh, to me feels like there some data mismatch. Are you sure you're using
the expected Vector (ml vs mllib). I am not sure you attached the whole
Exception but you might find some more useful details there.

Best,

On Tue, Jul 11, 2017 at 3:07 PM, mckunkel  wrote:

> Im not sure why I cannot subscribe, so that everyone can view the
> conversation.
> Help?
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Testing-another-Dataset-after-ML-
> training-tp28845p28846.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Testing another Dataset after ML training

2017-07-11 Thread Michael C. Kunkel

Greetings,

Thanks for the communication.

I attached the entire stacktrace in which was output to the screen.
I tried to use JavaRDD and LabeledPoint then convert to Dataset and I still get 
the same error as I did when I only used datasets.

I am using the expected ml Vector. I tried it using the mllib and that also 
didnt work.

BR
MK

Michael C. Kunkel, USMC, PhD
Forschungszentrum Jülich
Nuclear Physics Institute and Juelich Center for Hadron Physics
Experimental Hadron Structure (IKP-1)
www.fz-juelich.de/ikp

On 11/07/2017 17:21, Riccardo Ferrari wrote:
Mh, to me feels like there some data mismatch. Are you sure you're using the 
expected Vector (ml vs mllib). I am not sure you attached the whole Exception 
but you might find some more useful details there.

Best,

On Tue, Jul 11, 2017 at 3:07 PM, mckunkel 
mailto:m.kun...@fz-juelich.de>> wrote:
Im not sure why I cannot subscribe, so that everyone can view the
conversation.
Help?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Testing-another-Dataset-after-ML-training-tp28845p28846.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org







Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt





Re: Testing another Dataset after ML training

2017-07-12 Thread Riccardo Ferrari
Hi Michael,

I don't see any attachment, not sure you can attach files though

On Tue, Jul 11, 2017 at 10:44 PM, Michael C. Kunkel 
wrote:

> Greetings,
>
> Thanks for the communication.
>
> I attached the entire stacktrace in which was output to the screen.
> I tried to use JavaRDD and LabeledPoint then convert to Dataset and I
> still get the same error as I did when I only used datasets.
>
> I am using the expected ml Vector. I tried it using the mllib and that
> also didnt work.
>
> BR
> MK
> 
> Michael C. Kunkel, USMC, PhD
> Forschungszentrum Jülich
> Nuclear Physics Institute and Juelich Center for Hadron Physics
> Experimental Hadron Structure (IKP-1)www.fz-juelich.de/ikp
>
> On 11/07/2017 17:21, Riccardo Ferrari wrote:
>
> Mh, to me feels like there some data mismatch. Are you sure you're using
> the expected Vector (ml vs mllib). I am not sure you attached the whole
> Exception but you might find some more useful details there.
>
> Best,
>
> On Tue, Jul 11, 2017 at 3:07 PM, mckunkel  wrote:
>
>> Im not sure why I cannot subscribe, so that everyone can view the
>> conversation.
>> Help?
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Testing-another-Dataset-after-ML-train
>> ing-tp28845p28846.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
>
>
> 
> 
> 
> 
> Forschungszentrum Juelich GmbH
> 52425 Juelich
> Sitz der Gesellschaft: Juelich
> Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
> Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
> Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
> Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
> Prof. Dr. Sebastian M. Schmidt
> 
> 
> 
> 
>
>


Re: Testing another Dataset after ML training

2017-07-12 Thread Kunkel, Michael C.
Greetings

The attachment I meant to refer to was the posting in the initial email on the 
email list.

BR
MK

Michael C. Kunkel, USMC, PhD
Forschungszentrum Jülich
Nuclear Physics Institute and Juelich Center for Hadron Physics
Experimental Hadron Structure (IKP-1)
www.fz-juelich.de/ikp

On Jul 12, 2017, at 09:56, Riccardo Ferrari 
mailto:ferra...@gmail.com>> wrote:

Hi Michael,

I don't see any attachment, not sure you can attach files though

On Tue, Jul 11, 2017 at 10:44 PM, Michael C. Kunkel 
mailto:m.kun...@fz-juelich.de>> wrote:

Greetings,

Thanks for the communication.

I attached the entire stacktrace in which was output to the screen.
I tried to use JavaRDD and LabeledPoint then convert to Dataset and I still get 
the same error as I did when I only used datasets.

I am using the expected ml Vector. I tried it using the mllib and that also 
didnt work.

BR
MK

Michael C. Kunkel, USMC, PhD
Forschungszentrum Jülich
Nuclear Physics Institute and Juelich Center for Hadron Physics
Experimental Hadron Structure (IKP-1)
www.fz-juelich.de/ikp

On 11/07/2017 17:21, Riccardo Ferrari wrote:
Mh, to me feels like there some data mismatch. Are you sure you're using the 
expected Vector (ml vs mllib). I am not sure you attached the whole Exception 
but you might find some more useful details there.

Best,

On Tue, Jul 11, 2017 at 3:07 PM, mckunkel 
mailto:m.kun...@fz-juelich.de>> wrote:
Im not sure why I cannot subscribe, so that everyone can view the
conversation.
Help?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Testing-another-Dataset-after-ML-training-tp28845p28846.html
Sent from the Apache Spark User List mailing list archive at 
Nabble.com.

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org







Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt






Re: Testing another Dataset after ML training

2017-07-12 Thread Riccardo Ferrari
Hi Michael,

I think I found you posting on SO:
https://stackoverflow.com/questions/45041677/java-spark-training-on-new-data-with-datasetrow-from-csv-file

The exception trace there is quite different from what I read here, and
indeed is self-explanatory:
...
Caused by: java.lang.IllegalArgumentException: requirement failed: The
columns of A don't match the number of elements of x. A: 38611, x: 36179
...
Can it be that you have different 'features' vector sizes from train and
test?

Best,

On Wed, Jul 12, 2017 at 1:41 PM, Kunkel, Michael C. 
wrote:

> Greetings
>
> The attachment I meant to refer to was the posting in the initial email on
> the email list.
>
> BR
> MK
> 
> Michael C. Kunkel, USMC, PhD
> Forschungszentrum Jülich
> Nuclear Physics Institute and Juelich Center for Hadron Physics
> Experimental Hadron Structure (IKP-1)
> www.fz-juelich.de/ikp
>
> On Jul 12, 2017, at 09:56, Riccardo Ferrari  wrote:
>
> Hi Michael,
>
> I don't see any attachment, not sure you can attach files though
>
> On Tue, Jul 11, 2017 at 10:44 PM, Michael C. Kunkel <
> m.kun...@fz-juelich.de> wrote:
>
>> Greetings,
>>
>> Thanks for the communication.
>>
>> I attached the entire stacktrace in which was output to the screen.
>> I tried to use JavaRDD and LabeledPoint then convert to Dataset and I
>> still get the same error as I did when I only used datasets.
>>
>> I am using the expected ml Vector. I tried it using the mllib and that
>> also didnt work.
>>
>> BR
>> MK
>> 
>> Michael C. Kunkel, USMC, PhD
>> Forschungszentrum Jülich
>> Nuclear Physics Institute and Juelich Center for Hadron Physics
>> Experimental Hadron Structure (IKP-1)www.fz-juelich.de/ikp
>>
>> On 11/07/2017 17:21, Riccardo Ferrari wrote:
>>
>> Mh, to me feels like there some data mismatch. Are you sure you're using
>> the expected Vector (ml vs mllib). I am not sure you attached the whole
>> Exception but you might find some more useful details there.
>>
>> Best,
>>
>> On Tue, Jul 11, 2017 at 3:07 PM, mckunkel  wrote:
>>
>>> Im not sure why I cannot subscribe, so that everyone can view the
>>> conversation.
>>> Help?
>>>
>>>
>>>
>>> --
>>> View this message in context: http://apache-spark-user-list.
>>> 1001560.n3.nabble.com/Testing-another-Dataset-after-ML-train
>>> ing-tp28845p28846.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>>
>>
>> 
>> 
>> 
>> 
>> Forschungszentrum Juelich GmbH
>> 52425 Juelich
>> Sitz der Gesellschaft: Juelich
>> Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
>> Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
>> Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
>> Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
>> Prof. Dr. Sebastian M. Schmidt
>> 
>> 
>> 
>> 
>>
>>
>


Re: Testing another Dataset after ML training

2017-07-12 Thread Michael C. Kunkel

Greetings Riccardo,

That is indeed my post. That is my second attempt at getting this 
problem to work. I am not sure if the vector size are different as I 
know the "unknown" data is just a blind copy of 3 of the used inputs for 
the training data.

I will pursue this avenue more.

Thanks for the correspondence.

BR
MK

Michael C. Kunkel, USMC, PhD
Forschungszentrum Jülich
Nuclear Physics Institute and Juelich Center for Hadron Physics
Experimental Hadron Structure (IKP-1)
www.fz-juelich.de/ikp

On 12/07/2017 14:05, Riccardo Ferrari wrote:

Hi Michael,

I think I found you posting on SO:
https://stackoverflow.com/questions/45041677/java-spark-training-on-new-data-with-datasetrow-from-csv-file

The exception trace there is quite different from what I read here, 
and indeed is self-explanatory:

...
Caused by: java.lang.IllegalArgumentException: requirement failed: The 
columns of A don't match the number of elements of x. A: 38611, x: 36179

...
Can it be that you have different 'features' vector sizes from train 
and test?


Best,

On Wed, Jul 12, 2017 at 1:41 PM, Kunkel, Michael C. 
mailto:m.kun...@fz-juelich.de>> wrote:


Greetings

The attachment I meant to refer to was the posting in the initial
email on the email list.

BR
MK

Michael C. Kunkel, USMC, PhD
Forschungszentrum Jülich
Nuclear Physics Institute and Juelich Center for Hadron Physics
Experimental Hadron Structure (IKP-1)
www.fz-juelich.de/ikp 

On Jul 12, 2017, at 09:56, Riccardo Ferrari mailto:ferra...@gmail.com>> wrote:


Hi Michael,

I don't see any attachment, not sure you can attach files though

On Tue, Jul 11, 2017 at 10:44 PM, Michael C. Kunkel
mailto:m.kun...@fz-juelich.de>> wrote:

Greetings,

Thanks for the communication.

I attached the entire stacktrace in which was output to the
screen.
I tried to use JavaRDD and LabeledPoint then convert to
Dataset and I still get the same error as I did when I only
used datasets.

I am using the expected ml Vector. I tried it using the mllib
and that also didnt work.

BR
MK

Michael C. Kunkel, USMC, PhD
Forschungszentrum Jülich
Nuclear Physics Institute and Juelich Center for Hadron Physics
Experimental Hadron Structure (IKP-1)
www.fz-juelich.de/ikp 

On 11/07/2017 17:21, Riccardo Ferrari wrote:

Mh, to me feels like there some data mismatch. Are you sure
you're using the expected Vector (ml vs mllib). I am not
sure you attached the whole Exception but you might find
some more useful details there.
Best,
On Tue, Jul 11, 2017 at 3:07 PM, mckunkel
mailto:m.kun...@fz-juelich.de>> wrote:

Im not sure why I cannot subscribe, so that everyone can
view the conversation. Help? -- View this message in
context:

http://apache-spark-user-list.1001560.n3.nabble.com/Testing-another-Dataset-after-ML-training-tp28845p28846.html



Sent from the Apache Spark User List mailing list
archive at Nabble.com .

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org







Forschungszentrum Juelich GmbH 52425 Juelich Sitz der
Gesellschaft: Juelich Eingetragen im Handelsregister des
Amtsgerichts Dueren Nr. HR B 3498 Vorsitzender des
Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt
(Vorsitzender), Karsten Beneke (stellv. Vorsitzender), Prof.
Dr.-Ing. Harald Bolt, Prof. Dr. Sebastian M. Schmidt