[ 
https://issues.apache.org/jira/browse/SPARK-4872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14252744#comment-14252744
 ] 

zhang jun wei commented on SPARK-4872:
--------------------------------------

Ok, so in order to use NaiveBayes in Spark correctly, seems the translation way 
should be:
1) With the real life model:
Weather | Temperature | Humidity | Wind => Decision to play tennis
Sunny | Hot | High | No => No
Sunny | Hot | High | Yes => No
Cloudy | Normal | Normal | No => Yes
Rainy | Cold | Normal | Yes => No

2) (Weather, Temperature, Humidity, Wind) are not features, as Weather and 
Temperature have three values, Humidity and Wind are binary features, instead, 
(Sunny, Cloudy, Rainy, Hot, Normal, Cold, Humidity, Wind) are binary features 
to present a training/test data.

The reason why I am trying to figure out the format here is because in some 
different implementations, the codes can accept the real life data, then it 
internally constructs the feature matrix by listing all the values (e.g. Sunny, 
Cloudy, Rainy) in each non-binary category (e.g. Weather).

> Provide sample format of training/test data in MLlib programming guide
> ----------------------------------------------------------------------
>
>                 Key: SPARK-4872
>                 URL: https://issues.apache.org/jira/browse/SPARK-4872
>             Project: Spark
>          Issue Type: Improvement
>          Components: Documentation
>    Affects Versions: 1.1.1
>            Reporter: zhang jun wei
>              Labels: documentation
>
> I suggest: in samples of the online programming guide of MLlib, it's better 
> to give examples in the real life data, and list the translated data format 
> for the model to consume. 
> The problem blocking me is how to translate the real life data into the 
> format which MLLib  can understand correctly. 
> Here is one sample, I want to use NaiveBayes to train and predict tennis-play 
> decision, the original data is:
> Weather | Temperature | Humidity | Wind  => Decision to play tennis
> Sunny     | Hot               | High       | No     => No
> Sunny     | Hot               | High       | Yes    => No
> Cloudy    | Normal         | Normal   | No     => Yes
> Rainy      | Cold             | Normal   | Yes    => No
> Now, from my understanding, one potential translation is:
> 1) put every feature value word into a line:
> Sunny Cloudy Rainy Hot Normal Cold High Normal Yes No
> 2) map them to numbers:
> 1 2 3 4 5 6 7 8 9 10
> 3) map decision labels to numbers:
> 0 - No
> 1 - Yes
> 4) set the value to 1 if it appears, or 0 if not, for the above example, here 
> is the data format for MLUtils.loadLibSVMFile to use:
> 0 1:1 2:0 3:0 4:1 5:0 6:0 7:1 8:0 9:0 10:1
> 0 1:1 2:0 3:0 4:1 5:0 6:0 7:1 8:0 9:1 10:0
> 1 1:0 2:1 3:0 4:0 5:1 6:0 7:0 8:1 9:0 10:1
> 0 1:0 2:0 3:1 4:0 5:0 6:1 7:0 8:1 9:1 10:0
> ==> Is this a correct understanding?
> And another way I can image is:
> 1) put every feature name into a line:
> Weather  Temperature  Humidity  Wind
> 2) map them to numbers:
> 1 2 3 4 
> 3) map decision labels to numbers:
> 0 - No
> 1 - Yes
> 4) map each value of each feature to a number (e.g. Sunny to 1, Cloudy to 2, 
> Rainy to 3; Hot to 1, Normal to 2, Cold to 3; High to 1, Normal to 2; Yes to 
> 1, No to 2) for the above example, here is the data format for 
> MLUtils.loadLibSVMFile to use:
> 0 1:1 2:1 3:1 4:2
> 0 1:1 2:1 3:1 4:1
> 1 1:2 2:2 3:2 4:2
> 0 1:3 2:3 3:2 4:1
> ==> but when I read the source code in NaiveBayes.scala, seems this is not 
> correct, I am not sure though...
> So which data format translation way is correct?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to