[jira] [Commented] (SPARK-16840) Please save the aggregate term frequencies as part of the NaiveBayesModel

Barry Becker (JIRA) Tue, 02 Aug 2016 06:58:41 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-16840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15404006#comment-15404006
 ]


Barry Becker commented on SPARK-16840:
--------------------------------------

I tried adding the aggregated data to the model. However, after doing that, I 
see that it still does not contain what we want.
Here is what it contains for the titanic dataset:

(0.0,(370,[56.0,111.0,1689.0,1450.0,934.0,569.0,370.0,495.0])), 
(1.0,(244,[165.0,109.0,1006.0,1408.0,468.0,362.0,244.0,379.0]))

Which corresponds to
(survived=No, (numNo in traing data,[ *** ])), (survived=Yes ,( numYes in 
training data, [ *** ]))
***  = Dense vector with num features elements. I believe it is the sum of 
counts times num values for each feature.

But we want is the distribution of class counts for each feature value. Its not 
the same, and looks like spark is not even computing that as an intermediate 
step. So maybe it is not easy to add this information to the model if it is not 
being calculated as an intermediate step.


> Please save the aggregate term frequencies as part of the NaiveBayesModel
> -------------------------------------------------------------------------
>
>                 Key: SPARK-16840
>                 URL: https://issues.apache.org/jira/browse/SPARK-16840
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 1.6.2, 2.0.0
>            Reporter: Barry Becker
>              Labels: ML, MLLib,
>
> I would like to visualize the structure of the NaiveBayes model in order to 
> get additional insight into the patterns in the data. In order to do that I 
> need the frequencies for each feature value per label.
> This exact information is computed in the  NaiveBayes.run method (see 
> "aggregated" variable), but then discarded when creating the model. Pi and 
> theta are computed based on the aggregated frequency counts, but surprisingly 
> those counts are not needed to apply the model. It would not add much to the 
> model size to add these aggregated counts, but could be very useful for some 
> applications of the model.
> {code}
>   def run(data: RDD[LabeledPoint]): NaiveBayesModel = {
>      :
>     // Aggregates term frequencies per label.
>     val aggregated = data.map(p => (p.label, p.features)).combineByKey[(Long, 
> DenseVector)](
>       createCombiner = (v: Vector) => {
>         :
>       },
>     :
>     new NaiveBayesModel(labels, pi, theta, modelType) // <- please include 
> "aggregated" here.
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16840) Please save the aggregate term frequencies as part of the NaiveBayesModel

Reply via email to