[jira] [Commented] (MAHOUT-985) MapBackedArffModel Unable To Parse ARFF Files Containing Instance Weights

Dave Kor (Commented) (JIRA) Sun, 26 Feb 2012 04:19:17 -0800

    [ 
https://issues.apache.org/jira/browse/MAHOUT-985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13216709#comment-13216709
 ]


Dave Kor commented on MAHOUT-985:
---------------------------------

This is best answered from a machine learning perspective. Instance weights are 
used for quite a wide variety of ML related tasks and that is why Weka supports 
it. Examples include:

(A) Resampling Datasets
Sometimes, there is a need to get the machine learning algorithm to focus on 
specific parts of the dataset. For example, when the dataset is imbalanced and 
the class label you are interested is swamped by a huge number of uninteresting 
instances (in other words, the proverbial needle in a haystack problem). Most 
techniques for handling such cases involve some form of careful resampling, 
either boosting the weightage of instances that have the desired class label, 
or down-weighting the unwanted instances, or both.

(B) Smoothing or Regularization (See 
http://en.wikipedia.org/wiki/Regularization_(mathematics) )
Some methods of Bayesian learning often take into consideration the prior 
distribution of labels when training a model and the simpler ways of 
introducing a prior is apply them as instance weights. Algorithms that can make 
use of instance weights include Naive Bayes, K-Means, Logistic Regression, 
Expectation Maximization/Gradient Descent/Conjugate Gradient, Nearest Neighbor, 
AdaBoost and many more. 

These are the two main uses of instance weighting I can remember off the top of 
my mind. I'm sure there are a few more uses that I have missed out. As to how 
the weights are used, it is different from algorithm to algorithm and not all 
algorithms will make use of instance weights. In Weka, the algorithms that to 
take advantage of instance weights all implement 
weka.core.WeightedInstanceHandler. Weka algorithms that do not implement 
WeightedInstanceHandler simply assume the weights don't exist. For your 
reference, you can see the list of algorithms at 
http://weka.sourceforge.net/doc.dev/weka/core/WeightedInstancesHandler.html

As for Mahout, I am really not in a position to say as I have only started 
evaluating Mahout this week. The easy way out is simply to make sure 
MapBackedArffModel is able to successfully parse Arff files that contain 
weights and throw these weights away. However, it would be good if the weights 
can be passed on to Mahout's algorithms and let them have a chance to use the 
weights if the algorithm so desires. 

I hope this helps.

                
> MapBackedArffModel Unable To Parse ARFF Files Containing Instance Weights
> -------------------------------------------------------------------------
>
>                 Key: MAHOUT-985
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-985
>             Project: Mahout
>          Issue Type: Bug
>          Components: Integration
>    Affects Versions: 0.5
>            Reporter: Dave Kor
>            Priority: Minor
>              Labels: Arff
>
> When parsing an Arff file that contain instance-specific weights, 
> MapBackedArffModel throws the following NPE exception. While I have only 
> tested this in 0.5, I suspect this bug also occur in 0.6
> Exception in thread "main" java.lang.NullPointerException
>         at 
> org.apache.mahout.utils.vectors.arff.MapBackedARFFModel.getValue(MapBackedARFFModel.java:87)
>         at 
> org.apache.mahout.utils.vectors.arff.ARFFIterator.computeNext(ARFFIterator.java:75)
>         at 
> org.apache.mahout.utils.vectors.arff.ARFFIterator.computeNext(ARFFIterator.java:30)
>         at 
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:136)
>         at 
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:131)
>         at 
> org.apache.mahout.utils.vectors.io.SequenceFileVectorWriter.write(SequenceFileVectorWriter.java:43)
>         at 
> org.apache.mahout.utils.vectors.arff.Driver.writeFile(Driver.java:159)
>         at org.apache.mahout.utils.vectors.arff.Driver.main(Driver.java:127)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at 
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>         at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>         at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)
> The code works properly when all instance weights are set to the default 
> value of 1. However when any instance has a non-default weight, such as in 
> the sample Arff file below, the NPE occurs when MapBackedArffModel attempts 
> to parse line 8. 
> -----
> @relation 'Test Mahout'
> @attribute Attr0 numeric
> @attribute Label {True,False}
> @data
> 0,False
> 1,True,{2}
> -----
> The reason is that in Weka, all data instances are assumed to have a default 
> weight of 1 and this default weight is not saved in the Arff file. However 
> when a data instance DOES NOT have the default weight of 1, the non-default 
> instance weight is appended at the end of the line surrounded by curly 
> braces. When MapBackedArffModel.getValue method tries to parse this weight as 
> an attribute, typeMap.get(idx) returns a null ARFFtype as there is no such 
> attribute, which results in an NPE. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-985) MapBackedArffModel Unable To Parse ARFF Files Containing Instance Weights

Reply via email to