[
https://issues.apache.org/jira/browse/MAHOUT-985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13216709#comment-13216709
]
Dave Kor commented on MAHOUT-985:
---------------------------------
This is best answered from a machine learning perspective. Instance weights are
used for quite a wide variety of ML related tasks and that is why Weka supports
it. Examples include:
(A) Resampling Datasets
Sometimes, there is a need to get the machine learning algorithm to focus on
specific parts of the dataset. For example, when the dataset is imbalanced and
the class label you are interested is swamped by a huge number of uninteresting
instances (in other words, the proverbial needle in a haystack problem). Most
techniques for handling such cases involve some form of careful resampling,
either boosting the weightage of instances that have the desired class label,
or down-weighting the unwanted instances, or both.
(B) Smoothing or Regularization (See
http://en.wikipedia.org/wiki/Regularization_(mathematics) )
Some methods of Bayesian learning often take into consideration the prior
distribution of labels when training a model and the simpler ways of
introducing a prior is apply them as instance weights. Algorithms that can make
use of instance weights include Naive Bayes, K-Means, Logistic Regression,
Expectation Maximization/Gradient Descent/Conjugate Gradient, Nearest Neighbor,
AdaBoost and many more.
These are the two main uses of instance weighting I can remember off the top of
my mind. I'm sure there are a few more uses that I have missed out. As to how
the weights are used, it is different from algorithm to algorithm and not all
algorithms will make use of instance weights. In Weka, the algorithms that to
take advantage of instance weights all implement
weka.core.WeightedInstanceHandler. Weka algorithms that do not implement
WeightedInstanceHandler simply assume the weights don't exist. For your
reference, you can see the list of algorithms at
http://weka.sourceforge.net/doc.dev/weka/core/WeightedInstancesHandler.html
As for Mahout, I am really not in a position to say as I have only started
evaluating Mahout this week. The easy way out is simply to make sure
MapBackedArffModel is able to successfully parse Arff files that contain
weights and throw these weights away. However, it would be good if the weights
can be passed on to Mahout's algorithms and let them have a chance to use the
weights if the algorithm so desires.
I hope this helps.
> MapBackedArffModel Unable To Parse ARFF Files Containing Instance Weights
> -------------------------------------------------------------------------
>
> Key: MAHOUT-985
> URL: https://issues.apache.org/jira/browse/MAHOUT-985
> Project: Mahout
> Issue Type: Bug
> Components: Integration
> Affects Versions: 0.5
> Reporter: Dave Kor
> Priority: Minor
> Labels: Arff
>
> When parsing an Arff file that contain instance-specific weights,
> MapBackedArffModel throws the following NPE exception. While I have only
> tested this in 0.5, I suspect this bug also occur in 0.6
> Exception in thread "main" java.lang.NullPointerException
> at
> org.apache.mahout.utils.vectors.arff.MapBackedARFFModel.getValue(MapBackedARFFModel.java:87)
> at
> org.apache.mahout.utils.vectors.arff.ARFFIterator.computeNext(ARFFIterator.java:75)
> at
> org.apache.mahout.utils.vectors.arff.ARFFIterator.computeNext(ARFFIterator.java:30)
> at
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:136)
> at
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:131)
> at
> org.apache.mahout.utils.vectors.io.SequenceFileVectorWriter.write(SequenceFileVectorWriter.java:43)
> at
> org.apache.mahout.utils.vectors.arff.Driver.writeFile(Driver.java:159)
> at org.apache.mahout.utils.vectors.arff.Driver.main(Driver.java:127)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)
> The code works properly when all instance weights are set to the default
> value of 1. However when any instance has a non-default weight, such as in
> the sample Arff file below, the NPE occurs when MapBackedArffModel attempts
> to parse line 8.
> -----
> @relation 'Test Mahout'
> @attribute Attr0 numeric
> @attribute Label {True,False}
> @data
> 0,False
> 1,True,{2}
> -----
> The reason is that in Weka, all data instances are assumed to have a default
> weight of 1 and this default weight is not saved in the Arff file. However
> when a data instance DOES NOT have the default weight of 1, the non-default
> instance weight is appended at the end of the line surrounded by curly
> braces. When MapBackedArffModel.getValue method tries to parse this weight as
> an attribute, typeMap.get(idx) returns a null ARFFtype as there is no such
> attribute, which results in an NPE.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira