Russell Jurney created DATAFU-149:
-------------------------------------
Summary: Add MutliLabelStratifiedSample to DataFu.spark
Key: DATAFU-149
URL: https://issues.apache.org/jira/browse/DATAFU-149
Project: DataFu
Issue Type: Improvement
Affects Versions: 1.5.0
Reporter: Russell Jurney
Assignee: Russell Jurney
Fix For: 1.6.0
I'm working on an implementation of On the Stratification of Multi-Label Data,
to create a stratified (balanced, in my case) sample of highly skewed labels
for a multi-label, multi-class classification problem. This isn't
straightforward because adding one record adds multiple labels to the balance.
A greedy algorithm that adds labels with the least common labels works, and
since I'm writing it, it would probably make a good feature.
http://lpis.csd.auth.gr/publications/sechidis-ecmlpkdd-2011.pdf
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)