Russell Jurney created DATAFU-149: ------------------------------------- Summary: Add MutliLabelStratifiedSample to DataFu.spark Key: DATAFU-149 URL: https://issues.apache.org/jira/browse/DATAFU-149 Project: DataFu Issue Type: Improvement Affects Versions: 1.5.0 Reporter: Russell Jurney Assignee: Russell Jurney Fix For: 1.6.0
I'm working on an implementation of On the Stratification of Multi-Label Data, to create a stratified (balanced, in my case) sample of highly skewed labels for a multi-label, multi-class classification problem. This isn't straightforward because adding one record adds multiple labels to the balance. A greedy algorithm that adds labels with the least common labels works, and since I'm writing it, it would probably make a good feature. http://lpis.csd.auth.gr/publications/sechidis-ecmlpkdd-2011.pdf -- This message was sent by Atlassian JIRA (v7.6.14#76016)