Dmitriy Lyubimov created MAHOUT-1722: ----------------------------------------
Summary: DRM row sampling api Key: MAHOUT-1722 URL: https://issues.apache.org/jira/browse/MAHOUT-1722 Project: Mahout Issue Type: Improvement Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: 0.10.2 We will ask engines to support two tiny apis for row vector sampling. One api is uniform multivariate hypergeometric (k parameter is given), and another is by fraction (simple map-only probabilistic filter). Spark implementation is enclosed (Spark just has an api for both, albeit k-sampler does not have strict mathematical guarantee of the distribution, and is only for small k). challenge here is that returned rows should be ordinally renumbered. (maybe i need to revisit this issue later, this was a pretty hasty API change, might be less than ideal in general case). -- This message was sent by Atlassian JIRA (v6.3.4#6332)