Dmitriy Lyubimov created MAHOUT-1722:
----------------------------------------

             Summary: DRM row sampling api
                 Key: MAHOUT-1722
                 URL: https://issues.apache.org/jira/browse/MAHOUT-1722
             Project: Mahout
          Issue Type: Improvement
            Reporter: Dmitriy Lyubimov
            Assignee: Dmitriy Lyubimov
             Fix For: 0.10.2


We will ask engines to support two tiny apis for row vector sampling. 

One api is uniform multivariate hypergeometric (k parameter is given), and 
another is by fraction (simple map-only probabilistic filter). Spark 
implementation is enclosed (Spark just has an api for both, albeit k-sampler 
does not have strict mathematical guarantee of the distribution, and is only 
for small k).

challenge here is that returned rows should be ordinally renumbered.

(maybe i need to revisit this issue later, this was a pretty hasty API change, 
might be less than ideal in general case).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to