[ 
https://issues.apache.org/jira/browse/MAHOUT-676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022084#comment-13022084
 ] 

Sean Owen commented on MAHOUT-676:
----------------------------------

What's the particular use case this supports? I sort of prefer to avoid adding 
nice code that might get used but isn't. Can this replace a lot of current 
sampling code with one nice unified approach? Then that would justify itself 
off the bat.

Sure, I'd say you're describing two cases of sampling: you want exactly n 
samples of N (in which case you need 'reservoir'), or you want about n samples 
out of N (sample each with probability n/N). And I think there are also two 
basic contexts: you have all N items at once, or you don't since N is large and 
you have a stream.

(Side-note on Mappers: you have a stream of values, one at a time, yes. It gets 
tricky to support this case. The biggest problem is that you don't know that 
you'll see all values for one key in one Mapper. The second-biggest problem is 
that you need to detect when you're done with one key so you can finish the 
computation for that key. This also means you need logic in its close() method 
to deal with the final key. 

These are solvable problems, but not easily. If the patch isn't fully 
addressing the use case above, then you're having the user deal with "flushing" 
and such. In which case they're already collecting a List of values in-memory 
anyway. In which case this patch just needs to deal with sampling from a List.)


Back to the patch -- I am still not clear on why you need flush()?

If I cooked this up from scratch, I would imagine a class called "Samplers" 
with four methods that support the two sampling styles and two use cases above. 
So you've got two methods that wrap an Iterator (big N case, sampling from a 
stream), handling reservoir and Bernoulli sampling. And two methods that 
likewise take a List and return a List.

It would probably nicely wrap up and augment the iterator-based samplers today, 
answer your use cases, and probably mean a lot of sampling-like code around the 
code base can be simplified. 

What say taking it that way?

> Random samplers in a modular library
> ------------------------------------
>
>                 Key: MAHOUT-676
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-676
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Math
>            Reporter: Lance Norskog
>            Priority: Minor
>         Attachments: Sampler.patch
>
>
> This is a modular suite of samplers.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to