[
https://issues.apache.org/jira/browse/MAHOUT-676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022084#comment-13022084
]
Sean Owen commented on MAHOUT-676:
----------------------------------
What's the particular use case this supports? I sort of prefer to avoid adding
nice code that might get used but isn't. Can this replace a lot of current
sampling code with one nice unified approach? Then that would justify itself
off the bat.
Sure, I'd say you're describing two cases of sampling: you want exactly n
samples of N (in which case you need 'reservoir'), or you want about n samples
out of N (sample each with probability n/N). And I think there are also two
basic contexts: you have all N items at once, or you don't since N is large and
you have a stream.
(Side-note on Mappers: you have a stream of values, one at a time, yes. It gets
tricky to support this case. The biggest problem is that you don't know that
you'll see all values for one key in one Mapper. The second-biggest problem is
that you need to detect when you're done with one key so you can finish the
computation for that key. This also means you need logic in its close() method
to deal with the final key.
These are solvable problems, but not easily. If the patch isn't fully
addressing the use case above, then you're having the user deal with "flushing"
and such. In which case they're already collecting a List of values in-memory
anyway. In which case this patch just needs to deal with sampling from a List.)
Back to the patch -- I am still not clear on why you need flush()?
If I cooked this up from scratch, I would imagine a class called "Samplers"
with four methods that support the two sampling styles and two use cases above.
So you've got two methods that wrap an Iterator (big N case, sampling from a
stream), handling reservoir and Bernoulli sampling. And two methods that
likewise take a List and return a List.
It would probably nicely wrap up and augment the iterator-based samplers today,
answer your use cases, and probably mean a lot of sampling-like code around the
code base can be simplified.
What say taking it that way?
> Random samplers in a modular library
> ------------------------------------
>
> Key: MAHOUT-676
> URL: https://issues.apache.org/jira/browse/MAHOUT-676
> Project: Mahout
> Issue Type: New Feature
> Components: Math
> Reporter: Lance Norskog
> Priority: Minor
> Attachments: Sampler.patch
>
>
> This is a modular suite of samplers.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira