[
https://issues.apache.org/jira/browse/MAHOUT-676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13021950#comment-13021950
]
Lance Norskog commented on MAHOUT-676:
--------------------------------------
There's a big wide world of sampling algorithms out there.
Time-based sampling:
[Sampling Time-Based Sliding Windows in Bounded
Space|http://www.gemulla.de/rg/publications/gemulla08streamsampling.pdf]
Bernoulli sampling is not good at maintaining ratios for repeating items:
[Maintaining Bernoulli Samples over Evolving
Multisets|http://www.gemulla.de/rg/publications/gemulla07multisetsampling.pdf]
And, if you really can't go to sleep:
[Rainer Gemulla's 281-page PhD thesis on
sampling|http://www.gemulla.de/rg/publications/gemulla08thesis.pdf]
> Random samplers in a modular library
> ------------------------------------
>
> Key: MAHOUT-676
> URL: https://issues.apache.org/jira/browse/MAHOUT-676
> Project: Mahout
> Issue Type: New Feature
> Components: Math
> Reporter: Lance Norskog
> Priority: Minor
> Attachments: Sampler.patch
>
>
> This is a modular suite of samplers.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira