[ https://issues.apache.org/jira/browse/PIG-795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12705739#action_12705739 ]
Olga Natkovich edited comment on PIG-795 at 5/4/09 1:58 PM: ------------------------------------------------------------ Patch committed. Thanks, Eric, for contributing. was (Author: olgan): Patch committed. Thanks, Eric, for sontributing. > Command that selects a random sample of the rows, similar to LIMIT > ------------------------------------------------------------------ > > Key: PIG-795 > URL: https://issues.apache.org/jira/browse/PIG-795 > Project: Pig > Issue Type: New Feature > Components: impl > Reporter: Eric Gaudet > Priority: Trivial > Attachments: sample2.diff, sample3.diff > > > When working with very large data sets (imagine that!), running a pig script > can take time. It may be useful to run on a small subset of the data in some > situations (eg: debugging / testing, or to get fast results even if less > accurate.) > The command "LIMIT N" selects the first N rows of the data, but these are not > necessarily randomzed. A command "SAMPLE X" would retain the row only with > the probability x%. > Note: it is possible to implement this feature with FILTER BY and an UDF, but > so is LIMIT, and limit is built-in. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.