[
https://issues.apache.org/jira/browse/CRUNCH-57?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rahul Sharma updated CRUNCH-57:
-------------------------------
Attachment: minver2.patch
@Gabriel, yes you are right that the approach outside MR context will be
faster, but in MR we have few things that come into play like eg when we are
using a reducer after groupByKey then MR will put sorting in place, if we use
it or not that's secondary but on reducer the output will be sorted always.
I have created a version for min function that tries to use things from MR and
following the same principle. I tested it against the avro data in aggregate
test. It is a bit faster that the current min function like the best result
clocked 9% faster and in worst result it was the same. Another important aspect
is it doesn't rely on user classes being comparable.
> Add a length function to PCollection
> ------------------------------------
>
> Key: CRUNCH-57
> URL: https://issues.apache.org/jira/browse/CRUNCH-57
> Project: Crunch
> Issue Type: New Feature
> Components: Core
> Affects Versions: 0.3.0
> Reporter: Kiyan Ahmadizadeh
> Assignee: Josh Wills
> Attachments: CRUNCH-57.patch, minver2.patch
>
>
> Sometimes it's useful and interesting to compute the number of elements in a
> PCollection.
>
> For example, suppose there was an initial PCollection that was then filtered
> into another. If I'm interested in how many elements of the original
> PCollection matched the filter, I'll have to write extra code to compute this.
> PCollections should have a length method that, when called, computes the
> number of elements in the PCollection and returns the result.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira