[jira] Commented: (HADOOP-908) Hadoop Abacus, a package for performing simple counting/aggregation

Doug Judd (JIRA) Thu, 18 Jan 2007 15:18:51 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12465914
 ]


Doug Judd commented on HADOOP-908:
----------------------------------

One issue (or at least I assume is an issue) that I'd like to see taken care of 
in this toolkit is the following.  You do a big crawl of a bunch of pages and 
want to perform a link count computation and then do a (reverse) sort by count. 
 The problem is that the link counts follow a Zipfian distribution where there 
is a long tail of links of count 1 or 2.  Conceptualy, you can imagine 
situations where you literally have 1 billion links of count 1 making it 
infeasible to pass into a reduce function.

To get around this situation, I've created a TaggedLongWritable class.  It 
contains a Long and a string tag (the tag in the above case would be the 
link/URL).  The comparison function first compares the Long and then if they 
match, compares the tag.  This way, you get a numeric comparison, but two keys 
don't match if their tags are different.


> Hadoop Abacus, a package for performing simple counting/aggregation
> -------------------------------------------------------------------
>
>                 Key: HADOOP-908
>                 URL: https://issues.apache.org/jira/browse/HADOOP-908
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: contrib/streaming
>            Reporter: Runping Qi
>         Assigned To: Runping Qi
>         Attachments: abacus.patch
>
>
> Hadoop Abacus package is a specialization of map/reduce framework, 
> specilizing for performing various counting and aggregations. 
> It offers similar functionalities to Google's SawZall. 
> Generally speaking, in order to implement an application using Map/Reduce 
> model, 
> the developer needs to implement Map and Reduce functions (and possibly 
> Combine function). 
> However, for a lot of applications related to counting and statistics 
> computing, 
> these functions have very similar characteristics. 
> Abacus abstracts out the general patterns and provides a package implementing 
> those patterns. 
> In particular, the package provides a generic mapper class, a reducer class 
> and a combiner class, 
> and a set of built-in value aggregators. It also provides a generic utility 
> class, ValueAggregatorJob
> for creating Abacus jobs.
> To create an Abacus job, the user just needs to implement one plugin class 
> that 
> is responsible for specifying what aggregators to use and what values are for 
> which aggregators. 
> The mapper will call this class in the runtime to generate aggregation ids 
> and values.
> The generic  combiner and reducer will aggregate the values associated with 
> the same 
> aggregation ids accordingly. Thus, it is much easier to create and run an 
> Abacus job than 
> a normal map/reduce job. Since a  built-in generic combiner is always used, 
> the execution is very efficient.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-908) Hadoop Abacus, a package for performing simple counting/aggregation

Reply via email to