[jira] Commented: (HADOOP-2046) Documentation: Hadoop Install/Configuration Guide and Map-Reduce User Manual

Owen O'Malley (JIRA) Thu, 18 Oct 2007 14:21:11 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12536054
 ]


Owen O'Malley commented on HADOOP-2046:
---------------------------------------

I agree that this is good overall. More items:
* In Configuration, the proper way to get un-substituted values is with getRaw, 
not getObject, which is deprecated. 
* I'd add a better discussion of the set/getOutputValueGroupingComparator. 
Something like my message to hadoop-user on the topic:
{quote}
There is not a guarantee of the reduce sort being stable in any sense. (WIth 
the non-deterministic order of the map outputs being available to the reduce, 
it wouldn't make that much sense.)

There certainly isn't enough documentation about what is allowed for sorting. 
I've filed a bug HADOOP-1981 to expand the Reducer java doc to mention the 
JobConf methods that can control the sort order. In particular, the methods are:

setOutputKeyComparatorClass
setOutputValueGroupingComparator

The first comparator controls the sort order of the keys. The second controls 
which keys are grouped together into a single call to the reduce method. The 
combination of these two allows you to set up jobs that act like you've defined 
an order on the values.

For example, say that you want to find duplicate web pages and tag them all 
with the url of the "best" known example. You would set up the job like:

Map Input Key: url
Map Input Value: document
Map Output Key: document checksum, url pagerank
Map Output Value: url
Partitioner: by checksum
OutputKeyComparator: by checksum and then decreasing pagerank
OutputValueGroupingComparator: by checksum

with this setup, the reduce function will be called exactly once with each 
checksum, but the first value from the iterator will be the one with the 
highest pagerank, which can then be used to tag the other entries of the 
checksum family.
{quote}

> Documentation: Hadoop Install/Configuration Guide and Map-Reduce User Manual
> ----------------------------------------------------------------------------
>
>                 Key: HADOOP-2046
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2046
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: documentation
>    Affects Versions: 0.14.2
>            Reporter: Arun C Murthy
>            Assignee: Arun C Murthy
>            Priority: Blocker
>             Fix For: 0.15.0
>
>         Attachments: HADOOP-2046_1_20071018.patch
>
>
> I'd like to put forward some thoughts on how to structure reasonably detailed 
> documentation for hadoop.
> Essentially I think of atleast 3 different profiles to target:
> * hadoop-dev, folks who are actively involved improving/fixing hadoop.
> * hadoop-user
> ** mapred application writers and/or folks who directly use hdfs
> ** hadoop cluster administrators
> For this issue, I'd like to first target the latter category (admin and 
> hdfs/mapred user) - where, arguably, is the biggest bang for the buck, right 
> now. 
> There is a crying need to get user-level stuff documented, judging by the 
> sheer no. of emails we get on the hadoop lists...
> ----
> *1. Installing/Configuration Guides*
> This set of documents caters to folks ranging from someone just playing with 
> hadoop on a single-node to operations teams who administer hadoop on several 
> nodes (thousands). To ensure we cover all bases I'm thinking along the lines 
> of:
> * _Download, install and configure hadoop_ on a single-node cluster: 
> including a few comments on how to run examples (word-count) etc.
> * *Admin Guide*: Install and configure a real, distributed cluster. 
> * *Tune Hadoop*: Separate sections on how to tune hdfs and map-reduce, 
> targeting power admins/users.
> I reckon most of this would be done via forrest, with appropriate links to 
> javadoc.
> ---
> *2. User Manual*
> This set is geared for people who use hdfs and/or map-reduce per-se. Stuff to 
> document:
> * Write a really simple mapred application, just fitting the blocks together 
> i.e. maybe a walk-through of a couple of examples like word-count, sort etc.
> * Detailed information on important map-reduce user-interfaces:
> *- JobConf
> *- JobClient
> *- Tool & ToolRunner
> *- InputFormat 
> *-- InputSplit
> *-- RecordReader
> *- Mapper
> *- Reducer
> *- Reporter
> *- OutputCollector
> *- Writable
> *- WritableComparable
> *- OutputFormat
> *- DistributedCache
> * SequenceFile
> *- Compression types: NONE, RECORD, BLOCK
> * Hadoop Streaming
> * Hadoop Pipes
> I reckon most of this would land up in the javadocs, specifically 
> package.html and some via forrest.
> ----
> Also, as discussed in HADOOP-1881, it would be quite useful to maintain 
> documentation per-release, even on the hadoop website i.e. we could have a 
> main documentation page link to documentation per-release and to the trunk.
> ----
> Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2046) Documentation: Hadoop Install/Configuration Guide and Map-Reduce User Manual

Reply via email to