[jira] Commented: (HADOOP-1327) Doc on Streaming

arkady borkovsky (JIRA) Fri, 14 Dec 2007 17:22:04 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-1327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12552003
 ]


arkady borkovsky commented on HADOOP-1327:
------------------------------------------


0. Assume that this is the only Tutorial a new Hadoop user needs to read (once 
she knows hdfs -ls , and -cat, knows the URLs for Job tracker)

1. In FAQ section: "Can I use UNIX pipes? For example, will -mapper "cut -f1 | 
sed s/foo/bar/g" work?"
  use a script for this (i.e.) put the command into a file, and use that files 
as a streaming command)

2. n FAQ section:  "How do I process files, one per map?"
does not really give a solution for streaming, but rather has some Java class

3.    -mapper org.apache.hadoop.mapred.lib.IdentityMapper \
does this work?
last time I tried, it was inserting unnecessary keys.
If it is not fixed yet, it should not be in the example

4.  Describe the the task env variables -- many of them are extremely useful

5. FAQ to answer:
how can I make sure that each input file goes to a single mapper, without 
splitting?
(the answer is to set maxsplit parameter)

6. FAQ to answer
how can I make sure that my mapper gets the input exactly as it is store in DFS?
and
in case of compressed input, how can I make sure that my mapper gets as input 
exactly what come out from the decopressing?

7.  Describe more the use of compression
-- for the input
-- different compression formats for output
-- if the input is a compressed representation of multiple files, how can I get 
each uncompressed file to go to a separated mapper? (assuming that the files 
are large enough)

8.  "Field selection" and "Aggregate package" although very nice and useful may 
be put into a separate page, as they are not fundamental for Steaming
(while "secondary sort, the -partitioner" is fundamental -- I'd recommend to 
use "split by field 1, sort by filed 2" as default, for most users).

9. It would be very convenient to have section numbers.


> Doc on Streaming
> ----------------
>
>                 Key: HADOOP-1327
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1327
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: documentation
>            Reporter: Runping Qi
>            Assignee: Rob Weltman
>             Fix For: 0.15.2
>
>         Attachments: HADOOP-1327.patch, site.xml, streaming-doc.patch, 
> streaming.html, streaming.xml
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1327) Doc on Streaming

Reply via email to