[
https://issues.apache.org/jira/browse/HADOOP-2806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12579409#action_12579409
]
Hudson commented on HADOOP-2806:
--------------------------------
Integrated in Hadoop-trunk #431 (See
[http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/431/])
> Streaming has no way to force entire record (or null) as key
> ------------------------------------------------------------
>
> Key: HADOOP-2806
> URL: https://issues.apache.org/jira/browse/HADOOP-2806
> Project: Hadoop Core
> Issue Type: Bug
> Components: contrib/streaming
> Reporter: Marco Nicosia
> Assignee: Amareshwari Sriramadasu
> Priority: Minor
> Fix For: 0.17.0
>
> Attachments: patch-2806.txt
>
>
> I think perhaps streaming needs a "-allkey" or "-nullkey" option? Otherwise,
> I'm concerned there is a subtle streaming documentation problem.
> These two docs:
> http://hadoop.apache.org/core/docs/current/streaming.html
> http://wiki.apache.org/hadoop/HadoopStreaming (Should be merged with above?)
> ... seem to ignore that streaming, by default, splits key/value on TAB. Sure,
> they mention it, but in all the simple (no separator) examples, they don't
> seem to take into account that streaming may inconsistently decide whether
> the whole line is the key, or just up to the first tab, should one occur.
> This means that some records might be sorted differently as compared to
> others based on whether or not there's a tab?
> Here's a very simple pair of examples, that to the naive, should produce the
> same output, but do not:
> > [hod] (marco) >> run dfs -fs local -cat str-tabs
> > a 1
> > b 3
> > a 4
> >
> > [hod] (marco) >> run dfs -put str-tabs str-tabs
> >
> > [hod] (marco) >> run jar hadoop-streaming.jar -input str-tabs -output
> > str-tabs.out -mapper /bin/cat -reducer /bin/cat
> > [blah blah blah]
> >
> > [hod] (marco) >> run dfs -cat str-tabs.out/part-00000
> > a 4
> > a 1
> > b 3
> Compare to this negative-test:
> > [hod] (marco) >> run dfs -fs local -cat str-notabs
> > a 1
> > b 3
> > a 4
> >
> > [hod] (marco) >> run dfs -put str-notabs str-notabs
> >
> > [hod] (marco) >> run jar hadoop-streaming.jar -input str-notabs -output
> > str-notabs.out -mapper /bin/cat -reducer /bin/cat
> > [blah blah blah]
> >
> > [hod] (marco) >> run dfs -cat str-notabs.out/part-00000
> > a 1
> > a 4
> > b 3
> >
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.