[
https://issues.apache.org/jira/browse/MAPREDUCE-1589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alok Singh updated MAPREDUCE-1589:
----------------------------------
Fix Version/s: 0.20.1
0.20.2
0.20.3
Release Note: adding the dire src/examples/streaming/ and adding a test
bigrams under it
Status: Patch Available (was: Open)
adding of the bigram streaming test.
here is the info
1 Bigrams Description
Bigrams help provide the conditional probability of a word given the
preceding word,
when the relation of the conditional probability is applied:
P(W_n|W_{n-1}) = { P(W_{n-1},W_n) / P(W_{n-1}) }
That is, the probability P() of a word Wn given the preceding word Wn - 1 is
equal to
the probability of their bigram, or the co-occurrence of the two words P(Wn -
1,Wn),
divided by the probability of the preceding word.
2. Algorithms
Using hadoop streaming one can calculate bigram very easily.
It consist of 2 map/reduce job
Job1) first map reduce job
mapper) pair.pl
line -> (mapper) -> {W_1:W_2}, {W_2:W_1} ...
where the W_1 and W_2 are close to each other.
reducer) count.pl
at the reducer side all the output of the mappers are sorted by the
framework, so we gurantee that all the pairs of 'W_i' would be
together and hence in the reducer we can count them and know
say W_i is paired with W_j Count number of times.
Since we are using these hadoop options
-D stream.map.output.field.separator=: \
-D stream.num.map.output.key.fields=2 \
which basically lets reducer gets the W_i:W_j as value
W_i:W_j ... -> reducer -> {W_i:Count-MAX_INT:W_j}, ...
Where W_i: any random word
W_j: a word which appears after W_i Count many times.
Note: We are using Count-MAX_INT instead of count. Since we want to do
the numeric reverse sort using the string sorts.
Job2) second map reduce job
mapper) identity mapper i.e /bin/cat
reducer) firstN.pl
since we just want to get the sorted input to the reducer so that
we can count first N first , we use the options
-D stream.map.output.field.separator=: \
-D stream.num.map.output.key.fields=2 \
so that when we go to the reducer see the inputs as
{W_i:Count-MAX_INT, W_j} ...
so in the reducer can extract Count and print the top N words
from the sorted list of keys
c)
README.txt : this README
cmd : contains run instructions/cmd
input.txt : input which is just wikipedia info for hadoop
pairs.pl : mapper for job1
count.pl : reducer for job1
firstN.pl : reducer for job2
part-00000.golden : golden result
c) Running/testing it
vim the cmd file.
change the required information as per the instructions in cmd
source cmd
> Need streaming examples in mapred/src/examples/streaming
> --------------------------------------------------------
>
> Key: MAPREDUCE-1589
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1589
> Project: Hadoop Map/Reduce
> Issue Type: New Feature
> Components: examples
> Reporter: Alok Singh
> Priority: Minor
> Fix For: 0.20.3, 0.20.2, 0.20.1
>
> Attachments: streaming_example_bigrams.patch
>
>
> Hi,
> The examples directory contains the examples for pipes, java mapred but not
> for the streaming.
> We are planning to add the test cases for the streaming in the examples
> respository
> Alok
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.