streaming

Alok Singh (JIRA) Wed, 10 Mar 2010 20:13:54 -0800

     [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Alok Singh updated MAPREDUCE-1589:
----------------------------------

    Fix Version/s: 0.20.1
                   0.20.2
                   0.20.3
     Release Note: adding the dire src/examples/streaming/ and adding a test 
bigrams under it
           Status: Patch Available  (was: Open)


adding of the bigram streaming test.

here is the info
1 Bigrams Description
  Bigrams help provide the conditional probability of a word given the 
preceding word,
  when the relation of the conditional probability is applied:

   P(W_n|W_{n-1}) = { P(W_{n-1},W_n) / P(W_{n-1}) }

  That is, the probability P() of a word Wn given the preceding word Wn - 1 is 
equal to
  the probability of their bigram, or the co-occurrence of the two words P(Wn - 
1,Wn),
  divided by the probability of the preceding word.

2. Algorithms
 Using hadoop streaming one can calculate bigram very easily.

 It consist of 2 map/reduce job

 Job1) first map reduce job

   mapper) pair.pl
      line -> (mapper) -> {W_1:W_2}, {W_2:W_1} ...
      where the W_1 and W_2 are close to each other.

   reducer) count.pl
      at the reducer side all the output of the mappers are sorted by the
      framework, so we gurantee that all the pairs of 'W_i' would be
      together and hence in the reducer we can count them and know
      say W_i is paired with W_j Count number of times.

      Since we are using these hadoop options

      -D stream.map.output.field.separator=: \
      -D stream.num.map.output.key.fields=2 \

      which basically lets reducer gets the W_i:W_j as value

      W_i:W_j ...  -> reducer -> {W_i:Count-MAX_INT:W_j}, ...

      Where W_i: any random word
            W_j: a word which appears after W_i Count many times.

    Note: We are using Count-MAX_INT instead of count. Since we want to do
          the numeric reverse sort using the string sorts.



 Job2) second map reduce job
     mapper) identity mapper i.e /bin/cat
    reducer) firstN.pl
       since we just want to get the sorted input to the reducer so that
       we can count first N first , we use the options
          -D stream.map.output.field.separator=: \
          -D stream.num.map.output.key.fields=2 \
       so that when we go to the reducer see the inputs as
       {W_i:Count-MAX_INT, W_j} ...

       so in the reducer can extract Count and print the top N words
       from the sorted list of keys




c)
README.txt        : this README
cmd               : contains run instructions/cmd
input.txt         : input which is just wikipedia info for hadoop
pairs.pl          : mapper for job1
count.pl          : reducer for job1
firstN.pl         : reducer for job2
part-00000.golden : golden result


c) Running/testing it
   vim the cmd file.
   change the required information as per the instructions in cmd

   source cmd



> Need streaming examples in mapred/src/examples/streaming
> --------------------------------------------------------
>
>                 Key: MAPREDUCE-1589
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1589
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: examples
>            Reporter: Alok Singh
>            Priority: Minor
>             Fix For: 0.20.3, 0.20.2, 0.20.1
>
>         Attachments: streaming_example_bigrams.patch
>
>
> Hi,
>  The examples directory contains the examples for pipes, java mapred but not 
> for the streaming.
> We are planning to add the test cases for the streaming in the examples 
> respository
> Alok

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAPREDUCE-1589) Need streaming examples in mapred/src/examples/streaming

Reply via email to