Seriously consider Pig (free answer, 4 LOC): my_data = LOAD 'my_data.json' USING com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[]; words = FOREACH my_data GENERATE $0#'author' as author, FLATTEN(TOKENIZE($0#'text')) as word; word_counts = FOREACH (GROUP words BY word) GENERATE group AS word, COUNT_STAR(words) AS word_count; STORE word_counts INTO '/tmp/word_counts.txt';
It will be faster than the Java you'll likely write. On Wed, May 29, 2013 at 2:54 PM, jamal sasha <jamalsha...@gmail.com> wrote: > Hi, > I am stuck again. :( > My input data is in hdfs. I am again trying to do wordcount but there is > slight difference. > The data is in json format. > So each line of data is: > > {"author":"foo", "text": "hello"} > {"author":"foo123", "text": "hello world"} > {"author":"foo234", "text": "hello this world"} > > So I want to do wordcount for text part. > I understand that in mapper, I just have to pass this data as json and > extract "text" and rest of the code is just the same but I am trying to > switch from python to java hadoop. > How do I do this. > Thanks > -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com