Whatever you are trying to do should work,
Here is the modified WordCount Map


    public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {        String line =
value.toString();
        JSONObject line_as_json = new JSONObject(line);
        String text = line_as_json.getString("text");
        StringTokenizer tokenizer = new StringTokenizer(text);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());            context.write(word, one);
      }    }





Pramod N <http://atmachinelearner.blogspot.in>
Bruce Wayne of web
@machinelearner <https://twitter.com/machinelearner>

--


On Thu, May 30, 2013 at 8:42 AM, Rahul Bhattacharjee <
rahul.rec....@gmail.com> wrote:

> Whatever you have mentioned Jamal should work.you can debug this.
>
> Thanks,
> Rahul
>
>
> On Thu, May 30, 2013 at 5:14 AM, jamal sasha <jamalsha...@gmail.com>wrote:
>
>> Hi,
>>   For some reason, this have to be in java :(
>> I am trying to use org.json library, something like (in mapper)
>> JSONObject jsn = new JSONObject(value.toString());
>>
>> String text = (String) jsn.get("text");
>> StringTokenizer itr = new StringTokenizer(text);
>>
>> But its not working :(
>> It would be better to get this thing properly but I wouldnt mind using a
>> hack as well :)
>>
>>
>> On Wed, May 29, 2013 at 4:30 PM, Michael Segel <michael_se...@hotmail.com
>> > wrote:
>>
>>> Yeah,
>>> I have to agree w Russell. Pig is definitely the way to go on this.
>>>
>>> If you want to do it as a Java program you will have to do some work on
>>> the input string but it too should be trivial.
>>> How formal do you want to go?
>>> Do you want to strip it down or just find the quote after the text part?
>>>
>>>
>>> On May 29, 2013, at 5:13 PM, Russell Jurney <russell.jur...@gmail.com>
>>> wrote:
>>>
>>> Seriously consider Pig (free answer, 4 LOC):
>>>
>>> my_data = LOAD 'my_data.json' USING
>>> com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[];
>>> words = FOREACH my_data GENERATE $0#'author' as author,
>>> FLATTEN(TOKENIZE($0#'text')) as word;
>>> word_counts = FOREACH (GROUP words BY word) GENERATE group AS word,
>>> COUNT_STAR(words) AS word_count;
>>> STORE word_counts INTO '/tmp/word_counts.txt';
>>>
>>> It will be faster than the Java you'll likely write.
>>>
>>>
>>> On Wed, May 29, 2013 at 2:54 PM, jamal sasha <jamalsha...@gmail.com>wrote:
>>>
>>>> Hi,
>>>>    I am stuck again. :(
>>>> My input data is in hdfs. I am again trying to do wordcount but there
>>>> is slight difference.
>>>> The data is in json format.
>>>> So each line of data is:
>>>>
>>>> {"author":"foo", "text": "hello"}
>>>> {"author":"foo123", "text": "hello world"}
>>>> {"author":"foo234", "text": "hello this world"}
>>>>
>>>> So I want to do wordcount for text part.
>>>> I understand that in mapper, I just have to pass this data as json and
>>>> extract "text" and rest of the code is just the same but I am trying to
>>>> switch from python to java hadoop.
>>>> How do I do this.
>>>> Thanks
>>>>
>>>
>>>
>>>
>>> --
>>> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome
>>> .com
>>>
>>>
>>>
>>
>

Reply via email to