You have the entire string. 
If you tokenize on commas ... 

Starting with :
>> {"author":"foo", "text": "hello"}
>> {"author":"foo123", "text": "hello world"}
>> {"author":"foo234", "text": "hello this world"}

You end up with 
{"author":"foo",    and "text":"hello"}

So you can ignore the first token, then again split the token on the colon (':')

This gives you "text" and "hello"}

You can again ignore the first token and you now have "hello"}

And now you can parse out the stuff within the quotes. 

HTH


On May 29, 2013, at 6:44 PM, jamal sasha <jamalsha...@gmail.com> wrote:

> Hi,
>   For some reason, this have to be in java :(
> I am trying to use org.json library, something like (in mapper)
> JSONObject jsn = new JSONObject(value.toString());
> 
> String text = (String) jsn.get("text");
> StringTokenizer itr = new StringTokenizer(text);
> 
> But its not working :(
> It would be better to get this thing properly but I wouldnt mind using a hack 
> as well :)
> 
> 
> On Wed, May 29, 2013 at 4:30 PM, Michael Segel <michael_se...@hotmail.com> 
> wrote:
> Yeah, 
> I have to agree w Russell. Pig is definitely the way to go on this. 
> 
> If you want to do it as a Java program you will have to do some work on the 
> input string but it too should be trivial. 
> How formal do you want to go? 
> Do you want to strip it down or just find the quote after the text part? 
> 
> 
> On May 29, 2013, at 5:13 PM, Russell Jurney <russell.jur...@gmail.com> wrote:
> 
>> Seriously consider Pig (free answer, 4 LOC):
>> 
>> my_data = LOAD 'my_data.json' USING 
>> com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[];
>> words = FOREACH my_data GENERATE $0#'author' as author, 
>> FLATTEN(TOKENIZE($0#'text')) as word;
>> word_counts = FOREACH (GROUP words BY word) GENERATE group AS word, 
>> COUNT_STAR(words) AS word_count;
>> STORE word_counts INTO '/tmp/word_counts.txt';
>> 
>> It will be faster than the Java you'll likely write.
>> 
>> 
>> On Wed, May 29, 2013 at 2:54 PM, jamal sasha <jamalsha...@gmail.com> wrote:
>> Hi,
>>    I am stuck again. :(
>> My input data is in hdfs. I am again trying to do wordcount but there is 
>> slight difference.
>> The data is in json format.
>> So each line of data is:
>> 
>> {"author":"foo", "text": "hello"}
>> {"author":"foo123", "text": "hello world"}
>> {"author":"foo234", "text": "hello this world"}
>> 
>> So I want to do wordcount for text part.
>> I understand that in mapper, I just have to pass this data as json and 
>> extract "text" and rest of the code is just the same but I am trying to 
>> switch from python to java hadoop. 
>> How do I do this.
>> Thanks
>> 
>> 
>> 
>> -- 
>> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com
> 
> 

Reply via email to