Reading json format input

2013-05-29 Thread jamal sasha
Hi, I am stuck again. :( My input data is in hdfs. I am again trying to do wordcount but there is slight difference. The data is in json format. So each line of data is: {"author":"foo", "text": "hello"} {"author":"foo123", "text": "hello world"} {"author":"foo234", "text": "hello this world"}

Re: Reading json format input

2013-05-29 Thread Russell Jurney
Seriously consider Pig (free answer, 4 LOC): my_data = LOAD 'my_data.json' USING com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[]; words = FOREACH my_data GENERATE $0#'author' as author, FLATTEN(TOKENIZE($0#'text')) as word; word_counts = FOREACH (GROUP words BY word) GENERATE group AS

Re: Reading json format input

2013-05-29 Thread Michael Segel
Yeah, I have to agree w Russell. Pig is definitely the way to go on this. If you want to do it as a Java program you will have to do some work on the input string but it too should be trivial. How formal do you want to go? Do you want to strip it down or just find the quote after the text par

Re: Reading json format input

2013-05-29 Thread Rishi Yadav
Hi Jamal, I took your input and put it in sample wordcount program and it's working just fine and giving this output. author 3 foo234 1 text 3 foo 1 foo123 1 hello 3 this 1 world 2 When we split using String[] words = input.split("\\W+"); it takes care of all non-alphanumeric characters. Tha

Re: Reading json format input

2013-05-29 Thread jamal sasha
Hi, For some reason, this have to be in java :( I am trying to use org.json library, something like (in mapper) JSONObject jsn = new JSONObject(value.toString()); String text = (String) jsn.get("text"); StringTokenizer itr = new StringTokenizer(text); But its not working :( It would be better t

Re: Reading json format input

2013-05-29 Thread jamal sasha
Hi Rishi, But I dont want the wordcount of all the words.. In json, there is a field "text".. and those are the words I wish to count? On Wed, May 29, 2013 at 4:43 PM, Rishi Yadav wrote: > Hi Jamal, > > I took your input and put it in sample wordcount program and it's working > just fine and

Re: Reading json format input

2013-05-29 Thread Rishi Yadav
for that, you have to only write intermediate data if word = "text" String[] words = line.split("\\W+"); for (String word : words) { if (word.equals("text")) context.write(new Text(word), new IntWritable(1)); } I am assuming you have huge volume of data for it, otherwise Map

Re: Reading json format input

2013-05-29 Thread Michael Segel
You have the entire string. If you tokenize on commas ... Starting with : >> {"author":"foo", "text": "hello"} >> {"author":"foo123", "text": "hello world"} >> {"author":"foo234", "text": "hello this world"} You end up with {"author":"foo",and "text":"hello"} So you can ignore the first t

Re: Reading json format input

2013-05-29 Thread Rahul Bhattacharjee
Whatever you have mentioned Jamal should work.you can debug this. Thanks, Rahul On Thu, May 30, 2013 at 5:14 AM, jamal sasha wrote: > Hi, > For some reason, this have to be in java :( > I am trying to use org.json library, something like (in mapper) > JSONObject jsn = new JSONObject(value.to

Re: Reading json format input

2013-05-30 Thread Pramod N
Whatever you are trying to do should work, Here is the modified WordCount Map public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {String line = value.toString(); JSONObject line_as_json = new JSONObject(line); String

Re: Reading json format input

2013-05-30 Thread jamal sasha
Hi Thanks guys. I figured out the issue. Hence i have another question. I am using a third party library and I thought that once I have created the jar file I dont need to specify the dependancies but aparently thats not the case. (error below) Very very naive question...probably stupid. How do i

Re: Reading json format input

2013-05-30 Thread Shahab Yunus
For starters, you can specify them through the -libjars parameter when you kick off your M/R job. This way the jars will be copied to all TTs. Regards, Shahab On Thu, May 30, 2013 at 2:43 PM, jamal sasha wrote: > Hi Thanks guys. > I figured out the issue. Hence i have another question. > I am

Re: Reading json format input

2013-05-30 Thread jamal sasha
Hi, I did that but still same exception error. I did: export HADOOP_CLASSPATH=/path/to/external.jar And then had a -libjars /path/to/external.jar added in my command but still same error On Thu, May 30, 2013 at 11:46 AM, Shahab Yunus wrote: > For starters, you can specify them through the -lib

Re: Reading json format input

2013-05-30 Thread jamal sasha
Ok got this thing working.. Turns out that -libjars should be mentioned before specifying hdfs input and output.. rather than after it.. :-/ Thanks everyone. On Thu, May 30, 2013 at 1:35 PM, jamal sasha wrote: > Hi, > I did that but still same exception error. > I did: > export HADOOP_CLASSPA