Re: Using the Stanford NLP with hadoop
Greetings, There's a way you can distribute files along with your MR job as part of a "payload", or you could save the file in the same spot on every machine of your cluster with some rsyncing, and hard-code loading it. This may be of some help: http://hadoop.apache.org/core/docs/r0.18.2/api/org/apache/hadoop/filecache/DistributedCache.html On Sat, Apr 18, 2009 at 5:18 AM, hari939 wrote: > > My project of parsing through material for a semantic search engine requires > me to use the http://nlp.stanford.edu/software/lex-parser.shtml Stanford > NLP parser on hadoop cluster. > > To use the Stanford NLP parser, one must create a lexical parser object > using a englishPCFG.ser.gz file as a constructor's parameter. > i have tried loading the file onto the Hadoop dfs in the /user/root/ folder > and have also tried packing the file along with the jar of the java program. > > i am new to the hadoop platform and am not very familiar with some of the > salient features of hadoop. > > looking forward to any form of help. > -- > View this message in context: > http://www.nabble.com/Using-the-Stanford-NLP-with-hadoop-tp23112316p23112316.html > Sent from the Hadoop core-user mailing list archive at Nabble.com. >
Help with Map Reduce
Hello all: I am new to Hadoop and Map Reduce. I am writing a program to analyze some census data. I have a general question with MapReduce: In the Reducer, how can I separate keys to do separate calculations based on the key? In my case, I am trying to use if statements to separate the keys out, but for some reason, it is not doing so. Here is a segment of my code. My mapper seems to work, I think it is my Reducer that is messing up: public static class MapClass extends MapReduceBase implements Mapper { //private final static IntWritable one = new IntWritable(1); Text sex = new Text("sex"); public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line,","); : : word.set(itr.nextToken()); IntWritable sexVal = new IntWritable(Integer.parseInt(word.toString()));//map gender output.collect(sex, sexVal); }//end map()method }//end class MapClass public static class Reduce extends MapReduceBase implements Reducer { public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int totalPop= 0;//count: total population int numMales= 0;//count: males int numFemales = 0;//count: females Text popOut = new Text("The total population is: "); if (key.equals("sex")) { while (values.hasNext()) { if (values.next().get() == 0) numMales++;//number of males else numFemales++;//number of females }//end while totalPop = numMales + numFemales; //count of population }//end if output.collect(popOut, new IntWritable(totalPop)); }//end reduce method }//end class Reduce key is of type Text, I also used: if (key.toString() == "sex) { but that wasnt working either. There are of course other keys and values, but I would think to gather all the keys "sex" would be done with the way I show it above. However, My output reads as 0. I tried changing the numMales value to see if the if statement is executed, but it is not. Am I doing something wrong?
Re: Ec2 instability
Hi, I have 6 instances allocated. i havent tried adding more instances coz i have maximum of 30,000 rows in hbase tables. wht do u recommend? i have max 4-5 map concurrent map/reduce tasks on one node. how do we characterize the memory usage of mappers and reducers?? i m running spinn3r... other than regular hadoop/hbase... but spinn3r is being called from one of my map tasks. I am not running gangila or any other program to characterize resource usage over time. Thanks, Raakhi On Sat, Apr 18, 2009 at 7:09 PM, Andrew Purtell wrote: > > Hi, > > This is an OS level exception. Your node is out of memory > even to fork a process. > > How many instances do you currently have allocated? Have > you increased the number of instances over time to try and > spread the load of your application around? How many > concurrent mapper and/or reducer processes do you execute > on a node? Can you characterize the memory usage of your > mappers and reducers? Are you running other processes > external to hadoop/hbase which consume a lot of memory? Are > you running Ganglia or similar to track and characterize > resource usage over time? > > You may find you are trying to solve a 100 node problem > with 10. > > - Andy > > > From: Rakhi Khatwani > > Subject: Re: Ec2 instability > > To: hbase-u...@hadoop.apache.org, core-user@hadoop.apache.org > > Date: Friday, April 17, 2009, 9:44 AM > > Hi, > > this is the exception i have been getting @ the mapreduce > > > > java.io.IOException: Cannot run program "bash": > > java.io.IOException: > > error=12, Cannot allocate memory > > at java.lang.ProcessBuilder.start(ProcessBuilder.java:459) > > at org.apache.hadoop.util.Shell.runCommand(Shell.java:149) > > at org.apache.hadoop.util.Shell.run(Shell.java:134) > > at org.apache.hadoop.fs.DF.getAvailable(DF.java:73) > > at > > > org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:321) > > at > > > org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124) > > at > > > org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61) > > at > > > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1199) > > at > > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:857) > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333) > > at org.apache.hadoop.mapred.Child.main(Child.java:155) > > Caused by: java.io.IOException: java.io.IOException: > > error=12, Cannot > > allocate memory > > at java.lang.UNIXProcess.(UNIXProcess.java:148) > > at java.lang.ProcessImpl.start(ProcessImpl.java:65) > > at java.lang.ProcessBuilder.start(ProcessBuilder.java:452) > > ... 10 more > > > > >
Re: max value for a dataset
Thanks for the explanation. I forgot the close method. On Sat, Apr 18, 2009 at 11:57 AM, jason hadoop wrote: > The traditional approach would be a Mapper class that maintained a member > variable that you kept the max value record, and in the close method of > your > mapper you output a single record containing that value. > > The map method of course compares the current record against the max and > stores current in max when current is larger than max. > > Then each map output is a single record and the reduce behaves very > similarly, in that the close method outputs the final max record. A single > reduce would be the simplest. > > On your question a Mapper and Reducer defines 3 entry points, configure, > called once on on task start, the map/reduce called once for each record, > and close, called once after the last call to map/reduce. > at least through 0.19, the close is not provided with the output collector > or the reporter, so you need to save them in the map/reduce method. > > On Sat, Apr 18, 2009 at 9:28 AM, Farhan Husain wrote: > > > How do you identify that map task is ending within the map method? Is it > > possible to know which is the last call to map method? > > > > On Sat, Apr 18, 2009 at 10:59 AM, Edward Capriolo > >wrote: > > > > > I jumped into Hadoop at the 'deep end'. I know pig, hive, and hbase > > > support the ability to max(). I am writing my own max() over a simple > > > one column dataset. > > > > > > The best solution I came up with was using MapRunner. With maprunner I > > > can store the highest value in a private member variable. I can read > > > through the entire data set and only have to emit one value per mapper > > > upon completion of the map data. Then I can specify one reducer and > > > carry out the same operation. > > > > > > Does anyone have a better tactic. I thought a counter could do this > > > but are they atomic? > > > > > > > > > -- > Alpha Chapters of my book on Hadoop are available > http://www.apress.com/book/view/9781430219422 >
Re: max value for a dataset
The traditional approach would be a Mapper class that maintained a member variable that you kept the max value record, and in the close method of your mapper you output a single record containing that value. The map method of course compares the current record against the max and stores current in max when current is larger than max. Then each map output is a single record and the reduce behaves very similarly, in that the close method outputs the final max record. A single reduce would be the simplest. On your question a Mapper and Reducer defines 3 entry points, configure, called once on on task start, the map/reduce called once for each record, and close, called once after the last call to map/reduce. at least through 0.19, the close is not provided with the output collector or the reporter, so you need to save them in the map/reduce method. On Sat, Apr 18, 2009 at 9:28 AM, Farhan Husain wrote: > How do you identify that map task is ending within the map method? Is it > possible to know which is the last call to map method? > > On Sat, Apr 18, 2009 at 10:59 AM, Edward Capriolo >wrote: > > > I jumped into Hadoop at the 'deep end'. I know pig, hive, and hbase > > support the ability to max(). I am writing my own max() over a simple > > one column dataset. > > > > The best solution I came up with was using MapRunner. With maprunner I > > can store the highest value in a private member variable. I can read > > through the entire data set and only have to emit one value per mapper > > upon completion of the map data. Then I can specify one reducer and > > carry out the same operation. > > > > Does anyone have a better tactic. I thought a counter could do this > > but are they atomic? > > > -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422
Re: Seattle / PNW Hadoop + Lucene User Group?
I would love to come but I'm afraid I'm stuck in rainy old England :( Amin On 18 Apr 2009, at 01:08, Bradford Stephens wrote: OK, we've got 3 people... that's enough for a party? :) Surely there must be dozens more of you guys out there... c'mon, accelerate your knowledge! Join us in Seattle! On Thu, Apr 16, 2009 at 3:27 PM, Bradford Stephens wrote: Greetings, Would anybody be willing to join a PNW Hadoop and/or Lucene User Group with me in the Seattle area? I can donate some facilities, etc. -- I also always have topics to speak about :) Cheers, Bradford - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: max value for a dataset
How do you identify that map task is ending within the map method? Is it possible to know which is the last call to map method? On Sat, Apr 18, 2009 at 10:59 AM, Edward Capriolo wrote: > I jumped into Hadoop at the 'deep end'. I know pig, hive, and hbase > support the ability to max(). I am writing my own max() over a simple > one column dataset. > > The best solution I came up with was using MapRunner. With maprunner I > can store the highest value in a private member variable. I can read > through the entire data set and only have to emit one value per mapper > upon completion of the map data. Then I can specify one reducer and > carry out the same operation. > > Does anyone have a better tactic. I thought a counter could do this > but are they atomic? >
max value for a dataset
I jumped into Hadoop at the 'deep end'. I know pig, hive, and hbase support the ability to max(). I am writing my own max() over a simple one column dataset. The best solution I came up with was using MapRunner. With maprunner I can store the highest value in a private member variable. I can read through the entire data set and only have to emit one value per mapper upon completion of the map data. Then I can specify one reducer and carry out the same operation. Does anyone have a better tactic. I thought a counter could do this but are they atomic?
Re: Ec2 instability
Hi, This is an OS level exception. Your node is out of memory even to fork a process. How many instances do you currently have allocated? Have you increased the number of instances over time to try and spread the load of your application around? How many concurrent mapper and/or reducer processes do you execute on a node? Can you characterize the memory usage of your mappers and reducers? Are you running other processes external to hadoop/hbase which consume a lot of memory? Are you running Ganglia or similar to track and characterize resource usage over time? You may find you are trying to solve a 100 node problem with 10. - Andy > From: Rakhi Khatwani > Subject: Re: Ec2 instability > To: hbase-u...@hadoop.apache.org, core-user@hadoop.apache.org > Date: Friday, April 17, 2009, 9:44 AM > Hi, > this is the exception i have been getting @ the mapreduce > > java.io.IOException: Cannot run program "bash": > java.io.IOException: > error=12, Cannot allocate memory > at java.lang.ProcessBuilder.start(ProcessBuilder.java:459) > at org.apache.hadoop.util.Shell.runCommand(Shell.java:149) > at org.apache.hadoop.util.Shell.run(Shell.java:134) > at org.apache.hadoop.fs.DF.getAvailable(DF.java:73) > at > org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:321) > at > org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124) > at > org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1199) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:857) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333) > at org.apache.hadoop.mapred.Child.main(Child.java:155) > Caused by: java.io.IOException: java.io.IOException: > error=12, Cannot > allocate memory > at java.lang.UNIXProcess.(UNIXProcess.java:148) > at java.lang.ProcessImpl.start(ProcessImpl.java:65) > at java.lang.ProcessBuilder.start(ProcessBuilder.java:452) > ... 10 more
Using the Stanford NLP with hadoop
My project of parsing through material for a semantic search engine requires me to use the http://nlp.stanford.edu/software/lex-parser.shtml Stanford NLP parser on hadoop cluster. To use the Stanford NLP parser, one must create a lexical parser object using a englishPCFG.ser.gz file as a constructor's parameter. i have tried loading the file onto the Hadoop dfs in the /user/root/ folder and have also tried packing the file along with the jar of the java program. i am new to the hadoop platform and am not very familiar with some of the salient features of hadoop. looking forward to any form of help. -- View this message in context: http://www.nabble.com/Using-the-Stanford-NLP-with-hadoop-tp23112316p23112316.html Sent from the Hadoop core-user mailing list archive at Nabble.com.