Re: Using the Stanford NLP with hadoop

2009-04-18 Thread Bradford Stephens
Greetings,

There's a way you can distribute files along with your MR job as part
of a "payload", or you could save the file in the same spot on every
machine of your cluster with some rsyncing, and hard-code loading it.

This may be of some help:
http://hadoop.apache.org/core/docs/r0.18.2/api/org/apache/hadoop/filecache/DistributedCache.html

On Sat, Apr 18, 2009 at 5:18 AM, hari939  wrote:
>
> My project of parsing through material for a semantic search engine requires
> me to use the  http://nlp.stanford.edu/software/lex-parser.shtml Stanford
> NLP parser  on hadoop cluster.
>
> To use the Stanford NLP parser, one must create a lexical parser object
> using a englishPCFG.ser.gz file as a constructor's parameter.
> i have tried loading the file onto the Hadoop dfs in the /user/root/ folder
> and have also tried packing the file along with the jar of the java program.
>
> i am new to the hadoop platform and am not very familiar with some of the
> salient features of hadoop.
>
> looking forward to any form of help.
> --
> View this message in context: 
> http://www.nabble.com/Using-the-Stanford-NLP-with-hadoop-tp23112316p23112316.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>


Help with Map Reduce

2009-04-18 Thread Reza
Hello all:

I am new to Hadoop and Map Reduce. I am writing a program to analyze some
census data.

I have a general question with MapReduce:

In the Reducer, how can I separate keys to do separate calculations based
on the key? In my case, I am trying to use if statements to separate the
keys out, but for some reason, it is not doing so.

Here is a segment of my code. My mapper seems to work, I think it is my
Reducer that is messing up:

  public static class MapClass extends MapReduceBase implements
Mapper {

//private final static IntWritable one = new IntWritable(1);
Text sex = new Text("sex");
public void map(LongWritable key, Text value,
OutputCollector output,
Reporter reporter) throws IOException {
  String line = value.toString();
  StringTokenizer itr = new StringTokenizer(line,",");

:
:
word.set(itr.nextToken());
IntWritable sexVal = new
IntWritable(Integer.parseInt(word.toString()));//map gender
output.collect(sex, sexVal);

}//end map()method
  }//end class MapClass


  public static class Reduce extends MapReduceBase
implements Reducer {


public void reduce(Text key, Iterator values,
   OutputCollector output,
   Reporter reporter) throws IOException {

int totalPop= 0;//count: total population
int numMales= 0;//count: males
int numFemales  = 0;//count: females
Text popOut = new Text("The total population is: ");
if (key.equals("sex")) {
while (values.hasNext()) {
if (values.next().get() == 0)
numMales++;//number of males
else
numFemales++;//number of
females
}//end while
totalPop = numMales + numFemales; //count of population
}//end if
output.collect(popOut, new IntWritable(totalPop));

}//end reduce method
  }//end class Reduce

key is of type Text, I also used: if (key.toString() == "sex) {
but that wasnt working either.

There are of course other keys and values, but I would think to gather all
the keys "sex" would be done with the way I show it above.  However, My
output reads as 0.  I tried changing the numMales value to see if the if
statement is executed, but it is not. Am I doing something wrong?




Re: Ec2 instability

2009-04-18 Thread Rakhi Khatwani
 Hi,

 I have 6 instances allocated.
i havent tried adding more instances coz i have maximum of 30,000 rows in
hbase tables. wht do u recommend?
i have max 4-5 map concurrent map/reduce tasks on one node.
how do we characterize the memory usage of mappers and reducers??
i m running spinn3r... other than regular hadoop/hbase... but spinn3r is
being called from one of my map tasks.
I am not running gangila or any other program to characterize resource usage
over time.

Thanks,
Raakhi

On Sat, Apr 18, 2009 at 7:09 PM, Andrew Purtell  wrote:

>
> Hi,
>
> This is an OS level exception. Your node is out of memory
> even to fork a process.
>
> How many instances do you currently have allocated? Have
> you increased the number of instances over time to try and
> spread the load of your application around? How many
> concurrent mapper and/or reducer processes do you execute
> on a node? Can you characterize the memory usage of your
> mappers and reducers? Are you running other processes
> external to hadoop/hbase which consume a lot of memory? Are
> you running Ganglia or similar to track and characterize
> resource usage over time?
>
> You may find you are trying to solve a 100 node problem
> with 10.
>
>   - Andy
>
> > From: Rakhi Khatwani
> > Subject: Re: Ec2 instability
> > To: hbase-u...@hadoop.apache.org, core-user@hadoop.apache.org
> > Date: Friday, April 17, 2009, 9:44 AM
>  > Hi,
> >  this is the exception i have been getting @ the mapreduce
> >
> > java.io.IOException: Cannot run program "bash":
> > java.io.IOException:
> > error=12, Cannot allocate memory
> >   at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
> >   at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
> >   at org.apache.hadoop.util.Shell.run(Shell.java:134)
> >   at org.apache.hadoop.fs.DF.getAvailable(DF.java:73)
> >   at
> >
> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:321)
> >   at
> >
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
> >   at
> >
> org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61)
> >   at
> >
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1199)
> >   at
> > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:857)
> >   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333)
> >   at org.apache.hadoop.mapred.Child.main(Child.java:155)
> > Caused by: java.io.IOException: java.io.IOException:
> > error=12, Cannot
> > allocate memory
> >   at java.lang.UNIXProcess.(UNIXProcess.java:148)
> >   at java.lang.ProcessImpl.start(ProcessImpl.java:65)
> >   at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
> >   ... 10 more
>
>
>
>
>


Re: max value for a dataset

2009-04-18 Thread Farhan Husain
Thanks for the explanation. I forgot the close method.

On Sat, Apr 18, 2009 at 11:57 AM, jason hadoop wrote:

> The traditional approach would be a Mapper class that maintained a member
> variable that you kept the max value record, and in the close method of
> your
> mapper you output a single record containing that value.
>
> The map method of course compares the current record against the max and
> stores current in max when current is larger than max.
>
> Then each map output is a single record and the reduce behaves very
> similarly, in that the close method outputs the final max record. A single
> reduce would be the simplest.
>
> On your question a Mapper and Reducer defines 3 entry points, configure,
> called once on on task start, the map/reduce called once for each record,
> and close, called once after the last call to map/reduce.
> at least through 0.19, the close is not provided with the output collector
> or the reporter, so you need to save them in the map/reduce method.
>
> On Sat, Apr 18, 2009 at 9:28 AM, Farhan Husain  wrote:
>
> > How do you identify that map task is ending within the map method? Is it
> > possible to know which is the last call to map method?
> >
> > On Sat, Apr 18, 2009 at 10:59 AM, Edward Capriolo  > >wrote:
> >
> > > I jumped into Hadoop at the 'deep end'. I know pig, hive, and hbase
> > > support the ability to max(). I am writing my own max() over a simple
> > > one column dataset.
> > >
> > > The best solution I came up with was using MapRunner. With maprunner I
> > > can store the highest value in a private member variable. I can read
> > > through the entire data set and only have to emit one value per mapper
> > > upon completion of the map data. Then I can specify one reducer and
> > > carry out the same operation.
> > >
> > > Does anyone have a better tactic. I thought a counter could do this
> > > but are they atomic?
> > >
> >
>
>
>
> --
> Alpha Chapters of my book on Hadoop are available
> http://www.apress.com/book/view/9781430219422
>


Re: max value for a dataset

2009-04-18 Thread jason hadoop
The traditional approach would be a Mapper class that maintained a member
variable that you kept the max value record, and in the close method of your
mapper you output a single record containing that value.

The map method of course compares the current record against the max and
stores current in max when current is larger than max.

Then each map output is a single record and the reduce behaves very
similarly, in that the close method outputs the final max record. A single
reduce would be the simplest.

On your question a Mapper and Reducer defines 3 entry points, configure,
called once on on task start, the map/reduce called once for each record,
and close, called once after the last call to map/reduce.
at least through 0.19, the close is not provided with the output collector
or the reporter, so you need to save them in the map/reduce method.

On Sat, Apr 18, 2009 at 9:28 AM, Farhan Husain  wrote:

> How do you identify that map task is ending within the map method? Is it
> possible to know which is the last call to map method?
>
> On Sat, Apr 18, 2009 at 10:59 AM, Edward Capriolo  >wrote:
>
> > I jumped into Hadoop at the 'deep end'. I know pig, hive, and hbase
> > support the ability to max(). I am writing my own max() over a simple
> > one column dataset.
> >
> > The best solution I came up with was using MapRunner. With maprunner I
> > can store the highest value in a private member variable. I can read
> > through the entire data set and only have to emit one value per mapper
> > upon completion of the map data. Then I can specify one reducer and
> > carry out the same operation.
> >
> > Does anyone have a better tactic. I thought a counter could do this
> > but are they atomic?
> >
>



-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422


Re: Seattle / PNW Hadoop + Lucene User Group?

2009-04-18 Thread Amin Mohammed-Coleman

I would love to come but I'm afraid I'm stuck in rainy old England :(

Amin

On 18 Apr 2009, at 01:08, Bradford Stephens  
 wrote:



OK, we've got 3 people... that's enough for a party? :)

Surely there must be dozens more of you guys out there... c'mon,
accelerate your knowledge! Join us in Seattle!



On Thu, Apr 16, 2009 at 3:27 PM, Bradford Stephens
 wrote:

Greetings,

Would anybody be willing to join a PNW Hadoop and/or Lucene User  
Group

with me in the Seattle area? I can donate some facilities, etc. -- I
also always have topics to speak about :)

Cheers,
Bradford



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: max value for a dataset

2009-04-18 Thread Farhan Husain
How do you identify that map task is ending within the map method? Is it
possible to know which is the last call to map method?

On Sat, Apr 18, 2009 at 10:59 AM, Edward Capriolo wrote:

> I jumped into Hadoop at the 'deep end'. I know pig, hive, and hbase
> support the ability to max(). I am writing my own max() over a simple
> one column dataset.
>
> The best solution I came up with was using MapRunner. With maprunner I
> can store the highest value in a private member variable. I can read
> through the entire data set and only have to emit one value per mapper
> upon completion of the map data. Then I can specify one reducer and
> carry out the same operation.
>
> Does anyone have a better tactic. I thought a counter could do this
> but are they atomic?
>


max value for a dataset

2009-04-18 Thread Edward Capriolo
I jumped into Hadoop at the 'deep end'. I know pig, hive, and hbase
support the ability to max(). I am writing my own max() over a simple
one column dataset.

The best solution I came up with was using MapRunner. With maprunner I
can store the highest value in a private member variable. I can read
through the entire data set and only have to emit one value per mapper
upon completion of the map data. Then I can specify one reducer and
carry out the same operation.

Does anyone have a better tactic. I thought a counter could do this
but are they atomic?


Re: Ec2 instability

2009-04-18 Thread Andrew Purtell

Hi,

This is an OS level exception. Your node is out of memory
even to fork a process. 

How many instances do you currently have allocated? Have
you increased the number of instances over time to try and
spread the load of your application around? How many
concurrent mapper and/or reducer processes do you execute
on a node? Can you characterize the memory usage of your
mappers and reducers? Are you running other processes
external to hadoop/hbase which consume a lot of memory? Are
you running Ganglia or similar to track and characterize
resource usage over time? 

You may find you are trying to solve a 100 node problem
with 10.

   - Andy

> From: Rakhi Khatwani
> Subject: Re: Ec2 instability
> To: hbase-u...@hadoop.apache.org, core-user@hadoop.apache.org
> Date: Friday, April 17, 2009, 9:44 AM
> Hi,
>  this is the exception i have been getting @ the mapreduce
> 
> java.io.IOException: Cannot run program "bash":
> java.io.IOException:
> error=12, Cannot allocate memory
>   at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
>   at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
>   at org.apache.hadoop.util.Shell.run(Shell.java:134)
>   at org.apache.hadoop.fs.DF.getAvailable(DF.java:73)
>   at
> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:321)
>   at
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
>   at
> org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61)
>   at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1199)
>   at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:857)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333)
>   at org.apache.hadoop.mapred.Child.main(Child.java:155)
> Caused by: java.io.IOException: java.io.IOException:
> error=12, Cannot
> allocate memory
>   at java.lang.UNIXProcess.(UNIXProcess.java:148)
>   at java.lang.ProcessImpl.start(ProcessImpl.java:65)
>   at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
>   ... 10 more



  


Using the Stanford NLP with hadoop

2009-04-18 Thread hari939

My project of parsing through material for a semantic search engine requires
me to use the  http://nlp.stanford.edu/software/lex-parser.shtml Stanford
NLP parser  on hadoop cluster.

To use the Stanford NLP parser, one must create a lexical parser object
using a englishPCFG.ser.gz file as a constructor's parameter.
i have tried loading the file onto the Hadoop dfs in the /user/root/ folder
and have also tried packing the file along with the jar of the java program.

i am new to the hadoop platform and am not very familiar with some of the
salient features of hadoop.

looking forward to any form of help.
-- 
View this message in context: 
http://www.nabble.com/Using-the-Stanford-NLP-with-hadoop-tp23112316p23112316.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.