Can I startup 2 datanodes on 1 machine?

2008-10-07 Thread Zhou, Yunqing
Here I have an existing hadoop 0.17.1 cluster. Now I'd like to add a second
disk on every machine.
So can I startup multi datanodes on 1 machine? Or do I have to setup each
machine with soft RAID configured ? (no RAID support on mainboards)

Thanks


Re: Can I startup 2 datanodes on 1 machine?

2008-10-07 Thread Zhou, Yunqing
Thanks, I will try it then.

On Tue, Oct 7, 2008 at 4:40 PM, Miles Osborne [EMAIL PROTECTED] wrote:

 you can specify multiple data directories in your conf file

 
 dfs.data.dir Comma separated list of paths on the local filesystem
 of a DataNode where it should store its blocks.  If this is a
 comma-delimited list of directories, then data will be stored in all
 named directories, typically on different devices.
 

 Miles

 2008/10/7 Zhou, Yunqing [EMAIL PROTECTED]:
   Here I have an existing hadoop 0.17.1 cluster. Now I'd like to add a
 second
  disk on every machine.
  So can I startup multi datanodes on 1 machine? Or do I have to setup each
  machine with soft RAID configured ? (no RAID support on mainboards)
 
  Thanks
 



 --
 The University of Edinburgh is a charitable body, registered in
 Scotland, with registration number SC005336.



Gets sum of all integers between map tasks

2008-10-07 Thread Edward J. Yoon
I would like to get the spam probability P(word|category) of the words
from an files of category (bad/good e-mails) as describe below. BTW,
To computes it on reduce, I need a sum of spamTotal between map
tasks. How can i get it?

Map:

/**
 * Counts word frequency
 */
public void map(LongWritable key, Text value,
OutputCollectorText, FloatWritable output, Reporter reporter)
throws IOException {
  String line = value.toString();
  String[] tokens = line.split(splitregex);

  // For every word token
  for (int i = 0; i  tokens.length; i++) {
String word = tokens[i].toLowerCase();
Matcher m = wordregex.matcher(word);
if (m.matches()) {
  spamTotal++;
  output.collect(new Text(word), count);
}
  }
}
  }

Reduce:

  /**
   * Computes bad count / total bad words
   */
  public static class Reduce extends MapReduceBase implements
  ReducerText, FloatWritable, Text, FloatWritable {

public void reduce(Text key, IteratorFloatWritable values,
OutputCollectorText, FloatWritable output, Reporter reporter)
throws IOException {
  int sum = 0;
  while (values.hasNext()) {
sum += (int) values.next().get();
  }

  FloatWritable badProb = new FloatWritable((float) sum / spamTotal);
  output.collect(key, badProb);
}
  }


-- 
Best regards, Edward J. Yoon
[EMAIL PROTECTED]
http://blog.udanax.org


Connect to a virtual cluster

2008-10-07 Thread Adrian Fdz.
Hi!

First of all, sorry for my English.

I've been working with Hadoop the last weeks and I wonder if there is any
virtual cluster where I can connect to, with a single machine, in order to
submit jobs.

Thanks


Re: Gets sum of all integers between map tasks

2008-10-07 Thread Miles Osborne
this is a well known problem.  basically, you want to aggregate values
computed at some previous step.

--emit category,probability pairs and have the reducer simply sum-up
the probabilities for a given category

(it is the same task as summing-up the word counts)

Miles

2008/10/7 Edward J. Yoon [EMAIL PROTECTED]:
 I would like to get the spam probability P(word|category) of the words
 from an files of category (bad/good e-mails) as describe below. BTW,
 To computes it on reduce, I need a sum of spamTotal between map
 tasks. How can i get it?

 Map:

/**
 * Counts word frequency
 */
public void map(LongWritable key, Text value,
OutputCollectorText, FloatWritable output, Reporter reporter)
throws IOException {
  String line = value.toString();
  String[] tokens = line.split(splitregex);

  // For every word token
  for (int i = 0; i  tokens.length; i++) {
String word = tokens[i].toLowerCase();
Matcher m = wordregex.matcher(word);
if (m.matches()) {
  spamTotal++;
  output.collect(new Text(word), count);
}
  }
}
  }

 Reduce:

  /**
   * Computes bad count / total bad words
   */
  public static class Reduce extends MapReduceBase implements
  ReducerText, FloatWritable, Text, FloatWritable {

public void reduce(Text key, IteratorFloatWritable values,
OutputCollectorText, FloatWritable output, Reporter reporter)
throws IOException {
  int sum = 0;
  while (values.hasNext()) {
sum += (int) values.next().get();
  }

  FloatWritable badProb = new FloatWritable((float) sum / spamTotal);
  output.collect(key, badProb);
}
  }


 --
 Best regards, Edward J. Yoon
 [EMAIL PROTECTED]
 http://blog.udanax.org




-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


graphics in hadoop

2008-10-07 Thread chandra


hi

does hadoop support graphics packages for displaying some images..?


-- 
Best Regards
S.Chandravadana

This e-mail and any files transmitted with it are for the sole use of the 
intended recipient(s) and may contain confidential and privileged information.
If you are not the intended recipient, please contact the sender by reply 
e-mail and destroy all copies of the original message. 
Any unauthorized review, use, disclosure, dissemination, forwarding, printing 
or copying of this email or any action taken in reliance on this e-mail is 
strictly 
prohibited and may be unlawful.


Re: Can I startup 2 datanodes on 1 machine?

2008-10-07 Thread Miles Osborne
you can specify multiple data directories in your conf file


dfs.data.dir Comma separated list of paths on the local filesystem
of a DataNode where it should store its blocks.  If this is a
comma-delimited list of directories, then data will be stored in all
named directories, typically on different devices.


Miles

2008/10/7 Zhou, Yunqing [EMAIL PROTECTED]:
 Here I have an existing hadoop 0.17.1 cluster. Now I'd like to add a second
 disk on every machine.
 So can I startup multi datanodes on 1 machine? Or do I have to setup each
 machine with soft RAID configured ? (no RAID support on mainboards)

 Thanks




-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


Re: graphics in hadoop

2008-10-07 Thread Lukáš Vlček
Hi,

Hadoop is a platform for distributed computing. Typically it runs on a
cluster of dedicated servers (though expensive HW is not required), as far
as I know it is not mean to be a platform for applications running on
client.
Hadoop is very general and not limitted by nature of the data, this means
that you should be able to process also image data.

Regards,
Lukas

On Tue, Oct 7, 2008 at 10:51 AM, chandra 
[EMAIL PROTECTED] wrote:



 hi

 does hadoop support graphics packages for displaying some images..?


 --
 Best Regards
 S.Chandravadana

 This e-mail and any files transmitted with it are for the sole use of the
 intended recipient(s) and may contain confidential and privileged
 information.
 If you are not the intended recipient, please contact the sender by reply
 e-mail and destroy all copies of the original message.
 Any unauthorized review, use, disclosure, dissemination, forwarding,
 printing or copying of this email or any action taken in reliance on this
 e-mail is strictly
 prohibited and may be unlawful.



Re: graphics in hadoop

2008-10-07 Thread chandravadana

and  is there any method for creating an image file in hadoop..?


chandravadana wrote:
 
 
 
 hi
 
 does hadoop support graphics packages for displaying some images..?
 
 
 -- 
 Best Regards
 S.Chandravadana
 
 This e-mail and any files transmitted with it are for the sole use of the
 intended recipient(s) and may contain confidential and privileged
 information.
 If you are not the intended recipient, please contact the sender by reply
 e-mail and destroy all copies of the original message. 
 Any unauthorized review, use, disclosure, dissemination, forwarding,
 printing or copying of this email or any action taken in reliance on this
 e-mail is strictly 
 prohibited and may be unlawful.
 
 

-- 
View this message in context: 
http://www.nabble.com/graphics-in-hadoop-tp19853939p19855082.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Gets sum of all integers between map tasks

2008-10-07 Thread Edward J. Yoon
Oh-ha, that's simple. :)

/Edward J. Yoon

On Tue, Oct 7, 2008 at 7:14 PM, Miles Osborne [EMAIL PROTECTED] wrote:
 this is a well known problem.  basically, you want to aggregate values
 computed at some previous step.

 --emit category,probability pairs and have the reducer simply sum-up
 the probabilities for a given category

 (it is the same task as summing-up the word counts)

 Miles

 2008/10/7 Edward J. Yoon [EMAIL PROTECTED]:
 I would like to get the spam probability P(word|category) of the words
 from an files of category (bad/good e-mails) as describe below. BTW,
 To computes it on reduce, I need a sum of spamTotal between map
 tasks. How can i get it?

 Map:

/**
 * Counts word frequency
 */
public void map(LongWritable key, Text value,
OutputCollectorText, FloatWritable output, Reporter reporter)
throws IOException {
  String line = value.toString();
  String[] tokens = line.split(splitregex);

  // For every word token
  for (int i = 0; i  tokens.length; i++) {
String word = tokens[i].toLowerCase();
Matcher m = wordregex.matcher(word);
if (m.matches()) {
  spamTotal++;
  output.collect(new Text(word), count);
}
  }
}
  }

 Reduce:

  /**
   * Computes bad count / total bad words
   */
  public static class Reduce extends MapReduceBase implements
  ReducerText, FloatWritable, Text, FloatWritable {

public void reduce(Text key, IteratorFloatWritable values,
OutputCollectorText, FloatWritable output, Reporter reporter)
throws IOException {
  int sum = 0;
  while (values.hasNext()) {
sum += (int) values.next().get();
  }

  FloatWritable badProb = new FloatWritable((float) sum / spamTotal);
  output.collect(key, badProb);
}
  }


 --
 Best regards, Edward J. Yoon
 [EMAIL PROTECTED]
 http://blog.udanax.org




 --
 The University of Edinburgh is a charitable body, registered in
 Scotland, with registration number SC005336.




-- 
Best regards, Edward J. Yoon
[EMAIL PROTECTED]
http://blog.udanax.org


Re: Weird problem running wordcount example from within Eclipse

2008-10-07 Thread Ski Gh3
I figure out the input directory part: I just need to add the
$HADOOP_HOME/conf directory into the classpath in eclipse.

However, now I ran into a new problem: now the program complains that it
cannot find the class files for my mapper and reducer!  The error message is
as follows:

08/10/07 10:21:52 WARN mapred.JobClient: No job jar file set.  User classes
may not be found. See JobConf(Class) or JobConf#setJar(String).
08/10/07 10:21:52 INFO mapred.FileInputFormat: Total input paths to process
: 1
08/10/07 10:21:54 INFO mapred.JobClient: Running job: job_200810071020_0001
08/10/07 10:21:55 INFO mapred.JobClient:  map 0% reduce 0%
08/10/07 10:22:01 INFO mapred.JobClient: Task Id :
task_200810071020_0001_m_00_0, Status : FAILED
java.lang.RuntimeException: java.lang.RuntimeException:
java.lang.ClassNotFoundException:
org.apache.hadoop.examples.WordCount$Reduce
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:633)
at org.apache.hadoop.mapred.JobConf.getCombinerClass(JobConf.java:768)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:383)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:185)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException:
org.apache.hadoop.examples.WordCount$Reduce
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:601)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:625)
... 4 more
Caused by: java.lang.ClassNotFoundException:
org.apache.hadoop.examples.WordCount$Reduce
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:276)
at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:581)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:599)
... 5 more

I don't know how to fix this.  It seems everyone is using the hadoop jar
command line for running the program. however, I don't understand why this
won't work since I am using the jobclient interface for interacting with
hadoop...

Would really appreciate if anybody can share some experience on this!

Thank you!

On Mon, Oct 6, 2008 at 10:48 AM, Ski Gh3 [EMAIL PROTECTED] wrote:

 Hi all,

 I have a weird problem regarding running the wordcount example from
 eclipse.

 I was able to run the wordcount example from the command line like:
 $ ...MyHadoop/bin/hadoop jar ../MyHadoop/hadoop-xx-examples.jar wordcount
 myinputdir myoutputdir

 However, if I try to run the wordcount program from Eclipse (supplying same
 two args: myinputdir myoutputdir)
 I got the following error messsage:

 Exception in thread main java.lang.RuntimeException: java.io.IOException:
 No FileSystem for scheme: file
 at
 org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:356)
 at
 org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:331)
 at
 org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:304)
 at org.apache.hadoop.examples.WordCount.run(WordCount.java:149)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.hadoop.examples.WordCount.main(WordCount.java:161)
 Caused by: java.io.IOException: No FileSystem for scheme: file
 at
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1277)
 at org.apache.hadoop.fs.FileSystem.access$1(FileSystem.java:1273)
 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1291)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:203)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:108)
 at
 org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:352)
 ... 5 more

 It seems from within Eclipse, the program does not know how to interpret
 the myinputdir as a hadoop path?

 Can someone please tell me how I can fix this?

 Thanks a lot!!!



Re: architecture diagram

2008-10-07 Thread Alex Loddengaard
Thanks for the clarification, Samuel.  I wasn't aware that parts of a line
might be emitted depending on the split, while using TextInputFormat.
Terrence, this means that you'll have to take the approach of collecting key
= column_count, value = column_contents in your map step.

Alex

On Mon, Oct 6, 2008 at 6:41 PM, Samuel Guo [EMAIL PROTECTED] wrote:

 I think what Alex talked about 'split' is the mapreduce system's action.
 What you said about 'split' is your mapper's action.

 I guess that your map/reduce application uses *TextInputFormat* to treat
 your input file.

 your input file will first be splitted into a few splits. these splits may
 be like filename, offset, length. What Alex said about 'The location of
 these splits is semi-arbitrary' means that the file split's offset in your
 input file is semi-arbitrary. Am I right, Alex?
 Then *TextInputFormat* will translate these file splits into a sequence of
 lines, where offset is treated as key and line is treated as value.

 As these file splits are splitted by offset. Some lines in your file may be
 splitted into different file splits. A *LineRecordReader* used by
 *TextInputFormat* will remove the half-baked line in these file splits to
 make sure that every mapper will get integrated lines one by one.

 For examples:

 a file as below:
 
 AAA BBB CCC DDD
 EEE FFF GGG HHH
 AAA BBB CCC DDD
 

 it may be splitted into two file splits(we assume that there are two
 mappers.).
 split one:
 
 AAA BBB CCC

 split two:
 DDD
 EEE FFF GGG HHH
 AAA BBB CCC DDD
 

 take split two as example:
 TextInputFormat will use LineRecordReader to translate split two into a
 sequence of offset, line pairs, and it will skip the first half-baked
 line
 DDD. so the sequence will be:
 offset1, EEE FFF GGG HHH
 offset2, AAA BBB CCC DDD
 

 Then what to do with the lines depends on your job.


 On Tue, Oct 7, 2008 at 5:55 AM, Terrence A. Pietrondi 
 [EMAIL PROTECTED]
  wrote:

  So looking at the following mapper...
 
 
 
 http://csvdatamix.svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/src/com/datamix/pivot/PivotMapper.java?view=markup
 
  On line 32, you can see the row split via a delimiter. On line 43, you
 can
  see that the field index (the column index) is the map key, and the map
  value is the field contents. How is this incorrect? I think this follows
  your earlier suggestion of:
 
  You may want to play with the following idea: collect key =
 column_number
  and value = column_contents in your map step.
 
  Terrence A. Pietrondi
 
 
  --- On Mon, 10/6/08, Alex Loddengaard [EMAIL PROTECTED] wrote:
 
   From: Alex Loddengaard [EMAIL PROTECTED]
   Subject: Re: architecture diagram
   To: core-user@hadoop.apache.org
   Date: Monday, October 6, 2008, 12:55 PM
   As far as I know, splits will never be made within a line,
   only between
   rows.  To answer your question about ways to control the
   splits, see below:
  
   http://wiki.apache.org/hadoop/HowManyMapsAndReduces
   
  
 
 http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html
   
  
   Alex
  
   On Mon, Oct 6, 2008 at 6:38 AM, Terrence A. Pietrondi
   [EMAIL PROTECTED]
wrote:
  
Can you explain The location of these splits is
   semi-arbitrary? What if
the example was...
   
AAA|BBB|CCC|DDD
EEE|FFF|GGG|HHH
   
   
Does this mean the split might be between CCC such
   that it results in
AAA|BBB|C and C|DDD for the first line? Is there a way
   to control this
behavior to split on my delimiter?
   
   
Terrence A. Pietrondi
   
   
--- On Sun, 10/5/08, Alex Loddengaard
   [EMAIL PROTECTED] wrote:
   
 From: Alex Loddengaard
   [EMAIL PROTECTED]
 Subject: Re: architecture diagram
 To: core-user@hadoop.apache.org
 Date: Sunday, October 5, 2008, 9:26 PM
 Let's say you have one very large input file
   of the
 form:

 A|B|C|D
 E|F|G|H
 ...
 |1|2|3|4

 This input file will be broken up into N pieces,
   where N is
 the number of
 mappers that run.  The location of these splits
   is
 semi-arbitrary.  This
 means that unless you have one mapper, you
   won't be
 able to see the entire
 contents of a column in your mapper.  Given that
   you would
 need one mapper
 to be able to see the entirety of a column,
   you've now
 essentially reduced
 your problem to a single machine.

 You may want to play with the following idea:
   collect key
 = column_number
 and value = column_contents in your map step.
This
 means that you would be
 able to see the entirety of a column in your
   reduce step,
 though you're
 still faced with the tasks of shuffling and
   re-pivoting.

 Does this clear up your confusion?  Let me know
   if
 you'd like me to clarify
 more.

 Alex

 On Sun, Oct 5, 2008 at 3:54 PM, Terrence A.
   Pietrondi
 [EMAIL PROTECTED]
  

Re: graphics in hadoop

2008-10-07 Thread Alex Loddengaard
Hadoop runs Java code, so you can do anything that Java could do.  This
means that you can create and/or analyze images.  However, as Lukas has
said, Hadoop runs on a cluster of computers and is used for data storage and
processing.

If you need to display images, then you'd have to take these images off the
HDFS (Hadoop distributed filesystem) and on to your local desktop.

Alex

On Tue, Oct 7, 2008 at 3:50 AM, Lukáš Vlček [EMAIL PROTECTED] wrote:

 Hi,

 Hadoop is a platform for distributed computing. Typically it runs on a
 cluster of dedicated servers (though expensive HW is not required), as far
 as I know it is not mean to be a platform for applications running on
 client.
 Hadoop is very general and not limitted by nature of the data, this means
 that you should be able to process also image data.

 Regards,
 Lukas

 On Tue, Oct 7, 2008 at 10:51 AM, chandra 
 [EMAIL PROTECTED] wrote:

 
 
  hi
 
  does hadoop support graphics packages for displaying some images..?
 
 
  --
  Best Regards
  S.Chandravadana
 
  This e-mail and any files transmitted with it are for the sole use of the
  intended recipient(s) and may contain confidential and privileged
  information.
  If you are not the intended recipient, please contact the sender by reply
  e-mail and destroy all copies of the original message.
  Any unauthorized review, use, disclosure, dissemination, forwarding,
  printing or copying of this email or any action taken in reliance on this
  e-mail is strictly
  prohibited and may be unlawful.
 



Re: Connect to a virtual cluster

2008-10-07 Thread Alex Loddengaard
Amazon EC2 and S3 is probably the easiest way for someone without a cluster
to get jobs running.  Take a look:

EC2:
http://aws.amazon.com/ec2/
http://wiki.apache.org/hadoop/AmazonEC2

S3:
http://aws.amazon.com/s3/
http://wiki.apache.org/hadoop/AmazonS3

Hope this helps.

Alex

On Tue, Oct 7, 2008 at 6:51 AM, Adrian Fdz. [EMAIL PROTECTED] wrote:

 Hi!

 First of all, sorry for my English.

 I've been working with Hadoop the last weeks and I wonder if there is any
 virtual cluster where I can connect to, with a single machine, in order to
 submit jobs.

 Thanks



NoSuchMethodException when running Map Task

2008-10-07 Thread Dan Benjamin

I've got a simple hadoop job running on an EC2 cluster using the scripts
under src/contrib/ec2.  The map tasks all fail with the following error:

08/10/07 15:11:00 INFO mapred.JobClient: Task Id :
attempt_200810071501_0001_m_31_0, Status : FAILED
java.lang.RuntimeException: java.lang.NoSuchMethodException:
ManifestRetriever$Map.init()
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:80)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:33)
at
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:223)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
Caused by: java.lang.NoSuchMethodException:
com.amazon.ec2.ebs.billing.ManifestRetriever$Map.init()
at java.lang.Class.getConstructor0(Class.java:2706)
at java.lang.Class.getDeclaredConstructor(Class.java:1985)
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:74)

I tried adding an explicit (public) no-arg constructor to the
ManifestRetriever.Map class but this gives me the same error.  Has anyone
encountered this problem before?

-- 
View this message in context: 
http://www.nabble.com/NoSuchMethodException-when-running-Map-Task-tp19865280p19865280.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: NoSuchMethodException when running Map Task

2008-10-07 Thread Dan Benjamin

Sorry, I should have mentioned I'm using hadoop version 0.18.1 and java 1.6.


Dan Benjamin wrote:
 
 I've got a simple hadoop job running on an EC2 cluster using the scripts
 under src/contrib/ec2.  The map tasks all fail with the following error:
 
 08/10/07 15:11:00 INFO mapred.JobClient: Task Id :
 attempt_200810071501_0001_m_31_0, Status : FAILED
 java.lang.RuntimeException: java.lang.NoSuchMethodException:
 ManifestRetriever$Map.init()
 at
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:80)
 at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:33)
 at
 org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
 at
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:223)
 at
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
 Caused by: java.lang.NoSuchMethodException:
 com.amazon.ec2.ebs.billing.ManifestRetriever$Map.init()
 at java.lang.Class.getConstructor0(Class.java:2706)
 at java.lang.Class.getDeclaredConstructor(Class.java:1985)
 at
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:74)
 
 I tried adding an explicit (public) no-arg constructor to the
 ManifestRetriever.Map class but this gives me the same error.  Has anyone
 encountered this problem before?
 
 

-- 
View this message in context: 
http://www.nabble.com/NoSuchMethodException-when-running-Map-Task-tp19865280p19865371.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



dual core configuration

2008-10-07 Thread Elia Mazzawi

hello,

I have some dual core nodes, and I've noticed hadoop is only running 1 
instance, and so is only using 1 on the CPU's on each node.

is there a configuration to tell it to run more than once?
or do i need to turn each machine into 2 nodes?

Thanks.


Questions regarding Hive metadata schema

2008-10-07 Thread Alan Gates

Hi,

I've been looking over the db schema that hive uses to store it's 
metadata (package.jdo) and I had some questions:


  1.  What do the field names in the TYPES table mean? TYPE1, TYPE2, 
and TYPE_FIELDS are all unclear to me.

  2. In the TBLS (tables) table, what is sd?
  3. What does the SERDES table store?
  4. What does the SORT_ORDER table store? It appears to describe the 
ordering within a storage descriptor, which in turn appears to be 
related to a partition. Do you envision having a table where different 
partitions have different orders?
  5. SDS (storage descriptor) table has a list of columns. Does this 
imply that columnar storage is supported?
  6. What is the relationship between a storage descriptor and a 
partition? 1-1, 1-n?


Thanks.

Alan.


Re: Questions regarding Hive metadata schema

2008-10-07 Thread Prasad Chakka
Hi Alan,

The objects are very closely associated with the Thrift API objects defined
in src/contrib/hive/metastore/if/hive_metastore.thrift . It contains
descriptions as to what each field is and it should most of your questions.
ORM for this is at s/c/h/metastore/src/java/model/package.jdo.

2) SD is storage descriptor (look at SDS table)
3) SERDES contains information for Hive serializers and deserializers
5) Tables and Partitions have Storage Descriptors. Storage Descriptors
contain physical storage info and how to read the data (serde info). Storage
Description object actually contains the columns. This means that different
partitions can have different column sets
6) 1-1

Thanks,
Prasad

From: Alan Gates [EMAIL PROTECTED]
Reply-To: core-user@hadoop.apache.org
Date: Tue, 7 Oct 2008 15:28:50 -0700
To: core-user@hadoop.apache.org
Subject: Questions regarding Hive metadata schema

Hi,

I've been looking over the db schema that hive uses to store it's
metadata (package.jdo) and I had some questions:

   1.  What do the field names in the TYPES table mean? TYPE1, TYPE2,
and TYPE_FIELDS are all unclear to me.
   2. In the TBLS (tables) table, what is sd?
   3. What does the SERDES table store?
   4. What does the SORT_ORDER table store? It appears to describe the
ordering within a storage descriptor, which in turn appears to be
related to a partition. Do you envision having a table where different
partitions have different orders?
   5. SDS (storage descriptor) table has a list of columns. Does this
imply that columnar storage is supported?
   6. What is the relationship between a storage descriptor and a
partition? 1-1, 1-n?

Thanks.

Alan.




Re: nagios to monitor hadoop datanodes!

2008-10-07 Thread Stefan Groschupf

try jmx. There should be also jmx to snmp available somewhere.
http://blogs.sun.com/jmxetc/entry/jmx_vs_snmp

~~~
101tec Inc., Menlo Park, California
web:  http://www.101tec.com
blog: http://www.find23.net



On Oct 6, 2008, at 10:05 AM, Gerardo Velez wrote:


Hi Everyone!


I would like to implement Nagios health monitoring of a Hadoop grid.

Some of you have some experience here, do you hace any approach or  
advice I

could use.

At this time I've been only playing with jsp's files that hadoop has
integrated into it. so I;m not sure if it could be a good idea that
nagios monitor request info to these jsp?


Thanks in advance!


-- Gerardo




Re: nagios to monitor hadoop datanodes!

2008-10-07 Thread Brian Bockelman

Hey Stefan,

Is there any documentation for making JMX working in Hadoop?

Brian

On Oct 7, 2008, at 7:03 PM, Stefan Groschupf wrote:


try jmx. There should be also jmx to snmp available somewhere.
http://blogs.sun.com/jmxetc/entry/jmx_vs_snmp

~~~
101tec Inc., Menlo Park, California
web:  http://www.101tec.com
blog: http://www.find23.net



On Oct 6, 2008, at 10:05 AM, Gerardo Velez wrote:


Hi Everyone!


I would like to implement Nagios health monitoring of a Hadoop grid.

Some of you have some experience here, do you hace any approach or  
advice I

could use.

At this time I've been only playing with jsp's files that hadoop has
integrated into it. so I;m not sure if it could be a good idea that
nagios monitor request info to these jsp?


Thanks in advance!


-- Gerardo




Re: Questions regarding Hive metadata schema

2008-10-07 Thread Jeff Hammerbacher
For translation purposes, SerDe's in Hive correspond to
StoreFunc/LoadFunc pairs in Pig and Producers/Extractor pairs in
SCOPE.

I claim SCOPE's terminology is the most elegant and we should all
standardize on their terminology, in this case at least. Joy claims
that SerDe is a common term in the hardware community. Since Hive was
mainly intended for hardware developers, ...wait a second, that's not
right.

(seriously though, we need some way to keep these things straight, and
being able to reuse serialization/deserialization libraries would be
nice).

On Tue, Oct 7, 2008 at 3:49 PM, Prasad Chakka [EMAIL PROTECTED] wrote:
 Hi Alan,

 The objects are very closely associated with the Thrift API objects defined
 in src/contrib/hive/metastore/if/hive_metastore.thrift . It contains
 descriptions as to what each field is and it should most of your questions.
 ORM for this is at s/c/h/metastore/src/java/model/package.jdo.

 2) SD is storage descriptor (look at SDS table)
 3) SERDES contains information for Hive serializers and deserializers
 5) Tables and Partitions have Storage Descriptors. Storage Descriptors
 contain physical storage info and how to read the data (serde info). Storage
 Description object actually contains the columns. This means that different
 partitions can have different column sets
 6) 1-1

 Thanks,
 Prasad

 From: Alan Gates [EMAIL PROTECTED]
 Reply-To: core-user@hadoop.apache.org
 Date: Tue, 7 Oct 2008 15:28:50 -0700
 To: core-user@hadoop.apache.org
 Subject: Questions regarding Hive metadata schema

 Hi,

 I've been looking over the db schema that hive uses to store it's
 metadata (package.jdo) and I had some questions:

   1.  What do the field names in the TYPES table mean? TYPE1, TYPE2,
 and TYPE_FIELDS are all unclear to me.
   2. In the TBLS (tables) table, what is sd?
   3. What does the SERDES table store?
   4. What does the SORT_ORDER table store? It appears to describe the
 ordering within a storage descriptor, which in turn appears to be
 related to a partition. Do you envision having a table where different
 partitions have different orders?
   5. SDS (storage descriptor) table has a list of columns. Does this
 imply that columnar storage is supported?
   6. What is the relationship between a storage descriptor and a
 partition? 1-1, 1-n?

 Thanks.

 Alan.





Re: nagios to monitor hadoop datanodes!

2008-10-07 Thread 何永强
Hadoop already integrated jmx inside, you can extend them to  
implement what you want to monitor, it need to modify some code to  
add some counters or something like that.
One thing you may need to be care is hadoop does not include any  
JMXConnectorServer inside, you need to start one JMXConnectorServer  
for every hadoop process you want to monitor.
This is what we have done on hadoop to monitor it. We have not check  
out Nagios  for hadoop,so no word on Nagios.

hope it help.
在 2008-10-8,上午8:34,Brian Bockelman 写道:


Hey Stefan,

Is there any documentation for making JMX working in Hadoop?

Brian

On Oct 7, 2008, at 7:03 PM, Stefan Groschupf wrote:


try jmx. There should be also jmx to snmp available somewhere.
http://blogs.sun.com/jmxetc/entry/jmx_vs_snmp

~~~
101tec Inc., Menlo Park, California
web:  http://www.101tec.com
blog: http://www.find23.net



On Oct 6, 2008, at 10:05 AM, Gerardo Velez wrote:


Hi Everyone!


I would like to implement Nagios health monitoring of a Hadoop grid.

Some of you have some experience here, do you hace any approach  
or advice I

could use.

At this time I've been only playing with jsp's files that hadoop has
integrated into it. so I;m not sure if it could be a good idea that
nagios monitor request info to these jsp?


Thanks in advance!


-- Gerardo








Re: IPC Client error | Too many files open

2008-10-07 Thread 何永强

try update jdk to 1.6, there is a bug for jdk 1.5 about nio.
在 2008-9-26,下午7:29,Goel, Ankur 写道:


Hi Folks,

We have developed a simple log writer in Java that is plugged into
Apache custom log and writes log entries directly to our hadoop  
cluster

(50 machines, quad core, each with 16 GB Ram and 800 GB hard-disk, 1
machine as dedicated Namenode another machine as JobTracker 
TaskTracker + DataNode).

There are around 8 Apache servers dumping logs into HDFS via our  
writer.

Everything was working fine and we were getting around 15 - 20 MB log
data per hour from each server.



Recently we have been experiencing problems with 2-3 of our Apache
servers where a file is opened by log-writer in HDFS for writing  
but it

never receives any data.

Looking at apache error logs shows the following errors

08/09/22 05:02:13 INFO ipc.Client: java.io.IOException: Too many open
files
at sun.nio.ch.IOUtil.initPipe(Native Method)
at
sun.nio.ch.EPollSelectorImpl.init(EPollSelectorImpl.java:49)
at
sun.nio.ch.EPollSelectorProvider.openSelector 
(EPollSelectorProvider.java

:18)
at
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.get 
(SocketIOWithT

imeout.java:312)
at
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select 
(SocketIOWi

thTimeout.java:227)
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO 
(SocketIOWithTimeout.java:

155)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java: 
149)

at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java: 
122)

at java.io.FilterInputStream.read(FilterInputStream.java:116)
at
org.apache.hadoop.ipc.Client$Connection$1.read(Client.java:203)
at
java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at
java.io.BufferedInputStream.read(BufferedInputStream.java:237)
at java.io.DataInputStream.readInt(DataInputStream.java:370)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java: 
289)


...

...

 Followed by connection errors saying

Retrying to connect to server: hadoop-server.com:9000. Already tried
'n' times.

(same as above) ...



and is retrying constantly (log-writer set up so that it waits and
retries).



Doing an lsof on the log writer java process shows that it got  
stuck in

a lot of pipe/event poll and eventually ran out of file handles.

Below is the part of the lsof output



lsof -p 2171
COMMAND  PID USER   FD   TYPE DEVICE SIZE NODE  
NAME




java2171 root   20r  FIFO0,7  24090207  
pipe
java2171 root   21w  FIFO0,7  24090207  
pipe

java2171 root   22r  0,80 24090208
eventpoll
java2171 root   23r  FIFO0,7  23323281  
pipe
java2171 root   24r  FIFO0,7  23331536  
pipe
java2171 root   25w  FIFO0,7  23306764  
pipe

java2171 root   26r  0,80 23306765
eventpoll
java2171 root   27r  FIFO0,7  23262160  
pipe
java2171 root   28w  FIFO0,7  23262160  
pipe

java2171 root   29r  0,80 23262161
eventpoll
java2171 root   30w  FIFO0,7  23299329  
pipe

java2171 root   31r  0,80 23299330
eventpoll
java2171 root   32w  FIFO0,7  23331536  
pipe
java2171 root   33r  FIFO0,7  23268961  
pipe
java2171 root   34w  FIFO0,7  23268961  
pipe

java2171 root   35r  0,80 23268962
eventpoll
java2171 root   36w  FIFO0,7  23314889  
pipe


...

...

...

What in DFS client (if any) could have caused this? Could it be
something else?

Is it not ideal to use an HDFS writer to directly write logs from  
Apache

into HDFS?

Is 'Chukwa (hadoop log collection and analysis framework  
contributed by

Yahoo) a better fit for our case?



I would highly appreciate help on any or all of the above questions.



Thanks and Regards

-Ankur





Re: dual core configuration

2008-10-07 Thread Taeho Kang
You can have your node (tasktracker) running more than 1 task
simultaneously.
You may set mapred.tasktracker.map.tasks.maximum and
mapred.tasktracker.reduce.tasks.maximum properties found in
hadoop-site.xml file. You should change hadoop-site.xml file on all your
slave nodes depending on how many cores each slave has. For example, you
don't really want to have 8 tasks running at once on a 2 core machine.

/Taeho

On Wed, Oct 8, 2008 at 5:53 AM, Elia Mazzawi
[EMAIL PROTECTED]wrote:

 hello,

 I have some dual core nodes, and I've noticed hadoop is only running 1
 instance, and so is only using 1 on the CPU's on each node.
 is there a configuration to tell it to run more than once?
 or do i need to turn each machine into 2 nodes?

 Thanks.



Re: dual core configuration

2008-10-07 Thread Alex Loddengaard
Taeho, I was going to suggest this change as well, but it's documented that
mapred.tasktracker.map.tasks.maximum defaults to 2.  Can you explain why
Elia is only having one core utilized when this config option is set to 2?

Here is the documentation I'm referring to:
http://hadoop.apache.org/core/docs/r0.18.1/cluster_setup.html

Alex

On Tue, Oct 7, 2008 at 8:27 PM, Taeho Kang [EMAIL PROTECTED] wrote:

 You can have your node (tasktracker) running more than 1 task
 simultaneously.
 You may set mapred.tasktracker.map.tasks.maximum and
 mapred.tasktracker.reduce.tasks.maximum properties found in
 hadoop-site.xml file. You should change hadoop-site.xml file on all your
 slave nodes depending on how many cores each slave has. For example, you
 don't really want to have 8 tasks running at once on a 2 core machine.

 /Taeho

 On Wed, Oct 8, 2008 at 5:53 AM, Elia Mazzawi
 [EMAIL PROTECTED]wrote:

  hello,
 
  I have some dual core nodes, and I've noticed hadoop is only running 1
  instance, and so is only using 1 on the CPU's on each node.
  is there a configuration to tell it to run more than once?
  or do i need to turn each machine into 2 nodes?
 
  Thanks.