Can I startup 2 datanodes on 1 machine?
Here I have an existing hadoop 0.17.1 cluster. Now I'd like to add a second disk on every machine. So can I startup multi datanodes on 1 machine? Or do I have to setup each machine with soft RAID configured ? (no RAID support on mainboards) Thanks
Re: Can I startup 2 datanodes on 1 machine?
Thanks, I will try it then. On Tue, Oct 7, 2008 at 4:40 PM, Miles Osborne [EMAIL PROTECTED] wrote: you can specify multiple data directories in your conf file dfs.data.dir Comma separated list of paths on the local filesystem of a DataNode where it should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Miles 2008/10/7 Zhou, Yunqing [EMAIL PROTECTED]: Here I have an existing hadoop 0.17.1 cluster. Now I'd like to add a second disk on every machine. So can I startup multi datanodes on 1 machine? Or do I have to setup each machine with soft RAID configured ? (no RAID support on mainboards) Thanks -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
Gets sum of all integers between map tasks
I would like to get the spam probability P(word|category) of the words from an files of category (bad/good e-mails) as describe below. BTW, To computes it on reduce, I need a sum of spamTotal between map tasks. How can i get it? Map: /** * Counts word frequency */ public void map(LongWritable key, Text value, OutputCollectorText, FloatWritable output, Reporter reporter) throws IOException { String line = value.toString(); String[] tokens = line.split(splitregex); // For every word token for (int i = 0; i tokens.length; i++) { String word = tokens[i].toLowerCase(); Matcher m = wordregex.matcher(word); if (m.matches()) { spamTotal++; output.collect(new Text(word), count); } } } } Reduce: /** * Computes bad count / total bad words */ public static class Reduce extends MapReduceBase implements ReducerText, FloatWritable, Text, FloatWritable { public void reduce(Text key, IteratorFloatWritable values, OutputCollectorText, FloatWritable output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += (int) values.next().get(); } FloatWritable badProb = new FloatWritable((float) sum / spamTotal); output.collect(key, badProb); } } -- Best regards, Edward J. Yoon [EMAIL PROTECTED] http://blog.udanax.org
Connect to a virtual cluster
Hi! First of all, sorry for my English. I've been working with Hadoop the last weeks and I wonder if there is any virtual cluster where I can connect to, with a single machine, in order to submit jobs. Thanks
Re: Gets sum of all integers between map tasks
this is a well known problem. basically, you want to aggregate values computed at some previous step. --emit category,probability pairs and have the reducer simply sum-up the probabilities for a given category (it is the same task as summing-up the word counts) Miles 2008/10/7 Edward J. Yoon [EMAIL PROTECTED]: I would like to get the spam probability P(word|category) of the words from an files of category (bad/good e-mails) as describe below. BTW, To computes it on reduce, I need a sum of spamTotal between map tasks. How can i get it? Map: /** * Counts word frequency */ public void map(LongWritable key, Text value, OutputCollectorText, FloatWritable output, Reporter reporter) throws IOException { String line = value.toString(); String[] tokens = line.split(splitregex); // For every word token for (int i = 0; i tokens.length; i++) { String word = tokens[i].toLowerCase(); Matcher m = wordregex.matcher(word); if (m.matches()) { spamTotal++; output.collect(new Text(word), count); } } } } Reduce: /** * Computes bad count / total bad words */ public static class Reduce extends MapReduceBase implements ReducerText, FloatWritable, Text, FloatWritable { public void reduce(Text key, IteratorFloatWritable values, OutputCollectorText, FloatWritable output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += (int) values.next().get(); } FloatWritable badProb = new FloatWritable((float) sum / spamTotal); output.collect(key, badProb); } } -- Best regards, Edward J. Yoon [EMAIL PROTECTED] http://blog.udanax.org -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
graphics in hadoop
hi does hadoop support graphics packages for displaying some images..? -- Best Regards S.Chandravadana This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email or any action taken in reliance on this e-mail is strictly prohibited and may be unlawful.
Re: Can I startup 2 datanodes on 1 machine?
you can specify multiple data directories in your conf file dfs.data.dir Comma separated list of paths on the local filesystem of a DataNode where it should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Miles 2008/10/7 Zhou, Yunqing [EMAIL PROTECTED]: Here I have an existing hadoop 0.17.1 cluster. Now I'd like to add a second disk on every machine. So can I startup multi datanodes on 1 machine? Or do I have to setup each machine with soft RAID configured ? (no RAID support on mainboards) Thanks -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
Re: graphics in hadoop
Hi, Hadoop is a platform for distributed computing. Typically it runs on a cluster of dedicated servers (though expensive HW is not required), as far as I know it is not mean to be a platform for applications running on client. Hadoop is very general and not limitted by nature of the data, this means that you should be able to process also image data. Regards, Lukas On Tue, Oct 7, 2008 at 10:51 AM, chandra [EMAIL PROTECTED] wrote: hi does hadoop support graphics packages for displaying some images..? -- Best Regards S.Chandravadana This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email or any action taken in reliance on this e-mail is strictly prohibited and may be unlawful.
Re: graphics in hadoop
and is there any method for creating an image file in hadoop..? chandravadana wrote: hi does hadoop support graphics packages for displaying some images..? -- Best Regards S.Chandravadana This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email or any action taken in reliance on this e-mail is strictly prohibited and may be unlawful. -- View this message in context: http://www.nabble.com/graphics-in-hadoop-tp19853939p19855082.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Gets sum of all integers between map tasks
Oh-ha, that's simple. :) /Edward J. Yoon On Tue, Oct 7, 2008 at 7:14 PM, Miles Osborne [EMAIL PROTECTED] wrote: this is a well known problem. basically, you want to aggregate values computed at some previous step. --emit category,probability pairs and have the reducer simply sum-up the probabilities for a given category (it is the same task as summing-up the word counts) Miles 2008/10/7 Edward J. Yoon [EMAIL PROTECTED]: I would like to get the spam probability P(word|category) of the words from an files of category (bad/good e-mails) as describe below. BTW, To computes it on reduce, I need a sum of spamTotal between map tasks. How can i get it? Map: /** * Counts word frequency */ public void map(LongWritable key, Text value, OutputCollectorText, FloatWritable output, Reporter reporter) throws IOException { String line = value.toString(); String[] tokens = line.split(splitregex); // For every word token for (int i = 0; i tokens.length; i++) { String word = tokens[i].toLowerCase(); Matcher m = wordregex.matcher(word); if (m.matches()) { spamTotal++; output.collect(new Text(word), count); } } } } Reduce: /** * Computes bad count / total bad words */ public static class Reduce extends MapReduceBase implements ReducerText, FloatWritable, Text, FloatWritable { public void reduce(Text key, IteratorFloatWritable values, OutputCollectorText, FloatWritable output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += (int) values.next().get(); } FloatWritable badProb = new FloatWritable((float) sum / spamTotal); output.collect(key, badProb); } } -- Best regards, Edward J. Yoon [EMAIL PROTECTED] http://blog.udanax.org -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -- Best regards, Edward J. Yoon [EMAIL PROTECTED] http://blog.udanax.org
Re: Weird problem running wordcount example from within Eclipse
I figure out the input directory part: I just need to add the $HADOOP_HOME/conf directory into the classpath in eclipse. However, now I ran into a new problem: now the program complains that it cannot find the class files for my mapper and reducer! The error message is as follows: 08/10/07 10:21:52 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). 08/10/07 10:21:52 INFO mapred.FileInputFormat: Total input paths to process : 1 08/10/07 10:21:54 INFO mapred.JobClient: Running job: job_200810071020_0001 08/10/07 10:21:55 INFO mapred.JobClient: map 0% reduce 0% 08/10/07 10:22:01 INFO mapred.JobClient: Task Id : task_200810071020_0001_m_00_0, Status : FAILED java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hadoop.examples.WordCount$Reduce at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:633) at org.apache.hadoop.mapred.JobConf.getCombinerClass(JobConf.java:768) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:383) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:185) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122) Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hadoop.examples.WordCount$Reduce at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:601) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:625) ... 4 more Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.examples.WordCount$Reduce at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:276) at java.lang.ClassLoader.loadClass(ClassLoader.java:251) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:247) at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:581) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:599) ... 5 more I don't know how to fix this. It seems everyone is using the hadoop jar command line for running the program. however, I don't understand why this won't work since I am using the jobclient interface for interacting with hadoop... Would really appreciate if anybody can share some experience on this! Thank you! On Mon, Oct 6, 2008 at 10:48 AM, Ski Gh3 [EMAIL PROTECTED] wrote: Hi all, I have a weird problem regarding running the wordcount example from eclipse. I was able to run the wordcount example from the command line like: $ ...MyHadoop/bin/hadoop jar ../MyHadoop/hadoop-xx-examples.jar wordcount myinputdir myoutputdir However, if I try to run the wordcount program from Eclipse (supplying same two args: myinputdir myoutputdir) I got the following error messsage: Exception in thread main java.lang.RuntimeException: java.io.IOException: No FileSystem for scheme: file at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:356) at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:331) at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:304) at org.apache.hadoop.examples.WordCount.run(WordCount.java:149) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.examples.WordCount.main(WordCount.java:161) Caused by: java.io.IOException: No FileSystem for scheme: file at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1277) at org.apache.hadoop.fs.FileSystem.access$1(FileSystem.java:1273) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1291) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:203) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:108) at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:352) ... 5 more It seems from within Eclipse, the program does not know how to interpret the myinputdir as a hadoop path? Can someone please tell me how I can fix this? Thanks a lot!!!
Re: architecture diagram
Thanks for the clarification, Samuel. I wasn't aware that parts of a line might be emitted depending on the split, while using TextInputFormat. Terrence, this means that you'll have to take the approach of collecting key = column_count, value = column_contents in your map step. Alex On Mon, Oct 6, 2008 at 6:41 PM, Samuel Guo [EMAIL PROTECTED] wrote: I think what Alex talked about 'split' is the mapreduce system's action. What you said about 'split' is your mapper's action. I guess that your map/reduce application uses *TextInputFormat* to treat your input file. your input file will first be splitted into a few splits. these splits may be like filename, offset, length. What Alex said about 'The location of these splits is semi-arbitrary' means that the file split's offset in your input file is semi-arbitrary. Am I right, Alex? Then *TextInputFormat* will translate these file splits into a sequence of lines, where offset is treated as key and line is treated as value. As these file splits are splitted by offset. Some lines in your file may be splitted into different file splits. A *LineRecordReader* used by *TextInputFormat* will remove the half-baked line in these file splits to make sure that every mapper will get integrated lines one by one. For examples: a file as below: AAA BBB CCC DDD EEE FFF GGG HHH AAA BBB CCC DDD it may be splitted into two file splits(we assume that there are two mappers.). split one: AAA BBB CCC split two: DDD EEE FFF GGG HHH AAA BBB CCC DDD take split two as example: TextInputFormat will use LineRecordReader to translate split two into a sequence of offset, line pairs, and it will skip the first half-baked line DDD. so the sequence will be: offset1, EEE FFF GGG HHH offset2, AAA BBB CCC DDD Then what to do with the lines depends on your job. On Tue, Oct 7, 2008 at 5:55 AM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: So looking at the following mapper... http://csvdatamix.svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/src/com/datamix/pivot/PivotMapper.java?view=markup On line 32, you can see the row split via a delimiter. On line 43, you can see that the field index (the column index) is the map key, and the map value is the field contents. How is this incorrect? I think this follows your earlier suggestion of: You may want to play with the following idea: collect key = column_number and value = column_contents in your map step. Terrence A. Pietrondi --- On Mon, 10/6/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Monday, October 6, 2008, 12:55 PM As far as I know, splits will never be made within a line, only between rows. To answer your question about ways to control the splits, see below: http://wiki.apache.org/hadoop/HowManyMapsAndReduces http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html Alex On Mon, Oct 6, 2008 at 6:38 AM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: Can you explain The location of these splits is semi-arbitrary? What if the example was... AAA|BBB|CCC|DDD EEE|FFF|GGG|HHH Does this mean the split might be between CCC such that it results in AAA|BBB|C and C|DDD for the first line? Is there a way to control this behavior to split on my delimiter? Terrence A. Pietrondi --- On Sun, 10/5/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Sunday, October 5, 2008, 9:26 PM Let's say you have one very large input file of the form: A|B|C|D E|F|G|H ... |1|2|3|4 This input file will be broken up into N pieces, where N is the number of mappers that run. The location of these splits is semi-arbitrary. This means that unless you have one mapper, you won't be able to see the entire contents of a column in your mapper. Given that you would need one mapper to be able to see the entirety of a column, you've now essentially reduced your problem to a single machine. You may want to play with the following idea: collect key = column_number and value = column_contents in your map step. This means that you would be able to see the entirety of a column in your reduce step, though you're still faced with the tasks of shuffling and re-pivoting. Does this clear up your confusion? Let me know if you'd like me to clarify more. Alex On Sun, Oct 5, 2008 at 3:54 PM, Terrence A. Pietrondi [EMAIL PROTECTED]
Re: graphics in hadoop
Hadoop runs Java code, so you can do anything that Java could do. This means that you can create and/or analyze images. However, as Lukas has said, Hadoop runs on a cluster of computers and is used for data storage and processing. If you need to display images, then you'd have to take these images off the HDFS (Hadoop distributed filesystem) and on to your local desktop. Alex On Tue, Oct 7, 2008 at 3:50 AM, Lukáš Vlček [EMAIL PROTECTED] wrote: Hi, Hadoop is a platform for distributed computing. Typically it runs on a cluster of dedicated servers (though expensive HW is not required), as far as I know it is not mean to be a platform for applications running on client. Hadoop is very general and not limitted by nature of the data, this means that you should be able to process also image data. Regards, Lukas On Tue, Oct 7, 2008 at 10:51 AM, chandra [EMAIL PROTECTED] wrote: hi does hadoop support graphics packages for displaying some images..? -- Best Regards S.Chandravadana This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email or any action taken in reliance on this e-mail is strictly prohibited and may be unlawful.
Re: Connect to a virtual cluster
Amazon EC2 and S3 is probably the easiest way for someone without a cluster to get jobs running. Take a look: EC2: http://aws.amazon.com/ec2/ http://wiki.apache.org/hadoop/AmazonEC2 S3: http://aws.amazon.com/s3/ http://wiki.apache.org/hadoop/AmazonS3 Hope this helps. Alex On Tue, Oct 7, 2008 at 6:51 AM, Adrian Fdz. [EMAIL PROTECTED] wrote: Hi! First of all, sorry for my English. I've been working with Hadoop the last weeks and I wonder if there is any virtual cluster where I can connect to, with a single machine, in order to submit jobs. Thanks
NoSuchMethodException when running Map Task
I've got a simple hadoop job running on an EC2 cluster using the scripts under src/contrib/ec2. The map tasks all fail with the following error: 08/10/07 15:11:00 INFO mapred.JobClient: Task Id : attempt_200810071501_0001_m_31_0, Status : FAILED java.lang.RuntimeException: java.lang.NoSuchMethodException: ManifestRetriever$Map.init() at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:80) at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:33) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:223) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207) Caused by: java.lang.NoSuchMethodException: com.amazon.ec2.ebs.billing.ManifestRetriever$Map.init() at java.lang.Class.getConstructor0(Class.java:2706) at java.lang.Class.getDeclaredConstructor(Class.java:1985) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:74) I tried adding an explicit (public) no-arg constructor to the ManifestRetriever.Map class but this gives me the same error. Has anyone encountered this problem before? -- View this message in context: http://www.nabble.com/NoSuchMethodException-when-running-Map-Task-tp19865280p19865280.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: NoSuchMethodException when running Map Task
Sorry, I should have mentioned I'm using hadoop version 0.18.1 and java 1.6. Dan Benjamin wrote: I've got a simple hadoop job running on an EC2 cluster using the scripts under src/contrib/ec2. The map tasks all fail with the following error: 08/10/07 15:11:00 INFO mapred.JobClient: Task Id : attempt_200810071501_0001_m_31_0, Status : FAILED java.lang.RuntimeException: java.lang.NoSuchMethodException: ManifestRetriever$Map.init() at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:80) at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:33) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:223) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207) Caused by: java.lang.NoSuchMethodException: com.amazon.ec2.ebs.billing.ManifestRetriever$Map.init() at java.lang.Class.getConstructor0(Class.java:2706) at java.lang.Class.getDeclaredConstructor(Class.java:1985) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:74) I tried adding an explicit (public) no-arg constructor to the ManifestRetriever.Map class but this gives me the same error. Has anyone encountered this problem before? -- View this message in context: http://www.nabble.com/NoSuchMethodException-when-running-Map-Task-tp19865280p19865371.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
dual core configuration
hello, I have some dual core nodes, and I've noticed hadoop is only running 1 instance, and so is only using 1 on the CPU's on each node. is there a configuration to tell it to run more than once? or do i need to turn each machine into 2 nodes? Thanks.
Questions regarding Hive metadata schema
Hi, I've been looking over the db schema that hive uses to store it's metadata (package.jdo) and I had some questions: 1. What do the field names in the TYPES table mean? TYPE1, TYPE2, and TYPE_FIELDS are all unclear to me. 2. In the TBLS (tables) table, what is sd? 3. What does the SERDES table store? 4. What does the SORT_ORDER table store? It appears to describe the ordering within a storage descriptor, which in turn appears to be related to a partition. Do you envision having a table where different partitions have different orders? 5. SDS (storage descriptor) table has a list of columns. Does this imply that columnar storage is supported? 6. What is the relationship between a storage descriptor and a partition? 1-1, 1-n? Thanks. Alan.
Re: Questions regarding Hive metadata schema
Hi Alan, The objects are very closely associated with the Thrift API objects defined in src/contrib/hive/metastore/if/hive_metastore.thrift . It contains descriptions as to what each field is and it should most of your questions. ORM for this is at s/c/h/metastore/src/java/model/package.jdo. 2) SD is storage descriptor (look at SDS table) 3) SERDES contains information for Hive serializers and deserializers 5) Tables and Partitions have Storage Descriptors. Storage Descriptors contain physical storage info and how to read the data (serde info). Storage Description object actually contains the columns. This means that different partitions can have different column sets 6) 1-1 Thanks, Prasad From: Alan Gates [EMAIL PROTECTED] Reply-To: core-user@hadoop.apache.org Date: Tue, 7 Oct 2008 15:28:50 -0700 To: core-user@hadoop.apache.org Subject: Questions regarding Hive metadata schema Hi, I've been looking over the db schema that hive uses to store it's metadata (package.jdo) and I had some questions: 1. What do the field names in the TYPES table mean? TYPE1, TYPE2, and TYPE_FIELDS are all unclear to me. 2. In the TBLS (tables) table, what is sd? 3. What does the SERDES table store? 4. What does the SORT_ORDER table store? It appears to describe the ordering within a storage descriptor, which in turn appears to be related to a partition. Do you envision having a table where different partitions have different orders? 5. SDS (storage descriptor) table has a list of columns. Does this imply that columnar storage is supported? 6. What is the relationship between a storage descriptor and a partition? 1-1, 1-n? Thanks. Alan.
Re: nagios to monitor hadoop datanodes!
try jmx. There should be also jmx to snmp available somewhere. http://blogs.sun.com/jmxetc/entry/jmx_vs_snmp ~~~ 101tec Inc., Menlo Park, California web: http://www.101tec.com blog: http://www.find23.net On Oct 6, 2008, at 10:05 AM, Gerardo Velez wrote: Hi Everyone! I would like to implement Nagios health monitoring of a Hadoop grid. Some of you have some experience here, do you hace any approach or advice I could use. At this time I've been only playing with jsp's files that hadoop has integrated into it. so I;m not sure if it could be a good idea that nagios monitor request info to these jsp? Thanks in advance! -- Gerardo
Re: nagios to monitor hadoop datanodes!
Hey Stefan, Is there any documentation for making JMX working in Hadoop? Brian On Oct 7, 2008, at 7:03 PM, Stefan Groschupf wrote: try jmx. There should be also jmx to snmp available somewhere. http://blogs.sun.com/jmxetc/entry/jmx_vs_snmp ~~~ 101tec Inc., Menlo Park, California web: http://www.101tec.com blog: http://www.find23.net On Oct 6, 2008, at 10:05 AM, Gerardo Velez wrote: Hi Everyone! I would like to implement Nagios health monitoring of a Hadoop grid. Some of you have some experience here, do you hace any approach or advice I could use. At this time I've been only playing with jsp's files that hadoop has integrated into it. so I;m not sure if it could be a good idea that nagios monitor request info to these jsp? Thanks in advance! -- Gerardo
Re: Questions regarding Hive metadata schema
For translation purposes, SerDe's in Hive correspond to StoreFunc/LoadFunc pairs in Pig and Producers/Extractor pairs in SCOPE. I claim SCOPE's terminology is the most elegant and we should all standardize on their terminology, in this case at least. Joy claims that SerDe is a common term in the hardware community. Since Hive was mainly intended for hardware developers, ...wait a second, that's not right. (seriously though, we need some way to keep these things straight, and being able to reuse serialization/deserialization libraries would be nice). On Tue, Oct 7, 2008 at 3:49 PM, Prasad Chakka [EMAIL PROTECTED] wrote: Hi Alan, The objects are very closely associated with the Thrift API objects defined in src/contrib/hive/metastore/if/hive_metastore.thrift . It contains descriptions as to what each field is and it should most of your questions. ORM for this is at s/c/h/metastore/src/java/model/package.jdo. 2) SD is storage descriptor (look at SDS table) 3) SERDES contains information for Hive serializers and deserializers 5) Tables and Partitions have Storage Descriptors. Storage Descriptors contain physical storage info and how to read the data (serde info). Storage Description object actually contains the columns. This means that different partitions can have different column sets 6) 1-1 Thanks, Prasad From: Alan Gates [EMAIL PROTECTED] Reply-To: core-user@hadoop.apache.org Date: Tue, 7 Oct 2008 15:28:50 -0700 To: core-user@hadoop.apache.org Subject: Questions regarding Hive metadata schema Hi, I've been looking over the db schema that hive uses to store it's metadata (package.jdo) and I had some questions: 1. What do the field names in the TYPES table mean? TYPE1, TYPE2, and TYPE_FIELDS are all unclear to me. 2. In the TBLS (tables) table, what is sd? 3. What does the SERDES table store? 4. What does the SORT_ORDER table store? It appears to describe the ordering within a storage descriptor, which in turn appears to be related to a partition. Do you envision having a table where different partitions have different orders? 5. SDS (storage descriptor) table has a list of columns. Does this imply that columnar storage is supported? 6. What is the relationship between a storage descriptor and a partition? 1-1, 1-n? Thanks. Alan.
Re: nagios to monitor hadoop datanodes!
Hadoop already integrated jmx inside, you can extend them to implement what you want to monitor, it need to modify some code to add some counters or something like that. One thing you may need to be care is hadoop does not include any JMXConnectorServer inside, you need to start one JMXConnectorServer for every hadoop process you want to monitor. This is what we have done on hadoop to monitor it. We have not check out Nagios for hadoop,so no word on Nagios. hope it help. 在 2008-10-8,上午8:34,Brian Bockelman 写道: Hey Stefan, Is there any documentation for making JMX working in Hadoop? Brian On Oct 7, 2008, at 7:03 PM, Stefan Groschupf wrote: try jmx. There should be also jmx to snmp available somewhere. http://blogs.sun.com/jmxetc/entry/jmx_vs_snmp ~~~ 101tec Inc., Menlo Park, California web: http://www.101tec.com blog: http://www.find23.net On Oct 6, 2008, at 10:05 AM, Gerardo Velez wrote: Hi Everyone! I would like to implement Nagios health monitoring of a Hadoop grid. Some of you have some experience here, do you hace any approach or advice I could use. At this time I've been only playing with jsp's files that hadoop has integrated into it. so I;m not sure if it could be a good idea that nagios monitor request info to these jsp? Thanks in advance! -- Gerardo
Re: IPC Client error | Too many files open
try update jdk to 1.6, there is a bug for jdk 1.5 about nio. 在 2008-9-26,下午7:29,Goel, Ankur 写道: Hi Folks, We have developed a simple log writer in Java that is plugged into Apache custom log and writes log entries directly to our hadoop cluster (50 machines, quad core, each with 16 GB Ram and 800 GB hard-disk, 1 machine as dedicated Namenode another machine as JobTracker TaskTracker + DataNode). There are around 8 Apache servers dumping logs into HDFS via our writer. Everything was working fine and we were getting around 15 - 20 MB log data per hour from each server. Recently we have been experiencing problems with 2-3 of our Apache servers where a file is opened by log-writer in HDFS for writing but it never receives any data. Looking at apache error logs shows the following errors 08/09/22 05:02:13 INFO ipc.Client: java.io.IOException: Too many open files at sun.nio.ch.IOUtil.initPipe(Native Method) at sun.nio.ch.EPollSelectorImpl.init(EPollSelectorImpl.java:49) at sun.nio.ch.EPollSelectorProvider.openSelector (EPollSelectorProvider.java :18) at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.get (SocketIOWithT imeout.java:312) at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select (SocketIOWi thTimeout.java:227) at org.apache.hadoop.net.SocketIOWithTimeout.doIO (SocketIOWithTimeout.java: 155) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java: 149) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java: 122) at java.io.FilterInputStream.read(FilterInputStream.java:116) at org.apache.hadoop.ipc.Client$Connection$1.read(Client.java:203) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) at java.io.DataInputStream.readInt(DataInputStream.java:370) at org.apache.hadoop.ipc.Client$Connection.run(Client.java: 289) ... ... Followed by connection errors saying Retrying to connect to server: hadoop-server.com:9000. Already tried 'n' times. (same as above) ... and is retrying constantly (log-writer set up so that it waits and retries). Doing an lsof on the log writer java process shows that it got stuck in a lot of pipe/event poll and eventually ran out of file handles. Below is the part of the lsof output lsof -p 2171 COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME java2171 root 20r FIFO0,7 24090207 pipe java2171 root 21w FIFO0,7 24090207 pipe java2171 root 22r 0,80 24090208 eventpoll java2171 root 23r FIFO0,7 23323281 pipe java2171 root 24r FIFO0,7 23331536 pipe java2171 root 25w FIFO0,7 23306764 pipe java2171 root 26r 0,80 23306765 eventpoll java2171 root 27r FIFO0,7 23262160 pipe java2171 root 28w FIFO0,7 23262160 pipe java2171 root 29r 0,80 23262161 eventpoll java2171 root 30w FIFO0,7 23299329 pipe java2171 root 31r 0,80 23299330 eventpoll java2171 root 32w FIFO0,7 23331536 pipe java2171 root 33r FIFO0,7 23268961 pipe java2171 root 34w FIFO0,7 23268961 pipe java2171 root 35r 0,80 23268962 eventpoll java2171 root 36w FIFO0,7 23314889 pipe ... ... ... What in DFS client (if any) could have caused this? Could it be something else? Is it not ideal to use an HDFS writer to directly write logs from Apache into HDFS? Is 'Chukwa (hadoop log collection and analysis framework contributed by Yahoo) a better fit for our case? I would highly appreciate help on any or all of the above questions. Thanks and Regards -Ankur
Re: dual core configuration
You can have your node (tasktracker) running more than 1 task simultaneously. You may set mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum properties found in hadoop-site.xml file. You should change hadoop-site.xml file on all your slave nodes depending on how many cores each slave has. For example, you don't really want to have 8 tasks running at once on a 2 core machine. /Taeho On Wed, Oct 8, 2008 at 5:53 AM, Elia Mazzawi [EMAIL PROTECTED]wrote: hello, I have some dual core nodes, and I've noticed hadoop is only running 1 instance, and so is only using 1 on the CPU's on each node. is there a configuration to tell it to run more than once? or do i need to turn each machine into 2 nodes? Thanks.
Re: dual core configuration
Taeho, I was going to suggest this change as well, but it's documented that mapred.tasktracker.map.tasks.maximum defaults to 2. Can you explain why Elia is only having one core utilized when this config option is set to 2? Here is the documentation I'm referring to: http://hadoop.apache.org/core/docs/r0.18.1/cluster_setup.html Alex On Tue, Oct 7, 2008 at 8:27 PM, Taeho Kang [EMAIL PROTECTED] wrote: You can have your node (tasktracker) running more than 1 task simultaneously. You may set mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum properties found in hadoop-site.xml file. You should change hadoop-site.xml file on all your slave nodes depending on how many cores each slave has. For example, you don't really want to have 8 tasks running at once on a 2 core machine. /Taeho On Wed, Oct 8, 2008 at 5:53 AM, Elia Mazzawi [EMAIL PROTECTED]wrote: hello, I have some dual core nodes, and I've noticed hadoop is only running 1 instance, and so is only using 1 on the CPU's on each node. is there a configuration to tell it to run more than once? or do i need to turn each machine into 2 nodes? Thanks.