Problem with LibHDFS
Hi, I am able to get Hadoop running and also able to compile the libhdfs. But when I run the hdfs_test program it is giving Segmentation Fault. Just a small program like this #include hdfs.h int main() { return(0); } and compiled using the command gcc -ggdb -m32 -I/garl/garl-alpha1/home1/raghu/Desktop/jre1.5.0_14/include -I/garl/garl-alpha1/home1/raghu/Desktop/jre1.5.0_14/include/ hdfs_test.c -L/garl/garl-alpha1/home1/raghu/Desktop/hadoop-0.15.3/libhdfs -lhdfs -L/garl/garl-alpha1/home1/raghu/Desktop/jre1.5.0_14/lib/i386/server -ljvm -shared -m32 -Wl,-x -o hdfs_test running hdfs_test gives segmentation fault. please tell me as to how to fix it. -- Regards, Raghavendra K
Re: Problem with LibHDFS
Since you are compiling a C(++) program, why not add the -g switch and run it within gdb: that will tell people which line it crashes at (etc etc) Miles On 21/02/2008, Raghavendra K [EMAIL PROTECTED] wrote: Hi, I am able to get Hadoop running and also able to compile the libhdfs. But when I run the hdfs_test program it is giving Segmentation Fault. Just a small program like this #include hdfs.h int main() { return(0); } and compiled using the command gcc -ggdb -m32 -I/garl/garl-alpha1/home1/raghu/Desktop/jre1.5.0_14/include -I/garl/garl-alpha1/home1/raghu/Desktop/jre1.5.0_14/include/ hdfs_test.c -L/garl/garl-alpha1/home1/raghu/Desktop/hadoop-0.15.3/libhdfs -lhdfs -L/garl/garl-alpha1/home1/raghu/Desktop/jre1.5.0_14/lib/i386/server -ljvm -shared -m32 -Wl,-x -o hdfs_test running hdfs_test gives segmentation fault. please tell me as to how to fix it. -- Regards, Raghavendra K -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
Re: Add your project or company to the powered by page?
The New York Times / nytimes.com -large scale image conversions -http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-super-computing-fun/ On Thu, Feb 21, 2008 at 1:26 AM, Eric Baldeschwieler [EMAIL PROTECTED] wrote: Hi Folks, Let's get the word out that Hadoop is being used and is useful in your organizations, ok? Please add yourselves to the Hadoop powered by page, or reply to this email with what details you would like to add and I'll do it. http://wiki.apache.org/hadoop/PoweredBy Thanks! E14 --- eric14 a.k.a. Eric Baldeschwieler senior director, grid computing Yahoo! Inc.
java error
Hello, As per my earlier mails I could not deploy Nutch on Linux . Now am attempting the same using cygwin as per the tutorial by Peter Wang. Can someone from the list help me resolve the attached error? Atleast on Linux I could run the crawl. java.lang.NoClassDefFoundError: org/apache/hadoop/util/PlatformName Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.util.PlatformName at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:276) at java.lang.ClassLoader.loadClass(ClassLoader.java:251) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319) Exception in thread main java.lang.NoClassDefFoundError: org/apache/nutch/crawl/Crawl Caused by: java.lang.ClassNotFoundException: org.apache.nutch.crawl.Crawl at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:276) at java.lang.ClassLoader.loadClass(ClassLoader.java:251) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319) Exception in thread main P.S. Have sent this mail to nutch-users as well but so far no response from their end. Am a writer and not very technical to debug the errors. Regards, Jaya
Question on metrics via ganglia
We have modified my metrics file, distributed it and restarted our cluster. We have gmond running on the nodes, and a machine on the vlan with gmetad running. We have statistics for the machines in the web ui, and our statistics reported by the gmetric program are present. We don't see any hadoop reporting. Clearly we have something basic wrong in our understanding of how to set this up. # Configuration of the dfs context for null # dfs.class=org.apache.hadoop.metrics.spi.NullContext # Configuration of the dfs context for file #dfs.class=org.apache.hadoop.metrics.file.FileContext #dfs.period=10 #dfs.fileName=/tmp/dfsmetrics.log # Configuration of the dfs context for ganglia dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext dfs.period=10 dfs.servers=localhost:8649 # Configuration of the mapred context for null # mapred.class=org.apache.hadoop.metrics.spi.NullContext # Configuration of the mapred context for file #mapred.class=org.apache.hadoop.metrics.file.FileContext #mapred.period=10 #mapred.fileName=/tmp/mrmetrics.log # Configuration of the mapred context for ganglia mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext mapred.period=10 mapred.servers=localhost:8649 # Configuration of the jvm context for null # jvm.class=org.apache.hadoop.metrics.spi.NullContext # Configuration of the jvm context for file #jvm.class=org.apache.hadoop.metrics.file.FileContext #jvm.period=10 #jvm.fileName=/tmp/jvmmetrics.log # Configuration of the jvm context for ganglia jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext jvm.period=10 jvm.servers=localhost:8649 -- Jason Venner Attributor - Publish with Confidence http://www.attributor.com/ Attributor is hiring Hadoop Wranglers, contact if interested
Re: Hadoop summit / workshop at Yahoo!
On Wed, 20 Feb 2008 12:10:09 PST, Ajay Anand wrote: The registration page for the Hadoop summit is now up: http://developer.yahoo.com/hadoop/summit/ ... Agenda: Ajay, when we talked about the summit on the phone, you were considering having a poster session. I don't see that listed. Should I assume it's no longer planned? Thanks, -John
Re: Questions about namenode and JobTracker configuration.
Zhang, jian wrote: Hi, All I have a small question about configuration. In Hadoop Documentation page, it says Typically you choose one machine in the cluster to act as the NameNode and one machine as to act as the JobTracker, exclusively. The rest of the machines act as both a DataNode and TaskTracker and are referred to as slaves. Does that mean the JobTracker is not a slave as NameNode ? JobTracker and Namenode are daemons on a machine (frequently called as masters). The master node can also act as a slave node. JobTracker and Namenode basically do the book-keeping/scheduling work. On a large cluster the load on the JobTracker/Namenode is usually high. Hence its recommended to run these daemons on a separate machine but this is not mandatory. NameNode and DataNode form the HDFS. Since the JobTracker needs to interact with TaskTracker which resides in HDFS, TaskTracker and DataNodes are processes on the slave nodes. TaskTracker communicates with the JobTracker while DataNode communicates with the Namenode. The DFS is designed in such a way that it can function without mapreduce just for distributed storage. The TaskTracker never communicates with the NameNode. Its the JobTracker that does. Mostly the TaskTracker concentrates on doing the work locally i.e spawn JVMs for doing the maps. Amar to make the communication easier, I think it should be at least part of the HDFS. Best Regards Jian Zhang
Re: Add your project or company to the powered by page?
* [http://alpha.search.wikia.com Search Wikia] * A project to help develop open source social search tools. We run a 125 node hadoop cluster. Derek Gottfrid wrote: The New York Times / nytimes.com -large scale image conversions -http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-super-computing-fun/ On Thu, Feb 21, 2008 at 1:26 AM, Eric Baldeschwieler [EMAIL PROTECTED] wrote: Hi Folks, Let's get the word out that Hadoop is being used and is useful in your organizations, ok? Please add yourselves to the Hadoop powered by page, or reply to this email with what details you would like to add and I'll do it. http://wiki.apache.org/hadoop/PoweredBy Thanks! E14 --- eric14 a.k.a. Eric Baldeschwieler senior director, grid computing Yahoo! Inc.
Re: Question on metrics via ganglia
Well, with the metrics file changed to perform file based logging, metrics do appear. On digging into the GangliaContext source, it looks like it is using udp for reporting, and we modified the gmond.conf to receive via udp as well as tcp. netstat -a -p shows gmond monitoring 8649 for both tcp and udp. Still nothing visible via the ganglia ui and no rrd file for anything hadoop related. Jason Venner wrote: We have modified my metrics file, distributed it and restarted our cluster. We have gmond running on the nodes, and a machine on the vlan with gmetad running. We have statistics for the machines in the web ui, and our statistics reported by the gmetric program are present. We don't see any hadoop reporting. Clearly we have something basic wrong in our understanding of how to set this up. # Configuration of the dfs context for null # dfs.class=org.apache.hadoop.metrics.spi.NullContext # Configuration of the dfs context for file #dfs.class=org.apache.hadoop.metrics.file.FileContext #dfs.period=10 #dfs.fileName=/tmp/dfsmetrics.log # Configuration of the dfs context for ganglia dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext dfs.period=10 dfs.servers=localhost:8649 # Configuration of the mapred context for null # mapred.class=org.apache.hadoop.metrics.spi.NullContext # Configuration of the mapred context for file #mapred.class=org.apache.hadoop.metrics.file.FileContext #mapred.period=10 #mapred.fileName=/tmp/mrmetrics.log # Configuration of the mapred context for ganglia mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext mapred.period=10 mapred.servers=localhost:8649 # Configuration of the jvm context for null # jvm.class=org.apache.hadoop.metrics.spi.NullContext # Configuration of the jvm context for file #jvm.class=org.apache.hadoop.metrics.file.FileContext #jvm.period=10 #jvm.fileName=/tmp/jvmmetrics.log # Configuration of the jvm context for ganglia jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext jvm.period=10 jvm.servers=localhost:8649
Re: how to set the result of the first mapreduce program as the input of the second mapreduce program?
Output of every mapreduce job in Hadoop gets stored in the DFS i.e made visible. You can run back to back jobs (i.e job chaining) but the output wont be temporary. Look at Grep.java as Hairong suggested for more details on job chaining. As of now there is no support for job chaining in Hadoop. Pig []http://incubator.apache.org/pig/] on the other hand implicitly does job pipelining. But for smaller and simple pipelines you could do manual chaining. It depends on the kind of pipelining one requires. Amar ma qiang wrote: Hi all: Here I have two mapreduce program.I need to use the result of the first mapreduce program to computer another values which generate in the second mapreduce program and this intermediate result is not need to save, so I want to run the second mapreduce program automatic using output of the first mapreduce program as the input of the second mapreduce program. Who can tell me how? Thanks! Best Wishes! Qiang
Re: changes to compression interfaces in 0.15?
Joydeep, On Feb 20, 2008, at 5:06 PM, Joydeep Sen Sarma wrote: Hi developers, In migrating to 0.15 - i am noticing that the compression interfaces have changed: - compression type for sequencefile outputs used to be set by: SequenceFile.setCompressionType() - now it seems to be set using: sequenceFileOutputFormat.setOutputCompressionType() Yes, we added SequenceFileOutputFormat.setOutputCompressionType and deprecated the old api. (HADOOP-1851) The change is for the better - but would it be possible to: - remove old/dead interfaces. That would have been a straightforward hint for applications to look for new interfaces. (hadoop-default.xml also still has setting for old conf variable: io.seqfile.compression.type) To maintain backward compat, we cannot remove old apis - the standard procedure is to deprecate them for the next release and remove them in subsequent releases. - if possible - document changed interfaces in the release notes (there's no way we can find this out by looking at the long list of Jiras). Please look at the INCOMPATIBLE CHANGES section of CHANGES.txt, HADOOP-1851 is listed there. Admittedly we can do better, but that is a good place to look for when upgrading to newer releases. i am not sure how updated the wiki is on the compression stuff (my responsibility to update it) - but please do consider the impact of Please use the forrest-based docs (on the hadoop website - e.g. mapred_tutorial.html) rather than the wiki as the gold-standard. The reason we moved away from the wiki is precisely this - harder to maintain docs per release etc. changing interfaces on existing applications. (maybe we should have a JIRA tag to mark out bugs that change interfaces). Again, CHANGES.txt and INCOMPATIBLE CHANGES section for now. Arun As always - thanks for all the fish (err .. working code), Joydeep
Re: Add your project or company to the powered by page?
On 2/21/08 11:34 AM, Jeff Hammerbacher [EMAIL PROTECTED] wrote: yeah, i've heard those facebook groups can be a great way to get the word out... anyways, just got approval yesterday for a 320 node cluster. each node has 8 cores and 4 TB of raw storage so this guy is gonna be pretty powerful. can we claim largest cluster outside of yahoo? I guess it depends upon how you define outside. *Technically*, M45 is outside of a Yahoo! building, given that it is in one of those shipping-container-data-center-thingies ...
Re: Add your project or company to the powered by page?
More on the subject of outreach, not specific uses at companies, but... A couple things might help get the word out: - Add a community group in LinkedIn (shows up on profile searches) http://www.linkedin.com/static?key=groups_faq - Add a link on the wiki to the Facebook group about Hadoop http://www.facebook.com/pages/Hadoop/9887781514 There's also a small but growing network of local user groups for Amazon AWS, and much interest there for presentations and discussions about Hadoop: http://www.amazon.com/Upcoming-Events-AWS-home-page/b/ref=sc_fe_c_0_371080011_1/103-5668663-1566203?ie=UTF8node=16284451no=371080011me=A36L942TSJ2AJA I'd be happy to help with any of those. Paco On Wed, Feb 20, 2008 at 10:26 PM, Eric Baldeschwieler [EMAIL PROTECTED] wrote: Hi Folks, Let's get the word out that Hadoop is being used and is useful in your organizations, ok? Please add yourselves to the Hadoop powered by page, or reply to this email with what details you would like to add and I'll do it. http://wiki.apache.org/hadoop/PoweredBy Thanks! E14 --- eric14 a.k.a. Eric Baldeschwieler senior director, grid computing Yahoo! Inc.
Re: Question on metrics via ganglia solved
Instead of localhost, in the servers block, we now put the machine that has gmetad running. dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext dfs.period=10 dfs.servers=GMETAD_HOST:8649 Jason Venner wrote: Well, with the metrics file changed to perform file based logging, metrics do appear. On digging into the GangliaContext source, it looks like it is using udp for reporting, and we modified the gmond.conf to receive via udp as well as tcp. netstat -a -p shows gmond monitoring 8649 for both tcp and udp. Still nothing visible via the ganglia ui and no rrd file for anything hadoop related. Jason Venner wrote: We have modified my metrics file, distributed it and restarted our cluster. We have gmond running on the nodes, and a machine on the vlan with gmetad running. We have statistics for the machines in the web ui, and our statistics reported by the gmetric program are present. We don't see any hadoop reporting. Clearly we have something basic wrong in our understanding of how to set this up. # Configuration of the dfs context for null # dfs.class=org.apache.hadoop.metrics.spi.NullContext # Configuration of the dfs context for file #dfs.class=org.apache.hadoop.metrics.file.FileContext #dfs.period=10 #dfs.fileName=/tmp/dfsmetrics.log # Configuration of the dfs context for ganglia dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext dfs.period=10 dfs.servers=localhost:8649 # Configuration of the mapred context for null # mapred.class=org.apache.hadoop.metrics.spi.NullContext # Configuration of the mapred context for file #mapred.class=org.apache.hadoop.metrics.file.FileContext #mapred.period=10 #mapred.fileName=/tmp/mrmetrics.log # Configuration of the mapred context for ganglia mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext mapred.period=10 mapred.servers=localhost:8649 # Configuration of the jvm context for null # jvm.class=org.apache.hadoop.metrics.spi.NullContext # Configuration of the jvm context for file #jvm.class=org.apache.hadoop.metrics.file.FileContext #jvm.period=10 #jvm.fileName=/tmp/jvmmetrics.log # Configuration of the jvm context for ganglia jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext jvm.period=10 jvm.servers=localhost:8649 -- Jason Venner Attributor - Publish with Confidence http://www.attributor.com/ Attributor is hiring Hadoop Wranglers, contact if interested
RE: changes to compression interfaces in 0.15?
To maintain backward compat, we cannot remove old apis - the standard procedure is to deprecate them for the next release and remove them in subsequent releases. you've got to be kidding. we didn't maintain backwards compatibility. my app broke. Simple and straightforward. and the old interfaces are not deprecated (to quote 0.15.3 on a 'deprecated' interface: /** * Set the compression type for sequence files. * @param job the configuration to modify * @param val the new compression type (none, block, record) */ static public void setCompressionType(Configuration job, CompressionType val) { ) I (and i would suspect any average user willing to recompile code) would much much rather that we broke backwards compatibility immediately rather than maintain carry over defunct apis that insidiously break application behavior. and of course - this does not address the point that the option strings themselves are depcreated. (remember - people set options explicitly from xml files and streaming. not everyone goes through java apis)). -- as one of my dear professors once said - put ur self in the other person's shoe. consider that u were in my position and that a production app suddenly went from consuming 100G to 1TB. and everything slowed down drastically. and it did not give any sign that anything was amiss. everything looked golden on the ourside. what would be ur reaction if u find out after a week that the system was full and numerous processes had to be re-run? how would you have figured that was going to happen by looking at the INCOMPATIBLE section (which btw - i did carefully before sending my mail). (fortunately i escaped the worst case - but i think this is a real call to action) -Original Message- From: Arun C Murthy [mailto:[EMAIL PROTECTED] Sent: Thu 2/21/2008 11:21 AM To: core-user@hadoop.apache.org Subject: Re: changes to compression interfaces in 0.15? Joydeep, On Feb 20, 2008, at 5:06 PM, Joydeep Sen Sarma wrote: Hi developers, In migrating to 0.15 - i am noticing that the compression interfaces have changed: - compression type for sequencefile outputs used to be set by: SequenceFile.setCompressionType() - now it seems to be set using: sequenceFileOutputFormat.setOutputCompressionType() Yes, we added SequenceFileOutputFormat.setOutputCompressionType and deprecated the old api. (HADOOP-1851) The change is for the better - but would it be possible to: - remove old/dead interfaces. That would have been a straightforward hint for applications to look for new interfaces. (hadoop-default.xml also still has setting for old conf variable: io.seqfile.compression.type) To maintain backward compat, we cannot remove old apis - the standard procedure is to deprecate them for the next release and remove them in subsequent releases. - if possible - document changed interfaces in the release notes (there's no way we can find this out by looking at the long list of Jiras). Please look at the INCOMPATIBLE CHANGES section of CHANGES.txt, HADOOP-1851 is listed there. Admittedly we can do better, but that is a good place to look for when upgrading to newer releases. i am not sure how updated the wiki is on the compression stuff (my responsibility to update it) - but please do consider the impact of Please use the forrest-based docs (on the hadoop website - e.g. mapred_tutorial.html) rather than the wiki as the gold-standard. The reason we moved away from the wiki is precisely this - harder to maintain docs per release etc. changing interfaces on existing applications. (maybe we should have a JIRA tag to mark out bugs that change interfaces). Again, CHANGES.txt and INCOMPATIBLE CHANGES section for now. Arun As always - thanks for all the fish (err .. working code), Joydeep
Re: Hadoop summit / workshop at Yahoo!
I would certainly appreciate being able to watch them online too, and they would help spread the word about hadoop - think of all the people who watch Google's Techtalks (am I allowed to say the G word around here?). On Thu, 2008-02-21 at 08:34 +0100, Lukas Vlcek wrote: Online webcast/recorded video would be really appreciated by lot of people. Please post the content online! (not only you can target much greater audience but you can significantly save on break/lunch/beer food budget :-). Lukas On Wed, Feb 20, 2008 at 9:10 PM, Ajay Anand [EMAIL PROTECTED] wrote: The registration page for the Hadoop summit is now up: http://developer.yahoo.com/hadoop/summit/ Space is limited, so please sign up early if you are interested in attending. About the summit: Yahoo! is hosting the first summit on Apache Hadoop on March 25th in Sunnyvale. The summit is sponsored by the Computing Community Consortium (CCC) and brings together leaders from the Hadoop developer and user communities. The speakers will cover topics in the areas of extensions being developed for Hadoop, case studies of applications being built and deployed on Hadoop, and a discussion on future directions for the platform. Agenda: 8:30-8:55 Breakfast 8:55-9:00 Welcome to Yahoo! Logistics - Ajay Anand, Yahoo! 9:00-9:30 Hadoop Overview - Doug Cutting / Eric Baldeschwieler, Yahoo! 9:30-10:00 Pig - Chris Olston, Yahoo! 10:00-10:30 JAQL - Kevin Beyer, IBM 10:30-10:45 Break 10:45-11:15 DryadLINQ - Michael Isard, Microsoft 11:15-11:45 Monitoring Hadoop using X-Trace - Andy Konwinski and Matei Zaharia, UC Berkeley 11:45-12:15 Zookeeper - Ben Reed, Yahoo! 12:15-1:15 Lunch 1:15-1:45 Hbase - Michael Stack, Powerset 1:45-2:15 Hbase App - Bryan Duxbury, Rapleaf 2:15-2:45 Hive - Joydeep Sen Sarma, Facebook 2:45-3:00 Break 3:00-3:20 Building Ground Models of Southern California - Steve Schossler, David O'Hallaron, Intel / CMU 3:20-3:40 Online search for engineering design content - Mike Haley, Autodesk 3:40-4:00 Yahoo - Webmap - Arnab Bhattacharjee, Yahoo! 4:00-4:30 Natural language Processing - Jimmy Lin, U of Maryland / Christophe Bisciglia, Google 4:30-4:45 Break 4:45-5:30 Panel on future directions 5:30-7:00 Happy hour Look forward to seeing you there! Ajay -Original Message- From: Bradford Stephens [mailto:[EMAIL PROTECTED] Sent: Wednesday, February 20, 2008 9:17 AM To: core-user@hadoop.apache.org Subject: Re: Hadoop summit / workshop at Yahoo! Hrm yes, I'd like to make a visit as well :) On Feb 20, 2008 8:05 AM, C G [EMAIL PROTECTED] wrote: Hey All: Is this going forward? I'd like to make plans to attend and the sooner I can get plane tickets the happier the bean counters will be :-). Thx, C G Ajay Anand wrote: Yahoo plans to host a summit / workshop on Apache Hadoop at our Sunnyvale campus on March 25th. Given the interest we are seeing from developers in a broad range of organizations, this seems like a good time to get together and brief each other on the progress that is being made. We would like to cover topics in the areas of extensions being developed for Hadoop, innovative applications being built and deployed on Hadoop, and future extensions to the platform. Some of the speakers who have already committed to present are from organizations such as IBM, Intel, Carnegie Mellon University, UC Berkeley, Facebook and Yahoo!, and we are actively recruiting other leaders in the space. If you have an innovative application you would like to talk about, please let us know. Although there are limitations on the amount of time we have, we would love to hear from you. You can contact me at [EMAIL PROTECTED] Thanks and looking forward to hearing about your cool apps, Ajay -- View this message in context: http://www.nabble.com/Hadoop-summit---workshop-at-Yahoo%21-tp14889262p15 393386.htmlhttp://www.nabble.com/Hadoop-summit---workshop-at-Yahoo%21-tp14889262p15393386.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com. - Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now.
Re: changes to compression interfaces in 0.15?
On Feb 21, 2008, at 12:20 PM, Joydeep Sen Sarma wrote: To maintain backward compat, we cannot remove old apis - the standard procedure is to deprecate them for the next release and remove them in subsequent releases. you've got to be kidding. we didn't maintain backwards compatibility. my app broke. Simple and straightforward. and the old interfaces are not deprecated (to quote 0.15.3 on a 'deprecated' interface: You are right, HADOOP-1851 didn't fix it right. I've filed HADOOP-2869. We do need to be more diligent about listing config changes in CHANGES.txt for starters, and that point is taken. However, we can't start pulling out apis without deprecating them first. Arun /** * Set the compression type for sequence files. * @param job the configuration to modify * @param val the new compression type (none, block, record) */ static public void setCompressionType(Configuration job, CompressionType val) { ) I (and i would suspect any average user willing to recompile code) would much much rather that we broke backwards compatibility immediately rather than maintain carry over defunct apis that insidiously break application behavior. and of course - this does not address the point that the option strings themselves are depcreated. (remember - people set options explicitly from xml files and streaming. not everyone goes through java apis)). -- as one of my dear professors once said - put ur self in the other person's shoe. consider that u were in my position and that a production app suddenly went from consuming 100G to 1TB. and everything slowed down drastically. and it did not give any sign that anything was amiss. everything looked golden on the ourside. what would be ur reaction if u find out after a week that the system was full and numerous processes had to be re-run? how would you have figured that was going to happen by looking at the INCOMPATIBLE section (which btw - i did carefully before sending my mail). (fortunately i escaped the worst case - but i think this is a real call to action) -Original Message- From: Arun C Murthy [mailto:[EMAIL PROTECTED] Sent: Thu 2/21/2008 11:21 AM To: core-user@hadoop.apache.org Subject: Re: changes to compression interfaces in 0.15? Joydeep, On Feb 20, 2008, at 5:06 PM, Joydeep Sen Sarma wrote: Hi developers, In migrating to 0.15 - i am noticing that the compression interfaces have changed: - compression type for sequencefile outputs used to be set by: SequenceFile.setCompressionType() - now it seems to be set using: sequenceFileOutputFormat.setOutputCompressionType() Yes, we added SequenceFileOutputFormat.setOutputCompressionType and deprecated the old api. (HADOOP-1851) The change is for the better - but would it be possible to: - remove old/dead interfaces. That would have been a straightforward hint for applications to look for new interfaces. (hadoop-default.xml also still has setting for old conf variable: io.seqfile.compression.type) To maintain backward compat, we cannot remove old apis - the standard procedure is to deprecate them for the next release and remove them in subsequent releases. - if possible - document changed interfaces in the release notes (there's no way we can find this out by looking at the long list of Jiras). Please look at the INCOMPATIBLE CHANGES section of CHANGES.txt, HADOOP-1851 is listed there. Admittedly we can do better, but that is a good place to look for when upgrading to newer releases. i am not sure how updated the wiki is on the compression stuff (my responsibility to update it) - but please do consider the impact of Please use the forrest-based docs (on the hadoop website - e.g. mapred_tutorial.html) rather than the wiki as the gold-standard. The reason we moved away from the wiki is precisely this - harder to maintain docs per release etc. changing interfaces on existing applications. (maybe we should have a JIRA tag to mark out bugs that change interfaces). Again, CHANGES.txt and INCOMPATIBLE CHANGES section for now. Arun As always - thanks for all the fish (err .. working code), Joydeep
Re: changes to compression interfaces in 0.15?
If the API semantics are changing under you, you have to change your code whether or not the API is pulled or deprecated. Pulling it makes it more obvious that the user has to change his/her code. -- pete On 2/21/08 12:41 PM, Arun C Murthy [EMAIL PROTECTED] wrote: On Feb 21, 2008, at 12:20 PM, Joydeep Sen Sarma wrote: To maintain backward compat, we cannot remove old apis - the standard procedure is to deprecate them for the next release and remove them in subsequent releases. you've got to be kidding. we didn't maintain backwards compatibility. my app broke. Simple and straightforward. and the old interfaces are not deprecated (to quote 0.15.3 on a 'deprecated' interface: You are right, HADOOP-1851 didn't fix it right. I've filed HADOOP-2869. We do need to be more diligent about listing config changes in CHANGES.txt for starters, and that point is taken. However, we can't start pulling out apis without deprecating them first. Arun /** * Set the compression type for sequence files. * @param job the configuration to modify * @param val the new compression type (none, block, record) */ static public void setCompressionType(Configuration job, CompressionType val) { ) I (and i would suspect any average user willing to recompile code) would much much rather that we broke backwards compatibility immediately rather than maintain carry over defunct apis that insidiously break application behavior. and of course - this does not address the point that the option strings themselves are depcreated. (remember - people set options explicitly from xml files and streaming. not everyone goes through java apis)). -- as one of my dear professors once said - put ur self in the other person's shoe. consider that u were in my position and that a production app suddenly went from consuming 100G to 1TB. and everything slowed down drastically. and it did not give any sign that anything was amiss. everything looked golden on the ourside. what would be ur reaction if u find out after a week that the system was full and numerous processes had to be re-run? how would you have figured that was going to happen by looking at the INCOMPATIBLE section (which btw - i did carefully before sending my mail). (fortunately i escaped the worst case - but i think this is a real call to action) -Original Message- From: Arun C Murthy [mailto:[EMAIL PROTECTED] Sent: Thu 2/21/2008 11:21 AM To: core-user@hadoop.apache.org Subject: Re: changes to compression interfaces in 0.15? Joydeep, On Feb 20, 2008, at 5:06 PM, Joydeep Sen Sarma wrote: Hi developers, In migrating to 0.15 - i am noticing that the compression interfaces have changed: - compression type for sequencefile outputs used to be set by: SequenceFile.setCompressionType() - now it seems to be set using: sequenceFileOutputFormat.setOutputCompressionType() Yes, we added SequenceFileOutputFormat.setOutputCompressionType and deprecated the old api. (HADOOP-1851) The change is for the better - but would it be possible to: - remove old/dead interfaces. That would have been a straightforward hint for applications to look for new interfaces. (hadoop-default.xml also still has setting for old conf variable: io.seqfile.compression.type) To maintain backward compat, we cannot remove old apis - the standard procedure is to deprecate them for the next release and remove them in subsequent releases. - if possible - document changed interfaces in the release notes (there's no way we can find this out by looking at the long list of Jiras). Please look at the INCOMPATIBLE CHANGES section of CHANGES.txt, HADOOP-1851 is listed there. Admittedly we can do better, but that is a good place to look for when upgrading to newer releases. i am not sure how updated the wiki is on the compression stuff (my responsibility to update it) - but please do consider the impact of Please use the forrest-based docs (on the hadoop website - e.g. mapred_tutorial.html) rather than the wiki as the gold-standard. The reason we moved away from the wiki is precisely this - harder to maintain docs per release etc. changing interfaces on existing applications. (maybe we should have a JIRA tag to mark out bugs that change interfaces). Again, CHANGES.txt and INCOMPATIBLE CHANGES section for now. Arun As always - thanks for all the fish (err .. working code), Joydeep
RE: Hadoop summit / workshop at Yahoo!
We do plan to make the video available online after the event. Ajay -Original Message- From: Tim Wintle [mailto:[EMAIL PROTECTED] Sent: Thursday, February 21, 2008 12:22 PM To: core-user@hadoop.apache.org Subject: Re: Hadoop summit / workshop at Yahoo! I would certainly appreciate being able to watch them online too, and they would help spread the word about hadoop - think of all the people who watch Google's Techtalks (am I allowed to say the G word around here?). On Thu, 2008-02-21 at 08:34 +0100, Lukas Vlcek wrote: Online webcast/recorded video would be really appreciated by lot of people. Please post the content online! (not only you can target much greater audience but you can significantly save on break/lunch/beer food budget :-). Lukas On Wed, Feb 20, 2008 at 9:10 PM, Ajay Anand [EMAIL PROTECTED] wrote: The registration page for the Hadoop summit is now up: http://developer.yahoo.com/hadoop/summit/ Space is limited, so please sign up early if you are interested in attending. About the summit: Yahoo! is hosting the first summit on Apache Hadoop on March 25th in Sunnyvale. The summit is sponsored by the Computing Community Consortium (CCC) and brings together leaders from the Hadoop developer and user communities. The speakers will cover topics in the areas of extensions being developed for Hadoop, case studies of applications being built and deployed on Hadoop, and a discussion on future directions for the platform. Agenda: 8:30-8:55 Breakfast 8:55-9:00 Welcome to Yahoo! Logistics - Ajay Anand, Yahoo! 9:00-9:30 Hadoop Overview - Doug Cutting / Eric Baldeschwieler, Yahoo! 9:30-10:00 Pig - Chris Olston, Yahoo! 10:00-10:30 JAQL - Kevin Beyer, IBM 10:30-10:45 Break 10:45-11:15 DryadLINQ - Michael Isard, Microsoft 11:15-11:45 Monitoring Hadoop using X-Trace - Andy Konwinski and Matei Zaharia, UC Berkeley 11:45-12:15 Zookeeper - Ben Reed, Yahoo! 12:15-1:15 Lunch 1:15-1:45 Hbase - Michael Stack, Powerset 1:45-2:15 Hbase App - Bryan Duxbury, Rapleaf 2:15-2:45 Hive - Joydeep Sen Sarma, Facebook 2:45-3:00 Break 3:00-3:20 Building Ground Models of Southern California - Steve Schossler, David O'Hallaron, Intel / CMU 3:20-3:40 Online search for engineering design content - Mike Haley, Autodesk 3:40-4:00 Yahoo - Webmap - Arnab Bhattacharjee, Yahoo! 4:00-4:30 Natural language Processing - Jimmy Lin, U of Maryland / Christophe Bisciglia, Google 4:30-4:45 Break 4:45-5:30 Panel on future directions 5:30-7:00 Happy hour Look forward to seeing you there! Ajay -Original Message- From: Bradford Stephens [mailto:[EMAIL PROTECTED] Sent: Wednesday, February 20, 2008 9:17 AM To: core-user@hadoop.apache.org Subject: Re: Hadoop summit / workshop at Yahoo! Hrm yes, I'd like to make a visit as well :) On Feb 20, 2008 8:05 AM, C G [EMAIL PROTECTED] wrote: Hey All: Is this going forward? I'd like to make plans to attend and the sooner I can get plane tickets the happier the bean counters will be :-). Thx, C G Ajay Anand wrote: Yahoo plans to host a summit / workshop on Apache Hadoop at our Sunnyvale campus on March 25th. Given the interest we are seeing from developers in a broad range of organizations, this seems like a good time to get together and brief each other on the progress that is being made. We would like to cover topics in the areas of extensions being developed for Hadoop, innovative applications being built and deployed on Hadoop, and future extensions to the platform. Some of the speakers who have already committed to present are from organizations such as IBM, Intel, Carnegie Mellon University, UC Berkeley, Facebook and Yahoo!, and we are actively recruiting other leaders in the space. If you have an innovative application you would like to talk about, please let us know. Although there are limitations on the amount of time we have, we would love to hear from you. You can contact me at [EMAIL PROTECTED] Thanks and looking forward to hearing about your cool apps, Ajay -- View this message in context: http://www.nabble.com/Hadoop-summit---workshop-at-Yahoo%21-tp14889262p15 393386.htmlhttp://www.nabble.com/Hadoop-summit---workshop-at-Yahoo%21-t p14889262p15393386.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com. - Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now.
define backwards compatibility (was: changes to compression interfaces in 0.15?)
Arun - if you can't pull the api - then u must redirect the api to the new call that preserves it's semantics. in this case - had we re-implemented SequenceFile.setCompressionType in 0.15 to call SequenceFileOutputFormat.setOutputCompressionType() - then it would have been a backwards compatible change. + deprecation would have served fair warning for eventual pullout. i find the confusion over what backwards compatibility means scary - and i am really hoping that the outcome of this thread is a clear definition from the committers/hadoop-board of what to reasonably expect (or not!) going forward. -Original Message- From: Pete Wyckoff [mailto:[EMAIL PROTECTED] Sent: Thu 2/21/2008 12:47 PM To: core-user@hadoop.apache.org Subject: Re: changes to compression interfaces in 0.15? If the API semantics are changing under you, you have to change your code whether or not the API is pulled or deprecated. Pulling it makes it more obvious that the user has to change his/her code. -- pete On 2/21/08 12:41 PM, Arun C Murthy [EMAIL PROTECTED] wrote: On Feb 21, 2008, at 12:20 PM, Joydeep Sen Sarma wrote: To maintain backward compat, we cannot remove old apis - the standard procedure is to deprecate them for the next release and remove them in subsequent releases. you've got to be kidding. we didn't maintain backwards compatibility. my app broke. Simple and straightforward. and the old interfaces are not deprecated (to quote 0.15.3 on a 'deprecated' interface: You are right, HADOOP-1851 didn't fix it right. I've filed HADOOP-2869. We do need to be more diligent about listing config changes in CHANGES.txt for starters, and that point is taken. However, we can't start pulling out apis without deprecating them first. Arun /** * Set the compression type for sequence files. * @param job the configuration to modify * @param val the new compression type (none, block, record) */ static public void setCompressionType(Configuration job, CompressionType val) { ) I (and i would suspect any average user willing to recompile code) would much much rather that we broke backwards compatibility immediately rather than maintain carry over defunct apis that insidiously break application behavior. and of course - this does not address the point that the option strings themselves are depcreated. (remember - people set options explicitly from xml files and streaming. not everyone goes through java apis)). -- as one of my dear professors once said - put ur self in the other person's shoe. consider that u were in my position and that a production app suddenly went from consuming 100G to 1TB. and everything slowed down drastically. and it did not give any sign that anything was amiss. everything looked golden on the ourside. what would be ur reaction if u find out after a week that the system was full and numerous processes had to be re-run? how would you have figured that was going to happen by looking at the INCOMPATIBLE section (which btw - i did carefully before sending my mail). (fortunately i escaped the worst case - but i think this is a real call to action) -Original Message- From: Arun C Murthy [mailto:[EMAIL PROTECTED] Sent: Thu 2/21/2008 11:21 AM To: core-user@hadoop.apache.org Subject: Re: changes to compression interfaces in 0.15? Joydeep, On Feb 20, 2008, at 5:06 PM, Joydeep Sen Sarma wrote: Hi developers, In migrating to 0.15 - i am noticing that the compression interfaces have changed: - compression type for sequencefile outputs used to be set by: SequenceFile.setCompressionType() - now it seems to be set using: sequenceFileOutputFormat.setOutputCompressionType() Yes, we added SequenceFileOutputFormat.setOutputCompressionType and deprecated the old api. (HADOOP-1851) The change is for the better - but would it be possible to: - remove old/dead interfaces. That would have been a straightforward hint for applications to look for new interfaces. (hadoop-default.xml also still has setting for old conf variable: io.seqfile.compression.type) To maintain backward compat, we cannot remove old apis - the standard procedure is to deprecate them for the next release and remove them in subsequent releases. - if possible - document changed interfaces in the release notes (there's no way we can find this out by looking at the long list of Jiras). Please look at the INCOMPATIBLE CHANGES section of CHANGES.txt, HADOOP-1851 is listed there. Admittedly we can do better, but that is a good place to look for when upgrading to newer releases. i am not sure how updated the wiki is on the compression stuff (my responsibility to update it) - but please do consider the impact of Please use the forrest-based docs (on the hadoop website - e.g. mapred_tutorial.html) rather than the wiki
Python access to HDFS
Are there any existing HDFS access packages out there for Python? I've had some success using SWIG and the C HDFS code, as documented here: http://www.stat.purdue.edu/~sguha/code.html (halfway down the page) but it's slow adding support for some of the more complex functions. If there's anything out there I missed, I'd like to hear about it. -- Steve Sapovits Invite Media - http://www.invitemedia.com [EMAIL PROTECTED]
Re: define backwards compatibility
Joydeep Sen Sarma wrote: i find the confusion over what backwards compatibility means scary - and i am really hoping that the outcome of this thread is a clear definition from the committers/hadoop-board of what to reasonably expect (or not!) going forward. The goal is clear: code that compiles and runs warning-free in one release should not have to to be altered to try the next release. It may generate warnings, and these should be addressed before another upgrade is attempted. Sometimes it is not possible to achieve this. In these cases applications should fail with a clear error message, either at compilation or runtime. In both cases, incompatible changes should be well documented in the release notes. This is described (in part) in http://wiki.apache.org/hadoop/Roadmap That's the goal. Implementing and enforcing it is another story. For that we depend on developer and user vigilance. The current issue seems a case of failure to implement the policy rather than a lack of policy. Doug
Re: Questions regarding configuration parameters...
Try the 2 parameters to utilize all the cores per node/host. property namemapred.tasktracker.map.tasks.maximum/name value7/value descriptionThe maximum number of map tasks that will be run simultaneously by a task tracker. /description /property property namemapred.tasktracker.reduce.tasks.maximum/name value7/value descriptionThe maximum number of reduce tasks that will be run simultaneously by a task tracker. /description /property The default value are 2 so you might only see 2 cores used by Hadoop per node/host. If each system/machine has 4 cores (dual dual core), then you can change them to 3. Hope this works for you. -Andy On Wed, Feb 20, 2008 at 9:30 AM, C G [EMAIL PROTECTED] wrote: Hi All: The documentation for the configuration parameters mapred.map.tasks and mapred.reduce.tasks discuss these values in terms of number of available hosts in the grid. This description strikes me as a bit odd given that a host could be anything from a uniprocessor to an N-way box, where values for N could vary from 2..16 or more. The documentation is also vague about computing the actual value. For example, for mapred.map.tasks the doc says …a prime number several times greater…. I'm curious about how people are interpreting the descriptions and what values people are using. Specifically, I'm wondering if I should be using core count instead of host count to set these values. In the specific case of my system, we have 24 hosts where each host is a 4-way system (i.e. 96 cores total). For mapred.map.tasks I chose the value 173, as that is a prime number which is near 7*24. For mapred.reduce.tasks I chose 23 since that is a prime number close to 24. Is this what was intended? Beyond curiousity, I'm concerned about setting these values and other configuration parameters correctly because I am pursuing some performance issues where it is taking a very long time to process small amounts of data. I am hoping that some amount of tuning will resolve the problems. Any thoughts and insights most appreciated. Thanks, C G - Never miss a thing. Make Yahoo your homepage.
Re: Sorting output data on value
On Feb 21, 2008, at 5:47 PM, Ted Dunning wrote: It may be sorted within the output for a single reducer and, indeed, you can even guarantee that it is sorted but *only* by the reduce key. The order that values appear will not be deterministic. Actually, there is a better answer for this. If you put both the primary and secondary key into the key, you can use JobConf.setOutputValueGroupingComparator to set a comparator that only compares the primary key. Reduce will be called once per a primary key, but all of the values will be sorted by the secondary key. See http://tinyurl.com/32gld4 -- Owen
Problems running a HOD test cluster
Hello everyone, I've been trying to run HOD on a sample cluster with three nodes that already have Torque installed and (hopefully?) properly working. I also prepared a configuration file for hod, that I'm gonna paste at the end of this email. A few questions: - is Java6 ok for HOD? - I have an externally running HDFS cluster, as specified in [gridservice-hdfs]: how do I find out the fs_port of my cluster? IS it something specified in the hadoop-site.xml file? - what should I expect at the end of an allocate command? Currently what I get is the output above, but should I in theory return back to the shell prompt, to issue an hadoop command? [2008-02-21 19:45:34,349] DEBUG/10 hod:144 - ('server.com', 10029) [2008-02-21 19:45:34,350] INFO/20 hod:216 - Service Registry Started. [2008-02-21 19:45:34,353] DEBUG/10 hadoop:425 - allocate /mnt/scratch/grid/test 3 3 [2008-02-21 19:45:34,357] DEBUG/10 torque:72 - ringmaster cmd: /mnt/scratch/grid/hod/bin/ringmaster --hodring.tarball-retry-initial-time 1.0 --hodring.cmd-retry-initial-time 2.0 --hodring.http-port-range 1-11000 --hodring.log-dir /mnt/scratch/grid/hod/logs --hodring.temp-dir /tmp/hod --hodring.register --hodring.userid hadoop --hodring.java-home /usr/java/jdk1.6.0_04 --hodring.tarball-retry-interval 3.0 --hodring.cmd-retry-interval 2.0 --hodring.xrs-port-range 1-11000 --hodring.debug 4 --resource_manager.queue hadoop --resource_manager.env-vars HOD_PYTHON_HOME=/usr/bin/python2.5 --resource_manager.id torque --resource_manager.batch-home /usr --gridservice-hdfs.fs_port 10007 --gridservice-hdfs.host localhost --gridservice-hdfs.pkgs /mnt/scratch/grid/hadoop/current --gridservice-hdfs.info_port 10009 --gridservice-hdfs.external --ringmaster.http-port-range 1-11000 --ringmaster.hadoop-tar-ball hadoop/hadoop-releases/hadoop-0.16.0.tar.gz --ringmaster.temp-dir /tmp/hod --ringmaster.register --ringmaster.userid hadoop --ringmaster.work-dirs /tmp/hod/1,/tmp/hod/2 --ringmaster.svcrgy-addr server.com:10029 --ringmaster.log-dir /mnt/scratch/grid/hod/logs --ringmaster.max-connect 30 --ringmaster.xrs-port-range 1-11000 --ringmaster.jt-poll-interval 120 --ringmaster.debug 4 --ringmaster.idleness-limit 3600 --gridservice-mapred.tracker_port 10003 --gridservice-mapred.host localhost --gridservice-mapred.pkgs /mnt/scratch/grid/hadoop/current --gridservice-mapred.info_port 10008 [2008-02-21 19:45:34,361] DEBUG/10 torque:44 - qsub - /usr/bin/qsub -l nodes=3 -W x= -l nodes=3 -W x= -N HOD -r n -d /tmp/ -q hadoop -v HOD_PYTHON_HOME=/usr/bin/python2.5 [2008-02-21 19:45:34,373] DEBUG/10 torque:54 - qsub stdin: #!/bin/sh [2008-02-21 19:45:34,374] DEBUG/10 torque:54 - qsub stdin: /mnt/scratch/grid/hod/bin/ringmaster --hodring.tarball-retry-initial-time 1.0 --hodring.cmd-retry-initial-time 2.0 --hodring.http-port-range 1-11000 --hodring.log-dir /mnt/scratch/grid/hod/logs --hodring.temp-dir /tmp/hod --hodring.register --hodring.userid hadoop --hodring.java-home /usr/java/jdk1.6.0_04 --hodring.tarball-retry-interval 3.0 --hodring.cmd-retry-interval 2.0 --hodring.xrs-port-range 1-11000 --hodring.debug 4 --resource_manager.queue hadoop --resource_manager.env-vars HOD_PYTHON_HOME=/usr/bin/python2.5 --resource_manager.id torque --resource_manager.batch-home /usr --gridservice-hdfs.fs_port 10007 --gridservice-hdfs.host localhost --gridservice-hdfs.pkgs /mnt/scratch/grid/hadoop/current --gridservice-hdfs.info_port 10009 --gridservice-hdfs.external --ringmaster.http-port-range 1-11000 --ringmaster.hadoop-tar-ball hadoop/hadoop-releases/hadoop-0.16.0.tar.gz --ringmaster.temp-dir /tmp/hod --ringmaster.register --ringmaster.userid hadoop --ringmaster.work-dirs /tmp/hod/1,/tmp/hod/2 --ringmaster.svcrgy-addr server.com:10029 --ringmaster.log-dir /mnt/scratch/grid/hod/logs --ringmaster.max-connect 30 --ringmaster.xrs-port-range 1-11000 --ringmaster.jt-poll-interval 120 --ringmaster.debug 4 --ringmaster.idleness-limit 3600 --gridservice-mapred.tracker_port 10003 --gridservice-mapred.host localhost --gridservice-mapred.pkgs /mnt/scratch/grid/hadoop/current --gridservice-mapred.info_port 10008 [2008-02-21 19:45:36,385] DEBUG/10 torque:76 - qsub jobid: 207.server.com [2008-02-21 19:45:36,389] DEBUG/10 torque:87 - /usr/bin/qstat -f -1 207.server.com [2008-02-21 19:45:38,952] DEBUG/10 torque:87 - /usr/bin/qstat -f -1 207.server.com [2008-02-21 19:45:41,524] DEBUG/10 torque:87 - /usr/bin/qstat -f -1 207.server.com [2008-02-21 19:45:44,066] DEBUG/10 torque:87 - /usr/bin/qstat -f -1 207.server.com [2008-02-21 19:45:46,612] DEBUG/10 torque:87 - /usr/bin/qstat -f -1 207.server.com [2008-02-21 19:45:49,155] DEBUG/10 torque:87 - /usr/bin/qstat -f -1 207.server.com [2008-02-21 19:45:51,696] DEBUG/10 torque:87 - /usr/bin/qstat -f -1 207.server.com [2008-02-21 19:45:54,236] DEBUG/10 torque:87 - /usr/bin/qstat -f -1 207.server.com [2008-02-21 19:45:56,797] DEBUG/10 torque:87 - /usr/bin/qstat -f -1
RE: Questions regarding configuration parameters...
My performance problems fall into 2 categories: 1. Extremely slow reduce phases - our map phases march along at impressive speed, but during reduce phases most nodes go idle...the active machines mostly clunk along at 10-30% CPU. Compare this to the map phase where I get all grid nodes cranking away at 100% CPU. This is a vague explanation I realize. 2. Pregnant pauses during dfs -copyToLocal and -cat operations. Frequently I'll be iterating over a list of HDFS files cat-ing them into one file to bulk load into a database. Many times I'll see one of the copies/cats sit for anywhere from 2-5 minutes. During that time no data is transferred, all nodes are idle, and absolutely nothing is written to any of the logs. The file sizes being copied are relatively small...less than 1G each in most cases. Both of these issues persist in 0.16.0 and definitely have me puzzled. I'm sure that I'm doing something wrong/non-optimal w/r/t slow reduce phases, but the long pauses during a dfs command line operation seems like a bug to me. Unfortunately I've not seen anybody else report this. Any thoughts/ideas most welcome... Thanks, C G Joydeep Sen Sarma [EMAIL PROTECTED] wrote: The default value are 2 so you might only see 2 cores used by Hadoop per node/host. that's 2 each for map and reduce. so theoretically - one could fully utilize a 4 core box with this setting. in practice - a little bit of oversubscription (3 each on a 4 core) seems to be working out well for us (maybe overlapping some compute and io - but mostly we are trading off for higher # concurrent jobs against per job latency). unlikely that these settings are causing slowness in processing small amounts of data. send more details - what's slow (map/shuffle/reduce)? check cpu consumption when map task is running .. etc. -Original Message- From: Andy Li [mailto:[EMAIL PROTECTED] Sent: Thu 2/21/2008 2:36 PM To: core-user@hadoop.apache.org Subject: Re: Questions regarding configuration parameters... Try the 2 parameters to utilize all the cores per node/host. mapred.tasktracker.map.tasks.maximum 7 The maximum number of map tasks that will be run simultaneously by a task tracker. mapred.tasktracker.reduce.tasks.maximum 7 The maximum number of reduce tasks that will be run simultaneously by a task tracker. The default value are 2 so you might only see 2 cores used by Hadoop per node/host. If each system/machine has 4 cores (dual dual core), then you can change them to 3. Hope this works for you. -Andy On Wed, Feb 20, 2008 at 9:30 AM, C G wrote: Hi All: The documentation for the configuration parameters mapred.map.tasks and mapred.reduce.tasks discuss these values in terms of number of available hosts in the grid. This description strikes me as a bit odd given that a host could be anything from a uniprocessor to an N-way box, where values for N could vary from 2..16 or more. The documentation is also vague about computing the actual value. For example, for mapred.map.tasks the doc says .a prime number several times greater.. I'm curious about how people are interpreting the descriptions and what values people are using. Specifically, I'm wondering if I should be using core count instead of host count to set these values. In the specific case of my system, we have 24 hosts where each host is a 4-way system (i.e. 96 cores total). For mapred.map.tasks I chose the value 173, as that is a prime number which is near 7*24. For mapred.reduce.tasks I chose 23 since that is a prime number close to 24. Is this what was intended? Beyond curiousity, I'm concerned about setting these values and other configuration parameters correctly because I am pursuing some performance issues where it is taking a very long time to process small amounts of data. I am hoping that some amount of tuning will resolve the problems. Any thoughts and insights most appreciated. Thanks, C G - Never miss a thing. Make Yahoo your homepage. - Looking for last minute shopping deals? Find them fast with Yahoo! Search.
Re: Sorting output data on value
But this only guarantees that the results will be sorted within each reducers input. Thus, this won't result in getting the results sorted by the reducers output value. On 2/21/08 8:40 PM, Owen O'Malley [EMAIL PROTECTED] wrote: On Feb 21, 2008, at 5:47 PM, Ted Dunning wrote: It may be sorted within the output for a single reducer and, indeed, you can even guarantee that it is sorted but *only* by the reduce key. The order that values appear will not be deterministic. Actually, there is a better answer for this. If you put both the primary and secondary key into the key, you can use JobConf.setOutputValueGroupingComparator to set a comparator that only compares the primary key. Reduce will be called once per a primary key, but all of the values will be sorted by the secondary key. See http://tinyurl.com/32gld4 -- Owen