Re: When does Reduce job start
Tht is wht I was looking for Thanks a mil harsh Kool , now tht I have a start point, I will check it in hadoop 18 -Sagar On Tue, Jan 4, 2011 at 7:23 PM, Harsh J wrote: > Hello Sagar, > > On Wed, Jan 5, 2011 at 6:44 AM, sagar naik wrote: Wht is the configuration param to change this behavior > > mapred.reduce.slowstart.completed.maps is a property (0.20.x) that > controls "when" the ReduceTasks have to start getting scheduled. Your > job would still need free reduce slots for it to begin. > > -- > Harsh J > www.harshj.com >
Re: monit? daemontools? jsvc? something else?
Ah, more manual work! :( You guys never have JVM die "just because"? I just had a DN's JVM die the other day "just because and with no obvious cause". Restarting it brought it back to life, everything recovered smoothly. Had some automated tool done the restart for me, I'd be even happier. But I'll have to take your advice. :( Anyone else has a different opinion? Actually, is anyone actually using any such tools and *not* seeing problems when they kick in and do their job of restarting dead processes? Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Hadoop ecosystem search :: http://search-hadoop.com/ - Original Message > From: Brian Bockelman > To: common-user@hadoop.apache.org > Sent: Tue, January 4, 2011 8:43:46 AM > Subject: Re: monit? daemontools? jsvc? something else? > > I'll second this opinion. Although there are some tools in life that need > to >be actively managed like this (and even then, sometimes management tools can >be >set to be too aggressive, making a bad situation terrible), HDFS is not one. > > If the JVM dies, you likely need a human brain to log in and figure out > what's >wrong - or just keep that node dead. > > Brian > > On Jan 3, 2011, at 10:40 PM, Allen Wittenauer wrote: > > > > > On Jan 3, 2011, at 2:22 AM, Otis Gospodnetic wrote: > >> I see over on http://search-hadoop.com/?q=monit+daemontools that people > >> *do* >use > > >> tools like monit and daemontools (and a few other ones) to keep revive >their > > >> Hadoop processes when they die. > >> > > > > I'm not a fan of doing this for Hadoop processes, even TaskTrackers > > and >DataNodes. The processes generally die for a reason, usually indicating that >something is wrong with the box. Restarting those processes may potentially >hide issues. > >
How to create hadoop-0.21.0-core.jar ?
How to create the hadoop-0.21.0-core.jar using the source code? Now when I compile the code, I need three or more jar files common,hdfs and mapred. I want to build the hadoop-0.21.0-core.jar to run a hadoop program. Anyone can help ? -- View this message in context: http://old.nabble.com/How-to-create-hadoop-0.21.0-core.jar---tp30593204p30593204.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Data for Testing in Hadoop
Also, Amazon offers free public data sets at: http://aws.amazon.com/datasets?_encoding=UTF8&jiveRedirect=1 On Tue, Jan 4, 2011 at 7:28 PM, Lance Norskog wrote: > https://cwiki.apache.org/confluence/display/MAHOUT/Collections > > All the collections you can imagine. > > On Tue, Jan 4, 2011 at 12:28 AM, Harsh J wrote: > > You can use MR to generate the data itself. Checkout GridMix in > > Hadoop, or PigMix from Pig for examples on general load tests. > > > > On Tue, Jan 4, 2011 at 1:01 PM, Adarsh Sharma > wrote: > >> Dear all, > >> > >> Designing the architecture is very important for the Hadoop in > Production > >> Clusters. > >> > >> We are researching to run Hadoop in Cloud in Individual Nodes and in > Cloud > >> Environment ( VM's ). > >> > >> For this, I require some data for testing. Would anyone send me some > links > >> for data of different sizes ( 10Gb, 20GB, 30 Gb , 50GB ) . > >> I shall be grateful for this kindness. > >> > >> > >> Thanks & Regards > >> > >> Adarsh Sharma > >> > >> > > > > > > > > -- > > Harsh J > > www.harshj.com > > > > > > -- > Lance Norskog > goks...@gmail.com >
Re: Data for Testing in Hadoop
https://cwiki.apache.org/confluence/display/MAHOUT/Collections All the collections you can imagine. On Tue, Jan 4, 2011 at 12:28 AM, Harsh J wrote: > You can use MR to generate the data itself. Checkout GridMix in > Hadoop, or PigMix from Pig for examples on general load tests. > > On Tue, Jan 4, 2011 at 1:01 PM, Adarsh Sharma > wrote: >> Dear all, >> >> Designing the architecture is very important for the Hadoop in Production >> Clusters. >> >> We are researching to run Hadoop in Cloud in Individual Nodes and in Cloud >> Environment ( VM's ). >> >> For this, I require some data for testing. Would anyone send me some links >> for data of different sizes ( 10Gb, 20GB, 30 Gb , 50GB ) . >> I shall be grateful for this kindness. >> >> >> Thanks & Regards >> >> Adarsh Sharma >> >> > > > > -- > Harsh J > www.harshj.com > -- Lance Norskog goks...@gmail.com
Re: When does Reduce job start
Hello Sagar, On Wed, Jan 5, 2011 at 6:44 AM, sagar naik wrote: >>> Wht is the configuration param to change this behavior mapred.reduce.slowstart.completed.maps is a property (0.20.x) that controls "when" the ReduceTasks have to start getting scheduled. Your job would still need free reduce slots for it to begin. -- Harsh J www.harshj.com
Re: When does Reduce job start
As the other gentleman said. The reduce task kinda needs to know all the data is available before doing its work. By design. Cheers James Sent from my mobile. Please excuse the typos. On 2011-01-04, at 6:14 PM, sagar naik wrote: > Hi Jeff, > > To be clear on my end I m not talking abt reduce () function call but > spawning of reduce process/task itself > To rephrase: > Reduce Process/Task is not started untill 90% of map task are done > > > -Sagar > On Tue, Jan 4, 2011 at 3:14 PM, Jeff Bean wrote: >> It's part of the design that reduce() does not get called until the map >> phase is complete. You're seeing reduce report as started when map is at 90% >> complete because hadoop is shuffling data from the mappers that have >> completed. As currently designed, you can't prematurely start reduce() >> because there is no way to gaurantee you have all the values for any key >> until all the mappers are done. reduce() requires a key and all the values >> for that key in order to execute. >> >> Jeff >> >> >> On Tue, Jan 4, 2011 at 10:53 AM, sagar naik wrote: >> >>> Hi All, >>> >>> number of map task: 1000s >>> number of reduce task:single digit >>> >>> In such cases the reduce task wont started even when few map task are >>> completed. >>> Example: >>> In my observation of a sample run of bin/hadoop jar >>> hadoop-*examples*.jar pi 1 10, the reduce did not start untill 90% >>> of map task were complete. >>> >>> The only reason, I can think of not starting a reduce task is to >>> avoid the un-necessary transfer of map output data in case of >>> failures. >>> >>> >>> Is there a way to quickly start the reduce task in such case ? >>> Wht is the configuration param to change this behavior >>> >>> >>> >>> Thanks, >>> Sagar >>> >>
Re: When does Reduce job start
Hi Jeff, To be clear on my end I m not talking abt reduce () function call but spawning of reduce process/task itself To rephrase: Reduce Process/Task is not started untill 90% of map task are done -Sagar On Tue, Jan 4, 2011 at 3:14 PM, Jeff Bean wrote: > It's part of the design that reduce() does not get called until the map > phase is complete. You're seeing reduce report as started when map is at 90% > complete because hadoop is shuffling data from the mappers that have > completed. As currently designed, you can't prematurely start reduce() > because there is no way to gaurantee you have all the values for any key > until all the mappers are done. reduce() requires a key and all the values > for that key in order to execute. > > Jeff > > > On Tue, Jan 4, 2011 at 10:53 AM, sagar naik wrote: > >> Hi All, >> >> number of map task: 1000s >> number of reduce task:single digit >> >> In such cases the reduce task wont started even when few map task are >> completed. >> Example: >> In my observation of a sample run of bin/hadoop jar >> hadoop-*examples*.jar pi 1 10, the reduce did not start untill 90% >> of map task were complete. >> >> The only reason, I can think of not starting a reduce task is to >> avoid the un-necessary transfer of map output data in case of >> failures. >> >> >> Is there a way to quickly start the reduce task in such case ? >> Wht is the configuration param to change this behavior >> >> >> >> Thanks, >> Sagar >> >
Re: When does Reduce job start
It's part of the design that reduce() does not get called until the map phase is complete. You're seeing reduce report as started when map is at 90% complete because hadoop is shuffling data from the mappers that have completed. As currently designed, you can't prematurely start reduce() because there is no way to gaurantee you have all the values for any key until all the mappers are done. reduce() requires a key and all the values for that key in order to execute. Jeff On Tue, Jan 4, 2011 at 10:53 AM, sagar naik wrote: > Hi All, > > number of map task: 1000s > number of reduce task:single digit > > In such cases the reduce task wont started even when few map task are > completed. > Example: > In my observation of a sample run of bin/hadoop jar > hadoop-*examples*.jar pi 1 10, the reduce did not start untill 90% > of map task were complete. > > The only reason, I can think of not starting a reduce task is to > avoid the un-necessary transfer of map output data in case of > failures. > > > Is there a way to quickly start the reduce task in such case ? > Wht is the configuration param to change this behavior > > > > Thanks, > Sagar >
Re: Rngd
As it normally stands, rngd will only help (it appears) if you have a hardware RNG. You need to cheat and use entropy you don't really have. If you don't mind hacking your system, you could even do this: # mv /dev/random /dev/random.orig # ln /dev/urandom /dev/random This makes /dev/random behave as if it were /dev/urandom (which it, strictly speaking, is after you do this). Don't let your sysadmin see you do this, of course. On Tue, Jan 4, 2011 at 12:00 PM, Jon Lederman wrote: > Hi, > > I am trying to locate the source for rngd to build on my embedded processor > in order to test whether my hadoop setup is stalled due to low entropy? Do > u know where I can find this. I thought it was part of rng-tools but it's > not. > > Thanks > > Jon > > Sent from my iPhone > > Sent from my iPhone >
Re: When does Reduce job start
On Jan 4, 2011, at 10:53 AM, sagar naik wrote: > > The only reason, I can think of not starting a reduce task is to > avoid the un-necessary transfer of map output data in case of > failures. Reduce tasks also eat slots while doing the map output. On shared grids, this can be extremely bad behavior. > Is there a way to quickly start the reduce task in such case ? > Wht is the configuration param to change this behavior mapred.reduce.slowstart.completed.maps See http://wiki.apache.org/hadoop/LimitingTaskSlotUsage (from the FAQ 2.12/2.13 questions).
RE:Rngd
http://sourceforge.net/projects/gkernel/files/rng-tools rndg is in there. Michael D. Black Senior Scientist Advanced Analytics Directorate Northrop Grumman Information Systems From: Jon Lederman [mailto:jon2...@mac.com] Sent: Tue 1/4/2011 2:00 PM To: common-user@hadoop.apache.org Subject: EXTERNAL:Rngd Hi, I am trying to locate the source for rngd to build on my embedded processor in order to test whether my hadoop setup is stalled due to low entropy? Do u know where I can find this. I thought it was part of rng-tools but it's not. Thanks Jon Sent from my iPhone Sent from my iPhone
Rngd
Hi, I am trying to locate the source for rngd to build on my embedded processor in order to test whether my hadoop setup is stalled due to low entropy? Do u know where I can find this. I thought it was part of rng-tools but it's not. Thanks Jon Sent from my iPhone Sent from my iPhone
When does Reduce job start
Hi All, number of map task: 1000s number of reduce task:single digit In such cases the reduce task wont started even when few map task are completed. Example: In my observation of a sample run of bin/hadoop jar hadoop-*examples*.jar pi 1 10, the reduce did not start untill 90% of map task were complete. The only reason, I can think of not starting a reduce task is to avoid the un-necessary transfer of map output data in case of failures. Is there a way to quickly start the reduce task in such case ? Wht is the configuration param to change this behavior Thanks, Sagar
Re: SequenceFiles and streaming or hdfs thrift api
On Tue, Jan 4, 2011 at 10:02 AM, Marc Sturlese wrote: > The thing is I want this file to be a SequenceFile, where the key should be > a Text and the value a Thrift serialized object. Is it possible to reach > that goal? > I've done the work to support that in Java. See my patch in HADOOP-6685. It also adds seamless support for ProtocolBuffers and Avro in SequenceFiles with arbitrary combinations of keys and values using different serializations. -- Owen
SequenceFiles and streaming or hdfs thrift api
Hey there, I have the need to write a file to a hdfs cluster in php. I now I can do that with the hdfs thrift api. http://wiki.apache.org/hadoop/HDFS-APIs The thing is I want this file to be a SequenceFile, where the key should be a Text and the value a Thrift serialized object. Is it possible to reach that goal? Thanks in advance -- View this message in context: http://lucene.472066.n3.nabble.com/SequenceFiles-and-streaming-or-hdfs-thrift-api-tp2193101p2193101.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
Re: monit? daemontools? jsvc? something else?
I'll second this opinion. Although there are some tools in life that need to be actively managed like this (and even then, sometimes management tools can be set to be too aggressive, making a bad situation terrible), HDFS is not one. If the JVM dies, you likely need a human brain to log in and figure out what's wrong - or just keep that node dead. Brian On Jan 3, 2011, at 10:40 PM, Allen Wittenauer wrote: > > On Jan 3, 2011, at 2:22 AM, Otis Gospodnetic wrote: >> I see over on http://search-hadoop.com/?q=monit+daemontools that people *do* >> use >> tools like monit and daemontools (and a few other ones) to keep revive their >> Hadoop processes when they die. >> > > I'm not a fan of doing this for Hadoop processes, even TaskTrackers and > DataNodes. The processes generally die for a reason, usually indicating that > something is wrong with the box. Restarting those processes may potentially > hide issues. smime.p7s Description: S/MIME cryptographic signature
Output is null why?
My Outpu ist null Why? Here is my Java Code: import java.io.IOException; import org.apache.hadoop.hbase.HBaseConfiguration; import org.apache.hadoop.hbase.client.Get; import org.apache.hadoop.hbase.client.HTable; import org.apache.hadoop.hbase.client.Result; import org.apache.hadoop.hbase.util.Bytes; public class FirstHBaseClientRead { public static void main(String[] args) throws IOException { HBaseConfiguration config = new HBaseConfiguration(); HTable table = new HTable(config, "Table"); Get get = new Get(Bytes.toBytes("FirstRowKey")); Result result = table.get(get); byte[] value= result.getValue(Bytes.toBytes("F1"), Bytes.toBytes("FirstColumn")); System.out.println(Bytes.toString(value)); } } This is my Test Table: hbase(main):013:0> scan 'Table' ROW COLUMN+CELL FirstRowKey column=F1:Firstcolumn, timestamp=1294132718775, value=First Value FirstRowKey1column=F1:Firstcolumn, timestamp=1294134178724, value=First Value1 FirstRowKey1column=F1:Firstcolumn1, timestamp=1294134197574, value=First Value1 2 row(s) in 0.1030 seconds
Re: Hadoop example
Hi, Seems that you need to add your hostname/IP pair in /etc/hosts in both nodes. Also it looks that you need to setup your configuration files correctly. This guides can be helpful for you: http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html http://hadoop.apache.org/common/docs/r0.20.2/cluster_setup.html cheers, esteban. On Tue, Jan 4, 2011 at 02:38, haiyan wrote: > I have two nodes as Hadoop test. When I set fs.default.name to > hdfs://hostname:54310/ in core-site.xml and mapred.job.tracker to > hdfs://hostname:54311 in mapred-site.xml, > I received the following error information while I started it by > start-all.sh. > > org.apache.hadoop.ipc.RemoteException: java.io.IOException: File > /home/hadoop/tmp/mapred/system/jobtracker.info could only be replicated to > 0 > nodes, instead of 1 >at > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1271) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:422) >at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source) >at > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >at java.lang.reflect.Method.invoke(Method.java:597) >at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508) >at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959) >at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955) >at java.security.AccessController.doPrivileged(Native Method) >at javax.security.auth.Subject.doAs(Subject.java:396) >at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953) > ... > Then I had to change hdfs://hostname:54310/ to hdfs://ipAddress:54310/ and > hdfs://hostname:54311 to hdfs://ipAddress:54311, it's ok while I started it > by start-all.sh. > However, when I run wordcount example, I got the following error message. > > java.lang.IllegalArgumentException: Wrong FS: > > hdfs://ipAddress:54310/home/hadoop/tmp/mapred/system/job_201101041628_0005/job.xml, > expected: hdfs://hostname:54310 >at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:310) >at > > org.apache.hadoop.hdfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:99) >at > > org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:155) >at > > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:453) >at > org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:745) >at > org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:1664) >at > org.apache.hadoop.mapred.TaskTracker.access$1200(TaskTracker.java:97) >at > > org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:1629) > > From message above, it seems the hdfs://hostname:port is not suitable for > example run? What should I do ? > > Note: ipAddress means ip address I used, hostname means host name I used >
Hadoop example
I have two nodes as Hadoop test. When I set fs.default.name to hdfs://hostname:54310/ in core-site.xml and mapred.job.tracker to hdfs://hostname:54311 in mapred-site.xml, I received the following error information while I started it by start-all.sh. org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /home/hadoop/tmp/mapred/system/jobtracker.info could only be replicated to 0 nodes, instead of 1 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1271) at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:422) at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953) ... Then I had to change hdfs://hostname:54310/ to hdfs://ipAddress:54310/ and hdfs://hostname:54311 to hdfs://ipAddress:54311, it's ok while I started it by start-all.sh. However, when I run wordcount example, I got the following error message. java.lang.IllegalArgumentException: Wrong FS: hdfs://ipAddress:54310/home/hadoop/tmp/mapred/system/job_201101041628_0005/job.xml, expected: hdfs://hostname:54310 at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:310) at org.apache.hadoop.hdfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:99) at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:155) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:453) at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:745) at org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:1664) at org.apache.hadoop.mapred.TaskTracker.access$1200(TaskTracker.java:97) at org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:1629) From message above, it seems the hdfs://hostname:port is not suitable for example run? What should I do ? Note: ipAddress means ip address I used, hostname means host name I used
Re: Data for Testing in Hadoop
You can use MR to generate the data itself. Checkout GridMix in Hadoop, or PigMix from Pig for examples on general load tests. On Tue, Jan 4, 2011 at 1:01 PM, Adarsh Sharma wrote: > Dear all, > > Designing the architecture is very important for the Hadoop in Production > Clusters. > > We are researching to run Hadoop in Cloud in Individual Nodes and in Cloud > Environment ( VM's ). > > For this, I require some data for testing. Would anyone send me some links > for data of different sizes ( 10Gb, 20GB, 30 Gb , 50GB ) . > I shall be grateful for this kindness. > > > Thanks & Regards > > Adarsh Sharma > > -- Harsh J www.harshj.com