Re: input file order
Hi, Mappers run in parallel. So without reducer it is not possible ensure the sequence. On Fri, Jan 20, 2012 at 2:32 AM, Mapred Learn mapred.le...@gmail.comwrote: This is my question too. What if I want output to be in same order as input without using reducers. Thanks, JJ Sent from my iPhone On Jan 19, 2012, at 12:19 PM, Ronald Petty ronald.pe...@gmail.com wrote: Daniel, Can you provide a concrete example of what you mean by output to be in an orderly manner? Also, what are the file sizes and types? Ron On Thu, Jan 19, 2012 at 11:19 AM, Daniel Yehdego dtyehd...@miners.utep.eduwrote: Hi, I have 100 .txt input files and I want my mapper output to be in an orderly manner. I am not using any reducer.Any idea? Regards, -- https://github.com/zinnia-phatak-dev/Nectar
Re: Multiple linear Regression on Hadoop
Hi , Nectar already implemented Multiple Linear Regression. You can look into the code here https://github.com/zinnia-phatak-dev/Nectar . On Fri, Jan 13, 2012 at 11:24 AM, Saurabh Bajaj saurabh.ba...@mu-sigma.comwrote: Hi All, Could someone guide me how we can do a multiple linear regression on Hadoop. Mahout doesn't yet support Multiple Linear Regression. Saurabh Bajaj | Senior Business Analyst | +91 9986588089 | www.mu-sigma.comhttp://www.mu-sigma.com/ | ---Your problem isn't motivation, but execution - Peter Bregman--- This email message may contain proprietary, private and confidential information. The information transmitted is intended only for the person(s) or entities to which it is addressed. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited and may be illegal. If you received this in error, please contact the sender and delete the message from your system. Mu Sigma takes all reasonable steps to ensure that its electronic communications are free from viruses. However, given Internet accessibility, the Company cannot accept liability for any virus introduced by this e-mail or any attachment and you are advised to use up-to-date virus checking software. -- https://github.com/zinnia-phatak-dev/Nectar
Re: 0 tasktrackers in jobtracker but all datanodes present
Hi, 1. Stop the job tracker and task trackers. - bin/stop-mapred.sh 2. Disable namenode safemode - bin/hadoop dfsadmin -safemode leave 3. Start the job tracker and tasktrackers again - bin/start-mapred.sh On Fri, Jan 13, 2012 at 5:20 AM, Ravi Prakash ravihad...@gmail.com wrote: Courtesy Kihwal and Bobby Have you tried increasing the max heap size with -Xmx? and make sure that you have swap enabled. On Wed, Jan 11, 2012 at 6:59 PM, Gaurav Bagga gbagg...@gmail.com wrote: Hi hadoop-0.19 I have a working hadoop cluster which has been running perfectly for months. But today after restarting the cluster, at jobtracker UI its showing state INITIALIZING for a long time and is staying on the same state. The nodes in jobtracker are zero whereas all the nodes are present on the dfs. It says Safe mode is on. grep'ed on slaves and I see the tasktrackers running. In namenode logs i get the following error 2012-01-11 16:50:57,195 WARN ipc.Server - Out of Memory in server select java.lang.OutOfMemoryError: Java heap space at java.nio.HeapByteBuffer.init(HeapByteBuffer.java:39) at java.nio.ByteBuffer.allocate(ByteBuffer.java:312) at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:804) at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:400) at org.apache.hadoop.ipc.Server$Listener.run(Server.java:309) Not sure why the cluster is not coming up -G -- https://github.com/zinnia-phatak-dev/Nectar
Re: 0 tasktrackers in jobtracker but all datanodes present
Gaurav NN memory might have hit its upper bound. As a bench mark, for every 1 million files/blocks/directories 1GB of memory is required on the NN. The number of files in your cluster might have grown beyond this treshold. So the options left for you would be - If there are large number of small files, use HAR or Sequence File for grouping the same - Increase the NN heap Regards Bejoy KS On Mon, Apr 2, 2012 at 12:08 PM, madhu phatak phatak@gmail.com wrote: Hi, 1. Stop the job tracker and task trackers. - bin/stop-mapred.sh 2. Disable namenode safemode - bin/hadoop dfsadmin -safemode leave 3. Start the job tracker and tasktrackers again - bin/start-mapred.sh On Fri, Jan 13, 2012 at 5:20 AM, Ravi Prakash ravihad...@gmail.com wrote: Courtesy Kihwal and Bobby Have you tried increasing the max heap size with -Xmx? and make sure that you have swap enabled. On Wed, Jan 11, 2012 at 6:59 PM, Gaurav Bagga gbagg...@gmail.com wrote: Hi hadoop-0.19 I have a working hadoop cluster which has been running perfectly for months. But today after restarting the cluster, at jobtracker UI its showing state INITIALIZING for a long time and is staying on the same state. The nodes in jobtracker are zero whereas all the nodes are present on the dfs. It says Safe mode is on. grep'ed on slaves and I see the tasktrackers running. In namenode logs i get the following error 2012-01-11 16:50:57,195 WARN ipc.Server - Out of Memory in server select java.lang.OutOfMemoryError: Java heap space at java.nio.HeapByteBuffer.init(HeapByteBuffer.java:39) at java.nio.ByteBuffer.allocate(ByteBuffer.java:312) at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:804) at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:400) at org.apache.hadoop.ipc.Server$Listener.run(Server.java:309) Not sure why the cluster is not coming up -G -- https://github.com/zinnia-phatak-dev/Nectar
mapred.child.java.opts and mapreduce.reduce.java.opts
Hello, I have a job that requires a bit more memory than the default for the reducer (not for the mapper). So for this I have this property in my configuration file: mapreduce.reduce.java.opts=-Xmx4000m When I run the job, I can see its configuration in the web interface and I see that indeed I have mapreduce.reduce.java.opts set to -Xmx4000m but I also have mapred.child.java.opts set to -Xmx200m and when I ps -ef the java process, it is using -Xmx200m. So to make my job work I had to set mapred.child.java.opts=-Xmx4000m in my configuration file. However I don't need that much memory for the mapper. How can I set more memory only for the mapper ? Is the only solution to set mapred.child.java.opts to -Xmx4000m, mapreduce.reduce.java.opts to -Xmx4000m and mapreduce.map.java.opts to -Xmx200m ? I am using hadoop 1.0.1. Thank you very much, Juan
Image Processing in Hadoop
Hi, Can someone point me to some info on Image processing using Hadoop? Regards, Shreya This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email or any action taken in reliance on this e-mail is strictly prohibited and may be unlawful.
Re: Working with MapFiles
Hi Ondrej, Pe 30.03.2012 14:30, Ondřej Klimpera a scris: And one more question, is it even possible to add a MapFile (as it consits of index and data file) to Distributed cache? Thanks Should be no problem, they are just two files. On 03/30/2012 01:15 PM, Ondřej Klimpera wrote: Hello, I'm not sure what you mean by using map reduce setup()? If the file is that small you could load it all in memory to avoid network IO. Do that in the setup() method of the map reduce job. Can you please explain little bit more? Check the javadocs[1]: setup is called once per task so you can read the file from HDFS then or perform other initializations. [1] http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/Mapper.html Reading 20 MB in ram should not be a problem and is preferred if you need to make many requests against that data. It really depends on your use case so think carefully or just go ahead and test it. Thanks On 03/30/2012 12:49 PM, Ioan Eugen Stan wrote: Hello Ondrej, Pe 29.03.2012 18:05, Ondřej Klimpera a scris: Hello, I have a MapFile as a product of MapReduce job, and what I need to do is: 1. If MapReduce produced more spilts as Output, merge them to single file. 2. Copy this merged MapFile to another HDFS location and use it as a Distributed cache file for another MapReduce job. I'm wondering if it is even possible to merge MapFiles according to their nature and use them as Distributed cache file. A MapFile is actually two files [1]: one SequanceFile (with sorted keys) and a small index for that file. The map file does a version of binary search to find your key and performs seek() to go to the byte offset in the file. What I'm trying to achieve is repeatedly fast search in this file during another MapReduce job. If my idea is absolute wrong, can you give me any tip how to do it? The file is supposed to be 20MB large. I'm using Hadoop 0.20.203. If the file is that small you could load it all in memory to avoid network IO. Do that in the setup() method of the map reduce job. The distributed cache will also use HDFS [2] and I don't think it will provide you with any benefits. Thanks for your reply:) Ondrej Klimpera [1] http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/MapFile.html [2] http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html -- Ioan Eugen Stan http://ieugen.blogspot.com
Re: Working with MapFiles
Ok, thanks. I missed setup() method because of using older version of hadoop, so I suppose that method configure() does the same in hadoop 0.20.203. Now I'm able to load a map file inside configure() method to MapFile.Reader instance as a class private variable, all works fine, just wondering if the MapFile is replicated on HDFS and data are read locally, or if reading from this file will increase the network bandwidth because of getting it's data from another computer node in the hadoop cluster. Hopefully last question to bother you is, if reading files from DistributedCache (normal text file) is limited to particular job. Before running a job I add a file to DistCache. When getting the file in Reducer implementation, can it access DistCache files from another jobs? In another words what will list this command: //Reducer impl. public void configure(JobConf job) { URI[] distCacheFileUris = DistributedCache.getCacheFiles(job); } will the distCacheFileUris variable contain only URIs for this job, or for any job running on Hadoop cluster? Hope it's understandable. Thanks. On 04/02/2012 11:34 AM, Ioan Eugen Stan wrote: Hi Ondrej, Pe 30.03.2012 14:30, Ondřej Klimpera a scris: And one more question, is it even possible to add a MapFile (as it consits of index and data file) to Distributed cache? Thanks Should be no problem, they are just two files. On 03/30/2012 01:15 PM, Ondřej Klimpera wrote: Hello, I'm not sure what you mean by using map reduce setup()? If the file is that small you could load it all in memory to avoid network IO. Do that in the setup() method of the map reduce job. Can you please explain little bit more? Check the javadocs[1]: setup is called once per task so you can read the file from HDFS then or perform other initializations. [1] http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/Mapper.html Reading 20 MB in ram should not be a problem and is preferred if you need to make many requests against that data. It really depends on your use case so think carefully or just go ahead and test it. Thanks On 03/30/2012 12:49 PM, Ioan Eugen Stan wrote: Hello Ondrej, Pe 29.03.2012 18:05, Ondřej Klimpera a scris: Hello, I have a MapFile as a product of MapReduce job, and what I need to do is: 1. If MapReduce produced more spilts as Output, merge them to single file. 2. Copy this merged MapFile to another HDFS location and use it as a Distributed cache file for another MapReduce job. I'm wondering if it is even possible to merge MapFiles according to their nature and use them as Distributed cache file. A MapFile is actually two files [1]: one SequanceFile (with sorted keys) and a small index for that file. The map file does a version of binary search to find your key and performs seek() to go to the byte offset in the file. What I'm trying to achieve is repeatedly fast search in this file during another MapReduce job. If my idea is absolute wrong, can you give me any tip how to do it? The file is supposed to be 20MB large. I'm using Hadoop 0.20.203. If the file is that small you could load it all in memory to avoid network IO. Do that in the setup() method of the map reduce job. The distributed cache will also use HDFS [2] and I don't think it will provide you with any benefits. Thanks for your reply:) Ondrej Klimpera [1] http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/MapFile.html [2] http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html
Re: Image Processing in Hadoop
Shreya can u please Explain your scenario . On Mon, Apr 2, 2012 at 3:02 PM, shreya@cognizant.com wrote: Hi, Can someone point me to some info on Image processing using Hadoop? Regards, Shreya This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email or any action taken in reliance on this e-mail is strictly prohibited and may be unlawful.
Yuan Jin is out of the office.
I will be out of the office starting 04/02/2012 and will not return until 04/05/2012. I am out of office, and will reply you when I am back.
Re: Image Processing in Hadoop
Hi Shreya, Image files binary files . Use SequenceFile format to store the image in hdfs and SequenceInputFormat to read the bytes . You can use TwoDWritable to store matrix for image. On Mon, Apr 2, 2012 at 3:36 PM, Sujit Dhamale sujitdhamal...@gmail.comwrote: Shreya can u please Explain your scenario . On Mon, Apr 2, 2012 at 3:02 PM, shreya@cognizant.com wrote: Hi, Can someone point me to some info on Image processing using Hadoop? Regards, Shreya This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email or any action taken in reliance on this e-mail is strictly prohibited and may be unlawful. -- https://github.com/zinnia-phatak-dev/Nectar
RE: Image Processing in Hadoop
Hi, My scenario is: There are some images of Structures (Building plans etc) that have to be stored in HDFS, If the user click on a door of that building, I want to use mapreduce to display the corresponding door image stored in HDFS and all the information related to it. In a nut shell an image has to be displayed and based on user click, need to drill down into the image Thanks and Regards, Shreya Pal -Original Message- From: Sujit Dhamale [mailto:sujitdhamal...@gmail.com] Sent: Monday, April 02, 2012 3:36 PM To: common-user@hadoop.apache.org Subject: Re: Image Processing in Hadoop Shreya can u please Explain your scenario . On Mon, Apr 2, 2012 at 3:02 PM, shreya@cognizant.com wrote: Hi, Can someone point me to some info on Image processing using Hadoop? Regards, Shreya This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email or any action taken in reliance on this e-mail is strictly prohibited and may be unlawful. This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email or any action taken in reliance on this e-mail is strictly prohibited and may be unlawful.
Re: Working with MapFiles
Hi Ondrej, Pe 02.04.2012 13:00, Ondřej Klimpera a scris: Ok, thanks. I missed setup() method because of using older version of hadoop, so I suppose that method configure() does the same in hadoop 0.20.203. Aha, if it's possible, try upgrading. I don't know how support is for versions older then hadoop 0.20 branch. Now I'm able to load a map file inside configure() method to MapFile.Reader instance as a class private variable, all works fine, just wondering if the MapFile is replicated on HDFS and data are read locally, or if reading from this file will increase the network bandwidth because of getting it's data from another computer node in the hadoop cluster. You could use a method variable instead of a class private if you load the file. If the MapFile is wrote to HDFS then yes it is replicated, and you can configure the replication factor at file creation (and later maybe). If you use DistributedCache then the files are not written in HDFS, but in mapred.local.dir [1] folder on every node. The folder size is configurable so it's possible that the data will be available there for the next MR job but don't rely on this. Please read the docs, I may get things wrong. RTFM will save you life ;). [1] http://developer.yahoo.com/hadoop/tutorial/module5.html#auxdata [2] https://forums.aws.amazon.com/message.jspa?messageID=152538 Hopefully last question to bother you is, if reading files from DistributedCache (normal text file) is limited to particular job. Before running a job I add a file to DistCache. When getting the file in Reducer implementation, can it access DistCache files from another jobs? In another words what will list this command: //Reducer impl. public void configure(JobConf job) { URI[] distCacheFileUris = DistributedCache.getCacheFiles(job); } will the distCacheFileUris variable contain only URIs for this job, or for any job running on Hadoop cluster? Hope it's understandable. Thanks. It's -- Ioan Eugen Stan http://ieugen.blogspot.com
RE: Image Processing in Hadoop
This doesn't sound like a mapreduce[1] sort of problem. Now, of course, you can store files in HDFS and retrieve them. But its up to your application to interpret them. MapReduce cannot display the corresponding door image, it is a computation scheme and performs calculations that you provide. [1] http://en.wikipedia.org/wiki/MapReduce On Mon, 2012-04-02 at 15:52 +0530, shreya@cognizant.com wrote: Hi, My scenario is: There are some images of Structures (Building plans etc) that have to be stored in HDFS, If the user click on a door of that building, I want to use mapreduce to display the corresponding door image stored in HDFS and all the information related to it. In a nut shell an image has to be displayed and based on user click, need to drill down into the image Thanks and Regards, Shreya Pal -Original Message- From: Sujit Dhamale [mailto:sujitdhamal...@gmail.com] Sent: Monday, April 02, 2012 3:36 PM To: common-user@hadoop.apache.org Subject: Re: Image Processing in Hadoop Shreya can u please Explain your scenario . On Mon, Apr 2, 2012 at 3:02 PM, shreya@cognizant.com wrote: Hi, Can someone point me to some info on Image processing using Hadoop? Regards, Shreya This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email or any action taken in reliance on this e-mail is strictly prohibited and may be unlawful. This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email or any action taken in reliance on this e-mail is strictly prohibited and may be unlawful.
RE: Image Processing in Hadoop
Ya I understand that we need to write the processing logic, what I want to know is are there any kind of APIs that can be used for image processing, Was reading about HIPI, is this the right API or webGL should be used? Any other suggestions are welcome. Thanks and Regards, Shreya -Original Message- From: Darren Govoni [mailto:dar...@ontrenet.com] Sent: Monday, April 02, 2012 4:47 PM To: common-user@hadoop.apache.org Subject: RE: Image Processing in Hadoop This doesn't sound like a mapreduce[1] sort of problem. Now, of course, you can store files in HDFS and retrieve them. But its up to your application to interpret them. MapReduce cannot display the corresponding door image, it is a computation scheme and performs calculations that you provide. [1] http://en.wikipedia.org/wiki/MapReduce On Mon, 2012-04-02 at 15:52 +0530, shreya@cognizant.com wrote: Hi, My scenario is: There are some images of Structures (Building plans etc) that have to be stored in HDFS, If the user click on a door of that building, I want to use mapreduce to display the corresponding door image stored in HDFS and all the information related to it. In a nut shell an image has to be displayed and based on user click, need to drill down into the image Thanks and Regards, Shreya Pal -Original Message- From: Sujit Dhamale [mailto:sujitdhamal...@gmail.com] Sent: Monday, April 02, 2012 3:36 PM To: common-user@hadoop.apache.org Subject: Re: Image Processing in Hadoop Shreya can u please Explain your scenario . On Mon, Apr 2, 2012 at 3:02 PM, shreya@cognizant.com wrote: Hi, Can someone point me to some info on Image processing using Hadoop? Regards, Shreya This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email or any action taken in reliance on this e-mail is strictly prohibited and may be unlawful. This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email or any action taken in reliance on this e-mail is strictly prohibited and may be unlawful. This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email or any action taken in reliance on this e-mail is strictly prohibited and may be unlawful.
Re: mapred.child.java.opts and mapreduce.reduce.java.opts
For 1.0, the right property is mapred.reduce.child.java.opts. The mapreduce.* style would apply to MR in 2.0 and above. On Mon, Apr 2, 2012 at 3:00 PM, Juan Pino juancitomiguel...@gmail.com wrote: Hello, I have a job that requires a bit more memory than the default for the reducer (not for the mapper). So for this I have this property in my configuration file: mapreduce.reduce.java.opts=-Xmx4000m When I run the job, I can see its configuration in the web interface and I see that indeed I have mapreduce.reduce.java.opts set to -Xmx4000m but I also have mapred.child.java.opts set to -Xmx200m and when I ps -ef the java process, it is using -Xmx200m. So to make my job work I had to set mapred.child.java.opts=-Xmx4000m in my configuration file. However I don't need that much memory for the mapper. How can I set more memory only for the mapper ? Is the only solution to set mapred.child.java.opts to -Xmx4000m, mapreduce.reduce.java.opts to -Xmx4000m and mapreduce.map.java.opts to -Xmx200m ? I am using hadoop 1.0.1. Thank you very much, Juan -- Harsh J
Re: mapred.child.java.opts and mapreduce.reduce.java.opts
Thank you that worked! Juan On Mon, Apr 2, 2012 at 12:55 PM, Harsh J ha...@cloudera.com wrote: For 1.0, the right property is mapred.reduce.child.java.opts. The mapreduce.* style would apply to MR in 2.0 and above. On Mon, Apr 2, 2012 at 3:00 PM, Juan Pino juancitomiguel...@gmail.com wrote: Hello, I have a job that requires a bit more memory than the default for the reducer (not for the mapper). So for this I have this property in my configuration file: mapreduce.reduce.java.opts=-Xmx4000m When I run the job, I can see its configuration in the web interface and I see that indeed I have mapreduce.reduce.java.opts set to -Xmx4000m but I also have mapred.child.java.opts set to -Xmx200m and when I ps -ef the java process, it is using -Xmx200m. So to make my job work I had to set mapred.child.java.opts=-Xmx4000m in my configuration file. However I don't need that much memory for the mapper. How can I set more memory only for the mapper ? Is the only solution to set mapred.child.java.opts to -Xmx4000m, mapreduce.reduce.java.opts to -Xmx4000m and mapreduce.map.java.opts to -Xmx200m ? I am using hadoop 1.0.1. Thank you very much, Juan -- Harsh J
Re: How can I configure oozie to submit different workflows from different users ?
Praveenesh, If I'm not mistaken 0.20.205 does not support wildcards for the proxyuser (hosts/groups) settings. You have to use explicit hosts/groups. Thxs. Alejandro PS: please follow up this thread in the oozie-us...@incubator.apache.org On Mon, Apr 2, 2012 at 2:15 PM, praveenesh kumar praveen...@gmail.comwrote: Hi all, I want to use oozie to submit different workflows from different users. These users are able to submit hadoop jobs. I am using hadoop 0.20.205 and oozie 3.1.3 I have a hadoop user as a oozie-user I have set the following things : conf/oozie-site.xml : property name oozie.services.ext /name value org.apache.oozie.service.HadoopAccessorService /value description To add/replace services defined in 'oozie.services' with custom implementations.Class names must be separated by commas. /description /property conf/core-site.xml property namehadoop.proxyuser.hadoop.hosts /name value* / value /property property namehadoop.proxyuser.hadoop.groups /name value* /value /property When I am submitting jobs as a hadoop user, I am able to run it properly. But when I am able to submit the same work flow from a different user, who can submit the simple MR jobs to my hadoop cluster, I am getting the following error: JA009: java.io.IOException: java.io.IOException: The username kumar obtained from the conf doesn't match the username hadoop the user authenticated asat org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3943) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382) Caused by: java.io.IOException: The username kumar obtained from the conf doesn't match the username hadoop the user authenticated as at org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:426) at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3941) ... 11 more
Re: How can I configure oozie to submit different workflows from different users ?
How can I specify multiple users /groups for proxy user setting ? Can I give comma separated values in these settings ? Thanks, Praveenesh On Mon, Apr 2, 2012 at 5:52 PM, Alejandro Abdelnur t...@cloudera.comwrote: Praveenesh, If I'm not mistaken 0.20.205 does not support wildcards for the proxyuser (hosts/groups) settings. You have to use explicit hosts/groups. Thxs. Alejandro PS: please follow up this thread in the oozie-us...@incubator.apache.org On Mon, Apr 2, 2012 at 2:15 PM, praveenesh kumar praveen...@gmail.com wrote: Hi all, I want to use oozie to submit different workflows from different users. These users are able to submit hadoop jobs. I am using hadoop 0.20.205 and oozie 3.1.3 I have a hadoop user as a oozie-user I have set the following things : conf/oozie-site.xml : property name oozie.services.ext /name value org.apache.oozie.service.HadoopAccessorService /value description To add/replace services defined in 'oozie.services' with custom implementations.Class names must be separated by commas. /description /property conf/core-site.xml property namehadoop.proxyuser.hadoop.hosts /name value* / value /property property namehadoop.proxyuser.hadoop.groups /name value* /value /property When I am submitting jobs as a hadoop user, I am able to run it properly. But when I am able to submit the same work flow from a different user, who can submit the simple MR jobs to my hadoop cluster, I am getting the following error: JA009: java.io.IOException: java.io.IOException: The username kumar obtained from the conf doesn't match the username hadoop the user authenticated asat org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3943) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382) Caused by: java.io.IOException: The username kumar obtained from the conf doesn't match the username hadoop the user authenticated as at org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:426) at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3941) ... 11 more
Re: How can I configure oozie to submit different workflows from different users ?
multiple value are comma separated. keep in mind that valid values for proxyuser groups, as the property name states are GROUPS, not USERS. thxs. Alejandro On Mon, Apr 2, 2012 at 2:27 PM, praveenesh kumar praveen...@gmail.comwrote: How can I specify multiple users /groups for proxy user setting ? Can I give comma separated values in these settings ? Thanks, Praveenesh On Mon, Apr 2, 2012 at 5:52 PM, Alejandro Abdelnur t...@cloudera.com wrote: Praveenesh, If I'm not mistaken 0.20.205 does not support wildcards for the proxyuser (hosts/groups) settings. You have to use explicit hosts/groups. Thxs. Alejandro PS: please follow up this thread in the oozie-us...@incubator.apache.org On Mon, Apr 2, 2012 at 2:15 PM, praveenesh kumar praveen...@gmail.com wrote: Hi all, I want to use oozie to submit different workflows from different users. These users are able to submit hadoop jobs. I am using hadoop 0.20.205 and oozie 3.1.3 I have a hadoop user as a oozie-user I have set the following things : conf/oozie-site.xml : property name oozie.services.ext /name value org.apache.oozie.service.HadoopAccessorService /value description To add/replace services defined in 'oozie.services' with custom implementations.Class names must be separated by commas. /description /property conf/core-site.xml property namehadoop.proxyuser.hadoop.hosts /name value* / value /property property namehadoop.proxyuser.hadoop.groups /name value* /value /property When I am submitting jobs as a hadoop user, I am able to run it properly. But when I am able to submit the same work flow from a different user, who can submit the simple MR jobs to my hadoop cluster, I am getting the following error: JA009: java.io.IOException: java.io.IOException: The username kumar obtained from the conf doesn't match the username hadoop the user authenticated asat org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3943) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382) Caused by: java.io.IOException: The username kumar obtained from the conf doesn't match the username hadoop the user authenticated as at org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:426) at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3941) ... 11 more
Re: How can I configure oozie to submit different workflows from different users ?
Is this a problem of proxy setting ? because after specifying the group name also, I am not able to run it. Its still giving me the same error. Thanks, Praveenesh On Mon, Apr 2, 2012 at 6:05 PM, Alejandro Abdelnur t...@cloudera.comwrote: multiple value are comma separated. keep in mind that valid values for proxyuser groups, as the property name states are GROUPS, not USERS. thxs. Alejandro On Mon, Apr 2, 2012 at 2:27 PM, praveenesh kumar praveen...@gmail.com wrote: How can I specify multiple users /groups for proxy user setting ? Can I give comma separated values in these settings ? Thanks, Praveenesh On Mon, Apr 2, 2012 at 5:52 PM, Alejandro Abdelnur t...@cloudera.com wrote: Praveenesh, If I'm not mistaken 0.20.205 does not support wildcards for the proxyuser (hosts/groups) settings. You have to use explicit hosts/groups. Thxs. Alejandro PS: please follow up this thread in the oozie-us...@incubator.apache.org On Mon, Apr 2, 2012 at 2:15 PM, praveenesh kumar praveen...@gmail.com wrote: Hi all, I want to use oozie to submit different workflows from different users. These users are able to submit hadoop jobs. I am using hadoop 0.20.205 and oozie 3.1.3 I have a hadoop user as a oozie-user I have set the following things : conf/oozie-site.xml : property name oozie.services.ext /name value org.apache.oozie.service.HadoopAccessorService /value description To add/replace services defined in 'oozie.services' with custom implementations.Class names must be separated by commas. /description /property conf/core-site.xml property namehadoop.proxyuser.hadoop.hosts /name value* / value /property property namehadoop.proxyuser.hadoop.groups /name value* /value /property When I am submitting jobs as a hadoop user, I am able to run it properly. But when I am able to submit the same work flow from a different user, who can submit the simple MR jobs to my hadoop cluster, I am getting the following error: JA009: java.io.IOException: java.io.IOException: The username kumar obtained from the conf doesn't match the username hadoop the user authenticated asat org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3943) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382) Caused by: java.io.IOException: The username kumar obtained from the conf doesn't match the username hadoop the user authenticated as at org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:426) at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3941) ... 11 more
Re: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:'
Jay, What does your job do? Create files directly on HDFS? If so, do you follow this method?: http://wiki.apache.org/hadoop/FAQ#Can_I_write_create.2BAC8-write-to_hdfs_files_directly_from_map.2BAC8-reduce_tasks.3F A local filesystem may not complain if you re-create an existent file. HDFS' behavior here is different. This simple Python test is what I mean: a = open('a', 'w') a.write('f') b = open('a', 'w') b.write('s') a.close(), b.close() open('a').read() 's' Hence it is best to use the FileOutputCommitter framework as detailed in the mentioned link. On Mon, Apr 2, 2012 at 7:09 PM, Jay Vyas jayunit...@gmail.com wrote: Hi guys: I have a map reduce job that runs normally on local file system from eclipse, *but* it fails on HDFS running in psuedo distributed mode. The exception I see is *org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:* Any thoughts on why this might occur in psuedo distributed mode, but not in regular file system ? -- Harsh J
Re: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:'
No, my job does not write files directly to disk. It simply goes to some web pages , reads data (in the reducer phase), and parses jsons into thrift objects which are emitted via the standard MultipleOutputs API to hdfs files. Any idea why hadoop would throw the AlreadyBeingCreatedException ? On Mon, Apr 2, 2012 at 2:52 PM, Harsh J ha...@cloudera.com wrote: Jay, What does your job do? Create files directly on HDFS? If so, do you follow this method?: http://wiki.apache.org/hadoop/FAQ#Can_I_write_create.2BAC8-write-to_hdfs_files_directly_from_map.2BAC8-reduce_tasks.3F A local filesystem may not complain if you re-create an existent file. HDFS' behavior here is different. This simple Python test is what I mean: a = open('a', 'w') a.write('f') b = open('a', 'w') b.write('s') a.close(), b.close() open('a').read() 's' Hence it is best to use the FileOutputCommitter framework as detailed in the mentioned link. On Mon, Apr 2, 2012 at 7:09 PM, Jay Vyas jayunit...@gmail.com wrote: Hi guys: I have a map reduce job that runs normally on local file system from eclipse, *but* it fails on HDFS running in psuedo distributed mode. The exception I see is *org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:* Any thoughts on why this might occur in psuedo distributed mode, but not in regular file system ? -- Harsh J -- Jay Vyas MMSB/UCHC
HADOOP_OPTS to tasks
hi all, is it normal that HADOOP_OPTS are not passed to the actual tasks (ie the java processes running as child of tasktracker)? the tasktracker process uses them correctly. is there a way to set general java options for each started task? many thanks, stijn
Re: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:'
Jay, Without seeing the whole stack trace all I can say as cause for that exception from a job is: 1. You're using threads and the API components you are using isn't thread safe in your version of Hadoop. 2. Files are being written out to HDFS directories without following the OC rules. (This is negated, per your response). On Mon, Apr 2, 2012 at 7:35 PM, Jay Vyas jayunit...@gmail.com wrote: No, my job does not write files directly to disk. It simply goes to some web pages , reads data (in the reducer phase), and parses jsons into thrift objects which are emitted via the standard MultipleOutputs API to hdfs files. Any idea why hadoop would throw the AlreadyBeingCreatedException ? On Mon, Apr 2, 2012 at 2:52 PM, Harsh J ha...@cloudera.com wrote: Jay, What does your job do? Create files directly on HDFS? If so, do you follow this method?: http://wiki.apache.org/hadoop/FAQ#Can_I_write_create.2BAC8-write-to_hdfs_files_directly_from_map.2BAC8-reduce_tasks.3F A local filesystem may not complain if you re-create an existent file. HDFS' behavior here is different. This simple Python test is what I mean: a = open('a', 'w') a.write('f') b = open('a', 'w') b.write('s') a.close(), b.close() open('a').read() 's' Hence it is best to use the FileOutputCommitter framework as detailed in the mentioned link. On Mon, Apr 2, 2012 at 7:09 PM, Jay Vyas jayunit...@gmail.com wrote: Hi guys: I have a map reduce job that runs normally on local file system from eclipse, *but* it fails on HDFS running in psuedo distributed mode. The exception I see is *org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:* Any thoughts on why this might occur in psuedo distributed mode, but not in regular file system ? -- Harsh J -- Jay Vyas MMSB/UCHC -- Harsh J
Re: HADOOP_OPTS to tasks
HADOOP_OPTS isn't applied for Task JVMs. For Task JVMs, set mapred.child.java.opts in mapred-site.xml (Or via Configuration for per-job tuning), to the opts string you want it to have. For example -Xmx200m -Dsomesysprop=abc. On Mon, Apr 2, 2012 at 7:47 PM, Stijn De Weirdt stijn.dewei...@ugent.be wrote: hi all, is it normal that HADOOP_OPTS are not passed to the actual tasks (ie the java processes running as child of tasktracker)? the tasktracker process uses them correctly. is there a way to set general java options for each started task? many thanks, stijn -- Harsh J
Re: getting NullPointerException while running Word cont example
Can some one please look in to below issue ?? Thanks in Advance On Wed, Mar 7, 2012 at 9:09 AM, Sujit Dhamale sujitdhamal...@gmail.comwrote: Hadoop version : hadoop-0.20.203.0rc1.tar Operaring Syatem : ubuntu 11.10 On Wed, Mar 7, 2012 at 12:19 AM, Harsh J ha...@cloudera.com wrote: Hi Sujit, Please also tell us which version/distribution of Hadoop is this? On Tue, Mar 6, 2012 at 11:27 PM, Sujit Dhamale sujitdhamal...@gmail.com wrote: Hi, I am new to Hadoop., i install Hadoop as per http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/ http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluste while running Word cont example i am getting *NullPointerException *can some one please look in to this issue ?* *Thanks in Advance* !!! * duser@sujit:~/Desktop/hadoop$ bin/hadoop dfs -ls /user/hduser/data Found 3 items -rw-r--r-- 1 hduser supergroup 674566 2012-03-06 23:04 /user/hduser/data/pg20417.txt -rw-r--r-- 1 hduser supergroup1573150 2012-03-06 23:04 /user/hduser/data/pg4300.txt -rw-r--r-- 1 hduser supergroup1423801 2012-03-06 23:04 /user/hduser/data/pg5000.txt hduser@sujit:~/Desktop/hadoop$ bin/hadoop jar hadoop*examples*.jar wordcount /user/hduser/data /user/hduser/gutenberg-outputd 12/03/06 23:14:33 INFO input.FileInputFormat: Total input paths to process : 3 12/03/06 23:14:33 INFO mapred.JobClient: Running job: job_201203062221_0002 12/03/06 23:14:34 INFO mapred.JobClient: map 0% reduce 0% 12/03/06 23:14:49 INFO mapred.JobClient: map 66% reduce 0% 12/03/06 23:14:55 INFO mapred.JobClient: map 100% reduce 0% 12/03/06 23:14:58 INFO mapred.JobClient: Task Id : attempt_201203062221_0002_r_00_0, Status : FAILED Error: java.lang.NullPointerException at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2900) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2820) 12/03/06 23:15:07 INFO mapred.JobClient: Task Id : attempt_201203062221_0002_r_00_1, Status : FAILED Error: java.lang.NullPointerException at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2900) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2820) 12/03/06 23:15:16 INFO mapred.JobClient: Task Id : attempt_201203062221_0002_r_00_2, Status : FAILED Error: java.lang.NullPointerException at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2900) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2820) 12/03/06 23:15:31 INFO mapred.JobClient: Job complete: job_201203062221_0002 12/03/06 23:15:31 INFO mapred.JobClient: Counters: 20 12/03/06 23:15:31 INFO mapred.JobClient: Job Counters 12/03/06 23:15:31 INFO mapred.JobClient: Launched reduce tasks=4 12/03/06 23:15:31 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=22084 12/03/06 23:15:31 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/03/06 23:15:31 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/03/06 23:15:31 INFO mapred.JobClient: Launched map tasks=3 12/03/06 23:15:31 INFO mapred.JobClient: Data-local map tasks=3 12/03/06 23:15:31 INFO mapred.JobClient: Failed reduce tasks=1 12/03/06 23:15:31 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=16799 12/03/06 23:15:31 INFO mapred.JobClient: FileSystemCounters 12/03/06 23:15:31 INFO mapred.JobClient: FILE_BYTES_READ=740520 12/03/06 23:15:31 INFO mapred.JobClient: HDFS_BYTES_READ=3671863 12/03/06 23:15:31 INFO mapred.JobClient: FILE_BYTES_WRITTEN=2278287 12/03/06 23:15:31 INFO mapred.JobClient: File Input Format Counters 12/03/06 23:15:31 INFO mapred.JobClient: Bytes Read=3671517 12/03/06 23:15:31 INFO mapred.JobClient: Map-Reduce Framework 12/03/06 23:15:31 INFO mapred.JobClient: Map output materialized bytes=1474341 12/03/06 23:15:31 INFO mapred.JobClient: Combine output records=102322 12/03/06 23:15:31 INFO mapred.JobClient: Map input records=77932 12/03/06 23:15:31 INFO mapred.JobClient: Spilled Records=153640 12/03/06 23:15:31 INFO mapred.JobClient: Map output bytes=6076095 12/03/06 23:15:31 INFO mapred.JobClient: Combine input records=629172 12/03/06 23:15:31 INFO mapred.JobClient: Map output records=629172 12/03/06 23:15:31 INFO mapred.JobClient:
Re: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:'
Thanks J : just curious about how you came to hypothesize (1) (i.e. regarding the fact that threads and the API componentns arent thread safe in my hadoop version). I think thats a really good guess, and I would like to be able to make those sorts of intelligent hypotheses myself. Any reading you can point me to for further enlightement ? On Mon, Apr 2, 2012 at 3:16 PM, Harsh J ha...@cloudera.com wrote: Jay, Without seeing the whole stack trace all I can say as cause for that exception from a job is: 1. You're using threads and the API components you are using isn't thread safe in your version of Hadoop. 2. Files are being written out to HDFS directories without following the OC rules. (This is negated, per your response). On Mon, Apr 2, 2012 at 7:35 PM, Jay Vyas jayunit...@gmail.com wrote: No, my job does not write files directly to disk. It simply goes to some web pages , reads data (in the reducer phase), and parses jsons into thrift objects which are emitted via the standard MultipleOutputs API to hdfs files. Any idea why hadoop would throw the AlreadyBeingCreatedException ? On Mon, Apr 2, 2012 at 2:52 PM, Harsh J ha...@cloudera.com wrote: Jay, What does your job do? Create files directly on HDFS? If so, do you follow this method?: http://wiki.apache.org/hadoop/FAQ#Can_I_write_create.2BAC8-write-to_hdfs_files_directly_from_map.2BAC8-reduce_tasks.3F A local filesystem may not complain if you re-create an existent file. HDFS' behavior here is different. This simple Python test is what I mean: a = open('a', 'w') a.write('f') b = open('a', 'w') b.write('s') a.close(), b.close() open('a').read() 's' Hence it is best to use the FileOutputCommitter framework as detailed in the mentioned link. On Mon, Apr 2, 2012 at 7:09 PM, Jay Vyas jayunit...@gmail.com wrote: Hi guys: I have a map reduce job that runs normally on local file system from eclipse, *but* it fails on HDFS running in psuedo distributed mode. The exception I see is *org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:* Any thoughts on why this might occur in psuedo distributed mode, but not in regular file system ? -- Harsh J -- Jay Vyas MMSB/UCHC -- Harsh J -- Jay Vyas MMSB/UCHC
Re: HBase bulk loader doing speculative execution when it set to false in mapred-site.xml
+common-user@hadoop.apache.org Hi Harsh, Thanks for the information. Is there any way to differentiate between a client side property and server-side property?or a Document which enlists whether a property is server or client-side? Many times i have to speculate over this and try out test runs. Thanks, Anil On Fri, Mar 30, 2012 at 9:54 PM, Harsh J ha...@cloudera.com wrote: Anil, You can also disable speculative execution on a per-job basis. See http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/Job.html#setMapSpeculativeExecution(boolean) (Which is why it is called a client-sided property - it applies per-job). If HBase strongly recommends turning it off, HBase should also, by default, turn it off for its own offered jobs? On Sat, Mar 31, 2012 at 4:02 AM, anil gupta anilg...@buffalo.edu wrote: Hi Doug, Yes, that's why i had set that property as false in my mapred-site.xml. But, to my surprise i didnt know that setting that property would be useless for Hadoop jobs unless the mapred-site.xml is in classpath. The idea of client side property is a little confusing to me at present since there is no proper nomenclature for client side properties at present. Thanks for your reply. ~Anil On Fri, Mar 30, 2012 at 3:26 PM, Doug Meil doug.m...@explorysmedical.comwrote: Speculative execution is on by default in Hadoop. One of the Performance recommendations in the Hbase RefGuide is to turn it off. On 3/30/12 6:12 PM, Jean-Daniel Cryans jdcry...@apache.org wrote: Well that's not an HBase configuration, that's Hadoop. I'm not sure if this is listed anywhere, maybe in the book. BTW usually HBase has a client somewhere in the same to indicate it's client side. J-D On Fri, Mar 30, 2012 at 3:08 PM, anil gupta anilg...@buffalo.edu wrote: Thanks for the quick reply, Jean. Is there any link where i can find the name of all client-side configuration for HBase? ~Anil On Fri, Mar 30, 2012 at 3:01 PM, Jean-Daniel Cryans jdcry...@apache.orgwrote: This is a client-side configuration so if your mapred-site.xml is _not_ on your classpath when you start the bulk load, it's not going to pick it up. So either have that file on your classpath, or put it in whatever other configuration file you have. J-D On Fri, Mar 30, 2012 at 2:52 PM, anil gupta anilgupt...@gmail.com wrote: Hi All, I am using cdh3u2. I ran HBase bulk loading with property mapred.reduce.tasks.speculative.execution set to false in mapred-site.xml. Still, i can see 6 killed task in Bulk Loading job and after short analysis i realized that these jobs are killed because another worker node completed the task, hence it means that speculative execution is still on. Why the HBase Bulk loader is doing speculative execution when i have set it to false in mapred-site.xml? Please let me know if i am missing something over here. -- Thanks Regards, Anil Gupta -- Thanks Regards, Anil Gupta -- Thanks Regards, Anil Gupta -- Harsh J -- Thanks Regards, Anil Gupta
Re: HADOOP_OPTS to tasks
On 04/02/2012 04:18 PM, Harsh J wrote: HADOOP_OPTS isn't applied for Task JVMs. For Task JVMs, set mapred.child.java.opts in mapred-site.xml (Or via Configuration for per-job tuning), to the opts string you want it to have. For example -Xmx200m -Dsomesysprop=abc. thanks! stijn On Mon, Apr 2, 2012 at 7:47 PM, Stijn De Weirdtstijn.dewei...@ugent.be wrote: hi all, is it normal that HADOOP_OPTS are not passed to the actual tasks (ie the java processes running as child of tasktracker)? the tasktracker process uses them correctly. is there a way to set general java options for each started task? many thanks, stijn
data distribution in HDFS
hi all, i'm just started to play around with hdfs+mapred. i'm currently playing with teragen/sort/validate to see if i understand all. the test setup involves 5 nodes that all are tasktracker and datanode (and one node that is also jobtracker and namenode on top of that. (this one node is running both the namenode hadoop process as the datanode process) when i do the in teragen run, the data is not distributed equally over all nodes. the node that is also namenode, get's a bigger portion of all the data. (as seen by df on the nodes and by using dsfadmin -report) i also get this distribution when i ran the TestDFSIO write test (50 files of 1GB) i use basic command line teragen $((100*1000*1000)) /benchmarks/teragen, so i expect 100M*0.1kb = 10GB of data. (if i add the volumes in use by hdfs, it's actually quite a bit more.) 4 data nodes are using 4.2-4.8GB, and the data+namenode has 9.4GB in use. so this one datanode is seen as 2 nodes. when i do ls on the filesystem, i see that teragen created 250MB files, the current hdfs blocksize is 64MB. is there a reason why one datanode is preferred over the others. it is annoying since the terasort output behaves the same, and i can't use the full hdfs space for testing that way. also, since more IO comes to this one node, the performance isn't really balanced. many thanks, stijn
Re: data distribution in HDFS
Stijn, The first block of the data , is always stored in the local node. Assuming that you had a replication factor of 3, the node that generates the data will get about 10GB of data and the other 20GB will be distributed among other nodes. Raj From: Stijn De Weirdt stijn.dewei...@ugent.be To: common-user@hadoop.apache.org Sent: Monday, April 2, 2012 9:54 AM Subject: data distribution in HDFS hi all, i'm just started to play around with hdfs+mapred. i'm currently playing with teragen/sort/validate to see if i understand all. the test setup involves 5 nodes that all are tasktracker and datanode (and one node that is also jobtracker and namenode on top of that. (this one node is running both the namenode hadoop process as the datanode process) when i do the in teragen run, the data is not distributed equally over all nodes. the node that is also namenode, get's a bigger portion of all the data. (as seen by df on the nodes and by using dsfadmin -report) i also get this distribution when i ran the TestDFSIO write test (50 files of 1GB) i use basic command line teragen $((100*1000*1000)) /benchmarks/teragen, so i expect 100M*0.1kb = 10GB of data. (if i add the volumes in use by hdfs, it's actually quite a bit more.) 4 data nodes are using 4.2-4.8GB, and the data+namenode has 9.4GB in use. so this one datanode is seen as 2 nodes. when i do ls on the filesystem, i see that teragen created 250MB files, the current hdfs blocksize is 64MB. is there a reason why one datanode is preferred over the others. it is annoying since the terasort output behaves the same, and i can't use the full hdfs space for testing that way. also, since more IO comes to this one node, the performance isn't really balanced. many thanks, stijn
Re: Getting RemoteException: while copying data from Local machine to HDFS
Per your jps, you don't have a DataNode running. hduser@sujit:~/Desktop/data$ ups 6022 NameNode 7100 Jps 6569 JobTracker 6798 TaskTracker 6491 SecondaryNameNode Please read http://wiki.apache.org/hadoop/CouldOnlyBeReplicatedTo to solve this. You most likely need to also read: http://search-hadoop.com/m/l4JWggvLE2 On Mon, Apr 2, 2012 at 10:58 PM, Sujit Dhamale sujitdhamal...@gmail.com wrote: Getting RemoteException: while copying data from Local machine to HDFS Hadoop version : hadoop-0.20.203.0rc1.tar Operaring Syatem : ubuntu 11.10 hduser@sujit:~/Desktop/data$ jps 6022 NameNode 7100 Jps 6569 JobTracker 6798 TaskTracker 6491 SecondaryNameNode hduser@sujit:~/Desktop/data$ hduser@sujit:~/Desktop/data$ ls pg20417.txt pg4300.txt pg5000.txt hduser@sujit:~/Desktop/hadoop/bin$ hadoop dfs -copyFromLocal /home/hduser/Desktop/data /user/hduser/data 12/04/02 22:51:37 WARN hdfs.DFSClient: DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/hduser/data/pg20417.txt could only be replicated to 0 nodes, instead of 1 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1417) at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:596) at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:523) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1383) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1379) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1377) at org.apache.hadoop.ipc.Client.call(Client.java:1030) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:224) at $Proxy1.addBlock(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at $Proxy1.addBlock(Unknown Source) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:3104) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2975) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2255) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2446) 12/04/02 22:51:37 WARN hdfs.DFSClient: Error Recovery for block null bad datanode[0] nodes == null 12/04/02 22:51:37 WARN hdfs.DFSClient: Could not get block locations. Source file /user/hduser/data/pg20417.txt - Aborting... copyFromLocal: java.io.IOException: File /user/hduser/data/pg20417.txt could only be replicated to 0 nodes, instead of 1 12/04/02 22:51:37 ERROR hdfs.DFSClient: Exception closing file /user/hduser/data/pg20417.txt : org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/hduser/data/pg20417.txt could only be replicated to 0 nodes, instead of 1 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1417) at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:596) at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:523) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1383) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1379) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1377) org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/hduser/data/pg20417.txt could only be replicated to 0 nodes, instead of 1 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1417) at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:596) at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) at
Re: data distribution in HDFS
hi raj, what is a local node? is it relative to the tasks that are started? stijn On 04/02/2012 07:28 PM, Raj Vishwanathan wrote: Stijn, The first block of the data , is always stored in the local node. Assuming that you had a replication factor of 3, the node that generates the data will get about 10GB of data and the other 20GB will be distributed among other nodes. Raj From: Stijn De Weirdtstijn.dewei...@ugent.be To: common-user@hadoop.apache.org Sent: Monday, April 2, 2012 9:54 AM Subject: data distribution in HDFS hi all, i'm just started to play around with hdfs+mapred. i'm currently playing with teragen/sort/validate to see if i understand all. the test setup involves 5 nodes that all are tasktracker and datanode (and one node that is also jobtracker and namenode on top of that. (this one node is running both the namenode hadoop process as the datanode process) when i do the in teragen run, the data is not distributed equally over all nodes. the node that is also namenode, get's a bigger portion of all the data. (as seen by df on the nodes and by using dsfadmin -report) i also get this distribution when i ran the TestDFSIO write test (50 files of 1GB) i use basic command line teragen $((100*1000*1000)) /benchmarks/teragen, so i expect 100M*0.1kb = 10GB of data. (if i add the volumes in use by hdfs, it's actually quite a bit more.) 4 data nodes are using 4.2-4.8GB, and the data+namenode has 9.4GB in use. so this one datanode is seen as 2 nodes. when i do ls on the filesystem, i see that teragen created 250MB files, the current hdfs blocksize is 64MB. is there a reason why one datanode is preferred over the others. it is annoying since the terasort output behaves the same, and i can't use the full hdfs space for testing that way. also, since more IO comes to this one node, the performance isn't really balanced. many thanks, stijn
Re: data distribution in HDFS
thanks serge. is there a way to disable this feature (ie place first block always on local node)? and is this because the local node is a datanode? or is there always a local node with datatransfers? many thanks, stijn Local node is a node from where you are coping data from If lets say you are using -copyFromLocal option Regards Serge On 4/2/12 11:53 AM, Stijn De Weirdtstijn.dewei...@ugent.be wrote: hi raj, what is a local node? is it relative to the tasks that are started? stijn On 04/02/2012 07:28 PM, Raj Vishwanathan wrote: Stijn, The first block of the data , is always stored in the local node. Assuming that you had a replication factor of 3, the node that generates the data will get about 10GB of data and the other 20GB will be distributed among other nodes. Raj From: Stijn De Weirdtstijn.dewei...@ugent.be To: common-user@hadoop.apache.org Sent: Monday, April 2, 2012 9:54 AM Subject: data distribution in HDFS hi all, i'm just started to play around with hdfs+mapred. i'm currently playing with teragen/sort/validate to see if i understand all. the test setup involves 5 nodes that all are tasktracker and datanode (and one node that is also jobtracker and namenode on top of that. (this one node is running both the namenode hadoop process as the datanode process) when i do the in teragen run, the data is not distributed equally over all nodes. the node that is also namenode, get's a bigger portion of all the data. (as seen by df on the nodes and by using dsfadmin -report) i also get this distribution when i ran the TestDFSIO write test (50 files of 1GB) i use basic command line teragen $((100*1000*1000)) /benchmarks/teragen, so i expect 100M*0.1kb = 10GB of data. (if i add the volumes in use by hdfs, it's actually quite a bit more.) 4 data nodes are using 4.2-4.8GB, and the data+namenode has 9.4GB in use. so this one datanode is seen as 2 nodes. when i do ls on the filesystem, i see that teragen created 250MB files, the current hdfs blocksize is 64MB. is there a reason why one datanode is preferred over the others. it is annoying since the terasort output behaves the same, and i can't use the full hdfs space for testing that way. also, since more IO comes to this one node, the performance isn't really balanced. many thanks, stijn
Re: data distribution in HDFS
AFAIK there is no way to disable this feature . This is an optimization. It happens because in your case the node generating the data is also a data node. Raj From: Stijn De Weirdt stijn.dewei...@ugent.be To: common-user@hadoop.apache.org Sent: Monday, April 2, 2012 12:18 PM Subject: Re: data distribution in HDFS thanks serge. is there a way to disable this feature (ie place first block always on local node)? and is this because the local node is a datanode? or is there always a local node with datatransfers? many thanks, stijn Local node is a node from where you are coping data from If lets say you are using -copyFromLocal option Regards Serge On 4/2/12 11:53 AM, Stijn De Weirdtstijn.dewei...@ugent.be wrote: hi raj, what is a local node? is it relative to the tasks that are started? stijn On 04/02/2012 07:28 PM, Raj Vishwanathan wrote: Stijn, The first block of the data , is always stored in the local node. Assuming that you had a replication factor of 3, the node that generates the data will get about 10GB of data and the other 20GB will be distributed among other nodes. Raj From: Stijn De Weirdtstijn.dewei...@ugent.be To: common-user@hadoop.apache.org Sent: Monday, April 2, 2012 9:54 AM Subject: data distribution in HDFS hi all, i'm just started to play around with hdfs+mapred. i'm currently playing with teragen/sort/validate to see if i understand all. the test setup involves 5 nodes that all are tasktracker and datanode (and one node that is also jobtracker and namenode on top of that. (this one node is running both the namenode hadoop process as the datanode process) when i do the in teragen run, the data is not distributed equally over all nodes. the node that is also namenode, get's a bigger portion of all the data. (as seen by df on the nodes and by using dsfadmin -report) i also get this distribution when i ran the TestDFSIO write test (50 files of 1GB) i use basic command line teragen $((100*1000*1000)) /benchmarks/teragen, so i expect 100M*0.1kb = 10GB of data. (if i add the volumes in use by hdfs, it's actually quite a bit more.) 4 data nodes are using 4.2-4.8GB, and the data+namenode has 9.4GB in use. so this one datanode is seen as 2 nodes. when i do ls on the filesystem, i see that teragen created 250MB files, the current hdfs blocksize is 64MB. is there a reason why one datanode is preferred over the others. it is annoying since the terasort output behaves the same, and i can't use the full hdfs space for testing that way. also, since more IO comes to this one node, the performance isn't really balanced. many thanks, stijn
Compression codec org.apache.hadoop.io.compress.DeflateCodec not found.
Hi Folks, A coworker of mine recently setup a new CDH3 cluster with 4 machines (3 data nodes, one namenode that doubles as a jobtracker). I started looking through it using hadoop fs -ls, and that went fine with everything displaying alright. Next, I decided to test out some simple pig jobs. Each of these worked fine on my development pseudo cluster, but failed on the new CDH3 cluster with the exact same erro: *java.lang.IllegalArgumentException: Compression codec org.apache.hadoop.io.compress.DeflateCodec not found. * This also only happened when trying to process .gz files, and it happened even when I just tried to load and dump one. I figured this could be a problem with compression configs being manually overwritten in core-site.xml, but that file didn't have any mention of compressions on any of the boxes in the CDH3 cluster. I looked at each box individually, and all the proper jars seem to be there, so now I'm at a bit of a loss. Any ideas what the problem could be? Eli
Re: Compression codec org.apache.hadoop.io.compress.DeflateCodec not found.
Hi Eli, Moving this to cdh-u...@cloudera.org as its a CDH specific question. You'll get better answers from the community there. You are CC'd but to subscribe to the CDH users community, head to https://groups.google.com/a/cloudera.org/forum/#!forum/cdh-user. I've bcc'd common-user@ here. What you may be hitting here is caused by version mismatch in client vs. server. See https://ccp.cloudera.com/display/CDHDOC/Known+Issues+and+Work+Arounds+in+CDH3#KnownIssuesandWorkAroundsinCDH3-Pig (Point #2, but it may not be just Pig/Hive-specific) On Tue, Apr 3, 2012 at 3:54 AM, Eli Finkelshteyn iefin...@gmail.com wrote: Hi Folks, A coworker of mine recently setup a new CDH3 cluster with 4 machines (3 data nodes, one namenode that doubles as a jobtracker). I started looking through it using hadoop fs -ls, and that went fine with everything displaying alright. Next, I decided to test out some simple pig jobs. Each of these worked fine on my development pseudo cluster, but failed on the new CDH3 cluster with the exact same erro: *java.lang.IllegalArgumentException: Compression codec org.apache.hadoop.io.compress.DeflateCodec not found. * This also only happened when trying to process .gz files, and it happened even when I just tried to load and dump one. I figured this could be a problem with compression configs being manually overwritten in core-site.xml, but that file didn't have any mention of compressions on any of the boxes in the CDH3 cluster. I looked at each box individually, and all the proper jars seem to be there, so now I'm at a bit of a loss. Any ideas what the problem could be? Eli -- Harsh J