Re: Reg HttpFS
Manoj, Please look at http://hadoop.apache.org/docs/r2.4.0/hadoop-hdfs-httpfs/httpfs-default.htmllook at the 'httpfs.authentication.*' properties. Thanks. On Sun, May 4, 2014 at 5:27 AM, Manoj Babu manoj...@gmail.com wrote: Hi, How to accesss files in hdfs using HttpFS that is protected by kerberos? Kerberos authentication works only where is is configured ex: edge node. If i am triggering request from other system then how do i authenticate? Kindly advise. Cheers! Manoj. -- Alejandro
Re: hadoop 2.0 upgrade to 2.4
Motty, https://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Installation-Guide/CDH5-Installation-Guide.html provides instructions to upgrade from CDH4 to CDH5 (which bundles Hadoop 2.3.0). If you intention is to use CDH5 that should help you. If you have further questions about it, the right alias to use is cdh-u...@cloudera.org If your intention is to use Apache Hadoop 2.4.0, some of CDH documentation above may still be relevant. Thanks. On Thu, Apr 10, 2014 at 12:20 PM, motty cruz motty.c...@gmail.com wrote: Hi All, I currently have a hadoop 2.0 cluster in production, I want to upgrade to latest release. current version: [root@doop1 ~]# hadoop version Hadoop 2.0.0-cdh4.6.0 Cluster has the following services: hbase hive hue impala mapreduce oozie sqoop zookeeper can someone point me to a howto upgrade hadoop from 2.0 to hadoop 2.4.0? Thanks in advance, -- Alejandro
Re: Data Locality and WebHDFS
I may have expressed myself wrong. You don't need to do any test to see how locality works with files of multiple blocks. If you are accessing a file of more than one block over webhdfs, you only have assured locality for the first block of the file. Thanks. On Sun, Mar 16, 2014 at 9:18 PM, RJ Nowling rnowl...@gmail.com wrote: Thank you, Mingjiang and Alejandro. This is interesting. Since we will use the data locality information for scheduling, we could hack this to get the data locality information, at least for the first block. As Alejandro says, we'd have to test what happens for other data blocks -- e.g., what if, knowing the block sizes, we request the second or third block? Interesting food for thought! I see some experiments in my future! Thanks! On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur t...@cloudera.comwrote: well, this is for the first block of the file, the rest of the file (blocks being local or not) are streamed out by the same datanode. for small files (one block) you'll get locality, for large files only the first block, and by chance if other blocks are local to that datanode. Alejandro (phone typing) On Mar 16, 2014, at 18:53, Mingjiang Shi m...@gopivotal.com wrote: According to this page: http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/ *Data Locality*: The file read and file write calls are redirected to the corresponding datanodes. It uses the full bandwidth of the Hadoop cluster for streaming data. *A HDFS Built-in Component*: WebHDFS is a first class built-in component of HDFS. It runs inside Namenodes and Datanodes, therefore, it can use all HDFS functionalities. It is a part of HDFS - there are no additional servers to install So it looks like the data locality is built-into webhdfs, client will be redirected to the data node automatically. On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling rnowl...@gmail.com wrote: Hi all, I'm writing up a Google Summer of Code proposal to add HDFS support to Disco, an Erlang MapReduce framework. We're interested in using WebHDFS. I have two questions: 1) Does WebHDFS allow querying data locality information? 2) If the data locality information is known, can data on specific data nodes be accessed via Web HDFS? Or do all Web HDFS requests have to go through a single server? Thanks, RJ -- em rnowl...@gmail.com c 954.496.2314 -- Cheers -MJ -- em rnowl...@gmail.com c 954.496.2314 -- Alejandro
Re: Data Locality and WebHDFS
dont recall how skips are handled in webhdfs, but i would assume that you'll get to the first block As usual, and the skip is handled by the DN serving the file (as webhdfs doesnot know at open that you'll skip) Alejandro (phone typing) On Mar 17, 2014, at 9:47, RJ Nowling rnowl...@gmail.com wrote: Hi Alejandro, The WebHDFS API allows specifying an offset and length for the request. If I specify an offset that start in the second block for a file (thus skipping the first block all together), will the namenode still direct me to a datanode with the first block or will it direct me to a namenode with the second block? I.e., am I assured data locality only on the first block of the file (as you're saying) or on the first block I am accessing? If it is as you say, then I may want to reach out the WebHDFS developers and see if they would be interested in the additional functionality. Thank you, RJ On Mon, Mar 17, 2014 at 2:40 AM, Alejandro Abdelnur t...@cloudera.com wrote: I may have expressed myself wrong. You don't need to do any test to see how locality works with files of multiple blocks. If you are accessing a file of more than one block over webhdfs, you only have assured locality for the first block of the file. Thanks. On Sun, Mar 16, 2014 at 9:18 PM, RJ Nowling rnowl...@gmail.com wrote: Thank you, Mingjiang and Alejandro. This is interesting. Since we will use the data locality information for scheduling, we could hack this to get the data locality information, at least for the first block. As Alejandro says, we'd have to test what happens for other data blocks -- e.g., what if, knowing the block sizes, we request the second or third block? Interesting food for thought! I see some experiments in my future! Thanks! On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur t...@cloudera.com wrote: well, this is for the first block of the file, the rest of the file (blocks being local or not) are streamed out by the same datanode. for small files (one block) you'll get locality, for large files only the first block, and by chance if other blocks are local to that datanode. Alejandro (phone typing) On Mar 16, 2014, at 18:53, Mingjiang Shi m...@gopivotal.com wrote: According to this page: http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/ Data Locality: The file read and file write calls are redirected to the corresponding datanodes. It uses the full bandwidth of the Hadoop cluster for streaming data. A HDFS Built-in Component: WebHDFS is a first class built-in component of HDFS. It runs inside Namenodes and Datanodes, therefore, it can use all HDFS functionalities. It is a part of HDFS – there are no additional servers to install So it looks like the data locality is built-into webhdfs, client will be redirected to the data node automatically. On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling rnowl...@gmail.com wrote: Hi all, I'm writing up a Google Summer of Code proposal to add HDFS support to Disco, an Erlang MapReduce framework. We're interested in using WebHDFS. I have two questions: 1) Does WebHDFS allow querying data locality information? 2) If the data locality information is known, can data on specific data nodes be accessed via Web HDFS? Or do all Web HDFS requests have to go through a single server? Thanks, RJ -- em rnowl...@gmail.com c 954.496.2314 -- Cheers -MJ -- em rnowl...@gmail.com c 954.496.2314 -- Alejandro -- em rnowl...@gmail.com c 954.496.2314
Re: Data Locality and WebHDFS
well, this is for the first block of the file, the rest of the file (blocks being local or not) are streamed out by the same datanode. for small files (one block) you'll get locality, for large files only the first block, and by chance if other blocks are local to that datanode. Alejandro (phone typing) On Mar 16, 2014, at 18:53, Mingjiang Shi m...@gopivotal.com wrote: According to this page: http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/ Data Locality: The file read and file write calls are redirected to the corresponding datanodes. It uses the full bandwidth of the Hadoop cluster for streaming data. A HDFS Built-in Component: WebHDFS is a first class built-in component of HDFS. It runs inside Namenodes and Datanodes, therefore, it can use all HDFS functionalities. It is a part of HDFS – there are no additional servers to install So it looks like the data locality is built-into webhdfs, client will be redirected to the data node automatically. On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling rnowl...@gmail.com wrote: Hi all, I'm writing up a Google Summer of Code proposal to add HDFS support to Disco, an Erlang MapReduce framework. We're interested in using WebHDFS. I have two questions: 1) Does WebHDFS allow querying data locality information? 2) If the data locality information is known, can data on specific data nodes be accessed via Web HDFS? Or do all Web HDFS requests have to go through a single server? Thanks, RJ -- em rnowl...@gmail.com c 954.496.2314 -- Cheers -MJ
Re: Need YARN - Test job.
Gaurav, [BCCing user@h.a.o to move the thread out of it] It seems you are missing some step when reconfiguring your cluster in Cloudera Manager, you should have to modify things by hand in your setup. Adding the Cloudera Manager user alias. Thanks. On Fri, Jan 3, 2014 at 7:11 AM, Gaurav Shankhdhar shankhdhar.gau...@gmail.com wrote: yes, i am setting it but still still hangs there.. Also there are no failure logs, it just hangs without erroring out. Any log location you want me to look at? On Friday, January 3, 2014 8:38:16 PM UTC+5:30, Gunnar Tapper wrote: Hi Guarav, What are you setting HADOOP_CONF_DIR to? IME, you get the hang if you don't set it as: export HADOOP_CONF_DIR=/etc/hadoop/conf.cloudera.yarn1 Gunnar On Fri, Jan 3, 2014 at 6:47 AM, Gaurav Shankhdhar shankhdh...@gmail.comwrote: Folks, I am trying to run teragen program using YARN framework in CDH 4.5 but no luck. What i have figured out till now: 1. By default Demo VM run the MR in MRv1 uses JobTracker and TaskTracker. 2. You can either run MRv1 or MRv2 but not both. 3. I disabled MRv1 and configured YARN by increasing the alternatives priority for YARN, deployed the Client Configuration (Using Cloudera Manager). 4. Submitted the teragen program and can see the YARN framework is in action from Web UI's. 5. But it hangs at Map 0% and Reduce %. Tried multiple option to no avail. Any step by Step guide or doc to execute MR in MRv2 (YARN) mode in CDH4.5? Attached doc with screen shots. Reference: http://www.cloudera.com/content/cloudera-content/cloudera- docs/CM4Ent/latest/Cloudera-Manager-Managing-Clusters/cmmc_adding_YARN_ MRv2.html Regards Gaurav On Friday, January 3, 2014 3:33:44 AM UTC+5:30, Dhanasekaran Anbalagan wrote: Hi Guys, we recently installed CDH5 in our test cluster. we need any test job for YARN framework. we are able to run mapreduce job successfully on YARN framework without code change. But we need to test yarn job functionality. can you please guide me. we tired https://github.com/hortonworks/simple-yarn-app it's not help us. tech@dvcloudlab231:~$ *hadoop jar simple-yarn-app-1.0-SNAPSHOT.jar com.hortonworks.simpleyarnapp.Client /bin/date 2 /apps/simple/simple-yarn-app-1.0-SNAPSHOT.jar* 14/01/02 16:49:05 INFO client.RMProxy: Connecting to ResourceManager at dvcloudlab231/192.168.70.231:8032 Submitting application application_1388687890867_0007 14/01/02 16:49:05 INFO impl.YarnClientImpl: Submitted application application_1388687890867_0007 to ResourceManager at dvcloudlab231/ 192.168.70.231:8032 Application application_1388687890867_0007 finished with state FAILED at 1388699348344 Note: In Resource manager node. I don't see any container logs. -Dhanasekaran. Did I learn something today? If not, I wasted it. -- --- You received this message because you are subscribed to the Google Groups CDH Users group. To unsubscribe from this group and stop receiving emails from it, send an email to cdh-user+u...@cloudera.org. For more options, visit https://groups.google.com/a/ cloudera.org/groups/opt_out. -- Thanks, Gunnar *If you think you can you can, if you think you can't you're right.* -- --- You received this message because you are subscribed to the Google Groups CDH Users group. To unsubscribe from this group and stop receiving emails from it, send an email to cdh-user+unsubscr...@cloudera.org. For more options, visit https://groups.google.com/a/cloudera.org/groups/opt_out. -- Alejandro
Re: Time taken for starting AMRMClientAsync
Hi Krishna, Are you starting all AMs from the same JVM? Mind sharing the code you are using for your time testing? Thx On Thu, Nov 21, 2013 at 6:11 AM, Krishna Kishore Bonagiri write2kish...@gmail.com wrote: Hi Alejandro, I have modified the code in hadoop-2.2.0-src/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-unmanaged-am-launcher/src/main/java/org/apache/hadoop/yarn/applications/unmanagedamlauncher/UnmanagedAMLauncher.java to submit multiple application masters one after another and still seeing 800 to 900 ms being taken for the start() call on AMRMClientAsync in all of those applications. Please suggest if you think I am missing something else Thanks, Kishore On Tue, Nov 19, 2013 at 6:07 PM, Krishna Kishore Bonagiri write2kish...@gmail.com wrote: Hi Alejandro, I don't know what are managed and unmanaged AMs, can you please explain me what are the difference and how are each of them launched? I tried to google for these terms and came across hadoop-yarn-applications-unmanaged-am-launcher-2.2.0.jar, is it related to that? Thanks, Kishore On Tue, Nov 19, 2013 at 12:15 AM, Alejandro Abdelnur t...@cloudera.comwrote: Kishore, Also, please specify if you are using managed or unmanaged AMs (the numbers I've mentioned before are using unmanaged AMs). thx On Sun, Nov 17, 2013 at 11:16 AM, Vinod Kumar Vavilapalli vino...@hortonworks.com wrote: It is just creating a connection to RM and shouldn't take that long. Can you please file a ticket so that we can look at it? JVM class loading overhead is one possibility but 1 sec is a bit too much. Thanks, +Vinod On Oct 21, 2013, at 7:16 AM, Krishna Kishore Bonagiri wrote: Hi, I am seeing the following call to start() on AMRMClientAsync taking from 0.9 to 1 second. Why does it take that long? Is there a way to reduce it, I mean does it depend on any of the interval parameters or so in configuration files? I have tried reducing the value of the first argument below from 1000 to 100 seconds also, but that doesn't help. AMRMClientAsync.CallbackHandler allocListener = new RMCallbackHandler(); amRMClient = AMRMClientAsync.createAMRMClientAsync(1000, allocListener); amRMClient.init(conf); amRMClient.start(); Thanks, Kishore CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You. -- Alejandro -- Alejandro
Re: Time taken for starting AMRMClientAsync
Krishna, Well, it all depends on your use case. In the case of Llama, Llama is a server that hosts multiple unmanaged AMs, thus all AMs run in the same process. Thanks. On Mon, Nov 25, 2013 at 6:40 PM, Krishna Kishore Bonagiri write2kish...@gmail.com wrote: Hi Alejandro, I don't start all the AMs from the same JVM. How can I do that? Also, when I do that, that will save me time taken to get AM started, which is also good to see an improvement in. Please let me know how can I do that? And, would this also save me time taken for connecting from AM to the Resource Manager? Thanks, Kishore On Tue, Nov 26, 2013 at 3:45 AM, Alejandro Abdelnur t...@cloudera.comwrote: Hi Krishna, Are you starting all AMs from the same JVM? Mind sharing the code you are using for your time testing? Thx On Thu, Nov 21, 2013 at 6:11 AM, Krishna Kishore Bonagiri write2kish...@gmail.com wrote: Hi Alejandro, I have modified the code in hadoop-2.2.0-src/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-unmanaged-am-launcher/src/main/java/org/apache/hadoop/yarn/applications/unmanagedamlauncher/UnmanagedAMLauncher.java to submit multiple application masters one after another and still seeing 800 to 900 ms being taken for the start() call on AMRMClientAsync in all of those applications. Please suggest if you think I am missing something else Thanks, Kishore On Tue, Nov 19, 2013 at 6:07 PM, Krishna Kishore Bonagiri write2kish...@gmail.com wrote: Hi Alejandro, I don't know what are managed and unmanaged AMs, can you please explain me what are the difference and how are each of them launched? I tried to google for these terms and came across hadoop-yarn-applications-unmanaged-am-launcher-2.2.0.jar, is it related to that? Thanks, Kishore On Tue, Nov 19, 2013 at 12:15 AM, Alejandro Abdelnur t...@cloudera.com wrote: Kishore, Also, please specify if you are using managed or unmanaged AMs (the numbers I've mentioned before are using unmanaged AMs). thx On Sun, Nov 17, 2013 at 11:16 AM, Vinod Kumar Vavilapalli vino...@hortonworks.com wrote: It is just creating a connection to RM and shouldn't take that long. Can you please file a ticket so that we can look at it? JVM class loading overhead is one possibility but 1 sec is a bit too much. Thanks, +Vinod On Oct 21, 2013, at 7:16 AM, Krishna Kishore Bonagiri wrote: Hi, I am seeing the following call to start() on AMRMClientAsync taking from 0.9 to 1 second. Why does it take that long? Is there a way to reduce it, I mean does it depend on any of the interval parameters or so in configuration files? I have tried reducing the value of the first argument below from 1000 to 100 seconds also, but that doesn't help. AMRMClientAsync.CallbackHandler allocListener = new RMCallbackHandler(); amRMClient = AMRMClientAsync.createAMRMClientAsync(1000, allocListener); amRMClient.init(conf); amRMClient.start(); Thanks, Kishore CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You. -- Alejandro -- Alejandro -- Alejandro
Re: Time taken for starting AMRMClientAsync
Kishore, Also, please specify if you are using managed or unmanaged AMs (the numbers I've mentioned before are using unmanaged AMs). thx On Sun, Nov 17, 2013 at 11:16 AM, Vinod Kumar Vavilapalli vino...@hortonworks.com wrote: It is just creating a connection to RM and shouldn't take that long. Can you please file a ticket so that we can look at it? JVM class loading overhead is one possibility but 1 sec is a bit too much. Thanks, +Vinod On Oct 21, 2013, at 7:16 AM, Krishna Kishore Bonagiri wrote: Hi, I am seeing the following call to start() on AMRMClientAsync taking from 0.9 to 1 second. Why does it take that long? Is there a way to reduce it, I mean does it depend on any of the interval parameters or so in configuration files? I have tried reducing the value of the first argument below from 1000 to 100 seconds also, but that doesn't help. AMRMClientAsync.CallbackHandler allocListener = new RMCallbackHandler(); amRMClient = AMRMClientAsync.createAMRMClientAsync(1000, allocListener); amRMClient.init(conf); amRMClient.start(); Thanks, Kishore CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You. -- Alejandro
Re: Time taken for starting AMRMClientAsync
Hi Krishna, Those 900ms seems consistent with the numbers we found while doing some benchmarks in the context of Llama: http://cloudera.github.io/llama/ We found that the first application master created from a client process takes around 900 ms to be ready to submit resource requests. Subsequent application masters created from the same client process take a mean of 20 ms. The application master submission throughput (discarding the first submission) tops at approximately 100 application masters per second. I believe there is room for improvement there. Cheers On Mon, Oct 21, 2013 at 7:16 AM, Krishna Kishore Bonagiri write2kish...@gmail.com wrote: Hi, I am seeing the following call to start() on AMRMClientAsync taking from 0.9 to 1 second. Why does it take that long? Is there a way to reduce it, I mean does it depend on any of the interval parameters or so in configuration files? I have tried reducing the value of the first argument below from 1000 to 100 seconds also, but that doesn't help. AMRMClientAsync.CallbackHandler allocListener = new RMCallbackHandler(); amRMClient = AMRMClientAsync.createAMRMClientAsync(1000, allocListener); amRMClient.init(conf); amRMClient.start(); Thanks, Kishore -- Alejandro
Re: Cloudera links and Document
Satish, the right alias for Cloudera Manager questions scm-us...@cloudera.org Thanks On Thu, Jul 11, 2013 at 9:20 AM, Suresh Srinivas sur...@hortonworks.comwrote: Sathish, this mailing list for Apache Hadoop related questions. Please post questions related to other distributions to appropriate vendor's mailing list. On Thu, Jul 11, 2013 at 6:28 AM, Sathish Kumar sa848...@gmail.com wrote: Hi All, Can anyone help me the link or document that explain the below. How Cloudera Manager works and handle the clusters (Agent and Master Server)? How the Cloudera Manager Process Flow works? Where can I locate Cloudera configuration files and explanation in brief? Regards Sathish -- http://hortonworks.com/download/ -- Alejandro
Re: Requesting containers on a specific host
John, This feature will available in the upcoming 2.1.0-beta release. The first release candidate (RC) has been cut, but it seems a new RC will be needed. The exact release date is still not known but it should be soon. thanks. On Thu, Jul 4, 2013 at 2:43 PM, John Lilley john.lil...@redpoint.netwrote: Arun, I’m don’t know how to interpret the release schedule from the JIRA. It says that the patch targets 2.1.0 and it is checked into the trunk, does that mean it is likely to be rolled into the first Hadoop 2 GA or will it have to wait for another cycle? Thanks, John ** ** *From:* Arun C Murthy [mailto:a...@hortonworks.com] *Sent:* Thursday, July 04, 2013 6:28 AM *To:* user@hadoop.apache.org *Subject:* Re: Requesting containers on a specific host ** ** To guarantee nodes on a specific container you need to use the whitelist feature we added recently: https://issues.apache.org/jira/browse/YARN-398 Arun ** ** On Jul 4, 2013, at 3:14 AM, Krishna Kishore Bonagiri write2kish...@gmail.com wrote: I could get containers on specific nodes using addContainerRequest() on AMRMClient. But there are issues with it. I have two nodes, node1 and node2 in my cluster. And, my Application Master is trying to get 3 containers on node1, and 3 containers on node2 in that order. ** ** While trying to request on node1, it sometimes gives me those on node2, and vice verse. When I get a container on a different node than the one I need, I release it and make a fresh request. I am having to do like that forever to get a container on the node I need. ** ** Though the node I am requesting has enough resources, why does it keep giving me containers on the other node? How can I make sure I get a container on the node I want? ** ** Note: I am using the default scheduler, i.e. Capacity Scheduler. ** ** Thanks, Kishore ** ** On Fri, Jun 21, 2013 at 7:25 PM, Arun C Murthy a...@hortonworks.com wrote: Check if the hostname you are setting is the same in the RM logs… ** ** On Jun 21, 2013, at 2:15 AM, Krishna Kishore Bonagiri write2kish...@gmail.com wrote: Hi, I am trying to get container on a specific host, using setHostName(0 call on ResourceRequest, but could not get allocated anything forever, which just works fine when I change the node name to *. I am working on a single node cluster, and I am giving the name of the single node I have in my cluster. ** ** Is there any specific format that I need to give for setHostName(), why is it not working... ** ** Thanks, Kishore ** ** -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/ ** ** ** ** ** ** -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/ ** ** -- Alejandro
Re: WebHDFS - Jetty
John, you should look a the AuthenticationHandler interface in the hadoop-auth module, there are 2 implementations Pseudo and Kerberos. Hope this helps On Mon, Jul 1, 2013 at 9:15 PM, John Strange j...@strangeness.org wrote: We have an existing java module that integrates our authentication system that works with jetty, but I can’t find the configuration files for Jetty so I’m assuming it’s being passed and built on the fly by the data/name nodes when they start. The goal is to provide a way to authenticate users via the webHDFS interface. ** ** If anyone has any docs or a good place to start it would be of help. ** ** Thanks, - John -- Alejandro
Re: Job end notification does not always work (Hadoop 2.x)
Devaraj, If you don't run the HS, once your jobs finished you cannot retrieve status/counters from it, from Java AP or Web UI. So I'd for any practical usage, you need it. thx On Mon, Jun 24, 2013 at 8:42 PM, Devaraj k devara...@huawei.com wrote: It is not mandatory to have running HS in the cluster. Still the user can submit the job without HS in the cluster, and user may expect the Job/App End Notification. ** ** Thanks Devaraj k ** ** *From:* Alejandro Abdelnur [mailto:t...@cloudera.com] *Sent:* 24 June 2013 21:42 *To:* user@hadoop.apache.org *Cc:* user@hadoop.apache.org *Subject:* Re: Job end notification does not always work (Hadoop 2.x) ** ** if we ought to do this in a yarn service it should be the RM or the HS. the RM is, IMO, the natural fit. the HS, would be a good choice if we are concerned about the extra work this would cause in the RM. the problem with the current HS is that it is MR specific, we should generalize it for diff AM types. ** ** thx Alejandro (phone typing) On Jun 23, 2013, at 23:28, Devaraj k devara...@huawei.com wrote: Even if we handle all the failure cases in AM for Job End Notification, we may miss cases like abrupt kill of AM when it is in last retry. If we choose NM to give the notification, again RM needs to identify which NM should give the end-notification as we don't have any direct protocol between AM and NM. I feel it would be better to move End-Notification responsibility to RM as Yarn Service because it ensures 100% notification and also useful for other types of applications as well. Thanks Devaraj K *From:* Ravi Prakash [mailto:ravi...@ymail.com ravi...@ymail.com] *Sent:* 23 June 2013 19:01 *To:* user@hadoop.apache.org *Subject:* Re: Job end notification does not always work (Hadoop 2.x) Hi Alejandro, Thanks for your reply! I was thinking more along the lines Prashant suggested i.e. a failure during init() should still trigger an attempt to notify (by the AM). But now that you mention it, maybe we would be better of including this as a YARN feature after all (specially with all the new AMs being written). We could let the NM of the AM handle the notification burden, so that the RM doesn't get unduly taxed. Thoughts? Thanks Ravi -- *From:* Alejandro Abdelnur t...@cloudera.com *To:* common-u...@hadoop.apache.org user@hadoop.apache.org *Sent:* Saturday, June 22, 2013 7:37 PM *Subject:* Re: Job end notification does not always work (Hadoop 2.x) If the AM fails before doing the job end notification, at any stage of the execution for whatever reason, the job end notification will never be deliver. There is not way to fix this unless the notification is done by a Yarn service. The 2 'candidate' services for doing this would be the RM and the HS. The job notification URL is in the job conf. The RM never sees the job conf, that rules out the RM out unless we add, at AM registration time the possibility to specify a callback URL. The HS has access to the job conf, but the HS is currently a 'passive' service. thx On Sat, Jun 22, 2013 at 3:48 PM, Arun C Murthy a...@hortonworks.com wrote: Prashanth, Please file a jira. One thing to be aware of - AMs get restarted a certain number of times for fault-tolerance - which means we can't just assume that failure of a single AM is equivalent to failure of the job. Only the ResourceManager is in the appropriate position to judge failure of AM v/s failure-of-job. hth, Arun On Jun 22, 2013, at 2:44 PM, Prashant Kommireddi prash1...@gmail.com wrote: Thanks Ravi. Well, in this case its a no-effort :) A failure of AM init should be considered as failure of the job? I looked at the code and best-effort makes sense with respect to retry logic etc. You make a good point that there would be no notification in case AM OOMs, but I do feel AM init failure should send a notification by other means. On Sat, Jun 22, 2013 at 2:38 PM, Ravi Prakash ravi...@ymail.com wrote:** ** Hi Prashant, I would tend to agree with you. Although job-end notification is only a best-effort mechanism (i.e. we cannot always guarantee notification for example when the AM OOMs), I agree with you that we can do more. If you feel strongly about this, please create a JIRA and possibly upload a patch. Thanks Ravi -- *From:* Prashant Kommireddi prash1...@gmail.com *To:* user@hadoop.apache.org user@hadoop.apache.org *Sent:* Thursday, June 20, 2013 9:45 PM *Subject:* Job end notification does not always work (Hadoop 2.x) Hello, I came across an issue that occurs with the job notification callbacks
Re: Job end notification does not always work (Hadoop 2.x)
Devaraj, if a job can finish but you cannot determine it status after it ended, then the system is not usable. Thus, HS is a required component. thx On Tue, Jun 25, 2013 at 6:11 AM, Devaraj k devara...@huawei.com wrote: I agree, for getting status/counters we need HS. I mean Job can finish without HS also. ** ** Thanks Devaraj k ** ** *From:* Alejandro Abdelnur [mailto:t...@cloudera.com] *Sent:* 25 June 2013 18:05 *To:* common-u...@hadoop.apache.org *Subject:* Re: Job end notification does not always work (Hadoop 2.x) ** ** Devaraj, ** ** If you don't run the HS, once your jobs finished you cannot retrieve status/counters from it, from Java AP or Web UI. So I'd for any practical usage, you need it. ** ** thx ** ** On Mon, Jun 24, 2013 at 8:42 PM, Devaraj k devara...@huawei.com wrote:** ** It is not mandatory to have running HS in the cluster. Still the user can submit the job without HS in the cluster, and user may expect the Job/App End Notification. Thanks Devaraj k *From:* Alejandro Abdelnur [mailto:t...@cloudera.com] *Sent:* 24 June 2013 21:42 *To:* user@hadoop.apache.org *Cc:* user@hadoop.apache.org *Subject:* Re: Job end notification does not always work (Hadoop 2.x) if we ought to do this in a yarn service it should be the RM or the HS. the RM is, IMO, the natural fit. the HS, would be a good choice if we are concerned about the extra work this would cause in the RM. the problem with the current HS is that it is MR specific, we should generalize it for diff AM types. thx Alejandro (phone typing) On Jun 23, 2013, at 23:28, Devaraj k devara...@huawei.com wrote: Even if we handle all the failure cases in AM for Job End Notification, we may miss cases like abrupt kill of AM when it is in last retry. If we choose NM to give the notification, again RM needs to identify which NM should give the end-notification as we don't have any direct protocol between AM and NM. I feel it would be better to move End-Notification responsibility to RM as Yarn Service because it ensures 100% notification and also useful for other types of applications as well. Thanks Devaraj K *From:* Ravi Prakash [mailto:ravi...@ymail.com ravi...@ymail.com] *Sent:* 23 June 2013 19:01 *To:* user@hadoop.apache.org *Subject:* Re: Job end notification does not always work (Hadoop 2.x) Hi Alejandro, Thanks for your reply! I was thinking more along the lines Prashant suggested i.e. a failure during init() should still trigger an attempt to notify (by the AM). But now that you mention it, maybe we would be better of including this as a YARN feature after all (specially with all the new AMs being written). We could let the NM of the AM handle the notification burden, so that the RM doesn't get unduly taxed. Thoughts? Thanks Ravi -- *From:* Alejandro Abdelnur t...@cloudera.com *To:* common-u...@hadoop.apache.org user@hadoop.apache.org *Sent:* Saturday, June 22, 2013 7:37 PM *Subject:* Re: Job end notification does not always work (Hadoop 2.x) If the AM fails before doing the job end notification, at any stage of the execution for whatever reason, the job end notification will never be deliver. There is not way to fix this unless the notification is done by a Yarn service. The 2 'candidate' services for doing this would be the RM and the HS. The job notification URL is in the job conf. The RM never sees the job conf, that rules out the RM out unless we add, at AM registration time the possibility to specify a callback URL. The HS has access to the job conf, but the HS is currently a 'passive' service. thx On Sat, Jun 22, 2013 at 3:48 PM, Arun C Murthy a...@hortonworks.com wrote: Prashanth, Please file a jira. One thing to be aware of - AMs get restarted a certain number of times for fault-tolerance - which means we can't just assume that failure of a single AM is equivalent to failure of the job. Only the ResourceManager is in the appropriate position to judge failure of AM v/s failure-of-job. hth, Arun On Jun 22, 2013, at 2:44 PM, Prashant Kommireddi prash1...@gmail.com wrote: Thanks Ravi. Well, in this case its a no-effort :) A failure of AM init should be considered as failure of the job? I looked at the code and best-effort makes sense with respect to retry logic etc. You make a good point that there would be no notification in case AM OOMs, but I do feel AM init failure should send a notification by other means. On Sat, Jun 22, 2013 at 2:38 PM, Ravi Prakash ravi...@ymail.com wrote:** ** Hi Prashant, I would tend to agree with you. Although job-end
Re: Job end notification does not always work (Hadoop 2.x)
if we ought to do this in a yarn service it should be the RM or the HS. the RM is, IMO, the natural fit. the HS, would be a good choice if we are concerned about the extra work this would cause in the RM. the problem with the current HS is that it is MR specific, we should generalize it for diff AM types. thx Alejandro (phone typing) On Jun 23, 2013, at 23:28, Devaraj k devara...@huawei.com wrote: Even if we handle all the failure cases in AM for Job End Notification, we may miss cases like abrupt kill of AM when it is in last retry. If we choose NM to give the notification, again RM needs to identify which NM should give the end-notification as we don't have any direct protocol between AM and NM. I feel it would be better to move End-Notification responsibility to RM as Yarn Service because it ensures 100% notification and also useful for other types of applications as well. Thanks Devaraj K From: Ravi Prakash [mailto:ravi...@ymail.com] Sent: 23 June 2013 19:01 To: user@hadoop.apache.org Subject: Re: Job end notification does not always work (Hadoop 2.x) Hi Alejandro, Thanks for your reply! I was thinking more along the lines Prashant suggested i.e. a failure during init() should still trigger an attempt to notify (by the AM). But now that you mention it, maybe we would be better of including this as a YARN feature after all (specially with all the new AMs being written). We could let the NM of the AM handle the notification burden, so that the RM doesn't get unduly taxed. Thoughts? Thanks Ravi From: Alejandro Abdelnur t...@cloudera.com To: common-u...@hadoop.apache.org user@hadoop.apache.org Sent: Saturday, June 22, 2013 7:37 PM Subject: Re: Job end notification does not always work (Hadoop 2.x) If the AM fails before doing the job end notification, at any stage of the execution for whatever reason, the job end notification will never be deliver. There is not way to fix this unless the notification is done by a Yarn service. The 2 'candidate' services for doing this would be the RM and the HS. The job notification URL is in the job conf. The RM never sees the job conf, that rules out the RM out unless we add, at AM registration time the possibility to specify a callback URL. The HS has access to the job conf, but the HS is currently a 'passive' service. thx On Sat, Jun 22, 2013 at 3:48 PM, Arun C Murthy a...@hortonworks.com wrote: Prashanth, Please file a jira. One thing to be aware of - AMs get restarted a certain number of times for fault-tolerance - which means we can't just assume that failure of a single AM is equivalent to failure of the job. Only the ResourceManager is in the appropriate position to judge failure of AM v/s failure-of-job. hth, Arun On Jun 22, 2013, at 2:44 PM, Prashant Kommireddi prash1...@gmail.com wrote: Thanks Ravi. Well, in this case its a no-effort :) A failure of AM init should be considered as failure of the job? I looked at the code and best-effort makes sense with respect to retry logic etc. You make a good point that there would be no notification in case AM OOMs, but I do feel AM init failure should send a notification by other means. On Sat, Jun 22, 2013 at 2:38 PM, Ravi Prakash ravi...@ymail.com wrote: Hi Prashant, I would tend to agree with you. Although job-end notification is only a best-effort mechanism (i.e. we cannot always guarantee notification for example when the AM OOMs), I agree with you that we can do more. If you feel strongly about this, please create a JIRA and possibly upload a patch. Thanks Ravi From: Prashant Kommireddi prash1...@gmail.com To: user@hadoop.apache.org user@hadoop.apache.org Sent: Thursday, June 20, 2013 9:45 PM Subject: Job end notification does not always work (Hadoop 2.x) Hello, I came across an issue that occurs with the job notification callbacks in MR2. It works fine if the Application master has started, but does not send a callback if the initializing of AM fails. Here is the code from MRAppMaster.java . ... // set job classloader if configured MRApps.setJobClassLoader(conf); initAndStartAppMaster(appMaster, conf, jobUserName); } catch (Throwable t) { LOG.fatal(Error starting MRAppMaster, t); System.exit(1); } } protected static void initAndStartAppMaster(final MRAppMaster appMaster, final YarnConfiguration conf, String jobUserName) throws IOException, InterruptedException { UserGroupInformation.setConfiguration(conf); UserGroupInformation appMasterUgi = UserGroupInformation .createRemoteUser(jobUserName); appMasterUgi.doAs(new PrivilegedExceptionActionObject() { @Override public Object run() throws Exception { appMaster.init(conf); appMaster.start(); if(appMaster.errorHappenedShutDown
Re: How to use WebHDFS?
please try adding the user.name parameter to the querystring. use your hdfs username as value thanks Alejandro (phone typing) On May 18, 2013, at 3:42, Mohammad Mustaqeem 3m.mustaq...@gmail.com wrote: Can any one help me to explore WebHDFS. The link for it is here. I am unable to access it. When I run this command curl -i http://127.0.0.1:50070/webhdfs/v1/input/listOfFiles?op=GETFILESTATUS; I get following error - Sorry, you are not currently allowed to request http://127.0.0.1:50070/webhdfs/v1/input/listOfFiles? from this cache until you have authenticated yourself. Please help me. -- With regards --- Mohammad Mustaqeem, M.Tech (CSE) MNNIT Allahabad 9026604270
Re: Put file with WebHDFS, and can i put file by chunk with curl ?
you could use WebHDFS/HttpFS and the APPEND operation. thx On Wed, Mar 27, 2013 at 1:25 AM, 小学园PHP xxy-...@qq.com wrote: I want to put file to HDFS with curl, and more i need to put it by chunks. So, does somebody know if curl can upload file by chunk? Or ,who has work with WebHDFS by httplib? TIA Levi -- Alejandro
Re: Submit RHadoop job using Ozzie in Cloudera Manager
[moving thread to user@oozie.a.o, BCCing common-user@hadoop.a.o] Oozie web UI is read only, it does not do job submissions. If you want to do that you should look at Hue. Thx On Fri, Mar 8, 2013 at 2:53 AM, rohit sarewar rohitsare...@gmail.comwrote: Hi I have R and RHadoop packages installed on all the nodes. I can submit RMR jobs manually from the terminal. I just want to know How to submit RMR jobs from Oozie web interface ? -Rohit On Fri, Mar 8, 2013 at 4:18 PM, Jagat Singh jagatsi...@gmail.com wrote: Hi Do you have rmr and rhdfs packages installed on all nodes? For hadoop it doesnt matter what type of job is till you have libraries it needs to run in the cluster. Submitting any job would be fine. Thanks On Fri, Mar 8, 2013 at 9:46 PM, rohit sarewar rohitsare...@gmail.comwrote: Hi All I am using Cloudera Manager 4.5 . As of now I can submit MR jobs using Oozie. Can we submit Rhadoop jobs using Ozzie in Cloudera Manager ? -- Alejandro
Re: Hadoop Distcp [ Incomplete HDFS URI ]
it seems you are having an extra : before the first / in your uris. thx Alejandro (phone typing) On Feb 15, 2013, at 8:22 AM, Dhanasekaran Anbalagan bugcy...@gmail.com wrote: HI Guys, we have two cluster running with CDH4.0.1 I am trying data one cluster to another cluster. It's says Incomplete HDFS URI tech@dvcliftonhera227:~$ hadoop distcp hdfs://172.16.30.122:8020:/user/thirumal/test2/part-0 hdfs://172.16.30.227:/user/tech 13/02/15 11:13:15 INFO tools.DistCp: srcPaths=[hdfs://172.16.30.122:8020:/user/thirumal/test2/part-0] 13/02/15 11:13:15 INFO tools.DistCp: destPath=hdfs://172.16.30.227:/user/tech With failures, global counters are inaccurate; consider running with -i Copy failed: java.io.IOException: Incomplete HDFS URI, no host: hdfs://172.16.30.122:8020:/user/thirumal/test2/part-0 at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:118) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2150) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:80) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2184) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2166) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:302) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:194) at org.apache.hadoop.tools.DistCp.checkSrcPath(DistCp.java:635) at org.apache.hadoop.tools.DistCp.copy(DistCp.java:656) at org.apache.hadoop.tools.DistCp.run(DistCp.java:881) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.hadoop.tools.DistCp.main(DistCp.java:908) I try to debug, The port listening How to resolve this. Please guide me. $telnet 172.16.30.122 8020 Trying 172.16.30.122... Connected to 172.16.30.122. Escape character is '^]'. ^] telnet quit $ telnet 172.16.30.227 8020 Trying 172.16.30.227... Connected to 172.16.30.227. Escape character is '^]'. ^] telnet quit Connection closed. #core-site.xml property namefs.defaultFS/name valuehdfs://dvcliftonhera227:8020/value /property -Dhanasekaran Did I learn something today? If not, I wasted it. --
Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat
Tony, I think the first step would be to verify if the S3 filesystem implementation rename works as expected. Thx On Fri, Feb 1, 2013 at 7:12 AM, Tony Burton tbur...@sportingindex.comwrote: ** ** Thanks for the reply Alejandro. Using a temp output directory was my first guess as well. What’s the best way to proceed? I’ve come across FileSystem.rename but it’s consistently returning false for whatever Paths I provide. Specifically, I need to copy the following: ** ** s3://path to data/tmp folder/object type 1/part-0 … s3://path to data/tmp folder/object type 1/part-n s3://path to data/tmp folder/object type 2/part-0 … s3://path to data/tmp folder/object type 2/part-n … s3://path to data/tmp folder/object type m/part-n ** ** to ** ** s3://path to data/object type 1/part-0 … s3://path to data/object type 1/part-n s3://path to data/object type 2/part-0 … s3://path to data/object type 2/part-n … s3://path to data/object type m/part-n ** ** without doing a copyToLocal. ** ** Any tips? Are there any better alternatives to FileSystem.rename? Or would using the AWS Java SDK be a better solution? ** ** Thanks! ** ** Tony ** ** ** ** ** ** ** ** ** ** ** ** *From:* Alejandro Abdelnur [mailto:t...@cloudera.com] *Sent:* 31 January 2013 18:45 *To:* common-u...@hadoop.apache.org *Subject:* Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat ** ** Hi Tony, from what i understand your prob is not with MTOF but with you wanting to run 2 jobs using the same output directory, the second job will fail because the output dir already existed. My take would be tweaking your jobs to use a temp output dir, and moving them to the required (final) location upon completion. ** ** thx ** ** ** ** On Thu, Jan 31, 2013 at 8:22 AM, Tony Burton tbur...@sportingindex.com wrote: Hi everyone, Some of you might recall this topic, which I worked on with the list's help back in August last year - see email trail below. Despite initial success of the discovery, I had the shelve the approach as I ended up using a different solution (for reasons I forget!) with the implementation that was ultimately used for that particular project. I'm now in a position to be working on a similar new task, where I've successfully implemented the combination of LazyOutputFormat and MultipleOutputs using hadoop 1.0.3 to write out to multiple custom output locations. However, I've hit another snag which I'm hoping you might help me work through. I'm going to be running daily tasks to extract data from XML files (specifically, the data stored in certain nodes of the XML), stored on AWS S3 using object names with the following format: s3://inputbucket/data/2013/1/13/list of xml data files.bz2 I want to extract items from the XML and write out as follows: s3://outputbucket/path/xml node name/20130113/output from MR job For one day of data, this works fine. I pass in s3://inputbucket/data and s3://outputbucket/path as input and output arguments, along with my run date (20130113) which gets manipulated and appended where appropriate to form the precise read and write locations, for example FileInputFormat.setInputhPath(job, s3://inputbucket/data); FileOutputFormat.setOutputPath(job, s3://outputbucket/path); Then MultipleOutputs adds on my XML node names underneath s3://outputbucket/path automatically. However, for the next day's run, the job gets to FileOutputFormat.setOutputPath and sees that the output path (s3://outputbucket/path) already exists, and throws a FileAlreadyExistsException from FileOutputFormat.checkOutputSpecs() - even though my ultimate subdirectory, to be constructed by MultipleOutputs does not already exist. Is there any way around this? I'm given hope by this, from http://hadoop.apache.org/docs/r1.0.3/api/org/apache/hadoop/fs/FileAlreadyExistsException.html: public class FileAlreadyExistsException extends IOException - Used when target file already exists for any operation *and is not configured to be overwritten* (my emphasis). Is it possible to deconfigure the overwrite protection? If not, I suppose one other way ahead is to create my own FileOutputFormat where the checkOutputSpecs() is a bit less fussy; another might be to write to a temp directory and programmatically move it to the desired output when the job completes successfully, although this is getting to feel a bit hacky to me. Thanks for any feedback! Tony From: Harsh J [ha...@cloudera.com] Sent: 31 August 2012 10:47 To: user@hadoop.apache.org Subject: Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat Good finding, that OF slipped my mind. We can mention on the MultipleOutputs javadocs for the new API to use
Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat
Hi Tony, from what i understand your prob is not with MTOF but with you wanting to run 2 jobs using the same output directory, the second job will fail because the output dir already existed. My take would be tweaking your jobs to use a temp output dir, and moving them to the required (final) location upon completion. thx On Thu, Jan 31, 2013 at 8:22 AM, Tony Burton tbur...@sportingindex.comwrote: Hi everyone, Some of you might recall this topic, which I worked on with the list's help back in August last year - see email trail below. Despite initial success of the discovery, I had the shelve the approach as I ended up using a different solution (for reasons I forget!) with the implementation that was ultimately used for that particular project. I'm now in a position to be working on a similar new task, where I've successfully implemented the combination of LazyOutputFormat and MultipleOutputs using hadoop 1.0.3 to write out to multiple custom output locations. However, I've hit another snag which I'm hoping you might help me work through. I'm going to be running daily tasks to extract data from XML files (specifically, the data stored in certain nodes of the XML), stored on AWS S3 using object names with the following format: s3://inputbucket/data/2013/1/13/list of xml data files.bz2 I want to extract items from the XML and write out as follows: s3://outputbucket/path/xml node name/20130113/output from MR job For one day of data, this works fine. I pass in s3://inputbucket/data and s3://outputbucket/path as input and output arguments, along with my run date (20130113) which gets manipulated and appended where appropriate to form the precise read and write locations, for example FileInputFormat.setInputhPath(job, s3://inputbucket/data); FileOutputFormat.setOutputPath(job, s3://outputbucket/path); Then MultipleOutputs adds on my XML node names underneath s3://outputbucket/path automatically. However, for the next day's run, the job gets to FileOutputFormat.setOutputPath and sees that the output path (s3://outputbucket/path) already exists, and throws a FileAlreadyExistsException from FileOutputFormat.checkOutputSpecs() - even though my ultimate subdirectory, to be constructed by MultipleOutputs does not already exist. Is there any way around this? I'm given hope by this, from http://hadoop.apache.org/docs/r1.0.3/api/org/apache/hadoop/fs/FileAlreadyExistsException.html: public class FileAlreadyExistsException extends IOException - Used when target file already exists for any operation *and is not configured to be overwritten* (my emphasis). Is it possible to deconfigure the overwrite protection? If not, I suppose one other way ahead is to create my own FileOutputFormat where the checkOutputSpecs() is a bit less fussy; another might be to write to a temp directory and programmatically move it to the desired output when the job completes successfully, although this is getting to feel a bit hacky to me. Thanks for any feedback! Tony From: Harsh J [ha...@cloudera.com] Sent: 31 August 2012 10:47 To: user@hadoop.apache.org Subject: Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat Good finding, that OF slipped my mind. We can mention on the MultipleOutputs javadocs for the new API to use the LazyOutputFormat for the job-level config. Please file a JIRA for this under MAPREDUCE project on the Apache JIRA? On Fri, Aug 31, 2012 at 2:32 PM, Tony Burton tbur...@sportingindex.com wrote: Hi Harsh, I tried using NullOutputFormat as you suggested, however simply using job.setOutputFormatClass(NullOutputFormat.class); resulted in no output at all. Although I've not tried overriding getOutputCommitter in NullOutputFormat as you suggested, I discovered LazyOutputFormat which only writes when it has to, the output file is created only when the first record is emitted for a given partition (from Hadoop: The Definitive Guide). Instead of job.setOutputFormatClass(TextOutputFormat.class); use LazyOutputFormat like this: LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class); So now my unnamed MultipleOutputs are handling to segmented results, and LazyOutputFormat is suppressing the default output. Good job! Tony From: Harsh J [ha...@cloudera.com] Sent: 29 August 2012 17:05 To: user@hadoop.apache.org Subject: Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat Hi Tony, On Wed, Aug 29, 2012 at 9:30 PM, Tony Burton tbur...@sportingindex.com wrote: Success so far! I followed the example given by Tom on the link to the MultipleOutputs.html API you suggested. I implemented a WordCount MR job using hadoop 1.0.3 and segmented the output depending on word length: output to directory sml for less than 10 characters, med for between 10 and 20 characters, lrg otherwise.
Re: Same Map/Reduce works on command BUT it hanges through Oozie withouth any error msg
Yaotian, *Oozie version? *More details on what exactly is your workflow action (mapred, java, shell, etc) *What is is in the task log of the oozie laucher job for that action? Thx On Fri, Jan 25, 2013 at 10:43 PM, yaotian yaot...@gmail.com wrote: I manually run it in Hadoop. It works. But as a job i run it through Oozie. The map/reduce hanged withouth No any error msg. I checked master and datanode. === From the master. i saw: 2013-01-26 06:32:38,517 INFO org.apache.hadoop.mapred.JobTracker: Adding task (JOB_SETUP) 'attempt_201301251528_0014_r_04_0' to tip task_201301251528_0014_r_04, for tracker 'tracker_datanode1:localhost/ 127.0.0.1:45695' 2013-01-26 06:32:44,538 INFO org.apache.hadoop.mapred.JobInProgress: Task 'attempt_201301251528_0014_r_04_0' has completed task_201301251528_0014_r_04 successfully. 2013-01-26 06:33:40,640 INFO org.apache.hadoop.mapred.JSPUtil: Loading Job History file job_201301090834_0089. Cache size is 0 2013-01-26 06:33:40,640 WARN org.mortbay.log: /jobtaskshistory.jsp: java.io.FileNotFoundException: File /data/hadoop-0.20.205.0/logs/history/done/version-1/master_1357720451772_/2013/01/11/00/job_201301090834_0089_1357872144366_hadoop_sorting+locations+per+user does not exist. 2013-01-26 06:35:17,008 INFO org.apache.hadoop.mapred.JSPUtil: Loading Job History file job_201301090834_0051. Cache size is 0 2013-01-26 06:35:17,009 WARN org.mortbay.log: /jobtaskshistory.jsp: java.io.FileNotFoundException: File /data/hadoop-0.20.205.0/logs/history/done/version-1/master_1357720451772_/2013/01/11/00/job_201301090834_0051_1357870007893_hadoop_oozie%3Alauncher%3AT%3Djava%3AW%3Dmap-reduce-wf%3AA%3Dsort%5Fuser%3A does not exist. 2013-01-26 06:40:04,251 INFO org.apache.hadoop.mapred.JSPUtil: Loading Job History file job_201301090834_0026. Cache size is 0 2013-01-26 06:40:04,251 WARN org.mortbay.log: /taskstatshistory.jsp: java.io.FileNotFoundException: File /data/hadoop-0.20.205.0/logs/history/done/version-1/master_1357720451772_/2013/01/09/00/job_201301090834_0026_1357722497582_hadoop_oozie%3Alauncher%3AT%3Djava%3AW%3Dmap-reduce-wf%3AA%3Dreport%5Fsta does not exist. === Form the datanode. Just saw 2013-01-26 06:39:11,143 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301251528_0013_m_00_0 0.0% hdfs://master:9000/user/hadoop/oozie-hado/007-130125091701134-oozie-hado-W/sort_user--java/input/dummy.txt:0+5 2013-01-26 06:39:41,256 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301251528_0013_m_00_0 0.0% hdfs://master:9000/user/hadoop/oozie-hado/007-130125091701134-oozie-hado-W/sort_user--java/input/dummy.txt:0+5 2013-01-26 06:40:11,371 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301251528_0013_m_00_0 0.0% hdfs://master:9000/user/hadoop/oozie-hado/007-130125091701134-oozie-hado-W/sort_user--java/input/dummy.txt:0+5 2013-01-26 06:40:41,486 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301251528_0013_m_00_0 0.0% hdfs://master:9000/user/hadoop/oozie-hado/007-130125091701134-oozie-hado-W/sort_user--java/input/dummy.txt:0+5 2013-01-26 06:41:11,604 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301251528_0013_m_00_0 0.0% hdfs://master:9000/user/hadoop/oozie-hado/007-130125091701134-oozie-hado-W/sort_user--java/input/dummy.txt:0+5 2013-01-26 06:41:41,724 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301251528_0013_m_00_0 0.0% hdfs://master:9000/user/hadoop/oozie-hado/007-130125091701134-oozie-hado-W/sort_user--java/input/dummy.txt:0+5 2013-01-26 06:42:11,845 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301251528_0013_m_00_0 0.0% hdfs://master:9000/user/hadoop/oozie-hado/007-130125091701134-oozie-hado-W/sort_user--java/input/dummy.txt:0+5 2013-01-26 06:42:41,959 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301251528_0013_m_00_0 0.0% hdfs://master:9000/user/hadoop/oozie-hado/007-130125091701134-oozie-hado-W/sort_user--java/input/dummy.txt:0+5 -- Alejandro
Re: Filesystem closed exception
Hemanth, Is FS caching enabled or not in your cluster? A simple solution would be to modify your mapper code not to close the FS. It will go away when the task ends anyway. Thx On Thu, Jan 24, 2013 at 5:26 PM, Hemanth Yamijala yhema...@thoughtworks.com wrote: Hi, We are noticing a problem where we get a filesystem closed exception when a map task is done and is finishing execution. By map task, I literally mean the MapTask class of the map reduce code. Debugging this we found that the mapper is getting a handle to the filesystem object and itself calling a close on it. Because filesystem objects are cached, I believe the behaviour is as expected in terms of the exception. I just wanted to confirm that: - if we do have a requirement to use a filesystem object in a mapper or reducer, we should either not close it ourselves - or (seems better to me) ask for a new version of the filesystem instance by setting the fs.hdfs.impl.disable.cache property to true in job configuration. Also, does anyone know if this behaviour was any different in Hadoop 0.20 ? For some context, this behaviour is actually seen in Oozie, which runs a launcher mapper for a simple java action. Hence, the java action could very well interact with a file system. I know this is probably better addressed in Oozie context, but wanted to get the map reduce view of things. Thanks, Hemanth -- Alejandro
Re: Oozie workflow error - renewing token issue
Cobert, [Moving thread to user@oozie.a.o, BCCing common-user@hadoop.a.o] * What version of Oozie are you using? * Is the cluster a secure setup (Kerberos enabled)? * Would you mind posting the complete launcher logs? Thx On Wed, Jan 30, 2013 at 6:14 AM, Corbett Martin comar...@nhin.com wrote: Thanks for the tip. The sqoop command listed in the stdout log file is: sqoop import --driver org.apache.derby.jdbc.ClientDriver --connect jdbc:derby://test-server:1527/mondb --username monuser --password x --table MONITOR --split-by request_id --target-dir /mon/import --append --incremental append --check-column request_id --last-value 200 The following information is from the sterr and stdout log files. /var/log/hadoop-0.20-mapreduce/userlogs/job_201301231648_0029/attempt_201301231648_0029_m_00_0 # more stderr No such sqoop tool: sqoop. See 'sqoop help'. Intercepting System.exit(1) Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SqoopMain], exit code [1] /var/log/hadoop-0.20-mapreduce/userlogs/job_201301231648_0029/attempt_201301231648_0029_m_00_0 # more stdout ... ... ... Invoking Sqoop command line now 1598 [main] WARN org.apache.sqoop.tool.SqoopTool - $SQOOP_CONF_DIR has not been set in the environment. Cannot check for additional configuration. Intercepting System.exit(1) Invocation of Main class completed Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SqoopMain], exit code [1] Oozie Launcher failed, finishing Hadoop job gracefully Oozie Launcher ends ~Corbett Martin Software Architect AbsoluteAR Accounts Receivable Services - An NHIN Solution -Original Message- From: Harsh J [mailto:ha...@cloudera.com] Sent: Tuesday, January 29, 2013 10:11 PM To: user@hadoop.apache.org Subject: Re: Oozie workflow error - renewing token issue The job that Oozie launches for your action, which you are observing is failing, does its own logs (task logs) show any errors? On Wed, Jan 30, 2013 at 4:59 AM, Corbett Martin comar...@nhin.com wrote: Oozie question I'm trying to run an Oozie workflow (sqoop action) from the Hue console and it fails every time. No exception in the oozie log but I see this in the Job Tracker log file. Two primary issues seem to be 1. Client mapred tries to renew a token with renewer specified as mr token And 2. Cannot find class for token kind MAPREDUCE_DELEGATION_TOKEN Any ideas how to get past this? Full Stacktrace: 2013-01-29 17:11:28,860 INFO org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: Creating password for identifier: owner=hdfs, renewer=mr token, realUser=oozie, issueDate=1359501088860, maxDate=136010560, sequenceNumber=75, masterKeyId=8 2013-01-29 17:11:28,871 INFO org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: Creating password for identifier: owner=hdfs, renewer=mr token, realUser=oozie, issueDate=1359501088871, maxDate=136010571, sequenceNumber=76, masterKeyId=8 2013-01-29 17:11:29,202 INFO org.apache.hadoop.mapreduce.security.token.DelegationTokenRenewal: registering token for renewal for service =10.204.12.62:8021 and jobID = job_201301231648_0029 2013-01-29 17:11:29,211 INFO org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: Token renewal requested for identifier: owner=hdfs, renewer=mr token, realUser=oozie, issueDate=1359501088871, maxDate=136010571, sequenceNumber=76, masterKeyId=8 2013-01-29 17:11:29,211 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:mapred (auth:SIMPLE) cause:org.apache.hadoop.security.AccessControlException: Client mapred tries to renew a token with renewer specified as mr token 2013-01-29 17:11:29,211 WARN org.apache.hadoop.security.token.Token: Cannot find class for token kind MAPREDUCE_DELEGATION_TOKEN 2013-01-29 17:11:29,211 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 8021, call renewDelegationToken(Kind: MAPREDUCE_DELEGATION_TOKEN, Service: 10.204.12.62:8021, Ident: 00 04 68 64 66 73 08 6d 72 20 74 6f 6b 65 6e 05 6f 6f 7a 69 65 8a 01 3c 88 94 58 67 8a 01 3c ac a0 dc 67 4c 08), rpc version=2, client version=28, methodsFingerPrint=1830206421 from 10.204.12.62:9706: error: org.apache.hadoop.security.AccessControlException: Client mapred tries to
Re: Complex MapReduce applications with the streaming API
Using Oozie seems to be an overkilling for this application, besides, it doesn't support loops so the recusrsion can't really be implemented. Correct, Oozie does not support loops, this is a restriction by design (early prototypes supported loops). The idea was that you didn't want never ending workflows. To this end, Coordinator Jobs address the recurrent run of workflow jobs. Still, if you want to do recursion in Oozie, you certainly can, a workflow invoking to itself as a sub-workflow. Just make sure you define properly your exit condition. If you have additional questions, please move this thread to the u...@oozie.apache.org alias. Thx On Tue, Nov 27, 2012 at 4:03 AM, Zoltán Tóth-Czifra zoltan.tothczi...@softonic.com wrote: Hi everyone, Thanks in advance for the support. My problem is the following: I'm trying to develop a fairly complex MapReduce application using the streaming API (for demonstation purposes, so unfortunately the use Java answer doesn't work :-( ). I can get one single MapReduce phase running from command line with no problem. The problem is when I want to add more MapReduce phases which use each others output, and I maybe even want to do a recursion (feed the its output to the same phase again) conditioned by a counter. The solution in Java MapReduce is trivial (i.e. creating multiple Job instances and monitoring counters) but with the streaming API not quite. What is the correct way to manage my application with its native code? (Python, PHP, Perl...) Calling shell commands from a controller script? How should I obtain counters?... Using Oozie seems to be an overkilling for this application, besides, it doesn't support loops so the recusrsion can't really be implemented. Thanks a lot! Zoltan -- Alejandro
Re: Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError Java Heap Space
Eduard, Would you try using the following properties in your job invocation? -D mapreduce.map.java.opts=-Xmx768m -D mapreduce.reduce.java.opts=-Xmx768m -D mapreduce.map.memory.mb=2000 -D mapreduce.reduce.memory.mb=3000 Thx On Mon, Nov 5, 2012 at 7:43 AM, Kartashov, Andy andy.kartas...@mpac.ca wrote: Your error takes place during reduce task, when temporary files are written to memory/disk. You are clearly running low on resources. Check your memory “$ free –m” and disk space “$ df –H” as well as “$hadoop fs -df” I remember it took me a couple of days to figure out why I was getting heap size error and nothing wporked! Becaue, I tried to write 7Gb output file onto a disk (in pseudo distr mode) that only had 4Gb of free space. p.s. Always test your jobs on small input first (few lines of inputs) . p.p.s. follow your job execution through web: http://fully-qualified-hostan-name of your job tracker:50030 From: Eduard Skaley [mailto:e.v.ska...@gmail.com] Sent: Monday, November 05, 2012 4:10 AM To: user@hadoop.apache.org Cc: Nitin Pawar Subject: Re: Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError Java Heap Space By the way it happens on Yarn not on MRv1 each container gets 1GB at the moment. can you try increasing memory per reducer ? On Wed, Oct 31, 2012 at 9:15 PM, Eduard Skaley e.v.ska...@gmail.com wrote: Hello, I'm getting this Error through job execution: 16:20:26 INFO [main] Job - map 100% reduce 46% 16:20:27 INFO [main] Job - map 100% reduce 51% 16:20:29 INFO [main] Job - map 100% reduce 62% 16:20:30 INFO [main] Job - map 100% reduce 64% 16:20:32 INFO [main] Job - Task Id : attempt_1351680008718_0018_r_06_0, Status : FAILED Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#2 at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:123) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:371) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:152) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:147) Caused by: java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.io.BoundedByteArrayOutputStream.init(BoundedByteArrayOutputStream.java:58) at org.apache.hadoop.io.BoundedByteArrayOutputStream.init(BoundedByteArrayOutputStream.java:45) at org.apache.hadoop.mapreduce.task.reduce.MapOutput.init(MapOutput.java:97) at org.apache.hadoop.mapreduce.task.reduce.MergeManager.unconditionalReserve(MergeManager.java:286) at org.apache.hadoop.mapreduce.task.reduce.MergeManager.reserve(MergeManager.java:276) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:384) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:319) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:179) 16:20:33 INFO [main] Job - map 100% reduce 65% 16:20:36 INFO [main] Job - map 100% reduce 67% 16:20:39 INFO [main] Job - map 100% reduce 69% 16:20:41 INFO [main] Job - map 100% reduce 70% 16:20:43 INFO [main] Job - map 100% reduce 71% I have no clue what the issue could be for this. I googled this issue and checked several sources of possible solutions but nothing does fit. I saw this jira entry which could fit: https://issues.apache.org/jira/browse/MAPREDUCE-4655. Here somebody recommends to increase the value for the property dfs.datanode.max.xcievers / dfs.datanode.max.receiver.threads to 4096, but this is the value for our cluster. http://yaseminavcular.blogspot.de/2011/04/common-hadoop-hdfs-exceptions-with.html The issue with the to small input files doesn't fit I think, because the map phase reads 137 files with each 130MB. Block Size is 128MB. The cluster uses version 2.0.0-cdh4.1.1, 581959ba23e4af85afd8db98b7687662fe9c5f20. Thx -- Nitin Pawar NOTICE: This e-mail message and any attachments are confidential, subject to copyright and may be privileged. Any unauthorized use, copying or disclosure is prohibited. If you are not the intended recipient, please delete and contact the sender immediately. Please consider the environment before printing this e-mail. AVIS : le présent courriel et toute pièce jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur et peuvent être couverts par le secret professionnel. Toute utilisation, copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le destinataire prévu de ce courriel,
Re: Loading Data to HDFS
I don't know what you mean by gateway but in order to have a rough idea of the time needed you need 3 values I believe Sumit's setup is a cluster within a firewall and hadoop client machines also within the firewall, the only way to access to the cluster is to ssh from outside to one of the hadoop client machines and then submit your jobs. These hadoop client machines are often referred as gateway machines. On Tue, Oct 30, 2012 at 4:10 AM, Bertrand Dechoux decho...@gmail.com wrote: I don't know what you mean by gateway but in order to have a rough idea of the time needed you need 3 values * amount of data you want to put on hadoop * hadoop bandwidth with regards to local storage (read/write) * bandwidth between where your data are stored and where the hadoop cluster is For the latter, for big volumes, physically moving the volumes is a viable solution. It will depends on your constraints of course : budget, speed... Bertrand On Tue, Oct 30, 2012 at 11:39 AM, sumit ghosh sumi...@yahoo.com wrote: Hi Bertrand, By Physically movi ng the data do you mean that the data volume is connected to the gateway machine and the data is loaded from the local copy using copyFromLocal? Thanks, Sumit From: Bertrand Dechoux decho...@gmail.com To: common-user@hadoop.apache.org; sumit ghosh sumi...@yahoo.com Sent: Tuesday, 30 October 2012 3:46 PM Subject: Re: Loading Data to HDFS It might sound like a deprecated way but can't you move the data physically? From what I understand, it is one shot and not streaming so it could be a good method if you the access of course. Regards Bertrand On Tue, Oct 30, 2012 at 11:07 AM, sumit ghosh sumi...@yahoo.com wrote: Hi, I have a data on remote machine accessible over ssh. I have Hadoop CDH4 installed on RHEL. I am planning to load quite a few Petabytes of Data onto HDFS. Which will be the fastest method to use and are there any projects around Hadoop which can be used as well? I cannot install Hadoop-Client on the remote machine. Have a great Day Ahead! Sumit. --- Here I am attaching my previous discussion on CDH-user to avoid duplication. --- On Wed, Oct 24, 2012 at 9:29 PM, Alejandro Abdelnur t...@cloudera.com wrote: in addition to jarcec's suggestions, you could use httpfs. then you'd only need to poke a single host:port in your firewall as all the traffic goes thru it. thx Alejandro On Oct 24, 2012, at 8:28 AM, Jarek Jarcec Cecho jar...@cloudera.com wrote: Hi Sumit, there is plenty of ways how to achieve that. Please find my feedback below: Does Sqoop support loading flat files to HDFS? No, sqoop is supporting only data move from external database and warehouse systems. Copying files is not supported at the moment. Can use distcp? No. Distcp can be used only to copy data between HDFS filesystesm. How do we use the core-site.xml file on the remote machine to use copyFromLocal? Yes you can install hadoop binaries on your machine (with no hadoop running services) and use hadoop binary to upload data. Installation procedure is described in CDH4 installation guide [1] (follow client installation). Another way that I can think of is leveraging WebHDFS [2] or maybe hdfs-fuse [3]? Jarcec Links: 1: https://ccp.cloudera.com/display/CDH4DOC/CDH4+Installation 2: https://ccp.cloudera.com/display/CDH4DOC/Deploying+HDFS+on+a+Cluster#DeployingHDFSonaCluster-EnablingWebHDFS 3: https://ccp.cloudera.com/display/CDH4DOC/Mountable+HDFS On Wed, Oct 24, 2012 at 01:33:29AM -0700, Sumit Ghosh wrote: Hi, I have a data on remote machine accessible over ssh. What is the fastest way to load data onto HDFS? Does Sqoop support loading flat files to HDFS? Can use distcp? How do we use the core-site.xml file on the remote machine to use copyFromLocal? Which will be the best to use and are there any other open source projects around Hadoop which can be used as well? Have a great Day Ahead! Sumit -- Bertrand Dechoux -- Bertrand Dechoux -- Alejandro
Re: Detect when file is not being written by another process
AFAIK there is not way to determine i a file has been fully written or not. Oozie uses a feature of Hadoop which writes a _SUCCESS flag file in the output directory of a job. This _SUCCESS file is written at job completion time, thus ensuring all the output of the job is ready. This means that when Oozie is configured to look for a directory FOO/, in practice it looks for the existence of FOO/_SUCCESS file. You can configure Oozie to look for existence of FOO/ but this means you'll have to use a temp dir, i.e. FOO_TMP/, while writing data and do a rename to FOO/ once you finished writing the data. Thx On Wed, Sep 26, 2012 at 1:52 AM, Hemanth Yamijala yhema...@thoughtworks.com wrote: Agree with Bejoy. The problem you've mentioned sounds like building something like a workflow, which is what Oozie is supposed to do. Thanks hemanth On Wed, Sep 26, 2012 at 12:22 AM, Bejoy Ks bejoy.had...@gmail.com wrote: Hi Peter AFAIK oozie has a mechanism to achieve this. You can trigger your jobs as soon as the files are written to a certain hdfs directory. On Tue, Sep 25, 2012 at 10:23 PM, Peter Sheridan psheri...@millennialmedia.com wrote: These are log files being deposited by other processes, which we may not have control over. We don't want multiple processes to write to the same files — we just don't want to start our jobs until they have been completely written. Sorry for lack of clarity thanks for the response. --Pete From: Bertrand Dechoux decho...@gmail.com Reply-To: user@hadoop.apache.org user@hadoop.apache.org Date: Tuesday, September 25, 2012 12:33 PM To: user@hadoop.apache.org user@hadoop.apache.org Subject: Re: Detect when file is not being written by another process Hi, Multiple files and aggregation or something like hbase? Could you tell use more about your context? What are the volumes? Why do you want multiple processes to write to the same file? Regards Bertrand On Tue, Sep 25, 2012 at 6:28 PM, Peter Sheridan psheri...@millennialmedia.com wrote: Hi all. We're using Hadoop 1.0.3. We need to pick up a set of large (4+GB) files when they've finished being written to HDFS by a different process. There doesn't appear to be an API specifically for this. We had discovered through experimentation that the FileSystem.append() method can be used for this purpose — it will fail if another process is writing to the file. However: when running this on a multi-node cluster, using that API actually corrupts the file. Perhaps this is a known issue? Looking at the bug tracker I see https://issues.apache.org/jira/browse/HDFS-265 and a bunch of similar-sounding things. What's the right way to solve this problem? Thanks. --Pete -- Bertrand Dechoux -- Alejandro
Re: Native not compiling in OS X Mountain Lion
Mohamed, Currently Hadoop native code does not compile/run in any flavor of OS X. Thanks. Alejandro On Mon, Aug 13, 2012 at 2:59 AM, J Mohamed Zahoor jmo...@gmail.com wrote: Hi I have problems compiling native's in OS X 10.8 for trunk. Especially in Yarn projects. Anyone faced similar problems? [exec] /usr/bin/gcc -g -Wall -O2 -D_GNU_SOURCE -D_REENTRANT -I/Users/zahoor/Development/OpenSource/reading/hadoop-common/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src -I/Users/zahoor/Development/OpenSource/reading/hadoop-common/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/native -I/Users/zahoor/Development/OpenSource/reading/hadoop-common/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor -I/Users/zahoor/Development/OpenSource/reading/hadoop-common/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl -o CMakeFiles/container-executor.dir/main/native/container-executor/impl/main.c.o -c /Users/zahoor/Development/OpenSource/reading/hadoop-common/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/main.c [exec] Undefined symbols for architecture x86_64: [exec] _fcloseall, referenced from: [exec] _launch_container_as_user in libcontainer.a(container-executor.c.o) [exec] _mkdirat, referenced from: [exec] _mkdirs in libcontainer.a(container-executor.c.o) [exec] _openat, referenced from: [exec] _mkdirs in libcontainer.a(container-executor.c.o) [exec] ld: symbol(s) not found for architecture x86_64 [exec] collect2: ld returned 1 exit status [exec] make[2]: *** [target/usr/local/bin/container-executor] Error 1 [exec] make[1]: *** [CMakeFiles/container-executor.dir/all] Error 2 [exec] make: *** [all] Error 2 ./Zahoor -- Alejandro
Re: MRv1 with CDH4
Antonie, This is the Apache Hadoop alias for mapreduce, your question is specific to CDH4 distribution of Apache Hadoop. You should use the Cloudera alias for such questions. I'll follow up with you in the Cloudera alias where you cross-posted this message. Thanks. On Mon, Jul 30, 2012 at 6:20 AM, Antoine Boudot abou...@cyres.fr wrote: Hi everyone I don’t understand the steps to follow to implement MRv1 with CDH4 install service? run service? config service? mapred-site.xml? must - have the script retrieve and recreate CDH3 mapred-site.xml and master? thanks Antoine ** ** -- Alejandro
Re: HBase mulit-user security
Tony, Sorry, missed this email earlier. This seems more appropriate for the Hbase aliases. Thx. On Wed, Jul 11, 2012 at 8:41 AM, Tony Dean tony.d...@sas.com wrote: Hi, Looking at this further, it appears that when HBaseRPC is creating a proxy (e.g., SecureRpcEngine), it injects the current user: User.getCurrent() which by default is the cached Kerberos TGT (kinit'ed user - using the hadoop-user-kerberos JAAS context). Since the server proxy always uses User.getCurrent(), how can an application inject the user it wants to use for authorization checks on the peer (region server)? And since SecureHadoopUser is a static class, how can you have more than 1 active user in the same application? What you have works for a single user application like the hbase shell, but what about a multi-user application? Am I missing something? Thanks! -Tony -Original Message- From: Alejandro Abdelnur [mailto:t...@cloudera.com] Sent: Monday, July 02, 2012 11:40 AM To: common-user@hadoop.apache.org Subject: Re: hadoop security API (repost) Tony, If you are doing a server app that interacts with the cluster on behalf of different users (like Ooize, as you mentioned in your email), then you should use the proxyuser capabilities of Hadoop. * Configure user MYSERVERUSER as proxyuser in Hadoop core-site.xml (this requires 2 properties settings, HOSTS and GROUPS). * Run your server app as MYSERVERUSER and have a Kerberos principal MYSERVERUSER/MYSERVERHOST * Initialize your server app loading the MYSERVERUSER/MYSERVERHOST keytab * Use the UGI.doAs() to create JobClient/Filesystem instances using the user you want to do something on behalf * Keep in mind that all the users you need to do something on behalf should be valid Unix users in the cluster * If those users need direct access to the cluster, they'll have to be also defined in in the KDC user database. Hope this helps. Thx On Mon, Jul 2, 2012 at 6:22 AM, Tony Dean tony.d...@sas.com wrote: Yes, but this will not work in a multi-tenant environment. I need to be able to create a Kerberos TGT per execution thread. I was hoping through JAAS that I could inject the name of the current principal and authenticate against it. I'm sure there is a best practice for hadoop/hbase client API authentication, just not sure what it is. Thank you for your comment. The solution may well be associated with the UserGroupInformation class. Hopefully, other ideas will come from this thread. Thanks. -Tony -Original Message- From: Ivan Frain [mailto:ivan.fr...@gmail.com] Sent: Monday, July 02, 2012 8:14 AM To: common-user@hadoop.apache.org Subject: Re: hadoop security API (repost) Hi Tony, I am currently working on this to access HDFS securely and programmaticaly. What I have found so far may help even if I am not 100% sure this is the right way to proceed. If you have already obtained a TGT from the kinit command, hadoop library will locate it automatically if the name of the ticket cache corresponds to default location. On Linux it is located /tmp/krb5cc_uid-number. For example, with my linux user hdfs, I get a TGT for hadoop user 'ivan' meaning you can impersonate ivan from hdfs linux user: -- hdfs@mitkdc:~$ klist Ticket cache: FILE:/tmp/krb5cc_10003 Default principal: i...@hadoop.lan Valid startingExpires Service principal 02/07/2012 13:59 02/07/2012 23:59 krbtgt/hadoop@hadoop.lan renew until 03/07/2012 13:59 --- Then, you just have to set the right security options in your hadoop client in java and the identity will be i...@hadoop.lan for our example. In my tests, I only use HDFS and here a snippet of code to have access to a secure hdfs cluster assuming the previous TGT (ivan's impersonation): val conf: HdfsConfiguration = new HdfsConfiguration() conf.set(CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHENTICATION, kerberos) conf.set(CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHORIZATION, true) conf.set(DFSConfigKeys.DFS_NAMENODE_USER_NAME_KEY, serverPrincipal) UserGroupInformation.setConfiguration(conf) val fs = FileSystem.get(new URI(hdfsUri), conf) Using this 'fs' is a handler to access hdfs securely as user 'ivan' even if ivan does not appear in the hadoop client code. Anyway, I also see two other options: * Setting the KRB5CCNAME environment variable to point to the right ticketCache file * Specifying the keytab file you want to use from the UserGroupInformation singleton API: UserGroupInformation.loginUserFromKeytab(user, keytabFile) If you want to understand the auth process and the different options to login, I guess you need to have a look
Re: hadoop security API (repost)
Tony, If you are doing a server app that interacts with the cluster on behalf of different users (like Ooize, as you mentioned in your email), then you should use the proxyuser capabilities of Hadoop. * Configure user MYSERVERUSER as proxyuser in Hadoop core-site.xml (this requires 2 properties settings, HOSTS and GROUPS). * Run your server app as MYSERVERUSER and have a Kerberos principal MYSERVERUSER/MYSERVERHOST * Initialize your server app loading the MYSERVERUSER/MYSERVERHOST keytab * Use the UGI.doAs() to create JobClient/Filesystem instances using the user you want to do something on behalf * Keep in mind that all the users you need to do something on behalf should be valid Unix users in the cluster * If those users need direct access to the cluster, they'll have to be also defined in in the KDC user database. Hope this helps. Thx On Mon, Jul 2, 2012 at 6:22 AM, Tony Dean tony.d...@sas.com wrote: Yes, but this will not work in a multi-tenant environment. I need to be able to create a Kerberos TGT per execution thread. I was hoping through JAAS that I could inject the name of the current principal and authenticate against it. I'm sure there is a best practice for hadoop/hbase client API authentication, just not sure what it is. Thank you for your comment. The solution may well be associated with the UserGroupInformation class. Hopefully, other ideas will come from this thread. Thanks. -Tony -Original Message- From: Ivan Frain [mailto:ivan.fr...@gmail.com] Sent: Monday, July 02, 2012 8:14 AM To: common-user@hadoop.apache.org Subject: Re: hadoop security API (repost) Hi Tony, I am currently working on this to access HDFS securely and programmaticaly. What I have found so far may help even if I am not 100% sure this is the right way to proceed. If you have already obtained a TGT from the kinit command, hadoop library will locate it automatically if the name of the ticket cache corresponds to default location. On Linux it is located /tmp/krb5cc_uid-number. For example, with my linux user hdfs, I get a TGT for hadoop user 'ivan' meaning you can impersonate ivan from hdfs linux user: -- hdfs@mitkdc:~$ klist Ticket cache: FILE:/tmp/krb5cc_10003 Default principal: i...@hadoop.lan Valid startingExpires Service principal 02/07/2012 13:59 02/07/2012 23:59 krbtgt/hadoop@hadoop.lan renew until 03/07/2012 13:59 --- Then, you just have to set the right security options in your hadoop client in java and the identity will be i...@hadoop.lan for our example. In my tests, I only use HDFS and here a snippet of code to have access to a secure hdfs cluster assuming the previous TGT (ivan's impersonation): val conf: HdfsConfiguration = new HdfsConfiguration() conf.set(CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHENTICATION, kerberos) conf.set(CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHORIZATION, true) conf.set(DFSConfigKeys.DFS_NAMENODE_USER_NAME_KEY, serverPrincipal) UserGroupInformation.setConfiguration(conf) val fs = FileSystem.get(new URI(hdfsUri), conf) Using this 'fs' is a handler to access hdfs securely as user 'ivan' even if ivan does not appear in the hadoop client code. Anyway, I also see two other options: * Setting the KRB5CCNAME environment variable to point to the right ticketCache file * Specifying the keytab file you want to use from the UserGroupInformation singleton API: UserGroupInformation.loginUserFromKeytab(user, keytabFile) If you want to understand the auth process and the different options to login, I guess you need to have a look to the UserGroupInformation.java source code (release 0.23.1 link: http://bit.ly/NVzBKL). The private class HadoopConfiguration line 347 is of major interest in our case. Another point is that I did not find any easy way to prompt the user for a password at runtim using the actual hadoop API. It appears to be somehow hardcoded in the UserGroupInformation singleton. I guess it could be nice to have a new function to give to the UserGroupInformation an authenticated 'Subject' which could override all default configurations. If someone have better ideas it could be nice to discuss on it as well. BR, Ivan 2012/7/1 Tony Dean tony.d...@sas.com Hi, The security documentation specifies how to test a secure cluster by using kinit and thus adding the Kerberos principal TGT to the ticket cache in which the hadoop client code uses to acquire service tickets for use in the cluster. What if I created an application that used the hadoop API to communicate with hdfs and/or mapred protocols, is there a programmatic way to inform hadoop to use a particular Kerberos principal name with a keytab that contains its password
Re: hadoop security API (repost)
On Mon, Jul 2, 2012 at 9:15 AM, Tony Dean tony.d...@sas.com wrote: Alejandro, Thanks for the reply. My intent is to also be able to scan/get/put hbase tables under a specified identity as well. What options do I have to perform the same multi-tenant authorization for these operations? I have posted this to hbase users distribution list as well, but thought you might have insight. Since hbase security authentication is so dependent upon hadoop, it would be nice if your suggestion worked for hbase as well. Getting back to your suggestion... when configuring hadoop.proxyuser.myserveruser.hosts, host1 would be where I'm making the ugi.doAs() privileged call and host2 is the hadoop namenode? host1 in that case. Also, an another option, is there not a way for an application to pass hadoop/hbase authentication the name of a Kerberos principal to use? In this case, no proxy, just execute as the designated user. You could do that, but that means your app will have to have keytabs for all the users want to act as. Proxyuser will be much easier to manage. Maybe getting proxyuser support in hbase if it is not there yet Thanks. -Tony -Original Message- From: Alejandro Abdelnur [mailto:t...@cloudera.com] Sent: Monday, July 02, 2012 11:40 AM To: common-user@hadoop.apache.org Subject: Re: hadoop security API (repost) Tony, If you are doing a server app that interacts with the cluster on behalf of different users (like Ooize, as you mentioned in your email), then you should use the proxyuser capabilities of Hadoop. * Configure user MYSERVERUSER as proxyuser in Hadoop core-site.xml (this requires 2 properties settings, HOSTS and GROUPS). * Run your server app as MYSERVERUSER and have a Kerberos principal MYSERVERUSER/MYSERVERHOST * Initialize your server app loading the MYSERVERUSER/MYSERVERHOST keytab * Use the UGI.doAs() to create JobClient/Filesystem instances using the user you want to do something on behalf * Keep in mind that all the users you need to do something on behalf should be valid Unix users in the cluster * If those users need direct access to the cluster, they'll have to be also defined in in the KDC user database. Hope this helps. Thx On Mon, Jul 2, 2012 at 6:22 AM, Tony Dean tony.d...@sas.com wrote: Yes, but this will not work in a multi-tenant environment. I need to be able to create a Kerberos TGT per execution thread. I was hoping through JAAS that I could inject the name of the current principal and authenticate against it. I'm sure there is a best practice for hadoop/hbase client API authentication, just not sure what it is. Thank you for your comment. The solution may well be associated with the UserGroupInformation class. Hopefully, other ideas will come from this thread. Thanks. -Tony -Original Message- From: Ivan Frain [mailto:ivan.fr...@gmail.com] Sent: Monday, July 02, 2012 8:14 AM To: common-user@hadoop.apache.org Subject: Re: hadoop security API (repost) Hi Tony, I am currently working on this to access HDFS securely and programmaticaly. What I have found so far may help even if I am not 100% sure this is the right way to proceed. If you have already obtained a TGT from the kinit command, hadoop library will locate it automatically if the name of the ticket cache corresponds to default location. On Linux it is located /tmp/krb5cc_uid-number. For example, with my linux user hdfs, I get a TGT for hadoop user 'ivan' meaning you can impersonate ivan from hdfs linux user: -- hdfs@mitkdc:~$ klist Ticket cache: FILE:/tmp/krb5cc_10003 Default principal: i...@hadoop.lan Valid startingExpires Service principal 02/07/2012 13:59 02/07/2012 23:59 krbtgt/hadoop@hadoop.lan renew until 03/07/2012 13:59 --- Then, you just have to set the right security options in your hadoop client in java and the identity will be i...@hadoop.lan for our example. In my tests, I only use HDFS and here a snippet of code to have access to a secure hdfs cluster assuming the previous TGT (ivan's impersonation): val conf: HdfsConfiguration = new HdfsConfiguration() conf.set(CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHENTICATION, kerberos) conf.set(CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHORIZATION, true) conf.set(DFSConfigKeys.DFS_NAMENODE_USER_NAME_KEY, serverPrincipal) UserGroupInformation.setConfiguration(conf) val fs = FileSystem.get(new URI(hdfsUri), conf) Using this 'fs' is a handler to access hdfs securely as user 'ivan' even if ivan does not appear in the hadoop client code. Anyway, I also see two other options: * Setting the KRB5CCNAME environment variable to point to the right ticketCache file * Specifying the keytab
Re: kerberos mapreduce question
If you provision your user/group information via LDAP to all your nodes it is not a nightmare. On Thu, Jun 7, 2012 at 7:49 AM, Koert Kuipers ko...@tresata.com wrote: thanks for your answer. so at a large place like say yahoo, or facebook, assuming they use kerberos, every analyst that uses hive has an account on every node of their large cluster? sounds like an admin nightmare to me On Thu, Jun 7, 2012 at 10:46 AM, Mapred Learn mapred.le...@gmail.com wrote: Yes, User submitting a job needs to have an account on all the nodes. Sent from my iPhone On Jun 7, 2012, at 6:20 AM, Koert Kuipers ko...@tresata.com wrote: with kerberos enabled a mapreduce job runs as the user that submitted it. does this mean the user that submitted the job needs to have linux accounts on all machines on the cluster? how does mapreduce do this (run jobs as the user)? do the tasktrackers use secure impersonation to run-as the user? thanks! koert -- Alejandro
Re: How can I configure oozie to submit different workflows from different users ?
Praveenesh, If I'm not mistaken 0.20.205 does not support wildcards for the proxyuser (hosts/groups) settings. You have to use explicit hosts/groups. Thxs. Alejandro PS: please follow up this thread in the oozie-us...@incubator.apache.org On Mon, Apr 2, 2012 at 2:15 PM, praveenesh kumar praveen...@gmail.comwrote: Hi all, I want to use oozie to submit different workflows from different users. These users are able to submit hadoop jobs. I am using hadoop 0.20.205 and oozie 3.1.3 I have a hadoop user as a oozie-user I have set the following things : conf/oozie-site.xml : property name oozie.services.ext /name value org.apache.oozie.service.HadoopAccessorService /value description To add/replace services defined in 'oozie.services' with custom implementations.Class names must be separated by commas. /description /property conf/core-site.xml property namehadoop.proxyuser.hadoop.hosts /name value* / value /property property namehadoop.proxyuser.hadoop.groups /name value* /value /property When I am submitting jobs as a hadoop user, I am able to run it properly. But when I am able to submit the same work flow from a different user, who can submit the simple MR jobs to my hadoop cluster, I am getting the following error: JA009: java.io.IOException: java.io.IOException: The username kumar obtained from the conf doesn't match the username hadoop the user authenticated asat org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3943) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382) Caused by: java.io.IOException: The username kumar obtained from the conf doesn't match the username hadoop the user authenticated as at org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:426) at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3941) ... 11 more
Re: How can I configure oozie to submit different workflows from different users ?
multiple value are comma separated. keep in mind that valid values for proxyuser groups, as the property name states are GROUPS, not USERS. thxs. Alejandro On Mon, Apr 2, 2012 at 2:27 PM, praveenesh kumar praveen...@gmail.comwrote: How can I specify multiple users /groups for proxy user setting ? Can I give comma separated values in these settings ? Thanks, Praveenesh On Mon, Apr 2, 2012 at 5:52 PM, Alejandro Abdelnur t...@cloudera.com wrote: Praveenesh, If I'm not mistaken 0.20.205 does not support wildcards for the proxyuser (hosts/groups) settings. You have to use explicit hosts/groups. Thxs. Alejandro PS: please follow up this thread in the oozie-us...@incubator.apache.org On Mon, Apr 2, 2012 at 2:15 PM, praveenesh kumar praveen...@gmail.com wrote: Hi all, I want to use oozie to submit different workflows from different users. These users are able to submit hadoop jobs. I am using hadoop 0.20.205 and oozie 3.1.3 I have a hadoop user as a oozie-user I have set the following things : conf/oozie-site.xml : property name oozie.services.ext /name value org.apache.oozie.service.HadoopAccessorService /value description To add/replace services defined in 'oozie.services' with custom implementations.Class names must be separated by commas. /description /property conf/core-site.xml property namehadoop.proxyuser.hadoop.hosts /name value* / value /property property namehadoop.proxyuser.hadoop.groups /name value* /value /property When I am submitting jobs as a hadoop user, I am able to run it properly. But when I am able to submit the same work flow from a different user, who can submit the simple MR jobs to my hadoop cluster, I am getting the following error: JA009: java.io.IOException: java.io.IOException: The username kumar obtained from the conf doesn't match the username hadoop the user authenticated asat org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3943) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382) Caused by: java.io.IOException: The username kumar obtained from the conf doesn't match the username hadoop the user authenticated as at org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:426) at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3941) ... 11 more
Re: How do I synchronize Hadoop jobs?
You can use Oozie for that, you can write a workflow job that forks A B and then joins before C. Thanks. Alejandro On Wed, Feb 15, 2012 at 11:23 AM, W.P. McNeill bill...@gmail.com wrote: Say I have two Hadoop jobs, A and B, that can be run in parallel. I have another job, C, that takes the output of both A and B as input. I want to run A and B at the same time, wait until both have finished, and then launch C. What is the best way to do this? I know the answer if I've got a single Java client program that launches A, B, and C. But what if I don't have the option to launch all of them from a single Java program? (Say I've got a much more complicated system with many steps happening between A-B and C.) How do I synchronize between jobs, make sure there's no race conditions etc. Is this what Zookeeper is for?
Re: Hybrid Hadoop with fork/join ?
Rob, Hadoop has as a way to run Map tasks in multithreading mode, look for the MultithreadedMapRunner MultithreadedMapper. Thanks. Alejandro. On Tue, Jan 31, 2012 at 7:51 AM, Rob Stewart robstewar...@gmail.com wrote: Hi, I'm investigating the feasibility of a hybrid approach to parallel programming, by fusing together the concurrent Java fork/join libraries with Hadoop... MapReduce, a paradigm suited for scalable execution over distributed memory + fork/join, a paradigm for optimal multi-threaded shared memory execution. I am aware that, to a degree, Hadoop can take advantage of multiple core on a compute node, by setting the mapred.tasktracker.map/reduce.tasks.maximum to be more than one. As it states in the Hadoop: The definitive guide book, each task runs in a separate JVM. So setting maximum map tasks to 3, will allow the possibility of 3 JVMs running on the Operating System, right? As an alternative to this approach, I am looking to introduce fork/join into map tasks, and perhaps maybe too, throttling down the maximum number of map tasks per node to 1. I would implement a RecordReader that produces map tasks for more coarse granularity, and that fork/join would subdivide to map task using multiple threads - where the output of the join would be the output of the map task. The motivation for this is that threads are a lot more lightweight than initializing JVMs, which as the Hadoop book points out, takes a second or so each time, unless mapred.job.resuse.jvm.num.tasks is set higher than 1 (the default is 1). So for example, opting for small granular maps for a particular job for a given input generates 1,000 map tasks, which will mean that 1,000 JVMs will be created on the Hadoop cluster within the execution of the program. If I were to write a bespoke RecordReader to increase the granularity of each map, so much so that only 100 map tasks are created. The embedded fork/join code would further split each map into 10 threads, to evaluate within one JVM concurrently. I would expect this latter approach of multiple threads to have better performance, than the clunkier multple-JVM approach. Has such a hybrid approach combining MapReduce with ForkJoin been investigated for feasibility, or similar studies published? Are there any important technical limitations that I should consider? Any thoughts on the proposed multi-threaded distributed-shared memory architecture are much appreciated from the Hadoop community! -- Rob Stewart
Re: Any samples of how to write a custom FileSystem
Steven, You could also look at HttpFSFilesystem in the hadoop-httpfs module, it is quite simple and selfcontained. Cheers. Alejandro On Tue, Jan 31, 2012 at 8:37 PM, Harsh J ha...@cloudera.com wrote: To write a custom filesystem, extend on the FileSystem class. Depending on the scheme it is supposed to serve, creating an entry fs.scheme.impl in core-site.xml, and then loading it via the FileSystem.get(URI, conf) API will auto load it for you, provided the URI you pass has the right scheme. So supposing I have a FS scheme foo, I'd register it in core-site.xml as: namefs.foo.impl/name valuecom.myorg.FooFileSystem/value And then with a URI object from a path that goes foo:///mypath/, I'd do: FileSystem.get(URI, new Configuration()) to get a FooFileSystem instance. Similarly, if you want to overload the local filesystem with your class, override the fs.file.impl config with your derivative class, and that'd be used in your configuration loaded programs in future. A good, not-so-complex impl. example to look at generally would be the S3FileSystem. On Wed, Feb 1, 2012 at 9:41 AM, Steve Lewis lordjoe2...@gmail.com wrote: Specifically how do I register a Custom FileSystem - any sample code -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com -- Harsh J Customer Ops. Engineer Cloudera | http://tiny.cloudera.com/about
Re: Adding a soft-linked archive file to the distributed cache doesn't work as advertised
Bill, In addition you must call DistributedCached.createSymlink(configuration), that should do. Thxs. Alejandro On Mon, Jan 9, 2012 at 10:30 AM, W.P. McNeill bill...@gmail.com wrote: I am trying to add a zip file to the distributed cache and have it unzipped on the task nodes with a softlink to the unzipped directory placed in the working directory of my mapper process. I think I'm doing everything the way the documentation tells me to, but it's not working. On the client in the run() function while I'm creating the job I first call: fs.copyFromLocalFile(gate-app.zip, /tmp/gate-app.zip); As expected, this copies the archive file gate-app.zip to the HDFS directory /tmp. Then I call DistributedCache.addCacheArchive(/tmp/gate-app.zip#gate-app, configuration); I expect this to add /tmp/gate-app.zip to the distributed cache and put a softlink to it called gate-app in the working directory of each task. However, when I call job.waitForCompletion(), I see the following error: Exception in thread main java.io.FileNotFoundException: File does not exist: /tmp/gate-app.zip#gate-app. It appears that the distributed cache mechanism is interpreting the entire URI as the literal name of the file, instead of treating the fragment as the name of the softlink. As far as I can tell, I'm doing this correctly according to the API documentation: http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html . The full project in which I'm doing this is up on github: https://github.com/wpm/Hadoop-GATE. Can someone tell me what I'm doing wrong?
Re: Stop chained mapreduce.
Ilyal, The MR output files names follow the pattern part- and you'll have as many as reducers your job had. As you know the output directory, you could do a fs.listStatus() of the output directory and check all the part-* files. Hope this helps. Thanks. Alejandro On Sun, Sep 11, 2011 at 4:52 AM, ilyal levin nipponil...@gmail.com wrote: Hi I created a chained mapreduce program where each job creates a SequenceFile output. My stopping condition is simply to check if the last output file (Type - SequenceFile) is empty. In order to do that i need to use the SequenceFile.Reader and for him to read the data i need the path of the output file. The problem is that i don't know the name of the file, it usually depends on the number of the reducer. What can i do to solve this? Thanks.
Re: Timer jobs
[moving common-user@ to BCC] Oozie is not HA yet. But it would be relatively easy to make it. It was designed with that in mind, we even did a prototype. Oozie consists of 2 services, a SQL database to store the Oozie jobs state and a servlet container where Oozie app proper runs. The solution for HA for the database, well, it is left to the database. This means, you'll have to get an HA DB. The solution for HA for the Oozie app is deploying the servlet container with the Oozie app in more than one box (2 or 3); and front them by a HTTP load-balancer. The missing part is that the current Oozie lock-service is currently an in-memory implementation. This should be replaced with a Zookeeper implementation. Zookeeper could run externally or internally in all Oozie servers. This is what was prototyped long ago. Thanks. Alejandro On Thu, Sep 1, 2011 at 4:14 AM, Ronen Itkin ro...@taykey.com wrote: If I get you right you are asking about Installing Oozie as Distributed and/or HA cluster?! In that case I am not familiar with an out of the box solution by Oozie. But, I think you can made up a solution of your own, for example: Installing Oozie on two servers on the same partition which will be synchronized by DRBD. You can trigger a failover using linux Heartbeat and that way maintain a virtual IP. On Thu, Sep 1, 2011 at 1:59 PM, Per Steffensen st...@designware.dk wrote: Hi Thanks a lot for pointing me to Oozie. I have looked a little bit into Oozie and it seems like the component triggering jobs is called Coordinator Application. But I really see nowhere that this Coordinator Application doesnt just run on a single machine, and that it will therefore not trigger anything if this machine is down. Can you confirm that the Coordinator Application-role is distributed in a distribued Oozie setup, so that jobs gets triggered even if one or two machines are down? Regards, Per Steffensen Ronen Itkin skrev: Hi Try to use Oozie for job coordination and work flows. On Thu, Sep 1, 2011 at 12:30 PM, Per Steffensen st...@designware.dk wrote: Hi I use hadoop for a MapReduce job in my system. I would like to have the job run very 5th minute. Are there any distributed timer job stuff in hadoop? Of course I could setup a timer in an external timer framework (CRON or something like that) that invokes the MapReduce job. But CRON is only running on one particular machine, so if that machine goes down my job will not be triggered. Then I could setup the timer on all or many machines, but I would not like the job to be run in more than one instance every 5th minute, so then the timer jobs would need to coordinate who is actually starting the job this time and all the rest would just have to do nothing. Guess I could come up with a solution to that - e.g. writing some lock stuff using HDFS files or by using ZooKeeper. But I would really like if someone had already solved the problem, and provided some kind of a distributed timer framework running in a cluster, so that I could just register a timer job with the cluster, and then be sure that it is invoked every 5th minute, no matter if one or two particular machines in the cluster is down. Any suggestions are very welcome. Regards, Per Steffensen -- * Ronen Itkin* Taykey | www.taykey.com
Re: Oozie monitoring
Avi, For Oozie related questions, please subscribe and use the oozie-...@incubator.apache.org alias. Thanks. Alejandro On Tue, Aug 30, 2011 at 2:28 AM, Avi Vaknin avivakni...@gmail.com wrote: Hi All, First, I really enjoy writing you and I'm thankful for your help. I have Oozie installed on dedicated server and I want to monitor it using my Nagios server. Can you please suggest some important parameters to monitor or other tips regarding this issue. Thanks. Avi
Re: Oozie on the namenode server
[Moving thread to Oozie aliases and hadoop's alias to BCC] Avi, Currently you can have a cold standby solution. An Oozie setup consists of 2 systems, a SQL DB (storing all Oozie jobs state) and a servlet container (running Oozie proper). You need you DB to be high available. You need to have a second installation of Oozie configured identically to the running on stand by. If the running Oozie fails, you start the second one. The idea (we once did a prototype using Zookeeper) is to eventually support a multithreaded Oozie server, that would give both high-availability and horizontal scalability. Still you'll need a High Available DB. Thanks. Alejandro On Mon, Aug 29, 2011 at 4:57 AM, Avi Vaknin avivakni...@gmail.com wrote: Hi, I have one more question about Oozie, Is there any suggested high availability solution for Oozie ? Something like active/passive or active/active solution. Thanks. Avi -Original Message- From: Harsh J [mailto:ha...@cloudera.com] Sent: Monday, August 29, 2011 2:20 PM To: common-user@hadoop.apache.org Subject: Re: Oozie on the namenode server Avi, Should be OK to do so at this stage. However, keep monitoring loads on the machines to determine when to move things out to their own dedicated boxes. On Mon, Aug 29, 2011 at 3:20 PM, Avi Vaknin avivakni...@gmail.com wrote: Hi All, I want to install Oozie and I wonder if it is OK to install it on the name node or maybe I need to install dedicated server to it. I have a very small Hadoop cluster (4 datanodes + namenode + secondary namenode). Thanks for your help. Avi -- Harsh J - No virus found in this message. Checked by AVG - www.avg.com Version: 10.0.1392 / Virus Database: 1520/3864 - Release Date: 08/28/11
Hoop into 0.23 release
Hadoop developers, Arun will be cutting a branch for Hadoop 0.23 as soon the trunk has a successful build. I'd like Hoop (https://issues.apache.org/jira/browse/HDFS-2178) to be part of 0.23 (Nicholas already looked at the code). In addition, the Jersey utils in Hoop will be handy for https://issues.apache.org/jira/browse/MAPREDUCE-2863. Most if not all of remaining work is not development but Java package renaming (to be org.apache.hadoop..), Maven integration (sub-modules) and final packaging. The current blocker is to deciding on the Maven sub-modules organization, https://issues.apache.org/jira/browse/HADOOP-7560 I'll drive the discussion for HADOOP-7560 and as soon as we have an agreement there I'll refactor Hoop accordingly. Does this sound reasonable ? Thanks. Alejandro
Re: Hoop into 0.23 release
Nicholas, Thanks. In order to move forward with the integration of Hoop I need consensus on: * https://issues.apache.org/jira/browse/HDFS-2178?focusedCommentId=13089106page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13089106 * https://issues.apache.org/jira/browse/HADOOP-7560 Thanks. Alejandro On Mon, Aug 22, 2011 at 3:42 PM, Tsz Wo Sze szets...@yahoo.com wrote: +1 I believe HDFS-2178 is very close to being committed. Great work Alejandro! Nicholas From: Alejandro Abdelnur t...@cloudera.com To: common-user@hadoop.apache.org; hdfs-...@hadoop.apache.org Sent: Monday, August 22, 2011 2:16 PM Subject: Hoop into 0.23 release Hadoop developers, Arun will be cutting a branch for Hadoop 0.23 as soon the trunk has a successful build. I'd like Hoop (https://issues.apache.org/jira/browse/HDFS-2178) to be part of 0.23 (Nicholas already looked at the code). In addition, the Jersey utils in Hoop will be handy for https://issues.apache.org/jira/browse/MAPREDUCE-2863. Most if not all of remaining work is not development but Java package renaming (to be org.apache.hadoop..), Maven integration (sub-modules) and final packaging. The current blocker is to deciding on the Maven sub-modules organization, https://issues.apache.org/jira/browse/HADOOP-7560 I'll drive the discussion for HADOOP-7560 and as soon as we have an agreement there I'll refactor Hoop accordingly. Does this sound reasonable ? Thanks. Alejandro
Re: Can you access Distributed cache in custom output format ?
Mmmh, I've never used the -files option (I don't know if it will copy the files to HDFS for your or you have to put them there first). My usage pattern of the DC is copying the files to HDFS, then use the DC API to add those files to the jobconf. Alejandro On Fri, Jul 29, 2011 at 10:56 AM, Mapred Learn mapred.le...@gmail.comwrote: i m trying to access file that I sent as -files option in my hadoop jar command. in my outputformat, I am doing something like: Path[] cacheFiles = DistributedCache.getLocalCacheFiles(conf); String file1=; String file2=; Path pt=null; for (Path p : cacheFiles) { if (p != null) { if (p.getName().endsWith(.ryp)) { file1 = p.getName(); } else if (p.getName().endsWith(.cpt)) { file2 = p.getName(); pt=p; } } } // then read the file, which gives file does not exist exception: Path pat = new Path(file2); BufferedReader reader = null; try { FileSystem fs = FileSystem.get(conf); reader=new BufferedReader( new InputStreamReader(fs.open(pat))); String line = null; while ((line = reader.readLine()) != null) { System.out.println(Now parsing the line: + line); } } catch (Exception e) { System.out.println(exception + e.getMessage()); } On Fri, Jul 29, 2011 at 10:50 AM, Alejandro Abdelnur t...@cloudera.comwrote: Where are you getting the error, in the client submitting the job or in the MR tasks? Are you trying to access a file or trying to set a JAR in the DistributedCache? How/when are you adding the file/JAR to the DC? How are you retrieving the file/JAR from your outputformat code? Thxs. Alejandro On Fri, Jul 29, 2011 at 10:43 AM, Mapred Learn mapred.le...@gmail.comwrote: I am trying to create a custom text outputformat where I want to access a distirbuted cache file. On Fri, Jul 29, 2011 at 10:42 AM, Harsh J ha...@cloudera.com wrote: Mapred, By outputformat, do you mean the frontend, submit-time run of OutputFormat? Then no, it cannot access the distributed cache cause its not really setup at that point, and the front end doesn't need the distributed cache really when it can access those files directly. Could you describe slightly deeper on what you're attempting to do? On Fri, Jul 29, 2011 at 10:57 PM, Mapred Learn mapred.le...@gmail.com wrote: Hi, I am trying to access distributed cache in my custom output format but it does not work and file open in custom output format fails with file does not exist even though it physically does. Looks like distributed cache only works for Mappers and Reducers ? Is there a way I can read Distributed Cache in my custom output format ? Thanks, -JJ -- Harsh J
Re: Can you access Distributed cache in custom output format ?
So, the -files uses the '#' symlink feature, correct? If so, from the MR task JVM doing a new File(FILENAME) would work, where the FILENAME does not include the path of the file, correct? Thxs. Alejandro On Fri, Jul 29, 2011 at 11:20 AM, Brock Noland br...@cloudera.com wrote: With -files the file will be placed in the CWD of the map/reduce tasks. You should be able to open the file with FileInputStream. Brock On Fri, Jul 29, 2011 at 1:18 PM, Mapred Learn mapred.le...@gmail.comwrote: when you use -files option, it copies in a .staging directory and all mappers can access it but for output format, I see it is not able to access it. -files copies cache file under: /user/id/.staging/job name/files/filename On Fri, Jul 29, 2011 at 11:14 AM, Alejandro Abdelnur t...@cloudera.comwrote: Mmmh, I've never used the -files option (I don't know if it will copy the files to HDFS for your or you have to put them there first). My usage pattern of the DC is copying the files to HDFS, then use the DC API to add those files to the jobconf. Alejandro On Fri, Jul 29, 2011 at 10:56 AM, Mapred Learn mapred.le...@gmail.comwrote: i m trying to access file that I sent as -files option in my hadoop jar command. in my outputformat, I am doing something like: Path[] cacheFiles = DistributedCache.getLocalCacheFiles(conf); String file1=; String file2=; Path pt=null; for (Path p : cacheFiles) { if (p != null) { if (p.getName().endsWith(.ryp)) { file1 = p.getName(); } else if (p.getName().endsWith(.cpt)) { file2 = p.getName(); pt=p; } } } // then read the file, which gives file does not exist exception: Path pat = new Path(file2); BufferedReader reader = null; try { FileSystem fs = FileSystem.get(conf); reader=new BufferedReader( new InputStreamReader(fs.open(pat))); String line = null; while ((line = reader.readLine()) != null) { System.out.println(Now parsing the line: + line); } } catch (Exception e) { System.out.println(exception + e.getMessage()); } On Fri, Jul 29, 2011 at 10:50 AM, Alejandro Abdelnur t...@cloudera.com wrote: Where are you getting the error, in the client submitting the job or in the MR tasks? Are you trying to access a file or trying to set a JAR in the DistributedCache? How/when are you adding the file/JAR to the DC? How are you retrieving the file/JAR from your outputformat code? Thxs. Alejandro On Fri, Jul 29, 2011 at 10:43 AM, Mapred Learn mapred.le...@gmail.com wrote: I am trying to create a custom text outputformat where I want to access a distirbuted cache file. On Fri, Jul 29, 2011 at 10:42 AM, Harsh J ha...@cloudera.com wrote: Mapred, By outputformat, do you mean the frontend, submit-time run of OutputFormat? Then no, it cannot access the distributed cache cause its not really setup at that point, and the front end doesn't need the distributed cache really when it can access those files directly. Could you describe slightly deeper on what you're attempting to do? On Fri, Jul 29, 2011 at 10:57 PM, Mapred Learn mapred.le...@gmail.com wrote: Hi, I am trying to access distributed cache in my custom output format but it does not work and file open in custom output format fails with file does not exist even though it physically does. Looks like distributed cache only works for Mappers and Reducers ? Is there a way I can read Distributed Cache in my custom output format ? Thanks, -JJ -- Harsh J
Re: Multiple Output Formats
Roger, Or you can take a look at Hadoop's MultipleOutputs class. Thanks. Alejandro On Tue, Jul 26, 2011 at 11:30 PM, Luca Pireddu pire...@crs4.it wrote: On July 26, 2011 06:11:33 PM Roger Chen wrote: Hi all, I am attempting to implement MultipleOutputFormat to write data to multiple files dependent on the output keys and values. Can somebody provide a working example with how to implement this in Hadoop 0.20.2? Thanks! Hello, I have a working sample here: http://biodoop-seal.bzr.sourceforge.net/bzr/biodoop- seal/trunk/annotate/head%3A/src/it/crs4/seal/demux/DemuxOutputFormat.java It extends FileOutputFormat. -- Luca Pireddu CRS4 - Distributed Computing Group Loc. Pixina Manna Edificio 1 Pula 09010 (CA), Italy Tel: +39 0709250452
Re: Programming Multiple rounds of mapreduce
Thanks Matt, Arko, if you plan to use Oozie, you can have a simple coordinator job that does does, for example (the following schedules a WF every 5 mins that consumes the output produced by the previous run, you just have to have the initial data) Thxs. Alejandro coordinator-app name=coord-1 frequency=${coord:minutes(5)} start=${start} end=${end} timezone=UTC xmlns=uri:oozie:coordinator:0.1 controls concurrency1/concurrency /controls datasets dataset name=data frequency=${coord:minutes(5)} initial-instance=${start} timezone=UTC uri-template${nameNode}/user/${coord:user()}/examples/${dataRoot}/${YEAR}-${MONTH}-${DAY}-${HOUR}-${MINUTE} /uri-template /dataset /datasets input-events data-in name=input dataset=data instance${coord:current(0)}/instance /data-in /input-events output-events data-out name=output dataset=data instance${coord:current(1)}/instance /data-out /output-events action workflow app-path${nameNode}/user/${coord:user()}/examples/apps/subwf-1/app-path configuration property namejobTracker/name value${jobTracker}/value /property property namenameNode/name value${nameNode}/value /property property namequeueName/name value${queueName}/value /property property nameexamplesRoot/name value${examplesRoot}/value /property property nameinputDir/name value${coord:dataIn('input')}/value /property property nameoutputDir/name value${coord:dataOut('output')}/value /property /configuration /workflow /action /coordinator-app -- On Mon, Jun 13, 2011 at 3:01 PM, GOEKE, MATTHEW (AG/1000) matthew.go...@monsanto.com wrote: If you know for certain that it needs to be split into multiple work units I would suggest looking into Oozie. Easy to install, light weight, low learning curve... for my purposes it's been very helpful so far. I am also fairly certain you can chain multiple job confs into the same run but I have not actually tried that therefore I can't promise it is easy or possible. http://www.cloudera.com/blog/2010/07/whats-new-in-cdh3-b2-oozie/ If you are not running CDH3u0 then you can also get the tarball and documentation directly here: https://ccp.cloudera.com/display/SUPPORT/CDH3+Downloadable+Tarballs Matt -Original Message- From: Marcos Ortiz [mailto:mlor...@uci.cu] Sent: Monday, June 13, 2011 4:57 PM To: mapreduce-user@hadoop.apache.org Cc: Arko Provo Mukherjee Subject: Re: Programming Multiple rounds of mapreduce Well, you can define a job for each round and then, you can define the running workflow based in your implementation and to chain your jobs El 6/13/2011 5:46 PM, Arko Provo Mukherjee escribió: Hello, I am trying to write a program where I need to write multiple rounds of map and reduce. The output of the last round of map-reduce must be fed into the input of the next round. Can anyone please guide me to any link / material that can teach me as to how I can achieve this. Thanks a lot in advance! Thanks regards Arko -- Marcos Luís Ortíz Valmaseda Software Engineer (UCI) http://marcosluis2186.posterous.com http://twitter.com/marcosluis2186 This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of Viruses or other Malware. Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations.
Re: Problems adding JARs to distributed classpath in Hadoop 0.20.2
John, Do you have all JARs used by your classes in Needed.jar in the DC classpath as well? Are you propagating the delegation token? Thxs. Alejandro On Wed, Jun 1, 2011 at 12:38 PM, John Armstrong john.armstr...@ccri.comwrote: On Tue, 31 May 2011 15:09:28 -0400, John Armstrong john.armstr...@ccri.com wrote: On Tue, 31 May 2011 12:02:28 -0700, Alejandro Abdelnur t...@cloudera.com wrote: What is exactly that does not work? In the hopes that more information can help, I've dug into the local filesystems on each of my four nodes and retrieved the job.xml and the locations of the files to show that everything shows up where it should. In this example have one regular file (hdfs://node1:hdfsport/hdfs/path/to/file1.foo) added with DistributedCache.addCacheFile(). I also have a JAR (hdfs://node1:hdfsport/hdfs/path/to/needed.jar) added with DistributedCache.addFileToClassPath(). The needed JAR is also part of the classpath Oozie provides to my Java task. As you can see, both files (with correct filesizes and timestamps) are listed as cache files in job.xml, and the JAR is listed as a classpath file. Both files show up on each node; the JAR shows up twice on node 1 since that's where Oozie ran the Java task, and thus where Oozie placed the JAR with its own use of the distributed cache. And yet, when mapreduce actually tries to run the job my Java task launches, it immediately hits a ClassNotFoundException, claiming it can't find the class my.class.package.Needed which is contained in needed.jar. JOB.XML ... property !--Loaded from Unknown-- namemapred.job.classpath.files/name valuehdfs://node1:hdfsport/hdfs/path/to/needed.jar/value /property ... property !--Loaded from Unknown-- namemapred.cache.files/name valuehdfs://node1:hdfsport/hdfs/path/to/file1.foo,hdfs://node1:hdfsport/hdfs/path/to/needed.jar/value /property ... property !--Loaded from Unknown-- namemapred.cache.files.filesizes/name value61175,2257057/value /property ... property !--Loaded from Unknown-- namemapred.cache.files.timestamps/name value1306949104866,1306949371660/value /property ... NODE 1 LOCAL FILESYSTEM /data/4/mapred/local/taskTracker/distcache/5181540010607464671_-132008737_1279047490/node1/hdfs/path/to/file1.foo /data/1/mapred/local/taskTracker/distcache/6423795395825083633_-1942178119_1279314284/node1/hdfs/path/to/needed.jar /data/3/mapred/local/taskTracker/distcache/2424191142954514770_1281905983_1269665052/node1/hdfs/path/to/needed.jar NODE 2 LOCAL FILESYSTEM /data/1/mapred/local/taskTracker/distcache/-1458632814086969626_-132008737_1279047490/node1/hdfs/path/to/file1.foo /data/2/mapred/local/taskTracker/distcache/4434671176913378591_-1942178119_1279314284/node1/hdfs/path/to/needed.jar NODE 3 LOCAL FILESYSTEM /data/1/mapred/local/taskTracker/distcache/-6763452370915390695_-132008737_1279047490/node1/hdfs/path/to/file1.foo /data/2/mapred/local/taskTracker/distcache/683838159704655_-1942178119_1279314284/node1/hdfs/path/to/needed.jar NODE 4 LOCAL FILESYSTEM /data/1/mapred/local/taskTracker/distcache/-1759547009148985681_-132008737_1279047490/node1/hdfs/path/to/file1.foo /data/2/mapred/local/taskTracker/distcache/1998811135309473771_-1942178119_1279314284/node1/hdfs/path/to/needed.jar SAMPLE MAPPER ATTEMPT LOG 2011-06-01 14:21:41,442 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library 2011-06-01 14:21:41,557 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Creating symlink: /data/2/mapred/local/taskTracker/hdfs/jobcache/job_201106011430_0002/jars/job.jar - /data/2/mapred/local/taskTracker/hdfs/jobcache/job_201106011430_0002/attempt_201106011430_0002_m_09_0/work/./job.jar 2011-06-01 14:21:41,560 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Creating symlink: /data/2/mapred/local/taskTracker/hdfs/jobcache/job_201106011430_0002/jars/.job.jar.crc - /data/2/mapred/local/taskTracker/hdfs/jobcache/job_201106011430_0002/attempt_201106011430_0002_m_09_0/work/./.job.jar.crc 2011-06-01 14:21:41,563 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId= 2011-06-01 14:21:41,660 WARN org.apache.hadoop.mapred.Child: Error running child java.lang.RuntimeException: java.lang.ClassNotFoundException: my.class.package.Needed at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:973) at org.apache.hadoop.mapreduce.JobContext.getOutputFormatClass(JobContext.java:236) at org.apache.hadoop.mapred.Task.initialize(Task.java:484) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:298) at org.apache.hadoop.mapred.Child$4.run(Child.java:217) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396
Re: Problems adding JARs to distributed classpath in Hadoop 0.20.2
John, Now I get what you are trying to do. My recommendation would be: * Use a Java action to do all the stuff prior to starting your MR job * Use a mapreduce action to start your MR job * If you need to propagate properties from the Java action to the MR action you can use the capture-output flag. If you still want to start your MR job from your Java action, then your Java action should do all the setup the MapReduceMain class does before starting the MR job (this will ensure delegation tokens and distributed cache is avail to your MR job). Thanks. Alejandro On Mon, May 30, 2011 at 6:34 AM, John Armstrong john.armstr...@ccri.comwrote: On Fri, 27 May 2011 15:47:23 -0700, Alejandro Abdelnur t...@cloudera.com wrote: John, If you are using Oozie, dropping all the JARs your MR jobs needs in the Oozie WF lib/ directory should suffice. Oozie will make sure all those JARs are in the distributed cache. That doesn't seem to work. I have this JAR in the WF /lib/ directory because the Java job that launches the MR job needs it. And yes, it's in the distributed cache for the wrapper MR job that Oozie uses to remotely run the Java job. The problem is it's not available for the MR job that the Java job launches. Thanks, though, for the suggestion.
Re: MultipleOutputFormat
Dmitriy, Have you check the MultipleOutputs instead? It provides similar functionality. Alejandro On Wed, Mar 30, 2011 at 11:39 AM, Dmitriy Lyubimov dlie...@gmail.comwrote: Hi, I can't seem to be able to find either jira or implementation of MultipleOutputFormat in new api in either 0.21 or 0.22 branches. Are there any plans to port that to new api as well? thanks in advance. -Dmitriy
Re: MultipleOutputFormat
You should be able to create partitions on the fly. Check the last example in the javadocs: http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html ... mos.write(key, new Text(value), generateFileName(key, new Text(value))); Hope this helps. Alejandro On Wed, Mar 30, 2011 at 12:02 PM, Dmitriy Lyubimov dlie...@gmail.comwrote: yes.. but in my old code the file names are created on the fly (it basically creates partitions based on a time field). I dont think MultipleOutputs is not suitable to create partitions on the fly. On Tue, Mar 29, 2011 at 8:56 PM, Alejandro Abdelnur t...@cloudera.com wrote: Dmitriy, Have you check the MultipleOutputs instead? It provides similar functionality. Alejandro On Wed, Mar 30, 2011 at 11:39 AM, Dmitriy Lyubimov dlie...@gmail.com wrote: Hi, I can't seem to be able to find either jira or implementation of MultipleOutputFormat in new api in either 0.21 or 0.22 branches. Are there any plans to port that to new api as well? thanks in advance. -Dmitriy
Re: how to use hadoop apis with cloudera distribution ?
If write your code within a Maven project (which you can open from Eclipse) then you should the following in your pom.xml: * Define Cloudera repository: ... repository idcdh.repo/id url https://repository.cloudera.com/content/groups/cloudera-repos/url nameCloudera Repositories/name snapshots enabledfalse/enabled /snapshots /repository ... * Add Hadoop dependency: ... dependency groupIdorg.apache.hadoop/groupId artifactIdhadoop-core/artifactId version0.20.2-cdh3B4/version /dependency ... You can browse the Cloudera Maven repositories (for avail JARs and versions) at: https://repository.cloudera.com/ You'll find Hadoop/Pig/Hive/Sqoop/Zookeeper/Hbase/Flume/Oozie JARs (with all the *cdh3B4 of them versions working with each other) Then your app should run with the Cloudera releases of Hadoop/...etc. Hope this helps. Thanks. Alejandro On Wed, Mar 9, 2011 at 6:03 AM, Marcos Ortiz Valmaseda mlor...@uci.cuwrote: You can check the Cloudera Training Videos, where is a screencast explaining how to develop Hadoop using Eclipse. http://www.cloudera.com/presentations http://vimeo.com/cloudera Now, For working with Hadoop APIs using Eclipse, for developing applications based on Hadoop, you can use the Kamasphere Plugin for Hadoop Development, or if you are a NetBeans user, they have a module for that. Regards. - Mensaje original - De: Mapred Learn mapred.le...@gmail.com Para: Marcos Ortiz mlor...@uci.cu CC: mapreduce-user@hadoop.apache.org Enviados: Martes, 8 de Marzo 2011 12:26:00 (GMT-0500) Auto-Detected Asunto: Re: how to use hadoop apis with cloudera distribution ? Thanks Marco ! I was trying to use CDH3 with eclipse and not able to know why eclipse complains for the import statement for hadoop apis when cloudera already includes them. I did not understand how CDH3 works with eclipse, does it download hadoop apis when we add svn urls ? On Tue, Mar 8, 2011 at 7:22 AM, Marcos Ortiz mlor...@uci.cu wrote: On Tue, 2011-03-08 at 07:16 -080s, 0, Mapred Learn wrote: Hi, I downloaded CDH3 VM for hadoop but if I want to use something like: import org.apache.hadoop.conf.Configuration; in my java code, what else do I need to do ? Can you see all tutorial that Cloudera has on its site http://www.cloudera.com/presentations http://www.cloudera.com/info/training http://www.cloudera.com/developers/learn-hadoop/ Can you check the CDH3 Official Documentation and the last news about the new release: http://docs.cloudera.com http://www.cloudera.com/blog/category/cdh/ Do i need to download hadoop from apache ? No, CDH beta 3 has with all required tools to work with Hadoop, even more applications like HUE, Oozie, Zookepper, Pig, Hive, Chukwa, HBase, Flume, etc if yes, then what does cdh3 do ? The Cloudera' colleagues has a excelent work packaging the most used applications with Hadoop on a single virtual machine for testing and they did a better approach to use Hadoop. They has Red Hat and Ubuntu/Debian compatible packages to do more easy the installation, configuration and use of Hadoop on these operating systems. Please, read http://docs.cloudera.com if not, then where can i find hadoop code on cdh VM ? I am using above line in my java code in eclipse and eclipse is not able to find it. Do you set JAVA_HOME, and HADOOP_HOME on your system? If you have any doubt with this, you can check the excellent DZone' refcards about Getting Started with Hadood and Deploying Hadoop written by Eugene Ciurana( http://eugeneciurana.eu ), VP of Technology at Badoo.com Regards, and I hope that this information could be useful for you. -- Marcos Luís Ortíz Valmaseda Software Engineer Centro de Tecnologías de Gestión de Datos (DATEC) Universidad de las Ciencias Informáticas http://uncubanitolinuxero.blogspot.com http://www.linkedin.com/in/marcosluis2186 -- Marcos Luís Ortíz Valmaseda Software Engineer Universidad de las Ciencias Informáticas Linux User # 418229 http://uncubanitolinuxero.blogspot.com http://www.linkedin.com/in/marcosluis2186
Re: EXT :Re: Problem running a Hadoop program with external libraries
Why don't you put your native library in HDFS and use the DistributedCache to make them avail to the tasks. For example: Copy 'foo.so' to 'hdfs://localhost:8020/tmp/foo.so', then added to the job distributed cache: DistributedCache.addCacheFile(hdfs://localhost:8020/tmp/foo.so#foo.so, jobConf); DistributedCache.createSymlink(conf); Note that the #foo.so will create a soflink in the task running dir. And the task running dir is in LD_PATH of your task. Alejandro On Sat, Mar 5, 2011 at 7:19 AM, Lance Norskog goks...@gmail.com wrote: I have never heard of putting a native code shared library in a Java jar. I doubt that it works. But it's a cool idea! A Unix binary program loads shared libraries from the paths given in the environment variable LD_LIBRARY_PATH. This has to be set to the directory with the OpenCV .so file when you start Java. Lance On Mar 4, 2011, at 2:13 PM, Brian Bockelman wrote: Hi, Check your kernel's overcommit settings. This will prevent the JVM from allocating memory even when there's free RAM. Brian On Mar 4, 2011, at 3:55 PM, Ratner, Alan S (IS) wrote: Aaron, Thanks for the rapid responses. * ulimit -u unlimited is in .bashrc. * HADOOP_HEAPSIZE is set to 4000 MB in hadoop-env.sh * Mapred.child.ulimit is set to 2048000 in mapred-site.xml * Mapred.child.java.opts is set to -Xmx1536m in mapred-site.xml I take it you are suggesting that I change the java.opts command to: Mapred.child.java.opts is value -Xmx1536m -Djava.library.path=/path/to/native/libs /value Alan Ratner Northrop Grumman Information Systems Manager of Large-Scale Computing 9020 Junction Drive Annapolis Junction, MD 20701 (410) 707-8605 (cell) From: Aaron Kimball [mailto:akimbal...@gmail.com] Sent: Friday, March 04, 2011 4:30 PM To: common-user@hadoop.apache.org Cc: Ratner, Alan S (IS) Subject: EXT :Re: Problem running a Hadoop program with external libraries Actually, I just misread your email and missed the difference between your 2nd and 3rd attempts. Are you enforcing min/max JVM heap sizes on your tasks? Are you enforcing a ulimit (either through your shell configuration, or through Hadoop itself)? I don't know where these cannot allocate memory errors are coming from. If they're from the OS, could it be because it needs to fork() and momentarily exceed the ulimit before loading the native libs? - Aaron On Fri, Mar 4, 2011 at 1:26 PM, Aaron Kimball akimbal...@gmail.com mailto:akimbal...@gmail.com wrote: I don't know if putting native-code .so files inside a jar works. A native-code .so is not classloaded in the same way .class files are. So the correct .so files probably need to exist in some physical directory on the worker machines. You may want to doublecheck that the correct directory on the worker machines is identified in the JVM property 'java.library.path' (instead of / in addition to $LD_LIBRARY_PATH). This can be manipulated in the Hadoop configuration setting mapred.child.java.opts (include '-Djava.library.path=/path/to/native/libs' in the string there.) Also, if you added your .so files to a directory that is already used by the tasktracker (like hadoop-0.21.0/lib/native/Linux-amd64-64/), you may need to restart the tasktracker instance for it to take effect. (This is true of .jar files in the $HADOOP_HOME/lib directory; I don't know if it is true for native libs as well.) - Aaron On Fri, Mar 4, 2011 at 12:53 PM, Ratner, Alan S (IS) alan.rat...@ngc.commailto:alan.rat...@ngc.com wrote: We are having difficulties running a Hadoop program making calls to external libraries - but this occurs only when we run the program on our cluster and not from within Eclipse where we are apparently running in Hadoop's standalone mode. This program invokes the Open Computer Vision libraries (OpenCV and JavaCV). (I don't think there is a problem with our cluster - we've run many Hadoop jobs on it without difficulty.) 1. I normally use Eclipse to create jar files for our Hadoop programs but I inadvertently hit the run as Java application button and the program ran fine, reading the input file from the eclipse workspace rather than HDFS and writing the output file to the same place. Hadoop's output appears below. (This occurred on the master Hadoop server.) 2. I then exported from Eclipse a runnable jar which extracted required libraries into the generated jar - presumably producing a jar file that incorporated all the required library functions. (The plain jar file for this program is 17 kB while the runnable jar is 30MB.) When I try to run this on my Hadoop cluster (including my master and slave servers) the program reports that it is unable to locate libopencv_highgui.so.2.2: cannot open shared object file: No such file or directory. Now, in addition to this library being incorporated inside
Re: use DistributedCache to add many files to class path
Lei Liu, You have a cutpaste error the second addition should use 'tairJarPath' but it is using the 'jeJarPath' Hope this helps. Alejandro On Thu, Feb 17, 2011 at 11:50 AM, lei liu liulei...@gmail.com wrote: I use DistributedCache to add two files to class path, exampe below code : String jeJarPath = /group/aladdin/lib/je-4.1.7.jar; DistributedCache.addFileToClassPath(new Path(jeJarPath), conf); String tairJarPath = /group/aladdin/lib/tair-aladdin-2.3.1.jar DistributedCache.addFileToClassPath(new Path(jeJarPath), conf); when map/reduce is executing, the /group/aladdin/lib/tair-aladdin-2.3.1.jar file is added to class path, but the /group/aladdin/lib/je-4.1.7.jar file is not added to class path. How can I add many files to class path? Thanks, LiuLei
Re: Accessing Hadoop using Kerberos
Hadoop UserGroupInformation class looks into the OS Kerberos cache for user to use. I guess you'd had to do the kerberos authentication making sure the kerberos cache is used and then do a UserGroupInformation.loginUser() And after doing that the Hadoop libraries should work. Alejandro On Thu, Jan 13, 2011 at 2:19 PM, Muruga Prabu M murugapra...@gmail.comwrote: Yeah kinit is working fine. My scenario is as follows. I have a java program written by me to upload and download files to the HDFS cluster. I want to perform Kerberos authentication from the Java code itself. I have acquired the TicketGrantingTicket from the Authentication Server and Service Ticket from the Ticket Granting Server. Now how to authenticate myself with hadoop by sending the service ticket received from the Ticket Granting Server. Regards, Pikini On Wed, Jan 12, 2011 at 10:13 PM, Alejandro Abdelnur t...@cloudera.com wrote: If you kinit-ed successfully you are done. The hadoop libraries will do the trick of authenticating the user against Hadoop. Alejandro On Thu, Jan 13, 2011 at 12:46 PM, Muruga Prabu M murugapra...@gmail.com wrote: Hi, I have a Java program to upload and download files from the HDFS. I am using Hadoop with Kerberos. I am able to get a TGT(From the Authentication Server ) and a service Ticket(From Ticket Granting server) using kerberos successfully. But after that I don't know how to authenticate myself with Hadoop. Can anyone please help me how to do it ? Regards, Pikini
Re: how to run jobs every 30 minutes?
Ed, Actually Oozie is quite different from Cascading. * Cascading allows you to write 'queries' using a Java API and they get translated into MR jobs. * Oozie allows you compose sequences of MR/Pig/Hive/Java/SSH jobs in a DAG (workflow jobs) and has timer+data dependency triggers (coordinator jobs). Regards. Alejandro On Tue, Dec 14, 2010 at 1:26 PM, edward choi mp2...@gmail.com wrote: Thanks for the tip. I took a look at it. Looks similar to Cascading I guess...? Anyway thanks for the info!! Ed 2010/12/8 Alejandro Abdelnur t...@cloudera.com Or, if you want to do it in a reliable way you could use an Oozie coordinator job. On Wed, Dec 8, 2010 at 1:53 PM, edward choi mp2...@gmail.com wrote: My mistake. Come to think about it, you are right, I can just make an infinite loop inside the Hadoop application. Thanks for the reply. 2010/12/7 Harsh J qwertyman...@gmail.com Hi, On Tue, Dec 7, 2010 at 2:25 PM, edward choi mp2...@gmail.com wrote: Hi, I'm planning to crawl a certain web site every 30 minutes. How would I get it done in Hadoop? In pure Java, I used Thread.sleep() method, but I guess this won't work in Hadoop. Why wouldn't it? You need to manage your post-job logic mostly, but sleep and resubmission should work just fine. Or if it could work, could anyone show me an example? Ed. -- Harsh J www.harshj.com
Re: how to run jobs every 30 minutes?
Or, if you want to do it in a reliable way you could use an Oozie coordinator job. On Wed, Dec 8, 2010 at 1:53 PM, edward choi mp2...@gmail.com wrote: My mistake. Come to think about it, you are right, I can just make an infinite loop inside the Hadoop application. Thanks for the reply. 2010/12/7 Harsh J qwertyman...@gmail.com Hi, On Tue, Dec 7, 2010 at 2:25 PM, edward choi mp2...@gmail.com wrote: Hi, I'm planning to crawl a certain web site every 30 minutes. How would I get it done in Hadoop? In pure Java, I used Thread.sleep() method, but I guess this won't work in Hadoop. Why wouldn't it? You need to manage your post-job logic mostly, but sleep and resubmission should work just fine. Or if it could work, could anyone show me an example? Ed. -- Harsh J www.harshj.com
Re: HDFS Rsync process??
The other approach, if the DR cluster is idle or has enough excess capacity, would be running all the jobs on the input data in both clusters and perform checksums on the outputs to ensure everything is consistent. And you could take advantage and distribute ad hoc queries between the 2 clusters. Alejandro On Tue, Nov 30, 2010 at 6:51 PM, Steve Loughran ste...@apache.org wrote: On 30/11/10 03:59, hadoopman wrote: We have two Hadoop clusters in two separate buildings. Both clusters are loading the same data from the same sources (the second cluster is for DR). We're looking at how we can recover the primary cluster and catch it back up again as new data will continue to feed into the DR cluster. It's been suggested we use rsync across the network however my concern is the amount of data we would have to copy over would take several days (at a minimum) to sync them even with our dual bonded 1 gig network cards. I'm curious if anyone has come up with a solution short of just loading the source logs into HDFS. Is there a way to even rsync two clusters and get them in sync? Been googling around. Haven't found anything of substances yet. you don't need all the files in the cluster in sync as a lot of them are intermediate and transient files. Instead use dfscopy to copy source files to the two clusters, this runs across the machines in the cluster and is also designed to work across hadoop versions, with some limitations.
Re: Too large class path for map reduce jobs
well, if the issue is a too long classpath, the softlink thingy will give some room to breath as the total CP length will be much smaller. A On Thu, Oct 7, 2010 at 3:43 PM, Henning Blohm henning.bl...@zfabrik.dewrote: So that's actually another issue, right? Besides splitting the classpath into those three groups, you want the TT to create soft-links on demand to simplify the computation of classpath string. Is that right? But it's the TT that actually starts the job VM. Why does it matter what the string actually looks like, as long as it has the right content? Thanks, Henning On Thu, 2010-10-07 at 13:22 +0800, Alejandro Abdelnur wrote: [sent too soon] The first CP shown is how it is today the CP of a task. If we change it pick up all the job JARs from the current dir, then the classpath will be much shorter (second CP shown). We can easily achieve this by soft-linking the job JARs in the work dir of the task. Alejandro On Thu, Oct 7, 2010 at 1:02 PM, Alejandro Abdelnur t...@cloudera.com wrote: Fragmentation of Hadoop classpaths is another issue: hadoop should differentiate the CP in 3: 1*client CP: what is needed to submit a job (only the nachos) 2*server CP (JT/NN/TT/DD): what is need to run the cluster (the whole enchilada) 3*job CP: what is needed to run a job (some of the enchilada) But i'm not trying to get into that here. What I'm suggesting is: - # Hadoop JARs: /Users/tucu/dev-apps/hadoop/conf /System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home/lib/tools.jar /Users/tucu/dev-apps/hadoop/bin/.. /Users/tucu/dev-apps/hadoop/bin/../hadoop-core-0.20.3-CDH3-SNAPSHOT.jar /Users/tucu/dev-apps/hadoop/bin/../lib/aspectjrt-1.6.5.jar . (about 30 jars from hadoop lib/ ) /Users/tucu/dev-apps/hadoop/bin/../lib/jsp-2.1/jsp-api-2.1.jar # Job JARs (for a job with only 2 JARs): /Users/tucu/dev-apps/hadoop/dirs/mapred/taskTracker/distcache/-2707763075630339038_639898034_1993697040/localhost/user/tucu/oozie-tucu/003-101004184132247-oozie-tucu-W/java-node--java/java-launcher.jar /Users/tucu/dev-apps/hadoop/dirs/mapred/taskTracker/distcache/3613772770922728555_-588832047_1993624983/localhost/user/tucu/examples/apps/java-main/lib/oozie-examples-2.2.1-CDH3B3-SNAPSHOT.jar /Users/tucu/dev-apps/hadoop/dirs/mapred/taskTracker/tucu/jobcache/job_201010041326_0058/attempt_201010041326_0058_m_00_0/work - What I'm suggesting is that the later group, the job JARs to be soft-linked (by the TT) into the working directory, then their classpath is just: - java-launcher.jar oozie-examples-2.2.1-CDH3B3-SNAPSHOT.jar . - Alejandro On Wed, Oct 6, 2010 at 7:57 PM, Henning Blohm henning.bl...@zfabrik.de wrote: Hi Alejandro, yes, it can of course be done right (sorry if my wording seemed to imply otherwise). Just saying that I think that Hadoop M/R should not go into that class loader / module separation business. It's one Job, one VM, right? So the problem is to assign just the stuff needed to let the Job do its business without becoming an obstacle. Must admit I didn't understand your proposal 2. How would that remove (e.g.) jetty libs from the job's classpath? Thanks, Henning Am Mittwoch, den 06.10.2010, 18:28 +0800 schrieb Alejandro Abdelnur: 1. Classloader business can be done right. Actually it could be done as spec-ed for servlet web-apps. 2. If the issue is strictly 'too large classpath', then a simpler solution would be to sof-link all JARs to the current directory and create the classpath with the JAR names only (no path). Note that the soft-linking business is already supported by the DistributedCache. So the changes would be mostly in the TT to create the JAR names only classpath before starting the child. Alejandro On Wed, Oct 6, 2010 at 5:57 PM, Henning Blohm henning.bl...@zfabrik.de wrote: Hi Tom, that's exactly it. Thanks! I don't think that I can comment on the issues in Jira so I will do it here. Tricking with class paths and deviating from the default class loading delegation has never been anything but a short term relieve. Fixing things by imposing a better order of stuff on the class path will not work when people do actually use child loaders (as the parent win) - like we do. Also it may easily lead to very confusing situations because the former part of the class path is not complete and gets other stuff from a latter part etc. etc no good. Child loaders are good for module separation but should not be used to hide type visibiliy from the parent. Almost certainly leading to Class Loader Contraint Violation - once you lose control (which is usually earlier than expected). The suggestion to reduce the Job class path to the required minimum is the most practical approach. There is some gray area there of course
Re: how to set diffent VM parameters for mappers and reducers?
Never used those myself, always used the global one, but I knew they are there. Which Hadoop API are you using, the old one or the new one? Alejandro On Fri, Oct 8, 2010 at 3:50 AM, Vitaliy Semochkin vitaliy...@gmail.comwrote: Hi, I tried using mapred.map.child.java.opts and mapred.reduce.child.java.opts but looks like hadoop-0.20.2 ingnores it. On which version have you seen it working? Regards, Vitaliy S On Tue, Oct 5, 2010 at 5:14 PM, Alejandro Abdelnur t...@cloudera.com wrote: The following 2 properties should work: mapred.map.child.java.opts mapred.reduce.child.java.opts Alejandro On Tue, Oct 5, 2010 at 9:02 PM, Michael Segel michael_se...@hotmail.com wrote: Hi, You don't say which version of Hadoop you are using. Going from memory, I believe in the CDH3 release from Cloudera, there are some specific OPTs you can set in hadoop-env.sh. HTH -Mike Date: Tue, 5 Oct 2010 16:59:35 +0400 Subject: how to set diffent VM parameters for mappers and reducers? From: vitaliy...@gmail.com To: common-user@hadoop.apache.org Hello, I have mappers that do not need much ram but combiners and reducers need a lot. Is it possible to set different VM parameters for mappers and reducers? PS Often I face interesting problem, on same set of data I recieve I have java.lang.OutOfMemoryError: Java heap space in combiner but it happens not all the time. What could be cause of such behavior? My personal opinion is that I have mapred.job.reuse.jvm.num.tasks=-1 and jvm GC doesn't always start when it should. Thanks in Advance, Vitaliy S
Re: Too large class path for map reduce jobs
1. Classloader business can be done right. Actually it could be done as spec-ed for servlet web-apps. 2. If the issue is strictly 'too large classpath', then a simpler solution would be to sof-link all JARs to the current directory and create the classpath with the JAR names only (no path). Note that the soft-linking business is already supported by the DistributedCache. So the changes would be mostly in the TT to create the JAR names only classpath before starting the child. Alejandro On Wed, Oct 6, 2010 at 5:57 PM, Henning Blohm henning.bl...@zfabrik.dewrote: Hi Tom, that's exactly it. Thanks! I don't think that I can comment on the issues in Jira so I will do it here. Tricking with class paths and deviating from the default class loading delegation has never been anything but a short term relieve. Fixing things by imposing a better order of stuff on the class path will not work when people do actually use child loaders (as the parent win) - like we do. Also it may easily lead to very confusing situations because the former part of the class path is not complete and gets other stuff from a latter part etc. etc no good. Child loaders are good for module separation but should not be used to hide type visibiliy from the parent. Almost certainly leading to Class Loader Contraint Violation - once you lose control (which is usually earlier than expected). The suggestion to reduce the Job class path to the required minimum is the most practical approach. There is some gray area there of course and it will not be feasible to reach the absolute minimal set of types there - but something reasonable, i.e. the hadoop core that suffices to run the job. Certainly jetty co are not required for job execution (btw. I hacked 0.20.2 to remove anything in server/ from the classpath before setting the job class path). I would suggest to a) introduce some HADOOP_JOB_CLASSPATH var that, if set, is the additional classpath, added to the core classpath (as described above). If not set, for compatibility, preserve today's behavior. b) not getting into custom child loaders for jobs as part of hadoop M/R. It's non-trivial to get it right and feels to be beyond scope. I wouldn't mind helping btw. Thanks, Henning On Tue, 2010-10-05 at 15:59 -0700, Tom White wrote: Hi Henning, I don't know if you've seenhttps://issues.apache.org/jira/browse/MAPREDUCE-1938 andhttps://issues.apache.org/jira/browse/MAPREDUCE-1700 which have discussion about this issue. Cheers Tom On Fri, Sep 24, 2010 at 3:41 AM, Henning Blohm henning.bl...@zfabrik.de wrote: Short update on the issue: I tried to find a way to separate class path configurations by modifying the scripts in HADOOP_HOME/bin but found that TaskRunner actually copies the class path setting from the parent process when starting a local task so that I do not see a way of having less on a job's classpath without modifying Hadoop. As that will present a real issue when running our jobs on Hadoop I would like to propose to change TaskRunner so that it sets a class path specifically for M/R tasks. That class path could be defined in the scipts (as for the other processes) using a particular environment variable (e.g. HADOOP_JOB_CLASSPATH). It could default to the current VM's class path, preserving today's behavior. Is it ok to enter this as an issue? Thanks, Henning Am Freitag, den 17.09.2010, 16:01 + schrieb Allen Wittenauer: On Sep 17, 2010, at 4:56 AM, Henning Blohm wrote: When running map reduce tasks in Hadoop I run into classpath issues. Contrary to previous posts, my problem is not that I am missing classes on the Task's class path (we have a perfect solution for that) but rather find too many (e.g. ECJ classes or jetty). The fact that you mention: The libs in HADOOP_HOME/lib seem to contain everything needed to run anything in Hadoop which is, I assume, much more than is needed to run a map reduce task. hints that your perfect solution is to throw all your custom stuff in lib. If so, that's a huge mistake. Use distributed cache instead.
Re: Too large class path for map reduce jobs
[sent too soon] The first CP shown is how it is today the CP of a task. If we change it pick up all the job JARs from the current dir, then the classpath will be much shorter (second CP shown). We can easily achieve this by soft-linking the job JARs in the work dir of the task. Alejandro On Thu, Oct 7, 2010 at 1:02 PM, Alejandro Abdelnur t...@cloudera.comwrote: Fragmentation of Hadoop classpaths is another issue: hadoop should differentiate the CP in 3: 1*client CP: what is needed to submit a job (only the nachos) 2*server CP (JT/NN/TT/DD): what is need to run the cluster (the whole enchilada) 3*job CP: what is needed to run a job (some of the enchilada) But i'm not trying to get into that here. What I'm suggesting is: - # Hadoop JARs: /Users/tucu/dev-apps/hadoop/conf /System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home/lib/tools.jar /Users/tucu/dev-apps/hadoop/bin/.. /Users/tucu/dev-apps/hadoop/bin/../hadoop-core-0.20.3-CDH3-SNAPSHOT.jar /Users/tucu/dev-apps/hadoop/bin/../lib/aspectjrt-1.6.5.jar . (about 30 jars from hadoop lib/ ) /Users/tucu/dev-apps/hadoop/bin/../lib/jsp-2.1/jsp-api-2.1.jar # Job JARs (for a job with only 2 JARs): /Users/tucu/dev-apps/hadoop/dirs/mapred/taskTracker/distcache/-2707763075630339038_639898034_1993697040/localhost/user/tucu/oozie-tucu/003-101004184132247-oozie-tucu-W/java-node--java/java-launcher.jar /Users/tucu/dev-apps/hadoop/dirs/mapred/taskTracker/distcache/3613772770922728555_-588832047_1993624983/localhost/user/tucu/examples/apps/java-main/lib/oozie-examples-2.2.1-CDH3B3-SNAPSHOT.jar /Users/tucu/dev-apps/hadoop/dirs/mapred/taskTracker/tucu/jobcache/job_201010041326_0058/attempt_201010041326_0058_m_00_0/work - What I'm suggesting is that the later group, the job JARs to be soft-linked (by the TT) into the working directory, then their classpath is just: - java-launcher.jar oozie-examples-2.2.1-CDH3B3-SNAPSHOT.jar . - Alejandro On Wed, Oct 6, 2010 at 7:57 PM, Henning Blohm henning.bl...@zfabrik.dewrote: Hi Alejandro, yes, it can of course be done right (sorry if my wording seemed to imply otherwise). Just saying that I think that Hadoop M/R should not go into that class loader / module separation business. It's one Job, one VM, right? So the problem is to assign just the stuff needed to let the Job do its business without becoming an obstacle. Must admit I didn't understand your proposal 2. How would that remove (e.g.) jetty libs from the job's classpath? Thanks, Henning Am Mittwoch, den 06.10.2010, 18:28 +0800 schrieb Alejandro Abdelnur: 1. Classloader business can be done right. Actually it could be done as spec-ed for servlet web-apps. 2. If the issue is strictly 'too large classpath', then a simpler solution would be to sof-link all JARs to the current directory and create the classpath with the JAR names only (no path). Note that the soft-linking business is already supported by the DistributedCache. So the changes would be mostly in the TT to create the JAR names only classpath before starting the child. Alejandro On Wed, Oct 6, 2010 at 5:57 PM, Henning Blohm henning.bl...@zfabrik.de wrote: Hi Tom, that's exactly it. Thanks! I don't think that I can comment on the issues in Jira so I will do it here. Tricking with class paths and deviating from the default class loading delegation has never been anything but a short term relieve. Fixing things by imposing a better order of stuff on the class path will not work when people do actually use child loaders (as the parent win) - like we do. Also it may easily lead to very confusing situations because the former part of the class path is not complete and gets other stuff from a latter part etc. etc no good. Child loaders are good for module separation but should not be used to hide type visibiliy from the parent. Almost certainly leading to Class Loader Contraint Violation - once you lose control (which is usually earlier than expected). The suggestion to reduce the Job class path to the required minimum is the most practical approach. There is some gray area there of course and it will not be feasible to reach the absolute minimal set of types there - but something reasonable, i.e. the hadoop core that suffices to run the job. Certainly jetty co are not required for job execution (btw. I hacked 0.20.2 to remove anything in server/ from the classpath before setting the job class path). I would suggest to a) introduce some HADOOP_JOB_CLASSPATH var that, if set, is the additional classpath, added to the core classpath (as described above). If not set, for compatibility, preserve today's behavior. b) not getting into custom child loaders for jobs as part of hadoop M/R. It's non-trivial to get it right and feels to be beyond scope. I wouldn't mind helping btw. Thanks, Henning On Tue, 2010-10-05 at 15:59 -0700, Tom White
Re: is there no streaming.jar file in hadoop-0.21.0??
Edward, Yep, you should use the one from contrib/ Alejandro On Tue, Oct 5, 2010 at 1:55 PM, edward choi mp2...@gmail.com wrote: Thanks, Tom. Didn't expect the author of THE BOOK would answer my question. Very surprised and honored :-) Just one more question if you don't mind. I read it on the Internet that in order to user Hadoop Streaming in Hadoop-0.21.0 you should go $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar args (Of course I don't see any hadoop-streaming.jar in $HADOOP_HOME) But according to your reply I should go $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/mapred/contrib/streaming/hadoop-*-streaming.jar args I suppose the latter one is the way to go? Ed. 2010/10/5 Tom White t...@cloudera.com Hi Ed, The directory structure moved around as a result of the project splitting into three subprojects (Common, HDFS, MapReduce). The streaming jar is in mapred/contrib/streaming in the distribution. Cheers, Tom On Mon, Oct 4, 2010 at 8:03 PM, edward choi mp2...@gmail.com wrote: Hi, I've recently downloaded Hadoop-0.21.0. After the installation, I've noticed that there is no contrib directory that used to exist in Hadoop-0.20.2. So I was wondering if there is no hadoop-0.21.0-streaming.jar file in Hadoop-0.21.0. Anyone had any luck finding it? If the way to use streaming has changed in Hadoop-0.21.0, then please tell me how. Appreciate the help, thx. Ed.
Re: Re: Help!!The problem about Hadoop
Or you could try using MultiFileInputFormat for your MR job. http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapred/MultiFileInputFormat.html Alejandro On Tue, Oct 5, 2010 at 4:55 PM, Harsh J qwertyman...@gmail.com wrote: 500 small files comprising one gigabyte? Perhaps you should try concatenating them all into one big file and try; as a mapper is supposed to run at least for a minute optimally. And small files don't make good use of the HDFS block feature. Have a read: http://www.cloudera.com/blog/2009/02/the-small-files-problem/ 2010/10/5 Jander 442950...@163.com: Hi Jeff, Thank you very much for your reply sincerely. I exactly know hadoop has overhead, but is it too large in my problem? The 1GB text input has about 500 map tasks because the input is composed of little text file. And the time each map taken is from 8 seconds to 20 seconds. I use compression like conf.setCompressMapOutput(true). Thanks, Jander At 2010-10-05 16:28:55,Jeff Zhang zjf...@gmail.com wrote: Hi Jander, Hadoop has overhead compared to single-machine solution. How many task have you get when you run your hadoop job ? And what is time consuming for each map and reduce task ? There's lots of tips for performance tuning of hadoop. Such as compression and jvm reuse. 2010/10/5 Jander 442950...@163.com: Hi, all I do an application using hadoop. I take 1GB text data as input the result as follows: (1) the cluster of 3 PCs: the time consumed is 1020 seconds. (2) the cluster of 4 PCs: the time is about 680 seconds. But the application before I use Hadoop takes about 280 seconds, so as the speed above, I must use 8 PCs in order to have the same speed as before. Now the problem: whether it is correct? Jander, Thanks. -- Best Regards Jeff Zhang -- Harsh J www.harshj.com
Re: how to set diffent VM parameters for mappers and reducers?
The following 2 properties should work: mapred.map.child.java.opts mapred.reduce.child.java.opts Alejandro On Tue, Oct 5, 2010 at 9:02 PM, Michael Segel michael_se...@hotmail.com wrote: Hi, You don't say which version of Hadoop you are using. Going from memory, I believe in the CDH3 release from Cloudera, there are some specific OPTs you can set in hadoop-env.sh. HTH -Mike Date: Tue, 5 Oct 2010 16:59:35 +0400 Subject: how to set diffent VM parameters for mappers and reducers? From: vitaliy...@gmail.com To: common-user@hadoop.apache.org Hello, I have mappers that do not need much ram but combiners and reducers need a lot. Is it possible to set different VM parameters for mappers and reducers? PS Often I face interesting problem, on same set of data I recieve I have java.lang.OutOfMemoryError: Java heap space in combiner but it happens not all the time. What could be cause of such behavior? My personal opinion is that I have mapred.job.reuse.jvm.num.tasks=-1 and jvm GC doesn't always start when it should. Thanks in Advance, Vitaliy S
Re: Classpath
Yes, you can do #1, but I wouldn't say it is practical. You can do #2 as well, as you suggest. But, IMO, the best way is copying the JARs in HDFS and using DistributedCache. A On Sun, Aug 29, 2010 at 1:29 PM, Mark static.void@gmail.com wrote: How can I add jars to Hadoops classpath when running MapReduce jobs for the following situations? 1) Assuming that the jars are local the nodes that running the job. 2) The jobs are only local to the client submitting the job. I'm assuming I can just jar up all required jobs into the main job jar being submitted, but I was wondering if there was some other way. Thanks
Re: REST web service on top of Hadoop
In Oozie are working on MR/Pig jobs submission over HTTP. On Thu, Jul 29, 2010 at 5:09 PM, Steve Loughran ste...@apache.org wrote: S. Venkatesh wrote: HDFS Proxy in contrib provides HTTP interface over HDFS. Its not very RESTful but we are working on a new version which will have a REST API. AFAIK, Oozie will provide REST API for launching MR jobs. I've done a webapp to do this (and some trickier things like creating the cluster when needed), some slides on it http://www.slideshare.net/steve_l/long-haul-hadoop there are links on there to the various JIRA issues which have discussed the topic
Re: Hadoop multiple output files
Adam, Yes, you have do some configuration in the jobconf before submitting the job. if the javadocs are not clear enough, check the testcase A On Mon, Jun 28, 2010 at 8:42 PM, Adam Silberstein silbe...@yahoo-inc.com wrote: Hi Alejandro, Thanks for the tip. I tried the class, but got the following error: java.lang.IllegalArgumentException: Undefined named output 'text' at org.apache.hadoop.mapred.lib.MultipleOutputs.getCollector(MultipleOutputs.ja va:496) at org.apache.hadoop.mapred.lib.MultipleOutputs.getCollector(MultipleOutputs.ja va:476) at com.yahoo.shadoop.applications.SyntheticToHdfsTableAndIndex$SortReducer.redu ce(Unknown Source) at com.yahoo.shadoop.applications.SyntheticToHdfsTableAndIndex$SortReducer.redu ce(Unknown Source) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411) at org.apache.hadoop.mapred.Child.main(Child.java:170) I am trying to write to an output file named 'text.' Do I need to initialize that file in some way? I tried making a directory with the name, but that didn't do anything. Thanks, Adam On 6/28/10 6:17 PM, Alejandro Abdelnur tuc...@gmail.com wrote: Check the MultipleOutputs class http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/mapred/lib/ MultipleOutputs.html On Mon, Jun 28, 2010 at 5:31 PM, Adam Silberstein silbe...@yahoo-inc.com wrote: Hi, I would like to run a hadoop job that write to multiple output files. I see a class called MultipleOutputFormat that looks like what I want, but I have not been able to find any sample code showing how to use it. I see discussion of it in a JIRA, where the idea is to choose output file based on key. If someone could me to a sample, that would be great. Thanks, Adam
Re: preserve JobTracker information
Also you can configure the job tracker to keep the RunningJob information for completed jobs (avail via the Hadoop Java API). There is a config property that enables this, another that specifies the location (it can be HDFS or local), and another that specifies for how many hours you want to keep that information. HTH A On Wed, May 19, 2010 at 1:36 AM, Harsh J qwertyman...@gmail.com wrote: Preserved JobTracker history is already available at /jobhistory.jsp There is a link at the end of the /jobtracker.jsp page that leads to this. There's also free analysis to go with that! :) On Tue, May 18, 2010 at 11:00 PM, Alan Miller someb...@squareplanet.de wrote: Hi, Is there a way to preserve previous job information (Completed Jobs, Failed Jobs) when the hadoop cluster is restarted? Everytime I start up my cluster (start-dfs.sh,start-mapred.sh) the JobTracker interface at http://myhost:50020/jobtracker.jsp is always empty. Thanks, Alan -- Harsh J www.harshj.com
Re: MultipleTextOutputFormat splitting output into different directories.
Using the MultipleOutputs ( http://hadoop.apache.org/common/docs/r0.19.0/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html ) you can split data in different files in the outputdir. After your job finishes you can move the files to different directories. The benefit of this doing this is that task failures and speculative execution will also take into account all these files and your data will be consistent. If you are writing to different directories then you have to handle this by hand. A On Tue, Sep 15, 2009 at 4:22 PM, Aviad sela sela.s...@gmail.com wrote: Is any body interested ,addressed such probelm. Or does it seem to be esoteric usage ? On Wed, Sep 9, 2009 at 7:06 PM, Aviad sela sela.s...@gmail.com wrote: I am using Hadoop 0.19.1 I attempt to split an input into multiple directories. I don't know in advance how many directories exists. I don't know in advance what is the directory depth. I expect that under each such directory a file exists with all availble records having the same key permutation found in the job. If currently each reducer produce a single output i.e. PART-0001 I would like to create as many directory necessary taking the following pattern: key1 / key2/ .../ keyN/ PART-0001 where the key? may have different values for each input record. different record may results with a different path requested: key1a/key2b/PART-0001 key1c/key2d/key3e/PART-0001 to keep it simple, during each job we may expect the same depth from each record. I assume that the input records imply that each reduce will produce several hundreds of such directories. (Indeed this strongly depends on the input record semantic). The MAP part reads a record,following some logic, assign a key like : KEY_A, KEY_B The MAP Value is the original input line. For The reducer part I assign the IdentityReducer. However have set : jobConf.setReducerClass(IdentityReducer. *class*); jobConf.setOutputFormat(MyTextOutputFormat.*class*); Where the MyTextOutputFormat extends MultipleTextOutputFormat, and implements: protected String generateFileNameForKeyValue(K key, V value, String name) { String keyParts[] = key.toString().split(,); Path finalPath = null; // Build the directory structure comprised of the Key parts for (int i=0; i keyParts.length; i++) { String part = keyParts[i].trim(); if (false == .equals(part)) { if (null == finalPath) finalPath = new Path(part); else { finalPath = new Path(finalPath, part); } } } //end of for String fileName = generateLeafFileName(name); finalPath = new Path(finalPath, fileName); return finalPath.toString(); } //generatedFileNameKeyValue During execution I have seen the reduce attempts does create the following path under the output path: /user/hadoop/test_rep01/_temporary/_attempt_200909080349_0013_r_00_0/KEY_A/KEY_B/part-0 However, the file was empty. The job fails at the end with the the following exceptions found in the task log: 2009-09-09 11:19:49,653 INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink 9.148.30.71:50010 2009-09-09 11:19:49,654 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-6138647338595590910_39383 2009-09-09 11:19:55,659 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block. at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2722) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183) 2009-09-09 11:19:55,659 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_-6138647338595590910_39383 bad datanode[1] nodes == null 2009-09-09 11:19:55,660 WARN org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source file /user/hadoop/test_rep01/_temporary/_attempt_200909080349_0013_r_02_0/KEY_A/KEY_B/part-2 - Aborting... 2009-09-09 11:19:55,686 WARN org.apache.hadoop.mapred.TaskTracker: Error running child java.io.IOException: Bad connect ack with firstBadLink 9.148.30.71:50010 at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2780) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2703) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183) 2009-09-09 11:19:55,688 INFO org.apache.hadoop.mapred.TaskRunner: Runnning cleanup for
Re: Hadoop 0.19, Cascading 1.0 and MultipleOutputs problem
Mikhail, You are right, please open a Jira on this. Alejandro On Wed, Jan 28, 2009 at 9:23 PM, Mikhail Yakshin greycat.na@gmail.comwrote: Hi, We have a system based on Hadoop 0.18 / Cascading 0.8.1 and now I'm trying to port it to Hadoop 0.19 / Cascading 1.0. The first serious problem I've got into that we're extensively using MultipleOutputs in our jobs dealing with sequence files that store Cascading's Tuples. Since Cascading 0.9, Tuples stopped being WritableComparable and implemented generic Hadoop serialization interface and framework. However, in Hadoop 0.19, MultipleOutputs require use of older WritableComparable interface. Thus, trying to do something like: MultipleOutputs.addNamedOutput(conf, output-name, MySpecialMultiSplitOutputFormat.class, Tuple.class, Tuple.class); mos = new MultipleOutputs(conf); ... mos.getCollector(output-name, reporter).collect(tuple1, tuple2); yields an error: java.lang.RuntimeException: java.lang.RuntimeException: class cascading.tuple.Tuple not org.apache.hadoop.io.WritableComparable at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:752) at org.apache.hadoop.mapred.lib.MultipleOutputs.getNamedOutputKeyClass(MultipleOutputs.java:252) at org.apache.hadoop.mapred.lib.MultipleOutputs$InternalFileOutputFormat.getRecordWriter(MultipleOutputs.java:556) at org.apache.hadoop.mapred.lib.MultipleOutputs.getRecordWriter(MultipleOutputs.java:425) at org.apache.hadoop.mapred.lib.MultipleOutputs.getCollector(MultipleOutputs.java:511) at org.apache.hadoop.mapred.lib.MultipleOutputs.getCollector(MultipleOutputs.java:476) at my.namespace.MyReducer.reduce(MyReducer.java:xxx) Is there any known workaround for that? Any progress going on to make MultipleOutputs use generic Hadoop serialization? -- WBR, Mikhail Yakshin
easy way to create IntelliJ project for Hadoop trunk?
With ivy fetching JARs as part of the build and several src dirs creating an IDE (IntelliJ in my case) project becomes a pain. Any easy way of doing it? Thxs. Alejandro
Re: MultipleOutputFormat versus MultipleOutputs
Besides the usage pattern the key differences are: Different MultipleOutputs outputs can have different outputformats and key/value classes. With MultipleOutputFormat you can't. (and if I'm not mistaken) If using MultipleOutputFormat in a map you can't have a reduce phase. With MultipleOutputs you can. A On Thu, Aug 28, 2008 at 3:36 AM, Shirley Cohen [EMAIL PROTECTED] wrote: Hi, I would like the reducer to output to different files based upon the value of the key. I understand that both MultipleOutputs and MultipleOutputFormat can do this. Is that correct? However, I don't understand the differences between these two classes. Can someone explain the differences and provide an example to illustrate these differences? I found a snippet of code on how to use MultipleOutputs in the documentation, but could not find an example for using MultipleOutputFormat. Thanks in advance, Shirley
Re: MultipleOutputFormat versus MultipleOutputs
Shirley, Check the javadocs or the corresponding testcases. A On Thu, Aug 28, 2008 at 6:02 PM, Shirley Cohen [EMAIL PROTECTED] wrote: Thanks, Alejandro! Do you have an example that shows how to use MultipleOutputFormat? Shirley On Aug 28, 2008, at 2:41 AM, Alejandro Abdelnur wrote: Besides the usage pattern the key differences are: Different MultipleOutputs outputs can have different outputformats and key/value classes. With MultipleOutputFormat you can't. (and if I'm not mistaken) If using MultipleOutputFormat in a map you can't have a reduce phase. With MultipleOutputs you can. A On Thu, Aug 28, 2008 at 3:36 AM, Shirley Cohen [EMAIL PROTECTED] wrote: Hi, I would like the reducer to output to different files based upon the value of the key. I understand that both MultipleOutputs and MultipleOutputFormat can do this. Is that correct? However, I don't understand the differences between these two classes. Can someone explain the differences and provide an example to illustrate these differences? I found a snippet of code on how to use MultipleOutputs in the documentation, but could not find an example for using MultipleOutputFormat. Thanks in advance, Shirley
Re: How can I control Number of Mappers of a job?
when done, HADOOP-3387 would allow you to do that. In our implementation we can tell Hadoop the exact # maps and it will group splits if necessary. On Fri, Aug 1, 2008 at 5:25 AM, Andreas Kostyrka [EMAIL PROTECTED] wrote: Well, the only way to reliably fix the number of maptasks that I've found is by using compressed input files, that forces hadoop to assign one and only one file to a map task ;) Andreas On Thursday 31 July 2008 21:30:33 Gopal Gandhi wrote: Thank you, finally someone has interests in my questions =) My cluster contains more than one machine. Please don't get me wrong :-). I don't want to limit the total mappers in one node (by mapred.map.tasks). What I want is to limit the total mappers for one job. The motivation is that I have 2 jobs to run at the same time. they have the same input data in Hadoop. I found that one job has to wait until the other finishes its mapping. Because the 2 jobs are submitted by 2 different people, I don't want one job to be starving. So I want to limit the first job's total mappers so that the 2 jobs will be launched simultaneously. - Original Message From: Goel, Ankur [EMAIL PROTECTED] To: core-user@hadoop.apache.org Cc: [EMAIL PROTECTED] Sent: Wednesday, July 30, 2008 10:17:53 PM Subject: RE: How can I control Number of Mappers of a job? How big is your cluster? Assuming you are running a single node cluster, Hadoop-default.xml has a parameter 'mapred.map.tasks' that is set to 2. So By default, no matter how many map tasks are calculated by framework, only 2 map task will execute on a single node cluster. -Original Message- From: Gopal Gandhi [mailto:[EMAIL PROTECTED] Sent: Thursday, July 31, 2008 4:38 AM To: core-user@hadoop.apache.org Cc: [EMAIL PROTECTED] Subject: How can I control Number of Mappers of a job? The motivation is to control the max # of mappers of a job. For example, the input data is 246MB, divided by 64M is 4. If by default there will be 4 mappers launched on the 4 blocks. What I want is to set its max # of mappers as 2, so that 2 mappers are launched first and when they completes on the first 2 blocks, another 2 mappers start on the rest 2 blocks. Does Hadoop provide a way?
Re: multiple Output Collectors ?
Are you seeing in local FS or in HDFS? In local FS you'll see them. In HDFS you should not see any (via hadoop dfs -ls). As far as I understand, check-sum files are empty the corresponding files are empty. A On Mon, Jul 21, 2008 at 3:11 AM, Khanh Nguyen [EMAIL PROTECTED] wrote: Hello, Is there any reason that the check-sum file for a multipleoutput's collectors is empty? -k On Wed, Jul 16, 2008 at 4:26 AM, Alejandro Abdelnur [EMAIL PROTECTED] wrote: multiple mappers mean multiple jobs, which means you'll have to run 2 jobs on the same data, with the MultipleOutputs and MultipleOutputFormat you can do that in one pass form a single Mapper. On Wed, Jul 16, 2008 at 3:26 AM, Khanh Nguyen [EMAIL PROTECTED] wrote: Thank you very much. Someone suggested I could just use multiple mapper. Would that work better (easier?) ? -k On Mon, Jul 14, 2008 at 11:59 PM, Alejandro Abdelnur [EMAIL PROTECTED] wrote: check MultipleOutputFormat and MultipleOutputs (this has been committed to the trunk last week) On Mon, Jul 14, 2008 at 11:49 PM, Khanh Nguyen [EMAIL PROTECTED] wrote: Hello, Is it possible to have more than one output collector for one map? My input are records of html pages. I am mapping each url to its html-content and want to have two output collectors. One that maps each url, html-content -- url, outlinks and another one that map url, html-content to something else (difficult to explain). Please help. Thanks -k
Re: How to write one file per key as mapreduce output
On Thu, Jul 24, 2008 at 12:32 AM, Lincoln Ritter [EMAIL PROTECTED] wrote: Alejandro said: Take a look at the MultipleOutputFormat class or MultipleOutputs (in SVN tip) I'm muddling through both http://issues.apache.org/jira/browse/HADOOP-2906 and http://issues.apache.org/jira/browse/HADOOP-3149 trying to make sense of these. I'm a little confused by the way this works but it looks like I can define a number of named outputs which looks like it enables different output formats and I can also define some of these as multi, meaning that I can write to different targets (like files). Is this correct? Exactly. A couple of questions: - I needed to pass 'null' to the collect method so as to not write the key to the file. These files are meant to be consumable chunks of content so I want to control exactly what goes into them. Does this seem normal or am i missing something? Is there a downside to passing null here? Not sure what happens if you write NULL as key or value. - What is the 'part-0' file for? I have seen this in other places in the dfs. But it seems extraneous here. It's not super critical but if I can make it go away that would be great. This is the standard output of the M/R job whatever is written the OutputCollector you get in the reduce() call (or in the map() call when reduce=0) - What is the purpose of the '-r-0' suffix? Perhaps it is to help with collisions? Yes, files written from a map have '-m-', files written from a reduce have '-r-' I guess it seems strange that I can't just say the output file should be called X and have an output file called X appear. Well, you need the map, reduce mask and the task number mask to avoid collisions.
Re: How to write one file per key as mapreduce output
Then you may want to look at the MultipleOutputFile, it can do what you need. On Tue, Jul 29, 2008 at 10:11 PM, Lincoln Ritter [EMAIL PROTECTED] wrote: Thanks for the info! Not sure what happens if you write NULL as key or value. Looking at the code, it doesn't seem to really make a difference, and the function in question (basically 'collect') looks to be robust to null - but I may be missing something! In my case, I basically want the key to be the output filename, and the data in the files to be directly consumable by my app. Having the key show up in the file complicates things on the app side so I'm trying to avoid this. Passing null seems to work for now. -lincoln -- lincolnritter.com On Tue, Jul 29, 2008 at 9:27 AM, Alejandro Abdelnur [EMAIL PROTECTED] wrote: On Thu, Jul 24, 2008 at 12:32 AM, Lincoln Ritter [EMAIL PROTECTED] wrote: Alejandro said: Take a look at the MultipleOutputFormat class or MultipleOutputs (in SVN tip) I'm muddling through both http://issues.apache.org/jira/browse/HADOOP-2906 and http://issues.apache.org/jira/browse/HADOOP-3149 trying to make sense of these. I'm a little confused by the way this works but it looks like I can define a number of named outputs which looks like it enables different output formats and I can also define some of these as multi, meaning that I can write to different targets (like files). Is this correct? Exactly. A couple of questions: - I needed to pass 'null' to the collect method so as to not write the key to the file. These files are meant to be consumable chunks of content so I want to control exactly what goes into them. Does this seem normal or am i missing something? Is there a downside to passing null here? Not sure what happens if you write NULL as key or value. - What is the 'part-0' file for? I have seen this in other places in the dfs. But it seems extraneous here. It's not super critical but if I can make it go away that would be great. This is the standard output of the M/R job whatever is written the OutputCollector you get in the reduce() call (or in the map() call when reduce=0) - What is the purpose of the '-r-0' suffix? Perhaps it is to help with collisions? Yes, files written from a map have '-m-', files written from a reduce have '-r-' I guess it seems strange that I can't just say the output file should be called X and have an output file called X appear. Well, you need the map, reduce mask and the task number mask to avoid collisions.
Re: How to write one file per key as mapreduce output
Lincoln, Take a look at the MultipleOutputFormat class or MultipleOutputs (in SVN tip) A On Wed, Jul 23, 2008 at 5:34 AM, Lincoln Ritter [EMAIL PROTECTED] wrote: Greetings, I have what I think is a pretty straight-forward, noobie question. I would like to write one file per key in the reduce (or map) phase of a mapreduce job. I have looked at the documentation for FileOutputFormat and MultipleTextOutputFormat but am a bit unclear on how to use it/them. Can anybody give me a quick pointer? Thanks very much! -lincoln -- lincolnritter.com
[email recall] ROME 1.0 RC1 disted
Apologies for the my previous email, somehow my email aliases got screwed up, this should have gone to the ROME alias. Alejandro On Wed, Jul 16, 2008 at 10:43 PM, Alejandro Abdelnur [EMAIL PROTECTED] wrote: I've updated ROME wiki, uploaded the javadocs and dists, tagging CVS at the moment. Would somebody double check everything is OK, links and downloads? Also we should get the JAR in a maven repo. Who had the necessary permissions to do? We can do a final 1.0 release in a couple of weeks, if no issues arise retagging RC1 as 1.0. Also it would give time for any subproject (Fetcher ?) that wants to go 1.0 in tandem. Cheers. A
Re: multiple Output Collectors ?
multiple mappers mean multiple jobs, which means you'll have to run 2 jobs on the same data, with the MultipleOutputs and MultipleOutputFormat you can do that in one pass form a single Mapper. On Wed, Jul 16, 2008 at 3:26 AM, Khanh Nguyen [EMAIL PROTECTED] wrote: Thank you very much. Someone suggested I could just use multiple mapper. Would that work better (easier?) ? -k On Mon, Jul 14, 2008 at 11:59 PM, Alejandro Abdelnur [EMAIL PROTECTED] wrote: check MultipleOutputFormat and MultipleOutputs (this has been committed to the trunk last week) On Mon, Jul 14, 2008 at 11:49 PM, Khanh Nguyen [EMAIL PROTECTED] wrote: Hello, Is it possible to have more than one output collector for one map? My input are records of html pages. I am mapping each url to its html-content and want to have two output collectors. One that maps each url, html-content -- url, outlinks and another one that map url, html-content to something else (difficult to explain). Please help. Thanks -k
ROME 1.0 RC1 disted
I've updated ROME wiki, uploaded the javadocs and dists, tagging CVS at the moment. Would somebody double check everything is OK, links and downloads? Also we should get the JAR in a maven repo. Who had the necessary permissions to do? We can do a final 1.0 release in a couple of weeks, if no issues arise retagging RC1 as 1.0. Also it would give time for any subproject (Fetcher ?) that wants to go 1.0 in tandem. Cheers. A
Re: Outputting to different paths from the same input file
You can use MultipleOutputFormat or MultipleOutputs (it has been committed to SVN a few days ago) for this. Then you can use a filter on your input dir for the next jobs so only files matching a given name/pattern are used. A On Fri, Jul 11, 2008 at 8:54 PM, Jason Venner [EMAIL PROTECTED] wrote: We open side effect files in our map and reduce jobs to 'tee' off additional data streams. We open them in the /configure/ method and close them in the /close/ method The /configure/ method provides access to the /JobConf. /We create our files relative to value of conf.get(mapred.output.dir), in the map/reduce object instances. The files end up in the conf.getOutputPath() directory, and we move them out based on knowing the shape of the file names, after the job finishes. Then after the job is finished move all of the files to another location using a file name based filter to select the files to move (from the job schnitzi wrote: Okay, I've found some similar discussions in the archive, but I'm still not clear on this. I'm new to Hadoop, so 'scuse my ignorance... I'm writing a Hadoop tool to read in an event log, and I want to produce two separate outputs as a result -- one for statistics, and one for budgeting. Because the event log I'm reading in can be massive, I would like to only process it once. But the outputs will each be read by further M/R processes, and will be significantly different from each other. I've looked at MultipleOutputFormat, but it seems to just want to partition data that looks basically the same into this file or that. What's the proper way to do this? Ideally, whatever solution I implement should be atomic, in that if any one of the writes fails, neither output will be produced. AdTHANKSvance, Mark -- Jason Venner Attributor - Program the Web http://www.attributor.com/ Attributor is hiring Hadoop Wranglers and coding wizards, contact if interested
Re: Parameterized InputFormats
If your InputFormat implements Configurable you'll get access to the JobConf via the setConf(Configuration) method when Hadoop creates an instance of your class. On Mon, Jun 30, 2008 at 11:20 PM, Nathan Marz [EMAIL PROTECTED] wrote: Hello, Are there any plans to change the JobConf API so that it takes an instance of an InputFormat rather than the InputFormat class? I am finding the inability to properly parameterize my InputFormats to be very restricting. What's the reasoning behind having the class as a parameter rather than an instance? -Nathan Marz
Re: multiple Output Collectors ?
check MultipleOutputFormat and MultipleOutputs (this has been committed to the trunk last week) On Mon, Jul 14, 2008 at 11:49 PM, Khanh Nguyen [EMAIL PROTECTED] wrote: Hello, Is it possible to have more than one output collector for one map? My input are records of html pages. I am mapping each url to its html-content and want to have two output collectors. One that maps each url, html-content -- url, outlinks and another one that map url, html-content to something else (difficult to explain). Please help. Thanks -k
Re: Realtime Map Reduce = Supercomputing for the Masses?
Yes you would have to do it with classloaders (not 'hello world' but not 'rocket science' either). You'll be limited on using native libraries, even if you use classloaders properly as native libs can be loaded only once. You will have to ensure you get rid of the task classloader once the task is over (thus removing all singleton stuff that may be in it). You will have to put in place a security manager for the code running out the task classloader. You'll end up doing somemthing similar to servlet containers webapp classloading model with the extra burden of hot-loading for each task run. Which in the end may have a similar overhead of bootstrapping a JVM for the task, this should be measured to see what is the time delta to see if it is worth the effort. A On Mon, Jun 2, 2008 at 3:53 PM, Steve Loughran [EMAIL PROTECTED] wrote: Christophe Taton wrote: Actually Hadoop could be made more friendly to such realtime Map/Reduce jobs. For instance, we could consider running all tasks inside the task tracker jvm as separate threads, which could be implemented as another personality of the TaskRunner. I have been looking into this a couple of weeks ago... Would you be interested in such a feature? Why does that have benefits? So that you can share stuff via local data structures? Because you'd better be sharing classloaders if you are going to play that game. And that is very hard to get right (to the extent that I dont think any apache project other than Felix does it well)
Re: How do people keep their client configurations in sync with the remote cluster(s)
I'll keep an eye on that issue. I think a key problem right now is that clients take their config from the configuration file in the core jar, and from their own settings, You need to keep the settings in sync somehow, and have to take what the core jar provides. Yes, exactly that is the problem, there is no way to have default values set by the cluster unless you redistribute a hadoop-site.xml to all your clients. Regarding Hadoop-3287 there is some rejection for such fix, so if you feel it make sense please comment on it. Alejandro
Re: How do people keep their client configurations in sync with the remote cluster(s)
That would be an option too. On Mon, May 19, 2008 at 10:26 PM, Ted Dunning [EMAIL PROTECTED] wrote: I think it would be better to have the client retrieve the default configuration. Not all configuration settings are simple overrides. Some are read-modify-write operations. This also fits the current code better. On 5/19/08 6:38 AM, Steve Loughran [EMAIL PROTECTED] wrote: Alejandro Abdelnur wrote: A while ago I've opened an issue related to this topic https://issues.apache.org/jira/browse/HADOOP-3287 My take is a little different, when submitting a job, the clients should only send to the jobtracker the configuration they explicitly set, then the job tracker would apply the defaults for all the other configuration. By doing this the cluster admin can modify things at any time and changes on default values take effect for all clients without having to distribute a new configuration to all clients. IMO, this approach was the intended behavior at some point, according to the Configuration.write(OutputStream) javadocs ' Writes non-default properties in this configuration.'. But as the write method is writing default properties this is not happening. I'll keep an eye on that issue. I think a key problem right now is that clients take their config from the configuration file in the core jar, and from their own settings, You need to keep the settings in sync somehow, and have to take what the core jar provides. This approach would also get rid of the separate mechanism (zookeeper, svn, etc) to keep clients synchronized as there would be no need to do so. zookeeper and similar are to keep the cluster alive; they shouldnt be needed for clients, which should only need some URL of a job tracker to talk to.
Re: Can reducer output multiple files?
Lookt at the MultipleOutputFormat class under mapred/lib. Or you can look at the Hadoop-3149 patch that introduces a MultipleOutputs class. You could get this work easily on released Hadoop versions. HTH Alejandro On Thu, May 8, 2008 at 2:05 PM, Jeremy Chow [EMAIL PROTECTED] wrote: Hi list, I want to output my reduced results into several files according to some types the results blongs to. How can I implement this? Thx, Jeremy -- My research interests are distributed systems, parallel computing and bytecode based virtual machine. http://coderplay.javaeye.com
Re: small sized files - how to use MultiInputFileFormat
In principle I agree with you Ted. However, in many cases we have multiple large jobs generating outputs that are not that big and as result the number of small size files (significantly smaller than a Hadoop block) is large, using the default splitting logic there triggers jobs with a large amount of tasks that inefficiently clogs the cluster. The MultipleFileInputFormat helps to avoid that, but it has a problem, if the file set is a mix of small and large files the splits are uneven and it does not do split on single large files. To address this we've written our own InputFormat (for Text and SequenceFiles) that collapses small files into a splits up to the block size and splits big files into the block size. It has a twist that you can you specify the max number of MAPs that you want or the BLOCK size you want to use for the splits. When a particular split contains multiple small files, the suggested host for the splits is order based on the host that has most of the data for those files. We'll still have to do some clean up on the code and then we'll submit it to Hadoop. A On Sat, Mar 29, 2008 at 10:20 PM, Ted Dunning [EMAIL PROTECTED] wrote: Small files are a bad idea for high throughput no matter what technology you use. The issue is that you need a larger file in order to avoid disk seeks. On 3/28/08 7:34 PM, Jason Curtes [EMAIL PROTECTED] wrote: Hello, I have been trying to run Hadoop on a set of small text files, not larger than 10k each. The total input size is 15MB. If I try to run the example word count application, it takes about 2000 seconds, more than half an hour to complete. However, if I merge all the files into one large file, it takes much less than a minute. I think using MultiInputFileFormat can be helpful at this point. However, the API documentation is not really helpful. I wonder if MultiInputFileFormat can really solve my problem, and if so, can you suggest me a reference on how to use it, or a few lines to be added to the word count example to make things more clear? Thanks in advance. Regards, Jason Curtes