Re: Loading Data to HDFS
I don't know what you mean by gateway but in order to have a rough idea of the time needed you need 3 values I believe Sumit's setup is a cluster within a firewall and hadoop client machines also within the firewall, the only way to access to the cluster is to ssh from outside to one of the hadoop client machines and then submit your jobs. These hadoop client machines are often referred as gateway machines. On Tue, Oct 30, 2012 at 4:10 AM, Bertrand Dechoux decho...@gmail.com wrote: I don't know what you mean by gateway but in order to have a rough idea of the time needed you need 3 values * amount of data you want to put on hadoop * hadoop bandwidth with regards to local storage (read/write) * bandwidth between where your data are stored and where the hadoop cluster is For the latter, for big volumes, physically moving the volumes is a viable solution. It will depends on your constraints of course : budget, speed... Bertrand On Tue, Oct 30, 2012 at 11:39 AM, sumit ghosh sumi...@yahoo.com wrote: Hi Bertrand, By Physically movi ng the data do you mean that the data volume is connected to the gateway machine and the data is loaded from the local copy using copyFromLocal? Thanks, Sumit From: Bertrand Dechoux decho...@gmail.com To: common-user@hadoop.apache.org; sumit ghosh sumi...@yahoo.com Sent: Tuesday, 30 October 2012 3:46 PM Subject: Re: Loading Data to HDFS It might sound like a deprecated way but can't you move the data physically? From what I understand, it is one shot and not streaming so it could be a good method if you the access of course. Regards Bertrand On Tue, Oct 30, 2012 at 11:07 AM, sumit ghosh sumi...@yahoo.com wrote: Hi, I have a data on remote machine accessible over ssh. I have Hadoop CDH4 installed on RHEL. I am planning to load quite a few Petabytes of Data onto HDFS. Which will be the fastest method to use and are there any projects around Hadoop which can be used as well? I cannot install Hadoop-Client on the remote machine. Have a great Day Ahead! Sumit. --- Here I am attaching my previous discussion on CDH-user to avoid duplication. --- On Wed, Oct 24, 2012 at 9:29 PM, Alejandro Abdelnur t...@cloudera.com wrote: in addition to jarcec's suggestions, you could use httpfs. then you'd only need to poke a single host:port in your firewall as all the traffic goes thru it. thx Alejandro On Oct 24, 2012, at 8:28 AM, Jarek Jarcec Cecho jar...@cloudera.com wrote: Hi Sumit, there is plenty of ways how to achieve that. Please find my feedback below: Does Sqoop support loading flat files to HDFS? No, sqoop is supporting only data move from external database and warehouse systems. Copying files is not supported at the moment. Can use distcp? No. Distcp can be used only to copy data between HDFS filesystesm. How do we use the core-site.xml file on the remote machine to use copyFromLocal? Yes you can install hadoop binaries on your machine (with no hadoop running services) and use hadoop binary to upload data. Installation procedure is described in CDH4 installation guide [1] (follow client installation). Another way that I can think of is leveraging WebHDFS [2] or maybe hdfs-fuse [3]? Jarcec Links: 1: https://ccp.cloudera.com/display/CDH4DOC/CDH4+Installation 2: https://ccp.cloudera.com/display/CDH4DOC/Deploying+HDFS+on+a+Cluster#DeployingHDFSonaCluster-EnablingWebHDFS 3: https://ccp.cloudera.com/display/CDH4DOC/Mountable+HDFS On Wed, Oct 24, 2012 at 01:33:29AM -0700, Sumit Ghosh wrote: Hi, I have a data on remote machine accessible over ssh. What is the fastest way to load data onto HDFS? Does Sqoop support loading flat files to HDFS? Can use distcp? How do we use the core-site.xml file on the remote machine to use copyFromLocal? Which will be the best to use and are there any other open source projects around Hadoop which can be used as well? Have a great Day Ahead! Sumit -- Bertrand Dechoux -- Bertrand Dechoux -- Alejandro
Re: HBase mulit-user security
Tony, Sorry, missed this email earlier. This seems more appropriate for the Hbase aliases. Thx. On Wed, Jul 11, 2012 at 8:41 AM, Tony Dean tony.d...@sas.com wrote: Hi, Looking at this further, it appears that when HBaseRPC is creating a proxy (e.g., SecureRpcEngine), it injects the current user: User.getCurrent() which by default is the cached Kerberos TGT (kinit'ed user - using the hadoop-user-kerberos JAAS context). Since the server proxy always uses User.getCurrent(), how can an application inject the user it wants to use for authorization checks on the peer (region server)? And since SecureHadoopUser is a static class, how can you have more than 1 active user in the same application? What you have works for a single user application like the hbase shell, but what about a multi-user application? Am I missing something? Thanks! -Tony -Original Message- From: Alejandro Abdelnur [mailto:t...@cloudera.com] Sent: Monday, July 02, 2012 11:40 AM To: common-user@hadoop.apache.org Subject: Re: hadoop security API (repost) Tony, If you are doing a server app that interacts with the cluster on behalf of different users (like Ooize, as you mentioned in your email), then you should use the proxyuser capabilities of Hadoop. * Configure user MYSERVERUSER as proxyuser in Hadoop core-site.xml (this requires 2 properties settings, HOSTS and GROUPS). * Run your server app as MYSERVERUSER and have a Kerberos principal MYSERVERUSER/MYSERVERHOST * Initialize your server app loading the MYSERVERUSER/MYSERVERHOST keytab * Use the UGI.doAs() to create JobClient/Filesystem instances using the user you want to do something on behalf * Keep in mind that all the users you need to do something on behalf should be valid Unix users in the cluster * If those users need direct access to the cluster, they'll have to be also defined in in the KDC user database. Hope this helps. Thx On Mon, Jul 2, 2012 at 6:22 AM, Tony Dean tony.d...@sas.com wrote: Yes, but this will not work in a multi-tenant environment. I need to be able to create a Kerberos TGT per execution thread. I was hoping through JAAS that I could inject the name of the current principal and authenticate against it. I'm sure there is a best practice for hadoop/hbase client API authentication, just not sure what it is. Thank you for your comment. The solution may well be associated with the UserGroupInformation class. Hopefully, other ideas will come from this thread. Thanks. -Tony -Original Message- From: Ivan Frain [mailto:ivan.fr...@gmail.com] Sent: Monday, July 02, 2012 8:14 AM To: common-user@hadoop.apache.org Subject: Re: hadoop security API (repost) Hi Tony, I am currently working on this to access HDFS securely and programmaticaly. What I have found so far may help even if I am not 100% sure this is the right way to proceed. If you have already obtained a TGT from the kinit command, hadoop library will locate it automatically if the name of the ticket cache corresponds to default location. On Linux it is located /tmp/krb5cc_uid-number. For example, with my linux user hdfs, I get a TGT for hadoop user 'ivan' meaning you can impersonate ivan from hdfs linux user: -- hdfs@mitkdc:~$ klist Ticket cache: FILE:/tmp/krb5cc_10003 Default principal: i...@hadoop.lan Valid startingExpires Service principal 02/07/2012 13:59 02/07/2012 23:59 krbtgt/hadoop@hadoop.lan renew until 03/07/2012 13:59 --- Then, you just have to set the right security options in your hadoop client in java and the identity will be i...@hadoop.lan for our example. In my tests, I only use HDFS and here a snippet of code to have access to a secure hdfs cluster assuming the previous TGT (ivan's impersonation): val conf: HdfsConfiguration = new HdfsConfiguration() conf.set(CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHENTICATION, kerberos) conf.set(CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHORIZATION, true) conf.set(DFSConfigKeys.DFS_NAMENODE_USER_NAME_KEY, serverPrincipal) UserGroupInformation.setConfiguration(conf) val fs = FileSystem.get(new URI(hdfsUri), conf) Using this 'fs' is a handler to access hdfs securely as user 'ivan' even if ivan does not appear in the hadoop client code. Anyway, I also see two other options: * Setting the KRB5CCNAME environment variable to point to the right ticketCache file * Specifying the keytab file you want to use from the UserGroupInformation singleton API: UserGroupInformation.loginUserFromKeytab(user, keytabFile) If you want to understand the auth process and the different options to login, I guess you need to have a look
Re: hadoop security API (repost)
Tony, If you are doing a server app that interacts with the cluster on behalf of different users (like Ooize, as you mentioned in your email), then you should use the proxyuser capabilities of Hadoop. * Configure user MYSERVERUSER as proxyuser in Hadoop core-site.xml (this requires 2 properties settings, HOSTS and GROUPS). * Run your server app as MYSERVERUSER and have a Kerberos principal MYSERVERUSER/MYSERVERHOST * Initialize your server app loading the MYSERVERUSER/MYSERVERHOST keytab * Use the UGI.doAs() to create JobClient/Filesystem instances using the user you want to do something on behalf * Keep in mind that all the users you need to do something on behalf should be valid Unix users in the cluster * If those users need direct access to the cluster, they'll have to be also defined in in the KDC user database. Hope this helps. Thx On Mon, Jul 2, 2012 at 6:22 AM, Tony Dean tony.d...@sas.com wrote: Yes, but this will not work in a multi-tenant environment. I need to be able to create a Kerberos TGT per execution thread. I was hoping through JAAS that I could inject the name of the current principal and authenticate against it. I'm sure there is a best practice for hadoop/hbase client API authentication, just not sure what it is. Thank you for your comment. The solution may well be associated with the UserGroupInformation class. Hopefully, other ideas will come from this thread. Thanks. -Tony -Original Message- From: Ivan Frain [mailto:ivan.fr...@gmail.com] Sent: Monday, July 02, 2012 8:14 AM To: common-user@hadoop.apache.org Subject: Re: hadoop security API (repost) Hi Tony, I am currently working on this to access HDFS securely and programmaticaly. What I have found so far may help even if I am not 100% sure this is the right way to proceed. If you have already obtained a TGT from the kinit command, hadoop library will locate it automatically if the name of the ticket cache corresponds to default location. On Linux it is located /tmp/krb5cc_uid-number. For example, with my linux user hdfs, I get a TGT for hadoop user 'ivan' meaning you can impersonate ivan from hdfs linux user: -- hdfs@mitkdc:~$ klist Ticket cache: FILE:/tmp/krb5cc_10003 Default principal: i...@hadoop.lan Valid startingExpires Service principal 02/07/2012 13:59 02/07/2012 23:59 krbtgt/hadoop@hadoop.lan renew until 03/07/2012 13:59 --- Then, you just have to set the right security options in your hadoop client in java and the identity will be i...@hadoop.lan for our example. In my tests, I only use HDFS and here a snippet of code to have access to a secure hdfs cluster assuming the previous TGT (ivan's impersonation): val conf: HdfsConfiguration = new HdfsConfiguration() conf.set(CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHENTICATION, kerberos) conf.set(CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHORIZATION, true) conf.set(DFSConfigKeys.DFS_NAMENODE_USER_NAME_KEY, serverPrincipal) UserGroupInformation.setConfiguration(conf) val fs = FileSystem.get(new URI(hdfsUri), conf) Using this 'fs' is a handler to access hdfs securely as user 'ivan' even if ivan does not appear in the hadoop client code. Anyway, I also see two other options: * Setting the KRB5CCNAME environment variable to point to the right ticketCache file * Specifying the keytab file you want to use from the UserGroupInformation singleton API: UserGroupInformation.loginUserFromKeytab(user, keytabFile) If you want to understand the auth process and the different options to login, I guess you need to have a look to the UserGroupInformation.java source code (release 0.23.1 link: http://bit.ly/NVzBKL). The private class HadoopConfiguration line 347 is of major interest in our case. Another point is that I did not find any easy way to prompt the user for a password at runtim using the actual hadoop API. It appears to be somehow hardcoded in the UserGroupInformation singleton. I guess it could be nice to have a new function to give to the UserGroupInformation an authenticated 'Subject' which could override all default configurations. If someone have better ideas it could be nice to discuss on it as well. BR, Ivan 2012/7/1 Tony Dean tony.d...@sas.com Hi, The security documentation specifies how to test a secure cluster by using kinit and thus adding the Kerberos principal TGT to the ticket cache in which the hadoop client code uses to acquire service tickets for use in the cluster. What if I created an application that used the hadoop API to communicate with hdfs and/or mapred protocols, is there a programmatic way to inform hadoop to use a particular Kerberos principal name with a keytab that contains its password
Re: hadoop security API (repost)
On Mon, Jul 2, 2012 at 9:15 AM, Tony Dean tony.d...@sas.com wrote: Alejandro, Thanks for the reply. My intent is to also be able to scan/get/put hbase tables under a specified identity as well. What options do I have to perform the same multi-tenant authorization for these operations? I have posted this to hbase users distribution list as well, but thought you might have insight. Since hbase security authentication is so dependent upon hadoop, it would be nice if your suggestion worked for hbase as well. Getting back to your suggestion... when configuring hadoop.proxyuser.myserveruser.hosts, host1 would be where I'm making the ugi.doAs() privileged call and host2 is the hadoop namenode? host1 in that case. Also, an another option, is there not a way for an application to pass hadoop/hbase authentication the name of a Kerberos principal to use? In this case, no proxy, just execute as the designated user. You could do that, but that means your app will have to have keytabs for all the users want to act as. Proxyuser will be much easier to manage. Maybe getting proxyuser support in hbase if it is not there yet Thanks. -Tony -Original Message- From: Alejandro Abdelnur [mailto:t...@cloudera.com] Sent: Monday, July 02, 2012 11:40 AM To: common-user@hadoop.apache.org Subject: Re: hadoop security API (repost) Tony, If you are doing a server app that interacts with the cluster on behalf of different users (like Ooize, as you mentioned in your email), then you should use the proxyuser capabilities of Hadoop. * Configure user MYSERVERUSER as proxyuser in Hadoop core-site.xml (this requires 2 properties settings, HOSTS and GROUPS). * Run your server app as MYSERVERUSER and have a Kerberos principal MYSERVERUSER/MYSERVERHOST * Initialize your server app loading the MYSERVERUSER/MYSERVERHOST keytab * Use the UGI.doAs() to create JobClient/Filesystem instances using the user you want to do something on behalf * Keep in mind that all the users you need to do something on behalf should be valid Unix users in the cluster * If those users need direct access to the cluster, they'll have to be also defined in in the KDC user database. Hope this helps. Thx On Mon, Jul 2, 2012 at 6:22 AM, Tony Dean tony.d...@sas.com wrote: Yes, but this will not work in a multi-tenant environment. I need to be able to create a Kerberos TGT per execution thread. I was hoping through JAAS that I could inject the name of the current principal and authenticate against it. I'm sure there is a best practice for hadoop/hbase client API authentication, just not sure what it is. Thank you for your comment. The solution may well be associated with the UserGroupInformation class. Hopefully, other ideas will come from this thread. Thanks. -Tony -Original Message- From: Ivan Frain [mailto:ivan.fr...@gmail.com] Sent: Monday, July 02, 2012 8:14 AM To: common-user@hadoop.apache.org Subject: Re: hadoop security API (repost) Hi Tony, I am currently working on this to access HDFS securely and programmaticaly. What I have found so far may help even if I am not 100% sure this is the right way to proceed. If you have already obtained a TGT from the kinit command, hadoop library will locate it automatically if the name of the ticket cache corresponds to default location. On Linux it is located /tmp/krb5cc_uid-number. For example, with my linux user hdfs, I get a TGT for hadoop user 'ivan' meaning you can impersonate ivan from hdfs linux user: -- hdfs@mitkdc:~$ klist Ticket cache: FILE:/tmp/krb5cc_10003 Default principal: i...@hadoop.lan Valid startingExpires Service principal 02/07/2012 13:59 02/07/2012 23:59 krbtgt/hadoop@hadoop.lan renew until 03/07/2012 13:59 --- Then, you just have to set the right security options in your hadoop client in java and the identity will be i...@hadoop.lan for our example. In my tests, I only use HDFS and here a snippet of code to have access to a secure hdfs cluster assuming the previous TGT (ivan's impersonation): val conf: HdfsConfiguration = new HdfsConfiguration() conf.set(CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHENTICATION, kerberos) conf.set(CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHORIZATION, true) conf.set(DFSConfigKeys.DFS_NAMENODE_USER_NAME_KEY, serverPrincipal) UserGroupInformation.setConfiguration(conf) val fs = FileSystem.get(new URI(hdfsUri), conf) Using this 'fs' is a handler to access hdfs securely as user 'ivan' even if ivan does not appear in the hadoop client code. Anyway, I also see two other options: * Setting the KRB5CCNAME environment variable to point to the right ticketCache file * Specifying the keytab
Re: kerberos mapreduce question
If you provision your user/group information via LDAP to all your nodes it is not a nightmare. On Thu, Jun 7, 2012 at 7:49 AM, Koert Kuipers ko...@tresata.com wrote: thanks for your answer. so at a large place like say yahoo, or facebook, assuming they use kerberos, every analyst that uses hive has an account on every node of their large cluster? sounds like an admin nightmare to me On Thu, Jun 7, 2012 at 10:46 AM, Mapred Learn mapred.le...@gmail.com wrote: Yes, User submitting a job needs to have an account on all the nodes. Sent from my iPhone On Jun 7, 2012, at 6:20 AM, Koert Kuipers ko...@tresata.com wrote: with kerberos enabled a mapreduce job runs as the user that submitted it. does this mean the user that submitted the job needs to have linux accounts on all machines on the cluster? how does mapreduce do this (run jobs as the user)? do the tasktrackers use secure impersonation to run-as the user? thanks! koert -- Alejandro
Re: How can I configure oozie to submit different workflows from different users ?
Praveenesh, If I'm not mistaken 0.20.205 does not support wildcards for the proxyuser (hosts/groups) settings. You have to use explicit hosts/groups. Thxs. Alejandro PS: please follow up this thread in the oozie-us...@incubator.apache.org On Mon, Apr 2, 2012 at 2:15 PM, praveenesh kumar praveen...@gmail.comwrote: Hi all, I want to use oozie to submit different workflows from different users. These users are able to submit hadoop jobs. I am using hadoop 0.20.205 and oozie 3.1.3 I have a hadoop user as a oozie-user I have set the following things : conf/oozie-site.xml : property name oozie.services.ext /name value org.apache.oozie.service.HadoopAccessorService /value description To add/replace services defined in 'oozie.services' with custom implementations.Class names must be separated by commas. /description /property conf/core-site.xml property namehadoop.proxyuser.hadoop.hosts /name value* / value /property property namehadoop.proxyuser.hadoop.groups /name value* /value /property When I am submitting jobs as a hadoop user, I am able to run it properly. But when I am able to submit the same work flow from a different user, who can submit the simple MR jobs to my hadoop cluster, I am getting the following error: JA009: java.io.IOException: java.io.IOException: The username kumar obtained from the conf doesn't match the username hadoop the user authenticated asat org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3943) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382) Caused by: java.io.IOException: The username kumar obtained from the conf doesn't match the username hadoop the user authenticated as at org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:426) at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3941) ... 11 more
Re: How can I configure oozie to submit different workflows from different users ?
multiple value are comma separated. keep in mind that valid values for proxyuser groups, as the property name states are GROUPS, not USERS. thxs. Alejandro On Mon, Apr 2, 2012 at 2:27 PM, praveenesh kumar praveen...@gmail.comwrote: How can I specify multiple users /groups for proxy user setting ? Can I give comma separated values in these settings ? Thanks, Praveenesh On Mon, Apr 2, 2012 at 5:52 PM, Alejandro Abdelnur t...@cloudera.com wrote: Praveenesh, If I'm not mistaken 0.20.205 does not support wildcards for the proxyuser (hosts/groups) settings. You have to use explicit hosts/groups. Thxs. Alejandro PS: please follow up this thread in the oozie-us...@incubator.apache.org On Mon, Apr 2, 2012 at 2:15 PM, praveenesh kumar praveen...@gmail.com wrote: Hi all, I want to use oozie to submit different workflows from different users. These users are able to submit hadoop jobs. I am using hadoop 0.20.205 and oozie 3.1.3 I have a hadoop user as a oozie-user I have set the following things : conf/oozie-site.xml : property name oozie.services.ext /name value org.apache.oozie.service.HadoopAccessorService /value description To add/replace services defined in 'oozie.services' with custom implementations.Class names must be separated by commas. /description /property conf/core-site.xml property namehadoop.proxyuser.hadoop.hosts /name value* / value /property property namehadoop.proxyuser.hadoop.groups /name value* /value /property When I am submitting jobs as a hadoop user, I am able to run it properly. But when I am able to submit the same work flow from a different user, who can submit the simple MR jobs to my hadoop cluster, I am getting the following error: JA009: java.io.IOException: java.io.IOException: The username kumar obtained from the conf doesn't match the username hadoop the user authenticated asat org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3943) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382) Caused by: java.io.IOException: The username kumar obtained from the conf doesn't match the username hadoop the user authenticated as at org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:426) at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3941) ... 11 more
Re: How do I synchronize Hadoop jobs?
You can use Oozie for that, you can write a workflow job that forks A B and then joins before C. Thanks. Alejandro On Wed, Feb 15, 2012 at 11:23 AM, W.P. McNeill bill...@gmail.com wrote: Say I have two Hadoop jobs, A and B, that can be run in parallel. I have another job, C, that takes the output of both A and B as input. I want to run A and B at the same time, wait until both have finished, and then launch C. What is the best way to do this? I know the answer if I've got a single Java client program that launches A, B, and C. But what if I don't have the option to launch all of them from a single Java program? (Say I've got a much more complicated system with many steps happening between A-B and C.) How do I synchronize between jobs, make sure there's no race conditions etc. Is this what Zookeeper is for?
Re: Hybrid Hadoop with fork/join ?
Rob, Hadoop has as a way to run Map tasks in multithreading mode, look for the MultithreadedMapRunner MultithreadedMapper. Thanks. Alejandro. On Tue, Jan 31, 2012 at 7:51 AM, Rob Stewart robstewar...@gmail.com wrote: Hi, I'm investigating the feasibility of a hybrid approach to parallel programming, by fusing together the concurrent Java fork/join libraries with Hadoop... MapReduce, a paradigm suited for scalable execution over distributed memory + fork/join, a paradigm for optimal multi-threaded shared memory execution. I am aware that, to a degree, Hadoop can take advantage of multiple core on a compute node, by setting the mapred.tasktracker.map/reduce.tasks.maximum to be more than one. As it states in the Hadoop: The definitive guide book, each task runs in a separate JVM. So setting maximum map tasks to 3, will allow the possibility of 3 JVMs running on the Operating System, right? As an alternative to this approach, I am looking to introduce fork/join into map tasks, and perhaps maybe too, throttling down the maximum number of map tasks per node to 1. I would implement a RecordReader that produces map tasks for more coarse granularity, and that fork/join would subdivide to map task using multiple threads - where the output of the join would be the output of the map task. The motivation for this is that threads are a lot more lightweight than initializing JVMs, which as the Hadoop book points out, takes a second or so each time, unless mapred.job.resuse.jvm.num.tasks is set higher than 1 (the default is 1). So for example, opting for small granular maps for a particular job for a given input generates 1,000 map tasks, which will mean that 1,000 JVMs will be created on the Hadoop cluster within the execution of the program. If I were to write a bespoke RecordReader to increase the granularity of each map, so much so that only 100 map tasks are created. The embedded fork/join code would further split each map into 10 threads, to evaluate within one JVM concurrently. I would expect this latter approach of multiple threads to have better performance, than the clunkier multple-JVM approach. Has such a hybrid approach combining MapReduce with ForkJoin been investigated for feasibility, or similar studies published? Are there any important technical limitations that I should consider? Any thoughts on the proposed multi-threaded distributed-shared memory architecture are much appreciated from the Hadoop community! -- Rob Stewart
Re: Any samples of how to write a custom FileSystem
Steven, You could also look at HttpFSFilesystem in the hadoop-httpfs module, it is quite simple and selfcontained. Cheers. Alejandro On Tue, Jan 31, 2012 at 8:37 PM, Harsh J ha...@cloudera.com wrote: To write a custom filesystem, extend on the FileSystem class. Depending on the scheme it is supposed to serve, creating an entry fs.scheme.impl in core-site.xml, and then loading it via the FileSystem.get(URI, conf) API will auto load it for you, provided the URI you pass has the right scheme. So supposing I have a FS scheme foo, I'd register it in core-site.xml as: namefs.foo.impl/name valuecom.myorg.FooFileSystem/value And then with a URI object from a path that goes foo:///mypath/, I'd do: FileSystem.get(URI, new Configuration()) to get a FooFileSystem instance. Similarly, if you want to overload the local filesystem with your class, override the fs.file.impl config with your derivative class, and that'd be used in your configuration loaded programs in future. A good, not-so-complex impl. example to look at generally would be the S3FileSystem. On Wed, Feb 1, 2012 at 9:41 AM, Steve Lewis lordjoe2...@gmail.com wrote: Specifically how do I register a Custom FileSystem - any sample code -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com -- Harsh J Customer Ops. Engineer Cloudera | http://tiny.cloudera.com/about
Re: Adding a soft-linked archive file to the distributed cache doesn't work as advertised
Bill, In addition you must call DistributedCached.createSymlink(configuration), that should do. Thxs. Alejandro On Mon, Jan 9, 2012 at 10:30 AM, W.P. McNeill bill...@gmail.com wrote: I am trying to add a zip file to the distributed cache and have it unzipped on the task nodes with a softlink to the unzipped directory placed in the working directory of my mapper process. I think I'm doing everything the way the documentation tells me to, but it's not working. On the client in the run() function while I'm creating the job I first call: fs.copyFromLocalFile(gate-app.zip, /tmp/gate-app.zip); As expected, this copies the archive file gate-app.zip to the HDFS directory /tmp. Then I call DistributedCache.addCacheArchive(/tmp/gate-app.zip#gate-app, configuration); I expect this to add /tmp/gate-app.zip to the distributed cache and put a softlink to it called gate-app in the working directory of each task. However, when I call job.waitForCompletion(), I see the following error: Exception in thread main java.io.FileNotFoundException: File does not exist: /tmp/gate-app.zip#gate-app. It appears that the distributed cache mechanism is interpreting the entire URI as the literal name of the file, instead of treating the fragment as the name of the softlink. As far as I can tell, I'm doing this correctly according to the API documentation: http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html . The full project in which I'm doing this is up on github: https://github.com/wpm/Hadoop-GATE. Can someone tell me what I'm doing wrong?
Re: Timer jobs
[moving common-user@ to BCC] Oozie is not HA yet. But it would be relatively easy to make it. It was designed with that in mind, we even did a prototype. Oozie consists of 2 services, a SQL database to store the Oozie jobs state and a servlet container where Oozie app proper runs. The solution for HA for the database, well, it is left to the database. This means, you'll have to get an HA DB. The solution for HA for the Oozie app is deploying the servlet container with the Oozie app in more than one box (2 or 3); and front them by a HTTP load-balancer. The missing part is that the current Oozie lock-service is currently an in-memory implementation. This should be replaced with a Zookeeper implementation. Zookeeper could run externally or internally in all Oozie servers. This is what was prototyped long ago. Thanks. Alejandro On Thu, Sep 1, 2011 at 4:14 AM, Ronen Itkin ro...@taykey.com wrote: If I get you right you are asking about Installing Oozie as Distributed and/or HA cluster?! In that case I am not familiar with an out of the box solution by Oozie. But, I think you can made up a solution of your own, for example: Installing Oozie on two servers on the same partition which will be synchronized by DRBD. You can trigger a failover using linux Heartbeat and that way maintain a virtual IP. On Thu, Sep 1, 2011 at 1:59 PM, Per Steffensen st...@designware.dk wrote: Hi Thanks a lot for pointing me to Oozie. I have looked a little bit into Oozie and it seems like the component triggering jobs is called Coordinator Application. But I really see nowhere that this Coordinator Application doesnt just run on a single machine, and that it will therefore not trigger anything if this machine is down. Can you confirm that the Coordinator Application-role is distributed in a distribued Oozie setup, so that jobs gets triggered even if one or two machines are down? Regards, Per Steffensen Ronen Itkin skrev: Hi Try to use Oozie for job coordination and work flows. On Thu, Sep 1, 2011 at 12:30 PM, Per Steffensen st...@designware.dk wrote: Hi I use hadoop for a MapReduce job in my system. I would like to have the job run very 5th minute. Are there any distributed timer job stuff in hadoop? Of course I could setup a timer in an external timer framework (CRON or something like that) that invokes the MapReduce job. But CRON is only running on one particular machine, so if that machine goes down my job will not be triggered. Then I could setup the timer on all or many machines, but I would not like the job to be run in more than one instance every 5th minute, so then the timer jobs would need to coordinate who is actually starting the job this time and all the rest would just have to do nothing. Guess I could come up with a solution to that - e.g. writing some lock stuff using HDFS files or by using ZooKeeper. But I would really like if someone had already solved the problem, and provided some kind of a distributed timer framework running in a cluster, so that I could just register a timer job with the cluster, and then be sure that it is invoked every 5th minute, no matter if one or two particular machines in the cluster is down. Any suggestions are very welcome. Regards, Per Steffensen -- * Ronen Itkin* Taykey | www.taykey.com
Re: Oozie monitoring
Avi, For Oozie related questions, please subscribe and use the oozie-...@incubator.apache.org alias. Thanks. Alejandro On Tue, Aug 30, 2011 at 2:28 AM, Avi Vaknin avivakni...@gmail.com wrote: Hi All, First, I really enjoy writing you and I'm thankful for your help. I have Oozie installed on dedicated server and I want to monitor it using my Nagios server. Can you please suggest some important parameters to monitor or other tips regarding this issue. Thanks. Avi
Re: Oozie on the namenode server
[Moving thread to Oozie aliases and hadoop's alias to BCC] Avi, Currently you can have a cold standby solution. An Oozie setup consists of 2 systems, a SQL DB (storing all Oozie jobs state) and a servlet container (running Oozie proper). You need you DB to be high available. You need to have a second installation of Oozie configured identically to the running on stand by. If the running Oozie fails, you start the second one. The idea (we once did a prototype using Zookeeper) is to eventually support a multithreaded Oozie server, that would give both high-availability and horizontal scalability. Still you'll need a High Available DB. Thanks. Alejandro On Mon, Aug 29, 2011 at 4:57 AM, Avi Vaknin avivakni...@gmail.com wrote: Hi, I have one more question about Oozie, Is there any suggested high availability solution for Oozie ? Something like active/passive or active/active solution. Thanks. Avi -Original Message- From: Harsh J [mailto:ha...@cloudera.com] Sent: Monday, August 29, 2011 2:20 PM To: common-user@hadoop.apache.org Subject: Re: Oozie on the namenode server Avi, Should be OK to do so at this stage. However, keep monitoring loads on the machines to determine when to move things out to their own dedicated boxes. On Mon, Aug 29, 2011 at 3:20 PM, Avi Vaknin avivakni...@gmail.com wrote: Hi All, I want to install Oozie and I wonder if it is OK to install it on the name node or maybe I need to install dedicated server to it. I have a very small Hadoop cluster (4 datanodes + namenode + secondary namenode). Thanks for your help. Avi -- Harsh J - No virus found in this message. Checked by AVG - www.avg.com Version: 10.0.1392 / Virus Database: 1520/3864 - Release Date: 08/28/11
Hoop into 0.23 release
Hadoop developers, Arun will be cutting a branch for Hadoop 0.23 as soon the trunk has a successful build. I'd like Hoop (https://issues.apache.org/jira/browse/HDFS-2178) to be part of 0.23 (Nicholas already looked at the code). In addition, the Jersey utils in Hoop will be handy for https://issues.apache.org/jira/browse/MAPREDUCE-2863. Most if not all of remaining work is not development but Java package renaming (to be org.apache.hadoop..), Maven integration (sub-modules) and final packaging. The current blocker is to deciding on the Maven sub-modules organization, https://issues.apache.org/jira/browse/HADOOP-7560 I'll drive the discussion for HADOOP-7560 and as soon as we have an agreement there I'll refactor Hoop accordingly. Does this sound reasonable ? Thanks. Alejandro
Re: Hoop into 0.23 release
Nicholas, Thanks. In order to move forward with the integration of Hoop I need consensus on: * https://issues.apache.org/jira/browse/HDFS-2178?focusedCommentId=13089106page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13089106 * https://issues.apache.org/jira/browse/HADOOP-7560 Thanks. Alejandro On Mon, Aug 22, 2011 at 3:42 PM, Tsz Wo Sze szets...@yahoo.com wrote: +1 I believe HDFS-2178 is very close to being committed. Great work Alejandro! Nicholas From: Alejandro Abdelnur t...@cloudera.com To: common-user@hadoop.apache.org; hdfs-...@hadoop.apache.org Sent: Monday, August 22, 2011 2:16 PM Subject: Hoop into 0.23 release Hadoop developers, Arun will be cutting a branch for Hadoop 0.23 as soon the trunk has a successful build. I'd like Hoop (https://issues.apache.org/jira/browse/HDFS-2178) to be part of 0.23 (Nicholas already looked at the code). In addition, the Jersey utils in Hoop will be handy for https://issues.apache.org/jira/browse/MAPREDUCE-2863. Most if not all of remaining work is not development but Java package renaming (to be org.apache.hadoop..), Maven integration (sub-modules) and final packaging. The current blocker is to deciding on the Maven sub-modules organization, https://issues.apache.org/jira/browse/HADOOP-7560 I'll drive the discussion for HADOOP-7560 and as soon as we have an agreement there I'll refactor Hoop accordingly. Does this sound reasonable ? Thanks. Alejandro
Re: Multiple Output Formats
Roger, Or you can take a look at Hadoop's MultipleOutputs class. Thanks. Alejandro On Tue, Jul 26, 2011 at 11:30 PM, Luca Pireddu pire...@crs4.it wrote: On July 26, 2011 06:11:33 PM Roger Chen wrote: Hi all, I am attempting to implement MultipleOutputFormat to write data to multiple files dependent on the output keys and values. Can somebody provide a working example with how to implement this in Hadoop 0.20.2? Thanks! Hello, I have a working sample here: http://biodoop-seal.bzr.sourceforge.net/bzr/biodoop- seal/trunk/annotate/head%3A/src/it/crs4/seal/demux/DemuxOutputFormat.java It extends FileOutputFormat. -- Luca Pireddu CRS4 - Distributed Computing Group Loc. Pixina Manna Edificio 1 Pula 09010 (CA), Italy Tel: +39 0709250452
Re: EXT :Re: Problem running a Hadoop program with external libraries
Why don't you put your native library in HDFS and use the DistributedCache to make them avail to the tasks. For example: Copy 'foo.so' to 'hdfs://localhost:8020/tmp/foo.so', then added to the job distributed cache: DistributedCache.addCacheFile(hdfs://localhost:8020/tmp/foo.so#foo.so, jobConf); DistributedCache.createSymlink(conf); Note that the #foo.so will create a soflink in the task running dir. And the task running dir is in LD_PATH of your task. Alejandro On Sat, Mar 5, 2011 at 7:19 AM, Lance Norskog goks...@gmail.com wrote: I have never heard of putting a native code shared library in a Java jar. I doubt that it works. But it's a cool idea! A Unix binary program loads shared libraries from the paths given in the environment variable LD_LIBRARY_PATH. This has to be set to the directory with the OpenCV .so file when you start Java. Lance On Mar 4, 2011, at 2:13 PM, Brian Bockelman wrote: Hi, Check your kernel's overcommit settings. This will prevent the JVM from allocating memory even when there's free RAM. Brian On Mar 4, 2011, at 3:55 PM, Ratner, Alan S (IS) wrote: Aaron, Thanks for the rapid responses. * ulimit -u unlimited is in .bashrc. * HADOOP_HEAPSIZE is set to 4000 MB in hadoop-env.sh * Mapred.child.ulimit is set to 2048000 in mapred-site.xml * Mapred.child.java.opts is set to -Xmx1536m in mapred-site.xml I take it you are suggesting that I change the java.opts command to: Mapred.child.java.opts is value -Xmx1536m -Djava.library.path=/path/to/native/libs /value Alan Ratner Northrop Grumman Information Systems Manager of Large-Scale Computing 9020 Junction Drive Annapolis Junction, MD 20701 (410) 707-8605 (cell) From: Aaron Kimball [mailto:akimbal...@gmail.com] Sent: Friday, March 04, 2011 4:30 PM To: common-user@hadoop.apache.org Cc: Ratner, Alan S (IS) Subject: EXT :Re: Problem running a Hadoop program with external libraries Actually, I just misread your email and missed the difference between your 2nd and 3rd attempts. Are you enforcing min/max JVM heap sizes on your tasks? Are you enforcing a ulimit (either through your shell configuration, or through Hadoop itself)? I don't know where these cannot allocate memory errors are coming from. If they're from the OS, could it be because it needs to fork() and momentarily exceed the ulimit before loading the native libs? - Aaron On Fri, Mar 4, 2011 at 1:26 PM, Aaron Kimball akimbal...@gmail.com mailto:akimbal...@gmail.com wrote: I don't know if putting native-code .so files inside a jar works. A native-code .so is not classloaded in the same way .class files are. So the correct .so files probably need to exist in some physical directory on the worker machines. You may want to doublecheck that the correct directory on the worker machines is identified in the JVM property 'java.library.path' (instead of / in addition to $LD_LIBRARY_PATH). This can be manipulated in the Hadoop configuration setting mapred.child.java.opts (include '-Djava.library.path=/path/to/native/libs' in the string there.) Also, if you added your .so files to a directory that is already used by the tasktracker (like hadoop-0.21.0/lib/native/Linux-amd64-64/), you may need to restart the tasktracker instance for it to take effect. (This is true of .jar files in the $HADOOP_HOME/lib directory; I don't know if it is true for native libs as well.) - Aaron On Fri, Mar 4, 2011 at 12:53 PM, Ratner, Alan S (IS) alan.rat...@ngc.commailto:alan.rat...@ngc.com wrote: We are having difficulties running a Hadoop program making calls to external libraries - but this occurs only when we run the program on our cluster and not from within Eclipse where we are apparently running in Hadoop's standalone mode. This program invokes the Open Computer Vision libraries (OpenCV and JavaCV). (I don't think there is a problem with our cluster - we've run many Hadoop jobs on it without difficulty.) 1. I normally use Eclipse to create jar files for our Hadoop programs but I inadvertently hit the run as Java application button and the program ran fine, reading the input file from the eclipse workspace rather than HDFS and writing the output file to the same place. Hadoop's output appears below. (This occurred on the master Hadoop server.) 2. I then exported from Eclipse a runnable jar which extracted required libraries into the generated jar - presumably producing a jar file that incorporated all the required library functions. (The plain jar file for this program is 17 kB while the runnable jar is 30MB.) When I try to run this on my Hadoop cluster (including my master and slave servers) the program reports that it is unable to locate libopencv_highgui.so.2.2: cannot open shared object file: No such file or directory. Now, in addition to this library being incorporated inside
Re: Accessing Hadoop using Kerberos
Hadoop UserGroupInformation class looks into the OS Kerberos cache for user to use. I guess you'd had to do the kerberos authentication making sure the kerberos cache is used and then do a UserGroupInformation.loginUser() And after doing that the Hadoop libraries should work. Alejandro On Thu, Jan 13, 2011 at 2:19 PM, Muruga Prabu M murugapra...@gmail.comwrote: Yeah kinit is working fine. My scenario is as follows. I have a java program written by me to upload and download files to the HDFS cluster. I want to perform Kerberos authentication from the Java code itself. I have acquired the TicketGrantingTicket from the Authentication Server and Service Ticket from the Ticket Granting Server. Now how to authenticate myself with hadoop by sending the service ticket received from the Ticket Granting Server. Regards, Pikini On Wed, Jan 12, 2011 at 10:13 PM, Alejandro Abdelnur t...@cloudera.com wrote: If you kinit-ed successfully you are done. The hadoop libraries will do the trick of authenticating the user against Hadoop. Alejandro On Thu, Jan 13, 2011 at 12:46 PM, Muruga Prabu M murugapra...@gmail.com wrote: Hi, I have a Java program to upload and download files from the HDFS. I am using Hadoop with Kerberos. I am able to get a TGT(From the Authentication Server ) and a service Ticket(From Ticket Granting server) using kerberos successfully. But after that I don't know how to authenticate myself with Hadoop. Can anyone please help me how to do it ? Regards, Pikini
Re: how to run jobs every 30 minutes?
Ed, Actually Oozie is quite different from Cascading. * Cascading allows you to write 'queries' using a Java API and they get translated into MR jobs. * Oozie allows you compose sequences of MR/Pig/Hive/Java/SSH jobs in a DAG (workflow jobs) and has timer+data dependency triggers (coordinator jobs). Regards. Alejandro On Tue, Dec 14, 2010 at 1:26 PM, edward choi mp2...@gmail.com wrote: Thanks for the tip. I took a look at it. Looks similar to Cascading I guess...? Anyway thanks for the info!! Ed 2010/12/8 Alejandro Abdelnur t...@cloudera.com Or, if you want to do it in a reliable way you could use an Oozie coordinator job. On Wed, Dec 8, 2010 at 1:53 PM, edward choi mp2...@gmail.com wrote: My mistake. Come to think about it, you are right, I can just make an infinite loop inside the Hadoop application. Thanks for the reply. 2010/12/7 Harsh J qwertyman...@gmail.com Hi, On Tue, Dec 7, 2010 at 2:25 PM, edward choi mp2...@gmail.com wrote: Hi, I'm planning to crawl a certain web site every 30 minutes. How would I get it done in Hadoop? In pure Java, I used Thread.sleep() method, but I guess this won't work in Hadoop. Why wouldn't it? You need to manage your post-job logic mostly, but sleep and resubmission should work just fine. Or if it could work, could anyone show me an example? Ed. -- Harsh J www.harshj.com
Re: how to run jobs every 30 minutes?
Or, if you want to do it in a reliable way you could use an Oozie coordinator job. On Wed, Dec 8, 2010 at 1:53 PM, edward choi mp2...@gmail.com wrote: My mistake. Come to think about it, you are right, I can just make an infinite loop inside the Hadoop application. Thanks for the reply. 2010/12/7 Harsh J qwertyman...@gmail.com Hi, On Tue, Dec 7, 2010 at 2:25 PM, edward choi mp2...@gmail.com wrote: Hi, I'm planning to crawl a certain web site every 30 minutes. How would I get it done in Hadoop? In pure Java, I used Thread.sleep() method, but I guess this won't work in Hadoop. Why wouldn't it? You need to manage your post-job logic mostly, but sleep and resubmission should work just fine. Or if it could work, could anyone show me an example? Ed. -- Harsh J www.harshj.com
Re: HDFS Rsync process??
The other approach, if the DR cluster is idle or has enough excess capacity, would be running all the jobs on the input data in both clusters and perform checksums on the outputs to ensure everything is consistent. And you could take advantage and distribute ad hoc queries between the 2 clusters. Alejandro On Tue, Nov 30, 2010 at 6:51 PM, Steve Loughran ste...@apache.org wrote: On 30/11/10 03:59, hadoopman wrote: We have two Hadoop clusters in two separate buildings. Both clusters are loading the same data from the same sources (the second cluster is for DR). We're looking at how we can recover the primary cluster and catch it back up again as new data will continue to feed into the DR cluster. It's been suggested we use rsync across the network however my concern is the amount of data we would have to copy over would take several days (at a minimum) to sync them even with our dual bonded 1 gig network cards. I'm curious if anyone has come up with a solution short of just loading the source logs into HDFS. Is there a way to even rsync two clusters and get them in sync? Been googling around. Haven't found anything of substances yet. you don't need all the files in the cluster in sync as a lot of them are intermediate and transient files. Instead use dfscopy to copy source files to the two clusters, this runs across the machines in the cluster and is also designed to work across hadoop versions, with some limitations.
Re: how to set diffent VM parameters for mappers and reducers?
Never used those myself, always used the global one, but I knew they are there. Which Hadoop API are you using, the old one or the new one? Alejandro On Fri, Oct 8, 2010 at 3:50 AM, Vitaliy Semochkin vitaliy...@gmail.comwrote: Hi, I tried using mapred.map.child.java.opts and mapred.reduce.child.java.opts but looks like hadoop-0.20.2 ingnores it. On which version have you seen it working? Regards, Vitaliy S On Tue, Oct 5, 2010 at 5:14 PM, Alejandro Abdelnur t...@cloudera.com wrote: The following 2 properties should work: mapred.map.child.java.opts mapred.reduce.child.java.opts Alejandro On Tue, Oct 5, 2010 at 9:02 PM, Michael Segel michael_se...@hotmail.com wrote: Hi, You don't say which version of Hadoop you are using. Going from memory, I believe in the CDH3 release from Cloudera, there are some specific OPTs you can set in hadoop-env.sh. HTH -Mike Date: Tue, 5 Oct 2010 16:59:35 +0400 Subject: how to set diffent VM parameters for mappers and reducers? From: vitaliy...@gmail.com To: common-user@hadoop.apache.org Hello, I have mappers that do not need much ram but combiners and reducers need a lot. Is it possible to set different VM parameters for mappers and reducers? PS Often I face interesting problem, on same set of data I recieve I have java.lang.OutOfMemoryError: Java heap space in combiner but it happens not all the time. What could be cause of such behavior? My personal opinion is that I have mapred.job.reuse.jvm.num.tasks=-1 and jvm GC doesn't always start when it should. Thanks in Advance, Vitaliy S
Re: is there no streaming.jar file in hadoop-0.21.0??
Edward, Yep, you should use the one from contrib/ Alejandro On Tue, Oct 5, 2010 at 1:55 PM, edward choi mp2...@gmail.com wrote: Thanks, Tom. Didn't expect the author of THE BOOK would answer my question. Very surprised and honored :-) Just one more question if you don't mind. I read it on the Internet that in order to user Hadoop Streaming in Hadoop-0.21.0 you should go $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar args (Of course I don't see any hadoop-streaming.jar in $HADOOP_HOME) But according to your reply I should go $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/mapred/contrib/streaming/hadoop-*-streaming.jar args I suppose the latter one is the way to go? Ed. 2010/10/5 Tom White t...@cloudera.com Hi Ed, The directory structure moved around as a result of the project splitting into three subprojects (Common, HDFS, MapReduce). The streaming jar is in mapred/contrib/streaming in the distribution. Cheers, Tom On Mon, Oct 4, 2010 at 8:03 PM, edward choi mp2...@gmail.com wrote: Hi, I've recently downloaded Hadoop-0.21.0. After the installation, I've noticed that there is no contrib directory that used to exist in Hadoop-0.20.2. So I was wondering if there is no hadoop-0.21.0-streaming.jar file in Hadoop-0.21.0. Anyone had any luck finding it? If the way to use streaming has changed in Hadoop-0.21.0, then please tell me how. Appreciate the help, thx. Ed.
Re: Re: Help!!The problem about Hadoop
Or you could try using MultiFileInputFormat for your MR job. http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapred/MultiFileInputFormat.html Alejandro On Tue, Oct 5, 2010 at 4:55 PM, Harsh J qwertyman...@gmail.com wrote: 500 small files comprising one gigabyte? Perhaps you should try concatenating them all into one big file and try; as a mapper is supposed to run at least for a minute optimally. And small files don't make good use of the HDFS block feature. Have a read: http://www.cloudera.com/blog/2009/02/the-small-files-problem/ 2010/10/5 Jander 442950...@163.com: Hi Jeff, Thank you very much for your reply sincerely. I exactly know hadoop has overhead, but is it too large in my problem? The 1GB text input has about 500 map tasks because the input is composed of little text file. And the time each map taken is from 8 seconds to 20 seconds. I use compression like conf.setCompressMapOutput(true). Thanks, Jander At 2010-10-05 16:28:55,Jeff Zhang zjf...@gmail.com wrote: Hi Jander, Hadoop has overhead compared to single-machine solution. How many task have you get when you run your hadoop job ? And what is time consuming for each map and reduce task ? There's lots of tips for performance tuning of hadoop. Such as compression and jvm reuse. 2010/10/5 Jander 442950...@163.com: Hi, all I do an application using hadoop. I take 1GB text data as input the result as follows: (1) the cluster of 3 PCs: the time consumed is 1020 seconds. (2) the cluster of 4 PCs: the time is about 680 seconds. But the application before I use Hadoop takes about 280 seconds, so as the speed above, I must use 8 PCs in order to have the same speed as before. Now the problem: whether it is correct? Jander, Thanks. -- Best Regards Jeff Zhang -- Harsh J www.harshj.com
Re: how to set diffent VM parameters for mappers and reducers?
The following 2 properties should work: mapred.map.child.java.opts mapred.reduce.child.java.opts Alejandro On Tue, Oct 5, 2010 at 9:02 PM, Michael Segel michael_se...@hotmail.com wrote: Hi, You don't say which version of Hadoop you are using. Going from memory, I believe in the CDH3 release from Cloudera, there are some specific OPTs you can set in hadoop-env.sh. HTH -Mike Date: Tue, 5 Oct 2010 16:59:35 +0400 Subject: how to set diffent VM parameters for mappers and reducers? From: vitaliy...@gmail.com To: common-user@hadoop.apache.org Hello, I have mappers that do not need much ram but combiners and reducers need a lot. Is it possible to set different VM parameters for mappers and reducers? PS Often I face interesting problem, on same set of data I recieve I have java.lang.OutOfMemoryError: Java heap space in combiner but it happens not all the time. What could be cause of such behavior? My personal opinion is that I have mapred.job.reuse.jvm.num.tasks=-1 and jvm GC doesn't always start when it should. Thanks in Advance, Vitaliy S
Re: Classpath
Yes, you can do #1, but I wouldn't say it is practical. You can do #2 as well, as you suggest. But, IMO, the best way is copying the JARs in HDFS and using DistributedCache. A On Sun, Aug 29, 2010 at 1:29 PM, Mark static.void@gmail.com wrote: How can I add jars to Hadoops classpath when running MapReduce jobs for the following situations? 1) Assuming that the jars are local the nodes that running the job. 2) The jobs are only local to the client submitting the job. I'm assuming I can just jar up all required jobs into the main job jar being submitted, but I was wondering if there was some other way. Thanks
Re: REST web service on top of Hadoop
In Oozie are working on MR/Pig jobs submission over HTTP. On Thu, Jul 29, 2010 at 5:09 PM, Steve Loughran ste...@apache.org wrote: S. Venkatesh wrote: HDFS Proxy in contrib provides HTTP interface over HDFS. Its not very RESTful but we are working on a new version which will have a REST API. AFAIK, Oozie will provide REST API for launching MR jobs. I've done a webapp to do this (and some trickier things like creating the cluster when needed), some slides on it http://www.slideshare.net/steve_l/long-haul-hadoop there are links on there to the various JIRA issues which have discussed the topic
Re: Hadoop multiple output files
Adam, Yes, you have do some configuration in the jobconf before submitting the job. if the javadocs are not clear enough, check the testcase A On Mon, Jun 28, 2010 at 8:42 PM, Adam Silberstein silbe...@yahoo-inc.com wrote: Hi Alejandro, Thanks for the tip. I tried the class, but got the following error: java.lang.IllegalArgumentException: Undefined named output 'text' at org.apache.hadoop.mapred.lib.MultipleOutputs.getCollector(MultipleOutputs.ja va:496) at org.apache.hadoop.mapred.lib.MultipleOutputs.getCollector(MultipleOutputs.ja va:476) at com.yahoo.shadoop.applications.SyntheticToHdfsTableAndIndex$SortReducer.redu ce(Unknown Source) at com.yahoo.shadoop.applications.SyntheticToHdfsTableAndIndex$SortReducer.redu ce(Unknown Source) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411) at org.apache.hadoop.mapred.Child.main(Child.java:170) I am trying to write to an output file named 'text.' Do I need to initialize that file in some way? I tried making a directory with the name, but that didn't do anything. Thanks, Adam On 6/28/10 6:17 PM, Alejandro Abdelnur tuc...@gmail.com wrote: Check the MultipleOutputs class http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/mapred/lib/ MultipleOutputs.html On Mon, Jun 28, 2010 at 5:31 PM, Adam Silberstein silbe...@yahoo-inc.com wrote: Hi, I would like to run a hadoop job that write to multiple output files. I see a class called MultipleOutputFormat that looks like what I want, but I have not been able to find any sample code showing how to use it. I see discussion of it in a JIRA, where the idea is to choose output file based on key. If someone could me to a sample, that would be great. Thanks, Adam
Re: preserve JobTracker information
Also you can configure the job tracker to keep the RunningJob information for completed jobs (avail via the Hadoop Java API). There is a config property that enables this, another that specifies the location (it can be HDFS or local), and another that specifies for how many hours you want to keep that information. HTH A On Wed, May 19, 2010 at 1:36 AM, Harsh J qwertyman...@gmail.com wrote: Preserved JobTracker history is already available at /jobhistory.jsp There is a link at the end of the /jobtracker.jsp page that leads to this. There's also free analysis to go with that! :) On Tue, May 18, 2010 at 11:00 PM, Alan Miller someb...@squareplanet.de wrote: Hi, Is there a way to preserve previous job information (Completed Jobs, Failed Jobs) when the hadoop cluster is restarted? Everytime I start up my cluster (start-dfs.sh,start-mapred.sh) the JobTracker interface at http://myhost:50020/jobtracker.jsp is always empty. Thanks, Alan -- Harsh J www.harshj.com
Re: MultipleTextOutputFormat splitting output into different directories.
Using the MultipleOutputs ( http://hadoop.apache.org/common/docs/r0.19.0/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html ) you can split data in different files in the outputdir. After your job finishes you can move the files to different directories. The benefit of this doing this is that task failures and speculative execution will also take into account all these files and your data will be consistent. If you are writing to different directories then you have to handle this by hand. A On Tue, Sep 15, 2009 at 4:22 PM, Aviad sela sela.s...@gmail.com wrote: Is any body interested ,addressed such probelm. Or does it seem to be esoteric usage ? On Wed, Sep 9, 2009 at 7:06 PM, Aviad sela sela.s...@gmail.com wrote: I am using Hadoop 0.19.1 I attempt to split an input into multiple directories. I don't know in advance how many directories exists. I don't know in advance what is the directory depth. I expect that under each such directory a file exists with all availble records having the same key permutation found in the job. If currently each reducer produce a single output i.e. PART-0001 I would like to create as many directory necessary taking the following pattern: key1 / key2/ .../ keyN/ PART-0001 where the key? may have different values for each input record. different record may results with a different path requested: key1a/key2b/PART-0001 key1c/key2d/key3e/PART-0001 to keep it simple, during each job we may expect the same depth from each record. I assume that the input records imply that each reduce will produce several hundreds of such directories. (Indeed this strongly depends on the input record semantic). The MAP part reads a record,following some logic, assign a key like : KEY_A, KEY_B The MAP Value is the original input line. For The reducer part I assign the IdentityReducer. However have set : jobConf.setReducerClass(IdentityReducer. *class*); jobConf.setOutputFormat(MyTextOutputFormat.*class*); Where the MyTextOutputFormat extends MultipleTextOutputFormat, and implements: protected String generateFileNameForKeyValue(K key, V value, String name) { String keyParts[] = key.toString().split(,); Path finalPath = null; // Build the directory structure comprised of the Key parts for (int i=0; i keyParts.length; i++) { String part = keyParts[i].trim(); if (false == .equals(part)) { if (null == finalPath) finalPath = new Path(part); else { finalPath = new Path(finalPath, part); } } } //end of for String fileName = generateLeafFileName(name); finalPath = new Path(finalPath, fileName); return finalPath.toString(); } //generatedFileNameKeyValue During execution I have seen the reduce attempts does create the following path under the output path: /user/hadoop/test_rep01/_temporary/_attempt_200909080349_0013_r_00_0/KEY_A/KEY_B/part-0 However, the file was empty. The job fails at the end with the the following exceptions found in the task log: 2009-09-09 11:19:49,653 INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink 9.148.30.71:50010 2009-09-09 11:19:49,654 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-6138647338595590910_39383 2009-09-09 11:19:55,659 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block. at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2722) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183) 2009-09-09 11:19:55,659 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_-6138647338595590910_39383 bad datanode[1] nodes == null 2009-09-09 11:19:55,660 WARN org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source file /user/hadoop/test_rep01/_temporary/_attempt_200909080349_0013_r_02_0/KEY_A/KEY_B/part-2 - Aborting... 2009-09-09 11:19:55,686 WARN org.apache.hadoop.mapred.TaskTracker: Error running child java.io.IOException: Bad connect ack with firstBadLink 9.148.30.71:50010 at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2780) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2703) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183) 2009-09-09 11:19:55,688 INFO org.apache.hadoop.mapred.TaskRunner: Runnning cleanup for