Re: Loading Data to HDFS

2012-10-30 Thread Alejandro Abdelnur
 I don't know what you mean by gateway but in order to have a rough idea of
 the time needed you need 3 values

I believe Sumit's setup is a cluster within a firewall and hadoop
client machines also within the firewall, the only way to access to
the cluster is to ssh from outside to one of the hadoop client
machines and then submit your jobs. These hadoop client machines are
often referred as gateway machines.


On Tue, Oct 30, 2012 at 4:10 AM, Bertrand Dechoux decho...@gmail.com wrote:
 I don't know what you mean by gateway but in order to have a rough idea of
 the time needed you need 3 values
 * amount of data you want to put on hadoop
 * hadoop bandwidth with regards to local storage (read/write)
 * bandwidth between where your data are stored and where the hadoop cluster
 is

 For the latter, for big volumes, physically moving the volumes is a viable
 solution.
 It will depends on your constraints of course : budget, speed...

 Bertrand

 On Tue, Oct 30, 2012 at 11:39 AM, sumit ghosh sumi...@yahoo.com wrote:

 Hi Bertrand,

 By Physically movi ng the data do you mean that the data volume is
 connected to the gateway machine and the data is loaded from the local copy
 using copyFromLocal?

 Thanks,
 Sumit


 
 From: Bertrand Dechoux decho...@gmail.com
 To: common-user@hadoop.apache.org; sumit ghosh sumi...@yahoo.com
 Sent: Tuesday, 30 October 2012 3:46 PM
 Subject: Re: Loading Data to HDFS

 It might sound like a deprecated way but can't you move the data
 physically?
 From what I understand, it is one shot and not streaming so it could be a
 good method if you the access of course.

 Regards

 Bertrand

 On Tue, Oct 30, 2012 at 11:07 AM, sumit ghosh sumi...@yahoo.com wrote:

  Hi,
 
  I have a data on remote machine accessible over ssh. I have Hadoop CDH4
  installed on RHEL. I am planning to load quite a few Petabytes of Data
 onto
  HDFS.
 
  Which will be the fastest method to use and are there any projects around
  Hadoop which can be used as well?
 
 
  I cannot install Hadoop-Client on the remote machine.
 
  Have a great Day Ahead!
  Sumit.
 
 
  ---
  Here I am attaching my previous discussion on CDH-user to avoid
  duplication.
  ---
  On Wed, Oct 24, 2012 at 9:29 PM, Alejandro Abdelnur t...@cloudera.com
  wrote:
  in addition to jarcec's suggestions, you could use httpfs. then you'd
 only
  need to poke a single host:port in your firewall as all the traffic goes
  thru it.
  thx
  Alejandro
 
  On Oct 24, 2012, at 8:28 AM, Jarek Jarcec Cecho jar...@cloudera.com
  wrote:
   Hi Sumit,
   there is plenty of ways how to achieve that. Please find my feedback
  below:
  
   Does Sqoop support loading flat files to HDFS?
  
   No, sqoop is supporting only data move from external database and
  warehouse systems. Copying files is not supported at the moment.
  
   Can use distcp?
  
   No. Distcp can be used only to copy data between HDFS filesystesm.
  
   How do we use the core-site.xml file on the remote machine to use
   copyFromLocal?
  
   Yes you can install hadoop binaries on your machine (with no hadoop
  running services) and use hadoop binary to upload data. Installation
  procedure is described in CDH4 installation guide [1] (follow client
  installation).
  
   Another way that I can think of is leveraging WebHDFS [2] or maybe
  hdfs-fuse [3]?
  
   Jarcec
  
   Links:
   1: https://ccp.cloudera.com/display/CDH4DOC/CDH4+Installation
   2:
 
 https://ccp.cloudera.com/display/CDH4DOC/Deploying+HDFS+on+a+Cluster#DeployingHDFSonaCluster-EnablingWebHDFS
   3: https://ccp.cloudera.com/display/CDH4DOC/Mountable+HDFS
  
   On Wed, Oct 24, 2012 at 01:33:29AM -0700, Sumit Ghosh wrote:
  
  
   Hi,
  
   I have a data on remote machine accessible over ssh. What is the
 fastest
   way to load data onto HDFS?
  
   Does Sqoop support loading flat files to HDFS?
   Can use distcp?
   How do we use the core-site.xml file on the remote machine to use
   copyFromLocal?
  
   Which will be the best to use and are there any other open source
  projects
   around Hadoop which can be used as well?
   Have a great Day Ahead!
   Sumit




 --
 Bertrand Dechoux




 --
 Bertrand Dechoux



-- 
Alejandro


Re: HBase mulit-user security

2012-07-25 Thread Alejandro Abdelnur
Tony,

Sorry, missed this email earlier. This seems more appropriate for the Hbase
aliases.

Thx.

On Wed, Jul 11, 2012 at 8:41 AM, Tony Dean tony.d...@sas.com wrote:

 Hi,

 Looking at this further, it appears that when HBaseRPC is creating a proxy
 (e.g., SecureRpcEngine), it injects the current user:
 User.getCurrent() which by default is the cached Kerberos TGT (kinit'ed
 user - using the hadoop-user-kerberos JAAS context).

 Since the server proxy always uses User.getCurrent(), how can an
 application inject the user it wants to use for authorization checks on the
 peer (region server)?

 And since SecureHadoopUser is a static class, how can you have more than 1
 active user in the same application?

 What you have works for a single user application like the hbase shell,
 but what about a multi-user application?

 Am I missing something?

 Thanks!

 -Tony

 -Original Message-
 From: Alejandro Abdelnur [mailto:t...@cloudera.com]
 Sent: Monday, July 02, 2012 11:40 AM
 To: common-user@hadoop.apache.org
 Subject: Re: hadoop security API (repost)

 Tony,

 If you are doing a server app that interacts with the cluster on behalf of
 different users (like Ooize, as you mentioned in your email), then you
 should use the proxyuser capabilities of Hadoop.

 * Configure user MYSERVERUSER as proxyuser in Hadoop core-site.xml (this
 requires 2 properties settings, HOSTS and GROUPS).
 * Run your server app as MYSERVERUSER and have a Kerberos principal
 MYSERVERUSER/MYSERVERHOST
 * Initialize your server app loading the MYSERVERUSER/MYSERVERHOST keytab
 * Use the UGI.doAs() to create JobClient/Filesystem instances using the
 user you want to do something on behalf
 * Keep in mind that all the users you need to do something on behalf
 should be valid Unix users in the cluster
 * If those users need direct access to the cluster, they'll have to be
 also defined in in the KDC user database.

 Hope this helps.

 Thx

 On Mon, Jul 2, 2012 at 6:22 AM, Tony Dean tony.d...@sas.com wrote:
  Yes, but this will not work in a multi-tenant environment.  I need to be
 able to create a Kerberos TGT per execution thread.
 
  I was hoping through JAAS that I could inject the name of the current
 principal and authenticate against it.  I'm sure there is a best practice
 for hadoop/hbase client API authentication, just not sure what it is.
 
  Thank you for your comment.  The solution may well be associated with
 the UserGroupInformation class.  Hopefully, other ideas will come from this
 thread.
 
  Thanks.
 
  -Tony
 
  -Original Message-
  From: Ivan Frain [mailto:ivan.fr...@gmail.com]
  Sent: Monday, July 02, 2012 8:14 AM
  To: common-user@hadoop.apache.org
  Subject: Re: hadoop security API (repost)
 
  Hi Tony,
 
  I am currently working on this to access HDFS securely and
 programmaticaly.
  What I have found so far may help even if I am not 100% sure this is the
 right way to proceed.
 
  If you have already obtained a TGT from the kinit command, hadoop
 library will locate it automatically if the name of the ticket cache
 corresponds to default location. On Linux it is located
 /tmp/krb5cc_uid-number.
 
  For example, with my linux user hdfs, I get a TGT for hadoop user 'ivan'
  meaning you can impersonate ivan from hdfs linux user:
  --
  hdfs@mitkdc:~$ klist
  Ticket cache: FILE:/tmp/krb5cc_10003
  Default principal: i...@hadoop.lan
 
  Valid startingExpires   Service principal
  02/07/2012 13:59  02/07/2012 23:59  krbtgt/hadoop@hadoop.lan renew
  until 03/07/2012 13:59
  ---
 
  Then, you just have to set the right security options in your hadoop
 client in java and the identity will be i...@hadoop.lan for our example.
 In my tests, I only use HDFS and here a snippet of code to have access to a
 secure hdfs cluster assuming the previous TGT (ivan's impersonation):
 
  
   val conf: HdfsConfiguration = new HdfsConfiguration()
 
  conf.set(CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHENTICATION,
  kerberos)
 
  conf.set(CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHORIZATION,
  true)
   conf.set(DFSConfigKeys.DFS_NAMENODE_USER_NAME_KEY,
  serverPrincipal)
 
   UserGroupInformation.setConfiguration(conf)
 
   val fs = FileSystem.get(new URI(hdfsUri), conf)
  
 
  Using this 'fs' is a handler to access hdfs securely as user 'ivan' even
 if ivan does not appear in the hadoop client code.
 
  Anyway, I also see two other options:
* Setting the KRB5CCNAME environment variable to point to the right
 ticketCache file
* Specifying the keytab file you want to use from the
 UserGroupInformation singleton API:
  UserGroupInformation.loginUserFromKeytab(user, keytabFile)
 
  If you want to understand the auth process and the different options to
 login, I guess you need to have a look

Re: hadoop security API (repost)

2012-07-02 Thread Alejandro Abdelnur
Tony,

If you are doing a server app that interacts with the cluster on
behalf of different users (like Ooize, as you mentioned in your
email), then you should use the proxyuser capabilities of Hadoop.

* Configure user MYSERVERUSER as proxyuser in Hadoop core-site.xml
(this requires 2 properties settings, HOSTS and GROUPS).
* Run your server app as MYSERVERUSER and have a Kerberos principal
MYSERVERUSER/MYSERVERHOST
* Initialize your server app loading the MYSERVERUSER/MYSERVERHOST keytab
* Use the UGI.doAs() to create JobClient/Filesystem instances using
the user you want to do something on behalf
* Keep in mind that all the users you need to do something on behalf
should be valid Unix users in the cluster
* If those users need direct access to the cluster, they'll have to be
also defined in in the KDC user database.

Hope this helps.

Thx

On Mon, Jul 2, 2012 at 6:22 AM, Tony Dean tony.d...@sas.com wrote:
 Yes, but this will not work in a multi-tenant environment.  I need to be able 
 to create a Kerberos TGT per execution thread.

 I was hoping through JAAS that I could inject the name of the current 
 principal and authenticate against it.  I'm sure there is a best practice for 
 hadoop/hbase client API authentication, just not sure what it is.

 Thank you for your comment.  The solution may well be associated with the 
 UserGroupInformation class.  Hopefully, other ideas will come from this 
 thread.

 Thanks.

 -Tony

 -Original Message-
 From: Ivan Frain [mailto:ivan.fr...@gmail.com]
 Sent: Monday, July 02, 2012 8:14 AM
 To: common-user@hadoop.apache.org
 Subject: Re: hadoop security API (repost)

 Hi Tony,

 I am currently working on this to access HDFS securely and programmaticaly.
 What I have found so far may help even if I am not 100% sure this is the 
 right way to proceed.

 If you have already obtained a TGT from the kinit command, hadoop library 
 will locate it automatically if the name of the ticket cache corresponds to 
 default location. On Linux it is located /tmp/krb5cc_uid-number.

 For example, with my linux user hdfs, I get a TGT for hadoop user 'ivan'
 meaning you can impersonate ivan from hdfs linux user:
 --
 hdfs@mitkdc:~$ klist
 Ticket cache: FILE:/tmp/krb5cc_10003
 Default principal: i...@hadoop.lan

 Valid startingExpires   Service principal
 02/07/2012 13:59  02/07/2012 23:59  krbtgt/hadoop@hadoop.lan renew until 
 03/07/2012 13:59
 ---

 Then, you just have to set the right security options in your hadoop client 
 in java and the identity will be i...@hadoop.lan for our example. In my 
 tests, I only use HDFS and here a snippet of code to have access to a secure 
 hdfs cluster assuming the previous TGT (ivan's impersonation):

 
  val conf: HdfsConfiguration = new HdfsConfiguration()
  conf.set(CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHENTICATION,
 kerberos)
  conf.set(CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHORIZATION,
 true)
  conf.set(DFSConfigKeys.DFS_NAMENODE_USER_NAME_KEY, serverPrincipal)

  UserGroupInformation.setConfiguration(conf)

  val fs = FileSystem.get(new URI(hdfsUri), conf)
 

 Using this 'fs' is a handler to access hdfs securely as user 'ivan' even if 
 ivan does not appear in the hadoop client code.

 Anyway, I also see two other options:
   * Setting the KRB5CCNAME environment variable to point to the right 
 ticketCache file
   * Specifying the keytab file you want to use from the UserGroupInformation 
 singleton API:
 UserGroupInformation.loginUserFromKeytab(user, keytabFile)

 If you want to understand the auth process and the different options to 
 login, I guess you need to have a look to the UserGroupInformation.java 
 source code (release 0.23.1 link: http://bit.ly/NVzBKL). The private class 
 HadoopConfiguration line 347 is of major interest in our case.

 Another point is that I did not find any easy way to prompt the user for a 
 password at runtim using the actual hadoop API. It appears to be somehow 
 hardcoded in the UserGroupInformation singleton. I guess it could be nice to 
 have a new function to give to the UserGroupInformation an authenticated 
 'Subject' which could override all default configurations. If someone have 
 better ideas it could be nice to discuss on it as well.


 BR,
 Ivan

 2012/7/1 Tony Dean tony.d...@sas.com

 Hi,

 The security documentation specifies how to test a secure cluster by
 using kinit and thus adding the Kerberos principal TGT to the ticket
 cache in which the hadoop client code uses to acquire service tickets
 for use in the cluster.
 What if I created an application that used the hadoop API to
 communicate with hdfs and/or mapred protocols, is there a programmatic
 way to inform hadoop to use a particular Kerberos principal name with
 a keytab that contains its password 

Re: hadoop security API (repost)

2012-07-02 Thread Alejandro Abdelnur
On Mon, Jul 2, 2012 at 9:15 AM, Tony Dean tony.d...@sas.com wrote:
 Alejandro,

 Thanks for the reply.  My intent is to also be able to scan/get/put hbase 
 tables under a specified identity as well.  What options do I have to perform 
 the same multi-tenant  authorization for these operations?  I have posted 
 this to hbase users distribution list as well, but thought you might have 
 insight.  Since hbase security authentication is so dependent upon hadoop, it 
 would be nice if your suggestion worked for hbase as well.

 Getting back to your suggestion... when configuring 
 hadoop.proxyuser.myserveruser.hosts, host1 would be where I'm making the 
 ugi.doAs() privileged call and host2 is the hadoop namenode?


host1 in that case.

 Also, an another option, is there not a way for an application to pass 
 hadoop/hbase authentication the name of a Kerberos principal to use?  In this 
 case, no proxy, just execute as the designated user.

You could do that, but that means your app will have to have keytabs
for all the users want to act as. Proxyuser will be much easier to
manage. Maybe getting proxyuser support in hbase if it is not there
yet


 Thanks.

 -Tony

 -Original Message-
 From: Alejandro Abdelnur [mailto:t...@cloudera.com]
 Sent: Monday, July 02, 2012 11:40 AM
 To: common-user@hadoop.apache.org
 Subject: Re: hadoop security API (repost)

 Tony,

 If you are doing a server app that interacts with the cluster on behalf of 
 different users (like Ooize, as you mentioned in your email), then you should 
 use the proxyuser capabilities of Hadoop.

 * Configure user MYSERVERUSER as proxyuser in Hadoop core-site.xml (this 
 requires 2 properties settings, HOSTS and GROUPS).
 * Run your server app as MYSERVERUSER and have a Kerberos principal 
 MYSERVERUSER/MYSERVERHOST
 * Initialize your server app loading the MYSERVERUSER/MYSERVERHOST keytab
 * Use the UGI.doAs() to create JobClient/Filesystem instances using the user 
 you want to do something on behalf
 * Keep in mind that all the users you need to do something on behalf should 
 be valid Unix users in the cluster
 * If those users need direct access to the cluster, they'll have to be also 
 defined in in the KDC user database.

 Hope this helps.

 Thx

 On Mon, Jul 2, 2012 at 6:22 AM, Tony Dean tony.d...@sas.com wrote:
 Yes, but this will not work in a multi-tenant environment.  I need to be 
 able to create a Kerberos TGT per execution thread.

 I was hoping through JAAS that I could inject the name of the current 
 principal and authenticate against it.  I'm sure there is a best practice 
 for hadoop/hbase client API authentication, just not sure what it is.

 Thank you for your comment.  The solution may well be associated with the 
 UserGroupInformation class.  Hopefully, other ideas will come from this 
 thread.

 Thanks.

 -Tony

 -Original Message-
 From: Ivan Frain [mailto:ivan.fr...@gmail.com]
 Sent: Monday, July 02, 2012 8:14 AM
 To: common-user@hadoop.apache.org
 Subject: Re: hadoop security API (repost)

 Hi Tony,

 I am currently working on this to access HDFS securely and programmaticaly.
 What I have found so far may help even if I am not 100% sure this is the 
 right way to proceed.

 If you have already obtained a TGT from the kinit command, hadoop library 
 will locate it automatically if the name of the ticket cache corresponds 
 to default location. On Linux it is located /tmp/krb5cc_uid-number.

 For example, with my linux user hdfs, I get a TGT for hadoop user 'ivan'
 meaning you can impersonate ivan from hdfs linux user:
 --
 hdfs@mitkdc:~$ klist
 Ticket cache: FILE:/tmp/krb5cc_10003
 Default principal: i...@hadoop.lan

 Valid startingExpires   Service principal
 02/07/2012 13:59  02/07/2012 23:59  krbtgt/hadoop@hadoop.lan renew
 until 03/07/2012 13:59
 ---

 Then, you just have to set the right security options in your hadoop client 
 in java and the identity will be i...@hadoop.lan for our example. In my 
 tests, I only use HDFS and here a snippet of code to have access to a secure 
 hdfs cluster assuming the previous TGT (ivan's impersonation):

 
  val conf: HdfsConfiguration = new HdfsConfiguration()

 conf.set(CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHENTICATION,
 kerberos)

 conf.set(CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHORIZATION,
 true)
  conf.set(DFSConfigKeys.DFS_NAMENODE_USER_NAME_KEY,
 serverPrincipal)

  UserGroupInformation.setConfiguration(conf)

  val fs = FileSystem.get(new URI(hdfsUri), conf)
 

 Using this 'fs' is a handler to access hdfs securely as user 'ivan' even if 
 ivan does not appear in the hadoop client code.

 Anyway, I also see two other options:
   * Setting the KRB5CCNAME environment variable to point to the right 
 ticketCache file
   * Specifying the keytab

Re: kerberos mapreduce question

2012-06-07 Thread Alejandro Abdelnur
If you provision your user/group information via LDAP to all your nodes it
is not a nightmare.

On Thu, Jun 7, 2012 at 7:49 AM, Koert Kuipers ko...@tresata.com wrote:

 thanks for your answer.

 so at a large place like say yahoo, or facebook, assuming they use
 kerberos, every analyst that uses hive has an account on every node of
 their large cluster? sounds like an admin nightmare to me

 On Thu, Jun 7, 2012 at 10:46 AM, Mapred Learn mapred.le...@gmail.com
 wrote:

  Yes, User submitting a job needs to have an account on all the nodes.
 
  Sent from my iPhone
 
  On Jun 7, 2012, at 6:20 AM, Koert Kuipers ko...@tresata.com wrote:
 
   with kerberos enabled a mapreduce job runs as the user that submitted
  it.
   does this mean the user that submitted the job needs to have linux
  accounts
   on all machines on the cluster?
  
   how does mapreduce do this (run jobs as the user)? do the tasktrackers
  use
   secure impersonation to run-as the user?
  
   thanks! koert
 




-- 
Alejandro


Re: How can I configure oozie to submit different workflows from different users ?

2012-04-02 Thread Alejandro Abdelnur
Praveenesh,

If I'm not mistaken 0.20.205 does not support wildcards for the proxyuser
(hosts/groups) settings. You have to use explicit hosts/groups.

Thxs.

Alejandro
PS: please follow up this thread in the oozie-us...@incubator.apache.org

On Mon, Apr 2, 2012 at 2:15 PM, praveenesh kumar praveen...@gmail.comwrote:

 Hi all,

 I want to use oozie to submit different workflows from different users.
 These users are able to submit hadoop jobs.
 I am using hadoop 0.20.205 and oozie 3.1.3
 I have a hadoop user as a oozie-user

 I have set the following things :

 conf/oozie-site.xml :

  property 
  name oozie.services.ext /name 
  value org.apache.oozie.service.HadoopAccessorService
  /value 
  description 
 To add/replace services defined in 'oozie.services' with custom
 implementations.Class names must be separated by commas.
  /description 
  /property 

 conf/core-site.xml
  property
  namehadoop.proxyuser.hadoop.hosts /name
  value* / value
  /property
  property
  namehadoop.proxyuser.hadoop.groups /name
  value* /value
  /property

 When I am submitting jobs as a hadoop user, I am able to run it properly.
 But when I am able to submit the same work flow  from a different user, who
 can submit the simple MR jobs to my hadoop cluster, I am getting the
 following error:

 JA009: java.io.IOException: java.io.IOException: The username kumar
 obtained from the conf doesn't match the username hadoop the user
 authenticated asat
 org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3943)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

 at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at

 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)

 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)

 Caused by: java.io.IOException: The username kumar obtained from the conf
 doesn't match the username hadoop the user authenticated as
 at org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:426)
 at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3941)
 ... 11 more



Re: How can I configure oozie to submit different workflows from different users ?

2012-04-02 Thread Alejandro Abdelnur
multiple value are comma separated. keep in mind that valid values for
proxyuser groups, as the property name states are GROUPS, not USERS.

thxs.

Alejandro

On Mon, Apr 2, 2012 at 2:27 PM, praveenesh kumar praveen...@gmail.comwrote:

 How can I specify multiple users /groups for proxy user setting ?
 Can I give comma separated values in these settings ?

 Thanks,
 Praveenesh

 On Mon, Apr 2, 2012 at 5:52 PM, Alejandro Abdelnur t...@cloudera.com
 wrote:

  Praveenesh,
 
  If I'm not mistaken 0.20.205 does not support wildcards for the proxyuser
  (hosts/groups) settings. You have to use explicit hosts/groups.
 
  Thxs.
 
  Alejandro
  PS: please follow up this thread in the oozie-us...@incubator.apache.org
 
  On Mon, Apr 2, 2012 at 2:15 PM, praveenesh kumar praveen...@gmail.com
  wrote:
 
   Hi all,
  
   I want to use oozie to submit different workflows from different users.
   These users are able to submit hadoop jobs.
   I am using hadoop 0.20.205 and oozie 3.1.3
   I have a hadoop user as a oozie-user
  
   I have set the following things :
  
   conf/oozie-site.xml :
  
property 
name oozie.services.ext /name 
value org.apache.oozie.service.HadoopAccessorService
/value 
description 
   To add/replace services defined in 'oozie.services' with custom
   implementations.Class names must be separated by commas.
/description 
/property 
  
   conf/core-site.xml
property
namehadoop.proxyuser.hadoop.hosts /name
value* / value
/property
property
namehadoop.proxyuser.hadoop.groups /name
value* /value
/property
  
   When I am submitting jobs as a hadoop user, I am able to run it
 properly.
   But when I am able to submit the same work flow  from a different user,
  who
   can submit the simple MR jobs to my hadoop cluster, I am getting the
   following error:
  
   JA009: java.io.IOException: java.io.IOException: The username kumar
   obtained from the conf doesn't match the username hadoop the user
   authenticated asat
   org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3943)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at
  
  
 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
  
   at
  
  
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
  
   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at
  
  
 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
  
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)
  
   Caused by: java.io.IOException: The username kumar obtained from the
 conf
   doesn't match the username hadoop the user authenticated as
   at
 org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:426)
   at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3941)
   ... 11 more
  
 



Re: How do I synchronize Hadoop jobs?

2012-02-15 Thread Alejandro Abdelnur
You can use Oozie for that, you can write a workflow job that forks A
 B and then joins before C.

Thanks.

Alejandro

On Wed, Feb 15, 2012 at 11:23 AM, W.P. McNeill bill...@gmail.com wrote:
 Say I have two Hadoop jobs, A and B, that can be run in parallel. I have
 another job, C, that takes the output of both A and B as input. I want to
 run A and B at the same time, wait until both have finished, and then
 launch C. What is the best way to do this?

 I know the answer if I've got a single Java client program that launches A,
 B, and C. But what if I don't have the option to launch all of them from a
 single Java program? (Say I've got a much more complicated system with many
 steps happening between A-B and C.) How do I synchronize between jobs, make
 sure there's no race conditions etc. Is this what Zookeeper is for?


Re: Hybrid Hadoop with fork/join ?

2012-01-31 Thread Alejandro Abdelnur
Rob,

Hadoop has as a way to run Map tasks in multithreading mode, look for the
MultithreadedMapRunner  MultithreadedMapper.

Thanks.

Alejandro.

On Tue, Jan 31, 2012 at 7:51 AM, Rob Stewart robstewar...@gmail.com wrote:

 Hi,

 I'm investigating the feasibility of a hybrid approach to parallel
 programming, by fusing together the concurrent Java fork/join
 libraries with Hadoop... MapReduce, a paradigm suited for scalable
 execution over distributed memory + fork/join, a paradigm for optimal
 multi-threaded shared memory execution.

 I am aware that, to a degree, Hadoop can take advantage of multiple
 core on a compute node, by setting the
 mapred.tasktracker.map/reduce.tasks.maximum to be more than one. As
 it states in the Hadoop: The definitive guide book, each task runs
 in a separate JVM. So setting maximum map tasks to 3, will allow the
 possibility of 3 JVMs running on the Operating System, right?

 As an alternative to this approach, I am looking to introduce
 fork/join into map tasks, and perhaps maybe too, throttling down the
 maximum number of map tasks per node to 1. I would implement a
 RecordReader that produces map tasks for more coarse granularity, and
 that fork/join would subdivide to map task using multiple threads -
 where the output of the join would be the output of the map task.

 The motivation for this is that threads are a lot more lightweight
 than initializing JVMs, which as the Hadoop book points out, takes a
 second or so each time, unless mapred.job.resuse.jvm.num.tasks is set
 higher than 1 (the default is 1). So for example, opting for small
 granular maps for a particular job for a given input generates 1,000
 map tasks, which will mean that 1,000 JVMs will be created on the
 Hadoop cluster within the execution of the program. If I were to write
 a bespoke RecordReader to increase the granularity of each map, so
 much so that only 100 map tasks are created. The embedded fork/join
 code would further split each map into 10 threads, to evaluate within
 one JVM concurrently. I would expect this latter approach of multiple
 threads to have better performance, than the clunkier multple-JVM
 approach.

 Has such a hybrid approach combining MapReduce with ForkJoin been
 investigated for feasibility, or similar studies published? Are there
 any important technical limitations that I should consider? Any
 thoughts on the proposed multi-threaded distributed-shared memory
 architecture are much appreciated from the Hadoop community!

 --
 Rob Stewart



Re: Any samples of how to write a custom FileSystem

2012-01-31 Thread Alejandro Abdelnur
Steven,

You could also look at HttpFSFilesystem in the hadoop-httpfs module, it is
quite simple and selfcontained.

Cheers.

Alejandro

On Tue, Jan 31, 2012 at 8:37 PM, Harsh J ha...@cloudera.com wrote:

 To write a custom filesystem, extend on the FileSystem class.

 Depending on the scheme it is supposed to serve, creating an entry
 fs.scheme.impl in core-site.xml, and then loading it via the
 FileSystem.get(URI, conf) API will auto load it for you, provided the
 URI you pass has the right scheme.

 So supposing I have a FS scheme foo, I'd register it in core-site.xml as:

 namefs.foo.impl/name
 valuecom.myorg.FooFileSystem/value

 And then with a URI object from a path that goes foo:///mypath/, I'd
 do: FileSystem.get(URI, new Configuration()) to get a FooFileSystem
 instance.

 Similarly, if you want to overload the local filesystem with your
 class, override the fs.file.impl config with your derivative class,
 and that'd be used in your configuration loaded programs in future.

 A good, not-so-complex impl. example to look at generally would be the
 S3FileSystem.

 On Wed, Feb 1, 2012 at 9:41 AM, Steve Lewis lordjoe2...@gmail.com wrote:
  Specifically how do I register a Custom FileSystem - any sample code
 
  --
  Steven M. Lewis PhD
  4221 105th Ave NE
  Kirkland, WA 98033
  206-384-1340 (cell)
  Skype lordjoe_com



 --
 Harsh J
 Customer Ops. Engineer
 Cloudera | http://tiny.cloudera.com/about



Re: Adding a soft-linked archive file to the distributed cache doesn't work as advertised

2012-01-09 Thread Alejandro Abdelnur
Bill,

In addition you must call DistributedCached.createSymlink(configuration),
that should do.

Thxs.

Alejandro

On Mon, Jan 9, 2012 at 10:30 AM, W.P. McNeill bill...@gmail.com wrote:

 I am trying to add a zip file to the distributed cache and have it unzipped
 on the task nodes with a softlink to the unzipped directory placed in the
 working directory of my mapper process. I think I'm doing everything the
 way the documentation tells me to, but it's not working.

 On the client in the run() function while I'm creating the job I first
 call:

 fs.copyFromLocalFile(gate-app.zip, /tmp/gate-app.zip);

 As expected, this copies the archive file gate-app.zip to the HDFS
 directory /tmp.

 Then I call

 DistributedCache.addCacheArchive(/tmp/gate-app.zip#gate-app,
 configuration);

 I expect this to add /tmp/gate-app.zip to the distributed cache and put a
 softlink to it called gate-app in the working directory of each task.
 However, when I call job.waitForCompletion(), I see the following error:

 Exception in thread main java.io.FileNotFoundException: File does not
 exist: /tmp/gate-app.zip#gate-app.

 It appears that the distributed cache mechanism is interpreting the entire
 URI as the literal name of the file, instead of treating the fragment as
 the name of the softlink.

 As far as I can tell, I'm doing this correctly according to the API
 documentation:

 http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html
 .

 The full project in which I'm doing this is up on github:
 https://github.com/wpm/Hadoop-GATE.

 Can someone tell me what I'm doing wrong?



Re: Timer jobs

2011-09-01 Thread Alejandro Abdelnur
[moving common-user@ to BCC]

Oozie is not HA yet. But it would be relatively easy to make it. It was
designed with that in mind, we even did a prototype.

Oozie consists of 2 services, a SQL database to store the Oozie jobs state
and a servlet container where Oozie app proper runs.

The solution for HA for the database, well, it is left to the database. This
means, you'll have to get an HA DB.

The solution for HA for the Oozie app is deploying the servlet container
with the Oozie app in more than one box (2 or 3); and front them by a HTTP
load-balancer.

The missing part is that the current Oozie lock-service is currently an
in-memory implementation. This should be replaced with a Zookeeper
implementation. Zookeeper could run externally or internally in all Oozie
servers. This is what was prototyped long ago.

Thanks.

Alejandro


On Thu, Sep 1, 2011 at 4:14 AM, Ronen Itkin ro...@taykey.com wrote:

 If I get you right you are asking about Installing Oozie as Distributed
 and/or HA cluster?!
 In that case I am not familiar with an out of the box solution by Oozie.
 But, I think you can made up a solution of your own, for example:
 Installing Oozie on two servers on the same partition which will be
 synchronized by DRBD.
 You can trigger a failover using linux Heartbeat and that way maintain a
 virtual IP.





 On Thu, Sep 1, 2011 at 1:59 PM, Per Steffensen st...@designware.dk
 wrote:

  Hi
 
  Thanks a lot for pointing me to Oozie. I have looked a little bit into
  Oozie and it seems like the component triggering jobs is called
  Coordinator Application. But I really see nowhere that this Coordinator
  Application doesnt just run on a single machine, and that it will
 therefore
  not trigger anything if this machine is down. Can you confirm that the
  Coordinator Application-role is distributed in a distribued Oozie
 setup,
  so that jobs gets triggered even if one or two machines are down?
 
  Regards, Per Steffensen
 
  Ronen Itkin skrev:
 
   Hi
 
  Try to use Oozie for job coordination and work flows.
 
 
 
  On Thu, Sep 1, 2011 at 12:30 PM, Per Steffensen st...@designware.dk
  wrote:
 
 
 
  Hi
 
  I use hadoop for a MapReduce job in my system. I would like to have the
  job
  run very 5th minute. Are there any distributed timer job stuff in
  hadoop?
  Of course I could setup a timer in an external timer framework (CRON or
  something like that) that invokes the MapReduce job. But CRON is only
  running on one particular machine, so if that machine goes down my job
  will
  not be triggered. Then I could setup the timer on all or many machines,
  but
  I would not like the job to be run in more than one instance every 5th
  minute, so then the timer jobs would need to coordinate who is actually
  starting the job this time and all the rest would just have to do
  nothing.
  Guess I could come up with a solution to that - e.g. writing some
 lock
  stuff using HDFS files or by using ZooKeeper. But I would really like
 if
  someone had already solved the problem, and provided some kind of a
  distributed timer framework running in a cluster, so that I could
  just
  register a timer job with the cluster, and then be sure that it is
  invoked
  every 5th minute, no matter if one or two particular machines in the
  cluster
  is down.
 
  Any suggestions are very welcome.
 
  Regards, Per Steffensen
 
 
 
 
 
 
 
 
 
 


 --
 *
 Ronen Itkin*
 Taykey | www.taykey.com



Re: Oozie monitoring

2011-08-30 Thread Alejandro Abdelnur
Avi,

For Oozie related questions, please subscribe and use the
oozie-...@incubator.apache.org alias.

Thanks.

Alejandro

On Tue, Aug 30, 2011 at 2:28 AM, Avi Vaknin avivakni...@gmail.com wrote:

 Hi All,

 First, I really enjoy writing you and I'm thankful for your help.

 I have Oozie installed on dedicated server and I want to monitor it using
 my
 Nagios server.

 Can you please suggest some important parameters to monitor or other tips
 regarding this issue.



 Thanks.



 Avi




Re: Oozie on the namenode server

2011-08-29 Thread Alejandro Abdelnur
[Moving thread to Oozie aliases and hadoop's alias to BCC]

Avi,

Currently you can have a cold standby solution.

An Oozie setup consists of 2 systems, a SQL DB (storing all Oozie jobs
state) and a servlet container (running Oozie proper). You need you DB to be
high available. You need to have a second installation of Oozie configured
identically to the running on stand by. If the running Oozie fails, you
start the second one.

The idea (we once did a prototype using Zookeeper) is to eventually support
a multithreaded Oozie server, that would give both high-availability and
horizontal scalability. Still you'll need a High Available DB.

Thanks.

Alejandro


On Mon, Aug 29, 2011 at 4:57 AM, Avi Vaknin avivakni...@gmail.com wrote:

 Hi,
 I have one more question about Oozie,
 Is there any suggested high availability solution for Oozie ?
 Something like active/passive or active/active solution.
 Thanks.

 Avi

 -Original Message-
 From: Harsh J [mailto:ha...@cloudera.com]
 Sent: Monday, August 29, 2011 2:20 PM
 To: common-user@hadoop.apache.org
 Subject: Re: Oozie on the namenode server

 Avi,

 Should be OK to do so at this stage. However, keep monitoring loads on
 the machines to determine when to move things out to their own
 dedicated boxes.

 On Mon, Aug 29, 2011 at 3:20 PM, Avi Vaknin avivakni...@gmail.com wrote:
  Hi All,
 
  I want to install Oozie and I wonder if it is OK to install it on the
 name
  node or maybe I need to install dedicated server to it.
 
  I have a very small Hadoop cluster (4 datanodes + namenode + secondary
  namenode).
 
  Thanks for your help.
 
 
 
  Avi
 
 



 --
 Harsh J
 -
 No virus found in this message.
 Checked by AVG - www.avg.com
 Version: 10.0.1392 / Virus Database: 1520/3864 - Release Date: 08/28/11




Hoop into 0.23 release

2011-08-22 Thread Alejandro Abdelnur
Hadoop developers,

Arun will be cutting a branch for Hadoop 0.23 as soon the trunk has a
successful build.

I'd like Hoop (https://issues.apache.org/jira/browse/HDFS-2178) to be part
of 0.23 (Nicholas already looked at the code).

In addition, the Jersey utils in Hoop will be handy for
https://issues.apache.org/jira/browse/MAPREDUCE-2863.

Most if not all of remaining work is not development but Java package
renaming (to be org.apache.hadoop..), Maven integration (sub-modules) and
final packaging.

The current blocker is to deciding on the Maven sub-modules organization,
https://issues.apache.org/jira/browse/HADOOP-7560

I'll drive the discussion for HADOOP-7560 and as soon as we have an
agreement there I'll refactor Hoop accordingly.

Does this sound reasonable ?

Thanks.

Alejandro


Re: Hoop into 0.23 release

2011-08-22 Thread Alejandro Abdelnur
Nicholas,

Thanks. In order to move forward with the integration of Hoop I need
consensus on:

*
https://issues.apache.org/jira/browse/HDFS-2178?focusedCommentId=13089106page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13089106

* https://issues.apache.org/jira/browse/HADOOP-7560

Thanks.

Alejandro

On Mon, Aug 22, 2011 at 3:42 PM, Tsz Wo Sze szets...@yahoo.com wrote:

 +1
 I believe HDFS-2178 is very close to being committed.  Great work
 Alejandro!

 Nicholas



 
 From: Alejandro Abdelnur t...@cloudera.com
 To: common-user@hadoop.apache.org; hdfs-...@hadoop.apache.org
 Sent: Monday, August 22, 2011 2:16 PM
 Subject: Hoop into 0.23 release

 Hadoop developers,

 Arun will be cutting a branch for Hadoop 0.23 as soon the trunk has a
 successful build.

 I'd like Hoop (https://issues.apache.org/jira/browse/HDFS-2178) to be part
 of 0.23 (Nicholas already looked at the code).

 In addition, the Jersey utils in Hoop will be handy for
 https://issues.apache.org/jira/browse/MAPREDUCE-2863.

 Most if not all of remaining work is not development but Java package
 renaming (to be org.apache.hadoop..), Maven integration (sub-modules) and
 final packaging.

 The current blocker is to deciding on the Maven sub-modules organization,
 https://issues.apache.org/jira/browse/HADOOP-7560

 I'll drive the discussion for HADOOP-7560 and as soon as we have an
 agreement there I'll refactor Hoop accordingly.

 Does this sound reasonable ?

 Thanks.

 Alejandro



Re: Multiple Output Formats

2011-07-27 Thread Alejandro Abdelnur
Roger,

Or you can take a look at Hadoop's MultipleOutputs class.

Thanks.

Alejandro

On Tue, Jul 26, 2011 at 11:30 PM, Luca Pireddu pire...@crs4.it wrote:

 On July 26, 2011 06:11:33 PM Roger Chen wrote:
  Hi all,
 
  I am attempting to implement MultipleOutputFormat to write data to
 multiple
  files dependent on the output keys and values. Can somebody provide a
  working example with how to implement this in Hadoop 0.20.2?
 
  Thanks!

 Hello,

 I have a working sample here:

 http://biodoop-seal.bzr.sourceforge.net/bzr/biodoop-
 seal/trunk/annotate/head%3A/src/it/crs4/seal/demux/DemuxOutputFormat.java

 It extends FileOutputFormat.

 --
 Luca Pireddu
 CRS4 - Distributed Computing Group
 Loc. Pixina Manna Edificio 1
 Pula 09010 (CA), Italy
 Tel:  +39 0709250452



Re: EXT :Re: Problem running a Hadoop program with external libraries

2011-03-05 Thread Alejandro Abdelnur
Why don't you put your native library in HDFS and use the DistributedCache
to make them avail to the tasks. For example:

Copy 'foo.so' to 'hdfs://localhost:8020/tmp/foo.so', then added to the job
distributed cache:

  DistributedCache.addCacheFile(hdfs://localhost:8020/tmp/foo.so#foo.so,
jobConf);
  DistributedCache.createSymlink(conf);

Note that the #foo.so will create a soflink in the task running dir. And the
task running dir is in LD_PATH of your task.

Alejandro

On Sat, Mar 5, 2011 at 7:19 AM, Lance Norskog goks...@gmail.com wrote:

 I have never heard of putting a native code shared library in a Java jar. I
 doubt that it works. But it's a cool idea!

 A Unix binary program loads shared libraries from the paths given in the
 environment variable LD_LIBRARY_PATH. This has to be set to the directory
 with the OpenCV .so file when you start Java.

 Lance

 On Mar 4, 2011, at 2:13 PM, Brian Bockelman wrote:

  Hi,
 
  Check your kernel's overcommit settings.  This will prevent the JVM from
 allocating memory even when there's free RAM.
 
  Brian
 
  On Mar 4, 2011, at 3:55 PM, Ratner, Alan S (IS) wrote:
 
  Aaron,
 
   Thanks for the rapid responses.
 
 
  * ulimit -u unlimited is in .bashrc.
 
 
  * HADOOP_HEAPSIZE is set to 4000 MB in hadoop-env.sh
 
 
  * Mapred.child.ulimit is set to 2048000 in mapred-site.xml
 
 
  * Mapred.child.java.opts is set to -Xmx1536m in mapred-site.xml
 
   I take it you are suggesting that I change the java.opts command to:
 
  Mapred.child.java.opts is value -Xmx1536m
 -Djava.library.path=/path/to/native/libs /value
 
 
  Alan Ratner
  Northrop Grumman Information Systems
  Manager of Large-Scale Computing
  9020 Junction Drive
  Annapolis Junction, MD 20701
  (410) 707-8605 (cell)
 
  From: Aaron Kimball [mailto:akimbal...@gmail.com]
  Sent: Friday, March 04, 2011 4:30 PM
  To: common-user@hadoop.apache.org
  Cc: Ratner, Alan S (IS)
  Subject: EXT :Re: Problem running a Hadoop program with external
 libraries
 
  Actually, I just misread your email and missed the difference between
 your 2nd and 3rd attempts.
 
  Are you enforcing min/max JVM heap sizes on your tasks? Are you
 enforcing a ulimit (either through your shell configuration, or through
 Hadoop itself)? I don't know where these cannot allocate memory errors are
 coming from. If they're from the OS, could it be because it needs to fork()
 and momentarily exceed the ulimit before loading the native libs?
 
  - Aaron
 
  On Fri, Mar 4, 2011 at 1:26 PM, Aaron Kimball akimbal...@gmail.com
 mailto:akimbal...@gmail.com wrote:
  I don't know if putting native-code .so files inside a jar works. A
 native-code .so is not classloaded in the same way .class files are.
 
  So the correct .so files probably need to exist in some physical
 directory on the worker machines. You may want to doublecheck that the
 correct directory on the worker machines is identified in the JVM property
 'java.library.path' (instead of / in addition to $LD_LIBRARY_PATH). This can
 be manipulated in the Hadoop configuration setting mapred.child.java.opts
 (include '-Djava.library.path=/path/to/native/libs' in the string there.)
 
  Also, if you added your .so files to a directory that is already used by
 the tasktracker (like hadoop-0.21.0/lib/native/Linux-amd64-64/), you may
 need to restart the tasktracker instance for it to take effect. (This is
 true of .jar files in the $HADOOP_HOME/lib directory; I don't know if it is
 true for native libs as well.)
 
  - Aaron
 
  On Fri, Mar 4, 2011 at 12:53 PM, Ratner, Alan S (IS) 
 alan.rat...@ngc.commailto:alan.rat...@ngc.com wrote:
  We are having difficulties running a Hadoop program making calls to
 external libraries - but this occurs only when we run the program on our
 cluster and not from within Eclipse where we are apparently running in
 Hadoop's standalone mode.  This program invokes the Open Computer Vision
 libraries (OpenCV and JavaCV).  (I don't think there is a problem with our
 cluster - we've run many Hadoop jobs on it without difficulty.)
 
  1.  I normally use Eclipse to create jar files for our Hadoop
 programs but I inadvertently hit the run as Java application button and
 the program ran fine, reading the input file from the eclipse workspace
 rather than HDFS and writing the output file to the same place.  Hadoop's
 output appears below.  (This occurred on the master Hadoop server.)
 
  2.  I then exported from Eclipse a runnable jar which extracted
 required libraries into the generated jar - presumably producing a jar file
 that incorporated all the required library functions. (The plain jar file
 for this program is 17 kB while the runnable jar is 30MB.)  When I try to
 run this on my Hadoop cluster (including my master and slave servers) the
 program reports that it is unable to locate libopencv_highgui.so.2.2:
 cannot open shared object file: No such file or directory.  Now, in
 addition to this library being incorporated inside 

Re: Accessing Hadoop using Kerberos

2011-01-12 Thread Alejandro Abdelnur
Hadoop UserGroupInformation class looks into the OS Kerberos cache for user
to use.

I guess you'd had to do the kerberos authentication making sure the kerberos
cache is used and then do a UserGroupInformation.loginUser()

And after doing that the Hadoop libraries should work.

Alejandro

On Thu, Jan 13, 2011 at 2:19 PM, Muruga Prabu M murugapra...@gmail.comwrote:

 Yeah kinit is working fine. My scenario is as follows. I have a java
 program
 written by me to upload and download files to the HDFS cluster. I want to
 perform Kerberos authentication from the Java code itself. I have acquired
 the TicketGrantingTicket from the Authentication Server and Service Ticket
 from the Ticket Granting Server. Now how to authenticate myself with hadoop
 by sending the service ticket received from the Ticket Granting Server.

 Regards,
 Pikini

 On Wed, Jan 12, 2011 at 10:13 PM, Alejandro Abdelnur t...@cloudera.com
 wrote:

  If you kinit-ed successfully you are done.
 
  The hadoop libraries will do the trick of authenticating the user against
  Hadoop.
 
  Alejandro
 
 
  On Thu, Jan 13, 2011 at 12:46 PM, Muruga Prabu M murugapra...@gmail.com
  wrote:
 
   Hi,
  
   I have a Java program to upload and download files from the HDFS. I am
   using
   Hadoop with Kerberos. I am able to get a TGT(From the Authentication
  Server
   ) and a service Ticket(From Ticket Granting server) using kerberos
   successfully. But after that I don't know how to authenticate myself
 with
   Hadoop. Can anyone please help me how to do it ?
  
   Regards,
   Pikini
  
 



Re: how to run jobs every 30 minutes?

2010-12-14 Thread Alejandro Abdelnur
Ed,

Actually Oozie is quite different from Cascading.

* Cascading allows you to write 'queries' using a Java API and they get
translated into MR jobs.
* Oozie allows you compose sequences of MR/Pig/Hive/Java/SSH jobs in a DAG
(workflow jobs) and has timer+data dependency triggers (coordinator jobs).

Regards.

Alejandro

On Tue, Dec 14, 2010 at 1:26 PM, edward choi mp2...@gmail.com wrote:

 Thanks for the tip. I took a look at it.
 Looks similar to Cascading I guess...?
 Anyway thanks for the info!!

 Ed

 2010/12/8 Alejandro Abdelnur t...@cloudera.com

  Or, if you want to do it in a reliable way you could use an Oozie
  coordinator job.
 
  On Wed, Dec 8, 2010 at 1:53 PM, edward choi mp2...@gmail.com wrote:
   My mistake. Come to think about it, you are right, I can just make an
   infinite loop inside the Hadoop application.
   Thanks for the reply.
  
   2010/12/7 Harsh J qwertyman...@gmail.com
  
   Hi,
  
   On Tue, Dec 7, 2010 at 2:25 PM, edward choi mp2...@gmail.com wrote:
Hi,
   
I'm planning to crawl a certain web site every 30 minutes.
How would I get it done in Hadoop?
   
In pure Java, I used Thread.sleep() method, but I guess this won't
  work
   in
Hadoop.
  
   Why wouldn't it? You need to manage your post-job logic mostly, but
   sleep and resubmission should work just fine.
  
Or if it could work, could anyone show me an example?
   
Ed.
   
  
  
  
   --
   Harsh J
   www.harshj.com
  
  
 



Re: how to run jobs every 30 minutes?

2010-12-08 Thread Alejandro Abdelnur
Or, if you want to do it in a reliable way you could use an Oozie
coordinator job.

On Wed, Dec 8, 2010 at 1:53 PM, edward choi mp2...@gmail.com wrote:
 My mistake. Come to think about it, you are right, I can just make an
 infinite loop inside the Hadoop application.
 Thanks for the reply.

 2010/12/7 Harsh J qwertyman...@gmail.com

 Hi,

 On Tue, Dec 7, 2010 at 2:25 PM, edward choi mp2...@gmail.com wrote:
  Hi,
 
  I'm planning to crawl a certain web site every 30 minutes.
  How would I get it done in Hadoop?
 
  In pure Java, I used Thread.sleep() method, but I guess this won't work
 in
  Hadoop.

 Why wouldn't it? You need to manage your post-job logic mostly, but
 sleep and resubmission should work just fine.

  Or if it could work, could anyone show me an example?
 
  Ed.
 



 --
 Harsh J
 www.harshj.com




Re: HDFS Rsync process??

2010-11-30 Thread Alejandro Abdelnur
The other approach, if the DR cluster is idle or has enough excess capacity,
would be running all the jobs on the input data in both clusters and perform
checksums on the outputs to ensure everything is consistent. And you could
take advantage and distribute ad hoc queries between the 2 clusters.

Alejandro

On Tue, Nov 30, 2010 at 6:51 PM, Steve Loughran ste...@apache.org wrote:

 On 30/11/10 03:59, hadoopman wrote:

 We have two Hadoop clusters in two separate buildings.  Both clusters
 are loading the same data from the same sources (the second cluster is
 for DR).

 We're looking at how we can recover the primary cluster and catch it
 back up again as new data will continue to feed into the DR cluster.
 It's been suggested we use rsync across the network however my concern
 is the amount of data we would have to copy over would take several days
 (at a minimum) to sync them even with our dual bonded 1 gig network cards.

 I'm curious if anyone has come up with a solution short of just loading
 the source logs into HDFS. Is there a way to even rsync two clusters and
 get them in sync? Been googling around. Haven't found anything of
 substances yet.



 you don't need all the files in the cluster in sync as a lot of them are
 intermediate and transient files.

 Instead use dfscopy to copy source files to the two clusters, this runs
 across the machines in the cluster and is also designed to work across
 hadoop versions, with some limitations.





Re: how to set diffent VM parameters for mappers and reducers?

2010-10-07 Thread Alejandro Abdelnur
Never used those myself, always used the global one, but I knew they are
there.

Which Hadoop API are you using, the old one or the new one?

Alejandro

On Fri, Oct 8, 2010 at 3:50 AM, Vitaliy Semochkin vitaliy...@gmail.comwrote:

 Hi,

 I tried using  mapred.map.child.java.opts and mapred.reduce.child.java.opts
 but looks like  hadoop-0.20.2  ingnores it.

 On which version have you seen it working?

 Regards,
 Vitaliy S

 On Tue, Oct 5, 2010 at 5:14 PM, Alejandro Abdelnur t...@cloudera.com
 wrote:
  The following 2 properties should work:
 
  mapred.map.child.java.opts
  mapred.reduce.child.java.opts
 
  Alejandro
 
 
  On Tue, Oct 5, 2010 at 9:02 PM, Michael Segel michael_se...@hotmail.com
 wrote:
 
  Hi,
 
  You don't say which version of Hadoop you are using.
  Going from memory, I believe in the CDH3 release from Cloudera, there
 are some specific OPTs you can set in hadoop-env.sh.
 
  HTH
 
  -Mike
 
 
  Date: Tue, 5 Oct 2010 16:59:35 +0400
  Subject: how to set diffent VM parameters for mappers and reducers?
  From: vitaliy...@gmail.com
  To: common-user@hadoop.apache.org
 
  Hello,
 
 
  I have mappers that do not need much ram but combiners and reducers
 need a lot.
  Is it possible to set different VM parameters for mappers and reducers?
 
 
 
  PS Often I face interesting problem, on same set of data I
  recieve I have java.lang.OutOfMemoryError: Java heap space in combiner
  but it happens not all the time.
  What could be cause of such behavior?
  My personal opinion is that I have
 
  mapred.job.reuse.jvm.num.tasks=-1 and jvm GC doesn't always start when
  it should.
 
  Thanks in Advance,
  Vitaliy S
 
 



Re: is there no streaming.jar file in hadoop-0.21.0??

2010-10-05 Thread Alejandro Abdelnur
Edward,

Yep, you should use the one from contrib/

Alejandro

On Tue, Oct 5, 2010 at 1:55 PM, edward choi mp2...@gmail.com wrote:
 Thanks, Tom.
 Didn't expect the author of THE BOOK would answer my question. Very
 surprised and honored :-)
 Just one more question if you don't mind.
 I read it on the Internet that in order to user Hadoop Streaming in
 Hadoop-0.21.0 you should go
 $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar args (Of
 course I don't see any hadoop-streaming.jar in $HADOOP_HOME)
 But according to your reply I should go
 $HADOOP_HOME/bin/hadoop jar
 $HADOOP_HOME/mapred/contrib/streaming/hadoop-*-streaming.jar args
 I suppose the latter one is the way to go?

 Ed.

 2010/10/5 Tom White t...@cloudera.com

 Hi Ed,

 The directory structure moved around as a result of the project
 splitting into three subprojects (Common, HDFS, MapReduce). The
 streaming jar is in mapred/contrib/streaming in the distribution.

 Cheers,
 Tom

 On Mon, Oct 4, 2010 at 8:03 PM, edward choi mp2...@gmail.com wrote:
  Hi,
  I've recently downloaded Hadoop-0.21.0.
  After the installation, I've noticed that there is no contrib directory
  that used to exist in Hadoop-0.20.2.
  So I was wondering if there is no hadoop-0.21.0-streaming.jar file in
  Hadoop-0.21.0.
  Anyone had any luck finding it?
  If the way to use streaming has changed in Hadoop-0.21.0, then please
 tell
  me how.
  Appreciate the help, thx.
 
  Ed.
 




Re: Re: Help!!The problem about Hadoop

2010-10-05 Thread Alejandro Abdelnur
Or you could try using MultiFileInputFormat for your MR job.

http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapred/MultiFileInputFormat.html

Alejandro

On Tue, Oct 5, 2010 at 4:55 PM, Harsh J qwertyman...@gmail.com wrote:
 500 small files comprising one gigabyte? Perhaps you should try
 concatenating them all into one big file and try; as a mapper is
 supposed to run at least for a minute optimally. And small files don't
 make good use of the HDFS block feature.

 Have a read: http://www.cloudera.com/blog/2009/02/the-small-files-problem/

 2010/10/5 Jander 442950...@163.com:
 Hi Jeff,

 Thank you very much for your reply sincerely.

 I exactly know hadoop has overhead, but is it too large in my problem?

 The 1GB text input has about 500 map tasks because the input is composed of 
 little text file. And the time each map taken is from 8 seconds to 20 
 seconds. I use compression like conf.setCompressMapOutput(true).

 Thanks,
 Jander




 At 2010-10-05 16:28:55,Jeff Zhang zjf...@gmail.com wrote:

Hi Jander,

Hadoop has overhead compared to single-machine solution. How many task
have you get when you run your hadoop job ? And what is time consuming
for each map and reduce task ?

There's lots of tips for performance tuning of hadoop. Such as
compression and jvm reuse.


2010/10/5 Jander 442950...@163.com:
 Hi, all
 I do an application using hadoop.
 I take 1GB text data as input the result as follows:
    (1) the cluster of 3 PCs: the time consumed is 1020 seconds.
    (2) the cluster of 4 PCs: the time is about 680 seconds.
 But the application before I use Hadoop takes about 280 seconds, so as the 
 speed above, I must use 8 PCs in order to have the same speed as before. 
 Now the problem: whether it is correct?

 Jander,
 Thanks.






--
Best Regards

Jeff Zhang




 --
 Harsh J
 www.harshj.com



Re: how to set diffent VM parameters for mappers and reducers?

2010-10-05 Thread Alejandro Abdelnur
The following 2 properties should work:

mapred.map.child.java.opts
mapred.reduce.child.java.opts

Alejandro


On Tue, Oct 5, 2010 at 9:02 PM, Michael Segel michael_se...@hotmail.com wrote:

 Hi,

 You don't say which version of Hadoop you are using.
 Going from memory, I believe in the CDH3 release from Cloudera, there are 
 some specific OPTs you can set in hadoop-env.sh.

 HTH

 -Mike


 Date: Tue, 5 Oct 2010 16:59:35 +0400
 Subject: how to set diffent VM parameters for mappers and reducers?
 From: vitaliy...@gmail.com
 To: common-user@hadoop.apache.org

 Hello,


 I have mappers that do not need much ram but combiners and reducers need a 
 lot.
 Is it possible to set different VM parameters for mappers and reducers?



 PS Often I face interesting problem, on same set of data I
 recieve I have java.lang.OutOfMemoryError: Java heap space in combiner
 but it happens not all the time.
 What could be cause of such behavior?
 My personal opinion is that I have

 mapred.job.reuse.jvm.num.tasks=-1 and jvm GC doesn't always start when
 it should.

 Thanks in Advance,
 Vitaliy S



Re: Classpath

2010-08-28 Thread Alejandro Abdelnur
Yes, you can do #1, but I wouldn't say it is practical. You can do #2
as well, as you suggest.

But, IMO, the best way is copying the JARs in HDFS and using DistributedCache.

A

On Sun, Aug 29, 2010 at 1:29 PM, Mark static.void@gmail.com wrote:
  How can I add jars to Hadoops classpath when running MapReduce jobs for the
 following situations?

 1) Assuming that the jars are local the nodes that running the job.
 2) The jobs are only local to the client submitting the job.

 I'm assuming I can just jar up all required jobs into the main job jar being
 submitted, but I was wondering if there was some other way. Thanks



Re: REST web service on top of Hadoop

2010-07-29 Thread Alejandro Abdelnur
In Oozie are working on MR/Pig jobs submission over HTTP.

On Thu, Jul 29, 2010 at 5:09 PM, Steve Loughran ste...@apache.org wrote:
 S. Venkatesh wrote:

 HDFS Proxy in contrib provides HTTP interface over HDFS. Its not very
 RESTful but we are working on a new version which will have a REST
 API.

 AFAIK, Oozie will provide REST API for launching MR jobs.



 I've done a webapp to do this (and some trickier things like creating the
 cluster when needed), some slides on it
 http://www.slideshare.net/steve_l/long-haul-hadoop

 there are links on there to the various JIRA issues which have discussed the
 topic



Re: Hadoop multiple output files

2010-06-28 Thread Alejandro Abdelnur
Adam,

Yes, you have do some configuration in the jobconf before submitting
the job. if the javadocs are not clear enough, check the testcase

A

On Mon, Jun 28, 2010 at 8:42 PM, Adam Silberstein
silbe...@yahoo-inc.com wrote:
 Hi Alejandro,
 Thanks for the tip.  I tried the class, but got the following error:

 java.lang.IllegalArgumentException: Undefined named output 'text'
  at
 org.apache.hadoop.mapred.lib.MultipleOutputs.getCollector(MultipleOutputs.ja
 va:496)
  at
 org.apache.hadoop.mapred.lib.MultipleOutputs.getCollector(MultipleOutputs.ja
 va:476)
  at
 com.yahoo.shadoop.applications.SyntheticToHdfsTableAndIndex$SortReducer.redu
 ce(Unknown Source)
  at
 com.yahoo.shadoop.applications.SyntheticToHdfsTableAndIndex$SortReducer.redu
 ce(Unknown Source)
  at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
  at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
  at org.apache.hadoop.mapred.Child.main(Child.java:170)


 I am trying to write to an output file named 'text.'  Do I need to
 initialize that file in some way?  I tried making a directory with the name,
 but that didn't do anything.

 Thanks,
 Adam


 On 6/28/10 6:17 PM, Alejandro Abdelnur tuc...@gmail.com wrote:

 Check the MultipleOutputs class

 http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/mapred/lib/
 MultipleOutputs.html

 On Mon, Jun 28, 2010 at 5:31 PM, Adam Silberstein
 silbe...@yahoo-inc.com wrote:

 Hi,
 I would like to run a hadoop job that write to multiple output files.  I see
 a class called MultipleOutputFormat that looks like what I want, but I have
 not been able to find any sample code showing how to use it.  I see
 discussion of it in a JIRA, where the idea is to choose output file based on
 key.  If someone could me to a sample, that would be great.

 Thanks,
 Adam






Re: preserve JobTracker information

2010-05-19 Thread Alejandro Abdelnur
Also you can configure the job tracker to keep the RunningJob
information for completed jobs (avail via the Hadoop Java API). There
is a config property that enables this, another that specifies the
location (it can be HDFS or local), and another that specifies for how
many hours you want to keep that information.

HTH

A

On Wed, May 19, 2010 at 1:36 AM, Harsh J qwertyman...@gmail.com wrote:
 Preserved JobTracker history is already available at /jobhistory.jsp

 There is a link at the end of the /jobtracker.jsp page that leads to
 this. There's also free analysis to go with that! :)

 On Tue, May 18, 2010 at 11:00 PM, Alan Miller someb...@squareplanet.de 
 wrote:
 Hi,

 Is there a way to preserve previous job information (Completed Jobs, Failed
 Jobs)
 when the hadoop cluster is restarted?

 Everytime I start up my cluster (start-dfs.sh,start-mapred.sh) the
 JobTracker interface
 at http://myhost:50020/jobtracker.jsp is always empty.

 Thanks,
 Alan






 --
 Harsh J
 www.harshj.com



Re: MultipleTextOutputFormat splitting output into different directories.

2009-09-15 Thread Alejandro Abdelnur
Using the MultipleOutputs (
http://hadoop.apache.org/common/docs/r0.19.0/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html
) you can split data in different files in the outputdir.

After your job finishes you can move the files to different directories.

The benefit of this doing this is that task failures and speculative
execution will also take into account all these files and your data
will be consistent. If you are writing to different directories then
you have to handle this by hand.

A

On Tue, Sep 15, 2009 at 4:22 PM, Aviad sela sela.s...@gmail.com wrote:
 Is any body interested ,addressed such probelm.
 Or does it seem to be esoteric usage ?




 On Wed, Sep 9, 2009 at 7:06 PM, Aviad sela sela.s...@gmail.com wrote:

  I am using Hadoop 0.19.1

 I attempt to split an input into multiple directories.
 I don't know in advance how many directories exists.
 I don't know in advance what is the directory depth.
 I expect that under each such directory a file exists with all availble
 records having the same key permutation
 found in the job.

 If currently each reducer produce a single output i.e. PART-0001
 I would like to create as many directory necessary taking the following
 pattern:

                key1 / key2/ .../ keyN/ PART-0001

 where the  key?  may have different values for each input record.
 different record may results with a different path requested:
               key1a/key2b/PART-0001
               key1c/key2d/key3e/PART-0001
 to keep it simple, during each job we may expect the same depth from each
 record.

 I assume that the input records imply that each reduce will produce several
 hundreds of such directories.
 (Indeed this strongly depends on the input record semantic).


 The MAP part reads a record,following some logic, assign a key like :
 KEY_A, KEY_B
 The MAP Value is the original input line.


 For The reducer part I assign the IdentityReducer.
 However have set :

     jobConf.setReducerClass(IdentityReducer.
 *class*);

     jobConf.setOutputFormat(MyTextOutputFormat.*class*);



 Where the MyTextOutputFormat  extends MultipleTextOutputFormat, and
 implements:

     protected String generateFileNameForKeyValue(K key, V value, String
 name)
     {
         String keyParts[] = key.toString().split(,);
         Path finalPath = null;
         // Build the directory structure comprised of the Key parts
        for (int i=0; i  keyParts.length; i++)
        {
             String part = keyParts[i].trim();
            if  (false == .equals(part))
            {
                if (null == finalPath)
                           finalPath = new Path(part);
                 else
                 {
                         finalPath = new Path(finalPath, part);
                 }
            }
         } //end of for

        String fileName = generateLeafFileName(name);
        finalPath = new Path(finalPath, fileName);

        return finalPath.toString();
  } //generatedFileNameKeyValue
 During execution I have seen the reduce attempts does create the following
 path under the output path:

 /user/hadoop/test_rep01/_temporary/_attempt_200909080349_0013_r_00_0/KEY_A/KEY_B/part-0
    However, the file was empty.



 The job fails at the end with the the following exceptions found in the
 task log:

 2009-09-09 11:19:49,653 INFO org.apache.hadoop.hdfs.DFSClient: Exception in
 createBlockOutputStream java.io.IOException: Bad connect ack with
 firstBadLink 9.148.30.71:50010
 2009-09-09 11:19:49,654 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning
 block blk_-6138647338595590910_39383
 2009-09-09 11:19:55,659 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer
 Exception: java.io.IOException: Unable to create new block.
 at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2722)
 at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996)
 at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)

 2009-09-09 11:19:55,659 WARN org.apache.hadoop.hdfs.DFSClient: Error
 Recovery for block blk_-6138647338595590910_39383 bad datanode[1] nodes ==
 null
 2009-09-09 11:19:55,660 WARN org.apache.hadoop.hdfs.DFSClient: Could not
 get block locations. Source file
 /user/hadoop/test_rep01/_temporary/_attempt_200909080349_0013_r_02_0/KEY_A/KEY_B/part-2
 - Aborting...
 2009-09-09 11:19:55,686 WARN org.apache.hadoop.mapred.TaskTracker: Error
 running child
 java.io.IOException: Bad connect ack with firstBadLink 9.148.30.71:50010
 at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2780)
 at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2703)
 at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996)
 at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
 2009-09-09 11:19:55,688 INFO org.apache.hadoop.mapred.TaskRunner: Runnning
 cleanup for