Re: Loading Data to HDFS

2012-10-30 Thread Alejandro Abdelnur
; > installed on RHEL. I am planning to load quite a few Petabytes of Data >> onto >> > HDFS. >> > >> > Which will be the fastest method to use and are there any projects around >> > Hadoop which can be used as well? >> > >> > >

Re: HBase mulit-user security

2012-07-25 Thread Alejandro Abdelnur
> Am I missing something? > > Thanks! > > -Tony > > -Original Message- > From: Alejandro Abdelnur [mailto:t...@cloudera.com] > Sent: Monday, July 02, 2012 11:40 AM > To: common-user@hadoop.apache.org > Subject: Re: hadoop security API (repost) > > Ton

Re: hadoop security API (repost)

2012-07-02 Thread Alejandro Abdelnur
r support in hbase if it is not there yet > Thanks. > > -Tony > > -Original Message- > From: Alejandro Abdelnur [mailto:t...@cloudera.com] > Sent: Monday, July 02, 2012 11:40 AM > To: common-user@hadoop.apache.org > Subject: Re: hadoop security API (repost) > > Tony, &

Re: hadoop security API (repost)

2012-07-02 Thread Alejandro Abdelnur
Tony, If you are doing a server app that interacts with the cluster on behalf of different users (like Ooize, as you mentioned in your email), then you should use the proxyuser capabilities of Hadoop. * Configure user MYSERVERUSER as proxyuser in Hadoop core-site.xml (this requires 2 properties s

Re: kerberos mapreduce question

2012-06-07 Thread Alejandro Abdelnur
If you provision your user/group information via LDAP to all your nodes it is not a nightmare. On Thu, Jun 7, 2012 at 7:49 AM, Koert Kuipers wrote: > thanks for your answer. > > so at a large place like say yahoo, or facebook, assuming they use > kerberos, every analyst that uses hive has an acc

Re: How can I configure oozie to submit different workflows from different users ?

2012-04-02 Thread Alejandro Abdelnur
I give comma separated values in these settings ? > > Thanks, > Praveenesh > > On Mon, Apr 2, 2012 at 5:52 PM, Alejandro Abdelnur >wrote: > > > Praveenesh, > > > > If I'm not mistaken 0.20.205 does not support wildcards for the proxyuser > > (hosts

Re: How can I configure oozie to submit different workflows from different users ?

2012-04-02 Thread Alejandro Abdelnur
Praveenesh, If I'm not mistaken 0.20.205 does not support wildcards for the proxyuser (hosts/groups) settings. You have to use explicit hosts/groups. Thxs. Alejandro PS: please follow up this thread in the oozie-us...@incubator.apache.org On Mon, Apr 2, 2012 at 2:15 PM, praveenesh kumar wrote:

Re: How do I synchronize Hadoop jobs?

2012-02-15 Thread Alejandro Abdelnur
You can use Oozie for that, you can write a workflow job that forks A & B and then joins before C. Thanks. Alejandro On Wed, Feb 15, 2012 at 11:23 AM, W.P. McNeill wrote: > Say I have two Hadoop jobs, A and B, that can be run in parallel. I have > another job, C, that takes the output of both A

Re: Any samples of how to write a custom FileSystem

2012-01-31 Thread Alejandro Abdelnur
Steven, You could also look at HttpFSFilesystem in the hadoop-httpfs module, it is quite simple and selfcontained. Cheers. Alejandro On Tue, Jan 31, 2012 at 8:37 PM, Harsh J wrote: > To write a custom filesystem, extend on the FileSystem class. > > Depending on the scheme it is supposed to se

Re: Hybrid Hadoop with fork/join ?

2012-01-31 Thread Alejandro Abdelnur
Rob, Hadoop has as a way to run Map tasks in multithreading mode, look for the MultithreadedMapRunner & MultithreadedMapper. Thanks. Alejandro. On Tue, Jan 31, 2012 at 7:51 AM, Rob Stewart wrote: > Hi, > > I'm investigating the feasibility of a hybrid approach to parallel > programming, by fu

Re: Adding a soft-linked archive file to the distributed cache doesn't work as advertised

2012-01-09 Thread Alejandro Abdelnur
Bill, In addition you must call DistributedCached.createSymlink(configuration), that should do. Thxs. Alejandro On Mon, Jan 9, 2012 at 10:30 AM, W.P. McNeill wrote: > I am trying to add a zip file to the distributed cache and have it unzipped > on the task nodes with a softlink to the unzippe

Re: Timer jobs

2011-09-01 Thread Alejandro Abdelnur
[moving common-user@ to BCC] Oozie is not HA yet. But it would be relatively easy to make it. It was designed with that in mind, we even did a prototype. Oozie consists of 2 services, a SQL database to store the Oozie jobs state and a servlet container where Oozie app proper runs. The solution f

Re: Oozie monitoring

2011-08-30 Thread Alejandro Abdelnur
Avi, For Oozie related questions, please subscribe and use the oozie-...@incubator.apache.org alias. Thanks. Alejandro On Tue, Aug 30, 2011 at 2:28 AM, Avi Vaknin wrote: > Hi All, > > First, I really enjoy writing you and I'm thankful for your help. > > I have Oozie installed on dedicated ser

Re: Oozie on the namenode server

2011-08-29 Thread Alejandro Abdelnur
[Moving thread to Oozie aliases and hadoop's alias to BCC] Avi, Currently you can have a cold standby solution. An Oozie setup consists of 2 systems, a SQL DB (storing all Oozie jobs state) and a servlet container (running Oozie proper). You need you DB to be high available. You need to have a s

Re: Hoop into 0.23 release

2011-08-22 Thread Alejandro Abdelnur
owse/HADOOP-7560 Thanks. Alejandro On Mon, Aug 22, 2011 at 3:42 PM, Tsz Wo Sze wrote: > +1 > I believe HDFS-2178 is very close to being committed. Great work > Alejandro! > > Nicholas > > > > ____ > From: Alejandro Abdelnur > To

Hoop into 0.23 release

2011-08-22 Thread Alejandro Abdelnur
Hadoop developers, Arun will be cutting a branch for Hadoop 0.23 as soon the trunk has a successful build. I'd like Hoop (https://issues.apache.org/jira/browse/HDFS-2178) to be part of 0.23 (Nicholas already looked at the code). In addition, the Jersey utils in Hoop will be handy for https://iss

Re: Multiple Output Formats

2011-07-27 Thread Alejandro Abdelnur
Roger, Or you can take a look at Hadoop's MultipleOutputs class. Thanks. Alejandro On Tue, Jul 26, 2011 at 11:30 PM, Luca Pireddu wrote: > On July 26, 2011 06:11:33 PM Roger Chen wrote: > > Hi all, > > > > I am attempting to implement MultipleOutputFormat to write data to > multiple > > files

Re: EXT :Re: Problem running a Hadoop program with external libraries

2011-03-05 Thread Alejandro Abdelnur
Why don't you put your native library in HDFS and use the DistributedCache to make them avail to the tasks. For example: Copy 'foo.so' to 'hdfs://localhost:8020/tmp/foo.so', then added to the job distributed cache: DistributedCache.addCacheFile("hdfs://localhost:8020/tmp/foo.so#foo.so", jobConf

Re: Accessing Hadoop using Kerberos

2011-01-12 Thread Alejandro Abdelnur
ve acquired > the TicketGrantingTicket from the Authentication Server and Service Ticket > from the Ticket Granting Server. Now how to authenticate myself with hadoop > by sending the service ticket received from the Ticket Granting Server. > > Regards, > Pikini > > On Wed, Jan 12, 201

Re: Accessing Hadoop using Kerberos

2011-01-12 Thread Alejandro Abdelnur
If you kinit-ed successfully you are done. The hadoop libraries will do the trick of authenticating the user against Hadoop. Alejandro On Thu, Jan 13, 2011 at 12:46 PM, Muruga Prabu M wrote: > Hi, > > I have a Java program to upload and download files from the HDFS. I am > using > Hadoop with

Re: how to run jobs every 30 minutes?

2010-12-14 Thread Alejandro Abdelnur
oordinator jobs). Regards. Alejandro On Tue, Dec 14, 2010 at 1:26 PM, edward choi wrote: > Thanks for the tip. I took a look at it. > Looks similar to Cascading I guess...? > Anyway thanks for the info!! > > Ed > > 2010/12/8 Alejandro Abdelnur > > > Or, if you want

Re: how to run jobs every 30 minutes?

2010-12-08 Thread Alejandro Abdelnur
Or, if you want to do it in a reliable way you could use an Oozie coordinator job. On Wed, Dec 8, 2010 at 1:53 PM, edward choi wrote: > My mistake. Come to think about it, you are right, I can just make an > infinite loop inside the Hadoop application. > Thanks for the reply. > > 2010/12/7 Harsh

Re: HDFS Rsync process??

2010-11-30 Thread Alejandro Abdelnur
The other approach, if the DR cluster is idle or has enough excess capacity, would be running all the jobs on the input data in both clusters and perform checksums on the outputs to ensure everything is consistent. And you could take advantage and distribute ad hoc queries between the 2 clusters.

Re: how to set diffent VM parameters for mappers and reducers?

2010-10-07 Thread Alejandro Abdelnur
java.opts > but looks like hadoop-0.20.2 ingnores it. > > On which version have you seen it working? > > Regards, > Vitaliy S > > On Tue, Oct 5, 2010 at 5:14 PM, Alejandro Abdelnur > wrote: > > The following 2 properties should work: > > > > mapred

Re: how to set diffent VM parameters for mappers and reducers?

2010-10-05 Thread Alejandro Abdelnur
The following 2 properties should work: mapred.map.child.java.opts mapred.reduce.child.java.opts Alejandro On Tue, Oct 5, 2010 at 9:02 PM, Michael Segel wrote: > > Hi, > > You don't say which version of Hadoop you are using. > Going from memory, I believe in the CDH3 release from Cloudera, the

Re: Re: Help!!The problem about Hadoop

2010-10-05 Thread Alejandro Abdelnur
Or you could try using MultiFileInputFormat for your MR job. http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapred/MultiFileInputFormat.html Alejandro On Tue, Oct 5, 2010 at 4:55 PM, Harsh J wrote: > 500 small files comprising one gigabyte? Perhaps you should try > concat

Re: is there no streaming.jar file in hadoop-0.21.0??

2010-10-05 Thread Alejandro Abdelnur
Edward, Yep, you should use the one from contrib/ Alejandro On Tue, Oct 5, 2010 at 1:55 PM, edward choi wrote: > Thanks, Tom. > Didn't expect the author of THE BOOK would answer my question. Very > surprised and honored :-) > Just one more question if you don't mind. > I read it on the Internet

Re: Relation between number of map tasks and input splits

2010-09-23 Thread Alejandro Abdelnur
And keep in mind that one split is not necessary 1 file. That depends on the InputFormat. For example, the MultipleInputFormat, clubs together multiple files in 1 split. On Thu, Sep 23, 2010 at 3:16 PM, Greg Roelofs wrote: > > Can a map task work on more than one input split? > > As far as I ca

Re: Classpath

2010-08-28 Thread Alejandro Abdelnur
Yes, you can do #1, but I wouldn't say it is practical. You can do #2 as well, as you suggest. But, IMO, the best way is copying the JARs in HDFS and using DistributedCache. A On Sun, Aug 29, 2010 at 1:29 PM, Mark wrote: >  How can I add jars to Hadoops classpath when running MapReduce jobs for

Re: REST web service on top of Hadoop

2010-07-29 Thread Alejandro Abdelnur
In Oozie are working on MR/Pig jobs submission over HTTP. On Thu, Jul 29, 2010 at 5:09 PM, Steve Loughran wrote: > S. Venkatesh wrote: >> >> HDFS Proxy in contrib provides HTTP interface over HDFS. Its not very >> RESTful but we are working on a new version which will have a REST >> API. >> >> AF

Re: Hadoop multiple output files

2010-06-28 Thread Alejandro Abdelnur
with the name, > but that didn't do anything. > > Thanks, > Adam > > > On 6/28/10 6:17 PM, "Alejandro Abdelnur" wrote: > >> Check the MultipleOutputs class >> >> http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/mapred/lib/ >>

Re: FW: Hadoop multiple output files

2010-06-28 Thread Alejandro Abdelnur
Check the MultipleOutputs class http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html On Mon, Jun 28, 2010 at 5:31 PM, Adam Silberstein wrote: > > Hi, > I would like to run a hadoop job that write to multiple output files.  I see > a class called Mult

Re: preserve JobTracker information

2010-05-19 Thread Alejandro Abdelnur
Also you can configure the job tracker to keep the RunningJob information for completed jobs (avail via the Hadoop Java API). There is a config property that enables this, another that specifies the location (it can be HDFS or local), and another that specifies for how many hours you want to keep t

Re: MultipleTextOutputFormat splitting output into different directories.

2009-09-15 Thread Alejandro Abdelnur
Using the MultipleOutputs ( http://hadoop.apache.org/common/docs/r0.19.0/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html ) you can split data in different files in the outputdir. After your job finishes you can move the files to different directories. The benefit of this doing this is that