Re: doubt on Hadoop job submission process

2012-08-13 Thread Harsh J
Hi Manoj, Reply inline. On Mon, Aug 13, 2012 at 3:42 PM, Manoj Babu manoj...@gmail.com wrote: Hi All, Normal Hadoop job submission process involves: Checking the input and output specifications of the job. Computing the InputSplits for the job. Setup the requisite accounting information

Locks in M/R framework

2012-08-13 Thread David Ginzburg
Hi, I have an HDFS folder and M/R job that periodically updates it by replacing the data with newly generated data. I have a different M/R job that periodically or ad-hoc process the data in the folder. The second job ,naturally, fails sometime, when the data is replaced by newly generated

Re: Locks in M/R framework

2012-08-13 Thread Tim Robertson
How about introducing a distributed coordination and locking mechanism? ZooKeeper would be a good candidate for that kind of thing. On Mon, Aug 13, 2012 at 12:52 PM, David Ginzburg ginz...@hotmail.comwrote: Hi, I have an HDFS folder and M/R job that periodically updates it by replacing the

Re: doubt on Hadoop job submission process

2012-08-13 Thread Manoj Babu
Hi Harsh, Thanks for your reply. Consider from my main program i am doing so many activities(Reading/writing/updating non hadoop activities) before invoking JobClient.runJob(conf); Is it anyway to separate the process flow by programmatic instead of going for any workflow engine? Cheers! Manoj.

Re: doubt on Hadoop job submission process

2012-08-13 Thread Harsh J
Sure, you may separate the logic as you want it to be, but just ensure the configuration object has a proper setJar or setJarByClass done on it before you submit the job. On Mon, Aug 13, 2012 at 4:43 PM, Manoj Babu manoj...@gmail.com wrote: Hi Harsh, Thanks for your reply. Consider from my

Re: Locks in M/R framework

2012-08-13 Thread Harsh J
David, While ZK can solve this, locking may only make you slower. Lets try to keep it simple? Have you considered keeping two directories? One where the older data is moved to (by the first job, instead of replacing files), for consumption by the second job, which triggers by watching this

Re: help in distribution of a task with hadoop

2012-08-13 Thread Pierre Antoine DuBoDeNa
We have all documents moved to HDFS. I understand with our 1st option we need more I/O than what you say but let's say that's not a problem for now. Could you please point me on 2) option? how could we do that? any tutorial or example? Thanks 2012/8/13 Bertrand Dechoux decho...@gmail.com 1) A

Re: help in distribution of a task with hadoop

2012-08-13 Thread Bejoy Ks
Hi Bertrand -libjars option works well with the 'hadoop jar' command. Instead of executing your runnable with the plain java 'jar' command use 'hadoop jar' . When you use hadoop jar you can ship the dependent jars/files etc as 1) include them in the /lib folder in your jar 2) use -libjars /

Re: help in distribution of a task with hadoop

2012-08-13 Thread Pierre Antoine DuBoDeNa
You mean like that: hadoop jar Rdg.jar my.hadoop.Rdg -libjars Rdg_lib/* tester rdg_output Where Rdg_lib is the a folder containing all reqd classes/jars stored on HDFS. We get this error though. We do something wrong? 12/08/10 08:16:24 ERROR security.UserGroupInformation:

Re: hadoop-fuse

2012-08-13 Thread Harsh J
Hi Rishab, Please provide the outputs of: $ uname -a; lsb_release -a $ file $HADOOP_HOME/bin/fuse_dfs $ $HADOOP_HOME/bin/hadoop version On Mon, Aug 13, 2012 at 1:25 PM, Rishabh Agrawal rishabh.agra...@impetus.co.in wrote: So do I have to download fuse libraries and install it before running

Re: Hadoop in Pseudo-Distributed

2012-08-13 Thread Harsh J
Subho, Can you try to tweak the mapred.task.tracker.http.address in mapred-site.xml, and set it to always bind to localhost? (i.e. set it to localhost:50060, instead of default 0.0.0.0:50060) and then see if you get this behavior? On Mon, Aug 13, 2012 at 12:37 PM, Subho Banerjee

Re: Hbase JDBC API

2012-08-13 Thread Jeff Hung
Hi Sandeep, You may try JackHare: http://sourceforge.net/projects/jackhare/. Regards, Jeff Hung

RE: Hadoop in Pseudo-Distributed

2012-08-13 Thread Rishabh Agrawal
Thanks Harsh. I think I have resolved the issue. Now another problem has come after I add fuse-dfs#dfs://localhost:8020 mount point fuse allow_other,usetrash,rw 2 0 to fstab and execute mount mount point I get /bin/sh: fuse-dfs: not found Any tip on that. -Rishabh -Original Message-

Re: Hadoop in Pseudo-Distributed

2012-08-13 Thread Subho Banerjee
Hello Harsh, I tried setting it. But it doesnt seem to help. There is also something else that I found out The link http://localhost:50060/tasklog?plaintext=trueattemptid=attempt_201208131655_0001_m_00_0filter=stderr works and actually returns me the error, however

DataNode and Tasttracker communication

2012-08-13 Thread Björn-Elmar Macek
Hi, i am currently trying to run my hadoop program on a cluster. Sadly though my datanodes and tasktrackers seem to have difficulties with their communication as their logs say: * Some datanodes and tasktrackers seem to have portproblems of some kind as it can be seen in the logs below. I

Re: DataNode and Tasttracker communication

2012-08-13 Thread Mohammad Tariq
Hello there, Could you please share your /etc/hosts file, if you don't mind. Regards, Mohammad Tariq On Mon, Aug 13, 2012 at 6:01 PM, Björn-Elmar Macek ma...@cs.uni-kassel.dewrote: Hi, i am currently trying to run my hadoop program on a cluster. Sadly though my datanodes and

Re: DataNode and Tasttracker communication

2012-08-13 Thread Michael Segel
If the nodes can communicate and distribute data, then the odds are that the issue isn't going to be in his /etc/hosts. A more relevant question is if he's running a firewall on each of these machines? A simple test... ssh to one node, ping other nodes and the control nodes at random to see

Re: how to enhance job start up speed?

2012-08-13 Thread Bertrand Dechoux
I am not sure to understand and I guess I am not the only one. 1) What's a worker in your context? Only the logic inside your Mapper or something else? 2) You should clarify your cases. You seem to have two cases but both are in overhead so I am assuming there is a baseline? Hadoop vs sequential,

unsubscribe

2012-08-13 Thread Christian Gonzalez
SIGIS Soluciones Integrales GIS C.A

Re: DataNode and Tasttracker communication

2012-08-13 Thread Mohammad Tariq
Hi Michael, I asked for hosts file because there seems to be some loopback prob to me. The log shows that call is going at 0.0.0.0. Apart from what you have said, I think disabling IPv6 and making sure that there is no prob with the DNS resolution is also necessary. Please correct me if I

Re: Hadoop in Pseudo-Distributed mode on Mac OS X 10.8

2012-08-13 Thread Mohammad Tariq
Hello Subho, Please check the permission of mapred.local.dir is. This is the place where map outputs are store. Reduce phase is sending the read requests but this directory is not accessible. As a result 403 is thrown. Regards, Mohammad Tariq On Mon, Aug 13, 2012 at 9:51 AM,

Re: Can not generate a result

2012-08-13 Thread Mohammad Tariq
Hello Astie, Please make sure your datanode is up. I think you have not included hadoop.tmp.dir, dfs.name.dir and dfs.data.dir properties. The value of these props default to the /tmp dir, which gets emptied on each restart. As a result you loose all your data and meta information. Regards,

Re: how to enhance job start up speed?

2012-08-13 Thread Matthias Kricke
Ok, I try to clarify: 1) The worker is the logic inside my mapper and the same for both cases. 2) I have two cases. In the first one I use hadoop to execute my worker and in a second one, I execute my worker without hadoop (simple read of the file). Now I measured, for both cases, the time the

Re: how to enhance job start up speed?

2012-08-13 Thread Bejoy KS
Hi Matthais When an mapreduce program is being used there are some extra steps like checking for input and output dir, calclulating input splits, JT assigning TT for executing the task etc. If your file is non splittable , then one map task per file will be generated irrespective of the

Re: Hadoop hardware failure recovery

2012-08-13 Thread Harsh J
Aji, The best place would be to ask on Apache Accumulo's own user lists, subscrib-able at http://accumulo.apache.org/mailing_list.html That said, if Accumulo bases itself on HDFS, then its data safety should be the same or nearly the same as what HDFS itself can offer. Note that with 2.1.0

Re: DataNode and Tasttracker communication

2012-08-13 Thread Michael Segel
0.0.0.0 means that the call is going to all interfaces on the machine. (Shouldn't be an issue...) IPv4 vs IPv6? Could be an issue, however OP says he can write data to DNs and they seem to communicate, therefore if its IPv6 related, wouldn't it impact all traffic and not just a specific port?

Re: DataNode and Tasttracker communication

2012-08-13 Thread Mohammad Tariq
Thank you so very much for the detailed response Michael. I'll keep the tip in mind. Please pardon my ignorance, as I am still in the learning phase. Regards, Mohammad Tariq On Mon, Aug 13, 2012 at 8:29 PM, Michael Segel michael_se...@hotmail.comwrote: 0.0.0.0 means that the call is

Re: how to enhance job start up speed?

2012-08-13 Thread Bertrand Dechoux
It was almost what I was getting at but I was not sure about your problem. Basically, Hadoop is only adding overhead due to the way your job is constructed. Now the question is : why do you need a single mapper? Is your need truly not 'parallelisable'? Bertrand On Mon, Aug 13, 2012 at 4:49 PM,

Re: Hadoop hardware failure recovery

2012-08-13 Thread Steve Loughran
On 13 August 2012 07:55, Harsh J ha...@cloudera.com wrote: Note that with 2.1.0 (upcoming) and above releases of HDFS, we offer a working hsync() API that allows you to write files with guarantee that it has been written to the disk (like the fsync() *nix call). A guarantee that the OS

Re: Can not generate a result

2012-08-13 Thread Harsh J
Hi Astie, Live Nodes:0 That the live nodes = 0 is the real issue here. If you're running off of default configs (i.e. haven't overriden hadoop.tmp.dir, dfs.name.dir, nor dfs.data.dir), do this: $ rm -rf /tmp/hadoop-$(whoami)/dfs/data And then: $ $HADOOP_HOME/bin/start-all.sh And you should

Re: how to enhance job start up speed?

2012-08-13 Thread Matthias Kricke
@Bejoy KS: Thanks for your advice. @Bertrand: It is parallelisable, this is just a test case. In later cases there will be a lot of big files which should be processed completly each in one map step. We want to minimize the overhead of network traffic. The idea is to execute some worker (could be

Re: how to enhance job start up speed?

2012-08-13 Thread Bertrand Dechoux
Seems like you want to misuse Hadoop but maybe I still don't understand your context. The standard way would be to split your files into multiples maps. Each map could profit from data locality. Do a part of the worker stuff in the mapper and then use a reducer to aggregate all the results (which

Re: Hadoop hardware failure recovery

2012-08-13 Thread Steve Loughran
On 13 August 2012 08:42, Harsh J ha...@cloudera.com wrote: Hey Steve, Interesting, thanks for pointing that out! I didn't know that it disables this by default :) It's always something to watch out for: someone implementing a disk FS, OS, VM environment discovering that they get great

Re: Native not compiling in OS X Mountain Lion

2012-08-13 Thread Alejandro Abdelnur
Mohamed, Currently Hadoop native code does not compile/run in any flavor of OS X. Thanks. Alejandro On Mon, Aug 13, 2012 at 2:59 AM, J Mohamed Zahoor jmo...@gmail.com wrote: Hi I have problems compiling native's in OS X 10.8 for trunk. Especially in Yarn projects. Anyone faced similar

group assignment on HDFS from Hadoop and Hive

2012-08-13 Thread Chen Song
I am wondering how Hadoop assign groups when dirs/files are being created by a user and below are some tests I have done. In my cluster, group hadoop is configured as the supergroup. hadoop fs -ls /tmp drwxrwxrwx - abc hadoop 0 2012-08-10 23:02 /tmp/abc drwxrwxrwx - def other_group

Re: Can not generate a result

2012-08-13 Thread Harsh J
Astie, Since you've overriden these, do: $ rm -rf /home/astie/hdfs/data And then re-run your start-all command. After this works, please never re-issue a namenode -format unless you really want to wipe everything away and start over. On Mon, Aug 13, 2012 at 9:48 PM, Astie Darmayantie

Re: DataNode and Tasttracker communication

2012-08-13 Thread Sriram Ramachandrasekaran
the logs indicate already in use exception. is that some sign? :) On 13 Aug 2012 20:36, Mohammad Tariq donta...@gmail.com wrote: Thank you so very much for the detailed response Michael. I'll keep the tip in mind. Please pardon my ignorance, as I am still in the learning phase. Regards,

Re: native not compiling in OS X Mountain Lion

2012-08-13 Thread Brandon Li
You may see similar problem compiling HDFS native code too since it's not supported on OS X yet. Brandon On Sun, Aug 12, 2012 at 10:49 PM, J Mohamed Zahoor jmo...@gmail.com wrote: Hi I have problems compiling native's in OS X 10.8 for trunk. Especially in Yarn projects. Anyone faced

Re: Hadoop in Pseudo-Distributed mode on Mac OS X 10.8

2012-08-13 Thread Subho Banerjee
Where do I set this? On Mon, Aug 13, 2012 at 7:52 PM, Mohammad Tariq donta...@gmail.com wrote: Hello Subho, Please check the permission of mapred.local.dir is. This is the place where map outputs are store. Reduce phase is sending the read requests but this directory is not

unsubscribe

2012-08-13 Thread Maheswaran
unsubscribe