Running Hadoop client as a different user

2013-05-13 Thread Steve Lewis
-- I have been running Hadoop on a clister set to not check permissions. I would run a java client on my local machine and would run as the local user on the cluster. I say * String connectString = hdfs:// + host + : + port + /;* *Configuration config = new Configuration();* *

Re: Running Hadoop client as a different user

2013-05-13 Thread Harsh J
Hi Steve, A normally-written client program would work normally on both permissions and no-permissions clusters. There is no concept of a password for users in Apache Hadoop as of yet, unless you're dealing with a specific cluster that has custom-implemented it. Setting a specific user is not

Re: The minimum memory requirements to datanode and namenode?

2013-05-13 Thread Nitin Pawar
just one node not having memory does not mean your cluster is down. Can you see your hdfs health on NN UI? how much memory do you have on NN? if there are no jobs running on the cluster then you can safely restart datanode and tasktracker. Also run a top command and figure out which processes

600s timeout during copy phase of job

2013-05-13 Thread David Parks
I have a job that's getting 600s task timeouts during the copy phase of the reduce step. I see a lot of copy tasks all moving at about 2.5MB/sec, and it's taking longer than 10 min to do that copy. The process starts copying when the reduce step is 80% complete. This is a very IO bound task as

Re: The minimum memory requirements to datanode and namenode?

2013-05-13 Thread sam liu
I can issue a command 'hadoop dfsadmin -report', but it did not return any result for a long time. Also, I can open the NN UI(http://namenode:50070), but it is always keeping in the connecting status, and could not return any cluster statistic. The mem of NN: total used

Re: The minimum memory requirements to datanode and namenode?

2013-05-13 Thread Nitin Pawar
4GB memory on NN? this will run out of memory in few days. You will need to make sure your NN has atleast more than double RAM of your DNs if you have a miniature cluster. On Mon, May 13, 2013 at 11:52 AM, sam liu samliuhad...@gmail.com wrote: I can issue a command 'hadoop dfsadmin -report',

How to combine input files for a MapReduce job

2013-05-13 Thread Agarwal, Nikhil
Hi, I have a 3-node cluster, with JobTracker running on one machine and TaskTrackers on other two. Instead of using HDFS, I have written my own FileSystem implementation. As an experiment, I kept 1000 text files (all of same size) on both the slave nodes and ran a simple Wordcount MR job. It

Re: How to combine input files for a MapReduce job

2013-05-13 Thread Harsh J
For control number of mappers question: You can use http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/CombineFileInputFormat.html which is designed to solve similar cases. However, you cannot beat the speed you get out of a single large file (or a few large files), as you'll

Re: How to combine input files for a MapReduce job

2013-05-13 Thread Harsh J
Shashwat, Tweaking the split sizes affects a single input split, not how the splits are packed. It may be used with the CombineFileInputFormat to control packed split sizes, but would otherwise not be of use in merging processing of several blocks across files into the same map task. On Mon, May

RE: How to combine input files for a MapReduce job

2013-05-13 Thread Agarwal, Nikhil
Hi, @Harsh: Thanks for the reply. Would the patch work in Hadoop 1.0.4 release? -Original Message- From: Harsh J [mailto:ha...@cloudera.com] Sent: Monday, May 13, 2013 1:03 PM To: user@hadoop.apache.org Subject: Re: How to combine input files for a MapReduce job For control number of

Re: How to combine input files for a MapReduce job

2013-05-13 Thread Harsh J
Yes I believe the branch-1 patch attached there should apply cleanly to 1.0.4. On Mon, May 13, 2013 at 1:25 PM, Agarwal, Nikhil nikhil.agar...@netapp.com wrote: Hi, @Harsh: Thanks for the reply. Would the patch work in Hadoop 1.0.4 release? -Original Message- From: Harsh J

Re: The minimum memory requirements to datanode and namenode?

2013-05-13 Thread shashwat shriparv
Due to Small amount of memory available to the nodes they are not able to send response in time, and socket connection exception, and there may be some network issue to. Please check which program is using memory? as there will be some other cohosted application eating up the memory. ps -e

FileNotFoundException When DistributedCache file with YARN

2013-05-13 Thread YouPeng Yang
HI I adopt distributed cache to implement the semi-joins. I am using CDH4.1.2 the Map side setup function is as [1]: It works well in my eclipse indigo, howevet it goes wrong when I run it in cli: The exeception in one of the containers refers to[2]. How could I solve this exception?

RE: How to combine input files for a MapReduce job

2013-05-13 Thread Agarwal, Nikhil
Hi Harsh, I applied the changes of the patch in hadoop source code but can you please tell exactly where is this log being printed? I checked in log files of JobTracker and TaskTracker but it is not there. It is not getting printed in _logs folder creates inside output directory for MR job.

RE: How to combine input files for a MapReduce job

2013-05-13 Thread Agarwal, Nikhil
Hi, I got it. The log info is printed in userlogs folder in slave nodes, in the file syslog. Thanks, Nikhil -Original Message- From: Agarwal, Nikhil Sent: Monday, May 13, 2013 4:10 PM To: 'user@hadoop.apache.org' Subject: RE: How to combine input files for a MapReduce job Hi Harsh,

Using FairScheduler to limit # of tasks

2013-05-13 Thread David Parks
Can I use the FairScheduler to limit the number of map/reduce tasks directly from the job configuration? E.g. I have 1 job that I know should run a more limited # of map/reduce tasks than is set as the default, I want to configure a queue with a limited # of map/reduce tasks, but only apply it to

Re: FileNotFoundException When DistributedCache file with YARN

2013-05-13 Thread YouPeng Yang
Hi I fixed the problem just add the expression of job.setJar(MyJarName) ,and the job went well. But I have no idea about the expression and the exception. Any suggestion will be appreciated. regards. 2013/5/13 YouPeng Yang yypvsxf19870...@gmail.com HI I adopt distributed cache to

Re: Using FairScheduler to limit # of tasks

2013-05-13 Thread Michel Segel
Using fair scheduler or capacity scheduler, you are creating a queue that is being applied to the cluster. Having said that, you can limit who uses the special queue as well as specify the queue at the start of you job as a command line option. HTH Sent from a remote device. Please excuse

Re: 600s timeout during copy phase of job

2013-05-13 Thread Michel Segel
That doesn't make sense... Try introducing a combiner step. Sent from a remote device. Please excuse any typos... Mike Segel On May 13, 2013, at 3:30 AM, shashwat shriparv dwivedishash...@gmail.com wrote: On Mon, May 13, 2013 at 11:35 AM, David Parks davidpark...@yahoo.com wrote: (I’ve

OpenCL with Hadoop

2013-05-13 Thread rohit sarewar
Hi All How do I use OpenCL(for GPU compute) with Hadoop ? It would be great if someone could share a sample code. Thanks Regards Rohit Sarewar

Access HDFS from OpenCL

2013-05-13 Thread rohit sarewar
Hi All My data set resides in HDFS. I need to compute 5 metrics, among which 2 are compute intensive. So I want to compute those 2 metrics on GPU using OpenCL and the rest 3 metrics using java map reduce code on Hadoop. How can I pass data from HDFS to GPU ? or How can my opencl code access data

RE: Access HDFS from OpenCL

2013-05-13 Thread David Parks
Hadoop just runs as a standard java process, you should find something that bridges between OpenCL and java, a quick google search yields: http://www.jocl.org/ I expect that you'll find everything you need to accomplish the handoff from your mapreduce code to OpenCL there. As for HDFS,

Re: Install Hadoop on Linux Pseudo Distributed Mode - Root Required?

2013-05-13 Thread Nitin Pawar
You do not root if you want to install everything in your home directory and assuming sun jdk is installed On May 13, 2013 8:13 PM, Raj Hadoop hadoop...@yahoo.com wrote: Hi, I am planning to install Hadoop on Linux in a Pseudo Distributed Mode ( One Machine ). Do I require 'root' privileges

Re: Install Hadoop on Linux Pseudo Distributed Mode - Root Required?

2013-05-13 Thread Mohammad Tariq
Hello Raj, Install in what sense?Are you planning to use Apache's package?If that is the case you just have to download and unzip it. And you don't need root privilege for that.Or something else like CDH?I'm sorry, I didn't quite get the question. Warm Regards, Tariq cloudfront.blogspot.com

Re: Install Hadoop on Linux Pseudo Distributed Mode - Root Required?

2013-05-13 Thread Raj Hadoop
I am thinking to install both CDH and Apache version. So are you saying if i install CDH - do i require root privielges? From: Mohammad Tariq donta...@gmail.com To: user@hadoop.apache.org user@hadoop.apache.org; Raj Hadoop hadoop...@yahoo.com Sent: Monday,

Re: Install Hadoop on Linux Pseudo Distributed Mode - Root Required?

2013-05-13 Thread Nitin Pawar
if you want to install CDH, then you will need root access as it needs to install RPMs for apache downloads, its not needed On Mon, May 13, 2013 at 8:25 PM, Raj Hadoop hadoop...@yahoo.com wrote: I am thinking to install both CDH and Apache version. So are you saying if i install CDH - do i

Re: Wrapping around BitSet with the Writable interface

2013-05-13 Thread Jim Twensky
Thanks for the suggestions. I ended up switching to jdk 1.7+ just to make the code more readable. I will take a look at the EWAH implementation as well. Jim On Sun, May 12, 2013 at 3:40 PM, Bertrand Dechoux decho...@gmail.comwrote: You can disregard my links as their are only valid for java

Re: Install Hadoop on Linux Pseudo Distributed Mode - Root Required?

2013-05-13 Thread Raj Hadoop
So for CDH, while installing - how do i request my Unix admin ? Any tips.   I am requesting a separate user on Linux box. So how do that user ( what kind of privileges ) are required for the new user? and do these new user need to have some kind of temporary root access. How does this work

Re: Install Hadoop on Linux Pseudo Distributed Mode - Root Required?

2013-05-13 Thread shashwat shriparv
If you are installing CDH version of hadoop tell your admi that you need root access as yoiu need to install RPM :) *Thanks Regards* ∞ Shashwat Shriparv

Re: Install Hadoop on Linux Pseudo Distributed Mode - Root Required?

2013-05-13 Thread Harsh J
Raj, Apache Hadoop by itself does not require root privileges to run (assuming a non-secure setup). You can run it out of a tarball from a home directory you have on the server machines. However, many prefer using packages, such as those from Apache Bigtop/etc., to install Hadoop and use it.

Number of records in an HDFS file

2013-05-13 Thread Mix Nin
Hello, What is the bets way to get the count of records in an HDFS file generated by a PIG script. Thanks

Re: Number of records in an HDFS file

2013-05-13 Thread Mix Nin
It is a text file. If we want to use wc, we need to copy file from HDFS and then use wc, and this may take time. Is there a way without copying file from HDFS to local directory? Thanks On Mon, May 13, 2013 at 11:04 AM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: few pointers. what

Re: Number of records in an HDFS file

2013-05-13 Thread Rahul Bhattacharjee
How about the second approach , get the application/job id which the pig creates and submits to cluster and then find the job output counter for that job from the JT. Thanks, Rahul On Mon, May 13, 2013 at 11:37 PM, Mix Nin pig.mi...@gmail.com wrote: It is a text file. If we want to use wc,

Re: Hadoop schedulers!

2013-05-13 Thread Rahul Bhattacharjee
Any pointer to my question. There is another question , kind-of dumb , but just wanted to clarify. Say in a FIFO scheduler or a capacity scheduler , if there are slots available and the first job doesn't need all of the available slots , then the job next in the queue is scheduled for execution

Re: Number of records in an HDFS file

2013-05-13 Thread Mohammad Tariq
If it is just counting the no. of records in a file then how about having a short 3 liner : LOGS= LOAD 'log'; LOGS_GROUP= GROUP LOGS ALL; LOG_COUNT = FOREACH LOGS_GROUP GENERATE COUNT(LOGS); It did the trick for me. Warm Regards, Tariq cloudfront.blogspot.com On Mon, May 13, 2013 at 11:57 PM,

Re: Hadoop schedulers!

2013-05-13 Thread Harsh J
Hi, On Sat, May 11, 2013 at 8:31 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: Hi, I was going through the job schedulers of Hadoop and could not see any major operational difference between the capacity scheduler and the fair share scheduler apart from the fact that fair share

Re: Number of records in an HDFS file

2013-05-13 Thread Mix Nin
Hi, The final count file should reside in local directory, but not in HDFS directory. The above scripts will store text file in HDFS directory. The count file would need to be sent to other team who do not work on HDFS. Thanks On Mon, May 13, 2013 at 11:36 AM, Mohammad Tariq

Re: Number of records in an HDFS file

2013-05-13 Thread Mohammad Tariq
Agree with Shahab. Warm Regards, Tariq cloudfront.blogspot.com On Tue, May 14, 2013 at 12:32 AM, Shahab Yunus shahab.yu...@gmail.comwrote: The count file will be a very small file, right? Once it is generated on HDFS, you can automate its downloading or movement anywhere you want. This

Re: Hadoop schedulers!

2013-05-13 Thread Alok Kumar
Hi, As the name suggest, Fair-scheduler does a fair allocation of slot to the jobs. Let say, you have 10 map slots in your cluster and it is occupied by a job-1 which requires 30 map slot to finish. But the same time, another job-2 require only 2 map slots to finish - Here slots will be provided

Setup Eclipse for Hadoop on Mac

2013-05-13 Thread Raj Hadoop
Hi,   Can anyone suggest how to configure Eclipse on Mac for Hadoop? Hadoop is running like a Pseudo-distributed moded. Please provide any reference articles or other best practices that need to be followed in this case.   Thanks, Raj

Re: Setup Eclipse for Hadoop on Mac

2013-05-13 Thread Mohammad Tariq
Hello Raj, I am a Linux addict, but the procedure should be same for Mac as well. You need to pull the hadoop-eclipse plugin, build it keeping all the dependencies in mind and copy it into the Eclipse plugin directory. Restart the Eclipse and you should be good to go. For a detailed info

Re: Hadoop schedulers!

2013-05-13 Thread Rahul Bhattacharjee
Thanks a lot for the replies , it was really helpful. On Tue, May 14, 2013 at 1:02 AM, Alok Kumar alok...@gmail.com wrote: Hi, As the name suggest, Fair-scheduler does a fair allocation of slot to the jobs. Let say, you have 10 map slots in your cluster and it is occupied by a job-1 which