Re: each stage's time in hadoop

2013-03-06 Thread bharath vissapragada
Look into JobHistory Class ! On Wed, Mar 6, 2013 at 2:37 PM, Mirko Kämpf mirko.kae...@gmail.com wrote: Hi, please have a look on the Starfish project. http://www.cs.duke.edu/starfish/ Best wishes Mirko 2013/3/6 claytonly clayto...@163.com Hello ,all I was using hadoop-1.0.0 in

Re: dfs.datanode.du.reserved

2013-03-06 Thread Bertrand Dechoux
Not that I know. If so you should be able to identify each volume and as of now this isn't the case. BUT it can be done without Hadoop knowing about it, at the OS level, using different partitions/mounts for datanode and jobtracker stuff. That should solve your problem. Regards Bertrand On Mon,

Re: Hadoop cluster setup - could not see second datanode

2013-03-06 Thread Vikas Jadhav
1) check whehter you can ssh to other node from namenode set your configuration carefully property namefs.default.name/name valuelocalhost:9000/value /property replace localhost with name node having namnode ruuning and shoulde resolvable (try ping to that node from other

Re: mapper combiner and partitioner for particular dataset

2013-03-06 Thread Vikas Jadhav
got it Thanx Mahesh. On Tue, Mar 5, 2013 at 1:35 PM, Mahesh Balija balijamahesh@gmail.comwrote: What Harsh means by that is, you should create a custom partitioner which should take care of partitioning the records based on the input record data (Key, Value). i.e., if you have multiple

For Hadoop 2.0.3; setting CLASSPATH=$(hadoop classpath) does not work, as opposed to 1.x versions

2013-03-06 Thread shubhangi
I am writing an application in c++, which uses API provided by libhdfs to manipulate Hadoop DFS. I could run the application with 1.0.4 and 1.1.1; setting classpath equal to $(hadoop classpath). For Hadoop 2.0.3; setting CLASSPATH=$(hadoop classpath) does not load necessary classes required

Issue: Namenode is in safe mode

2013-03-06 Thread AMARNATH, Balachandar
Hi, I have created a hadoop cluster with two nodes (A and B). 'A' act both as namenode and datanode, and 'B' act as datanode only. With this setup, I could store, read files. Now, I added one more datanode 'C' and relieved 'A' from datanode duty. This means, 'A' act only as namenode, and both

RE: Issue: Namenode is in safe mode

2013-03-06 Thread Samir Kumar Das Mohapatra
Just do it $ hadoop dfsadmin -safemode leave From: AMARNATH, Balachandar [mailto:balachandar.amarn...@airbus.com] Sent: 06 March 2013 15:21 To: user@hadoop.apache.org Subject: Issue: Namenode is in safe mode Hi, I have created a hadoop cluster with two nodes (A and B). 'A' act both as

store file gives exception

2013-03-06 Thread AMARNATH, Balachandar
Now I came out of the safe mode through admin command. I tried to put a file into hdfs and encountered this error. org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/hadoop/hosts could only be replicated to 0 nodes, instead of 1 Any hint to fix this, This happens when the

Re: Execution handover in map/reduce pipeline

2013-03-06 Thread Michel Segel
RTFM? Yes you can do this. See Oozie. When you have a cryptic name, you get a cryptic answer. Sent from a remote device. Please excuse any typos... Mike Segel On Mar 5, 2013, at 5:35 PM, Public Network Services publicnetworkservi...@gmail.com wrote: Hi... I have an application that

fsimage.ckpt are not deleted - Exception in doCheckpoint

2013-03-06 Thread Elmar Grote
Hi, we are writing our fsimage and edits file on the namenode and secondary namenode and additional on a nfs share. In these folders we found a a lot of fsimage.ckpt_0 . files, the oldest is from 9. Aug 2012. As far a i know these files should be deleted after the secondary

Re: Issue: Namenode is in safe mode

2013-03-06 Thread Nitin Pawar
what is your replication factor? when you removed node A as datanode .. did you first mark it for retirement? if you just removed it from service then the blocks from that datanode are missing and namenode when starts up it checks for the blocks. Unless it reaches its threshold value it will not

Re: Issue: Namenode is in safe mode

2013-03-06 Thread Bertrand Dechoux
How was the namenode A relieved of its duty and what was the default replication factor? If the replication factor was 1 and the datanode A was unplugged without any care then you lost half of your files and your namenode is not really happy about it (and is waiting for you to correct the mistake

Re: For Hadoop 2.0.3; setting CLASSPATH=$(hadoop classpath) does not work, as opposed to 1.x versions

2013-03-06 Thread shubhangi
Hi All, I am writing an application in c++, which uses API provided by libhdfs to manipulate Hadoop DFS. I could run the application with 1.0.4 and 1.1.1; setting classpath equal to $(hadoop classpath). For Hadoop 2.0.3; setting CLASSPATH=$(hadoop classpath) does not load necessary classes

Re: Transpose

2013-03-06 Thread Michel Segel
Sandy, Remember KISS. Don't try to read it in as anything but just a text line. Its really a 3x3 matrix in what looks to be grouped by columns. Your output will drop the initial key, and you then parse the lines and then output it. Without further explanation, it looks like each tuple is

Re: S3N copy creating recursive folders

2013-03-06 Thread Michel Segel
Have you tried using distcp? Sent from a remote device. Please excuse any typos... Mike Segel On Mar 5, 2013, at 8:37 AM, Subroto ssan...@datameer.com wrote: Hi, Its not because there are too many recursive folders in S3 bucket; in-fact there is no recursive folder in the source. If I

RE: Issue: Namenode is in safe mode

2013-03-06 Thread AMARNATH, Balachandar
The repliation factor was 1 when I removed the entry of A in slaves file. I did not mark it retirement. I do not know yet how to mark a node for retirement. I waited for few minutes and then I could see thte namenode running again From: Nitin Pawar [mailto:nitinpawar...@gmail.com] Sent: 06

Re: S3N copy creating recursive folders

2013-03-06 Thread Subroto
Hi Mike, I have tries distcp as well and it ended up with exception: 13/03/06 05:41:13 INFO tools.DistCp: srcPaths=[s3n://acessKey:acesssec...@dm.test.bucket/srcData] 13/03/06 05:41:13 INFO tools.DistCp: destPath=/test/srcData 13/03/06 05:41:18 INFO tools.DistCp: /test/srcData does not exist.

RE: store file gives exception

2013-03-06 Thread AMARNATH, Balachandar
Hi, I could successfully install hadoop cluster with three nodes (2 datanodes and 1 namenode). However, when I tried to store a file, I get the following error. 13/03/06 16:45:56 WARN hdfs.DFSClient: Error Recovery for block null bad datanode[0] nodes == null 13/03/06 16:45:56 WARN

RE: store file gives exception

2013-03-06 Thread AMARNATH, Balachandar
Hi all, I thought the below issue is coming because of non availability of enough space. Hence, I replaced the datanodes with other nodes with more space and it worked. Now, I have a working HDFS cluster. I am thinking of my application where I need to execute 'a set of similar instructions'

Re: store file gives exception

2013-03-06 Thread Nitin Pawar
in hadoop you don't have to worry about data locality. Hadoop job tracker will by default try to schedule the job where the data is located in case it has enough compute capacity. Also note that datanode just store the blocks of file and multiple datanodes will have different blocks of the file.

[no subject]

2013-03-06 Thread ashish_kumar_gupta
Unsubscribe me How many more times, I have to mail u

Re:

2013-03-06 Thread Jean-Marc Spaggiari
Hi Ashish, It's operation you have to do on your side Have you tried google? https://www.google.ca/search?q=unsubscribe+hadoop.apache.orgaq=foq=unsubscribe+hadoop.apache.orgaqs=chrome.0.57.2271sourceid=chromeie=UTF-8 JM 2013/3/6 ashish_kumar_gu...@students.iitmandi.ac.in: Unsubscribe

Re:

2013-03-06 Thread Kai Voigt
In my opinion, another 2782829 times, give or take a few. Or try reading and understanding http://hadoop.apache.org/mailing_lists.html otherwise which tells you to send an email to user-unsubscr...@hadoop.apache.org Cheers Kai Am 06.03.2013 um 14:03 schrieb

Re:

2013-03-06 Thread Panshul Whisper
lol... as long as u dnt mail to user-unsubscr...@hadoop.apache.org noobs... On Wed, Mar 6, 2013 at 2:03 PM, ashish_kumar_gu...@students.iitmandi.ac.inwrote: Unsubscribe me How many more times, I have to mail u -- Regards, Ouch Whisper 010101010101

Re: [unsubscribe noobs]

2013-03-06 Thread Mike Spreitzer
The question is, how much more of this must we endure before the mailing list server gets smarter? How about making it respond to any short message that includes the word unsubscribe with a message reminding the noob how to manage his subscription and how to send an email with the word

too many memory spills

2013-03-06 Thread Panshul Whisper
Hello, I have a file of size 9GB and having approximately 109.5 million records. I execute a pig script on this file that is doing: 1. Group by on a field of the file 2. Count number of records in every group 3. Store the result in a CSV file using normal PigStorage(,) The job is completed

Re: each stage's time in hadoop

2013-03-06 Thread Shumin Guo
You can also try the following two commands: 1, hadoop job -status job-id For example: hadoop job -status job_201303021057_0004 I will get the following output: Job: job_201303021057_0004 file: hdfs://master:54310/user/ec2-user/.staging/job_201303021057_0004/job.xml tracking URL:

Re: store file gives exception

2013-03-06 Thread Shumin Guo
Nitin is right. The hadoop Job tracker will schedule a job based on the data block location and the computing power of the node. Based on the number of data blocks, the job tracker will split a job into map tasks. Optimally, map tasks should be scheduled on nodes with local data. And also because

Re: S3N copy creating recursive folders

2013-03-06 Thread Shumin Guo
I used to have similar problem. Looks like there is a recursive folder creation bug. How about you try remove the srcData from the dst, for example use the following command: *hadoop fs -cp s3n://acessKey:acesssec...@bucket.name/srcData /test/* Or with distcp: *hadoop distcp

cannot find /usr/lib/hadoop/mapred/

2013-03-06 Thread Jay Vyas
Hi guys: I'm getting an odd error involving a file called toBeDeleted. I've never seen this - somehow its blocking my task trackers from starting. 2013-03-06 16:19:24,657 ERROR org.apache.hadoop.mapred.TaskTracker: Can not start task tracker because java.lang.RuntimeException: Cannot find root

RE: dfs.datanode.du.reserved

2013-03-06 Thread John Meza
Thanks for the reply. This sounds like it has potential, but also seems to be a rather duct-tape type of work around. It would be nice if there was a mod to dfs.datanode.du.reserved that worked within Hadoop, so that would imply that hadoop was a little more certain to adhere to it. I

Re: Issue: Namenode is in safe mode

2013-03-06 Thread Shumin Guo
To decommission a live datanode from the cluster, you can do the following steps: 1, edit configuration file $HADOOP_HOME/conf/hdfs-site.xml, and add the following property: property namedfs.hosts.exclude/name value$HADOOP_HOME/conf/dfs-exclude.txt/value /property 2, put the host name of the

Re: For Hadoop 2.0.3; setting CLASSPATH=$(hadoop classpath) does not work, as opposed to 1.x versions

2013-03-06 Thread Shumin Guo
You can always print out the hadoop classpath before running the hadoop command, for example by editing the $HADOOP_HOME/bin/hadoop file. HTH. On Wed, Mar 6, 2013 at 5:01 AM, shubhangi shubhangi.g...@oracle.com wrote: Hi All, I am writing an application in c++, which uses API provided by

Re: Issue: Namenode is in safe mode

2013-03-06 Thread shashwat shriparv
you can not directly remove a datanode from a cluster its not a proper way. you need to decommission nodes and wait till the data from the datanode to be removed are copied to other nodes. just read document for proper decomissioning of nodes

Re: For Hadoop 2.0.3; setting CLASSPATH=$(hadoop classpath) does not work, as opposed to 1.x versions

2013-03-06 Thread Arpit Gupta
When you constructed the classpath with the full path did you also add slf4j-log4j12-*.jar(http://www.slf4j.org/codes.html#StaticLoggerBinder) to the classpath. The jar should be in HADOOP_HOME/lib. This should help with SLF4J issue. 13/03/04 11:17:23 WARN util.NativeCodeLoader: Unable to

Re: Execution handover in map/reduce pipeline

2013-03-06 Thread Shumin Guo
Oozie for mapreduce job flow management can be a good choice. It can be too heavy weight for your problem. Based on your description. I am simply assuming that you are processing some static data files, for example, the files will not change on the way of processing, and there are no

Re: cannot find /usr/lib/hadoop/mapred/

2013-03-06 Thread Jay Vyas
interesting: The solution was simply to delete the toBeDeleted directory manually: rm -rf /usr/lib/hadoop/mapred/toBeDeleted/ I guess maybe somehow I changed the privileges of the /usr/lib/hadoop/mapred directory so that it was unreadable or something. Nevertheless, it was a cryptic error

Hadoop Jobtracker API

2013-03-06 Thread Kyle B
Hello, I was wondering if the Hadoop job tracker had an API, such as a web service or xml feed? I'm trying to track Hadoop jobs as they progress. Right now, I'm parsing the HTML of the Running Jobs section at http://hadoop:50030/jobtracker.jsp, but this is definitely not desired if there is a

Difference between HDFS_BYTES_READ and the actual size of input files

2013-03-06 Thread Jeff LI
Dear Hadoop Users, I recently noticed there is a difference between the File System Counter HDFS_BYTES_READ and the actual size of input files in map-reduce jobs. And the difference seems to increase as the size of each key,value pairs increases. For example, I'm running the same job on two

Re: Difference between HDFS_BYTES_READ and the actual size of input files

2013-03-06 Thread Jeffrey Buell
Jeff, Probably because records are split across blocks, so some of the data has to be read twice. Assuming you have a 64 MB block size and 128 GB of data, I'd estimate the overhead at 1 GB for 1 MB record size, and 32 GB for 32 MB record size. Your overhead is about 75% of that, maybe my

Re: Hadoop Jobtracker API

2013-03-06 Thread Dino Kečo
Hi Kyle, Maybe this can help you http://stackoverflow.com/questions/2616524/tracking-hadoop-job-status-via-web-interface-exposing-hadoop-to-internal-clien/4156387#4156387 Regards, Dino On Mar 6, 2013 7:43 PM, Kyle B kbi...@gmail.com wrote: Hello, I was wondering if the Hadoop job tracker

Re: How to solve one Scenario in hadoop ?

2013-03-06 Thread Vikas Jadhav
I will go with first case because if data size is large then it will distribute data across multiple nodes. On Tue, Mar 5, 2013 at 10:57 AM, samir das mohapatra samir.help...@gmail.com wrote: Hi All, I have one scenario where our organization is trying to implement hadoop. Scenario

Re: How to solve one Scenario in hadoop ?

2013-03-06 Thread Dino Kečo
I would sugest Hive in these cases because it is easy to manage multiple data sources, it uses SQL like syntax, it scales because of Hadoop and it has joining implemented and optimized Regards Dino On Mar 6, 2013 8:46 PM, Vikas Jadhav vikascjadha...@gmail.com wrote: I will go with first case

Hadoop cluster hangs on big hive job

2013-03-06 Thread Daning Wang
We have 5 nodes cluster(Hadoop 1.0.4), It hung a couple of times while running big jobs. Basically all the nodes are dead, from that trasktracker's log looks it went into some kinds of loop forever. All the log entries like this when problem happened. Any idea how to debug the issue? Thanks in

preferential distribution of tasks to tasktracker by jobtracker based on specific criteria (cpu frequency)

2013-03-06 Thread Sayan Kole
Hi, I want the jobtracker to prioritize the assignment of tasks to certain tasktrackers. eg: If a tasktracker meets certain criteria better than other ones, I want to assign task to that tasktracker first (ideally I want the jobtracker to sort tasktrackers based on certain criteria (eg cpu

Re: Hadoop Jobtracker API

2013-03-06 Thread Kyle B
Hi Dino, Thanks for the response. I've seen this before, but was hoping to avoid getting locked into the Java road. Do you happen to know if there is an open API for the job tacker included with Hadoop? Something I could call from a variety of languages, like a web service? -Kyle On Wed, Mar 6,

Re: Hadoop Jobtracker API

2013-03-06 Thread Dino Kečo
Hi Kyle, There is only JobTracker servlet which u can use as web service but you need to parse HTML response or you can build small java ws using code from stackoverflow. Regards Dino On Mar 6, 2013 10:06 PM, Kyle B kbi...@gmail.com wrote: Hi Dino, Thanks for the response. I've seen this

Re: location of log files for mapreduce job run through eclipse

2013-03-06 Thread Harsh J
If you've not changed any configs, look under /tmp/hadoop-${user.name}/ perhaps. On Thu, Mar 7, 2013 at 3:19 AM, Sayan Kole sayank...@gmail.com wrote: Hi, I cannot find the log files for the wordcount job: job_local_0001 when I run it through eclipse. I am getting the standard output on

Re: Hadoop Jobtracker API

2013-03-06 Thread Harsh J
The Java API of JobClient class lets you query all jobs and provides some task-level info as a public API. In YARN (2.x onwards), the MRv2's AM publishes a REST API that lets you query it (the RM lets you get a list of such AMs as well, as a first step). This sounds more like what you need. A

Re: dfs.datanode.du.reserved

2013-03-06 Thread Harsh J
Hey John, Ideas, comments and patches are welcome on https://issues.apache.org/jira/browse/HDFS-1564 for achieving this! On Wed, Mar 6, 2013 at 9:56 PM, John Meza j_meza...@hotmail.com wrote: Thanks for the reply. This sounds like it has potential, but also seems to be a rather duct-tape

Best practices for adding services to Hadoop cluster?

2013-03-06 Thread Mark Kerzner
Hi, my Hadop cluster needs help: some tasks have to be done by a Windows server with specialized closed-source software. How do I add them to the mix? For example, I can run Tomcat, and the mapper would be calling a servlet there. Is there anything better, which would be closer to the

Reading partitioned sequence file from hdfs throws filenotfoundexception

2013-03-06 Thread Dmitriy Ivanov
Hello, I'm using hadoop 1.1.1 and run into unexpected complication with partitioned file. The file itself is the result of map-reduce task. Here is code I'm using to read the file: try (SequenceFile.Reader reader = new SequenceFile.Reader(fs, path, conf)) { // skipped

HDFS network traffic

2013-03-06 Thread Bill Q
Hi All, I am working on converting a sequence file to mapfile and just discovered something I wasn't aware of. For example, suppose I am working on a 2-node cluster, one master/namenode/datanode, one slave/datanode. If I do hadoop dfs -cp /data/file1 /data/file2 (a 1G file) from the master, and

Re: HDFS network traffic

2013-03-06 Thread Harsh J
Yes, the simple copy is a client operation. Client reads bytes from source and writes to the destination, thereby being in control of failures, etc.. However, if you want your cluster to do the copy (and if the copy is a big set), consider using the DistCp (distributed-copy) MR job to do it. On

Re: Best practices for adding services to Hadoop cluster?

2013-03-06 Thread Harsh J
Can the mapper not directly talk to whatever application server the Windows server runs? Is the work needed to be done in the map step (i.e. per record)? If not, you can perhaps also consider the SSH action of Oozie (although I've never tried it with a Windows machine) under a workflow. On Thu,

Re: Best practices for adding services to Hadoop cluster?

2013-03-06 Thread Mark Kerzner
Okay, then there is nothing wrong with the mapper directly talking to the server, and failing the map task if the service does not work out. Thank you, Mark On Wed, Mar 6, 2013 at 11:21 PM, Harsh J ha...@cloudera.com wrote: Can the mapper not directly talk to whatever application server the

mapred.max.tracker.failures

2013-03-06 Thread Mohit Anchlia
I am wondering what the correct behaviour is of this parameter? If it's set to 4 does it mean job should fail if a job has more than 4 failures?

Re: Reading partitioned sequence file from hdfs throws filenotfoundexception

2013-03-06 Thread Abdelrhman Shettia
Hi All , Try to give the full path for the file such as : /users/ivanovd/1.2a8b1a9c-47de-4631-8013-f0dd3e096036.cvsp/part-r-0 if the job is producing lots of files and there is a need to setup the number of mappers more than one. A file crusher utility may be the best option here to

Re: Best practices for adding services to Hadoop cluster?

2013-03-06 Thread Harsh J
The only thing wrong would be what is said for the DB-talking jobs as well: Distributed mappers talking to a single point of service can bring it down. On Thu, Mar 7, 2013 at 10:59 AM, Mark Kerzner mark.kerz...@shmsoft.com wrote: Okay, then there is nothing wrong with the mapper directly

Re: mapred.max.tracker.failures

2013-03-06 Thread Abdelrhman Shettia
Hi Mohit , This is The number of failed tasks in specified job after which the job will not run on the task tracker. The job's tasks will no longer be assigned to the task tracker. However, If the same task failed more than 4 times , The job will fail regardless. Hope this helps. Thanks

Re: mapred.max.tracker.failures

2013-03-06 Thread Harsh J
It is a per-job config which controls the automatic job-level blacklist: If, for a single job, a specific tracker has failed 4 (or X) total tasks, then as prevent scheduling anymore of the job's tasks to that tracker (but we don't eliminate more than 25% of the available trackers this way, as for

Re: mapred.max.tracker.failures

2013-03-06 Thread bharath vissapragada
No, its the number of task failures in a job after which that particular tasktracker can be blacklisted *for that job*! Note that it can take tasks from other jobs! On Thu, Mar 7, 2013 at 11:21 AM, Mohit Anchlia mohitanch...@gmail.com wrote: I am wondering what the correct behaviour is of this

Re: S3N copy creating recursive folders

2013-03-06 Thread George Datskos
Subroto and Shumin Try adding a slash to to the s3n source: - hadoop fs -cp s3n://acessKey:acesssec...@bucket.name/srcData /test/srcData + hadoop fs -cp s3n://acessKey:acesssec...@bucket.name/srcData/ /test/srcData Without the slash, it will keep listing srcData each time it is listed,