Re: High IO Usage in Datanodes due to Replication

2013-04-30 Thread Harsh J
The block scanner is a simple, independent operation of the DN that runs periodically and does work in small phases, to ensure that no blocks exist that aren't matching their checksums (its an automatic data validator) - such that it may report corrupt/rotting blocks and keep the cluster healthy.

Re: High IO Usage in Datanodes due to Replication

2013-04-30 Thread selva
Thanks Harsh & Manoj for the inputs. Now i found that the data node is busy with block scanning. I have TBs data attached with each data node. So its taking days to complete the data block scanning. I have two questions. 1. Is data node will not allow to write the data during DataBlockScanning pr

Re: Can't initialize cluster

2013-04-30 Thread Harsh J
When you run with java -jar, as previously stated on another thread, you aren't loading any configs present on the installation (that configure HDFS to be the default filesystem). When you run with "hadoop jar", the configs under /etc/hadoop/conf get applied automatically to your program, making i

Re: Hadoop Avro Question

2013-04-30 Thread Harsh J
Oops, moving for sure this time :) On Wed, May 1, 2013 at 10:35 AM, Harsh J wrote: > Moving the question to Apache Avro's user@ lists. Please use the right > lists for the most relevant answers. > > Avro is a different serialization technique that intends to replace > the Writable serialization d

Re: Hadoop Avro Question

2013-04-30 Thread Harsh J
Moving the question to Apache Avro's user@ lists. Please use the right lists for the most relevant answers. Avro is a different serialization technique that intends to replace the Writable serialization defaults in Hadoop. MR accepts a list of serializers it can use for its key/value structures an

Re: partition as block?

2013-04-30 Thread Jay Vyas
What do you mean "increasing the size"? Im talking more about increasing the number of partitions... Which actually decreases individual file size. On Apr 30, 2013, at 4:09 PM, Mohammad Tariq wrote: > Increasing the size can help us to an extent, but increasing it further might > cause proble

Re: New to Hadoop-SSH communication

2013-04-30 Thread Automation Me
Thank you Mitra..I will change the hostname On Tue, Apr 30, 2013 at 6:16 PM, Mitra Kaseebhotla < mitra.kaseebho...@gmail.com> wrote: > and change the hostname to reflect your actual hostnames. > > > > On Tue, Apr 30, 2013 at 3:14 PM, Mohammad Tariq wrote: > >> comment out 127.0.1.1 ubuntu in bot

Re: New to Hadoop-SSH communication

2013-04-30 Thread Automation Me
Thank you Tariq. I will try that... On Tue, Apr 30, 2013 at 6:14 PM, Mohammad Tariq wrote: > comment out 127.0.1.1 ubuntu in both the machines. > > if it still doesn't work change 127.0.1.1master to something else, > like 127.0.0.3 or something. > > Warm Regards, > Tariq > https://mtariq.ju

Re: New to Hadoop-SSH communication

2013-04-30 Thread Mitra Kaseebhotla
and change the hostname to reflect your actual hostnames. On Tue, Apr 30, 2013 at 3:14 PM, Mohammad Tariq wrote: > comment out 127.0.1.1 ubuntu in both the machines. > > if it still doesn't work change 127.0.1.1master to something else, > like 127.0.0.3 or something. > > Warm Regards, > Ta

Re: New to Hadoop-SSH communication

2013-04-30 Thread Mohammad Tariq
comment out 127.0.1.1 ubuntu in both the machines. if it still doesn't work change 127.0.1.1master to something else, like 127.0.0.3 or something. Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Wed, May 1, 2013 at 3:34 AM, Automation Me wrote: > Hi Tariq, > > > Mas

Re: New to Hadoop-SSH communication

2013-04-30 Thread Automation Me
Hi Tariq, Master: Users: hduser hduser hostname: ubuntu *etc/hosts* 127.0.0.1 localhost 127.0.1.1 ubuntu # The following lines are desirable for IPv6 capable hosts ::1 ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters 127

Re: New to Hadoop-SSH communication

2013-04-30 Thread Automation Me
@Mitra Yes I cloned the same VM's. By default Ubuntu takes 127.0.0.1 -ubuntu hostname for all machines @Tariq i will send the hosts file and users of all the machines. On Tue, Apr 30, 2013 at 5:42 PM, Mitra Kaseebhotla < mitra.kaseebho...@gmail.com> wrote: > Looks like you have just cloned/copi

Re: New to Hadoop-SSH communication

2013-04-30 Thread Mitra Kaseebhotla
Looks like you have just cloned/copied the same VMs. Change the hostname of each: http://askubuntu.com/questions/87665/how-do-i-change-the-hostname-without-a-restart On Tue, Apr 30, 2013 at 2:30 PM, Automation Me wrote: > Thank you Tariq. > > I am using the same username on both the machines

Re: New to Hadoop-SSH communication

2013-04-30 Thread Mohammad Tariq
show me your /etc/hosts file along with the output of "users" and "hostname" on both the machines. Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Wed, May 1, 2013 at 3:00 AM, Automation Me wrote: > Thank you Tariq. > > I am using the same username on both the machines

Re: New to Hadoop-SSH communication

2013-04-30 Thread Automation Me
Thank you Tariq. I am using the same username on both the machines and when i try to copy a file master to slave just to make sure SSH is working fine, The file is copying into master itself not an slave machine. scp -r /usr/local/somefile hduser@slave:/usr/local/somefile Any suggestions...

Re: New to Hadoop-SSH communication

2013-04-30 Thread Mohammad Tariq
ssh is actually *user@some_machine *to *user@some_other_machine*. either use same username on both the machines or add the IPs along with proper user@hostname in /etc/hosts file. HTH Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Wed, May 1, 2013 at 2:39 AM, Automation M

New to Hadoop-SSH communication

2013-04-30 Thread Automation Me
Hello, I am new to Hadoop and trying to install multinode cluster on ubuntu VM's.I am not able to communicate between two clusters using SSH. My host file: 127.0.1.1 Master 127.0.1.2 Slave The following changes i made in two VM's 1.Updated the etc/hosts file in two vm's on Master VM i did SS

Re: partition as block?

2013-04-30 Thread Mohammad Tariq
Increasing the size can help us to an extent, but increasing it further might cause problems during copy and shuffle. If the partitions are too big to be held in the memory, we'll end up with *disk based shuffle* which is gonna be slower than *RAM based shuffle,* thus delaying the entire reduce pha

Re: partition as block?

2013-04-30 Thread Jay Vyas
Yes it is a problem at the first stage. What I'm wondering, though, is wether the intermediate results - which happen after the mapper phase - can be optimized. On Tue, Apr 30, 2013 at 3:38 PM, Mohammad Tariq wrote: > Hmmm. I was actually thinking about the very first step. How are you going >

Re: partition as block?

2013-04-30 Thread Mohammad Tariq
Hmmm. I was actually thinking about the very first step. How are you going to create the maps. Suppose you are on a block-less filesystem and you have a custom Format that is going to give you the splits dynamically. This mean that you are going to store the file as a whole and create the splits as

RE: Can't initialize cluster

2013-04-30 Thread Kevin Burton
Tariq, Thank you. I tried this and the summary of the map reduce job looks like: 13/04/30 14:02:35 INFO mapred.JobClient: Job complete: job_201304301251_0004 13/04/30 14:02:35 INFO mapred.JobClient: Counters: 7 13/04/30 14:02:35 INFO mapred.JobClient: Job Counters 13/04/30 14:02:35 INF

Re: partition as block?

2013-04-30 Thread Jay Vyas
Well, to be more clear, I'm wondering how hadoop-mapreduce can be optimized in a block-less filesystem... And am thinking about application tier ways to simulate blocks - i.e. by making the granularity of partitions smaller. Wondering, if there is a way to hack an increased numbers of partitions a

Re: partition as block?

2013-04-30 Thread Mohammad Tariq
Hello Jay, What are you going to do in your custom InputFormat and partitioner?Is your InputFormat is going to create larger splits which will overlap with larger blocks?If that is the case, IMHO, then you are going to reduce the no. of mappers thus reducing the parallelism. Also, much larger

partition as block?

2013-04-30 Thread Jay Vyas
Hi guys: Im wondering - if I'm running mapreduce jobs on a cluster with large block sizes - can i increase performance with either: 1) A custom FileInputFormat 2) A custom partitioner 3) -DnumReducers Clearly, (3) will be an issue due to the fact that it might overload tasks and network traffi

RE: Can't initialize cluster

2013-04-30 Thread Kevin Burton
We/I are/am making progress. Now I get the error: 13/04/30 12:59:40 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 13/04/30 12:59:40 INFO mapred.JobClient: Cleaning up the staging area hdfs://devubuntu05:9000/data/had

Re: Can't initialize cluster

2013-04-30 Thread Mohammad Tariq
Set "HADOOP_MAPRED_HOME" in your hadoop-env.sh file and re-run the job. See if it helps. Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Tue, Apr 30, 2013 at 10:10 PM, Kevin Burton wrote: > To be clear when this code is run with ‘java –jar’ it runs without > exception. Th

hadoop

2013-04-30 Thread Aditya exalter

Hadoop

2013-04-30 Thread Manoj Sah
Hai Hadoop, -- * With Best Regards Manoj Kumar Sahu Ameerpet, Hyderabad-500016. 8374232928 /7842496524 * Pl. *Save a tree. Please don't print this e-mail unless you really need to...*

[no subject]

2013-04-30 Thread Sandeep Nemuri
-- Regards N.H Sandeep

[no subject]

2013-04-30 Thread Niketh Nikky

Re: Permission problem

2013-04-30 Thread Mohammad Tariq
Sorry Kevin, I was away for a while. Are you good now? Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Tue, Apr 30, 2013 at 9:50 PM, Arpit Gupta wrote: > Kevin > > You will have create a new account if you did not have one before. > > -- > Arpit > > On Apr 30, 2013, at 9

RE: Can't initialize cluster

2013-04-30 Thread Kevin Burton
To be clear when this code is run with 'java -jar' it runs without exception. The exception occurs when I run with 'hadoop jar'. From: Kevin Burton [mailto:rkevinbur...@charter.net] Sent: Tuesday, April 30, 2013 11:36 AM To: user@hadoop.apache.org Subject: Can't initialize cluster I have a

Can't initialize cluster

2013-04-30 Thread Kevin Burton
I have a simple MapReduce job that I am trying to get to run on my cluster. When I run it I get: 13/04/30 11:27:45 INFO mapreduce.Cluster: Failed to use org.apache.hadoop.mapred.LocalClientProtocolProvider due to error: Invalid "mapreduce.jobtracker.address" configuration value for LocalJobRunn

Re: Permission problem

2013-04-30 Thread Arpit Gupta
Kevin You will have create a new account if you did not have one before. -- Arpit On Apr 30, 2013, at 9:11 AM, Kevin Burton wrote: I don’t see a “create issue” button or tab. If I need to log in then I am not sure what credentials I should use to log in because all I tried failed. *From

RE: Permission problem

2013-04-30 Thread Kevin Burton
I am not sure how to create a jira. Again I am not sure I understand your workaround. You are suggesting that I create /data/hadoop/tmp on HDFS like: sudo -u hdfs hadoop fs -mkdir /data/hadoop/tmp I don't think I can chmod -R 777 on /data since it is a disk and as I indicated it is bein

RE: Permission problem

2013-04-30 Thread Kevin Burton
I am not clear on what you are suggesting to create on HDFS or the local file system. As I understand it hadoop.tmp.dir is the local file system. I changed it so that the temporary files would be on a disk that has more capacity then /tmp. So you are suggesting that I create /data/hadoop/tmp on HDF

Re: Permission problem

2013-04-30 Thread Arpit Gupta
ah this is what mapred.sytem.dir defaults to mapred.system.dir ${hadoop.tmp.dir}/mapred/system The directory where MapReduce stores control files. So thats why its trying to write to /data/hadoop/tmp/hadoop-mapred/mapred/system So if you want hadoop.tmp.dir to be /data/hadoop/tmp/

Unsubscribe

2013-04-30 Thread Jensen, Daniel
Unsubscribe

RE: Permission problem

2013-04-30 Thread Kevin Burton
In core-site.xml I have: fs.default.name hdfs://devubuntu05:9000 The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. In hdfs-site.xml I have hadoop.tmp.dir /data/hadoop/tmp/hadoop-${user.name} Hadoop tempo

Preparation needed when change the version of Hadoop

2013-04-30 Thread Geelong Yao
Hi everyone We have implemented a cluster with Apache version 1.0.4 Due to some corporation issues, we need to change the version to Cloudera CDH 4.2.0. 1.How to uninstall the current Hadoop without any unnecessary files leave behind? 2.Any preparation for this change? BRs Geelong -- >From Go

Re: Permission problem

2013-04-30 Thread Arpit Gupta
Based on the logs your system dir is set to > hdfs://devubuntu05:9000/data/hadoop/tmp/hadoop-mapred/mapred/system what is your fs.default.name and hadoop.tmp.dir in core-site.xml set to? -- Arpit Gupta Hortonworks Inc. http://hortonworks.com/ On Apr 30, 2013, at 7:39 AM, "Kevin Burton" wrot

RE: Permission problem

2013-04-30 Thread Kevin Burton
Thank you. mapred.system.dir is not set. I am guessing that it is whatever the default is. What should I set it to? /tmp is already 777 kevin@devUbuntu05:~$ hadoop fs -ls /tmp Found 1 items drwxr-xr-x - hdfs supergroup 0 2013-04-29 15:45 /tmp/mapred kevin@devUbuntu05:~$

Re: Permission problem

2013-04-30 Thread Arpit Gupta
what is your mapred.system.dir set to in mapred-site.xml? By default it will write to /tmp on hdfs. So you can do the following create /tmp on hdfs and chmod it to 777 as user hdfs and then restart jobtracker and tasktrackers. In case its set to /mapred/something then create /mapred and chown

RE: Permission problem

2013-04-30 Thread Kevin Burton
To further complicate the issue the log file in (/var/log/hadoop-0.20-mapreduce/hadoop-hadoop-jobtracker-devUbuntu05.log) is owned by mapred:mapred and the name of the file seems to indicate some other lineage (hadoop,hadoop). I am out of my league in understanding the permission structure for hado

RE: Permission problem

2013-04-30 Thread Kevin Burton
That is what I perceive as the problem. The hdfs file system was created with the user 'hdfs' owning the root ('/') but for some reason with a M/R job the user 'mapred' needs to have write permission to the root. I don't know how to satisfy both conditions. That is one reason that I relaxed the per

Hadoop Avro Question

2013-04-30 Thread Rahul Bhattacharjee
Hi, When dealing with Avro data files in MR jobs ,we use AvroMapper , I noticed that the output of K and V of AvroMapper isnt writable and neither the key is comparable (these are AvroKey and AvroValue). As the general serialization mechanism is writable , how is the K,V pairs in case of avro , tr

RE: Permission problem

2013-04-30 Thread Kevin Burton
I have relaxed it even further so now it is 775 kevin@devUbuntu05:/var/log/hadoop-0.20-mapreduce$ hadoop fs -ls -d / Found 1 items drwxrwxr-x - hdfs supergroup 0 2013-04-29 15:43 / But I still get this error: 2013-04-30 07:43:02,520 FATAL org.apache.hadoop.mapred.JobTracker

Re: Set reducer capacity for a specific M/R job

2013-04-30 Thread Nitin Pawar
I don't think you can control how many reducers can run parallely via framework. Other way to do this is increase the memory given to individual reducer so that the tasktracker will be limited by memory to launch more reducers at the same time and they will queue up you can try setting up this ma

Re: Set reducer capacity for a specific M/R job

2013-04-30 Thread Han JU
Yes.. In the conf file of my cluster, mapred.tasktracker.reduce.tasks.maximum is 8. And for this job, I want it to be 4. I set it through conf and build the job with this conf, then submit it. But hadoop lauches 8 reduce per datanode... 2013/4/30 Nitin Pawar > so basically if I understand corre

Re: Set reducer capacity for a specific M/R job

2013-04-30 Thread Nitin Pawar
so basically if I understand correctly you want to limit the # parallel execution of reducers only for this job? On Tue, Apr 30, 2013 at 4:02 PM, Han JU wrote: > Thanks. > > In fact I don't want to set reducer or mapper numbers, they are fine. > I want to set the reduce slot capacity of my cl

Re: Relations ship between HDFS_BYTE_READ and Map input bytes

2013-04-30 Thread YouPeng Yang
Hi Pralabh * * 1.The Map input bytes couter belongs to the MapReduce FrameWork. The hadoop defintive explains that: The number of bytes of uncompressed input consumed by all the maps in the job. Incremented every time a record is read from a RecordReader and passed to the map’s map() me

Re: Set reducer capacity for a specific M/R job

2013-04-30 Thread Han JU
Thanks. In fact I don't want to set reducer or mapper numbers, they are fine. I want to set the reduce slot capacity of my cluster when it executes my specific job. Say I have 100 reduce tasks for this job, I want my cluster to execute 4 of them in the same time, not 8 of them in the same time, on

Re: Set reducer capacity for a specific M/R job

2013-04-30 Thread Nitin Pawar
forgot to add there is similar method for reducer as well job.setNumReduceTasks(0); On Tue, Apr 30, 2013 at 3:56 PM, Nitin Pawar wrote: > The *mapred*.*tasktracker*.*reduce*.*tasks*.*maximum* parameter sets the > maximum number of reduce tasks that may be run by an individual TaskTracker > se

Re: Set reducer capacity for a specific M/R job

2013-04-30 Thread Nitin Pawar
The *mapred*.*tasktracker*.*reduce*.*tasks*.*maximum* parameter sets the maximum number of reduce tasks that may be run by an individual TaskTracker server at one time. This is not per job configuration. he number of map tasks for a given job is driven by the number of input splits and not by the

Re: Set reducer capacity for a specific M/R job

2013-04-30 Thread Han JU
Thanks Nitin. What I need is to set slot only for a specific job, not for the whole cluster conf. But what I did does NOT work ... Have I done something wrong? 2013/4/30 Nitin Pawar > The config you are setting is for job only > > But if you want to reduce the slota on tasktrackers then you wi

Re: Set reducer capacity for a specific M/R job

2013-04-30 Thread Nitin Pawar
The config you are setting is for job only But if you want to reduce the slota on tasktrackers then you will need to edit tasktracker conf and restart tasktracker On Apr 30, 2013 3:30 PM, "Han JU" wrote: > Hi, > > I want to change the cluster's capacity of reduce slots on a per job > basis. Orig

Set reducer capacity for a specific M/R job

2013-04-30 Thread Han JU
Hi, I want to change the cluster's capacity of reduce slots on a per job basis. Originally I have 8 reduce slots for a tasktracker. I did: conf.set("mapred.tasktracker.reduce.tasks.maximum", "4"); ... Job job = new Job(conf, ...) And in the web UI I can see that for this job, the max reduce tas