date:20130416

Re: Re: How to balance reduce job

2013-04-16 Thread rauljin


mapred.tasktracker.map.tasks.maximum
4



mapred.tasktracker.reduce.tasks.maximum
4


   I am not clear the number  of reuce slots in each Task tracker.Is it define 
in the configuration?

 






rauljin

From: bejoy.hadoop
Date: 2013-04-17 13:09
To: user; liujin666jin
Subject: Re: How to balance reduce job
Hi Rauljin

Few things to check here.
What is the number of reduce slots in each Task Tracker? What is the number of 
reduce tasks for your job?
Based on the availability of slots the reduce tasks are scheduled on TTs.

You can do the following
Set the number of reduce tasks to 8 or more. 
Play with the number of slots (not very advisable for tweaking this on a job 
level )

The reducers are scheduled purely based on the slot availability so it won't be 
that easy to ensure that all TT are evenly loaded with same number of reducers.

Regards 
Bejoy KS

Sent from remote device, Please excuse typos



From: rauljin  
Date: Wed, 17 Apr 2013 12:53:37 +0800
To: user@hadoop.apache.org
ReplyTo: user@hadoop.apache.org 
Subject: How to balance reduce job


8 datanode in my hadoop cluseter ,when running reduce job,there is only 2 
datanode running the job .

I want to use the 8 datanode to run the reduce job,so I can balance the I/O 
press.

Any ideas?

Thanks.




rauljin

RE: Adding new name node location

2013-04-16 Thread Henry Hung

Hi Varun Kumar,

Could  you be more elaborate about how the new changes being made to new name 
node?

The scenario in my mind is:
Suppose old name node metadata contains 100 hdfs files.
Then I restart by using stop-dfs, change config and start-dfs.
Hadoop will automatically create new name node directory in /backup.
Now I have one old name node that have metadata of 100 hdfs files and new name 
node that have none metadata.
I put a new file into hdfs, the old name node will have 101 and new name node 
will have 1.

Best regards,
Henry

From: varun kumar [mailto:varun@gmail.com]
Sent: Wednesday, April 17, 2013 2:18 PM
To: user
Cc: MA11 YTHung1
Subject: Re: Adding new name node location

Hi Henry,

As per your mail Point number 1 is correct.

After doing these changes metadata will be written in the new partition.

Regards,
Varun Kumar.P

On Wed, Apr 17, 2013 at 11:32 AM, Henry Hung 
mailto:ythu...@winbond.com>> wrote:
Hi Everyone,

I'm using Hadoop 1.0.4 and only define 1 location for name node files, like 
this:
  
dfs.name.dir
/home/hadoop/hadoop-data/namenode
  

Now I want to protect my name node files by changing the configuration to:
  
dfs.name.dir

/home/hadoop/hadoop-data/namenode,/backup/hadoop/hadoop-data/namenode
  

Where /backup is another mount point. This /backup can be another disk or from 
another NFS server.

My question are:

1.   Is my procedure correct: do stop-dfs.sh then modify conf, and last 
start-dfs.sh?

2.   If answer to no 1 is no, then could you provide the correct procedure?

3.   Would the new name node files will auto copy the original name node 
files?

Best regards,
Henry


The privileged confidential information contained in this email is intended for 
use only by the addressees as indicated by the original sender of this email. 
If you are not the addressee indicated in this email or are not responsible for 
delivery of the email to such a person, please kindly reply to the sender 
indicating this fact and delete all copies of it from your computer and network 
server immediately. Your cooperation is highly appreciated. It is advised that 
any unauthorized use of confidential information of Winbond is strictly 
prohibited; and any information in this email irrelevant to the official 
business of Winbond shall be deemed as neither given nor endorsed by Winbond.



--
Regards,
Varun Kumar.P


The privileged confidential information contained in this email is intended for 
use only by the addressees as indicated by the original sender of this email. 
If you are not the addressee indicated in this email or are not responsible for 
delivery of the email to such a person, please kindly reply to the sender 
indicating this fact and delete all copies of it from your computer and network 
server immediately. Your cooperation is highly appreciated. It is advised that 
any unauthorized use of confidential information of Winbond is strictly 
prohibited; and any information in this email irrelevant to the official 
business of Winbond shall be deemed as neither given nor endorsed by Winbond.

Re: How to balance reduce job

2013-04-16 Thread bejoy . hadoop

Yes, That is a valid point.

The partitioner might do non uniform distribution and reducers can be unevenly 
loaded.

But this doesn't change the number of reducers and its distribution across 
nodes. The bottom issue as I understand is that his reduce tasks are scheduled 
on just a few nodes.

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Ajay Srivastava 
Date: Wed, 17 Apr 2013 06:02:30 
To: ; 

Reply-To: user@hadoop.apache.org
Cc: Mohammad Tariq
Subject: Re: How to balance reduce job

Tariq probably meant distribution of keys from  pair emitted by 
mapper.
Partitioner distributes these pairs to different reducers based on key. If data 
is such that keys are skewed then most of the records may go to same reducer.



Regards,
Ajay Srivastava


On 17-Apr-2013, at 11:08 AM, 
mailto:bejoy.had...@gmail.com>>
 mailto:bejoy.had...@gmail.com>> wrote:


Uniform Data distribution across HDFS is one of the factor that ensures map 
tasks are uniformly distributed across nodes. But reduce tasks doesn't depend 
on data distribution it is purely based on slot availability.
Regards
Bejoy KS

Sent from remote device, Please excuse typos

From: Mohammad Tariq mailto:donta...@gmail.com>>
Date: Wed, 17 Apr 2013 10:46:27 +0530
To: 
user@hadoop.apache.orgmailto:user@hadoop.apache.org>>;
 Bejoy Ksmailto:bejoy.had...@gmail.com>>
Subject: Re: How to balance reduce job

Just to add to Bejoy's comments, it also depends on the data distribution. Is 
your data properly distributed across the HDFS?

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Wed, Apr 17, 2013 at 10:39 AM, 
mailto:bejoy.had...@gmail.com>> wrote:
Hi Rauljin

Few things to check here.
What is the number of reduce slots in each Task Tracker? What is the number of 
reduce tasks for your job?
Based on the availability of slots the reduce tasks are scheduled on TTs.

You can do the following
Set the number of reduce tasks to 8 or more.
Play with the number of slots (not very advisable for tweaking this on a job 
level )

The reducers are scheduled purely based on the slot availability so it won't be 
that easy to ensure that all TT are evenly loaded with same number of reducers.
Regards
Bejoy KS

Sent from remote device, Please excuse typos

From: rauljin mailto:liujin666...@sina.com>>
Date: Wed, 17 Apr 2013 12:53:37 +0800
To: 
user@hadoop.apache.orgmailto:user@hadoop.apache.org>>
ReplyTo: user@hadoop.apache.org
Subject: How to balance reduce job

8 datanode in my hadoop cluseter ,when running reduce job,there is only 2 
datanode running the job .

I want to use the 8 datanode to run the reduce job,so I can balance the I/O 
press.

Any ideas?

Thanks.


rauljin

Re: Adding new name node location

2013-04-16 Thread varun kumar

Hi Henry,

As per your mail Point number 1 is correct.

After doing these changes metadata will be written in the new partition.

Regards,
Varun Kumar.P


On Wed, Apr 17, 2013 at 11:32 AM, Henry Hung  wrote:

>  Hi Everyone,
>
> ** **
>
> I’m using Hadoop 1.0.4 and only define 1 location for name node files,
> like this:
>
>   
>
> dfs.name.dir
>
> /home/hadoop/hadoop-data/namenode
>
>   
>
> ** **
>
> Now I want to protect my name node files by changing the configuration to:
> 
>
>   
>
> dfs.name.dir
>
>
> /home/hadoop/hadoop-data/namenode,/backup/hadoop/hadoop-data/namenode
> 
>
>   
>
> ** **
>
> Where /backup is another mount point. This /backup can be another disk or
> from another NFS server.
>
> ** **
>
> My question are:
>
> **1.   **Is my procedure correct: do stop-dfs.sh then modify conf,
> and last start-dfs.sh?
>
> **2.   **If answer to no 1 is no, then could you provide the correct
> procedure?
>
> **3.   **Would the new name node files will auto copy the original
> name node files?
>
> ** **
>
> Best regards,
>
> Henry
>
> --
> The privileged confidential information contained in this email is
> intended for use only by the addressees as indicated by the original sender
> of this email. If you are not the addressee indicated in this email or are
> not responsible for delivery of the email to such a person, please kindly
> reply to the sender indicating this fact and delete all copies of it from
> your computer and network server immediately. Your cooperation is highly
> appreciated. It is advised that any unauthorized use of confidential
> information of Winbond is strictly prohibited; and any information in this
> email irrelevant to the official business of Winbond shall be deemed as
> neither given nor endorsed by Winbond.
>



-- 
Regards,
Varun Kumar.P

Re: How to balance reduce job

2013-04-16 Thread Ajay Srivastava

Tariq probably meant distribution of keys from  pair emitted by 
mapper.
Partitioner distributes these pairs to different reducers based on key. If data 
is such that keys are skewed then most of the records may go to same reducer.



Regards,
Ajay Srivastava


On 17-Apr-2013, at 11:08 AM, 
mailto:bejoy.had...@gmail.com>>
 mailto:bejoy.had...@gmail.com>> wrote:


Uniform Data distribution across HDFS is one of the factor that ensures map 
tasks are uniformly distributed across nodes. But reduce tasks doesn't depend 
on data distribution it is purely based on slot availability.
Regards
Bejoy KS

Sent from remote device, Please excuse typos

From: Mohammad Tariq mailto:donta...@gmail.com>>
Date: Wed, 17 Apr 2013 10:46:27 +0530
To: 
user@hadoop.apache.orgmailto:user@hadoop.apache.org>>;
 Bejoy Ksmailto:bejoy.had...@gmail.com>>
Subject: Re: How to balance reduce job

Just to add to Bejoy's comments, it also depends on the data distribution. Is 
your data properly distributed across the HDFS?

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Wed, Apr 17, 2013 at 10:39 AM, 
mailto:bejoy.had...@gmail.com>> wrote:
Hi Rauljin

Few things to check here.
What is the number of reduce slots in each Task Tracker? What is the number of 
reduce tasks for your job?
Based on the availability of slots the reduce tasks are scheduled on TTs.

You can do the following
Set the number of reduce tasks to 8 or more.
Play with the number of slots (not very advisable for tweaking this on a job 
level )

The reducers are scheduled purely based on the slot availability so it won't be 
that easy to ensure that all TT are evenly loaded with same number of reducers.
Regards
Bejoy KS

Sent from remote device, Please excuse typos

From: rauljin mailto:liujin666...@sina.com>>
Date: Wed, 17 Apr 2013 12:53:37 +0800
To: 
user@hadoop.apache.orgmailto:user@hadoop.apache.org>>
ReplyTo: user@hadoop.apache.org
Subject: How to balance reduce job

8 datanode in my hadoop cluseter ,when running reduce job,there is only 2 
datanode running the job .

I want to use the 8 datanode to run the reduce job,so I can balance the I/O 
press.

Any ideas?

Thanks.


rauljin

Adding new name node location

2013-04-16 Thread Henry Hung

Hi Everyone,

I'm using Hadoop 1.0.4 and only define 1 location for name node files, like 
this:
  
dfs.name.dir
/home/hadoop/hadoop-data/namenode
  

Now I want to protect my name node files by changing the configuration to:
  
dfs.name.dir

/home/hadoop/hadoop-data/namenode,/backup/hadoop/hadoop-data/namenode
  

Where /backup is another mount point. This /backup can be another disk or from 
another NFS server.

My question are:

1.   Is my procedure correct: do stop-dfs.sh then modify conf, and last 
start-dfs.sh?

2.   If answer to no 1 is no, then could you provide the correct procedure?

3.   Would the new name node files will auto copy the original name node 
files?

Best regards,
Henry


The privileged confidential information contained in this email is intended for 
use only by the addressees as indicated by the original sender of this email. 
If you are not the addressee indicated in this email or are not responsible for 
delivery of the email to such a person, please kindly reply to the sender 
indicating this fact and delete all copies of it from your computer and network 
server immediately. Your cooperation is highly appreciated. It is advised that 
any unauthorized use of confidential information of Winbond is strictly 
prohibited; and any information in this email irrelevant to the official 
business of Winbond shall be deemed as neither given nor endorsed by Winbond.

Re: How to balance reduce job

2013-04-16 Thread bejoy . hadoop


Uniform Data distribution across HDFS is one of the factor that ensures map 
tasks are uniformly distributed across nodes. But reduce tasks doesn't depend 
on data distribution it is purely based on slot availability.

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Mohammad Tariq 
Date: Wed, 17 Apr 2013 10:46:27 
To: user@hadoop.apache.org; Bejoy 
Ks
Subject: Re: How to balance reduce job

Just to add to Bejoy's comments, it also depends on the data distribution.
Is your data properly distributed across the HDFS?

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Wed, Apr 17, 2013 at 10:39 AM,  wrote:

> **
> Hi Rauljin
>
> Few things to check here.
> What is the number of reduce slots in each Task Tracker? What is the
> number of reduce tasks for your job?
> Based on the availability of slots the reduce tasks are scheduled on TTs.
>
> You can do the following
> Set the number of reduce tasks to 8 or more.
> Play with the number of slots (not very advisable for tweaking this on a
> job level )
>
> The reducers are scheduled purely based on the slot availability so it
> won't be that easy to ensure that all TT are evenly loaded with same number
> of reducers.
> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
> --
> *From: * rauljin 
> *Date: *Wed, 17 Apr 2013 12:53:37 +0800
> *To: *user@hadoop.apache.org
> *ReplyTo: * user@hadoop.apache.org
> *Subject: *How to balance reduce job
>
> 8 datanode in my hadoop cluseter ,when running reduce job,there is only 2
> datanode running the job .
>
> I want to use the 8 datanode to run the reduce job,so I can balance the
> I/O press.
>
> Any ideas?
>
> Thanks.
>
> --
> rauljin
>

Re: How to balance reduce job

2013-04-16 Thread Mohammad Tariq

Just to add to Bejoy's comments, it also depends on the data distribution.
Is your data properly distributed across the HDFS?

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Wed, Apr 17, 2013 at 10:39 AM,  wrote:

> **
> Hi Rauljin
>
> Few things to check here.
> What is the number of reduce slots in each Task Tracker? What is the
> number of reduce tasks for your job?
> Based on the availability of slots the reduce tasks are scheduled on TTs.
>
> You can do the following
> Set the number of reduce tasks to 8 or more.
> Play with the number of slots (not very advisable for tweaking this on a
> job level )
>
> The reducers are scheduled purely based on the slot availability so it
> won't be that easy to ensure that all TT are evenly loaded with same number
> of reducers.
> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
> --
> *From: * rauljin 
> *Date: *Wed, 17 Apr 2013 12:53:37 +0800
> *To: *user@hadoop.apache.org
> *ReplyTo: * user@hadoop.apache.org
> *Subject: *How to balance reduce job
>
> 8 datanode in my hadoop cluseter ,when running reduce job,there is only 2
> datanode running the job .
>
> I want to use the 8 datanode to run the reduce job,so I can balance the
> I/O press.
>
> Any ideas?
>
> Thanks.
>
> --
> rauljin
>

Re: How to balance reduce job

2013-04-16 Thread bejoy . hadoop

Hi Rauljin

Few things to check here.
What is the number of reduce slots in each Task Tracker? What is the number of 
reduce tasks for your job?
 Based on the availability of slots the reduce tasks are scheduled on TTs.

You can do the following
Set the number of reduce tasks to 8 or more. 
Play with the number of slots (not very advisable for tweaking this on a job 
level )

The reducers are scheduled purely based on the slot availability so it won't be 
that easy to ensure that all TT are evenly loaded with same number of reducers.

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: rauljin 
Date: Wed, 17 Apr 2013 12:53:37 
To: user@hadoop.apache.org
Reply-To: user@hadoop.apache.org
Subject: How to balance reduce job 

8 datanode in my hadoop cluseter ,when running reduce job,there is only 2 
datanode running the job .

I want to use the 8 datanode to run the reduce job,so I can balance the I/O 
press.

Any ideas?

Thanks.




rauljin

Re: Basic Doubt in Hadoop

2013-04-16 Thread bejoy . hadoop

The data is in HDFS in case of WordCount MR sample. 

In hdfs, you have the metadata in NameNode and actual data as blocks replicated 
across DataNodes.

In case of reducer, If a reducer is running on a particular node then you have 
one replica of the blocks in the same node (If there is no space issues) and 
rest replicas on other nodes.
Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Raj Hadoop 
Date: Tue, 16 Apr 2013 21:49:34 
To: user@hadoop.apache.org
Reply-To: user@hadoop.apache.org
Subject: Basic Doubt in Hadoop

Hi,

I am new to Hadoop. I started reading the standard Wordcount program. I got 
this basic doubt in Hadoop.

After the Map - Reduce is done, where is the output generated?  Does the 
reducer ouput sit on individual DataNodes ? Please advise.



Thanks,
Raj

How to balance reduce job

2013-04-16 Thread rauljin

8 datanode in my hadoop cluseter ,when running reduce job,there is only 2 
datanode running the job .

I want to use the 8 datanode to run the reduce job,so I can balance the I/O 
press.

Any ideas?

Thanks.




rauljin

Basic Doubt in Hadoop

2013-04-16 Thread Raj Hadoop

Hi,

I am new to Hadoop. I started reading the standard Wordcount program. I got 
this basic doubt in Hadoop.

After the Map - Reduce is done, where is the output generated?  Does the 
reducer ouput sit on individual DataNodes ? Please advise.



Thanks,
Raj

Re: How to configure mapreduce archive size?

2013-04-16 Thread Hemanth Yamijala

You can limit the size by setting local.cache.size in the mapred-site.xml
(or core-site.xml if that works for you). I mistakenly mentioned
mapred-default.xml in my last mail - apologies for that. However, please
note that this does not prevent whatever is writing into the distributed
cache from creating those files when they are required. After they are
done, the property will help cleanup the files due to the limit set.

That's why I am more keen on finding what is using the files in the
Distributed cache. It may be useful if you can ask on the HBase list as
well if the APIs you are using are creating the files you mention (assuming
you are only running HBase jobs on the cluster and nothing else)

Thanks
Hemanth


On Tue, Apr 16, 2013 at 11:15 PM,  wrote:

> Hi Hemanth,
>
> ** **
>
> I did not explicitly using DistributedCache in my code. I did not use any
> command line arguments like –libjars neither.
>
> ** **
>
> Where can I find job.xml? I am using Hbase MapReduce API and not setting
> any job.xml.
>
> ** **
>
> The key point is I want to limit the size of 
> /tmp/hadoop-root/mapred/local/archive.
> Could you help?
>
> ** **
>
> Thanks.
>
> ** **
>
> Xia
>
> ** **
>
> *From:* Hemanth Yamijala [mailto:yhema...@thoughtworks.com]
> *Sent:* Thursday, April 11, 2013 9:09 PM
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: How to configure mapreduce archive size?
>
> ** **
>
> TableMapReduceUtil has APIs like addDependencyJars which will use
> DistributedCache. I don't think you are explicitly using that. Are you
> using any command line arguments like -libjars etc when you are launching
> the MapReduce job ? Alternatively you can check job.xml of the launched MR
> job to see if it has set properties having prefixes like mapred.cache. If
> nothing's set there, it would seem like some other process or user is
> adding jars to DistributedCache when using the cluster.
>
> ** **
>
> Thanks
>
> hemanth
>
> ** **
>
> ** **
>
> ** **
>
> On Thu, Apr 11, 2013 at 11:40 PM,  wrote:
>
> Hi Hemanth,
>
>  
>
> Attached is some sample folders within my
> /tmp/hadoop-root/mapred/local/archive. There are some jar and class files
> inside.
>
>  
>
> My application uses MapReduce job to do purge Hbase old data. I am using
> basic HBase MapReduce API to delete rows from Hbase table. I do not specify
> to use Distributed cache. Maybe HBase use it?
>
>  
>
> Some code here:
>
>  
>
>Scan scan = *new* Scan();
>
>scan.setCaching(500);// 1 is the default in Scan, which
> will be bad for MapReduce jobs
>
>scan.setCacheBlocks(*false*);  // don't set to true for MR jobs
>
>scan.setTimeRange(Long.*MIN_VALUE*, timestamp);
>
>// set other scan *attrs*
>
>// the purge start time
>
>Date date=*new* Date();
>
>TableMapReduceUtil.*initTableMapperJob*(
>
>  tableName,// input table
>
>  scan,   // Scan instance to control CF and
> attribute selection
>
>  MapperDelete.*class*, // *mapper* class
>
>  *null*, // *mapper* output key
>
>  *null*,  // *mapper* output value
>
>  job);
>
>  
>
>job.setOutputFormatClass(TableOutputFormat.*class*);
>
>job.getConfiguration().set(TableOutputFormat.*OUTPUT_TABLE*,
> tableName);
>
>job.setNumReduceTasks(0);
>
>
>
>*boolean* b = job.waitForCompletion(*true*);
>
>  
>
> *From:* Hemanth Yamijala [mailto:yhema...@thoughtworks.com]
> *Sent:* Thursday, April 11, 2013 12:29 AM
>
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: How to configure mapreduce archive size?
>
>  
>
> Could you paste the contents of the directory ? Not sure whether that will
> help, but just giving it a shot.
>
>  
>
> What application are you using ? Is it custom MapReduce jobs in which you
> use Distributed cache (I guess not) ? 
>
>  
>
> Thanks
>
> Hemanth
>
>  
>
> On Thu, Apr 11, 2013 at 3:34 AM,  wrote:
>
> Hi Arun,
>
>  
>
> I stopped my application, then restarted my hbase (which include hadoop).
> After that I start my application. After one evening, my
> /tmp/hadoop-root/mapred/local/archive goes to more than 1G. It does not
> work.
>
>  
>
> Is this the right place to change the value?
>
>  
>
> "local.cache.size" in file core-default.xml, which is in
> hadoop-core-1.0.3.jar
>
>  
>
> Thanks,
>
>  
>
> Jane
>
>  
>
> *From:* Arun C Murthy [mailto:a...@hortonworks.com]
> *Sent:* Wednesday, April 10, 2013 2:45 PM
>
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: How to configure mapreduce archive size?
>
>  
>
> Ensure no jobs are running (cache limit is only for non-active cache
> files), check after a little while (takes sometime for the cleaner th

Mapreduce jobs to download job input from across the internet

2013-04-16 Thread David Parks

For a set of jobs to run I need to download about 100GB of data from the
internet (~1000 files of varying sizes from ~10 different domains).

 

Currently I do this in a simple linux script as it's easy to script FTP,
curl, and the like. But it's a mess to maintain a separate server for that
process. I'd rather it run in mapreduce. Just give it a bill of materials
and let it go about downloading it, retrying as necessary to deal with iffy
network conditions.

 

I wrote one such job to craw images we need to acquire, and it was the
royalist of royal pains. I wonder if there are any good approaches to this
kind of data acquisition task in Hadoop. It would certainly be nicer just to
schedule a data-acquisition job ahead of the processing jobs in Oozie rather
than try to maintain synchronization between the download processes and the
jobs.

 

Ideas?

Re: Submitting mapreduce and nothing happens

2013-04-16 Thread Zizon Qiu

try use job.waitFromComplete(true) instead of job.submit().
it should show more details.


On Mon, Apr 15, 2013 at 6:06 PM, Amit Sela  wrote:

> Hi all,
>
> I'm trying to submit a mapreduce job remotely using job.submit()
>
> I get the following:
>
> [WARN ] org.apache.hadoop.mapred.JobClient   » Use
> GenericOptionsParser for parsing the arguments. Applications should
> implement Tool for the same.
> [INFO ] org.apache.hadoop.mapred.JobClient   » Cleaning up the staging
> area hdfs://{namenode
> address}:{port}{hadoop.tmp.dir}/mapred/staging/myusername/.staging/job_201304150711_0022
>
> and nothing happens...
>
> I set the the mapred.job.tracker and changed permissions
> for hadoop.tmp.dir. I also set "hadoop.job.ugi" as "hadoop,supergroup" but
> some how I think that it's not making any difference.
> The system submitting the job is running with another user, call it:
> myusername and not hadoop.
>
> I believe it is related to the user permissions but I can't seem to get it
> right.
>
> Thanks for the help,
>
> Amit.
>
>

Re: Submitting mapreduce and nothing happens

2013-04-16 Thread Azuryy Yu

do you have data exists on your input path?


On Wed, Apr 17, 2013 at 1:18 AM, Amit Sela  wrote:

> Nothing on JT log, but as I mentioned I see this in the client log:
>
> [WARN ] org.apache.hadoop.mapred.JobClient   » Use
> GenericOptionsParser for parsing the arguments. Applications should
> implement Tool for the same.
> [INFO ] org.apache.hadoop.mapred.JobClient   » Cleaning up the staging
> area
> hdfs://hadoop-name-node.address:8000{hadoop.tmp.dir}/mapred/staging/{clientusername}/.staging/job_201304150711_0034
>
> And this on the NN log:
>
> INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of
> transactions: 1 Total time for transactions(ms): 0Number of transactions
> batched in Syncs: 0 Number of syncs: 0 SyncTimes(ms): 0 0
> INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=
> clientusername ip=/xx.xx.xx.xxx cmd=mkdirs src=
> {hadoop.tmp.dir}/mapred/staging/{clientusername}/.staging/job_201304150711_0034
>  dst=null
> perm=clientusername:supergroup:rwxr-xr-x
> INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=
> clientusername ip=/xx.xx.xx.xxx cmd=setPermission 
> src={hadoop.tmp.dir}/mapred/staging/{clientusername}/.staging/job_201304150711_0034
>   dst=null perm=clientusername:supergroup:rwx--
>
> Thanks.
>
>
> On Tue, Apr 16, 2013 at 10:30 AM, Bejoy Ks  wrote:
>
>> Hi Amit
>>
>> Are you seeing any errors or warnings on JT logs?
>>
>> Regards
>> Bejoy KS
>>
>
>

答复: Task Trackers accumulation

2013-04-16 Thread dylan

Thank you very much .

This morning I found  it, and  tested it again. 

The results as what you said.

发件人: Harsh J [mailto:ha...@cloudera.com] 
发送时间: 2013年4月16日 18:21
收件人: 
主题: Re: Task Trackers accumulation

 

This is the regular behavior. You should see it disappear after ~10 mins of
the timeout period. Reason is that every TT starts on an ephemeral port and
therefore appears as a new TT to the JT (TTs aren't persistent members of a
cluster).

 

On Tue, Apr 16, 2013 at 2:01 PM, dylan  wrote:

Hi 

I found that the task tracker still appear on the web interface after I
killed the task tracker process, then I tried to restart it again,

But old task tracker remains. No matter how many times I repeated it
kill-restart.

 

Only restarting the job tracker solved my problem. 



 





 

-- 
Harsh J 

<>

Problem: org.apache.hadoop.mapred.ReduceTask: java.net.SocketTimeoutException: connect timed out

2013-04-16 Thread Som Satpathy

Hi All,

I have just set up a CDH cluster on EC2 using cloudera manager 4.5. I have
been trying to run a couple of mapreduce jobs as part of an oozie workflow
but have been blocked by the following exception: (my reducer always hangs
because of this) -

2013-04-17 00:32:02,268 WARN org.apache.hadoop.mapred.ReduceTask:
attempt_201304170021_0003_r_00_0 copy failed:
attempt_201304170021_0003_m_00_0 from
ip-10-174-49-51.us-west-1.compute.internal
2013-04-17 00:32:02,269 WARN org.apache.hadoop.mapred.ReduceTask:
java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
at
java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
at java.net.Socket.connect(Socket.java:529)
at sun.net.NetworkClient.doConnect(NetworkClient.java:158)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:395)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:530)
at sun.net.www.http.HttpClient.(HttpClient.java:234)
at sun.net.www.http.HttpClient.New(HttpClient.java:307)
at sun.net.www.http.HttpClient.New(HttpClient.java:324)
at
sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:970)
at
sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:911)
at
sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:836)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1573)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.setupSecureConnection(ReduceTask.java:1530)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1466)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1360)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1292)

2013-04-17 00:32:02,269 INFO org.apache.hadoop.mapred.ReduceTask: Task
attempt_201304170021_0003_r_00_0: Failed fetch #1 from
attempt_201304170021_0003_m_00_0
2013-04-17 00:32:02,269 WARN org.apache.hadoop.mapred.ReduceTask:
attempt_201304170021_0003_r_00_0 adding host
ip-10-174-49-51.us-west-1.compute.internal to penalty box, next contact in
12 seconds

Any suggestions that can help me get around this?

Really appreciate any help here.

Thanks,
Som

Re: Querying a Prolog Server from a JVM during a MapReduce Job

2013-04-16 Thread Steve Lewis

Assuming that the server can handle high volume and multiple queries there
is no reason not to run it on a large and powerful machine outside the
cluster. Nothing prevents your mappers from accessing a server or even,
depending on the design, a custom InputFormat from pulling data from the
server.
I would not try to run copies of the server on datanodes without a very
compelling reason.


On Tue, Apr 16, 2013 at 1:31 PM, Robert Spurrier
wrote:

> Hello!
>
> I'm working on a research project, and I also happen to be relatively new
> to Hadoop/MapReduce. So apologies ahead of time for any glaring errors.
>
> On my local machine, my project runs within a JVM and uses a Java API to
> communicate with a Prolog server to do information lookups. I was planning
> on deploying my project as the mapper during the MR job, but I am unclear
> on how I would access the Prolog server during runtime. Would it be O.K. To
> just let the server live and run on each data node while my job is running,
> and have each mapper hit the server on its respective node? (let's assume
> the server can handle the high volume of queries from the mappers)
>
> I am not even remotely aware of what types of issues will arise when the
> mappers (from each of their JVMs/process) query the Prolog server (running
> in its own single & separate process on each node). They will only be
> querying data from the server, not deleting/updating.
>
>
> Anything that would make this impossible or what I should be looking out
> for?
>
> Thanks
> -Robert
>
>
>
>


-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Querying a Prolog Server from a JVM during a MapReduce Job

2013-04-16 Thread Robert Spurrier

Hello!

I'm working on a research project, and I also happen to be relatively new
to Hadoop/MapReduce. So apologies ahead of time for any glaring errors.

On my local machine, my project runs within a JVM and uses a Java API to
communicate with a Prolog server to do information lookups. I was planning
on deploying my project as the mapper during the MR job, but I am unclear
on how I would access the Prolog server during runtime. Would it be O.K. To
just let the server live and run on each data node while my job is running,
and have each mapper hit the server on its respective node? (let's assume
the server can handle the high volume of queries from the mappers)

I am not even remotely aware of what types of issues will arise when the
mappers (from each of their JVMs/process) query the Prolog server (running
in its own single & separate process on each node). They will only be
querying data from the server, not deleting/updating.


Anything that would make this impossible or what I should be looking out
for?

Thanks
-Robert

Re: Get Hadoop cluster topology

2013-04-16 Thread Nikhil

>From http://archive.cloudera.com/cdh/3/hadoop/hdfs_user_guide.html
(Assuming you are using Cloudera Hadoop Distribution 3)

$ hadoop dfsadmin -refreshNodes # would help do the same.

-refreshNodes : Updates the set of hosts allowed to connect to namenode.
Re-reads the config file to update values defined by dfs.hosts and
dfs.host.exclude and reads the entires (hostnames) in those files. Each
entry not defined in dfs.hosts but in dfs.hosts.exclude is decommissioned.
Each entry defined in dfs.hosts and also in dfs.host.exclude is stopped
from decommissioning if it has aleady been marked for decommission. Entires
not present in both the lists are decommissioned.

There is also -printTopology switch useful to look at the current topology
view.

-printTopology : Print the topology of the cluster. Display a tree of racks
and datanodes attached to the tracks as viewed by the NameNode.

In most cases, however, I have seen that updating the topology with wrong
information such as rackno, tabs/spaces would get the master services in
soup and in such cases, it would mandate a restart.
I have tried looking for ways to refresh of the topology cache on both
namenode/jobtracker without the need for bouncing, however this can get
little tricky.

for more information, see:
http://grokbase.com/t/hadoop/common-user/121yqsme6v/refresh-namenode-topology-cache
.

On Tue, Apr 16, 2013 at 11:39 PM, shashwat shriparv <
dwivedishash...@gmail.com> wrote:

>
> On Tue, Apr 16, 2013 at 11:34 PM, Diwakar Sharma  > wrote:
>
>> uster topology or uses an API to build it.
>
>
> If you stop and start the cluster Hadoop Reads thes configuration files
> for sure.
>
>
>
> ∞
> Shashwat Shriparv
>
>

Re: How to configure mapreduce archive size?

2013-04-16 Thread bejoy . hadoop


Also, You need to change the value for 'local.cache.size' in core-site.x.l not 
in core-default.xml.

If you need to override any property in config files do it in *-site.xml not in 
*-default.xml.

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: bejoy.had...@gmail.com
Date: Tue, 16 Apr 2013 18:05:51 
To: 
Reply-To: bejoy.had...@gmail.com
Subject: Re: How to configure mapreduce archive size?

You can get your Job.xml for each jobs from The JT web UI. Click on the job, on 
the specific job page you'll get this.

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: 
Date: Tue, 16 Apr 2013 12:45:26 
To: 
Reply-To: user@hadoop.apache.org
Subject: RE: How to configure mapreduce archive size?

Hi Hemanth,

I did not explicitly using DistributedCache in my code. I did not use any 
command line arguments like -libjars neither.

Where can I find job.xml? I am using Hbase MapReduce API and not setting any 
job.xml.

The key point is I want to limit the size of 
/tmp/hadoop-root/mapred/local/archive. Could you help?

Thanks.

Xia

From: Hemanth Yamijala [mailto:yhema...@thoughtworks.com]
Sent: Thursday, April 11, 2013 9:09 PM
To: user@hadoop.apache.org
Subject: Re: How to configure mapreduce archive size?

TableMapReduceUtil has APIs like addDependencyJars which will use 
DistributedCache. I don't think you are explicitly using that. Are you using 
any command line arguments like -libjars etc when you are launching the 
MapReduce job ? Alternatively you can check job.xml of the launched MR job to 
see if it has set properties having prefixes like mapred.cache. If nothing's 
set there, it would seem like some other process or user is adding jars to 
DistributedCache when using the cluster.

Thanks
hemanth



On Thu, Apr 11, 2013 at 11:40 PM, mailto:xia_y...@dell.com>> 
wrote:
Hi Hemanth,

Attached is some sample folders within my 
/tmp/hadoop-root/mapred/local/archive. There are some jar and class files 
inside.

My application uses MapReduce job to do purge Hbase old data. I am using basic 
HBase MapReduce API to delete rows from Hbase table. I do not specify to use 
Distributed cache. Maybe HBase use it?

Some code here:

   Scan scan = new Scan();
   scan.setCaching(500);// 1 is the default in Scan, which will be 
bad for MapReduce jobs
   scan.setCacheBlocks(false);  // don't set to true for MR jobs
   scan.setTimeRange(Long.MIN_VALUE, timestamp);
   // set other scan attrs
   // the purge start time
   Date date=new Date();
   TableMapReduceUtil.initTableMapperJob(
 tableName,// input table
 scan,   // Scan instance to control CF and attribute 
selection
 MapperDelete.class, // mapper class
 null, // mapper output key
 null,  // mapper output value
 job);

   job.setOutputFormatClass(TableOutputFormat.class);
   job.getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, tableName);
   job.setNumReduceTasks(0);

   boolean b = job.waitForCompletion(true);

From: Hemanth Yamijala 
[mailto:yhema...@thoughtworks.com]
Sent: Thursday, April 11, 2013 12:29 AM

To: user@hadoop.apache.org
Subject: Re: How to configure mapreduce archive size?

Could you paste the contents of the directory ? Not sure whether that will 
help, but just giving it a shot.

What application are you using ? Is it custom MapReduce jobs in which you use 
Distributed cache (I guess not) ?

Thanks
Hemanth

On Thu, Apr 11, 2013 at 3:34 AM, mailto:xia_y...@dell.com>> 
wrote:
Hi Arun,

I stopped my application, then restarted my hbase (which include hadoop). After 
that I start my application. After one evening, my 
/tmp/hadoop-root/mapred/local/archive goes to more than 1G. It does not work.

Is this the right place to change the value?

"local.cache.size" in file core-default.xml, which is in hadoop-core-1.0.3.jar

Thanks,

Jane

From: Arun C Murthy [mailto:a...@hortonworks.com]
Sent: Wednesday, April 10, 2013 2:45 PM

To: user@hadoop.apache.org
Subject: Re: How to configure mapreduce archive size?

Ensure no jobs are running (cache limit is only for non-active cache files), 
check after a little while (takes sometime for the cleaner thread to kick in).

Arun

On Apr 11, 2013, at 2:29 AM, mailto:xia_y...@dell.com>> 
mailto:xia_y...@dell.com>> wrote:

Hi Hemanth,

For the hadoop 1.0.3, I can only find "local.cache.size" in file 
core-default.xml, which is in hadoop-core-1.0.3.jar. It is not in 
mapred-default.xml.

I updated the value in file default.xml and changed the value to 50. This 
is just for my testing purpose. However, the folder 
/tmp/hadoop-root/mapred/local/archive already goes more than 1G now. Looks like 
it does not do the work. Could you advise if what I did

Re: Get Hadoop cluster topology

2013-04-16 Thread shashwat shriparv

On Tue, Apr 16, 2013 at 11:34 PM, Diwakar Sharma
wrote:

> uster topology or uses an API to build it.


If you stop and start the cluster Hadoop Reads thes configuration files for
sure.



∞
Shashwat Shriparv

Re: How to configure mapreduce archive size?

2013-04-16 Thread bejoy . hadoop

You can get your Job.xml for each jobs from The JT web UI. Click on the job, on 
the specific job page you'll get this.

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: 
Date: Tue, 16 Apr 2013 12:45:26 
To: 
Reply-To: user@hadoop.apache.org
Subject: RE: How to configure mapreduce archive size?

Hi Hemanth,

I did not explicitly using DistributedCache in my code. I did not use any 
command line arguments like -libjars neither.

Where can I find job.xml? I am using Hbase MapReduce API and not setting any 
job.xml.

The key point is I want to limit the size of 
/tmp/hadoop-root/mapred/local/archive. Could you help?

Thanks.

Xia

From: Hemanth Yamijala [mailto:yhema...@thoughtworks.com]
Sent: Thursday, April 11, 2013 9:09 PM
To: user@hadoop.apache.org
Subject: Re: How to configure mapreduce archive size?

TableMapReduceUtil has APIs like addDependencyJars which will use 
DistributedCache. I don't think you are explicitly using that. Are you using 
any command line arguments like -libjars etc when you are launching the 
MapReduce job ? Alternatively you can check job.xml of the launched MR job to 
see if it has set properties having prefixes like mapred.cache. If nothing's 
set there, it would seem like some other process or user is adding jars to 
DistributedCache when using the cluster.

Thanks
hemanth



On Thu, Apr 11, 2013 at 11:40 PM, mailto:xia_y...@dell.com>> 
wrote:
Hi Hemanth,

Attached is some sample folders within my 
/tmp/hadoop-root/mapred/local/archive. There are some jar and class files 
inside.

My application uses MapReduce job to do purge Hbase old data. I am using basic 
HBase MapReduce API to delete rows from Hbase table. I do not specify to use 
Distributed cache. Maybe HBase use it?

Some code here:

   Scan scan = new Scan();
   scan.setCaching(500);// 1 is the default in Scan, which will be 
bad for MapReduce jobs
   scan.setCacheBlocks(false);  // don't set to true for MR jobs
   scan.setTimeRange(Long.MIN_VALUE, timestamp);
   // set other scan attrs
   // the purge start time
   Date date=new Date();
   TableMapReduceUtil.initTableMapperJob(
 tableName,// input table
 scan,   // Scan instance to control CF and attribute 
selection
 MapperDelete.class, // mapper class
 null, // mapper output key
 null,  // mapper output value
 job);

   job.setOutputFormatClass(TableOutputFormat.class);
   job.getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, tableName);
   job.setNumReduceTasks(0);

   boolean b = job.waitForCompletion(true);

From: Hemanth Yamijala 
[mailto:yhema...@thoughtworks.com]
Sent: Thursday, April 11, 2013 12:29 AM

To: user@hadoop.apache.org
Subject: Re: How to configure mapreduce archive size?

Could you paste the contents of the directory ? Not sure whether that will 
help, but just giving it a shot.

What application are you using ? Is it custom MapReduce jobs in which you use 
Distributed cache (I guess not) ?

Thanks
Hemanth

On Thu, Apr 11, 2013 at 3:34 AM, mailto:xia_y...@dell.com>> 
wrote:
Hi Arun,

I stopped my application, then restarted my hbase (which include hadoop). After 
that I start my application. After one evening, my 
/tmp/hadoop-root/mapred/local/archive goes to more than 1G. It does not work.

Is this the right place to change the value?

"local.cache.size" in file core-default.xml, which is in hadoop-core-1.0.3.jar

Thanks,

Jane

From: Arun C Murthy [mailto:a...@hortonworks.com]
Sent: Wednesday, April 10, 2013 2:45 PM

To: user@hadoop.apache.org
Subject: Re: How to configure mapreduce archive size?

Ensure no jobs are running (cache limit is only for non-active cache files), 
check after a little while (takes sometime for the cleaner thread to kick in).

Arun

On Apr 11, 2013, at 2:29 AM, mailto:xia_y...@dell.com>> 
mailto:xia_y...@dell.com>> wrote:

Hi Hemanth,

For the hadoop 1.0.3, I can only find "local.cache.size" in file 
core-default.xml, which is in hadoop-core-1.0.3.jar. It is not in 
mapred-default.xml.

I updated the value in file default.xml and changed the value to 50. This 
is just for my testing purpose. However, the folder 
/tmp/hadoop-root/mapred/local/archive already goes more than 1G now. Looks like 
it does not do the work. Could you advise if what I did is correct?

  local.cache.size
  50

Thanks,

Xia

From: Hemanth Yamijala 
[mailto:yhema...@thoughtworks.com]
Sent: Monday, April 08, 2013 9:09 PM
To: user@hadoop.apache.org
Subject: Re: How to configure mapreduce archive size?

Hi,

This directory is used as part of the 'DistributedCache' feature. 
(http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html#DistributedCache

Get Hadoop cluster topology

2013-04-16 Thread Diwakar Sharma

I understand that when Namenode starts up it reads fsimage to get the state
of HDFS and applies the edits file to complete it.

But how about the cluster topology ? Does the namenode read the config
files like core-site.xml/slaves/... etc to determine its cluster topology
or uses an API to build it.


Thanks
Diwakar

Jobtracker memory issues due to FileSystem$Cache

2013-04-16 Thread Marcin Mejran

We've recently run into jobtracker memory issues on our new hadoop cluster. A 
heap dump shows that there are thousands of copies of DistributedFileSystem 
kept in FileSystem$Cache, a bit over one for each job run on the cluster and 
their jobconf objects support this view. I believe these are created when the 
.staging directories get cleaned up but I may be wrong on that.

>From what I can tell in the dump, the username (probably not ugi, hard to 
>tell), scheme and authority parts of the Cache$Key are the same across 
>multiple objects in FileSystem$Cache. I can only assume that the 
>usergroupinformation piece differs somehow every time it's created.

We're using CDH4.2, MR1, CentOS 6.3 and Java 1.6_31. Kerberos, ldap and so on 
are not enabled.

Is there any known reason for this type of behavior?

Thanks,
-Marcin

RE: How to configure mapreduce archive size?

2013-04-16 Thread Xia_Yang

Hi Hemanth,

I did not explicitly using DistributedCache in my code. I did not use any 
command line arguments like -libjars neither.

Where can I find job.xml? I am using Hbase MapReduce API and not setting any 
job.xml.

The key point is I want to limit the size of 
/tmp/hadoop-root/mapred/local/archive. Could you help?

Thanks.

Xia

From: Hemanth Yamijala [mailto:yhema...@thoughtworks.com]
Sent: Thursday, April 11, 2013 9:09 PM
To: user@hadoop.apache.org
Subject: Re: How to configure mapreduce archive size?

TableMapReduceUtil has APIs like addDependencyJars which will use 
DistributedCache. I don't think you are explicitly using that. Are you using 
any command line arguments like -libjars etc when you are launching the 
MapReduce job ? Alternatively you can check job.xml of the launched MR job to 
see if it has set properties having prefixes like mapred.cache. If nothing's 
set there, it would seem like some other process or user is adding jars to 
DistributedCache when using the cluster.

Thanks
hemanth



On Thu, Apr 11, 2013 at 11:40 PM, mailto:xia_y...@dell.com>> 
wrote:
Hi Hemanth,

Attached is some sample folders within my 
/tmp/hadoop-root/mapred/local/archive. There are some jar and class files 
inside.

My application uses MapReduce job to do purge Hbase old data. I am using basic 
HBase MapReduce API to delete rows from Hbase table. I do not specify to use 
Distributed cache. Maybe HBase use it?

Some code here:

   Scan scan = new Scan();
   scan.setCaching(500);// 1 is the default in Scan, which will be 
bad for MapReduce jobs
   scan.setCacheBlocks(false);  // don't set to true for MR jobs
   scan.setTimeRange(Long.MIN_VALUE, timestamp);
   // set other scan attrs
   // the purge start time
   Date date=new Date();
   TableMapReduceUtil.initTableMapperJob(
 tableName,// input table
 scan,   // Scan instance to control CF and attribute 
selection
 MapperDelete.class, // mapper class
 null, // mapper output key
 null,  // mapper output value
 job);

   job.setOutputFormatClass(TableOutputFormat.class);
   job.getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, tableName);
   job.setNumReduceTasks(0);

   boolean b = job.waitForCompletion(true);

From: Hemanth Yamijala 
[mailto:yhema...@thoughtworks.com]
Sent: Thursday, April 11, 2013 12:29 AM

To: user@hadoop.apache.org
Subject: Re: How to configure mapreduce archive size?

Could you paste the contents of the directory ? Not sure whether that will 
help, but just giving it a shot.

What application are you using ? Is it custom MapReduce jobs in which you use 
Distributed cache (I guess not) ?

Thanks
Hemanth

On Thu, Apr 11, 2013 at 3:34 AM, mailto:xia_y...@dell.com>> 
wrote:
Hi Arun,

I stopped my application, then restarted my hbase (which include hadoop). After 
that I start my application. After one evening, my 
/tmp/hadoop-root/mapred/local/archive goes to more than 1G. It does not work.

Is this the right place to change the value?

"local.cache.size" in file core-default.xml, which is in hadoop-core-1.0.3.jar

Thanks,

Jane

From: Arun C Murthy [mailto:a...@hortonworks.com]
Sent: Wednesday, April 10, 2013 2:45 PM

To: user@hadoop.apache.org
Subject: Re: How to configure mapreduce archive size?

Ensure no jobs are running (cache limit is only for non-active cache files), 
check after a little while (takes sometime for the cleaner thread to kick in).

Arun

On Apr 11, 2013, at 2:29 AM, mailto:xia_y...@dell.com>> 
mailto:xia_y...@dell.com>> wrote:

Hi Hemanth,

For the hadoop 1.0.3, I can only find "local.cache.size" in file 
core-default.xml, which is in hadoop-core-1.0.3.jar. It is not in 
mapred-default.xml.

I updated the value in file default.xml and changed the value to 50. This 
is just for my testing purpose. However, the folder 
/tmp/hadoop-root/mapred/local/archive already goes more than 1G now. Looks like 
it does not do the work. Could you advise if what I did is correct?

  local.cache.size
  50

Thanks,

Xia

From: Hemanth Yamijala 
[mailto:yhema...@thoughtworks.com]
Sent: Monday, April 08, 2013 9:09 PM
To: user@hadoop.apache.org
Subject: Re: How to configure mapreduce archive size?

Hi,

This directory is used as part of the 'DistributedCache' feature. 
(http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html#DistributedCache). 
There is a configuration key "local.cache.size" which controls the amount of 
data stored under DistributedCache. The default limit is 10GB. However, the 
files under this cannot be deleted if they are being used. Also, some 
frameworks on Hadoop could be using DistributedCache transparently to you.

So you could check what is being stored

Re: Submitting mapreduce and nothing happens

2013-04-16 Thread Amit Sela

Nothing on JT log, but as I mentioned I see this in the client log:

[WARN ] org.apache.hadoop.mapred.JobClient   » Use GenericOptionsParser
for parsing the arguments. Applications should implement Tool for the same.
[INFO ] org.apache.hadoop.mapred.JobClient   » Cleaning up the staging
area
hdfs://hadoop-name-node.address:8000{hadoop.tmp.dir}/mapred/staging/{clientusername}/.staging/job_201304150711_0034

And this on the NN log:

INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of
transactions: 1 Total time for transactions(ms): 0Number of transactions
batched in Syncs: 0 Number of syncs: 0 SyncTimes(ms): 0 0
INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=
clientusername ip=/xx.xx.xx.xxx cmd=mkdirs src=
{hadoop.tmp.dir}/mapred/staging/{clientusername}/.staging/job_201304150711_0034
dst=null
perm=clientusername:supergroup:rwxr-xr-x
INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=
clientusername ip=/xx.xx.xx.xxx cmd=setPermission
src={hadoop.tmp.dir}/mapred/staging/{clientusername}/.staging/job_201304150711_0034
  dst=null perm=clientusername:supergroup:rwx--

Thanks.

On Tue, Apr 16, 2013 at 10:30 AM, Bejoy Ks  wrote:

> Hi Amit
>
> Are you seeing any errors or warnings on JT logs?
>
> Regards
> Bejoy KS
>

Re: threads quota is exceeded question

2013-04-16 Thread Thanh Do

Hadoop by default limit 5 concurrent threads per node for balancing
purpose. That causes your problem.


On Mon, Apr 15, 2013 at 10:24 PM, rauljin  wrote:

> **
>  HI:
>The hadoop cluster is running balance.
>
>And one datannode 172.16.80.72 is :
>
> Datanode :Not able to copy block -507744952197054725 to /
> 172.16.80.73:51658 because threads quota is exceeded.
>
>
> ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
> 172.16.80.72:50010, storageID=DS-1202844662-172.16
> .80.72-50010-1330656432004, infoPort=50075, ipcPort=50020):DataXceiver
> java.io.IOException: Block blk_8443528692263789109_8159545 is not valid.
>
> at 
> org.apache.hadoop.hdfs.server.datanode.FSDataset.getBlockFile(FSDataset.java:734)
>
> at 
> org.apache.hadoop.hdfs.server.datanode.FSDataset.getLength(FSDataset.java:722)
>
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:92)
>
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:172)
>
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:95)
> at java.lang.Thread.run(Thread.java:636)
>
>
>And other datanode:
>
>
>
> ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
> 172.16.80.73:50010
> , storageID=DS-1771394657-172.16.80.73-50010-1362474580654, infoPort=50075, 
> ipcPort=50020):DataXceiver
> java.io.EOFException
> at java.io.DataInputStream.readByte(DataInputStream.java:267)
>
> at 
> org.apache.hadoop.util.DataChecksum.newDataChecksum(DataChecksum.java:84)
>
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:92)
>
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.replaceBlock(DataXceiver.java:580)
>
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:115)
> at java.lang.Thread.run(Thread.java:636)
>
>   At that moment , the hdfs is not  avalible.
>
>
> I restart  the 172.16.80.72 datanode service ,and the service is ok.
>
>
>
> what causes the problem.
>
> Any ideas?
>Thanks!
>
>
>
>
>
>
>
>
>
>
> --
> rauljin
>

Re: HW infrastructure for Hadoop

2013-04-16 Thread Adam Smieszny

There are also reference architectures available from a variety of hardware
vendors - the likes of Dell, HP, IBM, Cisco, and others. They often outline
a reasonable framework for disk/cpu/memory mix, and usually include some
description of network as well. If you have a preferred hardware vendor,
that would be another route to pursue.

Cheers,
Adam


On Tue, Apr 16, 2013 at 12:51 PM, Amal G Jose  wrote:

> +1 for Hadoop operations.
> There is one document from Hortonworks that explain hadoop cluster
> infrastructure. That doc is a brand specific. But after referring the
> hadoop operations, we can refer this doc to get a clear overview.
>
>
> On Tue, Apr 16, 2013 at 4:50 PM, Bejoy Ks  wrote:
>
>> +1 for "Hadoop Operations"
>>
>>
>> On Tue, Apr 16, 2013 at 3:57 PM, MARCOS MEDRADO RUBINELLI <
>> marc...@buscapecompany.com> wrote:
>>
>>>  Tadas,
>>>
>>> "Hadoop Operations" has pretty useful, up-to-date information. The
>>> chapter on hardware selection is available here:
>>> http://my.safaribooksonline.com/book/databases/hadoop/9781449327279/4dot-planning-a-hadoop-cluster/id2760689
>>>
>>> Regards,
>>> Marcos
>>>
>>> Em 16-04-2013 07:13, Tadas Makčinskas escreveu:
>>>
>>>  We are thinking to distribute like 50 node cluster. And trying to
>>> figure out what would be a good HW infrastructure (Disks – I/O‘s, RAM,
>>> CPUs, network). I cannot actually come around any examples that people ran
>>> and found it working well and cost effectively. 
>>>
>>> ** **
>>>
>>> If anybody could share their best considered infrastructure. Would be a
>>> tremendous help not trying to figure it out on our own.
>>>
>>> ** **
>>>
>>> Regards, Tadas
>>>
>>> ** **
>>>
>>> ** **
>>>
>>>
>>>
>>
>


-- 
Adam Smieszny
Cloudera | Systems Engineer | http://www.linkedin.com/in/adamsmieszny
917.830.4156

Re: HW infrastructure for Hadoop

2013-04-16 Thread Amal G Jose

+1 for Hadoop operations.
There is one document from Hortonworks that explain hadoop cluster
infrastructure. That doc is a brand specific. But after referring the
hadoop operations, we can refer this doc to get a clear overview.


On Tue, Apr 16, 2013 at 4:50 PM, Bejoy Ks  wrote:

> +1 for "Hadoop Operations"
>
>
> On Tue, Apr 16, 2013 at 3:57 PM, MARCOS MEDRADO RUBINELLI <
> marc...@buscapecompany.com> wrote:
>
>>  Tadas,
>>
>> "Hadoop Operations" has pretty useful, up-to-date information. The
>> chapter on hardware selection is available here:
>> http://my.safaribooksonline.com/book/databases/hadoop/9781449327279/4dot-planning-a-hadoop-cluster/id2760689
>>
>> Regards,
>> Marcos
>>
>> Em 16-04-2013 07:13, Tadas Makčinskas escreveu:
>>
>>  We are thinking to distribute like 50 node cluster. And trying to
>> figure out what would be a good HW infrastructure (Disks – I/O‘s, RAM,
>> CPUs, network). I cannot actually come around any examples that people ran
>> and found it working well and cost effectively. 
>>
>> ** **
>>
>> If anybody could share their best considered infrastructure. Would be a
>> tremendous help not trying to figure it out on our own.
>>
>> ** **
>>
>> Regards, Tadas
>>
>> ** **
>>
>> ** **
>>
>>
>>
>

Re: VM reuse!

2013-04-16 Thread bejoy . hadoop

Hi Rahul

AFAIK there is no guarantee that 1 task would be on N1 and another on N2. Both 
can be on N1 as well.

JT has no notion of JVM reuse. It doesn't consider that for task scheduling.

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Rahul Bhattacharjee 
Date: Tue, 16 Apr 2013 21:13:54 
To: 
Reply-To: user@hadoop.apache.org
Subject: Re: VM reuse!

Agreed.

Not sure about the behavour of JT.Consider the situation.

N1 has split 1 and split 2 of a file and there are two map slots.N2 has
split 2 and it also has one mapper slot. I think the JT would probably
schedule a single map in N1 and another map in N2.For better parallel IO.
Rather than scheduling two mappers in N1 and no task in N2.

I do not think the JT considers whether VM reuse is enabled.However it can
consider this into account along with data locality aspect. When a job
writer asked for VM reuse then its would not be entirely wrong to assume
that there might be certain things in the job which takes long time to
initialize and hence making reuse suitable. In those scenarious , JT might
consider allocating multiple mappers in a single node.

Thinking aloud , the situation that you have mentioned is quite possible in
a unbalanced cluster. Where the data is distributed within a small set of
nodes of the entire cluster.

Thanks,
Rahul

On Tue, Apr 16, 2013 at 8:14 PM, Bejoy Ks  wrote:

>  When you process larger data volumes, this is the case mostly. :)
>
>
> Say you have a job with smaller input size and if you have  2 blocks on a
> single node and then the JT may schedule two tasks on the same TT if there
> are available free slots. So those tasks can take advantage of JVM reuse.
>
> Which TT the JT would assign tasks is totally dependent on data locality
> and availability of task slots.
>
>
> On Tue, Apr 16, 2013 at 5:03 PM, Rahul Bhattacharjee <
> rahul.rec@gmail.com> wrote:
>
>> Ok, Thanks Bejoy.
>>
>> Only in some typical scenarios it's possible , like the one that you have
>> mentioned.
>> Much more number of mappers and less number of mappers slots.
>>
>> Regards,
>> Rahul
>>
>>
>> On Tue, Apr 16, 2013 at 2:40 PM, Bejoy Ks  wrote:
>>
>>> Hi Rahul
>>>
>>> If you look at larger cluster and jobs that involve larger input data
>>> sets. The data would be spread across the whole cluster, and a single node
>>> might have  various blocks of that entire data set. Imagine you have a
>>> cluster with 100 map slots and your job has 500 map tasks, now in that case
>>> there should be multiple map tasks in a single task tracker based on slot
>>> availability.
>>>
>>> Here if you enable jvm reuse, all tasks related to a job on a single
>>> TaskTracker would use the same jvm. The benefit here is just the time you
>>> are saving in spawning and cleaning up jvm for individual tasks.
>>>
>>>
>>>
>>>
>>> On Tue, Apr 16, 2013 at 2:04 PM, Rahul Bhattacharjee <
>>> rahul.rec@gmail.com> wrote:
>>>
 Hi,

 I have a question related to VM reuse in Hadoop.I now understand the
 purpose of VM reuse , but I am wondering how is it useful.

 Example. for VM reuse to be effective or kicked in , we need more than
 one mapper task to be submitted to a single node (for the same job).Hadoop
 would consider spawning mappers into nodes which actually contains the data
 , it might rarely happen that multiple mappers are allocated to a single
 task tracker. And even if a single task nodes gets to run multiple mappers
 then it might as well run in parallel in multiple VM rather than
 sequentially in a single VM.

 I am sure I am missing some link here , please help me find that.

 Thanks,
 Rahul

>>>
>>>
>>
>

Re: Hadoop sampler related query!

2013-04-16 Thread Rahul Bhattacharjee

Mighty users@hadoop

anyone on this.


On Tue, Apr 16, 2013 at 2:19 PM, Rahul Bhattacharjee <
rahul.rec@gmail.com> wrote:

> Hi,
>
> I have a question related to Hadoop's input sampler ,which is used for
> investigating the data set before hand using random selection , sampling
> etc .Mainly used for total sort , used in pig's skewed join implementation
> as well.
>
> The question here is -
>
> Mapper
>
> K and V are input key and value of the mapper .Essentially coming in from
> the input format. OK and OV are output key and value emitted from the
> mapper.
>
> Looking at the input sample's code ,it looks like it is creating the
> partition based on the input key of the mapper.
>
> I think the partitions should be created considering the output key (OK)
> and the output key sort comparator should be used for sorting the samples.
>
> If partitioning is done based on input key and the mapper emits a
> different key then the total sort wouldn't hold any good.
>
>  Is there is any condition that input sample is to be only used for
> mapper?
>
>
> Thanks,
> Rahul
>
>

Re: VM reuse!

2013-04-16 Thread Rahul Bhattacharjee

Agreed.

Not sure about the behavour of JT.Consider the situation.

N1 has split 1 and split 2 of a file and there are two map slots.N2 has
split 2 and it also has one mapper slot. I think the JT would probably
schedule a single map in N1 and another map in N2.For better parallel IO.
Rather than scheduling two mappers in N1 and no task in N2.

I do not think the JT considers whether VM reuse is enabled.However it can
consider this into account along with data locality aspect. When a job
writer asked for VM reuse then its would not be entirely wrong to assume
that there might be certain things in the job which takes long time to
initialize and hence making reuse suitable. In those scenarious , JT might
consider allocating multiple mappers in a single node.

Thinking aloud , the situation that you have mentioned is quite possible in
a unbalanced cluster. Where the data is distributed within a small set of
nodes of the entire cluster.

Thanks,
Rahul

On Tue, Apr 16, 2013 at 8:14 PM, Bejoy Ks  wrote:

>  When you process larger data volumes, this is the case mostly. :)
>
>
> Say you have a job with smaller input size and if you have  2 blocks on a
> single node and then the JT may schedule two tasks on the same TT if there
> are available free slots. So those tasks can take advantage of JVM reuse.
>
> Which TT the JT would assign tasks is totally dependent on data locality
> and availability of task slots.
>
>
> On Tue, Apr 16, 2013 at 5:03 PM, Rahul Bhattacharjee <
> rahul.rec@gmail.com> wrote:
>
>> Ok, Thanks Bejoy.
>>
>> Only in some typical scenarios it's possible , like the one that you have
>> mentioned.
>> Much more number of mappers and less number of mappers slots.
>>
>> Regards,
>> Rahul
>>
>>
>> On Tue, Apr 16, 2013 at 2:40 PM, Bejoy Ks  wrote:
>>
>>> Hi Rahul
>>>
>>> If you look at larger cluster and jobs that involve larger input data
>>> sets. The data would be spread across the whole cluster, and a single node
>>> might have  various blocks of that entire data set. Imagine you have a
>>> cluster with 100 map slots and your job has 500 map tasks, now in that case
>>> there should be multiple map tasks in a single task tracker based on slot
>>> availability.
>>>
>>> Here if you enable jvm reuse, all tasks related to a job on a single
>>> TaskTracker would use the same jvm. The benefit here is just the time you
>>> are saving in spawning and cleaning up jvm for individual tasks.
>>>
>>>
>>>
>>>
>>> On Tue, Apr 16, 2013 at 2:04 PM, Rahul Bhattacharjee <
>>> rahul.rec@gmail.com> wrote:
>>>
 Hi,

 I have a question related to VM reuse in Hadoop.I now understand the
 purpose of VM reuse , but I am wondering how is it useful.

 Example. for VM reuse to be effective or kicked in , we need more than
 one mapper task to be submitted to a single node (for the same job).Hadoop
 would consider spawning mappers into nodes which actually contains the data
 , it might rarely happen that multiple mappers are allocated to a single
 task tracker. And even if a single task nodes gets to run multiple mappers
 then it might as well run in parallel in multiple VM rather than
 sequentially in a single VM.

 I am sure I am missing some link here , please help me find that.

 Thanks,
 Rahul

>>>
>>>
>>
>

Re: VM reuse!

2013-04-16 Thread Bejoy Ks

 When you process larger data volumes, this is the case mostly. :)


Say you have a job with smaller input size and if you have  2 blocks on a
single node and then the JT may schedule two tasks on the same TT if there
are available free slots. So those tasks can take advantage of JVM reuse.

Which TT the JT would assign tasks is totally dependent on data locality
and availability of task slots.


On Tue, Apr 16, 2013 at 5:03 PM, Rahul Bhattacharjee <
rahul.rec@gmail.com> wrote:

> Ok, Thanks Bejoy.
>
> Only in some typical scenarios it's possible , like the one that you have
> mentioned.
> Much more number of mappers and less number of mappers slots.
>
> Regards,
> Rahul
>
>
> On Tue, Apr 16, 2013 at 2:40 PM, Bejoy Ks  wrote:
>
>> Hi Rahul
>>
>> If you look at larger cluster and jobs that involve larger input data
>> sets. The data would be spread across the whole cluster, and a single node
>> might have  various blocks of that entire data set. Imagine you have a
>> cluster with 100 map slots and your job has 500 map tasks, now in that case
>> there should be multiple map tasks in a single task tracker based on slot
>> availability.
>>
>> Here if you enable jvm reuse, all tasks related to a job on a single
>> TaskTracker would use the same jvm. The benefit here is just the time you
>> are saving in spawning and cleaning up jvm for individual tasks.
>>
>>
>>
>>
>> On Tue, Apr 16, 2013 at 2:04 PM, Rahul Bhattacharjee <
>> rahul.rec@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I have a question related to VM reuse in Hadoop.I now understand the
>>> purpose of VM reuse , but I am wondering how is it useful.
>>>
>>> Example. for VM reuse to be effective or kicked in , we need more than
>>> one mapper task to be submitted to a single node (for the same job).Hadoop
>>> would consider spawning mappers into nodes which actually contains the data
>>> , it might rarely happen that multiple mappers are allocated to a single
>>> task tracker. And even if a single task nodes gets to run multiple mappers
>>> then it might as well run in parallel in multiple VM rather than
>>> sequentially in a single VM.
>>>
>>> I am sure I am missing some link here , please help me find that.
>>>
>>> Thanks,
>>> Rahul
>>>
>>
>>
>

Re: VM reuse!

2013-04-16 Thread Rahul Bhattacharjee

Ok, Thanks Bejoy.

Only in some typical scenarios it's possible , like the one that you have
mentioned.
Much more number of mappers and less number of mappers slots.

Regards,
Rahul


On Tue, Apr 16, 2013 at 2:40 PM, Bejoy Ks  wrote:

> Hi Rahul
>
> If you look at larger cluster and jobs that involve larger input data
> sets. The data would be spread across the whole cluster, and a single node
> might have  various blocks of that entire data set. Imagine you have a
> cluster with 100 map slots and your job has 500 map tasks, now in that case
> there should be multiple map tasks in a single task tracker based on slot
> availability.
>
> Here if you enable jvm reuse, all tasks related to a job on a single
> TaskTracker would use the same jvm. The benefit here is just the time you
> are saving in spawning and cleaning up jvm for individual tasks.
>
>
>
>
> On Tue, Apr 16, 2013 at 2:04 PM, Rahul Bhattacharjee <
> rahul.rec@gmail.com> wrote:
>
>> Hi,
>>
>> I have a question related to VM reuse in Hadoop.I now understand the
>> purpose of VM reuse , but I am wondering how is it useful.
>>
>> Example. for VM reuse to be effective or kicked in , we need more than
>> one mapper task to be submitted to a single node (for the same job).Hadoop
>> would consider spawning mappers into nodes which actually contains the data
>> , it might rarely happen that multiple mappers are allocated to a single
>> task tracker. And even if a single task nodes gets to run multiple mappers
>> then it might as well run in parallel in multiple VM rather than
>> sequentially in a single VM.
>>
>> I am sure I am missing some link here , please help me find that.
>>
>> Thanks,
>> Rahul
>>
>
>

Re: HW infrastructure for Hadoop

2013-04-16 Thread Bejoy Ks

+1 for "Hadoop Operations"


On Tue, Apr 16, 2013 at 3:57 PM, MARCOS MEDRADO RUBINELLI <
marc...@buscapecompany.com> wrote:

>  Tadas,
>
> "Hadoop Operations" has pretty useful, up-to-date information. The chapter
> on hardware selection is available here:
> http://my.safaribooksonline.com/book/databases/hadoop/9781449327279/4dot-planning-a-hadoop-cluster/id2760689
>
> Regards,
> Marcos
>
> Em 16-04-2013 07:13, Tadas Makčinskas escreveu:
>
>  We are thinking to distribute like 50 node cluster. And trying to figure
> out what would be a good HW infrastructure (Disks – I/O‘s, RAM, CPUs,
> network). I cannot actually come around any examples that people ran and
> found it working well and cost effectively. 
>
> ** **
>
> If anybody could share their best considered infrastructure. Would be a
> tremendous help not trying to figure it out on our own.
>
> ** **
>
> Regards, Tadas
>
> ** **
>
> ** **
>
>
>

Re: HW infrastructure for Hadoop

2013-04-16 Thread MARCOS MEDRADO RUBINELLI

Tadas,

"Hadoop Operations" has pretty useful, up-to-date information. The chapter on 
hardware selection is available here: 
http://my.safaribooksonline.com/book/databases/hadoop/9781449327279/4dot-planning-a-hadoop-cluster/id2760689

Regards,
Marcos

Em 16-04-2013 07:13, Tadas Makčinskas escreveu:
We are thinking to distribute like 50 node cluster. And trying to figure out 
what would be a good HW infrastructure (Disks – I/O‘s, RAM, CPUs, network). I 
cannot actually come around any examples that people ran and found it working 
well and cost effectively.

If anybody could share their best considered infrastructure. Would be a 
tremendous help not trying to figure it out on our own.

Regards, Tadas

Re: Task Trackers accumulation

2013-04-16 Thread Harsh J

This is the regular behavior. You should see it disappear after ~10 mins of
the timeout period. Reason is that every TT starts on an ephemeral port and
therefore appears as a new TT to the JT (TTs aren't persistent members of a
cluster).

On Tue, Apr 16, 2013 at 2:01 PM, dylan  wrote:

> Hi 
>
> I found that the task tracker still appear on the web interface after I
> killed the task tracker process, then I tried to restart it again,
>
> But old task tracker remains. No matter how many times I repeated it
> kill-restart.
>
> ** **
>
> Only restarting the job tracker solved my problem. 
>
> 
>
> ** **
>

-- 
Harsh J
<>

HW infrastructure for Hadoop

2013-04-16 Thread Tadas Makčinskas

We are thinking to distribute like 50 node cluster. And trying to figure out 
what would be a good HW infrastructure (Disks – I/O‘s, RAM, CPUs, network). I 
cannot actually come around any examples that people ran and found it working 
well and cost effectively.

If anybody could share their best considered infrastructure. Would be a 
tremendous help not trying to figure it out on our own.

Regards, Tadas

Re: VM reuse!

2013-04-16 Thread Bejoy Ks

Hi Rahul

If you look at larger cluster and jobs that involve larger input data sets.
The data would be spread across the whole cluster, and a single node might
have  various blocks of that entire data set. Imagine you have a cluster
with 100 map slots and your job has 500 map tasks, now in that case there
should be multiple map tasks in a single task tracker based on slot
availability.

Here if you enable jvm reuse, all tasks related to a job on a single
TaskTracker would use the same jvm. The benefit here is just the time you
are saving in spawning and cleaning up jvm for individual tasks.

On Tue, Apr 16, 2013 at 2:04 PM, Rahul Bhattacharjee <
rahul.rec@gmail.com> wrote:

> Hi,
>
> I have a question related to VM reuse in Hadoop.I now understand the
> purpose of VM reuse , but I am wondering how is it useful.
>
> Example. for VM reuse to be effective or kicked in , we need more than one
> mapper task to be submitted to a single node (for the same job).Hadoop
> would consider spawning mappers into nodes which actually contains the data
> , it might rarely happen that multiple mappers are allocated to a single
> task tracker. And even if a single task nodes gets to run multiple mappers
> then it might as well run in parallel in multiple VM rather than
> sequentially in a single VM.
>
> I am sure I am missing some link here , please help me find that.
>
> Thanks,
> Rahul
>

Hadoop sampler related query!

2013-04-16 Thread Rahul Bhattacharjee

Hi,

I have a question related to Hadoop's input sampler ,which is used for
investigating the data set before hand using random selection , sampling
etc .Mainly used for total sort , used in pig's skewed join implementation
as well.

The question here is -

Mapper

K and V are input key and value of the mapper .Essentially coming in from
the input format. OK and OV are output key and value emitted from the
mapper.

Looking at the input sample's code ,it looks like it is creating the
partition based on the input key of the mapper.

I think the partitions should be created considering the output key (OK)
and the output key sort comparator should be used for sorting the samples.

If partitioning is done based on input key and the mapper emits a different
key then the total sort wouldn't hold any good.

Is there is any condition that input sample is to be only used for
mapper?


Thanks,
Rahul

VM reuse!

2013-04-16 Thread Rahul Bhattacharjee

Hi,

I have a question related to VM reuse in Hadoop.I now understand the
purpose of VM reuse , but I am wondering how is it useful.

Example. for VM reuse to be effective or kicked in , we need more than one
mapper task to be submitted to a single node (for the same job).Hadoop
would consider spawning mappers into nodes which actually contains the data
, it might rarely happen that multiple mappers are allocated to a single
task tracker. And even if a single task nodes gets to run multiple mappers
then it might as well run in parallel in multiple VM rather than
sequentially in a single VM.

I am sure I am missing some link here , please help me find that.

Thanks,
Rahul

Task Trackers accumulation

2013-04-16 Thread dylan

Hi 

I found that the task tracker still appear on the web interface after I
killed the task tracker process, then I tried to restart it again,

But old task tracker remains. No matter how many times I repeated it
kill-restart.

 

Only restarting the job tracker solved my problem. 



 

<>

Re: Submitting mapreduce and nothing happens

2013-04-16 Thread Bejoy Ks

Hi Amit

Are you seeing any errors or warnings on JT logs?

Regards
Bejoy KS

45 matches

Mail list logo