Re: Tasktracker & Jobtracker

2011-11-14 Thread Harsh J
Mohmmadanis,

Yes. Sort of.

The JobTracker will keep submitted jobs in queue until TaskTrackers are 
available to be assigned and run tasks on. JobTracker's queues will continue to 
be functional and will await the return of slots to assign tasks to in order to 
continue running the job.

On 15-Nov-2011, at 12:57 PM, mohmmadanis moulavi wrote:

> Hello friends,
> 
> 
> 
> I want to know, 
> 
> Suppose if tasktracker goes down, or I stopeed the tasktracker,
> then can Jobtrakcer waits until it comes up.
> Please let me know it.
> 
> 
> Regards,
> Mohmmadanis Moulavi



How is data of each job assigned in Mumak ?

2011-11-14 Thread ArunKumar
Hi guys !

Q> How can i assign data of each job in mumak nodes and what else i need to
do ?
In genreral how can i use the pluggable block-placement for HDFS in Mumak ?
Meaning in my context i am using 19-jobs-trace json file and modified
topology json file consisting of say 4 nodes. Since the number of tasks(map
& reduce) are fixed for these jobs i want to assign the input data( with
their replication) of each job to some particular nodes, so that i can use
this info in my scheduler . Does the code for "ObtainLocalMapTask" /
"ObtainNonlocalMaptask" in Scheduler need to be changed if i have to use
this data placement info ?

Q>If i have to add some fields to jobs in job trace file , how do i add it
to job trace and access in my scheduler code for scheduling in Mumak.
Which classes do i need to modify for this?


Thanks,
Arun


--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-is-data-of-each-job-assigned-in-Mumak-tp3508919p3508919.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.


Tasktracker & Jobtracker

2011-11-14 Thread mohmmadanis moulavi
Hello friends,



I want to know, 

Suppose if tasktracker goes down, or I stopeed the tasktracker,
then can Jobtrakcer waits until it comes up.
Please let me know it.


Regards,
Mohmmadanis Moulavi


Re: Input split for a streaming job!

2011-11-14 Thread Raj V
MIlind

I realised that thankls to Joey from Cloudera. I have given up on bzip.

Raj



>
>From: "milind.bhandar...@emc.com" 
>To: common-user@hadoop.apache.org; rajv...@yahoo.com; cdh-u...@cloudera.org
>Sent: Monday, November 14, 2011 2:02 PM
>Subject: Re: Input split for a streaming job!
>
>It looks like your hadoop distro does not have
>https://issues.apache.org/jira/browse/HADOOP-4012.
>
>- milind
>
>On 11/10/11 2:40 PM, "Raj V"  wrote:
>
>>All
>>
>>I assumed that the input splits for a streaming job will follow the same
>>logic as a map reduce java job but I seem to be wrong.
>>
>>I started out with 73 gzipped files that vary between 23MB to 255MB in
>>size. My default block size was 128MB.  8 of the 73 files are larger than
>>128 MB
>>
>>When I ran my streaming job, it ran, as expected,  73 mappers ( No
>>reducers for this job).
>>
>>Since I have 128 Nodes in my cluster , I thought I would use more systems
>>in the cluster by increasing the number of mappers. I changed all the
>>gzip files into bzip2 files. I expected the number of mappers to increase
>>to 81. The mappers remained at 73.
>>
>>I tried a second experiment- I changed my dfs.block.size to 32MB. That
>>should have increased my mappers to about ~250. It remains steadfast at
>>73.
>>
>>Is my understanding wrong? With a smaller block size and bzipped files,
>>should I not get more mappers?
>>
>>Raj
>
>
>
>

Re: Input split for a streaming job!

2011-11-14 Thread Milind.Bhandarkar
It looks like your hadoop distro does not have
https://issues.apache.org/jira/browse/HADOOP-4012.

- milind

On 11/10/11 2:40 PM, "Raj V"  wrote:

>All
>
>I assumed that the input splits for a streaming job will follow the same
>logic as a map reduce java job but I seem to be wrong.
>
>I started out with 73 gzipped files that vary between 23MB to 255MB in
>size. My default block size was 128MB.  8 of the 73 files are larger than
>128 MB
>
>When I ran my streaming job, it ran, as expected,  73 mappers ( No
>reducers for this job).
>
>Since I have 128 Nodes in my cluster , I thought I would use more systems
>in the cluster by increasing the number of mappers. I changed all the
>gzip files into bzip2 files. I expected the number of mappers to increase
>to 81. The mappers remained at 73.
>
>I tried a second experiment- I changed my dfs.block.size to 32MB. That
>should have increased my mappers to about ~250. It remains steadfast at
>73.
>
>Is my understanding wrong? With a smaller block size and bzipped files,
>should I not get more mappers?
>
>Raj



RE: setting up eclipse env for hadoop

2011-11-14 Thread Uma Maheswara Rao G
You are right.

From: Tim Broberg [tim.brob...@exar.com]
Sent: Tuesday, November 15, 2011 1:02 AM
To: common-user@hadoop.apache.org
Subject: RE: setting up eclipse env for hadoop

The ant steps for building the eclipse plugin are replaced by "mvn 
eclipse:eclipse," for versions 0.23+, correct?


From: Uma Maheswara Rao G [mahesw...@huawei.com]
Sent: Monday, November 14, 2011 10:11 AM
To: common-user@hadoop.apache.org
Subject: RE: setting up eclipse env for hadoop

Yes, you can follow that.
  mvn eclipse:eclipse will generate eclipse related files. After that directly 
import in your eclipse.
note: Repository links need to update. hdfs and mapreduce are moved inside to 
common folder.

Regatrds,
Uma

From: Amir Sanjar [v1san...@us.ibm.com]
Sent: Monday, November 14, 2011 9:07 PM
To: common-user@hadoop.apache.org
Subject: setting up eclipse env for hadoop

I am trying to build hadoop-trunk using eclipse, is this
http://wiki.apache.org/hadoop/EclipseEnvironment the latest document?

Best Regards
Amir Sanjar

Linux System Management Architect and Lead
IBM Senior Software Engineer
Phone# 512-286-8393
Fax#  512-838-8858

The information and any attached documents contained in this message
may be confidential and/or legally privileged.  The message is
intended solely for the addressee(s).  If you are not the intended
recipient, you are hereby notified that any use, dissemination, or
reproduction is strictly prohibited and may be unlawful.  If you are
not the intended recipient, please contact the sender immediately by
return e-mail and destroy all copies of the original message.


Re: hadoop 0.20.205.0 multi-user cluster

2011-11-14 Thread Raj V
Hi Stephen

THis is probably happening during jobtracker start. Can you provide any 
relevant logs from the task tracker log fiile?

Raj



>
>From: stephen mulcahy 
>To: common-user@hadoop.apache.org
>Sent: Monday, November 14, 2011 5:33 AM
>Subject: Re: hadoop 0.20.205.0 multi-user cluster
>
>On 14/11/11 09:38, stephen mulcahy wrote:
>> Hi Raj,
>>
>> Thanks for your reply, comments below.
>>
>> On 09/11/11 18:45, Raj V wrote:
>>> Can you try the following?
>>>
>>> - Change the permisson to 775 for /hadoop/mapred/system
>
>As per the previous problem, the permissions still get reset on cluster 
>restart.
>
>Am I the only one trying to use the cluster in this way?
>Is everyone else submitting all jobs as a single user or using the full 
>authentication support?
>
>-stephen
>
>-- 
>Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
>NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
>http://di2.deri.ie    http://webstar.deri.ie    http://sindice.com
>
>
>

RE: setting up eclipse env for hadoop

2011-11-14 Thread Tim Broberg
The ant steps for building the eclipse plugin are replaced by "mvn 
eclipse:eclipse," for versions 0.23+, correct?


From: Uma Maheswara Rao G [mahesw...@huawei.com]
Sent: Monday, November 14, 2011 10:11 AM
To: common-user@hadoop.apache.org
Subject: RE: setting up eclipse env for hadoop

Yes, you can follow that.
  mvn eclipse:eclipse will generate eclipse related files. After that directly 
import in your eclipse.
note: Repository links need to update. hdfs and mapreduce are moved inside to 
common folder.

Regatrds,
Uma

From: Amir Sanjar [v1san...@us.ibm.com]
Sent: Monday, November 14, 2011 9:07 PM
To: common-user@hadoop.apache.org
Subject: setting up eclipse env for hadoop

I am trying to build hadoop-trunk using eclipse, is this
http://wiki.apache.org/hadoop/EclipseEnvironment the latest document?

Best Regards
Amir Sanjar

Linux System Management Architect and Lead
IBM Senior Software Engineer
Phone# 512-286-8393
Fax#  512-838-8858

The information and any attached documents contained in this message
may be confidential and/or legally privileged.  The message is
intended solely for the addressee(s).  If you are not the intended
recipient, you are hereby notified that any use, dissemination, or
reproduction is strictly prohibited and may be unlawful.  If you are
not the intended recipient, please contact the sender immediately by
return e-mail and destroy all copies of the original message.


Reading Global Counters in Streaming?

2011-11-14 Thread Ryan Rosario
Hi,

I am writing a Hadoop Streaming job in Python. I know that I can
increment counters by writing a special format to sys.stderr. Is it
possible to *read* the values of counters from my Python program? I am
using the global counter as the denominator of a probability, and must
have this value available to all reducers. Is this possible, and how
would I do it in Python?

Thanks,
Ryan


RE: setting up eclipse env for hadoop

2011-11-14 Thread Uma Maheswara Rao G
Yes, you can follow that. 
  mvn eclipse:eclipse will generate eclipse related files. After that directly 
import in your eclipse.
note: Repository links need to update. hdfs and mapreduce are moved inside to 
common folder.

Regatrds,
Uma

From: Amir Sanjar [v1san...@us.ibm.com]
Sent: Monday, November 14, 2011 9:07 PM
To: common-user@hadoop.apache.org
Subject: setting up eclipse env for hadoop

I am trying to build hadoop-trunk using eclipse, is this
http://wiki.apache.org/hadoop/EclipseEnvironment the latest document?

Best Regards
Amir Sanjar

Linux System Management Architect and Lead
IBM Senior Software Engineer
Phone# 512-286-8393
Fax#  512-838-8858


Re: Mappers and Reducer not being called, but no errors indicated

2011-11-14 Thread Andy Doddington
OK, continuing our earlier conversation...

I have a job that schedules 100 map jobs (small number just for testing), 
passing data view a set of 100 sequence files. This is based on the PiEstimator 
example, that is shipped with the distribution.

The data consist of a blob of serialised state, amounting to around 20MB of 
data. I have added various checks, including checksums,
to reduce the risk of data corruption or misalignment.

The mapper takes the blob of data as its value input and an integer in the 
range 0-99 as its key (passed as a LongWritable).

Each mapper then does some processing, based upon the deserialised contents of 
the blob and the integer key value (0-99).

The reducer then selects the minimum value that was produced across all of the 
mappers.

Unfortunately, this process is generating an incorrect value, when compared to 
a simple iterative solution.

After inspecting the results it seems that the mappers are generating correct 
values for even-numbered keys, but incorrect
values for odd-numbered keys. I am logging the values of the keys, so I am 
confident that these are correct. My serialisation
checks also make me confident that the ‘value’ blobs are not getting corrupted, 
so it’s all something of a mystery.

Harsh J: Previously, you indicated that this might be a “...key/val data issue… 
...Perhaps bad partitioning/grouping is happening as a result of that”. I 
apologise for the lack of detail, but do you think this still might be the 
case? If so, could you refer me to some place that gives more detail on this 
type of issue?

With apologies for continuing to be a nuisance :-(

Andy D



Re: hadoop 0.20.205.0 multi-user cluster

2011-11-14 Thread stephen mulcahy

On 14/11/11 15:31, Shi Jin wrote:

I am guessing that /tmp is reset upon cluster restart. Maybe try to use
a persistent directory.


Thanks for the suggestion but /tmp will only be reset on server reboot - 
not cluster restart (I'm talking about running stop-all.sh and 
start-all.sh, not a full reboot).


-stephen

--
Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com


setting up eclipse env for hadoop

2011-11-14 Thread Amir Sanjar

I am trying to build hadoop-trunk using eclipse, is this
http://wiki.apache.org/hadoop/EclipseEnvironment the latest document?

Best Regards
Amir Sanjar

Linux System Management Architect and Lead
IBM Senior Software Engineer
Phone# 512-286-8393
Fax#  512-838-8858


Re: hadoop 0.20.205.0 multi-user cluster

2011-11-14 Thread Shi Jin
I am guessing that /tmp is reset upon cluster restart. Maybe try to use
a persistent directory.

Shi

On Mon, Nov 14, 2011 at 6:33 AM, stephen mulcahy
wrote:

> On 14/11/11 09:38, stephen mulcahy wrote:
>
>> Hi Raj,
>>
>> Thanks for your reply, comments below.
>>
>> On 09/11/11 18:45, Raj V wrote:
>>
>>> Can you try the following?
>>>
>>> - Change the permisson to 775 for /hadoop/mapred/system
>>>
>>
> As per the previous problem, the permissions still get reset on cluster
> restart.
>
> Am I the only one trying to use the cluster in this way?
> Is everyone else submitting all jobs as a single user or using the full
> authentication support?
>
>
> -stephen
>
> --
> Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
> NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
> http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com
>



-- 
Shi Jin, Ph.D.


Re: how to start tasktracker only on single port

2011-11-14 Thread mohmmadanis moulavi
@harsh J
I got it frm Wel Wu,
It can be done using "mapred.task.tracker.report.address" parameter.
Thanks for your reply



From: Harsh J 
To: common-user@hadoop.apache.org
Sent: Monday, 14 November 2011 2:00 PM
Subject: Re: how to start tasktracker only on single port

If you are requiring this for monitoring purposes, do notice that the older 
TaskTracker would be with a different ID (port number will differ). You can 
differentiate based on that.

TaskTrackers are mere clients to the JobTracker and hence JobTracker maintains 
no complete state of them, like is done on the HDFS side -- neither do 
TaskTrackers hold state.

Thus, older TaskTrackers may continue to be in the live node list until they 
time out (10 mins or so usually, configurable). Cause from the JobTracker 
perspective, new TaskTrackers are new instances, not replacements. Older 
instances are cleaned away after they time out.

On 14-Nov-2011, at 1:09 PM, mohmmadanis moulavi wrote:

> Hello,
> 
> 
> 
> Friends I am using Hadoop 0.20.2 version,
> My problem is whenever I kill the tasktracker and start it again, 
jobtrakcer shows one extra tasktracker (the one which is killed & the other 
which has started afterwords)
> I want to do it like this,
> Whenever I kill the tasktracker it will stop sending the heartbeats, but 
when I again start tasktracker, It should start again sending  heartbeats, i.e 
it should start that tasktrakcer on same port as that of before,
> what changes should I made in configuration parameters for that, please 
let me know it
> 
> 
> Thanks & Regads,
> Mohmmadanis Moulavi
> 

Re: hadoop 0.20.205.0 multi-user cluster

2011-11-14 Thread stephen mulcahy

On 14/11/11 09:38, stephen mulcahy wrote:

Hi Raj,

Thanks for your reply, comments below.

On 09/11/11 18:45, Raj V wrote:

Can you try the following?

- Change the permisson to 775 for /hadoop/mapred/system


As per the previous problem, the permissions still get reset on cluster 
restart.


Am I the only one trying to use the cluster in this way?
Is everyone else submitting all jobs as a single user or using the full 
authentication support?


-stephen

--
Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com


Re: how to start tasktracker only on single port

2011-11-14 Thread Wei Wu
It seems you are looking for
the parameter "mapred.task.tracker.report.address". Set to a fixed port
number 127.0.0.1:50050 (for example) and try it.

Wei

On Mon, Nov 14, 2011 at 3:39 PM, mohmmadanis moulavi <
anis_moul...@yahoo.co.in> wrote:

> Hello,
>
>
>
> Friends I am using Hadoop 0.20.2 version,
> My problem is whenever I kill the tasktracker and start it again,
> jobtrakcer shows one extra tasktracker (the one which is killed & the other
> which has started afterwords)
> I want to do it like this,
> Whenever I kill the tasktracker it will stop sending the heartbeats, but
> when I again start tasktracker, It should start again sending  heartbeats,
> i.e it should start that tasktrakcer on same port as that of before,
> what changes should I made in configuration parameters for that, please
> let me know it
>
>
> Thanks & Regads,
> Mohmmadanis Moulavi
>
>


Re: how to start tasktracker only on single port

2011-11-14 Thread mohmmadanis moulavi
yeah, I mean kill as stop tasktracker.



- Original Message -
From: Alexander C.H. Lorenz 
To: common-user@hadoop.apache.org; mohmmadanis moulavi 

Cc: 
Sent: Monday, 14 November 2011 1:16 PM
Subject: Re: how to start tasktracker only on single port

Hi,

please explain the reason to kill (I assume kill -9) a tasktracker. The
best way is to use the start / stop scripts.

best,
Alex

On Mon, Nov 14, 2011 at 8:39 AM, mohmmadanis moulavi <
anis_moul...@yahoo.co.in> wrote:

> Hello,
>
>
>
> Friends I am using Hadoop 0.20.2 version,
> My problem is whenever I kill the tasktracker and start it again,
> jobtrakcer shows one extra tasktracker (the one which is killed & the other
> which has started afterwords)
> I want to do it like this,
> Whenever I kill the tasktracker it will stop sending the heartbeats, but
> when I again start tasktracker, It should start again sending  heartbeats,
> i.e it should start that tasktrakcer on same port as that of before,
> what changes should I made in configuration parameters for that, please
> let me know it
>
>
> Thanks & Regads,
> Mohmmadanis Moulavi
>
>


-- 
Alexander Lorenz
http://mapredit.blogspot.com

*P **Think of the environment: please don't print this email unless you
really need to.*



Re: HDFS and Openstack - avoiding excessive redundancy

2011-11-14 Thread Dieter Plaetinck
Or more general:
isn't using virtualized i/o counter effective when dealing with hadoop M/R?
I would think that for running hadoop M/R you'd want predictable and consistent 
i/o on each node,
not to mention your bottlenecks are usually disk i/o (and maybe CPU), so using 
virtualisation makes
things less performant and less predictable, so, inferior.  Or am I missing 
something?

Dieter

On Sat, 12 Nov 2011 07:54:05 +
Graeme Seaton  wrote:

> One advantage to using Hadoop replication though, is that it provides
> a greater pool of potential servers for M/R jobs to execute on.  If
> you simply use Openstack replication it will appear to the JobTracker
> that a particular block only exists on a single server and should
> only be executed on that node.  This may have have an impact
> depending on your workload profile.
> 
> Regards,
> Graeme
> 
> On 12/11/11 07:24, Dejan Menges wrote:
> > Replication factor for HDFS can easily be changed to 1 if you don't
> > need it's redundancy in hdfs-site.xml
> >
> > Regards,
> > Dejo
> >
> > Sent from my iPhone
> >
> > On 12. 11. 2011., at 03:58, Edmon Begoli  wrote:
> >
> >> A question related to standing up cloud infrastructure for running
> >> Hadoop/HDFS.
> >>
> >> We are building up an infrastructure using Openstack which has its
> >> own storage management redundancy.
> >>
> >> We are planning to use Openstack to instantiate Hadoop nodes (HDFS,
> >> M/R tasks, Hive, HBase)
> >> on demand.
> >>
> >> The problem is that HDFS by design creates three copies of the
> >> data, so there is a 4x times redundancy
> >> which we would prefer to avoid.
> >>
> >> I am asking here if anyone has had a similar case and if anyone has
> >> had any helpful solution to recommend.
> >>
> >> Thank you in advance,
> >> Edmon
> 



Re: hadoop 0.20.205.0 multi-user cluster

2011-11-14 Thread stephen mulcahy

Hi Raj,

Thanks for your reply, comments below.

On 09/11/11 18:45, Raj V wrote:

Can you try the following?

- Change the permisson to 775 for  /hadoop/mapred/system


Done.


- Change the group to hadoop


Done.


- Make all users who need to submit hadoop jobs a part of the hadoop group.


The users are remote users. Do I need to create accounts on the hadoop 
cluster for those users to add them to the hadoop group or how should 
this work?


Thanks,

-stephen

--
Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com


Re: how to start tasktracker only on single port

2011-11-14 Thread Harsh J
If you are requiring this for monitoring purposes, do notice that the older 
TaskTracker would be with a different ID (port number will differ). You can 
differentiate based on that.

TaskTrackers are mere clients to the JobTracker and hence JobTracker maintains 
no complete state of them, like is done on the HDFS side -- neither do 
TaskTrackers hold state.

Thus, older TaskTrackers may continue to be in the live node list until they 
time out (10 mins or so usually, configurable). Cause from the JobTracker 
perspective, new TaskTrackers are new instances, not replacements. Older 
instances are cleaned away after they time out.

On 14-Nov-2011, at 1:09 PM, mohmmadanis moulavi wrote:

> Hello,
> 
> 
> 
> Friends I am using Hadoop 0.20.2 version,
> My problem is whenever I kill the tasktracker and start it again, jobtrakcer 
> shows one extra tasktracker (the one which is killed & the other which has 
> started afterwords)
> I want to do it like this,
> Whenever I kill the tasktracker it will stop sending the heartbeats, but when 
> I again start tasktracker, It should start again sending  heartbeats, i.e it 
> should start that tasktrakcer on same port as that of before,
> what changes should I made in configuration parameters for that, please let 
> me know it
> 
> 
> Thanks & Regads,
> Mohmmadanis Moulavi
>