Re: Tasktracker & Jobtracker
Mohmmadanis, Yes. Sort of. The JobTracker will keep submitted jobs in queue until TaskTrackers are available to be assigned and run tasks on. JobTracker's queues will continue to be functional and will await the return of slots to assign tasks to in order to continue running the job. On 15-Nov-2011, at 12:57 PM, mohmmadanis moulavi wrote: > Hello friends, > > > > I want to know, > > Suppose if tasktracker goes down, or I stopeed the tasktracker, > then can Jobtrakcer waits until it comes up. > Please let me know it. > > > Regards, > Mohmmadanis Moulavi
How is data of each job assigned in Mumak ?
Hi guys ! Q> How can i assign data of each job in mumak nodes and what else i need to do ? In genreral how can i use the pluggable block-placement for HDFS in Mumak ? Meaning in my context i am using 19-jobs-trace json file and modified topology json file consisting of say 4 nodes. Since the number of tasks(map & reduce) are fixed for these jobs i want to assign the input data( with their replication) of each job to some particular nodes, so that i can use this info in my scheduler . Does the code for "ObtainLocalMapTask" / "ObtainNonlocalMaptask" in Scheduler need to be changed if i have to use this data placement info ? Q>If i have to add some fields to jobs in job trace file , how do i add it to job trace and access in my scheduler code for scheduling in Mumak. Which classes do i need to modify for this? Thanks, Arun -- View this message in context: http://lucene.472066.n3.nabble.com/How-is-data-of-each-job-assigned-in-Mumak-tp3508919p3508919.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
Tasktracker & Jobtracker
Hello friends, I want to know, Suppose if tasktracker goes down, or I stopeed the tasktracker, then can Jobtrakcer waits until it comes up. Please let me know it. Regards, Mohmmadanis Moulavi
Re: Input split for a streaming job!
MIlind I realised that thankls to Joey from Cloudera. I have given up on bzip. Raj > >From: "milind.bhandar...@emc.com" >To: common-user@hadoop.apache.org; rajv...@yahoo.com; cdh-u...@cloudera.org >Sent: Monday, November 14, 2011 2:02 PM >Subject: Re: Input split for a streaming job! > >It looks like your hadoop distro does not have >https://issues.apache.org/jira/browse/HADOOP-4012. > >- milind > >On 11/10/11 2:40 PM, "Raj V" wrote: > >>All >> >>I assumed that the input splits for a streaming job will follow the same >>logic as a map reduce java job but I seem to be wrong. >> >>I started out with 73 gzipped files that vary between 23MB to 255MB in >>size. My default block size was 128MB. 8 of the 73 files are larger than >>128 MB >> >>When I ran my streaming job, it ran, as expected, 73 mappers ( No >>reducers for this job). >> >>Since I have 128 Nodes in my cluster , I thought I would use more systems >>in the cluster by increasing the number of mappers. I changed all the >>gzip files into bzip2 files. I expected the number of mappers to increase >>to 81. The mappers remained at 73. >> >>I tried a second experiment- I changed my dfs.block.size to 32MB. That >>should have increased my mappers to about ~250. It remains steadfast at >>73. >> >>Is my understanding wrong? With a smaller block size and bzipped files, >>should I not get more mappers? >> >>Raj > > > >
Re: Input split for a streaming job!
It looks like your hadoop distro does not have https://issues.apache.org/jira/browse/HADOOP-4012. - milind On 11/10/11 2:40 PM, "Raj V" wrote: >All > >I assumed that the input splits for a streaming job will follow the same >logic as a map reduce java job but I seem to be wrong. > >I started out with 73 gzipped files that vary between 23MB to 255MB in >size. My default block size was 128MB. 8 of the 73 files are larger than >128 MB > >When I ran my streaming job, it ran, as expected, 73 mappers ( No >reducers for this job). > >Since I have 128 Nodes in my cluster , I thought I would use more systems >in the cluster by increasing the number of mappers. I changed all the >gzip files into bzip2 files. I expected the number of mappers to increase >to 81. The mappers remained at 73. > >I tried a second experiment- I changed my dfs.block.size to 32MB. That >should have increased my mappers to about ~250. It remains steadfast at >73. > >Is my understanding wrong? With a smaller block size and bzipped files, >should I not get more mappers? > >Raj
RE: setting up eclipse env for hadoop
You are right. From: Tim Broberg [tim.brob...@exar.com] Sent: Tuesday, November 15, 2011 1:02 AM To: common-user@hadoop.apache.org Subject: RE: setting up eclipse env for hadoop The ant steps for building the eclipse plugin are replaced by "mvn eclipse:eclipse," for versions 0.23+, correct? From: Uma Maheswara Rao G [mahesw...@huawei.com] Sent: Monday, November 14, 2011 10:11 AM To: common-user@hadoop.apache.org Subject: RE: setting up eclipse env for hadoop Yes, you can follow that. mvn eclipse:eclipse will generate eclipse related files. After that directly import in your eclipse. note: Repository links need to update. hdfs and mapreduce are moved inside to common folder. Regatrds, Uma From: Amir Sanjar [v1san...@us.ibm.com] Sent: Monday, November 14, 2011 9:07 PM To: common-user@hadoop.apache.org Subject: setting up eclipse env for hadoop I am trying to build hadoop-trunk using eclipse, is this http://wiki.apache.org/hadoop/EclipseEnvironment the latest document? Best Regards Amir Sanjar Linux System Management Architect and Lead IBM Senior Software Engineer Phone# 512-286-8393 Fax# 512-838-8858 The information and any attached documents contained in this message may be confidential and/or legally privileged. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, dissemination, or reproduction is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender immediately by return e-mail and destroy all copies of the original message.
Re: hadoop 0.20.205.0 multi-user cluster
Hi Stephen THis is probably happening during jobtracker start. Can you provide any relevant logs from the task tracker log fiile? Raj > >From: stephen mulcahy >To: common-user@hadoop.apache.org >Sent: Monday, November 14, 2011 5:33 AM >Subject: Re: hadoop 0.20.205.0 multi-user cluster > >On 14/11/11 09:38, stephen mulcahy wrote: >> Hi Raj, >> >> Thanks for your reply, comments below. >> >> On 09/11/11 18:45, Raj V wrote: >>> Can you try the following? >>> >>> - Change the permisson to 775 for /hadoop/mapred/system > >As per the previous problem, the permissions still get reset on cluster >restart. > >Am I the only one trying to use the cluster in this way? >Is everyone else submitting all jobs as a single user or using the full >authentication support? > >-stephen > >-- >Stephen Mulcahy, DI2, Digital Enterprise Research Institute, >NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland >http://di2.deri.ie http://webstar.deri.ie http://sindice.com > > >
RE: setting up eclipse env for hadoop
The ant steps for building the eclipse plugin are replaced by "mvn eclipse:eclipse," for versions 0.23+, correct? From: Uma Maheswara Rao G [mahesw...@huawei.com] Sent: Monday, November 14, 2011 10:11 AM To: common-user@hadoop.apache.org Subject: RE: setting up eclipse env for hadoop Yes, you can follow that. mvn eclipse:eclipse will generate eclipse related files. After that directly import in your eclipse. note: Repository links need to update. hdfs and mapreduce are moved inside to common folder. Regatrds, Uma From: Amir Sanjar [v1san...@us.ibm.com] Sent: Monday, November 14, 2011 9:07 PM To: common-user@hadoop.apache.org Subject: setting up eclipse env for hadoop I am trying to build hadoop-trunk using eclipse, is this http://wiki.apache.org/hadoop/EclipseEnvironment the latest document? Best Regards Amir Sanjar Linux System Management Architect and Lead IBM Senior Software Engineer Phone# 512-286-8393 Fax# 512-838-8858 The information and any attached documents contained in this message may be confidential and/or legally privileged. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, dissemination, or reproduction is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender immediately by return e-mail and destroy all copies of the original message.
Reading Global Counters in Streaming?
Hi, I am writing a Hadoop Streaming job in Python. I know that I can increment counters by writing a special format to sys.stderr. Is it possible to *read* the values of counters from my Python program? I am using the global counter as the denominator of a probability, and must have this value available to all reducers. Is this possible, and how would I do it in Python? Thanks, Ryan
RE: setting up eclipse env for hadoop
Yes, you can follow that. mvn eclipse:eclipse will generate eclipse related files. After that directly import in your eclipse. note: Repository links need to update. hdfs and mapreduce are moved inside to common folder. Regatrds, Uma From: Amir Sanjar [v1san...@us.ibm.com] Sent: Monday, November 14, 2011 9:07 PM To: common-user@hadoop.apache.org Subject: setting up eclipse env for hadoop I am trying to build hadoop-trunk using eclipse, is this http://wiki.apache.org/hadoop/EclipseEnvironment the latest document? Best Regards Amir Sanjar Linux System Management Architect and Lead IBM Senior Software Engineer Phone# 512-286-8393 Fax# 512-838-8858
Re: Mappers and Reducer not being called, but no errors indicated
OK, continuing our earlier conversation... I have a job that schedules 100 map jobs (small number just for testing), passing data view a set of 100 sequence files. This is based on the PiEstimator example, that is shipped with the distribution. The data consist of a blob of serialised state, amounting to around 20MB of data. I have added various checks, including checksums, to reduce the risk of data corruption or misalignment. The mapper takes the blob of data as its value input and an integer in the range 0-99 as its key (passed as a LongWritable). Each mapper then does some processing, based upon the deserialised contents of the blob and the integer key value (0-99). The reducer then selects the minimum value that was produced across all of the mappers. Unfortunately, this process is generating an incorrect value, when compared to a simple iterative solution. After inspecting the results it seems that the mappers are generating correct values for even-numbered keys, but incorrect values for odd-numbered keys. I am logging the values of the keys, so I am confident that these are correct. My serialisation checks also make me confident that the ‘value’ blobs are not getting corrupted, so it’s all something of a mystery. Harsh J: Previously, you indicated that this might be a “...key/val data issue… ...Perhaps bad partitioning/grouping is happening as a result of that”. I apologise for the lack of detail, but do you think this still might be the case? If so, could you refer me to some place that gives more detail on this type of issue? With apologies for continuing to be a nuisance :-( Andy D
Re: hadoop 0.20.205.0 multi-user cluster
On 14/11/11 15:31, Shi Jin wrote: I am guessing that /tmp is reset upon cluster restart. Maybe try to use a persistent directory. Thanks for the suggestion but /tmp will only be reset on server reboot - not cluster restart (I'm talking about running stop-all.sh and start-all.sh, not a full reboot). -stephen -- Stephen Mulcahy, DI2, Digital Enterprise Research Institute, NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com
setting up eclipse env for hadoop
I am trying to build hadoop-trunk using eclipse, is this http://wiki.apache.org/hadoop/EclipseEnvironment the latest document? Best Regards Amir Sanjar Linux System Management Architect and Lead IBM Senior Software Engineer Phone# 512-286-8393 Fax# 512-838-8858
Re: hadoop 0.20.205.0 multi-user cluster
I am guessing that /tmp is reset upon cluster restart. Maybe try to use a persistent directory. Shi On Mon, Nov 14, 2011 at 6:33 AM, stephen mulcahy wrote: > On 14/11/11 09:38, stephen mulcahy wrote: > >> Hi Raj, >> >> Thanks for your reply, comments below. >> >> On 09/11/11 18:45, Raj V wrote: >> >>> Can you try the following? >>> >>> - Change the permisson to 775 for /hadoop/mapred/system >>> >> > As per the previous problem, the permissions still get reset on cluster > restart. > > Am I the only one trying to use the cluster in this way? > Is everyone else submitting all jobs as a single user or using the full > authentication support? > > > -stephen > > -- > Stephen Mulcahy, DI2, Digital Enterprise Research Institute, > NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland > http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com > -- Shi Jin, Ph.D.
Re: how to start tasktracker only on single port
@harsh J I got it frm Wel Wu, It can be done using "mapred.task.tracker.report.address" parameter. Thanks for your reply From: Harsh JTo: common-user@hadoop.apache.org Sent: Monday, 14 November 2011 2:00 PM Subject: Re: how to start tasktracker only on single port If you are requiring this for monitoring purposes, do notice that the older TaskTracker would be with a different ID (port number will differ). You can differentiate based on that. TaskTrackers are mere clients to the JobTracker and hence JobTracker maintains no complete state of them, like is done on the HDFS side -- neither do TaskTrackers hold state. Thus, older TaskTrackers may continue to be in the live node list until they time out (10 mins or so usually, configurable). Cause from the JobTracker perspective, new TaskTrackers are new instances, not replacements. Older instances are cleaned away after they time out. On 14-Nov-2011, at 1:09 PM, mohmmadanis moulavi wrote: > Hello, > > > > Friends I am using Hadoop 0.20.2 version, > My problem is whenever I kill the tasktracker and start it again, jobtrakcer shows one extra tasktracker (the one which is killed & the other which has started afterwords) > I want to do it like this, > Whenever I kill the tasktracker it will stop sending the heartbeats, but when I again start tasktracker, It should start again sending heartbeats, i.e it should start that tasktrakcer on same port as that of before, > what changes should I made in configuration parameters for that, please let me know it > > > Thanks & Regads, > Mohmmadanis Moulavi >
Re: hadoop 0.20.205.0 multi-user cluster
On 14/11/11 09:38, stephen mulcahy wrote: Hi Raj, Thanks for your reply, comments below. On 09/11/11 18:45, Raj V wrote: Can you try the following? - Change the permisson to 775 for /hadoop/mapred/system As per the previous problem, the permissions still get reset on cluster restart. Am I the only one trying to use the cluster in this way? Is everyone else submitting all jobs as a single user or using the full authentication support? -stephen -- Stephen Mulcahy, DI2, Digital Enterprise Research Institute, NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com
Re: how to start tasktracker only on single port
It seems you are looking for the parameter "mapred.task.tracker.report.address". Set to a fixed port number 127.0.0.1:50050 (for example) and try it. Wei On Mon, Nov 14, 2011 at 3:39 PM, mohmmadanis moulavi < anis_moul...@yahoo.co.in> wrote: > Hello, > > > > Friends I am using Hadoop 0.20.2 version, > My problem is whenever I kill the tasktracker and start it again, > jobtrakcer shows one extra tasktracker (the one which is killed & the other > which has started afterwords) > I want to do it like this, > Whenever I kill the tasktracker it will stop sending the heartbeats, but > when I again start tasktracker, It should start again sending heartbeats, > i.e it should start that tasktrakcer on same port as that of before, > what changes should I made in configuration parameters for that, please > let me know it > > > Thanks & Regads, > Mohmmadanis Moulavi > >
Re: how to start tasktracker only on single port
yeah, I mean kill as stop tasktracker. - Original Message - From: Alexander C.H. Lorenz To: common-user@hadoop.apache.org; mohmmadanis moulavi Cc: Sent: Monday, 14 November 2011 1:16 PM Subject: Re: how to start tasktracker only on single port Hi, please explain the reason to kill (I assume kill -9) a tasktracker. The best way is to use the start / stop scripts. best, Alex On Mon, Nov 14, 2011 at 8:39 AM, mohmmadanis moulavi < anis_moul...@yahoo.co.in> wrote: > Hello, > > > > Friends I am using Hadoop 0.20.2 version, > My problem is whenever I kill the tasktracker and start it again, > jobtrakcer shows one extra tasktracker (the one which is killed & the other > which has started afterwords) > I want to do it like this, > Whenever I kill the tasktracker it will stop sending the heartbeats, but > when I again start tasktracker, It should start again sending heartbeats, > i.e it should start that tasktrakcer on same port as that of before, > what changes should I made in configuration parameters for that, please > let me know it > > > Thanks & Regads, > Mohmmadanis Moulavi > > -- Alexander Lorenz http://mapredit.blogspot.com *P **Think of the environment: please don't print this email unless you really need to.*
Re: HDFS and Openstack - avoiding excessive redundancy
Or more general: isn't using virtualized i/o counter effective when dealing with hadoop M/R? I would think that for running hadoop M/R you'd want predictable and consistent i/o on each node, not to mention your bottlenecks are usually disk i/o (and maybe CPU), so using virtualisation makes things less performant and less predictable, so, inferior. Or am I missing something? Dieter On Sat, 12 Nov 2011 07:54:05 + Graeme Seaton wrote: > One advantage to using Hadoop replication though, is that it provides > a greater pool of potential servers for M/R jobs to execute on. If > you simply use Openstack replication it will appear to the JobTracker > that a particular block only exists on a single server and should > only be executed on that node. This may have have an impact > depending on your workload profile. > > Regards, > Graeme > > On 12/11/11 07:24, Dejan Menges wrote: > > Replication factor for HDFS can easily be changed to 1 if you don't > > need it's redundancy in hdfs-site.xml > > > > Regards, > > Dejo > > > > Sent from my iPhone > > > > On 12. 11. 2011., at 03:58, Edmon Begoli wrote: > > > >> A question related to standing up cloud infrastructure for running > >> Hadoop/HDFS. > >> > >> We are building up an infrastructure using Openstack which has its > >> own storage management redundancy. > >> > >> We are planning to use Openstack to instantiate Hadoop nodes (HDFS, > >> M/R tasks, Hive, HBase) > >> on demand. > >> > >> The problem is that HDFS by design creates three copies of the > >> data, so there is a 4x times redundancy > >> which we would prefer to avoid. > >> > >> I am asking here if anyone has had a similar case and if anyone has > >> had any helpful solution to recommend. > >> > >> Thank you in advance, > >> Edmon >
Re: hadoop 0.20.205.0 multi-user cluster
Hi Raj, Thanks for your reply, comments below. On 09/11/11 18:45, Raj V wrote: Can you try the following? - Change the permisson to 775 for /hadoop/mapred/system Done. - Change the group to hadoop Done. - Make all users who need to submit hadoop jobs a part of the hadoop group. The users are remote users. Do I need to create accounts on the hadoop cluster for those users to add them to the hadoop group or how should this work? Thanks, -stephen -- Stephen Mulcahy, DI2, Digital Enterprise Research Institute, NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com
Re: how to start tasktracker only on single port
If you are requiring this for monitoring purposes, do notice that the older TaskTracker would be with a different ID (port number will differ). You can differentiate based on that. TaskTrackers are mere clients to the JobTracker and hence JobTracker maintains no complete state of them, like is done on the HDFS side -- neither do TaskTrackers hold state. Thus, older TaskTrackers may continue to be in the live node list until they time out (10 mins or so usually, configurable). Cause from the JobTracker perspective, new TaskTrackers are new instances, not replacements. Older instances are cleaned away after they time out. On 14-Nov-2011, at 1:09 PM, mohmmadanis moulavi wrote: > Hello, > > > > Friends I am using Hadoop 0.20.2 version, > My problem is whenever I kill the tasktracker and start it again, jobtrakcer > shows one extra tasktracker (the one which is killed & the other which has > started afterwords) > I want to do it like this, > Whenever I kill the tasktracker it will stop sending the heartbeats, but when > I again start tasktracker, It should start again sending heartbeats, i.e it > should start that tasktrakcer on same port as that of before, > what changes should I made in configuration parameters for that, please let > me know it > > > Thanks & Regads, > Mohmmadanis Moulavi >