Re: Hadoop and Ubuntu / Java
As per Oracle, going forward openjdk will be official oracle jdk for linux . Which means openjdk will be same as the official one. On Tue, Dec 20, 2011 at 9:12 PM, hadoopman hadoop...@gmail.com wrote: http://www.omgubuntu.co.uk/**2011/12/java-to-be-removed-** from-ubuntu-uninstalled-from-**user-machines/http://www.omgubuntu.co.uk/2011/12/java-to-be-removed-from-ubuntu-uninstalled-from-user-machines/ I'm curious what this will mean for Hadoop on Ubuntu systems moving forward. I've tried openJDK nearly two years ago with Hadoop. Needless to say it was a real problem. Hopefully we can still download it from the Sun/Oracle web site and still use it. Won't be the same though :/ -- https://github.com/zinnia-phatak-dev/Nectar
Re: How to rebuild NameNode from DataNode.
Switch to MapR M5? :-) Just kidding. Simple way of solving this pre CDH4... NFS mount a directory from your SN and add it to your list of checkpoint directories. You may lose some data, but you should be able rebuild. There is more to this, but its the basic idea on how to get a copy of your meta data. Sent from a remote device. Please excuse any typos... Mike Segel On Apr 18, 2012, at 11:48 PM, Saburo Fujioka fuji...@do-it.co.jp wrote: Hello, I do a tentative plan of operative trouble countermeasures of a system currently now. If when NameNode has been lost, but are investigating the means to rebuild the remaining NameNode from DataNode, I don't know at the moment. Were consistent with those of the DataNode is the namespaceID of dfs/name/current/ VERSION as confirmation, fsimage are not rebuilt, the list did not display anything in the hadoop dfs-ls. The risk of loss for NameNode because that is protected by Corosync + Pacemaker + DRBD is low. Because of the rare case, it is necessary to clarify the means to reconfigure the NameNode from DataNode. Do you know how to? I am using hadoop 1.0.1. Thank you very much,
Algorithms used in fairscheduler 0.20.205
I could find that the closest doc matching the current implementation of the fairscheduler could be find in this documenthttp://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-55.htmlfrom Matei Zaharia et al.. Another documented from delay scheduling can be found from year 2010.. a) I am interested if there maybe exist any newer documented version of the implementation? b) Are there any other algorithms in addition to delay scheduling, copy-compute splitting algorithm and fairshare calculation algorithm that are important for the cluster performance and fairsharing? c) Is there maybe any connection between copy-compute splitting and mapreduce phases (copy-sort-reduce)? Thank you..
JobControl run
Hi, I wonder why when I call run() on a JobControl object, it loops forever. Namely, this code doesn't work: JobControl jobControl = new JobControl(Name); // some stuff here (add jobs and dependencies) jobControl.run(); This code works but looks a bit ugly: JobControl jobControl = new JobControl(Name); // some stuff here (add jobs and dependencies) Thread control = new Thread(jobControl); control.start(); while (!jobControl.allFinished()) { try { Thread.sleep(5000); } catch (Exception e) {} } I wonder if the run method in the JobControl class could add the following condition to break the while(true) loop: if (jobsInProgress.isEmpty()) { break; } Thanks very much, Juan
Troubleshoot job failures after upgrade to 1.0.2
Hi, I have an application on Cassandra 1.0.8 + Hadoop. Previously running on the cloudera distribution hadoop-0.20.2-cdh3u1, I tried today to upgrade to Hadoop 1.0.2 but stumbled in issues with consistent job failures The hadoop userlogs seem quite clear: 2012-04-19 18:21:09.837 java[57837:1903] Unable to load realm info from SCDynamicStore Apr 19, 2012 6:21:10 PM org.apache.hadoop.util.NativeCodeLoader clinit WARNING: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Apr 19, 2012 6:21:10 PM org.apache.hadoop.mapred.Child main SEVERE: Error running child : java.lang.NoClassDefFoundError: Could not initialize class org.apache.log4j.LogManager at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:334) at org.apache.hadoop.mapred.Child.main(Child.java:229) In particular, the SEVERE: Error running child : java.lang.NoClassDefFoundError: Could not initialize class org.apache.log4j.LogManager seem to be the most reasonable cause. Can you help me troubleshoot the cause of the failure? Thanks, -- Filippo Diotalevi
Re: Pre-requisites for hadoop 0.23/CDH4
You are better of trying hadoop-0.23.1 or even hadoop-0.23.2-rc0 since CDH4's version of YARN is very incomplete and you might get nasty surprises there. Settings: # Run 1 NodeManager with yarn.nodemanager.resource.memory-mb - 1024 # Use CapacityScheduler (significantly better tested) by setting yarn.resourcemanager.scheduler.class to org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler # Set min container size to 128M or 256M (upto you) by setting yarn.scheduler.capacity.minimum-allocation-mb # Set max container size to 1024M (max given to NM) by setting yarn.scheduler.capacity.maximum-allocation-mb Arun On Apr 18, 2012, at 8:00 PM, praveenesh kumar wrote: Hi, Sweet.. Can you please elaborate how can I tweak my configs to make CDH4/hadoop-0.23 run in 1.5GB RAM VM. Regards, Praveenesh On Wed, Apr 18, 2012 at 8:42 AM, Harsh J ha...@cloudera.com wrote: Praveenesh, Speaking minimally (and thereby requiring less tweaks on your end), 1.5 GB would be a good value to use for RAM if available (1.0 will do too, if you make sure to tweak your configs to not use too much heap memory). Single processor should do fine for testing purposes. On Tue, Apr 17, 2012 at 8:51 PM, praveenesh kumar praveen...@gmail.com wrote: I am looking to test hadoop 0.23 or CDH4 beta on my local VM. I am looking to execute the sample example codes in new architecture, play around with the containers/resource managers. Is there any pre-requisite on default memory/CPU/core settings I need to keep in mind before setting up the VM. Regards, Praveenesh -- Harsh J -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/
Re: Pre-requisites for hadoop 0.23/CDH4
Thanks Arun, Will try out those settings. Is there a good documentation on configuring/playing with hadoop 0.23 apart from the apache hadoop-0.23 page. I have already looked into that page.. Just wondering is there something more that I don't know. Regards, Praveenesh On Fri, Apr 20, 2012 at 12:45 AM, Arun C Murthy a...@hortonworks.com wrote: You are better of trying hadoop-0.23.1 or even hadoop-0.23.2-rc0 since CDH4's version of YARN is very incomplete and you might get nasty surprises there. Settings: # Run 1 NodeManager with yarn.nodemanager.resource.memory-mb - 1024 # Use CapacityScheduler (significantly better tested) by setting yarn.resourcemanager.scheduler.class to org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler # Set min container size to 128M or 256M (upto you) by setting yarn.scheduler.capacity.minimum-allocation-mb # Set max container size to 1024M (max given to NM) by setting yarn.scheduler.capacity.maximum-allocation-mb Arun On Apr 18, 2012, at 8:00 PM, praveenesh kumar wrote: Hi, Sweet.. Can you please elaborate how can I tweak my configs to make CDH4/hadoop-0.23 run in 1.5GB RAM VM. Regards, Praveenesh On Wed, Apr 18, 2012 at 8:42 AM, Harsh J ha...@cloudera.com wrote: Praveenesh, Speaking minimally (and thereby requiring less tweaks on your end), 1.5 GB would be a good value to use for RAM if available (1.0 will do too, if you make sure to tweak your configs to not use too much heap memory). Single processor should do fine for testing purposes. On Tue, Apr 17, 2012 at 8:51 PM, praveenesh kumar praveen...@gmail.com wrote: I am looking to test hadoop 0.23 or CDH4 beta on my local VM. I am looking to execute the sample example codes in new architecture, play around with the containers/resource managers. Is there any pre-requisite on default memory/CPU/core settings I need to keep in mind before setting up the VM. Regards, Praveenesh -- Harsh J -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/
Re: Help me with architecture of a somewhat non-trivial mapreduce implementation
From what I can see your implementation seems OK, especially from a performance perspective. Depending on what storage: is it is likely to be your bottlekneck, not the hadoop computations. Because you are writing files directly instead of relying on Hadoop to do it for you, you may need to deal with error cases that Hadoop will normally hide from you, and you will not be able to turn on speculative execution. Just be aware that a map or reduce task may have problems in the middle, and be relaunched. So when you are writing out your updated manifest be careful to not replace the old one until the new one is completely ready and will not fail, or you may lose data. You may also need to be careful in your reduce if you are writing directly to the file there too, but because it is not a read modify write, but just a write it is not as critical. --Bobby Evans On 4/18/12 4:56 PM, Sky USC sky...@hotmail.com wrote: Please help me architect the design of my first significant MR task beyond word count. My program works well. but I am trying to optimize performance to maximize use of available computing resources. I have 3 questions at the bottom. Project description in an abstract sense (written in java): * I have MM number of MANIFEST files available on storage:/root/1.manif.txt to 4000.manif.txt * Each MANIFEST in turn contains varilable number EE of URLs to EBOOKS (range could be 1 - 50,000 EBOOKS urls per MANIFEST) -- stored on storage:/root/1.manif/1223.folder/5443.Ebook.ebk So we are talking about millions of ebooks My task is to: 1. Fetch each ebook, and obtain a set of 3 attributes per ebook (example: publisher, year, ebook-version). 2. Update each of the EBOOK entry record in the manifest - with the 3 attributes (eg: ebook 1334 - publisher=aaa year=bbb, ebook-version=2.01) 3. Create a output file such that the named publisher_year_ebook-version contains a list of all ebook urls that met that criteria. example: File storage:/root/summary/RANDOMHOUSE_1999_2.01.txt contains: storage:/root/1.manif/1223.folder/2143.Ebook.ebk storage:/root/2.manif/2133.folder/5449.Ebook.ebk storage:/root/2.manif/2133.folder/5450.Ebook.ebk etc.. and File storage:/root/summary/PENGUIN_2001_3.12.txt contains: storage:/root/19.manif/2223.folder/4343.Ebook.ebk storage:/root/13.manif/9733.folder/2149.Ebook.ebk storage:/root/21.manif/3233.folder/1110.Ebook.ebk etc 4. finally, I also want to output statistics such that: publisher_year_ebook-version COUNT_OF_URLs PENGUIN_2001_3.12 250,111 RANDOMHOUSE_1999_2.01 11,322 etc Here is how I implemented: * My launcher gets list of MM manifests * My Mapper gets one manifest. --- It reads the manifest, within a WHILE loop, --- fetches each EBOOK, and obtain attributes from each ebook, --- updates the manifest for that ebook --- context.write(new Text(RANDOMHOUSE_1999_2.01), new Text(storage:/root/1.manif/1223.folder/2143.Ebook.ebk)) --- Once all ebooks in the manifest are read, it saves the updated Manifest, and exits * My Reducer gets the RANDOMHOUSE_1999_2.01 and a list of ebooks urls. --- It writes a new file storage:/root/summary/RANDOMHOUSE_1999_2.01.txt with all the storage urls for the ebooks --- It also does a context.write(new Text(RANDOMHOUSE_1999_2.01), new IntWritable(SUM_OF_ALL_EBOOK_URLS_FROM_THE_LIST)) As I mentioned, its working. I launch it on 15 elastic instances. I have three questions: 1. Is this the best way to implement the MR logic? 2. I dont know if each of the instances is getting one task or multiple tasks simultaneously for the MAP portion. If it is not getting multiple MAP tasks, should I go with the route of multithreaded reading of ebooks from each manifest? Its not efficient to read just one ebook at a time per machine. Is Context.write() threadsafe? 3. I can see log4j logs for main program, but no visibility into logs for Mapper or Reducer. Any idea?
Re: Multiple data centre in Hadoop
Where I work we have done some things like this, but none of them are open source, and I have not really been directly involved with the details of it. I can guess about what it would take, but that is all it would be at this point. --Bobby On 4/17/12 5:46 PM, Abhishek Pratap Singh manu.i...@gmail.com wrote: Thanks bobby, I m looking for something like this. Now the question is what is the best strategy to do Hot/Hot or Hot/Warm. I need to consider the CPU and Network bandwidth, also needs to decide from which layer this replication should start. Regards, Abhishek On Mon, Apr 16, 2012 at 7:08 AM, Robert Evans ev...@yahoo-inc.com wrote: Hi Abhishek, Manu is correct about High Availability within a single colo. I realize that in some cases you have to have fail over between colos. I am not aware of any turn key solution for things like that, but generally what you want to do is to run two clusters, one in each colo, either hot/hot or hot/warm, and I have seen both depending on how quickly you need to fail over. In hot/hot the input data is replicated to both clusters and the same software is run on both. In this case though you have to be fairly sure that your processing is deterministic, or the results could be slightly different (i.e. No generating if random ids). In hot/warm the data is replicated from one colo to the other at defined checkpoints. The data is only processed on one of the grids, but if that colo goes down the other one can take up the processing from where ever the last checkpoint was. I hope that helps. --Bobby On 4/12/12 5:07 AM, Manu S manupk...@gmail.com wrote: Hi Abhishek, 1. Use multiple directories for *dfs.name.dir* *dfs.data.dir* etc * Recommendation: write to *two local directories on different physical volumes*, and to an *NFS-mounted* directory - Data will be preserved even in the event of a total failure of the NameNode machines * Recommendation: *soft-mount the NFS* directory - If the NFS mount goes offline, this will not cause the NameNode to fail 2. *Rack awareness* https://issues.apache.org/jira/secure/attachment/12345251/Rack_aware_HDFS_proposal.pdf On Thu, Apr 12, 2012 at 2:18 AM, Abhishek Pratap Singh manu.i...@gmail.comwrote: Thanks Robert. Is there a best practice or design than can address the High Availability to certain extent? ~Abhishek On Wed, Apr 11, 2012 at 12:32 PM, Robert Evans ev...@yahoo-inc.com wrote: No it does not. Sorry On 4/11/12 1:44 PM, Abhishek Pratap Singh manu.i...@gmail.com wrote: Hi All, Just wanted if hadoop supports more than one data centre. This is basically for DR purposes and High Availability where one centre goes down other can bring up. Regards, Abhishek -- Thanks Regards *Manu S* SI Engineer - OpenSource HPC Wipro Infotech Mob: +91 8861302855Skype: manuspkd www.opensourcetalk.co.in
Re: Multiple data centre in Hadoop
I don't know of any open source solution in doing this... And yeah its something one can't talk about ;-) On Apr 19, 2012, at 4:28 PM, Robert Evans wrote: Where I work we have done some things like this, but none of them are open source, and I have not really been directly involved with the details of it. I can guess about what it would take, but that is all it would be at this point. --Bobby On 4/17/12 5:46 PM, Abhishek Pratap Singh manu.i...@gmail.com wrote: Thanks bobby, I m looking for something like this. Now the question is what is the best strategy to do Hot/Hot or Hot/Warm. I need to consider the CPU and Network bandwidth, also needs to decide from which layer this replication should start. Regards, Abhishek On Mon, Apr 16, 2012 at 7:08 AM, Robert Evans ev...@yahoo-inc.com wrote: Hi Abhishek, Manu is correct about High Availability within a single colo. I realize that in some cases you have to have fail over between colos. I am not aware of any turn key solution for things like that, but generally what you want to do is to run two clusters, one in each colo, either hot/hot or hot/warm, and I have seen both depending on how quickly you need to fail over. In hot/hot the input data is replicated to both clusters and the same software is run on both. In this case though you have to be fairly sure that your processing is deterministic, or the results could be slightly different (i.e. No generating if random ids). In hot/warm the data is replicated from one colo to the other at defined checkpoints. The data is only processed on one of the grids, but if that colo goes down the other one can take up the processing from where ever the last checkpoint was. I hope that helps. --Bobby On 4/12/12 5:07 AM, Manu S manupk...@gmail.com wrote: Hi Abhishek, 1. Use multiple directories for *dfs.name.dir* *dfs.data.dir* etc * Recommendation: write to *two local directories on different physical volumes*, and to an *NFS-mounted* directory - Data will be preserved even in the event of a total failure of the NameNode machines * Recommendation: *soft-mount the NFS* directory - If the NFS mount goes offline, this will not cause the NameNode to fail 2. *Rack awareness* https://issues.apache.org/jira/secure/attachment/12345251/Rack_aware_HDFS_proposal.pdf On Thu, Apr 12, 2012 at 2:18 AM, Abhishek Pratap Singh manu.i...@gmail.comwrote: Thanks Robert. Is there a best practice or design than can address the High Availability to certain extent? ~Abhishek On Wed, Apr 11, 2012 at 12:32 PM, Robert Evans ev...@yahoo-inc.com wrote: No it does not. Sorry On 4/11/12 1:44 PM, Abhishek Pratap Singh manu.i...@gmail.com wrote: Hi All, Just wanted if hadoop supports more than one data centre. This is basically for DR purposes and High Availability where one centre goes down other can bring up. Regards, Abhishek -- Thanks Regards *Manu S* SI Engineer - OpenSource HPC Wipro Infotech Mob: +91 8861302855Skype: manuspkd www.opensourcetalk.co.in
Re: Multiple data centre in Hadoop
If you want to start an open source project for this I am sure that there are others with the same problem that might be very wiling to help out. :) --Bobby Evans On 4/19/12 4:31 PM, Michael Segel michael_se...@hotmail.com wrote: I don't know of any open source solution in doing this... And yeah its something one can't talk about ;-) On Apr 19, 2012, at 4:28 PM, Robert Evans wrote: Where I work we have done some things like this, but none of them are open source, and I have not really been directly involved with the details of it. I can guess about what it would take, but that is all it would be at this point. --Bobby On 4/17/12 5:46 PM, Abhishek Pratap Singh manu.i...@gmail.com wrote: Thanks bobby, I m looking for something like this. Now the question is what is the best strategy to do Hot/Hot or Hot/Warm. I need to consider the CPU and Network bandwidth, also needs to decide from which layer this replication should start. Regards, Abhishek On Mon, Apr 16, 2012 at 7:08 AM, Robert Evans ev...@yahoo-inc.com wrote: Hi Abhishek, Manu is correct about High Availability within a single colo. I realize that in some cases you have to have fail over between colos. I am not aware of any turn key solution for things like that, but generally what you want to do is to run two clusters, one in each colo, either hot/hot or hot/warm, and I have seen both depending on how quickly you need to fail over. In hot/hot the input data is replicated to both clusters and the same software is run on both. In this case though you have to be fairly sure that your processing is deterministic, or the results could be slightly different (i.e. No generating if random ids). In hot/warm the data is replicated from one colo to the other at defined checkpoints. The data is only processed on one of the grids, but if that colo goes down the other one can take up the processing from where ever the last checkpoint was. I hope that helps. --Bobby On 4/12/12 5:07 AM, Manu S manupk...@gmail.com wrote: Hi Abhishek, 1. Use multiple directories for *dfs.name.dir* *dfs.data.dir* etc * Recommendation: write to *two local directories on different physical volumes*, and to an *NFS-mounted* directory - Data will be preserved even in the event of a total failure of the NameNode machines * Recommendation: *soft-mount the NFS* directory - If the NFS mount goes offline, this will not cause the NameNode to fail 2. *Rack awareness* https://issues.apache.org/jira/secure/attachment/12345251/Rack_aware_HDFS_proposal.pdf On Thu, Apr 12, 2012 at 2:18 AM, Abhishek Pratap Singh manu.i...@gmail.comwrote: Thanks Robert. Is there a best practice or design than can address the High Availability to certain extent? ~Abhishek On Wed, Apr 11, 2012 at 12:32 PM, Robert Evans ev...@yahoo-inc.com wrote: No it does not. Sorry On 4/11/12 1:44 PM, Abhishek Pratap Singh manu.i...@gmail.com wrote: Hi All, Just wanted if hadoop supports more than one data centre. This is basically for DR purposes and High Availability where one centre goes down other can bring up. Regards, Abhishek -- Thanks Regards *Manu S* SI Engineer - OpenSource HPC Wipro Infotech Mob: +91 8861302855Skype: manuspkd www.opensourcetalk.co.in
Re: Multiple data centre in Hadoop
Hive is beginning to implement Region support where one metastore will manage multiple filesystems and jobtrackers. When a query creates a table it will then be copied to one ore more datacenters. In addition the query planner will intelligently attempt to run queries in regions only where all the tables exists. While wiating for these awesome features I am doing a fair amount of distcp work from groovy scripts. Edward On Thu, Apr 19, 2012 at 5:33 PM, Robert Evans ev...@yahoo-inc.com wrote: If you want to start an open source project for this I am sure that there are others with the same problem that might be very wiling to help out. :) --Bobby Evans On 4/19/12 4:31 PM, Michael Segel michael_se...@hotmail.com wrote: I don't know of any open source solution in doing this... And yeah its something one can't talk about ;-) On Apr 19, 2012, at 4:28 PM, Robert Evans wrote: Where I work we have done some things like this, but none of them are open source, and I have not really been directly involved with the details of it. I can guess about what it would take, but that is all it would be at this point. --Bobby On 4/17/12 5:46 PM, Abhishek Pratap Singh manu.i...@gmail.com wrote: Thanks bobby, I m looking for something like this. Now the question is what is the best strategy to do Hot/Hot or Hot/Warm. I need to consider the CPU and Network bandwidth, also needs to decide from which layer this replication should start. Regards, Abhishek On Mon, Apr 16, 2012 at 7:08 AM, Robert Evans ev...@yahoo-inc.com wrote: Hi Abhishek, Manu is correct about High Availability within a single colo. I realize that in some cases you have to have fail over between colos. I am not aware of any turn key solution for things like that, but generally what you want to do is to run two clusters, one in each colo, either hot/hot or hot/warm, and I have seen both depending on how quickly you need to fail over. In hot/hot the input data is replicated to both clusters and the same software is run on both. In this case though you have to be fairly sure that your processing is deterministic, or the results could be slightly different (i.e. No generating if random ids). In hot/warm the data is replicated from one colo to the other at defined checkpoints. The data is only processed on one of the grids, but if that colo goes down the other one can take up the processing from where ever the last checkpoint was. I hope that helps. --Bobby On 4/12/12 5:07 AM, Manu S manupk...@gmail.com wrote: Hi Abhishek, 1. Use multiple directories for *dfs.name.dir* *dfs.data.dir* etc * Recommendation: write to *two local directories on different physical volumes*, and to an *NFS-mounted* directory - Data will be preserved even in the event of a total failure of the NameNode machines * Recommendation: *soft-mount the NFS* directory - If the NFS mount goes offline, this will not cause the NameNode to fail 2. *Rack awareness* https://issues.apache.org/jira/secure/attachment/12345251/Rack_aware_HDFS_proposal.pdf On Thu, Apr 12, 2012 at 2:18 AM, Abhishek Pratap Singh manu.i...@gmail.comwrote: Thanks Robert. Is there a best practice or design than can address the High Availability to certain extent? ~Abhishek On Wed, Apr 11, 2012 at 12:32 PM, Robert Evans ev...@yahoo-inc.com wrote: No it does not. Sorry On 4/11/12 1:44 PM, Abhishek Pratap Singh manu.i...@gmail.com wrote: Hi All, Just wanted if hadoop supports more than one data centre. This is basically for DR purposes and High Availability where one centre goes down other can bring up. Regards, Abhishek -- Thanks Regards *Manu S* SI Engineer - OpenSource HPC Wipro Infotech Mob: +91 8861302855 Skype: manuspkd www.opensourcetalk.co.in
hadoop datanode is dead but cannot stop it
I have encountered when there is a disk IO error in a datanode machine, the datanode will be dead, but the in the dead datanode, the datanode daemon is still alive, and I cannot stop it to restart it the datanode. When I check the process , it seems that the linux command du -sk path/to/datadir is hangup, this problem cause the datanode dead, so that I cannot stop the datanode as well as cannot use the “kill -9 datanode-process” to kill the datanode process, is this a bug? may be we should set a timeout of linux command du , when there is no return to the datanode.
Re: Help me with architecture of a somewhat non-trivial mapreduce implementation
Thanks for your reply. After I sent my email, I found a fundamental defect - in my understanding of how MR is distributed. I discovered that even though I was firing off 15 COREs, the map job - which is the most expensive part of my processing was run only on 1 core. To start my map job, I was creating a single file with following data: 1 storage:/root/1.manif.txt 2 storage:/root/2.manif.txt 3 storage:/root/3.manif.txt ... 4000 storage:/root/4000.manif.txt I thought that each of the available COREs will be assigned a map job from top down from the same file one at a time, and as soon as one CORE is done, it would get the next map job. However, it looks like I need to split the file into the number of times. Now while that’s clearly trivial to do, I am not sure how I can detect at runtime how many splits I need to do, and also to deal with adding new CORES at runtime. Any suggestions? (it doesn't have to be a file, it can be a list, etc). This all would be much easier to debug, if somehow I could get my log4j logs for my mappers and reducers. I can see log4j for my main launcher, but not sure how to enable it for mappers and reducers. Thx - Akash -Original Message- From: Robert Evans Sent: Thursday, April 19, 2012 2:08 PM To: common-user@hadoop.apache.org Subject: Re: Help me with architecture of a somewhat non-trivial mapreduce implementation From what I can see your implementation seems OK, especially from a performance perspective. Depending on what storage: is it is likely to be your bottlekneck, not the hadoop computations. Because you are writing files directly instead of relying on Hadoop to do it for you, you may need to deal with error cases that Hadoop will normally hide from you, and you will not be able to turn on speculative execution. Just be aware that a map or reduce task may have problems in the middle, and be relaunched. So when you are writing out your updated manifest be careful to not replace the old one until the new one is completely ready and will not fail, or you may lose data. You may also need to be careful in your reduce if you are writing directly to the file there too, but because it is not a read modify write, but just a write it is not as critical. --Bobby Evans On 4/18/12 4:56 PM, Sky USC sky...@hotmail.com wrote: Please help me architect the design of my first significant MR task beyond word count. My program works well. but I am trying to optimize performance to maximize use of available computing resources. I have 3 questions at the bottom. Project description in an abstract sense (written in java): * I have MM number of MANIFEST files available on storage:/root/1.manif.txt to 4000.manif.txt * Each MANIFEST in turn contains varilable number EE of URLs to EBOOKS (range could be 1 - 50,000 EBOOKS urls per MANIFEST) -- stored on storage:/root/1.manif/1223.folder/5443.Ebook.ebk So we are talking about millions of ebooks My task is to: 1. Fetch each ebook, and obtain a set of 3 attributes per ebook (example: publisher, year, ebook-version). 2. Update each of the EBOOK entry record in the manifest - with the 3 attributes (eg: ebook 1334 - publisher=aaa year=bbb, ebook-version=2.01) 3. Create a output file such that the named publisher_year_ebook-version contains a list of all ebook urls that met that criteria. example: File storage:/root/summary/RANDOMHOUSE_1999_2.01.txt contains: storage:/root/1.manif/1223.folder/2143.Ebook.ebk storage:/root/2.manif/2133.folder/5449.Ebook.ebk storage:/root/2.manif/2133.folder/5450.Ebook.ebk etc.. and File storage:/root/summary/PENGUIN_2001_3.12.txt contains: storage:/root/19.manif/2223.folder/4343.Ebook.ebk storage:/root/13.manif/9733.folder/2149.Ebook.ebk storage:/root/21.manif/3233.folder/1110.Ebook.ebk etc 4. finally, I also want to output statistics such that: publisher_year_ebook-version COUNT_OF_URLs PENGUIN_2001_3.12 250,111 RANDOMHOUSE_1999_2.01 11,322 etc Here is how I implemented: * My launcher gets list of MM manifests * My Mapper gets one manifest. --- It reads the manifest, within a WHILE loop, --- fetches each EBOOK, and obtain attributes from each ebook, --- updates the manifest for that ebook --- context.write(new Text(RANDOMHOUSE_1999_2.01), new Text(storage:/root/1.manif/1223.folder/2143.Ebook.ebk)) --- Once all ebooks in the manifest are read, it saves the updated Manifest, and exits * My Reducer gets the RANDOMHOUSE_1999_2.01 and a list of ebooks urls. --- It writes a new file storage:/root/summary/RANDOMHOUSE_1999_2.01.txt with all the storage urls for the ebooks --- It also does a context.write(new Text(RANDOMHOUSE_1999_2.01), new IntWritable(SUM_OF_ALL_EBOOK_URLS_FROM_THE_LIST)) As I mentioned, its working. I launch it on 15 elastic instances. I have three questions: 1. Is this the best way to implement the MR logic? 2. I dont know if each of the instances is getting one task or multiple tasks
Re: Help me with architecture of a somewhat non-trivial mapreduce implementation
How 'large' or rather in this case small is your file? If you're on a default system, the block sizes are 64MB. So if your file ~= 64MB, you end up with 1 block, and you will only have 1 mapper. On Apr 19, 2012, at 10:10 PM, Sky wrote: Thanks for your reply. After I sent my email, I found a fundamental defect - in my understanding of how MR is distributed. I discovered that even though I was firing off 15 COREs, the map job - which is the most expensive part of my processing was run only on 1 core. To start my map job, I was creating a single file with following data: 1 storage:/root/1.manif.txt 2 storage:/root/2.manif.txt 3 storage:/root/3.manif.txt ... 4000 storage:/root/4000.manif.txt I thought that each of the available COREs will be assigned a map job from top down from the same file one at a time, and as soon as one CORE is done, it would get the next map job. However, it looks like I need to split the file into the number of times. Now while that’s clearly trivial to do, I am not sure how I can detect at runtime how many splits I need to do, and also to deal with adding new CORES at runtime. Any suggestions? (it doesn't have to be a file, it can be a list, etc). This all would be much easier to debug, if somehow I could get my log4j logs for my mappers and reducers. I can see log4j for my main launcher, but not sure how to enable it for mappers and reducers. Thx - Akash -Original Message- From: Robert Evans Sent: Thursday, April 19, 2012 2:08 PM To: common-user@hadoop.apache.org Subject: Re: Help me with architecture of a somewhat non-trivial mapreduce implementation From what I can see your implementation seems OK, especially from a performance perspective. Depending on what storage: is it is likely to be your bottlekneck, not the hadoop computations. Because you are writing files directly instead of relying on Hadoop to do it for you, you may need to deal with error cases that Hadoop will normally hide from you, and you will not be able to turn on speculative execution. Just be aware that a map or reduce task may have problems in the middle, and be relaunched. So when you are writing out your updated manifest be careful to not replace the old one until the new one is completely ready and will not fail, or you may lose data. You may also need to be careful in your reduce if you are writing directly to the file there too, but because it is not a read modify write, but just a write it is not as critical. --Bobby Evans On 4/18/12 4:56 PM, Sky USC sky...@hotmail.com wrote: Please help me architect the design of my first significant MR task beyond word count. My program works well. but I am trying to optimize performance to maximize use of available computing resources. I have 3 questions at the bottom. Project description in an abstract sense (written in java): * I have MM number of MANIFEST files available on storage:/root/1.manif.txt to 4000.manif.txt * Each MANIFEST in turn contains varilable number EE of URLs to EBOOKS (range could be 1 - 50,000 EBOOKS urls per MANIFEST) -- stored on storage:/root/1.manif/1223.folder/5443.Ebook.ebk So we are talking about millions of ebooks My task is to: 1. Fetch each ebook, and obtain a set of 3 attributes per ebook (example: publisher, year, ebook-version). 2. Update each of the EBOOK entry record in the manifest - with the 3 attributes (eg: ebook 1334 - publisher=aaa year=bbb, ebook-version=2.01) 3. Create a output file such that the named publisher_year_ebook-version contains a list of all ebook urls that met that criteria. example: File storage:/root/summary/RANDOMHOUSE_1999_2.01.txt contains: storage:/root/1.manif/1223.folder/2143.Ebook.ebk storage:/root/2.manif/2133.folder/5449.Ebook.ebk storage:/root/2.manif/2133.folder/5450.Ebook.ebk etc.. and File storage:/root/summary/PENGUIN_2001_3.12.txt contains: storage:/root/19.manif/2223.folder/4343.Ebook.ebk storage:/root/13.manif/9733.folder/2149.Ebook.ebk storage:/root/21.manif/3233.folder/1110.Ebook.ebk etc 4. finally, I also want to output statistics such that: publisher_year_ebook-version COUNT_OF_URLs PENGUIN_2001_3.12 250,111 RANDOMHOUSE_1999_2.01 11,322 etc Here is how I implemented: * My launcher gets list of MM manifests * My Mapper gets one manifest. --- It reads the manifest, within a WHILE loop, --- fetches each EBOOK, and obtain attributes from each ebook, --- updates the manifest for that ebook --- context.write(new Text(RANDOMHOUSE_1999_2.01), new Text(storage:/root/1.manif/1223.folder/2143.Ebook.ebk)) --- Once all ebooks in the manifest are read, it saves the updated Manifest, and exits * My Reducer gets the RANDOMHOUSE_1999_2.01 and a list of ebooks urls. --- It writes a new file storage:/root/summary/RANDOMHOUSE_1999_2.01.txt with all the storage urls for the ebooks --- It also does a
Re: Help me with architecture of a somewhat non-trivial mapreduce implementation
My file for the input to mapper is very small - as all it has is urls to list of manifests. The task for mappers is to fetch each manifest, and then fetch files using urls from the manifests and then process them. Besides passing around lists of files, I am not really accessing the disk. It should be RAM, network, and CPU (unzip, parsexml,extract attributes). So is my only choice to break the input file and submit multiple files (if I have 15 cores, I should split the file with urls to 15 files? also how does it look in code?)? The two drawbacks are - some cores might finish early and stay idle, and I don’t know how to deal with dynamically increasing/decreasing cores. Thx - Sky -Original Message- From: Michael Segel Sent: Thursday, April 19, 2012 8:49 PM To: common-user@hadoop.apache.org Subject: Re: Help me with architecture of a somewhat non-trivial mapreduce implementation How 'large' or rather in this case small is your file? If you're on a default system, the block sizes are 64MB. So if your file ~= 64MB, you end up with 1 block, and you will only have 1 mapper. On Apr 19, 2012, at 10:10 PM, Sky wrote: Thanks for your reply. After I sent my email, I found a fundamental defect - in my understanding of how MR is distributed. I discovered that even though I was firing off 15 COREs, the map job - which is the most expensive part of my processing was run only on 1 core. To start my map job, I was creating a single file with following data: 1 storage:/root/1.manif.txt 2 storage:/root/2.manif.txt 3 storage:/root/3.manif.txt ... 4000 storage:/root/4000.manif.txt I thought that each of the available COREs will be assigned a map job from top down from the same file one at a time, and as soon as one CORE is done, it would get the next map job. However, it looks like I need to split the file into the number of times. Now while that’s clearly trivial to do, I am not sure how I can detect at runtime how many splits I need to do, and also to deal with adding new CORES at runtime. Any suggestions? (it doesn't have to be a file, it can be a list, etc). This all would be much easier to debug, if somehow I could get my log4j logs for my mappers and reducers. I can see log4j for my main launcher, but not sure how to enable it for mappers and reducers. Thx - Akash -Original Message- From: Robert Evans Sent: Thursday, April 19, 2012 2:08 PM To: common-user@hadoop.apache.org Subject: Re: Help me with architecture of a somewhat non-trivial mapreduce implementation From what I can see your implementation seems OK, especially from a performance perspective. Depending on what storage: is it is likely to be your bottlekneck, not the hadoop computations. Because you are writing files directly instead of relying on Hadoop to do it for you, you may need to deal with error cases that Hadoop will normally hide from you, and you will not be able to turn on speculative execution. Just be aware that a map or reduce task may have problems in the middle, and be relaunched. So when you are writing out your updated manifest be careful to not replace the old one until the new one is completely ready and will not fail, or you may lose data. You may also need to be careful in your reduce if you are writing directly to the file there too, but because it is not a read modify write, but just a write it is not as critical. --Bobby Evans On 4/18/12 4:56 PM, Sky USC sky...@hotmail.com wrote: Please help me architect the design of my first significant MR task beyond word count. My program works well. but I am trying to optimize performance to maximize use of available computing resources. I have 3 questions at the bottom. Project description in an abstract sense (written in java): * I have MM number of MANIFEST files available on storage:/root/1.manif.txt to 4000.manif.txt * Each MANIFEST in turn contains varilable number EE of URLs to EBOOKS (range could be 1 - 50,000 EBOOKS urls per MANIFEST) -- stored on storage:/root/1.manif/1223.folder/5443.Ebook.ebk So we are talking about millions of ebooks My task is to: 1. Fetch each ebook, and obtain a set of 3 attributes per ebook (example: publisher, year, ebook-version). 2. Update each of the EBOOK entry record in the manifest - with the 3 attributes (eg: ebook 1334 - publisher=aaa year=bbb, ebook-version=2.01) 3. Create a output file such that the named publisher_year_ebook-version contains a list of all ebook urls that met that criteria. example: File storage:/root/summary/RANDOMHOUSE_1999_2.01.txt contains: storage:/root/1.manif/1223.folder/2143.Ebook.ebk storage:/root/2.manif/2133.folder/5449.Ebook.ebk storage:/root/2.manif/2133.folder/5450.Ebook.ebk etc.. and File storage:/root/summary/PENGUIN_2001_3.12.txt contains: storage:/root/19.manif/2223.folder/4343.Ebook.ebk storage:/root/13.manif/9733.folder/2149.Ebook.ebk storage:/root/21.manif/3233.folder/1110.Ebook.ebk etc 4. finally, I
Re: Help me with architecture of a somewhat non-trivial mapreduce implementation
If the file is small enough you could read it in to a java object like a list and write your own input format that takes a list object as its input and then lets you specify the number of mappers. On Apr 19, 2012, at 11:34 PM, Sky wrote: My file for the input to mapper is very small - as all it has is urls to list of manifests. The task for mappers is to fetch each manifest, and then fetch files using urls from the manifests and then process them. Besides passing around lists of files, I am not really accessing the disk. It should be RAM, network, and CPU (unzip, parsexml,extract attributes). So is my only choice to break the input file and submit multiple files (if I have 15 cores, I should split the file with urls to 15 files? also how does it look in code?)? The two drawbacks are - some cores might finish early and stay idle, and I don’t know how to deal with dynamically increasing/decreasing cores. Thx - Sky -Original Message- From: Michael Segel Sent: Thursday, April 19, 2012 8:49 PM To: common-user@hadoop.apache.org Subject: Re: Help me with architecture of a somewhat non-trivial mapreduce implementation How 'large' or rather in this case small is your file? If you're on a default system, the block sizes are 64MB. So if your file ~= 64MB, you end up with 1 block, and you will only have 1 mapper. On Apr 19, 2012, at 10:10 PM, Sky wrote: Thanks for your reply. After I sent my email, I found a fundamental defect - in my understanding of how MR is distributed. I discovered that even though I was firing off 15 COREs, the map job - which is the most expensive part of my processing was run only on 1 core. To start my map job, I was creating a single file with following data: 1 storage:/root/1.manif.txt 2 storage:/root/2.manif.txt 3 storage:/root/3.manif.txt ... 4000 storage:/root/4000.manif.txt I thought that each of the available COREs will be assigned a map job from top down from the same file one at a time, and as soon as one CORE is done, it would get the next map job. However, it looks like I need to split the file into the number of times. Now while that’s clearly trivial to do, I am not sure how I can detect at runtime how many splits I need to do, and also to deal with adding new CORES at runtime. Any suggestions? (it doesn't have to be a file, it can be a list, etc). This all would be much easier to debug, if somehow I could get my log4j logs for my mappers and reducers. I can see log4j for my main launcher, but not sure how to enable it for mappers and reducers. Thx - Akash -Original Message- From: Robert Evans Sent: Thursday, April 19, 2012 2:08 PM To: common-user@hadoop.apache.org Subject: Re: Help me with architecture of a somewhat non-trivial mapreduce implementation From what I can see your implementation seems OK, especially from a performance perspective. Depending on what storage: is it is likely to be your bottlekneck, not the hadoop computations. Because you are writing files directly instead of relying on Hadoop to do it for you, you may need to deal with error cases that Hadoop will normally hide from you, and you will not be able to turn on speculative execution. Just be aware that a map or reduce task may have problems in the middle, and be relaunched. So when you are writing out your updated manifest be careful to not replace the old one until the new one is completely ready and will not fail, or you may lose data. You may also need to be careful in your reduce if you are writing directly to the file there too, but because it is not a read modify write, but just a write it is not as critical. --Bobby Evans On 4/18/12 4:56 PM, Sky USC sky...@hotmail.com wrote: Please help me architect the design of my first significant MR task beyond word count. My program works well. but I am trying to optimize performance to maximize use of available computing resources. I have 3 questions at the bottom. Project description in an abstract sense (written in java): * I have MM number of MANIFEST files available on storage:/root/1.manif.txt to 4000.manif.txt * Each MANIFEST in turn contains varilable number EE of URLs to EBOOKS (range could be 1 - 50,000 EBOOKS urls per MANIFEST) -- stored on storage:/root/1.manif/1223.folder/5443.Ebook.ebk So we are talking about millions of ebooks My task is to: 1. Fetch each ebook, and obtain a set of 3 attributes per ebook (example: publisher, year, ebook-version). 2. Update each of the EBOOK entry record in the manifest - with the 3 attributes (eg: ebook 1334 - publisher=aaa year=bbb, ebook-version=2.01) 3. Create a output file such that the named publisher_year_ebook-version contains a list of all ebook urls that met that criteria. example: File storage:/root/summary/RANDOMHOUSE_1999_2.01.txt contains: storage:/root/1.manif/1223.folder/2143.Ebook.ebk
Re: hadoop datanode is dead but cannot stop it
What distro/version of Hadoop are you using? This was a bug fixed quite a while ago. On Fri, Apr 20, 2012 at 7:29 AM, Johnson Chengwu johnsonchen...@gmail.com wrote: I have encountered when there is a disk IO error in a datanode machine, the datanode will be dead, but the in the dead datanode, the datanode daemon is still alive, and I cannot stop it to restart it the datanode. When I check the process , it seems that the linux command du -sk path/to/datadir is hangup, this problem cause the datanode dead, so that I cannot stop the datanode as well as cannot use the “kill -9 datanode-process” to kill the datanode process, is this a bug? may be we should set a timeout of linux command du , when there is no return to the datanode. -- Harsh J