Setting Number of Maps in 0.20.2
I am trying to figure out how to set the number of maps to use in 0.20.2. If I was using JobConf in my program I could use: conf.setNumMapTasks(numMaps); However JobConf and the method are deprecated and when we started our project we structured everything to use Configuration and Job because of this. Is there a way to set the number of map tasks using Job and Configuration? Thank you for any help, Jason
Re: Question about RAID controllers and hadoop
[cwimmer@hostname bonnie++-1.03e]$ ./bonnie++ -d . -s 5 -m P410 -f Writing intelligently...done Rewriting...done Reading intelligently...done start 'em...done...done...done... Create files in sequential order...done. Stat files in sequential order...done. Delete files in sequential order...done. Create files in random order...done. Stat files in random order...done. Delete files in random order...done. Version 1.03e --Sequential Output-- --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP P410 5M 68392 12 21153 3 116423 4 216.8 0 --Sequential Create-- Random Create -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 22238 33 + +++ + +++ + +++ + +++ + +++ P410,5M,,,68392,12,21153,3,,,116423,4,216.8,0,16,22238,33,+,+++,+,+++,+,+++,+,+++,+,+++ On 8/11/11 5:15 PM, Mohit Anchlia mohitanch...@gmail.com wrote: On Thu, Aug 11, 2011 at 3:26 PM, Charles Wimmer cwim...@yahoo-inc.com wrote: We currently use P410s in 12 disk system. Each disk is set up as a RAID0 volume. Performance is at least as good as a bare disk. Can you please share what throughput you see with P410s? Are these SATA or SAS? On 8/11/11 3:23 PM, GOEKE, MATTHEW (AG/1000) matthew.go...@monsanto.com wrote: If I read that email chain correctly then they were referring to the classic JBOD vs multiple disks striped together conversation. The conversation that was started here is referring to JBOD vs 1 RAID 0 per disk and the effects of the raid controller on those independent raids. Matt -Original Message- From: Kai Voigt [mailto:k...@123.org] Sent: Thursday, August 11, 2011 5:17 PM To: common-user@hadoop.apache.org Subject: Re: Question about RAID controllers and hadoop Yahoo did some testing 2 years ago: http://markmail.org/message/xmzc45zi25htr7ry But updated benchmark would be interesting to see. Kai Am 12.08.2011 um 00:13 schrieb GOEKE, MATTHEW (AG/1000): My assumption would be that having a set of 4 raid 0 disks would actually be better than having a controller that allowed pure JBOD of 4 disks due to the cache on the controller. If anyone has any personal experience with this I would love to know performance numbers but our infrastructure guy is doing tests on exactly this over the next couple days so I will pass it along once we have it. Matt -Original Message- From: Bharath Mundlapudi [mailto:bharathw...@yahoo.com] Sent: Thursday, August 11, 2011 5:00 PM To: common-user@hadoop.apache.org Subject: Re: Question about RAID controllers and hadoop True, you need a P410 controller. You can create RAID0 for each disk to make it as JBOD. -Bharath From: Koert Kuipers ko...@tresata.com To: common-user@hadoop.apache.org Sent: Thursday, August 11, 2011 2:50 PM Subject: Question about RAID controllers and hadoop Hello all, We are considering using low end HP proliant machines (DL160s and DL180s) for cluster nodes. However with these machines if you want to do more than 4 hard drives then HP puts in a P410 raid controller. We would configure the RAID controller to function as JBOD, by simply creating multiple RAID volumes with one disk. Does anyone have experience with this setup? Is it a good idea, or am i introducing a i/o bottleneck? Thanks for your help! Best, Koert This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of Viruses or other Malware. Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations. -- Kai Voigt
Is Hadoop suitable for my data distribution problem?
Hey guys, I'm new on the list and I'm currently considering Hadoop to solve a data distribution problem. Right now, there's a server which contains very large files (usual files have 30GB or even more). This server is accessed through LAN and over the internet but, of course, it's difficult to do this without local connection. My idea to solve this problem is to deploy new servers on the places which access data more often in an such a way that they get a local copy of the files most accessed by then. These new servers would download and store parts of the data (entire files) so that they can be accessed through their own LAN alone, without needing to relieve on another server's data. Is it possible to have this kind of limitation when splitting a file through Hadoop's nodes? In reality I don't know even if this restriction is useful. In my head, enforcing this kind of data locality would make possible to use data internally even if there is no internet connection, at the price of limiting the number of nodes to balance the load and replication possibilities. Is this tradeoff acceptable or at least possible with Hadoop? thanks, Thiago Moraes - EnC 07 - UFSCar
Re: hdfs format command issue
this should help. echo Y | ${hadoophdfshome}/bin/hdfs namenode -format -giri On Sat, Aug 13, 2011 at 8:41 AM, Dhodapkar, Chinmay chinm...@qualcomm.comwrote: I am trying to automate the installation/bringup of a complete hadoop/hbase cluster from a single script. I have run into a very small issue... Before bringing up the namenode, I have to format it with the usual hadoop namenode -format Executing the above command prompts the user for Y/N?. Is there an option that can be passed to force the format without prompting? The aim is for the script to complete without any human intervention...
WritableComparable
Hi Folks, After much poking around I am still unable to determine why I am seeing 'reduce' being called twice with the same key. Recall from my previous email that sameness is determined by 'compareTo' of my custom key type. AFAIK, the default WritableComparator invokes 'compareTo' for any two keys which are being ordered during sorting and merging. Is it somehow possible that a bitwise comparator is used for the spilled map output rather than the default WritableComparator? I am out of clues, short of studying the shuffling code. If anyone can suggest some further debugging steps, don't be shy. :) Thanks!!! stan
Re: WritableComparable
Does your compareTo() method test object pointer equality? If so, you could be getting burned by Hadoop reusing Writable objects. -Joey On Aug 14, 2011 9:20 PM, Stan Rosenberg srosenb...@proclivitysystems.com wrote: Hi Folks, After much poking around I am still unable to determine why I am seeing 'reduce' being called twice with the same key. Recall from my previous email that sameness is determined by 'compareTo' of my custom key type. AFAIK, the default WritableComparator invokes 'compareTo' for any two keys which are being ordered during sorting and merging. Is it somehow possible that a bitwise comparator is used for the spilled map output rather than the default WritableComparator? I am out of clues, short of studying the shuffling code. If anyone can suggest some further debugging steps, don't be shy. :) Thanks!!! stan
NPE in TaskLogAppender
I'm running the Cassandra Brisk server with Haddop core 20.203 on OSX, everything is local. I keep running into this problem for Hive jobs INFO 13:52:39,923 Error from attempt_201108151342_0001_m_01_1: java.lang.NullPointerException at org.apache.hadoop.mapred.TaskLogAppender.flush(TaskLogAppender.java:67) at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:264) at org.apache.hadoop.mapred.Child$4.run(Child.java:261) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.mapred.Child.main(Child.java:253) The only info I've found online was http://www.mail-archive.com/common-user@hadoop.apache.org/msg12829.html Just for fun I tried… * setting mapred.acls.enabled to true * setting mapred.queue.default.acl-submit-job and mapred.queue.default.acl-administer-jobs to * There was no discernible increase in joy though. Any thoughts ? Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com
Re: WritableComparable
On Sun, Aug 14, 2011 at 9:33 PM, Joey Echeverria j...@cloudera.com wrote: Does your compareTo() method test object pointer equality? If so, you could be getting burned by Hadoop reusing Writable objects. Yes, but only the equality between enum values. Interestingly, when 'reduce' is called there are three instances of the same key. Two instances are correctly merged and they both come from the same mapper. The other instance comes from a different mapper, and for some reason does not get merged. I see the key and the values (corresponding to the two merged instances) passed as arguments to 'reduce'; then in subsequent 'reduce' call I see the key and the value corresponding to the third instance. For completeness, here is my 'Key.compareTo': public int compareTo(Key o) { if (this.type != o.type) { // Type.X Type.Y return (this.type == Type.X ? -1 : 1); } // otherwise, delegate if (this.type == Type.X) { return this.key1.compareTo(o.key1); } else { return this.key2.compareTo(o.key2); } } The 'type' field is an enum with two possible values, say X and Y. Key is essentially a union type; i.e., at any given time it's the values in key1 or key2 that are being compared (depending on the 'type' value).
Re: WritableComparable
On Sun, Aug 14, 2011 at 10:25 PM, Joey Echeverria j...@cloudera.com wrote: What are the types of key1 and key2? What does the readFields() method look like? The type of key1 is essentially a wrapper for java.util.UUID. Here is its readFields: public void readFields(DataInput in) throws IOException { id = new UUID(in.readLong(), in.readLong()); } So, it reconstitutes the UUID by deserializing two longs. The 'compareTo' method of this key type delegates to java.util.UUID.compareTo. The type of key2 wraps a different id, one that fits into a long. In addition to an id, it also stores an enum which designates the source of this id. Here is its readFields: public void readFields(DataInput in) throws IOException { source = Source.values()[in.readByte() 0xFF]; id = in.readLong(); } The source is an enum value which is serialized by writing its ordinal. (There are only two possible enum values, hence only one byte.) The 'compareTo' method of this key type orders by the source values if the id values are different, otherwise by the id values.