Re: CDH2 or Apache Hadoop
I've had a great experience with CDH2 on various platforms (Ubuntu, OpenSolaris). Worked as advertised. My 2 cents. On Tue, Feb 23, 2010 at 3:13 PM, Ananth Sarathy ananth.t.sara...@gmail.comwrote: Just wanted to get the groups general feelings on what the preferred distro is and why? Obviously assuming one didn't have a service agreement with cloudera. Ananth T Sarathy -- Dali Kilani === Twitter : @dadicool Phone : (650) 492-5921 (Google Voice) E-Fax : (775) 552-2982
Re: Multiple disks for DFS
Just specify multiple directories (where the different local partitions are mounted) for dfs.data.dir (hdfs data) and mapred.local.dir (intermediate data) in hdfs-site.xml.Data should then be striped across the different partitions/disks. See here : http://bit.ly/fbUkr Dali On Mon, Sep 14, 2009 at 11:50 AM, Stas Oskin stas.os...@gmail.com wrote: Hi. Thanks for the explanation. Any idea if I can re-use this round robin mechanism for local disk writing? Or it's DFS only? Regards. 2009/9/14 Jason Venner jason.had...@gmail.com When you have multiple partitions specified for hdfs storage, they are used for block storage in a round robin fashion. If a partition has insufficient space it is dropped for the set used for storing new blocks. On Sun, Sep 13, 2009 at 3:01 AM, Stas Oskin stas.os...@gmail.com wrote: Hi. When I specify multiple disks for DFS, does Hadoop distributes the concurrent writings over the multiple disks? I mean, to prevent an utilization of a single disk? Thanks for any info on subject. -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals -- Dali Kilani === Phone : (650) 492-5921 (Google Voice) E-Fax : (775) 552-2982
Re: Ubuntu/Hadoop incompatibilities?
Can you double check that your data node doesn't have the same /etc/hosts issue mentioned above in the thread? (i.e. machine name resolves to 127.0.0.1) Dali On Mon, Aug 17, 2009 at 1:33 PM, CubicDesign cubicdes...@gmail.com wrote: Thank you all for your answers. My problem with Hadoop on Ubuntu is that I cannot make the DataNode server to work properly (at least this is where I think the error is). I get an File jobtracker.info could only be replicated to 0 nodes instead of 1 error message. All other servers are running fine. I am running Hadoop in a single (test) machine. The results for jps and netstats are: jps 4465 NameNode 4553 DataNode 5105 Jps 4717 JobTracker 4649 SecondaryNameNode 4807 TaskTracker sudo netstat -plten | grep java tcp0 0 0.0.0.0:50722 0.0.0.0:* LISTEN 1000 13858 4553/java tcp0 0 0.0.0.0:50020 0.0.0.0:* LISTEN 1000 15130 4553/java tcp0 0 127.0.0.1:54310 0.0.0.0:* LISTEN 1000 13564 4465/java tcp0 0 127.0.0.1:54311 0.0.0.0:* LISTEN 1000 14571 4717/java tcp0 0 0.0.0.0:59080 0.0.0.0:* LISTEN 1000 14547 4717/java tcp0 0 0.0.0.0:50090 0.0.0.0:* LISTEN 1000 14943 4649/java tcp0 0 127.0.0.1:40555 0.0.0.0:* LISTEN 1000 15057 4807/java tcp0 0 0.0.0.0:50060 0.0.0.0:* LISTEN 1000 15031 4807/java tcp0 0 0.0.0.0:47661 0.0.0.0:* LISTEN 1000 14247 4649/java tcp0 0 0.0.0.0:50030 0.0.0.0:* LISTEN 1000 14941 4717/java tcp0 0 0.0.0.0:57839 0.0.0.0:* LISTEN 1000 13514 4465/java tcp0 0 0.0.0.0:50070 0.0.0.0:* LISTEN 1000 14533 4465/java tcp0 0 0.0.0.0:50010 0.0.0.0:* LISTEN 1000 14765 4553/java tcp0 0 0.0.0.0:50075 0.0.0.0:* LISTEN 1000 14946 4553/java -- Dali Kilani === Phone : (650) 492-5921 (Google Voice) E-Fax : (775) 552-2982
Re: Why is Spilled Records always equal to Map output records
If I am not mistaken (I am new to this stuff), that's because you need to have a checkpoint from which you can restart the reduce jobs that use those spilled records in case of a reduce task failure. Dali On Mon, Jul 13, 2009 at 6:32 PM, Mu Qiao qiao...@gmail.com wrote: Thank you. But why need map outputs to be written to disk at least once? I think my io.sort.mb is large enough to do in-memory operations. Could you provide me some information about it? On Tue, Jul 14, 2009 at 1:27 AM, Owen O'Malley omal...@apache.org wrote: On Jul 12, 2009, at 3:55 AM, Mu Qiao wrote: I notice it from the web console after I've tried to run serveral jobs. Every one of the jobs has the number of Spilled Records equal to Map output records, even if there are only 5 map output records This is good. The map outputs need to be written to disk at least once. So if they are equal, things are fitting in memory. If multiple passes are needed, you'll see 2x or more spilled records. In the reduce phase, there are also spilled records which is equal to reduce input records. This is reasonable, although 0.19 and 0.20 don't need to spill the records in the reduce at all, if you make the buffer big enough. -- Owen -- Best wishes, Qiao Mu -- Dali Kilani === Phone : (650) 492-5921 (Google Voice) E-Fax : (775) 552-2982