Re: Loading Data to HDFS
Loading a petabyte from a single machine will take you about 4 months, assuming you can push 100MB/s (1 GigE) continuously for 24 hrs/day over those 4 months. Any interruptions and the 4 months will become 6 months. You might want to consider a more parallel solution instead of a single gateway machine. On Tue, Oct 30, 2012 at 3:07 AM, sumit ghosh sumi...@yahoo.com wrote: Hi, I have a data on remote machine accessible over ssh. I have Hadoop CDH4 installed on RHEL. I am planning to load quite a few Petabytes of Data onto HDFS. Which will be the fastest method to use and are there any projects around Hadoop which can be used as well? I cannot install Hadoop-Client on the remote machine. Have a great Day Ahead! Sumit. --- Here I am attaching my previous discussion on CDH-user to avoid duplication. --- On Wed, Oct 24, 2012 at 9:29 PM, Alejandro Abdelnur t...@cloudera.com wrote: in addition to jarcec's suggestions, you could use httpfs. then you'd only need to poke a single host:port in your firewall as all the traffic goes thru it. thx Alejandro On Oct 24, 2012, at 8:28 AM, Jarek Jarcec Cecho jar...@cloudera.com wrote: Hi Sumit, there is plenty of ways how to achieve that. Please find my feedback below: Does Sqoop support loading flat files to HDFS? No, sqoop is supporting only data move from external database and warehouse systems. Copying files is not supported at the moment. Can use distcp? No. Distcp can be used only to copy data between HDFS filesystesm. How do we use the core-site.xml file on the remote machine to use copyFromLocal? Yes you can install hadoop binaries on your machine (with no hadoop running services) and use hadoop binary to upload data. Installation procedure is described in CDH4 installation guide [1] (follow client installation). Another way that I can think of is leveraging WebHDFS [2] or maybe hdfs-fuse [3]? Jarcec Links: 1: https://ccp.cloudera.com/display/CDH4DOC/CDH4+Installation 2: https://ccp.cloudera.com/display/CDH4DOC/Deploying+HDFS+on+a+Cluster#DeployingHDFSonaCluster-EnablingWebHDFS 3: https://ccp.cloudera.com/display/CDH4DOC/Mountable+HDFS On Wed, Oct 24, 2012 at 01:33:29AM -0700, Sumit Ghosh wrote: Hi, I have a data on remote machine accessible over ssh. What is the fastest way to load data onto HDFS? Does Sqoop support loading flat files to HDFS? Can use distcp? How do we use the core-site.xml file on the remote machine to use copyFromLocal? Which will be the best to use and are there any other open source projects around Hadoop which can be used as well? Have a great Day Ahead! Sumit
Re: Restricting the number of slave nodes used for a given job (regardless of the # of map/reduce tasks involved)
What you are asking for (and much more sophisticated slicing/dicing of the cluster) is possible with MapR's distro. Please contact me offline if you are interested, or try it for yourself at www.mapr.com/download On Mon, Sep 10, 2012 at 2:06 AM, Safdar Kureishy safdar.kurei...@gmail.comwrote: Hi, I need to run some benchmarking tests for a given mapreduce job on a *subset *of a 10-node Hadoop cluster. Not that it matters, but the current cluster settings allow for ~20 map slots and 10 reduce slots per node. Without loss of generalization, let's say I want a job with these constraints below: - to use only *5* out of the 10 nodes for running the mappers, - to use only *5* out of the 10 nodes for running the reducers. Is there any other way of achieving this through Hadoop property overrides during job-submission time? I understand that the Fair Scheduler can potentially be used to create pools of a proportionate # of mappers and reducers, to achieve a similar outcome, but the problem is that I still cannot tie such a pool to a fixed # of machines (right?). Essentially, regardless of the # of map/reduce tasks involved, I only want a *fixed # of machines* to handle the job. Any tips on how I can go about achieving this? Thanks, Safdar
Re: Ideal file size
On Wed, Jun 6, 2012 at 10:14 AM, Mohit Anchlia mohitanch...@gmail.comwrote: On Wed, Jun 6, 2012 at 9:48 AM, M. C. Srivas mcsri...@gmail.com wrote: Many factors to consider than just the size of the file. . How long can you wait before you *have to* process the data? 5 minutes? 5 hours? 5 days? If you want good timeliness, you need to roll-over faster. The longer you wait: 1. the lesser the load on the NN. 2. but the poorer the timeliness 3. and the larger chance of lost data (ie, the data is not saved until the file is closed and rolled over, unless you want to sync() after every write) To Begin with I was going to use Flume and specify rollover file size. I understand the above parameters, I just want to ensure that too many small files doesn't cause problem on the NameNode. For instance there would be times when we get GBs of data in an hour and at times only few 100 MB. From what Harsh, Edward and you've described it doesn't cause issues with the NameNode but rather increase in processing times if there are too many small files. Looks like I need to find that balance. It would also be interesting to see how others solve this problem when not using Flume. They use NFS with MapR. Any and all log-rotators (like the one in log4j) simply just work over NFS, and MapR does not have a NN, so the problems with small files or number of files do not exist. On Wed, Jun 6, 2012 at 7:00 AM, Mohit Anchlia mohitanch...@gmail.com wrote: We have continuous flow of data into the sequence file. I am wondering what would be the ideal file size before file gets rolled over. I know too many small files are not good but could someone tell me what would be the ideal size such that it doesn't overload NameNode.
Re: Ideal file size
Many factors to consider than just the size of the file. . How long can you wait before you *have to* process the data? 5 minutes? 5 hours? 5 days? If you want good timeliness, you need to roll-over faster. The longer you wait: 1. the lesser the load on the NN. 2. but the poorer the timeliness 3. and the larger chance of lost data (ie, the data is not saved until the file is closed and rolled over, unless you want to sync() after every write) On Wed, Jun 6, 2012 at 7:00 AM, Mohit Anchlia mohitanch...@gmail.comwrote: We have continuous flow of data into the sequence file. I am wondering what would be the ideal file size before file gets rolled over. I know too many small files are not good but could someone tell me what would be the ideal size such that it doesn't overload NameNode.
Re: Feedback on real world production experience with Flume
Karl, since you did ask for alternatives, people using MapR prefer to use the NFS access to directly deposit data (or access it). Works seamlessly from all Linuxes, Solaris, Windows, AIX and a myriad of other legacy systems without having to load any agents on those machines. And it is fully automatic HA Since compression is built-in in MapR, the data gets compressed coming in over NFS automatically without much fuss. Wrt to performance, can get about 870 MB/s per node if you have 10GigE attached (of course, with compression, the effective throughput will surpass that based on how good the data can be squeezed). On Fri, Apr 20, 2012 at 3:14 PM, Karl Hennig khen...@baynote.com wrote: I am investigating automated methods of moving our data from the web tier into HDFS for processing, a process that's performed periodically. I am looking for feedback from anyone who has actually used Flume in a production setup (redundant, failover) successfully. I understand it is now being largely rearchitected during its incubation as Apache Flume-NG, so I don't have full confidence in the old, stable releases. The other option would be to write our own tools. What methods are you using for these kinds of tasks? Did you write your own or does Flume (or something else) work for you? I'm also on the Flume mailing list, but I wanted to ask these questions here because I'm interested in Flume _and_ alternatives. Thank you!
Re: NameNode per-block memory usage?
Konstantin's paper http://www.usenix.org/publications/login/2010-04/openpdfs/shvachko.pdf mentions that on average a file consumes about 600 bytes of memory in the name-node (1 file object + 2 block objects). To quote from his paper (see page 9) .. in order to store 100 million files (referencing 200 million blocks) a name-node should have at least 60GB of RAM. This matches observations on deployed clusters. On Tue, Jan 17, 2012 at 7:08 AM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hello, How much memory/JVM heap does NameNode use for each block? I've tried locating this in the FAQ and on search-hadoop.com, but couldn't find a ton of concrete numbers, just these two: http://search-hadoop.com/m/RmxWMVyVvK1 - 150 bytes/block? http://search-hadoop.com/m/O886P1VyVvK1 - 1 GB heap for every object? Thanks, Otis
Re: HDFS Backup nodes
Suresh, As of today, there is no option except to use NFS. And as you yourself mention, the first HA prototype when it comes out will require NFS. (a) I wasn't aware that Bookkeeper had progressed that far. I wonder whether it would be able to keep up with the data rates that is required in order to hold the NN log without falling behind. (b) I do know Karthik Ranga at FB just started a design to put the NN data in HDFS itself, but that is in very preliminary design stages with no real code there. The problem is that the HA code written with NFS in mind is very different from the HA code written with HDFS in mind, which are both quite different from the code that is written with Bookkeeper in mind. Essentially the three options will form three different implementations, since the failure modes of each of the back-ends are different. Am I totally off base? thanks, Srivas. On Tue, Dec 13, 2011 at 11:00 AM, Suresh Srinivas sur...@hortonworks.comwrote: Srivas, As you may know already, NFS is just being used in the first prototype for HA. Two options for editlog store are: 1. Using BookKeeper. Work has already completed on trunk towards this. This will replace need for NFS to store the editlogs and is highly available. This solution will also be used for HA. 2. We have a short term goal also to enable editlogs going to HDFS itself. The work is in progress. Regards, Suresh -- Forwarded message -- From: M. C. Srivas mcsri...@gmail.com Date: Sun, Dec 11, 2011 at 10:47 PM Subject: Re: HDFS Backup nodes To: common-user@hadoop.apache.org You are out of luck if you don't want to use NFS, and yet want redundancy for the NN. Even the new NN HA work being done by the community will require NFS ... and the NFS itself needs to be HA. But if you use a Netapp, then the likelihood of the Netapp crashing is lower than the likelihood of a garbage-collection-of-death happening in the NN. [ disclaimer: I don't work for Netapp, I work for MapR ] On Wed, Dec 7, 2011 at 4:30 PM, randy randy...@comcast.net wrote: Thanks Joey. We've had enough problems with nfs (mainly under very high load) that we thought it might be riskier to use it for the NN. randy On 12/07/2011 06:46 PM, Joey Echeverria wrote: Hey Rand, It will mark that storage directory as failed and ignore it from then on. In order to do this correctly, you need a couple of options enabled on the NFS mount to make sure that it doesn't retry infinitely. I usually run with the tcp,soft,intr,timeo=10,**retrans=10 options set. -Joey On Wed, Dec 7, 2011 at 12:37 PM,randy...@comcast.net wrote: What happens then if the nfs server fails or isn't reachable? Does hdfs lock up? Does it gracefully ignore the nfs copy? Thanks, randy - Original Message - From: Joey Echeverriaj...@cloudera.com To: common-user@hadoop.apache.org Sent: Wednesday, December 7, 2011 6:07:58 AM Subject: Re: HDFS Backup nodes You should also configure the Namenode to use an NFS mount for one of it's storage directories. That will give the most up-to-date back of the metadata in case of total node failure. -Joey On Wed, Dec 7, 2011 at 3:17 AM, praveenesh kumar praveen...@gmail.com wrote: This means still we are relying on Secondary NameNode idealogy for Namenode's backup. Can OS-mirroring of Namenode is a good alternative keep it alive all the time ? Thanks, Praveenesh On Wed, Dec 7, 2011 at 1:35 PM, Uma Maheswara Rao G mahesw...@huawei.comwrote: AFAIK backup node introduced in 0.21 version onwards. __**__ From: praveenesh kumar [praveen...@gmail.com] Sent: Wednesday, December 07, 2011 12:40 PM To: common-user@hadoop.apache.org Subject: HDFS Backup nodes Does hadoop 0.20.205 supports configuring HDFS backup nodes ? Thanks, Praveenesh -- Joseph Echeverria Cloudera, Inc. 443.305.9434
Re: HDFS Backup nodes
You are out of luck if you don't want to use NFS, and yet want redundancy for the NN. Even the new NN HA work being done by the community will require NFS ... and the NFS itself needs to be HA. But if you use a Netapp, then the likelihood of the Netapp crashing is lower than the likelihood of a garbage-collection-of-death happening in the NN. [ disclaimer: I don't work for Netapp, I work for MapR ] On Wed, Dec 7, 2011 at 4:30 PM, randy randy...@comcast.net wrote: Thanks Joey. We've had enough problems with nfs (mainly under very high load) that we thought it might be riskier to use it for the NN. randy On 12/07/2011 06:46 PM, Joey Echeverria wrote: Hey Rand, It will mark that storage directory as failed and ignore it from then on. In order to do this correctly, you need a couple of options enabled on the NFS mount to make sure that it doesn't retry infinitely. I usually run with the tcp,soft,intr,timeo=10,**retrans=10 options set. -Joey On Wed, Dec 7, 2011 at 12:37 PM,randy...@comcast.net wrote: What happens then if the nfs server fails or isn't reachable? Does hdfs lock up? Does it gracefully ignore the nfs copy? Thanks, randy - Original Message - From: Joey Echeverriaj...@cloudera.com To: common-user@hadoop.apache.org Sent: Wednesday, December 7, 2011 6:07:58 AM Subject: Re: HDFS Backup nodes You should also configure the Namenode to use an NFS mount for one of it's storage directories. That will give the most up-to-date back of the metadata in case of total node failure. -Joey On Wed, Dec 7, 2011 at 3:17 AM, praveenesh kumarpraveen...@gmail.com wrote: This means still we are relying on Secondary NameNode idealogy for Namenode's backup. Can OS-mirroring of Namenode is a good alternative keep it alive all the time ? Thanks, Praveenesh On Wed, Dec 7, 2011 at 1:35 PM, Uma Maheswara Rao G mahesw...@huawei.comwrote: AFAIK backup node introduced in 0.21 version onwards. __**__ From: praveenesh kumar [praveen...@gmail.com] Sent: Wednesday, December 07, 2011 12:40 PM To: common-user@hadoop.apache.org Subject: HDFS Backup nodes Does hadoop 0.20.205 supports configuring HDFS backup nodes ? Thanks, Praveenesh -- Joseph Echeverria Cloudera, Inc. 443.305.9434
Re: Any related paper on how to resolve hadoop SPOF issue?
Are there any performance benchmarks available for Ceph? (with Hadoop, without, both?) On Thu, Aug 25, 2011 at 11:44 AM, Alex Nelson ajnel...@cs.ucsc.edu wrote: Hi George, UC Santa Cruz contributed a ;login: article describing replacing HDFS with Ceph. (I was one of the authors.) One of the key architectural advantages of Ceph over HDFS is that Ceph distributes its metadata service over multiple metadata servers. I hope that helps. --Alex On Aug 25, 2011, at 02:54 , George Kousiouris wrote: Hi, many thanks!! I also found this one, which has some related work also with other efforts, in case it is helpful for someone else: https://ritdml.rit.edu/bitstream/handle/1850/13321/ATalwalkarThesis1-2011.pdf?sequence=1 BR, George On 8/25/2011 12:46 PM, Nan Zhu wrote: Hope it helps http://www.springerlink.com/content/h17r882710314147/ Best, Nan On Thu, Aug 25, 2011 at 5:43 PM, George Kousiouris gkous...@mail.ntua.grwrote: Hi guys, We are currently in the process of writing a paper regarding hadoop and we would like to reference any attempt to remove the single point of failure of the Namenode. We have found in various presentations some efforts (like dividing the namespace between more than one namenodes if i remember correctly) but the search for a concrete paper on the issue came to nothing. Is anyone aware or has participated in such an effort? Thanks, George -- --- George Kousiouris Electrical and Computer Engineer Division of Communications, Electronics and Information Engineering School of Electrical and Computer Engineering Tel: +30 210 772 2546 Mobile: +30 6939354121 Fax: +30 210 772 2569 Email: gkous...@mail.ntua.gr Site: http://users.ntua.gr/gkousiou/ National Technical University of Athens 9 Heroon Polytechniou str., 157 73 Zografou, Athens, Greece -- --- George Kousiouris Electrical and Computer Engineer Division of Communications, Electronics and Information Engineering School of Electrical and Computer Engineering Tel: +30 210 772 2546 Mobile: +30 6939354121 Fax: +30 210 772 2569 Email: gkous...@mail.ntua.gr Site: http://users.ntua.gr/gkousiou/ National Technical University of Athens 9 Heroon Polytechniou str., 157 73 Zografou, Athens, Greece
Re: question regarding MapR
MapR's entire distro is free, and consists mostly of open-source components. The only closed-source component is MapR's file-system, which is a complete drop-in replacement for HDFS. If you wish, we can discuss further about your concerns re: MapR outside this mailing list. thanks, Srivas. On Sun, Oct 9, 2011 at 4:12 AM, George Kousiouris gkous...@mail.ntua.grwrote: Hi all, We were looking how to overcome the random write/concurrent write issue with hdfs. We came across the MapR platform that claims to resolve this, along with the distribued Namenode. However we are a bit concerned that it seems that this is not open source. Does anyone have any info on which parts of MapR are indeed open source? Or any alternative solution? BR, George -- --- George Kousiouris Electrical and Computer Engineer Division of Communications, Electronics and Information Engineering School of Electrical and Computer Engineering Tel: +30 210 772 2546 Mobile: +30 6939354121 Fax: +30 210 772 2569 Email: gkous...@mail.ntua.gr Site: http://users.ntua.gr/gkousiou/ National Technical University of Athens 9 Heroon Polytechniou str., 157 73 Zografou, Athens, Greece
Re: YCSB Benchmarking for HBase
Lohit did some work on making YCSB run on a bunch of machines in a coordinated manner. Plus fixed some limits in how many zk connections/threads can run inside one process. See http://github.com/lohitvijayarenu/YCSB I believe that code also has a data-verification option to ensure that a * get* reads what was written and not something else. On Wed, Aug 3, 2011 at 3:10 AM, praveenesh kumar praveen...@gmail.comwrote: Hi, Anyone working on YCSB (Yahoo Cloud Service Benchmarking) for HBase ?? I am trying to run it, its giving me error: $ java -cp build/ycsb.jar com.yahoo.ycsb.CommandLine -db com.yahoo.ycsb.db.HBaseClient YCSB Command Line client Type help for command line help Start with -help for usage info Exception in thread main java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration at java.lang.Class.getDeclaredConstructors0(Native Method) at java.lang.Class.privateGetDeclaredConstructors(Class.java:2406) at java.lang.Class.getConstructor0(Class.java:2716) at java.lang.Class.newInstance0(Class.java:343) at java.lang.Class.newInstance(Class.java:325) at com.yahoo.ycsb.CommandLine.main(Unknown Source) Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration at java.net.URLClassLoader$1.run(URLClassLoader.java:217) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:205) at java.lang.ClassLoader.loadClass(ClassLoader.java:321) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294) at java.lang.ClassLoader.loadClass(ClassLoader.java:266) ... 6 more By the error, it seems like its not able to get Hadoop-core.jar file, but its already in the class path. Has anyone worked on YCSB with hbase ? Thanks, Praveenesh
Re: The best architecture for EC2/Hadoop interface?
Avoid Swing. Try GWT, its quite nice. http://code.google.com/webtoolkit/. You get the Java debuggability, and a browser-based GUI. On Tue, Aug 2, 2011 at 3:35 AM, Steve Loughran ste...@apache.org wrote: On 02/08/11 05:09, Mark Kerzner wrote: Hi, I want to give my users a GUI that would allow them to start Hadoop clusters and run applications that I will provide on the AMIs. What would be a good approach to make it simple for the user? Should I write a Java Swing app that will wrap around the EC2 commands? Should I use some more direct EC2 API? Or should I use a web browser interface? My idea was to give the user a Java Swing GUI, so that he gives his Amazon credentials to it, and it would be secure because the application is not exposed to the outside. Does this approach make sense? 1. I'm not sure that Java Swing GUI makes sense for anything anymore -if it ever did. 2. Have a look at what other people have done first before writing your own. Amazon provide something for their derivative of Hadoop, Elastic MR, I suspect KarmaSphere and others may provide UIs in front of it too. the other thing is most big jobs are more than one operation, so you are a workflow world. Things like cascading pig and oozie help here, and if you can bring them up in-cluster, you can get a web UI.
Re: Which release to use?
Mike, Just a minor inaccuracy in your email. Here's setting the record straight: 1. MapR directly sells their distribution of Hadoop. Support is from MapR. 2. EMC also sells the MapR distribution, for use on any hardware. Support is from EMC worldwide. 3. EMC also sells a Hadoop appliance, which has the MapR distribution specially built for it. Support is from EMC. 4. MapR also has a free, unlimited, unrestricted version called M3, which has the same 2-5x performance, management and stability improvements, and includes NFS. It is not crippleware, and the unlimited, unrestricted, free use does not expire on any date. Hope that clarifies what MapR is doing. thanks regards, Srivas. On Mon, Jul 18, 2011 at 11:33 AM, Michael Segel michael_se...@hotmail.comwrote: EMC has inked a deal with MapRTech to resell their release and support services for MapRTech. Does this mean that they are going to stop selling their own release on Greenplum? Maybe not in the near future, however, a Greenplum appliance may not get the customer transaction that their reselling of MapR will generate. It sounds like they are hedging their bets and are taking an 'IBM' approach. Subject: RE: Which release to use? Date: Mon, 18 Jul 2011 08:30:59 -0500 From: jeff.schm...@shell.com To: common-user@hadoop.apache.org Steve, I read your blog nice post - I believe EMC is selling the Greenplumb solution as an appliance - Cheers - Jeffery -Original Message- From: Steve Loughran [mailto:ste...@apache.org] Sent: Friday, July 15, 2011 4:07 PM To: common-user@hadoop.apache.org Subject: Re: Which release to use? On 15/07/2011 18:06, Arun C Murthy wrote: Apache Hadoop is a volunteer driven, open-source project. The contributors to Apache Hadoop, both individuals and folks across a diverse set of organizations, are committed to driving the project forward and making timely releases - see discussion on hadoop-0.23 with a raft newer features such as HDFS Federation, NextGen MapReduce and plans for HA NameNode etc. As with most successful projects there are several options for commercial support to Hadoop or its derivatives. However, Apache Hadoop has thrived before there was any commercial support (I've personally been involved in over 20 releases of Apache Hadoop and deployed them while at Yahoo) and I'm sure it will in this new world order. We, the Apache Hadoop community, are committed to keeping Apache Hadoop 'free', providing support to our users and to move it forward at a rapid rate. Arun makes a good point which is that the Apache project depends on contributions from the community to thrive. That includes -bug reports -patches to fix problems -more tests -documentation improvements: more examples, more on getting started, troubleshooting, etc. If there's something lacking in the codebase, and you think you can fix it, please do so. Helping with the documentation is a good start, as it can be improved, and you aren't going to break anything. Once you get into changing the code, you'll end up working with the head of whichever branch you are targeting. The other area everyone can contribute on is testing. Yes, Y! and FB can test at scale, yes, other people can test large clusters too -but nobody has a network that looks like yours but you. And Hadoop does care about network configurations. Testing beta and release candidate releases in your infrastructure, helps verify that the final release will work on your site, and you don't end up getting all the phone calls about something not working
Re: large memory tasks
I like this, Shi! Very clever! On Wed, Jun 15, 2011 at 4:36 PM, Shi Yu sh...@uchicago.edu wrote: Suppose you are looking up a value V of a key K. And V is required for an upcoming process. Suppose the data in the upcoming process has the form R1 K1 K2 K3, where R1 is the record number, K1 to K3 are the keys occurring in the record, which means in the look up case you would query for V1, V2, V3 Using inner join you could attach all the V values for a single record and prepare the data like R1 K1 K2 K3 V1 V2 V3 then each record has the complete information for the next process. So you pay the storage for the efficiency. Even taking into account the time required for preparing the data, it is still faster than the look-up approach. I have also tried TokyoCabinet, you need to compile and install some extensions to get it working. Sometimes getting things and APIs to work can be painful. If you don't need to update the lookup table, install TC, MemCache, MongoDB locally on each node would be the most efficient solution because all the look-ups are local. On 6/15/2011 5:56 PM, Ian Upright wrote: If the data set doesn't fit in working memory, but is still of a reasonable size (lets say a few hundred gigabytes), then I'd probably use something like this: http://fallabs.com/tokyocabinet/ From reading the Hadoop docs (which I'm very new to), then I might use DistributedCache to replicate that database around. My impression would be that this might be among the most efficient things one could do. However, for my particular application, even using tokycabinet introduces too much inefficiency, and a pure plain old memory-based lookups is by far the most efficient. (not to mention that some of the lookups I'm doing are specialized trees that can't be done with tokyocabinet or any typical db, but thats beside the point) I'm having trouble understanding your more efficient method by using more data and HDFS, and having trouble understanding how it could possibly be any more efficient than say the above approach. How is increasing the size minimizing the lookups? Ian I had the same problem before, a big lookup table too large to load in memory. I tried and compared the following approaches: in-memory MySQL DB, a dedicated central memcache server, a dedicated central MongoDB server, local DB (each node has its own MongoDB server) model. The local DB model is the most efficient one. I believe dedicated server approach could get improved if the number of server is increased and distributed. I just tried single server. But later I dropped out the lookup table approach. Instead, I attached the table information in the HDFS (which could be considered as an inner join DB process), which significantly increases the size of data sets but avoids the bottle neck of table look up. There is a trade-off, when no table looks up, the data to process is intensive (TB size). In contrast, a look-up table could save 90% of the data storage. According to our experiments on a 30-node cluster, attaching information in HDFS is even 20% faster than the local DB model. When attaching information in HDFS, it is also easier to ping-pong Map/Reduce configuration to further improve the efficiency. Shi On 6/15/2011 5:05 PM, GOEKE, MATTHEW (AG/1000) wrote: Is the lookup table constant across each of the tasks? You could try putting it into memcached: http://hcil.cs.umd.edu/trs/2009-01/2009-01.pdf Matt -Original Message- From: Ian Upright [mailto:i...@upright.net] Sent: Wednesday, June 15, 2011 3:42 PM To: common-user@hadoop.apache.org Subject: large memory tasks Hello, I'm quite new to Hadoop, so I'd like to get an understanding of something. Lets say I have a task that requires 16gb of memory, in order to execute. Lets say hypothetically it's some sort of big lookuptable of sorts that needs that kind of memory. I could have 8 cores run the task in parallel (multithreaded), and all 8 cores can share that 16gb lookup table. On another machine, I could have 4 cores run the same task, and they still share that same 16gb lookup table. Now, with my understanding of Hadoop, each task has it's own memory. So if I have 4 tasks that run on one machine, and 8 tasks on another, then the 4 tasks need a 64 GB machine, and the 8 tasks need a 128 GB machine, but really, lets say I only have two machines, one with 4 cores and one with 8, each machine only having 24 GB. How can the work be evenly distributed among these machines? Am I missing something? What other ways can this be configured such that this works properly? Thanks, Ian This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any
Re: Hadoop and WikiLeaks
Interesting to note that Cassandra and ZK are now considered Hadoop projects. There were independent of Hadoop before the recent update. On Thu, May 19, 2011 at 4:18 AM, Steve Loughran ste...@apache.org wrote: On 18/05/11 18:05, javam...@cox.net wrote: Yes! -Pete Edward Caprioloedlinuxg...@gmail.com wrote: = http://hadoop.apache.org/#What+Is+Apache%E2%84%A2+Hadoop%E2%84%A2%3F March 2011 - Apache Hadoop takes top prize at Media Guardian Innovation Awards The Hadoop project won the innovator of the yearaward from the UK's Guardian newspaper, where it was described as had the potential as a greater catalyst for innovation than other nominees including WikiLeaks and the iPad. Does this copy text bother anyone else? Sure winning any award is great but does hadoop want to be associated with innovation like WikiLeaks? Ian updated the page yesterday with changes I'd put in for trademarks, and I added this news quote directly from the paper. We could strip out the quote easily enough.
Re: Cluster hard drive ratios
Hey Matt, we are using the same Dell boxes, and we can get 2 GB/s per node (read and write) without problems. On Wed, May 4, 2011 at 8:43 AM, Matt Goeke msg...@gmail.com wrote: I have been reviewing quite a few presentations on the web from various businesses, in addition to the ones I watched first hand at the cloudera data summit last week, and I am curious as to others thoughts around hard drive ratios. Various sources including Cloudera have sited 1 HDD x 2 cores x 4 GB ECC but this makes me wonder what the upper bound for HDDs is in this ratio. We have specced out various machines from Dell and it is possible to get dual hexacores with 14 drives (2 raided for OS and 12x2TB) but this seems to conflict with that original ratio and some of the specs I have witnessed in presentations (which are mostly 4 drive configurations). I would assume all you incur is additional complexity and more potential for hardware failure on a specific machine but I have seen little to no data stating at what point there is a plateau in write speed performance. Can anyone give personal experience around this type of setup? If we accept that we are incurring the negatives I stated above but we gain higher data density in the cluster then is this setup fine or we overlooking something? Thanks, Matt
Re: Memory mapped resources
Sorry, don't mean to say you don't know mmap or didn't do cool things in the past. But you will see why anyone would've interpreted this original post, given the title of the posting and the following wording, to mean can I mmap files that are in hdfs On Mon, Apr 11, 2011 at 3:57 PM, Benson Margulies bimargul...@gmail.comwrote: We have some very large files that we access via memory mapping in Java. Someone's asked us about how to make this conveniently deployable in Hadoop. If we tell them to put the files into hdfs, can we obtain a File for the underlying file on any given node?
Re: Memory mapped resources
I am not sure if you realize, but HDFS is not VM integrated. What you are asking for is support *inside* the linux kernel for HDFS file systems. I don't see that happening for the next few years, and probably never at all. (HDFS is all Java today, and Java certainly is not going to go inside the kernel) The ways to get there are a) use the hdfs-fuse proxy b) do this by hand - copy the file into each individual machine's local disk, and then mmap the local path c) more or less do the same as (b), using a thing called the Distributed Cache in Hadoop, and them mmap the local path d) don't use HDFS, and instead use something else for this purpose On Tue, Apr 12, 2011 at 10:40 AM, Benson Margulies bimargul...@gmail.comwrote: Here's the OP again. I want to make it clear that my question here has to do with the problem of distributing 'the program' around the cluster, not 'the data'. In the case at hand, the issue a system that has a large data resource that it needs to do its work. Every instance of the code needs the entire model. Not just some blocks or pieces. Memory mapping is a very attractive tactic for this kind of data resource. The data is read-only. Memory-mapping it allows the operating system to ensure that only one copy of the thing ends up in physical memory. If we force the model into a conventional file (storable in HDFS) and read it into the JVM in a conventional way, then we get as many copies in memory as we have JVMs. On a big machine with a lot of cores, this begins to add up. For people who are running a cluster of relatively conventional systems, just putting copies on all the nodes in a conventional place is adequate.
Re: decommissioning node woes
All trunking/bonding at the switch (eg, LACP) gives only 1 NIC's worth of bandwidth point-to-point, even if your boxes all have multiple NICs. It chooses a NIC at connection initiation (via round-robin, or load, or whatever). But once the TCP connection is established, there is no load-balancing -- On Sat, Mar 19, 2011 at 7:11 PM, Michael Segel michael_se...@hotmail.comwrote: Usually the port bonding is done at a lower level so that you and your applications see this as a single port. So you don't have to worry about load balancing between the ports. (Or am I missing something?) thx -Mike From: tdunn...@maprtech.com Date: Sat, 19 Mar 2011 09:00:30 -0700 Subject: Re: decommissioning node woes To: common-user@hadoop.apache.org CC: michael_se...@hotmail.com Unfortunately this doesn't help much because it is hard to get the ports to balance the load. On Fri, Mar 18, 2011 at 8:30 PM, Michael Segel michael_se...@hotmail.comwrote: With a 1GBe port, you could go 100Mbs for the bandwidth limit. If you bond your ports, you could go higher.
Re: Is this a fair summary of HDFS failover?
The summary is quite inaccurate. On Mon, Feb 14, 2011 at 8:48 AM, Mark Kerzner markkerz...@gmail.com wrote: Hi, is it accurate to say that - In 0.20 the Secondary NameNode acts as a cold spare; it can be used to recreate the HDFS if the Primary NameNode fails, but with the delay of minutes if not hours, and there is also some data loss; The Secondary NN is not a spare. It is used to augment the work of the Primary, by offloading some of its work to another machine. The work offloaded is log rollup or checkpointing. This has been a source of constant confusion (some named it incorrectly as a secondary and now we are stuck with it). The Secondary NN certainly cannot take over for the Primary. It is not its purpose. Yes, there is data loss. - in 0.21 there are streaming edits to a Backup Node (HADOOP-4539), which replaces the Secondary NameNode. The Backup Node can be used as a warm spare, with the failover being a matter of seconds. There can be multiple Backup Nodes, for additional insurance against failure, and previous best common practices apply to it; There is no Backup NN in the manner you are thinking of. It is completely manual, and requires restart of the whole world, and takes about 2-3 hours to happen. If you are lucky, you may have only a little data loss (people have lost entire clusters due to this -- from what I understand, you are far better off resurrecting the Primary instead of trying to bring up a Backup NN). In any case, when you run it like you mention above, you will have to (a) make sure that the primary is dead (b) edit hdfs-site.xml on *every* datanode to point to the new IP address of the backup, and restart each datanode. (c) wait for 2-3 hours for all the block-reports from every restarted DN to finish 2-3 hrs afterwards: (d) after that, restart all TT and the JT to connect to the new NN (e) finally, restart all the clients (eg, HBase, Oozie, etc) Many companies, including Yahoo! and Facebook, use a couple of NetApp filers to hold the actual data that the NN writes. The two NetApp filers are run in HA mode with NVRAM copying. But the NN remains a single point of failure, and there is probably some data loss. - 0.22 will have further improvements to the HDFS performance, such as HDFS-1093. Does the paper on HDFS Reliability by Tom White http://www.cloudera.com/wp-content/uploads/2010/03/HDFS_Reliability.pdf still represent the current state of things? See Dhruba's blog-post about the Avatar NN + some custom stackable HDFS code on all the clients + Zookeeper + the dual NetApp filers. It helps Facebook do manual, controlled, fail-over during software upgrades, at the cost of some performance loss on the DataNodes (the DataNodes have to do 2x block reports, and each block-report is expensive, so it limits the DataNode a bit). The article does not talk about dataloss when the fail-over is initiated manually, so I don't know about that. http://hadoopblog.blogspot.com/2010/02/hadoop-namenode-high-availability.html Thank you. Sincerely, Mark
Re: Is this a fair summary of HDFS failover?
I understand you are writing a book Hadoop in Practice. If so, its important that what's recommended in the book should be verified in practice. (I mean, beyond simply posting in this newsgroup - for instance, the recommendations on NN fail-over should be tried out first before writing about how to do it). Otherwise you won't know your recommendations really work or not. On Mon, Feb 14, 2011 at 12:31 PM, Mark Kerzner markkerz...@gmail.comwrote: Thank you, M. C. Srivas, that was enormously useful. I understand it now, but just to be complete, I have re-formulated my points according to your comments: - In 0.20 the Secondary NameNode performs snapshotting. Its data can be used to recreate the HDFS if the Primary NameNode fails. The procedure is manual and may take hours, and there is also data loss since the last snapshot; - In 0.21 there is a Backup Node (HADOOP-4539), which aims to help with HA and act as a cold spare. The data loss is less than with Secondary NN, but it is still manual and potentially error-prone, and it takes hours; - There is an AvatarNode patch available for 0.20, and Facebook runs its cluster that way, but the patch submitted to Apache requires testing and the developers adopting it must do some custom configurations and also exercise care in their work. As a conclusion, when building an HA HDFS cluster, one needs to follow the best practices outlined by Tom White http://www.cloudera.com/wp-content/uploads/2010/03/HDFS_Reliability.pdf, and may still need to resort to specialized NSF filers for running the NameNode. Sincerely, Mark On Mon, Feb 14, 2011 at 11:50 AM, M. C. Srivas mcsri...@gmail.com wrote: The summary is quite inaccurate. On Mon, Feb 14, 2011 at 8:48 AM, Mark Kerzner markkerz...@gmail.com wrote: Hi, is it accurate to say that - In 0.20 the Secondary NameNode acts as a cold spare; it can be used to recreate the HDFS if the Primary NameNode fails, but with the delay of minutes if not hours, and there is also some data loss; The Secondary NN is not a spare. It is used to augment the work of the Primary, by offloading some of its work to another machine. The work offloaded is log rollup or checkpointing. This has been a source of constant confusion (some named it incorrectly as a secondary and now we are stuck with it). The Secondary NN certainly cannot take over for the Primary. It is not its purpose. Yes, there is data loss. - in 0.21 there are streaming edits to a Backup Node (HADOOP-4539), which replaces the Secondary NameNode. The Backup Node can be used as a warm spare, with the failover being a matter of seconds. There can be multiple Backup Nodes, for additional insurance against failure, and previous best common practices apply to it; There is no Backup NN in the manner you are thinking of. It is completely manual, and requires restart of the whole world, and takes about 2-3 hours to happen. If you are lucky, you may have only a little data loss (people have lost entire clusters due to this -- from what I understand, you are far better off resurrecting the Primary instead of trying to bring up a Backup NN). In any case, when you run it like you mention above, you will have to (a) make sure that the primary is dead (b) edit hdfs-site.xml on *every* datanode to point to the new IP address of the backup, and restart each datanode. (c) wait for 2-3 hours for all the block-reports from every restarted DN to finish 2-3 hrs afterwards: (d) after that, restart all TT and the JT to connect to the new NN (e) finally, restart all the clients (eg, HBase, Oozie, etc) Many companies, including Yahoo! and Facebook, use a couple of NetApp filers to hold the actual data that the NN writes. The two NetApp filers are run in HA mode with NVRAM copying. But the NN remains a single point of failure, and there is probably some data loss. - 0.22 will have further improvements to the HDFS performance, such as HDFS-1093. Does the paper on HDFS Reliability by Tom White http://www.cloudera.com/wp-content/uploads/2010/03/HDFS_Reliability.pdf still represent the current state of things? See Dhruba's blog-post about the Avatar NN + some custom stackable HDFS code on all the clients + Zookeeper + the dual NetApp filers. It helps Facebook do manual, controlled, fail-over during software upgrades, at the cost of some performance loss on the DataNodes (the DataNodes have to do 2x block reports, and each block-report is expensive, so it limits the DataNode a bit). The article does not talk about dataloss when the fail-over is initiated manually, so I don't know about that. http://hadoopblog.blogspot.com/2010/02/hadoop-namenode-high-availability.html Thank you. Sincerely, Mark
Re: the performance of HDFS
On Tue, Jan 25, 2011 at 12:33 PM, Da Zheng zhengda1...@gmail.com wrote: Hello, I try to measure the performance of HDFS, but the writing rate is quite low. When the replication factor is 1, the rate of writing to HDFS is about 60MB/s. When the replication factor is 3, the rate drops significantly to about 15MB/s. Even though the actual rate of writing data to the disk is about 45MB/s, it's still much lower than when replication factor is 1. The link between two nodes in the cluster is 1Gbps. CPU is Dual-Core AMD Opteron(tm) Processor 2212, so CPU isn't bottleneck either. I thought I should be able to saturate the disk very easily. I wonder where the bottleneck is. What is the throughput for writing on a Hadoop cluster when the replication factor is 3? The numbers above seem correct as per my observations. If your data is 3-way replicated, the data-node writes about 3x the actual data written. Conversely, your write-rate will be limited to 1/3 of how fast the disk can write, minus some overhead for replication. The aggregate write-rate can get much higher if you use more drives, but a single stream throughput is limited to the speed of one disk spindle. Thanks, Da
Re: Approached to combing the output of reducers
Not with HDFS, since only one process may write to a single file (and its not visible until the file is closed). In fact, its worse than that ... the same process that's writing that file cannot see it or read it until after its done. If you have multiple reducers, you are simply out of luck and will have to run a separate job to copy the data out. On Sat, Oct 23, 2010 at 3:08 PM, Steve Lewis lordjoe2...@gmail.com wrote: Once I run a map-reduce job I get output in the form of part-r-0 part-r-1 ... In many cases the output is significantly smaller than the original input - take the classic word count In most cases I want to combine the output into a single file that may well not live on HDFS but on a more accessible file system Are there standard libraries or approaches for consolidating reducer output. A second Map-Reduce job taking the output directory as an input is an OK start but as output there needs to be a single reducer that writes a real file and not reduce output - Are there standard libraries or approaches to this? -- Steven M. Lewis PhD 4221 105th Ave Ne Kirkland, WA 98033 206-384-1340 (cell) Institute for Systems Biology Seattle WA
Re: Approached to combing the output of reducers
On Sat, Oct 23, 2010 at 4:19 PM, Steve Lewis lordjoe2...@gmail.com wrote: I am assuming the first job outputs multiple files and that the second (and I presume a map-reduce job) will assign the output intended for a single file to a single reducer (in some cases multiple output files might be supported - one per reducer - On issue is how to allow the reducer to write to some 'external file system' -.i.e. not hdfs or the instance's local file system but s3 on amazon or some mounted nfs system on a stand alone cluster bin/hadoop jar jarname input-dir output-dir Thus. bin/hadoop jar jarname hdfs://...file:///my/nfs/mounted/dir/... will work, if you nfs-mount your destination dir on all the nodes in the cluster. On Oct 23, 2010, at 3:32 PM, M. C. Srivas mcsri...@gmail.com wrote: Not with HDFS, since only one process may write to a single file (and its not visible until the file is closed). In fact, its worse than that ... the same process that's writing that file cannot see it or read it until after its done. If you have multiple reducers, you are simply out of luck and will have to run a separate job to copy the data out. On Sat, Oct 23, 2010 at 3:08 PM, Steve Lewis lordjoe2...@gmail.com wrote: Once I run a map-reduce job I get output in the form of part-r-0 part-r-1 ... In many cases the output is significantly smaller than the original input - take the classic word count In most cases I want to combine the output into a single file that may well not live on HDFS but on a more accessible file system Are there standard libraries or approaches for consolidating reducer output. A second Map-Reduce job taking the output directory as an input is an OK start but as output there needs to be a single reducer that writes a real file and not reduce output - Are there standard libraries or approaches to this? -- Steven M. Lewis PhD 4221 105th Ave Ne Kirkland, WA 98033 206-384-1340 (cell) Institute for Systems Biology Seattle WA -- Steven M. Lewis PhD 4221 105th Ave Ne Kirkland, WA 98033 206-384-1340 (cell) Institute for Systems Biology Seattle WA
Re: BUG: Anyone use block size more than 2GB before?
I thought the petasort benchmark you published used 12.5G block sizes. How did you make that work? On Mon, Oct 18, 2010 at 4:27 PM, Owen O'Malley omal...@apache.org wrote: Block sizes larger than 2**31 are known to not work. I haven't ever tracked down the problem, just set my block size to be smaller than that. -- Owen