Re: datanode tuning
Thanks Ravi. The number of nodes isn't a lot but the size is rather large. Each data node has about 14-16T (560-640T). For the datanode block scanner, how can increase its Current scan rate limit KBps ? On Sun, Oct 6, 2013 at 11:09 PM, Ravi Prakash ravi...@ymail.com wrote: Please look at dfs.heartbeat.interval and dfs.namenode.heartbeat.recheck-interval 40 datanodes is not a large cluster IMHO and the Namenode is capable of managing 100 times more datanodes. From: Rita rmorgan...@gmail.com To: common-user@hadoop.apache.org common-user@hadoop.apache.org Sent: Sunday, October 6, 2013 9:49 AM Subject: datanode tuning I would like my 40 data nodes to aggressively report to namenode if they are alive or not therefore I think I need to change these params dfs.block.access.token.lifetime : Default is 600 seconds. Can I decrease this to 60? dfs.block.access.key.update.interval: Default is 600 seconds. Can I decrease this to 60? Also, what are some other turnings people do for datanodes in a relatively large cluster? -- --- Get your facts first, then you can distort them as you please.-- -- --- Get your facts first, then you can distort them as you please.--
Re: datanode tuning
For dfs.datanode.scan.period.hours, why isn't it documented here http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml Also, once these settings are in effect, how can I see that they are active? is there a JSON page I can login to see them? On Mon, Oct 7, 2013 at 10:50 AM, Ravi Prakash ravi...@ymail.com wrote: Rita! 14-16 Tb is perhaps a big node. Even then the scalability limits of the Namenode in your case would depend on how many files (more accurately how many blocks) there are on HDFS. In any case, if you want the datanodes to be marked dead quickly when their heartbeats are lost, you should reduce the two parameters I told you about. The datanode block scanner is unfortunately hard coded to use a maximum of 8Mb/s and a minimum of 1 Mb/s. The only thing you can change is dfs.datanode.scan.period.hours HTH Ravi From: Rita rmorgan...@gmail.com To: common-user@hadoop.apache.org common-user@hadoop.apache.org; Ravi Prakash ravi...@ymail.com Sent: Monday, October 7, 2013 5:55 AM Subject: Re: datanode tuning Thanks Ravi. The number of nodes isn't a lot but the size is rather large. Each data node has about 14-16T (560-640T). For the datanode block scanner, how can increase its Current scan rate limit KBps ? On Sun, Oct 6, 2013 at 11:09 PM, Ravi Prakash ravi...@ymail.com wrote: Please look at dfs.heartbeat.interval and dfs.namenode.heartbeat.recheck-interval 40 datanodes is not a large cluster IMHO and the Namenode is capable of managing 100 times more datanodes. From: Rita rmorgan...@gmail.com To: common-user@hadoop.apache.org common-user@hadoop.apache.org Sent: Sunday, October 6, 2013 9:49 AM Subject: datanode tuning I would like my 40 data nodes to aggressively report to namenode if they are alive or not therefore I think I need to change these params dfs.block.access.token.lifetime : Default is 600 seconds. Can I decrease this to 60? dfs.block.access.key.update.interval: Default is 600 seconds. Can I decrease this to 60? Also, what are some other turnings people do for datanodes in a relatively large cluster? -- --- Get your facts first, then you can distort them as you please.-- -- --- Get your facts first, then you can distort them as you please.-- -- --- Get your facts first, then you can distort them as you please.--
datanode tuning
I would like my 40 data nodes to aggressively report to namenode if they are alive or not therefore I think I need to change these params dfs.block.access.token.lifetime : Default is 600 seconds. Can I decrease this to 60? dfs.block.access.key.update.interval: Default is 600 seconds. Can I decrease this to 60? Also, what are some other turnings people do for datanodes in a relatively large cluster? -- --- Get your facts first, then you can distort them as you please.--
Re: hardware for hdfs
any thought? On Wed, Mar 13, 2013 at 7:17 PM, Rita rmorgan...@gmail.com wrote: i am planning to build a hdfs cluster primary for streaming large files (10g avg size). I was wondering if anyone can recommend a good hardware vendor. -- --- Get your facts first, then you can distort them as you please.-- -- --- Get your facts first, then you can distort them as you please.--
Re: measuring iops
I was curious because when a vendor (big storage company) presented they were offering a hadoop solution. They posted IOPS and I wasn't sure how they were determining this number On Tue, Oct 23, 2012 at 9:19 AM, Michael Segel michael_se...@hotmail.comwrote: You have two issues. 1) You need to know the throughput in terms of data transfer between disks and controller cards on the node. 2) The actual network throughput of having all of the nodes talking to one another as fast as they can. This will let you see your real limitations in the ToR Switch's fabric. Not sure why you really want to do this except to test the disk, disk controller, and then networking infrastructure of your ToR and then your backplane to connect multiple racks HTH -Mike On Oct 23, 2012, at 7:47 AM, Ravi Prakash ravi...@ymail.com wrote: Do you mean in a cluster being used by users, or as a benchmark to measure the maximum? The JMX page nn:port/jmx provides some interesting stats, but I'm not sure they have what you want. And I'm unaware of other tools which could. From: Rita rmorgan...@gmail.com To: common-user@hadoop.apache.org; Ravi Prakash ravi...@ymail.com Sent: Monday, October 22, 2012 6:46 PM Subject: Re: measuring iops Is it possible to know how many reads and writes are occurring thru the entire cluster in a consolidated manner -- this does not include replication factors. On Mon, Oct 22, 2012 at 10:28 AM, Ravi Prakash ravi...@ymail.com wrote: Hi Rita, SliveTest can help you measure the number of reads / writes / deletes / ls / appends per second your NameNode can handle. DFSIO can be used to help you measure the amount of throughput. Both these tests are actually very flexible and have a plethora of options to help you test different facets of performance. In my experience, you actually have to be very careful and understand what the tests are doing for the results to be sensible. HTH Ravi From: Rita rmorgan...@gmail.com To: common-user@hadoop.apache.org common-user@hadoop.apache.org Sent: Monday, October 22, 2012 7:23 AM Subject: Re: measuring iops Anyone? On Sun, Oct 21, 2012 at 8:30 AM, Rita rmorgan...@gmail.com wrote: Hi, Was curious if there was a method to measure the total number of IOPS (I/O operations per second) on a HDFS cluster. -- --- Get your facts first, then you can distort them as you please.-- -- --- Get your facts first, then you can distort them as you please.-- -- --- Get your facts first, then you can distort them as you please.-- -- --- Get your facts first, then you can distort them as you please.--
Re: measuring iops
Anyone? On Sun, Oct 21, 2012 at 8:30 AM, Rita rmorgan...@gmail.com wrote: Hi, Was curious if there was a method to measure the total number of IOPS (I/O operations per second) on a HDFS cluster. -- --- Get your facts first, then you can distort them as you please.-- -- --- Get your facts first, then you can distort them as you please.--
Re: measuring iops
Is it possible to know how many reads and writes are occurring thru the entire cluster in a consolidated manner -- this does not include replication factors. On Mon, Oct 22, 2012 at 10:28 AM, Ravi Prakash ravi...@ymail.com wrote: Hi Rita, SliveTest can help you measure the number of reads / writes / deletes / ls / appends per second your NameNode can handle. DFSIO can be used to help you measure the amount of throughput. Both these tests are actually very flexible and have a plethora of options to help you test different facets of performance. In my experience, you actually have to be very careful and understand what the tests are doing for the results to be sensible. HTH Ravi From: Rita rmorgan...@gmail.com To: common-user@hadoop.apache.org common-user@hadoop.apache.org Sent: Monday, October 22, 2012 7:23 AM Subject: Re: measuring iops Anyone? On Sun, Oct 21, 2012 at 8:30 AM, Rita rmorgan...@gmail.com wrote: Hi, Was curious if there was a method to measure the total number of IOPS (I/O operations per second) on a HDFS cluster. -- --- Get your facts first, then you can distort them as you please.-- -- --- Get your facts first, then you can distort them as you please.-- -- --- Get your facts first, then you can distort them as you please.--
Re: distcp question
thanks for the advise. Before I push or pull. Are there any tests I can run before I do the distCP. I am not 100% sure if I have my webhdfs setup properly. On Fri, Oct 12, 2012 at 1:01 PM, J. Rottinghuis jrottingh...@gmail.comwrote: Rita, Are you doing a push from the source cluster or a pull from the target cluster? Doing a pull with distcp using hftp (to accomodate for version differences) has the advantage of slightly fewer transfers of blocks over the TORs. Each block is read from exactly the datanode where it is located, and on the target side (where the mappers run) the first write is to the local datanode. With RF=3 each block transfers out of the source TOR, into the target TOR, out of the first target-cluster TOR into a different target-cluster TOR for replica 2 3. Overall 2 time out, and 2 times in. Doing a pull with webhdfs:// the proxy server has to collect all blocks from the source DNs, then they get pulled to the target machine. Situation is similar as above, with the one extra transfer of all data going through the proxy server. Doing a push with webhdfs:// on the target cluster size, the mapper has to collect all blocks from one or more files (depending on # mappers used) and send them to the proxy server, which then writes blocks to the target cluster. Advantage on the target cluster is that each block for a large multi-block files get spread over different datanodes on the target side. But if I'm counting correctly, you'll have the most data transfer. Out of each source DN, through source cluster mapper DN, through target proxy server, to target DN, and out/in again for replicas 23. So convenience and setup aside, I think the first option would be the least network transfers. Now if you're clusters are separated over a WAN, then this may not matter all at. Just something to think about. Cheers, Joep On Fri, Oct 12, 2012 at 8:37 AM, Harsh J ha...@cloudera.com wrote: Rita, I believe, per the implementation, that webhdfs:// URIs should work fine. Please give it a try and let us know. On Fri, Oct 12, 2012 at 7:14 PM, Rita rmorgan...@gmail.com wrote: I have 2 different versions of Hadoop running. I need to copy significant amount of data (100tb) from one cluster to another. I know distcp is the way to do. On the target cluster I have webhdfs running. Would that work? The DistCp manual says, I need to use HftpFileSystem. Is that necessary or will webhdfs do the task? -- --- Get your facts first, then you can distort them as you please.-- -- Harsh J -- --- Get your facts first, then you can distort them as you please.--
Re: Re: distcp question
nvermind. Figured it out. On Fri, Oct 12, 2012 at 3:20 PM, kojie.fu kojie...@gmail.com wrote: kojie.fu From: Rita Date: 2012-10-13 03:19 To: common-user Subject: Re: distcp question thanks for the advise. Before I push or pull. Are there any tests I can run before I do the distCP. I am not 100% sure if I have my webhdfs setup properly. On Fri, Oct 12, 2012 at 1:01 PM, J. Rottinghuis jrottingh...@gmail.com wrote: Rita, Are you doing a push from the source cluster or a pull from the target cluster? Doing a pull with distcp using hftp (to accomodate for version differences) has the advantage of slightly fewer transfers of blocks over the TORs. Each block is read from exactly the datanode where it is located, and on the target side (where the mappers run) the first write is to the local datanode. With RF=3 each block transfers out of the source TOR, into the target TOR, out of the first target-cluster TOR into a different target-cluster TOR for replica 2 3. Overall 2 time out, and 2 times in. Doing a pull with webhdfs:// the proxy server has to collect all blocks from the source DNs, then they get pulled to the target machine. Situation is similar as above, with the one extra transfer of all data going through the proxy server. Doing a push with webhdfs:// on the target cluster size, the mapper has to collect all blocks from one or more files (depending on # mappers used) and send them to the proxy server, which then writes blocks to the target cluster. Advantage on the target cluster is that each block for a large multi-block files get spread over different datanodes on the target side. But if I'm counting correctly, you'll have the most data transfer. Out of each source DN, through source cluster mapper DN, through target proxy server, to target DN, and out/in again for replicas 23. So convenience and setup aside, I think the first option would be the least network transfers. Now if you're clusters are separated over a WAN, then this may not matter all at. Just something to think about. Cheers, Joep On Fri, Oct 12, 2012 at 8:37 AM, Harsh J ha...@cloudera.com wrote: Rita, I believe, per the implementation, that webhdfs:// URIs should work fine. Please give it a try and let us know. On Fri, Oct 12, 2012 at 7:14 PM, Rita rmorgan...@gmail.com wrote: I have 2 different versions of Hadoop running. I need to copy significant amount of data (100tb) from one cluster to another. I know distcp is the way to do. On the target cluster I have webhdfs running. Would that work? The DistCp manual says, I need to use HftpFileSystem. Is that necessary or will webhdfs do the task? -- --- Get your facts first, then you can distort them as you please.-- -- Harsh J -- --- Get your facts first, then you can distort them as you please.-- -- --- Get your facts first, then you can distort them as you please.--
file checksum
Does Hadoop, HDFS in particular, do any sanity checks of the file before and after balancing/copying/reading the files? We have 20TB of data and I want to make sure after these operating are completed the data is still in good shape. Where can I read about this? tia -- --- Get your facts first, then you can distort them as you please.--
Re: file checksum
what is the parameter I can use to check more often, like 3 days? On Mon, Jun 25, 2012 at 7:33 AM, Kai Voigt k...@123.org wrote: HDFS has block checksums. Whenever a block is written to the datanodes, a checksum is calculated and written with the block to the datanodes' disks. Whenever a block is requested, the block's checksum is verified against the stored checksum. If they don't match, that block is corrupt. But since there's additional replicas of the block, chances are high one block is matching the checksum. Corrupt blocks will be scheduled to be rereplicated. Also, to prevent bit rod, blocks are checked periodically (weekly by default, I believe, you can configure that period) in the background. Kai Am 25.06.2012 um 13:29 schrieb Rita: Does Hadoop, HDFS in particular, do any sanity checks of the file before and after balancing/copying/reading the files? We have 20TB of data and I want to make sure after these operating are completed the data is still in good shape. Where can I read about this? tia -- --- Get your facts first, then you can distort them as you please.-- -- Kai Voigt k...@123.org -- --- Get your facts first, then you can distort them as you please.--
Re: freeze a mapreduce job
thanks. I think I will investigate capacity scheduler. On Fri, May 11, 2012 at 7:26 AM, Michael Segel michael_se...@hotmail.comwrote: Just a quick note... If your task is currently occupying a slot, the only way to release the slot is to kill the specific task. If you are using FS, you can move the task to another queue and/or you can lower the job's priority which will cause new tasks to spawn slower than other jobs so you will eventually free up the cluster. There isn't a way to 'freeze' or stop a job mid state. Is the issue that the job has a large number of slots, or is it an issue of the individual tasks taking a long time to complete? If its the latter, you will probably want to go to a capacity scheduler over the fair scheduler. HTH -Mike On May 11, 2012, at 6:08 AM, Harsh J wrote: I do not know about the per-host slot control (that is most likely not supported, or not yet anyway - and perhaps feels wrong to do), but the rest of the needs can be doable if you use schedulers and queues/pools. If you use FairScheduler (FS), ensure that this job always goes to a special pool and when you want to freeze the pool simply set the pool's maxMaps and maxReduces to 0. Likewise, control max simultaneous tasks as you wish, to constrict instead of freeze. When you make changes to the FairScheduler configs, you do not need to restart the JT, and you may simply wait a few seconds for FairScheduler to refresh its own configs. More on FS at http://hadoop.apache.org/common/docs/current/fair_scheduler.html If you use CapacityScheduler (CS), then I believe you can do this by again making sure the job goes to a specific queue, and when needed to freeze it, simply set the queue's maximum-capacity to 0 (percentage) or to constrict it, choose a lower, positive percentage value as you need. You can also refresh CS to pick up config changes by refreshing queues via mradmin. More on CS at http://hadoop.apache.org/common/docs/current/capacity_scheduler.html Either approach will not freeze/constrict the job immediately, but should certainly prevent it from progressing. Meaning, their existing running tasks during the time of changes made to scheduler config will continue to run till completion but further tasks scheduling from those jobs shall begin seeing effect of the changes made. P.s. A better solution would be to make your job not take as many days, somehow? :-) On Fri, May 11, 2012 at 4:13 PM, Rita rmorgan...@gmail.com wrote: I have a rather large map reduce job which takes few days. I was wondering if its possible for me to freeze the job or make the job less intensive. Is it possible to reduce the number of slots per host and then I can increase them overnight? tia -- --- Get your facts first, then you can distort them as you please.-- -- Harsh J -- --- Get your facts first, then you can distort them as you please.--
hdfs file browser
Is it possible to get pretty URLs when doing HDFS file browsing via web browser? -- --- Get your facts first, then you can distort them as you please.--
setting client retry
In the hdfs-site.xml file what argument do I need to set for client retries? Also, what is the default parameter? -- --- Get your facts first, then you can distort them as you please.--
Re: hadoop filesystem cache
My intention isn't to make it a mandatory feature just as an option. Keeping data locally on a filesystem as a method of Lx cache is far better than getting it from the network and the cost of fs buffer cache is much cheaper than a RPC call. On Mon, Jan 16, 2012 at 1:07 PM, Edward Capriolo edlinuxg...@gmail.comwrote: The challenges of this design is people accessing the same data over and over again is the uncommon usecase for hadoop. Hadoop's bread and butter is all about streaming through large datasets that do not fit in memory. Also your shuffle-sort-spill is going to play havoc on and file system based cache. The distributed cache roughly fits this role except that it does not persist after a job. Replicating content to N nodes also is not a hard problem to tackle (you can hack up a content delivery system with ssh+rsync) and get similar results.The approach often taken has been to keep data that is accessed repeatedly and fits in memory in some other system (hbase/cassandra/mysql/whatever). Edward On Mon, Jan 16, 2012 at 11:33 AM, Rita rmorgan...@gmail.com wrote: Thanks. I believe this is a good feature to have for clients especially if you are reading the same large file over and over. On Sun, Jan 15, 2012 at 7:33 PM, Todd Lipcon t...@cloudera.com wrote: There is some work being done in this area by some folks over at UC Berkeley's AMP Lab in coordination with Facebook. I don't believe it has been published quite yet, but the title of the project is PACMan -- I expect it will be published soon. -Todd On Sat, Jan 14, 2012 at 5:30 PM, Rita rmorgan...@gmail.com wrote: After reading this article, http://www.cloudera.com/blog/2012/01/caching-in-hbase-slabcache/ , I was wondering if there was a filesystem cache for hdfs. For example, if a large file (10gigabytes) was keep getting accessed on the cluster instead of keep getting it from the network why not storage the content of the file locally on the client itself. A use case on the client would be like this: property namedfs.client.cachedirectory/name value/var/cache/hdfs/value /property property namedfs.client.cachesize/name descriptionin megabytes/description value10/value /property Any thoughts of a feature like this? -- --- Get your facts first, then you can distort them as you please.-- -- Todd Lipcon Software Engineer, Cloudera -- --- Get your facts first, then you can distort them as you please.-- -- --- Get your facts first, then you can distort them as you please.--
Re: hadoop filesystem cache
Thanks. I believe this is a good feature to have for clients especially if you are reading the same large file over and over. On Sun, Jan 15, 2012 at 7:33 PM, Todd Lipcon t...@cloudera.com wrote: There is some work being done in this area by some folks over at UC Berkeley's AMP Lab in coordination with Facebook. I don't believe it has been published quite yet, but the title of the project is PACMan -- I expect it will be published soon. -Todd On Sat, Jan 14, 2012 at 5:30 PM, Rita rmorgan...@gmail.com wrote: After reading this article, http://www.cloudera.com/blog/2012/01/caching-in-hbase-slabcache/ , I was wondering if there was a filesystem cache for hdfs. For example, if a large file (10gigabytes) was keep getting accessed on the cluster instead of keep getting it from the network why not storage the content of the file locally on the client itself. A use case on the client would be like this: property namedfs.client.cachedirectory/name value/var/cache/hdfs/value /property property namedfs.client.cachesize/name descriptionin megabytes/description value10/value /property Any thoughts of a feature like this? -- --- Get your facts first, then you can distort them as you please.-- -- Todd Lipcon Software Engineer, Cloudera -- --- Get your facts first, then you can distort them as you please.--
hadoop filesystem cache
After reading this article, http://www.cloudera.com/blog/2012/01/caching-in-hbase-slabcache/ , I was wondering if there was a filesystem cache for hdfs. For example, if a large file (10gigabytes) was keep getting accessed on the cluster instead of keep getting it from the network why not storage the content of the file locally on the client itself. A use case on the client would be like this: property namedfs.client.cachedirectory/name value/var/cache/hdfs/value /property property namedfs.client.cachesize/name descriptionin megabytes/description value10/value /property Any thoughts of a feature like this? -- --- Get your facts first, then you can distort them as you please.--
Re: hadoop filesystem cache
yes, something different from that. To my knowledge, DistributedCache is only for Mapreduce. On Sat, Jan 14, 2012 at 8:33 PM, Prashant Kommireddi prash1...@gmail.comwrote: You mean something different from the DistributedCache? Sent from my iPhone On Jan 14, 2012, at 5:30 PM, Rita rmorgan...@gmail.com wrote: After reading this article, http://www.cloudera.com/blog/2012/01/caching-in-hbase-slabcache/ , I was wondering if there was a filesystem cache for hdfs. For example, if a large file (10gigabytes) was keep getting accessed on the cluster instead of keep getting it from the network why not storage the content of the file locally on the client itself. A use case on the client would be like this: property namedfs.client.cachedirectory/name value/var/cache/hdfs/value /property property namedfs.client.cachesize/name descriptionin megabytes/description value10/value /property Any thoughts of a feature like this? -- --- Get your facts first, then you can distort them as you please.-- -- --- Get your facts first, then you can distort them as you please.--
measuring network throughput
Is there a tool or a method to measure the throughput of the cluster at a given time? It would be a great feature to add -- --- Get your facts first, then you can distort them as you please.--
Re: measuring network throughput
Yes, I think they can graph it for you. However, I am looking for raw data because I would like to create something custom On Thu, Dec 22, 2011 at 8:19 AM, alo alt wget.n...@googlemail.com wrote: Rita, ganglia give you a throughput like Nagios. Could that help? - Alex On Thu, Dec 22, 2011 at 1:58 PM, Rita rmorgan...@gmail.com wrote: Is there a tool or a method to measure the throughput of the cluster at a given time? It would be a great feature to add -- --- Get your facts first, then you can distort them as you please.-- -- Alexander Lorenz http://mapredit.blogspot.com P Think of the environment: please don't print this email unless you really need to. -- --- Get your facts first, then you can distort them as you please.--
Hadoop RPC question
Hello, I am working on writing a process (bash) which can attach to the namenode and listen to the RPCs. I am interested in, what files are hot and who is reading the data. Currently, I am using the namenode logs to gather this data but was wondering if I can attach to the hadoop/hdfs port and listen to the calls? Has anyone done this before? TIA -- --- Get your facts first, then you can distort them as you please.--
Re: Hadoop 0.21
I second Vinod´s idea. Get the latest stable from Cloudera. Their binaries are near perfect! On Tue, Dec 6, 2011 at 1:46 PM, T Vinod Gupta tvi...@readypulse.com wrote: Saurabh, Its best if you go through the hbase book - Lars George's book HBase the Definitive Guide. Your best bet is to build all binaries yourself or get a stable build from cloudera. I was in this situation few months ago and had to spend a lot of time before I was able to get a production ready hbase version up and running. thanks vinod On Tue, Dec 6, 2011 at 10:41 AM, Saurabh Sehgal saurabh@gmail.com wrote: Hi All, According to the Hadoop release notes, version 0.21.0 should not be considered stable or suitable for production: 23 August, 2010: release 0.21.0 available This release contains many improvements, new features, bug fixes and optimizations. It has not undergone testing at scale and should not be considered stable or suitable for production. This release is being classified as a minor release, which means that it should be API compatible with 0.20.2. Is this still the case ? Thank you, Saurabh -- --- Get your facts first, then you can distort them as you please.--
replication question
Hello, I am using hbase and I have a default replication factor of 2. Now, if I change the directory replication factor will all the new files being created there be automatically be replicated as 3? -- --- Get your facts first, then you can distort them as you please.--
Re: Hadoop + cygwin
Why ? The beauty of hadoop is its OS agnostic. What is your native operating system? I am sure you have a version of JDK and JRE running there. On Tue, Nov 1, 2011 at 4:53 AM, Masoud mas...@agape.hanyang.ac.kr wrote: Hi Anybody ran hadoop on cygwin for development purpose??? Did you have any problem in running tasktracker? Thanks -- --- Get your facts first, then you can distort them as you please.--
HDFS-RAID
I would like to know if and ever if HDFS RAID ( http://wiki.apache.org/hadoop/HDFS-RAID) will ever get into mainline. This is would be an extremely useful feature for many sites especially larger ones. The saving on storage will be noticeable. I haven really seen any progress in https://issues.apache.org/jira/browse/HDFS-503 so I am a bit worried :-) -- --- Get your facts first, then you can distort them as you please.--
correct way to reserve space
What is the correct way to reserve space for hdfs? I currently have 2 filesystem, /fs1 and /fs2 and I would like to reserve space for non-dfs operations. For example, for /fs1 i would like to reserve 30gb of space for non-dfs and 10gb of space for /fs2 ? I fear HADOOP-2991 is still haunting us? I am using CDH 3U1 -- --- Get your facts first, then you can distort them as you please.--
default timeout of datanode
Is there a way to configure the default timeout of a datanode? Currently its set to 630seconds and I want something a bit more realistic -- like 30 seconds. -- --- Get your facts first, then you can distort them as you please.--
Re: Which release to use?
Arun, I second Joeś comment. Thanks for giving us a heads up. I will wait patiently until 0.23 is considered stable. On Mon, Jul 18, 2011 at 11:19 PM, Joe Stein charmal...@allthingshadoop.comwrote: Arun, Thanks for the update. Again, I hate to have to play the part of captain obvious. Glad to hear the same contiguous mantra for this next release. I think sometimes the plebeians ( of which I am one ) need that affirmation. One love, Apache Hadoop! /* Joe Stein http://www.medialets.com Twitter: @allthingshadoop */ On Jul 18, 2011, at 11:06 PM, Arun Murthy a...@hortonworks.com wrote: Joe, The dev community is currently gearing up for hadoop-0.23 off trunk. 0.23 is a massive step forward with with HDFS Federation, NextGen MapReduce and possible others such as wire-compat and HA NameNode. In a couple of weeks I plan to create the 0.23 branch off trunk and we then spend all our energies stabilizing pushing the release out. Please see my note to general@ for more details. Arun On Jul 18, 2011, at 7:01 PM, Joe Stein charmal...@allthingshadoop.com wrote: So, last I checked this list was about Apache Hadoop not about derivative works. The Cloudera team has always been diligent (you rock) about redirecting non apache CDH releases to their list for answers. I commend those supporting apache releases of Hadoop too, very cool!!! But yeah, even I have to ask what the latest release will be. Is there going to be a single Hadoop release or a continued branch that Horton maintains and will only support? There is something to be said for release from trunk that gets everyone on the same page towards our common goals. You can pin the state the obvious paper on my back but kinda feel it had to be said. One love, Apache Hadoop! /* Joe Stein http://www.medialets.com Twitter: @allthingshadoop */ On Jul 18, 2011, at 9:51 PM, Michael Segel michael_se...@hotmail.com wrote: Date: Mon, 18 Jul 2011 18:19:38 -0700 Subject: Re: Which release to use? From: mcsri...@gmail.com To: common-user@hadoop.apache.org Mike, Just a minor inaccuracy in your email. Here's setting the record straight: 1. MapR directly sells their distribution of Hadoop. Support is from MapR. 2. EMC also sells the MapR distribution, for use on any hardware. Support is from EMC worldwide. 3. EMC also sells a Hadoop appliance, which has the MapR distribution specially built for it. Support is from EMC. 4. MapR also has a free, unlimited, unrestricted version called M3, which has the same 2-5x performance, management and stability improvements, and includes NFS. It is not crippleware, and the unlimited, unrestricted, free use does not expire on any date. Hope that clarifies what MapR is doing. thanks regards, Srivas. Srivas, I'm sorry, I thought I was being clear in that I was only addressing EMC and not MapR directly. I was responding to post about EMC selling a Greenplum appliance. I wanted to point out that EMC will resell MapR's release along with their own (EMC) support. The point I was trying to make was that with respect to derivatives of Hadoop, I believe that MapR has a more compelling story than either EMC or DataStax. IMHO replacing Java HDFS w either GreenPlum or Cassandra has a limited market. When a company is going to look at a M/R solution cost and performance are going to be at the top of the list. MapR isn't cheap but if you look at the features in M5, if they work, then you have a very compelling reason to look at their release. Some of the people I spoke to when I was in Santa Clara were in the beta program. They indicated that MapR did what they claimed. Things are definitely starting to look interesting. -Mike On Mon, Jul 18, 2011 at 11:33 AM, Michael Segel michael_se...@hotmail.comwrote: EMC has inked a deal with MapRTech to resell their release and support services for MapRTech. Does this mean that they are going to stop selling their own release on Greenplum? Maybe not in the near future, however, a Greenplum appliance may not get the customer transaction that their reselling of MapR will generate. It sounds like they are hedging their bets and are taking an 'IBM' approach. Subject: RE: Which release to use? Date: Mon, 18 Jul 2011 08:30:59 -0500 From: jeff.schm...@shell.com To: common-user@hadoop.apache.org Steve, I read your blog nice post - I believe EMC is selling the Greenplumb solution as an appliance - Cheers - Jeffery -Original Message- From: Steve Loughran [mailto:ste...@apache.org] Sent: Friday, July 15, 2011 4:07 PM To: common-user@hadoop.apache.org Subject: Re: Which release to use? On 15/07/2011 18:06, Arun C Murthy wrote: Apache Hadoop is a volunteer driven, open-source project. The contributors to Apache Hadoop, both individuals and folks across a
Re: Which release to use?
I made the big mistake by using the latest version, 0.21.0 and found bunch of bugs so I got pissed off at hdfs. Then, after reading this thread it seems I should of used 0.20.x . I really wish we can fix this on the website, stating 0.21.0 as unstable. On Mon, Jul 18, 2011 at 4:50 PM, Michael Segel michael_se...@hotmail.comwrote: Well that's CDH3. :-) And yes, that's because up until the past month... other releases didn't exist w commercial support. Now there are more players as we look at the movement from leading edge to mainstream adopters. Subject: RE: Which release to use? Date: Mon, 18 Jul 2011 14:30:39 -0500 From: jeff.schm...@shell.com To: common-user@hadoop.apache.org Most people are using CH3 - if you need some features from another distro use that - http://www.cloudera.com/hadoop/ I wonder if the Cloudera people realize that CH3 was a pretty happening punk band back in the day (if not they do now = ) http://en.wikipedia.org/wiki/Channel_3_%28band%29 cheers - Jeffery Schmitz Projects and Technology 3737 Bellaire Blvd Houston, Texas 77001 Tel: +1-713-245-7326 Fax: +1 713 245 7678 Email: jeff.schm...@shell.com Intergalactic Proton Powered Electrical Tentacled Advertising Droids! -Original Message- From: Michael Segel [mailto:michael_se...@hotmail.com] Sent: Monday, July 18, 2011 2:10 PM To: common-user@hadoop.apache.org Subject: RE: Which release to use? Tom, I'm not sure that you're really honoring the purpose and approach of this list. I mean on the one hand, you're not under any obligation to respond or participate on the list. And I can respect that. You're not in an SD role so you're not 'customer facing' and not used to having to deal with these types of questions. On the other, you're not being free with your information. So when this type of question comes up, it becomes very easy to discount IBM as a release or source provider for commercial support. Without information, I'm afraid that I may have to make recommendations to my clients that may be out of date. There is even some speculation from analysts that recent comments from IBM are more of an indication that IBM is still not ready for prime time. I'm sorry you're not in a position to detail your offering. Maybe by September you might be ready and then talk to our CHUG? -Mike To: common-user@hadoop.apache.org Subject: Re: Which release to use? From: tdeut...@us.ibm.com Date: Sat, 16 Jul 2011 10:29:55 -0700 Hi Rita - I want to make sure we are honoring the purpose/approach of this list. So you are welcome to ping me for information, but let's take this discussion off the list at this point. Tom Deutsch Program Director CTO Office: Information Management Hadoop Product Manager / Customer Exec IBM 3565 Harbor Blvd Costa Mesa, CA 92626-1420 tdeut...@us.ibm.com Rita rmorgan...@gmail.com 07/16/2011 08:53 AM Please respond to common-user@hadoop.apache.org To common-user@hadoop.apache.org cc Subject Re: Which release to use? I am curious about the IBM product BigInishgts. Where can we download it? It seems we have to register to download it? On Fri, Jul 15, 2011 at 12:38 PM, Tom Deutsch tdeut...@us.ibm.com wrote: One quick clarification - IBM GA'd a product called BigInsights in 2Q. It faithfully uses the Hadoop stack and many related projects - but provides a number of extensions (that are compatible) based on customer requests. Not appropriate to say any more on this list, but the info on it is all publically available. Tom Deutsch Program Director CTO Office: Information Management Hadoop Product Manager / Customer Exec IBM 3565 Harbor Blvd Costa Mesa, CA 92626-1420 tdeut...@us.ibm.com Michael Segel michael_se...@hotmail.com 07/15/2011 07:58 AM Please respond to common-user@hadoop.apache.org To common-user@hadoop.apache.org cc Subject RE: Which release to use? Unfortunately the picture is a bit more confusing. Yahoo! is now HortonWorks. Their stated goal is to not have their own derivative release but to sell commercial support for the official Apache release. So those selling commercial support are: *Cloudera *HortonWorks *MapRTech *EMC (reselling MapRTech, but had announced their own) *IBM (not sure what they are selling exactly... still seems like smoke and mirrors...) *DataStax So while you can use the Apache release, it may not make sense for your organization to do so. (Said as I don the flame
Re: Which release to use?
I am a dimwit. On Mon, Jul 18, 2011 at 8:12 PM, Allen Wittenauer a...@apache.org wrote: On Jul 18, 2011, at 5:01 PM, Rita wrote: I made the big mistake by using the latest version, 0.21.0 and found bunch of bugs so I got pissed off at hdfs. Then, after reading this thread it seems I should of used 0.20.x . I really wish we can fix this on the website, stating 0.21.0 as unstable. It is stated in a few places on the website that 0.21 isn't stable: http://hadoop.apache.org/common/releases.html#23+August%2C+2010%3A+release+0.21.0+available It has not undergone testing at scale and should not be considered stable or suitable for production. ... and ... http://hadoop.apache.org/common/releases.html#Download 0.21.X - unstable, unsupported, does not include security and it isn't in the stable directory on the apache download mirrors. -- --- Get your facts first, then you can distort them as you please.--
Re: Which release to use?
I am curious about the IBM product BigInishgts. Where can we download it? It seems we have to register to download it? On Fri, Jul 15, 2011 at 12:38 PM, Tom Deutsch tdeut...@us.ibm.com wrote: One quick clarification - IBM GA'd a product called BigInsights in 2Q. It faithfully uses the Hadoop stack and many related projects - but provides a number of extensions (that are compatible) based on customer requests. Not appropriate to say any more on this list, but the info on it is all publically available. Tom Deutsch Program Director CTO Office: Information Management Hadoop Product Manager / Customer Exec IBM 3565 Harbor Blvd Costa Mesa, CA 92626-1420 tdeut...@us.ibm.com Michael Segel michael_se...@hotmail.com 07/15/2011 07:58 AM Please respond to common-user@hadoop.apache.org To common-user@hadoop.apache.org cc Subject RE: Which release to use? Unfortunately the picture is a bit more confusing. Yahoo! is now HortonWorks. Their stated goal is to not have their own derivative release but to sell commercial support for the official Apache release. So those selling commercial support are: *Cloudera *HortonWorks *MapRTech *EMC (reselling MapRTech, but had announced their own) *IBM (not sure what they are selling exactly... still seems like smoke and mirrors...) *DataStax So while you can use the Apache release, it may not make sense for your organization to do so. (Said as I don the flame retardant suit...) The issue is that outside of HortonWorks which is stating that they will support the official Apache release, everything else is a derivative work of Apache's Hadoop. From what I have seen, Cloudera's release is the closest to the Apache release. Like I said, things are getting interesting. HTH -- --- Get your facts first, then you can distort them as you please.--
Re: large data and hbase
Thanks. If you mean asking to ask the MapReduce list they will naturally recommend it :) I suppose I will look into it eventually but we invested a lot of time into Torque. On Tue, Jul 12, 2011 at 9:01 AM, Harsh J ha...@cloudera.com wrote: For a query to work in a fully distributed manner, MapReduce may still be required (atop HBase, i.e.). There's been work ongoing to assist the same at the HBase side as well, but you're guaranteed better responses on their mailing lists instead. On Tue, Jul 12, 2011 at 3:31 PM, Rita rmorgan...@gmail.com wrote: This is encouraging. ¨Make sure HDFS is running first. Start and stop the Hadoop HDFS daemons by running bin/start-hdfs.sh over in the HADOOP_HOME directory. You can ensure it started properly by testing the *put* and *get* of files into the Hadoop filesystem. HBase does not normally use the mapreduce daemons. These do not need to be started.¨ On Mon, Jul 11, 2011 at 1:40 PM, Bharath Mundlapudi bharathw...@yahoo.comwrote: Another option to look at is Pig Or Hive. These need MapReduce. -Bharath From: Rita rmorgan...@gmail.com To: common-user@hadoop.apache.org common-user@hadoop.apache.org Sent: Monday, July 11, 2011 4:31 AM Subject: large data and hbase I have a dataset which is several terabytes in size. I would like to query this data using hbase (sql). Would I need to setup mapreduce to use hbase? Currently the data is stored in hdfs and I am using `hdfs -cat ` to get the data and pipe it into stdin. -- --- Get your facts first, then you can distort them as you please.-- -- --- Get your facts first, then you can distort them as you please.-- -- Harsh J -- --- Get your facts first, then you can distort them as you please.--
Re: large data and hbase
This is encouraging. ¨Make sure HDFS is running first. Start and stop the Hadoop HDFS daemons by running bin/start-hdfs.sh over in the HADOOP_HOME directory. You can ensure it started properly by testing the *put* and *get* of files into the Hadoop filesystem. HBase does not normally use the mapreduce daemons. These do not need to be started.¨ On Mon, Jul 11, 2011 at 1:40 PM, Bharath Mundlapudi bharathw...@yahoo.comwrote: Another option to look at is Pig Or Hive. These need MapReduce. -Bharath From: Rita rmorgan...@gmail.com To: common-user@hadoop.apache.org common-user@hadoop.apache.org Sent: Monday, July 11, 2011 4:31 AM Subject: large data and hbase I have a dataset which is several terabytes in size. I would like to query this data using hbase (sql). Would I need to setup mapreduce to use hbase? Currently the data is stored in hdfs and I am using `hdfs -cat ` to get the data and pipe it into stdin. -- --- Get your facts first, then you can distort them as you please.-- -- --- Get your facts first, then you can distort them as you please.--
large data and hbase
I have a dataset which is several terabytes in size. I would like to query this data using hbase (sql). Would I need to setup mapreduce to use hbase? Currently the data is stored in hdfs and I am using `hdfs -cat ` to get the data and pipe it into stdin. -- --- Get your facts first, then you can distort them as you please.--
Re: parallel cat
Thanks Steve. This is exactly what I was looking for. Unfortunately, I don see any example code for the implementation. On Wed, Jul 6, 2011 at 7:35 AM, Steve Loughran ste...@apache.org wrote: On 06/07/11 11:08, Rita wrote: I have many large files ranging from 2gb to 800gb and I use hadoop fs -cat a lot to pipe to various programs. I was wondering if its possible to prefetch the data for clients with more bandwidth. Most of my clients have 10g interface and datanodes are 1g. I was thinking, prefetch x blocks (even though it will cost extra memory) while reading block y. After block y is read, read the prefetched blocked and then throw it away. It should be used like this: export PREFETCH_BLOCKS=2 #default would be 1 hadoop fs -pcat hdfs://namenode/verylarge file | program Any thoughts? Look at Russ Perry's work on doing very fast fetches from an HDFS filestore http://www.hpl.hp.com/**techreports/2009/HPL-2009-345.**pdfhttp://www.hpl.hp.com/techreports/2009/HPL-2009-345.pdf Here the DFS client got some extra data on where every copy of every block was, and the client decided which machine to fetch it from. This made the best use of the entire cluster, by keeping each datanode busy. -steve -- --- Get your facts first, then you can distort them as you please.--
Re: thrift and python
Could someone please compile and provide the jar for this class? It would be much appreciated. I am running r0.21.0/http://hadoop.apache.org/common/docs/r0.21.0/ On Thu, Jul 7, 2011 at 3:56 AM, Rita rmorgan...@gmail.com wrote: By looking at this, h ttp://www.mail-archive.com/mapreduce-dev@hadoop.apache.org/msg02088.htmlhttp://www.mail-archive.com/mapreduce-dev@hadoop.apache.org/msg02088.html Is it still necessary to compile the jar to resolve, Could not find the main class: org.apache.hadoop.thriftfs.HadoopThriftServer. Program will exit. I would think the .jar would exist on the latest version of hadoop/hdfs -- --- Get your facts first, then you can distort them as you please.-- -- --- Get your facts first, then you can distort them as you please.--
Re: parallel cat
Thanks again Steve. I will try to implement it with thrift. On Thu, Jul 7, 2011 at 5:35 AM, Steve Loughran ste...@apache.org wrote: On 07/07/11 08:22, Rita wrote: Thanks Steve. This is exactly what I was looking for. Unfortunately, I don see any example code for the implementation. No. I think I have access to russ's source somewhere, but there'd be paperwork in getting it released. Russ said it wasn't too hard to do, he just had to patch the DFS client to offer up the entire list of block locations to the client, and let the client program make the decision. If you discussed this on the hdfs-dev list (via a JIRA), you may be able to get a patch for this accepted, though you have to do the code and tests yourself. On Wed, Jul 6, 2011 at 7:35 AM, Steve Loughranste...@apache.org wrote: On 06/07/11 11:08, Rita wrote: I have many large files ranging from 2gb to 800gb and I use hadoop fs -cat a lot to pipe to various programs. I was wondering if its possible to prefetch the data for clients with more bandwidth. Most of my clients have 10g interface and datanodes are 1g. I was thinking, prefetch x blocks (even though it will cost extra memory) while reading block y. After block y is read, read the prefetched blocked and then throw it away. It should be used like this: export PREFETCH_BLOCKS=2 #default would be 1 hadoop fs -pcat hdfs://namenode/verylarge file | program Any thoughts? Look at Russ Perry's work on doing very fast fetches from an HDFS filestore http://www.hpl.hp.com/techreports/2009/HPL-2009-345.pdfhttp://www.hpl.hp.com/**techreports/2009/HPL-2009-345.**pdf http://www.hpl.hp.com/**techreports/2009/HPL-2009-345.**pdfhttp://www.hpl.hp.com/techreports/2009/HPL-2009-345.pdf Here the DFS client got some extra data on where every copy of every block was, and the client decided which machine to fetch it from. This made the best use of the entire cluster, by keeping each datanode busy. -steve -- --- Get your facts first, then you can distort them as you please.--
parallel cat
I have many large files ranging from 2gb to 800gb and I use hadoop fs -cat a lot to pipe to various programs. I was wondering if its possible to prefetch the data for clients with more bandwidth. Most of my clients have 10g interface and datanodes are 1g. I was thinking, prefetch x blocks (even though it will cost extra memory) while reading block y. After block y is read, read the prefetched blocked and then throw it away. It should be used like this: export PREFETCH_BLOCKS=2 #default would be 1 hadoop fs -pcat hdfs://namenode/verylarge file | program Any thoughts? -- --- Get your facts first, then you can distort them as you please.--
tar or hadoop archive
We use hadoop/hdfs to archive data. I archive a lot of file by creating one large tar file and then placing to hdfs. Is it better to use hadoop archive for this or is it essentially the same thing? -- --- Get your facts first, then you can distort them as you please.--
measure throughput of cluster
I am trying to acquire statistics about my hdfs cluster in the lab. One stat I am really interested in is the total throughput (gigabytes served) of the cluster for 24 hours. I suppose I can look for 'cmd=open' in the log file of the name node but how accurate is it? It seems there is no 'cmd=close' to distinguish a full file read. Is there a better way to acquire this? -- --- Get your facts first, then you can distort them as you please.--
Re: hdfs log question
I am guessing I should change this in the file. log4j.logger.org.apache.hadoop = DEBUG Do I need to restart anything or does the change take effect immediately? Some examples in the documentation would help immensely. On Thu, Apr 21, 2011 at 12:30 AM, Harsh J ha...@cloudera.com wrote: Hello, Have a look at conf/log4j.properties to configure all logging options. On Thu, Apr 21, 2011 at 3:14 AM, Rita rmorgan...@gmail.com wrote: I guess I should ask, how does one enable debug mode for the namenode and datanode logs? I would like to see if in the debug mode I am able to see close calls of a file. -- Harsh J -- --- Get your facts first, then you can distort them as you please.--
Re: hdfs log question
I guess I should ask, how does one enable debug mode for the namenode and datanode logs? I would like to see if in the debug mode I am able to see close calls of a file. On Tue, Apr 19, 2011 at 8:48 PM, Rita rmorgan...@gmail.com wrote: I know in the logs you can see 'cmd=open' and the filename. Is there a way to see the closing of the file? Basically, I want to account for the total number of megabytes transferred in hdfs. What is the best way to achieve this? Running 0.21 btw -- --- Get your facts first, then you can distort them as you please.-- -- --- Get your facts first, then you can distort them as you please.--
Re: changing node's rack
I think I tried this. I have a data file which has the map, ip address:rack, hostname:rack. I changed that and did a refreshNodes. Is this what you mean? Or something else? I would be more than happy to test it. On Mon, Mar 28, 2011 at 4:15 PM, Michael Segel michael_se...@hotmail.comwrote: This may be weird, but I could have sworn that the script is called repeatedly. One simple test would be to change the rack aware script and print a message out when the script is called. Then change the script and see if it catches the change without restarting the cluster. -Mike From: tdunn...@maprtech.com Date: Sat, 26 Mar 2011 15:50:58 -0700 Subject: Re: changing node's rack To: common-user@hadoop.apache.org CC: rmorgan...@gmail.com I think that the namenode remembers the rack. Restarting the datanode doesn't make it forget. On Sat, Mar 26, 2011 at 7:34 AM, Rita rmorgan...@gmail.com wrote: What is the best way to change the rack of a node? I have tried the following: Killed the datanode process. Changed the rackmap file so the node and ip address entry reflect the new rack and I do a '-refreshNodes'. Restarted the datanode. But it seems the datanode is keep getting register to the old rack. -- --- Get your facts first, then you can distort them as you please.-- -- --- Get your facts first, then you can distort them as you please.--
live/dead node problem
Hello All, Is there a parameter or procedure to check more aggressively for a live/dead node? Despite me killing the hadoop process, I see the node active for more than 10+ minutes in the Live Nodes page. Fortunately, the last contact increments. Using, branch-0.21, 0985326 -- --- Get your facts first, then you can distort them as you please.--
Re: live/dead node problem
what about for 0.21 ? Also, where do you set this? in the data node configuration or namenode? It seems the default is set to 3 seconds. On Tue, Mar 29, 2011 at 5:37 PM, Ravi Prakash ravip...@yahoo-inc.comwrote: I set these parameters for quickly discovering live / dead nodes. For 0.20 : heartbeat.recheck.interval For 0.22 : dfs.namenode.heartbeat.recheck-interval dfs.heartbeat.interval Cheers, Ravi On 3/29/11 10:24 AM, Michael Segel michael_se...@hotmail.com wrote: Rita, When the NameNode doesn't see a heartbeat for 10 minutes, it then recognizes that the node is down. Per the Hadoop online documentation: Each DataNode sends a Heartbeat message to the NameNode periodically. A network partition can cause a subset of DataNodes to lose connectivity with the NameNode. The NameNode detects this condition by the absence of a Heartbeat message. The NameNode marks DataNodes without recent Heartbeats as dead and does not forward any new IO requests to them. Any data that was registered to a dead DataNode is not available to HDFS any more. DataNode death may cause the replication factor of some blocks to fall below their specified value. The NameNode constantly tracks which blocks need to be replicated and initiates replication whenever necessary. The necessity for re-replication may arise due to many reasons: a DataNode may become unavailable, a replica may become corrupted, a hard disk on a DataNode may fail, or the replication factor of a file may be increased. I was trying to find out if there's an hdfs-site parameter that could be set to decrease this time period, but wasn't successful. HTH -Mike Date: Tue, 29 Mar 2011 08:13:43 -0400 Subject: live/dead node problem From: rmorgan...@gmail.com To: common-user@hadoop.apache.org Hello All, Is there a parameter or procedure to check more aggressively for a live/dead node? Despite me killing the hadoop process, I see the node active for more than 10+ minutes in the Live Nodes page. Fortunately, the last contact increments. Using, branch-0.21, 0985326 -- --- Get your facts first, then you can distort them as you please.-- -- --- Get your facts first, then you can distort them as you please.--
Re: changing node's rack
Thanks Allen. I really hope this gets addressed. Leaving it in cache can become dangerous. On Sat, Mar 26, 2011 at 7:49 PM, Allen Wittenauer awittena...@linkedin.comwrote: On Mar 26, 2011, at 3:50 PM, Ted Dunning wrote: I think that the namenode remembers the rack. Restarting the datanode doesn't make it forget. Correct. https://issues.apache.org/jira/browse/HDFS-870 -- --- Get your facts first, then you can distort them as you please.--
Re: CDH and Hadoop
Thanks everyone for your replies. I knew Cloudera had their release but never knew Y! had one too... On Thu, Mar 24, 2011 at 5:04 PM, Eli Collins e...@cloudera.com wrote: Hey Rita, All software developed by Cloudera for CDH is Apache (v2) licensed and freely available. See these docs [1,2] for more info. We publish source packages (which includes the packaging source) and source tarballs, you can find these at http://archive.cloudera.com/cdh/3/. See the CHANGES.txt file (or the cloudera directory in the tarballs) for the specific patches that have been applied. CDH contains a number of projects (Hadoop, Pig, Hive, HBase, Oozie, Flume, Sqoop, Whirr, Hue, ZooKeeper, etc). Most have a small handful of patches applied (often there's only a couple additional patches as we've rolled an upstream dot release that folded in the delta from the previous release). The vast majority of the patches to Hadoop come from the Apache security and append [3, 4] branches. Aside from those the rest are critical backports and bug fixes. In general, we develop upstream first. Hope this clarifies things. Thanks, Eli 1. https://wiki.cloudera.com/display/DOC/Apache+License 2. https://wiki.cloudera.com/display/DOC/CDH3+Installation+Guide 3. http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20-security 4. http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20-append On Wed, Mar 23, 2011 at 7:29 AM, Rita rmorgan...@gmail.com wrote: I have been wondering if I should use CDH ( http://www.cloudera.com/hadoop/) instead of the standard Hadoop distribution. What do most people use? Is CDH free? do they provide the tars or does it provide source code and I simply compile? Can I have some data nodes as CDH and the rest as regular Hadoop? I am asking this because so far I noticed a serious bug (IMO) in the decommissioning process ( http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201103.mbox/%3cAANLkTikPKGt5zw1QGLse+LPzUDP7Mom=ty_mxfcuo...@mail.gmail.com%3e ) -- --- Get your facts first, then you can distort them as you please.-- -- --- Get your facts first, then you can distort them as you please.--
CDH and Hadoop
I have been wondering if I should use CDH (http://www.cloudera.com/hadoop/) instead of the standard Hadoop distribution. What do most people use? Is CDH free? do they provide the tars or does it provide source code and I simply compile? Can I have some data nodes as CDH and the rest as regular Hadoop? I am asking this because so far I noticed a serious bug (IMO) in the decommissioning process ( http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201103.mbox/%3cAANLkTikPKGt5zw1QGLse+LPzUDP7Mom=ty_mxfcuo...@mail.gmail.com%3e ) -- --- Get your facts first, then you can distort them as you please.--
Re: CDH and Hadoop
Mike, Thanks. This helps a lot. At our lab we have close to 60 servers which only run hdfs. I don't need mapreduce and other bells and whistles. We just use hdfs for storing dataset results ranging from 3gb to 90gb. So, what is the best practice for hdfs? should I always deploy one version before? I understand that Cloudera's version is heavily patched (similar to Redhat Linux kernel versus standard Linux kernel). On Wed, Mar 23, 2011 at 10:44 AM, Michael Segel michael_se...@hotmail.comwrote: Rita, Short answer... Cloudera's release is free, and they do also offer a support contract if you want support from them. Cloudera has sources, but most use yum (redhat/centos) to download an already built release. Should you use it? Depends on what you want to do. If your goal is to get up and running with Hadoop and then focus on *using* Hadoop/HBase/Hive/Pig/etc... then it makes sense. If your goal is to do a deep dive in to Hadoop and get your hands dirty mucking around with the latest and greatest in trunk? Then no. You're better off building your own off the official Apache release. Many companies choose Cloudera's release for the following reasons: * Paid support is available. * Companies focus on using a tech not developing the tech, so Cloudera does the heavy lifting while Client Companies focus on 'USING' Hadoop. * Cloudera's release makes sure that the versions in the release work together. That is that when you down load CHD3B4, you get a version of Hadoop that will work with the included version of HBase, Hive, etc ... And no, its never a good idea to try and mix and match Hadoop from different environments and versions in a cluster. (I think it will barf on you.) Does that help? -Mike Date: Wed, 23 Mar 2011 10:29:16 -0400 Subject: CDH and Hadoop From: rmorgan...@gmail.com To: common-user@hadoop.apache.org I have been wondering if I should use CDH ( http://www.cloudera.com/hadoop/) instead of the standard Hadoop distribution. What do most people use? Is CDH free? do they provide the tars or does it provide source code and I simply compile? Can I have some data nodes as CDH and the rest as regular Hadoop? I am asking this because so far I noticed a serious bug (IMO) in the decommissioning process ( http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201103.mbox/%3cAANLkTikPKGt5zw1QGLse+LPzUDP7Mom=ty_mxfcuo...@mail.gmail.com%3e ) -- --- Get your facts first, then you can distort them as you please.-- -- --- Get your facts first, then you can distort them as you please.--
Re: decommissioning node woes
Any help? On Wed, Mar 16, 2011 at 9:36 PM, Rita rmorgan...@gmail.com wrote: Hello, I have been struggling with decommissioning data nodes. I have a 50+ data node cluster (no MR) with each server holding about 2TB of storage. I split the nodes into 2 racks. I edit the 'exclude' file and then do a -refreshNodes. I see the node immediate in 'Decommiosied node' and I also see it as a 'live' node! Eventhough I wait 24+ hours its still like this. I am suspecting its a bug in my version. The data node process is still running on the node I am trying to decommission. So, sometimes I kill -9 the process and I see the 'under replicated' blocks...this can't be the normal procedure. There were even times that I had corrupt blocks because I was impatient -- waited 24-34 hours I am using 23 August, 2010: release 0.21.0 http://hadoop.apache.org/hdfs/releases.html#23+August%2C+2010%3A+release+0.21.0+available version. Is this a known bug? Is there anything else I need to do to decommission a node? -- --- Get your facts first, then you can distort them as you please.-- -- --- Get your facts first, then you can distort them as you please.--
decommissioning node woes
Hello, I have been struggling with decommissioning data nodes. I have a 50+ data node cluster (no MR) with each server holding about 2TB of storage. I split the nodes into 2 racks. I edit the 'exclude' file and then do a -refreshNodes. I see the node immediate in 'Decommiosied node' and I also see it as a 'live' node! Eventhough I wait 24+ hours its still like this. I am suspecting its a bug in my version. The data node process is still running on the node I am trying to decommission. So, sometimes I kill -9 the process and I see the 'under replicated' blocks...this can't be the normal procedure. There were even times that I had corrupt blocks because I was impatient -- waited 24-34 hours I am using 23 August, 2010: release 0.21.0 http://hadoop.apache.org/hdfs/releases.html#23+August%2C+2010%3A+release+0.21.0+available version. Is this a known bug? Is there anything else I need to do to decommission a node? -- --- Get your facts first, then you can distort them as you please.--
Hadoop Streaming?
Hi :) May I have two simple (and general) question regarding Hadoop Streaming? 1. What's the difference among hadoop streaming, hadoop pipe, and hadoop online (hop), a pipelining version developed by UC Berkeley? 2. In the current hadoop trunk, where could we find hadoop-streaming.jar? Further -- may I have an example which teaches me how to use hadoop-streaming feature? Thanks a lot! -Rita :)