Re: rack awarness unexpected behaviour
Marc, The rack aware script is an artificial concept. Meaning you can tell which machine is in which rack and that may or may not reflect where the machine is actually located. The idea is to balance the number of nodes in the racks, at least on paper. So you can have 14 machines in rack 1, and 16 machines in rack 2 even though they may physically be 20 machines in rack 1 and 10 machines in rack 2. HTH -Mike On Oct 3, 2013, at 2:52 AM, Marc Sturlese marc.sturl...@gmail.com wrote: I've check it out and it works like that. The problem is, if the two racks have not the same capacity, one will have the disk space filled up much faster than the other (that's what I'm seeing). If one rack (rack A) has 2 servers of 8 cores with 4 reduce slots each and the other rack (rack B) has 2 servers of 16 cores with 8 reduce slots each, rack A will get filled up faster as rack B is writing more (because has more reduce slots). Could a solution be to modify the bash script used to decide to which replica write a block? Would use probability and give to rack B double chance to receive de write. -- View this message in context: http://lucene.472066.n3.nabble.com/rack-awareness-unexpected-behaviour-tp4086029p4093270.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com. The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. Use at your own risk. Michael Segel michael_segel (AT) hotmail.com
Re: how to get the time of a hadoop cluster, v0.20.2
Then you have a problem where the solution is more of people management and not technical. All of your servers should be using NTP. At a minimum, you have one server that gets the time from a national (government) time server, and then have all of the machines in that Data Center use that machine as its NTP server, or you can have all machines by default use the government server for NTP. You can also buy your own clock server that syncs to either GPS or national time servers via a radio signal. But you have a problem of staff that is either unwilling or unable to do their job. You can either take a carrot or a stick approach. I suggest that maybe bribing them with a bottle of scotch. (That seems to be the current liquid lubricator that works universally these days, unless of course they don't drink...) HTH -Mike On May 17, 2013, at 9:13 AM, Jane Wayne jane.wayne2...@gmail.com wrote: and please remember, i stated that although the hadoop cluster uses NTP, the server (the machine that is not a part of the hadoop cluster) cannot assume to be using NTP (and in fact, doesn't). On Fri, May 17, 2013 at 10:10 AM, Jane Wayne jane.wayne2...@gmail.comwrote: if NTP is correclty used that's the key statement. in several of our clusters, NTP setup is kludgy. note that the professionals administering the cluster are different from us the engineers. so, there's a lot of red tape to go through to get something trivial or not fixed. we have noticed that NTP is not setup correctly (using default GMT timezone, for example). without explaining all the tedious details, this mismatch of date/time (of the hadoop cluster to the server machine) is causing some pains. i'm not sure i agree with the local OS time from your server machine will be the best estimation. that doesn't make sense. but what i want to achieve is very simple. as stated before, i just want to ask the namenode or jobtracker, hey, what date/time do you have? unfortunately for me, as niels pointed out, this query is not possible via the hadoop api. thanks for helping, though. :) On Fri, May 17, 2013 at 10:02 AM, Bertrand Dechoux decho...@gmail.comwrote: For hadoop, 'cluster time' is the local OS time. You might want to get the time of the namenode machine but indeed if NTP is correctly used, the local OS time from your server machine will be the best estimation. If you request the time from the namenode machine, you will be penalized by the delay of your request. Regards Bertrand On Fri, May 17, 2013 at 3:17 PM, Niels Basjes ni...@basjes.nl wrote: Hi, i have another computer (which i have referred to as a server, since it is running tomcat), and this computer is NOT a part of the hadoop cluster (it doesn't run any of the hadoop daemons), but does submit jobs to the hadoop cluster via a JEE webapp interface. i need to check that the time on this computer is in sync with the time on the hadoop cluster. when i say check that the time is in sync, there is a defined tolerance/threshold difference in date/time that i am willing to accept (e.g. the date/time should be the same down to the minute). If you ensure (using NTP) that all your servers have the same time then you can simply query your local server for the time and you have the correct answer to your question. You are searching for a solution in the Hadoop API (where this does not exist) when the solution is present at a different level. -- Best regards / Met vriendelijke groeten, Niels Basjes
Re: How to handle sensitive data
Simple, have your app encrypt the field prior to writing to HDFS. Also consider HBase. On Feb 14, 2013, at 10:35 AM, abhishek abhishek.dod...@gmail.com wrote: Hi all, we are having some sensitive data, in some particular fields(columns). Can I know how to handle sensitive in Hadoop. How do different people handle sensitive data in Hadoop. Thanks Abhi Michael Segel | (m) 312.755.9623 Segel and Associates
Re: NN Memory Jumps every 1 1/2 hours
Hey Silly question... How long have you had 27 million files? I mean can you correlate the number of files to the spat of OOMs? Even without problems... I'd say it would be a good idea to upgrade due to the probability of a lot of code fixes... If you're running anything pre 1.x, going to 1.7 java wouldn't be a good idea. Having said that... outside of MapR, have any of the distros certified themselves on 1.7 yet? On Dec 22, 2012, at 6:54 AM, Edward Capriolo edlinuxg...@gmail.com wrote: I will give this a go. I have actually went in JMX and manually triggered GC no memory is returned. So I assumed something was leaking. On Fri, Dec 21, 2012 at 11:59 PM, Adam Faris afa...@linkedin.com wrote: I know this will sound odd, but try reducing your heap size. We had an issue like this where GC kept falling behind and we either ran out of heap or would be in full gc. By reducing heap, we were forcing concurrent mark sweep to occur and avoided both full GC and running out of heap space as the JVM would collect objects more frequently. On Dec 21, 2012, at 8:24 PM, Edward Capriolo edlinuxg...@gmail.com wrote: I have an old hadoop 0.20.2 cluster. Have not had any issues for a while. (which is why I never bothered an upgrade) Suddenly it OOMed last week. Now the OOMs happen periodically. We have a fairly large NameNode heap Xmx 17GB. It is a fairly large FS about 27,000,000 files. So the strangest thing is that every 1 and 1/2 hour the NN memory usage increases until the heap is full. http://imagebin.org/240287 We tried failing over the NN to another machine. We change the Java version from 1.6_23 - 1.7.0. I have set the NameNode logs to debug and ALL and I have done the same with the data nodes. Secondary NN is running and shipping edits and making new images. I am thinking something has corrupted the NN MetaData and after enough time it becomes a time bomb, but this is just a total shot in the dark. Does anyone have any interesting trouble shooting ideas?
Re: Disks RAID best practice
Oleg, that's for an overall raid preference. Specifically for the 'control nodes' aka (NN, SN, JT, HM, ZK...) I tend to just use simple mirroring because these processes are not really I/O bound. (RAID-1). I guess you could go RAID-10 (Stripe and Mirrored) but that may be a little overkill and my preference comes from working in the RDBMS world. If we are using commodity servers, JBOD tends to be the preferred way of handling things. However, I've seen cases where people will use RAIDed Drives on a node for a couple of reasons. The nice thing about doing mirrored DN drives is that if you have a disk failure you just pop the drive and replace it. Much simpler. If we're looking at using NetApp's E Series in conjunction with a compute cluster, then you are using their raided configuration and can reduce the cluster's replication factor to 2 from 3. While its easy to recommend RAID on the control nodes, data nodes is a bit trickier. I mean you can run with straight JBOD and based on a cost issue, its the cheapest in terms of hardware. If you go with RAID on the DN, you reduce your storage density per node because you have redundancy in hardware. And this has an impact on your overall machine density and TCO. This is offset by easier and faster recovery time from some hardware failure events. Lets face it, the number one thing to fail is going to be your hard drives. So we are going to have to balance the costs against the benefits. Now I have to state the obvious caveats... 1) YMMV, 2) The factors which go in to the cluster design decision are going to be unique to the company setting up the cluster. These are IMHO, and you know what they say about opinions... ;-) HTH -Mike On Nov 1, 2012, at 7:52 AM, Oleg Ruchovets oruchov...@gmail.com wrote: Do you mean RAID 10 for Master Node? What about DataNode? Thanks Oleg. On Thu, Nov 1, 2012 at 2:43 PM, Michael Segel michael_se...@hotmail.comwrote: I prefer RAID 10, but some say RAID 6. I thought NetApp used RAID 6 ? Its definitely an interesting discussion point though. -Mike On Nov 1, 2012, at 7:37 AM, Oleg Ruchovets oruchov...@gmail.com wrote: Hi , What is the best practice for DISKS RAID (Master and Data Nodes). Thanks in advance Oleg.
Re: measuring iops
You have two issues. 1) You need to know the throughput in terms of data transfer between disks and controller cards on the node. 2) The actual network throughput of having all of the nodes talking to one another as fast as they can. This will let you see your real limitations in the ToR Switch's fabric. Not sure why you really want to do this except to test the disk, disk controller, and then networking infrastructure of your ToR and then your backplane to connect multiple racks HTH -Mike On Oct 23, 2012, at 7:47 AM, Ravi Prakash ravi...@ymail.com wrote: Do you mean in a cluster being used by users, or as a benchmark to measure the maximum? The JMX page nn:port/jmx provides some interesting stats, but I'm not sure they have what you want. And I'm unaware of other tools which could. From: Rita rmorgan...@gmail.com To: common-user@hadoop.apache.org; Ravi Prakash ravi...@ymail.com Sent: Monday, October 22, 2012 6:46 PM Subject: Re: measuring iops Is it possible to know how many reads and writes are occurring thru the entire cluster in a consolidated manner -- this does not include replication factors. On Mon, Oct 22, 2012 at 10:28 AM, Ravi Prakash ravi...@ymail.com wrote: Hi Rita, SliveTest can help you measure the number of reads / writes / deletes / ls / appends per second your NameNode can handle. DFSIO can be used to help you measure the amount of throughput. Both these tests are actually very flexible and have a plethora of options to help you test different facets of performance. In my experience, you actually have to be very careful and understand what the tests are doing for the results to be sensible. HTH Ravi From: Rita rmorgan...@gmail.com To: common-user@hadoop.apache.org common-user@hadoop.apache.org Sent: Monday, October 22, 2012 7:23 AM Subject: Re: measuring iops Anyone? On Sun, Oct 21, 2012 at 8:30 AM, Rita rmorgan...@gmail.com wrote: Hi, Was curious if there was a method to measure the total number of IOPS (I/O operations per second) on a HDFS cluster. -- --- Get your facts first, then you can distort them as you please.-- -- --- Get your facts first, then you can distort them as you please.-- -- --- Get your facts first, then you can distort them as you please.--
Re: reg hadoop on AWS
Actually the best way to do it is to use EMR since everything is then built for you. On Oct 5, 2012, at 7:21 AM, Nitin Pawar nitinpawar...@gmail.com wrote: Hi Sudha, best way to use hadoop on aws is via whirr Thanks, nitin On Fri, Oct 5, 2012 at 4:45 PM, sudha sadhasivam sudhasadhasi...@yahoo.com wrote: Sir We tried to setup hadoop on AWS. The procedure is given. We face problem with the parameters needed for input and output files. Can somebody provide us with a sample exercise with steps for working on hadoop in AWS? thanking you Dr G Sudha -- Nitin Pawar
Re: Which hardware to choose
Well... If you're not running HBase, you're less harmed by minimal swapping so you could push the number of slots and over subscribe. The only thing I would have to suggest is that you monitor your system closely as you adjust the number of slots. You have to admit though, its fun to tune the cluster. :-) On Oct 3, 2012, at 12:09 PM, J. Rottinghuis jrottingh...@gmail.com wrote: Of course it all depends... But something like this could work: Leave 1-2 GB for the kernel, pagecache, tools, overhead etc. Plan 3-4 GB for Datanode and Tasktracker each Plan 2.5-3 GB per slot. Depending on the kinds of jobs, you may need more or less memory per slot. Have 2-3 times as many mappers as reducers (depending on the kinds of jobs you run). As Micheal pointed out the ratio of cores (hyperthreads) per disk matters. With those initial rules of thumb you'd arrive somewhere between 10 mappers + 5 reducers and 9 mappers + 4 reducers Try, test, measure, adjust, rinse, repeat. Cheers, Joep On Tue, Oct 2, 2012 at 8:42 PM, Alexander Pivovarov apivova...@gmail.comwrote: All configs are per node. No HBase, only Hive and Pig installed On Tue, Oct 2, 2012 at 9:40 PM, Michael Segel michael_se...@hotmail.com wrote: I think he's saying that its 24 maps 8 reducers per node and at 48GB that could be too many mappers. Especially if they want to run HBase. On Oct 2, 2012, at 8:14 PM, hadoopman hadoop...@gmail.com wrote: Only 24 map and 8 reduce tasks for 38 data nodes? are you sure that's right? Sounds VERY low for a cluster that size. We have only 10 c2100's and are running I believe 140 map and 70 reduce slots so far with pretty decent performance. On 10/02/2012 12:55 PM, Alexander Pivovarov wrote: 38 data nodes + 2 Name Nodes Data Node: Dell PowerEdge C2100 series 2 x XEON x5670 48 GB RAM ECC (12x4GB 1333MHz) 12 x 2 TB 7200 RPM SATA HDD (with hot swap) JBOD Intel Gigabit ET Dual port PCIe x4 Redundant Power Supply Hadoop CDH3 max map tasks 24 max reduce tasks 8
Re: Which hardware to choose
I think he's saying that its 24 maps 8 reducers per node and at 48GB that could be too many mappers. Especially if they want to run HBase. On Oct 2, 2012, at 8:14 PM, hadoopman hadoop...@gmail.com wrote: Only 24 map and 8 reduce tasks for 38 data nodes? are you sure that's right? Sounds VERY low for a cluster that size. We have only 10 c2100's and are running I believe 140 map and 70 reduce slots so far with pretty decent performance. On 10/02/2012 12:55 PM, Alexander Pivovarov wrote: 38 data nodes + 2 Name Nodes Data Node: Dell PowerEdge C2100 series 2 x XEON x5670 48 GB RAM ECC (12x4GB 1333MHz) 12 x 2 TB 7200 RPM SATA HDD (with hot swap) JBOD Intel Gigabit ET Dual port PCIe x4 Redundant Power Supply Hadoop CDH3 max map tasks 24 max reduce tasks 8
Re: Learning about different distributions of Hadoop
Now that's a loaded question. I'm going to plead the 5th because no matter how I answer it, I will probably piss someone off. ;-P They all have their own respective strengths and weaknesses. (Like that's stopped me before. ;-) -Mike On Aug 8, 2012, at 10:53 AM, Harit Himanshu harit.subscripti...@gmail.com wrote: Hello I have a very basic question - There are various flavors of hadoop by Apache, Cloudera, MapR, HortonWorks(may be more I am not aware of). I would like to learn what are the differences between these distributions and how do I know which distribution is best for me? I am current using Apache Hadoop Thank you + Harit
Re: Learning about different distributions of Hadoop
Well Scott kinda side stepped MapR in his response. ;-) On Aug 8, 2012, at 11:20 AM, Serge Blazhiyevskyy serge.blazhiyevs...@nice.com wrote: I agree with Scoot! The first question to answer is Do you have a problem with your current distribution? Thanks Serge On 8/8/12 9:17 AM, Scott Fines scottfi...@gmail.com wrote: That's a bit like asking people what the best Linux Distro is..they all serve (mostly) the same function, and you're likely to start a religious war by stating their differences. The main point running through all the different flavors of Hadoop is that they are all Hadoop. The differences only come from the chosen patch sets, which are all open-sourced anyway. At least in theory, you could rebuild Cloudera/Hortonworks/whatever just by applying the right sequences of patch sets to core Hadoop. The real question is: Are you happy with what you are currently using? If so, why worry about it? If not, why are you unhappy? Answering that question is likely to give you the guidance you would like in terms of what flavor you wish to pick. Scott On Aug 8, 2012, at 11:10 AM, Michael Segel wrote: Now that's a loaded question. I'm going to plead the 5th because no matter how I answer it, I will probably piss someone off. ;-P They all have their own respective strengths and weaknesses. (Like that's stopped me before. ;-) -Mike On Aug 8, 2012, at 10:53 AM, Harit Himanshu harit.subscripti...@gmail.com wrote: Hello I have a very basic question - There are various flavors of hadoop by Apache, Cloudera, MapR, HortonWorks(may be more I am not aware of). I would like to learn what are the differences between these distributions and how do I know which distribution is best for me? I am current using Apache Hadoop Thank you + Harit
Re: Learning about different distributions of Hadoop
Uhm... I'd take that report with a grain of salt. ;-) On Aug 8, 2012, at 4:16 PM, Alberto Ortiz aort...@gmail.com wrote: I recently stepped into this Forrester report, may give you more information to consider for a decision: http://www.forrester.com/dl/The+Forrester+Wave+Enterprise+Hadoop+Solutions+Q1+2012/-/E-RES60755/PDF?oid=1-KCJIQAaction=PDFobjectid=RES60755 Regards On Wed, Aug 8, 2012 at 2:32 PM, Harit Himanshu harit.subscripti...@gmail.com wrote: nah! I am happy with Apache Hadoop as I am learning but saw so many distributions so wanted to know about them On Aug 8, 2012, at 9:10 AM, Michael Segel wrote: hey all have their own respective strengths and wea
Re: migrate cluster to different datacenter
The OP hasn't provided enough information to even start trying to make a real recommendation on how to solve this problem. On Aug 4, 2012, at 7:32 AM, Nitin Kesarwani bumble@gmail.com wrote: Given the size of data, there can be several approaches here: 1. Moving the boxes Not possible, as I suppose the data must be needed for 24x7 analytics. 2. Mirroring the data. This is a good solution. However, if you have data being written/removed continuously (if a part of live system), there are chances of losing some of the data during mirroring happens, unless a) You block writes/updates during that time (if you do so, that would be as good as unplugging and moving the machine around), or, b) Keep a track of what was modified since you started the mirroring process. I would recommend you to go with 2b) because it minimizes downtime. Here is how I think you can do it, by using some of the tools provided by Hadoop itself. a) You can use some fast distributed copying tool to copy large chunks of data. Before you kick-off with this, you can create a utility that tracks the modification of data made to your live system while copying is going on in the background. The utility will log the modifications into an audit trail. b) Once you're done copying the files, allow the new data store replication to catch up by reading the real-time modifications that were made, from your utility's log file. Once sync'ed up you can begin with the minimal downtime by switching off the JobTracker in live cluster so that new files are not created. c) As soon as you reach the last chunk of copying, change the DNS entries so that the hostnames referenced by the Hadoop jobs points to the new location. d) Turn on the JobTracker for the new cluster. e) Enjoy a drink with the money you saved by not using other paid third party solutions and pat your back! ;) The key of the above solution is to make data copying of step a) as fast as possible. Lesser the time, lesser the contents in audit trail, lesser the overall downtime. You can develop some in house solution for this, or use DistCp, provided by Hadoop that uses copies over the data using Map/Reduce. On Sat, Aug 4, 2012 at 3:27 AM, Michael Segel michael_se...@hotmail.comwrote: Sorry at 1PB of disk... compression isn't going to really help a whole heck of a lot. Your networking bandwidth will be your bottleneck. So lets look at the problem. How much down time can you afford? What does your hardware look like? How much space do you have in your current data center? You have 1PB of data. OK, what does the access pattern look like? There are a couple of ways to slice and dice this. How many trucks do you have? On Aug 3, 2012, at 4:24 PM, Harit Himanshu harit.subscripti...@gmail.com wrote: Moving 1 PB of data would take loads of time, - check if this new data center provides something similar to http://aws.amazon.com/importexport/ - Consider multi part uploading of data - consider compressing the data On Aug 3, 2012, at 2:19 PM, Patai Sangbutsarakum wrote: thanks for response. Physical move is not a choice in this case. Purely looking for copying data and how to catch up with the update of a file while it is being migrated. On Fri, Aug 3, 2012 at 12:40 PM, Chen He airb...@gmail.com wrote: sometimes, physically moving hard drives helps. :) On Aug 3, 2012 1:50 PM, Patai Sangbutsarakum silvianhad...@gmail.com wrote: Hi Hadoopers, We have a plan to migrate Hadoop cluster to a different datacenter where we can triple the size of the cluster. Currently, our 0.20.2 cluster have around 1PB of data. We use only Java/Pig. I would like to get some input how we gonna handle with transferring 1PB of data to a new site, and also keep up with new files that thrown into cluster all the time. Happy friday !! P
Re: Decommisioning runs for ever
Did you change the background bandwidth from 10mbs to something higher? Worst case is that you can kill the DN and wait 10 mins for the cluster to realize its down and then rebalance. (Its ugly, but it works.) On Aug 6, 2012, at 7:59 PM, Chandra Mohan, Ananda Vel Murugan ananda.muru...@honeywell.com wrote: Hi, I tried decommissioning a node in my Hadoop cluster. I am running Apache Hadoop 1.0.2 and ours is a four node cluster. I also have HBase installed in my cluster. I have shut down region server in this node. For decommissioning, I did the following steps * Added the following XML in hdfs-site.xml property namedfs.hosts.exclude/name value/full/path/of/host/exclude/file/value /property * Ran HADOOP_HOME/bin/hadoop dfsadmin -refreshNodes But node decommissioning is running for the last 6 hrs. I don't know when it will get over. I am in need of this node for other activities. From HDFS health status JSP: Cluster Summary 338 files and directories, 200 blocks = 538 total. Heap Size is 16.62 MB / 888.94 MB (1%) Configured Capacity : 1.35 TB DFS Used : 759.57 MB Non DFS Used : 179.36 GB DFS Remaining : 1.17 TB DFS Used% : 0.05 % DFS Remaining% : 86.92 % Live Nodes : 4 Dead Nodes : 0 Decommissioning Nodes : 1 Number of Under-Replicated Blocks : 129 Please share if you have any idea. Thanks a lot. Regards, Anand.C
Re: Decommisioning runs for ever
Yup. By default it looks like 10MB/Sec. With 1GBe, you could probably push this up to 100MB/Sec or even higher depending on your cluster usage. 10GBe... obviously higher. On Aug 6, 2012, at 8:15 PM, Chandra Mohan, Ananda Vel Murugan ananda.muru...@honeywell.com wrote: Are you referring to this setting dfs.balance.bandwidthPerSec ? -Original Message- From: Michael Segel [mailto:michael_se...@hotmail.com] Sent: Tuesday, August 07, 2012 6:36 AM To: common-user@hadoop.apache.org Subject: Re: Decommisioning runs for ever Did you change the background bandwidth from 10mbs to something higher? Worst case is that you can kill the DN and wait 10 mins for the cluster to realize its down and then rebalance. (Its ugly, but it works.) On Aug 6, 2012, at 7:59 PM, Chandra Mohan, Ananda Vel Murugan ananda.muru...@honeywell.com wrote: Hi, I tried decommissioning a node in my Hadoop cluster. I am running Apache Hadoop 1.0.2 and ours is a four node cluster. I also have HBase installed in my cluster. I have shut down region server in this node. For decommissioning, I did the following steps * Added the following XML in hdfs-site.xml property namedfs.hosts.exclude/name value/full/path/of/host/exclude/file/value /property * Ran HADOOP_HOME/bin/hadoop dfsadmin -refreshNodes But node decommissioning is running for the last 6 hrs. I don't know when it will get over. I am in need of this node for other activities. From HDFS health status JSP: Cluster Summary 338 files and directories, 200 blocks = 538 total. Heap Size is 16.62 MB / 888.94 MB (1%) Configured Capacity : 1.35 TB DFS Used : 759.57 MB Non DFS Used : 179.36 GB DFS Remaining : 1.17 TB DFS Used% : 0.05 % DFS Remaining% : 86.92 % Live Nodes : 4 Dead Nodes : 0 Decommissioning Nodes : 1 Number of Under-Replicated Blocks : 129 Please share if you have any idea. Thanks a lot. Regards, Anand.C
Re: migrate cluster to different datacenter
Sorry at 1PB of disk... compression isn't going to really help a whole heck of a lot. Your networking bandwidth will be your bottleneck. So lets look at the problem. How much down time can you afford? What does your hardware look like? How much space do you have in your current data center? You have 1PB of data. OK, what does the access pattern look like? There are a couple of ways to slice and dice this. How many trucks do you have? On Aug 3, 2012, at 4:24 PM, Harit Himanshu harit.subscripti...@gmail.com wrote: Moving 1 PB of data would take loads of time, - check if this new data center provides something similar to http://aws.amazon.com/importexport/ - Consider multi part uploading of data - consider compressing the data On Aug 3, 2012, at 2:19 PM, Patai Sangbutsarakum wrote: thanks for response. Physical move is not a choice in this case. Purely looking for copying data and how to catch up with the update of a file while it is being migrated. On Fri, Aug 3, 2012 at 12:40 PM, Chen He airb...@gmail.com wrote: sometimes, physically moving hard drives helps. :) On Aug 3, 2012 1:50 PM, Patai Sangbutsarakum silvianhad...@gmail.com wrote: Hi Hadoopers, We have a plan to migrate Hadoop cluster to a different datacenter where we can triple the size of the cluster. Currently, our 0.20.2 cluster have around 1PB of data. We use only Java/Pig. I would like to get some input how we gonna handle with transferring 1PB of data to a new site, and also keep up with new files that thrown into cluster all the time. Happy friday !! P
Re: Merge Reducers Output
You really don't want to run a single reducer unless you know that you don't have a lot of mappers. As long as the output data types and structure are the same as the input, you can run your code as the combiner, and then run it again as the reducer. Problem solved with one or two lines of code. If your input and output don't match, then you can use the existing code as a combiner, and then write a new reducer. It could as easily be an identity reducer too. (Don't know the exact problem.) So here's a silly question. Why wouldn't you want to run a combiner? On Jul 31, 2012, at 12:08 AM, Jay Vyas jayunit...@gmail.com wrote: Its not clear to me that you need custom input formats 1) Getmerge might work or 2) Simply run a SINGLE reducer job (have mappers output static final int key=1, or specify numReducers=1). In this case, only one reducer will be called, and it will read through all the values. On Tue, Jul 31, 2012 at 12:30 AM, Bejoy KS bejoy.had...@gmail.com wrote: Hi Why not use 'hadoop fs -getMerge outputFolderInHdfs targetFileNameInLfs' while copying files out of hdfs for the end users to consume. This will merge all the files in 'outputFolderInHdfs' into one file and put it in lfs. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: Michael Segel michael_se...@hotmail.com Date: Mon, 30 Jul 2012 21:08:22 To: common-user@hadoop.apache.org Reply-To: common-user@hadoop.apache.org Subject: Re: Merge Reducers Output Why not use a combiner? On Jul 30, 2012, at 7:59 PM, Mike S wrote: Liked asked several times, I need to merge my reducers output files. Imagine I have many reducers which will generate 200 files. Now to merge them together, I have written another map reduce job where each mapper read a complete file in full in memory, and output that and then only one reducer has to merge them together. To do so, I had to write a custom fileinputreader that reads the complete file into memory and then another custom fileoutputfileformat to append the each reducer item bytes together. this how my mapper and reducers looks like public static class MapClass extends MapperNullWritable, BytesWritable, IntWritable, BytesWritable { @Override public void map(NullWritable key, BytesWritable value, Context context) throws IOException, InterruptedException { context.write(key, value); } } public static class Reduce extends ReducerNullWritable, BytesWritable, NullWritable, BytesWritable { @Override public void reduce(NullWritable key, IterableBytesWritable values, Context context) throws IOException, InterruptedException { for (BytesWritable value : values) { context.write(NullWritable.get(), value); } } } I still have to have one reducers and that is a bottle neck. Please note that I must do this merging as the users of my MR job are outside my hadoop environment and the result as one file. Is there better way to merge reducers output files? -- Jay Vyas MMSB/UCHC
Re: Merge Reducers Output
Sorry, but the OP was saying he had map/reduce job where the job had multiple reducers where he wanted to then combine the output to a single file. While you could merge the output files, you could also use a combiner then an identity reducer all within the same M/R job. On Jul 31, 2012, at 10:10 AM, Raj Vishwanathan rajv...@yahoo.com wrote: Is there a requirement for the final reduce file to be sorted? If not, wouldn't a map only job ( + a combiner, ) and a merge only job provide the answer? Raj From: Michael Segel michael_se...@hotmail.com To: common-user@hadoop.apache.org Sent: Tuesday, July 31, 2012 5:24 AM Subject: Re: Merge Reducers Output You really don't want to run a single reducer unless you know that you don't have a lot of mappers. As long as the output data types and structure are the same as the input, you can run your code as the combiner, and then run it again as the reducer. Problem solved with one or two lines of code. If your input and output don't match, then you can use the existing code as a combiner, and then write a new reducer. It could as easily be an identity reducer too. (Don't know the exact problem.) So here's a silly question. Why wouldn't you want to run a combiner? On Jul 31, 2012, at 12:08 AM, Jay Vyas jayunit...@gmail.com wrote: Its not clear to me that you need custom input formats 1) Getmerge might work or 2) Simply run a SINGLE reducer job (have mappers output static final int key=1, or specify numReducers=1). In this case, only one reducer will be called, and it will read through all the values. On Tue, Jul 31, 2012 at 12:30 AM, Bejoy KS bejoy.had...@gmail.com wrote: Hi Why not use 'hadoop fs -getMerge outputFolderInHdfs targetFileNameInLfs' while copying files out of hdfs for the end users to consume. This will merge all the files in 'outputFolderInHdfs' into one file and put it in lfs. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: Michael Segel michael_se...@hotmail.com Date: Mon, 30 Jul 2012 21:08:22 To: common-user@hadoop.apache.org Reply-To: common-user@hadoop.apache.org Subject: Re: Merge Reducers Output Why not use a combiner? On Jul 30, 2012, at 7:59 PM, Mike S wrote: Liked asked several times, I need to merge my reducers output files. Imagine I have many reducers which will generate 200 files. Now to merge them together, I have written another map reduce job where each mapper read a complete file in full in memory, and output that and then only one reducer has to merge them together. To do so, I had to write a custom fileinputreader that reads the complete file into memory and then another custom fileoutputfileformat to append the each reducer item bytes together. this how my mapper and reducers looks like public static class MapClass extends MapperNullWritable, BytesWritable, IntWritable, BytesWritable { @Override public void map(NullWritable key, BytesWritable value, Context context) throws IOException, InterruptedException { context.write(key, value); } } public static class Reduce extends ReducerNullWritable, BytesWritable, NullWritable, BytesWritable { @Override public void reduce(NullWritable key, IterableBytesWritable values, Context context) throws IOException, InterruptedException { for (BytesWritable value : values) { context.write(NullWritable.get(), value); } } } I still have to have one reducers and that is a bottle neck. Please note that I must do this merging as the users of my MR job are outside my hadoop environment and the result as one file. Is there better way to merge reducers output files? -- Jay Vyas MMSB/UCHC
Re: task jvm bootstrapping via distributed cache
Hi Stan, If I understood your question... you want to ship a jar to the nodes where the task will run prior to the start of the task? Not sure what it is you're trying to do... Your example isn't really clear. See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/filecache/DistributedCache.html When you pull stuff out of the cache you get the path to the jar. Or you should be able to get it. I'm assuming you're doing this in your setup() method? Can you give a better example, there may be a different way to handle this... On Jul 31, 2012, at 3:50 PM, Stan Rosenberg stan.rosenb...@gmail.com wrote: Forwarding to common-user to hopefully get more exposure... -- Forwarded message -- From: Stan Rosenberg stan.rosenb...@gmail.com Date: Tue, Jul 31, 2012 at 11:55 AM Subject: Re: task jvm bootstrapping via distributed cache To: mapreduce-u...@hadoop.apache.org I am guessing this is either a well-known problem or an edge case. In any case, would it be a bad idea to designate predetermined output paths? E.g., DistributedCache.addCacheFileInto(uri, conf, outputPath) would attempt to copy the cached file into the specified path resolving to a task's local filesystem. Thanks, stan On Mon, Jul 30, 2012 at 6:23 PM, Stan Rosenberg stan.rosenb...@gmail.com wrote: Hi, I am seeking a way to leverage hadoop's distributed cache in order to ship jars that are required to bootstrap a task's jvm, i.e., before a map/reduce task is launched. As a concrete example, let's say that I need to launch with '-javaagent:/path/profiler.jar'. In theory, the task tracker is responsible for downloading cached files onto its local filesystem. However, the absolute path to a given cached file is not known a priori; however, we need the path in order to configure '-javaagent'. Is this currently possible with the distributed cache? If not, is the use case appealing enough to open a jira ticket? Thanks, stan
Re: Merge Reducers Output
Why not use a combiner? On Jul 30, 2012, at 7:59 PM, Mike S wrote: Liked asked several times, I need to merge my reducers output files. Imagine I have many reducers which will generate 200 files. Now to merge them together, I have written another map reduce job where each mapper read a complete file in full in memory, and output that and then only one reducer has to merge them together. To do so, I had to write a custom fileinputreader that reads the complete file into memory and then another custom fileoutputfileformat to append the each reducer item bytes together. this how my mapper and reducers looks like public static class MapClass extends MapperNullWritable, BytesWritable, IntWritable, BytesWritable { @Override public void map(NullWritable key, BytesWritable value, Context context) throws IOException, InterruptedException { context.write(key, value); } } public static class Reduce extends ReducerNullWritable, BytesWritable, NullWritable, BytesWritable { @Override public void reduce(NullWritable key, IterableBytesWritable values, Context context) throws IOException, InterruptedException { for (BytesWritable value : values) { context.write(NullWritable.get(), value); } } } I still have to have one reducers and that is a bottle neck. Please note that I must do this merging as the users of my MR job are outside my hadoop environment and the result as one file. Is there better way to merge reducers output files?
Re: Counting records
Look at using a dynamic counter. You don't need to set up or declare an enum. The only caveat is that counters are passed back to the JT by each task and are stored in memory. On Jul 23, 2012, at 9:32 AM, Kai Voigt wrote: http://diveintodata.org/2011/03/15/an-example-of-hadoop-mapreduce-counter/
Re: Counting records
If the task fails the counter for that task is not used. So if you have speculative execution turned on and the JT kills a task, it won't affect your end results. Again the only major caveat is that the counters are in memory so if you have a lot of counters... On Jul 23, 2012, at 4:52 PM, Peter Marron wrote: Yeah, I thought about using counters but I was worried about what happens if a Mapper task fails. Does the counter get adjusted to remove any contributions that the failed Mapper made before another replacement Mapper is started? Otherwise in the case of any Mapper failure I'm going to get an overcount am I not? Or is there some way to make sure that counters have the correct semantics in the face of failures? Peter Marron -Original Message- From: Dave Shine [mailto:Dave.Shine@channelintelligence. com] Sent: 23 July 2012 15:35 To: common-user@hadoop.apache.org Subject: RE: Counting records You could just use a counter and never emit anything from the Map(). Use the getCounter(MyRecords, RecordTypeToCount).increment(1) whenever you find the type of record you are looking for. Never call output.collect(). Call the job with reduceTasks(0). When the job finishes, you can programmatically get the values of all counters including the one you create in the Map() method. Dave Shine Sr. Software Engineer 321.939.5093 direct | 407.314.0122 mobile CI Boost(tm) Clients Outperform Online(tm) www.ciboost.com -Original Message- From: Peter Marron [mailto:Peter.Marron@trilliumsoftware. com] Sent: Monday, July 23, 2012 10:25 AM To: common-user@hadoop.apache.org Subject: Counting records Hi, I am a complete noob with Hadoop and MapReduce and I have a question that is probably silly, but I still don't know the answer. For the purposes of discussion I'll assume that I'm using a standard TextInputFormat. (I don't think that this changes things too much.) To simplify (a fair bit) I want to count all the records that meet specific criteria. I would like to use MapReduce because I anticipate large sources and I want to get the performance and reliability that MapReduce offers. So the obvious and simple approach is to have my Mapper check whether each record meets the criteria and emit a 0 or a 1. Then I could use a combiner which accumulates (like a LongSumReducer) and use this as a reducer as well, and I am sure that that would work fine. However it seems massive overkill to have all those 1s and 0s emitted and stored on disc. It seems tempting to have the Mapper accumulate the count for all of the records that it sees and then just emit once at the end the total value. This seems simple enough, except that the Mapper doesn't seem to have any easy way to know when it is presented with the last record. Now I could just make the Mapper take a copy of the OutputCollector for each record called and then in the close method it could do a single emit. However, although, this looks like it would work with the current implementation, there seem to be no guarantees that the collector is valid at the time that the close is called. This just seems ugly. Or I could get the Mapper to record the first offset that it sees and read the split length using report.getInputSplit().getLength() and then it could monitor how far it is through the split and it should be able to detect the last record. It looks like the MapRunner class creates a Mapper object and uses it to process a split, and so it looks like it's safe to store state in the mapper class between invocations of the map method. (But is this just an implementation artefact? Is the mapper class supposed to be completely stateless?) Maybe I should have a custom InputFormat class and have it flag the last record by placing some extra information in the key? (Assuming that the InputFormant has enough information from the split to be able to detect the last record, which seems reasonable enough.) Is there some blessed way to do this? Or am I barking up the wrong tree because I should really just generate all those 1s and 0s and accept the overhead? Regards, Peter Marron Trillium Software UK Limited The information contained in this email message is considered confidential and proprietary to the sender and is intended solely for review and use by the named recipient. Any unauthorized review, use or distribution is strictly prohibited. If you have received this message in error, please advise the sender by reply email and delete the message.
Re: Concurrency control
It goes beyond the atomic writes... There isn't the concept of transactions in HBase. He could also be talking about Hive, which would be appropriate for this mailing list... -Mike On Jul 18, 2012, at 11:49 AM, Harsh J wrote: Hi, Do note that there are many users who haven't used Teradata out there and they may not directly pick up what you meant to say here. Since you're speaking of Tables, I am going to assume you mean HBase. If what you're looking for is atomicity, HBase does offer it already. If you want to order requests differently, depending on a condition, the HBase coprocessors (new from Apache HBase 0.92 onwards) provide you an ability to do that too. If your question is indeed specific to HBase, please ask it in a more clarified form on the u...@hbase.apache.org lists. If not HBase, do you mean read/write concurrency over HDFS files? Cause HDFS files do not allow concurrent writers (one active lease per file), AFAICT. On Wed, Jul 18, 2012 at 9:09 PM, saubhagya dey saubhagya@gmail.com wrote: how do i manage concurrency in hadoop like we do in teradata. We need to have a read and write lock when simultaneous the same table is being hit with a read query and write query -- Harsh J
Re: Hive/Hdfs Connector
Have you tried using Hive's thrift server? On Jul 5, 2012, at 10:20 AM, Sandeep Reddy P wrote: We use hive Jdbc drivers to connect to RDMS. But we need our application which generates HQL to connect directly to Hive. On Thu, Jul 5, 2012 at 11:12 AM, Bejoy KS bejoy.had...@gmail.com wrote: Hi Sandeep You can connect to hdfs from a remote machine if that machine is reachable from the cluster, and you have the hadoop jars and right hadoop configuration files. Similarly you can issue HQL programatically from your application using hive jdbc driver. --Original Message-- From: Sandeep Reddy P To: common-user@hadoop.apache.org To: cdh-u...@cloudera.org Cc: t...@cloudwick.com ReplyTo: common-user@hadoop.apache.org Subject: Hive/Hdfs Connector Sent: Jul 5, 2012 20:32 Hi, We have some application which generates SQL queries and connects to RDBMS using connectors like JDBC for mysql. Now if we generate HQL using our application is there any way to connect to Hive/Hdfs using connectors?? I need help on what connectors i have to use? We dont want to pull data from Hive/Hdfs to RDBMS instead we need our application to connect to Hive/Hdfs. -- Thanks, sandeep Regards Bejoy KS Sent from handheld, please excuse typos. -- Thanks, sandeep
Re: Multiple cores vs multiple nodes
Hi, First, you have to explain what you mean by 'equivalent' . The short answer is that it depends. The longer answer is that you have to consider cost in your design. The whole issue of design is to maintain the correct ratio of cores to memory and cores to spindles while optimizing the box within the cost, space and hardware (box configurations) limitations. Note that you can sacrifice some of the ratio, however, you will leave some of the performance on the table. On Jul 1, 2012, at 6:13 AM, Safdar Kureishy wrote: Hi, I have a reasonably simple question that I thought I'd post to this list because I don't have enough experience with hardware to figure this out myself. Let's assume that I have 2 separate cluster setups for slave nodes. The master node is a separate machine *outside* these clusters: *Setup A*: 28 nodes, each with a 2-core CPU, 8 GB RAM and 1 SATA drives (1 TB each) *Setup B*: 7 nodes, each with a 8-core CPU, 32 GB Ram and 4 SATA drives (1 TB each) Note that I have maintained the same *core:memory:spindle* ratio above. In essence, setup B has the same overall processing + memory + spindle capacity, but achieved with 4 times fewer nodes. Ignoring the* cost* of each node above, and assuming a 10Gb Ethernet connectivity and the same speed-per-core across nodes in both the scenarios above, are Setup A and Setup B equivalent to each other in the context of setting up a Hadoop cluster? Or will the relative performance be different? Excluding the network connectivity between the nodes, what would be some other criteria that might give one setup an edge over the other, for regular Hadoop jobs? Also, assuming the same type of Hadoop jobs on both clusters, how different would the load experienced by the master node be for each setup above? Thanks in advance, Safdar
Re: Can I remove a folder on HDFS when decommissioning a data node?
Well, there you go... care to open a Jira for it? It would be very helpful... The OP's question presents a very good use case. Thx -Mike On Jun 26, 2012, at 10:00 AM, Harsh J wrote: Hi, On Tue, Jun 26, 2012 at 8:21 PM, Michael Segel michael_se...@hotmail.com wrote: Hi, Yes you can remove a file while there is a node or node(s) being decommissioned. I wonder if there's a way to manually clear out the .trash which may also give you more space. There's no way to do that across all users presently (for each user, -expunge works though), other than running a manual rm (-skipTrash) command for the .Trash files under each /user/* directory. I suppose we can add in an admin command to clear out all trash manually (forcing, other than relying on the periodic auto-emptier thread). -- Harsh J
Re: Hive error when loading csv data.
Alternatively you could write a simple script to convert the csv to a pipe delimited file so that abc,def will be abc,def. On Jun 26, 2012, at 2:51 PM, Harsh J wrote: Hive's delimited-fields-format record reader does not handle quoted text that carry the same delimiter within them. Excel supports such records, so it reads it fine. You will need to create your table with a custom InputFormat class that can handle this (Try using OpenCSV readers, they support this), instead of relying on Hive to do this for you. If you're successful in your approach, please also consider contributing something back to Hive/Pig to help others. On Wed, Jun 27, 2012 at 12:37 AM, Sandeep Reddy P sandeepreddy.3...@gmail.com wrote: Hi all, I have a csv file with 46 columns but i'm getting error when i do some analysis on that data type. For simplification i have taken 3 columns and now my csv is like c,zxy,xyz d,abc,def,abcd i have created table for this data using, hive create table test3( f1 string, f2 string, f3 string) row format delimited fields terminated by ,; OK Time taken: 0.143 seconds hive load data local inpath '/home/training/a.csv' into table test3; Copying data from file:/home/training/a.csv Copying file: file:/home/training/a.csv Loading data to table default.test3 OK Time taken: 0.276 seconds hive select * from test3; OK c zxy xyz d abcdef Time taken: 0.156 seconds When i do select f2 from test3; my results are, OK zxy abc but this should be abc,def When i open the same csv file with Microsoft Excel i got abc,def How should i solve this error?? -- Thanks, sandeep -- -- Harsh J
Re: oozie workflow file for teragen and terasort
There's a series of articles on Oozie by Boris Lublinsky over on InfoQ. http://www.infoq.com/articles/introductionOozie On Jun 23, 2012, at 7:42 PM, Hadoop James wrote: I want to be able to submit my teragen and terasort jobs via oozie. I have tried different things in workflow.xml to no avail. Has anybody had any success doing so? Can you share your workflow.xml file ? Many thanks -James
Re: Split brain - is it possible in hadoop?
In your example, you only have one active Name Node. So how would you encounter a 'split brain' scenario? Maybe it would be better if you defined what you mean by a split brain? -Mike On Jun 18, 2012, at 8:30 PM, hdev ml wrote: All hadoop contributors/experts, I am trying to simulate split brain in our installation. There are a few things we want to know 1. Does data corruption happen? 2. If Yes in #1, how to recover from it. 3. What are the corrective steps to take in this situation e.g. killing one namenode etc So to simulate this I took following steps. 1. We already have a healthy test cluster, consisting of 4 machines. One machine runs namenode and a datanode, other machine runs secondarynamenode and a datanode, 3rd runs jobtracker and a datanode, and 4th one just a datanode. 2. Copied the hadoop installation folder to a new location in the datanode. 3. Kept all configurations same in hdfs-site and core-site xmls, except renamed the fs.default.name to a different URI 4. The namenode directory - dfs.name.dir was pointing to the same shared NFS mounted directory to which the main namenode points to. I started this standby namenode using following command bin/hadoop-daemon.sh --config conf --hosts slaves start namenode It errored out saying that the directory is already locked, which is an expected behaviour. The directory has been locked by the original namenode. So I changed the dfs.name.dir to some other folder, and issued the same command. It fails with message - namenode has not been formatted, which is also expected. This makes me think - does splitbrain situation really occur in hadoop? My understanding is that split brain happens because of timeouts on the main namenode. The way it happens is, when the timeout occurs, the HA implementation - Be it Linux HA, Veritas etc., thinks that the main namenode has died and tries to start the standby namenode. The standby namenode starts up and then main namenode comes back from the timeout phase and starts functioning as if nothing happened, giving rise to 2 namenodes in the cluster - Split Brain. Considering the error messages and the above understanding, I cannot point 2 different namenodes to same directory, because the main namenode isn't responding but has locked the directory. So can I safely conclude that split brain does not occur in hadoop? Or am I missing any other situation where split brain happens and the namenode directory is not locked, thus allowing the standby namenode also to start up? Has anybody encountered this? Any help is really appreciated. Harshad
Re: Hardware specs calculation for io
You will want something in between... 8 cores means 8 spindles. 16 cores means 16 spindles. You may want to up the memory, especially if you're running or thinking about running HBase. If you go beyond 4 spindles, you will saturate your 1GBe link. If you think about Type B, you will need 10GBe. On Jun 13, 2012, at 9:36 AM, Sandeep Reddy P wrote: Hi, I need to know difference between two hardware configurations below for 24TB of data. (slave machines only for hadoop,hive and pig) TYPE A: 2 quad core, 32 GB memory, 6 x 1TB drives(6TB / machine) TYPE B: 4 quad core, 48 GB memory, 12 x 1TB drives (12TB / machine) suppose we choose 4 type A machines for 24tb of data and 2 type b machines for 24 tb data. Assuming disk io speed is constant (7200 RPM sata), cost is same for 4Type A and 2 Type B machines. I need which type of machines will give me best results in terms of performance. -- Thanks, sandeep
Re: Hadoop with Sharded MySql
Ok just tossing out some ideas... Take them with a grain of salt... With hive you can create external tables. Write a custom Java app the creates one thread to each server. Then iterate through each table selecting the rows you want. You can then easily write the output directly to HDFS in each thread. It's not a map reduce, but it should be fairly efficient. You can even expand on this if you want. Java and jdbc... Sent from my iPhone On Jun 1, 2012, at 11:30 AM, Srinivas Surasani hivehadooplearn...@gmail.com wrote: All, I'm trying to get data into HDFS directly from sharded database and expose to existing hive infrastructure. ( we are currently doing this way,, mysql-staging server-hdfs put commands-hdfs, which is taking lot of time ). If we have way of running single sqoop job across all shardes for single table, I believe it makes life easier in terms of monotoring and exception handlings.. Thanks, Srinivas On Fri, Jun 1, 2012 at 1:27 AM, anil gupta anilgupt...@gmail.com wrote: Hi Sujith, Srinivas is asking how to import data into HDFS using sqoop? I believe he must have thought out well before designing the entire architecture/solution. He has not specified whether he would like to modify the data or not. Whether to use HIve or HBase is a different question altogether and depends on his use-case. Thanks, Anil On Thu, May 31, 2012 at 9:52 PM, Sujit Dhamale sujitdhamal...@gmail.com wrote: Hi , instead of pulling 70K tables from mysql into hdfs. take dump of all 30 table and put in to hBase data base . if you pulled 70K tables from mysql into hdfs , you need to use Hive , but modification will not possible in Hive :( *@ common-user :* please correct me , if i am wrong . Kind Regards Sujit Dhamale (+91 9970086652) On Fri, Jun 1, 2012 at 5:42 AM, Edward Capriolo edlinuxg...@gmail.com wrote: Maybe you can do some VIEWs or unions or merge tables on the mysql side to overcome the aspect of launching so many sqoop jobs. On Thu, May 31, 2012 at 6:02 PM, Srinivas Surasani hivehadooplearn...@gmail.com wrote: All, We are trying to implement sqoop in our environment which has 30 mysql sharded databases and all the databases have around 30 databases with 150 tables in each of the database which are all sharded (horizontally sharded that means the data is divided into all the tables in mysql). The problem is that we have a total of around 70K tables which needed to be pulled from mysql into hdfs. So, my question is that generating 70K sqoop commands and running them parallel is feasible or not? Also, doing incremental updates is going to be like invoking 70K another sqoop jobs which intern kick of map-reduce jobs. The main problem is monitoring and managing this huge number of jobs? Can anyone suggest me the best way of doing it or is sqoop a good candidate for this type of scenario? Currently the same process is done by generating tsv files mysql server and dumped into staging server and from there we'll generate hdfs put statements.. Appreciate your suggestions !!! Thanks, Srinivas Surasani -- Thanks Regards, Anil Gupta -- Regards, -- Srinivas srini...@cloudwick.com
Re: How to Integrate LDAP in Hadoop ?
I believe that their CDH3u3 or later has this... parameter. (Possibly even earlier.) On May 29, 2012, at 6:12 AM, samir das mohapatra wrote: It is cloudera version .20 On Tue, May 29, 2012 at 4:14 PM, Michel Segel michael_se...@hotmail.comwrote: Which release? Version? I believe there are variables in the *-site.xml that allow LDAP integration ... Sent from a remote device. Please excuse any typos... Mike Segel On May 26, 2012, at 7:40 AM, samir das mohapatra samir.help...@gmail.com wrote: Hi All, Did any one work on hadoop with LDAP integration. Please help me for same. Thanks samir
Re: Pragmatic cluster backup strategies?
Hi, That's not a back up strategy. You could still have joe luser take out a key file or directory. What do you do then? On May 29, 2012, at 11:19 AM, Darrell Taylor wrote: Hi, We are about to build a 10 machine cluster with 40Tb of storage, obviously as this gets full actually trying to create an offsite backup becomes a problem unless we build another 10 machine cluster (too expensive right now). Not sure if it will help but we have planned the cabinet into an upper and lower half with separate redundant power, then we plan to put half of the cluster in the top, half in the bottom, effectively 2 racks, so in theory we could lose half the cluster and still have the copies of all the blocks with a replication factor of 3? Apart form the data centre burning down or some other disaster that would render the machines totally unrecoverable, is this approach good enough? I realise this is a very open question and everyone's circumstances are different, but I'm wondering what other peoples experiences/opinions are for backing up cluster data? Thanks Darrell.
Re: is hadoop suitable for us?
You are going to have to put HDFS on top of your SAN. The issue is that you introduce overhead and latencies by having attached storage rather than the drives physically on the bus within the case. Also I'm going to assume that your SAN is using RAID. One of the side effects of using a SAN is that you could reduce your replication factor from 3 to 2. (The SAN already protects you from disk failures if you're using RAID) On May 17, 2012, at 11:10 PM, Pierre Antoine DuBoDeNa wrote: You used HDFS too? or storing everything on SAN immediately? I don't have number of GB/TB (it might be about 2TB so not really that huge) but they are more than 100 million documents to be processed. In a single machine currently we can process about 200.000 docs/day (several parsing, indexing, metadata extraction has to be done). So in the worst case we want to use the 50 VMs to distribute the processing.. 2012/5/17 Sagar Shukla sagar_shu...@persistent.co.in Hi PA, In my environment, we had a SAN storage and I/O was pretty good. So if you have similar environment then I don't see any performance issues. Just out of curiosity - what amount of data are you looking forward to process ? Regards, Sagar -Original Message- From: Pierre Antoine Du Bois De Naurois [mailto:pad...@gmail.com] Sent: Thursday, May 17, 2012 8:29 PM To: common-user@hadoop.apache.org Subject: Re: is hadoop suitable for us? Thanks Sagar, Mathias and Michael for your replies. It seems we will have to go with hadoop even if I/O will be slow due to our configuration. I will try to update on how it worked for our case. Best, PA 2012/5/17 Michael Segel michael_se...@hotmail.com The short answer is yes. The longer answer is that you will have to account for the latencies. There is more but you get the idea.. Sent from my iPhone On May 17, 2012, at 5:33 PM, Pierre Antoine Du Bois De Naurois pad...@gmail.com wrote: We have large amount of text files that we want to process and index (plus applying other algorithms). The problem is that our configuration is share-everything while hadoop has a share-nothing configuration. We have 50 VMs and not actual servers, and these share a huge central storage. So using HDFS might not be really useful as replication will not help, distribution of files have no meaning as all files will be again located in the same HDD. I am afraid that I/O will be very slow with or without HDFS. So i am wondering if it will really help us to use hadoop/hbase/pig etc. to distribute and do several parallel tasks.. or is better to install something different (which i am not sure what). We heard myHadoop is better for such kind of configurations, have any clue about it? For example we now have a central mySQL to check if we have already processed a document and keeping there several metadata. Soon we will have to distribute it as there is not enough space in one VM, But Hadoop/HBase will be useful? we don't want to do any complex join/sort of the data, we just want to do queries to check if already processed a document, and if not to add it with several of it's metadata. We heard sungrid for example is another way to go but it's commercial. We are somewhat lost.. so any help/ideas/suggestions are appreciated. Best, PA 2012/5/17 Abhishek Pratap Singh manu.i...@gmail.com Hi, For your question if HADOOP can be used without HDFS, the answer is Yes. Hadoop can be used with any kind of distributed file system. But I m not able to understand the problem statement clearly to advice my point of view. Are you processing text file and saving in distributed database?? Regards, Abhishek On Thu, May 17, 2012 at 1:46 PM, Pierre Antoine Du Bois De Naurois pad...@gmail.com wrote: We want to distribute processing of text files.. processing of large machine learning tasks, have a distributed database as we have big amount of data etc. The problem is that each VM can have up to 2TB of data (limitation of VM), and we have 20TB of data. So we have to distribute the processing, the database etc. But all those data will be in a shared huge central file system. We heard about myHadoop, but we are not sure why is that any different from Hadoop. If we run hadoop/mapreduce without using HDFS? is that an option? best, PA 2012/5/17 Mathias Herberts mathias.herbe...@gmail.com Hadoop does not perform well with shared storage and vms. The question should be asked first regarding what you're trying to achieve, not about your infra. On May 17, 2012 10:39 PM, Pierre Antoine Du Bois De Naurois pad...@gmail.com wrote: Hello, We have about 50 VMs and we want to distribute processing across them. However these VMs share a huge data storage system and thus their virtual HDD are all located in the same computer. Would Hadoop be useful for such configuration? Could we use hadoop
Re: is hadoop suitable for us?
The short answer is yes. The longer answer is that you will have to account for the latencies. There is more but you get the idea.. Sent from my iPhone On May 17, 2012, at 5:33 PM, Pierre Antoine Du Bois De Naurois pad...@gmail.com wrote: We have large amount of text files that we want to process and index (plus applying other algorithms). The problem is that our configuration is share-everything while hadoop has a share-nothing configuration. We have 50 VMs and not actual servers, and these share a huge central storage. So using HDFS might not be really useful as replication will not help, distribution of files have no meaning as all files will be again located in the same HDD. I am afraid that I/O will be very slow with or without HDFS. So i am wondering if it will really help us to use hadoop/hbase/pig etc. to distribute and do several parallel tasks.. or is better to install something different (which i am not sure what). We heard myHadoop is better for such kind of configurations, have any clue about it? For example we now have a central mySQL to check if we have already processed a document and keeping there several metadata. Soon we will have to distribute it as there is not enough space in one VM, But Hadoop/HBase will be useful? we don't want to do any complex join/sort of the data, we just want to do queries to check if already processed a document, and if not to add it with several of it's metadata. We heard sungrid for example is another way to go but it's commercial. We are somewhat lost.. so any help/ideas/suggestions are appreciated. Best, PA 2012/5/17 Abhishek Pratap Singh manu.i...@gmail.com Hi, For your question if HADOOP can be used without HDFS, the answer is Yes. Hadoop can be used with any kind of distributed file system. But I m not able to understand the problem statement clearly to advice my point of view. Are you processing text file and saving in distributed database?? Regards, Abhishek On Thu, May 17, 2012 at 1:46 PM, Pierre Antoine Du Bois De Naurois pad...@gmail.com wrote: We want to distribute processing of text files.. processing of large machine learning tasks, have a distributed database as we have big amount of data etc. The problem is that each VM can have up to 2TB of data (limitation of VM), and we have 20TB of data. So we have to distribute the processing, the database etc. But all those data will be in a shared huge central file system. We heard about myHadoop, but we are not sure why is that any different from Hadoop. If we run hadoop/mapreduce without using HDFS? is that an option? best, PA 2012/5/17 Mathias Herberts mathias.herbe...@gmail.com Hadoop does not perform well with shared storage and vms. The question should be asked first regarding what you're trying to achieve, not about your infra. On May 17, 2012 10:39 PM, Pierre Antoine Du Bois De Naurois pad...@gmail.com wrote: Hello, We have about 50 VMs and we want to distribute processing across them. However these VMs share a huge data storage system and thus their virtual HDD are all located in the same computer. Would Hadoop be useful for such configuration? Could we use hadoop without HDFS? so that we can retrieve and store everything in the same storage? Thanks, PA
Re: freeze a mapreduce job
Just a quick note... If your task is currently occupying a slot, the only way to release the slot is to kill the specific task. If you are using FS, you can move the task to another queue and/or you can lower the job's priority which will cause new tasks to spawn slower than other jobs so you will eventually free up the cluster. There isn't a way to 'freeze' or stop a job mid state. Is the issue that the job has a large number of slots, or is it an issue of the individual tasks taking a long time to complete? If its the latter, you will probably want to go to a capacity scheduler over the fair scheduler. HTH -Mike On May 11, 2012, at 6:08 AM, Harsh J wrote: I do not know about the per-host slot control (that is most likely not supported, or not yet anyway - and perhaps feels wrong to do), but the rest of the needs can be doable if you use schedulers and queues/pools. If you use FairScheduler (FS), ensure that this job always goes to a special pool and when you want to freeze the pool simply set the pool's maxMaps and maxReduces to 0. Likewise, control max simultaneous tasks as you wish, to constrict instead of freeze. When you make changes to the FairScheduler configs, you do not need to restart the JT, and you may simply wait a few seconds for FairScheduler to refresh its own configs. More on FS at http://hadoop.apache.org/common/docs/current/fair_scheduler.html If you use CapacityScheduler (CS), then I believe you can do this by again making sure the job goes to a specific queue, and when needed to freeze it, simply set the queue's maximum-capacity to 0 (percentage) or to constrict it, choose a lower, positive percentage value as you need. You can also refresh CS to pick up config changes by refreshing queues via mradmin. More on CS at http://hadoop.apache.org/common/docs/current/capacity_scheduler.html Either approach will not freeze/constrict the job immediately, but should certainly prevent it from progressing. Meaning, their existing running tasks during the time of changes made to scheduler config will continue to run till completion but further tasks scheduling from those jobs shall begin seeing effect of the changes made. P.s. A better solution would be to make your job not take as many days, somehow? :-) On Fri, May 11, 2012 at 4:13 PM, Rita rmorgan...@gmail.com wrote: I have a rather large map reduce job which takes few days. I was wondering if its possible for me to freeze the job or make the job less intensive. Is it possible to reduce the number of slots per host and then I can increase them overnight? tia -- --- Get your facts first, then you can distort them as you please.-- -- Harsh J
Re: freeze a mapreduce job
I haven't seen any. Haven't really had to test that... On May 11, 2012, at 9:03 AM, Shi Yu wrote: Is there any risk to suppress a job too long in FS?I guess there are some parameters to control the waiting time of a job (such as timeout ,etc.), for example, if a job is kept idle for more than 24 hours is there a configuration deciding kill/keep that job? Shi On 5/11/2012 6:52 AM, Rita wrote: thanks. I think I will investigate capacity scheduler. On Fri, May 11, 2012 at 7:26 AM, Michael Segelmichael_se...@hotmail.comwrote: Just a quick note... If your task is currently occupying a slot, the only way to release the slot is to kill the specific task. If you are using FS, you can move the task to another queue and/or you can lower the job's priority which will cause new tasks to spawn slower than other jobs so you will eventually free up the cluster. There isn't a way to 'freeze' or stop a job mid state. Is the issue that the job has a large number of slots, or is it an issue of the individual tasks taking a long time to complete? If its the latter, you will probably want to go to a capacity scheduler over the fair scheduler. HTH -Mike On May 11, 2012, at 6:08 AM, Harsh J wrote: I do not know about the per-host slot control (that is most likely not supported, or not yet anyway - and perhaps feels wrong to do), but the rest of the needs can be doable if you use schedulers and queues/pools. If you use FairScheduler (FS), ensure that this job always goes to a special pool and when you want to freeze the pool simply set the pool's maxMaps and maxReduces to 0. Likewise, control max simultaneous tasks as you wish, to constrict instead of freeze. When you make changes to the FairScheduler configs, you do not need to restart the JT, and you may simply wait a few seconds for FairScheduler to refresh its own configs. More on FS at http://hadoop.apache.org/common/docs/current/fair_scheduler.html If you use CapacityScheduler (CS), then I believe you can do this by again making sure the job goes to a specific queue, and when needed to freeze it, simply set the queue's maximum-capacity to 0 (percentage) or to constrict it, choose a lower, positive percentage value as you need. You can also refresh CS to pick up config changes by refreshing queues via mradmin. More on CS at http://hadoop.apache.org/common/docs/current/capacity_scheduler.html Either approach will not freeze/constrict the job immediately, but should certainly prevent it from progressing. Meaning, their existing running tasks during the time of changes made to scheduler config will continue to run till completion but further tasks scheduling from those jobs shall begin seeing effect of the changes made. P.s. A better solution would be to make your job not take as many days, somehow? :-) On Fri, May 11, 2012 at 4:13 PM, Ritarmorgan...@gmail.com wrote: I have a rather large map reduce job which takes few days. I was wondering if its possible for me to freeze the job or make the job less intensive. Is it possible to reduce the number of slots per host and then I can increase them overnight? tia -- --- Get your facts first, then you can distort them as you please.-- -- Harsh J
Re: Reduce Hangs at 66%
Well That was one of the things I had asked. ulimit -a says it all. But you have to do this for the users... hdfs, mapred, and hadoop (Which is why I asked about which flavor.) On May 3, 2012, at 7:03 PM, Raj Vishwanathan wrote: Keith What is the the output for ulimit -n? Your value for number of open files is probably too low. Raj From: Keith Thompson kthom...@binghamton.edu To: common-user@hadoop.apache.org Sent: Thursday, May 3, 2012 4:33 PM Subject: Re: Reduce Hangs at 66% I am not sure about ulimits, but I can answer the rest. It's a Cloudera distribution of Hadoop 0.20.2. The HDFS has 9 TB free. In the reduce step, I am taking keys in the form of (gridID, date), each with a value of 1. The reduce step just sums the 1's as the final output value for the key (It's counting how many people were in the gridID on a certain day). I have been running other more complicated jobs with no problem, so I'm not sure why this one is being peculiar. This is the code I used to execute the program from the command line (the source is a file on the hdfs): hadoop jar jarfile driver source /thompson/outputDensity/density1 The program then executes the map and gets to 66% of the reduce, then stops responding. The cause of the error seems to be: Error from attempt_201202240659_6432_r_00_1: java.io.IOException: The temporary job-output directory hdfs://analytix1:9000/thompson/outputDensity/density1/_temporary doesn't exist! I don't understand what the _temporary is. I am assuming it's something Hadoop creates automatically. On Thu, May 3, 2012 at 5:02 AM, Michel Segel michael_se...@hotmail.comwrote: Well... Lots of information but still missing some of the basics... Which release and version? What are your ulimits set to? How much free disk space do you have? What are you attempting to do? Stuff like that. Sent from a remote device. Please excuse any typos... Mike Segel On May 2, 2012, at 4:49 PM, Keith Thompson kthom...@binghamton.edu wrote: I am running a task which gets to 66% of the Reduce step and then hangs indefinitely. Here is the log file (I apologize if I am putting too much here but I am not exactly sure what is relevant): 2012-05-02 16:42:52,975 INFO org.apache.hadoop.mapred.JobTracker: Adding task (REDUCE) 'attempt_201202240659_6433_r_00_0' to tip task_201202240659_6433_r_00, for tracker 'tracker_analytix7:localhost.localdomain/127.0.0.1:56515' 2012-05-02 16:42:53,584 INFO org.apache.hadoop.mapred.JobInProgress: Task 'attempt_201202240659_6433_m_01_0' has completed task_201202240659_6433_m_01 successfully. 2012-05-02 17:00:47,546 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201202240659_6432_r_00_0: Task attempt_201202240659_6432_r_00_0 failed to report status for 1800 seconds. Killing! 2012-05-02 17:00:47,546 INFO org.apache.hadoop.mapred.JobTracker: Removing task 'attempt_201202240659_6432_r_00_0' 2012-05-02 17:00:47,546 INFO org.apache.hadoop.mapred.JobTracker: Adding task (TASK_CLEANUP) 'attempt_201202240659_6432_r_00_0' to tip task_201202240659_6432_r_00, for tracker 'tracker_analytix4:localhost.localdomain/127.0.0.1:44204' 2012-05-02 17:00:48,763 INFO org.apache.hadoop.mapred.JobTracker: Removing task 'attempt_201202240659_6432_r_00_0' 2012-05-02 17:00:48,957 INFO org.apache.hadoop.mapred.JobTracker: Adding task (REDUCE) 'attempt_201202240659_6432_r_00_1' to tip task_201202240659_6432_r_00, for tracker 'tracker_analytix5:localhost.localdomain/127.0.0.1:59117' 2012-05-02 17:00:56,559 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201202240659_6432_r_00_1: java.io.IOException: The temporary job-output directory hdfs://analytix1:9000/thompson/outputDensity/density1/_temporary doesn't exist! at org.apache.hadoop.mapred.FileOutputCommitter.getWorkPath(FileOutputCommitter.java:250) at org.apache.hadoop.mapred.FileOutputFormat.getTaskOutputPath(FileOutputFormat.java:240) at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:116) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:438) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:416) at org.apache.hadoop.mapred.Child$4.run(Child.java:268) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115) at org.apache.hadoop.mapred.Child.main(Child.java:262) 2012-05-02 17:00:59,903 INFO org.apache.hadoop.mapred.JobTracker: Removing task 'attempt_201202240659_6432_r_00_1' 2012-05-02 17:00:59,906 INFO org.apache.hadoop.mapred.JobTracker: Adding task (REDUCE) 'attempt_201202240659_6432_r_00_2' to tip task_201202240659_6432_r_00, for tracker
Re: Changing the Java heap
Not sure of your question. Java child Heap size is independent of how files are split on HDFS. I suggest you look at Tom White's book on HDFS and how files are split in to blocks. Blocks are split on set sizes. 64MB by default. Your record boundaries are not necessarily on file block boundaries so one process may read the rest of the last record in block A and then complete reading it at the start of block B. A different task may start with block B and skip the first n bytes until it hits the start of a record. HTH -Mike On Apr 26, 2012, at 3:46 PM, Barry, Sean F wrote: Within my small 2 node cluster I set up my 4 core slave node to have 4 task trackers and I also limited my java heap size to -Xmx1024m Is there a possibility that when the data gets broken up that it will break it at a place in the file that is not a whitespace? Or is that already handled when the data on HDFS is broken up into blocks? -SB
Re: Feedback on real world production experience with Flume
Gee Edward, what about putting a link to a company website or your blog in your signature... ;-) Seriously one could also mention fuse, right? ;-) Sent from my iPhone On Apr 22, 2012, at 7:15 AM, Edward Capriolo edlinuxg...@gmail.com wrote: I think this is valid to talk about for example one need not need a decentralized collector if they can just write log directly to decentralized files in a decentralized file system. In any case it was not even a hard vendor pitch. It was someone describing how they handle centralized logging. It stated facts and it was informative. Lets face it, if fuse-mounting-hdfs or directly soft mounting NFS in a way that performs well many of the use cases for flume and scribe like tools would be gone. (not all but many) I never knew there was a rule that discussing alternative software on a mailing list. It seems like a closed minded thing. I also doubt the ASF would back a rule like that. Are we not allowed to talk about EMR or S3, or am I not even allowed to mention S3? Can flume run on ec2 and log to S3? (oops party foul I guess I cant ask that.) Edward On Sun, Apr 22, 2012 at 12:59 AM, Alexander Lorenz wget.n...@googlemail.com wrote: no. That is the Flume Open Source Mailinglist. Not a vendor list. NFS logging has nothing to do with decentralized collectors like Flume, JMS or Scribe. sent via my mobile device On Apr 22, 2012, at 12:23 AM, Edward Capriolo edlinuxg...@gmail.com wrote: It seems pretty relevant. If you can directly log via NFS that is a viable alternative. On Sat, Apr 21, 2012 at 11:42 AM, alo alt wget.n...@googlemail.com wrote: We decided NO product and vendor advertising on apache mailing lists! I do not understand why you'll put that closed source stuff from your employe in the room. It has nothing to do with flume or the use cases! -- Alexander Lorenz http://mapredit.blogspot.com On Apr 21, 2012, at 4:06 PM, M. C. Srivas wrote: Karl, since you did ask for alternatives, people using MapR prefer to use the NFS access to directly deposit data (or access it). Works seamlessly from all Linuxes, Solaris, Windows, AIX and a myriad of other legacy systems without having to load any agents on those machines. And it is fully automatic HA Since compression is built-in in MapR, the data gets compressed coming in over NFS automatically without much fuss. Wrt to performance, can get about 870 MB/s per node if you have 10GigE attached (of course, with compression, the effective throughput will surpass that based on how good the data can be squeezed). On Fri, Apr 20, 2012 at 3:14 PM, Karl Hennig khen...@baynote.com wrote: I am investigating automated methods of moving our data from the web tier into HDFS for processing, a process that's performed periodically. I am looking for feedback from anyone who has actually used Flume in a production setup (redundant, failover) successfully. I understand it is now being largely rearchitected during its incubation as Apache Flume-NG, so I don't have full confidence in the old, stable releases. The other option would be to write our own tools. What methods are you using for these kinds of tasks? Did you write your own or does Flume (or something else) work for you? I'm also on the Flume mailing list, but I wanted to ask these questions here because I'm interested in Flume _and_ alternatives. Thank you!
Re: Multiple data centre in Hadoop
I don't know of any open source solution in doing this... And yeah its something one can't talk about ;-) On Apr 19, 2012, at 4:28 PM, Robert Evans wrote: Where I work we have done some things like this, but none of them are open source, and I have not really been directly involved with the details of it. I can guess about what it would take, but that is all it would be at this point. --Bobby On 4/17/12 5:46 PM, Abhishek Pratap Singh manu.i...@gmail.com wrote: Thanks bobby, I m looking for something like this. Now the question is what is the best strategy to do Hot/Hot or Hot/Warm. I need to consider the CPU and Network bandwidth, also needs to decide from which layer this replication should start. Regards, Abhishek On Mon, Apr 16, 2012 at 7:08 AM, Robert Evans ev...@yahoo-inc.com wrote: Hi Abhishek, Manu is correct about High Availability within a single colo. I realize that in some cases you have to have fail over between colos. I am not aware of any turn key solution for things like that, but generally what you want to do is to run two clusters, one in each colo, either hot/hot or hot/warm, and I have seen both depending on how quickly you need to fail over. In hot/hot the input data is replicated to both clusters and the same software is run on both. In this case though you have to be fairly sure that your processing is deterministic, or the results could be slightly different (i.e. No generating if random ids). In hot/warm the data is replicated from one colo to the other at defined checkpoints. The data is only processed on one of the grids, but if that colo goes down the other one can take up the processing from where ever the last checkpoint was. I hope that helps. --Bobby On 4/12/12 5:07 AM, Manu S manupk...@gmail.com wrote: Hi Abhishek, 1. Use multiple directories for *dfs.name.dir* *dfs.data.dir* etc * Recommendation: write to *two local directories on different physical volumes*, and to an *NFS-mounted* directory - Data will be preserved even in the event of a total failure of the NameNode machines * Recommendation: *soft-mount the NFS* directory - If the NFS mount goes offline, this will not cause the NameNode to fail 2. *Rack awareness* https://issues.apache.org/jira/secure/attachment/12345251/Rack_aware_HDFS_proposal.pdf On Thu, Apr 12, 2012 at 2:18 AM, Abhishek Pratap Singh manu.i...@gmail.comwrote: Thanks Robert. Is there a best practice or design than can address the High Availability to certain extent? ~Abhishek On Wed, Apr 11, 2012 at 12:32 PM, Robert Evans ev...@yahoo-inc.com wrote: No it does not. Sorry On 4/11/12 1:44 PM, Abhishek Pratap Singh manu.i...@gmail.com wrote: Hi All, Just wanted if hadoop supports more than one data centre. This is basically for DR purposes and High Availability where one centre goes down other can bring up. Regards, Abhishek -- Thanks Regards *Manu S* SI Engineer - OpenSource HPC Wipro Infotech Mob: +91 8861302855Skype: manuspkd www.opensourcetalk.co.in
Re: Help me with architecture of a somewhat non-trivial mapreduce implementation
How 'large' or rather in this case small is your file? If you're on a default system, the block sizes are 64MB. So if your file ~= 64MB, you end up with 1 block, and you will only have 1 mapper. On Apr 19, 2012, at 10:10 PM, Sky wrote: Thanks for your reply. After I sent my email, I found a fundamental defect - in my understanding of how MR is distributed. I discovered that even though I was firing off 15 COREs, the map job - which is the most expensive part of my processing was run only on 1 core. To start my map job, I was creating a single file with following data: 1 storage:/root/1.manif.txt 2 storage:/root/2.manif.txt 3 storage:/root/3.manif.txt ... 4000 storage:/root/4000.manif.txt I thought that each of the available COREs will be assigned a map job from top down from the same file one at a time, and as soon as one CORE is done, it would get the next map job. However, it looks like I need to split the file into the number of times. Now while thatās clearly trivial to do, I am not sure how I can detect at runtime how many splits I need to do, and also to deal with adding new CORES at runtime. Any suggestions? (it doesn't have to be a file, it can be a list, etc). This all would be much easier to debug, if somehow I could get my log4j logs for my mappers and reducers. I can see log4j for my main launcher, but not sure how to enable it for mappers and reducers. Thx - Akash -Original Message- From: Robert Evans Sent: Thursday, April 19, 2012 2:08 PM To: common-user@hadoop.apache.org Subject: Re: Help me with architecture of a somewhat non-trivial mapreduce implementation From what I can see your implementation seems OK, especially from a performance perspective. Depending on what storage: is it is likely to be your bottlekneck, not the hadoop computations. Because you are writing files directly instead of relying on Hadoop to do it for you, you may need to deal with error cases that Hadoop will normally hide from you, and you will not be able to turn on speculative execution. Just be aware that a map or reduce task may have problems in the middle, and be relaunched. So when you are writing out your updated manifest be careful to not replace the old one until the new one is completely ready and will not fail, or you may lose data. You may also need to be careful in your reduce if you are writing directly to the file there too, but because it is not a read modify write, but just a write it is not as critical. --Bobby Evans On 4/18/12 4:56 PM, Sky USC sky...@hotmail.com wrote: Please help me architect the design of my first significant MR task beyond word count. My program works well. but I am trying to optimize performance to maximize use of available computing resources. I have 3 questions at the bottom. Project description in an abstract sense (written in java): * I have MM number of MANIFEST files available on storage:/root/1.manif.txt to 4000.manif.txt * Each MANIFEST in turn contains varilable number EE of URLs to EBOOKS (range could be 1 - 50,000 EBOOKS urls per MANIFEST) -- stored on storage:/root/1.manif/1223.folder/5443.Ebook.ebk So we are talking about millions of ebooks My task is to: 1. Fetch each ebook, and obtain a set of 3 attributes per ebook (example: publisher, year, ebook-version). 2. Update each of the EBOOK entry record in the manifest - with the 3 attributes (eg: ebook 1334 - publisher=aaa year=bbb, ebook-version=2.01) 3. Create a output file such that the named publisher_year_ebook-version contains a list of all ebook urls that met that criteria. example: File storage:/root/summary/RANDOMHOUSE_1999_2.01.txt contains: storage:/root/1.manif/1223.folder/2143.Ebook.ebk storage:/root/2.manif/2133.folder/5449.Ebook.ebk storage:/root/2.manif/2133.folder/5450.Ebook.ebk etc.. and File storage:/root/summary/PENGUIN_2001_3.12.txt contains: storage:/root/19.manif/2223.folder/4343.Ebook.ebk storage:/root/13.manif/9733.folder/2149.Ebook.ebk storage:/root/21.manif/3233.folder/1110.Ebook.ebk etc 4. finally, I also want to output statistics such that: publisher_year_ebook-version COUNT_OF_URLs PENGUIN_2001_3.12 250,111 RANDOMHOUSE_1999_2.01 11,322 etc Here is how I implemented: * My launcher gets list of MM manifests * My Mapper gets one manifest. --- It reads the manifest, within a WHILE loop, --- fetches each EBOOK, and obtain attributes from each ebook, --- updates the manifest for that ebook --- context.write(new Text(RANDOMHOUSE_1999_2.01), new Text(storage:/root/1.manif/1223.folder/2143.Ebook.ebk)) --- Once all ebooks in the manifest are read, it saves the updated Manifest, and exits * My Reducer gets the RANDOMHOUSE_1999_2.01 and a list of ebooks urls. --- It writes a new file storage:/root/summary/RANDOMHOUSE_1999_2.01.txt with all the storage urls for the ebooks --- It also does a
Re: Help me with architecture of a somewhat non-trivial mapreduce implementation
If the file is small enough you could read it in to a java object like a list and write your own input format that takes a list object as its input and then lets you specify the number of mappers. On Apr 19, 2012, at 11:34 PM, Sky wrote: My file for the input to mapper is very small - as all it has is urls to list of manifests. The task for mappers is to fetch each manifest, and then fetch files using urls from the manifests and then process them. Besides passing around lists of files, I am not really accessing the disk. It should be RAM, network, and CPU (unzip, parsexml,extract attributes). So is my only choice to break the input file and submit multiple files (if I have 15 cores, I should split the file with urls to 15 files? also how does it look in code?)? The two drawbacks are - some cores might finish early and stay idle, and I donāt know how to deal with dynamically increasing/decreasing cores. Thx - Sky -Original Message- From: Michael Segel Sent: Thursday, April 19, 2012 8:49 PM To: common-user@hadoop.apache.org Subject: Re: Help me with architecture of a somewhat non-trivial mapreduce implementation How 'large' or rather in this case small is your file? If you're on a default system, the block sizes are 64MB. So if your file ~= 64MB, you end up with 1 block, and you will only have 1 mapper. On Apr 19, 2012, at 10:10 PM, Sky wrote: Thanks for your reply. After I sent my email, I found a fundamental defect - in my understanding of how MR is distributed. I discovered that even though I was firing off 15 COREs, the map job - which is the most expensive part of my processing was run only on 1 core. To start my map job, I was creating a single file with following data: 1 storage:/root/1.manif.txt 2 storage:/root/2.manif.txt 3 storage:/root/3.manif.txt ... 4000 storage:/root/4000.manif.txt I thought that each of the available COREs will be assigned a map job from top down from the same file one at a time, and as soon as one CORE is done, it would get the next map job. However, it looks like I need to split the file into the number of times. Now while thatās clearly trivial to do, I am not sure how I can detect at runtime how many splits I need to do, and also to deal with adding new CORES at runtime. Any suggestions? (it doesn't have to be a file, it can be a list, etc). This all would be much easier to debug, if somehow I could get my log4j logs for my mappers and reducers. I can see log4j for my main launcher, but not sure how to enable it for mappers and reducers. Thx - Akash -Original Message- From: Robert Evans Sent: Thursday, April 19, 2012 2:08 PM To: common-user@hadoop.apache.org Subject: Re: Help me with architecture of a somewhat non-trivial mapreduce implementation From what I can see your implementation seems OK, especially from a performance perspective. Depending on what storage: is it is likely to be your bottlekneck, not the hadoop computations. Because you are writing files directly instead of relying on Hadoop to do it for you, you may need to deal with error cases that Hadoop will normally hide from you, and you will not be able to turn on speculative execution. Just be aware that a map or reduce task may have problems in the middle, and be relaunched. So when you are writing out your updated manifest be careful to not replace the old one until the new one is completely ready and will not fail, or you may lose data. You may also need to be careful in your reduce if you are writing directly to the file there too, but because it is not a read modify write, but just a write it is not as critical. --Bobby Evans On 4/18/12 4:56 PM, Sky USC sky...@hotmail.com wrote: Please help me architect the design of my first significant MR task beyond word count. My program works well. but I am trying to optimize performance to maximize use of available computing resources. I have 3 questions at the bottom. Project description in an abstract sense (written in java): * I have MM number of MANIFEST files available on storage:/root/1.manif.txt to 4000.manif.txt * Each MANIFEST in turn contains varilable number EE of URLs to EBOOKS (range could be 1 - 50,000 EBOOKS urls per MANIFEST) -- stored on storage:/root/1.manif/1223.folder/5443.Ebook.ebk So we are talking about millions of ebooks My task is to: 1. Fetch each ebook, and obtain a set of 3 attributes per ebook (example: publisher, year, ebook-version). 2. Update each of the EBOOK entry record in the manifest - with the 3 attributes (eg: ebook 1334 - publisher=aaa year=bbb, ebook-version=2.01) 3. Create a output file such that the named publisher_year_ebook-version contains a list of all ebook urls that met that criteria. example: File storage:/root/summary/RANDOMHOUSE_1999_2.01.txt contains: storage:/root/1.manif/1223.folder/2143.Ebook.ebk storage
Re: start hadoop slave over WAN
Probably a timeout. Really, not a good idea to do this in the first place... Sent from my iPhone On Mar 30, 2012, at 12:35 PM, Ben Cuthbert bencuthb...@ymail.com wrote: Strange thing is the datanode in the remote location has a log zero bytes. So nothing there. Its strange it is like the master does and ssh, login, and then attempts to start it but nothing. Maybe there is a timeout? On 30 Mar 2012, at 18:22, kasi subrahmanyam wrote: Try checking the logs in the logs folder for the datanode.It might give some lead. Maybe there is a mismatch between the namespace iDs in the system and user itself while starting the datanode. On Fri, Mar 30, 2012 at 10:32 PM, Ben Cuthbert bencuthb...@ymail.comwrote: All We have a master in one region and we are trying to start a slave datanode in another region. When executing the scripts it looks to login to the remote host, but never starts the datanode. When executing hbase tho it does work. Is there a timeout or something with hadoop?
Re: where are my logging output files going to?
You don't want users actually running anything directly on the cluster. You would set up some machine to launch jobs. Essentially any sort of Linux machine where you can install Hadoop, but you don't run any jobs... Sent from my iPhone On Mar 28, 2012, at 3:30 AM, Jane Wayne jane.wayne2...@gmail.com wrote: what do you mean by an edge node? do you mean any node that is not the master node (or NameNode or JobTracker node)? On Wed, Mar 28, 2012 at 3:51 AM, Michel Segel michael_se...@hotmail.comwrote: First you really don't want to launch the job from the cluster but from an edge node. To answer your question, in a word, yes, you should have a consistent set of configuration files as possible, noting that overtime this may not be possible as hardware configs may change, Sent from a remote device. Please excuse any typos... Mike Segel On Mar 27, 2012, at 8:42 PM, Jane Wayne jane.wayne2...@gmail.com wrote: if i have a hadoop cluster of 10 nodes, do i have to modify the /hadoop/conf/log4j.properties files on ALL 10 nodes to be the same? currently, i ssh into the master node to execute a job. this node is the only place where i have modified the logj4.properties file. i notice that although my log files are being created, nothing is being written to them. when i test on cygwin, the logging works, however, when i go to a live cluster (i.e. amazon elastic mapreduce), the logging output on the master node no longer works. i wonder if logging is happening at each slave/task node? could someone explain logging or point me to the documentation discussing this issue?
Re: MR job launching is slower
Hi, First, it sounds like you have 2 6 core CPUs giving you 12 cores not 24. Even though the OS reports 24 cores that's the hyper threading and not real cores. This becomes an issue with respect to tuning. To answer your question ... You have a single 1TB HD. That's going to be a major bottleneck in terms of performance. You usually want to have at least 1 drive per core. With a 12 core box that's 12 spindles. How large is your hadoop job's jar? This gets pushed around to all of the nodes. Bigger jars take longer to process and handle. Having said that, the start up time isn't out of whack. It depends on what job you're launching and what you are doing within the job. Remember that the tasks have to report back to the JT. Do you have Ganglia up and running? You will probably see a high load on the CPUs and then a lot of Wait IOs. HTH -Mike On Mar 20, 2012, at 5:40 AM, praveenesh kumar wrote: I have 10 node cluster ( around 24 CPUs, 48 GB RAM, 1 TB HDD, 10 GB ethernet connection) After triggering any MR job, its taking like 3-5 seconds to launch ( I mean the time when I can see any MR job completion % on the screen). I know internally its trying to launch the job,intialize mappers, loading data etc. What I want to know - Is it a default/desired/expected hadoop behavior or there are ways in which I can decrease this startup time ? Also I feel like my hadoop jobs should run faster, but I am still not able to make it as fast as it should be according to me ? I did some tunning also, following are the parameters I am playing around these days but still I feel there are something missing that I can still use: dfs.block.size: mapred.compress.map.output mapred.map/reduce.tasks.speculative.execution mapred.tasktracker.map/reduce.tasks.maximum: mapred.child.java.opts io.sort.mb: io.sort.factor: mapred.reduce.parallel.copies: mapred.job.reuse.jvm.num.tasks: Thanks, Praveenesh
Re: Issue when starting services on CDH3
Are you running the init.d scripts as root and what is order of the services you want to start. Sent from my iPhone On Mar 15, 2012, at 11:22 AM, Manish Bhoge manishbh...@rocketmail.com wrote: Ys, I understand the order and I formatted namenode before starting services. As I suspect there may be ownership and an access issue. Not able to nail down issue exactly. I also have question why there are 2 routes to start services. When we have start-all.sh script then why need to go to init.d to start services?? Thank you, Manish Sent from my BlackBerry, pls excuse typo -Original Message- From: Manu S manupk...@gmail.com Date: Thu, 15 Mar 2012 21:43:26 To: common-user@hadoop.apache.org; manishbh...@rocketmail.com Reply-To: common-user@hadoop.apache.org Subject: Re: Issue when starting services on CDH3 Did you check the service status? Is it like dead, but pid exist? Did you check the ownership and permissions for the dfs.name.dir,dfs.data.dir,mapped.local.dir etc ? The order for starting daemons are like this: 1 namenode 2 datanode 3 jobtracker 4 tasktracker Did you format the namenode before starting? On Mar 15, 2012 9:31 PM, Manu S manupk...@gmail.com wrote: Dear manish Which daemons are not starting? On Mar 15, 2012 9:21 PM, Manish Bhoge manishbh...@rocketmail.com wrote: I have CDH3 installed in standalone mode. I have install all hadoop components. Now when I start services (namenode,secondary namenode,job tracker,task tracker) I can start gracefully from /usr/lib/hadoop/ ./bin/start-all.sh. But when start the same servises from /etc/init.d/hadoop-0.20-* then I unable to start. Why? Now I want to start Hue also which is in init.d that also I couldn't start. Here I suspect authentication issue. Because all the services in init.d are under root user and root group. Please suggest I am stuck here. I tried hive and it seems it running fine. Thanks Manish. Sent from my BlackBerry, pls excuse typo
Re: Hadoop pain points?
What? The lack of documentation is what made Hadoop, really HBase, a lot of fun:-) You know what they say... Not guts, no glory... I'm sorry, while I agree w Harsh, I just don't want to sound like some old guy talking about how when they were young, they had to walk in chest high snow, in a blizzard, uphill (both ways)to and from school ... And how you newbies have it so much better... ;-P Sent from my iPhone On Mar 2, 2012, at 6:42 PM, Russell Jurney russell.jur...@gmail.com wrote: +2 Russell Jurney http://datasyndrome.com On Mar 2, 2012, at 4:38 PM, Mohit Anchlia mohitanch...@gmail.com wrote: +1 On Fri, Mar 2, 2012 at 4:09 PM, Harsh J ha...@cloudera.com wrote: Since you ask about anything in general, when I forayed into using Hadoop, my biggest pain was lack of documentation clarity and completeness over the MR and DFS user APIs (and other little points). It would be nice to have some work done to have one example or semi-example for every single Input/OutputFormat, Mapper/Reducer implementations, etc. added to the javadocs. I believe examples and snippets help out a ton (tons more than explaining just behavior) to new devs. On Fri, Mar 2, 2012 at 9:45 PM, Kunaal kunalbha...@alumni.cmu.edu wrote: I am doing a general poll on what are the most prevalent pain points that people run into with Hadoop? These could be performance related (memory usage, IO latencies), usage related or anything really. The goal is to look for what areas this platform could benefit the most in the near future. Any feedback is much appreciated. Thanks, Kunal. -- Harsh J
Re: Hadoop Datacenter Setup
If you are going this route why not net boot the nodes in the cluster? Sent from my iPhone On Jan 30, 2012, at 8:17 PM, Patrick Angeles patrickange...@gmail.com wrote: Hey Aaron, I'm still skeptical when it comes to flash drives, especially as pertains to Hadoop. The write cycle limit is impractical to make them usable for dfs.data.dir and mapred.local.dir, and as you pointed out, you can't use them for logs either. If you put HADOOP_LOG_DIR in /mnt/d0, you will still have to shut down the TT and DN in order to replace the drive. So you may as well just carve out 100GB from that drive and put your root filesystem there. I'd say that unless you're running some extremely CPU-heavy workloads, you should consider getting more than 3 drives per node. Most shops get 6-12 drives per node (with dual quad or hex core processors). Then you can sacrifice one of the drives for swap and the OS. I'd keep the RegionServer heap at 12GB or under to mitigate long GC pauses (the bigger the heap, the longer the eventual full GC). Finally, you can run Hive on the same cluster as HBase, just be wary of load spikes due to MR jobs and configure properly. You don't want a large Hive query to knock out your RegionServers thereby causing cascading failures. - P On Mon, Jan 30, 2012 at 6:44 PM, Aaron Tokhy aaron.to...@resonatenetworks.com wrote: I forgot to add: Are there use cases for using a swap partition for Hadoop nodes if our combined planned heap size is not expected to go over 24GB for any particular node type? I've noticed that if HBase starts to GC, it will pause for unreasonable amounts of time if old pages get swapped to disk, causing the regionserver to crash (which we've mitigated by setting vm.swappiness=5). Our slave node template will have a 1 GB heap Task Tracker, a 1 GB heap Data Node and a 12-16GB heap RegionServer. We assume the OS memory overhead is 1 GB. We added another 1 GB for combined Java VM overhead across services, which comes up to be around a max of 16-20GB used. This gives us around 4-8GB for tasks that would work with HBase. We may also use Hive on the same cluster for queries. On 01/30/2012 05:40 PM, Aaron Tokhy wrote: Hi, Our group is trying to set up a prototype for what will eventually become a cluster of ~50 nodes. Anyone have experiences with a stateless Hadoop cluster setup using this method on CentOS? Are there any caveats with a read-only root file system approach? This would save us from having to keep a root volume on every system (whether it is installed on a USB thumb drive, or a RAID 1 of bootable / partitions). http://citethisbook.net/Red_**Hat_Introduction_to_Stateless_**Linux.htmlhttp://citethisbook.net/Red_Hat_Introduction_to_Stateless_Linux.html We would like to keep the OS root file system separate from the Hadoop filesystem(s) for maintenance reasons (we can hot swap disks while the system is running) We were also considering installing the root filesystem on USB flash drives, making it persistent yet separate. However we would identify and turn off anything that would cause excess writes to the root filesystem given the limited number of USB flash drive write cycles (keep IO writes to the root filesystem to a minimum). We would do this by storing the Hadoop logs on the same filesystem/drive as what we specify in dfs.data.dir/dfs.name.dir. In the end we would have something like this: USB (MS DOS partition table + 1 ext2/3/4 partition) /dev/sda /dev/sda1mounted as /(possibly read-only) /dev/sda2mounted as /var(read-write) /dev/sda3mounted as /tmp(read-write) Hadoop Disks (no partition table or GPT since these are 3TB disks) /dev/sdb/mnt/d0 /dev/sdc/mnt/d1 /dev/sdd/mnt/d2 /mnt/d0 would contain all Hadoop logs. Hadoop configuration files would still reside on / Any issues with such a setup? Are there better ways of achieving this?
Re: Send Data and Message to all nodes
Unless I'm missing something, it sounds like the OP wants to chain jobs where the results from one job are the input to another... Of course it's Sun morning and I haven't had my first cup of coffee so I could be misinterpreting the OP's question. If the OP wanted to send the data to each node and use it as a lookup table, his initial output is on HDFS so he could just open the file and read it in to memory in Mapper.setup(). Note: if the file is too big, then you probably wouldn't want to use distributed cache any way... Sent from my iPhone On Jan 28, 2012, at 7:11 PM, Ravi Prakash ravihad...@gmail.com wrote: Take a look at distributed cache for distributing data to all nodes. I'm not sure what you mean by messages. The MR programming paradigm is different from MPI. http://hadoop.apache.org/common/docs/r0.18.3/mapred_tutorial.html#DistributedCache On Sat, Jan 28, 2012 at 5:52 AM, Oliaei oli...@gmail.com wrote: Hi, I want to run a MR procedure under Hadoop and then send some messages data to all of nodes and after that run anther MR. What's the easiest way for sending data to all or some nodes? Or Is there any way to do that under Hadoop without using other frameworks? Regards, Oliaei oli...@gmail.com -- View this message in context: http://old.nabble.com/Send-Data-and-Message-to-all-nodes-tp33219535p33219535.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Connect to HDFS running on a different Hadoop-Version
BigInsights? ... Ok, I'll be nice ... :-) Ok, so of I understand your question, you want to use a single HDFS file system to be used by different 'Hadoop' frameworks ? (derivatives) First, it doesn't make sense. I mean it really doesn't make any sense. Second.. I don't think it would be possible except in the rare case that the two flavors of Hadoop were from the same code stream and similar release level. As a hypothetical example, Oracle forks their own distro from Cloudera but makes relatively few changes under the hood. But getting back to the first point... Not a good idea when you considers that it violates the KISS principle to design. IMHO, you would be better off w two clusters using distcp. Sent from my iPhone On Jan 25, 2012, at 5:38 AM, Romeo Kienzler ro...@ormium.de wrote: Dear List, we're trying to use a central HDFS storage in order to be accessed from various other Hadoop-Distributions. Do you think this is possible? We're having trouble, but not related to different RPC-Versions. When trying to access a Cloudera CDH3 Update 2 (cdh3u2) HDFS from BigInsights 1.3 we're getting this error: Bad connection to FS. Command aborted. Exception: Call to localhost.localdomain/127.0.0.1:50070 failed on local exception: java.io.EOFException java.io.IOException: Call to localhost.localdomain/127.0.0.1:50070 failed on local exception: java.io.EOFException at org.apache.hadoop.ipc.Client.wrapException(Client.java:1142) at org.apache.hadoop.ipc.Client.call(Client.java:1110) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226) at $Proxy0.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:398) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:384) at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:111) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:213) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:180) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89) at com.ibm.biginsights.hadoop.patch.PatchedDistributedFileSystem.initialize(PatchedDistributedFileSystem.java:19) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1514) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:1548) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1530) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:228) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:111) at org.apache.hadoop.fs.FsShell.init(FsShell.java:82) at org.apache.hadoop.fs.FsShell.run(FsShell.java:1785) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.fs.FsShell.main(FsShell.java:1939) Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:815) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:724) But we've already replaced the client hadoop-common.jar's with the Cloudera ones. Please note also that we're getting an EOFException and not an RPC.VersionMismatch. FsShell.java: try { init(); } catch (RPC.VersionMismatch v) { System.err.println(Version Mismatch between client and server + ... command aborted.); return exitCode; } catch (IOException e) { System.err.println(Bad connection to FS. command aborted.); System.err .println(Bad connection to FS. Command aborted. Exception: + e.getLocalizedMessage()); e.printStackTrace(); return exitCode; } Any ideas?
Re: Connect to HDFS running on a different Hadoop-Version
Alex, I said I would be nice and hold my tongue when it comes to IBM and their IM pillar products... :-) You could write a client that talks to two different hadoop versions but then you would be using hftp which is what you have under the hood in distcp... But that doesn't seem to be what he wants to do... I can only imagine why he is asking this question... ;-) Sent from my iPhone On Jan 25, 2012, at 7:32 AM, alo alt wget.n...@googlemail.com wrote: Insight is a IBM related product, based on an fork of hadoop I think. The mixing of totally different stacks make no sense. And will not work, I guess. - Alex -- Alexander Lorenz http://mapredit.blogspot.com On Jan 25, 2012, at 1:12 PM, Harsh J wrote: Hello Romeo, Inlineā¦ On Wed, Jan 25, 2012 at 4:07 PM, Romeo Kienzler ro...@ormium.de wrote: Dear List, we're trying to use a central HDFS storage in order to be accessed from various other Hadoop-Distributions. The HDFS you've setup, what 'distribution' is that from? You will have to use that particular version's jar across all client applications you use, else you'll run into RPC version incompatibilities. Do you think this is possible? We're having trouble, but not related to different RPC-Versions. It should be possible _most of the times_ by replacing jars at the client end to use the one that runs your cluster, but there may be minor API incompatibilities between certain versions that can get in the way. Purely depends on your client application and its implementation. If it sticks to using the publicly supported APIs, you are mostly fine. When trying to access a Cloudera CDH3 Update 2 (cdh3u2) HDFS from BigInsights 1.3 we're getting this error: BigInsights runs off IBM's own patched Hadoop sources if I am right, and things can get a bit tricky there. See the following points: Bad connection to FS. Command aborted. Exception: Call to localhost.localdomain/127.0.0.1:50070 failed on local exception: java.io.EOFException java.io.IOException: Call to localhost.localdomain/127.0.0.1:50070 failed on local exception: java.io.EOFException This is surely an RPC issue. The call tries to read off a field, but gets no response, EOFs and dies. We have more descriptive error messages with the 0.23 version onwards, but the problem here is that your IBM client jar is not the same as your cluster's jar. The mixture won't work. com.ibm.biginsights.hadoop.patch.PatchedDistributedFileSystem.initialize(PatchedDistributedFileSystem.java:19) ^^ This is what am speaking of. Your client (BigInsights? Have not used it reallyā¦) is using an IBM jar with their supplied 'PatchDistributedFileSystem', and that is probably incompatible with the cluster's HDFS RPC protocols. I do not know enough about IBM's custom stuff to know for sure it would work if you replace it with your clusters' jar. But we've already replaced the client hadoop-common.jar's with the Cloudera ones. Apparently not. Your strace shows that com.ibm.* classes are still being pulled. My guess is that BigInsights would not work with anything non IBM, but I have not used it to know for sure. If they have a user community, you can ask there if there is a working way to have BigInsights run against Apache/CDH/etc. distributions. For CDH specific questions, you may ask at https://groups.google.com/a/cloudera.org/group/cdh-user/topics instead of the Apache lists here. -- Harsh J Customer Ops. Engineer, Cloudera
Re: Problems with timeout when a Hadoop job generates a large number of key-value pairs
Steve, Ok, first your client connection to the cluster is a non issue. If you go in to /etc/Hadoop/conf That supposed to be a little h but my iPhone knows what's best... Look and see what you have set for your bandwidth... I forget which parameter but there are only a couple that deal with bandwidth. I think it's set to 1mb or 10mb by default. You need to up it to 100-200mb if you're on a 1 GB network . That would solve you balancing issue. See if that helps... Sent from my iPhone On Jan 20, 2012, at 4:57 PM, Steve Lewis lordjoe2...@gmail.com wrote: On Fri, Jan 20, 2012 at 12:18 PM, Michel Segel michael_se...@hotmail.comwrote: Steve, If you want me to debug your code, I'll be glad to set up a billable contract... ;-) What I am willing to do is to help you to debug your code.. The code seems to work well for small input files and is basically a standard sample. . Did you time how long it takes in the Mapper.map() method? The reason I asked this is to first confirm that you are failing within a map() method. It could be that you're just not updating your status... The map map method starts out running very fast - generateSubstrings - the only interesting part runs in milliseconds. The only other thing the mapper does is context,write which SHOULD update status You said that you are writing many output records for a single input. So let's take a look at your code. Are all writes of the same length? Meaning that in each iteration of Mapper.map() you will always write. K number of rows? Because in my sample the input strings are the same length - every call to the mapper will write the same number of records If so, ask yourself why some iterations are taking longer and longer? I believe the issue may relate to local storage getting filled and Hadoop taking a LOT of time to rebalance the output, Assuming the string length is the same on each map there is no reason for some iterations to me longer than others Note: I'm assuming that the time for each iteration is taking longer than the previous... I assume so as well since in m,y cluster the first 50% of mapping goes pretty fast Or am I missing something? How do I get timing of map iteratons?? -Mike Sent from a remote device. Please excuse any typos... Mike Segel On Jan 20, 2012, at 11:16 AM, Steve Lewis lordjoe2...@gmail.com wrote: We have been having problems with mappers timing out after 600 sec when the mapper writes many more, say thousands of records for every input record - even when the code in the mapper is small and fast. I no idea what could cause the system to be so slow and am reluctant to raise the 600 sec limit without understanding why there should be a timeout when all MY code is very fast. P I am enclosing a small sample which illustrates the problem. It will generate a 4GB text file on hdfs if the input file does not exist or is not at least that size and this will take some time (hours in my configuration) - then the code is essentially wordcount but instead of finding and emitting words - the mapper emits all substrings of the input data - this generates a much larger output data and number of output records than wordcount generates. Still, the amount of data emitted is no larger than other data sets I know Hadoop can handle. All mappers on my 8 node cluster eventually timeout after 600 sec - even though I see nothing in the code which is even a little slow and suspect that any slow behavior is in the called Hadoop code. This is similar to a problem we have in bioinformatics where a colleague saw timeouts on his 50 node cluster. I would appreciate any help from the group. Note - if you have a text file at least 4 GB the program will take that as an imput without trying to create its own file. /* */ import org.apache.hadoop.conf.*; import org.apache.hadoop.fs.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.*; import org.apache.hadoop.mapreduce.lib.output.*; import org.apache.hadoop.util.*; import java.io.*; import java.util.*; /** * org.systemsbiology.hadoop.SubstringGenerator * * This illustrates an issue we are having where a mapper generating a much larger volume of * data ans number of records times out even though the code is small, simple and fast * * NOTE!!! as written the program will generate a 4GB file in hdfs with good input data - * this is done only if the file does not exist but may take several hours. It will only be * done once. After that the failure is fairly fast * * What this will do is count unique Substrings of lines of length * between MIN_SUBSTRING_LENGTH and MAX_SUBSTRING_LENGTH by generatin all * substrings and then using the word could algorithm * What is interesting is that the
Re: Problems with timeout when a Hadoop job generates a large number of key-value pairs
Thats the one ... Sent from my iPhone On Jan 20, 2012, at 6:28 PM, Paul Ho p...@walmart.com wrote: I think the balancing bandwidth property you are looking for is in hdfs-site.xml: property namedfs.balance.bandwidthPerSec/name value402653184/value /property Set the value that makes most sense for your NIC. But I thought this is only for balancing. On Jan 20, 2012, at 3:43 PM, Michael Segel wrote: Steve, Ok, first your client connection to the cluster is a non issue. If you go in to /etc/Hadoop/conf That supposed to be a little h but my iPhone knows what's best... Look and see what you have set for your bandwidth... I forget which parameter but there are only a couple that deal with bandwidth. I think it's set to 1mb or 10mb by default. You need to up it to 100-200mb if you're on a 1 GB network . That would solve you balancing issue. See if that helps... Sent from my iPhone On Jan 20, 2012, at 4:57 PM, Steve Lewis lordjoe2...@gmail.com wrote: On Fri, Jan 20, 2012 at 12:18 PM, Michel Segel michael_se...@hotmail.comwrote: Steve, If you want me to debug your code, I'll be glad to set up a billable contract... ;-) What I am willing to do is to help you to debug your code.. The code seems to work well for small input files and is basically a standard sample. . Did you time how long it takes in the Mapper.map() method? The reason I asked this is to first confirm that you are failing within a map() method. It could be that you're just not updating your status... The map map method starts out running very fast - generateSubstrings - the only interesting part runs in milliseconds. The only other thing the mapper does is context,write which SHOULD update status You said that you are writing many output records for a single input. So let's take a look at your code. Are all writes of the same length? Meaning that in each iteration of Mapper.map() you will always write. K number of rows? Because in my sample the input strings are the same length - every call to the mapper will write the same number of records If so, ask yourself why some iterations are taking longer and longer? I believe the issue may relate to local storage getting filled and Hadoop taking a LOT of time to rebalance the output, Assuming the string length is the same on each map there is no reason for some iterations to me longer than others Note: I'm assuming that the time for each iteration is taking longer than the previous... I assume so as well since in m,y cluster the first 50% of mapping goes pretty fast Or am I missing something? How do I get timing of map iteratons?? -Mike Sent from a remote device. Please excuse any typos... Mike Segel On Jan 20, 2012, at 11:16 AM, Steve Lewis lordjoe2...@gmail.com wrote: We have been having problems with mappers timing out after 600 sec when the mapper writes many more, say thousands of records for every input record - even when the code in the mapper is small and fast. I no idea what could cause the system to be so slow and am reluctant to raise the 600 sec limit without understanding why there should be a timeout when all MY code is very fast. P I am enclosing a small sample which illustrates the problem. It will generate a 4GB text file on hdfs if the input file does not exist or is not at least that size and this will take some time (hours in my configuration) - then the code is essentially wordcount but instead of finding and emitting words - the mapper emits all substrings of the input data - this generates a much larger output data and number of output records than wordcount generates. Still, the amount of data emitted is no larger than other data sets I know Hadoop can handle. All mappers on my 8 node cluster eventually timeout after 600 sec - even though I see nothing in the code which is even a little slow and suspect that any slow behavior is in the called Hadoop code. This is similar to a problem we have in bioinformatics where a colleague saw timeouts on his 50 node cluster. I would appreciate any help from the group. Note - if you have a text file at least 4 GB the program will take that as an imput without trying to create its own file. /* */ import org.apache.hadoop.conf.*; import org.apache.hadoop.fs.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.*; import org.apache.hadoop.mapreduce.lib.output.*; import org.apache.hadoop.util.*; import java.io.*; import java.util.*; /** * org.systemsbiology.hadoop.SubstringGenerator * * This illustrates an issue we are having where a mapper generating a much larger volume of * data ans number of records times out even though the code is small, simple and fast * * NOTE
Re: I am trying to run a large job and it is consistently failing with timeout - nothing happens for 600 sec
But Steve, it is your code... :-) Here is a simple test... Set your code up where the run fails... Add a simple timer to see how long you spend in the Mapper.map() method. only print out the time if its greater than lets say 500 seconds... The other thing is to update a dynamic counter in Mapper.map(). This would force a status update to be sent to the JT. Also you dont give a lot of detail... Are you writing out to an HBase table??? HTH -Mike On Jan 18, 2012, at 6:21 PM, Steve Lewis wrote: 1) I do a lot of progress reporting 2) Why would the job succeed when the only change in the code is if(NumberWrites++ % 100 == 0) context.write(key,value); comment out the test allowing full writes and the job fails Since every write is a report I assume that something in the write code or other hadoop code for dealing with output if failing. I do increment a counter for every write or in the case of the above code potential write What I am seeing is that where ever the timeout occurs it is not in a place where I am capable of inserting more reporting On Wed, Jan 18, 2012 at 4:01 PM, Leonardo Urbina lurb...@mit.edu wrote: Perhaps you are not reporting progress throughout your task. If you happen to run a job large enough job you hit the the default timeout mapred.task.timeout (that defaults to 10 min). Perhaps you should consider reporting progress in your mapper/reducer by calling progress() on the Reporter object. Check tip 7 of this link: http://www.cloudera.com/blog/2009/05/10-mapreduce-tips/ Hope that helps, -Leo Sent from my phone On Jan 18, 2012, at 6:46 PM, Steve Lewis lordjoe2...@gmail.com wrote: I KNOW is is a task timeout - what I do NOT know is WHY merely cutting the number of writes causes it to go away. It seems to imply that some context.write operation or something downstream from that is taking a huge amount of time and that is all hadoop internal code - not mine so my question is why should increasing the number and volume of wriotes cause a task to time out On Wed, Jan 18, 2012 at 2:33 PM, Tom Melendez t...@supertom.com wrote: Sounds like mapred.task.timeout? The default is 10 minutes. http://hadoop.apache.org/common/docs/current/mapred-default.html Thanks, Tom On Wed, Jan 18, 2012 at 2:05 PM, Steve Lewis lordjoe2...@gmail.com wrote: The map tasks fail timing out after 600 sec. I am processing one 9 GB file with 16,000,000 records. Each record (think is it as a line) generates hundreds of key value pairs. The job is unusual in that the output of the mapper in terms of records or bytes orders of magnitude larger than the input. I have no idea what is slowing down the job except that the problem is in the writes. If I change the job to merely bypass a fraction of the context.write statements the job succeeds. This is one map task that failed and one that succeeded - I cannot understand how a write can take so long or what else the mapper might be doing JOB FAILED WITH TIMEOUT *Parser*TotalProteins90,103NumberFragments10,933,089 *FileSystemCounters*HDFS_BYTES_READ67,245,605FILE_BYTES_WRITTEN444,054,807 *Map-Reduce Framework*Combine output records10,033,499Map input records 90,103Spilled Records10,032,836Map output bytes3,520,182,794Combine input records10,844,881Map output records10,933,089 Same code but fewer writes JOB SUCCEEDED *Parser*TotalProteins90,103NumberFragments206,658,758 *FileSystemCounters*FILE_BYTES_READ111,578,253HDFS_BYTES_READ67,245,607 FILE_BYTES_WRITTEN220,169,922 *Map-Reduce Framework*Combine output records4,046,128Map input records90,103Spilled Records4,046,128Map output bytes662,354,413Combine input records4,098,609Map output records2,066,588 Any bright ideas -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com
Re: Does hadoop installations need to be at same locations in cluster ?
Sure, You could do that, but in doing so, you will make your life a living hell. Literally. Think about it... You will have to manually manage each nodes config files... So if something goes wrong you will have a hard time diagnosing the issue. Why make life harder? Why not just do the simple think and make all of your DN the same? Sent from my iPhone On Dec 23, 2011, at 6:51 AM, praveenesh kumar praveen...@gmail.com wrote: When installing hadoop on slave machines, do we have to install hadoop at same locations on each machine ? Can we have hadoop installation at different location on different machines at same cluster ? If yes, what things we have to take care in that case Thanks, Praveenesh
Re: Hadoop configuration
Class project due? Sorry, second set of questions on setting up a 2 node cluster... Sent from my iPhone On Dec 22, 2011, at 3:25 AM, Humayun kabir humayun0...@gmail.com wrote: someone please help me to configure hadoop such as core-site.xml, hdfs-site.xml, mapred-site.xml etc. please provide some example. it is badly needed. because i run in a 2 node cluster. when i run the wordcount example then it gives the result too mutch fetch failure.
RE: Does hadoop installations need to be at same locations in cluster ?
Ok, Here's the thing... 1) When building the cluster, you want to be consistent. 2) Location of $HADOOP_HOME is configurable. So you can place it anywhere. Putting the software in two different locations isn't a good idea because you now have to set it up with a unique configuration per node. It would be faster and make your life a lot easier by putting the software in the same location on *all* machines. So my suggestion would be to bite the bullet and rebuild your cluster. HTH -Mike Date: Fri, 23 Dec 2011 19:47:45 +0530 Subject: Re: Does hadoop installations need to be at same locations in cluster ? From: praveen...@gmail.com To: common-user@hadoop.apache.org What I mean to say is, Does hadoop internally assumes that all installations on each nodes need to be in same location. I was having hadoop installed on different location on 2 different nodes. I configured hadoop config files to be a part of same cluster. But when I started hadoop on master, I saw it was also searching for hadoop starting scripts in the same location as of master. Do we have any workaround in these kind of situation or do I have to reinstall hadoop again on same location as master. Thanks, Praveenesh On Fri, Dec 23, 2011 at 6:26 PM, Michael Segel michael_se...@hotmail.com wrote: Sure, You could do that, but in doing so, you will make your life a living hell. Literally. Think about it... You will have to manually manage each nodes config files... So if something goes wrong you will have a hard time diagnosing the issue. Why make life harder? Why not just do the simple think and make all of your DN the same? Sent from my iPhone On Dec 23, 2011, at 6:51 AM, praveenesh kumar praveen...@gmail.com wrote: When installing hadoop on slave machines, do we have to install hadoop at same locations on each machine ? Can we have hadoop installation at different location on different machines at same cluster ? If yes, what things we have to take care in that case Thanks, Praveenesh
RE: More cores Vs More Nodes ?
Tom, Look, I've said this before and I'm going to say it again. Your knowledge of Hadoop is purely academic. It may be ok to talk to C level execs who visit the San Jose IM Lab or in Markham, but when you give answers on issues you don't have first hand practical experience, you end up doing more harm than good. The problem is that too many people blindly except what they see on the web as fact when its not always accurate and may not suit their needs. I've lost count on the number of hours I've spent in meetings trying to undo the damage cause by someone saying ... but FB does it this way...therefore that's how we should do it. Now Michael St.Ack is a pretty smart guy. He knows his shit. He's extremely credible. However when he says that FB does something a specific way, that is because FB has certain requirements and the solution works for them. It doesn't mean that it will be the best solution for your customer/client. And Tom, if we pull out your business card, you have a nice fancy title with IBM. So you instantly have some credibility. Unfortunately, you're no St.Ack. (I'd put a smile face but I'm actually trying to be serious.) Even in this post, you continue to go down the wrong path. Unfortunately I don't have time to lecture you on why what you said is wrong and that your thoughts on cluster design are way off base. Oh and I tease you because frankly, you deserve it. I have to apologize to everyone on the list, but in the past, you failed to actually stop and take the hint that maybe you need to rethink your views on Hadoop. That had you had practical experience setting up actual clusters (Not EC2 clusters) you would have the necessary understanding of what can go wrong and how to fix it. If I get time, I'll have to find my copy of Up Front by Bill Maudlin. There's a cartoon that really fits you. Later To: common-user@hadoop.apache.org Subject: RE: More cores Vs More Nodes ? From: tdeut...@us.ibm.com Date: Wed, 14 Dec 2011 11:40:51 -0800 Your eagerness to insult is throwing you off track here Michael. For example, the workload profile of a cluster doing heavy NLP is very different than one doing serving as a destination for large scale application/web logs. Ditto for PC risk modeling vs smart meter use cases, etc etc...Those are not general purpose clusters. You may - and should I'd say - have the NLP use cases in a common analytics environment (internal cloud model) for sharing of methods/skills, but putting orthogonal use cases on that cluster is not inherently a best practice. How those clusters should be built does vary, and no it is not uncommon to have focused use cases like that. If you know it is going to be a general purpose cluster then do build it in a balanced spec. Tom Deutsch Program Director Information Management Big Data Technologies IBM 3565 Harbor Blvd Costa Mesa, CA 92626-1420 tdeut...@us.ibm.com
RE: More cores Vs More Nodes ?
Sorry, But having read the thread, I am going to have to say that this is definitely a silly question. NOTE THE FOLLOWING: Silly questions are not a bad thing. I happen to ask them all the time. ;-) Here's why I say its a silly question... Hadoop is a cost effective solution when you build out 'commodity' servers. Now here's the rub. Commodity servers means something different to each person, and I don't want to get in to a debate on its definition. When building out a cluster, too many people gloss over the complexity. 1U vs 2U in box size. Do you 1/2 MB or full size MB. How many disks per node. How much memory. Physical plant limitations. (Available rack space, costs if this is going in to a colo...) Power consumption, budget... At a client, back in 2009, our first cluster was build on whatever hardware we could get. It was 5 blade servers w SCSI/SAS 2.5 disks where we split each blade so we could have 10 nodes. Yeah, it was a mistake and a royal pain. But we got the cluster up and could do some simple PoCs. But we then came up with our reference architecture for further PoCs and development. We build out the DN w 8 core, 32GB, and 4 x 2TB 3.5 drives. Why? Because based on our constraints, this gave us the optimal combination w price and performance. Note: We knew we would leave some performance on the table. It was a conscious decision to leave some performance on the table so that we could maximize the number of nodes to fit within out budget. We chose 2TB drives because at the time they offered the best price/performance ratio. Today, that may be different. We chose 32GB because at the time it was the sweet spot in memory prices. Today w 3 channel memory it looks like 36GB is the sweet spot. Of course YMMV. (It could be 48GB...) Moving forward, I would reconsider the design because the price points on hardware has changed. That's going to be your driving factor. You want to look at 64 Core boxes, then you need 256GB of memory. Think of how many disks you have to add. (64-128 disks) Now then ask yourself is this a commodity box? Now price that box out. Then price out how many 8 core 1U boxes you can buy. Kind of puts it in to perspective, doesn't it? ;-) The reason why I call this a 'silly question' is that you're attempting to look at your cluster by focusing on only one variable. This is not to say that its a bad question because it forces you to realize that there are definitely lots of other options. that you have to consider. HTH -Mike Date: Tue, 13 Dec 2011 20:25:17 -0600 Subject: Re: More cores Vs More Nodes ? From: airb...@gmail.com To: common-user@hadoop.apache.org Hi Brad This is a really interesting experiment. I am curious why you did not use 2 cores each machine but 32 nodes. That makes the number of CPU core in two groups equal. Chen On Tue, Dec 13, 2011 at 7:15 PM, Brad Sarsfield b...@bing.com wrote: Hi Prashant, In each case I had a single tasktracker per node. I oversubscribed the total tasks per tasktracker/node by 1.5 x # of cores. So for the 64 core allocation comparison. In A: 8 cores; Each machine had a single tasktracker with 8 maps / 4 reduce slots for 12 task slots total per machine x 8 machines (including head node) In B: 2 c ores; Each machine had a single tasktracker with 2 maps / 1 reduce slots for 3 slots total per machines x 29 machines (including head node which was running 8 cores) The experiment was done in a cloud hosted environment running set of VMs. ~Brad -Original Message- From: Prashant Kommireddi [mailto:prash1...@gmail.com] Sent: Tuesday, December 13, 2011 9:46 AM To: common-user@hadoop.apache.org Subject: Re: More cores Vs More Nodes ? Hi Brad, how many taskstrackers did you have on each node in both cases? Thanks, Prashant Sent from my iPhone On Dec 13, 2011, at 9:42 AM, Brad Sarsfield b...@bing.com wrote: Praveenesh, Your question is not naĆÆve; in fact, optimal hardware design can ultimately be a very difficult question to answer on what would be better. If you made me pick one without much information I'd go for more machines. But... It all depends; and there is no right answer :) More machines +May run your workload faster +Will give you a higher degree of reliability protection from node / hardware / hard drive failure. +More aggregate IO capabilities - capex / opex may be higher than allocating more cores More cores +May run your workload faster +More cores may allow for more tasks to run on the same machine +More cores/tasks may reduce network contention and increase increasing task to task data flow performance. Notice May run your workload faster is in both; as it can be very workload dependant. My Experience: I did a recent experiment and found that given the same number of cores (64) with the exact
RE: More cores Vs More Nodes ?
Aw Tommy, Actually no. You really don't want to do this. If you actually ran a cluster and worked in the real world, you would find that if you purposely build a cluster for one job, there will be a mandate that some other group needs to use the cluster and that their job has different performance issues and your cluster is now suboptimal for their jobs... Perhaps you meant that you needed to think about the purpose of the cluster? That is do you want to minimize the nodes but maximize the disk space per node and use the cluster as your backup cluster? (Assuming that you are considering your DR and BCP in your design.) The problem with your answer, is that a job has a specific meaning within the Hadoop world. You should have asked what is the purpose of the cluster. I agree w Brad, that it depends ... But the factors which will impact your cluster design are more along the lines of the purpose of the cluster and then the budget along with your IT constraints. IMHO its better to avoid building purpose built clusters. You end up not being able to easily recycle the hardware in to new clusters easily. But hey what do I know? ;-) To: common-user@hadoop.apache.org Subject: RE: More cores Vs More Nodes ? From: tdeut...@us.ibm.com Date: Tue, 13 Dec 2011 09:46:49 -0800 It also helps to know the profile of your job in how you spec the machines. So in addition to Brad's response you should consider if you think your jobs will be more storage or compute oriented. Tom Deutsch Program Director Information Management Big Data Technologies IBM 3565 Harbor Blvd Costa Mesa, CA 92626-1420 tdeut...@us.ibm.com Brad Sarsfield b...@bing.com 12/13/2011 09:41 AM Please respond to common-user@hadoop.apache.org To common-user@hadoop.apache.org common-user@hadoop.apache.org cc Subject RE: More cores Vs More Nodes ? Praveenesh, Your question is not naĆÆve; in fact, optimal hardware design can ultimately be a very difficult question to answer on what would be better. If you made me pick one without much information I'd go for more machines. But... It all depends; and there is no right answer :) More machines +May run your workload faster +Will give you a higher degree of reliability protection from node / hardware / hard drive failure. +More aggregate IO capabilities - capex / opex may be higher than allocating more cores More cores +May run your workload faster +More cores may allow for more tasks to run on the same machine +More cores/tasks may reduce network contention and increase increasing task to task data flow performance. Notice May run your workload faster is in both; as it can be very workload dependant. My Experience: I did a recent experiment and found that given the same number of cores (64) with the exact same network / machine configuration; A: I had 8 machines with 8 cores B: I had 28 machines with 2 cores (and 1x8 core head node) B was able to outperform A by 2x using teragen and terasort. These machines were running in a virtualized environment; where some of the IO capabilities behind the scenes were being regulated to 400Mbps per node when running in the 2 core configuration vs 1Gbps on the 8 core. So I would expect the non-throttled scenario to work even better. ~Brad -Original Message- From: praveenesh kumar [mailto:praveen...@gmail.com] Sent: Monday, December 12, 2011 8:51 PM To: common-user@hadoop.apache.org Subject: More cores Vs More Nodes ? Hey Guys, So I have a very naive question in my mind regarding Hadoop cluster nodes ? more cores or more nodes - Shall I spend money on going from 2-4 core machines, or spend money on buying more nodes less core eg. say 2 machines of 2 cores for example? Thanks, Praveenesh
RE: More cores Vs More Nodes ?
Brian, I think you missed my point. The moment you go and design a cluster for a specific job, you end up getting fscked because there's another group who wants to use the shared resource for their job which could be orthogonal to the original purpose. It happens everyday. This is why you have to ask if the cluster is being built for a specific purpose. Meaning answering the question 'Which of the following best describes your cluster: a) PoC b) Development c) Pre-prod d) Production e) Secondary/Backup Note that sizing the cluster is a different matter. Meaning if you know you need a PB of storage, you're going to design the cluster differently because once you get to a certain size, you have to recognize that your clusters are going to have lots of disk, require 10GBe just for the storage. Number of cores would be less of an issue, however again look at pricing. 2 socket 8 core Xeon MBs are currently at an optimal price point. And again this goes back to the point I was trying to make. You need to look beyond the number of cores as a determining factor. You go too small, you're going to take a hit because of the price/performance curve. (Remember that you have to consider Machine Room real estate. 100 2 core boxes take up much more space than 25 8 core boxes) If you go to the other extreme... 64 core giant SMP box $ for $$$ (less money) build out an 8 node cluster. Beyond that, you really, really don't want to build a custom cluster for a specific job unless you know that you're going to be running that specific job or set of jobs (24x7X365) [And yes, I came across such a use case...] HTH -Mike From: bbock...@cse.unl.edu Subject: Re: More cores Vs More Nodes ? Date: Wed, 14 Dec 2011 07:41:25 -0600 To: common-user@hadoop.apache.org Actually, there are varying degrees here. If you have a successful project, you will find other groups at your door wanting to use the cluster too. Their jobs might be different from the original use case. However, if you don't understand the original use case (CPU heavy or storage heavy? is a great beginning question), your original project won't be successful. Then there will be no follow-up users because you failed. So, you want to have a reasonably general-purpose cluster, but make sure it matches well with the type of jobs. As an example, we had one group who required an estimated CPU-millenia per byte of dataā¦ they needed a general purpose cluster for a certain value of general purpose. Brian On Dec 14, 2011, at 7:29 AM, Michael Segel wrote: Aw Tommy, Actually no. You really don't want to do this. If you actually ran a cluster and worked in the real world, you would find that if you purposely build a cluster for one job, there will be a mandate that some other group needs to use the cluster and that their job has different performance issues and your cluster is now suboptimal for their jobs... Perhaps you meant that you needed to think about the purpose of the cluster? That is do you want to minimize the nodes but maximize the disk space per node and use the cluster as your backup cluster? (Assuming that you are considering your DR and BCP in your design.) The problem with your answer, is that a job has a specific meaning within the Hadoop world. You should have asked what is the purpose of the cluster. I agree w Brad, that it depends ... But the factors which will impact your cluster design are more along the lines of the purpose of the cluster and then the budget along with your IT constraints. IMHO its better to avoid building purpose built clusters. You end up not being able to easily recycle the hardware in to new clusters easily. But hey what do I know? ;-) To: common-user@hadoop.apache.org Subject: RE: More cores Vs More Nodes ? From: tdeut...@us.ibm.com Date: Tue, 13 Dec 2011 09:46:49 -0800 It also helps to know the profile of your job in how you spec the machines. So in addition to Brad's response you should consider if you think your jobs will be more storage or compute oriented. Tom Deutsch Program Director Information Management Big Data Technologies IBM 3565 Harbor Blvd Costa Mesa, CA 92626-1420 tdeut...@us.ibm.com Brad Sarsfield b...@bing.com 12/13/2011 09:41 AM Please respond to common-user@hadoop.apache.org To common-user@hadoop.apache.org common-user@hadoop.apache.org cc Subject RE: More cores Vs More Nodes ? Praveenesh, Your question is not naĆÆve; in fact, optimal hardware design can ultimately be a very difficult question to answer on what would be better. If you made me pick one without much information I'd go for more machines. But... It all depends; and there is no right answer :) More machines
RE: More cores Vs More Nodes ?
Tommy, Again, I think you need to really have some real world experience before you make generalizations like that. Sorry, but at a client, we put 6 different groups' applications in production. Without going in to detail the jobs in production were orthogonal to one another. The point is that were we to build our cluster optimized to one job we would have been screwed. Oh wait, I forgot that you worked for IBM and they would love to sell you more hardware and consulting to improve the situation... (I kee-id, I kee-id) Now Seriously, The point of this discussion is that you really, really don't want to build the cluster optimized for a single job. The only time you want to do that is if you have a job or set of jobs that you plan on running every day 24x7 and the job takes the entire cluster. Yes, such jobs do exist. However they are highly irregular and definitely not the norm. One of the other pain points is that developers have to get used to the cluster as a shared resource to be used between different teams. This helps to defer the costs including maintenance. So as a shared resource, development and production, you need to build out a box that handles everything equally. Had you attended our session at Hadoop World, not only would you have learned this... (Don't tune the cluster to the application, but tune the application to the cluster) I would have also poked fun of you in person. ;-) We also talked about avoiding the internet myths and 'truisms'. Unless you've had your hands dirty and at customer's sites you're going to find the real world is a different place. ;-) But hey! What do I know? To: common-user@hadoop.apache.org Subject: RE: More cores Vs More Nodes ? From: tdeut...@us.ibm.com Date: Wed, 14 Dec 2011 07:56:30 -0800 Putting aside any smarmy responses for a moment - sorry that job(s) wasn't understood as equating to purpose. If you are building a general purpose sandbox then I think we all agree on building a balanced general purpose cluster. But if you have production use cases in mind then you darn well better try to understand how the cluster will be used/stressed so you don't end up with a hardware spec that doesn't match how the cluster is actually used. If you can't profile a production use case as to how it will stress the cluster that is a huge warning sign as to project risk. If you are tearing down and re-purposing a cluster that was implemented to support a production use case then the planning failed. Tom Deutsch Program Director Information Management Big Data Technologies IBM 3565 Harbor Blvd Costa Mesa, CA 92626-1420 tdeut...@us.ibm.com
RE: Regarding loading a big XML file to HDFS
Just wanted to address this: Basically in My mapreduce program i am expecting a complete XML as my input.i have a CustomReader(for XML) in my mapreduce job configuration.My main confusion is if namenode distribute data to DataNodes ,there is a chance that a part of xml can go to one data node and other half can go in another datanode.If that is the case will my custom XMLReader in the mapreduce be able to combine it(as mapreduce reads data locally only). Please help me on this? if you can not do anything parallel here, make your input split size to cover complete file size. also configure the block size to cover complete file size. In this case, only one mapper and reducer will be spawned for file. But here you wont get any parallel processing advantage. You can do this in parallel. You need to write a custom input format class. (Which is what you're already doing...) Lets see if I can explain this correctly. You have an XML record split across block A and block B. Your map reduce job will instantiate a task per block. So in mapper processing block A, you read and process the XML records... when you get to the last record, which is only in part of A, mapper A will continue on to block B and continue reading the last record. Then stops. In mapper for block B, the reader will skip and not process data until it sees the start of a record. So you end up getting all of your XML records processed (no duplication) and done in parallel. Does that make sense? -Mike Date: Tue, 22 Nov 2011 03:08:20 + From: mahesw...@huawei.com Subject: RE: Regarding loading a big XML file to HDFS To: common-user@hadoop.apache.org; core-u...@hadoop.apache.org Also i am surprising, how you are writing mapreduce application here. Map and reduce will work with key value pairs. From: Uma Maheswara Rao G Sent: Tuesday, November 22, 2011 8:33 AM To: common-user@hadoop.apache.org; core-u...@hadoop.apache.org Subject: RE: Regarding loading a big XML file to HDFS __ From: hari708 [hari...@gmail.com] Sent: Tuesday, November 22, 2011 6:50 AM To: core-u...@hadoop.apache.org Subject: Regarding loading a big XML file to HDFS Hi, I have a big file consisting of XML data.the XML is not represented as a single line in the file. if we stream this file using ./hadoop dfs -put command to a hadoop directory .How the distribution happens.? HDFS will didvide the blocks based on your block size configured for the file. Basically in My mapreduce program i am expecting a complete XML as my input.i have a CustomReader(for XML) in my mapreduce job configuration.My main confusion is if namenode distribute data to DataNodes ,there is a chance that a part of xml can go to one data node and other half can go in another datanode.If that is the case will my custom XMLReader in the mapreduce be able to combine it(as mapreduce reads data locally only). Please help me on this? if you can not do anything parallel here, make your input split size to cover complete file size. also configure the block size to cover complete file size. In this case, only one mapper and reducer will be spawned for file. But here you wont get any parallel processing advantage. -- View this message in context: http://old.nabble.com/Regarding-loading-a-big-XML-file-to-HDFS-tp32871900p32871900.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
RE: Matrix multiplication in Hadoop
Ok Mike, First I admire that you are studying Hadoop. To answer your question... not well. Might I suggest that if you want to learn Hadoop, you try and find a problem which can easily be broken in to a series of parallel tasks where there is minimal communication requirements between each task? No offense, but if I could make a parallel... what you're asking is akin to taking a normalized relational model and trying to run it as is in HBase. Yes it can be done. But not the best use of resources. To: common-user@hadoop.apache.org CC: common-user@hadoop.apache.org Subject: Re: Matrix multiplication in Hadoop From: mspre...@us.ibm.com Date: Fri, 18 Nov 2011 12:39:00 -0500 That's also an interesting question, but right now I am studying Hadoop and want to know how well dense MM can be done in Hadoop. Thanks, Mike From: Michel Segel michael_se...@hotmail.com To: common-user@hadoop.apache.org common-user@hadoop.apache.org Date: 11/18/2011 12:34 PM Subject:Re: Matrix multiplication in Hadoop Is Hadoop the best tool for doing large matrix math. Sure you can do it, but, aren't there better tools for these types of problems? Sent from a remote device. Please excuse any typos... Mike Segel
RE: mapr common library?
Not sure what you mean But w MapR, you have everything under the /opt/mapr/hadoop/hadoop* tree. There you'll see the core library and also the other ./contrib, etc directories. Sorry, I'm not specific... I'm going from memory. HTH -Mike Date: Wed, 19 Oct 2011 20:06:50 -0700 Subject: mapr common library? From: alexgauthie...@gmail.com To: common-user@hadoop.apache.org Is there such a thing somewhere? I have the basic nPath, lucene-like search processing but looking for ETL like transformations, typical weblog processor or clickstream. Anything beyond wordcount would be appreciated :) GodSpeed. Alex http://twitter.com/#!/A23Corp
RE: Can we replace namenode machine with some other machine ?
I agree w Steve except on one thing... RAID 5 Bad. RAID 10 (1+0) good. Sorry this goes back to my RDBMs days where RAID 5 will kill your performance and worse... Date: Thu, 22 Sep 2011 11:28:39 +0100 From: ste...@apache.org To: common-user@hadoop.apache.org Subject: Re: Can we replace namenode machine with some other machine ? On 22/09/11 05:42, praveenesh kumar wrote: Hi all, Can we replace our namenode machine later with some other machine. ? Actually I got a new server machine in my cluster and now I want to make this machine as my new namenode and jobtracker node ? Also Does Namenode/JobTracker machine's configuration needs to be better than datanodes/tasktracker's ?? 1. I'd give it lots of RAM - holding data about many files, avoiding swapping, etc. 2. I'd make sure the disks are RAID5, with some NFS-mounted FS that the secondary namenode can talk to. avoids risk of loss of the index, which, if it happens, renders your filesystem worthless. If I was really paranoid I'd have twin raid controllers with separate connections to disk arrays in separate racks, as [Jiang2008] shows that interconnect problems on disk arrays can be higher than HDD failures. 3. if your central switches are at 10 GbE, consider getting a 10GbE NIC and hooking it up directly -this stops the network being the bottleneck, though it does mean the server can have a lot more packets hitting it, so putting more load on it. 4. Leave space for a second CPU and time for GC tuning. JT's are less important; they need RAM but use HDFS for storage. If your cluster is small, NN and JT can be run locally. If you do this, set up DNS to have two hostnames to point to same network address. Then if you ever split them off, everyone whose bookmark says http://jobtracker won't notice Either way: the NN and the JT are the machines whose availability you care about. The rest is just a source of statistics you can look at later. -Steve [Jiang2008] Are disks the dominant contributor for storage failures?: A comprehensive study of storage subsystem failure characteristics. ACM Transactions on Storage.
RE: Can we replace namenode machine with some other machine ?
Well you could do RAID 1 which is just mirroring. I don't think you need to do any raid 0 or raid 5 (striping) to get better performance. Also if you're using a 1U box, you just need 2 SATA drives internal and then NFS mount a drive from your SN for your backup copy... Date: Thu, 22 Sep 2011 17:18:55 +0100 From: ste...@apache.org To: common-user@hadoop.apache.org Subject: Re: Can we replace namenode machine with some other machine ? On 22/09/11 17:13, Michael Segel wrote: I agree w Steve except on one thing... RAID 5 Bad. RAID 10 (1+0) good. Sorry this goes back to my RDBMs days where RAID 5 will kill your performance and worse... sorry, I should have said RAID =5. The main thing is you don't want the NN data lost. ever
RE: risks of using Hadoop
Tom, Normally someone who has a personal beef with someone will take it offline and deal with it. Clearly manners aren't your strong point... unfortunately making me respond to you in public. Since you asked, no, I don't have any beefs with IBM. In fact, I happen to have quite a few friends within IBM's IM pillar. (although many seem to taking Elvis' advice and left the building...) What I do have a problem with is you and your response to the posts in this thread. Its bad enough that you really don't know what you're talking about. But this is compounded by the fact that your posts end with your job title seems to indicate that you are a thought leader from a well known, brand name company. So unlike some schmuck off the street, because of your job title, someone may actually pay attention to you and take what you say at face value. The issue at hand is that the OP wanted to know the risks so that he can address them to give his pointy haired stake holders a warm fuzzy feeling. SPOF isn't a risk, but a point of FUD that is constantly being brought out by people who have an alternative that they wanted to promote. Brian pretty much put it in to perspective. You attempted to correct him, and while Brian was polite, I'm not. Why? Because I happen to know of enough people who still think that what BS IBM trots out must be true and taken at face value. I think you're more concerned with making an appearance than you are with anyone having a good experience. No offense, but again, you're not someone who has actual hands on experience so you're not in position to give advice. I don't know to write what you say out of being arrogant, but I have to wonder if you actually paid attention in your SSM class. Raising FUD and non issues as risk doesn't help anyone promote Hadoop, regardless of the vendor. What it does is cause the stakeholders reason to pause. Overstating risks can cause just as much harm as over promising results. Again, its Sales 101. Perhaps you're still trying to convert these folks off Hadoop on to IBM's DB2? No wait, that was someone else... and it wasn't Hadoop, it was Informix. (Sorry to the list, that was an inside joke that probably went over Tom's head, but for someone's benefit.) To help drill the point of the issue home... 1) Look at MapR, an IBM competitor who's derivative already solves this SPOF problem. 2) Look at how to set up a cluster (Apache, HortonWorks, Cloudera) where you can mitigate this by your node configuration along with simple sysadmin tricks like NFS mounting a drive from a different machine within the cluster (Preferably a different rack for a back up.) 3) Think about your backup and recovery of your Name Node's files. There's more, and I would encourage you to actually talk to a professional before giving out advice. ;-) HTH -Mike PS. My last PS talked about the big power switch in a switch box in the machine room that cuts the power. (When its a lever, do you really need to tell someone that its not a light switch? And I guess you could padlock it too) Seriously, there is more risk to data loss and corruption based on luser issues than there is of a SPOF (NN failure). To: common-user@hadoop.apache.org Subject: RE: risks of using Hadoop From: tdeut...@us.ibm.com Date: Wed, 21 Sep 2011 06:20:53 -0700 I am truly sorry if at some point in your life someone dropped an IBM logo on your head and it left a dent - but you are being a jerk. Right after you were engaging in your usual condescension a person from Xerox posted on the very issue you were blowing off. Things happen. To any system. I'm not knocking Hadoop - and frankly making sure new users have a good experience based on the real things that need to be aware of / manage is in everyone's interests here to grow the footprint. Please take note that no where in here have I ever said anything to discourage Hadoop deployments/use or anything that is vendor specific. Tom Deutsch Program Director CTO Office: Information Management Hadoop Product Manager / Customer Exec IBM 3565 Harbor Blvd Costa Mesa, CA 92626-1420 tdeut...@us.ibm.com Michael Segel michael_se...@hotmail.com 09/20/2011 02:52 PM Please respond to common-user@hadoop.apache.org To common-user@hadoop.apache.org cc Subject RE: risks of using Hadoop Tom, I think it is arrogant to parrot FUD when you've never had your hands dirty in any real Hadoop environment. So how could your response reflect the operational realities of running a Hadoop cluster? What Brian was saying was that the SPOF is an over played FUD trump card. Anyone who's built clusters will have mitigated the risks of losing the NN. Then there's MapR... where you don't have a SPOF. But again that's a derivative of Apache Hadoop. (Derivative isn't a bad thing...) You're right that you need
RE: risks of using Hadoop
Kobina The points 1 and 2 are definitely real risks. SPOF is not. As I pointed out in my mini-rant to Tom was that your end users / developers who use the cluster can do more harm to your cluster than a SPOF machine failure. I don't know what one would consider a 'long learning curve'. With the adoption of any new technology, you're talking at least 3-6 months based on the individual and the overall complexity of the environment. Take anyone who is a strong developer, put them through Cloudera's training, plus some play time, and you've shortened the learning curve. The better the java developer, the easier it is for them to pick up Hadoop. I would also suggest taking the approach of hiring a senior person who can cross train and mentor your staff. This too will shorten the runway. HTH -Mike Date: Wed, 21 Sep 2011 17:02:45 +0100 Subject: Re: risks of using Hadoop From: kobina.kwa...@gmail.com To: common-user@hadoop.apache.org Jignesh, Will your point 2 still be valid if we hire very experienced Java programmers? Kobina. On 20 September 2011 21:07, Jignesh Patel jign...@websoft.com wrote: @Kobina 1. Lack of skill set 2. Longer learning curve 3. Single point of failure @Uma I am curious to know about .20.2 is that stable? Is it same as the one you mention in your email(Federation changes), If I need scaled nameNode and append support, which version I should choose. Regarding Single point of failure, I believe Hortonworks(a.k.a Yahoo) is updating the Hadoop API. When that will be integrated with Hadoop. If I need -Jignesh On Sep 17, 2011, at 12:08 AM, Uma Maheswara Rao G 72686 wrote: Hi Kobina, Some experiences which may helpful for you with respective to DFS. 1. Selecting the correct version. I will recommend to use 0.20X version. This is pretty stable version and all other organizations prefers it. Well tested as well. Dont go for 21 version.This version is not a stable version.This is risk. 2. You should perform thorough test with your customer operations. (of-course you will do this :-)) 3. 0.20x version has the problem of SPOF. If NameNode goes down you will loose the data.One way of recovering is by using the secondaryNameNode.You can recover the data till last checkpoint.But here manual intervention is required. In latest trunk SPOF will be addressed bu HDFS-1623. 4. 0.20x NameNodes can not scale. Federation changes included in latest versions. ( i think in 22). this may not be the problem for your cluster. But please consider this aspect as well. 5. Please select the hadoop version depending on your security requirements. There are versions available for security as well in 0.20X. 6. If you plan to use Hbase, it requires append support. 20Append has the support for append. 0.20.205 release also will have append support but not yet released. Choose your correct version to avoid sudden surprises. Regards, Uma - Original Message - From: Kobina Kwarko kobina.kwa...@gmail.com Date: Saturday, September 17, 2011 3:42 am Subject: Re: risks of using Hadoop To: common-user@hadoop.apache.org We are planning to use Hadoop in my organisation for quality of servicesanalysis out of CDR records from mobile operators. We are thinking of having a small cluster of may be 10 - 15 nodes and I'm preparing the proposal. my office requires that i provide some risk analysis in the proposal. thank you. On 16 September 2011 20:34, Uma Maheswara Rao G 72686 mahesw...@huawei.comwrote: Hello, First of all where you are planning to use Hadoop? Regards, Uma - Original Message - From: Kobina Kwarko kobina.kwa...@gmail.com Date: Saturday, September 17, 2011 0:41 am Subject: risks of using Hadoop To: common-user common-user@hadoop.apache.org Hello, Please can someone point some of the risks we may incur if we decide to implement Hadoop? BR, Isaac.
RE: risks of using Hadoop
Tom, I think it is arrogant to parrot FUD when you've never had your hands dirty in any real Hadoop environment. So how could your response reflect the operational realities of running a Hadoop cluster? What Brian was saying was that the SPOF is an over played FUD trump card. Anyone who's built clusters will have mitigated the risks of losing the NN. Then there's MapR... where you don't have a SPOF. But again that's a derivative of Apache Hadoop. (Derivative isn't a bad thing...) You're right that you need to plan accordingly, however from risk perspective, this isn't a risk. In fact, I believe Tom White's book has a good layout to mitigate this and while I have First Ed, I'll have to double check the second ed to see if he modified it. Again, the point Brian was making and one that I agree with is that the NN as a SPOF is an overblown 'risk'. You have a greater chance of data loss than you do of losing your NN. Probably the reason why some of us are a bit irritated by the SPOF reference to the NN is that its clowns who haven't done any work in this space, pick up on the FUD and spread it around. This makes it difficult for guys like me from getting anything done because we constantly have to go back and reassure stake holders that its a non-issue. With respect to naming vendors, I did name MapR outside of Apache because they do have their own derivative release that improves upon the limitations found in Apache's Hadoop. -Mike PS... There's this junction box in your machine room that has this very large on/off switch. If pulled down, it will cut power to your cluster and you will lose everything. Now would you consider this a risk? Sure. But is it something you should really lose sleep over? Do you understand that there are risks and there are improbable risks? To: common-user@hadoop.apache.org Subject: RE: risks of using Hadoop From: tdeut...@us.ibm.com Date: Tue, 20 Sep 2011 12:48:05 -0700 No worries Michael - it would be stretch to see any arrogance or disrespect in your response. Kobina has asked a fair question, and deserves a response that reflects the operational realities of where we are. If you are looking at doing large scale CDR handling - which I believe is the use case here - you need to plan accordingly. Even you use the term mitigate - which is different than prevent. Kobina needs an understanding of that they are looking at. That isn't a pro/con stance on Hadoop, it is just reality and they should plan accordingly. (Note - I'm not the one who brought vendors into this - which doesn't strike me as appropriate for this list) Tom Deutsch Program Director CTO Office: Information Management Hadoop Product Manager / Customer Exec IBM 3565 Harbor Blvd Costa Mesa, CA 92626-1420 tdeut...@us.ibm.com Michael Segel michael_se...@hotmail.com 09/17/2011 07:37 PM Please respond to common-user@hadoop.apache.org To common-user@hadoop.apache.org cc Subject RE: risks of using Hadoop Gee Tom, No disrespect, but I don't believe you have any personal practical experience in designing and building out clusters or putting them to the test. Now to the points that Brian raised.. 1) SPOF... it sounds great on paper. Some FUD to scare someone away from Hadoop. But in reality... you can mitigate your risks by setting up raid on your NN/HM node. You can also NFS mount a copy to your SN (or whatever they're calling it these days...) Or you can go to MapR which has redesigned HDFS which removes this problem. But with your Apache Hadoop or Cloudera's release, losing your NN is rare. Yes it can happen, but not your greatest risk. (Not by a long shot) 2) Data Loss. You can mitigate this as well. Do I need to go through all of the options and DR/BCP planning? Sure there's always a chance that you have some Luser who does something brain dead. This is true of all databases and systems. (I know I can probably recount some of IBM's Informix and DB2 having data loss issues. But that's a topic for another time. ;-) I can't speak for Brian, but I don't think he's trivializing it. In fact I think he's doing a fine job of level setting expectations. And if you talk to Ted Dunning of MapR, I'm sure he'll point out that their current release does address points 3 and 4 again making their risks moot. (At least if you're using MapR) -Mike Subject: Re: risks of using Hadoop From: tdeut...@us.ibm.com Date: Sat, 17 Sep 2011 17:38:27 -0600 To: common-user@hadoop.apache.org I disagree Brian - data loss and system down time (both potentially non-trival) should not be taken lightly. Use cases and thus availability requirements do vary, but I would not encourage anyone to shrug them off as overblown, especially as Hadoop become more production oriented in utilization
RE: Using HBase for real time transaction
Since Tom isn't technical... ;-) The short answer is No. HBase is not capable of being a transactional because it doesn't support transactions. Nor is HBase ACID compliant. Having said that, yes you can use HBase to serve data in real time. HTH -Mike Subject: Re: Using HBase for real time transaction From: jign...@websoft.com Date: Tue, 20 Sep 2011 17:25:17 -0400 To: common-user@hadoop.apache.org Tom, Let me reword: can HBase be used as a transactional database(i.e. in replacement of mysql)? The requirement is to have real time read and write operations. I mean as soon as data is written the user should see the data(Here data should be written in Hbase). -Jignesh On Sep 20, 2011, at 5:11 PM, Tom Deutsch wrote: Real-time means different things to different people. Can you share your latency requirements from the time the data is generated to when it needs to be consumed, or how you are thinking of using Hbase in the overall flow? Tom Deutsch Program Director CTO Office: Information Management Hadoop Product Manager / Customer Exec IBM 3565 Harbor Blvd Costa Mesa, CA 92626-1420 tdeut...@us.ibm.com Jignesh Patel jign...@websoft.com 09/20/2011 12:57 PM Please respond to common-user@hadoop.apache.org To common-user@hadoop.apache.org cc Subject Using HBase for real time transaction We are exploring possibility of using HBase for the real time transactions. Is that possible? -Jignesh
RE: Using HBase for real time transaction
Date: Tue, 20 Sep 2011 15:05:31 -0700 Subject: Re: Using HBase for real time transaction From: jdcry...@apache.org To: common-user@hadoop.apache.org While HBase isn't ACID-compliant, it does have have some guarantees: http://hbase.apache.org/acid-semantics.html J-D I think there has to be some clarification. The OP was asking about a mySQL replacement. HBase will never be a RDBMS replacement. No Transactions means no way of doing OLTP. Its the wrong tool for that type of work. Sure I know I can kludge something but its not worth the effort. Choose a better tool like a real database... e.g. Informix. Recognize what HBase is and what it is not. This doesn't mean you can't take in or deliver data in real time, it can. So if you want to use it in a real time manner, sure. Note that like with other databases, you will have to do some work to handle real time data. I guess you would have to provide a specific use case on what you want to achieve in order to know if its a good fit. HTH -Mike On Tue, Sep 20, 2011 at 2:56 PM, Michael Segel michael_se...@hotmail.com wrote: Since Tom isn't technical... ;-) The short answer is No. HBase is not capable of being a transactional because it doesn't support transactions. Nor is HBase ACID compliant. Having said that, yes you can use HBase to serve data in real time. HTH -Mike
RE: risks of using Hadoop
Gee Tom, No disrespect, but I don't believe you have any personal practical experience in designing and building out clusters or putting them to the test. Now to the points that Brian raised.. 1) SPOF... it sounds great on paper. Some FUD to scare someone away from Hadoop. But in reality... you can mitigate your risks by setting up raid on your NN/HM node. You can also NFS mount a copy to your SN (or whatever they're calling it these days...) Or you can go to MapR which has redesigned HDFS which removes this problem. But with your Apache Hadoop or Cloudera's release, losing your NN is rare. Yes it can happen, but not your greatest risk. (Not by a long shot) 2) Data Loss. You can mitigate this as well. Do I need to go through all of the options and DR/BCP planning? Sure there's always a chance that you have some Luser who does something brain dead. This is true of all databases and systems. (I know I can probably recount some of IBM's Informix and DB2 having data loss issues. But that's a topic for another time. ;-) I can't speak for Brian, but I don't think he's trivializing it. In fact I think he's doing a fine job of level setting expectations. And if you talk to Ted Dunning of MapR, I'm sure he'll point out that their current release does address points 3 and 4 again making their risks moot. (At least if you're using MapR) -Mike Subject: Re: risks of using Hadoop From: tdeut...@us.ibm.com Date: Sat, 17 Sep 2011 17:38:27 -0600 To: common-user@hadoop.apache.org I disagree Brian - data loss and system down time (both potentially non-trival) should not be taken lightly. Use cases and thus availability requirements do vary, but I would not encourage anyone to shrug them off as overblown, especially as Hadoop become more production oriented in utilization. --- Sent from my Blackberry so please excuse typing and spelling errors. - Original Message - From: Brian Bockelman [bbock...@cse.unl.edu] Sent: 09/17/2011 05:11 PM EST To: common-user@hadoop.apache.org Subject: Re: risks of using Hadoop On Sep 16, 2011, at 11:08 PM, Uma Maheswara Rao G 72686 wrote: Hi Kobina, Some experiences which may helpful for you with respective to DFS. 1. Selecting the correct version. I will recommend to use 0.20X version. This is pretty stable version and all other organizations prefers it. Well tested as well. Dont go for 21 version.This version is not a stable version.This is risk. 2. You should perform thorough test with your customer operations. (of-course you will do this :-)) 3. 0.20x version has the problem of SPOF. If NameNode goes down you will loose the data.One way of recovering is by using the secondaryNameNode.You can recover the data till last checkpoint.But here manual intervention is required. In latest trunk SPOF will be addressed bu HDFS-1623. 4. 0.20x NameNodes can not scale. Federation changes included in latest versions. ( i think in 22). this may not be the problem for your cluster. But please consider this aspect as well. With respect to (3) and (4) - these are often completely overblown for many Hadoop use cases. If you use Hadoop as originally designed (large scale batch data processing), these likely don't matter. If you're looking at some of the newer use cases (low latency stuff or time-critical processing), or if you architect your solution poorly (lots of small files), these issues become relevant. Another case where I see folks get frustrated is using Hadoop as a plain old batch system; for non-data workflows, it doesn't measure up against specialized systems. You really want to make sure that Hadoop is the best tool for your job. Brian
RE: risks of using Hadoop
Risks? Well if you come to Hadoop World in Nov, we actually have a presentation that might help reduce some of your initial risks. There are always risks when starting a new project. Regardless of the underlying technology, you have costs associated with failure and unless you can level set expectations you'll increase your odds of failure. Best advice... don't listen to sales critters or marketing folks. ;-) [Right Tom?] They have an agenda. ;-) Date: Fri, 16 Sep 2011 20:11:20 +0100 Subject: risks of using Hadoop From: kobina.kwa...@gmail.com To: common-user@hadoop.apache.org Hello, Please can someone point some of the risks we may incur if we decide to implement Hadoop? BR, Isaac.
RE: Hadoop multi tier backup
Matthew, the short answer is hire a consultant to work with you on your DR/BCP strategy. :-) Short of that... you have a couple of things... Your back-up cluster, is it in the same site? (What happens when site goes down?) Are you planning to make your back up cluster and main cluster homogenous? By this I mean if your main cluster has 1PB of disk w 4x2TB or 4x3TB drives, will your backup cluster have the same configuration? (You may want to consider asymmetry in designing your clusters) So your backup cluster has fewer nodes but more drives per node. You also have to look at your data. Are your data sets small and discrete? If so, you could probably back them up to tape, (snapshots) , just in case of human error and you didn't catch it in time and the error gets propagated to your backup cluster. I haven't played with fuse, so I don't know if there are any performance issues, but on a back up cluster, I don't think its much of an issue. From: matthew.go...@monsanto.com To: common-user@hadoop.apache.org; cdh-u...@cloudera.org Subject: Hadoop multi tier backup Date: Tue, 30 Aug 2011 16:54:07 + All, We were discussing how we would backup our data from the various environments we will have and I was hoping someone could chime in with previous experience in this. My primary concern about our cluster is that we would like to be able to recover anything within the last 60 days so having full backups both on tape and through distcp is preferred. Out initial thoughts can be seen in the jpeg attached but just in case any of you are weary of attachments it can also be summarized below: Prod Cluster --DistCp-- On-site Backup cluster with Fuse mount point running NetBackup daemon --NetBackup-- Media Server -- Tape One of our biggest grey areas so far is how do most people accomplish incremental backups? Our thought was to tie this into our NetBackup configuration as this can be done for other connectors but we do not see anything for HDFS yet. Thanks, Matt This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of Viruses or other Malware. Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations.
RE: Hadoop--store a sequence file in distributed cache?
Sofia, I was about to say that if your file is already on hdfs, you should just be able to open it. But as I type this, I have this thing kicking me in the back of the head reminding me that you may not be able to access the hdfs file at the same time someone else is accessing it? (Going from memory, is there an exclusive lock on the file when you open it in HDFS?) If not, you can just use your file. If so, you will need to use distributed cache which copies a copy of the file to some place local on each node running the task. Within your task you need to query the distributed cache for your file and get the path to the file so you can open it. Depending on the size of your index... which can get large, you need to open the file once and just reset to the beginning of the file. My suggestion is to consider putting your RTree into HBase. So HBase contains your index. Date: Sat, 13 Aug 2011 03:02:32 -0700 From: geosofie_...@yahoo.com Subject: Re: Hadoop--store a sequence file in distributed cache? To: common-user@hadoop.apache.org Good morning, I am a little confused, I have to say. A summury of the project first: I want to examine how an Rtree on HDFS would speed up spatial queries like point/range queries, that normally target a very small part of the original input. I have built my Rtree on HDFS, and now I need to answer queries using it. I thought I could make an MR Job that takes as input a text file where each line is a query (for example we have 2 queries). To answer the queries efficiently, I need to check some information about the root nodes of the tree, which are stored in R files (R=the #reducers of the previous job). These files are small in size and are read from every mapper, thus the idea of distributed cache fits, right? I have built an ArrayList during setup() to avoid opening all the files in distributed cache, and open only 3-4 of them for example. I agree, though, that opening and closing these files so many times is an important overhead. I think however, that opening these files from HDFS rather than distributed cache would be even worse, since the file accessing operations in HDFS are much more expensive than accessing files locally. Thank you all for your response, I would be glad to have more feedback. Sofia From: GOEKE, MATTHEW (AG/1000) matthew.go...@monsanto.com To: common-user@hadoop.apache.org common-user@hadoop.apache.org Sent: Friday, August 12, 2011 7:05 PM Subject: RE: Hadoop--store a sequence file in distributed cache? Sofia correct me if I am wrong, but Mike I think this thread was about using the output of a previous job, in this case already in sequence file format, as in memory join data for another job. Side note: does anyone know what the rule of thumb on file size is when using the distributed cache vs just reading from HDFS (join data not binary files)? I always thought that having a setup phase on a mapper read directly from HDFS was a asking for trouble and that you should always distribute to each node but I am hearing more and more people say to just read directly from HDFS for larger file sizes to avoid the IO cost of the distributed cache. Matt -Original Message- From: Ian Michael Gumby [mailto:michael_se...@hotmail.com] Sent: Friday, August 12, 2011 10:54 AM To: common-user@hadoop.apache.org Subject: RE: Hadoop--store a sequence file in distributed cache? This whole thread doesn't make a lot of sense. If your first m/r job creates the sequence files, which you then use as input files to your second job, you don't need to use distributed cache since the output of the first m/r job is going to be in HDFS. (Dino is correct on that account.) Sofia replied saying that she needed to open and close the sequence file to access the data in each Mapper.map() call. Without knowing more about the specific app, Ashook is correct that you could read the file in Mapper.setup() and then access it in memory. Joey is correct you can put anything in distributed cache, but you don't want to put an HDFS file in to distributed cache. Distributed cache is a tool for taking something from your job and distributing it to each job tracker as a local object. It does have a bit of overhead. A better example is if you're distributing binary objects that you want on each node. A c++ .so file that you want to call from within your java m/r. If you're not using all of the data in the sequence file, what about using HBase? From: ash...@clearedgeit.com To: common-user@hadoop.apache.org Date: Fri, 12 Aug 2011 09:06:39 -0400 Subject: RE: Hadoop--store a sequence file in distributed cache? If you are looking for performance gains, then possibly reading these files once during the setup() call in your Mapper and storing them in some data structure like a Map or a List will
RE: Speed up node under replicated block during decomission
Just a thought... Really quick and dirty thing to do is to turn off the node. Within 10 minutes the node looks down to the JT and NN so it gets marked as down. Run an fsck and it will show the files as under replicated and then will do the replication at the faster speed to rebalance the cluster. (100MB/sec should be ok on a 1GBe link) Then you can drop the next node... much faster than trying to decomission the node. Its not the best way to do it, but it works. From: ha...@cloudera.com Date: Fri, 12 Aug 2011 22:38:08 +0530 Subject: Re: Speed up node under replicated block during decomission To: common-user@hadoop.apache.org It could be that your process has hung cause a particular resident block (file) requires a very large replication factor, and your remaining # of nodes is less than that value. This is a genuine reason for hang (but must be fixed). The process usually waits until there are no under-replicated blocks, so I'd use fsck to check if any such ones are present and setrep them to a lower value. On Fri, Aug 12, 2011 at 9:28 PM, jonathan.hw...@accenture.com wrote: Hi All, I'm trying to decommission data node from my cluster. I put the data node in the /usr/lib/hadoop/conf/dfs.hosts.exclude list and restarted the name nodes. The under-replicated blocks are starting to replicate, but it's going down in a very slow pace. For 1 TB of data it takes over 1 day to complete. We change the settings as below and try to increase the replication rate. Added this to hdfs-site.xml on all the nodes on the cluster and restarted the data nodes and name node processes. property !-- 100Mbit/s -- namedfs.balance.bandwidthPerSec/name value131072000/value /property Speed didn't seem to pick up. Do you know what may be happening? Thanks! Jonathan This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the email by you is prohibited. -- Harsh J
RE: Moving Files to Distributed Cache in MapReduce
Yeah, I'll write something up and post it on my web site. Definitely not InfoQ stuff, but a simple tip and tricks stuff. -Mike Subject: Re: Moving Files to Distributed Cache in MapReduce From: a...@apache.org Date: Sun, 31 Jul 2011 19:21:14 -0700 To: common-user@hadoop.apache.org We really need to build a working example to the wiki and add a link from the FAQ page. Any volunteers? On Jul 29, 2011, at 7:49 PM, Michael Segel wrote: Here's the meat of my post earlier... Sample code on putting a file on the cache: DistributedCache.addCacheFile(new URI(path+MyFileName,conf)); Sample code in pulling data off the cache: private Path[] localFiles = DistributedCache.getLocalCacheFiles(context.getConfiguration()); boolean exitProcess = false; int i=0; while (!exit){ fileName = localFiles[i].getName(); if (fileName.equalsIgnoreCase(model.txt)){ // Build your input file reader on localFiles[i].toString() exitProcess = true; } i++; } Note that this is SAMPLE code. I didn't trap the exit condition if the file isn't there and you go beyond the size of the array localFiles[]. Also I set exit to false because its easier to read this as Do this loop until the condition exitProcess is true. When you build your file reader you need the full path, not just the file name. The path will vary when the job runs. HTH -Mike From: michael_se...@hotmail.com To: common-user@hadoop.apache.org Subject: RE: Moving Files to Distributed Cache in MapReduce Date: Fri, 29 Jul 2011 21:43:37 -0500 I could have sworn that I gave an example earlier this week on how to push and pull stuff from distributed cache. Date: Fri, 29 Jul 2011 14:51:26 -0700 Subject: Re: Moving Files to Distributed Cache in MapReduce From: rogc...@ucdavis.edu To: common-user@hadoop.apache.org jobConf is deprecated in 0.20.2 I believe; you're supposed to be using Configuration for that On Fri, Jul 29, 2011 at 1:59 PM, Mohit Anchlia mohitanch...@gmail.comwrote: Is this what you are looking for? http://hadoop.apache.org/common/docs/current/mapred_tutorial.html search for jobConf On Fri, Jul 29, 2011 at 1:51 PM, Roger Chen rogc...@ucdavis.edu wrote: Thanks for the response! However, I'm having an issue with this line Path[] cacheFiles = DistributedCache.getLocalCacheFiles(conf); because conf has private access in org.apache.hadoop.configured On Fri, Jul 29, 2011 at 11:18 AM, Mapred Learn mapred.le...@gmail.com wrote: I hope my previous reply helps... On Fri, Jul 29, 2011 at 11:11 AM, Roger Chen rogc...@ucdavis.edu wrote: After moving it to the distributed cache, how would I call it within my MapReduce program? On Fri, Jul 29, 2011 at 11:09 AM, Mapred Learn mapred.le...@gmail.com wrote: Did you try using -files option in your hadoop jar command as: /usr/bin/hadoop jar jar name main class name -files absolute path of file to be added to distributed cache input dir output dir On Fri, Jul 29, 2011 at 11:05 AM, Roger Chen rogc...@ucdavis.edu wrote: Slight modification: I now know how to add files to the distributed file cache, which can be done via this command placed in the main or run class: DistributedCache.addCacheFile(new URI(/user/hadoop/thefile.dat), conf); However I am still having trouble locating the file in the distributed cache. *How do I call the file path of thefile.dat in the distributed cache as a string?* I am using Hadoop 0.20.2 On Fri, Jul 29, 2011 at 10:26 AM, Roger Chen rogc...@ucdavis.edu wrote: Hi all, Does anybody have examples of how one moves files from the local filestructure/HDFS to the distributed cache in MapReduce? A Google search turned up examples in Pig but not MR. -- Roger Chen UC Davis Genome Center -- Roger Chen UC Davis Genome Center -- Roger Chen UC Davis Genome Center -- Roger Chen UC Davis Genome Center -- Roger Chen UC Davis Genome Center
RE: Hadoop cluster network requirement
Yeah what he said. Its never a good idea. Forget about losing a NN or a Rack, but just losing connectivity between data centers. (It happens more than you think.) Your entire cluster in both data centers go down. Boom! Its a bad design. You're better off doing two different clusters. Is anyone really trying to sell this as a design? That's even more scary. Subject: Re: Hadoop cluster network requirement From: a...@apache.org Date: Sun, 31 Jul 2011 20:28:53 -0700 To: common-user@hadoop.apache.org; saq...@margallacomm.com On Jul 31, 2011, at 7:30 PM, Saqib Jang -- Margalla Communications wrote: Thanks, I'm independently doing some digging into Hadoop networking requirements and had a couple of quick follow-ups. Could I have some specific info on why different data centers cannot be supported for master node and data node comms? Also, what may be the benefits/use cases for such a scenario? Most people who try to put the NN and DNs in different data centers are trying to achieve disaster recovery: one file system in multiple locations. That isn't the way HDFS is designed and it will end in tears. There are multiple problems: 1) no guarantee that one block replica will be each data center (thereby defeating the whole purpose!) 2) assuming one can work out problem 1, during a network break, the NN will lose contact from one half of the DNs, causing a massive network replication storm 3) if one using MR on top of this HDFS, the shuffle will likely kill the network in between (making MR performance pretty dreadful) is going to cause delays for the DN heartbeats 4) I don't even want to think about rebalancing. ... and I'm sure a lot of other problems I'm forgetting at the moment. So don't do it. If you want disaster recovery, set up two completely separate HDFSes and run everything in parallel.
RE: Moving Files to Distributed Cache in MapReduce
I could have sworn that I gave an example earlier this week on how to push and pull stuff from distributed cache. Date: Fri, 29 Jul 2011 14:51:26 -0700 Subject: Re: Moving Files to Distributed Cache in MapReduce From: rogc...@ucdavis.edu To: common-user@hadoop.apache.org jobConf is deprecated in 0.20.2 I believe; you're supposed to be using Configuration for that On Fri, Jul 29, 2011 at 1:59 PM, Mohit Anchlia mohitanch...@gmail.comwrote: Is this what you are looking for? http://hadoop.apache.org/common/docs/current/mapred_tutorial.html search for jobConf On Fri, Jul 29, 2011 at 1:51 PM, Roger Chen rogc...@ucdavis.edu wrote: Thanks for the response! However, I'm having an issue with this line Path[] cacheFiles = DistributedCache.getLocalCacheFiles(conf); because conf has private access in org.apache.hadoop.configured On Fri, Jul 29, 2011 at 11:18 AM, Mapred Learn mapred.le...@gmail.com wrote: I hope my previous reply helps... On Fri, Jul 29, 2011 at 11:11 AM, Roger Chen rogc...@ucdavis.edu wrote: After moving it to the distributed cache, how would I call it within my MapReduce program? On Fri, Jul 29, 2011 at 11:09 AM, Mapred Learn mapred.le...@gmail.com wrote: Did you try using -files option in your hadoop jar command as: /usr/bin/hadoop jar jar name main class name -files absolute path of file to be added to distributed cache input dir output dir On Fri, Jul 29, 2011 at 11:05 AM, Roger Chen rogc...@ucdavis.edu wrote: Slight modification: I now know how to add files to the distributed file cache, which can be done via this command placed in the main or run class: DistributedCache.addCacheFile(new URI(/user/hadoop/thefile.dat), conf); However I am still having trouble locating the file in the distributed cache. *How do I call the file path of thefile.dat in the distributed cache as a string?* I am using Hadoop 0.20.2 On Fri, Jul 29, 2011 at 10:26 AM, Roger Chen rogc...@ucdavis.edu wrote: Hi all, Does anybody have examples of how one moves files from the local filestructure/HDFS to the distributed cache in MapReduce? A Google search turned up examples in Pig but not MR. -- Roger Chen UC Davis Genome Center -- Roger Chen UC Davis Genome Center -- Roger Chen UC Davis Genome Center -- Roger Chen UC Davis Genome Center -- Roger Chen UC Davis Genome Center
RE: Moving Files to Distributed Cache in MapReduce
Here's the meat of my post earlier... Sample code on putting a file on the cache: DistributedCache.addCacheFile(new URI(path+MyFileName,conf)); Sample code in pulling data off the cache: private Path[] localFiles = DistributedCache.getLocalCacheFiles(context.getConfiguration()); boolean exitProcess = false; int i=0; while (!exit){ fileName = localFiles[i].getName(); if (fileName.equalsIgnoreCase(model.txt)){ // Build your input file reader on localFiles[i].toString() exitProcess = true; } i++; } Note that this is SAMPLE code. I didn't trap the exit condition if the file isn't there and you go beyond the size of the array localFiles[]. Also I set exit to false because its easier to read this as Do this loop until the condition exitProcess is true. When you build your file reader you need the full path, not just the file name. The path will vary when the job runs. HTH -Mike From: michael_se...@hotmail.com To: common-user@hadoop.apache.org Subject: RE: Moving Files to Distributed Cache in MapReduce Date: Fri, 29 Jul 2011 21:43:37 -0500 I could have sworn that I gave an example earlier this week on how to push and pull stuff from distributed cache. Date: Fri, 29 Jul 2011 14:51:26 -0700 Subject: Re: Moving Files to Distributed Cache in MapReduce From: rogc...@ucdavis.edu To: common-user@hadoop.apache.org jobConf is deprecated in 0.20.2 I believe; you're supposed to be using Configuration for that On Fri, Jul 29, 2011 at 1:59 PM, Mohit Anchlia mohitanch...@gmail.comwrote: Is this what you are looking for? http://hadoop.apache.org/common/docs/current/mapred_tutorial.html search for jobConf On Fri, Jul 29, 2011 at 1:51 PM, Roger Chen rogc...@ucdavis.edu wrote: Thanks for the response! However, I'm having an issue with this line Path[] cacheFiles = DistributedCache.getLocalCacheFiles(conf); because conf has private access in org.apache.hadoop.configured On Fri, Jul 29, 2011 at 11:18 AM, Mapred Learn mapred.le...@gmail.com wrote: I hope my previous reply helps... On Fri, Jul 29, 2011 at 11:11 AM, Roger Chen rogc...@ucdavis.edu wrote: After moving it to the distributed cache, how would I call it within my MapReduce program? On Fri, Jul 29, 2011 at 11:09 AM, Mapred Learn mapred.le...@gmail.com wrote: Did you try using -files option in your hadoop jar command as: /usr/bin/hadoop jar jar name main class name -files absolute path of file to be added to distributed cache input dir output dir On Fri, Jul 29, 2011 at 11:05 AM, Roger Chen rogc...@ucdavis.edu wrote: Slight modification: I now know how to add files to the distributed file cache, which can be done via this command placed in the main or run class: DistributedCache.addCacheFile(new URI(/user/hadoop/thefile.dat), conf); However I am still having trouble locating the file in the distributed cache. *How do I call the file path of thefile.dat in the distributed cache as a string?* I am using Hadoop 0.20.2 On Fri, Jul 29, 2011 at 10:26 AM, Roger Chen rogc...@ucdavis.edu wrote: Hi all, Does anybody have examples of how one moves files from the local filestructure/HDFS to the distributed cache in MapReduce? A Google search turned up examples in Pig but not MR. -- Roger Chen UC Davis Genome Center -- Roger Chen UC Davis Genome Center -- Roger Chen UC Davis Genome Center -- Roger Chen UC Davis Genome Center -- Roger Chen UC Davis Genome Center
RE: Hadoop upgrade Java version
Yeah... you can do that... I haven't tried to mix/match different releases within a cluster, although I suspect I could without any problems, but I don't want to risk it. Until we have a problem, or until we expand our clouds with a batch of new nodes, I like to follow the mantra... if it aint broke, don't fix it. (I would suggest if / when you upgrade your java that you bounce the cloud. Even with a rolling restart, you have to plan for it...) Date: Mon, 18 Jul 2011 22:54:54 -0500 Subject: RE: Hadoop upgrade Java version From: jshrini...@gmail.com To: common-user@hadoop.apache.org We are using Oracle JDK 6 update 26 and have not observed any problems so far. EA of JDK 6 update 27 is available now. We are planning to move to update 27 when the GA release is made available. -Shrinivas On Jul 18, 2011 7:52 PM, Michael Segel michael_se...@hotmail.com wrote: Any release after _21 seems to work fine. CC: highpoint...@gmail.com; common-user@hadoop.apache.org From: john.c.st...@gmail.com Subject: Re: Hadoop upgrade Java version Date: Mon, 18 Jul 2011 19:37:02 -0600 To: common-user@hadoop.apache.org We're using u26 without any problems. On Jul 18, 2011, at 4:45 PM, highpointe highpoint...@gmail.com wrote: So uhm yeah. Thanks for the Informica commercial. Now back to my original question. Anyone have a suggestion on what version of Java I should be using with the latest Hadoop release. Sent from my iPhone On Jul 18, 2011, at 11:26 AM, high pointe highpoint...@gmail.com wrote: We are in the process of upgrading to the most current version of Hadoop. At the same time we are in need of upgrading Java. We are currently running u17. I have read elsewhere that u21 or up is the best route to go. Currently the version is u26. Has anyone gone all the way to u26 with or without issues? Thanks for the help.
RE: Which release to use?
EMC has inked a deal with MapRTech to resell their release and support services for MapRTech. Does this mean that they are going to stop selling their own release on Greenplum? Maybe not in the near future, however, a Greenplum appliance may not get the customer transaction that their reselling of MapR will generate. It sounds like they are hedging their bets and are taking an 'IBM' approach. Subject: RE: Which release to use? Date: Mon, 18 Jul 2011 08:30:59 -0500 From: jeff.schm...@shell.com To: common-user@hadoop.apache.org Steve, I read your blog nice post - I believe EMC is selling the Greenplumb solution as an appliance - Cheers - Jeffery -Original Message- From: Steve Loughran [mailto:ste...@apache.org] Sent: Friday, July 15, 2011 4:07 PM To: common-user@hadoop.apache.org Subject: Re: Which release to use? On 15/07/2011 18:06, Arun C Murthy wrote: Apache Hadoop is a volunteer driven, open-source project. The contributors to Apache Hadoop, both individuals and folks across a diverse set of organizations, are committed to driving the project forward and making timely releases - see discussion on hadoop-0.23 with a raft newer features such as HDFS Federation, NextGen MapReduce and plans for HA NameNode etc. As with most successful projects there are several options for commercial support to Hadoop or its derivatives. However, Apache Hadoop has thrived before there was any commercial support (I've personally been involved in over 20 releases of Apache Hadoop and deployed them while at Yahoo) and I'm sure it will in this new world order. We, the Apache Hadoop community, are committed to keeping Apache Hadoop 'free', providing support to our users and to move it forward at a rapid rate. Arun makes a good point which is that the Apache project depends on contributions from the community to thrive. That includes -bug reports -patches to fix problems -more tests -documentation improvements: more examples, more on getting started, troubleshooting, etc. If there's something lacking in the codebase, and you think you can fix it, please do so. Helping with the documentation is a good start, as it can be improved, and you aren't going to break anything. Once you get into changing the code, you'll end up working with the head of whichever branch you are targeting. The other area everyone can contribute on is testing. Yes, Y! and FB can test at scale, yes, other people can test large clusters too -but nobody has a network that looks like yours but you. And Hadoop does care about network configurations. Testing beta and release candidate releases in your infrastructure, helps verify that the final release will work on your site, and you don't end up getting all the phone calls about something not working
RE: Which release to use?
Tom, I'm not sure that you're really honoring the purpose and approach of this list. I mean on the one hand, you're not under any obligation to respond or participate on the list. And I can respect that. You're not in an SD role so you're not 'customer facing' and not used to having to deal with these types of questions. On the other, you're not being free with your information. So when this type of question comes up, it becomes very easy to discount IBM as a release or source provider for commercial support. Without information, I'm afraid that I may have to make recommendations to my clients that may be out of date. There is even some speculation from analysts that recent comments from IBM are more of an indication that IBM is still not ready for prime time. I'm sorry you're not in a position to detail your offering. Maybe by September you might be ready and then talk to our CHUG? -Mike To: common-user@hadoop.apache.org Subject: Re: Which release to use? From: tdeut...@us.ibm.com Date: Sat, 16 Jul 2011 10:29:55 -0700 Hi Rita - I want to make sure we are honoring the purpose/approach of this list. So you are welcome to ping me for information, but let's take this discussion off the list at this point. Tom Deutsch Program Director CTO Office: Information Management Hadoop Product Manager / Customer Exec IBM 3565 Harbor Blvd Costa Mesa, CA 92626-1420 tdeut...@us.ibm.com Rita rmorgan...@gmail.com 07/16/2011 08:53 AM Please respond to common-user@hadoop.apache.org To common-user@hadoop.apache.org cc Subject Re: Which release to use? I am curious about the IBM product BigInishgts. Where can we download it? It seems we have to register to download it? On Fri, Jul 15, 2011 at 12:38 PM, Tom Deutsch tdeut...@us.ibm.com wrote: One quick clarification - IBM GA'd a product called BigInsights in 2Q. It faithfully uses the Hadoop stack and many related projects - but provides a number of extensions (that are compatible) based on customer requests. Not appropriate to say any more on this list, but the info on it is all publically available. Tom Deutsch Program Director CTO Office: Information Management Hadoop Product Manager / Customer Exec IBM 3565 Harbor Blvd Costa Mesa, CA 92626-1420 tdeut...@us.ibm.com Michael Segel michael_se...@hotmail.com 07/15/2011 07:58 AM Please respond to common-user@hadoop.apache.org To common-user@hadoop.apache.org cc Subject RE: Which release to use? Unfortunately the picture is a bit more confusing. Yahoo! is now HortonWorks. Their stated goal is to not have their own derivative release but to sell commercial support for the official Apache release. So those selling commercial support are: *Cloudera *HortonWorks *MapRTech *EMC (reselling MapRTech, but had announced their own) *IBM (not sure what they are selling exactly... still seems like smoke and mirrors...) *DataStax So while you can use the Apache release, it may not make sense for your organization to do so. (Said as I don the flame retardant suit...) The issue is that outside of HortonWorks which is stating that they will support the official Apache release, everything else is a derivative work of Apache's Hadoop. From what I have seen, Cloudera's release is the closest to the Apache release. Like I said, things are getting interesting. HTH -- --- Get your facts first, then you can distort them as you please.--
RE: Which release to use?
Well that's CDH3. :-) And yes, that's because up until the past month... other releases didn't exist w commercial support. Now there are more players as we look at the movement from leading edge to mainstream adopters. Subject: RE: Which release to use? Date: Mon, 18 Jul 2011 14:30:39 -0500 From: jeff.schm...@shell.com To: common-user@hadoop.apache.org Most people are using CH3 - if you need some features from another distro use that - http://www.cloudera.com/hadoop/ I wonder if the Cloudera people realize that CH3 was a pretty happening punk band back in the day (if not they do now = ) http://en.wikipedia.org/wiki/Channel_3_%28band%29 cheers - Jeffery Schmitz Projects and Technology 3737 Bellaire Blvd Houston, Texas 77001 Tel: +1-713-245-7326 Fax: +1 713 245 7678 Email: jeff.schm...@shell.com Intergalactic Proton Powered Electrical Tentacled Advertising Droids! -Original Message- From: Michael Segel [mailto:michael_se...@hotmail.com] Sent: Monday, July 18, 2011 2:10 PM To: common-user@hadoop.apache.org Subject: RE: Which release to use? Tom, I'm not sure that you're really honoring the purpose and approach of this list. I mean on the one hand, you're not under any obligation to respond or participate on the list. And I can respect that. You're not in an SD role so you're not 'customer facing' and not used to having to deal with these types of questions. On the other, you're not being free with your information. So when this type of question comes up, it becomes very easy to discount IBM as a release or source provider for commercial support. Without information, I'm afraid that I may have to make recommendations to my clients that may be out of date. There is even some speculation from analysts that recent comments from IBM are more of an indication that IBM is still not ready for prime time. I'm sorry you're not in a position to detail your offering. Maybe by September you might be ready and then talk to our CHUG? -Mike To: common-user@hadoop.apache.org Subject: Re: Which release to use? From: tdeut...@us.ibm.com Date: Sat, 16 Jul 2011 10:29:55 -0700 Hi Rita - I want to make sure we are honoring the purpose/approach of this list. So you are welcome to ping me for information, but let's take this discussion off the list at this point. Tom Deutsch Program Director CTO Office: Information Management Hadoop Product Manager / Customer Exec IBM 3565 Harbor Blvd Costa Mesa, CA 92626-1420 tdeut...@us.ibm.com Rita rmorgan...@gmail.com 07/16/2011 08:53 AM Please respond to common-user@hadoop.apache.org To common-user@hadoop.apache.org cc Subject Re: Which release to use? I am curious about the IBM product BigInishgts. Where can we download it? It seems we have to register to download it? On Fri, Jul 15, 2011 at 12:38 PM, Tom Deutsch tdeut...@us.ibm.com wrote: One quick clarification - IBM GA'd a product called BigInsights in 2Q. It faithfully uses the Hadoop stack and many related projects - but provides a number of extensions (that are compatible) based on customer requests. Not appropriate to say any more on this list, but the info on it is all publically available. Tom Deutsch Program Director CTO Office: Information Management Hadoop Product Manager / Customer Exec IBM 3565 Harbor Blvd Costa Mesa, CA 92626-1420 tdeut...@us.ibm.com Michael Segel michael_se...@hotmail.com 07/15/2011 07:58 AM Please respond to common-user@hadoop.apache.org To common-user@hadoop.apache.org cc Subject RE: Which release to use? Unfortunately the picture is a bit more confusing. Yahoo! is now HortonWorks. Their stated goal is to not have their own derivative release but to sell commercial support for the official Apache release. So those selling commercial support are: *Cloudera *HortonWorks *MapRTech *EMC (reselling MapRTech, but had announced their own) *IBM (not sure what they are selling exactly... still seems like smoke and mirrors...) *DataStax So while you can use the Apache release, it may not make sense for your organization to do so. (Said as I don the flame retardant suit...) The issue is that outside of HortonWorks which is stating that they will support the official Apache release, everything else is a derivative work of Apache's Hadoop. From what I have seen, Cloudera's release is the closest to the Apache release. Like I said, things are getting interesting. HTH -- --- Get your facts first, then you can distort them as you please.--
RE: Which release to use?
Date: Mon, 18 Jul 2011 18:19:38 -0700 Subject: Re: Which release to use? From: mcsri...@gmail.com To: common-user@hadoop.apache.org Mike, Just a minor inaccuracy in your email. Here's setting the record straight: 1. MapR directly sells their distribution of Hadoop. Support is from MapR. 2. EMC also sells the MapR distribution, for use on any hardware. Support is from EMC worldwide. 3. EMC also sells a Hadoop appliance, which has the MapR distribution specially built for it. Support is from EMC. 4. MapR also has a free, unlimited, unrestricted version called M3, which has the same 2-5x performance, management and stability improvements, and includes NFS. It is not crippleware, and the unlimited, unrestricted, free use does not expire on any date. Hope that clarifies what MapR is doing. thanks regards, Srivas. Srivas, I'm sorry, I thought I was being clear in that I was only addressing EMC and not MapR directly. I was responding to post about EMC selling a Greenplum appliance. I wanted to point out that EMC will resell MapR's release along with their own (EMC) support. The point I was trying to make was that with respect to derivatives of Hadoop, I believe that MapR has a more compelling story than either EMC or DataStax. IMHO replacing Java HDFS w either GreenPlum or Cassandra has a limited market. When a company is going to look at a M/R solution cost and performance are going to be at the top of the list. MapR isn't cheap but if you look at the features in M5, if they work, then you have a very compelling reason to look at their release. Some of the people I spoke to when I was in Santa Clara were in the beta program. They indicated that MapR did what they claimed. Things are definitely starting to look interesting. -Mike On Mon, Jul 18, 2011 at 11:33 AM, Michael Segel michael_se...@hotmail.comwrote: EMC has inked a deal with MapRTech to resell their release and support services for MapRTech. Does this mean that they are going to stop selling their own release on Greenplum? Maybe not in the near future, however, a Greenplum appliance may not get the customer transaction that their reselling of MapR will generate. It sounds like they are hedging their bets and are taking an 'IBM' approach. Subject: RE: Which release to use? Date: Mon, 18 Jul 2011 08:30:59 -0500 From: jeff.schm...@shell.com To: common-user@hadoop.apache.org Steve, I read your blog nice post - I believe EMC is selling the Greenplumb solution as an appliance - Cheers - Jeffery -Original Message- From: Steve Loughran [mailto:ste...@apache.org] Sent: Friday, July 15, 2011 4:07 PM To: common-user@hadoop.apache.org Subject: Re: Which release to use? On 15/07/2011 18:06, Arun C Murthy wrote: Apache Hadoop is a volunteer driven, open-source project. The contributors to Apache Hadoop, both individuals and folks across a diverse set of organizations, are committed to driving the project forward and making timely releases - see discussion on hadoop-0.23 with a raft newer features such as HDFS Federation, NextGen MapReduce and plans for HA NameNode etc. As with most successful projects there are several options for commercial support to Hadoop or its derivatives. However, Apache Hadoop has thrived before there was any commercial support (I've personally been involved in over 20 releases of Apache Hadoop and deployed them while at Yahoo) and I'm sure it will in this new world order. We, the Apache Hadoop community, are committed to keeping Apache Hadoop 'free', providing support to our users and to move it forward at a rapid rate. Arun makes a good point which is that the Apache project depends on contributions from the community to thrive. That includes -bug reports -patches to fix problems -more tests -documentation improvements: more examples, more on getting started, troubleshooting, etc. If there's something lacking in the codebase, and you think you can fix it, please do so. Helping with the documentation is a good start, as it can be improved, and you aren't going to break anything. Once you get into changing the code, you'll end up working with the head of whichever branch you are targeting. The other area everyone can contribute on is testing. Yes, Y! and FB can test at scale, yes, other people can test large clusters too -but nobody has a network that looks like yours but you. And Hadoop does care about network configurations. Testing beta and release candidate releases in your infrastructure, helps verify that the final release will work on your site, and you don't end up getting all the phone calls about something not working
RE: Hadoop upgrade Java version
Any release after _21 seems to work fine. CC: highpoint...@gmail.com; common-user@hadoop.apache.org From: john.c.st...@gmail.com Subject: Re: Hadoop upgrade Java version Date: Mon, 18 Jul 2011 19:37:02 -0600 To: common-user@hadoop.apache.org We're using u26 without any problems. On Jul 18, 2011, at 4:45 PM, highpointe highpoint...@gmail.com wrote: So uhm yeah. Thanks for the Informica commercial. Now back to my original question. Anyone have a suggestion on what version of Java I should be using with the latest Hadoop release. Sent from my iPhone On Jul 18, 2011, at 11:26 AM, high pointe highpoint...@gmail.com wrote: We are in the process of upgrading to the most current version of Hadoop. At the same time we are in need of upgrading Java. We are currently running u17. I have read elsewhere that u21 or up is the best route to go. Currently the version is u26. Has anyone gone all the way to u26 with or without issues? Thanks for the help.
RE: Which release to use?
Well I'm sort of curious as to what is in the 'free' version which differentiates from the Apache release? Earlier you wrote that IBM was faithful to the Apache release, plus a few 'extras'. (I think I can find your exact quote and I'm sorry I'm paraphrasing your statements.) This begs two questions... 1) What is IBM providing to your 'customers' to justify the uplift or premium for IBM's brand name. 2) If your release includes components which are not part of the Apache release, is it Apache's Hadoop? or considered a derivative? The interesting thing about #2 is that I don't know if or what represents Hadoop. I mean if you take an earlier release of Hadoop like 20.2 where current is 20.203 and apply a subset of patches that are Apache committed, is this not Apache Hadoop or a derivative work since you are not 100% at the latest release. Note: This is a broader question than just what IBM is releasing but what is meant by saying Hadoop or derived from Hadoop. Clearly DataStax and MapR are derivatives. Cloudera? This goes back to the OP's question 'Which release to use?'... And I have to apologize if I seem a bit suspect on what IBM has to say. When IBM first entered with an announced Hadoop release it was only for 32bit JVM and only on IBM's JVM. The last I heard, IBM's upsell was a configuration tool, which if anyone has built more than one Cloud/Cluster, its pretty much worthless. So it would be interesting to see what IBM is really offering in this space. HTH -Mike Subject: Re: Which release to use? From: tdeut...@us.ibm.com Date: Sun, 17 Jul 2011 14:07:20 -0600 To: common-user@hadoop.apache.org There are two release levels - one is free but most of our customers want our additional engineering so they use Enterprise Edition (which is not free). Happy to answer questions off list. --- Sent from my Blackberry so please excuse typing and spelling errors. - Original Message - From: Steve Loughran [ste...@apache.org] Sent: 07/17/2011 08:34 PM CET To: common-user@hadoop.apache.org Subject: Re: Which release to use? On 16/07/2011 16:53, Rita wrote: I am curious about the IBM product BigInishgts. Where can we download it? It seems we have to register to download it? I think you have to pay to use it
RE: Which release to use?
Unfortunately the picture is a bit more confusing. Yahoo! is now HortonWorks. Their stated goal is to not have their own derivative release but to sell commercial support for the official Apache release. So those selling commercial support are: *Cloudera *HortonWorks *MapRTech *EMC (reselling MapRTech, but had announced their own) *IBM (not sure what they are selling exactly... still seems like smoke and mirrors...) *DataStax So while you can use the Apache release, it may not make sense for your organization to do so. (Said as I don the flame retardant suit...) The issue is that outside of HortonWorks which is stating that they will support the official Apache release, everything else is a derivative work of Apache's Hadoop. From what I have seen, Cloudera's release is the closest to the Apache release. Like I said, things are getting interesting. HTH -Mike From: ev...@yahoo-inc.com To: common-user@hadoop.apache.org Date: Fri, 15 Jul 2011 07:35:45 -0700 Subject: Re: Which release to use? Adarsh, Yahoo! no longer has its own distribution of Hadoop. It has been merged into the 0.20.2XX line so 0.20.203 is what Yahoo is running internally right now, and we are moving towards 0.20.204 which should be out soon. I am not an expert on Cloudera so I cannot really map its releases to the Apache Releases, but their distro is based off of Apache Hadoop with a few bug fixes and maybe a few features like append added in on top of it, but you need to talk to Cloudera about the exact details. For the most part they are all very similar. You need to think most about support, there are several companies that can sell you support if you want/need it. You also need to think about features vs. stability. The 0.20.203 release has been tested on a lot of machines by many different groups, but may be missing some features that are needed in some situations. --Bobby On 7/14/11 11:49 PM, Adarsh Sharma adarsh.sha...@orkash.com wrote: Hadoop releases are issued time by time. But one more thing related to hadoop usage, There are so many providers that provides the distribution of Hadoop ; 1. Apache Hadoop 2. Cloudera 3. Yahoo etc. Which distribution is best among them on production usage. I think Cloudera's is best among them. Best Regards, Adarsh Owen O'Malley wrote: On Jul 14, 2011, at 4:33 PM, Teruhiko Kurosaka wrote: I'm a newbie and I am confused by the Hadoop releases. I thought 0.21.0 is the latest greatest release that I should be using but I noticed 0.20.203 has been released lately, and 0.21.X is marked unstable, unsupported. Should I be using 0.20.203? Yes, I apologize for confusing release numbering, but the best release to use is 0.20.203.0. It includes security, job limits, and many other improvements over 0.20.2 and 0.21.0. Unfortunately, it doesn't have the new sync support so it isn't suitable for using with HBase. Most large clusters use a separate version of HDFS for HBase. -- Owen
RE: Which release to use?
See, I knew there was something that I forgot. It all goes back to the question ... 'which release to use'... 2 years ago it was a very simple decision. Now, not so much. :-) And while Arun and Ownen work for a vendor, I do not and I try to follow each company and their offering. As Hadoop goes mainstream, the question of which vendor to choose gets interesting. Just like in the 90's during the database vendor wars, it looks like the vendor who has the best sales force and PR will win. (Not necessarily the best product.) JMHO -Mike Date: Fri, 15 Jul 2011 16:25:55 -0500 Subject: Re: Which release to use? From: markkerz...@gmail.com To: common-user@hadoop.apache.org Steve, this is so well said, do you mind if I repeat it here, http://shmsoft.blogspot.com/2011/07/hadoop-commercial-support-options.html Thank you, Mark On Fri, Jul 15, 2011 at 4:00 PM, Steve Loughran ste...@apache.org wrote: On 15/07/2011 15:58, Michael Segel wrote: Unfortunately the picture is a bit more confusing. Yahoo! is now HortonWorks. Their stated goal is to not have their own derivative release but to sell commercial support for the official Apache release. So those selling commercial support are: *Cloudera *HortonWorks *MapRTech *EMC (reselling MapRTech, but had announced their own) *IBM (not sure what they are selling exactly... still seems like smoke and mirrors...) *DataStax + Amazon, indirectly, that do their own derivative work of some release of Hadoop (which version is it based on?) I've used 0.21, which was the first with the new APIs and, with MRUnit, has the best test framework. For my small-cluster uses, it worked well. (oh, and I didn't care about security)
Re: Deduplication Effort in Hadoop
You don't have dupes because the key has to be unique.nbsp; Sent from my Palm Pre on ATamp;T On Jul 14, 2011 11:00 AM, jonathan.hw...@accenture.com lt;jonathan.hw...@accenture.comgt; wrote: Hi All, In databases you can be able to define primary keys to ensure no duplicate data get loaded into the system. Let say I have a lot of 1 billion records flowing into my system everyday and some of these are repeated data (Same records). I can use 2-3 columns in the record to match and look for duplicates. What is the best strategy of de-duplication? The duplicated records should only appear within the last 2 weeks.I want a fast way to get the data into the system without much delay. Anyway HBase or Hive can help? Thanks! Jonathan This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the email by you is prohibited.
RE: Performance Tunning
Matthew, I understood that Juan was talking about a 2 socket quad core box. We run boxes with the e5500 (xeon quad core ) chips. Linux sees these as 16 cores. Our data nodes are 32GB Ram w 4 x 2TB SATA. Its a pretty basic configuration. What I was saying was that if you consider 1 core for each TT, DN and RS jobs, thats 3 out of the 8 physical cores, leaving you 5 cores or 10 'hyperthread cores'. So you could put up 10 m/r slots on the machine. Note that on the main tasks (TT, DN, RS) I dedicate the physical core. Of course your mileage may vary if you're doing non-standard or normal things. A good starting point is 6 mappers and 4 reducers. And of course YMMV depending on if you're using MapR's release, Cloudera, and if you're running HBase or something else on the cluster. From our experience... we end up getting disk I/O bound first, and then network or memory becomes the next constraint. Really the xeon chipsets are really good. HTH -Mike From: matthew.go...@monsanto.com To: common-user@hadoop.apache.org Subject: RE: Performance Tunning Date: Tue, 28 Jun 2011 14:46:40 + Mike, I'm not really sure I have seen a community consensus around how to handle hyper-threading within Hadoop (although I have seen quite a few articles that discuss it). I was assuming that when Juan mentioned they were 4-core boxes that he meant 4 physical cores and not HT cores. I was more stating that the starting point should be 1 slot per thread (or hyper-threaded core) but obviously reviewing the results from ganglia, or any other monitoring solution, will help you come up with a more concrete configuration based on the load. My brain might not be working this morning but how did you get the 10 slots again? That seems low for an 8 physical core box but somewhat overextending for a 4 physical core box. Matt -Original Message- From: im_gu...@hotmail.com [mailto:im_gu...@hotmail.com] On Behalf Of Michel Segel Sent: Tuesday, June 28, 2011 7:39 AM To: common-user@hadoop.apache.org Subject: Re: Performance Tunning Matt, You have 2 threads per core, so your Linux box thinks an 8 core box has16 cores. In my calcs, I tend to take a whole core for TT DN and RS and then a thread per slot so you end up w 10 slots per node. Of course memory is also a factor. Note this is only a starting point.you can always tune up. Sent from a remote device. Please excuse any typos... Mike Segel On Jun 27, 2011, at 11:11 PM, GOEKE, MATTHEW (AG/1000) matthew.go...@monsanto.com wrote: Per node: 4 cores * 2 processes = 8 slots Datanode: 1 slot Tasktracker: 1 slot Therefore max of 6 slots between mappers and reducers. Below is part of our mapred-site.xml. The thing to keep in mind is the number of maps is defined by the number of input splits (which is defined by your data) so you only need to worry about setting the maximum number of concurrent processes per node. In this case the property you want to hone in on is mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum. Keep in mind there are a LOT of other tuning improvements that can be made but it requires an strong understanding of your job load. configuration property namemapred.tasktracker.map.tasks.maximum/name value2/value /property property namemapred.tasktracker.reduce.tasks.maximum/name value1/value /property property namemapred.child.java.opts/name value-Xmx512m/value /property property namemapred.compress.map.output/name valuetrue/value /property property namemapred.output.compress/name valuetrue/value /property This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of Viruses or other Malware. Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export
RE: Any reason Hadoop logs cant be directed to a separate filesystem?
Yes, and its called using cron and writing a simple ksh script to clear out any files that are older than 15 days. There may be another way, but that's really the easiest. Date: Thu, 23 Jun 2011 02:44:48 +0530 From: jagaran_...@yahoo.co.in Subject: Re: Any reason Hadoop logs cant be directed to a separate filesystem? To: common-user@hadoop.apache.org Hi, Can I limit the log file duration ? I want to keep files for last 15 days only. Regards, Jagaran From: Jack Craig jcr...@carrieriq.com To: common-user@hadoop.apache.org common-user@hadoop.apache.org Sent: Wed, 22 June, 2011 2:00:23 PM Subject: Re: Any reason Hadoop logs cant be directed to a separate filesystem? Thx to both respondents. Note i've not tried this redirection as I have only production grids available. Our grids are growing and with them, log volume. As until now that log location has been in the same fs as the grid data, so running out of space due log bloat is a growing problem. From your replies, sounds like I can relocate my logs, Cool! But now the tough question, if i set up a too small partition and it runs out of space, will my grid become unstable if hadoop can no longer write to its logs? Thx again, jackc... Jack Craig, Operations CarrierIQ.comhttp://CarrierIQ.com 1200 Villa Ct, Suite 200 Mountain View, CA. 94041 650-625-5456 On Jun 22, 2011, at 1:09 PM, Harsh J wrote: Jack, I believe the location can definitely be set to any desired path. Could you tell us the issues you face when you change it? P.s. The env var is used to set the config property hadoop.log.dir internally. So as long as you use the regular scripts (bin/ or init.d/ ones) to start daemons, it would apply fine. On Thu, Jun 23, 2011 at 1:32 AM, Jack Craig jcr...@carrieriq.commailto:jcr...@carrieriq.com wrote: Hi Folks, In the hadoop-env.sh, we find, ... # Where log files are stored. $HADOOP_HOME/logs by default. # export HADOOP_LOG_DIR=${HADOOP_HOME}/logs is there any reason this location could not be a separate filesystem on the name node? Thx, jackc... Jack Craig, Operations CarrierIQ.comhttp://CarrierIQ.com 1200 Villa Ct, Suite 200 Mountain View, CA. 94041 650-625-5456 -- Harsh J
RE: Hadoop Cluster Multi-datacenter
PWC now getting in to Hadoop? Interesting Sanjeev, the simple short answer is that you don't create a cloud that spans a data center. Bad design. You build two clusters one per data center. To: common-user@hadoop.apache.org Subject: Hadoop Cluster Multi-datacenter From: sanjeev.ta...@us.pwc.com Date: Mon, 6 Jun 2011 22:07:51 -0700 Hello, I wanted to know if anyone has any tips or tutorials on howto install the hadoop cluster on multiple datacenters Do you need ssh connectivity between the nodes across these data centers? Thanks in advance for any guidance you can provide. __ The information transmitted, including any attachments, is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited, and all liability arising therefrom is disclaimed. If you received this in error, please contact the sender and delete the material from any computer. PricewaterhouseCoopers LLP is a Delaware limited liability partnership. This communication may come from PricewaterhouseCoopers LLP or one of its subsidiaries.
RE: Why inter-rack communication in mapreduce slow?
Chris, I've gone back through the thread and here's Elton's initial question... On 06/06/11 08:22, elton sky wrote: hello everyone, As I don't have experience with big scale cluster, I cannot figure out why the inter-rack communication in a mapreduce job is significantly slower than intra-rack. I saw cisco catalyst 4900 series switch can reach upto 320Gbps forwarding capacity. Connected with 48 nodes with 1Gbps ethernet each, it should not be much contention at the switch, is it? Elton's question deals with why connections within the same switch are faster than connections that traverse a set of switches. The issue isn't so much one of the fabric within the switch itself, but the width of the connection between the two switches. If you have 40GBs (each direction) on a switch and you want it to communicate seamlessly with machines on the next switch, you have to have be able to bond 4 10GBe ports together. (Note: there's a bit more to it, but its the general idea.) You're going to have a significant slow down on communication between nodes that are on different racks because of the bandwidth limitations on the ports used to connect the switches and not the 'fabric' within the switch itself. To your point, you can monitor your jobs and see how much of your work is being done by 'data local' tasks. In one job we had 519 tasks started where 482 were 'data local'. So we had ~93% of the jobs where we didn't have an issue with any network latency. And then with the 7% of the jobs, you have to consider what percentage would have occurred where the data traffic is going to involve pulling data across a 'trunk'. So yes, network latency isn't going to be a huge factor in terms of improving overall efficiency. However, that's just for Hadoop. What happens when you run HBase? ;-) (You can have more network traffic during a m/r job.) HTH -Mike
RE: Dynamic Data Sets
James, If I understand you get a set of immutable attributes, then a state which can change. If you wanted to use HBase... I'd say create a unique identifier for your immutable attributes, then store the unique id, timestamp, and state. Assuming that you're really interested in looking at the state change over time. So what you end up with is one table of immutable attributes, with a unique key, and then another table where you can use the same unique key and create columns with column names of time stamps with the state as the value. HTH -Mike Date: Wed, 13 Apr 2011 18:12:58 -0700 Subject: Dynamic Data Sets From: selek...@yahoo.com To: common-user@hadoop.apache.org I have a requirement where I have large sets of incoming data into a system I own. A single unit of data in this set has a set of immutable attributes + state attached to it. The state is dynamic and can change at any time. What is the best way to run analytical queries on data of such nature ? One way is to maintain this data in a separate store, take a snapshot in point of time, and then import into the HDFS filesystem for analysis using Hadoop Map-Reduce. I do not see this approach scaling, since moving data is obviously expensive. If i was to directly maintain this data as Sequence Files in HDFS, how would updates work ? I am new to Hadoop/HDFS , so any suggestions/critique is welcome. I know that HBase works around this problem through multi version concurrency control techniques. Is that the only option ? Are there any alternatives ? Also note that all aggregation and analysis I want to do is time based i.e. sum of x on pivot y over a day, 2 days, week, month etc. For such use cases, is it advisable to use HDFS directly or use systems built on top of hadoop like Hive or Hbase ?
RE: live/dead node problem
Rita, When the NameNode doesn't see a heartbeat for 10 minutes, it then recognizes that the node is down. Per the Hadoop online documentation: Each DataNode sends a Heartbeat message to the NameNode periodically. A network partition can cause a subset of DataNodes to lose connectivity with the NameNode. The NameNode detects this condition by the absence of a Heartbeat message. The NameNode marks DataNodes without recent Heartbeats as dead and does not forward any new IO requests to them. Any data that was registered to a dead DataNode is not available to HDFS any more. DataNode death may cause the replication factor of some blocks to fall below their specified value. The NameNode constantly tracks which blocks need to be replicated and initiates replication whenever necessary. The necessity for re-replication may arise due to many reasons: a DataNode may become unavailable, a replica may become corrupted, a hard disk on a DataNode may fail, or the replication factor of a file may be increased. I was trying to find out if there's an hdfs-site parameter that could be set to decrease this time period, but wasn't successful. HTH -Mike Date: Tue, 29 Mar 2011 08:13:43 -0400 Subject: live/dead node problem From: rmorgan...@gmail.com To: common-user@hadoop.apache.org Hello All, Is there a parameter or procedure to check more aggressively for a live/dead node? Despite me killing the hadoop process, I see the node active for more than 10+ minutes in the Live Nodes page. Fortunately, the last contact increments. Using, branch-0.21, 0985326 -- --- Get your facts first, then you can distort them as you please.--
RE: changing node's rack
This may be weird, but I could have sworn that the script is called repeatedly. One simple test would be to change the rack aware script and print a message out when the script is called. Then change the script and see if it catches the change without restarting the cluster. -Mike From: tdunn...@maprtech.com Date: Sat, 26 Mar 2011 15:50:58 -0700 Subject: Re: changing node's rack To: common-user@hadoop.apache.org CC: rmorgan...@gmail.com I think that the namenode remembers the rack. Restarting the datanode doesn't make it forget. On Sat, Mar 26, 2011 at 7:34 AM, Rita rmorgan...@gmail.com wrote: What is the best way to change the rack of a node? I have tried the following: Killed the datanode process. Changed the rackmap file so the node and ip address entry reflect the new rack and I do a '-refreshNodes'. Restarted the datanode. But it seems the datanode is keep getting register to the old rack. -- --- Get your facts first, then you can distort them as you please.--
RE: CDH and Hadoop
Rita, Short answer... Cloudera's release is free, and they do also offer a support contract if you want support from them. Cloudera has sources, but most use yum (redhat/centos) to download an already built release. Should you use it? Depends on what you want to do. If your goal is to get up and running with Hadoop and then focus on *using* Hadoop/HBase/Hive/Pig/etc... then it makes sense. If your goal is to do a deep dive in to Hadoop and get your hands dirty mucking around with the latest and greatest in trunk? Then no. You're better off building your own off the official Apache release. Many companies choose Cloudera's release for the following reasons: * Paid support is available. * Companies focus on using a tech not developing the tech, so Cloudera does the heavy lifting while Client Companies focus onĀ 'USING' Hadoop. * Cloudera's release makes sure that the versions in the release work together. That is that when you down load CHD3B4, you get a version of Hadoop that will work with the included version of HBase, Hive, etc ... And no, its never a good idea to try and mix and match Hadoop from different environments and versions in a cluster. (I think it will barf on you.) Does that help? -Mike Date: Wed, 23 Mar 2011 10:29:16 -0400 Subject: CDH and Hadoop From: rmorgan...@gmail.com To: common-user@hadoop.apache.org I have been wondering if I should use CDH (http://www.cloudera.com/hadoop/) instead of the standard Hadoop distribution. What do most people use? Is CDH free? do they provide the tars or does it provide source code and I simply compile? Can I have some data nodes as CDH and the rest as regular Hadoop? I am asking this because so far I noticed a serious bug (IMO) in the decommissioning process ( http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201103.mbox/%3cAANLkTikPKGt5zw1QGLse+LPzUDP7Mom=ty_mxfcuo...@mail.gmail.com%3e ) -- --- Get your facts first, then you can distort them as you please.--