streaming but no sorting
Hi. I'm writing streaming based tasks that involves running thousands of mappers, after that I want to put all these outputs into small number (say 30) output files mainly so that disk space will be used more efficiently, the way I'm doing it right now is using /bin/cat as reducer and setting number of reducers to desired. This involves two highly ineffective (for the task) steps - sorting and fetching. Is there a way to get around that? Ideally I'd want all mapper outputs to be written to one file, one record per line. Thanks. --- Dmitry Pushkarev +1-650-644-8988
RE: CloudBurst: Hadoop for DNA Sequence Analysis
As a matter of fact it is nowhere near close to being data intensive, it does take gigabytes of input data to process, however it is mostly RAM and CPU intensive. Although post-processing of alignment files is exactly where hadoop excels. At least as far as I understand majority of time is spent on DP alignment whereas navigation in seed space and N*log(n) sort requires only a fraction of that time - that was my experience applying hadoop cluster to sequencing human genomes. --- Dmitry Pushkarev +1-650-644-8988 -Original Message- From: michael.sch...@gmail.com [mailto:michael.sch...@gmail.com] On Behalf Of Michael Schatz Sent: Wednesday, April 08, 2009 9:19 PM To: core-user@hadoop.apache.org Subject: CloudBurst: Hadoop for DNA Sequence Analysis Hadoop Users, I just wanted to announce my Hadoop application 'CloudBurst' is available open source at: http://cloudburst-bio.sourceforge.net In a nutshell, it is an application for mapping millions of short DNA sequences to a reference genome to, for example, map out differences in one individual's genome compared to the reference genome. As you might imagine, this is a very data intense problem, but Hadoop enables the application to scale up linearly to large clusters. A full description of the program is available in the journal Bioinformatics: http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btp236 I also wanted to take this opportunity to thank everyone on this mailing list. The discussions posted were essential for navigating the ins and outs of hadoop during the development of CloudBurst. Thanks everyone! Michael Schatz http://www.cbcb.umd.edu/~mschatz
HDD benchmark/checking tool
Dear hadoop users, Recently I have had a number of drive failures that slowed down processes a lot until they were discovered. It is there any easy way or tool, to check HDD performance and see if there any IO errors? Currently I wrote a simple script that looks at /var/log/messages and greps everything abnormal for /dev/sdaX. But if you have better solution I'd appreciate if you share it. --- Dmitry Pushkarev +1-650-644-8988
RE: Is Hadoop Suitable for me?
Definitely not, You should be looking at expandable Ethernet storage that can be extended by connecting additional SAS arrays. (like dell powervault and similar things from other companies) 600Mb is just 6 seconds over gigabit network... --- Dmitry Pushkarev -Original Message- From: Simon [mailto:sim...@bigair.net.au] Sent: Wednesday, January 28, 2009 10:02 PM To: core-user@hadoop.apache.org Subject: Is Hadoop Suitable for me? Hi Hadoop Users, I am trying to build a storage system for the office of about 20-30 users which will store everything. From normal everyday documents to computer configuration files to big files (600mb) which are generated every hour. Is Hadoop suitable for this kind of environment? Regards, Simon
streaming split sizes
Hi. I'm running streaming on relatively big (2Tb) dataset, which is being split by hadoop in 64mb pieces. One of the problems I have with that is my map tasks take very long time to initialize (they need to load 3GB database into RAM) and they are finishing these 64mb in 10 seconds. So I'm wondering if there is any way to make hadoop give larger datasets to map jobs? (trivial way, of course would be to split dataset to N files and make it feed one file at a time, but is there any standard solution for that?) Thanks. --- Dmitry Pushkarev +1-650-644-8988
RE: streaming split sizes
Well, database is specifically designed to fit into memory and if it is not it will slow things down hundreds of time. One simple hack I came to is to replace map tasks by /bin/cat and then run 150 reducers that will have database constantly in memory. Parallelism is also not a problems, since we're running very small (15 nodes, 120 cores) specifically built for the task. --- Dmitry Pushkarev +1-650-644-8988 -Original Message- From: Delip Rao [mailto:delip...@gmail.com] Sent: Tuesday, January 20, 2009 6:19 PM To: core-user@hadoop.apache.org Subject: Re: streaming split sizes Hi Dmitry, Not a direct answer to your question but I think the right approach would be to not load your database into memory during config() but instead lookup the database from map() via Hbase or something similar. That way you don't have to worry about the split sizes. In fact using fewer splits would limit the parallelism you can achieve, given that your maps are so fast. - delip On Tue, Jan 20, 2009 at 8:25 PM, Dmitry Pushkarev u...@stanford.edu wrote: Hi. I'm running streaming on relatively big (2Tb) dataset, which is being split by hadoop in 64mb pieces. One of the problems I have with that is my map tasks take very long time to initialize (they need to load 3GB database into RAM) and they are finishing these 64mb in 10 seconds. So I'm wondering if there is any way to make hadoop give larger datasets to map jobs? (trivial way, of course would be to split dataset to N files and make it feed one file at a time, but is there any standard solution for that?) Thanks. --- Dmitry Pushkarev +1-650-644-8988
Hadoop and Matlab
Hi. Can anyone share experience of successfully parallelizing matlab tasks using hadoop? We have implemented this thing with python (in form of simple module that takes serialized function and data array and runs this function on the cluster)m but we really have no clue how to that in Matlab. Ideally we want to use Matlab in the same way - write .m file that takes set of parameters and returns some value, specify list of input parameters (like lists of variable to try for Gaussian kernels) and run it on the cluster, in somewhat failproof manner - that's the ideal situation. Has anyone tried that? --- Dmitry
Hadoop and security.
Dear hadoop users, I'm lucky to work in academic environment where information security is not the question. However, I'm sure that most of the hadoop users aren't. Here is the question: how secure hadoop is? (or let's say foolproof) Here is the answer: http://www.google.com/search?client=opera http://www.google.com/search?client=operarls=enq=Hadoop+Map/Reduce+Admini strationsourceid=operaie=utf-8oe=utf-8 rls=enq=Hadoop+Map/Reduce+Administrationsourceid=operaie=utf-8oe=utf-8 not quite. What we're seeing here is open hadoop cluster, where anyone who capable of installing hadoop and changing his username to webcrawl can use their cluster and read their data, even though firewall is perfectly installed and ports like ssh are filtered to outsiders. After you've played enough with data, you can observe that you can submit jobs as well, and these jobs can execute shell commands. Which is very, very sad. In my view, this significantly limits distributed hadoop applications, where part of your cluster may reside on EC2 or other distant datacenter, since you always need to have certain ports open to an array of ip addresses (if your instances are dynamic) which isn't acceptable if anyone from that ip range can connect to your cluster. Can we propose to developers to introduce some basic user-management and access controls to help hadoop make one step further towards production-quality system? And, by the way add robots.txt to default distribution. (but I doubt it will help, as it takes less than a week to scan all internet for given port on home DSL..) --- Dmitry
hadoop under windows.
Hi. I have a strange problem with hadoop when I run jobs under windows (my laptop runs XP, but all cluster machines including namenode run Ubuntu). I run job (which runs perfectly under linux, and all configs and Java versions are the same), all mappers finishes successfully, and so does reducer but when I tries to copy resulting file to the output directory I get things like: 03.10.2008 21:47:24 *INFO * audit: ugi=Dmitry,mkpasswd,root,None,Administrators,Users ip=/171.65.102.211cmd=rename src=/user/public/tmp/streaming-job12345/out48/_temporary/_attempt_2008100320 05_0013_r_00_0/part-0 dst=/user/public/tmp/streaming-job12345/out48/_temporary/_attempt_2008100320 05_0013_r_00_0/part-0perm=Dmitry:supergroup:rw-r--r-- (FSNamesystem.java, line 94) And then it deletes the file. And I get no output. Why does it renames the files into itself and does it have anything to do with Path.getParent()? Thanks.
jython HBase map/red task
Hi. I'm writing mapreduce task in jython, and I can't launch ToolRunner.run, jython says TypeError: integer required on ToolRunner.run line and I can't get more detailed explanation. I guess the error is either in ToolRunner or in setConf: What am I doing wrong? J And can anyone share a sample working MapRed Hadoop job in jython (prefereably for HBase)? class rowcounter(TableMap,Tool): def map(self, key, value, output, reporter): pass def setConf(self,c): pass def getConf(self): return Configuration() def run(self,args): print Someday I'll be doing something pass def main(args): ToolRunner.run(Configuration, rowcounter(),args) sys.exit() if __name__ == __main__: main(sys.argv)
RE: Installing Hadoop on OS X 10.5 single node cluster (MacPro) posted to wiki
Awesome, wish I had it couple of weeks ago. By the way, can someone give me a Jython code that interacts with HBase? I want to learn to write simple mapreducers (for example to go over all rows in a given column, compute something, and put result into another column). It work I promise to write detailed wiki page. -Original Message- From: Sandy [mailto:[EMAIL PROTECTED] Sent: Sunday, September 14, 2008 8:58 PM To: core-user@hadoop.apache.org Subject: Installing Hadoop on OS X 10.5 single node cluster (MacPro) posted to wiki Hi all, I just posted a set of instructions on how to install hadoop on a Mac Pro single node cluster. Please let me know if you see any changes to be made or if you have any suggestions or comments! It can be viewed here: http://wiki.apache.org/hadoop/Running_Hadoop_On_OS_X_10.5_64-bit_(Single-Nod e_Cluster) Thanks, -SM
RE: HDFS
Why not use HAR over HDFS? Idea being that if you don't do too much of writing, having files compacter to har archives (that will be stored in 64mb slices) might be a good answer. Thus the question for hadoop developers, is hadoop har-aware? In two senses: 1. Whether it tries to assign tasks close to data (know to which piece of har file a given file belong to) 2. If har folder is fed to a map task, will hadoop give all files to the task altogether through local datanode, or getting each file will result in accessing namenode, resolving, and etc? -Original Message- From: Monchanin Eric [mailto:[EMAIL PROTECTED] Sent: Saturday, September 13, 2008 4:59 PM To: core-user@hadoop.apache.org Subject: Re: HDFS Hi, Thank you all for your answers, I am running deep into various distributed file systems. So far checking on mogilefs, kfs ... which seem to better fit my needs. I'll be dealing indeed with thousands of small files (few kB to few MB, 10 MB top). Our service is based on a web portal driven by php. Though I'm having hard times compiling any of them (mogilefs with perl and dependancies, kfs with c++ and dependancies). If any one of you has any experience to share, I'll be glad to listen. Cheers, Eric Robert Krüger a écrit : Hi Eric, we are currently building a system for a very similar purpose (digital asset management) and we use HDFS currently for a volume of approx. 100TB with the option to scale into the PB range. Since we haven't gone into production yet, I cannot say it will work flawlessly but so far everything has worked very well with really good performance (especially read performance which is probably also in your case the most important factor). The most important thing you have to be aware of IMHO ist that you will not have a real file system on the OS level. If you use tools which need that to process the data you will need to do some copying (which we do in some cases). There is a project out there that makes HDFS available via FUSE but it appears to be rather alpha which is why we haven't dared to take a look at it for this project. Apart from the namenode, which you have to get redundant yourself (lots of posts in the archives on this topic) you can simply configure the level of redundancy (see docs). Hope this helps, Robert Monchanin Eric wrote: Hello to all, I have been attracted by the Hadoop project while looking for a solution for my application. Basically, I have an application hosting user generated content (images, sounds, videos) and I would like to have this available at all time for all my servers. Servers will basically add new content, user can manipulate the existing content, make compositions etc etc ... We have a few servers (2 for now) dedicated to hosting content, and right now, they are connected via sshfs on some folders, in order to shorten the transfert time between these content servers and the application servers. Would the Hadoop filesystem be usefull in my case, is it worth digging into it. In the case it is doable, how redundant the system is ? for instance, to store 1 MB of data, how much storage do I need (I guess at least 2 MB ...) ? I hope I made myself clear enough and will get encouraging answers, Bests to all, Eric
RE: namenode multitreaded
I have 15+ million small files I like to process and move around..Thus my operations doesn't really include datanodes - they're idle when I for example do FS operations (like sort a bunch of new files written by tasktracker to appropriate folders). Now I tried to use HADOOP_OPTS=-server and it seems to help a little, but still performance isn't great. Perhaps problem is in the way I play with files - it's perl script over davf2 over WebDav which uses native API. Can anyone give an example of a jython or jruby file that'd recursively go over a hdfs folder and move all files to a different folder? (My programming skills are very modest..) -Original Message- From: Raghu Angadi [mailto:[EMAIL PROTECTED] Sent: Friday, September 12, 2008 9:41 AM To: core-user@hadoop.apache.org Subject: Re: namenode multitreaded The core of namenode functionality happens in single thread because of a global lock, unfortunately. The other cpus would still be used to some extent by network IO and other threads. Usually we don't see just one cpu at 100% and nothing else on the other cpus. What kind of load do you have? Raghu. Dmitry Pushkarev wrote: Hi. My namenode runs on a 8-core server with lots of RAM, but it only uses one core (100%). Is it possible to tell namenode to use all available cores? Thanks.
namenode multitreaded
Hi. My namenode runs on a 8-core server with lots of RAM, but it only uses one core (100%). Is it possible to tell namenode to use all available cores? Thanks.
RE: Thinking about retriving DFS metadata from datanodes!!!
This will effectively ruin system on large scale. Since you will have to update all blocks when you play with metadata... -Original Message- From: 叶双明 [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 10, 2008 12:06 AM To: core-user@hadoop.apache.org Subject: Re: Thinking about retriving DFS metadata from datanodes!!! I think let each block carry three simple additional information which doesn't use in normal situation: 1. which file that it belong to 2. which block is it in the file 3. how many blocks of the file After the cluster system has been destroy, we can set up new NameNode , and then , rebuild metadata from the information reported from datanodes. And the cost is a little disk space, indeed less than 1k each block I think. I don't think it replace of multiple NameNodes or compare to , but just a possible mechanism to recover data, the point is 'recover. hehe~~ thanks. 2008/9/10 Raghu Angadi [EMAIL PROTECTED] The main problem is the complexity of maintaining accuracy of the metadata. In other words, what you think is the cost? Do you think writing fsimage to multiple places helps with the terrorist attack? It is supported even now. Raghu. 叶双明 wrote: Thanks for paying attention to my tentative idea! What I thought isn't how to store the meradata, but the final (or last) way to recover valuable data in the cluster when something worst (which destroy the metadata in all multiple NameNode) happen. i.e. terrorist attack or natural disasters destroy half of cluster nodes within all NameNode, we can recover as much data as possible by this mechanism, and hava big chance to recover entire data of cluster because fo original replication. Any suggestion is appreciate! -- Sorry for my english!! 明
number of tasks on a node.
Hi. How can node find out how many task are being run on it at a given time? I want tasktracer nodes (which are assigned from amazon EC) to shutdown if nothing is being run for some period of time, but don't yet see right way of implementing this.
RE: task assignment managemens.
How about just specify machines to run the task on? I haven't seen it anywhere.. -Original Message- From: Devaraj Das [mailto:[EMAIL PROTECTED] Sent: Sunday, September 07, 2008 9:55 PM To: core-user@hadoop.apache.org Subject: Re: task assignment managemens. No that is not possible today. However, you might want to look at the TaskScheduler to see if you can implement a scheduler to provide this kind of task scheduling. In the current hadoop, one point regarding computationally intensive task is that if the machine is not able to keep up with the rest of the machines (and the task on that machine is running slower than others), speculative execution, if enabled, can help a lot. Also, implicitly, faster/better machines get more work than the slower machines. On 9/8/08 3:27 AM, Dmitry Pushkarev [EMAIL PROTECTED] wrote: Dear Hadoop users, Is it possible without using java manage task assignment to implement some simple rules? Like do not launch more that 1 instance of crawling task on a machine, and do not run data intensive tasks on remote machines, and do not run computationally intensive tasks on single-core machines:etc. Now it's done by failing tasks that decided to run on a wrong machine, but I hope to find some solution on jobtracker side.. --- Dmitry
task assignment managemens.
Dear Hadoop users, Is it possible without using java manage task assignment to implement some simple rules? Like do not launch more that 1 instance of crawling task on a machine, and do not run data intensive tasks on remote machines, and do not run computationally intensive tasks on single-core machines:etc. Now it's done by failing tasks that decided to run on a wrong machine, but I hope to find some solution on jobtracker side.. --- Dmitry
RE: no output from job run on cluster
Hi, I'd check java version installed, that was the problem in my case, and surprisingly no output from hadoop. If it help - can you submit bug request ? :) -Original Message- From: Shirley Cohen [mailto:[EMAIL PROTECTED] Sent: Thursday, September 04, 2008 10:07 AM To: core-user@hadoop.apache.org Subject: no output from job run on cluster Hi, I'm running on hadoop-0.18.0. I have a m-r job that executes correctly in standalone mode. However, when run on a cluster, the same job produces zero output. It is very bizarre. I looked in the logs and couldn't find anything unusual. All I see are the usual deprecated filesystem name warnings. Has this ever happened to anyone? Do you have any suggestions on how I might go about diagnosing the problem? Thanks, Shirley
RE: har/unhar utility
Not quite, I want to be able to create har archives on local system and then send them to HDFS, and back since I work with many small files (10kb) and hadoop seem to behave poorly with them. Perhaps HBASE is another option. Is anyone using it in production mode? And do I really need to downgrade to 17.x to install it? -Original Message- From: Devaraj Das [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 03, 2008 3:35 AM To: core-user@hadoop.apache.org Subject: Re: har/unhar utility Are you looking for user documentation on har? If so, here it is: http://hadoop.apache.org/core/docs/r0.18.0/hadoop_archives.html On 9/3/08 3:21 PM, Dmitry Pushkarev [EMAIL PROTECTED] wrote: Does anyone have har/unhar utility? Or at least format description: It looks pretty obvious though, but just in case. Thanks
RE: har/unhar utility
Probably, but the current idea is to bypass writing small files to HDFS by creating my own local har archive and uploading it. (small files lower transfer speed from 40-70MB/s to hundreds ok kbps :( -Original Message- From: Devaraj Das [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 03, 2008 4:00 AM To: core-user@hadoop.apache.org Subject: Re: har/unhar utility You could create a har archive of the small files and then pass the corresponding har filesystem as input to your mapreduce job. Would that work? On 9/3/08 4:24 PM, Dmitry Pushkarev [EMAIL PROTECTED] wrote: Not quite, I want to be able to create har archives on local system and then send them to HDFS, and back since I work with many small files (10kb) and hadoop seem to behave poorly with them. Perhaps HBASE is another option. Is anyone using it in production mode? And do I really need to downgrade to 17.x to install it? -Original Message- From: Devaraj Das [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 03, 2008 3:35 AM To: core-user@hadoop.apache.org Subject: Re: har/unhar utility Are you looking for user documentation on har? If so, here it is: http://hadoop.apache.org/core/docs/r0.18.0/hadoop_archives.html On 9/3/08 3:21 PM, Dmitry Pushkarev [EMAIL PROTECTED] wrote: Does anyone have har/unhar utility? Or at least format description: It looks pretty obvious though, but just in case. Thanks
datanodes in virtual networks.
Dear hadoop users, Our lab in slowly switching from SGE to hadoop, however not everything seems to be easy and obvious. We are in no way computer scientists, we're just physicists, biologist and couple of statisticians trying to solve our computational problems, please take this into consideration if questions will look to you obvious.. Our setup: 1. Data cluster - 4 Raided and Hadooped servers, with 2TB of storage each, they all have real IP addresses, one of them reserved for NameNode. 2. Computational cluster: 100 dualcore servers running Sun Grid Engine, they live on virtual network (10.0.0.X) and can connect to outside world, but not accessible from out of the cluster. On these we don't have root access, and these are shared via SGE with other people, who get reasonably nervous when see idle reserved servers. Basic Idea is to create on-demand computational cluster, which when needed will reserve servers from second cluster run jobs and let them go. Currently it is done via script that reserves server for namenode 25 servers for datanode copies data from first cluster, runs job, send result back and releases servers. I still want to make them work together using one namenode. After a week playing with hadoop I couldn't answer some of my question vie thorough RTFM, so I'd really appreciate is you can answer at least some of them in our context: 1. Is it possible to connect servers from second cluster to first namenode? What worries me is implementation of data-transfer protocol, because some of the nodes cannot be reached but they can easily reach any other node. Will hadoop try to establish connection both ways to transfer data between nodes? 2. It is possible to specify reliability of the node, that is to make replica on the node with raid installed counts as two replicas as probability of failure is much lower. 3. I also bumped into problems with decommissioning, after I add hosts to free to dfs.hosts.exclude file and refreshNodes, they are marked as Decommission in progress for days, even though data is removed from them within first several minutes. What I currently do is shoot them down with some delay, but I really hope to see Decommissioned one day. What am I probably doing wrong? 4. The same question about dead hosts. I do a simple exercise: I create 20 datanodes on empty cluster, then I kill 15 of them and try to store a file on HDFS, hadoop fails because some nodes that it thinks in service aren't accessible. Is it possible to tell hadoop to remove these nodes from the list and do not try to store data on them? My current solution is hadoop-stop/start via cron every hour. 5. We also have some external secure storage that can be accesses via NFS from fists DATA cluster, and it'd be great if I could somehow mount this storage to HDFS folder and tell hadoop that all data written to that folder shouldn't be replicated rather they should go directly to NFS. 6. Ironically none of us who uses cluster knows java, and most tasks are launched via streaming with C++ programs/perl scripts. The problem is how to write/read files from HDFS in this context, we currently use things like -moveFromLocal but it doesn't seems to be right answer, because it slows things down a lot. 7. On one of the DataCluster machines with run pretty large MySQL database, and just thinking whether it is possible to spread database across the cluster, has anyone tried that? 8. Fuse-hdfs works great, but we really hope to be able to write to HDFS someday, how to enable it? 9. And may be someone can point out where to look for ways to specify how to partition data for the map jobs, in some our tasks processing of one line of input file takes several minutes, currently we split these files to many one-line files and process them independently, but a simple streaming-compatible way to tell hadoop that for example we want each job to take 10 lines or to split the 10kb input file into 1 map tasks would help as a lot! Thanks in advance.