Hadoop as master's thesis
Hello everyone, I'm thinking of using Hadoop as a subject in my master's thesis in Computer Science. I'm supposed to solve some kind of a problem with Hadoop, but can't think of any :)). We have a lab with 10-15 computers and I tough of installing Hadoop on those computers, and now I should write some kind of a program to run on my cluster. I really hope you understood my problem :). I really need any kind of suggestion. P.S. Sorry for my bad English, I'm from Croatia.
Re: Hadoop as master's thesis
Tonci, to start with, you can run Hadoop on one computer in pseudo-cluster mode. Installing and configuring will be enough headache on its own. Then you can think of a problem, such as process student records and grades and find some statistics, or grade and their future achievements. Or, you can look at some publicly available datasets and so something with them. Cheers, Mark On Mon, Mar 1, 2010 at 8:01 AM, Tonci Buljan tonci.bul...@gmail.com wrote: Hello everyone, I'm thinking of using Hadoop as a subject in my master's thesis in Computer Science. I'm supposed to solve some kind of a problem with Hadoop, but can't think of any :)). We have a lab with 10-15 computers and I tough of installing Hadoop on those computers, and now I should write some kind of a program to run on my cluster. I really hope you understood my problem :). I really need any kind of suggestion. P.S. Sorry for my bad English, I'm from Croatia.
Re: Hadoop as master's thesis
Thank you for your reply. I didn't mention that I already installed Hadoop on 2 machines back at home (for a essay on Hadoop which I did), one as a namenode and datanode and one as a datanode only. Everything worked perfect. I would really try to install it on more machines to see how cluster works in more detail. So I was thinking:” Now I have a cluster, where do I find a large dataset to work with?”. I like your idea about publicly available datasets, do you have any links on that? The other idea, about student grades is also great (thank you for that) and I might just start with that. Thank you very much, you both really helped me. On 1 March 2010 15:15, Mark Kerzner markkerz...@gmail.com wrote: Tonci, to start with, you can run Hadoop on one computer in pseudo-cluster mode. Installing and configuring will be enough headache on its own. Then you can think of a problem, such as process student records and grades and find some statistics, or grade and their future achievements. Or, you can look at some publicly available datasets and so something with them. Cheers, Mark On Mon, Mar 1, 2010 at 8:01 AM, Tonci Buljan tonci.bul...@gmail.com wrote: Hello everyone, I'm thinking of using Hadoop as a subject in my master's thesis in Computer Science. I'm supposed to solve some kind of a problem with Hadoop, but can't think of any :)). We have a lab with 10-15 computers and I tough of installing Hadoop on those computers, and now I should write some kind of a program to run on my cluster. I really hope you understood my problem :). I really need any kind of suggestion. P.S. Sorry for my bad English, I'm from Croatia.
Re: Hadoop as master's thesis
Tonci, here are Enron email files used in the litigation that they had: http://edrm.net/resources/data-sets/enron-data-set-files Here is much more stuff: http://infochimps.org/ Sincerely, Mark http://edrm.net/resources/data-sets/enron-data-set-files On Mon, Mar 1, 2010 at 8:24 AM, Tonci Buljan tonci.bul...@gmail.com wrote: Thank you for your reply. I didn't mention that I already installed Hadoop on 2 machines back at home (for a essay on Hadoop which I did), one as a namenode and datanode and one as a datanode only. Everything worked perfect. I would really try to install it on more machines to see how cluster works in more detail. So I was thinking:” Now I have a cluster, where do I find a large dataset to work with?”. I like your idea about publicly available datasets, do you have any links on that? The other idea, about student grades is also great (thank you for that) and I might just start with that. Thank you very much, you both really helped me. On 1 March 2010 15:15, Mark Kerzner markkerz...@gmail.com wrote: Tonci, to start with, you can run Hadoop on one computer in pseudo-cluster mode. Installing and configuring will be enough headache on its own. Then you can think of a problem, such as process student records and grades and find some statistics, or grade and their future achievements. Or, you can look at some publicly available datasets and so something with them. Cheers, Mark On Mon, Mar 1, 2010 at 8:01 AM, Tonci Buljan tonci.bul...@gmail.com wrote: Hello everyone, I'm thinking of using Hadoop as a subject in my master's thesis in Computer Science. I'm supposed to solve some kind of a problem with Hadoop, but can't think of any :)). We have a lab with 10-15 computers and I tough of installing Hadoop on those computers, and now I should write some kind of a program to run on my cluster. I really hope you understood my problem :). I really need any kind of suggestion. P.S. Sorry for my bad English, I'm from Croatia.
Re: Sun JVM 1.6.0u18
On Mon, Mar 1, 2010 at 6:37 AM, Steve Loughran ste...@apache.org wrote: Todd Lipcon wrote: On Thu, Feb 25, 2010 at 11:09 AM, Scott Carey sc...@richrelevance.comwrote: I have found some notes that suggest that -XX:-ReduceInitialCardMarks will work around some known crash problems with 6u18, but that may be unrelated. Yep, I think that is probably a likely workaround as well. For now I'm recommending downgrade to our clients, rather than introducing cryptic XX flags :) lots of bugreps come in once you search for ReduceInitialCardMarks Looks like a feature has been turned on : http://bugs.sun.com/view_bug.do?bug_id=6889757 and now it is in wide-beta-test http://bugs.sun.com/view_bug.do?bug_id=698 http://permalink.gmane.org/gmane.comp.lang.scala/19228 Looks like the root cause is a new Garbage Collector, one that is still settling down. The ReduceInitialCardMarks flag is tuning the GC, but it is the GC itself that is possibly playing up, or it is a old GC + some new features. Either way: trouble. -steve FYI. We are still running: [r...@nyhadoopdata10 ~]# java -version java version 1.6.0_15 Java(TM) SE Runtime Environment (build 1.6.0_15-b03) Java HotSpot(TM) 64-Bit Server VM (build 14.1-b02, mixed mode) u14 added support for the 64bit compressed memory pointers which seemed important due to the fact that hadoop can be memory hungry. u15 has been stable in our deployments. Not saying you should not go newer, but I would not go older then u14.
Re: Hadoop as master's thesis
Bok Tonci, You'll find good dataset pointers here: http://www.simpy.com/user/otis/search/dataset You may find inspiration for Hadoop usage here, assuming you have ML background: http://cwiki.apache.org/MAHOUT/algorithms.html Oh, and you may also want to look out for GSOC (Google Summer of Code). Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Hadoop ecosystem search :: http://search-hadoop.com/ - Original Message From: Tonci Buljan tonci.bul...@gmail.com To: common-user@hadoop.apache.org Sent: Mon, March 1, 2010 9:24:53 AM Subject: Re: Hadoop as master's thesis Thank you for your reply. I didn't mention that I already installed Hadoop on 2 machines back at home (for a essay on Hadoop which I did), one as a namenode and datanode and one as a datanode only. Everything worked perfect. I would really try to install it on more machines to see how cluster works in more detail. So I was thinking:” Now I have a cluster, where do I find a large dataset to work with?”. I like your idea about publicly available datasets, do you have any links on that? The other idea, about student grades is also great (thank you for that) and I might just start with that. Thank you very much, you both really helped me. On 1 March 2010 15:15, Mark Kerzner wrote: Tonci, to start with, you can run Hadoop on one computer in pseudo-cluster mode. Installing and configuring will be enough headache on its own. Then you can think of a problem, such as process student records and grades and find some statistics, or grade and their future achievements. Or, you can look at some publicly available datasets and so something with them. Cheers, Mark On Mon, Mar 1, 2010 at 8:01 AM, Tonci Buljan wrote: Hello everyone, I'm thinking of using Hadoop as a subject in my master's thesis in Computer Science. I'm supposed to solve some kind of a problem with Hadoop, but can't think of any :)). We have a lab with 10-15 computers and I tough of installing Hadoop on those computers, and now I should write some kind of a program to run on my cluster. I really hope you understood my problem :). I really need any kind of suggestion. P.S. Sorry for my bad English, I'm from Croatia.
Re: Hadoop as master's thesis
Tonci Buljan wrote: Hello everyone, I'm thinking of using Hadoop as a subject in my master's thesis in Computer Science. I'm supposed to solve some kind of a problem with Hadoop, but can't think of any :)). well, you need some interesting data, then mine it. So ask around. Physicists often have stuff.
LocalDirAllocator error
Hi, We use hadoop 0.20.1 I saw the following in our log: 2010-02-27 10:05:09,808 WARN org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext: Failed to create /disk2/opt/kindsight/hadoop/data/mapred/local [r...@snv-qa-lin-cg ~]# df Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda3 7936288 2671376 4855256 36% / /dev/mapper/VolGroup00-LogVol01 1864008392 396885836 1370909156 23% /opt /dev/mapper/VolGroup00-LogVol00 3935944194560 3538224 6% /var /dev/sda1 497829 16649455478 4% /boot tmpfs 8219276 0 8219276 0% /dev/shm What should I do to get past the above warning ? Thanks
Re: cluster involvement trigger
Hi, You mentioned you pass the files packed together using -archives option. This will uncompress the archive on the compute node itself, so the namenode won't be hampered in this case. However, cleaning up the distributed cache is a tricky scenario ( user doesn't have explicit control over this ), you may search this list for many discussions pertaining to this. And while on the topic of archives, while it may not be practical for you as of now, but Hadoop Archives (har) provide similar functionality. Hope this helps. Amogh On 2/27/10 12:53 AM, Michael Kintzer michael.kint...@zerk.com wrote: Amogh, Thank you for the detailed information. Our initial prototyping seems to agree with your statements below, i.e. a single large input file is performing better than an index file + an archive of small files. I will take a look at the CombineFileInputFormat as you suggested. One question. Since the many small input files are all in a single jar archive managed by the name node, does that still hamper name node performance? I was under the impression these archives are are only unpacked into the temporary map reduce file space (and I'm assuming cleaned up after map-reduce completes). Does the name node need to store the metadata of each individual file during the unpacking for this case? -Michael On Feb 25, 2010, at 10:31 PM, Amogh Vasekar wrote: Hi, The number of mappers initialized depends largely on your input format ( the getSplits of your input format) , (almost all) input formats available in hadoop derive from fileinputformat, hence the 1 mapper per file block notion ( this actually is 1 mapper per split ). You say that you have too many small files. In general each of these small files ( 64 mb ) will be executed by a single mapper. However, I would suggest looking at CombineFileInputFormat which does the job of packaging many small files together depending on data locality for better performance ( initialization time is a significant factor in hadoop's performance ). On the other side, many small files will hamper your namenode performance since file metadata is stored in memory and limit its overall capacity wrt number of files. Amogh On 2/25/10 11:15 PM, Michael Kintzer michael.kint...@zerk.com wrote: Hi, We are using the streaming API.We are trying to understand what hadoop uses as a threshold or trigger to involve more TaskTracker nodes in a given Map-Reduce execution. With default settings (64MB chunk size in HDFS), if the input file is less than 64MB, will the data processing only occur on a single TaskTracker Node, even if our cluster size is greater than 1? For example, we are trying to figure out if hadoop is more efficient at processing: a) a single input file which is just an index file that refers to a jar archive of 100K or 1M individual small files, where the jar file is passed as the -archives argument, or b) a single input file containing all the raw data represented by the 100K or 1M small files. With (a), our input file is 64MB. With (b) our input file is very large. Thanks for any insight, -Michael
Re: Hadoop as master's thesis
Thank you all for your reply. Matteo, I' m definitely interested in what you did, and I would be very happy to check it out in detail. Mark Kerzner's link http://infochimps.org/was very usefull. Thank you Mark for that. I'll probably download and work with some data from there. For Marko (in Croatian) Nisam ima pojma da postoji još ljudi u Hrvatskoj koji se bave Hadoopom. Studiram na FESB-u u Splitu I cijela katedra koja se bavi distribuiranim računanjem je tanka. Profesor nije ni znao što je Hadoop kada sam ga pitao za ideju. Java je još veći bauk, isti taj profesor ju drži tako da će bit prava borba napisat nekakav diplomski na tu temu. U svakom slučaju, hvala za odgovor. On 1 March 2010 22:35, Song Liu lamfeeli...@gmail.com wrote: Hi, Tonci, Actually, I am taking a Master's thesis by developing algorithms on hadoop. My project is to extend algorithms into mapreduce fasion and to discover whether there is a optimal choice. Most of them belong to the Machine Learning area. Personally, I think this is a fresh area, and if you search the main academic database, you may find few literature about this. I recently made an proposal about my study on Hadoop, and I would like to discuss this with you in depth if you wish. Another interesting topic is to discover the limit of hadoop. We have a very large cluster at a very high rank among TOP500, so I'm wondering whether hadoop can perform as we expected. Hope this helpful. Regards Song Liu On Mon, Mar 1, 2010 at 9:16 PM, Stephen Watt sw...@us.ibm.com wrote: Hi Tonci Public Data Sets - Check out infochimps.org/ or aws.amazon.com/publicdatasets/ I find a lot of the Hadoopified algorithms out there originate from Linguistics departments, TF-IDF is one example, but, have you considered looking into Information Theory ? i.e. Entropy analytics using algorithms like Pointwise Mutual Information. I'd imagine most government security agencies would be interested in using Hadoop for signal processing/code breaking. Especially the cost savings of using commodity machines. The trick will be to find a dataset that suits your algorithm. Kind regards Steve Watt From: Tonci Buljan tonci.bul...@gmail.com To: common-user@hadoop.apache.org Date: 03/01/2010 08:27 AM Subject: Re: Hadoop as master's thesis Thank you for your reply. I didn't mention that I already installed Hadoop on 2 machines back at home (for a essay on Hadoop which I did), one as a namenode and datanode and one as a datanode only. Everything worked perfect. I would really try to install it on more machines to see how cluster works in more detail. So I was thinking: Now I have a cluster, where do I find a large dataset to work with?. I like your idea about publicly available datasets, do you have any links on that? The other idea, about student grades is also great (thank you for that) and I might just start with that. Thank you very much, you both really helped me. On 1 March 2010 15:15, Mark Kerzner markkerz...@gmail.com wrote: Tonci, to start with, you can run Hadoop on one computer in pseudo-cluster mode. Installing and configuring will be enough headache on its own. Then you can think of a problem, such as process student records and grades and find some statistics, or grade and their future achievements. Or, you can look at some publicly available datasets and so something with them. Cheers, Mark On Mon, Mar 1, 2010 at 8:01 AM, Tonci Buljan tonci.bul...@gmail.com wrote: Hello everyone, I'm thinking of using Hadoop as a subject in my master's thesis in Computer Science. I'm supposed to solve some kind of a problem with Hadoop, but can't think of any :)). We have a lab with 10-15 computers and I tough of installing Hadoop on those computers, and now I should write some kind of a program to run on my cluster. I really hope you understood my problem :). I really need any kind of suggestion. P.S. Sorry for my bad English, I'm from Croatia.
Re: Big-O Notation for Hadoop
On Mon, Mar 1, 2010 at 4:13 PM, Darren Govoni dar...@ontrenet.com wrote: Theoretically. O(n) All other variables being equal across all nodes should...m.reduce to n. That part that really can't be measured is the cost of Hadoop's bookkeeping chores as the data set grows since some things in Hadoop involve synchronous/serial behavior. On Mon, 2010-03-01 at 12:27 -0500, Edward Capriolo wrote: A previous post to core-user mentioned some formula to determine job time. I was wondering if anyone out there is trying to tackle designing a formula that can calculate the job run time of a map/reduce program. Obviously there are many variables here including but not limited to Disk Speed ,Network Speed, Processor Speed, input data, many constants , data-skew, map complexity, reduce complexity, # of nodes.. As an intellectual challenge has anyone starting trying to write a formula that can take into account all these factors and try to actually predict a job time in minutes/hours? Understood, BIG-0 notation is really not what I am looking for. Given all variables are the same, a hadoop job on a finite set of data should run for a finite time. There are parts of the process that run linear and parts that run in parallel, but there must be a way to express how long a job actually takes (although admittedly it is very involved to figure out)
Re: Big-O Notation for Hadoop
Its a Turing-class problem and thus non-deterministic by nature - a priori. But given the uniform aspect of map/reduce an estimate could continually be approximated - as the data is processed - noting that, the farther from completion it is, the less accurate that calculation would be. And of course, once completed, the estimate is 100% accurate. In order to do it before, you would need an algorithm that can examine your map/reduce code and predict the running cost. Without data on prior runs, its not -mathematically- possible. As a function of cycle complexity over time (which is what big O is), map/reduce will scale somewhat linearly (maybe even logn) with regards to data - is my hunch. There's probably a quotient in there for the bookkeeping no one has data on yet though. But its a good inquiry. On Mon, 2010-03-01 at 18:25 -0500, Edward Capriolo wrote: On Mon, Mar 1, 2010 at 4:13 PM, Darren Govoni dar...@ontrenet.com wrote: Theoretically. O(n) All other variables being equal across all nodes should...m.reduce to n. That part that really can't be measured is the cost of Hadoop's bookkeeping chores as the data set grows since some things in Hadoop involve synchronous/serial behavior. On Mon, 2010-03-01 at 12:27 -0500, Edward Capriolo wrote: A previous post to core-user mentioned some formula to determine job time. I was wondering if anyone out there is trying to tackle designing a formula that can calculate the job run time of a map/reduce program. Obviously there are many variables here including but not limited to Disk Speed ,Network Speed, Processor Speed, input data, many constants , data-skew, map complexity, reduce complexity, # of nodes.. As an intellectual challenge has anyone starting trying to write a formula that can take into account all these factors and try to actually predict a job time in minutes/hours? Understood, BIG-0 notation is really not what I am looking for. Given all variables are the same, a hadoop job on a finite set of data should run for a finite time. There are parts of the process that run linear and parts that run in parallel, but there must be a way to express how long a job actually takes (although admittedly it is very involved to figure out)
Re: Sun JVM 1.6.0u18
On Mar 1, 2010, at 10:46 AM, Allen Wittenauer wrote: On 3/1/10 7:24 AM, Edward Capriolo edlinuxg...@gmail.com wrote: u14 added support for the 64bit compressed memory pointers which seemed important due to the fact that hadoop can be memory hungry. u15 has been stable in our deployments. Not saying you should not go newer, but I would not go older then u14. How are the compressed memory pointers working for you? I've been debating turning them on here, so real world experience would be useful from those that have taken plunge. Been using it since they came out, both for Hadoop where needed and in many other applications. Performance gains and memory reduction in most places -- sometimes rather significant (25%). GC times significantly lower for any heap that is reference heavy. Heaps are still a little larger than a 32 bit one, but the benefits of native 64 bit code on x86 include improved computational performance as well. 6u18 introduces some performance enhancements to the feature that we might be able to use if 6u19 fixes the other bugs. The next Hotspot version will make it the default setting, whenever that gets integrated and tested into the JDK6 line. 6u14 and 6u18 are the last two JDK releases with updated Hotspot versions.
RE: Sun JVM 1.6.0u18
1.6.0_u18 also claims to fix bug_id=5103988 which may or may not improve the performance of the transferTo code used in org.apache.hadoop.net.SocketOutputStream. -Original Message- From: Scott Carey [mailto:sc...@richrelevance.com] Sent: Monday, March 01, 2010 6:41 PM To: common-user@hadoop.apache.org Subject: Re: Sun JVM 1.6.0u18 On Mar 1, 2010, at 10:46 AM, Allen Wittenauer wrote: On 3/1/10 7:24 AM, Edward Capriolo edlinuxg...@gmail.com wrote: u14 added support for the 64bit compressed memory pointers which seemed important due to the fact that hadoop can be memory hungry. u15 has been stable in our deployments. Not saying you should not go newer, but I would not go older then u14. How are the compressed memory pointers working for you? I've been debating turning them on here, so real world experience would be useful from those that have taken plunge. Been using it since they came out, both for Hadoop where needed and in many other applications. Performance gains and memory reduction in most places -- sometimes rather significant (25%). GC times significantly lower for any heap that is reference heavy. Heaps are still a little larger than a 32 bit one, but the benefits of native 64 bit code on x86 include improved computational performance as well. 6u18 introduces some performance enhancements to the feature that we might be able to use if 6u19 fixes the other bugs. The next Hotspot version will make it the default setting, whenever that gets integrated and tested into the JDK6 line. 6u14 and 6u18 are the last two JDK releases with updated Hotspot versions. ___ This e-mail may contain information that is confidential, privileged or otherwise protected from disclosure. If you are not an intended recipient of this e-mail, do not duplicate or redistribute it by any means. Please delete it and any attachments and notify the sender that you have received it in error. Unless specifically indicated, this e-mail is not an offer to buy or sell or a solicitation to buy or sell any securities, investment products or other financial product or service, an official confirmation of any transaction, or an official statement of Barclays. Any views or opinions presented are solely those of the author and do not necessarily represent those of Barclays. This e-mail is subject to terms available at the following link: www.barcap.com/emaildisclaimer. By messaging with Barclays you consent to the foregoing. Barclays Capital is the investment banking division of Barclays Bank PLC, a company registered in England (number 1026167) with its registered office at 1 Churchill Place, London, E14 5HP. This email may relate to or be sent from other members of the Barclays Group. ___
Re: Big-O Notation for Hadoop
I am looking at this many different ways. For example: shuffle sort might run faster if we have 12 disks not 8 per node. So shuffle sort involves data size/ disk speed network speed/ and processor speed/ number of nodes. Can we find formula to take these (and more factors ) into account? Once we find it we should be able to plug in 12 or 8 and get a result close to the shuffle sort time. I think it would be rather cool to have a long drawn out formula.that even made reference to some constants, like time to copy data to distributed cache, I am looking at source data size, map complety, map output size, shuffle sort time, reduce complexity, number of nodes and try to arrive at a formula that will say how long a job will take. From there we can factor in something like all nodes have 10 g ethernet and watch the entire thing fall apart :) On 3/1/10, brien colwell xcolw...@gmail.com wrote: Map reduce should be a constant factor improvement for the algorithm complexity. I think you're asking for the overhead as a function of input/cluster size? If your algorithm has some complexity O(f(n)), and you spread it over M nodes (constant), with some merge complexity less than f(n), the total time will still be O(f(n)). I run a small job, measure the time, and then extrapolate based on the bigO. On 3/1/2010 6:25 PM, Edward Capriolo wrote: On Mon, Mar 1, 2010 at 4:13 PM, Darren Govonidar...@ontrenet.com wrote: Theoretically. O(n) All other variables being equal across all nodes should...m.reduce to n. That part that really can't be measured is the cost of Hadoop's bookkeeping chores as the data set grows since some things in Hadoop involve synchronous/serial behavior. On Mon, 2010-03-01 at 12:27 -0500, Edward Capriolo wrote: A previous post to core-user mentioned some formula to determine job time. I was wondering if anyone out there is trying to tackle designing a formula that can calculate the job run time of a map/reduce program. Obviously there are many variables here including but not limited to Disk Speed ,Network Speed, Processor Speed, input data, many constants , data-skew, map complexity, reduce complexity, # of nodes.. As an intellectual challenge has anyone starting trying to write a formula that can take into account all these factors and try to actually predict a job time in minutes/hours? Understood, BIG-0 notation is really not what I am looking for. Given all variables are the same, a hadoop job on a finite set of data should run for a finite time. There are parts of the process that run linear and parts that run in parallel, but there must be a way to express how long a job actually takes (although admittedly it is very involved to figure out)
bulk data transfer to HDFS remotely (e.g. via wan)
I am considering a basic task of loading data to hadoop cluster in this scenario: hadoop cluster and bulk data reside on different boxes, e.g. connected via LAN or wan. An example to do this is to move data from amazon s3 to ec2, which is supported in latest hadoop by specifying s3(n)://authority/path in distcp. But generally speaking, what is the best way to load data to hadoop cluster from a remote box? Clearly, in this scenario, it is unreasonable to copy data to local name node and then issue some command like hadoop fs -copyFromLocal to put data in the cluster (besides this, a desired data transfer tool is also a factor, scp or sftp, gridftp, ..., compression and encryption, ...). I am not awaring of a generic support for fetching data from a remote box (like that from s3 or s3n), I am thinking about the following solution (run on remote boxes to push data to hadoop): cat datafile | ssh hadoopbox 'hadoop fs -put - dst' There are pros (simple and will do the job without storing a local copy of each data file and then do a command like 'hadoop fs -copyFromLocal') and cons (obviously will need many such pipelines running in parallel to speed up the job, but at the cost of creating processes on remote machines to read data and maintain ssh connections, so if data file is small, better archive small files into a tar file before calling 'cat'). Alternative to using a 'cat', a program can be written to keep reading data files and dump to stdin in parallel. Any comments about this or thoughts about a better solution? Thanks, -- Michael