DC area event: Investing in the Cloud: A Breakfast Conversation
Just another reminder for our upcoming event this week... --- We are pleased to present the following special event at the University of Maryland, featuring Christophe Bisciglia of Cloudera and Deepak Singh of Amazon Web Services... Investing in the Cloud: A Breakfast Conversation Wednesday, May 13, 2009 8:00 a.m. - 10:00 a.m. Co-hosted by: Dingman Center for Entrepreneurship, Robert H. Smith School of Business Human-Computer Interaction Laboratory (HCIL), The iSchool University of Maryland Sponsored by Redshift Ventures and Pillsbury URL: http://www.umiacs.umd.edu/~jimmylin/cloud-computing/2009-05-13-breakfast/ = Overview Cloud computing, whether in reference to utility computing, software as a service, or the ability to perform analytics on large datasets with emerging technologies such as Hadoop, represents tremendous market and investment opportunities. Established organizations are leveraging cloud technologies to provide a path to greater efficiencies in data center operations, through centralized management and economies of scale. Even more exciting, however, are the yet-to-be-exploited new business opportunities in the cloud space. Already, organizations are using Hadoop for Web-scale analytics, tackling problems that only a few years ago seemed intractable to all but a few. Similarly, organizations are taking advantage of utility computing services, converting capital costs into operational costs and reaping the benefits of on-demand computational resources. Join us as we explore the intersection of investment, entrepreneurship, and cloud computing with Christophe Bisciglia, representing Cloudera, whose mission is to provide enterprise-level support to users of Hadoop, and Deepak Singh, representing Amazon Web Services, a major provider of utility computing and cloud infrastructure. The discussion will be moderated by Prof. Jimmy Lin, who leads Maryland's cloud computing efforts in the Google/IBM Academic Cloud Computing Initiative. = Schedule and Logistics 8:00 am – 8:30 amBreakfast and Networking 8:30 am – 9:30 amPlenary session by invited speakers 9:30 am – 10:00 amPanel session moderated by Jimmy Lin The event will take place at: Robert H. Smith School of Business 2505 Van Munching Hall College Park, Maryland Directions at http://www.rhsmith.umd.edu/about/directions.aspx This event is free and open to the public. However, please register for the event at http://www.slyreply.com/Event/EventDetails.aspx?eid=kwIpwWNRrO8%3d = Speaker Bios Christophe Bisciglia joins Cloudera from Google, where he created and managed their Academic Cloud Computing Initiative. Starting in 2007, he began working with the University of Washington to teach students about Google's core data management and processing technologies—MapReduce and GFS. This quickly brought Hadoop into the curriculum, and has since resulted in an extensive partnership with the National Science Foundation (NSF) which makes Google-hosted Hadoop clusters available for research and education worldwide. Beyond his work with Hadoop, he holds patents related to search quality and personalization, and spent a year working in Shanghai. Christophe earned his degree, and remains a visiting scientist, at the University of Washington. Deepak Singh is a business development manager at Amazon Web Services where he spends a lot of time working with developers and organizations looking to leverage Amazon EC2 for a variety of applications, especially in the areas of scientific research and data analytics. Prior to his time at Amazon Web Services Deepak spent time at a number of life science informatics and software companies; as a strategist at Rosetta Biosoftware, a product manager and consortium director at Accelrys, and a scientific programmer at GeneFormatics. He has a PhD in physical chemistry from Syracuse University. Deepak is also an active blogger and podcaster. At business|bytes|genes|molecules (http://mndoci.com) and Coast to Coast Bio (http://c2cbio.com) he writes and talks about a variety of topics at the interface of the biosciences and technology, with special interests in open data, computing, and the web as a platform for science. Jimmy Lin is an Associate Professor in the iSchool at the University of Maryland, with affiliations in the Department of Computer Science and the Institute for Advanced Computer Studies, as well as the National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), the National Institutes of Health (NIH). He received a Ph.D. in Electrical Engineering and Computer Science from MIT in 2004. Dr. Lin's research primarily lies at the intersection of information retrieval and natural language processing, but his interests extend to human-computer interaction, bioinformatics, medical informatics, and large-scale distributed systems.
DC area event: Investing in the Cloud: A Breakfast Conversation
We are pleased to present the following special event at the University of Maryland, featuring Christophe Bisciglia of Cloudera and Deepak Singh of Amazon Web Services... Investing in the Cloud: A Breakfast Conversation Wednesday, May 13, 2009 8:00 a.m. - 10:00 a.m. Co-hosted by: Dingman Center for Entrepreneurship, Robert H. Smith School of Business Human-Computer Interaction Laboratory (HCIL), The iSchool University of Maryland Sponsored by Redshift Ventures and Pillsbury URL: http://www.umiacs.umd.edu/~jimmylin/cloud-computing/2009-05-13-breakfast/ = Overview Cloud computing, whether in reference to utility computing, software as a service, or the ability to perform analytics on large datasets with emerging technologies such as Hadoop, represents tremendous market and investment opportunities. Established organizations are leveraging cloud technologies to provide a path to greater efficiencies in data center operations, through centralized management and economies of scale. Even more exciting, however, are the yet-to-be-exploited new business opportunities in the cloud space. Already, organizations are using Hadoop for Web-scale analytics, tackling problems that only a few years ago seemed intractable to all but a few. Similarly, organizations are taking advantage of utility computing services, converting capital costs into operational costs and reaping the benefits of on-demand computational resources. Join us as we explore the intersection of investment, entrepreneurship, and cloud computing with Christophe Bisciglia, representing Cloudera, whose mission is to provide enterprise-level support to users of Hadoop, and Deepak Singh, representing Amazon Web Services, a major provider of utility computing and cloud infrastructure. The discussion will be moderated by Prof. Jimmy Lin, who leads Maryland's cloud computing efforts in the Google/IBM Academic Cloud Computing Initiative. = Schedule and Logistics 8:00 am – 8:30 am Breakfast and Networking 8:30 am – 9:30 am Plenary session by invited speakers 9:30 am – 10:00 am Panel session moderated by Jimmy Lin The event will take place at: Robert H. Smith School of Business 2505 Van Munching Hall College Park, Maryland Directions at http://www.rhsmith.umd.edu/about/directions.aspx This event is free and open to the public. However, please register for the event at http://www.slyreply.com/Event/EventDetails.aspx?eid=kwIpwWNRrO8%3d = Speaker Bios Christophe Bisciglia joins Cloudera from Google, where he created and managed their Academic Cloud Computing Initiative. Starting in 2007, he began working with the University of Washington to teach students about Google's core data management and processing technologies—MapReduce and GFS. This quickly brought Hadoop into the curriculum, and has since resulted in an extensive partnership with the National Science Foundation (NSF) which makes Google-hosted Hadoop clusters available for research and education worldwide. Beyond his work with Hadoop, he holds patents related to search quality and personalization, and spent a year working in Shanghai. Christophe earned his degree, and remains a visiting scientist, at the University of Washington. Deepak Singh is a business development manager at Amazon Web Services where he spends a lot of time working with developers and organizations looking to leverage Amazon EC2 for a variety of applications, especially in the areas of scientific research and data analytics. Prior to his time at Amazon Web Services Deepak spent time at a number of life science informatics and software companies; as a strategist at Rosetta Biosoftware, a product manager and consortium director at Accelrys, and a scientific programmer at GeneFormatics. He has a PhD in physical chemistry from Syracuse University. Deepak is also an active blogger and podcaster. At business|bytes|genes|molecules (http://mndoci.com) and Coast to Coast Bio (http://c2cbio.com) he writes and talks about a variety of topics at the interface of the biosciences and technology, with special interests in open data, computing, and the web as a platform for science. Jimmy Lin is an Associate Professor in the iSchool at the University of Maryland, with affiliations in the Department of Computer Science and the Institute for Advanced Computer Studies, as well as the National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), the National Institutes of Health (NIH). He received a Ph.D. in Electrical Engineering and Computer Science from MIT in 2004. Dr. Lin's research primarily lies at the intersection of information retrieval and natural language processing, but his interests extend to human-computer interaction, bioinformatics, medical informatics, and large-scale distributed systems.
Re: Coordination between Mapper tasks
Hmmm... sounds odd. Given the same memcached servers (config), the hashing should be consistent. FYI, all code for the experiments described in that tech report is in cloud9, the library I use for teaching my courses. Download at: http://www.umiacs.umd.edu/~jimmylin/ Hope this helps! (Let me know off list if you need more details) -Jimmy Stuart White wrote: You might want to look at a memcached solution some students and I worked out for exactly this problem. Thanks, Jimmy! This paper does exactly describe my problem. I started working to implement the memcached solution you describe, and I've run into a small problem. I've described it on the spymemcached forum: http://groups.google.com/group/spymemcached/browse_thread/thread/7b4d82bca469ed20 Essentially, it seems the keys are being hashed inconsistently by spymemcached across runs. This, of course, will result in inconsistent/invalid results. Did you guys run into this? Since I'm new to memcached, I'm hoping that this is simply something I don't understand or am overlooking.
Re: Using HDFS to serve www requests
Brian--- Can you share some performance figures for typical workloads with your HDFS/Fuse setup? Obviously, latency is going to be bad but throughput will probably be reasonable... but I'm curious to hear about concrete latency/throughput numbers. And, of course, I'm interested in these numbers as a function of concurrent clients... ;) Somewhat independent of file size is the workload... you can have huge TB-size files, but still have a seek-heavy workload (in which case HDFS is probably a sub-optimal choice). But if seek-heavy loads are reasonable, one can solve the lots-of-little-files problem by simple concatenation. Finally, I'm curious about the Fuse overhead (vs. directly using the Java API). Thanks in advance for your insights! -Jimmy Brian Bockelman wrote: On Mar 26, 2009, at 5:44 PM, Aaron Kimball wrote: In general, Hadoop is unsuitable for the application you're suggesting. Systems like Fuse HDFS do exist, though they're not widely used. We use FUSE on a 270TB cluster to serve up physics data because the client (2.5M lines of C++) doesn't understand how to connect to HDFS directly. Brian I don't know of anyone trying to connect Hadoop with Apache httpd. When you say that you have huge images, how big is "huge?" It might be useful if these images are 1 GB or larger. But in general, "huge" on Hadoop means 10s of GBs up to TBs. If you have a large number of moderately-sized files, you'll find that HDFS responds very poorly for your needs. It sounds like glusterfs is designed more for your needs. - Aaron On Thu, Mar 26, 2009 at 4:06 PM, phil cryer wrote: This is somewhat of a noob question I know, but after learning about Hadoop, testing it in a small cluster and running Map Reduce jobs on it, I'm still not sure if Hadoop is the right distributed file system to serve web requests. In other words, can, or is it right to, serve Images and data from HDFS using something like FUSE to mount a filesystem where Apache could serve images from it? We have huge images, thus the need for a distributed file system, and they go in, get stored with lots of metadata, and are redundant with Hadoop/HDFS - but is it the right way to serve web content? I looked at glusterfs before, they had an Apache and Lighttpd module which made it simple, does HDFS have something like this, do people just use a FUSE option as I described, or is this not a good use of Hadoop? Thanks P
Re: Coordination between Mapper tasks
Hi Stuart, You might want to look at a memcached solution some students and I worked out for exactly this problem. It's written up in: Jimmy Lin, Anand Bahety, Shravya Konda, and Samantha Mahindrakar. Low-Latency, High-Throughput Access to Static Global Resources within the Hadoop Framework. Technical Report HCIL-2009-01, University of Maryland, College Park, January 2009. Available at: http://www.umiacs.umd.edu/~jimmylin/publications/by_year.html Best, Jimmy Stuart White wrote: Thanks to everyone for your feedback. I'm unfamiliar with many of the technologies you've mentioned, so it may take me some time to digest all your responses. The first thing I'm going to look at is Ted's suggestion of a pure map-reduce solution by pre-joining my data with my lookup values. On Fri, Mar 20, 2009 at 9:55 AM, Owen O'Malley wrote: On Thu, Mar 19, 2009 at 6:42 PM, Stuart White wrote: My process requires a large dictionary of terms (~ 2GB when loaded into RAM). The terms are looked-up very frequently, so I want the terms memory-resident. So, the problem is, I want 3 processes (to utilize CPU), but each process requires ~2GB, but my nodes don't have enough memory to each have their own copy of the 2GB of data. So, I need to somehow share the 2GB between the processes. I would recommend using the multi-threaded map runner. Have 1 map/node and just use 3 worker threads that all consume the input. The only disadvantage is that it works best for cpu-heavy loads (or maps that are doing crawling, etc.), since you only have one record reader for all three of the map threads. In the longer term, it might make sense to enable parallel jvm reuse in addition to serial jvm reuse. -- Owen
Re: OT: How to search mailing list archives?
I've found nabble to be helpful: http://www.nabble.com/Hadoop-core-user-f30590.html -Jimmy Miles Osborne wrote: posts tend to get indexed by Google, so try that Miles 2009/3/8 Stuart White : This is slightly off-topic, and I realize this question is not specific to Hadoop, but what is the best way to search the mailing list archives? Here's where I'm looking: http://mail-archives.apache.org/mod_mbox/hadoop-core-user/ I don't see any way to search the archives. Am I missing something? Is there another archive site I should be looking at? Thanks!
Re: Lazily deserializing Writables
Hi Bryan, Thanks, this answers my question! So at the very least you'll have to read in the raw bytes and hang on to them. -Jimmy > We do this with some of our Thrift-serialized types. We account for > this behavior explicitly in the ThrittWritable class and make it so > that we can read the serialized version off the wire completely by > prepending the size. Then, we can read in the raw bytes and hang on > to them for later as we see fit. I would think that leaving the bytes > on the DataInput would break things in a very impressive way. > > -Bryan > > On Oct 2, 2008, at 2:48 PM, Jimmy Lin wrote: > >> Hi everyone, >> >> I'm wondering if it's possible to lazily deserialize a Writable. >> That is, >> when my custom Writable is handed a DataInput from readFields, can I >> simply hang on to the reference and read from it later? This would be >> useful if the Writable is a complex data structure that may be >> expensive >> to deserialize, so I'd only want to do it on-demand. Or does the >> runtime >> mutate the underlying stream, leaving the Writable with a reference to >> something completely different later? >> >> I'm wondering about both present behavior, and the implicit contract >> provided by the Hadoop API. >> >> Thanks! >> >> -Jimmy >> >> > > >
Lazily deserializing Writables
Hi everyone, I'm wondering if it's possible to lazily deserialize a Writable. That is, when my custom Writable is handed a DataInput from readFields, can I simply hang on to the reference and read from it later? This would be useful if the Writable is a complex data structure that may be expensive to deserialize, so I'd only want to do it on-demand. Or does the runtime mutate the underlying stream, leaving the Writable with a reference to something completely different later? I'm wondering about both present behavior, and the implicit contract provided by the Hadoop API. Thanks! -Jimmy
slash in AWS Secret Key, WAS Re: Namenode Exceptions with S3
I've come across this problem before. My simple solution was to regenerate new keys until I got one without a slash... ;) -Jimmy > I have Hadoop 0.17.1 and an AWS Secret Key that contains a slash ('/'). > > With distcp, I found that using the URL format s3://ID:[EMAIL PROTECTED]/ > did not work, even if I encoded the slash as "%2F". I got > "org.jets3t.service.S3ServiceException: S3 HEAD request failed. > ResponseCode=403, ResponseMessage=Forbidden" > > When I put the AWS Secret Key in hadoop-site.xml and wrote the URL as > s3://BUCKET/ it worked. > > I have periods ('.') in my bucket name, that was not a problem. > > What's weird is that org.apache.hadoop.fs.s3.Jets3tFileSystemStore > uses java.net.URI, which should take take of unencoding the %2F. > > -Stuart > > > On Wed, Jul 9, 2008 at 1:41 PM, Lincoln Ritter > <[EMAIL PROTECTED]> wrote: >> So far, I've had no luck. >> >> Can anyone out there clarify the permissible characters/format for aws >> keys and bucket names? >> >> I haven't looked at the code here, but it seems strange to me that the >> same restrictions on host/port etc apply given that it's a totally >> different system. I'd love to see exceptions thrown that are >> particular to the protocol/subsystem being employed. The s3 'handler' >> (or whatever, might be nice enough to check for format violations and >> throw and appropriate exception, for instance. It might URL-encode >> the secret key so that the user doesn't have to worry about this, or >> throw an exception notifying the user of a bad format. Currently, >> apparent problems with my s3 settings are throwing exceptions that >> give no indication that the problem is actually with those settings. >> >> My mitigating strategy has been to change my configuration to use >> "instance-local" storage (/mnt). I then copy the results out to s3 >> using 'distcp'. This is odd since distcp seems ok with my s3/aws >> info. >> >> I'm still unclear as to the permissible characters in bucket names and >> access keys. I gather '/' is bad in the secret key and that '_' is >> bad for bucket names. Thusfar i have only been able to get buckets to >> work in distcp that have only letters in their names, but I haven't >> tested to extensively. >> >> For example, I'd love to use buckets like: >> 'com.organization.hdfs.purpose'. This seems to fail. Using >> 'comorganizationhdfspurpose' works but clearly that is less than >> optimal. >> >> Like I say, I haven't dug into the source yet, but it is curious that >> distcp seems to work (at least where s3 is the destination) and hadoop >> fails when s3 is used as its storage. >> >> Anyone who has dealt with these issues, please post! It will help >> make the project better. >> >> -lincoln >> >> -- >> lincolnritter.com >> >> >> >> On Wed, Jul 9, 2008 at 7:10 AM, slitz <[EMAIL PROTECTED]> wrote: >>> I'm having the exact same problem, any tip? >>> >>> slitz >>> >>> On Wed, Jul 2, 2008 at 12:34 AM, Lincoln Ritter >>> <[EMAIL PROTECTED]> >>> wrote: >>> Hello, I am trying to use S3 with Hadoop 0.17.0 on EC2. Using this style of configuration: fs.default.name s3://$HDFS_BUCKET fs.s3.awsAccessKeyId $AWS_ACCESS_KEY_ID fs.s3.awsSecretAccessKey $AWS_SECRET_ACCESS_KEY on startup of the cluster with the bucket having no non-alphabetic characters, I get: 2008-07-01 16:10:49,171 ERROR org.apache.hadoop.dfs.NameNode: java.lang.RuntimeException: Not a host:port pair: X at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:121) at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:121) at org.apache.hadoop.dfs.NameNode.(NameNode.java:178) at org.apache.hadoop.dfs.NameNode.(NameNode.java:164) at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:848) at org.apache.hadoop.dfs.NameNode.main(NameNode.java:857) If I use this style of configuration: fs.default.name s3://$AWS_ACCESS_KEY:[EMAIL PROTECTED] I get (where the all-caps portions are the actual values...): 2008-07-01 19:05:17,540 ERROR org.apache.hadoop.dfs.NameNode: java.lang.NumberFormatException: For input string: "[EMAIL PROTECTED]" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48) at java.lang.Integer.parseInt(Integer.java:447) at java.lang.Integer.parseInt(Integer.java:497) at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:128) at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:121) at org.apache.hadoop.dfs.NameNode.(NameNode.java:178) at org.apache.hadoop.dfs.NameNode.(NameNode.java:164) at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:848) at org.apache.hadoo
Re: walkthrough of developing first hadoop app from scratch
Hi Stephen et al., I would take advantage of the Hadoop plug-in for Eclipse to handle the mundane aspects of putting together your job and running it on the cluster. With respect to gentler introductions on application development, you might want to take a look at the following: http://www.umiacs.umd.edu/~jimmylin/cloud9/umd-hadoop-dist/cloud9-docs/index.html Cloud9 is a MapReduce library primarily intended for teaching, which I use in my cloud computing course (going on right now). The associated tutorials might help you get started. Thus far it's worked well with U. Maryland grads and undergrads, but I'd appreciate additional feedback. Incidentally, I will be talking at the Hadoop summit next week, so if anyone else on the list will be there, I look forward to meeting everyone! -Jimmy Stephen J. Barr wrote: Hello, I am working on developing my first hadoop app from scratch. It is a Monte-Carlo simulation, and I am using the PiEstimator code from the examples as a reference. I believe I have what I want in a .java file. However, I couldn't find any documentation on how to make that .java file into a .jar that I could run, and I haven't found much documentation that is hadoop specific. Is it basically javac MyApp.java jar -cf MyApp or something to that effect, or is there more to it? Thanks! Sorry for the newbie question. -stephen barr
Re: Add your project or company to the powered by page?
University of Maryland http://www.umiacs.umd.edu/~jimmylin/cloud-computing/index.html We are one of six universities participating in IBM/Google's academic cloud computing initiative. Ongoing research and teaching efforts include projects in machine translation, language modeling, bioinformatics, email analysis, and image processing. Eric Baldeschwieler wrote: Hi Folks, Let's get the word out that Hadoop is being used and is useful in your organizations, ok? Please add yourselves to the Hadoop powered by page, or reply to this email with what details you would like to add and I'll do it. http://wiki.apache.org/hadoop/PoweredBy Thanks! E14 --- eric14 a.k.a. Eric Baldeschwieler senior director, grid computing Yahoo! Inc.
Question about key sorting interaction effects
Hi guys, I was wondering if someone could explain the possible interaction effects between the different methods available to control key sorting. Based on my understanding, there are three separate knobs: - a WritableComparable's compareTo method - registering a WritableComparator optimization - setOutputKeyComparatorClass method in JobConf So here's my questions: what happens if these each define a different sort order? To be more concrete, in a recent application I inadvertently defined an output key comparator that defined an ordering that was different from the WritableComparable's natural ordering (as defined by its compareTo). Running the application on small data sets lead to (my) expected behavior, sort order as defined by the output key comparator. However, I got unanticipated results with larger data sets, which leads me to suspect that different methods are used to sort at different times... Thanks in advance for the response! -Jimmy
Why no DoubleWritable?
Hi guys, What's the design decision for not implementing a DoubleWritable type that implements WritableComparable? I noticed that there are classes corresponding to all Java primitives except for double. Thanks in advance, Jimmy