DC area event: Investing in the Cloud: A Breakfast Conversation

2009-05-11 Thread Jimmy Lin

Just another reminder for our upcoming event this week...

---
We are pleased to present the following special event at the University 
of Maryland, featuring Christophe Bisciglia of Cloudera and Deepak Singh 
of Amazon Web Services...


Investing in the Cloud: A Breakfast Conversation

Wednesday, May 13, 2009
8:00 a.m. - 10:00 a.m.

Co-hosted by:
Dingman Center for Entrepreneurship, Robert H. Smith School of Business
Human-Computer Interaction Laboratory (HCIL), The iSchool
University of Maryland

Sponsored by Redshift Ventures and Pillsbury

URL:
http://www.umiacs.umd.edu/~jimmylin/cloud-computing/2009-05-13-breakfast/

= Overview

Cloud computing, whether in reference to utility computing, software as
a service, or the ability to perform analytics on large datasets with
emerging technologies such as Hadoop, represents tremendous market and
investment opportunities.

Established organizations are leveraging cloud technologies to provide a
path to greater efficiencies in data center operations, through
centralized management and economies of scale. Even more exciting,
however, are the yet-to-be-exploited new business opportunities in the
cloud space. Already, organizations are using Hadoop for Web-scale
analytics, tackling problems that only a few years ago seemed
intractable to all but a few. Similarly, organizations are taking
advantage of utility computing services, converting capital costs into
operational costs and reaping the benefits of on-demand computational
resources.

Join us as we explore the intersection of investment, entrepreneurship,
and cloud computing with Christophe Bisciglia, representing Cloudera,
whose mission is to provide enterprise-level support to users of Hadoop,
and Deepak Singh, representing Amazon Web Services, a major provider of
utility computing and cloud infrastructure. The discussion will be
moderated by Prof. Jimmy Lin, who leads Maryland's cloud computing
efforts in the Google/IBM Academic Cloud Computing Initiative.

= Schedule and Logistics

8:00 am – 8:30 amBreakfast and Networking
8:30 am – 9:30 amPlenary session by invited speakers
9:30 am – 10:00 amPanel session moderated by Jimmy Lin

The event will take place at:

 Robert H. Smith School of Business
 2505 Van Munching Hall
 College Park, Maryland
 Directions at http://www.rhsmith.umd.edu/about/directions.aspx

This event is free and open to the public. However, please register for
the event at
http://www.slyreply.com/Event/EventDetails.aspx?eid=kwIpwWNRrO8%3d

= Speaker Bios

Christophe Bisciglia joins Cloudera from Google, where he created and
managed their Academic Cloud Computing Initiative. Starting in 2007, he
began working with the University of Washington to teach students about
Google's core data management and processing technologies—MapReduce and
GFS. This quickly brought Hadoop into the curriculum, and has since
resulted in an extensive partnership with the National Science
Foundation (NSF) which makes Google-hosted Hadoop clusters available for
research and education worldwide. Beyond his work with Hadoop, he holds
patents related to search quality and personalization, and spent a year
working in Shanghai. Christophe earned his degree, and remains a
visiting scientist, at the University of Washington.

Deepak Singh is a business development manager at Amazon Web Services
where he spends a lot of time working with developers and organizations
looking to leverage Amazon EC2 for a variety of applications, especially
in the areas of scientific research and data analytics. Prior to his
time at Amazon Web Services Deepak spent time at a number of life
science informatics and software companies; as a strategist at Rosetta
Biosoftware, a product manager and consortium director at Accelrys, and
a scientific programmer at GeneFormatics. He has a PhD in physical
chemistry from Syracuse University. Deepak is also an active blogger and
podcaster. At business|bytes|genes|molecules (http://mndoci.com) and
Coast to Coast Bio (http://c2cbio.com) he writes and talks about a
variety of topics at the interface of the biosciences and technology,
with special interests in open data, computing, and the web as a
platform for science.

Jimmy Lin is an Associate Professor in the iSchool at the University of
Maryland, with affiliations in the Department of Computer Science and
the Institute for Advanced Computer Studies, as well as the National
Center for Biotechnology Information (NCBI), National Library of
Medicine (NLM), the National Institutes of Health (NIH). He received a
Ph.D. in Electrical Engineering and Computer Science from MIT in 2004.
Dr. Lin's research primarily lies at the intersection of information
retrieval and natural language processing, but his interests extend to
human-computer interaction, bioinformatics, medical informatics, and
large-scale distributed systems.





DC area event: Investing in the Cloud: A Breakfast Conversation

2009-04-29 Thread Jimmy Lin
We are pleased to present the following special event at the University 
of Maryland, featuring Christophe Bisciglia of Cloudera and Deepak Singh 
of Amazon Web Services...


Investing in the Cloud: A Breakfast Conversation

Wednesday, May 13, 2009
8:00 a.m. - 10:00 a.m.

Co-hosted by:
Dingman Center for Entrepreneurship, Robert H. Smith School of Business
Human-Computer Interaction Laboratory (HCIL), The iSchool
University of Maryland

Sponsored by Redshift Ventures and Pillsbury

URL:
http://www.umiacs.umd.edu/~jimmylin/cloud-computing/2009-05-13-breakfast/

= Overview

Cloud computing, whether in reference to utility computing, software as
a service, or the ability to perform analytics on large datasets with
emerging technologies such as Hadoop, represents tremendous market and
investment opportunities.

Established organizations are leveraging cloud technologies to provide a
path to greater efficiencies in data center operations, through
centralized management and economies of scale. Even more exciting,
however, are the yet-to-be-exploited new business opportunities in the
cloud space. Already, organizations are using Hadoop for Web-scale
analytics, tackling problems that only a few years ago seemed
intractable to all but a few. Similarly, organizations are taking
advantage of utility computing services, converting capital costs into
operational costs and reaping the benefits of on-demand computational
resources.

Join us as we explore the intersection of investment, entrepreneurship,
and cloud computing with Christophe Bisciglia, representing Cloudera,
whose mission is to provide enterprise-level support to users of Hadoop,
and Deepak Singh, representing Amazon Web Services, a major provider of
utility computing and cloud infrastructure. The discussion will be
moderated by Prof. Jimmy Lin, who leads Maryland's cloud computing
efforts in the Google/IBM Academic Cloud Computing Initiative.

= Schedule and Logistics

8:00 am – 8:30 am   Breakfast and Networking
8:30 am – 9:30 am   Plenary session by invited speakers
9:30 am – 10:00 am  Panel session moderated by Jimmy Lin

The event will take place at:

 Robert H. Smith School of Business
 2505 Van Munching Hall
 College Park, Maryland
 Directions at http://www.rhsmith.umd.edu/about/directions.aspx

This event is free and open to the public. However, please register for
the event at
http://www.slyreply.com/Event/EventDetails.aspx?eid=kwIpwWNRrO8%3d

= Speaker Bios

Christophe Bisciglia joins Cloudera from Google, where he created and
managed their Academic Cloud Computing Initiative. Starting in 2007, he
began working with the University of Washington to teach students about
Google's core data management and processing technologies—MapReduce and
GFS. This quickly brought Hadoop into the curriculum, and has since
resulted in an extensive partnership with the National Science
Foundation (NSF) which makes Google-hosted Hadoop clusters available for
research and education worldwide. Beyond his work with Hadoop, he holds
patents related to search quality and personalization, and spent a year
working in Shanghai. Christophe earned his degree, and remains a
visiting scientist, at the University of Washington.

Deepak Singh is a business development manager at Amazon Web Services
where he spends a lot of time working with developers and organizations
looking to leverage Amazon EC2 for a variety of applications, especially
in the areas of scientific research and data analytics. Prior to his
time at Amazon Web Services Deepak spent time at a number of life
science informatics and software companies; as a strategist at Rosetta
Biosoftware, a product manager and consortium director at Accelrys, and
a scientific programmer at GeneFormatics. He has a PhD in physical
chemistry from Syracuse University. Deepak is also an active blogger and
podcaster. At business|bytes|genes|molecules (http://mndoci.com) and
Coast to Coast Bio (http://c2cbio.com) he writes and talks about a
variety of topics at the interface of the biosciences and technology,
with special interests in open data, computing, and the web as a
platform for science.

Jimmy Lin is an Associate Professor in the iSchool at the University of
Maryland, with affiliations in the Department of Computer Science and
the Institute for Advanced Computer Studies, as well as the National
Center for Biotechnology Information (NCBI), National Library of
Medicine (NLM), the National Institutes of Health (NIH). He received a
Ph.D. in Electrical Engineering and Computer Science from MIT in 2004.
Dr. Lin's research primarily lies at the intersection of information
retrieval and natural language processing, but his interests extend to
human-computer interaction, bioinformatics, medical informatics, and
large-scale distributed systems.



Re: Coordination between Mapper tasks

2009-03-28 Thread Jimmy Lin
Hmmm... sounds odd. Given the same memcached servers (config), the 
hashing should be consistent.


FYI, all code for the experiments described in that tech report is in 
cloud9, the library I use for teaching my courses.  Download at:


http://www.umiacs.umd.edu/~jimmylin/

Hope this helps! (Let me know off list if you need more details)

-Jimmy

Stuart White wrote:

You might want to look at a memcached solution some students and I worked
out for exactly this problem.


Thanks, Jimmy!  This paper does exactly describe my problem.

I started working to implement the memcached solution you describe,
and I've run into a small problem.  I've described it on the
spymemcached forum:

http://groups.google.com/group/spymemcached/browse_thread/thread/7b4d82bca469ed20

Essentially, it seems the keys are being hashed inconsistently by
spymemcached across runs.  This, of course, will result in
inconsistent/invalid results.

Did you guys run into this?  Since I'm new to memcached, I'm hoping
that this is simply something I don't understand or am overlooking.



Re: Using HDFS to serve www requests

2009-03-26 Thread Jimmy Lin

Brian---

Can you share some performance figures for typical workloads with your 
HDFS/Fuse setup?  Obviously, latency is going to be bad but throughput 
will probably be reasonable... but I'm curious to hear about concrete 
latency/throughput numbers.  And, of course, I'm interested in these 
numbers as a function of concurrent clients... ;)


Somewhat independent of file size is the workload... you can have huge 
TB-size files, but still have a seek-heavy workload (in which case HDFS 
is probably a sub-optimal choice).  But if seek-heavy loads are 
reasonable, one can solve the lots-of-little-files problem by simple 
concatenation.


Finally, I'm curious about the Fuse overhead (vs. directly using the 
Java API).


Thanks in advance for your insights!

-Jimmy

Brian Bockelman wrote:


On Mar 26, 2009, at 5:44 PM, Aaron Kimball wrote:


In general, Hadoop is unsuitable for the application you're suggesting.
Systems like Fuse HDFS do exist, though they're not widely used.


We use FUSE on a 270TB cluster to serve up physics data because the 
client (2.5M lines of C++) doesn't understand how to connect to HDFS 
directly.


Brian


I don't
know of anyone trying to connect Hadoop with Apache httpd.

When you say that you have huge images, how big is "huge?" It might be
useful if these images are 1 GB or larger. But in general, "huge" on 
Hadoop
means 10s of GBs up to TBs.  If you have a large number of 
moderately-sized

files, you'll find that HDFS responds very poorly for your needs.

It sounds like glusterfs is designed more for your needs.

- Aaron

On Thu, Mar 26, 2009 at 4:06 PM, phil cryer  wrote:


This is somewhat of a noob question I know, but after learning about
Hadoop, testing it in a small cluster and running Map Reduce jobs on
it, I'm still not sure if Hadoop is the right distributed file system
to serve web requests.  In other words, can, or is it right to, serve
Images and data from HDFS using something like FUSE to mount a
filesystem where Apache could serve images from it?  We have huge
images, thus the need for a distributed file system, and they go in,
get stored with lots of metadata, and are redundant with Hadoop/HDFS -
but is it the right way to serve web content?

I looked at glusterfs before, they had an Apache and Lighttpd module
which made it simple, does HDFS have something like this, do people
just use a FUSE option as I described, or is this not a good use of
Hadoop?

Thanks

P






Re: Coordination between Mapper tasks

2009-03-21 Thread Jimmy Lin

Hi Stuart,

You might want to look at a memcached solution some students and I 
worked out for exactly this problem.  It's written up in:


Jimmy Lin, Anand Bahety, Shravya Konda, and Samantha Mahindrakar. 
Low-Latency, High-Throughput Access to Static Global Resources within 
the Hadoop Framework. Technical Report HCIL-2009-01, University of 
Maryland, College Park, January 2009.


Available at:

http://www.umiacs.umd.edu/~jimmylin/publications/by_year.html

Best,
Jimmy

Stuart White wrote:

Thanks to everyone for your feedback.  I'm unfamiliar with many of the
technologies you've mentioned, so it may take me some time to digest
all your responses.  The first thing I'm going to look at is Ted's
suggestion of a pure map-reduce solution by pre-joining my data with
my lookup values.

On Fri, Mar 20, 2009 at 9:55 AM, Owen O'Malley  wrote:

On Thu, Mar 19, 2009 at 6:42 PM, Stuart White wrote:


My process requires a large dictionary of terms (~ 2GB when loaded
into RAM).  The terms are looked-up very frequently, so I want the
terms memory-resident.

So, the problem is, I want 3 processes (to utilize CPU), but each
process requires ~2GB, but my nodes don't have enough memory to each
have their own copy of the 2GB of data.  So, I need to somehow share
the 2GB between the processes.


I would recommend using the multi-threaded map runner. Have 1 map/node and
just use 3 worker threads that all consume the input. The only disadvantage
is that it works best for cpu-heavy loads (or maps that are doing crawling,
etc.), since you only have one record reader for all three of the map
threads.

In the longer term, it might make sense to enable parallel jvm reuse in
addition to serial jvm reuse.

-- Owen





Re: OT: How to search mailing list archives?

2009-03-08 Thread Jimmy Lin

I've found nabble to be helpful:
http://www.nabble.com/Hadoop-core-user-f30590.html

-Jimmy

Miles Osborne wrote:

posts tend to get indexed by Google, so try that

Miles

2009/3/8 Stuart White :

This is slightly off-topic, and I realize this question is not
specific to Hadoop, but what is the best way to search the mailing
list archives?  Here's where I'm looking:

http://mail-archives.apache.org/mod_mbox/hadoop-core-user/

I don't see any way to search the archives.  Am I missing something?
Is there another archive site I should be looking at?

Thanks!







Re: Lazily deserializing Writables

2008-10-02 Thread Jimmy Lin
Hi Bryan,

Thanks, this answers my question!  So at the very least you'll have to
read in the raw bytes and hang on to them.

-Jimmy

> We do this with some of our Thrift-serialized types. We account for
> this behavior explicitly in the ThrittWritable class and make it so
> that we can read the serialized version off the wire completely by
> prepending the size. Then, we can read in the raw bytes and hang on
> to them for later as we see fit. I would think that leaving the bytes
> on the DataInput would break things in a very impressive way.
>
> -Bryan
>
> On Oct 2, 2008, at 2:48 PM, Jimmy Lin wrote:
>
>> Hi everyone,
>>
>> I'm wondering if it's possible to lazily deserialize a Writable.
>> That is,
>> when my custom Writable is handed a DataInput from readFields, can I
>> simply hang on to the reference and read from it later?  This would be
>> useful if the Writable is a complex data structure that may be
>> expensive
>> to deserialize, so I'd only want to do it on-demand.  Or does the
>> runtime
>> mutate the underlying stream, leaving the Writable with a reference to
>> something completely different later?
>>
>> I'm wondering about both present behavior, and the implicit contract
>> provided by the Hadoop API.
>>
>> Thanks!
>>
>> -Jimmy
>>
>>
>
>
>




Lazily deserializing Writables

2008-10-02 Thread Jimmy Lin
Hi everyone,

I'm wondering if it's possible to lazily deserialize a Writable.  That is,
when my custom Writable is handed a DataInput from readFields, can I
simply hang on to the reference and read from it later?  This would be
useful if the Writable is a complex data structure that may be expensive
to deserialize, so I'd only want to do it on-demand.  Or does the runtime
mutate the underlying stream, leaving the Writable with a reference to
something completely different later?

I'm wondering about both present behavior, and the implicit contract
provided by the Hadoop API.

Thanks!

-Jimmy




slash in AWS Secret Key, WAS Re: Namenode Exceptions with S3

2008-07-09 Thread Jimmy Lin
I've come across this problem before.  My simple solution was to
regenerate new keys until I got one without a slash... ;)

-Jimmy

> I have Hadoop 0.17.1 and an AWS Secret Key that contains a slash ('/').
>
> With distcp, I found that using the URL format s3://ID:[EMAIL PROTECTED]/
> did not work, even if I encoded the slash as "%2F".  I got
> "org.jets3t.service.S3ServiceException: S3 HEAD request failed.
> ResponseCode=403, ResponseMessage=Forbidden"
>
> When I put the AWS Secret Key in hadoop-site.xml and wrote the URL as
> s3://BUCKET/ it worked.
>
> I have periods ('.') in my bucket name, that was not a problem.
>
> What's weird is that org.apache.hadoop.fs.s3.Jets3tFileSystemStore
> uses java.net.URI, which should take take of unencoding the %2F.
>
> -Stuart
>
>
> On Wed, Jul 9, 2008 at 1:41 PM, Lincoln Ritter
> <[EMAIL PROTECTED]> wrote:
>> So far, I've had no luck.
>>
>> Can anyone out there clarify the permissible characters/format for aws
>> keys and bucket names?
>>
>> I haven't looked at the code here, but it seems strange to me that the
>> same restrictions on host/port etc apply given that it's a totally
>> different system.  I'd love to see exceptions thrown that are
>> particular to the protocol/subsystem being employed.  The s3 'handler'
>> (or whatever, might be nice enough to check for format violations and
>> throw and appropriate exception, for instance.  It might URL-encode
>> the secret key so that the user doesn't have to worry about this, or
>> throw an exception notifying the user of a bad format.  Currently,
>> apparent problems with my s3 settings are throwing exceptions that
>> give no indication that the problem is actually with those settings.
>>
>> My mitigating strategy has been to change my configuration to use
>> "instance-local" storage (/mnt).  I then copy the results out to s3
>> using 'distcp'.  This is odd since distcp seems ok with my s3/aws
>> info.
>>
>> I'm still unclear as to the permissible characters in bucket names and
>> access keys.  I gather '/' is bad in the secret key and that '_' is
>> bad for bucket names.  Thusfar i have only been able to get buckets to
>> work in distcp that have only letters in their names, but I haven't
>> tested to extensively.
>>
>> For example, I'd love to use buckets like:
>> 'com.organization.hdfs.purpose'.  This seems to fail.  Using
>> 'comorganizationhdfspurpose' works but clearly that is less than
>> optimal.
>>
>> Like I say, I haven't dug into the source yet, but it is curious that
>> distcp seems to work (at least where s3 is the destination) and hadoop
>> fails when s3 is used as its storage.
>>
>> Anyone who has dealt with these issues, please post!  It will help
>> make the project better.
>>
>> -lincoln
>>
>> --
>> lincolnritter.com
>>
>>
>>
>> On Wed, Jul 9, 2008 at 7:10 AM, slitz <[EMAIL PROTECTED]> wrote:
>>> I'm having the exact same problem, any tip?
>>>
>>> slitz
>>>
>>> On Wed, Jul 2, 2008 at 12:34 AM, Lincoln Ritter
>>> <[EMAIL PROTECTED]>
>>> wrote:
>>>
 Hello,

 I am trying to use S3 with Hadoop 0.17.0 on EC2.  Using this style of
 configuration:

 
  fs.default.name
  s3://$HDFS_BUCKET
 

 
  fs.s3.awsAccessKeyId
  $AWS_ACCESS_KEY_ID
 

 
  fs.s3.awsSecretAccessKey
  $AWS_SECRET_ACCESS_KEY
 

 on startup of the cluster with the bucket having no non-alphabetic
 characters, I get:

 2008-07-01 16:10:49,171 ERROR org.apache.hadoop.dfs.NameNode:
 java.lang.RuntimeException: Not a host:port pair: X
at
 org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:121)
at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:121)
at org.apache.hadoop.dfs.NameNode.(NameNode.java:178)
at org.apache.hadoop.dfs.NameNode.(NameNode.java:164)
at
 org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:848)
at org.apache.hadoop.dfs.NameNode.main(NameNode.java:857)

 If I use this style of configuration:

 
  fs.default.name
  s3://$AWS_ACCESS_KEY:[EMAIL PROTECTED]
 

 I get (where the all-caps portions are the actual values...):

 2008-07-01 19:05:17,540 ERROR org.apache.hadoop.dfs.NameNode:
 java.lang.NumberFormatException: For input string:
 "[EMAIL PROTECTED]"
at
 java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
at java.lang.Integer.parseInt(Integer.java:447)
at java.lang.Integer.parseInt(Integer.java:497)
at
 org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:128)
at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:121)
at org.apache.hadoop.dfs.NameNode.(NameNode.java:178)
at org.apache.hadoop.dfs.NameNode.(NameNode.java:164)
at
 org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:848)
at org.apache.hadoo

Re: walkthrough of developing first hadoop app from scratch

2008-03-22 Thread Jimmy Lin

Hi Stephen et al.,

I would take advantage of the Hadoop plug-in for Eclipse to handle the 
mundane aspects of putting together your job and running it on the cluster.


With respect to gentler introductions on application development, you 
might want to take a look at the following:


http://www.umiacs.umd.edu/~jimmylin/cloud9/umd-hadoop-dist/cloud9-docs/index.html

Cloud9 is a MapReduce library primarily intended for teaching, which I 
use in my cloud computing course (going on right now).  The associated 
tutorials might help you get started.  Thus far it's worked well with U. 
Maryland grads and undergrads, but I'd appreciate additional feedback.


Incidentally, I will be talking at the Hadoop summit next week, so if 
anyone else on the list will be there, I look forward to meeting everyone!


-Jimmy

Stephen J. Barr wrote:

Hello,

I am working on developing my first hadoop app from scratch. It is a 
Monte-Carlo simulation, and I am using the PiEstimator code from the 
examples as a reference. I believe I have what I want in a .java file. 
However, I couldn't find any documentation on how to make that .java 
file into a .jar that I could run, and I haven't found much 
documentation that is hadoop specific.


Is it basically javac MyApp.java
jar -cf MyApp

or something to that effect, or is there more to it?

Thanks! Sorry for the newbie question.

-stephen barr



Re: Add your project or company to the powered by page?

2008-02-22 Thread Jimmy Lin


University of Maryland
http://www.umiacs.umd.edu/~jimmylin/cloud-computing/index.html

We are one of six universities participating in IBM/Google's academic
cloud computing initiative.  Ongoing research and teaching efforts
include projects in machine translation, language modeling,
bioinformatics, email analysis, and image processing.


Eric Baldeschwieler wrote:

Hi Folks,

Let's get the word out that Hadoop is being used and is useful in your 
organizations, ok?  Please add yourselves to the Hadoop powered by page, 
or reply to this email with what details you would like to add and I'll 
do it.


http://wiki.apache.org/hadoop/PoweredBy

Thanks!

E14

---
eric14 a.k.a. Eric Baldeschwieler
senior director, grid computing
Yahoo!  Inc.







Question about key sorting interaction effects

2008-02-08 Thread Jimmy Lin

Hi guys,

I was wondering if someone could explain the possible interaction 
effects between the different methods available to control key sorting. 
 Based on my understanding, there are three separate knobs:


- a WritableComparable's compareTo method
- registering a WritableComparator optimization
- setOutputKeyComparatorClass method in JobConf

So here's my questions: what happens if these each define a different 
sort order?


To be more concrete, in a recent application I inadvertently defined an 
output key comparator that defined an ordering that was different from 
the WritableComparable's natural ordering (as defined by its compareTo). 
 Running the application on small data sets lead to (my) expected 
behavior, sort order as defined by the output key comparator.  However, 
I got unanticipated results with larger data sets, which leads me to 
suspect that different methods are used to sort at different times...


Thanks in advance for the response!

-Jimmy


Why no DoubleWritable?

2008-02-05 Thread Jimmy Lin
Hi guys,

What's the design decision for not implementing a DoubleWritable type
that implements WritableComparable? I noticed that there are classes
corresponding to all Java primitives except for double.

Thanks in advance,
Jimmy