Fwd: May 21 talk at Pasadena JUG

2012-05-04 Thread Mattmann, Chris A (388J)
(apologies for cross posting)

Hey Folks in the SoCal area -- if you're around on May 21st, I'll be speaking 
at the Pasadena JUG on Apache OODT,
Big Data and likely Apache Hadoop (in prep for my Hadoop Summit coming talk).

Info is below thanks to David Noble for setting this up!

Cheers,
Chris

Begin forwarded message:

The announcement is up on the Meetup site and the Pasadena JUG website, and has 
been sent to mailing lists for the Pasadena JUG, LA JUG, and OC JUG.

If you invite people, please do encourage them to RSVP on the Meetup site. It's 
useful to make sure we have enough food, but also to make sure we set up the 
right room. Last month's talk on Mule & MongoDB had 55 people RSVP (and 
probably more attend) and we had to bump up to a larger room than usual. 
Fortunately Idealab is equipped for that size group :-)

http://www.meetup.com/pasadenajug/
http://www.pasadenajug.org/

I'll follow up with the Apache lists in the next day or so, unless you beat me 
to it.



++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: [blog post] Accumulo, Nutch, and Gora

2012-02-28 Thread Mattmann, Chris A (388J)
UMMM wow!

That's awesome Jason! Thanks so much!

Cheers,
Chris

On Feb 28, 2012, at 5:41 PM, Jason Trost wrote:

> Blog post for anyone who's interested.  I cover a basic howto for
> getting Nutch to use Apache Gora to store web crawl data in Accumulo.
> 
> Let me know if you have any questions.
> 
> Accumulo, Nutch, and GORA
> http://www.covert.io/post/18414889381/accumulo-nutch-and-gora
> 
> --Jason


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: Similar frameworks like hadoop and taxonomy of distributed computing

2012-01-11 Thread Mattmann, Chris A (388J)
Here's some links to it:

Long Version: 
http://csse.usc.edu/csse/TECHRPTS/2008/usc-csse-2008-820/usc-csse-2008-820.pdf
Shorter Version (published in WICSA): 
http://wwwp.dnsalias.org/w/images/3/3f/AnatomyPhysiologyGridRevisited66.pdf

Cheers,
Chris

On Jan 11, 2012, at 4:02 PM, Mattmann, Chris A (388J) wrote:

> Also check out my paper on The Anatomy and Physiology of the Grid Revisited 
> just Google for it where we also tried to look at this very issue.
> 
> Cheers,
> Chris 
> 
> Sent from my iPhone
> 
> On Jan 11, 2012, at 3:55 PM, "Brian Bockelman"  wrote:
> 
>> 
>> On Jan 11, 2012, at 10:15 AM, George Kousiouris wrote:
>> 
>>> 
>>> Hi,
>>> 
>>> see comments in text
>>> 
>>> On 1/11/2012 4:42 PM, Merto Mertek wrote:
>>>> Hi,
>>>> 
>>>> I was wondering if anyone knows any paper discussing and comparing the
>>>> mentioned topic. I am a little bit confused about the classification of
>>>> hadoop.. Is it a /cluster/comp grid/ a mix of them?
>>> I think that a strict definition would be an implementation of the 
>>> map-reduce computing paradigm, for cluster usage.
>>> 
>>>> What is hadoop in
>>>> relation with a cloud - probably just a technology that enables cloud
>>>> services..
>>> It can be used to enable cloud services through a service oriented 
>>> framework, like we are doing in
>>> http://users.ntua.gr/gkousiou/publications/PID2095917.pdf
>>> 
>>> in which we are trying to create a cloud service that offers MapReduce 
>>> clusters as a service and distributed storage (through HDFS).
>>> But this is not the primary usage. This is the back end heavy processing in 
>>> a cluster-like manner, specifically for parallel jobs that follow the MR 
>>> logic.
>>> 
>>>> 
>>>> Can it be compared to cluster middleware like beowulf, oscar, condor,
>>>> sector/sphere, hpcc, dryad, etc? Why not?
>>> I could see some similarities with condor, mainly in the job submission 
>>> processes, however i am not really sure how condor deals with parallel jobs.
>>> 
>> 
>> Since you asked…
>> 
>> 
>> 
>> Condor has a built-in concept of a set of jobs (called a "job cluster").  On 
>> top of its scheduler, there is a product called "DAGMan" (DAG = directed 
>> acyclic graph) that can manage a large number of jobs with interrelated 
>> dependencies (providing a partial ordering between jobs).  Condor with DAG 
>> is somewhat comparable to the concept of Hadoop tasks plus Oozie workflows 
>> (although the data aspects are very different - don't try to stretch it too 
>> far).
>> 
>> Condor / PBS / LSF / {OGE,SGE,GE} / SLURM provide the capability to start 
>> many identical jobs in parallel for MPI-type computations, but I consider 
>> MPI wildly different than the sort of workflows you see with MapReduce.  
>> Specifically, "classic MPI"  programming (the ones you see in wide use, MPI2 
>> and later are improved) mostly requires all processes to start 
>> simultaneously and the job crashes if one process dies.  I think this is why 
>> the Top10 computers tend to measure mean time between failure in tens of 
>> hours.
>> 
>> Unlike Hadoop, Condor jobs can flow between pools (they call this 
>> "flocking") and pools can naturally cover multiple data centers.  The 
>> largest demonstration I'm aware of is 100,000 cores across the US; the 
>> largest production pool I'm aware of is about 20-30k cores across 100 
>> universities/labs on multiple continents.  This is not a criticism of Hadoop 
>> - Condor doesn't really have the same level of data-integration as Hadoop 
>> does, so tackles a much simpler problem (i.e., 
>> bring-your-own-data-management!).
>> 
>> 
>> 
>> Brian
>> 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: Similar frameworks like hadoop and taxonomy of distributed computing

2012-01-11 Thread Mattmann, Chris A (388J)
Also check out my paper on The Anatomy and Physiology of the Grid Revisited 
just Google for it where we also tried to look at this very issue.

Cheers,
Chris 

Sent from my iPhone

On Jan 11, 2012, at 3:55 PM, "Brian Bockelman"  wrote:

> 
> On Jan 11, 2012, at 10:15 AM, George Kousiouris wrote:
> 
>> 
>> Hi,
>> 
>> see comments in text
>> 
>> On 1/11/2012 4:42 PM, Merto Mertek wrote:
>>> Hi,
>>> 
>>> I was wondering if anyone knows any paper discussing and comparing the
>>> mentioned topic. I am a little bit confused about the classification of
>>> hadoop.. Is it a /cluster/comp grid/ a mix of them?
>> I think that a strict definition would be an implementation of the 
>> map-reduce computing paradigm, for cluster usage.
>> 
>>> What is hadoop in
>>> relation with a cloud - probably just a technology that enables cloud
>>> services..
>> It can be used to enable cloud services through a service oriented 
>> framework, like we are doing in
>> http://users.ntua.gr/gkousiou/publications/PID2095917.pdf
>> 
>> in which we are trying to create a cloud service that offers MapReduce 
>> clusters as a service and distributed storage (through HDFS).
>> But this is not the primary usage. This is the back end heavy processing in 
>> a cluster-like manner, specifically for parallel jobs that follow the MR 
>> logic.
>> 
>>> 
>>> Can it be compared to cluster middleware like beowulf, oscar, condor,
>>> sector/sphere, hpcc, dryad, etc? Why not?
>> I could see some similarities with condor, mainly in the job submission 
>> processes, however i am not really sure how condor deals with parallel jobs.
>> 
> 
> Since you asked…
> 
> 
> 
> Condor has a built-in concept of a set of jobs (called a "job cluster").  On 
> top of its scheduler, there is a product called "DAGMan" (DAG = directed 
> acyclic graph) that can manage a large number of jobs with interrelated 
> dependencies (providing a partial ordering between jobs).  Condor with DAG is 
> somewhat comparable to the concept of Hadoop tasks plus Oozie workflows 
> (although the data aspects are very different - don't try to stretch it too 
> far).
> 
> Condor / PBS / LSF / {OGE,SGE,GE} / SLURM provide the capability to start 
> many identical jobs in parallel for MPI-type computations, but I consider MPI 
> wildly different than the sort of workflows you see with MapReduce.  
> Specifically, "classic MPI"  programming (the ones you see in wide use, MPI2 
> and later are improved) mostly requires all processes to start simultaneously 
> and the job crashes if one process dies.  I think this is why the Top10 
> computers tend to measure mean time between failure in tens of hours.
> 
> Unlike Hadoop, Condor jobs can flow between pools (they call this "flocking") 
> and pools can naturally cover multiple data centers.  The largest 
> demonstration I'm aware of is 100,000 cores across the US; the largest 
> production pool I'm aware of is about 20-30k cores across 100 
> universities/labs on multiple continents.  This is not a criticism of Hadoop 
> - Condor doesn't really have the same level of data-integration as Hadoop 
> does, so tackles a much simpler problem (i.e., 
> bring-your-own-data-management!).
> 
> 
> 
> Brian
> 


[Call for Papers] ICSE Software Engineering for Cloud Computing (SECLOUD) Workshop

2011-01-20 Thread Mattmann, Chris A (388J)
(apologies for the cross posting)

*** PLEASE NOTE - the deadline for submitting papers has been extended by 1 
week to 1/28/2011! ***

Please consider submitting a paper to the ICSE 2011 Software Engineering for 
Cloud Computing (SECLOUD) Workshop to be held Sunday, May 22, 2011, at the 
Hilton Hawaiian Village Resort in Waikiki, Honolulu, HI.

This workshop focuses on identifying the grand challenges that lay before us in 
the realm of software engineering for cloud computing. We will debate existing 
notions of SE for the construction of cloud services and for their deployment. 
We will discuss and evangelize individual successes in the field and attempt to 
identify the key ingredients that enabled that success. Participants in this 
workshop will take a role in helping to formulate a long-term, concrete 
software engineering agenda for cloud and will have an opportunity to share in 
and contribute to the existing “tribal knowledge” for cloud-related software 
engineering.

We invite high quality submissions of technical and research papers, as well as 
research demonstrations describing original and unpublished results of research 
relevant to software engineering for cloud. Authors are asked to reserve a 
section in their submitted papers for identifying their suggested near-term and 
long-term challenge areas for the field.

The workshop seeks to focus discussion around the ways that the disruptive 
effect of cloud computing is engendering a new set of principles and approaches 
to software engineering. Specific topics of interest include, but are not 
limited to:

Topic Areas of Interest:

- Agile Software development on the cloud
- Architecting Applications using the cloud
- Benchmarking and Performance Metrics in the cloud
- Cloud Application Reliability
- Distributed collaboration on the cloud
- Interoperability and Portability challenges in multi-cloud environments
- Open Source Cloud Environments: Hadoop, Eucalyptus and other programming 
paradigms, languages and environments for the Cloud
- Requirements Elicitation effectiveness when using cloud service abstractions
- Secure Software Engineering in the Cloud
- Software Engineering Practices using popular cloud vendor services
- Software Engineering tools
- Software Project Estimations, Testing,Verification and Validation over cloud 
service abstractions

Submission Guidelines:

Submissions may take one of two acceptable forms. Research paper contributions 
may be submitted with a maximum length of 7 pages, and must conform to ICSE 
2011 paper formatting guidelines [1]. Research demonstration submissions are 
limited to 1 page in the conference proceedings and will be given during he two 
15-minute breakout sessions.

All submissions must be in English and in PDF format [2]. An abstract must 
accompany each research paper contribution. The workshop website [3] will be 
updated to include a link to the CyberChairPRO online submission system once it 
has been set up. All accepted papers will be published and disseminated on the 
workshop website after the conclusion of the workshop.

Review and Evaluation Criteria:

Each submission will be reviewed by at least two members of the Program 
Committee and will be assessed on the basis of originality, importance and 
relevance of contribution to SE for cloud, soundness, quality of presentation, 
and an appropriate comparison to related work. The program committee as a whole 
will make final decisions about acceptance.

Workshop Organizers and PC Chairs:

Nenad Medvidovic , University of Southern California
T.S. Mohan, Infosys Technologies
Chris A. Mattmann, NASA Jet Propulsion Laboratory
Owen O’Malley, Yahoo, Inc.

Important Dates:

Paper Submission: 21.January.2011
Paper Notification: 21.February.2011
Camera Ready: 10.March.2011

Website: http://sites.google.com/site/icsecloud2011

Program Committee:

Andrew F. Hart, NASA JPL, United States
Ben Reed, Yahoo, United States
Craig Lee, Aerospace Corp., United States
Daniel J. Crichton, NASA JPL, United States
David C. Kale, Children's Hospital LA, United States
David M. Woollard, Project WBS, United States
Dennis Kubes, Bit.ly, United States
Eric Dashofy, Aerospace Corp., United States
Jacek Becla, Stanford National Accelerator Lab, United States
Justin Erenkrantz, Project WBS/Apache, United States
Luca Cinquini, NASA JPL, United States
Mohanakrishna, Infosys, India
Nabor Mendonca, Univ. of Fortaleza, Brazil
Sanjay Radia, Yahoo, United States
Schahram Dustdar, TU Wien, Vienna
Srinivas Padmanabhuni, Infosys, India
Stefan Tai, Karlsruhe Institute of Technology, Germany
YN Srikant, Indian Institute of Science, Bangalore, India

Links:

[1] ICSE-2011 Format: 
http://2011.icse-conferences.org/content-submission-guidelines
[2] ACM Submission Policies: 
http://www.acm.org/sigs/publications/proceedings-templates
[3] Workshop Website: http://sites.google.com/site/icsecloud2011


About ICSE: 

The International Conference on Software Engineering (ICSE) is the pre

[Call for Papers] ICSE Software Engineering for Cloud Computing (SECLOUD) Workshop

2011-01-03 Thread Mattmann, Chris A (388J)
(apologies for the cross posting)

Please consider submitting a paper to the ICSE 2011 Software Engineering for 
Cloud Computing (SECLOUD) Workshop to be held Sunday, May 22, 2011, at the 
Hilton Hawaiian Village Resort in Waikiki, Honolulu, HI.

This workshop focuses on identifying the grand challenges that lay before us in 
the realm of software engineering for cloud computing. We will debate existing 
notions of SE for the construction of cloud services and for their deployment. 
We will discuss and evangelize individual successes in the field and attempt to 
identify the key ingredients that enabled that success. Participants in this 
workshop will take a role in helping to formulate a long-term, concrete 
software engineering agenda for cloud and will have an opportunity to share in 
and contribute to the existing “tribal knowledge” for cloud-related software 
engineering.

We invite high quality submissions of technical and research papers, as well as 
research demonstrations describing original and unpublished results of research 
relevant to software engineering for cloud. Authors are asked to reserve a 
section in their submitted papers for identifying their suggested near-term and 
long-term challenge areas for the field.

The workshop seeks to focus discussion around the ways that the disruptive 
effect of cloud computing is engendering a new set of principles and approaches 
to software engineering. Specific topics of interest include, but are not 
limited to:

Topic Areas of Interest:

- Agile Software development on the cloud
- Architecting Applications using the cloud
- Benchmarking and Performance Metrics in the cloud
- Cloud Application Reliability
- Distributed collaboration on the cloud
- Interoperability and Portability challenges in multi-cloud environments
- Open Source Cloud Environments: Hadoop, Eucalyptus and other programming 
paradigms, languages and environments for the Cloud
- Requirements Elicitation effectiveness when using cloud service abstractions
- Secure Software Engineering in the Cloud
- Software Engineering Practices using popular cloud vendor services
- Software Engineering tools
- Software Project Estimations, Testing,Verification and Validation over cloud 
service abstractions

Submission Guidelines:

Submissions may take one of two acceptable forms. Research paper contributions 
may be submitted with a maximum length of 7 pages, and must conform to ICSE 
2011 paper formatting guidelines [1]. Research demonstration submissions are 
limited to 1 page in the conference proceedings and will be given during he two 
15-minute breakout sessions.

All submissions must be in English and in PDF format [2]. An abstract must 
accompany each research paper contribution. The workshop website [3] will be 
updated to include a link to the CyberChairPRO online submission system once it 
has been set up. All accepted papers will be published and disseminated on the 
workshop website after the conclusion of the workshop.

Review and Evaluation Criteria:

Each submission will be reviewed by at least two members of the Program 
Committee and will be assessed on the basis of originality, importance and 
relevance of contribution to SE for cloud, soundness, quality of presentation, 
and an appropriate comparison to related work. The program committee as a whole 
will make final decisions about acceptance.

Workshop Organizers and PC Chairs:

Nenad Medvidovic , University of Southern California
T.S. Mohan, Infosys Technologies
Chris A. Mattmann, NASA Jet Propulsion Laboratory
Owen O’Malley, Yahoo, Inc.

Important Dates:

Paper Submission: 21.January.2011
Paper Notification: 21.February.2011
Camera Ready: 10.March.2011

Website: http://sites.google.com/site/icsecloud2011

Program Committee:

Andrew F. Hart, NASA JPL, United States
Ben Reed, Yahoo, United States
Craig Lee, Aerospace Corp., United States
Daniel J. Crichton, NASA JPL, United States
David C. Kale, Children's Hospital LA, United States
David M. Woollard, Project WBS, United States
Dennis Kubes, Bit.ly, United States
Eric Dashofy, Aerospace Corp., United States
Jacek Becla, Stanford National Accelerator Lab, United States
Justin Erenkrantz, Project WBS/Apache, United States
Luca Cinquini, NASA JPL, United States
Mohanakrishna, Infosys, India
Nabor Mendonca, Univ. of Fortaleza, Brazil
Sanjay Radia, Yahoo, United States
Schahram Dustdar, TU Wien, Vienna
Srinivas Padmanabhuni, Infosys, India
Stefan Tai, Karlsruhe Institute of Technology, Germany
YN Srikant, Indian Institute of Science, Bangalore, India

Links:

[1] ICSE-2011 Format: 
http://2011.icse-conferences.org/content-submission-guidelines
[2] ACM Submission Policies: 
http://www.acm.org/sigs/publications/proceedings-templates
[3] Workshop Website: http://sites.google.com/site/icsecloud2011


About ICSE: 

The International Conference on Software Engineering (ICSE) is the premier 
software engineering conference, providing a forum for researchers, 
practitioners and educator

Re: AWS Hadoop 20.2 AMIs

2010-11-18 Thread Mattmann, Chris A (388J)
Hey Mike,

Do you have time to submit a patch? You could probably create a jira issue here 
[1] and then attach a diff of your update...

Cheers,
Chris

[1] http://issues.apache.org/jira/browse/HADOOP

On Nov 17, 2010, at 11:26 AM, Gangl, Michael E (388K) wrote:

> FYI, I commented out the Kernal version in the hadoop-ec2-env.sh script for 
> the c1.xlarge if statements (at the bottom).
> 
> Before it was using aki-427d952b
> 
> Now it's using aki-b51cf9dc
> 
> And I'm able to connect. Turns out the problem was a hang during the boot. 
> This should probably be changed in the future releases of the ec2 scripts (if 
> it's not changed already :) )
> 
> -Mike
> 
> 
> On 11/17/10 10:41 AM, "Michael Gangl"  wrote:
> 
> I've been running into an issue today.
> 
> I'm trying to procure 5 c1.xlarge instances on Amazon EC2. I was able to use 
> the 453820947548/bixolabs-hadoop-0.20.2-i386 AMI for my previous m1.large 
> instances, so I figured I could use the c1.xlarge instances with the x86_64 
> versions.
> 
> When I start these with the src/contrib/ec2/bin scripts, however, the master 
> starts but then I'm unable to connect to the xlarge instances. I can still 
> use the 1.large instances, but these are too slow for me, so I'd like to use 
> the bigger machine. Has anyone else been having problems today, or in the 
> past with getting an AMI to work on the xlarge instances?
> 
> Thanks,
> 
> Mike
> 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: REST web service on top of Hadoop

2010-07-28 Thread Mattmann, Chris A (388J)
Hi Alex,

I had one of my students in my Search Engines class at USC prepare this very 
project. I will work on cleaning it up and trying to get it patch-ready...

Cheers,
Chris


On 7/28/10 1:03 PM, "Alex Kozlov"  wrote:

Since noone answered: AFAIK there is no REST interface to Hadoop/HDFS.
Would be an interesting contrib project.  -- Alex K

On Wed, Jul 28, 2010 at 7:01 AM, eluharani zineellabidine <
eluhar...@gmail.com> wrote:

> Hi All,
>
> I am still a newbie to Hadoop so please understand my confusions. I went
> through many tutorials and through Hadoop's API and yet couldn't figure out
> how to make REST webserbvice that would initialize a Map/Reduce Job and
> read
> the result. Can anybody help with this issue? Is(are) there any specific
> class(es) to use, or do I have to make some specific integration?
>
> I really appreciate your time and help.
>
> --
> ZineEllabidine Eluharani
>



++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: HDF5 and Hadoop

2010-05-03 Thread Mattmann, Chris A (388J)
Hi Andrew,

There has been some work in the Tika [1] project recently on looking at NetCDF4 
[2] and HDF4/5 [3] and extracting metadata/text content from them. Though this 
doesn't directly apply to your question below, it might be worth perhaps 
looking at how to marry Tika and Hadoop in that regard.

HTH!

Cheers,
Chris

[1] http://lucene.apache.org/tika/
[2] http://issues.apache.org/jira/browse/TIKA-400
[3] https://issues.apache.org/jira/browse/TIKA-399


On 5/3/10 10:36 AM, "Andrew Nguyen"  wrote:

Does anyone know of any existing work integrating HDF5 
(http://www.hdfgroup.org/HDF5/whatishdf5.html) with Hadoop?

I don't know much about HDF5 but it was recently brought to my attention as a 
way to store high-density scientific data.  Since I've confirmed that having 
Hadoop dramatically speeds up our analysis, it seems like marrying the two 
might have some benefits.

I've done some searches on google and it doesn't turn up much.

Thanks!

--Andrew



++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: bulk data transfer to HDFS remotely (e.g. via wan)

2010-03-02 Thread Mattmann, Chris A (388J)
Hi All,

I co-authored a paper about this that was published at the NASA/IEEE Mass 
Storage conference in 2006 [1]. Also, my Ph.D. Dissertation [2] contains 
information about making these types of data movement selections when needed. 
Thought I'd throw it out there in case it helps.

HTH,
Chris

[1] http://sunset.usc.edu/~mattmann/pubs/MSST2006.pdf
[2] http://sunset.usc.edu/~mattmann/Dissertation.pdf




On 3/2/10 11:10 AM, "jiang licht"  wrote:

Hi Brian,

Thanks a lot for sharing your experience. Here I have some questions to bother 
you for more help :)

So, basically means that data transfer in your case is 2-step job: 1st, use 
gridftp to make a local copy of data on target, 2nd load data into the target 
cluster by sth like "hadoop fs -put". If this is correct, I am wondering if 
this will consume too much disk space of your target box (since it is stored in 
a local file system, prior to be distributed to hadoop cluster). Also, do you 
do a integrity check for each file transferred (one straightforward method 
might be to do a 'cksum' or alike comparison, but is that doable in terms of 
efficiency)?

I am not familiar with gridftp except that I know it is a better choice 
compared to scp, sftp, etc. in that it can tune tcp settings and create 
parallel transfer. So, I want to know if it keeps a log of what files have been 
successfully transferred and what have not, does gridftp do a file integrity 
check? Right now, I only have one box for data storage (not in hadoop cluster) 
and want to transfer that data to hadoop. Can I just install gridftp on this 
box and name node box to enable gridftp transfer from the 1st to the 2nd?

Thanks,
--

Michael

--- On Tue, 3/2/10, Brian Bockelman  wrote:

From: Brian Bockelman 
Subject: Re: bulk data transfer to HDFS remotely (e.g. via wan)
To: common-user@hadoop.apache.org
Date: Tuesday, March 2, 2010, 8:38 AM

Hey Michael,

distcp does a MapReduce job to transfer data between two clusters - but it 
might not be acceptable security-wise for your setup.

Locally, we use gridftp between two clusters (not necessarily Hadoop!) and a 
protocol called SRM to load-balance between gridftp servers.  GridFTP was 
selected because it is common in our field, and we already have the certificate 
infrastructure well setup.

GridFTP is fast too - many Gbps is not too hard.

YMMV

Brian

On Mar 2, 2010, at 1:30 AM, jiang licht wrote:

> I am considering a basic task of loading data to hadoop cluster in this 
> scenario: hadoop cluster and bulk data reside on different boxes, e.g. 
> connected via LAN or wan.
>
> An example to do this is to move data from amazon s3 to ec2, which is 
> supported in latest hadoop by specifying s3(n)://authority/path in distcp.
>
> But generally speaking, what is the best way to load data to hadoop cluster 
> from a remote box? Clearly, in this scenario, it is unreasonable to copy data 
> to local name node and then issue some command like "hadoop fs 
> -copyFromLocal" to put data in the cluster (besides this, a desired data 
> transfer tool is also a factor, scp or sftp, gridftp, ..., compression and 
> encryption, ...).
>
> I am not awaring of a generic support for fetching data from a remote box 
> (like that from s3 or s3n), I am thinking about the following solution (run 
> on remote boxes to push data to hadoop):
>
> cat datafile | ssh hadoopbox 'hadoop fs -put - dst'
>
> There are pros (simple and will do the job without storing a local copy of 
> each data file and then do a command like 'hadoop fs -copyFromLocal') and 
> cons (obviously will need many such pipelines running in parallel to speed up 
> the job, but at the cost of creating processes on remote machines to read 
> data and maintain ssh connections, so if data file is small, better archive 
> small files into a tar file before calling 'cat'). Alternative to using a 
> 'cat', a program can be written to keep reading data files and dump to stdin 
> in parallel.
>
> Any comments about this or thoughts about a better solution?
>
> Thanks,
> --
> Michael
>
>








++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++