Re: [gt-user] Data Mining using Globus

Jan Ploski Fri, 04 Apr 2008 18:43:24 -0700

Lengyel, Florian wrote:


This needed editing... take two:

These seen like good questions to me. I would like to know
if there is something for software analogous to the
domain naming service for URLs--a "Software Naming Service."
Does such a thing exist?


Hello,

I'd say that there are many approaches for software deployment out thereand that the best one depends on your application (see below) and scale(number of execution environments, users, and versions that you careabout). The "software naming service" is likely to depend greatly on thepreferred deployment approach, which is probably why there is no suchservice in Globus. Another reason is that Grid hardware resources can beused by very independent communities who might actually be less thanhappy with artificial dependencies created using some grand unifiedpackage management scheme.

One approach is to not install anything up-front, just ship all yourstatically compiled executables and data along with your jobs. Just doit and be glad if you can. It's not always viable because of two mainissues: 1) waste of bandwidth, especially if you have noticeable amountsof job-independent "master data" (some caching schemes might help) 2) itdoes not work at all with MPI-based parallel applications, which must becompiled and linked on site against the local MPI library to execute atall or to achieve good performance.

Another approach is to install and build your software in each clustermore or less manually (using typical SCM tools like CVS/Subversion).Once you manage to get your software deployed "everywhere", just keep alist of configured sites and consult it when submitting jobs (this iseasy to automate if you use something like Condor-G for job submissionand match-making). In other words, create your own directory servicewith the kinds of information that you need. Better yet, run some testjobs regularly that check that your working configuration has not beenbroken externally (by the cluster's admin). You might think that ifeveryone deployed their own directory services, some terrible redundancyand waste of effort would result. From my experience, constructing sucha directory service is easy compared to getting non-trivial softwaresuccessfully built at different target sites. Unfortunately, theapplication software that needs building tends to be community-specific,so the outlooks for saving effort by doing something across communitiesare not good.

Yet another approach is to maintain hopefully just a single image of acomplete system and to rely on the ability of virtual machines (such asUML or VMWare) to execute anywhere. This has very similar problems tothe first mentioned approach. Virtual machines may abstract away toomuch of hardware and network for good performance and they may be toobulky in size.

If you have administrative power over your computing resources or decideto use the VM solution, you could enforce a certain degree ofhomogeneity and rely on a Linux package manager as your softwaredistribution mechanism. Package managers (and Unix file systems layouts)are not optimized toward maintaining multiple versions of the samesoftware on a single system, though. They also require some expertise tobuild packages and to correctly describe their dependencies. Most peoplewho know how to get application software configured and compiled don'thave this sort of expertise. Furthermore, package managers invariablymake the assumption of being used by "the administrator", while in aGrid setting each community will have its own set of "administrators".

If you prefer a package management system designed from scratch for Gridclusters, maybe http://www.cmtsite.org/ will be interesting. I swearthere was another similar solution, but my attempt to find it againtoday failed. Maybe someone else can give you some pointers.

Each of these tools has query features. Now try querying what software
is installed on a cluster, or a grid.  What tools would you use? They don't

seem to exist, or if they are, they haven't made it very far in Google'spage ranking.

There seems to be no Software Naming Service that could
be queried and used, comparable to the Domain Naming Service.

The point is, if you installed the software yourself, you know where itis and don't really need a Grid-wide service to find it. On the otherhand, if you didn't install the software, then it is either systemsoftware, so again you don't need to query for it, or you can't trustthe installer to have done it in the exact way you would need it and tokeep it so over time. In case you trusted the installer, you would havebeen informed by her how to find and use this software (e.g., what sortof specifications to include in your jobs to make sure they executeproperly).

Perhaps you are suggesting a scenario where a stranger would like to"browse" Grid resources to see what is installed where to selectpotential job destinations. Based on my experience, this sort ofactivity only makes sense within a community, which may very well havesome sort of directory services, but not at the level of a whole Gridused by different communities.

While I'm on the subject of tools for the end user, what about
a shell that abstracts commands that you do from a workstation to the
grid level? Something that might be called the "gshell."

I'm pretty sure I've heard one talk about something like that beingimplemented in the development version of gLite, but again, Google failsto locate it.

Where is the grid equivalent of the path? Of ls?
Or for someone who wants to run a job on some collection of clusters,
but needs certain libraries, which may be installed on different
machines out there, somewhere. Is there a grid equivalent ofldconfig? Or even of something deprecated, like LD_LIBRARY_PATH?


You might be mixing up two different roles:
1) People who deploy software
2) People who run jobs

The way how I see it working is: deployers do their thing "somehow" anddeliver user interfaces and instructions for people who run jobs. Peoplewho run jobs operate on the level of abstraction relevant to their task(which input data should be processed, which application configurationparameters should be used, what to do with the output). The selection ofsuitable target clusters shouldn't bother end users (it often does, forprosaic reasons such as the amount of free local disk space, localavailability of data, and unscheduled downtimes).

What if end users actually want to tweak and recompile application code?Then the deployer's task should be to provide a reasonable userinterface for doing distributed builds and on-site testing. Once again,the application programmers don't need to see much difference from whenthey were programming for their local cluster or machine.

While I appreciate that the globus toolkit is intended to solve
recurrent middleware problems, where is it being used to address
the most recurring problem of all: getting users to use it?

As you wrote, Globus Toolkit is middleware. The way I see it, theimmediate users of Grid middleware are Grid application developers. Theactual end users shouldn't need to know much about this sort of things,just like they shouldn't need to know about Unix administration. Thereal world might look different, but it's hardly a problem of GlobusToolkit.


Regards,
Jan Ploski

Re: [gt-user] Data Mining using Globus

Reply via email to