Re: [Scikit-learn-general] permutation test and distributed computing

Gael Varoquaux Sun, 09 Oct 2011 17:07:09 -0700

On Sun, Oct 09, 2011 at 03:23:51PM -0700, Fernando Perez wrote:
> Well, if it integrates 'quite poorly', it would be nice to hear this
> as a bug report or a question on the ipython lists, because PBS/SGE
> support is precisely one of the current goals...  Stefan just pointed
> me to this thread, otherwise I'd have no idea your office chatter has
> seen this problem.


I discussed this with Min at the scipy conference. He seemed to be very
aware of the limitations of the current code base. He did tell me that it
was a area where he wanted to work more. I am happy to hear you confirm
this.

Now let me expose the technical problem, as far as I understand it. This
may be outdated if there have been major changes since the scipy
conference. I may also be slightly incorrect with the details. But this
is the big picture as I get it.

* Our jobs are typically data processing jobs that require no
  communication, grab large data as an input, run on it, and spit out the
  result. In the configuration that we were discussing in the
  original thread there might be a large spread in the computing time 
  between the different jobs (for instance due to varying parameters 
  during a cross-validation).

* Our clusters bill per total run time of all the jobs. Some have a
  scheduling service that knows how to dispatch jobs where the data is.
  The pattern of usage is to queue the jobs in the execution queue. The 
  data is then be uploaded to the cluster by the cluster engine, or even
  to the right node of the cluster if the cluster is asymmetric. Once the 
  data is available for one job, and CPU is available, the cluster engine
  launches the job and starts billing. Once the job is finished, the
  billing for this job ends.

* As far as I know, IPython does not really use this scheduling service,
  and uses the launcher queue to start workers, that then need to be fed the
  jobs. By itself, this does not optimize very well the data transfer. In
  addition, if various jobs have different run times, it can very easily 
  lead to extra time being billed on the cluster for two reasons. One is 
  that some jobs may finish much earlier than others, and the workers
  should really die as soon as the job is done. The other reason is that as 
  jobs become available, workers come up asynchronously. As a user, I can 
  choose to wait for all my workers to come up, but this might take a while. 
  If I want to dispatch jobs with a greedy strategy, as workers come up, it 
  seems to me that I have to write some not completely trivial boiler plate 
  code.

* Finally, an extra pain point is that some of the clusters that we work
  on simply kill a job if it take too long. This means that if I bypass
  the cluster's scheduler and try to use long running workers, and
  IPython's scheduler to dispatch job to them, I will most probably have
  my workers killed after they accomplish a few of the jobs.

I lot of the impedance mismatches that I describe above can most probably
be worked around by writing code to make a clever use of IPython and of
the cluster engine together. I am sure that IPython exposes all the right
primitives for this. However, it is a significant investment, and it
seems much easier for me, at the time being, to simply use the queuing
engine as it is meant to be used.

One thing that Min seemed to agree with me, is that an easy feature to
add to IPython that would make our lives easier would be the ability of
workers to commit suicide cleanly once they are done, to avoid consuming
billable resources. It's probably not hard to do, and he seemed to have
this in mind.

The lack of understanding between you and I, if there is any, may come
from different usage pattern. As far as I know, physicists that use
clusters for massive simulations typical have processing going together
with a lot of communication and the setup/tear down is simultaneous
across the grid, and fairly costless. The typical data processing
situation is quite different. This is why we have seen tools like Hadoop
or Disco (http://discoproject.org/) been developed in addition to things
like MPI that answer a different usage pattern. The cluster configuration
that I am describing above is quite common. I have seen it in many labs.

> If in addition what you decide to do is to spread the notion that it
> 'works poorly' without providing any specifics nor feedback, and
> without even saying it on our own lists, it's really not helpful.
> It's simply classic FUD, and I'm quite surprised and saddened to hear
> it coming from a core IPython developer.

I cannot wage every war. I do not do cluster work. Honestly: I have
simply never used a cluster. I do work with people that do cluster work,
and I have looked a bit at their difficulties, this is why I was able to
write the paragraph above. But you cannot ask me to do the impedance
matching between them and the IPython mailing lists. I already work on
many projects, and I find that my efforts are thinning down. 

Now, I am a bit sad that you are calling this FUD. It is quite common
that issues are under-reported because reporting issues properly takes a
lot of time. I was just echoing the impression that IPython doesn't solve
the problem for some type of cluster usage. If you look down a bit in the
same thread, when Olivier mentioned a different usage pattern
that consist in taking ownership of the cluster, I said:

> Yes, this is the way the IPython guys use it, and it works very well.
> But many of us do not have the option of doing this.

Therefore I am not bashing IPython, just saying it doesn't seem to work
for everybody. I agree with you that the issue was not fully spelled out.
That makes it useless in terms of improving the software, but it doesn't
make it automatically FUD. If expressing opinions, even if they are not
properly backed or flawed, draws blame, I do not think that we are
heading in the right direction. I clearly stated in my message that it
was a second hand opinion:

> I haven't tried it out myself, but this is really what I hear around me.

I don't call FUD when somebody says that Mayavi is not the right tool for
their job. If I have time, I try to understand what that person means. It
might be that it is not the right tool for his usage pattern. It might
simply be a misunderstanding of the documentation, in which case I need
to improve it. If I don't have time, as it is often the case, I just let
it go.

I think that it is just a question of realizing that matching everybody's
usecases with one software is possible, but requires a very large
investment. In the short term, with a limited man power, we just have to
accept that they will be gaps in our tool stack.

I hope that this exchange will have been useful for you to understand
better some aspects of cluster computing in some environment. I cannot
rely go much further, as it is beyond my expertise.

Cheers,

Gaël

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2dcopy1
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] permutation test and distributed computing

Reply via email to