Re: [Scikit-learn-general] permutation test and distributed computing

Fernando Perez Mon, 10 Oct 2011 03:07:19 -0700

On Sun, Oct 9, 2011 at 5:06 PM, Gael Varoquaux
<[email protected]> wrote:
> On Sun, Oct 09, 2011 at 03:23:51PM -0700, Fernando Perez wrote:
>> Well, if it integrates 'quite poorly', it would be nice to hear this
>> as a bug report or a question on the ipython lists, because PBS/SGE
>> support is precisely one of the current goals...  Stefan just pointed
>> me to this thread, otherwise I'd have no idea your office chatter has
>> seen this problem.
>
> I discussed this with Min at the scipy conference. He seemed to be very
> aware of the limitations of the current code base. He did tell me that it
> was a area where he wanted to work more. I am happy to hear you confirm
> this.


Oh, we're certainly aware of *many* limitations in the current code
base.  We're currently working on finding some funding that would go
precisely towards this exact problem, in partnership with users at
large supercomputing facilities who do want/need improvements to be
made.  The feedback you provide below is fantastic, and for that I
really thank you.  This is the kind of thing that is best kept for
longer-term reference in a place where we can refine it further and
use it to drive design and later distill into concrete issues, so I've
lifted most of this verbatim and made a page for it into our wiki:

http://wiki.ipython.org/Parallel_Computing

It's an excellent start for us.

[...]

> * As far as I know, IPython does not really use this scheduling service,
>  and uses the launcher queue to start workers, that then need to be fed the
>  jobs. By itself, this does not optimize very well the data transfer. In
[...]

Yes, it's indeed quite likely that for this combination of operational
constraints, right now the api is a poor fit (or as you correctly
suggest below, that it's possible to get it done but highly
inconvenient).

> * Finally, an extra pain point is that some of the clusters that we work
>  on simply kill a job if it take too long. This means that if I bypass
>  the cluster's scheduler and try to use long running workers, and
>  IPython's scheduler to dispatch job to them, I will most probably have
>  my workers killed after they accomplish a few of the jobs.

It is possible to have engines come and go (leaving their lifetime to
be controlled by the local scheduler).  But whether that would be more
convenient with IPython than just doing it by hand, I can't really say
without digging into the details of the specific configuration needed.

> I lot of the impedance mismatches that I describe above can most probably
> be worked around by writing code to make a clever use of IPython and of
> the cluster engine together. I am sure that IPython exposes all the right
> primitives for this. However, it is a significant investment, and it
> seems much easier for me, at the time being, to simply use the queuing
> engine as it is meant to be used.

And we certainly don't expect ipython to be the right answer for every
conceivable scenario, far from it.  But understanding a concrete set
of constraints makes it much more likely that we can improve things in
that direction.

> One thing that Min seemed to agree with me, is that an easy feature to
> add to IPython that would make our lives easier would be the ability of
> workers to commit suicide cleanly once they are done, to avoid consuming
> billable resources. It's probably not hard to do, and he seemed to have
> this in mind.

Absolutely (hey, for all we know, he may have already implemented it!).

> The lack of understanding between you and I, if there is any, may come
> from different usage pattern. As far as I know, physicists that use
> clusters for massive simulations typical have processing going together
> with a lot of communication and the setup/tear down is simultaneous
> across the grid, and fairly costless. The typical data processing
> situation is quite different. This is why we have seen tools like Hadoop
> or Disco (http://discoproject.org/) been developed in addition to things
> like MPI that answer a different usage pattern. The cluster configuration
> that I am describing above is quite common. I have seen it in many labs.

Yup.

>> If in addition what you decide to do is to spread the notion that it
>> 'works poorly' without providing any specifics nor feedback, and
>> without even saying it on our own lists, it's really not helpful.
>> It's simply classic FUD, and I'm quite surprised and saddened to hear
>> it coming from a core IPython developer.
>
> I cannot wage every war. I do not do cluster work. Honestly: I have
> simply never used a cluster. I do work with people that do cluster work,
> and I have looked a bit at their difficulties, this is why I was able to
> write the paragraph above. But you cannot ask me to do the impedance
> matching between them and the IPython mailing lists. I already work on
> many projects, and I find that my efforts are thinning down.
>
> Now, I am a bit sad that you are calling this FUD. It is quite common
> that issues are under-reported because reporting issues properly takes a
> lot of time. I was just echoing the impression that IPython doesn't solve
> the problem for some type of cluster usage. If you look down a bit in the
> same thread, when Olivier mentioned a different usage pattern
> that consist in taking ownership of the cluster, I said:
>
>> Yes, this is the way the IPython guys use it, and it works very well.
>> But many of us do not have the option of doing this.
>
> Therefore I am not bashing IPython, just saying it doesn't seem to work
> for everybody. I agree with you that the issue was not fully spelled out.
> That makes it useless in terms of improving the software, but it doesn't
> make it automatically FUD. If expressing opinions, even if they are not
> properly backed or flawed, draws blame, I do not think that we are
> heading in the right direction. I clearly stated in my message that it
> was a second hand opinion:

What I object to is not an opinion (which is a matter of personal
preference), but rather what sounded like a statement of fact, likely
to lead others who come and read these threads later via google to
make decisions on whether IPython works or not, which is completely
unsubstantiated.  Keep in mind that people do decide whether to use
these tools based, amongst other things, based on reading these
discussions.  If they see that on the list of a very well-run project
like sklearn the opinion of core devs (and in particular of one who is
also an IPython core dev himself) is that IPython 'works quite poorly
for batch engines', they're quite likely to run away.  This, despite
the fact that there are specific pages of documentation on batch
engines, as I pointed above.  Sowing doubt in the minds of
third-parties with statements that hint to general, unspecified
problems that can happen if one follows a given path is what I
referred to as FUD, and I think that's how the term is widely used.

Specific criticism, no matter how harsh, I welcome *always*.  We have
~200 open tickets indicating flaws, and I could probably open another
50 off the top of my head, some pretty sad and major.  Recently a
colleague who has been experimenting with ipython.parallel in a large
cluster environment said he was quite happy, but I kept pressing him
for feedback on what *didn't* work for him, because only by hearing
what is a problem will we improve.  I know that, and I've been long
enough on the net to know that substantiated criticism is the one of
the biggest contributions anyone can give you.

>> I haven't tried it out myself, but this is really what I hear around me.
>
> I don't call FUD when somebody says that Mayavi is not the right tool for
> their job. If I have time, I try to understand what that person means. It
> might be that it is not the right tool for his usage pattern. It might
> simply be a misunderstanding of the documentation, in which case I need
> to improve it. If I don't have time, as it is often the case, I just let
> it go.
>
> I think that it is just a question of realizing that matching everybody's
> usecases with one software is possible, but requires a very large
> investment. In the short term, with a limited man power, we just have to
> accept that they will be gaps in our tool stack.
>
> I hope that this exchange will have been useful for you to understand
> better some aspects of cluster computing in some environment. I cannot
> rely go much further, as it is beyond my expertise.

In this form, it truly has, and for that I thank you immensely (that's
why I put it up on the wiki, as it will help us refine those ideas
further over time).  And I do want to apologize if my response seemed
harsh: if I responded in this manner, it's because I know you are
ultimately much better than that; I hold the people that I admire and
respect to a higher standard than most :)  And I know you're
overworked already to the hilt, so I don't expect you to be able to
work on every front always.  But please do keep in mind that *how* you
express these ideas matters.

All the best,

f

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2dcopy1
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] permutation test and distributed computing

Reply via email to