Re: [OMPI devel] autodetect broken
Heh. The list overrides the reply-to. :-) Agreed -- let's you, me, and Brian discuss off-list and let anyone who cares know what the final solution is that we come up with. FWIW, we've become quite fond of developing in mercurial branches and pushing them somewhere to share when you want to get others opinions before committing to the trunk. bitbucket.org, for example, offers free mercurial hosting. I use it quite heavily. On Jul 22, 2009, at 1:46 PM, Iain Bason wrote: On Jul 22, 2009, at 10:55 AM, Brian W. Barrett wrote: > The current autodetect implementation seems like the wrong approach > to me. I'm rather unhappy the base functionality was hacked up like > it was without any advanced notice or questions about original > design intent. We seem to have a set of base functions which are now > more unreadable than before, overly complex, and which leak memory. First, I should apologize for my procedural misstep. I had asked my group here at Sun whether I should do an RFC or something, and I guess I didn't make my question clear enough. I was under the impression that I should check something in and let people comment on it. That being said, I would argue that there are good reasons for adding some complexity to the base component: 1. The pre-existing implementation of expansion is broken (although the cases for which it is broken are somewhat obscure). 2. The autodetect component cannot set more than one directory without some knowledge of the relationships amongst the various directories. Giving it that knowledge would violate the independence of the components. You can see #1 by doing "OPAL_PREFIX='${datarootdir}/..' OPAL_DATAROOTDIR='/opt/SUNWhpc/HPC8.2/share' mpirun hostname" (if you have installed in /opt/SUNWhpc/HPC8.2). Yes, I know, "Why would anyone do that?" Nonetheless, I find that a poor excuse for having a bug in the code. To expand on #2 a little, consider the case where OpenMPI is configured with "--prefix=/path/one --libdir=/path/two". We can tell that libopen-pal is in /path/two, but it is not correct to assume that the prefix is therefore /path. (Unfortunately, there is code in OpenMPI that does make that sort of assumption -- see orterun.c.) We need information from the config component to avoid making incorrect assumptions. There are, of course, alternate ways of getting to the same point. But it is not feasible simply to leave the design of the base component unchanged. (More on that below.) As for readability, I am always open to constructive suggestions as to how to make code more readable. I didn't fix the memory leak because I hadn't yet found a way to do that without decreasing readability. > The intent of the installdirs framework was to allow this type of > behavior, but without rehacking all this infer crap into base. The > autodetect component should just set $prefix in the set of functions > it returns (and possibly libdir and bindir if you really want, which > might actually make sense if you guess wrong), and let the expansion > code take over from there. The general thought on how this would > work went something like: > > - Run after config > - If determine you have a special $prefix, set the > opal_instal_dirs.prefix to NULL (yes, it's a bit of a hack) and > set your special prefix. > - Same with bindir and libdir if needed > - Let expansion (which runs after all components have had the > chance to fill in their fields) expand out with your special > data If we run the autodetect component after config, and allow it to override values that are already in opal_install_dirs, then there will be no way for users to have environment variables take precedence. (Let's say someone runs with OPAL_LIBDIR=/foo. The autodetect component will not know whether opal_install_dirs.libdir has been set by the env component or by the config component.) Moreover, if the user has used an environment variable to override one of the paths in the config component, then the autodetect component may make the wrong inference. For example, let's say someone runs with OPAL_LIBDIR=/foo. The autodetect component finds libopen-pal in / usr/renamed/lib, and sets opal_install_dirs.libdir to /usr/renamed/ lib. However, it has to use the config component's idea of libdir (e.g., ${exec_prefix}/lib) to correctly infer that prefix should be / usr/renamed. Since it only has /foo from the environment variable, it will have to decide that it cannot infer the prefix. All of this will lead to behavior that users will have trouble diagnosing. While I appreciate simple code, I think that a simple user interface is more important. We could add some infrastructure so that the autodetect component can figure out the provenance of each field in opal_install_dirs, but that would make the boundary between the base component and the autodetect component unclear. > And the base stays simple, the components do all the heavy lifting, > and life is happy. Except
Re: [OMPI devel] OpenMPI, PLPA and Linux cpuset/cgroup support
I apologize for coming to this late - IU's email forwarding jammed up yesterday, so I'm only now getting around to reading the backlog. There has been some off-list discussion about advanced paffinity mappings/bindings. As I noted there, I am in the latter stages of completing a new mapper that allows users to easily specify #cpus to "bind" to each processor. As part of that effort, we have interfaced to the slurm cpus_per_task and cpuset envars. So we should (once this gets done) pretty much handle the slurm environment in that regard. Having worked on the paffinity issue for some time, I am somewhat strongly opinionated that PLPA is doing exactly what it should do. It is up to OMPI/ORTE to identify what cpusets were assigned to the job and figure out the mappings - the PLPA is there to tell us how many processors are available, how many are in each socket, etc., and to do the mechanics of binding the specified process to the specified cpus. I would tend to oppose any change in the relative responsibilities of OMPI/ORTE and PLPA. It is a good division as it stands, and is working well. I haven't read anything in this chain that would change my opinion. Just my $0.0002 Ralph On Jul 22, 2009, at 11:22 AM, Jeff Squyres wrote: On Jul 22, 2009, at 11:17 AM, Sylvain Jeaugey wrote: I'm interested in joining the effort, since we will likely have the same problem with SLURM's cpuset support. Ok. > But as to why it's getting EINVAL, that could be wonky. We might want to > take this to the PLPA list and have you run some small, non-MPI examples to > ensure that PLPA is parsing your /sys tree properly, etc. I don't see the /sys implication here. Can you be more precise on which files are read to determine placement ? Check in opal/mca/paffinity/linux/plpa/src/libplpa/ plpa_map.c:load_cache(). IIRC, when you are inside a cpuset, you can see all cpus (/sys should be unmodified) but calling set_schedaffinity with a mask containing a cpu outside the cpuset will return EINVAL. Ah, that could be the issue. The only solution I see to solve this would be to get the "allowed" cpus with sched_getaffinity, which should be set according to the cpuset mask. There are two issues here: - what should OMPI do - what should PLPA do PLPA currently does two things: 1. provide a portable set/get affinity API (to isolate you from whatever version you have in your linux install) 2. provide topology mapping information (sockets, cores) PLPA does not currently deal with cpusets. If we want to expand PLPA to somehow interact with cpusets, that should probably be brought up on the PLPA mailing lists (someone made this suggestion to me about a month or two ago and I haven't had a chance to follow up on it :-( ). OMPI (as a whole -- meaning: including the ORTE layer) does the following: 1. decide whether to bind MPI processes or not 2. if we do bind, use the paffinity module to bind processes to specific processors (the linux paffinity module uses PLPA to do the actual binding -- PLPA is wholly embedded inside OMPI's linux paffinity module) And there's two layers involved here: - the main ORTE logic saying both "yes, bind" and making the decision as to which processors to bind to - the linux paffinity component does a thin layer of translation between ORTE's/OMPI's requests and calling the back-end PLPA library As Ralph described, OMPI is currently fairly "dumb" about how it chooses which processors it uses -- 0 to N-1. I think the issue here is to make OMPI smarter about how it chooses which processors to use. It could be in ORTE itself, or it could be in the linux paffinity translation layer (e.g., linux paffinity component could report only as many processors as are available in the cpuset...? And binding could be relative to the cpuset...?). -- Jeff Squyres jsquy...@cisco.com ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] BTL receive callback
Hello Sebastian, Sounds like you are using the openib btl as a starting point, which is a good place to start. I am curious if you are indeed using a new interconnect (new hardware and protocol) or if it is requirements of the 3D-torus network that are not addressed by the openib btl that are driving the need for a new btl? -DON On 07/21/09 11:55, Sebastian Rinke wrote: Hello, I am developing a new BTL component (Open MPI v1.3.2) for a new 3D-torus interconnect. During a simple message transfer of 16362 B between two nodes with MPI_Send(), MPI_Recv() I encounter the following: The sender: --- 1. prepare_src() size: 16304 reserve: 32 -> alloc() size: 16336 -> ompi_convertor_pack(): 16304 2. send() 3. component_progress() -> send cb () -> free() 4. component_progress() -> recv cb () -> prepare_src() size: 58 reserve: 32 -> alloc() size: 90 -> ompi_convertor_pack(): 58 -> free() size: 90 Send is missing !!! 5. NO PROGRESS The receiver: - 1. component_progress() -> recv cb () -> alloc() size: 32 -> send() 2. component_progress() -> send cb () -> free() size: 32 3. component_progress() for ever !!! The problem is that after prepare_src() for the 2nd fragment, the sender calls free() instead of send() in its recv cb. Thus, the 2nd fragment is not being transmitted. As a consequence, the receiver waits for the 2nd fragment. I have found that mca_pml_ob1_recv_frag_callback_ack() is the corresponding recv cb. Before diving into the ob1 code, could you tell me under which conditions this cb calls free() instead of send() so that I can get an idea of where to look for errors in my BTL component. Thank you very much in advance. Sebastian Rinke ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] OpenMPI, PLPA and Linux cpuset/cgroup support
On Wed, Jul 22, 2009 at 19:24, Jeff Squyreswrote: > Bert -- is this functionality something we'd want to incorporate into PLPA? What functionality? The complete libcpuset or just the 'get me the cpuset mask of this task'? I don't think its good if we duplicate the whole functionality of the libcpuset and taking libcpuset as a dependency of PLPA sounds too heavy. Actually I really don't know yet. That the cpuset should be considered into the decision process of ORTE/OMPI is out of question. But I think its suffice if ORTE/OMPI query the current affinity mask and takes only these processors into account. Whoever set this to something smaller than the cpuset mask or the online mask (in absence of cpusets) needs to live with that a-priori taken decision or knows what he is doing (maybe a batch system without cpuset support). Bert
Re: [OMPI devel] autodetect broken
On Jul 22, 2009, at 10:55 AM, Brian W. Barrett wrote: The current autodetect implementation seems like the wrong approach to me. I'm rather unhappy the base functionality was hacked up like it was without any advanced notice or questions about original design intent. We seem to have a set of base functions which are now more unreadable than before, overly complex, and which leak memory. First, I should apologize for my procedural misstep. I had asked my group here at Sun whether I should do an RFC or something, and I guess I didn't make my question clear enough. I was under the impression that I should check something in and let people comment on it. That being said, I would argue that there are good reasons for adding some complexity to the base component: 1. The pre-existing implementation of expansion is broken (although the cases for which it is broken are somewhat obscure). 2. The autodetect component cannot set more than one directory without some knowledge of the relationships amongst the various directories. Giving it that knowledge would violate the independence of the components. You can see #1 by doing "OPAL_PREFIX='${datarootdir}/..' OPAL_DATAROOTDIR='/opt/SUNWhpc/HPC8.2/share' mpirun hostname" (if you have installed in /opt/SUNWhpc/HPC8.2). Yes, I know, "Why would anyone do that?" Nonetheless, I find that a poor excuse for having a bug in the code. To expand on #2 a little, consider the case where OpenMPI is configured with "--prefix=/path/one --libdir=/path/two". We can tell that libopen-pal is in /path/two, but it is not correct to assume that the prefix is therefore /path. (Unfortunately, there is code in OpenMPI that does make that sort of assumption -- see orterun.c.) We need information from the config component to avoid making incorrect assumptions. There are, of course, alternate ways of getting to the same point. But it is not feasible simply to leave the design of the base component unchanged. (More on that below.) As for readability, I am always open to constructive suggestions as to how to make code more readable. I didn't fix the memory leak because I hadn't yet found a way to do that without decreasing readability. The intent of the installdirs framework was to allow this type of behavior, but without rehacking all this infer crap into base. The autodetect component should just set $prefix in the set of functions it returns (and possibly libdir and bindir if you really want, which might actually make sense if you guess wrong), and let the expansion code take over from there. The general thought on how this would work went something like: - Run after config - If determine you have a special $prefix, set the opal_instal_dirs.prefix to NULL (yes, it's a bit of a hack) and set your special prefix. - Same with bindir and libdir if needed - Let expansion (which runs after all components have had the chance to fill in their fields) expand out with your special data If we run the autodetect component after config, and allow it to override values that are already in opal_install_dirs, then there will be no way for users to have environment variables take precedence. (Let's say someone runs with OPAL_LIBDIR=/foo. The autodetect component will not know whether opal_install_dirs.libdir has been set by the env component or by the config component.) Moreover, if the user has used an environment variable to override one of the paths in the config component, then the autodetect component may make the wrong inference. For example, let's say someone runs with OPAL_LIBDIR=/foo. The autodetect component finds libopen-pal in / usr/renamed/lib, and sets opal_install_dirs.libdir to /usr/renamed/ lib. However, it has to use the config component's idea of libdir (e.g., ${exec_prefix}/lib) to correctly infer that prefix should be / usr/renamed. Since it only has /foo from the environment variable, it will have to decide that it cannot infer the prefix. All of this will lead to behavior that users will have trouble diagnosing. While I appreciate simple code, I think that a simple user interface is more important. We could add some infrastructure so that the autodetect component can figure out the provenance of each field in opal_install_dirs, but that would make the boundary between the base component and the autodetect component unclear. And the base stays simple, the components do all the heavy lifting, and life is happy. Except in the cases where it doesn't work. I would not be opposed to putting in a "find expaneded part" type function that takes two strings like "${prefix}/lib" and "/usr/local/ lib" and returns "/usr/local" being added to the base so that other autodetect-style components don't need to handle such a case, but that's about the extent of the base changes I think are appropriate. Finally, a first quick code review
Re: [OMPI devel] OpenMPI, PLPA and Linux cpuset/cgroup support
Bert -- is this functionality something we'd want to incorporate into PLPA? On Jul 22, 2009, at 1:13 PM, Bert Wesarg wrote: On Wed, Jul 22, 2009 at 18:55, Bert Wesargwrote: > I does not know any C interface to get a tasks cpuset mask (ok, > libcpuset Just an amendment to give the url to the libcpuset homepage: http://oss.sgi.com/projects/cpusets/ > > Bert >> >> Sylvain >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI devel] OpenMPI, PLPA and Linux cpuset/cgroup support
On Jul 22, 2009, at 11:17 AM, Sylvain Jeaugey wrote: I'm interested in joining the effort, since we will likely have the same problem with SLURM's cpuset support. Ok. > But as to why it's getting EINVAL, that could be wonky. We might want to > take this to the PLPA list and have you run some small, non-MPI examples to > ensure that PLPA is parsing your /sys tree properly, etc. I don't see the /sys implication here. Can you be more precise on which files are read to determine placement ? Check in opal/mca/paffinity/linux/plpa/src/libplpa/ plpa_map.c:load_cache(). IIRC, when you are inside a cpuset, you can see all cpus (/sys should be unmodified) but calling set_schedaffinity with a mask containing a cpu outside the cpuset will return EINVAL. Ah, that could be the issue. The only solution I see to solve this would be to get the "allowed" cpus with sched_getaffinity, which should be set according to the cpuset mask. There are two issues here: - what should OMPI do - what should PLPA do PLPA currently does two things: 1. provide a portable set/get affinity API (to isolate you from whatever version you have in your linux install) 2. provide topology mapping information (sockets, cores) PLPA does not currently deal with cpusets. If we want to expand PLPA to somehow interact with cpusets, that should probably be brought up on the PLPA mailing lists (someone made this suggestion to me about a month or two ago and I haven't had a chance to follow up on it :-( ). OMPI (as a whole -- meaning: including the ORTE layer) does the following: 1. decide whether to bind MPI processes or not 2. if we do bind, use the paffinity module to bind processes to specific processors (the linux paffinity module uses PLPA to do the actual binding -- PLPA is wholly embedded inside OMPI's linux paffinity module) And there's two layers involved here: - the main ORTE logic saying both "yes, bind" and making the decision as to which processors to bind to - the linux paffinity component does a thin layer of translation between ORTE's/OMPI's requests and calling the back-end PLPA library As Ralph described, OMPI is currently fairly "dumb" about how it chooses which processors it uses -- 0 to N-1. I think the issue here is to make OMPI smarter about how it chooses which processors to use. It could be in ORTE itself, or it could be in the linux paffinity translation layer (e.g., linux paffinity component could report only as many processors as are available in the cpuset...? And binding could be relative to the cpuset...?). -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI devel] OpenMPI, PLPA and Linux cpuset/cgroup support
On Wed, Jul 22, 2009 at 18:55, Bert Wesargwrote: > I does not know any C interface to get a tasks cpuset mask (ok, > libcpuset Just an amendment to give the url to the libcpuset homepage: http://oss.sgi.com/projects/cpusets/ > > Bert >> >> Sylvain >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >
Re: [OMPI devel] OpenMPI, PLPA and Linux cpuset/cgroup support
On Wed, Jul 22, 2009 at 17:17, Sylvain Jeaugeywrote: > Hi Jeff, > > I'm interested in joining the effort, since we will likely have the same > problem with SLURM's cpuset support. > > On Wed, 22 Jul 2009, Jeff Squyres wrote: > >> But as to why it's getting EINVAL, that could be wonky. We might want to >> take this to the PLPA list and have you run some small, non-MPI examples to >> ensure that PLPA is parsing your /sys tree properly, etc. > > I don't see the /sys implication here. Can you be more precise on which > files are read to determine placement ? Most files under /sys/devices/system/cpu/cpu*/topology/* > > IIRC, when you are inside a cpuset, you can see all cpus (/sys should be > unmodified) but calling set_schedaffinity with a mask containing a cpu > outside the cpuset will return EINVAL. No. The Linux kernel ands the cpumask of the tasks cpuset with the given affinity mask. If no cpuset is involved it takes the online mask. After that, it checks if the resulting mask is a subset of the online mask and would return -EINVAL. > The only solution I see to solve this > would be to get the "allowed" cpus with sched_getaffinity, which should be > set according to the cpuset mask. sched_getaffinity() doesn't return the cpuset mask. It returns the mask the task can run, which is a subset of the cpuset. Also the initial mask of the task (right after exec) does not to be the cpuset mask, because the affinity mask is inherited from the parent. I does not know any C interface to get a tasks cpuset mask (ok, libcpuset, looks like this lib is debian now, note to myself: check this). The Cpus_allowed* fields in /proc//status are the same as sched_getaffinity returns and the /proc//cpuset needs to be resolved, i.e. where is the cpuset fs mounted? Bert > > Sylvain > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
Re: [OMPI devel] OpenMPI, PLPA and Linux cpuset/cgroup support
Hi Jeff, I'm interested in joining the effort, since we will likely have the same problem with SLURM's cpuset support. On Wed, 22 Jul 2009, Jeff Squyres wrote: But as to why it's getting EINVAL, that could be wonky. We might want to take this to the PLPA list and have you run some small, non-MPI examples to ensure that PLPA is parsing your /sys tree properly, etc. I don't see the /sys implication here. Can you be more precise on which files are read to determine placement ? IIRC, when you are inside a cpuset, you can see all cpus (/sys should be unmodified) but calling set_schedaffinity with a mask containing a cpu outside the cpuset will return EINVAL. The only solution I see to solve this would be to get the "allowed" cpus with sched_getaffinity, which should be set according to the cpuset mask. Sylvain
Re: [OMPI devel] pb with --enable-mpi-threads --enable-progress-threads options
***Progress thread support currently does not work, and may never be fully implemented. If you remove that configure option, it should work.* ** *I'm pretty sure we only left that option so developers could play at fixing it, though I don't know of anyone actually making the attempt at the moment (certainly, it would require significant changes to ORTE). * * * *From:* Bernard Secher - SFME/LGLS (*bernard.secher_at_[hidden]*) *Date:* 2009-07-22 06:29:32 Hi, I have added the two following options: --enable-mpi-threads --enable-progress-threads in configure step of openmpi-1.3.3. After install, mpirun command doesn't work on a very simple mpi program. There is a dead lock and program is not executed. Bernard
Re: [OMPI devel] OpenMPI, PLPA and Linux cpuset/cgroup support
I'm the "primary PLPA" guy that Ralph referred to, and I was on vacation last week -- sorry for missing all the chatter. Based on your mails, it looks like you're out this week -- so little will likely occur. I'm at the MPI Forum standards meeting next week, so my replies to email will be sporatic. OMPI is pretty much directly calling PLPA to set affinity for "processors" 0, 1, 2, 3 (which PLPA translates into Linux virtual processor IDs, and then invokes sched_setaffinity with each of those IDs). Note that the EFAULT errors you're seeing in the output are deliberate. PLPA has to "probe" the kernel to see what flavor of API it uses. Based on the error codes that comes back, it knows which flavor to use when actually invoking the syscall for sched_setaffinity. So you can ignore those EFAULT's. But as to why it's getting EINVAL, that could be wonky. We might want to take this to the PLPA list and have you run some small, non-MPI examples to ensure that PLPA is parsing your /sys tree properly, etc. Ping when you get back from vacation. On Jul 19, 2009, at 8:14 PM, Chris Samuel wrote: - "Ralph Castain"wrote: > Should just be > > -mca paffinity_base_verbose 5 > > Any value greater than 4 should turn it "on" Yup, that's what I was trying, but couldn't get any output. > Something I should have mentioned. The paffinity_base_service.c file > is solely used by the rank_file syntax. It has nothing to do with > setting mpi_paffinity_alone and letting OMPI self-determine the > process-to-core binding. That would explain why I'm not seeing any output from it then, it and the solaris module are the only ones containing any opal_output() statements in the paffinity section of MCA. I'll try scattering some opal_output()'s into the linux module instead along the same lines as the base module. > You want to dig into the linux module code that calls down > into the plpa. The same mca param should give you messages > from the module, and -might- give you messages from inside > plpa (not sure of the latter). The PLPA output is not run time selectable: #if defined(PLPA_DEBUG) && PLPA_DEBUG && 0 :-) cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI devel] autodetect broken
The current autodetect implementation seems like the wrong approach to me. I'm rather unhappy the base functionality was hacked up like it was without any advanced notice or questions about original design intent. We seem to have a set of base functions which are now more unreadable than before, overly complex, and which leak memory. The intent of the installdirs framework was to allow this type of behavior, but without rehacking all this infer crap into base. The autodetect component should just set $prefix in the set of functions it returns (and possibly libdir and bindir if you really want, which might actually make sense if you guess wrong), and let the expansion code take over from there. The general thought on how this would work went something like: - Run after config - If determine you have a special $prefix, set the opal_instal_dirs.prefix to NULL (yes, it's a bit of a hack) and set your special prefix. - Same with bindir and libdir if needed - Let expansion (which runs after all components have had the chance to fill in their fields) expand out with your special data And the base stays simple, the components do all the heavy lifting, and life is happy. I would not be opposed to putting in a "find expaneded part" type function that takes two strings like "${prefix}/lib" and "/usr/local/lib" and returns "/usr/local" being added to the base so that other autodetect-style components don't need to handle such a case, but that's about the extent of the base changes I think are appropriate. Finally, a first quick code review reveals a couple of problems: - We don't AC_SUBST variables adding .lo files to build sources in OMPI. Instead, we use AM_CONDITIONALS to add sources as needed. - Obviously, there's a problem with the _happy variable name consistency in configure.m4 - There's a naming convention issue - files should all start with opal_installdirs_autodetect_, and a number of the added files do not. - From a finding code standpoint, I'd rather walkcontext.c and backtrace.c be one file with #ifs - for such short functions, it makes it more obvious that it's just portability implementations of the same function. I'd be happy to discuss the issues further or review any code before it gets committed. But I think the changes as they exist today (even with bugs fixed) aren't consistent with what the installdirs framework was trying to accomplish and should be reworked. Brian
Re: [OMPI devel] default btl eager_limit
Jeff Squyres wrote: Just to follow up for the web archives -- we discussed this on the teleconf yesterday and decided that the assert()'s were not the way to go. Brian was going to hack up a quick check at the end of OB1 add_procs that checks each btl's eager_limit, etc. Terry would expand this to cover dr and csum. I've received the change from Brian and working on porting it across all the other PMLs. --td On Jul 16, 2009, at 10:10 AM, Terry Dontje wrote: Another way to do this which I am not sure makes sense is to just add sizeof(mca_pml_ob1_hdr_t) to the btl_eager_limit passed into by the user. Thus the defining the limit to be specifically for the user data and not the internal headers which the user may not have any inkling about. However, that may lead to the user to not realize there is a man behind the curtain bumping up the limit for the internal headers. --td Terry Dontje wrote: > I was playing around with some really silly fragment sizes (sub 72 > bytes) when I ran into some asserts in the btl_openib_sendi. I traced > the assert to be caused by mca_pml_ob1_send_request_start_btl() > calculating the true eager_limit with the following line: > > size_t eager_limit = btl->btl_eager_limit - sizeof(mca_pml_ob1_hdr_t); > > If btl_eager_limit ends up being less than the > sizeof(mca_pml_ob1_hdr_t) the eager_limit calculated results in a very > large number and an assert later on in the stack. > > It seems to me that it would be nice to insert some checks in > mca_btl_base_param_register() to make sure btl_eager_limit is > > sizeof(mca_pml_ob1_hdr_t). Am I missing a reason why this was not > done in the first place? > > --td > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] default btl eager_limit
Just to follow up for the web archives -- we discussed this on the teleconf yesterday and decided that the assert()'s were not the way to go. Brian was going to hack up a quick check at the end of OB1 add_procs that checks each btl's eager_limit, etc. Terry would expand this to cover dr and csum. On Jul 16, 2009, at 10:10 AM, Terry Dontje wrote: Another way to do this which I am not sure makes sense is to just add sizeof(mca_pml_ob1_hdr_t) to the btl_eager_limit passed into by the user. Thus the defining the limit to be specifically for the user data and not the internal headers which the user may not have any inkling about. However, that may lead to the user to not realize there is a man behind the curtain bumping up the limit for the internal headers. --td Terry Dontje wrote: > I was playing around with some really silly fragment sizes (sub 72 > bytes) when I ran into some asserts in the btl_openib_sendi. I traced > the assert to be caused by mca_pml_ob1_send_request_start_btl() > calculating the true eager_limit with the following line: > > size_t eager_limit = btl->btl_eager_limit - sizeof(mca_pml_ob1_hdr_t); > > If btl_eager_limit ends up being less than the > sizeof(mca_pml_ob1_hdr_t) the eager_limit calculated results in a very > large number and an assert later on in the stack. > > It seems to me that it would be nice to insert some checks in > mca_btl_base_param_register() to make sure btl_eager_limit is > > sizeof(mca_pml_ob1_hdr_t). Am I missing a reason why this was not > done in the first place? > > --td > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com
[OMPI devel] pb with --enable-mpi-threads --enable-progress-threads options
Hi, I have added the two following options: --enable-mpi-threads --enable-progress-threads in configure step of openmpi-1.3.3. After install, mpirun command doesn't work on a very simple mpi program. There is a dead lock and program is not executed. Bernard
Re: [OMPI devel] fortran MPI_COMPLEX datatype broken
A little more data... ompi_datatype_module.c:442 says #if 0 /* XXX TODO The following may be deleted, both CXX and F77/F90 complex types are statically set up */ ...followed by code to initialize ompi_mpi_cplx (i.e., MPI_COMPLEX). (another TODO!!) But ompi_mpi_cplex is setup with: ompi_predefined_datatype_t ompi_mpi_cplex = OMPI_DATATYPE_INIT_DEFER (COMPLEX, OMPI_DATATYPE_FLAG_DATA_FORTRAN | OMPI_DATATYPE_FLAG_DATA_COMPLEX ); and OMPI_DATATYPE_INIT_DEFER has a comment above it saying: /* * Initilization for these types is deferred until runtime. * * Using this macro implies that at this point not all informations needed * to fill up the datatype are known. We fill them with zeros and then later * when the datatype engine will be initialized we complete with the * correct information. This macro should be used for all composed types. */ So this first thing is clearly wrong. Assumedly, ompi_mpi_cplx (and friends) *do* need to be setup dynamically at runtime, and the code must be fixed to do so. On Jul 21, 2009, at 8:51 PM, Jeff Squyres (jsquyres) wrote: On Jul 21, 2009, at 8:44 PM, Jeff Squyres (jsquyres) wrote: > The extent for MPI_COMPLEX is returning 0. > Sorry -- I accidentally hit "send" way before I finished typing. :-\ You can reproduce the problem with a trivial program: - #include #include int main(int argc, char* argv[]) { MPI_Aint extent; MPI_Init(NULL, NULL); MPI_Type_extent(MPI_COMPLEX, ); printf("Got extent: %d\n", extent); MPI_Finalize(); return 0; } - This is an OMPI that was compiled with Fortran support. If I break at MPI_Type_extent in gdb, here's what *type is: - (gdb) p *type $1 = {super = {super = {obj_magic_id = 16046253926196952813, obj_class = 0x2a95aa0520, obj_reference_count = 1, cls_init_file_name = 0x2a95626ce0 "ompi_datatype_module.c", cls_init_lineno = 134}, flags = 63011, id = 0, bdt_used = 0, size = 0, true_lb = 0, true_ub = 0, lb = 0, ub = 0, align = 0, nbElems = 1, name = "OPAL_UNAVAILABLE", '\0' , desc = {length = 1, used = 1, desc = 0x2a95ac4640}, opt_desc = {length = 1, used = 1, desc = 0x2a95ac4640}, btypes = {0 }}, id = 25, d_f_to_c_index = 18, d_keyhash = 0x0, args = 0x0, packed_description = 0x0, name = "MPI_COMPLEX", '\0' } - The OPAL_UNAVAILABLE looks ominous...? When I do the same thing with MPI_INTEGER, it doesn't say OPAL_UNAVAILABLE: - (gdb) p *type $2 = {super = {super = {obj_magic_id = 16046253926196952813, obj_class = 0x2a95aa0520, obj_reference_count = 1, cls_init_file_name = 0x2a95626ce0 "ompi_datatype_module.c", cls_init_lineno = 131}, flags = 55094, id = 6, bdt_used = 64, size = 4, true_lb = 0, true_ub = 4, lb = 0, ub = 4, align = 4, nbElems = 1, name = "OPAL_INT4", '\0' , desc = {length = 1, used = 1, desc = 0x2a95777920}, opt_desc = {length = 1, used = 1, desc = 0x2a95777920}, btypes = {0, 0, 0, 0, 0, 0, 1, 0 }}, id = 22, d_f_to_c_index = 7, d_keyhash = 0x0, args = 0x0, packed_description = 0x0, name = "MPI_INTEGER", '\0' } - Note that configure was happy with all the COMPLEX datatypes; config.out and config.log attached. This was with gcc 3.4 on RHEL4. -- Jeff Squyres jsquy...@cisco.com -- Jeff Squyres jsquy...@cisco.com