On Apr 25, 2012, at 13:59 , Alex Margolin wrote:
> I guess you are right.
>
> I started looking into the communication passing between processes and I may
> have found a problem with the way I handle "reserved" data requested at
> prepare_src()... I've tried to write pretty much the same as TCP (the
> relevant code is around "if(opal_convertor_need_buffers(convertor))") and
> when I copy the buffered data to (frag+1) the program works. When I try to
> optimize the code by allowing the segment to point to the original location,
> I get MPI_ERR_TRUNCATE. I've printed out the data sent and recieved, and what
> I got ("[]" for sent, "<>" for received, running osu_latency) is appended
> below.
>
> Question is: Where is the code which is responsible for writing the reserved
> data?
It is the PML headers. Based on the error you reported OMPI is complaining
about truncated data on an MPI_Barrier … that's quite bad as the barrier is one
of the few operations that do not manipulate any data. I guess the PML headers
are not located at the expected displacement in the fragment, so the PML is
using wrong values.
george.
>
> Thanks,
> Alex
>
>
> Always assume opal_convertor_need_buffers - works (97 is the application
> data, preceded by 14 reserved bytes):
>
> ...
> [65,0,0,0,1,0,0,0,1,0,0,0,89,-112,97,97,97,97,]
> <65,0,0,0,1,0,0,0,1,0,0,0,89,-112,97,97,97,97,>
> [65,0,0,0,0,0,0,0,1,0,0,0,90,-112,97,97,97,97,]
> <65,0,0,0,0,0,0,0,1,0,0,0,90,-112,97,97,97,97,>
> [65,0,0,0,1,0,0,0,1,0,0,0,90,-112,97,97,97,97,]
> <65,0,0,0,1,0,0,0,1,0,0,0,90,-112,97,97,97,97,>
> [65,0,0,0,0,0,0,0,1,0,0,0,91,-112,97,97,97,97,]
> <65,0,0,0,0,0,0,0,1,0,0,0,91,-112,97,97,97,97,>
> [65,0,0,0,1,0,0,0,1,0,0,0,91,-112,97,97,97,97,]
> ...
>
> Detect when not opal_convertor_need_buffers - fails:
>
> ...
> [65,0,0,0,0,0,0,0,1,0,0,0,-15,85,]
> <65,0,0,0,0,0,0,0,1,0,0,0,-15,85,97,>
> [65,0,0,0,1,0,0,0,1,0,0,0,-15,85,]
> <65,0,0,0,1,0,0,0,1,0,0,0,-15,85,97,>
> [65,0,0,0,0,0,0,0,1,0,0,0,-14,85,]
> <65,0,0,0,0,0,0,0,1,0,0,0,-14,85,97,>
> [65,0,0,0,1,0,0,0,1,0,0,0,-14,85,]
> <65,0,0,0,1,0,0,0,1,0,0,0,-14,85,97,>
> [65,0,0,0,1,0,0,0,-16,-1,-1,-1,-13,85,]
> 1 453.26
> [65,0,0,0,0,0,0,0,-16,-1,-1,-1,-13,85,]
> <65,0,0,0,0,0,0,0,-16,-1,-1,-1,-13,85,97,>
> <65,0,0,0,1,0,0,0,-16,-1,-1,-1,-13,85,97,>
> [singularity:13509] *** An error occurred in MPI_Barrier
> [singularity:13509] *** reported by process [2239889409,140733193388033]
> [singularity:13509] *** on communicator MPI_COMM_WORLD
> [singularity:13509] *** MPI_ERR_TRUNCATE: message truncated
> [singularity:13509] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
> will now abort,
> [singularity:13509] *** and potentially your MPI job)
> [singularity:13507] 1 more process has sent help message help-mpi-errors.txt
> / mpi_errors_are_fatal
> [singularity:13507] Set MCA parameter "orte_base_help_aggregate" to 0 to see
> all help / error messages
> alex@singularity:~/huji/benchmarks/mpi/osu-micro-benchmarks-3.5.2$
>
> On 04/25/2012 04:35 PM, George Bosilca wrote:
>> Alex,
>>
>> You got the banner of the FT benchmark, so I guess at least the rank 0
>> successfully completed the MPI_Init call. This is a hint that you should
>> investigate more into the point-to-point logic of your mosix BTL.
>>
>> george.
>>
>> On Apr 25, 2012, at 09:30 , Alex Margolin wrote:
>>
>>> NAS Parallel Benchmarks 3.3 -- FT Benchmark
>>>
>>> No input file inputft.data. Using compiled defaults
>>> Size : 64x 64x 64
>>> Iterations : 6
>>> Number of processes : 4
>>> Processor array : 1x 4
>>> Layout type : 1D
>>
>> _______________________________________________
>> devel mailing list
>> [email protected]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> [email protected]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel