Hi Christoph,
this type:
x10::array::DistArray<x10aux::ref<x10::array::Array<double> > >
looks like the type of v_dst and v_src and _not_ of A. This suggests
that for some reason the evaluations of v_dst(pt) and v_src(pt) are not
reused between iterations of the for loop. Can you pull them out of the
loop i.e.
val p:Place = v_src.dist()(pt);
val dst = v_dst(pt);
val src = v_src(pt);
for ( (i,j) in (A|p) ) {
dst(i) += A(i,j)*src(j);
}
and report whether this makes a difference?
Cheers,
Josh
On 09/09/10 20:13, Christoph Pospiech wrote:
> Hi,
>
> I wrote a small X10 test program which runs an ordinary matrix vector multiply
> in a timing loop, and I am currently stuck assessing the performance.
>
> Running with eclipse X10DT 2.0.6 and C++ backend on Linux x86 with one place,
> I am getting the following.
> ****************************************************
> * X10 test program for matrix vector multiplication
> ****************************************************
>
> matrix size = 500
> loop count = 10
> places = 1
> axis for parallelization = 1
>
> The time is 7.22883933400044 s
>
> This has to be compared to an equivalent Fortran program.
> $ make clean;make
> mpif77 -O3 -c -o mmp.o mmp.f
> Linking mmp ...
> done
> $ mpiexec -np 1 ./mmp
> The wall clock time [s] was 9.17305762413889170E-003
>
> The difference is 2.9 orders of magnitude. Clearly, the X10 program has a
> performance issue.
>
> OK, there are the hints on the following URL.
> http://x10.codehaus.org/Performance+Tuning+an+X10+Application
>
> I compiled my own MPI runtime and currently ended up with the following.
> $ x10c++ -x10rt mpipg -O -NO_CHECKS -o matmul ../src/matmul.x10
> $ mpiexec -np 1 ./matmul 500 10 1
> ****************************************************
> * X10 test program for matrix vector multiplication
> ****************************************************
>
> matrix size = 500
> loop count = 10
> places = 1
> axis for parallelization = 1
>
> The time is 1.652489519001392 s
>
> Still a gap to Fortran performance by 2.25 orders of magnitude.
>
> The "pg" in "-x10rt mpipg" has the following significance.
> $ cat /opt/sw/X10_compiler/etc/x10rt_mpipg.properties
> CXX=mpicxx
> CXXFLAGS=-pg -g -O3
> LDFLAGS=-pg -g -O3
> LDLIBS=-lx10rt_mpi
>
> So I can now look at a gmon.out file and see the following.
> Flat profile:
>
> Each sample counts as 0.01 seconds.
> % cumulative self self total
> time seconds seconds calls ms/call ms/call name
> 58.06 0.18 0.18 4999853 0.00 0.00
> x10::array::DistArray<x10aux::ref<x10::array::Array<double> >
>
>> ::apply(x10aux::ref<x10::array::Point>)
>>
> 29.03 0.27 0.09
> matmul__closure__15::apply()
> 6.45 0.29 0.02 1 20.00 20.96
> x10_array_DistArray__closure__0<double>::apply()
> 3.23 0.30 0.01 5513706 0.00 0.00
> x10::lang::Iterator<x10aux::ref<x10::array::Point>
>
>> ::itable<x10::lang::Reference>*
>>
> x10aux::findITable<x10::lang::Iterator<x10aux::ref<x10::array::Point> >
>
>> (x10aux::itable_entry*)
>>
> matmul__closure__15::apply() is the calling parent to the hot spot. The call
> graph profile looks like this.
>
> granularity: each sample hit covers 4 byte(s) for 3.23% of 0.31 seconds
>
> index % time self children called name
> <spontaneous>
> [1] 90.6 0.09 0.19 matmul__closure__15::apply() [1]
> 0.18 0.00 4999823/4999853
> x10::array::DistArray<x10aux::ref<x10::array::Array<double> >
>
>> ::apply(x10aux::ref<x10::array::Point>) [2]
>>
> 0.01 0.00 4999820/5513706
> x10::lang::Iterator<x10aux::ref<x10::array::Point>
>
>> ::itable<x10::lang::Reference>*
>>
> x10aux::findITable<x10::lang::Iterator<x10aux::ref<x10::array::Point> >
>
>> (x10aux::itable_entry*) [5]
>>
> 0.00 0.00 10/12
> x10::array::DistArray<double>::__bar(x10::lang::Place) [10]
> -----------------------------------------------
> 0.00 0.00 10/4999853 matmul__closure__17::apply()
> [24]
> 0.00 0.00 20/4999853 matmul__closure__13::apply()
> [49]
> 0.18 0.00 4999823/4999853 matmul__closure__15::apply()
> [1]
> [2] 58.1 0.18 0.00 4999853
> x10::array::DistArray<x10aux::ref<x10::array::Array<double> >
>
>> ::apply(x10aux::ref<x10::array::Point>) [2]
>>
> -----------------------------------------------
>
> matmul__closure__15::apply() can be identified as the following code snippet,
> actually the heart of the matrix vector multiply.
>
> /**
> * Next do the local part of the
> * matrix multiply.
> */
> finish ateach (pt in v_src ) {
> val p:Place = v_src.dist()(pt);
> for ( (i,j) in (A|p) ) {
> v_dst(pt)(i) += A(i,j)*v_src(pt)(j);
> }
> if (debug) {
> val v_src_str = "v_src("+p.id()+")";
> prettyPrintArray1D(v_src_str, v_src(pt));
> val v_dst_str = "v_dst("+p.id()+")";
> prettyPrintArray1D(v_dst_str, v_dst(pt));
> }
> }
> where
> static type Array1D = Array[Double]{rank==1};
>
> global val v_dst: DistArray[Array1D]{rank==1};
> global val v_src: DistArray[Array1D]{rank==1};
> global val A: DistArray[Double]{rank==2};
>
> - the region for all objects of type Array1D is [0..vsize-1],
> - the region for v_src and v_dst is [0..number_of_places-1],
> - the distribution for v_src and v_dst maps exactly one point to each place.
> - the region for A is [0..vsize-1,0..vsize-1].
> - in all of the above, number_of_places == 1.
> - in all of the above, debug:Boolean == false.
>
> Am I correct that 58% of the time is spent in
> x10::array::DistArray...::apply(...Point), which I interpret as the evaluation
> of A(i,j) (perhaps also v_src(pt) and v_dst(pt)) ? And each of that is a
> function call, adding up to 4999823 calls ?
>
> That seems a lot of CPU cycles just to get the matrix value A(i,j). Perhaps
> this can be inlined ? How ?
> And where are all the rest of the cycles that add up to the performance gap of
> currently 2.25 orders of magnitude ?
>
------------------------------------------------------------------------------
This SF.net Dev2Dev email is sponsored by:
Show off your parallel programming skills.
Enter the Intel(R) Threading Challenge 2010.
http://p.sf.net/sfu/intel-thread-sfd
_______________________________________________
X10-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/x10-users