Re: [pocl-devel] Incorporating Vecmathlib into pocl

Erik Schnetter Wed, 20 Mar 2013 12:25:03 -0700

I am unable to reproduce the problem locally. I do see many test case
failures in as_type and convert_type, but these are for 32-byte vectors
only. (I tried llvm-trunk, but this has currently the minmax problem you
mentioned above.)


With llvm-3.2, I do not see a hangup. When I look at the generated code, I
see:

llc kernel-x86_64-apple-darwin12.3.0.bc

Look at kernel-x86_64-apple-darwin12.3.0.s, look for the string "roundd:"
(yes, two d's, and a colon at the end). My code looks like

__Z9_cl_roundd:                         ## @_Z9_cl_roundd
.cfi_startproc
## BB#0:                                ## %entry
vandpd LCPI8848_0(%rip), %xmm0, %xmm1
vaddsd LCPI8848_1(%rip), %xmm1, %xmm1
vroundsd $1, %xmm1, %xmm0, %xmm1
vandpd LCPI8848_2(%rip), %xmm1, %xmm1
vandpd LCPI8848_3(%rip), %xmm0, %xmm0
vorpd %xmm0, %xmm1, %xmm0
ret

Without AVX, the code should be similar, except that the "v" at the
beginning of the instructions should be missing, and each instruction
should have only two instead of three arguments. This corresponds to the
implementation of round() in VML:

(1) vandpd: fabs()
(2) vaddsd: add 0.5
(3) vroundsd: floor()
(4) vandpd: set second (unused) vector argument to zero, to ensure it has a
defined value
(5) vandpd: extract sign of the original argument
(6) vorpd: copysign()

I have some changes to configure.ac and Makefile.am that allow me to pass
-O3 to clang++ when it compiles VML.

-erik





On Wed, Mar 20, 2013 at 1:48 PM, Erik Schnetter <
[email protected]> wrote:

> On Wed, Mar 20, 2013 at 1:15 PM, Erik Schnetter <
> [email protected]> wrote:
>
>> On Wed, Mar 20, 2013 at 12:06 PM, Pekka Jääskeläinen <
>> [email protected]> wrote:
>>
>>> On 03/20/2013 05:49 PM, Erik Schnetter wrote:
>>>
>>>  With gcc, memcpy is completely optimized away. With clang as well -- I
>>>> am using memcpy internally e.g. to convert doubles into integers to
>>>> access certain bits, and this translates to no instruction at all,
>>>> things are just kept in the same register. I would therefore hope that
>>>> the pocl->vecmathlib transition would be similarly ideal.
>>>>
>>>
>>> Let's hope so. Anyways, the generic type version is useful to ensure
>>> portability to other targets. Afterall, the most important thing is to
>>> have an inlineable math library. Other optimizations are secondary
>>> at this point.
>>>
>>> I cannot compile vecmathlib separately to produce the 'test' binary:
>>>
>>> [  2%] Building CXX object CMakeFiles/bench.dir/bench.cc.**o
>>> make[2]: clang++-mp-3.3: Command not found
>>> make[2]: *** [CMakeFiles/bench.dir/bench.**cc.o] Error 127
>>> make[1]: *** [CMakeFiles/bench.dir/all] Error 2
>>>
>>
>> There is a file CMakeLists.txt that hard-codes (didn't use autoconf here)
>> the clang compiler name and compiler options. If you modify these manually,
>> you should be able to build.
>>
>>
>>> But I know it uses the SSE2 optimized header as I inserted an
>>> #warning there where it includes them. I do not have AVX.
>>>
>>> cat /proc/cpuinfo
>>> ...
>>> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
>>> mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall
>>> nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology
>>> nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2
>>> ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt aes lahf_lm ida arat dtherm
>>> tpr_shadow vnmi flexpriority ept vpid
>>
>>
>> On this system, round() should translate to a single machine instruction.
>> With optimization, clang should inline all function calls, and there should
>> not be a long string of calls.
>>
>> I'm continuing to investigate.
>>
>
> I think I found the problems. The C++ compiler does not know that long and
> double are to be supported, since the C++ code does not include types.h.
> Therefore, only round(float) is generated, and not round(double).
> Presumably, round(double) is then taken from somewhere else. Also, the C++
> compiler doesn't seem to see the optimization settings, so it produces
> unoptimized code, so that the calls to memcpy remain, and the call chain
> within VML is not inlined.
>
> -erik
>
> --
> Erik Schnetter <[email protected]>
> http://www.perimeterinstitute.ca/personal/eschnetter/
> AIM: eschnett247, Skype: eschnett, Google Talk: [email protected]
>



-- 
Erik Schnetter <[email protected]>
http://www.perimeterinstitute.ca/personal/eschnetter/
AIM: eschnett247, Skype: eschnett, Google Talk: [email protected]

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar

_______________________________________________
pocl-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pocl-devel

Re: [pocl-devel] Incorporating Vecmathlib into pocl

Reply via email to