Hi Bogdan,

> -----Original Message-----
> From: Andone, Bogdan [mailto:bogdan.and...@intel.com]
> Sent: Wednesday, July 29, 2015 4:22 PM
> To: internals@lists.php.net
> Subject: [PHP-DEV] Introduction and some opcache SSE related stuff
> 
> Hi Guys,
> 
> My name is Bogdan Andone and I work for Intel in the area of SW
performance
> analysis and optimizations.
> We would like to actively contribute to Zend PHP project and to involve
> ourselves in finding new performance improvement opportunities based on
> available and/or new hardware features.
> I am still in the source code digesting phase but I had a look to the
> fast_memcpy() implementation in opcache extension which uses SSE
intrinsics:
> 
> If I am not wrong fast_memcpy() function is not currently used, as I
didn't find
> the "-msse4.2" gcc flag in the Makefile. I assume you probably didn't see
any
> performance benefit so you preserved generic memcpy() usage.
> 
> I would like to propose a slightly different implementation which uses
> _mm_store_si128() instead of _mm_stream_si128(). This ensures that copied
> memory is preserved in data cache, which is not bad as the interpreter
will start
> to use this data without the need to go back one more time to memory.
> _mm_stream_si128() in the current implementation is intended to be used
for
> stores where we want to avoid reading data into the cache and the cache
> pollution; in opcache scenario it seems that preserving the data in cache
has a
> positive impact.
> 
> Running php-cgi -T10000 on WordPress4.1/index.php I see ~1% performance
> increase for the new version of fast_memcpy() compared with the generic
> memcpy(). Same result using a full load test with http_load on a Haswell
EP 18
> cores.
> 
> Here is the proposed pull request:
https://github.com/php/php-src/pull/1446
> 
> Related to the SW prefetching instructions in fast_memcpy()... they are
not
> really useful in this place. There benefit is almost negligible as the
address
> requested for prefetch will be needed at the next iteration (few cycles
later),
> while the time needed to get data from RAM is >100 cycles usually..
> Nevertheless... they don't heart and it seems they still have a very small
benefit
> so I preserved the original instruction and I added a new prefetch request
for the
> destination pointer.
> 
AFAIR we always rely on the standard features, thus SSE2 in this particular
case, for better compatibility. IMHO using newer things should be done more
carefully. Having more stats could be not bad, from what I see at least here
http://store.steampowered.com/hwsurvey it's still not safe to just switch
away from SSE2. Maybe  introducing some flexible solution like compile time
switches for people who want to exhaust features of the modern hardware, or
specific features available from vendors, could be an approach.  But it all
is of course a project definition.

Regards

Anatol


-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to