Thanks for the detailed reply Rasmus. I'll look into these points this week.

Pete

On Tue, May 28, 2019 at 6:40 PM Rasmus Munk Larsen <[email protected]>
wrote:

> Hi Pete,
>
> The way to optimize the tensor library for hardware with limited cache
> sizes would be to
>
> 1. Reduce the size of the buffer used for the ".block()" interface. I
> believe we currently try to fit them in L1, but perhaps the detection
> doesn't work correctly on your hardware.
> 2. Reduce the block sizes used in TensorContraction.
>
> 1. By default the blocksize is chosen such that the blocks fits in L1:
>
> https://bitbucket.org/eigen/eigen/src/3cbfc2d75ecabbb0f17291d0153de6e41e568f15/unsupported/Eigen/CXX11/src/Tensor/TensorExecutor.h#lines-166
>
>  Each evaluator in an expression reports how scratch memory it needs to
> compute a block's worth of data through the getResourceRequirements() API,
> e.g.:
>
> https://bitbucket.org/eigen/eigen/src/3cbfc2d75ecabbb0f17291d0153de6e41e568f15/unsupported/Eigen/CXX11/src/Tensor/TensorShuffling.h#lines-230
>
>  These values are then merged by the the executor in the calls here:
>
> https://bitbucket.org/eigen/eigen/src/default/unsupported/Eigen/CXX11/src/Tensor/TensorExecutor.h#lines-185
>
> https://bitbucket.org/eigen/eigen/src/3cbfc2d75ecabbb0f17291d0153de6e41e568f15/unsupported/Eigen/CXX11/src/Tensor/TensorExecutor.h#lines-324
>
> 2. The tensor contraction blocking uses a number of heuristics to choose
> block sizes and level of parallelism. In particular, it tries to pack the
> lhs into L2, and rhs into L3.
>
>
> https://bitbucket.org/eigen/eigen/src/3cbfc2d75ecabbb0f17291d0153de6e41e568f15/unsupported/Eigen/CXX11/src/Tensor/TensorContractionThreadPool.h#lines-127
>
> https://bitbucket.org/eigen/eigen/src/3cbfc2d75ecabbb0f17291d0153de6e41e568f15/unsupported/Eigen/CXX11/src/Tensor/TensorContractionThreadPool.h#lines-647
>
> https://bitbucket.org/eigen/eigen/src/default/unsupported/Eigen/CXX11/src/Tensor/TensorContractionThreadPool.h#lines-239
>
> I hope these pointers help.
>
> Rasmus
>
>
> On Tue, May 28, 2019 at 7:38 AM Pete Blacker <[email protected]>
> wrote:
>
>> Hi there,
>>
>> I'm currently using the Eigen::Tensor module on a relatively small
>> processors which has very limited cache, 16KB level 1 no level 2 at all!
>> I've been looking for any way to optimise the blocking of operations
>> performed by Eigen for a particular block size but I can't find anything so
>> far.
>>
>> Is there a way to optimise the Tensor operations for this type of small
>> cache?
>>
>> Thanks,
>>
>> Pete
>>
>

Reply via email to