Re: [Hdf-forum] Chunk cache size and performance

Ger van Diepen Thu, 14 Jan 2010 00:02:05 -0800

Hi Neil,

Thanks for your answer.
I'm afraid I do not fully understand what you mean with a vector of
chunks and a plane of chunks.
Is a vector of chunks the collection of chunks you need to access an
entire vector in some direction? E.g. In a 3-dim case to get i,*,k (a
vector in y)? To process all such vectors (e.g. to average them) you
need to iterate over i and k.
And is a plane of chunks all chunks you need to get a plane, say,
*,*,k?


I imagine the cache to be a collection of chunks, say, a
std::vector<chunk> in C++ terms. Ideally the cache should be so large
that all chunks you need to get a hyperslab fit in it. It means that
when getting the next hyperslab all data needed are usually still in the
cache. So at the end, after iterating through the entire cube, all
chunks are read only once. So a chunk should never be read back as you
describe in your pictures.

My idea would be that when reading, say, 0,*,0 HDF5 knows it needs,
say, chunk 0, 4, 8, and 12. So it can first test if they are in the
cache (using the chunk cache hash) before searching the chunk Btree. If
the cache is large enough, these tests are almost always true. So a
Btree lookup needs to be done seldomly (in fact only when a chunk needs
to be read). I got the impression that HDF5 is doing a Btree lookup for
each hyperslab access instead of using the approach I sketched. I assume
that a Btree lookup is much more expensive than a hash lookup.
But maybe my idea of what the chunk cache looks like and how HDF5 uses
it is totally wrong.
Quincey or you can correct me.

In my test program the cache is sized that way. The test program uses
3-dim arrays and can iterate over vectors in any direction (say get
i,*,k for all i and k) or iterate over chunks. I'll attach the test
program, so you can see it. It knows the access pattern, so it sizes the
cache beforehand.
Note that in the vector case the naive iteration is over all i and then
over all k which requires a large cache. You can do better than that by
first iterating over the chunks in i and k and then inside these chunks
iterate over i and k. It requires a much smaller cache.

We also use the casacore package containing an RDBMS-like table system
that stores arrays in a chunked way and uses a cache when accessing
them. It performs way better when iterating over vectors in the way
described above. So it should be possible to improve HDF5 considerably.
BTW. casacore is moving towards using mmap when possible, so the OS
takes care of caching.

Cheers,
Ger

>>> On 1/13/2010 at  5:12 PM, in message
<4b4df0d7.9020...@hdfgroup.org>, Neil
Fortner <nfort...@hdfgroup.org> wrote: 
> Hi all,
> 
> I've been following this for a while, and while I still don't have a

> complete picture of what the benchmark is doing, I have an idea that

> might shed some light on this.
> 
> My understanding is that the chunk cache is set large enough to hold
one 
> vector of chunks but not one plane of chunks, correct?  If the
vectors 
> are iterated over in the obvious way (i.e. one plane at a time) then

> each vector of chunks in the chunk cache will have to be thrown away

> once for every plane that passes through the chunks, obviously
hurting 
> performance.  In this case, it would be better to disable the chunk 
> cache entirely (set nbytes or nslots to 0), hence why Francesc sees 
> better performance with the smaller chunk cache.  Ideally, you would

> either use a chunk cache large enough for a plane of chunks, or
iterate 
> over the entire vector of chunks before moving on to the next vector
of 
> chunks in the plane.
> 
> Does this make sense?
> 
> Here are some (2-D) diagrams to help illustrate what I'm talking
about:
> 
> In cache
>     |
>    V
> +----+----+----+
> |X   |    |    |
> |    |    |    |
> +----+----+----+
> |    |    |    |
> |    |    |    |
> +----+----+----+
> 
> +----+----+----+
> |.X  |    |    |
> |    |    |    |
> +----+----+----+
> |
   |    |    |
> |    |    |    |
> +----+----+----+
> 
> +----+----+----+
> |..X |    |    |
> |    |    |    |
> +----+----+----+
> |    |    |    |
> |    |    |    |
> +----+----+----+
> 
> +----+----+----+
> |...X|    |    |
> |    |    |    |
> +----+----+----+
> |    |    |    |
> |    |    |    |
> +----+----+----+
> 
> Evicted
>    |   In cache
>   V       V
> +----+----+----+
> |....|X   |    |
> |    |    |    |
> +----+----+----+
> |    |    |    |
> |    |    |    |
> +----+----+----+
> 
> +----+----+----+
> |....|.X  |    |
> |    |    |    |
> +----+----+----+
> |    |    |    |
> |    |    |    |
> +----+----+----+
> 
> ...
>               In cache
>                   V
> +----+----+----+
> |....|....|...X|
> |    |    |    |
> +----+----+----+
> |    |    |    |
> |    |    |    |
> +----+----+----+
> 
> Read back into cache from disk for the second time
>     |         Evicted
>     V            V
> +----+----+----+
> |....|....|....|
> |X   |    |    |
> +----+----+----+
> |    |    |    |
> |    |    |    |
> +----+----+----+
> 
> ...
> 
> -Neil
> 
> 
> 
> On 01/12/2010 09:10 AM, Ger van Diepen wrote:
>> Hi Quincey,
>>
>> It uses fixed dimensions.
>> It iterates over vectors in the x, y or z direction. In the test the
z
>> axis was quite short, so it results in many small hyperslab
accesses.
>> Of course, I could access a plane at a time and get vectors myself,
but
>> I would think that is just what the cache should do for me. Besides,
I
>> do not always know if a plane fits in memory.
>>
>> I can imagine that doing so many btree lookups is quite expensive.
Do
>> you think that is the main performance bottleneck?
>> But why so many btree lookups?
>> Because the chunk and array size are fixed, it can derive the chunks
to
>> use from the hyperslab coordinates. Then it can first look if they
are
>> in the cache which is (I assume) much faster than looking in the
chunk
>> Btree.
>> Does the experimental branch improve in this way?
>>
>>
>> Not that Francesc's original question is still open. Why does
>> performance degrade when using a larger chunk cache?
>>
>> Cheers,
>> Ger
>>
>>    
>>>>> On 1/12/2010 at  3:27 PM, in message
>>>>>          
>> <f2abecc0-20f4-4a1e-b178-bde7baee7...@hdfgroup.org>, Quincey Koziol
>> <koz...@hdfgroup.org>  wrote:
>>    
>>> Hi Francesc,
>>>
>>> On Jan 8, 2010, at 11:45 AM, Francesc Alted wrote:
>>>
>>>      
>>>> Hi Ger,
>>>>
>>>> A Friday 08 January 2010 08:25:09 escriguéreu:
>>>>        
>>>>> Hi Francesc,
>>>>>
>>>>> This might be related to a problem I reported last June.
>>>>> I did tests using a 3-dim array with various chunk shapes and
>>>>>          
>> access
>>    
>>>>> patterns. It got very slow when iterating through the data by
>>>>>          
>> vector in
>>    
>>>>> the Z-direction. I believe it was filed as a bug by the HDF5
group.
>>>>>          
>> I
>>    
>>>>> sent a test program to Quincey that shows the behaviour. I'll
>>>>>          
>> forward
>>    
>>>>> that mail and the test program to you, so you can try it out
>>>>>          
>> yourself if
>>    
>>>>> you like to.
>>>>>
>>>>> I suspect the cache lookup algorithm to be the culprit. The
larger
>>>>>          
>> the
>>    
>>>>> cache and the more often it has to look up, the slower things
get.
>>>>>          
>> BTW,
>>    
>>>>> Did you adapt the cache's hash size to the number of slots in
the
>>>>>          
>> cache?
>>    
>>>> Thanks for your suggestion.  I've been looking at your problem,
and
>>>>        
>> my
>>    
>>>> profiles seem to say that it is not a cache issue.
>>>>
>>>> Have a look at the attached screenshots showing profiles for your
>>>>        
>> test bed
>>    
>>>> reading in the x axis with a cache size of 4 KB (the HDF5 cache
>>>>        
>> subsystem
>>    
>>> does
>>>      
>>>> not enters in action at all) and 256 KB (your size).  I've also
>>>>        
>> added a
>>    
>>>> profile
 for the tiled case for comparison purposes.  For all the
>>>>        
>> profiles
>>    
>>>> (except tiled), the bottleneck is clearly in the `H5FL_reg_free`
and
>>>>        
>>    
>>>> `H5FL_reg_malloc` calls, no matter how large the cache size is
(even
>>>>        
>> if it
>>    
>>>> does not enters in action).
>>>>
>>>> I think this is expected, because HDF5 has to reserve space for
each
>>>>        
>> I/O
>>    
>>>> operation.  When you walk the dataset following directions x or
y,
>>>>        
>> you have
>>    
>>> to
>>>      
>>>> do (32*2)x more I/O operations than for the tiled case, and HDF5
>>>>        
>> needs to
>>    
>>> book
>>>      
>>>> (and free again!) (32*2)x more memory areas.  Also, when you read
>>>>        
>> through
>>    
>>> the
>>>      
>>>> z axis, the additional times to book/release memory is (32*32)x.
>>>>        
>> All of
>>    
>>> this
>>>      
>>>> is consistent with both profiles and running the benchmark
>>>>        
>> manually:
>>    
>>>> fal...@antec:/tmp>  time ./tHDF5 1024 1024 10 32 32 2 t
>>>> real    0m0.057s
>>>> user    0m0.048s
>>>> sys     0m0.004s
>>>>
>>>> fal...@antec:/tmp>  time ./tHDF5 1024 1024 10 32 32 2 x
>>>> setting cache to 32 chunks (4096 bytes) with 3203 slots  //
forcing
>>>>        
>> no cache
>>    
>>>> real    0m1.055s
>>>> user    0m0.860s
>>>> sys     0m0.168s
>>>>
>>>> fal...@antec:/tmp>  time ./tHDF5 1024 1024 10 32 32 2 y
>>>> setting cache to 32 chunks (262144 bytes) with 3203 slots
>>>> real    0m1.211s
>>>> user    0m1.176s
>>>> sys     0m0.028s
>>>>
>>>> fal...@antec:/tmp>  time ./tHDF5 1024 1024 10 32 32 2 z
>>>> setting cache to 5 chunks (40960 bytes) with 503 slots
>>>> real    0m14.813s
>>>> user    0m14.777s
>>>> sys     0m0.024s
>>>>
>>>> So, in my opinion, there is little that HDF5 can do here.  You
>>>>        
>> should better
>>    
>>>      
>>>> adapt the chunk shape to your most used case (if you have just
one,
>>>>        
>> but I
>>    
>>> know
>>>      
>>>>        
>>   that this is not typically the case).
>>    
>>>     Interesting results.  I'll try to comment on a number of
>>>      
>> fronts:
>>    
>>>             - The H5FL_* routines are an internal "free list" that
>>>      
>> the HDF5 library uses,
>>    
>>> in order to avoid calling malloc() as frequently.  You can disable
>>>      
>> them for
>>    
>>> your benchmarking by configuring the library with the
>>> "--enable-using-memchecker".
>>>
>>>             - It looks like you are using the hyperslab routines
>>>      
>> quite a bit, are you
>>    
>>> creating complicated hyperslabs?
>>>
>>>             - The B-tree comparison routine is being used to locate
>>>      
>> the correct chunk to
>>    
>>> perform I/O on, which may or may not be in the cache.  Do your
>>>      
>> datasets have
>>    
>>> fixed or unlimited dimensions?  If they are fixed (or only have 1
>>>      
>> unlimited
>>    
>>> dimension), I can point you at an experimental branch with the new
>>>      
>> chunk
>>    
>>> indexing features and we can see if that would help your
>>>      
>> performance.
>>    
>>>     Quincey
>>>
>>>      
>>>>> In your tests you only mention the chunk size, but not the chunk
>>>>>          
>> shape.
>>    
>>>>> Isn't that important? It gives me the impression that in your
tests
>>>>>          
>> the
>>    
>>>>> data are stored and accessed fully sequentially which makes the
>>>>>          
>> cache
>>    
>>>>> useless.
>>>>>          
>>>> Yes, chunk shape is important, sorry, I forgot this important
>>>>        
>> detail.  As I
>>    
>>>> mentioned in a previous message to Rob Latham, I want to optimize
>>>>        
>> 'semi-
>>    
>>>> random' access mode in a certain row of the dataset, so I
normally
>>>>        
>> choose
>>    
>>> the
>>>      
>>>> chunk shape as (1, X), where X is the needed value for obtaining
a
>>>>        
>> chunksize
>>    
>>>      
>>>> between 1 KB and 8 MB --if
 X is larger than the maximum number of
>>>>        
>> columns, I
>>    
>>>> expand the number of rows in the chunk shape accordingly.
>>>>
>>>> Thanks,
>>>>
>>>> -- 
>>>> Francesc Alted
>>>>
>>>>        
>>>      
>> 
>
<tHDF5-tile.png><tHDF5-x-4KB.png><tHDF5-x-256KB.png>_______________________________
> _____
>>    
>>> ___________
>>>      
>>>> Hdf-forum is for HDF software users discussion.
>>>> Hdf-forum@hdfgroup.org
>>>> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>>>>        
>>>
>>> _______________________________________________
>>> Hdf-forum is for HDF software users discussion.
>>> Hdf-forum@hdfgroup.org
>>> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>>>      
>>
>> _______________________________________________
>> Hdf-forum is for HDF software users discussion.
>> Hdf-forum@hdfgroup.org
>> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>>    
> 
> 
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> Hdf-forum@hdfgroup.org
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

tHDF5.cc
Description: Binary data

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Re: [Hdf-forum] Chunk cache size and performance

Reply via email to