Re: [Hdf-forum] Seeking advice on HDF5 use case

Petr KLAPKA Thu, 06 Aug 2015 10:07:30 -0700

Thank you for the prompt response Francesc!

Regarding the Python suggestions, our application is developed in C#, so
I'm using the HDF5DotNet "wrapper" classes.  Yesterday I absorbed the basic
HDF5 tutorial, today I'm trying to stream structs from a C# application to
an HDF5 file.  I'm fighting my way through creating the COMPOUND type now,
jumping between the basic documentation and that for HDF5DotNet .

Based on your suggestions, my first leaning was toward using tables and the
TB functions.  At first glance however, I can't find many of the H5TB
functions implemented in the HDF5DotNet wrapper.  I don't have the time
budget to write my own wrapper classes for functions like H5TBmake_table
and H5TBread_table.  Unless those exist in the HDF5DotNet API in a location
other than H5TB class (only appears to contain getTableInfo and
getFieldInfo methods), that approach is out.  Unless I am missing something?

Looking at your recommended approach of using an extensible dataset and
appending my sample  "array of structs" to it every 50ms.  Every sample has
a unique identifier.  I need to be able to quickly locate the identifier
during reading.  Since each sample consists of between 0 and 500 elements
of my struct, I cannot rely on equal spacing, unless I do a whole lot of
padding.

Before finding out about HDF5, my plan was to write the data to a plain
binary file interleaved by the device and to maintain an "index" for each
devices samples which would then be written to the end of the file and
"read first" upon opening the file to avoid a computationally expensive
re-indexing process.

Could such an "index" be created as it's own dataset, containing the unique
identifier for each "sample" and an offset in the "sample" dataset array at
which it is found?

It seems doing it this way would work, but I was hoping HDF5 would solve my
indexing problem for me.  I'm reluctant to continue investing the time
needed to learn the API and code up the interop for my types (there is more
to this than what I give in my example) if in the end I still have to do my
own indexing...

What are your thoughts?

Best regards,

Petr Klapka
System Tools Engineer
*Valeo* Radar Systems
46 River Rd
Hudson, NH 03051
Mobile: (603) 921-4440
Office: (603) 578-8045
*"Festina lente."*

On Thu, Aug 6, 2015 at 12:19 PM, Francesc Alted <[email protected]> wrote:

> Hi Peter,
>
> 2015-08-06 16:46 GMT+02:00 Petr KLAPKA <[email protected]>:
>
>> Good morning!
>>
>> My name is Petr Klapka,  My colleagues and I are in the process of
>> evaluating HDF5 as a potential file format for a data acquisition tool.
>>
>> I have been working through the HDF5 tutorials and overcoming the API
>> learning curve.  I was hoping you could offer some advice on the
>> suitability of HDF5 for our intended purpose and perhaps save me the time
>> of mis-using the format or API.
>>
>> The data being acquired are "samples" from four devices.  Every ~50ms a
>> device provides a sample.  The sample is an array of structs.  The total
>> size of the array varies but will be on average around  8 kilobytes.  (160k
>> per second per device).
>>
>> The data will need to be recorded over a period of about an hour, meaning
>> an uncompressed file size of around 2.3 Gigabytes.
>>
>> I will need to "play back" these samples, as well as jump around in the
>> file, seeking on sample meta data and time.
>>
>> My questions to you are:
>>
>>    - Is HDF5 intended for data sets of this size and throughput given a
>>    high performance Windows workstation?
>>
>>
> Indeed HDF5 is a very good option for what you are trying to do.
>
>
>>
>>    - What is the "correct" usage pattern for this scenario?
>>       - Is it to use a "Group" for each device, and create a "Dataset"
>>       for each sample?  This would result in thousands of datasets in the 
>> file
>>       per group, but I fully understand how to navigate this structure.
>>
>> No, creating too many datasets will slow down your queries a lot later on.
>
>
>>
>>    - Or should there only be four "Datasets" that are extensible, and
>>       each sensor "sample" be appended into the dataset?
>>
>> IMO, this is the way to go.  You can append your array of structs to the
> dataset that is created initially empty.
>
>
>>
>>    -   If this is the case, can the dataset itself be searched for
>>       specific samples by time and metadata?
>>
>>
> In case your time samples are equally binned, you could use dimension
> scales for that.  But in general HDF5 does not allow you to do queries on
> non-uniform time series or other fields, and you should do a full scan for
> that.
>
> If you want to avoid the full scan for table queries, you will need to use
> 3rd party apps on top of HDF5.  For example, the indexing capabilities in
> PyTables can help:
>
> http://www.pytables.org/usersguide/optimization.html#indexed-searches
>
> Also, you may want to use either Pandas or TsTables:
>
> http://pandas.pydata.org/pandas-docs/version/0.16.2/io.html#hdf5-pytables
>
> http://andyfiedler.com/projects/tstables-store-high-frequency-data-with-pytables/
>
> However, all of the above packages are Python packages, so not sure if
> they would fit your scenario.
>
>
>>
>>    - Or is this use case appropriate for the Table API?
>>
>> The Table API is perfectly compatible with the above suggestion of using
> a large dataset for storing the time series (in fact, this is the API that
> PyTables uses behind the scenes).
>
> I will begin with prototyping the first scenario, since it is the most
>> straight forward to understand and implement.  Please let me know your
>> suggestions.  Many thanks!
>>
>
> Hope this helps,
>
> --
> Francesc Alted
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> Twitter: https://twitter.com/hdf5
>

-- 

*This e-mail message is intended only for the use of the intended recipient(s).
The information contained therein may be confidential or privileged,
and its disclosure or reproduction is strictly prohibited.
If you are not the intended recipient, please return it immediately to its 
sender 
at the above address and destroy it. *

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Re: [Hdf-forum] Seeking advice on HDF5 use case

Reply via email to