Re: [Hdf-forum] Seeking advice on HDF5 use case

Francesc Alted Thu, 06 Aug 2015 15:01:38 -0700

2015-08-06 19:04 GMT+02:00 Petr KLAPKA <[email protected]>:

> Thank you for the prompt response Francesc!
>
> Regarding the Python suggestions, our application is developed in C#, so
> I'm using the HDF5DotNet "wrapper" classes.  Yesterday I absorbed the basic
> HDF5 tutorial, today I'm trying to stream structs from a C# application to
> an HDF5 file.  I'm fighting my way through creating the COMPOUND type now,
> jumping between the basic documentation and that for HDF5DotNet .
>
> Based on your suggestions, my first leaning was toward using tables and
> the TB functions.  At first glance however, I can't find many of the H5TB
> functions implemented in the HDF5DotNet wrapper.  I don't have the time
> budget to write my own wrapper classes for functions like H5TBmake_table
> and H5TBread_table.  Unless those exist in the HDF5DotNet API in a location
> other than H5TB class (only appears to contain getTableInfo and
> getFieldInfo methods), that approach is out.  Unless I am missing something?
>


Yes, probably the Table API is not supported for .Net.


>
> Looking at your recommended approach of using an extensible dataset and
> appending my sample  "array of structs" to it every 50ms.  Every sample has
> a unique identifier.  I need to be able to quickly locate the identifier
> during reading.  Since each sample consists of between 0 and 500 elements
> of my struct, I cannot rely on equal spacing, unless I do a whole lot of
> padding.
>
> Before finding out about HDF5, my plan was to write the data to a plain
> binary file interleaved by the device and to maintain an "index" for each
> devices samples which would then be written to the end of the file and
> "read first" upon opening the file to avoid a computationally expensive
> re-indexing process.
>

Yes, a custom-made index is a perfectly viable solution.


>
> Could such an "index" be created as it's own dataset, containing the
> unique identifier for each "sample" and an offset in the "sample" dataset
> array at which it is found?
>

That would be my recommendation, yes.


>
> It seems doing it this way would work, but I was hoping HDF5 would solve
> my indexing problem for me.  I'm reluctant to continue investing the time
> needed to learn the API and code up the interop for my types (there is more
> to this than what I give in my example) if in the end I still have to do my
> own indexing...
>
> What are your thoughts?
>

Well, I think that creating your own index would not be that hard.  Luck!

Francesc


>
>
> Best regards,
>
> Petr Klapka
> System Tools Engineer
> *Valeo* Radar Systems
> 46 River Rd
> Hudson, NH 03051
> Mobile: (603) 921-4440
> Office: (603) 578-8045
> *"Festina lente."*
>
> On Thu, Aug 6, 2015 at 12:19 PM, Francesc Alted <[email protected]> wrote:
>
>> Hi Peter,
>>
>> 2015-08-06 16:46 GMT+02:00 Petr KLAPKA <[email protected]>:
>>
>>> Good morning!
>>>
>>> My name is Petr Klapka,  My colleagues and I are in the process of
>>> evaluating HDF5 as a potential file format for a data acquisition tool.
>>>
>>> I have been working through the HDF5 tutorials and overcoming the API
>>> learning curve.  I was hoping you could offer some advice on the
>>> suitability of HDF5 for our intended purpose and perhaps save me the time
>>> of mis-using the format or API.
>>>
>>> The data being acquired are "samples" from four devices.  Every ~50ms a
>>> device provides a sample.  The sample is an array of structs.  The total
>>> size of the array varies but will be on average around  8 kilobytes.  (160k
>>> per second per device).
>>>
>>> The data will need to be recorded over a period of about an hour,
>>> meaning an uncompressed file size of around 2.3 Gigabytes.
>>>
>>> I will need to "play back" these samples, as well as jump around in the
>>> file, seeking on sample meta data and time.
>>>
>>> My questions to you are:
>>>
>>>    - Is HDF5 intended for data sets of this size and throughput given a
>>>    high performance Windows workstation?
>>>
>>>
>> Indeed HDF5 is a very good option for what you are trying to do.
>>
>>
>>>
>>>    - What is the "correct" usage pattern for this scenario?
>>>       - Is it to use a "Group" for each device, and create a "Dataset"
>>>       for each sample?  This would result in thousands of datasets in the 
>>> file
>>>       per group, but I fully understand how to navigate this structure.
>>>
>>> No, creating too many datasets will slow down your queries a lot later
>> on.
>>
>>
>>>
>>>    - Or should there only be four "Datasets" that are extensible, and
>>>       each sensor "sample" be appended into the dataset?
>>>
>>> IMO, this is the way to go.  You can append your array of structs to the
>> dataset that is created initially empty.
>>
>>
>>>
>>>    -   If this is the case, can the dataset itself be searched for
>>>       specific samples by time and metadata?
>>>
>>>
>> In case your time samples are equally binned, you could use dimension
>> scales for that.  But in general HDF5 does not allow you to do queries on
>> non-uniform time series or other fields, and you should do a full scan for
>> that.
>>
>> If you want to avoid the full scan for table queries, you will need to
>> use 3rd party apps on top of HDF5.  For example, the indexing capabilities
>> in PyTables can help:
>>
>> http://www.pytables.org/usersguide/optimization.html#indexed-searches
>>
>> Also, you may want to use either Pandas or TsTables:
>>
>> http://pandas.pydata.org/pandas-docs/version/0.16.2/io.html#hdf5-pytables
>>
>> http://andyfiedler.com/projects/tstables-store-high-frequency-data-with-pytables/
>>
>> However, all of the above packages are Python packages, so not sure if
>> they would fit your scenario.
>>
>>
>>>
>>>    - Or is this use case appropriate for the Table API?
>>>
>>> The Table API is perfectly compatible with the above suggestion of using
>> a large dataset for storing the time series (in fact, this is the API that
>> PyTables uses behind the scenes).
>>
>> I will begin with prototyping the first scenario, since it is the most
>>> straight forward to understand and implement.  Please let me know your
>>> suggestions.  Many thanks!
>>>
>>
>> Hope this helps,
>>
>> --
>> Francesc Alted
>>
>> _______________________________________________
>> Hdf-forum is for HDF software users discussion.
>> [email protected]
>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>> Twitter: https://twitter.com/hdf5
>>
>
>
> *This e-mail message is intended only for the use of the intended 
> recipient(s).
> The information contained therein may be confidential or privileged,
> and its disclosure or reproduction is strictly prohibited.
> If you are not the intended recipient, please return it immediately to its 
> sender
> at the above address and destroy it. *
>
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> Twitter: https://twitter.com/hdf5
>



-- 
Francesc Alted

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Re: [Hdf-forum] Seeking advice on HDF5 use case

Reply via email to