2015-08-06 19:04 GMT+02:00 Petr KLAPKA <[email protected]>: > Thank you for the prompt response Francesc! > > Regarding the Python suggestions, our application is developed in C#, so > I'm using the HDF5DotNet "wrapper" classes. Yesterday I absorbed the basic > HDF5 tutorial, today I'm trying to stream structs from a C# application to > an HDF5 file. I'm fighting my way through creating the COMPOUND type now, > jumping between the basic documentation and that for HDF5DotNet . > > Based on your suggestions, my first leaning was toward using tables and > the TB functions. At first glance however, I can't find many of the H5TB > functions implemented in the HDF5DotNet wrapper. I don't have the time > budget to write my own wrapper classes for functions like H5TBmake_table > and H5TBread_table. Unless those exist in the HDF5DotNet API in a location > other than H5TB class (only appears to contain getTableInfo and > getFieldInfo methods), that approach is out. Unless I am missing something? >
Yes, probably the Table API is not supported for .Net. > > Looking at your recommended approach of using an extensible dataset and > appending my sample "array of structs" to it every 50ms. Every sample has > a unique identifier. I need to be able to quickly locate the identifier > during reading. Since each sample consists of between 0 and 500 elements > of my struct, I cannot rely on equal spacing, unless I do a whole lot of > padding. > > Before finding out about HDF5, my plan was to write the data to a plain > binary file interleaved by the device and to maintain an "index" for each > devices samples which would then be written to the end of the file and > "read first" upon opening the file to avoid a computationally expensive > re-indexing process. > Yes, a custom-made index is a perfectly viable solution. > > Could such an "index" be created as it's own dataset, containing the > unique identifier for each "sample" and an offset in the "sample" dataset > array at which it is found? > That would be my recommendation, yes. > > It seems doing it this way would work, but I was hoping HDF5 would solve > my indexing problem for me. I'm reluctant to continue investing the time > needed to learn the API and code up the interop for my types (there is more > to this than what I give in my example) if in the end I still have to do my > own indexing... > > What are your thoughts? > Well, I think that creating your own index would not be that hard. Luck! Francesc > > > Best regards, > > Petr Klapka > System Tools Engineer > *Valeo* Radar Systems > 46 River Rd > Hudson, NH 03051 > Mobile: (603) 921-4440 > Office: (603) 578-8045 > *"Festina lente."* > > On Thu, Aug 6, 2015 at 12:19 PM, Francesc Alted <[email protected]> wrote: > >> Hi Peter, >> >> 2015-08-06 16:46 GMT+02:00 Petr KLAPKA <[email protected]>: >> >>> Good morning! >>> >>> My name is Petr Klapka, My colleagues and I are in the process of >>> evaluating HDF5 as a potential file format for a data acquisition tool. >>> >>> I have been working through the HDF5 tutorials and overcoming the API >>> learning curve. I was hoping you could offer some advice on the >>> suitability of HDF5 for our intended purpose and perhaps save me the time >>> of mis-using the format or API. >>> >>> The data being acquired are "samples" from four devices. Every ~50ms a >>> device provides a sample. The sample is an array of structs. The total >>> size of the array varies but will be on average around 8 kilobytes. (160k >>> per second per device). >>> >>> The data will need to be recorded over a period of about an hour, >>> meaning an uncompressed file size of around 2.3 Gigabytes. >>> >>> I will need to "play back" these samples, as well as jump around in the >>> file, seeking on sample meta data and time. >>> >>> My questions to you are: >>> >>> - Is HDF5 intended for data sets of this size and throughput given a >>> high performance Windows workstation? >>> >>> >> Indeed HDF5 is a very good option for what you are trying to do. >> >> >>> >>> - What is the "correct" usage pattern for this scenario? >>> - Is it to use a "Group" for each device, and create a "Dataset" >>> for each sample? This would result in thousands of datasets in the >>> file >>> per group, but I fully understand how to navigate this structure. >>> >>> No, creating too many datasets will slow down your queries a lot later >> on. >> >> >>> >>> - Or should there only be four "Datasets" that are extensible, and >>> each sensor "sample" be appended into the dataset? >>> >>> IMO, this is the way to go. You can append your array of structs to the >> dataset that is created initially empty. >> >> >>> >>> - If this is the case, can the dataset itself be searched for >>> specific samples by time and metadata? >>> >>> >> In case your time samples are equally binned, you could use dimension >> scales for that. But in general HDF5 does not allow you to do queries on >> non-uniform time series or other fields, and you should do a full scan for >> that. >> >> If you want to avoid the full scan for table queries, you will need to >> use 3rd party apps on top of HDF5. For example, the indexing capabilities >> in PyTables can help: >> >> http://www.pytables.org/usersguide/optimization.html#indexed-searches >> >> Also, you may want to use either Pandas or TsTables: >> >> http://pandas.pydata.org/pandas-docs/version/0.16.2/io.html#hdf5-pytables >> >> http://andyfiedler.com/projects/tstables-store-high-frequency-data-with-pytables/ >> >> However, all of the above packages are Python packages, so not sure if >> they would fit your scenario. >> >> >>> >>> - Or is this use case appropriate for the Table API? >>> >>> The Table API is perfectly compatible with the above suggestion of using >> a large dataset for storing the time series (in fact, this is the API that >> PyTables uses behind the scenes). >> >> I will begin with prototyping the first scenario, since it is the most >>> straight forward to understand and implement. Please let me know your >>> suggestions. Many thanks! >>> >> >> Hope this helps, >> >> -- >> Francesc Alted >> >> _______________________________________________ >> Hdf-forum is for HDF software users discussion. >> [email protected] >> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org >> Twitter: https://twitter.com/hdf5 >> > > > *This e-mail message is intended only for the use of the intended > recipient(s). > The information contained therein may be confidential or privileged, > and its disclosure or reproduction is strictly prohibited. > If you are not the intended recipient, please return it immediately to its > sender > at the above address and destroy it. * > > > _______________________________________________ > Hdf-forum is for HDF software users discussion. > [email protected] > http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org > Twitter: https://twitter.com/hdf5 > -- Francesc Alted
_______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org Twitter: https://twitter.com/hdf5
