Hi, Did you try to hash all strings into numerics, which you can use later for indexing (symbols); name data blocks for example: 2014-01-01.mat ?
Adding chunking and setting the compression filter high covers performance versus space complexity. The data blocks should be layed out carefully: irregular series treated differently from regular time series as indexing the later is a breeze. For irregular time series you may use hdf5 custom data type with pocket tables, this allows high speed serial access. For regular time series n dimensional slabs of some data type fits better. As an example for candle data: instrument:time:OHLC can do well with float type. My set up is somewhat different from yours as it includes a parallel filesystem + cluster +MPI+ phdf5 C api. The data size is much same. Throughput/space is superb to SQL based approach. Steve On Oct 27, 2014 8:46 AM, "Dan E" <[email protected]> wrote: > I have equity options historical data, now in csv files - one file for > each day, that I'd like to store in one or multiple h5 files for efficient > resource usage and fast access. One row of data is about a dozen columns > (date, symbol name, options characteristics (maturiy, exercise price > etc..), price (open/high/low/close), volume + some other infos). One day is > about 700,000 rows accross 4,000 different symbols. > > Now here's the thing, I'd like to be able to quickly retrieve the data in > "both directions", meaning: - request for a specific day, all options on a > particular symbol or list of symbols - request for a specific option (or > option list) the time series of historical prices between two dates. > > What would be the ideal structure of the data? One vs multiple files? > Group hierarchy? Python modules you recommend for accessing the data? > (ideally pandas) > > I tried a first implementation creating for each day a Pandas DataFrame > indexed by symbol name and then storing it under a separate h5 file under > one root group 'OPTIONS' Then to access the data on a specific day for a > specific symbol list, I would do : > > h5 = pd.HDFStore(filename) > > opt_df = h5['OPTIONS'].ix[symbol_list] > > I would then slice opt_df using masks to finally get the options I really > want. > > The problem though is: - I usually need only to access a small fraction of > the 4,000 symbols (typically a couple at a time) and with this method it > loads the whole daily file in a DataFrame which causes significant overhead > - Although it is reasonably fast if I just need data for a specific day, it > becomes extremely painful if I want to access a specific contract over a > period of time as I would then have to load the file for each of the > requested dates. > > How would you implement this? Also keep in mind that I am constantly > adding more data (every day). > > Thanks a lot for your help > > _______________________________________________ > Hdf-forum is for HDF software users discussion. > [email protected] > > http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org > Twitter: https://twitter.com/hdf5 >
_______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org Twitter: https://twitter.com/hdf5
