I have equity options historical data, now in csv files - one file for each day, that I'd like to store in one or multiple h5 files for efficient resource usage and fast access. One row of data is about a dozen columns (date, symbol name, options characteristics (maturiy, exercise price etc..), price (open/high/low/close), volume + some other infos). One day is about 700,000 rows accross 4,000 different symbols.
Now here's the thing, I'd like to be able to quickly retrieve the data in "both directions", meaning: - request for a specific day, all options on a particular symbol or list of symbols - request for a specific option (or option list) the time series of historical prices between two dates. What would be the ideal structure of the data? One vs multiple files? Group hierarchy? Python modules you recommend for accessing the data? (ideally pandas) I tried a first implementation creating for each day a Pandas DataFrame indexed by symbol name and then storing it under a separate h5 file under one root group 'OPTIONS' Then to access the data on a specific day for a specific symbol list, I would do : h5 = pd.HDFStore(filename) opt_df = h5['OPTIONS'].ix[symbol_list] I would then slice opt_df using masks to finally get the options I really want. The problem though is: - I usually need only to access a small fraction of the 4,000 symbols (typically a couple at a time) and with this method it loads the whole daily file in a DataFrame which causes significant overhead - Although it is reasonably fast if I just need data for a specific day, it becomes extremely painful if I want to access a specific contract over a period of time as I would then have to load the file for each of the requested dates. How would you implement this? Also keep in mind that I am constantly adding more data (every day). Thanks a lot for your help
_______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org Twitter: https://twitter.com/hdf5
