I have equity options historical data, now in csv files - one file for each
day, that I'd like to store in one or multiple h5 files for efficient
resource usage and fast access. One row of data is about a dozen columns
(date, symbol name, options characteristics (maturiy, exercise price
etc..), price (open/high/low/close), volume + some other infos). One day is
about 700,000 rows accross 4,000 different symbols.

Now here's the thing, I'd like to be able to quickly retrieve the data in
"both directions", meaning: - request for a specific day, all options on a
particular symbol or list of symbols - request for a specific option (or
option list) the time series of historical prices between two dates.

What would be the ideal structure of the data? One vs multiple files? Group
hierarchy? Python modules you recommend for accessing the data? (ideally
pandas)

I tried a first implementation creating for each day a Pandas DataFrame
indexed by symbol name and then storing it under a separate h5 file under
one root group 'OPTIONS' Then to access the data on a specific day for a
specific symbol list, I would do :

h5 = pd.HDFStore(filename)

opt_df = h5['OPTIONS'].ix[symbol_list]

I would then slice opt_df using masks to finally get the options I really
want.

The problem though is: - I usually need only to access a small fraction of
the 4,000 symbols (typically a couple at a time) and with this method it
loads the whole daily file in a DataFrame which causes significant overhead
- Although it is reasonably fast if I just need data for a specific day, it
becomes extremely painful if I want to access a specific contract over a
period of time as I would then have to load the file for each of the
requested dates.

How would you implement this? Also keep in mind that I am constantly adding
more data (every day).

Thanks a lot for your help
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Reply via email to