Paul, thank you very much for this elaborate answer. There are a lot of points which make me rethink what I was planning. Especially I will take a close look on the TSdbi package/source code. Probably this is something I can rely on - that would be very great.
To conclude very shortly what my idea was: Use R to read the data in and give it a kind of standardised structure, which is then written to SQLite. Use SQLite to store the data and to filter/order it. Use R to get parts of the data and to estimate models on it. The models get an S4 class and have everything they need. By huge I think about Tickdata from e.g. Future markets for a year: For DAX Futures this could be around 28.5 Mio. Entries (each entry with timestamp, IDs, price, volume, etc). Another example would be Spot market data for a whole Index (say the Dow30). Usually we get the data from the wrds in a batched form like a csv with all entries for all securities batched. Further order book data with each order send to the exchange for one security for three months - altogether often more than 20 Mio entries (including deleted errors etc.). >From now on I will move this conversation to R-SIG_DB, as suggested in your >last comment. Best Simon On Sep 20, 2013, at 12:27 AM, Paul Gilbert <pgilbert...@gmail.com> wrote: > Simon > > Your idea to use SQLite and the nature of some of the sorting and extracting > you are suggesting makes me wonder why you are thinking of R data structures > as the home for the data storage. I would be inclined to put the data in an > SQL database as the prime repository, then extract parts you want with SQL > queries and bring them into R for analysis and graphics. If the full data set > is large, and the parts you want to analyze in R at any one time are > relatively small, then this will be much faster. After all, SQL is primarily > for databases, whereas R's strength is more in statistics and graphics. > > In the project http://tsdbi.r-forge.r-project.org/ I have code that does some > of the things you probably want. There the focus is on a single identifier > for a series, and various observation frequencies are supported. Tick data is > supported (as time stamped data) but not extensively tested as I do not work > with tick data much. There is a function TSquery, currently in TSdbi on CRAN > but very shortly being split with the SQL specific parts of the interface > into a package TSsql. It is very much like the queries you seem to have in > mind, but I have not used it with tick data. It is used to generate a time > series by formulating a query to a database with several possible sorting > fields, very much like you describe, and then order the data according to the > time index. > > If your data set is large, then you need to think carefully about which > fields you index. You certainly do not want to be building the indexes on the > fly, as you would need to do if you dump all the data out of R into an SQL db > just to do a sort. If the data set is small then indexing does not matter too > much. Also, for a small data set there will be much less advantage of keeping > the data in an SQL db rather than in R. You do need to be a bit more specific > about what "huge" means. (Tick data for 5 days or 20 years? A 100 IDs or 10 > million?) Large for an R structure is not necessarily large for an SQL db. > With more specifics I might be able to give more suggestions. > > (R-SIG-DB may be a better forum for this discussion.) > > HTH, > Paul > > On 13-09-18 01:06 PM, Simon Zehnder wrote: >> Dear R-Devels, >> >> I am designing right now a package intended to simplify the handling >> of market microstructure data (tick data, order data, etc). As these >> data is most times pretty huge and needs to be reordered quite often >> (e.g. if several security data is batched together or if only a >> certain time range should be considered) - the package needs to >> handle this. >> >> Before I start, I would like to mention some facts which made me >> decide to construct an own package instead of using e.g. the packages >> bigmemory, highfrequency, zoo or xts: AFAIK big memory does not >> provide the opportunity to handle data with different types >> (timestamp, string and numerics) and their appropriate sorting, for >> this task databases offer better tools. Package highfrequency is >> designed to work specifically with a certain data structure and the >> data in market microstructure has much greater versatility. Packages >> zoo and xts offer a lot of versatility but do not offer the data >> sorting ability needed for such big data. >> >> I would like to get some feedback in regard to my decision and in >> regard to the short design overview following. >> >> My design idea is now: >> >> 1. Base the package on S4 classes, with one class that handles >> data-reading from external sources, structuring and reordering. >> Structuring is done in regard to specific data variables, i.e. >> security ID, company ID, timestamp, price, volume (not all have to be >> provided, but some surely exist on market microstructure data). The >> less important variables are considered as a slot @other and are only >> ordered in regard to the other variables. Something like this: >> >> .mmstruct <- setClass('mmstruct', representation( name = >> "character", index = "array", N = "integer", K = >> "integer", compiD >> = "array", secID = "array", tradetime = "POSIXlt", flag >> = >> "array", price = "array", vol = "array", other = >> "data.frame")) >> >> 2. To enable a lightweight ordering function, the class should >> basically create an SQLite database on construction and delete it if >> 'rm()' is called. Throughout its life an object holds the database >> path and can execute queries on the database tables. By this, I can >> use the table sorting of SQLite (e.g. by constructing an index for >> each important variable). I assume this is faster and more efficient >> than programming something on my own - why reinventing the wheel? For >> this I would use VIRTUAL classes like: >> >> .mmstructBASE <- setClass('mmstructBASE', representation( dbName >> = >> "character", dbTable = "character")) >> >> .mmstructDB <- setClass('mmstructDB', representation( conn >> = >> "SQLiteConnection"), contains = c("mmstructBASE")) >> >> .mmstruct <- setClass('mmstruct', representation( name = >> "character", index = "array", N = "integer", K = >> "integer", compiD >> = "array", secID = "array", tradetime = "POSIXlt", price = >> "array", vol = "array", other = "data.frame"), contains = >> c("mmstructDB")) >> >> The slots in the mistrust class hold then a view (e.g. only the >> head()) of the data or can be used to hold retrieved data from the >> underlying database. >> >> 3. The workflow would than be something like: a) User reads in the >> data from an external source and gets a data.frame from it. b) This >> data.frame then can be used to construct an mmstruct object from it >> by formatting the variables and read them into the SQLite database >> constructed. c) Given the data structure in the database, the user >> can sort the data by secID, timestamp etc. and can use several >> algorithms for cleaning the data (package-specific not in the >> database) d) Example: The user makes a query to get only price from >> entries compID = "AA" with tradetime < "2012-03-09" or with trade >> time only first trading day in a month. This can then be converted >> e.g. to a 'ts' object in R by coercing e) In addition the user can >> perform several estimations of market microstructure models by >> calling package-specific functions. >> >> >> Is there a big fault in my design, something I haven't considered? I >> am very sure on this list are researchers and developers with much >> more experience. I would like to hear your opinion and ideas. I learn >> from it and can maybe get to a design which I can then implement for >> the research on such data and models. >> >> >> Best >> >> Simon >> >> >> >> >> ______________________________________________ R-devel@r-project.org >> mailing list https://stat.ethz.ch/mailman/listinfo/r-devel >> ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel