Re: [Rd] Design for classes with database connection

Simon Zehnder Fri, 20 Sep 2013 03:45:33 -0700

Paul,

thank you very much for this elaborate answer. There are a lot of points which 
make me rethink what I was planning. Especially I will take a close look on the 
TSdbi package/source code. Probably this is something I can rely on - that 
would be very great.


To conclude very shortly what my idea was: Use R to read the data in and give 
it a kind of standardised structure, which is then written to SQLite. Use 
SQLite to store the data and to filter/order it. Use R to get parts of the data 
and to estimate models on it. The models get an S4 class and have everything 
they need. 

By huge I think about Tickdata from e.g. Future markets for a year: For DAX 
Futures this could be around 28.5 Mio. Entries (each entry with timestamp, IDs, 
price, volume, etc). Another example would be Spot market data for a whole 
Index (say the Dow30). Usually we get the data from the wrds in a batched form 
like a csv with all entries for all securities batched. Further order book data 
with each order send to the exchange for one security for three months - 
altogether often more than 20 Mio entries (including deleted errors etc.). 

>From now on I will move this conversation to R-SIG_DB, as suggested in your 
>last comment.


Best

Simon



On Sep 20, 2013, at 12:27 AM, Paul Gilbert <pgilbert...@gmail.com> wrote:

> Simon
> 
> Your idea to use SQLite and the nature of some of the sorting and extracting 
> you are suggesting makes me wonder why you are thinking of R data structures 
> as the home for the data storage. I would be inclined to put the data in an 
> SQL database as the prime repository, then extract parts you want with SQL 
> queries and bring them into R for analysis and graphics. If the full data set 
> is large, and the parts you want to analyze in R at any one time are 
> relatively small, then this will be much faster. After all, SQL is primarily 
> for databases, whereas R's strength is more in statistics and graphics.
> 
> In the project http://tsdbi.r-forge.r-project.org/ I have code that does some 
> of the things you probably want. There the focus is on a single identifier 
> for a series, and various observation frequencies are supported. Tick data is 
> supported (as time stamped data) but not extensively tested as I do not work 
> with tick data much. There is a function TSquery, currently in TSdbi on CRAN 
> but very shortly being split with the SQL specific parts of the interface 
> into a package TSsql. It is very much like the queries you seem to have in 
> mind, but I have not used it with tick data. It is used to generate a time 
> series by formulating a query to a database with several possible sorting 
> fields, very much like you describe, and then order the data according to the 
> time index.
> 
> If your data set is large, then you need to think carefully about which 
> fields you index. You certainly do not want to be building the indexes on the 
> fly, as you would need to do if you dump all the data out of R into an SQL db 
> just to do a sort. If the data set is small then indexing does not matter too 
> much. Also, for a small data set there will be much less advantage of keeping 
> the data in an SQL db rather than in R. You do need to be a bit more specific 
> about what "huge" means. (Tick data for 5 days or 20 years? A 100 IDs or 10 
> million?) Large for an R structure is not necessarily large for an SQL db. 
> With more specifics I might be able to give more suggestions.
> 
> (R-SIG-DB may be a better forum for this discussion.)
> 
> HTH,
> Paul
> 
> On 13-09-18 01:06 PM, Simon Zehnder wrote:
>> Dear R-Devels,
>> 
>> I am designing right now a package intended to simplify the handling
>> of market microstructure data (tick data, order data, etc). As these
>> data is most times pretty huge and needs to be reordered quite often
>> (e.g. if several security data is batched together or if only a
>> certain time range should be considered) - the package needs to
>> handle this.
>> 
>> Before I start, I would like to mention some facts which made me
>> decide to construct an own package instead of using e.g. the packages
>> bigmemory, highfrequency, zoo or xts: AFAIK big memory does not
>> provide the opportunity to handle data with different types
>> (timestamp, string and numerics) and their appropriate sorting, for
>> this task databases offer better tools. Package highfrequency is
>> designed to work specifically with a certain data structure and the
>> data in market microstructure has much greater versatility. Packages
>> zoo and xts offer a lot of versatility but do not offer the data
>> sorting ability needed for such big data.
>> 
>> I would like to get some feedback in regard to my decision and in
>> regard to the short design overview following.
>> 
>> My design idea is now:
>> 
>> 1. Base the package on S4 classes, with one class that handles
>> data-reading from external sources, structuring and reordering.
>> Structuring is done in regard to specific data variables, i.e.
>> security ID, company ID, timestamp, price, volume (not all have to be
>> provided, but some surely exist on market microstructure data). The
>> less important variables are considered as a slot @other and are only
>> ordered in regard to the other variables. Something like this:
>> 
>> .mmstruct <- setClass('mmstruct', representation( name       =
>> "character", index   = "array", N            = "integer", K          = 
>> "integer", compiD
>> = "array", secID     = "array", tradetime    = "POSIXlt", flag               
>> =
>> "array", price       = "array", vol          = "array", other        = 
>> "data.frame"))
>> 
>> 2. To enable a lightweight ordering function, the class should
>> basically create an SQLite database on construction and delete it if
>> 'rm()' is called. Throughout its life an object holds the database
>> path and can execute queries on the database tables. By this, I can
>> use the table sorting of SQLite (e.g. by constructing an index for
>> each important variable). I assume this is faster and more efficient
>> than programming something on my own - why reinventing the wheel? For
>> this I would use VIRTUAL classes like:
>> 
>> .mmstructBASE        <- setClass('mmstructBASE', representation( dbName      
>>         =
>> "character", dbTable         = "character"))
>> 
>> .mmstructDB          <- setClass('mmstructDB', representation( conn          
>> =
>> "SQLiteConnection"), contains                = c("mmstructBASE"))
>> 
>> .mmstruct <- setClass('mmstruct', representation( name       =
>> "character", index   = "array", N            = "integer", K          = 
>> "integer", compiD
>> = "array", secID     = "array", tradetime    = "POSIXlt", price      =
>> "array", vol         = "array", other        = "data.frame"), contains =
>> c("mmstructDB"))
>> 
>> The slots in the mistrust class hold then a view (e.g. only the
>> head()) of the data or can be used to hold retrieved data from the
>> underlying database.
>> 
>> 3. The workflow would than be something like:        a) User reads in the
>> data from an external source and gets a data.frame from it. b) This
>> data.frame then can be used to construct an mmstruct object from it
>> by formatting the variables and read them into the SQLite database
>> constructed. c) Given the data structure in the database, the user
>> can sort the data by secID, timestamp etc. and can use several
>> algorithms for cleaning the data (package-specific not in the
>> database) d) Example: The user makes a query to get only price from
>> entries compID = "AA" with tradetime < "2012-03-09" or with trade
>> time only first trading day in a month. This can then be converted
>> e.g. to a 'ts' object in R by coercing e) In addition the user can
>> perform several estimations of market microstructure models by
>> calling package-specific functions.
>> 
>> 
>> Is there a big fault in my design, something I haven't considered? I
>> am very sure on this list are researchers and developers with much
>> more experience. I would like to hear your opinion and ideas. I learn
>> from it and can maybe get to a design which I can then implement for
>> the research on such data and models.
>> 
>> 
>> Best
>> 
>> Simon
>> 
>> 
>> 
>> 
>> ______________________________________________ R-devel@r-project.org
>> mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
>> 

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Design for classes with database connection

Reply via email to