Re: [Rd] Design for classes with database connection

2013-09-20 Thread Simon Zehnder
Paul,

thank you very much for this elaborate answer. There are a lot of points which 
make me rethink what I was planning. Especially I will take a close look on the 
TSdbi package/source code. Probably this is something I can rely on - that 
would be very great. 

To conclude very shortly what my idea was: Use R to read the data in and give 
it a kind of standardised structure, which is then written to SQLite. Use 
SQLite to store the data and to filter/order it. Use R to get parts of the data 
and to estimate models on it. The models get an S4 class and have everything 
they need. 

By huge I think about Tickdata from e.g. Future markets for a year: For DAX 
Futures this could be around 28.5 Mio. Entries (each entry with timestamp, IDs, 
price, volume, etc). Another example would be Spot market data for a whole 
Index (say the Dow30). Usually we get the data from the wrds in a batched form 
like a csv with all entries for all securities batched. Further order book data 
with each order send to the exchange for one security for three months - 
altogether often more than 20 Mio entries (including deleted errors etc.). 

>From now on I will move this conversation to R-SIG_DB, as suggested in your 
>last comment.


Best

Simon



On Sep 20, 2013, at 12:27 AM, Paul Gilbert  wrote:

> Simon
> 
> Your idea to use SQLite and the nature of some of the sorting and extracting 
> you are suggesting makes me wonder why you are thinking of R data structures 
> as the home for the data storage. I would be inclined to put the data in an 
> SQL database as the prime repository, then extract parts you want with SQL 
> queries and bring them into R for analysis and graphics. If the full data set 
> is large, and the parts you want to analyze in R at any one time are 
> relatively small, then this will be much faster. After all, SQL is primarily 
> for databases, whereas R's strength is more in statistics and graphics.
> 
> In the project http://tsdbi.r-forge.r-project.org/ I have code that does some 
> of the things you probably want. There the focus is on a single identifier 
> for a series, and various observation frequencies are supported. Tick data is 
> supported (as time stamped data) but not extensively tested as I do not work 
> with tick data much. There is a function TSquery, currently in TSdbi on CRAN 
> but very shortly being split with the SQL specific parts of the interface 
> into a package TSsql. It is very much like the queries you seem to have in 
> mind, but I have not used it with tick data. It is used to generate a time 
> series by formulating a query to a database with several possible sorting 
> fields, very much like you describe, and then order the data according to the 
> time index.
> 
> If your data set is large, then you need to think carefully about which 
> fields you index. You certainly do not want to be building the indexes on the 
> fly, as you would need to do if you dump all the data out of R into an SQL db 
> just to do a sort. If the data set is small then indexing does not matter too 
> much. Also, for a small data set there will be much less advantage of keeping 
> the data in an SQL db rather than in R. You do need to be a bit more specific 
> about what "huge" means. (Tick data for 5 days or 20 years? A 100 IDs or 10 
> million?) Large for an R structure is not necessarily large for an SQL db. 
> With more specifics I might be able to give more suggestions.
> 
> (R-SIG-DB may be a better forum for this discussion.)
> 
> HTH,
> Paul
> 
> On 13-09-18 01:06 PM, Simon Zehnder wrote:
>> Dear R-Devels,
>> 
>> I am designing right now a package intended to simplify the handling
>> of market microstructure data (tick data, order data, etc). As these
>> data is most times pretty huge and needs to be reordered quite often
>> (e.g. if several security data is batched together or if only a
>> certain time range should be considered) - the package needs to
>> handle this.
>> 
>> Before I start, I would like to mention some facts which made me
>> decide to construct an own package instead of using e.g. the packages
>> bigmemory, highfrequency, zoo or xts: AFAIK big memory does not
>> provide the opportunity to handle data with different types
>> (timestamp, string and numerics) and their appropriate sorting, for
>> this task databases offer better tools. Package highfrequency is
>> designed to work specifically with a certain data structure and the
>> data in market microstructure has much greater versatility. Packages
>> zoo and xts offer a lot of versatility but do not offer the data
>> sorting ability needed for such big data.
>> 
>> I would like to get some feedback in regard to my decision and in
>> regard to the short design overview following.
>> 
>> My design idea is now:
>> 
>> 1. Base the package on S4 classes, with one class that handles
>> data-reading from external sources, structuring and reordering.
>> Structuring is done in regard to specific data variables, i.e.
>> security I

Re: [Rd] Design for classes with database connection

2013-09-19 Thread Paul Gilbert

Simon

Your idea to use SQLite and the nature of some of the sorting and 
extracting you are suggesting makes me wonder why you are thinking of R 
data structures as the home for the data storage. I would be inclined to 
put the data in an SQL database as the prime repository, then extract 
parts you want with SQL queries and bring them into R for analysis and 
graphics. If the full data set is large, and the parts you want to 
analyze in R at any one time are relatively small, then this will be 
much faster. After all, SQL is primarily for databases, whereas R's 
strength is more in statistics and graphics.


In the project http://tsdbi.r-forge.r-project.org/ I have code that does 
some of the things you probably want. There the focus is on a single 
identifier for a series, and various observation frequencies are 
supported. Tick data is supported (as time stamped data) but not 
extensively tested as I do not work with tick data much. There is a 
function TSquery, currently in TSdbi on CRAN but very shortly being 
split with the SQL specific parts of the interface into a package TSsql. 
It is very much like the queries you seem to have in mind, but I have 
not used it with tick data. It is used to generate a time series by 
formulating a query to a database with several possible sorting fields, 
very much like you describe, and then order the data according to the 
time index.


If your data set is large, then you need to think carefully about which 
fields you index. You certainly do not want to be building the indexes 
on the fly, as you would need to do if you dump all the data out of R 
into an SQL db just to do a sort. If the data set is small then indexing 
does not matter too much. Also, for a small data set there will be much 
less advantage of keeping the data in an SQL db rather than in R. You do 
need to be a bit more specific about what "huge" means. (Tick data for 5 
days or 20 years? A 100 IDs or 10 million?) Large for an R structure is 
not necessarily large for an SQL db. With more specifics I might be able 
to give more suggestions.


(R-SIG-DB may be a better forum for this discussion.)

HTH,
Paul

On 13-09-18 01:06 PM, Simon Zehnder wrote:

Dear R-Devels,

I am designing right now a package intended to simplify the handling
of market microstructure data (tick data, order data, etc). As these
data is most times pretty huge and needs to be reordered quite often
(e.g. if several security data is batched together or if only a
certain time range should be considered) - the package needs to
handle this.

Before I start, I would like to mention some facts which made me
decide to construct an own package instead of using e.g. the packages
bigmemory, highfrequency, zoo or xts: AFAIK big memory does not
provide the opportunity to handle data with different types
(timestamp, string and numerics) and their appropriate sorting, for
this task databases offer better tools. Package highfrequency is
designed to work specifically with a certain data structure and the
data in market microstructure has much greater versatility. Packages
zoo and xts offer a lot of versatility but do not offer the data
sorting ability needed for such big data.

I would like to get some feedback in regard to my decision and in
regard to the short design overview following.

My design idea is now:

1. Base the package on S4 classes, with one class that handles
data-reading from external sources, structuring and reordering.
Structuring is done in regard to specific data variables, i.e.
security ID, company ID, timestamp, price, volume (not all have to be
provided, but some surely exist on market microstructure data). The
less important variables are considered as a slot @other and are only
ordered in regard to the other variables. Something like this:

.mmstruct <- setClass('mmstruct', representation( name   =
"character", index= "array", N  = "integer", K= 
"integer", compiD
= "array", secID  = "array", tradetime  = "POSIXlt", flag =
"array", price= "array", vol= "array", other  = 
"data.frame"))

2. To enable a lightweight ordering function, the class should
basically create an SQLite database on construction and delete it if
'rm()' is called. Throughout its life an object holds the database
path and can execute queries on the database tables. By this, I can
use the table sorting of SQLite (e.g. by constructing an index for
each important variable). I assume this is faster and more efficient
than programming something on my own - why reinventing the wheel? For
this I would use VIRTUAL classes like:

.mmstructBASE   <- setClass('mmstructBASE', representation( dbName   =
"character", dbTable  = "character"))

.mmstructDB <- setClass('mmstructDB', representation( conn  
 =
"SQLiteConnection"), contains = c("mmstructBASE"))

.mmstruct <- setClass('mmstruct', representation( name   =
"character", index 

[Rd] Design for classes with database connection

2013-09-18 Thread Simon Zehnder
Dear R-Devels,

I am designing right now a package intended to simplify the handling of market 
microstructure data (tick data, order data, etc). As these data is most times 
pretty huge and needs to be reordered quite often (e.g. if several security 
data is batched together or if only a certain time range should be considered) 
- the package needs to handle this. 

Before I start, I would like to mention some facts which made me decide to 
construct an own package instead of using e.g. the packages bigmemory, 
highfrequency, zoo or xts: AFAIK big memory does not provide the opportunity to 
handle data with different types (timestamp, string and numerics) and their 
appropriate sorting, for this task databases offer better tools. Package 
highfrequency is designed to work specifically with a certain data structure 
and the data in market microstructure has much greater versatility. Packages 
zoo and xts offer a lot of versatility but do not offer the data sorting 
ability needed for such big data. 

I would like to get some feedback in regard to my decision and in regard to the 
short design overview following.  
  
My design idea is now:

1. Base the package on S4 classes, with one class that handles data-reading 
from external sources, structuring and reordering. Structuring is done in 
regard to specific data variables, i.e. security ID, company ID, timestamp, 
price, volume (not all have to be provided, but some surely exist on market 
microstructure data). The less important variables are considered as a slot 
@other and are only ordered in regard to the other variables. Something like 
this:

.mmstruct <- setClass('mmstruct', representation(
name= "character",
index   = "array",
N   = "integer",
K   = "integer",
compiD  = "array",
secID   = "array",
tradetime   = "POSIXlt",
flag= "array",
price   = "array",
vol = "array",
other   = "data.frame"))

2. To enable a lightweight ordering function, the class should basically create 
an SQLite database on construction and delete it if 'rm()' is called. 
Throughout its life an object holds the database path and can execute queries 
on the database tables. By this, I can use the table sorting of SQLite (e.g. by 
constructing an index for each important variable). I assume this is faster and 
more efficient than programming something on my own - why reinventing the 
wheel? For this I would use VIRTUAL classes like:

.mmstructBASE   <- setClass('mmstructBASE', representation(
dbName  = "character",
dbTable = "character"))

.mmstructDB <- setClass('mmstructDB', representation(
conn= "SQLiteConnection"),
contains= 
c("mmstructBASE"))

.mmstruct <- setClass('mmstruct', representation(
name= "character",
index   = "array",
N   = "integer",
K   = "integer",
compiD  = "array",
secID   = "array",
tradetime   = "POSIXlt",
price   = "array",
vol = "array",
other   = "data.frame"),
contains = c("mmstructDB"))

The slots in the mistrust class hold then a view (e.g. only the head()) of the 
data or can be used to hold retrieved data from the underlying database. 

3. The workflow would than be something like:   a) User reads in the data from 
an external source and gets a data.frame from it. 

b) This data.frame then can be used to construct an mmstruct object 
from it by formatting the variables and read them into the SQLite database 
constructed. 

c) Given the data structure in the database, the user can sort the data 
by secID, timestamp etc. and can use several algorithms for cleaning the data 
(package-specific not in the database) 

d) Example: The user makes a query to get only price from entries 
compID = "AA" with tradetime < "2012-03-09" or with trade time only first 
trading day in a month. This can then be converted e.g. to a 'ts' object in R 
by coercing 

e) In addition the user can p