[R] I need guidance on better data management in preparation for time series analysis

Ted Byers Wed, 30 Jun 2010 14:36:06 -0700

OK, I have managed to use some of the basic processes of getting data from
my DB, passing it as a whole to something like fitdistr, &c.  I know I can
implement most of what I need using a brute force algorithm based on a
series of nested loops.  I also know I can handle some of this logic in a
brute force method using a blend of perl and R, with considerable file IO.
But some of what I need needs a smarter/faster way.


To understand what I am after, consider the following.  I have transaction
data comprised of sales and refunds, each of which has a timestamp.  The
refund data has a timestamp representing when the refund was issued and an
"original transaction ID" representing the sale it refunds.  I have massaged
this data in my schema so that there is a table that has a record for each
refund, and this record includes, among other things, the timestamps for
both the original sale and the refund.  I can construct a SQL query to get
these along with the elapsed time (in days, as a real number) between the
sale and refund.  For some merchants, I have such data going back years.  I
know, fromt he amount of data I have examined, the rate at which sales
result in refunds changes through time, though I have not run tests to
determine whether or not the changes I see are significant.  In most cases,
I can break the data for a merchant into weekly subsamples.

Obviously, I can construct loops that iterate over merchant ID, and
year/week (or day) covering the entire period for which I have data for a
given merchant.  What I am asking is, "Is there a smarter way?"

I can't load all the data as there are many GB of data, but the data for
individual merchants varies from a few hundred kB to a few dozen MB.  Thus,
I expect an outer loop iterating over merchant ID will be inevitable.

But, is there a smarter way to apply fitdistr (or similar function) to
samples represent sales in each week of each year (or each day of the year
when there is sufficient data), and then test to see if the parameter of the
exponential distribution that best fits the data varies significantly
through time (there are both theoretical and empirical reasons to expect an
exponential distribution, but the specific distribution doesn't really
matter for the purpose of this question).  That is one question I need to
deal with.  Is there a simple way to specify a function, a dataset and a
rule for determining all the subsamples, and then tell R to apply the
function to each subsample and then say whether or not the estimated
parameters for the subsample are significantly different?  Or do I have to
resort to the simple brute force approach of using a set of nested loops to
get what I need?

The other question I have at present is more a statistical question:
Integrating an exponential pdf over a given time period is simple enough,
but I need to learn how confidence intervals for that integral to be
computed when you have the estimate and std of the parameter for the
exponential distribution from something like fitdistr.  This gets to how to
get confidence intervales when dealing with integrals of functions of
uncertain numbers.  Not only is there a confidence interval for the
parameter of the exponential distribution, but to estimate how many refunds
to expect for the next week, one not only needs the confidence intervals of
the integral of the pdf over the next week for a given sample, but one needs
to integrate this over all the samples that could produce a refund in the
coming week.

I'd appreciate any information anyone can provide, even if that consists of
an URL that points to a resource that deals with the specific questions I
have.  I am afraid all the resources I have found searching so far have been
at a more introductory level of simply making a connection to a DB and then
submitting a SQL statement to it.  Something in between that level and the
level comprised of the maze of documentation for the plethora of relevant
packages is needed here (there is such an embarrassment of riches, I find
myself getting confused as to how to proceed).

Thanks

Ted

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] I need guidance on better data management in preparation for time series analysis

Reply via email to