Re: [R] How do I combine lists of data.frames into a single data frame?

Ted Byers Thu, 15 Jul 2010 14:33:45 -0700

Thanks Marc

Part of the challenge here is that EVERYTHING is dynamic.  New data is being
added to the DB all the time  Each active ID makes a new sample very day or
at a minimum every week, and new IDs are added every week.  So I can't hard
code anything.  If, for a given ID, I had 50 weekly samples last week, I'll
have 51 samples this week.


But some for the IDs have sample sizes that are so small, it would be pure
BS to try to use fitdist on their data.

I have figured out a way to handle this for a given ID, and so I have the
loop that iterates over the IDs, and processes the data for that ID IF there
is sufficient data.  And to make things interesting, the number of IDs I
need to process this week is greater than the number of IDs I had to process
last week.

So, I iterate over IDs, from 1 up through perhaps 500.  If a given ID has
sufficient data, I get the z lists.  And I have checked, applying rbind to
these works great!  Of all the IDs' datasets I have examined, perhaps 10% do
not yet have enough data to work with (but that, too changes through time).

>From what you have said, it would seem that I ought to make a master list.
So, I need to learn how to make a master list grow from nothing to include
all these z lists.  That reduces to a question of how can one append
dynamically created lists of varying size (from just a few list elements to
a few hundred list elements) to such a master list.

Actually, when it gets right down to it, I think I am ignorant of a key
piece of the puzzle (I have probably missed the key part of the
documentation dealing with this).  I do not yet know how to add even one
element to a list within a loop where the loop does not exist (or at least
is empty) at the beginning of the loop.

I get your example "do.call(rbind, c(z1, z2, z3, z4))", but what do you do
if there is no list at the beginning of a loop and you need to handle
something like:

#n is some large number, and in about 10% of values of 'i' (not known a
priori) creation
# of x and y is skipped
for (i = 1:n) {
  if(test that returns tru only 90% of the time) {
    x = function_that_makes_a_data_frame()
    y = function_that_makes_a_list_of_data_frames()
  }
}

We have not created any lists on entry into the loop.  How do we create a
list containing all instances of x and another that contains all elements
that had been in each instance of y?  If I can learn how to do that, then I
can call  do.call(rbind,x_list) and do.call(rbind,y_element_list).

If you know C++, and specifically the STL containers and algorithms, one can
grow vectors or lists using a function called 'push_back' which is defined
on most stl containers.  I am looking for the R equivalent for objects, and
the R equivalent of the C++ STL algorithm std::copy (passed the begin and
end iterators of the source list and a back inserter for the recipient
container), for appending a source list to a master list.

Thanks

Ted

On Thu, Jul 15, 2010 at 4:52 PM, Marc Schwartz <marc_schwa...@me.com> wrote:

> Ted,
>
> I may not be completely clear on how you have your processes implemented,
> but some thoughts:
>
> If you will be creating multiple lists initially, where each list (say
> z1...z4) contains 1 or more data frames and all of the data frames have the
> same column structure, you can use:
>
>  do.call(rbind, c(z1, z2, z3, z4))
>
> For example, using the iris data set:
>
>  list1 <- list(head(iris), head(iris), head(iris))
>
>  list2 <- list(head(iris), head(iris))
>
> So these now have 3 and 2 copies, respectively, of 6 rows from the iris
> data set. You can then do:
>
> DF <- do.call(rbind, c(list1, list2))
>
> > str(DF)
> 'data.frame':   30 obs. of  5 variables:
>  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 5.1 4.9 4.7 4.6 ...
>  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.5 3 3.2 3.1 ...
>  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.4 1.3 1.5 ...
>  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.2 0.2 0.2 0.2 ...
>  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1
> 1 1 1 ...
>
>
> So DF now contains 30 rows (6 rows * 5 data frames).
>
> I am not sure if that will spark some thoughts, but ideally, if you can
> figure out a way such that the result of all of your operations will be a
> single list (eg. within a loop construct), you can avoid the copying of
> objects, which both adds time and RAM overhead. Then you can just use the
> do.call(rbind, YourList) construct on the single 'all inclusive' list.  If
> you need to preallocate a 'master' list object, which you can then index in
> a loop, presuming that you know ahead of time how many total data frames
> will be created, you can use vector("list", N), where N is the number of
> total list elements that you will require. For example:
>
> > vector("list", 5)
> [[1]]
> NULL
>
> [[2]]
> NULL
>
> [[3]]
> NULL
>
> [[4]]
> NULL
>
> [[5]]
> NULL
>
> will preallocate a list of 5 elements, each of which can then be indexed to
> contain a data frame that is a result of your looping operation.
>
>
> HTH,
>
> Marc
>
>
> On Jul 15, 2010, at 2:58 PM, Ted Byers wrote:
>
> > Thanks Marc
> >
> > The next part of the question, though, involves the fact that there is a
> new
> > 'z' list made in almost every iteration through the ID loop.
> >
> > I guess there are two parts to the question.  First, how would I make a
> list
> > containing all the data frames created by a call to rbind?  I assume,
> then,
> > that I could call rbind again to make that new list into a single
> > data.frame.  Second, is it possible to just append one list of objects to
> > another list of objects, and would doing that and calling rbind on that
> > master list be more efficient than calling rbind on each z list and then
> > calling rbind after the loop on the list of such data.frames?
> >
> > Thanks again,
> >
> > Ted
> >
> > On Thu, Jul 15, 2010 at 3:27 PM, Marc Schwartz <marc_schwa...@me.com>
> wrote:
> >
> >> On Jul 15, 2010, at 2:18 PM, Ted Byers wrote:
> >>
> >>> The data.frame is constructed by one of the following functions:
> >>>
> >>> funweek <- function(df)
> >>> if (length(df$elapsed_time) > 5) {
> >>>   rv = fitdist(df$elapsed_time,"exp")
> >>>   rv$year = df$sale_year[1]
> >>>   rv$sample = df$sale_week[1]
> >>>   rv$granularity = "week"
> >>>   rv
> >>> }
> >>> funmonth <- function(df)
> >>> if (length(df$elapsed_time) > 5) {
> >>>   rv = fitdist(df$elapsed_time,"exp")
> >>>   rv$year = df$sale_year[1]
> >>>   rv$sample = df$sale_month[1]
> >>>   rv$granularity = "month"
> >>>   rv
> >>> }
> >>>
> >>> It is basically the data.frame created by fitdist extended to include
> the
> >>> variables used to distinguish one sample from another.
> >>>
> >>> I have the following statement that gets me a set of IDs from my db:
> >>>
> >>> ids <- dbGetQuery(con, "SELECT DISTINCT m_id FROM risk_input")
> >>>
> >>> And then I have a loop that allows me to analyze one dataset after
> >> another:
> >>>
> >>> for (i in 1:length(ids[,1])) {
> >>> print(i)
> >>> print(ids[i,1])
> >>>
> >>> Then, after a set of statements that give me information about the
> >> dataset
> >>> (such as its size), within a conditional block that ensures I apply the
> >>> analysis only on sufficiently large samples, I have the following:
> >>>
> >>> z <-
> >> lapply(split(moreinfo,list(moreinfo$sale_year,moreinfo$sale_week),drop
> >>> = TRUE), funweek)
> >>>
> >>> or z <-
> >>> lapply(split(moreinfo,list(moreinfo$sale_year,moreinfo$sale_month),drop
> =
> >>> TRUE), funmonth)
> >>>
> >>> followed by:
> >>>
> >>> str(z)
> >>>
> >>> Of course, I close the loop and disconnect from my db.
> >>>
> >>> NB: I don't see any way to get rid of the loop by adding ID as a factor
> >> to
> >>> split because I have to query the DB for several key bits of data in
> >> order
> >>> to determine whether or not there is sufficient data to work on.
> >>>
> >>> I have everything working, except the final step of storing the results
> >> back
> >>> into the db.  Storing data in the Db is easy enough.  But I am at a
> loss
> >> as
> >>> to how to combine the lists placed in z in most of the iterations
> through
> >>> the ID loop into a single data.frame.
> >>>
> >>> Now, I did take a look at rbind and cbind, but it isn't clear to me if
> >>> either is appropriate.  All the data frames have the same structure,
> but
> >> the
> >>> lists are of variable length, and I am not certain how either might be
> >> used
> >>> inside the IDs loop.
> >>>
> >>> So, what is the best way to combine all lists assigned to z into a
> single
> >>> data.frame?
> >>>
> >>> Thanks
> >>>
> >>> Ted
> >>
> >>
> >> Ted,
> >>
> >> If each of the data frames in the list 'z' have the same column
> structure,
> >> you can use:
> >>
> >> do.call(rbind, z)
> >>
> >> The result of which will be a single data frame containing all of the
> rows
> >> from each of the data frames in the list.
> >>
> >> HTH,
> >>
> >> Marc Schwartz
> >>
> >>
> >
> >
> > --
> > R.E.(Ted) Byers, Ph.D.,Ed.D.
> > t...@merchantservicecorp.com
> > CTO
> > Merchant Services Corp.
> > 350 Harry Walker Parkway North, Suite 8
> > Newmarket, Ontario
> > L3Y 8L3
> >
> >       [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
>


-- 
R.E.(Ted) Byers, Ph.D.,Ed.D.
t...@merchantservicecorp.com
CTO
Merchant Services Corp.
350 Harry Walker Parkway North, Suite 8
Newmarket, Ontario
L3Y 8L3

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] How do I combine lists of data.frames into a single data frame?

Reply via email to