Jim and others, As far as I can see the computation of certain conditional probabilities (required for computing my likelihood function) is what is slowing down the evaluation of the likelihood function. Let me explain what these conditional probabilities are and how I am (no doubt inefficiently) calculating them. I apologize for the long post but I could not explain the whole thing without some detailed examples, etc.
For every family, we are given a completed birth sequence (call it S_i) and the desired maximum number of children (call it N_i); for instance S_i might be GBB (where G stands for a girl and B stands for a boy) and N_i might be 4. For each family, we want to compute the the probability of observing the birth sequence, S_i, given that the family is "targeting" sons. Since, a priori, we do not know the desired target (for sons) for family i, we need to allow for all the feasible possibilities. So, when a family states that the maximum number of children that it desires is N_i, we need to allow for the possibilities that the family targets 1 son, 2 sons, ... , (N_i-1) sons. Of course the actual birth sequence might assign zero probability to some of these possibilities; but we cannot rule out any of these a priori. Let us denote a target for sons as k_i. So, for family i with birth sequence S_i and desired maximum number of children N_i, we need to compute the following (N_i-1) conditional probabilities: P(S_i|N_i, k_i=1, T_i=1), P(S_i|N_i, k_i=2, T_i=1), ... , P(S_i|N_i, k_i=N_i-1, T_i=1). Let q be the probability of male birth; it is a parameter in my model and will be estimated. Now, to compute something like P(S_i|N_i, k_i, T_i=1), we merely need to observe whether the family has any child after the k_i^{th} son. If there is a child after the k_i^{th} son, then we assign zero probability to P(S_i|N_i, k_i, T_i=1); else we assign it a probability of (q^(nb[j]))*((1-q)^(ng[j])), where q is the probability of a male birth, nb is the number of boys in the sequence S_i and ng is the number of girls in the sequence. An example might clarify matters. Suppose a family reports that the maximum number of children it desires to have is 4 and the birth sequence (starting with the first born child) for the family is observed to be GGBG (where G stands for a girl and B stands for a boy). For such a family, we need to compute the following probabilities: $(GGBG|N_i=4, k_i=1, T_i=1), P(GGBG|N_i=4, k_i=2, T_i=1), and P(GGBG|N_i=4, k_i=3, T_i=1). Since there is a child after the first boy, this family could not possibly be targeting one son; hence P(GGBG|N_i=4, k_i=1, T_i=1)=0. But the family could conceivably be targeting two or even three sons; these possibilities are not ruled out by the observed birth sequence. Hence P(GGBG|N_i=4, k_i=2, T_i=1)=(q)*((1-q)^3), and similarly P(GGBG| N_i=4, k_i=3, T_i=1)=(q)*((1-q)^3). To clarify matters further, take another example. Suppose the family in question reports a maximum desired family size (number of children) of 4 and we observe the following completed birth sequence for the same family: BGB. Since there is a child after the first boy, this family could not possibly be targeting one son; hence P(BGB|N_i=4, k_i=1, T_i=1)=0. Since there is no child after the second boy, the family could possibly be targeting 2 sons; hence P(BGB|N_i=4, k_i=2, T_i=1)=(q^2)*((1-q)). And since the family stops at three children (with two sons), it cannot be targeting three sons; to target three sons, the family should have gone for another child and not stopped at the third child (and second son). Hence, P(BGB|N_i=4, k_i=3, T_i=1)=0. In my sample I let N_i run from 2 to 6. So, depending on whether N_i is 2 , 3, 4, 5 or 6, I need to compute these conditional probabilities. For instance, if N_i is 2, I need to compute only one conditional probability; if N_i is 6, I need to compute five of these conditional probabilities. This is how I proceed. I start by creating a 6x6 matrix of zeros and a 1x15 vector of zeros. S <- matrix(0, 6, 6) # CREATE A 6x6 MATRIX OF ZEROS y <- numeric(15) # CREATE A 1x15 VECTOR OF ZEROS Then the following part of the code computes the conditional probabilities as rows of the matrix S. The code picks up the jth row if the family has N_i=j. Once this is done, I store the lower triangle of the matrix (i.e., entries below the principal diagonal) in the y vector. --------------- begin code fragment ------------------------- q <- 1/(1+exp(delta)) # PROBABILITY OF MALE BIRTH ###### N1 IS "DESIRED FAMILY SIZE" ###################### ##### alive IS THE NUMBER OF CHILDREN ALIVE IN THE FAMILY ##### nb IS NUMBER OF BOYS IN THE FAMILY ##### ng IS THE NUMBER OF GIRLS IN THE FAMILY if (alive[j]==N1) { for (i in 1:(N1-1)) { S[N1,i] <- (q^(nb[j]))*((1-q)^(ng[j])) } } else { for (i in 1:(N1-1)) { S[N1,i] <- 0 } } ######### CREATE A 1x6 VECTOR WITH POSITION OF BOYS WITHIN FAMILY # boy1 GIVES BIRTH POSITION OF FIRST BOY (ZERO IF THE FAMILY HAS NO BOYS) # boy2 GIVES BIRTH POSITION OF FIRST BOY (ZERO IF THE FAMILY HAS ONLY 1 BOY) # boy3 GIVES BIRTH POSITION OF FIRST BOY (ZERO IF THE FAMILY HAS ONLY 2 BOYS) # boy4 GIVES BIRTH POSITION OF FIRST BOY (ZERO IF THE FAMILY HAS ONLY 3 BOYS) # boy5 GIVES BIRTH POSITION OF FIRST BOY (ZERO IF THE FAMILY HAS ONLY 4 BOYS) # boy6 GIVES BIRTH POSITION OF FIRST BOY (ZERO IF THE FAMILY HAS ONLY 5 BOYS) x <- c(boy1[j], boy2[j], boy3[j], boy4[j], boy5[j]) if (N1>1) { for (i in 1:(N1-1)) { if (alive[j]>x[i] & x[i]>0) { S[N1,i] <- 0 } if (x[i] == alive[j] ) { S[N1,i] <- (q^(nb[j]))*((1-q)^(ng[j])) } } } y <- c(S[2,1],S[3,1],S[3,2],S[4,1],S[4,2],S[4,3],S[5,1],S[5,2],S[5,3],S[5,4],S[6,1],S[6,2],S[6,3],S[6,4],S[6,5]) ------------------ end code fragment ------------------------------ Later, I use the entries in the y vector for computing the likelihood. Any suggestions on how to rework this part of the code will, I think, improve overall efficiency. Thanks in advance for your time. Deepankar On Sun, 2007-12-02 at 20:54 -0500, jim holtman wrote: > One thing that I would suggest that you do is to use Rprof on a subset > of the data that runs for 10-15 minutes and see where some of the hot > spots are. Since you have not provided commented, minimal, > self-contained, reproducible code, it is hard to determine where the > inefficiencies are since we don't have any data to run against it. > Some of the loop look like you are just assigning a value to a vector, > e.g., > > > > if (alive[j]==N1) { > > > > for (i in 1:(N1-1)) { > > S[N1,i] <- (q^(nb[j]))*((1-q)^(ng[j])) > > } > > } > > > > else { > > for (i in 1:(N1-1)) { > > S[N1,i] <- 0 > > } > > > > } > > that can be done without loops, but without data, it is hard to > determine. Run Rprof and see what summary.Rprof shows to indicate > where to focus on. > > On Dec 2, 2007 12:49 PM, DEEPANKAR BASU <[EMAIL PROTECTED]> wrote: > > R Users: > > > > I am trying to estimate a model of fertility behaviour using birth history > > data with maximum likelihood. My code works but is extremely slow (because > > of several for loops and my programming inefficiencies); when I use the > > genetic algorithm to optimize the likelihood function, it takes several > > days to complete (on a machine with Intel Core 2 processor [2.66GHz] and > > 2.99 GB RAM). Computing the hessian and using it to calculate the standard > > errors takes a large chunk of this time. > > > > I am copying the code for my likelihood function below; it would be great > > if someone could suggest any method to speed up the code (by avoiding the > > for loops or by any other method). > > > > I am not providing details of my model or what exactly I am trying to do in > > each step of the computation below; i would be happy to provide these > > details if they are deemed necessary for re-working the code. > > > > Thanks. > > Deepankar > > > > > > --------- begin code ----------------------- > > > > LLK1 <- function(paramets, data.frame, ...) { # DEFINING THE LOGLIKELIHOOD > > FUNCTION > > > > # paramets IS A 1x27 VECTOR OF PARAMETERS OVER WHICH THE FUNCTION WILL BE > > MAXIMISED > > # data.frame IS A DATA FRAME. THE DATA FRAME CONTAINS OBSERVATIONS ON > > SEVERAL VARIABLES > > # (LIKE EDUCATION, AGE, ETC.) FOR EACH RESPONDENT. COLUMNS REFER TO > > VARIABLES AND ROWS REFER > > # TO OBSERVATIONS. > > > > ########## PARAMETERS ############################### > > > > # alpha: interaction between son targeting and family size > > # beta : son targeting > > # gamma : family size > > # delta : a 1x6 vector of probabilities of male birth at various parities > > (q1, q2, q3, q4, q5, q6) > > # zeta : a 1x11 vector of conditional probabilities with zeta[1]=1 always > > > > alpha <- paramets[1] # FIRST PARAMETER > > beta <- paramets[2:9] # SECOND TO SEVENTH PARAMETER > > gamma <- paramets[10:16] > > delta <- paramets[17] > > zeta <- paramets[18:27] # LAST 10 PARAMETERS > > > > ################# VARIABLES ############################### > > # READING IN THE VARIABLES IN THE DATA FRAME > > # AND RENAMING THEM FOR USE IN THE LIKELIHOOD FUNCTION > > > > everborn <- data.frame$v201 > > alive <- data.frame$alive > > age <- data.frame$age > > edu <- data.frame$edu > > rural <- data.frame$rur > > rich <- data.frame$rich > > middle <- data.frame$middle > > poor <- data.frame$poor > > work <- data.frame$work > > jointfam <- data.frame$jfam > > contracep <- data.frame$contra > > hindu <- data.frame$hindu > > muslim <- data.frame$muslim > > scaste <- data.frame$scaste > > stribe <- data.frame$stribe > > obc <- data.frame$obc > > ucaste <- data.frame$ucaste > > N <- data.frame$dfsize > > indN <- data.frame$dfsize1 # INDICATOR FUNCTION THAT dfsize==alive > > nb <- data.frame$nboy > > ng <- data.frame$ngirl > > ncord1 <- data.frame$ncord1 # FIRST CHILD: BOY=0; GIRL=1 > > ncord2 <- data.frame$ncord2 #SECOND CHILD: BOY=0; GIRL=1 > > ncord3 <- data.frame$ncord3 > > ncord4 <- data.frame$ncord4 > > ncord5 <- data.frame$ncord5 > > ncord6 <- data.frame$ncord6 # SIXTH CHILD: BOY=0; GIRL=1 > > > > > > > > ######### POSITION OF i^th BOY > > ################################################ > > boy1 <- data.frame$boy1 # BIRTH POSITION OF FIRST BOY (ZERO IF THE > > FAMILY HAS NO BOYS) > > boy2 <- data.frame$boy2 # BIRTH POSITION OF SECOND BOY (ZERO IF THE > > FAMILY HAS ONLY ONE BOY) > > boy3 <- data.frame$boy3 > > boy4 <- data.frame$boy4 > > boy5 <- data.frame$boy5 > > boy6 <- data.frame$boy6 # BIRTH POSITION OF SIXTH BOY (ZERO IF THE > > FAMILY HAS ONLY FIVE BOYS) > > > > > > ######################## CONDITIONAL PROBABILITIES > > ########################## > > qq21 <- 1 > > > > qq31 <- 1/(1+exp(zeta[1])) > > qq32 <- exp(zeta[1])/(1+exp(zeta[1])) > > > > qq41 <- 1/(1+exp(zeta[2])+exp(zeta[3])) > > qq42 <- exp(zeta[2])/(1+exp(zeta[2])+exp(zeta[3])) > > qq43 <- exp(zeta[3])/(1+exp(zeta[2])+exp(zeta[3])) > > > > qq51 <- 1/(1+exp(zeta[4])+exp(zeta[5])+exp(zeta[6])) > > qq52 <- exp(zeta[4])/(1+exp(zeta[4])+exp(zeta[5])+exp(zeta[6])) > > qq53 <- exp(zeta[5])/(1+exp(zeta[4])+exp(zeta[5])+exp(zeta[6])) > > qq54 <- exp(zeta[6])/(1+exp(zeta[4])+exp(zeta[5])+exp(zeta[6])) > > > > qq61 <- 1/(1+exp(zeta[7])+exp(zeta[8])+exp(zeta[9])+exp(zeta[10])) > > qq62 <- > > exp(zeta[7])/(1+exp(zeta[7])+exp(zeta[8])+exp(zeta[9])+exp(zeta[10])) > > qq63 <- > > exp(zeta[8])/(1+exp(zeta[7])+exp(zeta[8])+exp(zeta[9])+exp(zeta[10])) > > qq64 <- > > exp(zeta[9])/(1+exp(zeta[7])+exp(zeta[8])+exp(zeta[9])+exp(zeta[10])) > > qq65 <- > > exp(zeta[10])/(1+exp(zeta[7])+exp(zeta[8])+exp(zeta[9])+exp(zeta[10])) > > > > zeta1 <- > > c(qq21,qq31,qq32,qq41,qq42,qq43,qq51,qq52,qq53,qq54,qq61,qq62,qq63,qq64,qq65) > > > > ######################################################################### > > > > n <- length(N) # LENGTH OF SAMPLE; SIZE OF THE MAIN LOOP > > > > lglik <- numeric(n) # INITIALIZING THE LOGLIKELIHOOD FUNCTION > > # CREATES A 1xn VECTOR OF ZEROS > > > > for (j in 1:n) { # START OF MAIN LOOP > > > > S <- matrix(0, 6, 6) # CREATE A 6x6 MATRIX OF ZEROS > > y <- numeric(15) # CREATE A 1x15 VECTOR OF ZEROS > > N1 <- N[j] # DESIRED FAMILY SIZE > > > > > > q <- 1/(1+exp(delta)) # PROBABILITY OF MALE BIRTH > > > > > > if (alive[j]==N1) { > > > > for (i in 1:(N1-1)) { > > S[N1,i] <- (q^(nb[j]))*((1-q)^(ng[j])) > > } > > } > > > > else { > > for (i in 1:(N1-1)) { > > S[N1,i] <- 0 > > } > > > > } > > > > ######### CREATE A 1x6 VECTOR WITH POSITION OF BOYS WITHIN FAMILY > > x <- c(boy1[j], boy2[j], boy3[j], boy4[j], boy5[j]) > > > > if (N1>1) { > > for (i in 1:(N1-1)) { > > if (alive[j]>x[i] & x[i]>0) { > > S[N1,i] <- 0 > > } > > if (x[i] == alive[j] ) { > > S[N1,i] <- (q^(nb[j]))*((1-q)^(ng[j])) > > } > > } > > } > > > > y <- > > c(S[2,1],S[3,1],S[3,2],S[4,1],S[4,2],S[4,3],S[5,1],S[5,2],S[5,3],S[5,4],S[6,1],S[6,2],S[6,3],S[6,4],S[6,5]) > > > > > > z1 <- c(age[j],edu[j],work[j],rural[j],poor[j],middle[j],hindu[j]) > > # DETERMINANTS OF FAMILY SIZE > > z2 <- > > c(1,age[j],edu[j],work[j],contracep[j],rural[j],jointfam[j],hindu[j]) # > > DETERMINANTS OF SON TARGETING > > > > t1 <- > > (indN[j])*((q^(nb[j]))*((1-q)^(ng[j])))*(exp(-exp(sum(z1*gamma)))*((exp(sum(z1*gamma)))^N1)*pnorm(-sum(z2*beta)))/factorial(N1) > > t2 <- (sum(y*zeta1))*(exp(-exp(sum(z1*gamma) + > > alpha))*((exp(sum(z1*gamma) + > > alpha))^N1)*(1-pnorm(-sum(z2*beta)))/factorial(N1)) > > lglik[j] <- log(t1+t2) > > } > > > > return(-sum(lglik)) # RETURNING THE NEGATIVE OF THE LOGLIKELIHOOD > > # SUMMED OVER ALL OBSERVATIONS > > > > > > } > > > > ------------ end code ---------------------- > > > > ______________________________________________ > > R-help@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.