You could try method = "pin". Sent from my iPhone
> On Nov 16, 2014, at 1:40 AM, Yunqi Zhang <yqzh...@ucsd.edu> wrote: > > Hi William, > > Thank you very much for your reply. > > I did a subsampling to reduce the number of samples to ~1.8 million. It > seems to work fine except for 99th percentile (p-values for all the > features are 1.0). Does this mean I’m subsampling too much? How should I > interpret the result? > > tau: [1] 0.25 > > > > Coefficients: > > Value Std. Error t value Pr(>|t|) > > (Intercept) 72.15700 0.03651 1976.10513 0.00000 > > f1 -0.51000 0.04906 -10.39508 0.00000 > > f2 -20.44200 0.03933 -519.78766 0.00000 > > f3 -2.37000 0.04871 -48.65117 0.00000 > > f1:f2 -2.52500 0.05315 -47.50361 0.00000 > > f1:f3 1.03600 0.06573 15.76193 0.00000 > > f2:f3 3.41300 0.05247 65.05075 0.00000 > > f1:f2:f3 -0.83800 0.07120 -11.77002 0.00000 > > > > Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 * > > f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5, > > 0.75, 0.9, 0.95, 0.99), data = data_stats) > > > > tau: [1] 0.5 > > > > Coefficients: > > Value Std. Error t value Pr(>|t|) > > (Intercept) 83.80900 0.05626 1489.61222 0.00000 > > f1 -0.92200 0.07528 -12.24692 0.00000 > > f2 -27.90700 0.05937 -470.07189 0.00000 > > f3 -6.45000 0.07204 -89.53909 0.00000 > > f1:f2 -2.66500 0.07933 -33.59275 0.00000 > > f1:f3 1.99000 0.09869 20.16440 0.00000 > > f2:f3 7.09600 0.07611 93.23813 0.00000 > > f1:f2:f3 -1.71200 0.10390 -16.47660 0.00000 > > > > Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 * > > f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5, > > 0.75, 0.9, 0.95, 0.99), data = data_stats) > > > > tau: [1] 0.75 > > > > Coefficients: > > Value Std. Error t value Pr(>|t|) > > (Intercept) 102.71700 0.10175 1009.45946 0.00000 > > f1 -1.59300 0.13241 -12.03125 0.00000 > > f2 -40.64200 0.10623 -382.58456 0.00000 > > f3 -14.40900 0.12096 -119.11988 0.00000 > > f1:f2 -2.97600 0.13867 -21.46071 0.00000 > > f1:f3 3.74600 0.16335 22.93165 0.00000 > > f2:f3 14.14800 0.12692 111.47217 0.00000 > > f1:f2:f3 -3.16400 0.17159 -18.43899 0.00000 > > > > Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 * > > f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5, > > 0.75, 0.9, 0.95, 0.99), data = data_stats) > > > > tau: [1] 0.9 > > > > Coefficients: > > Value Std. Error t value Pr(>|t|) > > (Intercept) 130.89400 0.20609 635.12464 0.00000 > > f1 -2.55500 0.28139 -9.07995 0.00000 > > f2 -60.90500 0.21322 -285.64558 0.00000 > > f3 -29.42300 0.23409 -125.69092 0.00000 > > f1:f2 -2.77700 0.29052 -9.55870 0.00000 > > f1:f3 7.89700 0.33308 23.70870 0.00000 > > f2:f3 27.78100 0.24338 114.14722 0.00000 > > f1:f2:f3 -6.95800 0.34491 -20.17327 0.00000 > > > > Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 * > > f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5, > > 0.75, 0.9, 0.95, 0.99), data = data_stats) > > > > tau: [1] 0.95 > > > > Coefficients: > > Value Std. Error t value Pr(>|t|) > > (Intercept) 157.45900 0.42733 368.47413 0.00000 > > f1 -4.10200 0.55834 -7.34678 0.00000 > > f2 -81.24000 0.44012 -184.58697 0.00000 > > f3 -46.17500 0.46235 -99.87033 0.00000 > > f1:f2 -2.01700 0.57651 -3.49866 0.00047 > > f1:f3 15.67000 0.67409 23.24600 0.00000 > > f2:f3 43.00100 0.47973 89.63500 0.00000 > > f1:f2:f3 -14.05100 0.69737 -20.14843 0.00000 > > > > Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 * > > f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5, > > 0.75, 0.9, 0.95, 0.99), data = data_stats) > > > > tau: [1] 0.99 > > > > Coefficients: > > Value Std. Error t value Pr(>|t|) > > (Intercept) 2.544860e+02 3.878303e+07 1.000000e-05 9.999900e-01 > > f1 -1.420000e+01 5.917548e+11 0.000000e+00 1.000000e+00 > > f2 -1.582920e+02 3.450261e+07 0.000000e+00 1.000000e+00 > > f3 -1.139210e+02 4.763057e+07 0.000000e+00 1.000000e+00 > > f1:f2 5.725000e+00 1.324283e+12 0.000000e+00 1.000000e+00 > > f1:f3 6.811780e+02 1.153645e+13 0.000000e+00 1.000000e+00 > > f2:f3 1.042510e+02 2.299953e+24 0.000000e+00 1.000000e+00 > > f1:f2:f3 -6.763210e+02 2.299953e+24 0.000000e+00 1.000000e+00 > > Warning message: > > In summary.rq(xi, ...) : 288000 non-positive fis > >> On Sat, Nov 15, 2014 at 8:19 PM, William Dunlap <wdun...@tibco.com> wrote: >> >> You can time it yourself on increasingly large subsets of your data. E.g., >> >>> dat <- data.frame(x1=rnorm(1e6), x2=rnorm(1e6), >> x3=sample(c("A","B","C"),size=1e6,replace=TRUE)) >>> dat$y <- with(dat, x1 + 2*(x3=="B")*x2 + rnorm(1e6)) >>> t <- vapply(n<-4^(3:10),FUN=function(n){d<-dat[seq_len(n),]; >> print(system.time(rq(data=d, y ~ x1 + x2*x3, >> tau=0.9)))},FUN.VALUE=numeric(5)) >> user system elapsed >> 0 0 0 >> user system elapsed >> 0 0 0 >> user system elapsed >> 0.02 0.00 0.01 >> user system elapsed >> 0.01 0.00 0.02 >> user system elapsed >> 0.10 0.00 0.11 >> user system elapsed >> 1.09 0.00 1.10 >> user system elapsed >> 13.05 0.02 13.07 >> user system elapsed >> 273.30 0.11 273.74 >>> t >> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] >> user.self 0 0 0.02 0.01 0.10 1.09 13.05 273.30 >> sys.self 0 0 0.00 0.00 0.00 0.00 0.02 0.11 >> elapsed 0 0 0.01 0.02 0.11 1.10 13.07 273.74 >> user.child NA NA NA NA NA NA NA NA >> sys.child NA NA NA NA NA NA NA NA >> >> Do some regressions on t["elapsed",] as a function of n and predict up to >> n=10^7. E.g., >>> summary(lm(t["elapsed",] ~ poly(n,4))) >> >> Call: >> lm(formula = t["elapsed", ] ~ poly(n, 4)) >> >> Residuals: >> 1 2 3 4 5 6 >> 7 8 >> -2.375e-03 -2.970e-03 4.484e-03 1.674e-03 -8.723e-04 6.096e-05 >> -9.199e-07 2.715e-09 >> >> Coefficients: >> Estimate Std. Error t value Pr(>|t|) >> (Intercept) 3.601e+01 1.261e-03 28564.33 9.46e-14 *** >> poly(n, 4)1 2.493e+02 3.565e-03 69917.04 6.45e-15 *** >> poly(n, 4)2 5.093e+01 3.565e-03 14284.61 7.57e-13 *** >> poly(n, 4)3 1.158e+00 3.565e-03 324.83 6.43e-08 *** >> poly(n, 4)4 4.392e-02 3.565e-03 12.32 0.00115 ** >> --- >> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 >> >> Residual standard error: 0.003565 on 3 degrees of freedom >> Multiple R-squared: 1, Adjusted R-squared: 1 >> F-statistic: 1.273e+09 on 4 and 3 DF, p-value: 3.575e-14 >> >> >> It does not look good for n=10^7. >> >> >> >> Bill Dunlap >> TIBCO Software >> wdunlap tibco.com >> >>> On Sat, Nov 15, 2014 at 12:12 PM, Yunqi Zhang <yqzh...@ucsd.edu> wrote: >>> >>> Hi all, >>> >>> I'm using quantreg rq() to perform quantile regression on a large data >>> set. >>> Each record has 4 fields and there are about 18 million records in total. >>> I >>> wonder if anyone has tried rq() on a large dataset and how long I should >>> expect it to finish. Or it is simply too large and I should subsample the >>> data. I would like to have an idea before I start to run and wait forever. >>> >>> In addition, I will appreciate if anyone could give me an idea how long it >>> takes for rq() to run approximately for certain dataset size. >>> >>> Yunqi >>> >>> [[alternative HTML version deleted]] >>> >>> ______________________________________________ >>> R-help@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.