Thanks Scott and Ian. Interesting examples. I've done survival analysis on 140 million rows which I was able to aggregate down to about 3 million rows (didn't need the additional time series detail) with about 300 dependent variables. It's hard to imagine when it would be a billion or more unless it's consumer data for very large companies and involves time series. It's neat to hear that you've worked on problems that large.
I also tackled it in R using logistic regression. I skimmed through Andrew Ng's coursera course on machine learning which covered gradient descent but haven't put it to practice yet. It's been on my "some day" list to implement it based on an example that came across my reading list in F# http://clear-lines.com/blog/post/Logistic-Regression.aspx Another example in Python http://www.searsmerritt.com/blog/2013/7/parallel-stochastic-gradient-descent-for-logistic-regression-in-python I searched the mailing list and only found one reference to it, which didn't have a solution: http://www.jsoftware.com/pipermail/programming/2006-February/001246.html Looking at the F# and Python examples, it seems like a skilled J programmer should be able to implement it fairly easily. I'd be challenged on both the J and the algorithm. Some day. On Sun, Nov 24, 2013 at 3:53 PM, Scott Locklin <[email protected]> wrote: > Joe Bogner wrote: > >>Can anyone share specific examples where it was needed to scale out to >>multiple cores and machines? I >>am interested in learning about the types of problems this would be applied >>to. I have read some >>examples while researching but haven't ran into anyone who has. > >>For example, last week I had to create a database of the best 100,000 >>solutions out 56 billion >>combinations as part of a work deliverable. I am sure there may have been >>more elegant solutions >>however brute forcing with 4 instances of R and 32 gig of ram took 3 hours, >>which was fine. > > Here's a couple I've run into: > Marketing problems typically involve running some kind of clustering and/or > survival analysis on a billion or more rows with up to dozens of dependent > variables (channel, geo, time, past behavior, etc). This has gotten a lot of > attention in recent years, and is what most commercial "big data" a la > websites is concerned with. You want to increase ad click through rates, or > sift through remainder ads looking for stuff that can be targeted. You can > sample, but you might miss a lot of the most interesting stuff. > > There are plenty of forecasting problems consisting of a few hundred thousand > or a few million highly seasonal channels with unknown but strong > relationships (basically hierarchical clustering on ~500,000 * 700 days, or > 700*90 if they are evil and want shorter time intervals), then standard > forecasting on some reasonably sized aggregates. For this, I found a server > with 512G and just let 'er rip in R, but not every company has a big server > like that, I don't keep one 'round the house (and I might be contractually > forbidden from using it anyway), and there are always bigger problems. > > Ones I haven't done, but am aware of: > Insurance companies and other groups interested in probabilities do things > like run logistic regression (or relatives) on terascale databases. > Social advertising companies often need to build giant social network graphs. > These are pretty easy to do, conceptually in a SQL like thing (I'm pretty > sure this is what Hive was invented for). If you had something better than a > SQL like thing, you could do something better > Recommendation engines: often they do PCA or SVD type things on very large > data sets (Netflix stuff). > Document classification on very large databases (Yandex is probably doing > this internally, but I know others have the problem). > Real time ad serving on cell phones: involves all kinds of big data problems; > reconstructing geographic paths taken by the cell phone, correlating it with > local things and past behavior of the cell phone's owner, and other cell > phone owners with similar habits. I don't think this has been done well yet > (my cell phone has no screen, so I don't see any creepy ads), but it's > definitely being worked on. > > > The simplest thing is just "big regression" or "big classification" of some > kind. It's easy to construct an artificial data set. Stochastic gradient > descent logistic regression might make a useful test algorithm against Mahout. > > -SL > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
