Hi, I am doing the following exercise: I have 100 million labeled records (total 2.7 GB data) in LibSVM (sparse) format, split across 200 files on HDFS (each file ~14 MB), so each file has about 500K records. Only 50K of these 100 million are labeled as "positive", and the rest are all "negative". I am taking a sample of 50K from the "negative" set, merging it with the 50K positive, and splitting it into 50% training and 50% test set. I am training an Elastic Net logistic regression (without regularization) on the training dataset, testing its performance on the 50K test datapoints, and then applying the model on the rest of the data (100 million - 100K) to find the class-conditional probabilities of those examples being positive.
I have a 2-node cluster, one of them set up as master and both of them workers, each node having 10 GB executor memory and the driver having 10 GB memory. My Hadoop cluster is with the same machines as my Spark cluster. My Spak application is aborting after running for more than 3 hours, and it is not even reaching the logistic regression part in these 3 hours - it is all into the sampling, filtering and merging. Any ballpark about how long it should take? Are there some known benchmarks for logistic regression? -- Bibudh Lahiri Senior Data Scientist, Impetus Technolgoies 720 University Avenue, Suite 130 Los Gatos, CA 95129 http://knowthynumbers.blogspot.com/