Are you using the logistic_regression.py in examples/src/main/python or examples/src/main/python/mllib? The first one is an example of writing logistic regression by hand and won’t be as efficient as the MLlib one. I suggest trying the MLlib one.
You may also want to check how many iterations it runs — by default I think it runs 100, which may be more than you need. Matei On Jun 4, 2014, at 5:47 PM, Srikrishna S <srikrishna...@gmail.com> wrote: > Hi All., > > I am new to Spark and I am trying to run LogisticRegression (with SGD) using > MLLib on a beefy single machine with about 128GB RAM. The dataset has about > 80M rows with only 4 features so it barely occupies 2Gb on disk. > > I am running the code using all 8 cores with 20G memory using > spark-submit --executor-memory 20G --master local[8] logistic_regression.py > > It seems to take about 3.5 hours without caching and over 5 hours with > caching. > > What is the recommended use for Spark on a beefy single machine? > > Any suggestions will help! > > Regards, > Krishna > > > Code sample: > --------------------------------------------------------------------------------------------------------------------- > # Dataset > d = sys.argv[1] > data = sc.textFile(d) > > # Load and parse the data > # > ---------------------------------------------------------------------------------------------------------- > def parsePoint(line): > values = [float(x) for x in line.split(',')] > return LabeledPoint(values[0], values[1:]) > _parsedData = data.map(parsePoint) > parsedData = _parsedData.cache() > results = {} > > # Spark > # > ---------------------------------------------------------------------------------------------------------- > start_time = time.time() > # Build the gl_model > niters = 10 > spark_model = LogisticRegressionWithSGD.train(parsedData, iterations=niters) > > # Evaluate the gl_model on training data > labelsAndPreds = parsedData.map(lambda p: (p.label, > spark_model.predict(p.features))) > trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / > float(parsedData.count()) >