Are you using the logistic_regression.py in examples/src/main/python or 
examples/src/main/python/mllib? The first one is an example of writing logistic 
regression by hand and won’t be as efficient as the MLlib one. I suggest trying 
the MLlib one.

You may also want to check how many iterations it runs — by default I think it 
runs 100, which may be more than you need.

Matei

On Jun 4, 2014, at 5:47 PM, Srikrishna S <srikrishna...@gmail.com> wrote:

> Hi All., 
> 
> I am new to Spark and I am trying to run LogisticRegression (with SGD) using 
> MLLib on a beefy single machine with about 128GB RAM. The dataset has about 
> 80M rows with only 4 features so it barely occupies 2Gb on disk.
> 
> I am running the code using all 8 cores with 20G memory using
> spark-submit --executor-memory 20G --master local[8] logistic_regression.py 
> 
> It seems to take about 3.5 hours without caching and over 5 hours with 
> caching.
> 
> What is the recommended use for Spark on a beefy single machine?
> 
> Any suggestions will help!
> 
> Regards, 
> Krishna
> 
> 
> Code sample:
> ---------------------------------------------------------------------------------------------------------------------
> # Dataset
> d = sys.argv[1]
> data = sc.textFile(d)
> 
> # Load and parse the data
> # 
> ----------------------------------------------------------------------------------------------------------
> def parsePoint(line):
>     values = [float(x) for x in line.split(',')]
>     return LabeledPoint(values[0], values[1:])
> _parsedData = data.map(parsePoint)
> parsedData = _parsedData.cache()
> results = {}
> 
> # Spark
> # 
> ----------------------------------------------------------------------------------------------------------
> start_time = time.time()
> # Build the gl_model
> niters = 10
> spark_model = LogisticRegressionWithSGD.train(parsedData, iterations=niters)
> 
> # Evaluate the gl_model on training data
> labelsAndPreds = parsedData.map(lambda p: (p.label, 
> spark_model.predict(p.features)))
> trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / 
> float(parsedData.count())
> 

Reply via email to