Hi Krishna, Specifying executor memory in local mode has no effect, because all of the threads run inside the same JVM. You can either try --driver-memory 60g or start a standalone server.
Best, Xiangrui On Wed, Jun 4, 2014 at 7:28 PM, Xiangrui Meng <men...@gmail.com> wrote: > 80M by 4 should be about 2.5GB uncompressed. 10 iterations shouldn't > take that long, even on a single executor. Besides what Matei > suggested, could you also verify the executor memory in > http://localhost:4040 in the Executors tab. It is very likely the > executors do not have enough memory. In that case, caching may be > slower than reading directly from disk. -Xiangrui > > On Wed, Jun 4, 2014 at 7:06 PM, Matei Zaharia <matei.zaha...@gmail.com> wrote: >> Ah, is the file gzipped by any chance? We can’t decompress gzipped files in >> parallel so they get processed by a single task. >> >> It may also be worth looking at the application UI (http://localhost:4040) >> to see 1) whether all the data fits in memory in the Storage tab (maybe it >> somehow becomes larger, though it seems unlikely that it would exceed 20 GB) >> and 2) how many parallel tasks run in each iteration. >> >> Matei >> >> On Jun 4, 2014, at 6:56 PM, Srikrishna S <srikrishna...@gmail.com> wrote: >> >> I am using the MLLib one (LogisticRegressionWithSGD) with PySpark. I am >> running to only 10 iterations. >> >> The MLLib version of logistic regression doesn't seem to use all the cores >> on my machine. >> >> Regards, >> Krishna >> >> >> >> On Wed, Jun 4, 2014 at 6:47 PM, Matei Zaharia <matei.zaha...@gmail.com> >> wrote: >>> >>> Are you using the logistic_regression.py in examples/src/main/python or >>> examples/src/main/python/mllib? The first one is an example of writing >>> logistic regression by hand and won’t be as efficient as the MLlib one. I >>> suggest trying the MLlib one. >>> >>> You may also want to check how many iterations it runs — by default I >>> think it runs 100, which may be more than you need. >>> >>> Matei >>> >>> On Jun 4, 2014, at 5:47 PM, Srikrishna S <srikrishna...@gmail.com> wrote: >>> >>> > Hi All., >>> > >>> > I am new to Spark and I am trying to run LogisticRegression (with SGD) >>> > using MLLib on a beefy single machine with about 128GB RAM. The dataset >>> > has >>> > about 80M rows with only 4 features so it barely occupies 2Gb on disk. >>> > >>> > I am running the code using all 8 cores with 20G memory using >>> > spark-submit --executor-memory 20G --master local[8] >>> > logistic_regression.py >>> > >>> > It seems to take about 3.5 hours without caching and over 5 hours with >>> > caching. >>> > >>> > What is the recommended use for Spark on a beefy single machine? >>> > >>> > Any suggestions will help! >>> > >>> > Regards, >>> > Krishna >>> > >>> > >>> > Code sample: >>> > >>> > --------------------------------------------------------------------------------------------------------------------- >>> > # Dataset >>> > d = sys.argv[1] >>> > data = sc.textFile(d) >>> > >>> > # Load and parse the data >>> > # >>> > ---------------------------------------------------------------------------------------------------------- >>> > def parsePoint(line): >>> > values = [float(x) for x in line.split(',')] >>> > return LabeledPoint(values[0], values[1:]) >>> > _parsedData = data.map(parsePoint) >>> > parsedData = _parsedData.cache() >>> > results = {} >>> > >>> > # Spark >>> > # >>> > ---------------------------------------------------------------------------------------------------------- >>> > start_time = time.time() >>> > # Build the gl_model >>> > niters = 10 >>> > spark_model = LogisticRegressionWithSGD.train(parsedData, >>> > iterations=niters) >>> > >>> > # Evaluate the gl_model on training data >>> > labelsAndPreds = parsedData.map(lambda p: (p.label, >>> > spark_model.predict(p.features))) >>> > trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / >>> > float(parsedData.count()) >>> > >>> >> >>