80M by 4 should be about 2.5GB uncompressed. 10 iterations shouldn't
take that long, even on a single executor. Besides what Matei
suggested, could you also verify the executor memory in
http://localhost:4040 in the Executors tab. It is very likely the
executors do not have enough memory. In that case, caching may be
slower than reading directly from disk. -Xiangrui

On Wed, Jun 4, 2014 at 7:06 PM, Matei Zaharia <matei.zaha...@gmail.com> wrote:
> Ah, is the file gzipped by any chance? We can’t decompress gzipped files in
> parallel so they get processed by a single task.
>
> It may also be worth looking at the application UI (http://localhost:4040)
> to see 1) whether all the data fits in memory in the Storage tab (maybe it
> somehow becomes larger, though it seems unlikely that it would exceed 20 GB)
> and 2) how many parallel tasks run in each iteration.
>
> Matei
>
> On Jun 4, 2014, at 6:56 PM, Srikrishna S <srikrishna...@gmail.com> wrote:
>
> I am using the MLLib one (LogisticRegressionWithSGD)  with PySpark. I am
> running to only 10 iterations.
>
> The MLLib version of logistic regression doesn't seem to use all the cores
> on my machine.
>
> Regards,
> Krishna
>
>
>
> On Wed, Jun 4, 2014 at 6:47 PM, Matei Zaharia <matei.zaha...@gmail.com>
> wrote:
>>
>> Are you using the logistic_regression.py in examples/src/main/python or
>> examples/src/main/python/mllib? The first one is an example of writing
>> logistic regression by hand and won’t be as efficient as the MLlib one. I
>> suggest trying the MLlib one.
>>
>> You may also want to check how many iterations it runs — by default I
>> think it runs 100, which may be more than you need.
>>
>> Matei
>>
>> On Jun 4, 2014, at 5:47 PM, Srikrishna S <srikrishna...@gmail.com> wrote:
>>
>> > Hi All.,
>> >
>> > I am new to Spark and I am trying to run LogisticRegression (with SGD)
>> > using MLLib on a beefy single machine with about 128GB RAM. The dataset has
>> > about 80M rows with only 4 features so it barely occupies 2Gb on disk.
>> >
>> > I am running the code using all 8 cores with 20G memory using
>> > spark-submit --executor-memory 20G --master local[8]
>> > logistic_regression.py
>> >
>> > It seems to take about 3.5 hours without caching and over 5 hours with
>> > caching.
>> >
>> > What is the recommended use for Spark on a beefy single machine?
>> >
>> > Any suggestions will help!
>> >
>> > Regards,
>> > Krishna
>> >
>> >
>> > Code sample:
>> >
>> > ---------------------------------------------------------------------------------------------------------------------
>> > # Dataset
>> > d = sys.argv[1]
>> > data = sc.textFile(d)
>> >
>> > # Load and parse the data
>> > #
>> > ----------------------------------------------------------------------------------------------------------
>> > def parsePoint(line):
>> >     values = [float(x) for x in line.split(',')]
>> >     return LabeledPoint(values[0], values[1:])
>> > _parsedData = data.map(parsePoint)
>> > parsedData = _parsedData.cache()
>> > results = {}
>> >
>> > # Spark
>> > #
>> > ----------------------------------------------------------------------------------------------------------
>> > start_time = time.time()
>> > # Build the gl_model
>> > niters = 10
>> > spark_model = LogisticRegressionWithSGD.train(parsedData,
>> > iterations=niters)
>> >
>> > # Evaluate the gl_model on training data
>> > labelsAndPreds = parsedData.map(lambda p: (p.label,
>> > spark_model.predict(p.features)))
>> > trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() /
>> > float(parsedData.count())
>> >
>>
>
>

Reply via email to