Github user staple commented on the pull request: https://github.com/apache/spark/pull/2362#issuecomment-55539838 I ran a simple logistic regression performance test on my local machine (ubuntu desktop w/ 8gb ram). I used two data sizes: 2m records, which was not memory constrained, and 10m records which was memory constrained (generating log messages such as `CacheManager: Not enough space to cache partition`). I tested without this patch, with this patch, and with a modified version of this patch using `MEMORY_ONLY_SER` to persist the deserialized objects. Here are the results (each reported runtime is the mean of 3 runs): 2m records: master | 47.9099563758 w/ patch | 32.1143682798 w/ MEMORY_ONLY_SER | 79.4589416981 10m records: master | 2130.3178509871 w/ patch | 3232.856136322 w/ MEMORY_ONLY_SER | 2772.3923886617 It looks like, running in memory, this patch provides a 33% speed improvement, while the `MEMORY_ONLY_SER` version is 66% slower than master. In the test case with insufficient memory to keep all the `cache()`-ed training rdd partitions cached at once, this patch is 52% slower while `MEMORY_ONLY_SER` is 30% slower. Iâm not that familiar with the typical mllib memory profile. Do you think the in-memory result here would be similar to a real world run? Finally, here is the test script. Let me know if it seems reasonable. The data generation was roughly inspired by your mllib perf test in spark-perf. Data generation: import random from pyspark import SparkContext from pyspark.mllib.regression import LabeledPoint class NormalGenerator: def __init__(self): self.mu = random.random() self.sigma = random.random() def __call__(self, rnd): return rnd.normalvariate(self.mu,self.sigma) class PointGenerator: def __init__(self): self.generators = [[NormalGenerator() for _ in range(5)] for _ in range(2)] def __call__(self, rnd): label = rnd.choice([0, 1]) return LabeledPoint(float(label),[g(rnd) for g in self.generators[label]]) pointGenerator = PointGenerator() sc = SparkContext() def generatePoints(n): def generateData(index): rnd = random.Random(hash(str(index))) for _ in range(n / 10): yield pointGenerator(rnd) points = sc.parallelize(range(10), 10).flatMap(generateData) print points.count() points.saveAsPickleFile('logistic%.0e' % n) generatePoints(int(2e6)) generatePoints(int(1e7)) Test: import time import sys from pyspark import SparkContext from pyspark.mllib.classification import LogisticRegressionWithSGD sc = SparkContext() points = sc.pickleFile(sys.argv[1]) start = time.time() model = LogisticRegressionWithSGD.train(points, 100) print 'Runtime: ' + `(time.time() - start)` print model.weights
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org