[GitHub] spark pull request: [SPARK-3488][MLLIB] Cache python RDDs after de...

staple Sun, 14 Sep 2014 14:16:21 -0700

Github user staple commented on the pull request:

    https://github.com/apache/spark/pull/2362#issuecomment-55539838
  
    I ran a simple logistic regression performance test on my local machine 
(ubuntu desktop w/ 8gb ram). I used two data sizes: 2m records, which was not 
memory constrained, and 10m records which was memory constrained (generating 
log messages such as `CacheManager: Not enough space to cache partition`). I 
tested without this patch, with this patch, and with a modified version of this 
patch using `MEMORY_ONLY_SER` to persist the deserialized objects. Here are the 
results (each reported runtime is the mean of 3 runs):
    
    2m records:
    master | 47.9099563758
    w/ patch | 32.1143682798
    w/ MEMORY_ONLY_SER | 79.4589416981
    
    10m records:
    master | 2130.3178509871
    w/ patch | 3232.856136322
    w/ MEMORY_ONLY_SER | 2772.3923886617
    
    It looks like, running in memory, this patch provides a 33% speed 
improvement, while the `MEMORY_ONLY_SER` version is 66% slower than master. In 
the test case with insufficient memory to keep all the `cache()`-ed training 
rdd partitions cached at once, this patch is 52% slower while `MEMORY_ONLY_SER` 
is 30% slower.
    
    Iâm not that familiar with the typical mllib memory profile. Do you think 
the in-memory result here would be similar to a real world run?
    
    Finally, here is the test script. Let me know if it seems reasonable. The 
data generation was roughly inspired by your mllib perf test in spark-perf.
    
    Data generation:
    
        import random
    
        from pyspark import SparkContext
        from pyspark.mllib.regression import LabeledPoint
    
        class NormalGenerator:
                def __init__(self):
                        self.mu = random.random()
                        self.sigma = random.random()
    
                def __call__(self, rnd):
                        return rnd.normalvariate(self.mu,self.sigma)
    
        class PointGenerator:
                def __init__(self):
                        self.generators = [[NormalGenerator() for _ in 
range(5)] for _ in range(2)]
    
                def __call__(self, rnd):
                        label = rnd.choice([0, 1])
                        return LabeledPoint(float(label),[g(rnd) for g in 
self.generators[label]])
    
        pointGenerator = PointGenerator()
        sc = SparkContext()
    
        def generatePoints(n):
                def generateData(index):
                        rnd = random.Random(hash(str(index)))
                        for _ in range(n / 10):
                                yield pointGenerator(rnd)
    
                points = sc.parallelize(range(10), 10).flatMap(generateData)
                print points.count()
                points.saveAsPickleFile('logistic%.0e' % n)
    
        generatePoints(int(2e6))
        generatePoints(int(1e7))
    
    Test:
    
        import time
        import sys
    
        from pyspark import SparkContext
        from pyspark.mllib.classification import LogisticRegressionWithSGD
    
        sc = SparkContext()
        points = sc.pickleFile(sys.argv[1])
        start = time.time()
        model = LogisticRegressionWithSGD.train(points, 100)
        print 'Runtime: ' + `(time.time() - start)`
        print model.weights




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3488][MLLIB] Cache python RDDs after de...

Reply via email to