[ https://issues.apache.org/jira/browse/SPARK-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jeremy Freeman updated SPARK-3995: ---------------------------------- Description: There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. This is the version of NumPy included with the current Anaconda distribution (v2.1); this is a popular distribution, and is likely to affect many users. Steps to reproduce are: {code:python} foo = sc.parallelize(range(1000),5) foo.takeSample(False, 10) {code} Returns: {code} PythonException: Traceback (most recent call last): File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 79, in main serializer.dump_stream(func(split_index, iterator), outfile) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 196, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 127, in dump_stream for obj in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 185, in _batched for item in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 116, in func if self.getUniformSample(split) <= self._fraction: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 58, in getUniformSample self.initRandomGenerator(split) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 44, in initRandomGenerator self._random = numpy.random.RandomState(self._seed) File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ (numpy/random/mtrand/mtrand.c:7397) File "mtrand.pyx", line 646, in mtrand.RandomState.seed (numpy/random/mtrand/mtrand.c:7697) ValueError: Seed must be between 0 and 4294967295 {code} In PySpark's {{RDDSamplerBase}} class from {{pyspark.rddsampler}} we use: {code:python} self._seed = seed if seed is not None else random.randint(0, sys.maxint) {code} In previous versions of NumPy a random seed larger than 2 ** 32 would silently get truncated to 2 ** 32. This was fixed in a recent patch (https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c). But sampling {{(0, sys.maxint)}} often yields ints larger than 2 ** 32, which effectively breaks sampling operations in PySpark (unless the seed is set manually). I am putting a PR together now (the fix is very simple!). was: There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. This is the version of NumPy included with the current Anaconda distribution (v2.1); this is a popular distribution, and is likely to affect many users. Steps to reproduce are: {code:python} foo = sc.parallelize(range(1000),5) foo.takeSample(False, 10) {code} Returns: {code} PythonException: Traceback (most recent call last): File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 79, in main serializer.dump_stream(func(split_index, iterator), outfile) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 196, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 127, in dump_stream for obj in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 185, in _batched for item in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 116, in func if self.getUniformSample(split) <= self._fraction: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 58, in getUniformSample self.initRandomGenerator(split) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 44, in initRandomGenerator self._random = numpy.random.RandomState(self._seed) File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ (numpy/random/mtrand/mtrand.c:7397) File "mtrand.pyx", line 646, in mtrand.RandomState.seed (numpy/random/mtrand/mtrand.c:7697) ValueError: Seed must be between 0 and 4294967295 {code} In PySpark's {{RDDSamplerBase}} class from {{pyspark.rddsampler}} we use: {code:python} self._seed = seed if seed is not None else random.randint(0, sys.maxint) {code} In previous versions of NumPy a random seed larger than 2 ** 32 would silently get truncated to 2 ** 32. This was fixed in a recent patch (https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c). But sampling {{(0, sys.maxint)}} often yields ints larger than 2 ** 32, which effectively breaks sampling operations in PySpark. I am putting a PR together now (the fix is very simple!). > [PYSPARK] PySpark's sample methods do not work with NumPy 1.9 > ------------------------------------------------------------- > > Key: SPARK-3995 > URL: https://issues.apache.org/jira/browse/SPARK-3995 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core > Affects Versions: 1.1.0 > Reporter: Jeremy Freeman > Priority: Critical > > There is a breaking bug in PySpark's sampling methods when run with NumPy > v1.9. This is the version of NumPy included with the current Anaconda > distribution (v2.1); this is a popular distribution, and is likely to affect > many users. > Steps to reproduce are: > {code:python} > foo = sc.parallelize(range(1000),5) > foo.takeSample(False, 10) > {code} > Returns: > {code} > PythonException: Traceback (most recent call last): > File > "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", > line 79, in main > serializer.dump_stream(func(split_index, iterator), outfile) > File > "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", > line 196, in dump_stream > self.serializer.dump_stream(self._batched(iterator), stream) > File > "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", > line 127, in dump_stream > for obj in iterator: > File > "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", > line 185, in _batched > for item in iterator: > File > "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", > line 116, in func > if self.getUniformSample(split) <= self._fraction: > File > "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", > line 58, in getUniformSample > self.initRandomGenerator(split) > File > "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", > line 44, in initRandomGenerator > self._random = numpy.random.RandomState(self._seed) > File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ > (numpy/random/mtrand/mtrand.c:7397) > File "mtrand.pyx", line 646, in mtrand.RandomState.seed > (numpy/random/mtrand/mtrand.c:7697) > ValueError: Seed must be between 0 and 4294967295 > {code} > In PySpark's {{RDDSamplerBase}} class from {{pyspark.rddsampler}} we use: > {code:python} > self._seed = seed if seed is not None else random.randint(0, sys.maxint) > {code} > In previous versions of NumPy a random seed larger than 2 ** 32 would > silently get truncated to 2 ** 32. This was fixed in a recent patch > (https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c). > But sampling {{(0, sys.maxint)}} often yields ints larger than 2 ** 32, > which effectively breaks sampling operations in PySpark (unless the seed is > set manually). > I am putting a PR together now (the fix is very simple!). -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org