Re: PySpark: Overusing allocated cores / too many processes

2017-09-27 Thread Fabian Böhnlein
It ended up being unintended multi-threading of numpy <https://stackoverflow.com/questions/17053671/python-how-do-you-stop-numpy-from-multithreading>, solved by export MKL_NUM_THREADS=1 On Tue, 26 Sep 2017 at 09:05 Fabian Böhnlein <fabian.boehnl...@gmail.com> wrote: > Hi all, >

PySpark: Overusing allocated cores / too many processes

2017-09-26 Thread Fabian Böhnlein
Hi all, above topic has been mentioned before in this list between March - June 2016 , again mentioned

PySpark Pickle reading does not find module

2016-02-23 Thread Fabian Böhnlein
Hi all, how can I make a module/class visible to a sc.pickleFile? It seems to miss it in the env after an import in the driver PySpark context. The module is available for writing, but reading in a new SparkContext than the one that wrote, fails. The imports are the same in both. Any ideas

Re: Scala types to StructType

2016-02-15 Thread Fabian Böhnlein
yntax. Please tell me if that helped. On Feb 11, 2016, at 7:20 AM, Fabian Böhnlein <fabian.boehnl...@gmail.com <mailto:fabian.boehnl...@gmail.com>> wrote: Hi all, is there a way to create a Spark SQL Row schema based on Sc

Scala types to StructType

2016-02-11 Thread Fabian Böhnlein
Hi all, is there a way to create a Spark SQL Row schema based on Scala data types without creating a manual mapping? That's the only example I can find which doesn't require spark.sql.types.DataType already as input, but it requires to define them as Strings. * val struct = (new

HiveContext.read.orc - buffer size not respected after setting it

2015-12-09 Thread Fabian Böhnlein
Hello everyone, I'm hitting below exception when reading an ORC file with default HiveContext after setting hive.exec.orc.default.buffer.size to 1517137. See below for details. Is there another buffer parameter relevant or another place where I could set it? Any other ideas what's going

Re: How to add jars to standalone pyspark program

2015-04-28 Thread Fabian Böhnlein
Can you specifiy 'running via PyCharm'. how are you executing the script, with spark-submit? In PySpark I guess you used --jars databricks-csv.jar. With spark-submit you might need the additional --driver-class-path databricks-csv.jar. Both parameters cannot be set via the SparkConf object.