Re: [Announcement] Cloud data lake conference with heavy focus on open source

2020-07-07 Thread Ashley Hoff
Interesting You've piqued my interest. Will the sessons be available after the conference? (I'm in the wrong timezone to see this during daylight hours) On Wed, Jul 8, 2020 at 2:40 AM ldazaa11 wrote: > Hello Sparkers, > > If you’re interested in how Spark is being applied in cloud data

Re: Mocking pyspark read writes

2020-07-07 Thread Jörn Franke
Write to a local temp directory via file:// ? > Am 07.07.2020 um 20:07 schrieb Dark Crusader : > >  > Hi everyone, > > I have a function which reads and writes a parquet file from HDFS. When I'm > writing a unit test for this function, I want to mock this read & write. > > How do you achieve

Mocking pyspark read writes

2020-07-07 Thread Dark Crusader
Hi everyone, I have a function which reads and writes a parquet file from HDFS. When I'm writing a unit test for this function, I want to mock this read & write. How do you achieve this? Any help would be appreciated. Thank you.

[Announcement] Cloud data lake conference with heavy focus on open source

2020-07-07 Thread ldazaa11
Hello Sparkers, If you’re interested in how Spark is being applied in cloud data lake environments, then you should check out a new 1-day LIVE, virtual conference on July 30. This conference is called Subsurface and the focus is technical talks tailored specifically for data architects and

Re: When does SparkContext.defaultParallelism have the correct value?

2020-07-07 Thread Sean Owen
If not set explicitly with spark.default.parallelism, it will default to the number of cores currently available (minimum 2). At the very start, some executors haven't completed registering, which I think explains why it goes up after a short time. (In the case of dynamic allocation it will change

ANALYZE command not supported on Spark 2.3.2?

2020-07-07 Thread daniel123
Does anyone know if ANALYZE TABLE is supported on Spark 2.3.2? The command doesnt appear in the documentation (spark.apache.org/docs/2.3.2/sql-programming-guide.html) although we can launch it with estrange results The analyse table job takes hours and doesnt launch any executors, it just runs in

how to disable hivemetastore connection

2020-07-07 Thread iamabug
Hi community,I am running hundreds of Spark jobs at the same time, which cause Hive Metastore connection numbers to be very high (> 1K), since the jobs do not use HMS really, so I wish to disable that, I have tried setting