Public Hello all,
We noticed a HUGE difference between using pyspark and spark in scala. Pyspark runs: * on my work computer in +-350 seconds * on my home computer in +- 130 seconds (Windows defender enabled) * on my home computer in +- 105 seconds (Windows defender disabled) * on my home computer as Scala code in +- 7 seconds * The script: def setUp(self): self.left = self.parallelize([ ('Wim', 46), ('Klaas', 18) ]).toDF('name: string, age: int') self.right = self.parallelize([ ('Jiri', 25), ('Tomasz', 23) ]).toDF('name: string, age: int') def test_simple_union(self): sut = self.left.union(self.right) self.assertDatasetEquals(sut, self.parallelize([ ('Wim', 46), ('Klaas', 18), ('Jiri', 25), ('Tomasz', 23) ]).toDF('name: string, age: int') ) Disclaimer <http://www.kbc.com/KBCmailDisclaimer>