Public

Hello all,


We noticed a HUGE difference between using pyspark and spark in scala.
Pyspark runs:

  *   on my work computer in +-350 seconds
  *   on my home computer in +- 130 seconds (Windows defender enabled)
  *   on my home computer in +- 105 seconds (Windows defender disabled)
  *   on my home computer as Scala code in +- 7 seconds
  *

The script:
def setUp(self):
    self.left = self.parallelize([
        ('Wim', 46),
        ('Klaas', 18)
    ]).toDF('name: string, age: int')

    self.right = self.parallelize([
        ('Jiri', 25),
        ('Tomasz', 23)
    ]).toDF('name: string, age: int')

def test_simple_union(self):
    sut = self.left.union(self.right)

    self.assertDatasetEquals(sut, self.parallelize([
            ('Wim', 46),
            ('Klaas', 18),
            ('Jiri', 25),
            ('Tomasz', 23)
        ]).toDF('name: string, age: int')
    )

Disclaimer <http://www.kbc.com/KBCmailDisclaimer>

Reply via email to