Hi, I am currently working on my first Spark app (V2.4.3, running as standalone on a single node cluster, configured as 6 workers). I am coding this app in Python/PySpark.
This app is to be used to create a daily delta data set from a TSQL database (sybase ASE). This data set returns new, updated and delete records. The process is to load a reference parquet file into a data frame, the table from TSQL into a data frame, which is hashed and then create 3 separate data frames to store the three separate actions. These three last DF's are then unioned together, repartitioned into 1 partition and outputted as parquet. I have noticed a bit of weirdness around the connections to the TSQL server/database/table. Occasionally, it will query the table anything up to 6 separate connections (all running the same query - Select <all tables> from....). At other times, it will make only 1 or 2 connections (but if I do restrict it back to 1 worker, it seems to be consistently only 1 connection, as I would have expected). How does Spark determine how many connections to make? Is it only worker dependent (which I originally thought)? Or is there something else that is determining how many times to query the data? Any insight would be appreciated. Ideally, I'd like to restrict the number of database calls as much as I can. Thanks -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org