Which metastore are you using? Le jeu. 25 avr. 2019 à 09:02, Juho Autio <juho.au...@rovio.com> a écrit :
> Would anyone be able to answer this question about the non-optimal > implementation of insertInto? > > On Thu, Apr 18, 2019 at 4:45 PM Juho Autio <juho.au...@rovio.com> wrote: > >> Hi, >> >> My job is writing ~10 partitions with insertInto. With the same input / >> output data the total duration of the job is very different depending on >> how many partitions the target table has. >> >> Target table with 10 of partitions: >> 1 min 30 s >> >> Target table with ~10000 partitions: >> 13 min 0 s >> >> It seems that spark is always fetching the full list of partitions in >> target table. When this happens, the cluster is basically idling while >> driver is listing partitions. >> >> Here's a thread dump for executor driver from such idle time: >> https://gist.github.com/juhoautio/bdbc8eb339f163178905322fc393da20 >> >> Is there any way to optimize this currently? Is this a known issue? Any >> plans to improve? >> >> My code is essentially: >> >> spark = SparkSession.builder \ >> .config('spark.sql.hive.caseSensitiveInferenceMode', 'NEVER_INFER') \ >> .config("hive.exec.dynamic.partition", "true") \ >> .config('spark.sql.sources.partitionOverwriteMode', 'dynamic') \ >> .config("hive.exec.dynamic.partition.mode", "nonstrict") \ >> .enableHiveSupport() \ >> .getOrCreate() >> >> out_df.write \ >> .option('mapreduce.fileoutputcommitter.algorithm.version', '2') \ >> .insertInto(target_table_name, overwrite=True) >> >> Table has been originally created from spark with saveAsTable. >> >> Does spark need to know anything about the existing partitions though? As >> a manual workaround I would write the files directly to the partition >> locations, delete existing files first if there's anything in that >> partition, and then call metastore to ALTER TABLE IF NOT EXISTS ADD >> PARTITION. This doesn't require previous knowledge on existing partitions. >> >> Thanks. >> >