dr0ptp4kt added a comment.
TL;DR this is about 45% done. This week I was working to address non-performant, often hanging or crashy, Spark runs. Last night I managed to get this running better, producing a reduction (the equivalent of `val_triples_only_used_by_sas` from https://people.wikimedia.org/~andrewtavis-wmde/T342111_spark_sa_subgraph_metrics.html ) in 8 minutes in one pass - instead of 3 hours or, worse, something longer followed by an indefinite hang or crash. The key here was a couple things. First, higher resource limits (this seems obvious, but isn't always true) and attempting to prevent Spark from broadcast joins (it still tries to do them based on the Spark web UI's DAGs, but doesn't seem to do them at bad times, at least). "spark.driver.memory": "16g", "spark.driver.cores": 2, "spark.executor.memory": "12g", "spark.executor.cores": 4, "spark.executor.memoryOverhead": "4g", "spark.sql.shuffle.partitions": 512, 'spark.dynamicAllocation.maxExecutors': 128, 'spark.locality.wait': '1s', # test 0 'spark.sql.autoBroadcastJoinThreshold': -1 Second, removal of `cache()` calls and setting some join tables as their own DataFrames. This means likely in practice more disk-based merge behavior on the executors for huge joins, but it works better. I'm interested to explore bucketing as an optimization strategy, but may forego this for production of the table as it doesn't seem necessary at the moment - it may however be useful for the produced table for people doing further join operations so am thinking about this. I had the small reduction pushing to a Parquet directory in HDFS last night. I will be working to see how performant and reliable pushing a larger data set is and will report back here. From there I'll port from Python to Scala. TASK DETAIL https://phabricator.wikimedia.org/T347989 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: bking, dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, AWesterinen, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
_______________________________________________ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org