dr0ptp4kt added a comment.

  TL;DR this is about 45% done.
  
  This week I was working to address non-performant, often hanging or crashy, 
Spark runs. Last night I managed to get this running better, producing a 
reduction (the equivalent of `val_triples_only_used_by_sas` from 
https://people.wikimedia.org/~andrewtavis-wmde/T342111_spark_sa_subgraph_metrics.html
 ) in 8 minutes in one pass - instead of 3 hours or, worse, something longer 
followed by an indefinite hang or crash.
  
  The key here was a couple things. First, higher resource limits (this seems 
obvious, but isn't always true) and attempting to prevent Spark from broadcast 
joins (it still tries to do them based on the Spark web UI's DAGs, but doesn't 
seem to do them at bad times, at least).
  
    "spark.driver.memory": "16g",
    "spark.driver.cores": 2,
    "spark.executor.memory": "12g",
    "spark.executor.cores": 4,
    "spark.executor.memoryOverhead": "4g",
    "spark.sql.shuffle.partitions": 512,
    'spark.dynamicAllocation.maxExecutors': 128,
    'spark.locality.wait': '1s', # test 0
    'spark.sql.autoBroadcastJoinThreshold': -1
  
  Second, removal of `cache()` calls and setting some join tables as their own 
DataFrames. This means likely in practice more disk-based merge behavior on the 
executors for huge joins, but it works better. I'm interested to explore 
bucketing as an optimization strategy, but may forego this for production of 
the table as it doesn't seem necessary at the moment - it may however be useful 
for the produced table for people doing further join operations so am thinking 
about this.
  
  I had the small reduction pushing to a Parquet directory in HDFS last night. 
I will be working to see how performant and reliable pushing a larger data set 
is and will report back here. From there I'll port from Python to Scala.

TASK DETAIL
  https://phabricator.wikimedia.org/T347989

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: bking, dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, 
AWesterinen, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, 
Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, 
QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, 
Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
_______________________________________________
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org

Reply via email to