dr0ptp4kt added a subscriber: RKemper. dr0ptp4kt added a comment.
I ran the current version of the code as follows: spark3-submit --master yarn --driver-memory 16G --executor-memory 12G --executor-cores 4 --conf spark.driver.cores=2 --conf spark.executor.memoryOverhead=4g --conf spark.sql.shuffle.partitions=512 --conf spark.dynamicAllocation.maxExecutors=128 --conf spark.sql.autoBroadcastJoinThreshold=-1 --conf spark.yarn.maxAppAttempts=1 --class org.wikidata.query.rdf.spark.transform.structureddata.dumps.NTripleGenerator --name wikibase-rdf-statements-spark ~dr0ptp4kt/rdf-spark-tools-0.3.138-SNAPSHOT-jar-with-dependencies.jar --input-table-partition-spec discovery.wikibase_rdf_scholarly_split/snapshot=20231016/wiki=wikidata/scope=wikidata_main --output-hdfs-path hdfs://analytics-hadoop/user/dr0ptp4kt/nt_wd_main --num-partitions 1024 spark3-submit --master yarn --driver-memory 16G --executor-memory 12G --executor-cores 4 --conf spark.driver.cores=2 --conf spark.executor.memoryOverhead=4g --conf spark.sql.shuffle.partitions=512 --conf spark.dynamicAllocation.maxExecutors=128 --conf spark.sql.autoBroadcastJoinThreshold=-1 --conf spark.yarn.maxAppAttempts=1 --class org.wikidata.query.rdf.spark.transform.structureddata.dumps.NTripleGenerator --name wikibase-rdf-statements-spark ~dr0ptp4kt/rdf-spark-tools-0.3.138-SNAPSHOT-jar-with-dependencies.jar --input-table-partition-spec discovery.wikibase_rdf_scholarly_split/snapshot=20231016/wiki=wikidata/scope=scholarly_articles --output-hdfs-path hdfs://analytics-hadoop/user/dr0ptp4kt/nt_wd_schol --num-partitions 1024 And updated the permissions. hdfs dfs -chgrp -R analytics-search-users hdfs://analytics-hadoop/user/dr0ptp4kt/nt_wd_main hdfs dfs -chgrp -R analytics-search-users hdfs://analytics-hadoop/user/dr0ptp4kt/nt_wd_schol From stat1006 it is possible to use the already present `hdfs-rsync` (script fronting Java utility) to copy the produced files, like this: hdfs-rsync -r hdfs://analytics-hadoop/user/dr0ptp4kt/nt_wd_schol/ file:/destination/tot/nt_wd_schol_gzips/ hdfs-rsync -r hdfs://analytics-hadoop/user/dr0ptp4kt/nt_wd_main/ file:/destination/to/nd_wd_main_gzips/ Note: each directory has 1,024 files of 100 MB +/- a certain number of MB. The Spark routine randomly samples the data before sorting into partitions, and although all partitions have data, there's mild skew so the files aren't all exactly the same number of records. @bking / @RKemper / @dcausse / I will discuss more this week. TASK DETAIL https://phabricator.wikimedia.org/T350106 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: RKemper, EBernhardson, Aklapper, BTullis, bking, dr0ptp4kt, JAllemandou, dcausse, Danny_Benjafield_WMDE, Isabelladantes1983, Themindcoder, Adamm71, Jersione, Hellket777, LisafBia6531, Astuthiodit_1, AWesterinen, 786, Biggs657, karapayneWMDE, Invadibot, maantietaja, Juan90264, Alter-paule, Beast1978, ItamarWMDE, Un1tY, Akuckartz, Hook696, Kent7301, joker88john, CucyNoiD, Nandana, Namenlos314, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Neuronton, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
_______________________________________________ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org