dr0ptp4kt added a comment.
**About Amazon Neptune** Amazon Neptune was set to import using the simpler N-Triples file format with its serverless configuration at 128 NCUs (about 256 GB of RAM with some attendant CPU). We don't use N-Triples files in our existing import process, but it is the sort of format used in the graph split imports. curl -v -X POST \ -H 'Content-Type: application/json' \ https://db-neptune-1.cluster-cnim20k6c0mh.us-west-2.neptune.amazonaws.com:8182/loader -d ' { "source" : "s3://blazegraphdump/latest-lexemes.nt.bz2", "format" : "ntriples", "iamRoleArn" : "arn:aws:iam::ACCOUNTID:role/NeptuneLoadFromS3", "region" : "us-west-2", "failOnError" : "FALSE", "parallelism" : "HIGH", "updateSingleCardinalityProperties" : "FALSE", "queueRequest" : "TRUE" }' This required a bunch of grants, and I had to make my personal bucket hosting the file listable and readable, as well as the objects listable and readable within it (it's possible to do chained IAM grants, but it is a bit of work and requires somewhat complicated STSes). It appeared that it was also necessary to create the VPC endpoint as described in the documentation. This was started at 1:30 PM CT on Monday, February 26, 2024. Note that this is the lexemes dump. I'm trying here to verify that with 128 NCUs it goes faster than with 32 NCUs. Because if it does, that will be useful for the bigger dump. curl -v -X POST \ -H 'Content-Type: application/json' \ https://db-neptune-1-instance-1.cwnhpfsf87ne.us-west-2.neptune.amazonaws.com:8182/loader -d ' { "source" : "s3://blazegraphdump/latest-lexemes.nt.bz2", "format" : "ntriples", "iamRoleArn" : "arn:aws:iam::ACCOUNTID:role/NeptuneLoadFromS3Attempt", "region" : "us-west-2", "failOnError" : "FALSE", "parallelism" : "OVERSUBSCRIBE", "updateSingleCardinalityProperties" : "FALSE", "queueRequest" : "TRUE" }' { "status" : "200 OK", "payload" : { "loadId" : "8ace45ed-2989-4fd4-aa19-d13b9a59e824" } curl -G 'https://db-neptune-1-instance-1.cwnhpfsf87ne.us-west-2.neptune.amazonaws.com:8182/loader/8ace45ed-2989-4fd4-aa19-d13b9a59e824' { "status" : "200 OK", "payload" : { "feedCount" : [ { "LOAD_COMPLETED" : 1 } ], "overallStatus" : { "fullUri" : "s3://blazegraphdump/latest-lexemes.nt.bz2", "runNumber" : 1, "retryNumber" : 0, "status" : "LOAD_COMPLETED", "totalTimeSpent" : 2142, "startTime" : 1708975752, "totalRecords" : 163715491, "totalDuplicates" : 141148, "parsingErrors" : 0, "datatypeMismatchErrors" : 0, "insertErrors" : 0 } } } Now, for the full Wikidata load. This was started at about 2:20 PM CT on Monday, February 26, 2024. curl -v -X POST \ -H 'Content-Type: application/json' \ https://db-neptune-1-instance-1.cwnhpfsf87ne.us-west-2.neptune.amazonaws.com:8182/loader -d ' { "source" : "s3://blazegraphdump/latest-all.nt.bz2", "format" : "ntriples", "iamRoleArn" : "arn:aws:iam::ACCOUNTID:role/NeptuneLoadFromS3Attempt", "region" : "us-west-2", "failOnError" : "FALSE", "parallelism" : "OVERSUBSCRIBE", "updateSingleCardinalityProperties" : "FALSE", "queueRequest" : "TRUE" }' { "status" : "200 OK", "payload" : { "loadId" : "54dc9f5a-6e3c-428d-8897-180e10c96dbf" } curl -G 'https://db-neptune-1-instance-1.cwnhpfsf87ne.us-west-2.neptune.amazonaws.com:8182/loader/54dc9f5a-6e3c-428d-8897-180e10c96dbf' As a frame of reference, over 9B records imported in a a bit over 26 hours. This is in the ballpark of what's discussed in https://hal.science/hal-03132794/document , and remember this is serverless and is probably using lower clock speed vCPU and possibly lower rated RAM. A higher powered provisioned instance is likely to go faster. { "status" : "200 OK", "payload" : { "feedCount" : [ { "LOAD_IN_PROGRESS" : 1 } ], "overallStatus" : { "fullUri" : "s3://blazegraphdump/latest-all.nt.bz2", "runNumber" : 1, "retryNumber" : 0, "status" : "LOAD_IN_PROGRESS", "totalTimeSpent" : 101809, "startTime" : 1708978871, "totalRecords" : 9312600000, "totalDuplicates" : 145923094, "parsingErrors" : 0, "datatypeMismatchErrors" : 0, "insertErrors" : 0 } } } Following was the final outcome - 63 hours for this very full dump. ubuntu@ip-172-31-23-68:~$ curl 'https://db-neptune-1-instance-1.cwnhpfsf87ne.us-west-2.neptune.amazonaws.com:8182/loader/54dc9f5a-6e3c-428d-8897-180e10c96dbf' { "status" : "200 OK", "payload" : { "feedCount" : [ { "LOAD_COMPLETED" : 1 } ], "overallStatus" : { "fullUri" : "s3://blazegraphdump/latest-all.nt.bz2", "runNumber" : 1, "retryNumber" : 0, "status" : "LOAD_COMPLETED", "totalTimeSpent" : 227384, "startTime" : 1708978871, "totalRecords" : 19558358339, "totalDuplicates" : 341740320, "parsingErrors" : 0, "datatypeMismatchErrors" : 0, "insertErrors" : 0 } } This style of query works fine, although query caching and pre-warmup would be advisable through other means if ever running such a service in the real world. curl -X POST --data-binary 'query=select ?s ?p ?o where {?s ?p ?o} limit 10' 'https://db-neptune-1-instance-1.cwnhpfsf87ne.us-west-2.neptune.amazonaws.com:8182/sparql' Next, we try a provisioned instance using a high powered db.x2iedn.24xlarge - 128 vCPU and 4,096 GB of RAM according to its dropdown in the Create Database dialog (but its info panel says the following - Instance class db.x2iedn.24xlarge Current generation vCPU 96Memory 3072 GiB; indeed, this is what the db.x2iedn.24xlarge is said to be at https://aws.amazon.com/ec2/instance-types/ .This is not the sort of machine we'd run for more than an import as the costs would be prohibitive, but it is worth exploring. https://docs.aws.amazon.com/neptune/latest/userguide/best-practices-general-basic.html#best-practices-loader-tempinstance describes precisely how to go about doing this sort of thing. This was started at about 2:39 PM on February 29, 2024: curl -v -X POST \ -H 'Content-Type: application/json' \ https://db-neptune-2.cluster-cwnhpfsf87ne.us-west-2.neptune.amazonaws.com:8182/loader -d ' { "source" : "s3://blazegraphdump/latest-all.nt.bz2", "format" : "ntriples", "iamRoleArn" : "arn:aws:iam::ACCOUNTID:role/NeptuneLoadFromS3Attempt", "region" : "us-west-2", "failOnError" : "FALSE", "parallelism" : "OVERSUBSCRIBE", "updateSingleCardinalityProperties" : "FALSE", "queueRequest" : "TRUE" }' { "status" : "200 OK", "payload" : { "loadId" : "75b910c5-25d5-4257-8e9f-8a91f8c2df88" } watch -n 60 curl -G 'https://db-neptune-2.cluster-cwnhpfsf87ne.us-west-2.neptune.amazonaws.com:8182/loader/75b910c5-25d5-4257-8e9f-8a91f8c2df88' After running for a while, we just stop it. At this pace it would take about 14 hours to import 9 billion triples. That would be faster than serverless, but not so much so as to justify continuing with this particular import. CPU utilization when using this powerful provisioned node was hovering around 20%, which is lower than expected. Keep in mind that serverless was using 50%-70+% once it got going. Here was the point where we stopped it. ubuntu@ip-172-31-23-68:~$ curl 'https://db-neptune-2.cluster-cwnhpfsf87ne.us-west-2.neptune.amazonaws.com:8182/loader/75b910c5-25d5-4257-8e9f-8a91f8c2df88'{ "status" : "200 OK", "payload" : { "feedCount" : [ { "LOAD_IN_PROGRESS" : 1 } ], "overallStatus" : { "fullUri" : "s3://blazegraphdump/latest-all.nt.bz2", "runNumber" : 1, "retryNumber" : 0, "status" : "LOAD_IN_PROGRESS", "totalTimeSpent" : 7720, "startTime" : 1709239199, "totalRecords" : 1311570000, "totalDuplicates" : 8806377, "parsingErrors" : 0, "datatypeMismatchErrors" : 0, "insertErrors" : 0 } } Note that with the exception of storage, there was an allowance for Amazon Neptune serverless - up to 750 free compute hours are available for an account in the first month of use of Amazon Neptune serverless (this appeared to be fanned out across multiple NCUs somehow). The policies for this sort of usage may be subject to change, so it's important that any attempt to replicate pay close attention to hourly billing and potential case budgeting is considered (e.g., up to $500/day for import). Needs post an import could vary dramatically from use case to use case, ranging from comparatively affordable for such a big database to comparatively more expensive. It is possible to migrate workloads and adjust configuration after import. TASK DETAIL https://phabricator.wikimedia.org/T359062 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: ssingh, bking, dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
_______________________________________________ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org