[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

dr0ptp4kt Tue, 19 Mar 2024 14:25:40 -0700

dr0ptp4kt added a comment.


  **About Amazon Neptune**
  
  Amazon Neptune was set to import using the simpler N-Triples file format with 
its serverless configuration at 128 NCUs (about 256 GB of RAM with some 
attendant CPU). We don't use N-Triples files in our existing import process, 
but it is the sort of format used in the graph split imports.
  
    curl -v -X POST \
        -H 'Content-Type: application/json' \
        
https://db-neptune-1.cluster-cnim20k6c0mh.us-west-2.neptune.amazonaws.com:8182/loader
 -d '
        {
          "source" : "s3://blazegraphdump/latest-lexemes.nt.bz2",
          "format" : "ntriples",
          "iamRoleArn" : "arn:aws:iam::ACCOUNTID:role/NeptuneLoadFromS3",
          "region" : "us-west-2",
          "failOnError" : "FALSE",
          "parallelism" : "HIGH",
          "updateSingleCardinalityProperties" : "FALSE",
          "queueRequest" : "TRUE"
        }'
  
  This required a bunch of grants, and I had to make my personal bucket hosting 
the file listable and readable, as well as the objects listable and readable 
within it (it's possible to do chained IAM grants, but it is a bit of work and 
requires somewhat complicated STSes). It appeared that it was also necessary to 
create the VPC endpoint as described in the documentation.
  
  This was started at 1:30 PM CT on Monday, February 26, 2024. Note that this 
is the lexemes dump. I'm trying here to verify that with 128 NCUs it goes 
faster than with 32 NCUs. Because if it does, that will be useful for the 
bigger dump.
  
    curl -v -X POST \
        -H 'Content-Type: application/json' \
        
https://db-neptune-1-instance-1.cwnhpfsf87ne.us-west-2.neptune.amazonaws.com:8182/loader
 -d '
        {
          "source" : "s3://blazegraphdump/latest-lexemes.nt.bz2",
          "format" : "ntriples",
          "iamRoleArn" : "arn:aws:iam::ACCOUNTID:role/NeptuneLoadFromS3Attempt",
          "region" : "us-west-2",
          "failOnError" : "FALSE",
          "parallelism" : "OVERSUBSCRIBE",
          "updateSingleCardinalityProperties" : "FALSE",
          "queueRequest" : "TRUE"
        }'
    
    
    {
        "status" : "200 OK",
        "payload" : {
            "loadId" : "8ace45ed-2989-4fd4-aa19-d13b9a59e824"
        }
    
    curl -G 
'https://db-neptune-1-instance-1.cwnhpfsf87ne.us-west-2.neptune.amazonaws.com:8182/loader/8ace45ed-2989-4fd4-aa19-d13b9a59e824'
    
    
    {
        "status" : "200 OK",
        "payload" : {
            "feedCount" : [
                {
                    "LOAD_COMPLETED" : 1
                }
            ],
            "overallStatus" : {
                "fullUri" : "s3://blazegraphdump/latest-lexemes.nt.bz2",
                "runNumber" : 1,
                "retryNumber" : 0,
                "status" : "LOAD_COMPLETED",
                "totalTimeSpent" : 2142,
                "startTime" : 1708975752,
                "totalRecords" : 163715491,
                "totalDuplicates" : 141148,
                "parsingErrors" : 0,
                "datatypeMismatchErrors" : 0,
                "insertErrors" : 0
            }
        }
    }
  
  Now, for the full Wikidata load. This was started at about 2:20 PM CT on 
Monday, February 26, 2024.
  
    curl -v -X POST \
        -H 'Content-Type: application/json' \
        
https://db-neptune-1-instance-1.cwnhpfsf87ne.us-west-2.neptune.amazonaws.com:8182/loader
 -d '
        {
          "source" : "s3://blazegraphdump/latest-all.nt.bz2",
          "format" : "ntriples",
          "iamRoleArn" : "arn:aws:iam::ACCOUNTID:role/NeptuneLoadFromS3Attempt",
          "region" : "us-west-2",
          "failOnError" : "FALSE",
          "parallelism" : "OVERSUBSCRIBE",
          "updateSingleCardinalityProperties" : "FALSE",
          "queueRequest" : "TRUE"
        }'
    
    {
        "status" : "200 OK",
        "payload" : {
            "loadId" : "54dc9f5a-6e3c-428d-8897-180e10c96dbf"
        }
    
    
    curl -G 
'https://db-neptune-1-instance-1.cwnhpfsf87ne.us-west-2.neptune.amazonaws.com:8182/loader/54dc9f5a-6e3c-428d-8897-180e10c96dbf'
  
  As a frame of reference, over 9B records imported in a a bit over 26 hours. 
This is in the ballpark of what's discussed in 
https://hal.science/hal-03132794/document , and remember this is serverless and 
is probably using lower clock speed vCPU and possibly lower rated RAM. A higher 
powered provisioned instance is likely to go faster.
  
    {
        "status" : "200 OK",
        "payload" : {
            "feedCount" : [
                {
                    "LOAD_IN_PROGRESS" : 1
                }
            ],
            "overallStatus" : {
                "fullUri" : "s3://blazegraphdump/latest-all.nt.bz2",
                "runNumber" : 1,
                "retryNumber" : 0,
                "status" : "LOAD_IN_PROGRESS",
                "totalTimeSpent" : 101809,
                "startTime" : 1708978871,
                "totalRecords" : 9312600000,
                "totalDuplicates" : 145923094,
                "parsingErrors" : 0,
                "datatypeMismatchErrors" : 0,
                "insertErrors" : 0
            }
        }
    }
  
  Following was the final outcome - 63 hours for this very full dump.
  
    ubuntu@ip-172-31-23-68:~$ curl 
'https://db-neptune-1-instance-1.cwnhpfsf87ne.us-west-2.neptune.amazonaws.com:8182/loader/54dc9f5a-6e3c-428d-8897-180e10c96dbf'
    {
        "status" : "200 OK",
        "payload" : {
            "feedCount" : [
                {
                    "LOAD_COMPLETED" : 1
                }
            ],
            "overallStatus" : {
                "fullUri" : "s3://blazegraphdump/latest-all.nt.bz2",
                "runNumber" : 1,
                "retryNumber" : 0,
                "status" : "LOAD_COMPLETED",
                "totalTimeSpent" : 227384,
                "startTime" : 1708978871,
                "totalRecords" : 19558358339,
                "totalDuplicates" : 341740320,
                "parsingErrors" : 0,
                "datatypeMismatchErrors" : 0,
                "insertErrors" : 0
            }
        }
  
  This style of query works fine, although query caching and pre-warmup would 
be advisable through other means if ever running such a service in the real 
world.
  
    curl -X POST --data-binary 'query=select ?s ?p ?o where {?s ?p ?o} limit 
10' 
'https://db-neptune-1-instance-1.cwnhpfsf87ne.us-west-2.neptune.amazonaws.com:8182/sparql'
  
  Next, we try a provisioned instance using a high powered db.x2iedn.24xlarge - 
128 vCPU and 4,096 GB of RAM according to its dropdown in the Create Database 
dialog (but its info panel says the following - Instance class 
db.x2iedn.24xlarge Current generation vCPU 96Memory 3072 GiB; indeed, this is 
what the db.x2iedn.24xlarge is said to be at 
https://aws.amazon.com/ec2/instance-types/ .This is not the sort of machine 
we'd run for more than an import as the costs would be prohibitive, but it is 
worth exploring. 
https://docs.aws.amazon.com/neptune/latest/userguide/best-practices-general-basic.html#best-practices-loader-tempinstance
 describes precisely how to go about doing this sort of thing.
  
  This was started at about 2:39 PM on February 29, 2024:
  
    curl -v -X POST \
        -H 'Content-Type: application/json' \
        
https://db-neptune-2.cluster-cwnhpfsf87ne.us-west-2.neptune.amazonaws.com:8182/loader
 -d '
        {
          "source" : "s3://blazegraphdump/latest-all.nt.bz2",
          "format" : "ntriples",
          "iamRoleArn" : "arn:aws:iam::ACCOUNTID:role/NeptuneLoadFromS3Attempt",
          "region" : "us-west-2",
          "failOnError" : "FALSE",
          "parallelism" : "OVERSUBSCRIBE",
          "updateSingleCardinalityProperties" : "FALSE",
          "queueRequest" : "TRUE"
        }'
    
    
    {
        "status" : "200 OK",
        "payload" : {
            "loadId" : "75b910c5-25d5-4257-8e9f-8a91f8c2df88"
        }
    
    
    watch -n 60 curl -G 
'https://db-neptune-2.cluster-cwnhpfsf87ne.us-west-2.neptune.amazonaws.com:8182/loader/75b910c5-25d5-4257-8e9f-8a91f8c2df88'
  
  After running for a while, we just stop it. At this pace it would take about 
14 hours to import 9 billion triples. That would be faster than serverless, but 
not so much so as to justify continuing with this particular import. CPU 
utilization when using this powerful provisioned node was hovering around 20%, 
which is lower than expected. Keep in mind that serverless was using 50%-70+% 
once it got going. Here was the point where we stopped it.
  
    ubuntu@ip-172-31-23-68:~$ curl 
'https://db-neptune-2.cluster-cwnhpfsf87ne.us-west-2.neptune.amazonaws.com:8182/loader/75b910c5-25d5-4257-8e9f-8a91f8c2df88'{
        "status" : "200 OK",
        "payload" : {
            "feedCount" : [
                {
                    "LOAD_IN_PROGRESS" : 1
                }
            ],
            "overallStatus" : {
                "fullUri" : "s3://blazegraphdump/latest-all.nt.bz2",
                "runNumber" : 1,
                "retryNumber" : 0,
                "status" : "LOAD_IN_PROGRESS",
                "totalTimeSpent" : 7720,
                "startTime" : 1709239199,
                "totalRecords" : 1311570000,
                "totalDuplicates" : 8806377,
                "parsingErrors" : 0,
                "datatypeMismatchErrors" : 0,
                "insertErrors" : 0
            }
        }
  
  Note that with the exception of storage, there was an allowance for Amazon 
Neptune serverless - up to 750 free compute hours are available for an account 
in the first month of use of Amazon Neptune serverless (this appeared to be 
fanned out across multiple NCUs somehow). The policies for this sort of usage 
may be subject to change, so it's important that any attempt to replicate pay 
close attention to hourly billing and potential case budgeting is considered 
(e.g., up to $500/day for import). Needs post an import could vary dramatically 
from use case to use case, ranging from comparatively affordable for such a big 
database to comparatively more expensive. It is possible to migrate workloads 
and adjust configuration after import.

TASK DETAIL
  https://phabricator.wikimedia.org/T359062

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: ssingh, bking, dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, 
karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, 
Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, 
rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331

_______________________________________________
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org

[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

Reply via email to