Re: [Virtuoso-users] importing Freebase RDF dump: slows down, memleak?
Hey guys, you will have a much easer time if you use :BaseKB http://basekb.com/ The fact is that 50% of Freebase is junk, and you can't afford to load that. It pains me to see people having these problems when I load :BaseKB on a repeatable basis on relatively modest hardware, and in fact, you can get it pre-loaded and be running in minutes https://aws.amazon.com/marketplace/pp/B00KRKRYW0 Take it from me, this is the difference between "just works" and days of suffering. Make it easy for yourself. On Fri, Aug 22, 2014 at 11:44 AM, Jörn Hees wrote: > Hi Hugh, > > thanks for the reply. > > I know 32 GB is probably not much considering the size of the dumps, but > it's the size limit of our VMs :( > So i'd be willing to live with a bit slower import and response times if i > can still leave it on a VM. > > On 22 Aug 2014, at 17:51, Hugh Williams wrote: > > > What I would not expect though is for the memory consumption to continue > to increase until the server is killed due to oom error which would imply a > possible memory leak, which is why I recommend building with the develop/7 > build where there have been improvement in memory management. > > I currently just used the stable 7.1.0 release. I'll try with the dev > build again and report back... > > Cheers, > Jörn > > > > -- > Slashdot TV. > Video for Nerds. Stuff that matters. > http://tv.slashdot.org/ > ___ > Virtuoso-users mailing list > Virtuoso-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/virtuoso-users > -- Paul Houle Expert on Freebase, DBpedia, Hadoop and RDF (607) 539 6254paul.houle on Skype ontolo...@gmail.com -- Slashdot TV. Video for Nerds. Stuff that matters. http://tv.slashdot.org/___ Virtuoso-users mailing list Virtuoso-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/virtuoso-users
Re: [Virtuoso-users] importing Freebase RDF dump: slows down, memleak?
Hi Hugh, thanks for the reply. I know 32 GB is probably not much considering the size of the dumps, but it's the size limit of our VMs :( So i'd be willing to live with a bit slower import and response times if i can still leave it on a VM. On 22 Aug 2014, at 17:51, Hugh Williams wrote: > What I would not expect though is for the memory consumption to continue to > increase until the server is killed due to oom error which would imply a > possible memory leak, which is why I recommend building with the develop/7 > build where there have been improvement in memory management. I currently just used the stable 7.1.0 release. I'll try with the dev build again and report back... Cheers, Jörn -- Slashdot TV. Video for Nerds. Stuff that matters. http://tv.slashdot.org/ ___ Virtuoso-users mailing list Virtuoso-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/virtuoso-users
Re: [Virtuoso-users] importing Freebase RDF dump: slows down, memleak?
Hi Jörn, Are you running the Virtuoso open source git [1] stable or develop 7 branch ? I would recommend the load be performed with the develop/7 branch if this is not already being used. From analysis development have performed in-house earlier this year, it was found the latest Freebase datasets need about 13,000,000 buffers ie about 105GB RAM to load in memory and not have to use disk which significantly reduces the load rate. This is because the dataset contains many large literal values and thus does not compress very well and also a lot of duplicate data, so even though it is only about 1.9 billion triples as you have seen the actual as you have observed also. What I would not expect though is for the memory consumption to continue to increase until the server is killed due to oom error which would imply a possible memory leak, which is why I recommend building with the develop/7 build where there have been improvement in memory management. To speed the load you should consider using faster disk (SSD's) ideally as a trade off for insufficient memory when loading the dataset, and also database striping [2] for improved parallel i/O access to the database files if possible. Another option would be to load the dataset in 4 parts, which should give the leave enough On our LOD Cache Cloud server [3] which is a 4 node cluster with 768GB RAM and 60billion + triples load the Freebase datasets loaded in about 1.7 hrs: SQL> select min(ll_started) as start, max(ll_done) as finish, datediff('second', min(ll_started), max(ll_done)) as delta from load_list where ll_graph like 'http://commondatastorage.googleapis.com/freebase-public/rdf/freebase-rdf-2013-11-17-00-00.gz'; startfinish delta TIMESTAMPTIMESTAMPINTEGER ___ 2013.12.2 22:34.9 0 2013.12.3 0:16.24 0 6135 1 Rows. -- 74 msec. SQL> On the single server database we testing on with 105GB RAM it loaded in about 2hrs. Best Regards Hugh Williams Professional Services OpenLink Software, Inc. // http://www.openlinksw.com/ Weblog -- http://www.openlinksw.com/blogs/ LinkedIn -- http://www.linkedin.com/company/openlink-software/ Twitter -- http://twitter.com/OpenLink Google+ -- http://plus.google.com/100570109519069333827/ Facebook -- http://www.facebook.com/OpenLinkSoftware Universal Data Access, Integration, and Management Technology Providers [1] http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/VOSGitUsage [2] http://docs.openlinksw.com/virtuoso/dbadm.html#ini_Striping [3] http://lod.openlinksw.com On 22 Aug 2014, at 14:41, Jörn Hees wrote: > Hi, > > TLDR: When importing the Freebase RDF dump Virtuoso seems to consume way more > RAM than configured. > > i'm trying to load the Freebase RDF dump ( > https://developers.google.com/freebase/data ) into a clean Virtuoso > OpenSource 7.1.0 instance running on a VM with 4 cores and 32 GB of RAM, 300+ > GB HD free. > The dump file contains 2,656,580,382 rows (even though the page claims 1.9 > billion triples, maybe outdated or dups). > Before attempting to load the whole Freebase dump, i loaded the basekb.com > dump which contained 1,205,456,739 triples into the store which was already > filled with DBpedia without any noticeable problem. > > The Freebase dump rdf_loader_run() import starts with rapid IO rates (several > 100MB/s read and write bursts) and quickly consumes ~ 25 GB of RAM as > configured. It then continues to slowly consume more and more RAM ~ 1 > MB/minute. As this goes on, the IO rates slowly drop down to some KB/s read > and no / very very rare writes. htop at this point shows that the process > spends nearly all its time on IO wait. After a couple of days Virtuoso is > finally killed by the kernel when it consumed all RAM of the machine and > wants even more. > > I already tried adding 16 GB swap. This didn't help but made the machine > completely unresponsive after 4 days (sshd seems to have been swapped out and > never came back over a couple of hour long retries to ssh into the VM). > > Ubuntu 12.04 LTS or 14.04.1 LTS doesn't seem to make a difference. > > A colleague is reporting that the import works fine on a 256 GB RAM, 8 core > machine with settings for 64 GB... takes about 1 day to import, the final DB > is ~ 130 GB. Mine never gets to > 100 GB before Virtuoso is killed. > > > The instance is set up following my tutorial > http://joernhees.de/blog/2014/04/23/setting-up-a-local-dbpedia-3-9-mirror-with-virtuoso-7/ > just substitute the DBpedia Datasets with the Freebase triple dump and > Wikidata links. > > The virtuoso.ini values are set as suggested for 32 GB of RAM, there's > nothing else running on the VM: > [Database] > MaxCheckpointRemap = 2000 // also tried with 62500, so ~1/4th > of NumberOfBuffers as in the blogpost > [Parameters] > ;; Uncomment next
[Virtuoso-users] importing Freebase RDF dump: slows down, memleak?
Hi, TLDR: When importing the Freebase RDF dump Virtuoso seems to consume way more RAM than configured. i'm trying to load the Freebase RDF dump ( https://developers.google.com/freebase/data ) into a clean Virtuoso OpenSource 7.1.0 instance running on a VM with 4 cores and 32 GB of RAM, 300+ GB HD free. The dump file contains 2,656,580,382 rows (even though the page claims 1.9 billion triples, maybe outdated or dups). Before attempting to load the whole Freebase dump, i loaded the basekb.com dump which contained 1,205,456,739 triples into the store which was already filled with DBpedia without any noticeable problem. The Freebase dump rdf_loader_run() import starts with rapid IO rates (several 100MB/s read and write bursts) and quickly consumes ~ 25 GB of RAM as configured. It then continues to slowly consume more and more RAM ~ 1 MB/minute. As this goes on, the IO rates slowly drop down to some KB/s read and no / very very rare writes. htop at this point shows that the process spends nearly all its time on IO wait. After a couple of days Virtuoso is finally killed by the kernel when it consumed all RAM of the machine and wants even more. I already tried adding 16 GB swap. This didn't help but made the machine completely unresponsive after 4 days (sshd seems to have been swapped out and never came back over a couple of hour long retries to ssh into the VM). Ubuntu 12.04 LTS or 14.04.1 LTS doesn't seem to make a difference. A colleague is reporting that the import works fine on a 256 GB RAM, 8 core machine with settings for 64 GB... takes about 1 day to import, the final DB is ~ 130 GB. Mine never gets to > 100 GB before Virtuoso is killed. The instance is set up following my tutorial http://joernhees.de/blog/2014/04/23/setting-up-a-local-dbpedia-3-9-mirror-with-virtuoso-7/ just substitute the DBpedia Datasets with the Freebase triple dump and Wikidata links. The virtuoso.ini values are set as suggested for 32 GB of RAM, there's nothing else running on the VM: [Database] MaxCheckpointRemap = 2000 // also tried with 62500, so ~1/4th of NumberOfBuffers as in the blogpost [Parameters] ;; Uncomment next two lines if there is 32 GB system memory free NumberOfBuffers = 272 MaxDirtyBuffers = 200 As I already tried a lot of things but can't get this to work, i'd be thankful for feedback or someone looking into why virtuoso is consuming all of the RAM. Cheers, Jörn -- Slashdot TV. Video for Nerds. Stuff that matters. http://tv.slashdot.org/ ___ Virtuoso-users mailing list Virtuoso-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/virtuoso-users