Re: [Virtuoso-users] importing Freebase RDF dump: slows down, memleak?

2014-08-22 Thread Paul Houle
Hey guys,  you will have a much easer time if you use :BaseKB

http://basekb.com/

The fact is that 50% of Freebase is junk,  and you can't afford to load
that.

It pains me to see people having these problems when I load :BaseKB on a
repeatable basis on relatively modest hardware,  and in fact,  you can get
it pre-loaded and be running in minutes

https://aws.amazon.com/marketplace/pp/B00KRKRYW0

Take it from me,  this is the difference between "just works" and days of
suffering.  Make it easy for yourself.


On Fri, Aug 22, 2014 at 11:44 AM, Jörn Hees  wrote:

> Hi Hugh,
>
> thanks for the reply.
>
> I know 32 GB is probably not much considering the size of the dumps, but
> it's the size limit of our VMs :(
> So i'd be willing to live with a bit slower import and response times if i
> can still leave it on a VM.
>
> On 22 Aug 2014, at 17:51, Hugh Williams  wrote:
>
> > What I would not expect though is for the memory consumption to continue
> to increase until the server is killed due to oom error which would imply a
> possible memory leak, which is why I recommend building with the develop/7
> build where there have been improvement in memory management.
>
> I currently just used the stable 7.1.0 release. I'll try with the dev
> build again and report back...
>
> Cheers,
> Jörn
>
>
>
> --
> Slashdot TV.
> Video for Nerds.  Stuff that matters.
> http://tv.slashdot.org/
> ___
> Virtuoso-users mailing list
> Virtuoso-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/virtuoso-users
>



-- 
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254paul.houle on Skype   ontolo...@gmail.com
--
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/___
Virtuoso-users mailing list
Virtuoso-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/virtuoso-users


Re: [Virtuoso-users] importing Freebase RDF dump: slows down, memleak?

2014-08-22 Thread Jörn Hees
Hi Hugh,

thanks for the reply.

I know 32 GB is probably not much considering the size of the dumps, but it's 
the size limit of our VMs :(
So i'd be willing to live with a bit slower import and response times if i can 
still leave it on a VM.

On 22 Aug 2014, at 17:51, Hugh Williams  wrote:

> What I would not expect though is for the memory consumption to continue to 
> increase until the server is killed due to oom error which would imply a 
> possible memory leak, which is why I recommend building with the develop/7  
> build where there have been improvement in memory management.

I currently just used the stable 7.1.0 release. I'll try with the dev build 
again and report back...

Cheers,
Jörn


--
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/
___
Virtuoso-users mailing list
Virtuoso-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/virtuoso-users


Re: [Virtuoso-users] importing Freebase RDF dump: slows down, memleak?

2014-08-22 Thread Hugh Williams
Hi Jörn,

Are you running the Virtuoso open source git [1] stable or develop 7 branch ? I 
would recommend the load be performed with the develop/7 branch if this is not 
already being used.

From analysis development have performed in-house earlier this year, it was 
found the latest Freebase datasets need about 13,000,000 buffers ie about 105GB 
RAM to load in memory and not have to use disk which significantly reduces the 
load rate.  This is because the dataset contains many large literal values and 
thus does not compress very well and also a lot of duplicate data, so even 
though it is only about 1.9 billion triples as you have seen the actual as you 
have observed also.

What I would not expect though is for the memory consumption to continue to 
increase until the server is killed due to oom error which would imply a 
possible memory leak, which is why I recommend building with the develop/7  
build where there have been improvement in memory management.

To speed the load you should consider using faster disk (SSD's) ideally as a 
trade off for insufficient memory when loading the dataset, and also database 
striping [2] for improved parallel i/O access to the database files if 
possible. Another option would be to load the dataset in 4 parts, which should 
give the leave enough

On our LOD Cache Cloud server [3] which is a 4 node cluster with 768GB RAM and 
60billion + triples load the Freebase datasets loaded in about 1.7 hrs:

SQL> select min(ll_started) as start, max(ll_done) as finish, 
datediff('second', min(ll_started), max(ll_done)) as delta from load_list where 
ll_graph like 
'http://commondatastorage.googleapis.com/freebase-public/rdf/freebase-rdf-2013-11-17-00-00.gz';
startfinish   delta
TIMESTAMPTIMESTAMPINTEGER
___

2013.12.2 22:34.9 0  2013.12.3 0:16.24 0  6135

1 Rows. -- 74 msec.
SQL>

On the single server database we testing on with 105GB RAM it loaded in about 
2hrs.

Best Regards
Hugh Williams
Professional Services
OpenLink Software, Inc.  //  http://www.openlinksw.com/
Weblog   -- http://www.openlinksw.com/blogs/
LinkedIn -- http://www.linkedin.com/company/openlink-software/
Twitter  -- http://twitter.com/OpenLink
Google+  -- http://plus.google.com/100570109519069333827/
Facebook -- http://www.facebook.com/OpenLinkSoftware
Universal Data Access, Integration, and Management Technology Providers

[1] http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/VOSGitUsage
[2] http://docs.openlinksw.com/virtuoso/dbadm.html#ini_Striping
[3] http://lod.openlinksw.com

On 22 Aug 2014, at 14:41, Jörn Hees  wrote:

> Hi,
> 
> TLDR: When importing the Freebase RDF dump Virtuoso seems to consume way more 
> RAM than configured.
> 
> i'm trying to load the Freebase RDF dump ( 
> https://developers.google.com/freebase/data ) into a clean Virtuoso 
> OpenSource 7.1.0 instance running on a VM with 4 cores and 32 GB of RAM, 300+ 
> GB HD free.
> The dump file contains 2,656,580,382 rows (even though the page claims 1.9 
> billion triples, maybe outdated or dups).
> Before attempting to load the whole Freebase dump, i loaded the basekb.com 
> dump which contained 1,205,456,739 triples into the store which was already 
> filled with DBpedia without any noticeable problem.
> 
> The Freebase dump rdf_loader_run() import starts with rapid IO rates (several 
> 100MB/s read and write bursts) and quickly consumes ~ 25 GB of RAM as 
> configured. It then continues to slowly consume more and more RAM ~ 1 
> MB/minute. As this goes on, the IO rates slowly drop down to some KB/s read 
> and no / very very rare writes. htop at this point shows that the process 
> spends nearly all its time on IO wait. After a couple of days Virtuoso is 
> finally killed by the kernel when it consumed all RAM of the machine and 
> wants even more.
> 
> I already tried adding 16 GB swap. This didn't help but made the machine 
> completely unresponsive after 4 days (sshd seems to have been swapped out and 
> never came back over a couple of hour long retries to ssh into the VM).
> 
> Ubuntu 12.04 LTS or 14.04.1 LTS doesn't seem to make a difference.
> 
> A colleague is reporting that the import works fine on a 256 GB RAM, 8 core 
> machine with settings for 64 GB... takes about 1 day to import, the final DB 
> is ~ 130 GB. Mine never gets to > 100 GB before Virtuoso is killed.
> 
> 
> The instance is set up following my tutorial 
> http://joernhees.de/blog/2014/04/23/setting-up-a-local-dbpedia-3-9-mirror-with-virtuoso-7/
>  just substitute the DBpedia Datasets with the Freebase triple dump and 
> Wikidata links.
> 
> The virtuoso.ini values are set as suggested for 32 GB of RAM, there's 
> nothing else running on the VM:
> [Database]
> MaxCheckpointRemap  = 2000  // also tried with 62500, so ~1/4th 
> of NumberOfBuffers as in the blogpost
> [Parameters]
> ;; Uncomment next

[Virtuoso-users] importing Freebase RDF dump: slows down, memleak?

2014-08-22 Thread Jörn Hees
Hi,

TLDR: When importing the Freebase RDF dump Virtuoso seems to consume way more 
RAM than configured.

i'm trying to load the Freebase RDF dump ( 
https://developers.google.com/freebase/data ) into a clean Virtuoso OpenSource 
7.1.0 instance running on a VM with 4 cores and 32 GB of RAM, 300+ GB HD free.
The dump file contains 2,656,580,382 rows (even though the page claims 1.9 
billion triples, maybe outdated or dups).
Before attempting to load the whole Freebase dump, i loaded the basekb.com dump 
which contained 1,205,456,739 triples into the store which was already filled 
with DBpedia without any noticeable problem.

The Freebase dump rdf_loader_run() import starts with rapid IO rates (several 
100MB/s read and write bursts) and quickly consumes ~ 25 GB of RAM as 
configured. It then continues to slowly consume more and more RAM ~ 1 
MB/minute. As this goes on, the IO rates slowly drop down to some KB/s read and 
no / very very rare writes. htop at this point shows that the process spends 
nearly all its time on IO wait. After a couple of days Virtuoso is finally 
killed by the kernel when it consumed all RAM of the machine and wants even 
more.

I already tried adding 16 GB swap. This didn't help but made the machine 
completely unresponsive after 4 days (sshd seems to have been swapped out and 
never came back over a couple of hour long retries to ssh into the VM).

Ubuntu 12.04 LTS or 14.04.1 LTS doesn't seem to make a difference.

A colleague is reporting that the import works fine on a 256 GB RAM, 8 core 
machine with settings for 64 GB... takes about 1 day to import, the final DB is 
~ 130 GB. Mine never gets to > 100 GB before Virtuoso is killed.


The instance is set up following my tutorial 
http://joernhees.de/blog/2014/04/23/setting-up-a-local-dbpedia-3-9-mirror-with-virtuoso-7/
 just substitute the DBpedia Datasets with the Freebase triple dump and 
Wikidata links.

The virtuoso.ini values are set as suggested for 32 GB of RAM, there's nothing 
else running on the VM:
[Database]
MaxCheckpointRemap  = 2000  // also tried with 62500, so ~1/4th of 
NumberOfBuffers as in the blogpost
[Parameters]
;; Uncomment next two lines if there is 32 GB system memory free
NumberOfBuffers  = 272
MaxDirtyBuffers  = 200


As I already tried a lot of things but can't get this to work, i'd be thankful 
for feedback or someone looking into why virtuoso is consuming all of the RAM.

Cheers,
Jörn


--
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/
___
Virtuoso-users mailing list
Virtuoso-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/virtuoso-users