Re: [Virtuoso-users] importing Freebase RDF dump: slows down, memleak?

2014-09-08 Thread Jörn Hees

On 2 Sep 2014, at 00:01, Hugh Williams  wrote:

>>> Development indicate your suggestion is not without merit but 
>>> implementation is not as simple as it may seems as the indexes are not all 
>>> sequential, but something like that could possibly be implemented. It is 
>>> suggested you could try dropping the indexes on RDF_QUAD table,  load the 
>>> Freebase datasets and then recreate indexes after loading,  which would 
>>> require a smaller working set that would better fix into the 32GB RAM 
>>> available. The command for dropping the necessary indexes are:
>>> 
>>> drop index rdf_quad_pogs;
>>> drop index rdf_quad_sp;
>>> drop index rdf_quad_op;
>>> drop index rdf_quad_gs;
>>> 
>>> and the respective indexes can then be recreated as detailed at:
>>> 
>>> 
>>> http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/VirtRDFPerformanceTuning?#RDF%20Index%20Scheme
>>> 
>>> Note you need to recreate the column-wise indexes being v7. Let us know how 
>>> this works for you.
>> 
>> Cool, will try.
> 
> [Hugh] OK, let us know the outcome ...

After 5 days:

[Sun Sep  7 23:43:31 2014] virtuoso-t[11495]: segfault at  ip 
008c3a8e sp 7f4ec2f79d20 error 7 in virtuoso-t[40+b47000]

yay...

Performance also breaks down at some point with dropped indexes and 2 
run_rdf_loaders as suggested.
I'll append the output of `select * from DB.DBA.LOAD_LIST;` but for now i give 
up...

Cheers,
Jörn


ll_file 
  ll_graph  
ll_statell_started   ll_done  ll_host 
ll_work_time  ll_error
VARCHAR NOT NULL
  VARCHAR   
INTEGER TIMESTAMPTIMESTAMPINTEGER INTEGER   
  VARCHAR
___

/usr/local/data/datasets/remote/freebase/2014-08-20/splitted/freebase-rdf-2014-08-17-00-00.aa.nt.gz
  http://rdf.freebase.com   
2   2014.9.2 17:20.13 927132000  2014.9.2 17:34.28 11121000  0  
 NULLNULL
/usr/local/data/datasets/remote/freebase/2014-08-20/splitted/freebase-rdf-2014-08-17-00-00.ab.nt.gz
  http://rdf.freebase.com   
2   2014.9.2 17:20.59 818941000  2014.9.2 17:35.42 994563000  0 
  NULLNULL
/usr/local/data/datasets/remote/freebase/2014-08-20/splitted/freebase-rdf-2014-08-17-00-00.ac.nt.gz
  http://rdf.freebase.com   
2   2014.9.2 17:34.28 46244000  2014.9.2 17:52.1 4205  0
   NULLNULL
/usr/local/data/datasets/remote/freebase/2014-08-20/splitted/freebase-rdf-2014-08-17-00-00.ad.nt.gz
  http://rdf.freebase.com   
2   2014.9.2 17:35.43 10491000  2014.9.2 17:53.58 266217000  0  
 NULLNULL
/usr/local/data/datasets/remote/freebase/2014-08-20/splitted/freebase-rdf-2014-08-17-00-00.ae.nt.gz
  http://rdf.freebase.com   
2   2014.9.2 17:52.1 4551  2014.9.2 18:11.45 21522000  0
   NULLNULL
/usr/local/data/datasets/remote/freebase/2014-08-20/splitted/freebase-rdf-2014-08-17-00-00.af.nt.gz
  http://rdf.freebase.com   
2   2014.9.2 17:53.58 27075  2014.9.2 18:14.13 26648  0 
  NULLNULL
/usr/local/data/datasets/remote/freebase/2014-08-20/splitted/freebase-rdf-2014-08-17-00-00.ag.nt.gz
  http://rdf.freebase.com   
2   2014.9.2 18:11.45 25765000  2014.9.2 18:29.32 312824000  0  
 NULLNULL
/usr/local/data/datasets/remote/freebase/2014-08-20/splitted/freebase-rdf-2014-08-17-00-00.ah.nt.gz
  http://rdf.freebase.com   
2   2014.9.2 18:14.13 27152  2014.9.2 18:34.51 216078000  0 
  NULLNULL
/usr/local/data/datasets/remote/freebase/2014-08-20/splitted/freebase-rdf-2014-08-17-00-00.ai.nt.gz
  http://rdf.freebase.com   
2   2014.9.2 18:29.32 321036000  2014.9.2 18:54.37 54526000  0  
 NULLNULL
/usr/local/data/datasets/remote/freebase/2014-08-20/splitted/freebase-rdf-2014-08-17-00-00.aj.nt.gz
  http://rdf.freebase.com   
2   2014.9.2 18:34.51 220487000  2014.9.2 18:54.44 130952000  0 
  NULLNULL
/usr/local/data/datasets/remote/freebase/2014-08-20/splitted/freebase-rdf-2014-08-17-00-00.ak.nt.gz
  http

Re: [Virtuoso-users] importing Freebase RDF dump: slows down, memleak?

2014-09-01 Thread Hugh Williams
Jörn,

On 1 Sep 2014, at 16:44, Jörn Hees  wrote:

> 
> On 1 Sep 2014, at 17:15, Hugh Williams  wrote:
> 
>> [Hugh] Did you let the load continue or was it stopped ?
> 
> yupp, i let it continue but it was killed for out of memory 2.5 days after my 
> last mail... :-/
> 
> 
>> Development indicate your suggestion is not without merit but implementation 
>> is not as simple as it may seems as the indexes are not all sequential, but 
>> something like that could possibly be implemented. It is suggested you could 
>> try dropping the indexes on RDF_QUAD table,  load the Freebase datasets and 
>> then recreate indexes after loading,  which would require a smaller working 
>> set that would better fix into the 32GB RAM available. The command for 
>> dropping the necessary indexes are:
>> 
>>  drop index rdf_quad_pogs;
>>  drop index rdf_quad_sp;
>>  drop index rdf_quad_op;
>>  drop index rdf_quad_gs;
>> 
>> and the respective indexes can then be recreated as detailed at:
>> 
>>  
>> http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/VirtRDFPerformanceTuning?#RDF%20Index%20Scheme
>> 
>> Note you need to recreate the column-wise indexes being v7. Let us know how 
>> this works for you.
> 
> Cool, will try.

[Hugh] OK, let us know the outcome ...


> 
>> Note you can also use the ld_meter scripts we provided for monitoring the 
>> Virtuoso Bulk loader activity as detailed at:
>> 
>>  
>> http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/VirtTipsAndTricksGuideLDMeterUtility
>> 
>> Also, how many "rdf_loader_run()" processes do you have  running when 
>> performing the load, as for v7 we recommend running  Number of Core * 0.4 
>> for best performance typically ?
> 
> Thanks, didn't know these. I'll probably not run multiple rdf_loaders at the 
> same time as deactivating the indices, etc. (i assume it's meant for cases 
> where you have enough RAM and aren't invalidating even more of the cache 
> hierarchy by several processes concurring?)

[Hugh] You should run multiple rdf_loader_run() processes as they are many 
datasets to load and you want to achieve maximum platform utilisation (mainly 
optimum use of cores for parallel loading of triples)  during the load.

Regards
Hugh
> 
> Cheers,
> Jörn
> 
> 
> --
> Slashdot TV.  
> Video for Nerds.  Stuff that matters.
> http://tv.slashdot.org/
> ___
> Virtuoso-users mailing list
> Virtuoso-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/virtuoso-users


--
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/
___
Virtuoso-users mailing list
Virtuoso-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/virtuoso-users


Re: [Virtuoso-users] importing Freebase RDF dump: slows down, memleak?

2014-09-01 Thread Jörn Hees

On 1 Sep 2014, at 17:15, Hugh Williams  wrote:

> [Hugh] Did you let the load continue or was it stopped ?

yupp, i let it continue but it was killed for out of memory 2.5 days after my 
last mail... :-/


> Development indicate your suggestion is not without merit but implementation 
> is not as simple as it may seems as the indexes are not all sequential, but 
> something like that could possibly be implemented. It is suggested you could 
> try dropping the indexes on RDF_QUAD table,  load the Freebase datasets and 
> then recreate indexes after loading,  which would require a smaller working 
> set that would better fix into the 32GB RAM available. The command for 
> dropping the necessary indexes are:
> 
>   drop index rdf_quad_pogs;
>   drop index rdf_quad_sp;
>   drop index rdf_quad_op;
>   drop index rdf_quad_gs;
> 
> and the respective indexes can then be recreated as detailed at:
> 
>   
> http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/VirtRDFPerformanceTuning?#RDF%20Index%20Scheme
> 
> Note you need to recreate the column-wise indexes being v7. Let us know how 
> this works for you.

Cool, will try.

> Note you can also use the ld_meter scripts we provided for monitoring the 
> Virtuoso Bulk loader activity as detailed at:
> 
>   
> http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/VirtTipsAndTricksGuideLDMeterUtility
> 
> Also, how many "rdf_loader_run()" processes do you have  running when 
> performing the load, as for v7 we recommend running  Number of Core * 0.4 for 
> best performance typically ?

Thanks, didn't know these. I'll probably not run multiple rdf_loaders at the 
same time as deactivating the indices, etc. (i assume it's meant for cases 
where you have enough RAM and aren't invalidating even more of the cache 
hierarchy by several processes concurring?)

Cheers,
Jörn


--
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/
___
Virtuoso-users mailing list
Virtuoso-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/virtuoso-users


Re: [Virtuoso-users] importing Freebase RDF dump: slows down, memleak?

2014-09-01 Thread Hugh Williams



Best Regards
Hugh Williams
Professional Services
OpenLink Software, Inc.  //  http://www.openlinksw.com/
Weblog   -- http://www.openlinksw.com/blogs/
LinkedIn -- http://www.linkedin.com/company/openlink-software/
Twitter  -- http://twitter.com/OpenLink
Google+  -- http://plus.google.com/100570109519069333827/
Facebook -- http://www.facebook.com/OpenLinkSoftware
Universal Data Access, Integration, and Management Technology Providers

On 25 Aug 2014, at 14:33, Jörn Hees  wrote:

> Hi again,
> 
> On 22 Aug 2014, at 17:44, Jörn Hees  wrote:
> 
>> On 22 Aug 2014, at 17:51, Hugh Williams  wrote:
>> 
>>> What I would not expect though is for the memory consumption to continue to 
>>> increase until the server is killed due to oom error which would imply a 
>>> possible memory leak, which is why I recommend building with the develop/7  
>>> build where there have been improvement in memory management.
>> 
>> I currently just used the stable 7.1.0 release. I'll try with the dev build 
>> again and report back...
> 
> So i'm running the import on a fresh dev build since my last email and i'm 
> now at a total memory consumption of 31218/32177 MB (buffers: 15 MB, cache: 
> remaining ~700 MB).
> 
> The Virtuoso process has allocated 31.5 GB (VIRT), 30.1 GB (RES) and 3.812 MB 
> (SHR) Memory.
> 
> I'm not sure if i really have to run the importer till it's killed for out of 
> memory (as i said it becomes pretty slow after a while and is currently only 
> seeking around with 200 KB/s) or if this is enough already. As 
> NumberOfBuffers is set to 272 as recommended i guess that anything above 
> 21 GB is suspicious... we're at > 31 GB now.
> 
> 
> I've also split up the input file into 100M line chunks so that i can track 
> the progress a bit better...
> 14 of these are completely loaded now, so 1.4 G triples, the 15th is 
> currently running.
> These are the start times as reported in DB.DBA.LOAD_LIST. I added a column 
> for loaded triples (not necessarily unique):
> 2014.8.22 19:59 0
> 2014.8.22 20:09 100M
> 2014.8.22 20:22 200M
> 2014.8.22 20:39 300M
> 2014.8.22 20:53 400M
> 2014.8.22 21:11 500M
> 2014.8.22 21:31 600M
> 2014.8.22 22:03 700M
> 2014.8.22 22:39 800M
> 2014.8.22 23:32 900M
> 2014.8.23 00:17 1G
> 2014.8.23 02:47 1.1G
> 2014.8.23 08:51 1.2G
> 2014.8.23 18:02 1.3G
> 2014.8.24 16:16 1.4G
> 
> The import times for 100M triples seem to be roughly about:
> - 10 minutes initially
> - 30 minutes after 600M loaded triples
> - 45 minutes after 900M triples
> - 2h:30 after 1G triples (I'm guessing that this is when the set Memory-Limit 
> is hit)
> - 6h after 1.1G triples
> - 10h after 1.2G triples
> - 22h after 1.3G triples
> - >22h after 1.4G triples
> 
> The last 4 lines sadly don't give me the impression that this scales nearly 
> linearly after virtuoso runs out of fast random access memory and has to rely 
> on block storage :-/ Is there maybe a setting which allows virtuoso to fall 
> back to a merge-sort like approach like creating sorted temp dbs and then 
> merging them bottom up? Wouldn't this scale way beyond the available RAM 
> sizes and not cause the seek&wait pattern i observe?!?
> 
> 
> Anything else i can do to help to debug this? Can i stop the import?

[Hugh] Did you let the load continue or was it stopped ? Development indicate 
your suggestion is not without merit but implementation is not as simple as it 
may seems as the indexes are not all sequential, but something like that could 
possibly be implemented. It is suggested you could try dropping the indexes on 
RDF_QUAD table,  load the Freebase datasets and then recreate indexes after 
loading,  which would require a smaller working set that would better fix into 
the 32GB RAM available. The command for dropping the necessary indexes are:

drop index rdf_quad_pogs;
drop index rdf_quad_sp;
drop index rdf_quad_op;
drop index rdf_quad_gs;

and the respective indexes can then be recreated as detailed at:


http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/VirtRDFPerformanceTuning?#RDF%20Index%20Scheme

Note you need to recreate the column-wise indexes being v7. Let us know how 
this works for you. Note you can also use the ld_meter scripts we provided for 
monitoring the Virtuoso Bulk loader activity as detailed at:


http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/VirtTipsAndTricksGuideLDMeterUtility

Also, how many "rdf_loader_run()" processes do you have  running when 
performing the load, as for v7 we recommend running  Number of Core * 0.4 for 
best performance typically ?

Regards
Hugh

> 
> Cheers,
> Jörn
> 


--
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/
___
Virtuoso-users mailing list
Virtuoso-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/virtuoso-users


Re: [Virtuoso-users] importing Freebase RDF dump: slows down, memleak?

2014-08-25 Thread Jörn Hees
Hi again,

On 22 Aug 2014, at 17:44, Jörn Hees  wrote:

> On 22 Aug 2014, at 17:51, Hugh Williams  wrote:
> 
>> What I would not expect though is for the memory consumption to continue to 
>> increase until the server is killed due to oom error which would imply a 
>> possible memory leak, which is why I recommend building with the develop/7  
>> build where there have been improvement in memory management.
> 
> I currently just used the stable 7.1.0 release. I'll try with the dev build 
> again and report back...

So i'm running the import on a fresh dev build since my last email and i'm now 
at a total memory consumption of 31218/32177 MB (buffers: 15 MB, cache: 
remaining ~700 MB).

The Virtuoso process has allocated 31.5 GB (VIRT), 30.1 GB (RES) and 3.812 MB 
(SHR) Memory.

I'm not sure if i really have to run the importer till it's killed for out of 
memory (as i said it becomes pretty slow after a while and is currently only 
seeking around with 200 KB/s) or if this is enough already. As NumberOfBuffers 
is set to 272 as recommended i guess that anything above 21 GB is 
suspicious... we're at > 31 GB now.


I've also split up the input file into 100M line chunks so that i can track the 
progress a bit better...
14 of these are completely loaded now, so 1.4 G triples, the 15th is currently 
running.
These are the start times as reported in DB.DBA.LOAD_LIST. I added a column for 
loaded triples (not necessarily unique):
2014.8.22 19:59 0
2014.8.22 20:09 100M
2014.8.22 20:22 200M
2014.8.22 20:39 300M
2014.8.22 20:53 400M
2014.8.22 21:11 500M
2014.8.22 21:31 600M
2014.8.22 22:03 700M
2014.8.22 22:39 800M
2014.8.22 23:32 900M
2014.8.23 00:17 1G
2014.8.23 02:47 1.1G
2014.8.23 08:51 1.2G
2014.8.23 18:02 1.3G
2014.8.24 16:16 1.4G

The import times for 100M triples seem to be roughly about:
- 10 minutes initially
- 30 minutes after 600M loaded triples
- 45 minutes after 900M triples
- 2h:30 after 1G triples (I'm guessing that this is when the set Memory-Limit 
is hit)
- 6h after 1.1G triples
- 10h after 1.2G triples
- 22h after 1.3G triples
- >22h after 1.4G triples

The last 4 lines sadly don't give me the impression that this scales nearly 
linearly after virtuoso runs out of fast random access memory and has to rely 
on block storage :-/ Is there maybe a setting which allows virtuoso to fall 
back to a merge-sort like approach like creating sorted temp dbs and then 
merging them bottom up? Wouldn't this scale way beyond the available RAM sizes 
and not cause the seek&wait pattern i observe?!?


Anything else i can do to help to debug this? Can i stop the import?

Cheers,
Jörn


--
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/
___
Virtuoso-users mailing list
Virtuoso-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/virtuoso-users


Re: [Virtuoso-users] importing Freebase RDF dump: slows down, memleak?

2014-08-22 Thread Paul Houle
Hey guys,  you will have a much easer time if you use :BaseKB

http://basekb.com/

The fact is that 50% of Freebase is junk,  and you can't afford to load
that.

It pains me to see people having these problems when I load :BaseKB on a
repeatable basis on relatively modest hardware,  and in fact,  you can get
it pre-loaded and be running in minutes

https://aws.amazon.com/marketplace/pp/B00KRKRYW0

Take it from me,  this is the difference between "just works" and days of
suffering.  Make it easy for yourself.


On Fri, Aug 22, 2014 at 11:44 AM, Jörn Hees  wrote:

> Hi Hugh,
>
> thanks for the reply.
>
> I know 32 GB is probably not much considering the size of the dumps, but
> it's the size limit of our VMs :(
> So i'd be willing to live with a bit slower import and response times if i
> can still leave it on a VM.
>
> On 22 Aug 2014, at 17:51, Hugh Williams  wrote:
>
> > What I would not expect though is for the memory consumption to continue
> to increase until the server is killed due to oom error which would imply a
> possible memory leak, which is why I recommend building with the develop/7
> build where there have been improvement in memory management.
>
> I currently just used the stable 7.1.0 release. I'll try with the dev
> build again and report back...
>
> Cheers,
> Jörn
>
>
>
> --
> Slashdot TV.
> Video for Nerds.  Stuff that matters.
> http://tv.slashdot.org/
> ___
> Virtuoso-users mailing list
> Virtuoso-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/virtuoso-users
>



-- 
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254paul.houle on Skype   ontolo...@gmail.com
--
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/___
Virtuoso-users mailing list
Virtuoso-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/virtuoso-users


Re: [Virtuoso-users] importing Freebase RDF dump: slows down, memleak?

2014-08-22 Thread Jörn Hees
Hi Hugh,

thanks for the reply.

I know 32 GB is probably not much considering the size of the dumps, but it's 
the size limit of our VMs :(
So i'd be willing to live with a bit slower import and response times if i can 
still leave it on a VM.

On 22 Aug 2014, at 17:51, Hugh Williams  wrote:

> What I would not expect though is for the memory consumption to continue to 
> increase until the server is killed due to oom error which would imply a 
> possible memory leak, which is why I recommend building with the develop/7  
> build where there have been improvement in memory management.

I currently just used the stable 7.1.0 release. I'll try with the dev build 
again and report back...

Cheers,
Jörn


--
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/
___
Virtuoso-users mailing list
Virtuoso-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/virtuoso-users


Re: [Virtuoso-users] importing Freebase RDF dump: slows down, memleak?

2014-08-22 Thread Hugh Williams
Hi Jörn,

Are you running the Virtuoso open source git [1] stable or develop 7 branch ? I 
would recommend the load be performed with the develop/7 branch if this is not 
already being used.

From analysis development have performed in-house earlier this year, it was 
found the latest Freebase datasets need about 13,000,000 buffers ie about 105GB 
RAM to load in memory and not have to use disk which significantly reduces the 
load rate.  This is because the dataset contains many large literal values and 
thus does not compress very well and also a lot of duplicate data, so even 
though it is only about 1.9 billion triples as you have seen the actual as you 
have observed also.

What I would not expect though is for the memory consumption to continue to 
increase until the server is killed due to oom error which would imply a 
possible memory leak, which is why I recommend building with the develop/7  
build where there have been improvement in memory management.

To speed the load you should consider using faster disk (SSD's) ideally as a 
trade off for insufficient memory when loading the dataset, and also database 
striping [2] for improved parallel i/O access to the database files if 
possible. Another option would be to load the dataset in 4 parts, which should 
give the leave enough

On our LOD Cache Cloud server [3] which is a 4 node cluster with 768GB RAM and 
60billion + triples load the Freebase datasets loaded in about 1.7 hrs:

SQL> select min(ll_started) as start, max(ll_done) as finish, 
datediff('second', min(ll_started), max(ll_done)) as delta from load_list where 
ll_graph like 
'http://commondatastorage.googleapis.com/freebase-public/rdf/freebase-rdf-2013-11-17-00-00.gz';
startfinish   delta
TIMESTAMPTIMESTAMPINTEGER
___

2013.12.2 22:34.9 0  2013.12.3 0:16.24 0  6135

1 Rows. -- 74 msec.
SQL>

On the single server database we testing on with 105GB RAM it loaded in about 
2hrs.

Best Regards
Hugh Williams
Professional Services
OpenLink Software, Inc.  //  http://www.openlinksw.com/
Weblog   -- http://www.openlinksw.com/blogs/
LinkedIn -- http://www.linkedin.com/company/openlink-software/
Twitter  -- http://twitter.com/OpenLink
Google+  -- http://plus.google.com/100570109519069333827/
Facebook -- http://www.facebook.com/OpenLinkSoftware
Universal Data Access, Integration, and Management Technology Providers

[1] http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/VOSGitUsage
[2] http://docs.openlinksw.com/virtuoso/dbadm.html#ini_Striping
[3] http://lod.openlinksw.com

On 22 Aug 2014, at 14:41, Jörn Hees  wrote:

> Hi,
> 
> TLDR: When importing the Freebase RDF dump Virtuoso seems to consume way more 
> RAM than configured.
> 
> i'm trying to load the Freebase RDF dump ( 
> https://developers.google.com/freebase/data ) into a clean Virtuoso 
> OpenSource 7.1.0 instance running on a VM with 4 cores and 32 GB of RAM, 300+ 
> GB HD free.
> The dump file contains 2,656,580,382 rows (even though the page claims 1.9 
> billion triples, maybe outdated or dups).
> Before attempting to load the whole Freebase dump, i loaded the basekb.com 
> dump which contained 1,205,456,739 triples into the store which was already 
> filled with DBpedia without any noticeable problem.
> 
> The Freebase dump rdf_loader_run() import starts with rapid IO rates (several 
> 100MB/s read and write bursts) and quickly consumes ~ 25 GB of RAM as 
> configured. It then continues to slowly consume more and more RAM ~ 1 
> MB/minute. As this goes on, the IO rates slowly drop down to some KB/s read 
> and no / very very rare writes. htop at this point shows that the process 
> spends nearly all its time on IO wait. After a couple of days Virtuoso is 
> finally killed by the kernel when it consumed all RAM of the machine and 
> wants even more.
> 
> I already tried adding 16 GB swap. This didn't help but made the machine 
> completely unresponsive after 4 days (sshd seems to have been swapped out and 
> never came back over a couple of hour long retries to ssh into the VM).
> 
> Ubuntu 12.04 LTS or 14.04.1 LTS doesn't seem to make a difference.
> 
> A colleague is reporting that the import works fine on a 256 GB RAM, 8 core 
> machine with settings for 64 GB... takes about 1 day to import, the final DB 
> is ~ 130 GB. Mine never gets to > 100 GB before Virtuoso is killed.
> 
> 
> The instance is set up following my tutorial 
> http://joernhees.de/blog/2014/04/23/setting-up-a-local-dbpedia-3-9-mirror-with-virtuoso-7/
>  just substitute the DBpedia Datasets with the Freebase triple dump and 
> Wikidata links.
> 
> The virtuoso.ini values are set as suggested for 32 GB of RAM, there's 
> nothing else running on the VM:
> [Database]
> MaxCheckpointRemap  = 2000  // also tried with 62500, so ~1/4th 
> of NumberOfBuffers as in the blogpost
> [Parameters]
> ;; Uncomment next

[Virtuoso-users] importing Freebase RDF dump: slows down, memleak?

2014-08-22 Thread Jörn Hees
Hi,

TLDR: When importing the Freebase RDF dump Virtuoso seems to consume way more 
RAM than configured.

i'm trying to load the Freebase RDF dump ( 
https://developers.google.com/freebase/data ) into a clean Virtuoso OpenSource 
7.1.0 instance running on a VM with 4 cores and 32 GB of RAM, 300+ GB HD free.
The dump file contains 2,656,580,382 rows (even though the page claims 1.9 
billion triples, maybe outdated or dups).
Before attempting to load the whole Freebase dump, i loaded the basekb.com dump 
which contained 1,205,456,739 triples into the store which was already filled 
with DBpedia without any noticeable problem.

The Freebase dump rdf_loader_run() import starts with rapid IO rates (several 
100MB/s read and write bursts) and quickly consumes ~ 25 GB of RAM as 
configured. It then continues to slowly consume more and more RAM ~ 1 
MB/minute. As this goes on, the IO rates slowly drop down to some KB/s read and 
no / very very rare writes. htop at this point shows that the process spends 
nearly all its time on IO wait. After a couple of days Virtuoso is finally 
killed by the kernel when it consumed all RAM of the machine and wants even 
more.

I already tried adding 16 GB swap. This didn't help but made the machine 
completely unresponsive after 4 days (sshd seems to have been swapped out and 
never came back over a couple of hour long retries to ssh into the VM).

Ubuntu 12.04 LTS or 14.04.1 LTS doesn't seem to make a difference.

A colleague is reporting that the import works fine on a 256 GB RAM, 8 core 
machine with settings for 64 GB... takes about 1 day to import, the final DB is 
~ 130 GB. Mine never gets to > 100 GB before Virtuoso is killed.


The instance is set up following my tutorial 
http://joernhees.de/blog/2014/04/23/setting-up-a-local-dbpedia-3-9-mirror-with-virtuoso-7/
 just substitute the DBpedia Datasets with the Freebase triple dump and 
Wikidata links.

The virtuoso.ini values are set as suggested for 32 GB of RAM, there's nothing 
else running on the VM:
[Database]
MaxCheckpointRemap  = 2000  // also tried with 62500, so ~1/4th of 
NumberOfBuffers as in the blogpost
[Parameters]
;; Uncomment next two lines if there is 32 GB system memory free
NumberOfBuffers  = 272
MaxDirtyBuffers  = 200


As I already tried a lot of things but can't get this to work, i'd be thankful 
for feedback or someone looking into why virtuoso is consuming all of the RAM.

Cheers,
Jörn


--
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/
___
Virtuoso-users mailing list
Virtuoso-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/virtuoso-users