Re: [Virtuoso-users] full-text indexing after mass import (was: rdf_loader_run logs "missed delete of name id cache" messages)

2015-11-22 Thread Hugh Williams
Jörn,

> On 23 Nov 2015, at 01:46, Jörn Hees  wrote:
> 
> Hi Hugh,
> 
> thanks again for your replies...
> 
>> On 22 Nov 2015, at 01:46, Hugh Williams  wrote:
>> 
>>> What puzzles me is that after import and several checkpoints and restarts, 
>>> just leaving the DB idle without any queries (see below) it seems to become 
>>> busy.
>>> I guess it does some kind of "re-organization" and i'd mostly like to find 
>>> out how i can tell it "do it now, take all resources you want, don't care 
>>> if anyone is waiting, admin override, full speed ;)".
>>> That would allow me to then have that static state of the DB which i can 
>>> back-up and replay if things go wrong or someone wants an old version, 
>>> leaving us with "ready to use" backups, and not such that first start some 
>>> lengthy "re-organization after mass import".
>>> 
>>> The mentioned "re-organization state" now seems to be over after leaving 
>>> the DB switched on and idle for the last couple of days.
>> 
>> [Hugh] Does your database have Full Text indexing enabled which would is  a 
>> scheduled background task that would take time to complete on a newly loaded 
>> large database like yours, see:
>> 
>>  
>> http://docs.openlinksw.com/virtuoso/sparqlextensions.html#rdfsparqlrulefulltext
> 
> I really think that this could be it, as by default there seems to be an 
> "all" index.

[Hugh] If you installed the Virtuoso Faceted Browser then the FT index would be 
enabled and run as a scheduled job.
> 
> Reading the doc page, i have two remaining questions:
> 
> After a normal `rdf_loader_run()`, would a 
> `DB.DBA.VT_INC_INDEX_DB_DBA_RDF_OBJ();` be sufficient to get a complete 
> full-text index? Or do i have to run `DB.DBA.RDF_OBJ_FT_RECOVER();` in those 
> cases and will otherwise never arrive at a complete free-text index (not even 
> after the background tasks finished?)?

[Hugh] The scheduler will run `DB.DBA.VT_INC_INDEX_DB_DBA_RDF_OBJ();` so you 
can wait for it to run or run it manually itself.

> If i have to, a mention of this around 
> http://docs.openlinksw.com/virtuoso/rdfperformancetuning.html#rdfperfloadinglod
>  would be nice.
> 
> I ran `DB.DBA.RDF_OBJ_FT_RECOVER();` on a small instance with just the 
> DBpedia core (~ 430 M triples) and it seems to only use 2 - 3 CPUs with very 
> little IO. The whole importing of that dataset only took 1:30 hours, but the 
> full-text indexing is still running after 3 hours now... Is there any way to 
> go full speed at the cost of locking the whole DB or something?

[Hugh] Will have to check with development as I am not aware of a param to 
control CPU usage, it should run with full platform utilisation I would have 
thought …

Regards
Hugh

> 
> Cheers,
> Jörn
> 
> 
> 


--
Go from Idea to Many App Stores Faster with Intel(R) XDK
Give your users amazing mobile app experiences with Intel(R) XDK.
Use one codebase in this all-in-one HTML5 development environment.
Design, debug & build mobile apps & 2D/3D high-impact games for multiple OSs.
http://pubads.g.doubleclick.net/gampad/clk?id=254741551=/4140
___
Virtuoso-users mailing list
Virtuoso-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/virtuoso-users


[Virtuoso-users] full-text indexing after mass import (was: rdf_loader_run logs "missed delete of name id cache" messages)

2015-11-22 Thread Jörn Hees
Hi Hugh,

thanks again for your replies...

> On 22 Nov 2015, at 01:46, Hugh Williams  wrote:
> 
>> What puzzles me is that after import and several checkpoints and restarts, 
>> just leaving the DB idle without any queries (see below) it seems to become 
>> busy.
>> I guess it does some kind of "re-organization" and i'd mostly like to find 
>> out how i can tell it "do it now, take all resources you want, don't care if 
>> anyone is waiting, admin override, full speed ;)".
>> That would allow me to then have that static state of the DB which i can 
>> back-up and replay if things go wrong or someone wants an old version, 
>> leaving us with "ready to use" backups, and not such that first start some 
>> lengthy "re-organization after mass import".
>> 
>> The mentioned "re-organization state" now seems to be over after leaving the 
>> DB switched on and idle for the last couple of days.
> 
> [Hugh] Does your database have Full Text indexing enabled which would is  a 
> scheduled background task that would take time to complete on a newly loaded 
> large database like yours, see:
> 
>   
> http://docs.openlinksw.com/virtuoso/sparqlextensions.html#rdfsparqlrulefulltext

I really think that this could be it, as by default there seems to be an "all" 
index.

Reading the doc page, i have two remaining questions:

After a normal `rdf_loader_run()`, would a 
`DB.DBA.VT_INC_INDEX_DB_DBA_RDF_OBJ();` be sufficient to get a complete 
full-text index? Or do i have to run `DB.DBA.RDF_OBJ_FT_RECOVER();` in those 
cases and will otherwise never arrive at a complete free-text index (not even 
after the background tasks finished?)?
If i have to, a mention of this around 
http://docs.openlinksw.com/virtuoso/rdfperformancetuning.html#rdfperfloadinglod 
would be nice.

I ran `DB.DBA.RDF_OBJ_FT_RECOVER();` on a small instance with just the DBpedia 
core (~ 430 M triples) and it seems to only use 2 - 3 CPUs with very little IO. 
The whole importing of that dataset only took 1:30 hours, but the full-text 
indexing is still running after 3 hours now... Is there any way to go full 
speed at the cost of locking the whole DB or something?

Cheers,
Jörn




--
Go from Idea to Many App Stores Faster with Intel(R) XDK
Give your users amazing mobile app experiences with Intel(R) XDK.
Use one codebase in this all-in-one HTML5 development environment.
Design, debug & build mobile apps & 2D/3D high-impact games for multiple OSs.
http://pubads.g.doubleclick.net/gampad/clk?id=254741551=/4140
___
Virtuoso-users mailing list
Virtuoso-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/virtuoso-users


Re: [Virtuoso-users] full-text indexing after mass import (was: rdf_loader_run logs "missed delete of name id cache" messages)

2015-11-22 Thread Jörn Hees
Hi,

> On 23 Nov 2015, at 03:36, Hugh Williams  wrote:
> 
>>> 
>>> http://docs.openlinksw.com/virtuoso/sparqlextensions.html#rdfsparqlrulefulltext
>> 
>> I really think that this could be it, as by default there seems to be an 
>> "all" index.
> 
> [Hugh] If you installed the Virtuoso Faceted Browser then the FT index would 
> be enabled and run as a scheduled job.

It's a "vanilla" virtuoso-server as installed from the .deb packages of the 
7.2.1 sources, the only VAD package installed is the conductor.


>> 
>> Reading the doc page, i have two remaining questions:
>> 
>> After a normal `rdf_loader_run()`, would a 
>> `DB.DBA.VT_INC_INDEX_DB_DBA_RDF_OBJ();` be sufficient to get a complete 
>> full-text index? Or do i have to run `DB.DBA.RDF_OBJ_FT_RECOVER();` in those 
>> cases and will otherwise never arrive at a complete free-text index (not 
>> even after the background tasks finished?)?
> 
> [Hugh] The scheduler will run `DB.DBA.VT_INC_INDEX_DB_DBA_RDF_OBJ();` so you 
> can wait for it to run or run it manually itself.


Yes, that's what i understood, but in the 
http://docs.openlinksw.com/virtuoso/sparqlextensions.html#rdfsparqlrulefulltext 
there's this remark:

> One problem related to free-text indexing of DB.DBA.RDF_QUAD is that some 
> applications (e.g. those that import billions of triples) may set off 
> triggers. This will make free-text index data incomplete. Calling procedure 
> DB.DBA.RDF_OBJ_FT_RECOVER () will insert all missing free-text index items by 
> dropping and re-inserting every existing free-text index rule.

So i'm asking: is `rdf_loader_run();` one of those "applications" which 
deactivate some triggers leading to an incomplete full-text index and need 
`DB.DBA.RDF_OBJ_FT_RECOVER();` to be called?

Or are the only two interesting alternatives for me waiting or calling 
`DB.DBA.VT_INC_INDEX_DB_DBA_RDF_OBJ();`?


>> If i have to, a mention of this around 
>> http://docs.openlinksw.com/virtuoso/rdfperformancetuning.html#rdfperfloadinglod
>>  would be nice.
>> 
>> I ran `DB.DBA.RDF_OBJ_FT_RECOVER();` on a small instance with just the 
>> DBpedia core (~ 430 M triples) and it seems to only use 2 - 3 CPUs with very 
>> little IO. The whole importing of that dataset only took 1:30 hours, but the 
>> full-text indexing is still running after 3 hours now... Is there any way to 
>> go full speed at the cost of locking the whole DB or something?
> 
> [Hugh] Will have to check with development as I am not aware of a param to 
> control CPU usage, it should run with full platform utilisation I would have 
> thought 

Thanks,

Jörn


--
Go from Idea to Many App Stores Faster with Intel(R) XDK
Give your users amazing mobile app experiences with Intel(R) XDK.
Use one codebase in this all-in-one HTML5 development environment.
Design, debug & build mobile apps & 2D/3D high-impact games for multiple OSs.
http://pubads.g.doubleclick.net/gampad/clk?id=254741551=/4140
___
Virtuoso-users mailing list
Virtuoso-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/virtuoso-users


Re: [Virtuoso-users] output:format "NT" question

2015-11-22 Thread Davis, Daniel (NIH/NLM) [C]
Thanks Hugh,

Changing that just got me to a maximum size vector error.   Turns out there are 
many takes on saving this to files using http_nt_triple(), string_output(), 
string_to_file() through stored procedures, and that got me through.

The release of my URI preserving algorithm is Dec. 3rd, and after that I'll 
lurk until NLM makes a decision on whether to change this to production, and 
maybe add additional ontologies.

I wanted to take the opportunity to thank you for your patient support as I 
learned Virtuoso more deeply.   Its been a project; you've helped.

-Dan

From: Hugh Williams [hwilli...@openlinksw.com]
Sent: Saturday, November 21, 2015 8:37 PM
To: Davis, Daniel (NIH/NLM) [C]
Cc: virtuoso-users@lists.sourceforge.net
Subject: Re: [Virtuoso-users] output:format "NT" question

Hi Daniel,

A default virtuoso.ini file has a "ResultSetMaxRows = 1” in the "[SPARQL]" 
section  which I imagine  when using your initial script it was hitting this 
restriction on max size of a SPARQL result set ...

Best Regards
Hugh Williams
Professional Services
OpenLink Software, Inc.  //  http://www.openlinksw.com/
Weblog   -- http://www.openlinksw.com/blogs/
LinkedIn -- http://www.linkedin.com/company/openlink-software/
Twitter  -- http://twitter.com/OpenLink
Google+  -- http://plus.google.com/100570109519069333827/
Facebook -- http://www.facebook.com/OpenLinkSoftware
Universal Data Access, Integration, and Management Technology Providers

> On 16 Nov 2015, at 19:11, Davis, Daniel (NIH/NLM) [C]  
> wrote:
>
> Guys,
>
> I still don’t know why it doesn’t work, but I’ve adapted 
> https://www.mail-archive.com/virtuoso-users@lists.sourceforge.net/msg03950.html
>  
> andhttp://joaorosilva.no-ip.org/wiki/doku.php/mainblog:dump_and_load_graphs_in_virtuoso
>  to my own purposes and they seem to work well.
>
> From: Davis, Daniel (NIH/NLM) [C]
> Sent: Monday, November 16, 2015 1:09 PM
> To: virtuoso-users@lists.sourceforge.net
> Subject: output:format "NT" question
>
> I have a simple script to export triples, but it seems not to get all 
> triples, but only about 10,000.I cannot guess what may be the problem.
> Any advice is appreciated.
>
> isql  dba “$PASSWORD” BANNER=OFF BLOBS=ON VERBOSE=OFF ECHO=OFF PROMPT=OFF 
> TIMEOUT=0 >fullmesh.nt < SPARQL
> define output:format "NT"
> PREFIX meshv: 
> CONSTRUCT { ?s ?p ?o }
> WHERE {
>   GRAPH  {
> ?s ?p ?o
>   }
> };
> EOF
>
>
> Dan Davis, Systems/Applications Architect (Contractor),
> Office of Computer and Communications Systems,
> National Library of Medicine, NIH
>
> --
> Presto, an open source distributed SQL query engine for big data, initially
> developed by Facebook, enables you to easily query your data on Hadoop in a
> more interactive manner. Teradata is also now providing full enterprise
> support for Presto. Download a free open source copy now.
> http://pubads.g.doubleclick.net/gampad/clk?id=250295911=/4140___
> Virtuoso-users mailing list
> Virtuoso-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/virtuoso-users


--
___
Virtuoso-users mailing list
Virtuoso-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/virtuoso-users


Re: [Virtuoso-users] output:format "NT" question

2015-11-22 Thread Hugh Williams
Hi Dan,

Ok, what is the actual maximum vector size error you are getting and what is 
the size of the result set you are typically seeking to save to a file, as 
generally if the result set is large processing it in chunks would be required.

Best Regards
Hugh Williams
Professional Services
OpenLink Software, Inc.  //  http://www.openlinksw.com/
Weblog   -- http://www.openlinksw.com/blogs/
LinkedIn -- http://www.linkedin.com/company/openlink-software/
Twitter  -- http://twitter.com/OpenLink
Google+  -- http://plus.google.com/100570109519069333827/
Facebook -- http://www.facebook.com/OpenLinkSoftware
Universal Data Access, Integration, and Management Technology Providers

> On 22 Nov 2015, at 14:29, Davis, Daniel (NIH/NLM) [C]  
> wrote:
> 
> Thanks Hugh,
> 
> Changing that just got me to a maximum size vector error.   Turns out there 
> are many takes on saving this to files using http_nt_triple(), 
> string_output(), string_to_file() through stored procedures, and that got me 
> through.
> 
> The release of my URI preserving algorithm is Dec. 3rd, and after that I'll 
> lurk until NLM makes a decision on whether to change this to production, and 
> maybe add additional ontologies.
> 
> I wanted to take the opportunity to thank you for your patient support as I 
> learned Virtuoso more deeply.   Its been a project; you've helped.
> 
> -Dan
> 
> From: Hugh Williams [hwilli...@openlinksw.com]
> Sent: Saturday, November 21, 2015 8:37 PM
> To: Davis, Daniel (NIH/NLM) [C]
> Cc: virtuoso-users@lists.sourceforge.net
> Subject: Re: [Virtuoso-users] output:format "NT" question
> 
> Hi Daniel,
> 
> A default virtuoso.ini file has a "ResultSetMaxRows = 1” in the 
> "[SPARQL]" section  which I imagine  when using your initial script it was 
> hitting this restriction on max size of a SPARQL result set ...
> 
> Best Regards
> Hugh Williams
> Professional Services
> OpenLink Software, Inc.  //  http://www.openlinksw.com/
> Weblog   -- http://www.openlinksw.com/blogs/
> LinkedIn -- http://www.linkedin.com/company/openlink-software/
> Twitter  -- http://twitter.com/OpenLink
> Google+  -- http://plus.google.com/100570109519069333827/
> Facebook -- http://www.facebook.com/OpenLinkSoftware
> Universal Data Access, Integration, and Management Technology Providers
> 
>> On 16 Nov 2015, at 19:11, Davis, Daniel (NIH/NLM) [C]  
>> wrote:
>> 
>> Guys,
>> 
>> I still don’t know why it doesn’t work, but I’ve adapted 
>> https://www.mail-archive.com/virtuoso-users@lists.sourceforge.net/msg03950.html
>>  
>> andhttp://joaorosilva.no-ip.org/wiki/doku.php/mainblog:dump_and_load_graphs_in_virtuoso
>>  to my own purposes and they seem to work well.
>> 
>> From: Davis, Daniel (NIH/NLM) [C]
>> Sent: Monday, November 16, 2015 1:09 PM
>> To: virtuoso-users@lists.sourceforge.net
>> Subject: output:format "NT" question
>> 
>> I have a simple script to export triples, but it seems not to get all 
>> triples, but only about 10,000.I cannot guess what may be the problem.
>> Any advice is appreciated.
>> 
>> isql  dba “$PASSWORD” BANNER=OFF BLOBS=ON VERBOSE=OFF ECHO=OFF 
>> PROMPT=OFF TIMEOUT=0 >fullmesh.nt <> SPARQL
>> define output:format "NT"
>> PREFIX meshv: 
>> CONSTRUCT { ?s ?p ?o }
>> WHERE {
>>  GRAPH  {
>>?s ?p ?o
>>  }
>> };
>> EOF
>> 
>> 
>> Dan Davis, Systems/Applications Architect (Contractor),
>> Office of Computer and Communications Systems,
>> National Library of Medicine, NIH
>> 
>> --
>> Presto, an open source distributed SQL query engine for big data, initially
>> developed by Facebook, enables you to easily query your data on Hadoop in a
>> more interactive manner. Teradata is also now providing full enterprise
>> support for Presto. Download a free open source copy now.
>> http://pubads.g.doubleclick.net/gampad/clk?id=250295911=/4140___
>> Virtuoso-users mailing list
>> Virtuoso-users@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/virtuoso-users
> 
> 
> --
> ___
> Virtuoso-users mailing list
> Virtuoso-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/virtuoso-users

--
___
Virtuoso-users mailing list
Virtuoso-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/virtuoso-users