[Owlim-discussion] Loading a Large Triple Store using OWLIM-SE

2013-03-28 Thread Joshua Greben
Hello all,

I am new to this list and to OWLIM-SE and was wondering if anyone could offer 
advice for loading a large triple store. I am trying to load 670M triples into 
a repository using the openrdf-sesame workbench under tomcat6 on a single linux 
VM with 64-bit hardware and 64GB of memory.  

My JVM has the following: -Xms32g -Xmx32g -XX:MaxPermSize=256m

Here is the log info for my repository configuration:

...
[INFO ] 2013-03-27 13:57:00,720 [repositories/BFWorks_STF] Configured parameter 
'entity-id-size' to '32'
[INFO ] 2013-03-27 13:57:00,720 [repositories/BFWorks_STF] Configured parameter 
'enable-context-index' to 'false'
[INFO ] 2013-03-27 13:57:00,720 [repositories/BFWorks_STF] Configured parameter 
'entity-index-size' to '1'
[INFO ] 2013-03-27 13:57:00,720 [repositories/BFWorks_STF] Configured parameter 
'tuple-index-memory' to '1600m'
[INFO ] 2013-03-27 13:57:00,721 [repositories/BFWorks_STF] Configured parameter 
'cache-memory' to '3200m'
[INFO ] 2013-03-27 13:57:00,721 [repositories/BFWorks_STF] Cache pages for 
tuples: 83886
[INFO ] 2013-03-27 13:57:00,721 [repositories/BFWorks_STF] Cache pages for 
predicates: 0
[INFO ] 2013-03-27 13:57:00,721 [repositories/BFWorks_STF] Configured parameter 
'storage-folder' to 'storage'
[INFO ] 2013-03-27 13:57:00,741 [repositories/BFWorks_STF] Configured parameter 
'in-memory-literal-properties' to 'false'
[INFO ] 2013-03-27 13:57:00,742 [repositories/BFWorks_STF] Configured parameter 
'repository-type' to 'file-repository'

The loading came to a standstill after 19 hours and tomcat threw an 
OutOfMemoryError: GC overhead limit exceeded. 

My question is what the application is doing with all this memory and whether I 
configured my instance correctly for this load to finish.  I also see a lot of 
entries in the main log such as this:

[WARN ] 2013-03-28 08:50:59,114 [repositories/BFWorks_STF] [Rio error] 
Unescaped backslash in: L\'ambassadrice (314764886, -1)

Could these "Rio errors" be contributing to my troubles? I was also wondering 
if there was a way to configure logging to be able to track the application's 
progress. Right now these warnings are the only way I can tell how far the 
loading has progressed.

Advice from anyone who has experience successfully loading a large triplestore 
is much appreciated! Thanks in advance!

- Josh


Joshua Greben
Library Systems Programmer & Analyst
Stanford University Libraries
(650) 714-1937
jgre...@stanford.edu


___
Owlim-discussion mailing list
Owlim-discussion@ontotext.com
http://ontomail.semdata.org/cgi-bin/mailman/listinfo/owlim-discussion


Re: [Owlim-discussion] Loading a Large Triple Store using OWLIM-SE

2013-03-29 Thread Joshua Greben
Thanks for the advice! 

I used the spreadsheet and was able to size the application correctly. 17 hours 
later my rdf+xml triple file is 80% loaded. It looks like it might still take 
up to another 11 hours to finish, but again, this is based on my reading of 
"unescaped backslash" errors that are logged and timestamped with the file line 
number. 

I am still running this under tomcat using the workbench because the CURL 
command threw the following error: MALFORMED DATA: Element type "http:" must be 
followed by either attribute specifications, ">" or "/>".  I might try it again 
later using   curl --data-urlencode -T /path/to/data/data.nt ... to see if that 
helps, but I just wanted to get something running overnight.

Thanks again!

- Josh

It seems that the workbench application is better able to handle these 
On Mar 28, 2013, at 2:51 PM, Marek Šurek wrote:

> Hi,
> if you want to see progress in loading, there is and option to use standard 
> "curl" command instead of openrdf-workbench. It gives you some information 
> what is already loaded.
> To load files into owlim(from .trig file), run this command in your linux 
> shell :
> 
> curl -X POST -H "Content-Type:application/x-trig" -T 
> /path/to/data/datafile.trig 
> localhost:8080/openrdf-sesame/repositories/repository-name/statements
> 
> If you have xml style data, change content type to application/rdf+xml 
> 
> 
> If you load big amount of data, I recommend to use configuration.xls which is 
> part of OWLIM-SE.zip. It can help you to set datastore properly.
> 
> Hope this will help.
> 
> Best regards,
> Marek
> 
> From: Joshua Greben 
> To: owlim-discussion@ontotext.com 
> Sent: Thursday, 28 March 2013, 22:30
> Subject: [Owlim-discussion] Loading a Large Triple Store using OWLIM-SE
> 
> Hello all,
> 
> I am new to this list and to OWLIM-SE and was wondering if anyone could offer 
> advice for loading a large triple store. I am trying to load 670M triples 
> into a repository using the openrdf-sesame workbench under tomcat6 on a 
> single linux VM with 64-bit hardware and 64GB of memory.  
> 
> My JVM has the following: -Xms32g -Xmx32g -XX:MaxPermSize=256m
> 
> Here is the log info for my repository configuration:
> 
> ...
> [INFO ] 2013-03-27 13:57:00,720 [repositories/BFWorks_STF] Configured 
> parameter 'entity-id-size' to '32'
> [INFO ] 2013-03-27 13:57:00,720 [repositories/BFWorks_STF] Configured 
> parameter 'enable-context-index' to 'false'
> [INFO ] 2013-03-27 13:57:00,720 [repositories/BFWorks_STF] Configured 
> parameter 'entity-index-size' to '1'
> [INFO ] 2013-03-27 13:57:00,720 [repositories/BFWorks_STF] Configured 
> parameter 'tuple-index-memory' to '1600m'
> [INFO ] 2013-03-27 13:57:00,721 [repositories/BFWorks_STF] Configured 
> parameter 'cache-memory' to '3200m'
> [INFO ] 2013-03-27 13:57:00,721 [repositories/BFWorks_STF] Cache pages for 
> tuples: 83886
> [INFO ] 2013-03-27 13:57:00,721 [repositories/BFWorks_STF] Cache pages for 
> predicates: 0
> [INFO ] 2013-03-27 13:57:00,721 [repositories/BFWorks_STF] Configured 
> parameter 'storage-folder' to 'storage'
> [INFO ] 2013-03-27 13:57:00,741 [repositories/BFWorks_STF] Configured 
> parameter 'in-memory-literal-properties' to 'false'
> [INFO ] 2013-03-27 13:57:00,742 [repositories/BFWorks_STF] Configured 
> parameter 'repository-type' to 'file-repository'
> 
> The loading came to a standstill after 19 hours and tomcat threw an 
> OutOfMemoryError: GC overhead limit exceeded. 
> 
> My question is what the application is doing with all this memory and whether 
> I configured my instance correctly for this load to finish.  I also see a lot 
> of entries in the main log such as this:
> 
>   [WARN ] 2013-03-28 08:50:59,114 [repositories/BFWorks_STF] [Rio error] 
> Unescaped backslash in: L\'ambassadrice (314764886, -1)
> 
> Could these "Rio errors" be contributing to my troubles? I was also wondering 
> if there was a way to configure logging to be able to track the application's 
> progress. Right now these warnings are the only way I can tell how far the 
> loading has progressed.
> 
> Advice from anyone who has experience successfully loading a large 
> triplestore is much appreciated! Thanks in advance!
> 
> - Josh
> 
> 
> Joshua Greben
> Library Systems Programmer & Analyst
> Stanford University Libraries
> (650) 714-1937
> jgre...@stanford.edu
> 
> 
> 
> ___
> Owlim-discussion mailing list
> Owlim-discussion@ontotext.com
> http://ontomail.semdata.org/cgi-bin/mailman/listinfo/owlim-discussion
> 
> 

___
Owlim-discussion mailing list
Owlim-discussion@ontotext.com
http://ontomail.semdata.org/cgi-bin/mailman/listinfo/owlim-discussion


Re: [Owlim-discussion] Loading a Large Triple Store using OWLIM-SE

2013-04-09 Thread Joshua Greben
Hi Barry,

Following you advice I ran the load using the example.sh script and pointing to 
my repository on localhost:8080. The load ran fine for 7 hours, but then it 
gave up with the following error in the main log:

[ERROR] 2013-04-08 20:13:18,019 [repositories/BFWorks_STF] Error while 
handling request (500): java.net.SocketTimeoutException: Read timed out

I noticed that tomcat's connectionTimeout param was at the default (20sec.) so 
I considered increasing it to 10 minutes. Any advice on this?

Also, once this error happened I am unable to do anything with the repository 
except view the Contexts in Repository (via the workbench). When I try to clear 
the contexts to start over from scratch it takes a very long time and then I 
end up getting:

javax.servlet.ServletException: 
org.openrdf.repository.RepositoryException: java.io.EOFException

At this point I am forced to kill the tomcat process and delete the repository 
forcibly.


I then tried creating a repository using the sesame_owlim console. but I keep 
getting 

ERROR: No template called BFWorks.ttl found in 
/storage/openrdf-sesame-console/templates

even though I have a BFWorks.ttl file in that directory.

Any help/advice is appreciated.

 -Josh

On Mar 29, 2013, at 8:46 AM, Barry Bishop wrote:

> Hello Marek, Stefano,
> 
> There is a little bit of information here about how to load a lot of data 
> (the problems being that the Sesame workbench/browser will time out if it 
> takes too long and OWLIM uses a lot of memory if the transaction size is too 
> big):
> 
> https://confluence.ontotext.com/display/OWLIMv53/OWLIM+FAQ#OWLIMFAQ-HowdoIloadlargeamountsofdataintoOWLIMSEorOWLIMEnterprise%3F
> 
> There is also some information here about using the demonstrator program that 
> comes with OWLIM to do this:
> 
> https://confluence.ontotext.com/display/OWLIMv53/OWLIM-SE+Configuration#OWLIM-SEConfiguration-Bulkdataloading
> 
> This latter would be my preferred approach, because it allows you control 
> parsing errors in your data, e.g. skip errors or stop, validate literals, etc.
> 
> I hope this helps,
> barry
> 
> Barry Bishop
> OWLIM Product Manager
> Ontotext AD
> Tel: +43 650 2000 237
> email: barry.bis...@ontotext.com
> skype: bazbishop
> www.ontotext.com
> On 03/28/2013 10:51 PM, Marek Šurek wrote:
>> Hi,
>> if you want to see progress in loading, there is and option to use standard 
>> "curl" command instead of openrdf-workbench. It gives you some information 
>> what is already loaded.
>> To load files into owlim(from .trig file), run this command in your linux 
>> shell :
>> 
>> curl -X POST -H "Content-Type:application/x-trig" -T 
>> /path/to/data/datafile.trig 
>> localhost:8080/openrdf-sesame/repositories/repository-name/statements
>> 
>> If you have xml style data, change content type to application/rdf+xml 
>> 
>> 
>> If you load big amount of data, I recommend to use configuration.xls which 
>> is part of OWLIM-SE.zip. It can help you to set datastore properly.
>> 
>> Hope this will help.
>> 
>> Best regards,
>> Marek
>> 
>> From: Joshua Greben 
>> To: owlim-discussion@ontotext.com 
>> Sent: Thursday, 28 March 2013, 22:30
>> Subject: [Owlim-discussion] Loading a Large Triple Store using OWLIM-SE
>> 
>> Hello all,
>> 
>> I am new to this list and to OWLIM-SE and was wondering if anyone could 
>> offer advice for loading a large triple store. I am trying to load 670M 
>> triples into a repository using the openrdf-sesame workbench under tomcat6 
>> on a single linux VM with 64-bit hardware and 64GB of memory.  
>> 
>> My JVM has the following: -Xms32g -Xmx32g -XX:MaxPermSize=256m
>> 
>> Here is the log info for my repository configuration:
>> 
>> ...
>> [INFO ] 2013-03-27 13:57:00,720 [repositories/BFWorks_STF] Configured 
>> parameter 'entity-id-size' to '32'
>> [INFO ] 2013-03-27 13:57:00,720 [repositories/BFWorks_STF] Configured 
>> parameter 'enable-context-index' to 'false'
>> [INFO ] 2013-03-27 13:57:00,720 [repositories/BFWorks_STF] Configured 
>> parameter 'entity-index-size' to '1'
>> [INFO ] 2013-03-27 13:57:00,720 [repositories/BFWorks_STF] Configured 
>> parameter 'tuple-index-memory' to '1600m'
>> [INFO ] 2013-03-27 13:57:00,721 [repositories/BFWorks_STF] Configured 
>> parameter 'cache-memory' to '3200m'
>> [INFO ] 2013-03-27 13:57:00,721 [repositories/BFWorks_STF] Cache pages for 
>> tuples: 83886
>> [INFO ] 2013-03-27 13:57:00,721 [repositories/BFWorks_STF] Cache pa

Re: [Owlim-discussion] Loading a Large Triple Store using OWLIM-SE

2013-04-16 Thread Joshua Greben
Barry,

Thanks for your continued help. I have since been using the configuration 
provided by the owlim-se-configurator spreadsheet which gave me the following 
JAVA_OPTS which I also entered when creating the repository:

-Xmx56930m 
-Dentity-index-size=10155 
-Dcache-memory=25187m 
-Dtuple-index-memory=25187m 

The load runs for approximately 12 hours and then errors out with:

Failed to load 'BFWorks_STF_clean.nt' 
(N-Triples).org.openrdf.rio.RDFHandlerException: 
org.openrdf.repository.RepositoryException: java.net.ConnectException: 
Connection refused

I experienced this a few times already so I since increased tomcat's 
connectionTimeout, the default web application timeout, and the openrdf-sesame 
application's timeout all to 120 minutes to see if this helps and I am now 
starting another load.

Here is a link http://esp-dev-jgreben.stanford.edu:8080/owlimloads to a graph 
that I created based on the logs from my last load (and the assumption that it 
loads 500,000 statements per POST). Based on this graph it looks like it will 
take several days (or weeks) to reach 670 million statements as the loading 
stops becoming linear at around 120 million statements.

I just want to know if I am on the right track or if I should to do any more 
memory tuning or increase/decrease the entity index size.

Thanks again.

- Josh

On Apr 9, 2013, at 11:58 PM, Barry Bishop wrote:

> Hi Joshua,
> 
> Sorry to hear that you are still having problems loading data. Looking more 
> closely, I think you have a less than optimal memory configuration:
> 
> Java heap 32G
> 'tuple-index-memory' to '1600m'
> 'cache-memory' to '3200m'
> 
> I suggest you increase the last two parameters to something more like '10G' 
> or possibly even 15G for loading.
> 
> More comments inline:
> 
> On 04/09/2013 09:47 PM, Joshua Greben wrote:
>> Hi Barry,
>> 
>> Following you advice I ran the load using the example.sh script and pointing 
>> to my repository on localhost:8080. The load ran fine for 7 hours, but then 
>> it gave up with the following error in the main log:
>> 
>>  [ERROR] 2013-04-08 20:13:18,019 [repositories/BFWorks_STF] Error while 
>> handling request (500): java.net.SocketTimeoutException: Read timed out
> 
> I don't have the full stack trace, but I guess this is because successive 
> commit operations are taking longer and longer (not much memory for the 
> cache) and eventually one takes too long and this error occurs.
> 
>> 
>> I noticed that tomcat's connectionTimeout param was at the default (20sec.) 
>> so I considered increasing it to 10 minutes. Any advice on this?
> 
> I don't think this will hurt, so I agree that increasing this would be a good 
> idea.
> 
>> 
>> Also, once this error happened I am unable to do anything with the 
>> repository except view the Contexts in Repository (via the workbench). When 
>> I try to clear the contexts to start over from scratch it takes a very long 
>> time and then I end up getting:
>> 
>>  javax.servlet.ServletException: 
>> org.openrdf.repository.RepositoryException: java.io.EOFException
> 
> A full stack trace would be really useful here.
> 
>> 
>> At this point I am forced to kill the tomcat process and delete the 
>> repository forcibly.
>> 
> 
> It could be that OWLIM is still busily trying to commit a large transaction 
> with materialisation of inferences (lots of random index lookups), so killing 
> tomcat would quite possibly leave the storage files in an inconsistent 
> state.
> 
>> 
>> I then tried creating a repository using the sesame_owlim console. but I 
>> keep getting 
>> 
>>  ERROR: No template called BFWorks.ttl found in 
>> /storage/openrdf-sesame-console/templates
>> 
>> even though I have a BFWorks.ttl file in that directory.
> 
> Not sure about this one. I believe it is the client (not the server) that 
> needs to be able to load this template file. Is there a permissions problem? 
> Are you overriding the default location for loading template files?
> 
>> 
>> Any help/advice is appreciated.
>> 
>>  -Josh
> 
> All the best,
> barry
> 
>> 
>> On Mar 29, 2013, at 8:46 AM, Barry Bishop wrote:
>> 
>>> Hello Marek, Stefano,
>>> 
>>> There is a little bit of information here about how to load a lot of data 
>>> (the problems being that the Sesame workbench/browser will time out if it 
>>> takes too long and OWLIM uses a lot of memory if the transaction size is 
>>> too big):
>>> 
>>> https://confluence.ontotext.com/display/OWLIMv53/