Re: [Owlim-discussion] Questions about strange triple insertion rate changes. Sorting does not seem to help.

2011-07-28 Thread Jerven Bolleman

Hi Ivan,

I think its a bit different. There is a much higher uniqueness factor in 
the 650 million dataset. i.e. lots of small "sub graphs: without 
connections between them in comparison the 330 million dataset is much 
more likely to have the same 'po' shared between different 's'.


i.e.

uniref:1 :member uniparc:B
uniref:2 :member uniparc:B
uniref:3 :member uniparc:B

(On average I expect that one 'po' set is shared between 2,5 's')

The problem is I can't really easily sort the data such that 'o' or 'p' 
end up being grouped close to each other. In the example above I know 
that there will be a distance in the input file of +/- 100 million 
triples between uniref:1 and uniref:2. In other words input order is 
currently spo instead of pso. So the sort that I implemented on 
1,000,000 statement chunks wont help because it won't move uniref:1 and 
uniref:2 closer together. So the only way to beat this is to increase 
random access speed. However, I don't understand why the cache size has 
such an effect on this dataset. I would have guessed that the 7Gb 
tupleIndexMemory would have been 'enough'. Yet I don't understand why 
15GB would work as the pos.index and pso.index on disk are 19.7 GB in 
combination.


Regards,
Jerven

On 07/28/2011 04:51 PM, Ivan Peikov wrote:

Hi Jerven,

You have arrived at a correct conclusion that it's all about the input triples
order. An optimal (or near-optimal) order of the input triples would be
achieved if they are sorted by predicate and then by subject or object.

When this optimal insert data order is provided chances are that OWLIM's cache
will be optimally used and page hits will prevail which would lead to lesser
disk I/O and cache fill-up.

Having said that I would guess that the 650M dataset or parts of it already
exhibits close to optimal order and thus uses less cache than the 330M
dataset. Therefore adding more cache memory doesn't make any difference in
this scenario (650M, nicely ordered).


Hope that explains the mystery!


Cheers,
Ivan


On Thursday 28 July 2011 17:06:45 Jerven Bolleman wrote:

Dear OWLIM developers,

So this morning I started a new loading run. This time with a larger
tupleIndexMemory setting 15GB in this case instead of the earlier 7GB.
This loads the slow dataset in about 6 hours 15 minutes. So the funny
thing is that we have 330 million triples that are much slower to insert
than another set of 650 million triples. Leading me to conclude that
even when not using reasoning the kind of triples one inserts and in
what kind of pattern can make a very large difference to the loading
performance. The question now is what kind of pattern would be optimal
for loading performance? And why does the size of the tupleIndexMemory
make such a large difference for the 330 million triple dataset but near
no difference for the 650 million triple dataset?

Regards,
Jerven



<>___
OWLIM-discussion mailing list
OWLIM-discussion@ontotext.com
http://ontotext.com/mailman/listinfo/owlim-discussion


Re: [Owlim-discussion] Questions about strange triple insertion rate changes. Sorting does not seem to help.

2011-07-28 Thread Ivan Peikov
Hi Jerven,

You have arrived at a correct conclusion that it's all about the input triples 
order. An optimal (or near-optimal) order of the input triples would be 
achieved if they are sorted by predicate and then by subject or object.

When this optimal insert data order is provided chances are that OWLIM's cache 
will be optimally used and page hits will prevail which would lead to lesser 
disk I/O and cache fill-up. 

Having said that I would guess that the 650M dataset or parts of it already 
exhibits close to optimal order and thus uses less cache than the 330M 
dataset. Therefore adding more cache memory doesn't make any difference in 
this scenario (650M, nicely ordered).


Hope that explains the mystery!


Cheers,
Ivan


On Thursday 28 July 2011 17:06:45 Jerven Bolleman wrote:
> Dear OWLIM developers,
> 
> So this morning I started a new loading run. This time with a larger
> tupleIndexMemory setting 15GB in this case instead of the earlier 7GB.
> This loads the slow dataset in about 6 hours 15 minutes. So the funny
> thing is that we have 330 million triples that are much slower to insert
> than another set of 650 million triples. Leading me to conclude that
> even when not using reasoning the kind of triples one inserts and in
> what kind of pattern can make a very large difference to the loading
> performance. The question now is what kind of pattern would be optimal
> for loading performance? And why does the size of the tupleIndexMemory
> make such a large difference for the 330 million triple dataset but near
> no difference for the 650 million triple dataset?
> 
> Regards,
> Jerven
> 
___
OWLIM-discussion mailing list
OWLIM-discussion@ontotext.com
http://ontotext.com/mailman/listinfo/owlim-discussion


Re: [Owlim-discussion] Questions about strange triple insertion rate changes

2011-07-27 Thread Jerven Bolleman

Hi Barry,

Don't feel to bad. I actually made a big mistake.
In changing the queue from a LinkedBlockingQueue, to a 
PriorityBlockingQueue I forgot that the priority queue is unbounded.

Which is a problem! I changed that and will see how it goes.

Regards,
Jerven
On 07/27/2011 10:36 AM, Barry Bishop wrote:

Hi Jerven,

Seems I gave you a bum-steer. Sorry about that.

Nevertheless, we look forward to any other insight you can bring.

Thanks,
barry

On 27/07/11 10:30, Jerven Bolleman wrote:

Hi Damyan,

You where right in the end sorting the statements did not change a
single thing. Will keep looking for the source of the problems. Which
I now suspect is outside owlim itself.

The memory behavior is a bit suspicious and will have a look at that.

Regards,
Jerven

On 07/26/2011 02:47 PM, Damyan Ognyanov wrote:

Hi Jerven,

not sure if it would work - it all depends on the way you compare the
statements. But no matter how you actually do that comparison, the order
in which we index them will be different to that, since we use one that
is based on the value of the entity ID assigned to a particular RDF node
and it actually reflects the chronological order in which the nodes
appear in the data set ...

regards,
Damyan Ognyanov
Ontotext AD

On 26.7.2011 г. 15:08 ч., Jerven Bolleman wrote:

Hi Damyan,

Thanks for this information. I changed the way that I generate the
triples for insertion by owlim.

I read my raw rdf source in one thread generating statements.
The statements pass via blocking queue into a second thread which
inserts them via a sail connection into owlim.

I changed the blocking queue into a priority blocking queue.
With the following sorting. First sort on predicate then sort on
subject.

Is that the ordering that you would suggest or does an other one make
more sense?

The run is going to take quite a few hours so will let you know what
comes out of this tomorrow morning.

Regards,
Jerven








___
OWLIM-discussion mailing list
OWLIM-discussion@ontotext.com
http://ontotext.com/mailman/listinfo/owlim-discussion



___
OWLIM-discussion mailing list
OWLIM-discussion@ontotext.com
http://ontotext.com/mailman/listinfo/owlim-discussion


<>___
OWLIM-discussion mailing list
OWLIM-discussion@ontotext.com
http://ontotext.com/mailman/listinfo/owlim-discussion


Re: [Owlim-discussion] Questions about strange triple insertion rate changes

2011-07-27 Thread Barry Bishop

Hi Jerven,

Seems I gave you a bum-steer. Sorry about that.

Nevertheless, we look forward to any other insight you can bring.

Thanks,
barry

On 27/07/11 10:30, Jerven Bolleman wrote:

Hi Damyan,

You where right in the end sorting the statements did not change a 
single thing. Will keep looking for the source of the problems. Which 
I now suspect is outside owlim itself.


The memory behavior is a bit suspicious and will have a look at that.

Regards,
Jerven

On 07/26/2011 02:47 PM, Damyan Ognyanov wrote:

Hi Jerven,

not sure if it would work - it all depends on the way you compare the
statements. But no matter how you actually do that comparison, the order
in which we index them will be different to that, since we use one that
is based on the value of the entity ID assigned to a particular RDF node
and it actually reflects the chronological order in which the nodes
appear in the data set ...

regards,
Damyan Ognyanov
Ontotext AD

On 26.7.2011 г. 15:08 ч., Jerven Bolleman wrote:

Hi Damyan,

Thanks for this information. I changed the way that I generate the
triples for insertion by owlim.

I read my raw rdf source in one thread generating statements.
The statements pass via blocking queue into a second thread which
inserts them via a sail connection into owlim.

I changed the blocking queue into a priority blocking queue.
With the following sorting. First sort on predicate then sort on 
subject.


Is that the ordering that you would suggest or does an other one make
more sense?

The run is going to take quite a few hours so will let you know what
comes out of this tomorrow morning.

Regards,
Jerven








___
OWLIM-discussion mailing list
OWLIM-discussion@ontotext.com
http://ontotext.com/mailman/listinfo/owlim-discussion
___
OWLIM-discussion mailing list
OWLIM-discussion@ontotext.com
http://ontotext.com/mailman/listinfo/owlim-discussion


Re: [Owlim-discussion] Questions about strange triple insertion rate changes

2011-07-27 Thread Jerven Bolleman

Hi Damyan,

You where right in the end sorting the statements did not change a 
single thing. Will keep looking for the source of the problems. Which I 
now suspect is outside owlim itself.


The memory behavior is a bit suspicious and will have a look at that.

Regards,
Jerven

On 07/26/2011 02:47 PM, Damyan Ognyanov wrote:

Hi Jerven,

not sure if it would work - it all depends on the way you compare the
statements. But no matter how you actually do that comparison, the order
in which we index them will be different to that, since we use one that
is based on the value of the entity ID assigned to a particular RDF node
and it actually reflects the chronological order in which the nodes
appear in the data set ...

regards,
Damyan Ognyanov
Ontotext AD

On 26.7.2011 г. 15:08 ч., Jerven Bolleman wrote:

Hi Damyan,

Thanks for this information. I changed the way that I generate the
triples for insertion by owlim.

I read my raw rdf source in one thread generating statements.
The statements pass via blocking queue into a second thread which
inserts them via a sail connection into owlim.

I changed the blocking queue into a priority blocking queue.
With the following sorting. First sort on predicate then sort on subject.

Is that the ordering that you would suggest or does an other one make
more sense?

The run is going to take quite a few hours so will let you know what
comes out of this tomorrow morning.

Regards,
Jerven







<>___
OWLIM-discussion mailing list
OWLIM-discussion@ontotext.com
http://ontotext.com/mailman/listinfo/owlim-discussion


Re: [Owlim-discussion] Questions about strange triple insertion rate changes

2011-07-26 Thread Jerven Bolleman

Hi Damyan,

This sorting will then to group statements with the same 
predicate-subjects closer together in the insert order. The idea is that 
when we are accessing a node that already exists that in the B tree that 
needs to be looked up. The idea is that benefit as much as possible from 
that lookup.


In any case it seems to have a beneficial effect in loading time. The 
fast dataset seems even faster. To insert the first 6 million entries 
(records not triples) takes 51 minutes with the sorting approach but 
took 85 minutes with the non sorted approach. All other settings being 
equal. Of course I won't be sure that this holds over time. And might 
well be an artifact of how are data is ordered in the first place. Will 
know more tomorrow.


Regards,
Jerven

On 07/26/2011 02:47 PM, Damyan Ognyanov wrote:

Hi Jerven,

not sure if it would work - it all depends on the way you compare the
statements. But no matter how you actually do that comparison, the order
in which we index them will be different to that, since we use one that
is based on the value of the entity ID assigned to a particular RDF node
and it actually reflects the chronological order in which the nodes
appear in the data set ...

regards,
Damyan Ognyanov
Ontotext AD

On 26.7.2011 г. 15:08 ч., Jerven Bolleman wrote:

Hi Damyan,

Thanks for this information. I changed the way that I generate the
triples for insertion by owlim.

I read my raw rdf source in one thread generating statements.
The statements pass via blocking queue into a second thread which
inserts them via a sail connection into owlim.

I changed the blocking queue into a priority blocking queue.
With the following sorting. First sort on predicate then sort on subject.

Is that the ordering that you would suggest or does an other one make
more sense?

The run is going to take quite a few hours so will let you know what
comes out of this tomorrow morning.

Regards,
Jerven







<>___
OWLIM-discussion mailing list
OWLIM-discussion@ontotext.com
http://ontotext.com/mailman/listinfo/owlim-discussion


Re: [Owlim-discussion] Questions about strange triple insertion rate changes

2011-07-26 Thread Damyan Ognyanov

Hi Jerven,

a lot of data to look at ... though didn't examine it very closely at 
this point, I like to give you some comments on a possible explanation 
on the behavior you are observing ...


First, the statement storage for each of the main indices (po, pso) is 
organized as B+ tree where each component of the statement (subject, 
predicate, etc.) is an integer(40bit) number representing the id of the 
RDF node.


Those Ids are assigned when a new node appear in the data as a component 
of some RDF statement . So the new nodes introduced by your import 
process get greater integer values and because those are most probably 
either subjects or objects of the new statements, those statements are 
always put at the end of a particular subsection (depending on the 
ordering used) of the BTrees.


Because of the caching, it is more probably that those pages that are 
about to be altered are found in the cache (the new data is at the end 
of some sub-sequence because of the natural ordering by ID we are using) 
so the whole process runs smoothly.


When the data you are putting start to reuse some already existing 
nodes, then the statements you are storing are no longer at the end of 
such particular subsection of the tree and tend to be randomly 
distributed across the whole tree and the cache becames inefficient, 
mainly because the page where you are about to place the new statement 
is most probably out of it and it needs to be read from the disk. Yes, 
it is cashed at that point but you probably will not need it anymore 
after that single operation so the seek/read cost became an issue and 
the process of data import slows down greatly.


One way to reduce that cost is to somehow tweak the order of the 
statements you are importing (needs some pre-processing to rearrange 
them in some particular way so that the caching start working 
efficiently) .


For instance, you may start adding the statements at batches organized 
by a particullar predicate - that way the whole tree subsection that 
holds all the statements with that particular predicate will be cashed 
(most of it) so even the statements are a bit random they always will be 
part of that section.


In any way I will look closely at the data you sent to see if something 
else pop in my mind related to that sudden drop of the throughput of the 
storage.


many thanks for the detailed info you sent,

Regards,
Damyan Ognyanov
Ontotext AD.

On 26.7.2011 ?. 09:49 ?., Jerven Bolleman wrote:

Dear Owlim developers,

I am trying to load all the UniProt data on a 64GB RAM machine. I have 
a case where I am very pleased with the loading speed of a billion 
triples but then it just flatlines. I have included a set of graphs 
which show the relevant behavior and statistics on this machine. Maybe 
you could have a look at it. You might think that this is due to 
performance dropping of after loading a billion triples but I have the 
same problem the otherway round. See attachment "For discussion.png") 
where performance first flatlines taking 32 hours to insert 300 
million triples before recovering and loads a billion triples in 5 
hours. This is with an empty ruleset.


What could cause this behaviour?

Regards,
Jerven



Some data from owlim.properties as written after a sync (insertion of 
http://www.ontotext.com/owlim/system#flush)


Uniform owlim image
NumberOfStatements=1039890206
NumberOfExplicitStatements=1039890206
NumberOfEntities=403958694
VersionId=40
BNodeCounter=238


And all the relevant statistics.

 Original Message 
Subject: JMX values for ptx-serv01.vital-it.ch at 2011-07-26
Date: Tue, 26 Jul 2011 06:38:24 + (GMT)
From: nore...@uniprot.org
To: uuw_...@isb-sib.ch



Started on 2011-07-25 04:02
JVM Java HotSpot(TM) 64-Bit Server VM Sun Microsystems Inc.(20.1-b02)
Runtime name 6...@ptx-serv01.vital-it.ch
JVM Arguments
-Dwar=/data/sparql_uuw/expasy4j-sparql/dist/expasy4j-sparql.war
-Djava.util.logging.config.file=tomcat/conf/logging.properties
-Duser.timezone=GMT
-Xms45G
-Xmx55G
-XX:+HeapDumpOnOutOfMemoryError
-Djava.io.tmpdir=/data/tmp
-Djava.awt.headless=true
-Dexpasy_sparql_entityIndexsize=1147483647
-Dexpasy_sparql_cacheMemory=7G
-Dexpasy_sparql_tupleIndexMemory=5G
-Dshutdown.port=8081
-Dhttp.port=8080
-Dsecure.port=8082
-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.port=6969
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.authenticate=false
-Dexpasy_sparql_commitSize=100
-Duniprot.singlethreaded=true
-Dexpasy_sparql_path=/data
-Dexpasy_sparql_ruleset=empty
-Dexpasy_sparql_journaling=false
-Dexpasy_sparql_repositoryFragments=1
-Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager
-Dcatalina.base=tomcat/
-Dcatalina.home=tomcat/
Running 

Re: [Owlim-discussion] Questions about strange triple insertion rate changes

2011-07-26 Thread Barry Bishop

Hi Jerven,

Thanks for the incredibly detailed information, great stuff. 
Unfortunately, we have two developers on holiday and one sick, so we are 
rather understaffed this week. As soon as time allows we will have a 
close look at what's going on here.


I can't understand why the different datasets load at such a radically 
different speed.


Quick question: What kind of disks do you have HDDs or SSDs?

This might make a difference depending on how sorted your datasets are. 
If the input statements are highly disordered then more disk seeks will 
be required, which causes bigger problems for HDDs than for SDDs. There 
are two main indices with different orderings (PSO, POS), so this can 
never be fully optimised, but sorting input statements by 
predicate-subject for each commit has been shown to improve things.


Talk to you soon,
barry

--
Barry Bishop
OWLIM Product Manager
Ontotext AD
Tel: +43 650 2000 237
email: barry.bis...@ontotext.com
www.ontotext.com


On 26/07/11 08:49, Jerven Bolleman wrote:

Dear Owlim developers,

I am trying to load all the UniProt data on a 64GB RAM machine. I have 
a case where I am very pleased with the loading speed of a billion 
triples but then it just flatlines. I have included a set of graphs 
which show the relevant behavior and statistics on this machine. Maybe 
you could have a look at it. You might think that this is due to 
performance dropping of after loading a billion triples but I have the 
same problem the otherway round. See attachment "For discussion.png") 
where performance first flatlines taking 32 hours to insert 300 
million triples before recovering and loads a billion triples in 5 
hours. This is with an empty ruleset.


What could cause this behaviour?

Regards,
Jerven



Some data from owlim.properties as written after a sync (insertion of 
http://www.ontotext.com/owlim/system#flush)


Uniform owlim image
NumberOfStatements=1039890206
NumberOfExplicitStatements=1039890206
NumberOfEntities=403958694
VersionId=40
BNodeCounter=238


And all the relevant statistics.

 Original Message 
Subject: JMX values for ptx-serv01.vital-it.ch at 2011-07-26
Date: Tue, 26 Jul 2011 06:38:24 + (GMT)
From: nore...@uniprot.org
To: uuw_...@isb-sib.ch



Started on 2011-07-25 04:02
JVM Java HotSpot(TM) 64-Bit Server VM Sun Microsystems Inc.(20.1-b02)
Runtime name 6...@ptx-serv01.vital-it.ch
JVM Arguments
-Dwar=/data/sparql_uuw/expasy4j-sparql/dist/expasy4j-sparql.war
-Djava.util.logging.config.file=tomcat/conf/logging.properties
-Duser.timezone=GMT
-Xms45G
-Xmx55G
-XX:+HeapDumpOnOutOfMemoryError
-Djava.io.tmpdir=/data/tmp
-Djava.awt.headless=true
-Dexpasy_sparql_entityIndexsize=1147483647
-Dexpasy_sparql_cacheMemory=7G
-Dexpasy_sparql_tupleIndexMemory=5G
-Dshutdown.port=8081
-Dhttp.port=8080
-Dsecure.port=8082
-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.port=6969
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.authenticate=false
-Dexpasy_sparql_commitSize=100
-Duniprot.singlethreaded=true
-Dexpasy_sparql_path=/data
-Dexpasy_sparql_ruleset=empty
-Dexpasy_sparql_journaling=false
-Dexpasy_sparql_repositoryFragments=1
-Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager
-Dcatalina.base=tomcat/
-Dcatalina.home=tomcat/
Running on Linux 2.6.18-238.9.1.el5
Number of processors 16
Total memory 43564793856
Free memory in JVM23332798048
Used memory in JVM43564793856
Uptime52571417
System properties
java.vm.version20.1-b02
expasy_sparql_entityIdSize40
java.vendor.urlhttp://java.sun.com/
sun.jnu.encodingUTF-8
java.vm.infomixed mode
user.dir/data/sparql_uuw/expasy4j-sparql
java.awt.headlesstrue
sun.cpu.isalist
java.awt.graphicsenvsun.awt.X11GraphicsEnvironment
com.sun.management.jmxremote.authenticatefalse
sun.os.patch.levelunknown
catalina.useNamingtrue
uniprot_internalfalse
com.sun.management.jmxremote.sslfalse
java.io.tmpdir/data/tmp
user.home/home/jbollema
expasy_sparql_repositoryFragments1
java.awt.printerjobsun.print.PSPrinterJob
java.version1.6.0_26
file.encoding.pkgsun.io
package.access
sun.,org.apache.catalina.,org.apache.coyote.,org.apache.tomcat.,org.apache.jasper.,sun.beans.

expasy_sparql_enablePredicateListfalse
java.vendor.url.bughttp://java.sun.com/cgi-bin/bugreport.cgi
file.encodingUTF-8
line.separator

sun.java.commandorg.apache.catalina.startup.Bootstrap start
expasy_sparql_entityIndexsize1147483647
java.vm.specification.vendorSun Microsystems Inc.
java.util.logging.managerorg.apache.juli.ClassLoaderLogManager
tomcat.util.buf.StringCache.byte.enabledtrue
catalina.home/data/sparql_uuw/expasy4j-sparql/tomcat
java.vm.vendorSun Microsystems Inc.