Re: [Owlim-discussion] Questions about strange triple insertion rate changes. Sorting does not seem to help.
Hi Ivan, I think its a bit different. There is a much higher uniqueness factor in the 650 million dataset. i.e. lots of small "sub graphs: without connections between them in comparison the 330 million dataset is much more likely to have the same 'po' shared between different 's'. i.e. uniref:1 :member uniparc:B uniref:2 :member uniparc:B uniref:3 :member uniparc:B (On average I expect that one 'po' set is shared between 2,5 's') The problem is I can't really easily sort the data such that 'o' or 'p' end up being grouped close to each other. In the example above I know that there will be a distance in the input file of +/- 100 million triples between uniref:1 and uniref:2. In other words input order is currently spo instead of pso. So the sort that I implemented on 1,000,000 statement chunks wont help because it won't move uniref:1 and uniref:2 closer together. So the only way to beat this is to increase random access speed. However, I don't understand why the cache size has such an effect on this dataset. I would have guessed that the 7Gb tupleIndexMemory would have been 'enough'. Yet I don't understand why 15GB would work as the pos.index and pso.index on disk are 19.7 GB in combination. Regards, Jerven On 07/28/2011 04:51 PM, Ivan Peikov wrote: Hi Jerven, You have arrived at a correct conclusion that it's all about the input triples order. An optimal (or near-optimal) order of the input triples would be achieved if they are sorted by predicate and then by subject or object. When this optimal insert data order is provided chances are that OWLIM's cache will be optimally used and page hits will prevail which would lead to lesser disk I/O and cache fill-up. Having said that I would guess that the 650M dataset or parts of it already exhibits close to optimal order and thus uses less cache than the 330M dataset. Therefore adding more cache memory doesn't make any difference in this scenario (650M, nicely ordered). Hope that explains the mystery! Cheers, Ivan On Thursday 28 July 2011 17:06:45 Jerven Bolleman wrote: Dear OWLIM developers, So this morning I started a new loading run. This time with a larger tupleIndexMemory setting 15GB in this case instead of the earlier 7GB. This loads the slow dataset in about 6 hours 15 minutes. So the funny thing is that we have 330 million triples that are much slower to insert than another set of 650 million triples. Leading me to conclude that even when not using reasoning the kind of triples one inserts and in what kind of pattern can make a very large difference to the loading performance. The question now is what kind of pattern would be optimal for loading performance? And why does the size of the tupleIndexMemory make such a large difference for the 330 million triple dataset but near no difference for the 650 million triple dataset? Regards, Jerven <>___ OWLIM-discussion mailing list OWLIM-discussion@ontotext.com http://ontotext.com/mailman/listinfo/owlim-discussion
Re: [Owlim-discussion] Questions about strange triple insertion rate changes. Sorting does not seem to help.
Hi Jerven, You have arrived at a correct conclusion that it's all about the input triples order. An optimal (or near-optimal) order of the input triples would be achieved if they are sorted by predicate and then by subject or object. When this optimal insert data order is provided chances are that OWLIM's cache will be optimally used and page hits will prevail which would lead to lesser disk I/O and cache fill-up. Having said that I would guess that the 650M dataset or parts of it already exhibits close to optimal order and thus uses less cache than the 330M dataset. Therefore adding more cache memory doesn't make any difference in this scenario (650M, nicely ordered). Hope that explains the mystery! Cheers, Ivan On Thursday 28 July 2011 17:06:45 Jerven Bolleman wrote: > Dear OWLIM developers, > > So this morning I started a new loading run. This time with a larger > tupleIndexMemory setting 15GB in this case instead of the earlier 7GB. > This loads the slow dataset in about 6 hours 15 minutes. So the funny > thing is that we have 330 million triples that are much slower to insert > than another set of 650 million triples. Leading me to conclude that > even when not using reasoning the kind of triples one inserts and in > what kind of pattern can make a very large difference to the loading > performance. The question now is what kind of pattern would be optimal > for loading performance? And why does the size of the tupleIndexMemory > make such a large difference for the 330 million triple dataset but near > no difference for the 650 million triple dataset? > > Regards, > Jerven > ___ OWLIM-discussion mailing list OWLIM-discussion@ontotext.com http://ontotext.com/mailman/listinfo/owlim-discussion
Re: [Owlim-discussion] Questions about strange triple insertion rate changes
Hi Barry, Don't feel to bad. I actually made a big mistake. In changing the queue from a LinkedBlockingQueue, to a PriorityBlockingQueue I forgot that the priority queue is unbounded. Which is a problem! I changed that and will see how it goes. Regards, Jerven On 07/27/2011 10:36 AM, Barry Bishop wrote: Hi Jerven, Seems I gave you a bum-steer. Sorry about that. Nevertheless, we look forward to any other insight you can bring. Thanks, barry On 27/07/11 10:30, Jerven Bolleman wrote: Hi Damyan, You where right in the end sorting the statements did not change a single thing. Will keep looking for the source of the problems. Which I now suspect is outside owlim itself. The memory behavior is a bit suspicious and will have a look at that. Regards, Jerven On 07/26/2011 02:47 PM, Damyan Ognyanov wrote: Hi Jerven, not sure if it would work - it all depends on the way you compare the statements. But no matter how you actually do that comparison, the order in which we index them will be different to that, since we use one that is based on the value of the entity ID assigned to a particular RDF node and it actually reflects the chronological order in which the nodes appear in the data set ... regards, Damyan Ognyanov Ontotext AD On 26.7.2011 г. 15:08 ч., Jerven Bolleman wrote: Hi Damyan, Thanks for this information. I changed the way that I generate the triples for insertion by owlim. I read my raw rdf source in one thread generating statements. The statements pass via blocking queue into a second thread which inserts them via a sail connection into owlim. I changed the blocking queue into a priority blocking queue. With the following sorting. First sort on predicate then sort on subject. Is that the ordering that you would suggest or does an other one make more sense? The run is going to take quite a few hours so will let you know what comes out of this tomorrow morning. Regards, Jerven ___ OWLIM-discussion mailing list OWLIM-discussion@ontotext.com http://ontotext.com/mailman/listinfo/owlim-discussion ___ OWLIM-discussion mailing list OWLIM-discussion@ontotext.com http://ontotext.com/mailman/listinfo/owlim-discussion <>___ OWLIM-discussion mailing list OWLIM-discussion@ontotext.com http://ontotext.com/mailman/listinfo/owlim-discussion
Re: [Owlim-discussion] Questions about strange triple insertion rate changes
Hi Jerven, Seems I gave you a bum-steer. Sorry about that. Nevertheless, we look forward to any other insight you can bring. Thanks, barry On 27/07/11 10:30, Jerven Bolleman wrote: Hi Damyan, You where right in the end sorting the statements did not change a single thing. Will keep looking for the source of the problems. Which I now suspect is outside owlim itself. The memory behavior is a bit suspicious and will have a look at that. Regards, Jerven On 07/26/2011 02:47 PM, Damyan Ognyanov wrote: Hi Jerven, not sure if it would work - it all depends on the way you compare the statements. But no matter how you actually do that comparison, the order in which we index them will be different to that, since we use one that is based on the value of the entity ID assigned to a particular RDF node and it actually reflects the chronological order in which the nodes appear in the data set ... regards, Damyan Ognyanov Ontotext AD On 26.7.2011 г. 15:08 ч., Jerven Bolleman wrote: Hi Damyan, Thanks for this information. I changed the way that I generate the triples for insertion by owlim. I read my raw rdf source in one thread generating statements. The statements pass via blocking queue into a second thread which inserts them via a sail connection into owlim. I changed the blocking queue into a priority blocking queue. With the following sorting. First sort on predicate then sort on subject. Is that the ordering that you would suggest or does an other one make more sense? The run is going to take quite a few hours so will let you know what comes out of this tomorrow morning. Regards, Jerven ___ OWLIM-discussion mailing list OWLIM-discussion@ontotext.com http://ontotext.com/mailman/listinfo/owlim-discussion ___ OWLIM-discussion mailing list OWLIM-discussion@ontotext.com http://ontotext.com/mailman/listinfo/owlim-discussion
Re: [Owlim-discussion] Questions about strange triple insertion rate changes
Hi Damyan, You where right in the end sorting the statements did not change a single thing. Will keep looking for the source of the problems. Which I now suspect is outside owlim itself. The memory behavior is a bit suspicious and will have a look at that. Regards, Jerven On 07/26/2011 02:47 PM, Damyan Ognyanov wrote: Hi Jerven, not sure if it would work - it all depends on the way you compare the statements. But no matter how you actually do that comparison, the order in which we index them will be different to that, since we use one that is based on the value of the entity ID assigned to a particular RDF node and it actually reflects the chronological order in which the nodes appear in the data set ... regards, Damyan Ognyanov Ontotext AD On 26.7.2011 г. 15:08 ч., Jerven Bolleman wrote: Hi Damyan, Thanks for this information. I changed the way that I generate the triples for insertion by owlim. I read my raw rdf source in one thread generating statements. The statements pass via blocking queue into a second thread which inserts them via a sail connection into owlim. I changed the blocking queue into a priority blocking queue. With the following sorting. First sort on predicate then sort on subject. Is that the ordering that you would suggest or does an other one make more sense? The run is going to take quite a few hours so will let you know what comes out of this tomorrow morning. Regards, Jerven <>___ OWLIM-discussion mailing list OWLIM-discussion@ontotext.com http://ontotext.com/mailman/listinfo/owlim-discussion
Re: [Owlim-discussion] Questions about strange triple insertion rate changes
Hi Damyan, This sorting will then to group statements with the same predicate-subjects closer together in the insert order. The idea is that when we are accessing a node that already exists that in the B tree that needs to be looked up. The idea is that benefit as much as possible from that lookup. In any case it seems to have a beneficial effect in loading time. The fast dataset seems even faster. To insert the first 6 million entries (records not triples) takes 51 minutes with the sorting approach but took 85 minutes with the non sorted approach. All other settings being equal. Of course I won't be sure that this holds over time. And might well be an artifact of how are data is ordered in the first place. Will know more tomorrow. Regards, Jerven On 07/26/2011 02:47 PM, Damyan Ognyanov wrote: Hi Jerven, not sure if it would work - it all depends on the way you compare the statements. But no matter how you actually do that comparison, the order in which we index them will be different to that, since we use one that is based on the value of the entity ID assigned to a particular RDF node and it actually reflects the chronological order in which the nodes appear in the data set ... regards, Damyan Ognyanov Ontotext AD On 26.7.2011 г. 15:08 ч., Jerven Bolleman wrote: Hi Damyan, Thanks for this information. I changed the way that I generate the triples for insertion by owlim. I read my raw rdf source in one thread generating statements. The statements pass via blocking queue into a second thread which inserts them via a sail connection into owlim. I changed the blocking queue into a priority blocking queue. With the following sorting. First sort on predicate then sort on subject. Is that the ordering that you would suggest or does an other one make more sense? The run is going to take quite a few hours so will let you know what comes out of this tomorrow morning. Regards, Jerven <>___ OWLIM-discussion mailing list OWLIM-discussion@ontotext.com http://ontotext.com/mailman/listinfo/owlim-discussion
Re: [Owlim-discussion] Questions about strange triple insertion rate changes
Hi Jerven, a lot of data to look at ... though didn't examine it very closely at this point, I like to give you some comments on a possible explanation on the behavior you are observing ... First, the statement storage for each of the main indices (po, pso) is organized as B+ tree where each component of the statement (subject, predicate, etc.) is an integer(40bit) number representing the id of the RDF node. Those Ids are assigned when a new node appear in the data as a component of some RDF statement . So the new nodes introduced by your import process get greater integer values and because those are most probably either subjects or objects of the new statements, those statements are always put at the end of a particular subsection (depending on the ordering used) of the BTrees. Because of the caching, it is more probably that those pages that are about to be altered are found in the cache (the new data is at the end of some sub-sequence because of the natural ordering by ID we are using) so the whole process runs smoothly. When the data you are putting start to reuse some already existing nodes, then the statements you are storing are no longer at the end of such particular subsection of the tree and tend to be randomly distributed across the whole tree and the cache becames inefficient, mainly because the page where you are about to place the new statement is most probably out of it and it needs to be read from the disk. Yes, it is cashed at that point but you probably will not need it anymore after that single operation so the seek/read cost became an issue and the process of data import slows down greatly. One way to reduce that cost is to somehow tweak the order of the statements you are importing (needs some pre-processing to rearrange them in some particular way so that the caching start working efficiently) . For instance, you may start adding the statements at batches organized by a particullar predicate - that way the whole tree subsection that holds all the statements with that particular predicate will be cashed (most of it) so even the statements are a bit random they always will be part of that section. In any way I will look closely at the data you sent to see if something else pop in my mind related to that sudden drop of the throughput of the storage. many thanks for the detailed info you sent, Regards, Damyan Ognyanov Ontotext AD. On 26.7.2011 ?. 09:49 ?., Jerven Bolleman wrote: Dear Owlim developers, I am trying to load all the UniProt data on a 64GB RAM machine. I have a case where I am very pleased with the loading speed of a billion triples but then it just flatlines. I have included a set of graphs which show the relevant behavior and statistics on this machine. Maybe you could have a look at it. You might think that this is due to performance dropping of after loading a billion triples but I have the same problem the otherway round. See attachment "For discussion.png") where performance first flatlines taking 32 hours to insert 300 million triples before recovering and loads a billion triples in 5 hours. This is with an empty ruleset. What could cause this behaviour? Regards, Jerven Some data from owlim.properties as written after a sync (insertion of http://www.ontotext.com/owlim/system#flush) Uniform owlim image NumberOfStatements=1039890206 NumberOfExplicitStatements=1039890206 NumberOfEntities=403958694 VersionId=40 BNodeCounter=238 And all the relevant statistics. Original Message Subject: JMX values for ptx-serv01.vital-it.ch at 2011-07-26 Date: Tue, 26 Jul 2011 06:38:24 + (GMT) From: nore...@uniprot.org To: uuw_...@isb-sib.ch Started on 2011-07-25 04:02 JVM Java HotSpot(TM) 64-Bit Server VM Sun Microsystems Inc.(20.1-b02) Runtime name 6...@ptx-serv01.vital-it.ch JVM Arguments -Dwar=/data/sparql_uuw/expasy4j-sparql/dist/expasy4j-sparql.war -Djava.util.logging.config.file=tomcat/conf/logging.properties -Duser.timezone=GMT -Xms45G -Xmx55G -XX:+HeapDumpOnOutOfMemoryError -Djava.io.tmpdir=/data/tmp -Djava.awt.headless=true -Dexpasy_sparql_entityIndexsize=1147483647 -Dexpasy_sparql_cacheMemory=7G -Dexpasy_sparql_tupleIndexMemory=5G -Dshutdown.port=8081 -Dhttp.port=8080 -Dsecure.port=8082 -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=6969 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dexpasy_sparql_commitSize=100 -Duniprot.singlethreaded=true -Dexpasy_sparql_path=/data -Dexpasy_sparql_ruleset=empty -Dexpasy_sparql_journaling=false -Dexpasy_sparql_repositoryFragments=1 -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Dcatalina.base=tomcat/ -Dcatalina.home=tomcat/ Running
Re: [Owlim-discussion] Questions about strange triple insertion rate changes
Hi Jerven, Thanks for the incredibly detailed information, great stuff. Unfortunately, we have two developers on holiday and one sick, so we are rather understaffed this week. As soon as time allows we will have a close look at what's going on here. I can't understand why the different datasets load at such a radically different speed. Quick question: What kind of disks do you have HDDs or SSDs? This might make a difference depending on how sorted your datasets are. If the input statements are highly disordered then more disk seeks will be required, which causes bigger problems for HDDs than for SDDs. There are two main indices with different orderings (PSO, POS), so this can never be fully optimised, but sorting input statements by predicate-subject for each commit has been shown to improve things. Talk to you soon, barry -- Barry Bishop OWLIM Product Manager Ontotext AD Tel: +43 650 2000 237 email: barry.bis...@ontotext.com www.ontotext.com On 26/07/11 08:49, Jerven Bolleman wrote: Dear Owlim developers, I am trying to load all the UniProt data on a 64GB RAM machine. I have a case where I am very pleased with the loading speed of a billion triples but then it just flatlines. I have included a set of graphs which show the relevant behavior and statistics on this machine. Maybe you could have a look at it. You might think that this is due to performance dropping of after loading a billion triples but I have the same problem the otherway round. See attachment "For discussion.png") where performance first flatlines taking 32 hours to insert 300 million triples before recovering and loads a billion triples in 5 hours. This is with an empty ruleset. What could cause this behaviour? Regards, Jerven Some data from owlim.properties as written after a sync (insertion of http://www.ontotext.com/owlim/system#flush) Uniform owlim image NumberOfStatements=1039890206 NumberOfExplicitStatements=1039890206 NumberOfEntities=403958694 VersionId=40 BNodeCounter=238 And all the relevant statistics. Original Message Subject: JMX values for ptx-serv01.vital-it.ch at 2011-07-26 Date: Tue, 26 Jul 2011 06:38:24 + (GMT) From: nore...@uniprot.org To: uuw_...@isb-sib.ch Started on 2011-07-25 04:02 JVM Java HotSpot(TM) 64-Bit Server VM Sun Microsystems Inc.(20.1-b02) Runtime name 6...@ptx-serv01.vital-it.ch JVM Arguments -Dwar=/data/sparql_uuw/expasy4j-sparql/dist/expasy4j-sparql.war -Djava.util.logging.config.file=tomcat/conf/logging.properties -Duser.timezone=GMT -Xms45G -Xmx55G -XX:+HeapDumpOnOutOfMemoryError -Djava.io.tmpdir=/data/tmp -Djava.awt.headless=true -Dexpasy_sparql_entityIndexsize=1147483647 -Dexpasy_sparql_cacheMemory=7G -Dexpasy_sparql_tupleIndexMemory=5G -Dshutdown.port=8081 -Dhttp.port=8080 -Dsecure.port=8082 -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=6969 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dexpasy_sparql_commitSize=100 -Duniprot.singlethreaded=true -Dexpasy_sparql_path=/data -Dexpasy_sparql_ruleset=empty -Dexpasy_sparql_journaling=false -Dexpasy_sparql_repositoryFragments=1 -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Dcatalina.base=tomcat/ -Dcatalina.home=tomcat/ Running on Linux 2.6.18-238.9.1.el5 Number of processors 16 Total memory 43564793856 Free memory in JVM23332798048 Used memory in JVM43564793856 Uptime52571417 System properties java.vm.version20.1-b02 expasy_sparql_entityIdSize40 java.vendor.urlhttp://java.sun.com/ sun.jnu.encodingUTF-8 java.vm.infomixed mode user.dir/data/sparql_uuw/expasy4j-sparql java.awt.headlesstrue sun.cpu.isalist java.awt.graphicsenvsun.awt.X11GraphicsEnvironment com.sun.management.jmxremote.authenticatefalse sun.os.patch.levelunknown catalina.useNamingtrue uniprot_internalfalse com.sun.management.jmxremote.sslfalse java.io.tmpdir/data/tmp user.home/home/jbollema expasy_sparql_repositoryFragments1 java.awt.printerjobsun.print.PSPrinterJob java.version1.6.0_26 file.encoding.pkgsun.io package.access sun.,org.apache.catalina.,org.apache.coyote.,org.apache.tomcat.,org.apache.jasper.,sun.beans. expasy_sparql_enablePredicateListfalse java.vendor.url.bughttp://java.sun.com/cgi-bin/bugreport.cgi file.encodingUTF-8 line.separator sun.java.commandorg.apache.catalina.startup.Bootstrap start expasy_sparql_entityIndexsize1147483647 java.vm.specification.vendorSun Microsystems Inc. java.util.logging.managerorg.apache.juli.ClassLoaderLogManager tomcat.util.buf.StringCache.byte.enabledtrue catalina.home/data/sparql_uuw/expasy4j-sparql/tomcat java.vm.vendorSun Microsystems Inc.