> Brian, Frank, which versions are you running? I'm running the same version as Brian.
tbd: 0.8.9 arq: 2.8.7 jena: 2.6.4 I tried disabling anti-virus - no difference. I tried bumping up the heap to 3G - no difference. I tried running without setting direct mode, but using Brian's change, i.e., model.add() instead of model.read() - this runs slower than with both Brian's change and direct mode, but faster than if I run with direct mode without Brian's change: Only direct mode: 100,000 graphs in 2,080.58 sec Only model.add(): 100,000 graphs in 1,246.25 sec Both direct mode and model.add(): 100,000 graphs in 812.30 sec Brian, have you tried running it without setting direct mode? I'm wondering if there's a problem with memory mapped I/O on Windows? Thanks, Frank. Andy Seaborne <[email protected]> wrote on 03/28/2011 04:17:49 AM: > [image removed] > > Re: Bad Jena TDB performance > > Andy Seaborne > > to: > > jena-users > > 03/28/2011 04:18 AM > > Please respond to jena-users > > > > On 27/03/11 18:56, Brian McBride wrote: > > On 25/03/2011 18:25, Frank Budinsky wrote: > > [...] > >> One interesting difference between my slow running case and Brian's, is > >> that mine always starts out pretty fast, but gradually slows down. Notice > >> that it was taking about 2 seconds per 100 at the start, but it was up to > >> 15 seconds at 39800. I didn't let it run to completion, this time, but I > >> remember from a previous run that it was taking about 2 min per 100 at > >> the > >> end. Any idea why it might slow down over time, like this, when not using > >> direct mode? > >> > >> Anyway, the next thing I tried, was to change the code has Brian > >> suggested. > >> That also had a big impact for me: > > I have a hypothesis for why. > > > > When ARP, the RDF/XML parser, finishes reading a file it issues a > > finishedRead event to the event handler in the graph it is reading into. > > > > If the TDB default event handler then did something (like sync to disk?) > > that the memory model does not, this could explain the difference in > > performance. I have put a profiler on the test program and it reports > > that the test program is spending a lot more time in > > BlockManagerFile.force() when it is reading directly in to TDB than when > > it is going via a memory model. So there is some evidence that this is > > what is happening. > > > > I haven't been able to track down the block manager code actually in use > > as I'm having trouble checking ARQ out of SVN, but Andy likely knows off > > the top of his head whether this is plausible. > > > s/block manager/event manager/ > > Could be - the model.add(model) will go via the BulkUpdateHandler (I > think). TDB's BulkUpdateHandler inherits from SimpleBulkUpdateHandler > for insertion. > > Could you try putting a break point in dataset.sync and see what the > call stack is when it gets hit? That'll tell you who is causing the sync. > > There used to be (up to v 0.8.9? not int he last snapshot build) a sync > wrapper that sync() every n'000 triples added. It's not in the > development codebase. All hidden implicit syncs should now be removed > They were causing problems for a user who was tracking whether the DB > on disk was dirty or not. > > Brian, Frank, which versions are you running? > > > as I'm having trouble checking ARQ out of SVN, > > What sort of trouble? > > Andy > > > > > > Brian > > > > > > > >> Max mem: 1,820M > >> DIRECT mode > >> log4j:WARN No appenders could be found for logger > >> (com.hp.hpl.jena.sparql.mgt.ARQMgt). > >> log4j:WARN Please initialize the log4j system properly. > >>> Starting test: Fri Mar 25 09:56:03 PDT 2011 > >>> Initial number of indexed graphs: 0 > >> 100 at: Fri Mar 25 09:56:06 PDT 2011 > >> 200 at: Fri Mar 25 09:56:07 PDT 2011 > >> 300 at: Fri Mar 25 09:56:09 PDT 2011 > >> 400 at: Fri Mar 25 09:56:10 PDT 2011 > >> 500 at: Fri Mar 25 09:56:11 PDT 2011 > >> ... > >> 99700 at: Fri Mar 25 10:09:27 PDT 2011 > >> 99800 at: Fri Mar 25 10:09:27 PDT 2011 > >> 99900 at: Fri Mar 25 10:09:28 PDT 2011 > >> 100000 at: Fri Mar 25 10:09:29 PDT 2011 > >>> Done at: Fri Mar 25 10:09:35 PDT 2011 > >> 100,000 graphs in 812.30 sec > >> > >> With that change, I'm now only about 30% slower than Andy's number. Maybe > >> that's attributable to Windows vs Linux or the hardware differences. I'm > >> running it on: > >> > >> Intel Xeon E5345 233 GHZ (2 processors) > >> 8 GB RAM > >> 300 GB HDD > >> Windows 7 Enterprise SP1 > >> > >> Does anybody know how a Xeon 5345 should compare to i5 or i7 > >> processors, or > >> how much difference there might be between Linux vs Windows 7. > >> > >> Thanks again for your help. > >> > >> Frank > >> > >> Brian McBride<[email protected]> wrote on 03/25/2011 12:15:30 PM: > >> > >>> [image removed] > >>> > >>> Re: Bad Jena TDB performance > >>> > >>> Brian McBride > >>> > >>> to: > >>> > >>> jena-users > >>> > >>> 03/25/2011 12:16 PM > >>> > >>> Please respond to jena-users > >>> > >>> Hi Andy, Frank, > >>> > >>> On 25/03/2011 14:13, Andy Seaborne wrote: > >>>> I ran today: > >>>> > >>>> https://jena.svn.sourceforge.net/svnroot/jena/TDB/trunk/src-dev/ > >>> reports/ReportOutOfMemoryManyGraphsTDB.java > >>> [[ > >>> > >>> I would very much appreciate it if others on this mailing list could > >>> also > >>> give it a quick try. I'd like to know if it's just me (and my > >> colleagues), > >>> or is there some kind of pattern to explain this huge difference. > >>> ]] > >>> > >>> > >>> Having a windows 7 box and a couple of spare minutes I tried running > >>> Andy's version of the test code. I got performance akin to what Frank > >>> reports. > >>> > >>> [[ > >>> Max mem: 1,820M > >>> DIRECT mode > >>> > Starting test: Fri Mar 25 16:06:36 GMT 2011 > >>> WARN [main] (SetupTDB.java:755) - No BGP optimizer > >>> > Initial number of indexed graphs: 0 > >>> 100 at: Fri Mar 25 16:07:17 GMT 2011 > >>> 200 at: Fri Mar 25 16:08:05 GMT 2011 > >>> 300 at: Fri Mar 25 16:08:56 GMT 2011 > >>> ... > >>> ]] > >>> > >>> However, it runs a lot faster, if instead of reading the input stream > >>> directly into TDB, I read it into a memory model first and then add that > >>> model to a TDB model. Then I get performance similar to that which Andy > >>> reports. i.e. > >>> > >>> [[ > >>> Max mem: 1,820M > >>> DIRECT mode > >>> > Starting test: Fri Mar 25 15:55:48 GMT 2011 > >>> WARN [main] (SetupTDB.java:755) - No BGP optimizer > >>> > Initial number of indexed graphs: 0 > >>> 100 at: Fri Mar 25 15:55:51 GMT 2011 > >>> 200 at: Fri Mar 25 15:55:51 GMT 2011 > >>> 300 at: Fri Mar 25 15:55:52 GMT 2011 > >>> ... > >>> > >>> 100000 at: Fri Mar 25 16:04:47 GMT 2011 > >>> > Done at: Fri Mar 25 16:04:55 GMT 2011 > >>> 100,000 graphs in 547.00 sec > >>> ]] > >>> > >>> Frank, you might try replacing > >>> > >>> [[ > >>> InputStream instream = getGraph(i); // the RDF graph to > >> load > >>> dataset.getLock().enterCriticalSection(Lock.WRITE); > >>> try { > >>> Model model = > >>> dataset.getNamedModel("https://jazz.net/jazz/resource/itemName/ > >>> com.ibm.team.workitem.WorkItem/" > >>> + i); > >>> model.read(instream, null); > >>> //model.close(); > >>> } finally { dataset.getLock().leaveCriticalSection() ; } > >>> if (++i % 100 == 0) System.out.println(i/100 + "00 at: > >>> " + new java.util.Date()); > >>> instream.close(); > >>> } > >>> ]] > >>> > >>> with > >>> [[ > >>> InputStream instream = getGraph(i); // the RDF graph to > >> load > >>> Model m = ModelFactory.createDefaultModel(); > >>> m.read(instream, null); > >>> > >>> dataset.getLock().enterCriticalSection(Lock.WRITE); > >>> try { > >>> Model model = > >>> dataset.getNamedModel("https://jazz.net/jazz/resource/itemName/ > >>> com.ibm.team.workitem.WorkItem/" > >>> + i); > >>> // model.read(instream, null); > >>> model.add(m); > >>> //model.close(); > >>> } finally { dataset.getLock().leaveCriticalSection() ; } > >>> if (++i % 100 == 0) System.out.println(i/100 + "00 at: > >>> " + new java.util.Date()); > >>> instream.close(); > >>> } > >>> ]] > >>> > >>> I am running I think the latest Maven versions of Jena etc on Windows 7 > >>> on an Intel i7 with 8GB. Tests were run in Eclipse. > >>> > >>> Windows 7 SP1 > >>> Jena 2.6.4 > >>> ARQ 2.8.7 > >>> TDB 0.8.9 > >>> > >>> Hoping this might be helpful, > >>> Brian > >>> > >>> > >>> > >>>> in Eclispe > >>>> in direct mode. > >>>> > >>>> It has some configuration choices you might like to try. > >>>> > >>>> Max mem: 910M > >>>> DIRECT mode > >>>>> Starting test: Fri Mar 25 13:57:02 GMT 2011 > >>>>> Initial number of indexed graphs: 0 > >>>> 100 at: Fri Mar 25 13:57:04 GMT 2011 > >>>> 200 at: Fri Mar 25 13:57:04 GMT 2011 > >>>> 300 at: Fri Mar 25 13:57:05 GMT 2011 > >>>> 400 at: Fri Mar 25 13:57:06 GMT 2011 > >>>> 500 at: Fri Mar 25 13:57:06 GMT 2011 > >>>> 600 at: Fri Mar 25 13:57:07 GMT 2011 > >>>> 700 at: Fri Mar 25 13:57:07 GMT 2011 > >>>> 800 at: Fri Mar 25 13:57:08 GMT 2011 > >>>> 900 at: Fri Mar 25 13:57:08 GMT 2011 > >>>> 1000 at: Fri Mar 25 13:57:09 GMT 2011 > >>>> .... > >>>> 98000 at: Fri Mar 25 14:06:47 GMT 2011 > >>>> 98100 at: Fri Mar 25 14:06:47 GMT 2011 > >>>> 98200 at: Fri Mar 25 14:06:48 GMT 2011 > >>>> 98300 at: Fri Mar 25 14:06:48 GMT 2011 > >>>> 98400 at: Fri Mar 25 14:06:49 GMT 2011 > >>>> 98500 at: Fri Mar 25 14:06:50 GMT 2011 > >>>> 98600 at: Fri Mar 25 14:06:50 GMT 2011 > >>>> 98700 at: Fri Mar 25 14:06:52 GMT 2011 > >>>> 98800 at: Fri Mar 25 14:06:52 GMT 2011 > >>>> 98900 at: Fri Mar 25 14:06:53 GMT 2011 > >>>> 99000 at: Fri Mar 25 14:06:53 GMT 2011 > >>>> 99100 at: Fri Mar 25 14:06:54 GMT 2011 > >>>> 99200 at: Fri Mar 25 14:06:55 GMT 2011 > >>>> 99300 at: Fri Mar 25 14:06:55 GMT 2011 > >>>> 99400 at: Fri Mar 25 14:06:56 GMT 2011 > >>>> 99500 at: Fri Mar 25 14:06:56 GMT 2011 > >>>> 99600 at: Fri Mar 25 14:06:57 GMT 2011 > >>>> 99700 at: Fri Mar 25 14:06:58 GMT 2011 > >>>> 99800 at: Fri Mar 25 14:06:59 GMT 2011 > >>>> 99900 at: Fri Mar 25 14:07:00 GMT 2011 > >>>> 100000 at: Fri Mar 25 14:07:00 GMT 2011 > >>>>> Done at: Fri Mar 25 14:07:04 GMT 2011 > >>>> 100,000 graphs in 601.98 sec > >>>> > >>>> > >>>> > >>>> On 25/03/11 13:50, Frank Budinsky wrote: > >>>>> Hi Andy and all, > >>>>> > >>>>> I finally managed to get a relatively powerful machine set up > >>>> Details? > >>>> > >>>>> and I reran > >>>>> the test program I sent you, but unfortunately, it still runs orders > >> of > >>>>> magnitude slower than the numbers you got when you tried it. The > >>>>> hardware I > >>>>> used this time is similar to what you are using, the only significant > >>>>> difference is that it's running Window's 7 instead of Ubuntu. I know > >>>>> Linux > >>>>> is somewhat faster than Windows, but I don't expect we can blame > >>>>> Microsoft > >>>>> for a change from 573.87 seconds to about 4 hours :-) > >>>>> > >>>>> Are you sure that your numbers are correct? How big was the TDB > >>>>> database on > >>>>> disk at the end of the run? > >>>> 3.9G DB1 > >>>> > >>>>> Do you have any other ideas what may be wrong > >>>>> with my configuration? > >>>> Windows server or desktop? (server is better at I/O) > >>>> 64 bit? > >>>> Is there a virus checker? > >>>> > >>>>> I would very much appreciate it if others on this mailing list could > >>>>> also > >>>>> give it a quick try. I'd like to know if it's just me (and my > >>>>> colleagues), > >>>>> or is there some kind of pattern to explain this huge difference. > >>>>> > >>>>> Here is the simple test program again (inlined this time, since Apache > >>>>> seems to throw away attachments). To run it, just change the TDB_DIR > >>>>> constant to some empty directory. The test program loads100,000 > >>>>> datagraphs > >>>>> (about 100 triples each -> 10M triples total). It prints a message > >>>>> on the > >>>>> console at every 100, so if it's taking seconds for each println, > >> you'll > >>>>> know very quickly that it will take hours to run, instead of a few > >>>>> minutes, > >>>>> like Andy has seen. > >>>>> > >>>>> Thanks, > >>>>> Frank. > >
