Re: Bad Jena TDB performance

Frank Budinsky Mon, 28 Mar 2011 08:05:19 -0700

> Brian, Frank, which versions are you running?

I'm running the same version as Brian.


tbd: 0.8.9
arq: 2.8.7
jena: 2.6.4

I tried disabling anti-virus - no difference.

I tried bumping up the heap to 3G - no difference.

I tried running without setting direct mode, but using Brian's change,
i.e., model.add() instead of model.read() - this runs slower than with both
Brian's change and direct mode, but faster than if I run with direct mode
without Brian's change:

Only direct mode:
100,000 graphs in 2,080.58 sec

Only model.add():
100,000 graphs in 1,246.25 sec

Both direct mode and model.add():
100,000 graphs in 812.30 sec

Brian, have you tried running it without setting direct mode? I'm wondering
if there's a problem with memory mapped I/O on Windows?

Thanks,
Frank.

Andy Seaborne <[email protected]> wrote on 03/28/2011 04:17:49
AM:

> [image removed]
>
> Re: Bad Jena TDB performance
>
> Andy Seaborne
>
> to:
>
> jena-users
>
> 03/28/2011 04:18 AM
>
> Please respond to jena-users
>
>
>
> On 27/03/11 18:56, Brian McBride wrote:
> > On 25/03/2011 18:25, Frank Budinsky wrote:
> > [...]
> >> One interesting difference between my slow running case and Brian's,
is
> >> that mine always starts out pretty fast, but gradually slows down.
Notice
> >> that it was taking about 2 seconds per 100 at the start, but it was up
to
> >> 15 seconds at 39800. I didn't let it run to completion, this time, but
I
> >> remember from a previous run that it was taking about 2 min per 100 at
> >> the
> >> end. Any idea why it might slow down over time, like this, when not
using
> >> direct mode?
> >>
> >> Anyway, the next thing I tried, was to change the code has Brian
> >> suggested.
> >> That also had a big impact for me:
> > I have a hypothesis for why.
> >
> > When ARP, the RDF/XML parser, finishes reading a file it issues a
> > finishedRead event to the event handler in the graph it is reading
into.
> >
> > If the TDB default event handler then did something (like sync to
disk?)
> > that the memory model does not, this could explain the difference in
> > performance. I have put a profiler on the test program and it reports
> > that the test program is spending a lot more time in
> > BlockManagerFile.force() when it is reading directly in to TDB than
when
> > it is going via a memory model. So there is some evidence that this is
> > what is happening.
> >
> > I haven't been able to track down the block manager code actually in
use
> > as I'm having trouble checking ARQ out of SVN, but Andy likely knows
off
> > the top of his head whether this is plausible.
>
>  > s/block manager/event manager/
>
> Could be - the model.add(model) will go via the BulkUpdateHandler (I
> think).  TDB's BulkUpdateHandler inherits from SimpleBulkUpdateHandler
> for insertion.
>
> Could you try putting a break point in dataset.sync and see what the
> call stack is when it gets hit?  That'll tell you who is causing the
sync.
>
> There used to be (up to v 0.8.9? not int he last snapshot build) a sync
> wrapper that sync() every n'000 triples added.  It's not in the
> development codebase.  All hidden implicit syncs should now be removed
>   They were causing problems for a user who was tracking whether the DB
> on disk was dirty or not.
>
> Brian, Frank, which versions are you running?
>
>  > as I'm having trouble checking ARQ out of SVN,
>
> What sort of trouble?
>
>    Andy
>
>
> >
> > Brian
> >
> >
> >
> >> Max mem: 1,820M
> >> DIRECT mode
> >> log4j:WARN No appenders could be found for logger
> >> (com.hp.hpl.jena.sparql.mgt.ARQMgt).
> >> log4j:WARN Please initialize the log4j system properly.
> >>> Starting test: Fri Mar 25 09:56:03 PDT 2011
> >>> Initial number of indexed graphs: 0
> >> 100 at: Fri Mar 25 09:56:06 PDT 2011
> >> 200 at: Fri Mar 25 09:56:07 PDT 2011
> >> 300 at: Fri Mar 25 09:56:09 PDT 2011
> >> 400 at: Fri Mar 25 09:56:10 PDT 2011
> >> 500 at: Fri Mar 25 09:56:11 PDT 2011
> >> ...
> >> 99700 at: Fri Mar 25 10:09:27 PDT 2011
> >> 99800 at: Fri Mar 25 10:09:27 PDT 2011
> >> 99900 at: Fri Mar 25 10:09:28 PDT 2011
> >> 100000 at: Fri Mar 25 10:09:29 PDT 2011
> >>> Done at: Fri Mar 25 10:09:35 PDT 2011
> >> 100,000 graphs in 812.30 sec
> >>
> >> With that change, I'm now only about 30% slower than Andy's number.
Maybe
> >> that's attributable to Windows vs Linux or the hardware differences.
I'm
> >> running it on:
> >>
> >> Intel Xeon E5345 233 GHZ (2 processors)
> >> 8 GB RAM
> >> 300 GB HDD
> >> Windows 7 Enterprise SP1
> >>
> >> Does anybody know how a Xeon 5345 should compare to i5 or i7
> >> processors, or
> >> how much difference there might be between Linux vs Windows 7.
> >>
> >> Thanks again for your help.
> >>
> >> Frank
> >>
> >> Brian McBride<[email protected]> wrote on 03/25/2011 12:15:30 PM:
> >>
> >>> [image removed]
> >>>
> >>> Re: Bad Jena TDB performance
> >>>
> >>> Brian McBride
> >>>
> >>> to:
> >>>
> >>> jena-users
> >>>
> >>> 03/25/2011 12:16 PM
> >>>
> >>> Please respond to jena-users
> >>>
> >>> Hi Andy, Frank,
> >>>
> >>> On 25/03/2011 14:13, Andy Seaborne wrote:
> >>>> I ran today:
> >>>>
> >>>> https://jena.svn.sourceforge.net/svnroot/jena/TDB/trunk/src-dev/
> >>> reports/ReportOutOfMemoryManyGraphsTDB.java
> >>> [[
> >>>
> >>> I would very much appreciate it if others on this mailing list could
> >>> also
> >>> give it a quick try. I'd like to know if it's just me (and my
> >> colleagues),
> >>> or is there some kind of pattern to explain this huge difference.
> >>> ]]
> >>>
> >>>
> >>> Having a windows 7 box and a couple of spare minutes I tried running
> >>> Andy's version of the test code. I got performance akin to what Frank
> >>> reports.
> >>>
> >>> [[
> >>> Max mem: 1,820M
> >>> DIRECT mode
> >>> > Starting test: Fri Mar 25 16:06:36 GMT 2011
> >>> WARN [main] (SetupTDB.java:755) - No BGP optimizer
> >>> > Initial number of indexed graphs: 0
> >>> 100 at: Fri Mar 25 16:07:17 GMT 2011
> >>> 200 at: Fri Mar 25 16:08:05 GMT 2011
> >>> 300 at: Fri Mar 25 16:08:56 GMT 2011
> >>> ...
> >>> ]]
> >>>
> >>> However, it runs a lot faster, if instead of reading the input stream
> >>> directly into TDB, I read it into a memory model first and then add
that
> >>> model to a TDB model. Then I get performance similar to that which
Andy
> >>> reports. i.e.
> >>>
> >>> [[
> >>> Max mem: 1,820M
> >>> DIRECT mode
> >>> > Starting test: Fri Mar 25 15:55:48 GMT 2011
> >>> WARN [main] (SetupTDB.java:755) - No BGP optimizer
> >>> > Initial number of indexed graphs: 0
> >>> 100 at: Fri Mar 25 15:55:51 GMT 2011
> >>> 200 at: Fri Mar 25 15:55:51 GMT 2011
> >>> 300 at: Fri Mar 25 15:55:52 GMT 2011
> >>> ...
> >>>
> >>> 100000 at: Fri Mar 25 16:04:47 GMT 2011
> >>> > Done at: Fri Mar 25 16:04:55 GMT 2011
> >>> 100,000 graphs in 547.00 sec
> >>> ]]
> >>>
> >>> Frank, you might try replacing
> >>>
> >>> [[
> >>> InputStream instream = getGraph(i); // the RDF graph to
> >> load
> >>> dataset.getLock().enterCriticalSection(Lock.WRITE);
> >>> try {
> >>> Model model =
> >>> dataset.getNamedModel("https://jazz.net/jazz/resource/itemName/
> >>> com.ibm.team.workitem.WorkItem/"
> >>> + i);
> >>> model.read(instream, null);
> >>> //model.close();
> >>> } finally { dataset.getLock().leaveCriticalSection() ; }
> >>> if (++i % 100 == 0) System.out.println(i/100 + "00 at:
> >>> " + new java.util.Date());
> >>> instream.close();
> >>> }
> >>> ]]
> >>>
> >>> with
> >>> [[
> >>> InputStream instream = getGraph(i); // the RDF graph to
> >> load
> >>> Model m = ModelFactory.createDefaultModel();
> >>> m.read(instream, null);
> >>>
> >>> dataset.getLock().enterCriticalSection(Lock.WRITE);
> >>> try {
> >>> Model model =
> >>> dataset.getNamedModel("https://jazz.net/jazz/resource/itemName/
> >>> com.ibm.team.workitem.WorkItem/"
> >>> + i);
> >>> // model.read(instream, null);
> >>> model.add(m);
> >>> //model.close();
> >>> } finally { dataset.getLock().leaveCriticalSection() ; }
> >>> if (++i % 100 == 0) System.out.println(i/100 + "00 at:
> >>> " + new java.util.Date());
> >>> instream.close();
> >>> }
> >>> ]]
> >>>
> >>> I am running I think the latest Maven versions of Jena etc on Windows
7
> >>> on an Intel i7 with 8GB. Tests were run in Eclipse.
> >>>
> >>> Windows 7 SP1
> >>> Jena 2.6.4
> >>> ARQ 2.8.7
> >>> TDB 0.8.9
> >>>
> >>> Hoping this might be helpful,
> >>> Brian
> >>>
> >>>
> >>>
> >>>> in Eclispe
> >>>> in direct mode.
> >>>>
> >>>> It has some configuration choices you might like to try.
> >>>>
> >>>> Max mem: 910M
> >>>> DIRECT mode
> >>>>> Starting test: Fri Mar 25 13:57:02 GMT 2011
> >>>>> Initial number of indexed graphs: 0
> >>>> 100 at: Fri Mar 25 13:57:04 GMT 2011
> >>>> 200 at: Fri Mar 25 13:57:04 GMT 2011
> >>>> 300 at: Fri Mar 25 13:57:05 GMT 2011
> >>>> 400 at: Fri Mar 25 13:57:06 GMT 2011
> >>>> 500 at: Fri Mar 25 13:57:06 GMT 2011
> >>>> 600 at: Fri Mar 25 13:57:07 GMT 2011
> >>>> 700 at: Fri Mar 25 13:57:07 GMT 2011
> >>>> 800 at: Fri Mar 25 13:57:08 GMT 2011
> >>>> 900 at: Fri Mar 25 13:57:08 GMT 2011
> >>>> 1000 at: Fri Mar 25 13:57:09 GMT 2011
> >>>> ....
> >>>> 98000 at: Fri Mar 25 14:06:47 GMT 2011
> >>>> 98100 at: Fri Mar 25 14:06:47 GMT 2011
> >>>> 98200 at: Fri Mar 25 14:06:48 GMT 2011
> >>>> 98300 at: Fri Mar 25 14:06:48 GMT 2011
> >>>> 98400 at: Fri Mar 25 14:06:49 GMT 2011
> >>>> 98500 at: Fri Mar 25 14:06:50 GMT 2011
> >>>> 98600 at: Fri Mar 25 14:06:50 GMT 2011
> >>>> 98700 at: Fri Mar 25 14:06:52 GMT 2011
> >>>> 98800 at: Fri Mar 25 14:06:52 GMT 2011
> >>>> 98900 at: Fri Mar 25 14:06:53 GMT 2011
> >>>> 99000 at: Fri Mar 25 14:06:53 GMT 2011
> >>>> 99100 at: Fri Mar 25 14:06:54 GMT 2011
> >>>> 99200 at: Fri Mar 25 14:06:55 GMT 2011
> >>>> 99300 at: Fri Mar 25 14:06:55 GMT 2011
> >>>> 99400 at: Fri Mar 25 14:06:56 GMT 2011
> >>>> 99500 at: Fri Mar 25 14:06:56 GMT 2011
> >>>> 99600 at: Fri Mar 25 14:06:57 GMT 2011
> >>>> 99700 at: Fri Mar 25 14:06:58 GMT 2011
> >>>> 99800 at: Fri Mar 25 14:06:59 GMT 2011
> >>>> 99900 at: Fri Mar 25 14:07:00 GMT 2011
> >>>> 100000 at: Fri Mar 25 14:07:00 GMT 2011
> >>>>> Done at: Fri Mar 25 14:07:04 GMT 2011
> >>>> 100,000 graphs in 601.98 sec
> >>>>
> >>>>
> >>>>
> >>>> On 25/03/11 13:50, Frank Budinsky wrote:
> >>>>> Hi Andy and all,
> >>>>>
> >>>>> I finally managed to get a relatively powerful machine set up
> >>>> Details?
> >>>>
> >>>>> and I reran
> >>>>> the test program I sent you, but unfortunately, it still runs
orders
> >> of
> >>>>> magnitude slower than the numbers you got when you tried it. The
> >>>>> hardware I
> >>>>> used this time is similar to what you are using, the only
significant
> >>>>> difference is that it's running Window's 7 instead of Ubuntu. I
know
> >>>>> Linux
> >>>>> is somewhat faster than Windows, but I don't expect we can blame
> >>>>> Microsoft
> >>>>> for a change from 573.87 seconds to about 4 hours :-)
> >>>>>
> >>>>> Are you sure that your numbers are correct? How big was the TDB
> >>>>> database on
> >>>>> disk at the end of the run?
> >>>> 3.9G DB1
> >>>>
> >>>>> Do you have any other ideas what may be wrong
> >>>>> with my configuration?
> >>>> Windows server or desktop? (server is better at I/O)
> >>>> 64 bit?
> >>>> Is there a virus checker?
> >>>>
> >>>>> I would very much appreciate it if others on this mailing list
could
> >>>>> also
> >>>>> give it a quick try. I'd like to know if it's just me (and my
> >>>>> colleagues),
> >>>>> or is there some kind of pattern to explain this huge difference.
> >>>>>
> >>>>> Here is the simple test program again (inlined this time, since
Apache
> >>>>> seems to throw away attachments). To run it, just change the
TDB_DIR
> >>>>> constant to some empty directory. The test program loads100,000
> >>>>> datagraphs
> >>>>> (about 100 triples each -> 10M triples total). It prints a message
> >>>>> on the
> >>>>> console at every 100, so if it's taking seconds for each println,
> >> you'll
> >>>>> know very quickly that it will take hours to run, instead of a few
> >>>>> minutes,
> >>>>> like Andy has seen.
> >>>>>
> >>>>> Thanks,
> >>>>> Frank.
> >

Re: Bad Jena TDB performance

Reply via email to