Hi,
apologies for not being able to provide more details on this.

At the moment I do not have the sources of this (yes, sad, but true),
therefore I am not 100% sure of what's going on.
However, while I was executing this Java program I noticed the files
/tmp/DataBag- where just a few KB each (i.e. someone has set the spill
on disk threshold too low).
I think this is causing the OutOfMemoryError:

java.lang.OutOfMemoryError: Java heap space
        at 
org.openjena.atlas.io.CharStreamBuffered.<init>(CharStreamBuffered.java:55)
        at org.openjena.atlas.io.PeekReader.make(PeekReader.java:83)
        at org.openjena.atlas.io.PeekReader.make(PeekReader.java:72)
        at org.openjena.atlas.io.PeekReader.makeUTF8(PeekReader.java:98)
        at 
org.openjena.riot.tokens.TokenizerFactory.makeTokenizerUTF8(TokenizerFactory.java:53)
        at 
com.hp.hpl.jena.sparql.engine.binding.BindingInputStream.<init>(BindingInputStream.java:79)
        at 
org.openjena.riot.SerializationFactoryFinder$1.createDeserializer(SerializationFactoryFinder.java:60)
        at 
org.openjena.atlas.data.SortedDataBag.iterator(SortedDataBag.java:208)
        at 
com.hp.hpl.jena.sparql.engine.iterator.QueryIterSort$SortedBindingIterator.initializeIterator(QueryIterSort.java:104)
        at 
org.openjena.atlas.iterator.IteratorDelayedInitialization.init(IteratorDelayedInitialization.java:37)
        at 
org.openjena.atlas.iterator.IteratorDelayedInitialization.hasNext(IteratorDelayedInitialization.java:47)
        at 
com.hp.hpl.jena.sparql.engine.iterator.QueryIterPlainWrapper.hasNextBinding(QueryIterPlainWrapper.java:54)
        at 
com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:108)
        at 
com.hp.hpl.jena.sparql.engine.iterator.QueryIterConvert.hasNextBinding(QueryIterConvert.java:65)
        at 
com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:108)
        at 
com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorWrapper.hasNextBinding(QueryIteratorWrapper.java:40)
        at 
com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:108)
        at 
com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorWrapper.hasNextBinding(QueryIteratorWrapper.java:40)
        at 
com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:108)
        at 
com.hp.hpl.jena.sparql.engine.ResultSetStream.hasNext(ResultSetStream.java:72)
    ...

To avoid this situation we might have a limit on the number of files
we merge at the same time, say 100 for example... so, if there are
more than 100 /tmp/DataBag- files to merge we merge the first 100
first, and so on... until at the last step with have less than 100
files to merge. I did something similar in the preMerge() method here:
https://svn.apache.org/repos/asf/incubator/jena/Scratch/PC/tdbloader2/trunk/src/main/java/org/apache/jena/tdbloader2/MultiThreadedSortedDataBag.java
Ignore the multi threading bit... we could just take the preMerge().

What do you think?

Paolo

Reply via email to