Hi Paolo, I think you're right that pre-merging is needed to prevent using too many resources at once (file handles and reader buffers), and your stacktrace seems to indicate this. The Hadoop developers also seem to agree with you that 100 is a good fan out size [1].
I've created JENA-157 to track this issue. The preMerge() method you wrote looks pretty good. -Stephen [1] http://svn.apache.org/viewvc/pig/trunk/src/org/apache/pig/data/InternalSortedBag.java?view=markup On Sun, Nov 6, 2011 at 10:35 AM, Paolo Castagna <[email protected]> wrote: > Hi, > apologies for not being able to provide more details on this. > > At the moment I do not have the sources of this (yes, sad, but true), > therefore I am not 100% sure of what's going on. > However, while I was executing this Java program I noticed the files > /tmp/DataBag- where just a few KB each (i.e. someone has set the spill > on disk threshold too low). > I think this is causing the OutOfMemoryError: > > java.lang.OutOfMemoryError: Java heap space > at > org.openjena.atlas.io.CharStreamBuffered.<init>(CharStreamBuffered.java:55) > at org.openjena.atlas.io.PeekReader.make(PeekReader.java:83) > at org.openjena.atlas.io.PeekReader.make(PeekReader.java:72) > at org.openjena.atlas.io.PeekReader.makeUTF8(PeekReader.java:98) > at > org.openjena.riot.tokens.TokenizerFactory.makeTokenizerUTF8(TokenizerFactory.java:53) > at > com.hp.hpl.jena.sparql.engine.binding.BindingInputStream.<init>(BindingInputStream.java:79) > at > org.openjena.riot.SerializationFactoryFinder$1.createDeserializer(SerializationFactoryFinder.java:60) > at > org.openjena.atlas.data.SortedDataBag.iterator(SortedDataBag.java:208) > at > com.hp.hpl.jena.sparql.engine.iterator.QueryIterSort$SortedBindingIterator.initializeIterator(QueryIterSort.java:104) > at > org.openjena.atlas.iterator.IteratorDelayedInitialization.init(IteratorDelayedInitialization.java:37) > at > org.openjena.atlas.iterator.IteratorDelayedInitialization.hasNext(IteratorDelayedInitialization.java:47) > at > com.hp.hpl.jena.sparql.engine.iterator.QueryIterPlainWrapper.hasNextBinding(QueryIterPlainWrapper.java:54) > at > com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:108) > at > com.hp.hpl.jena.sparql.engine.iterator.QueryIterConvert.hasNextBinding(QueryIterConvert.java:65) > at > com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:108) > at > com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorWrapper.hasNextBinding(QueryIteratorWrapper.java:40) > at > com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:108) > at > com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorWrapper.hasNextBinding(QueryIteratorWrapper.java:40) > at > com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:108) > at > com.hp.hpl.jena.sparql.engine.ResultSetStream.hasNext(ResultSetStream.java:72) > ... > > To avoid this situation we might have a limit on the number of files > we merge at the same time, say 100 for example... so, if there are > more than 100 /tmp/DataBag- files to merge we merge the first 100 > first, and so on... until at the last step with have less than 100 > files to merge. I did something similar in the preMerge() method here: > https://svn.apache.org/repos/asf/incubator/jena/Scratch/PC/tdbloader2/trunk/src/main/java/org/apache/jena/tdbloader2/MultiThreadedSortedDataBag.java > Ignore the multi threading bit... we could just take the preMerge(). > > What do you think? > > Paolo >
