Re: QueryIterSort and OutOfMemoryError

Stephen Allen Sun, 06 Nov 2011 15:40:20 -0800

Hi Paolo,

I think you're right that pre-merging is needed to prevent using too
many resources at once (file handles and reader buffers), and your
stacktrace seems to indicate this.  The Hadoop developers also seem to
agree with you that 100 is a good fan out size [1].


I've created JENA-157 to track this issue.  The preMerge() method you
wrote looks pretty good.

-Stephen

[1] 
http://svn.apache.org/viewvc/pig/trunk/src/org/apache/pig/data/InternalSortedBag.java?view=markup


On Sun, Nov 6, 2011 at 10:35 AM, Paolo Castagna
<[email protected]> wrote:
> Hi,
> apologies for not being able to provide more details on this.
>
> At the moment I do not have the sources of this (yes, sad, but true),
> therefore I am not 100% sure of what's going on.
> However, while I was executing this Java program I noticed the files
> /tmp/DataBag- where just a few KB each (i.e. someone has set the spill
> on disk threshold too low).
> I think this is causing the OutOfMemoryError:
>
> java.lang.OutOfMemoryError: Java heap space
>        at 
> org.openjena.atlas.io.CharStreamBuffered.<init>(CharStreamBuffered.java:55)
>        at org.openjena.atlas.io.PeekReader.make(PeekReader.java:83)
>        at org.openjena.atlas.io.PeekReader.make(PeekReader.java:72)
>        at org.openjena.atlas.io.PeekReader.makeUTF8(PeekReader.java:98)
>        at 
> org.openjena.riot.tokens.TokenizerFactory.makeTokenizerUTF8(TokenizerFactory.java:53)
>        at 
> com.hp.hpl.jena.sparql.engine.binding.BindingInputStream.<init>(BindingInputStream.java:79)
>        at 
> org.openjena.riot.SerializationFactoryFinder$1.createDeserializer(SerializationFactoryFinder.java:60)
>        at 
> org.openjena.atlas.data.SortedDataBag.iterator(SortedDataBag.java:208)
>        at 
> com.hp.hpl.jena.sparql.engine.iterator.QueryIterSort$SortedBindingIterator.initializeIterator(QueryIterSort.java:104)
>        at 
> org.openjena.atlas.iterator.IteratorDelayedInitialization.init(IteratorDelayedInitialization.java:37)
>        at 
> org.openjena.atlas.iterator.IteratorDelayedInitialization.hasNext(IteratorDelayedInitialization.java:47)
>        at 
> com.hp.hpl.jena.sparql.engine.iterator.QueryIterPlainWrapper.hasNextBinding(QueryIterPlainWrapper.java:54)
>        at 
> com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:108)
>        at 
> com.hp.hpl.jena.sparql.engine.iterator.QueryIterConvert.hasNextBinding(QueryIterConvert.java:65)
>        at 
> com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:108)
>        at 
> com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorWrapper.hasNextBinding(QueryIteratorWrapper.java:40)
>        at 
> com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:108)
>        at 
> com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorWrapper.hasNextBinding(QueryIteratorWrapper.java:40)
>        at 
> com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:108)
>        at 
> com.hp.hpl.jena.sparql.engine.ResultSetStream.hasNext(ResultSetStream.java:72)
>    ...
>
> To avoid this situation we might have a limit on the number of files
> we merge at the same time, say 100 for example... so, if there are
> more than 100 /tmp/DataBag- files to merge we merge the first 100
> first, and so on... until at the last step with have less than 100
> files to merge. I did something similar in the preMerge() method here:
> https://svn.apache.org/repos/asf/incubator/jena/Scratch/PC/tdbloader2/trunk/src/main/java/org/apache/jena/tdbloader2/MultiThreadedSortedDataBag.java
> Ignore the multi threading bit... we could just take the preMerge().
>
> What do you think?
>
> Paolo
>

Re: QueryIterSort and OutOfMemoryError

Reply via email to