Hi Stephen

Stephen Allen wrote:
Hi Paolo,

I think you're right that pre-merging is needed to prevent using too
many resources at once (file handles and reader buffers), and your
stacktrace seems to indicate this.  The Hadoop developers also seem to
agree with you that 100 is a good fan out size [1].

That is where I get the 100 as default value. :-)

I've created JENA-157 to track this issue.  The preMerge() method you
wrote looks pretty good.

Thanks for opening the issue.

Paolo


-Stephen

[1] 
http://svn.apache.org/viewvc/pig/trunk/src/org/apache/pig/data/InternalSortedBag.java?view=markup


On Sun, Nov 6, 2011 at 10:35 AM, Paolo Castagna
<[email protected]> wrote:
Hi,
apologies for not being able to provide more details on this.

At the moment I do not have the sources of this (yes, sad, but true),
therefore I am not 100% sure of what's going on.
However, while I was executing this Java program I noticed the files
/tmp/DataBag- where just a few KB each (i.e. someone has set the spill
on disk threshold too low).
I think this is causing the OutOfMemoryError:

java.lang.OutOfMemoryError: Java heap space
       at 
org.openjena.atlas.io.CharStreamBuffered.<init>(CharStreamBuffered.java:55)
       at org.openjena.atlas.io.PeekReader.make(PeekReader.java:83)
       at org.openjena.atlas.io.PeekReader.make(PeekReader.java:72)
       at org.openjena.atlas.io.PeekReader.makeUTF8(PeekReader.java:98)
       at 
org.openjena.riot.tokens.TokenizerFactory.makeTokenizerUTF8(TokenizerFactory.java:53)
       at 
com.hp.hpl.jena.sparql.engine.binding.BindingInputStream.<init>(BindingInputStream.java:79)
       at 
org.openjena.riot.SerializationFactoryFinder$1.createDeserializer(SerializationFactoryFinder.java:60)
       at org.openjena.atlas.data.SortedDataBag.iterator(SortedDataBag.java:208)
       at 
com.hp.hpl.jena.sparql.engine.iterator.QueryIterSort$SortedBindingIterator.initializeIterator(QueryIterSort.java:104)
       at 
org.openjena.atlas.iterator.IteratorDelayedInitialization.init(IteratorDelayedInitialization.java:37)
       at 
org.openjena.atlas.iterator.IteratorDelayedInitialization.hasNext(IteratorDelayedInitialization.java:47)
       at 
com.hp.hpl.jena.sparql.engine.iterator.QueryIterPlainWrapper.hasNextBinding(QueryIterPlainWrapper.java:54)
       at 
com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:108)
       at 
com.hp.hpl.jena.sparql.engine.iterator.QueryIterConvert.hasNextBinding(QueryIterConvert.java:65)
       at 
com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:108)
       at 
com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorWrapper.hasNextBinding(QueryIteratorWrapper.java:40)
       at 
com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:108)
       at 
com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorWrapper.hasNextBinding(QueryIteratorWrapper.java:40)
       at 
com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:108)
       at 
com.hp.hpl.jena.sparql.engine.ResultSetStream.hasNext(ResultSetStream.java:72)
   ...

To avoid this situation we might have a limit on the number of files
we merge at the same time, say 100 for example... so, if there are
more than 100 /tmp/DataBag- files to merge we merge the first 100
first, and so on... until at the last step with have less than 100
files to merge. I did something similar in the preMerge() method here:
https://svn.apache.org/repos/asf/incubator/jena/Scratch/PC/tdbloader2/trunk/src/main/java/org/apache/jena/tdbloader2/MultiThreadedSortedDataBag.java
Ignore the multi threading bit... we could just take the preMerge().

What do you think?

Paolo


Reply via email to