Is this on latest master / branch-1.5?

out of the box we reserve only 16% (0.2 * 0.8) of the memory for execution
(e.g. aggregate, join) / shuffle sorting. With a 3GB heap, that's 480MB. So
each task gets 480MB / 32 = 15MB, and each operator reserves at least one
page for execution. If your page size is 4MB, it only takes 3 operators to
use up its memory.

The thing is page size is dynamically determined -- and in your case it
should be smaller than 4MB.
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/shuffle/ShuffleMemoryManager.scala#L174

Maybe there is a place that in the maven tests that we explicitly set the
page size (spark.buffer.pageSize) to 4MB? If yes, we need to find it and
just remove it.


On Mon, Sep 14, 2015 at 4:16 AM, Pete Robbins <robbin...@gmail.com> wrote:

> I keep hitting errors running the tests on 1.5 such as
>
>
> - join31 *** FAILED ***
>   Failed to execute query using catalyst:
>   Error: Job aborted due to stage failure: Task 9 in stage 3653.0 failed 1
> times, most recent failure: Lost task 9.0 in stage 3653.0 (TID 123363,
> localhost): java.io.IOException: Unable to acquire 4194304 bytes of memory
>       at
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368)
>
>
> This is using the command
> build/mvn -Pyarn -Phadoop-2.2 -Phive -Phive-thriftserver  test
>
>
> I don't see these errors in any of the amplab jenkins builds. Do those
> builds have any configuration/environment that I may be missing? My build
> is running with whatever defaults are in the top level pom.xml, eg -Xmx3G.
>
> I can make these tests pass by setting spark.shuffle.memoryFraction=0.6 in
> the HiveCompatibilitySuite rather than the default 0.2 value.
>
> Trying to analyze what is going on with the test it is related to the
> number of active tasks, which seems to rise to 32, and so the
> ShuffleMemoryManager allows less memory per task even though most of those
> tasks do not have any memory allocated to them.
>
> Has anyone seen issues like this before?
>

Reply via email to