Hey folks, increasing tez.task.scale.memory.reserve-fraction to 0.8 worked
for small jobs.

I will come back with more a more detailed breakdown to make sure I'm doing
things properly.

Thanks for the quick responses!

Kostas

On Thu, Nov 6, 2014 at 10:15 PM, Siddharth Seth <[email protected]> wrote:

> - hive-dev, +tez-dev
>
> Do you know at what stage of the processing the OOM occurs ? What other
> processing has happened so far. Ideally, if this was just part of the
> Inputs being initialized - you should not have seen an OOM.
> In most likelihood, the Processor started using some memory (which by
> default is counted as 30% of the JVM heap). You could try modifying this
> setting. [tez.task.scale.memory.reserve-fraction could be set higher than
> 0.3 (30%) for starters).
>
> The logs will definitely help figuring out what is happening. A heap dump
> would be even better.
>
> On Thu, Nov 6, 2014 at 1:01 PM, Gopal V <[email protected]> wrote:
>
> > On 11/6/14, 11:09 AM, Kostas Tzoumas wrote:
> >
> >> I am running into the same error [1] with plain Tez (not Hive):
> >>
> >> Any advice on what configuration parameters I should start looking at?
> >>
> >
> > Both issues are related to the Tez memory distributor
> > (InitialMemoryAllocator) impl used.
> >
> > http://tez.apache.org/releases/0.5.1/tez-runtime-
> > library-javadocs/org/apache/tez/runtime/library/resources/
> > WeightedScalingMemoryDistributor.html
> >
> > This divides memory up between different inputs and outputs, so that the
> > overall memory usage is capped without hitting GC issues.
> >
> > Suma's issue was probably that tez-0.4 (ergo, hive-13) didn't have a
> > memory distributor implementation.
> >
> > http://people.apache.org/~gopalv/tpch-plans/q8_national_market_share.svg
> >
> > This means that Reducer_4 in that can divvy up memory between these
> > buffers up.
> >
> > OrderedGroupedKVInputConfig::setShuffleBufferFraction() allows this
> > particular tuning per input edge.
> >
> > For a shuffle JOIN, you can tune the left and right hand side of the
> > buffers, as well as make reservations for the actual map-join in memory,
> so
> > that the plan's cost information can help  the memory scheduling Tez has.
> >
> > Cheers,
> > Gopal
> >
> >
> >
> >> [1] java.lang.OutOfMemoryError: Java heap space
> >> at
> >> org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>(
> >> BoundedByteArrayOutputStream.java:56)
> >> at
> >> org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>(
> >> BoundedByteArrayOutputStream.java:46)
> >> at
> >> org.apache.tez.runtime.library.common.shuffle.MemoryFetchedInput.<init>(
> >> MemoryFetchedInput.java:38)
> >> at
> >> org.apache.tez.runtime.library.common.shuffle.impl.
> >> SimpleFetchedInputAllocator.allocate(SimpleFetchedInputAllocator.
> >> java:139)
> >> at
> >> org.apache.tez.runtime.library.common.shuffle.
> >> Fetcher.fetchInputs(Fetcher.java:713)
> >> at
> >> org.apache.tez.runtime.library.common.shuffle.
> >> Fetcher.doHttpFetch(Fetcher.java:485)
> >> at
> >> org.apache.tez.runtime.library.common.shuffle.
> >> Fetcher.doHttpFetch(Fetcher.java:394)
> >> at
> >> org.apache.tez.runtime.library.common.shuffle.
> >> Fetcher.call(Fetcher.java:189)
> >> at
> >> org.apache.tez.runtime.library.common.shuffle.
> >> Fetcher.call(Fetcher.java:71)
> >> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> >> at
> >> java.util.concurrent.ThreadPoolExecutor.runWorker(
> >> ThreadPoolExecutor.java:1145)
> >> at
> >> java.util.concurrent.ThreadPoolExecutor$Worker.run(
> >> ThreadPoolExecutor.java:615)
> >> at java.lang.Thread.run(Thread.java:745)
> >>
> >> On Tue, Aug 26, 2014 at 4:26 PM, Suma Shivaprasad <
> >> [email protected]> wrote:
> >>
> >>  Am using Tez 0.4.0 and counters for the query run are as below
> >>>
> >>> 2014-08-26 14:06:41,203 INFO  [Thread-13]: exec.Task
> >>> (TezTask.java:execute(171)) -
> org.apache.tez.common.counters.DAGCounter:
> >>> 2014-08-26 14:06:41,205 INFO  [Thread-13]: exec.Task
> >>> (TezTask.java:execute(173)) -    NUM_FAILED_TASKS: 67
> >>> 2014-08-26 14:06:41,205 INFO  [Thread-13]: exec.Task
> >>> (TezTask.java:execute(173)) -    NUM_KILLED_TASKS: 312
> >>> 2014-08-26 14:06:41,205 INFO  [Thread-13]: exec.Task
> >>> (TezTask.java:execute(173)) -    TOTAL_LAUNCHED_TASKS: 259
> >>> 2014-08-26 14:06:41,205 INFO  [Thread-13]: exec.Task
> >>> (TezTask.java:execute(173)) -    DATA_LOCAL_TASKS: 59
> >>> 2014-08-26 14:06:41,205 INFO  [Thread-13]: exec.Task
> >>> (TezTask.java:execute(173)) -    RACK_LOCAL_TASKS: 27
> >>> 2014-08-26 14:06:41,207 INFO  [Thread-13]: exec.Task
> >>> (TezTask.java:execute(171)) - File System Counters:
> >>> 2014-08-26 14:06:41,208 INFO  [Thread-13]: exec.Task
> >>> (TezTask.java:execute(173)) -    FILE: BYTES_READ: 0
> >>> 2014-08-26 14:06:41,208 INFO  [Thread-13]: exec.Task
> >>> (TezTask.java:execute(173)) -    FILE: BYTES_WRITTEN: 3201156949
> >>> 2014-08-26 14:06:41,208 INFO  [Thread-13]: exec.Task
> >>> (TezTask.java:execute(173)) -    FILE: READ_OPS: 0
> >>> 2014-08-26 14:06:41,209 INFO  [Thread-13]: exec.Task
> >>> (TezTask.java:execute(173)) -    FILE: LARGE_READ_OPS: 0
> >>> 2014-08-26 14:06:41,209 INFO  [Thread-13]: exec.Task
> >>> (TezTask.java:execute(173)) -    FILE: WRITE_OPS: 0
> >>> 2014-08-26 14:06:41,209 INFO  [Thread-13]: exec.Task
> >>> (TezTask.java:execute(173)) -    HDFS: BYTES_READ: 30052072845
> >>> 2014-08-26 14:06:41,209 INFO  [Thread-13]: exec.Task
> >>> (TezTask.java:execute(173)) -    HDFS: BYTES_WRITTEN: 0
> >>> 2014-08-26 14:06:41,209 INFO  [Thread-13]: exec.Task
> >>> (TezTask.java:execute(173)) -    HDFS: READ_OPS: 768
> >>> 2014-08-26 14:06:41,209 INFO  [Thread-13]: exec.Task
> >>> (TezTask.java:execute(173)) -    HDFS: LARGE_READ_OPS: 0
> >>> 2014-08-26 14:06:41,209 INFO  [Thread-13]: exec.Task
> >>> (TezTask.java:execute(173)) -    HDFS: WRITE_OPS: 0
> >>> 2014-08-26 14:06:41,211 INFO  [Thread-13]: exec.Task
> >>> (TezTask.java:execute(171)) - org.apache.tez.common.
> >>> counters.TaskCounter:
> >>> 2014-08-26 14:06:41,211 INFO  [Thread-13]: exec.Task
> >>> (TezTask.java:execute(173)) -    GC_TIME_MILLIS: 148639
> >>> 2014-08-26 14:06:41,211 INFO  [Thread-13]: exec.Task
> >>> (TezTask.java:execute(173)) -    CPU_MILLISECONDS: 1420020
> >>> 2014-08-26 14:06:41,211 INFO  [Thread-13]: exec.Task
> >>> (TezTask.java:execute(173)) -    PHYSICAL_MEMORY_BYTES: 304725393408
> >>> 2014-08-26 14:06:41,211 INFO  [Thread-13]: exec.Task
> >>> (TezTask.java:execute(173)) -    VIRTUAL_MEMORY_BYTES: 440084279296
> >>> 2014-08-26 14:06:41,212 INFO  [Thread-13]: exec.Task
> >>> (TezTask.java:execute(173)) -    COMMITTED_HEAP_BYTES: 337806557184
> >>> 2014-08-26 14:06:41,212 INFO  [Thread-13]: exec.Task
> >>> (TezTask.java:execute(173)) -    INPUT_RECORDS_PROCESSED: 722420718
> >>> 2014-08-26 14:06:41,212 INFO  [Thread-13]: exec.Task
> >>> (TezTask.java:execute(173)) -    OUTPUT_RECORDS: 144488481
> >>> 2014-08-26 14:06:41,212 INFO  [Thread-13]: exec.Task
> >>> (TezTask.java:execute(173)) -    OUTPUT_BYTES: 6876509984
> >>> 2014-08-26 14:06:41,212 INFO  [Thread-13]: exec.Task
> >>> (TezTask.java:execute(173)) -    OUTPUT_BYTES_WITH_OVERHEAD: 7165487118
> >>> 2014-08-26 14:06:41,212 INFO  [Thread-13]: exec.Task
> >>> (TezTask.java:execute(173)) -    OUTPUT_BYTES_PHYSICAL: 3201154197
> >>> 2014-08-26 14:06:41,212 INFO  [Thread-13]: exec.Task
> >>> (TezTask.java:execute(171)) -
> >>> org.apache.hadoop.hive.ql.exec.FilterOperator$Counter:
> >>> 2014-08-26 14:06:41,212 INFO  [Thread-13]: exec.Task
> >>> (TezTask.java:execute(173)) -    FILTERED: 863123081
> >>> 2014-08-26 14:06:41,212 INFO  [Thread-13]: exec.Task
> >>> (TezTask.java:execute(173)) -    PASSED: 215782564
> >>> 2014-08-26 14:06:41,212 INFO  [Thread-13]: exec.Task
> >>> (TezTask.java:execute(171)) -
> >>> org.apache.hadoop.hive.ql.exec.MapOperator$Counter:
> >>> 2014-08-26 14:06:41,212 INFO  [Thread-13]: exec.Task
> >>> (TezTask.java:execute(173)) -    DESERIALIZE_ERRORS: 0
> >>>
> >>> Thanks
> >>> Suma
> >>>
> >>>
> >>> On Tue, Aug 26, 2014 at 7:47 PM, Suma Shivaprasad <
> >>> [email protected]> wrote:
> >>>
> >>> > Trying to run a query on Tez with the following configurations
> >>> >
> >>> >
> >>> > *set hive.tez.container.size=5120*
> >>> > *set mapreduce.map.child.java.opts=-Xmx5120M*
> >>> > *set hive.tez.java.opts=-Xmx4096M*
> >>> > *set hive.auto.convert.join.noconditionaltask.size=805306000*
> >>> > *set tez.am.resource.memory.mb=5120*
> >>> > *set tez.am.java.opts=-Xmx4096M*
> >>> >
> >>> > The above config settings were set after  running
> >>> >
> >>> https://github.com/hortonworks/hdp-configuration-
> >>> utils/blob/master/2.1/hdp-configuration-utils.py
> >>> > to get the right memory configs
> >>> >
> >>> > Tried with both
> >>> >
> >>> > set tez.runtime.io.sort.mb=512
> >>> > set mapreduce.task.io.sort.mb=512
> >>> >
> >>> > and
> >>> >
> >>> > set tez.runtime.io.sort.mb=2048
> >>> > set mapreduce.task.io.sort.mb=2048
> >>> >
> >>> >
> >>> > The query I am trying run is
> >>> >
> >>> > *select sum(tab1.m1),sum(tab1.m2)*
> >>> > * from tab1 join tab2 dm on tab1.col1=tab2.col1*
> >>> > * where tab1.dt = '2014-06-01' *
> >>> > * and tab2.col2 = '..'*
> >>> > * and tab2.col3 IN ('..')*
> >>> > * group by TAB1.col1*
> >>> >
> >>> > *where TAB1.col1 has high cardinality(around 700- 800 million)*
> >>> >
> >>> > And its going OOM during shuffle phase.
> >>> >
> >>> >  errorMessage=Fetch failed
> >>> > Container released by application,
> >>> > AttemptID:attempt_1407396011310_1577_1_01_000000_4 Info:Error:
> >>> > exceptionThrown=java.lang.OutOfMemoryError: Java heap space
> >>> >  at
> >>> >
> >>> org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>(
> >>> BoundedByteArrayOutputStream.java:56)
> >>> > at
> >>> >
> >>> org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>(
> >>> BoundedByteArrayOutputStream.java:46)
> >>> >  at
> >>> >
> >>>
> org.apache.tez.runtime.library.shuffle.common.MemoryFetchedInput.<init>(
> >>> MemoryFetchedInput.java:38)
> >>> > at
> >>> >
> >>> org.apache.tez.runtime.library.shuffle.common.impl.
> >>> SimpleFetchedInputAllocator.allocate(SimpleFetchedInputAllocator.
> >>> java:137)
> >>> >  at
> >>> >
> >>> org.apache.tez.runtime.library.shuffle.common.
> >>> Fetcher.fetchInputs(Fetcher.java:252)
> >>> > at
> >>> >
> >>> org.apache.tez.runtime.library.shuffle.common.
> >>> Fetcher.call(Fetcher.java:184)
> >>> >  at
> >>> >
> >>> org.apache.tez.runtime.library.shuffle.common.
> >>> Fetcher.call(Fetcher.java:59)
> >>> > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> >>> >  at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> >>> > at
> >>> >
> >>> java.util.concurrent.ThreadPoolExecutor$Worker.
> >>> runTask(ThreadPoolExecutor.java:886)
> >>> >  at
> >>> >
> >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(
> >>> ThreadPoolExecutor.java:908)
> >>> > at java.lang.Thread.run(Thread.java:662)
> >>> >
> >>> >
> >>> > Please advice if the configurations look ok? Do I need to change
> >>> anything?
> >>> >
> >>> >
> >>> >
> >>> > Thanks
> >>> > Suma
> >>> >
> >>> >
> >>> >
> >>>
> >>>
> >>
> >>
> >
>

Reply via email to