Hey folks, increasing tez.task.scale.memory.reserve-fraction to 0.8 worked for small jobs.
I will come back with more a more detailed breakdown to make sure I'm doing things properly. Thanks for the quick responses! Kostas On Thu, Nov 6, 2014 at 10:15 PM, Siddharth Seth <[email protected]> wrote: > - hive-dev, +tez-dev > > Do you know at what stage of the processing the OOM occurs ? What other > processing has happened so far. Ideally, if this was just part of the > Inputs being initialized - you should not have seen an OOM. > In most likelihood, the Processor started using some memory (which by > default is counted as 30% of the JVM heap). You could try modifying this > setting. [tez.task.scale.memory.reserve-fraction could be set higher than > 0.3 (30%) for starters). > > The logs will definitely help figuring out what is happening. A heap dump > would be even better. > > On Thu, Nov 6, 2014 at 1:01 PM, Gopal V <[email protected]> wrote: > > > On 11/6/14, 11:09 AM, Kostas Tzoumas wrote: > > > >> I am running into the same error [1] with plain Tez (not Hive): > >> > >> Any advice on what configuration parameters I should start looking at? > >> > > > > Both issues are related to the Tez memory distributor > > (InitialMemoryAllocator) impl used. > > > > http://tez.apache.org/releases/0.5.1/tez-runtime- > > library-javadocs/org/apache/tez/runtime/library/resources/ > > WeightedScalingMemoryDistributor.html > > > > This divides memory up between different inputs and outputs, so that the > > overall memory usage is capped without hitting GC issues. > > > > Suma's issue was probably that tez-0.4 (ergo, hive-13) didn't have a > > memory distributor implementation. > > > > http://people.apache.org/~gopalv/tpch-plans/q8_national_market_share.svg > > > > This means that Reducer_4 in that can divvy up memory between these > > buffers up. > > > > OrderedGroupedKVInputConfig::setShuffleBufferFraction() allows this > > particular tuning per input edge. > > > > For a shuffle JOIN, you can tune the left and right hand side of the > > buffers, as well as make reservations for the actual map-join in memory, > so > > that the plan's cost information can help the memory scheduling Tez has. > > > > Cheers, > > Gopal > > > > > > > >> [1] java.lang.OutOfMemoryError: Java heap space > >> at > >> org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>( > >> BoundedByteArrayOutputStream.java:56) > >> at > >> org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>( > >> BoundedByteArrayOutputStream.java:46) > >> at > >> org.apache.tez.runtime.library.common.shuffle.MemoryFetchedInput.<init>( > >> MemoryFetchedInput.java:38) > >> at > >> org.apache.tez.runtime.library.common.shuffle.impl. > >> SimpleFetchedInputAllocator.allocate(SimpleFetchedInputAllocator. > >> java:139) > >> at > >> org.apache.tez.runtime.library.common.shuffle. > >> Fetcher.fetchInputs(Fetcher.java:713) > >> at > >> org.apache.tez.runtime.library.common.shuffle. > >> Fetcher.doHttpFetch(Fetcher.java:485) > >> at > >> org.apache.tez.runtime.library.common.shuffle. > >> Fetcher.doHttpFetch(Fetcher.java:394) > >> at > >> org.apache.tez.runtime.library.common.shuffle. > >> Fetcher.call(Fetcher.java:189) > >> at > >> org.apache.tez.runtime.library.common.shuffle. > >> Fetcher.call(Fetcher.java:71) > >> at java.util.concurrent.FutureTask.run(FutureTask.java:262) > >> at > >> java.util.concurrent.ThreadPoolExecutor.runWorker( > >> ThreadPoolExecutor.java:1145) > >> at > >> java.util.concurrent.ThreadPoolExecutor$Worker.run( > >> ThreadPoolExecutor.java:615) > >> at java.lang.Thread.run(Thread.java:745) > >> > >> On Tue, Aug 26, 2014 at 4:26 PM, Suma Shivaprasad < > >> [email protected]> wrote: > >> > >> Am using Tez 0.4.0 and counters for the query run are as below > >>> > >>> 2014-08-26 14:06:41,203 INFO [Thread-13]: exec.Task > >>> (TezTask.java:execute(171)) - > org.apache.tez.common.counters.DAGCounter: > >>> 2014-08-26 14:06:41,205 INFO [Thread-13]: exec.Task > >>> (TezTask.java:execute(173)) - NUM_FAILED_TASKS: 67 > >>> 2014-08-26 14:06:41,205 INFO [Thread-13]: exec.Task > >>> (TezTask.java:execute(173)) - NUM_KILLED_TASKS: 312 > >>> 2014-08-26 14:06:41,205 INFO [Thread-13]: exec.Task > >>> (TezTask.java:execute(173)) - TOTAL_LAUNCHED_TASKS: 259 > >>> 2014-08-26 14:06:41,205 INFO [Thread-13]: exec.Task > >>> (TezTask.java:execute(173)) - DATA_LOCAL_TASKS: 59 > >>> 2014-08-26 14:06:41,205 INFO [Thread-13]: exec.Task > >>> (TezTask.java:execute(173)) - RACK_LOCAL_TASKS: 27 > >>> 2014-08-26 14:06:41,207 INFO [Thread-13]: exec.Task > >>> (TezTask.java:execute(171)) - File System Counters: > >>> 2014-08-26 14:06:41,208 INFO [Thread-13]: exec.Task > >>> (TezTask.java:execute(173)) - FILE: BYTES_READ: 0 > >>> 2014-08-26 14:06:41,208 INFO [Thread-13]: exec.Task > >>> (TezTask.java:execute(173)) - FILE: BYTES_WRITTEN: 3201156949 > >>> 2014-08-26 14:06:41,208 INFO [Thread-13]: exec.Task > >>> (TezTask.java:execute(173)) - FILE: READ_OPS: 0 > >>> 2014-08-26 14:06:41,209 INFO [Thread-13]: exec.Task > >>> (TezTask.java:execute(173)) - FILE: LARGE_READ_OPS: 0 > >>> 2014-08-26 14:06:41,209 INFO [Thread-13]: exec.Task > >>> (TezTask.java:execute(173)) - FILE: WRITE_OPS: 0 > >>> 2014-08-26 14:06:41,209 INFO [Thread-13]: exec.Task > >>> (TezTask.java:execute(173)) - HDFS: BYTES_READ: 30052072845 > >>> 2014-08-26 14:06:41,209 INFO [Thread-13]: exec.Task > >>> (TezTask.java:execute(173)) - HDFS: BYTES_WRITTEN: 0 > >>> 2014-08-26 14:06:41,209 INFO [Thread-13]: exec.Task > >>> (TezTask.java:execute(173)) - HDFS: READ_OPS: 768 > >>> 2014-08-26 14:06:41,209 INFO [Thread-13]: exec.Task > >>> (TezTask.java:execute(173)) - HDFS: LARGE_READ_OPS: 0 > >>> 2014-08-26 14:06:41,209 INFO [Thread-13]: exec.Task > >>> (TezTask.java:execute(173)) - HDFS: WRITE_OPS: 0 > >>> 2014-08-26 14:06:41,211 INFO [Thread-13]: exec.Task > >>> (TezTask.java:execute(171)) - org.apache.tez.common. > >>> counters.TaskCounter: > >>> 2014-08-26 14:06:41,211 INFO [Thread-13]: exec.Task > >>> (TezTask.java:execute(173)) - GC_TIME_MILLIS: 148639 > >>> 2014-08-26 14:06:41,211 INFO [Thread-13]: exec.Task > >>> (TezTask.java:execute(173)) - CPU_MILLISECONDS: 1420020 > >>> 2014-08-26 14:06:41,211 INFO [Thread-13]: exec.Task > >>> (TezTask.java:execute(173)) - PHYSICAL_MEMORY_BYTES: 304725393408 > >>> 2014-08-26 14:06:41,211 INFO [Thread-13]: exec.Task > >>> (TezTask.java:execute(173)) - VIRTUAL_MEMORY_BYTES: 440084279296 > >>> 2014-08-26 14:06:41,212 INFO [Thread-13]: exec.Task > >>> (TezTask.java:execute(173)) - COMMITTED_HEAP_BYTES: 337806557184 > >>> 2014-08-26 14:06:41,212 INFO [Thread-13]: exec.Task > >>> (TezTask.java:execute(173)) - INPUT_RECORDS_PROCESSED: 722420718 > >>> 2014-08-26 14:06:41,212 INFO [Thread-13]: exec.Task > >>> (TezTask.java:execute(173)) - OUTPUT_RECORDS: 144488481 > >>> 2014-08-26 14:06:41,212 INFO [Thread-13]: exec.Task > >>> (TezTask.java:execute(173)) - OUTPUT_BYTES: 6876509984 > >>> 2014-08-26 14:06:41,212 INFO [Thread-13]: exec.Task > >>> (TezTask.java:execute(173)) - OUTPUT_BYTES_WITH_OVERHEAD: 7165487118 > >>> 2014-08-26 14:06:41,212 INFO [Thread-13]: exec.Task > >>> (TezTask.java:execute(173)) - OUTPUT_BYTES_PHYSICAL: 3201154197 > >>> 2014-08-26 14:06:41,212 INFO [Thread-13]: exec.Task > >>> (TezTask.java:execute(171)) - > >>> org.apache.hadoop.hive.ql.exec.FilterOperator$Counter: > >>> 2014-08-26 14:06:41,212 INFO [Thread-13]: exec.Task > >>> (TezTask.java:execute(173)) - FILTERED: 863123081 > >>> 2014-08-26 14:06:41,212 INFO [Thread-13]: exec.Task > >>> (TezTask.java:execute(173)) - PASSED: 215782564 > >>> 2014-08-26 14:06:41,212 INFO [Thread-13]: exec.Task > >>> (TezTask.java:execute(171)) - > >>> org.apache.hadoop.hive.ql.exec.MapOperator$Counter: > >>> 2014-08-26 14:06:41,212 INFO [Thread-13]: exec.Task > >>> (TezTask.java:execute(173)) - DESERIALIZE_ERRORS: 0 > >>> > >>> Thanks > >>> Suma > >>> > >>> > >>> On Tue, Aug 26, 2014 at 7:47 PM, Suma Shivaprasad < > >>> [email protected]> wrote: > >>> > >>> > Trying to run a query on Tez with the following configurations > >>> > > >>> > > >>> > *set hive.tez.container.size=5120* > >>> > *set mapreduce.map.child.java.opts=-Xmx5120M* > >>> > *set hive.tez.java.opts=-Xmx4096M* > >>> > *set hive.auto.convert.join.noconditionaltask.size=805306000* > >>> > *set tez.am.resource.memory.mb=5120* > >>> > *set tez.am.java.opts=-Xmx4096M* > >>> > > >>> > The above config settings were set after running > >>> > > >>> https://github.com/hortonworks/hdp-configuration- > >>> utils/blob/master/2.1/hdp-configuration-utils.py > >>> > to get the right memory configs > >>> > > >>> > Tried with both > >>> > > >>> > set tez.runtime.io.sort.mb=512 > >>> > set mapreduce.task.io.sort.mb=512 > >>> > > >>> > and > >>> > > >>> > set tez.runtime.io.sort.mb=2048 > >>> > set mapreduce.task.io.sort.mb=2048 > >>> > > >>> > > >>> > The query I am trying run is > >>> > > >>> > *select sum(tab1.m1),sum(tab1.m2)* > >>> > * from tab1 join tab2 dm on tab1.col1=tab2.col1* > >>> > * where tab1.dt = '2014-06-01' * > >>> > * and tab2.col2 = '..'* > >>> > * and tab2.col3 IN ('..')* > >>> > * group by TAB1.col1* > >>> > > >>> > *where TAB1.col1 has high cardinality(around 700- 800 million)* > >>> > > >>> > And its going OOM during shuffle phase. > >>> > > >>> > errorMessage=Fetch failed > >>> > Container released by application, > >>> > AttemptID:attempt_1407396011310_1577_1_01_000000_4 Info:Error: > >>> > exceptionThrown=java.lang.OutOfMemoryError: Java heap space > >>> > at > >>> > > >>> org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>( > >>> BoundedByteArrayOutputStream.java:56) > >>> > at > >>> > > >>> org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>( > >>> BoundedByteArrayOutputStream.java:46) > >>> > at > >>> > > >>> > org.apache.tez.runtime.library.shuffle.common.MemoryFetchedInput.<init>( > >>> MemoryFetchedInput.java:38) > >>> > at > >>> > > >>> org.apache.tez.runtime.library.shuffle.common.impl. > >>> SimpleFetchedInputAllocator.allocate(SimpleFetchedInputAllocator. > >>> java:137) > >>> > at > >>> > > >>> org.apache.tez.runtime.library.shuffle.common. > >>> Fetcher.fetchInputs(Fetcher.java:252) > >>> > at > >>> > > >>> org.apache.tez.runtime.library.shuffle.common. > >>> Fetcher.call(Fetcher.java:184) > >>> > at > >>> > > >>> org.apache.tez.runtime.library.shuffle.common. > >>> Fetcher.call(Fetcher.java:59) > >>> > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > >>> > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > >>> > at > >>> > > >>> java.util.concurrent.ThreadPoolExecutor$Worker. > >>> runTask(ThreadPoolExecutor.java:886) > >>> > at > >>> > > >>> java.util.concurrent.ThreadPoolExecutor$Worker.run( > >>> ThreadPoolExecutor.java:908) > >>> > at java.lang.Thread.run(Thread.java:662) > >>> > > >>> > > >>> > Please advice if the configurations look ok? Do I need to change > >>> anything? > >>> > > >>> > > >>> > > >>> > Thanks > >>> > Suma > >>> > > >>> > > >>> > > >>> > >>> > >> > >> > > >
