hi Patrick, Usually a distinct is preferred on Primary key columns instead of the entire table - something typically addressed to as SKEWNESS in traditional rdbms world. Doing it on an array will further add to the woes typically.
A typical workaround for this done by me in past is to fall back to Unix , to do a distinct by uniq or something like that suitably as it can easily handle up to 2 GB filesize (yes I split bigger files and then merge them suitably). But that should not stop us doing something which is a released feature. May be a Jira can be filed to suitably get it fixed. regards Dev On Tue, Mar 12, 2019 at 12:43 AM Patrick Duin <patd...@gmail.com> wrote: > > Venkatesh: > Increasing the memory: I've tried even bigger setttings, that made the > error appear after twice much more time. > > Dev: > So I know which table is giving the issue, following your previous > suggestion I did a SELECT DISTINCT * FROM DELTA, which cause the same issue > so I think the DISTINCT is the likely cause. This would make sense since my > original query worked for a UNION ALL. > > I'll try the simplest query I can reduce it to with loads of memory and > see if that gets anywhere. Other pointers are much appreciated. > > Thanks for the help! > > > > > > Op ma 11 mrt. 2019 om 19:35 schreef Devopam Mittra <devo...@gmail.com>: > >> hi Patrick, >> If it sounds worth trying please do the same: >> >> 1. Create physical table from table 1. (with filter clause) >> 2. Create physical table from table 2. (with filter clause) >> 3. Create interim table 2_1 with the DISTINCT clause. >> 4. Create interim table 2_2 with the UNION clause. >> 5. Do an INSERT OVERWRITE into target table from 2_2 above. >> >> This should help with isolating the error step. >> Yes, increasing # mappers or memory helps - but I am usually averse to >> this till I have exhausted all other options to be sure. >> >> regards >> Dev >> >> >> On Mon, Mar 11, 2019 at 9:21 PM Patrick Duin <patd...@gmail.com> wrote: >> >>> Very good question, Yes that does give the same problem. >>> >>> Op ma 11 mrt. 2019 om 16:28 schreef Devopam Mittra <devo...@gmail.com>: >>> >>>> Can you please try doing SELECT DISTINCT * FROM DELTA into a physical >>>> table first ? >>>> regards >>>> Dev >>>> >>>> >>>> On Mon, Mar 11, 2019 at 7:59 PM Patrick Duin <patd...@gmail.com> wrote: >>>> >>>>> Hi, >>>>> >>>>> I'm running into oom issue trying to do a Union all on a bunch of AVRO >>>>> files. >>>>> >>>>> The query is something like this: >>>>> >>>>> with gold as ( select * from table1 where local_date=2019-01-01), >>>>> delta ss ( select * from table2 where local_date=2019-01-01) >>>>> insert overwrite table3 PARTITION ('local_date') >>>>> select * from gold >>>>> union distinct >>>>> select * from delta; >>>>> >>>>> UNION ALL works. The data size is in the low gigabytes and I'm running >>>>> on 6 16 GB Nodes (I've tried larger and set memory settings higher but >>>>> that >>>>> just postpones the error). >>>>> >>>>> Mappers fail with erros (stacktraces not all the same) >>>>> >>>>> 2019-03-11 13:37:22,381 ERROR [main] org.apache.hadoop.mapred.YarnChild: >>>>> Error running child : java.lang.OutOfMemoryError: GC overhead limit >>>>> exceeded >>>>> at org.apache.hadoop.io.Text.setCapacity(Text.java:268) >>>>> at org.apache.hadoop.io.Text.set(Text.java:224) >>>>> at org.apache.hadoop.io.Text.set(Text.java:214) >>>>> at org.apache.hadoop.io.Text.<init>(Text.java:93) >>>>> at >>>>> org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObjectInspector.copyObject(WritableStringObjectInspector.java:36) >>>>> at >>>>> org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.copyToStandardObject(ObjectInspectorUtils.java:418) >>>>> at >>>>> org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.copyToStandardObject(ObjectInspectorUtils.java:442) >>>>> at >>>>> org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.copyToStandardObject(ObjectInspectorUtils.java:428) >>>>> at >>>>> org.apache.hadoop.hive.ql.exec.KeyWrapperFactory$ListKeyWrapper.deepCopyElements(KeyWrapperFactory.java:152) >>>>> at >>>>> org.apache.hadoop.hive.ql.exec.KeyWrapperFactory$ListKeyWrapper.deepCopyElements(KeyWrapperFactory.java:144) >>>>> at >>>>> org.apache.hadoop.hive.ql.exec.KeyWrapperFactory$ListKeyWrapper.copyKey(KeyWrapperFactory.java:121) >>>>> at >>>>> org.apache.hadoop.hive.ql.exec.GroupByOperator.processHashAggr(GroupByOperator.java:805) >>>>> at >>>>> org.apache.hadoop.hive.ql.exec.GroupByOperator.processKey(GroupByOperator.java:719) >>>>> at >>>>> org.apache.hadoop.hive.ql.exec.GroupByOperator.process(GroupByOperator.java:787) >>>>> at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897) >>>>> at >>>>> org.apache.hadoop.hive.ql.exec.UnionOperator.process(UnionOperator.java:148) >>>>> at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897) >>>>> at >>>>> org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95) >>>>> at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897) >>>>> at >>>>> org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:130) >>>>> at >>>>> org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:148) >>>>> at >>>>> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:547) >>>>> at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:160) >>>>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54) >>>>> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:455) >>>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:344) >>>>> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) >>>>> at java.security.AccessController.doPrivileged(Native Method) >>>>> at javax.security.auth.Subject.doAs(Subject.java:422) >>>>> at >>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844) >>>>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:169) >>>>> >>>>> I've tried Hive 2.3.2 and Hive 2.3.4, both tez and mr engines. >>>>> >>>>> I've tried running with more and less mappers, always hitting oom. >>>>> >>>>> I'm running similar query on different (much larger) data without issues >>>>> so suspect something with the actual data. >>>>> >>>>> The table schema is this: >>>>> c1 string >>>>> c2 bigint >>>>> c3 array<map<string,string>> >>>>> local_date string >>>>> >>>>> >>>>> I've narrowed it down and (not surprisingly) the 3rd column seems to be >>>>> the cause of the issue, If I remove that the union works again just fine. >>>>> >>>>> Anyone has similar experiences? Perhaps any pointers on how to tackle >>>>> this? >>>>> >>>>> Kind regards, >>>>> >>>>> Patrick >>>>> >>>>> >>>>> >>>> >>>> -- >>>> Devopam Mittra >>>> Life and Relations are not binary >>>> >>> >> >> -- >> Devopam Mittra >> Life and Relations are not binary >> > -- Devopam Mittra Life and Relations are not binary