hi Patrick,
Usually a distinct is preferred on Primary key columns instead of the
entire table - something typically addressed to as SKEWNESS in traditional
rdbms world.
Doing it on an array will further add to the woes typically.

A typical workaround for this done by me in past is to fall back to Unix ,
to do a distinct by uniq or something like that suitably as it can easily
handle up to 2 GB filesize (yes I split bigger files and then merge them
suitably).

But that should not stop us doing something which is a released feature.
May be a Jira can be filed to suitably get it fixed.

regards
Dev

On Tue, Mar 12, 2019 at 12:43 AM Patrick Duin <patd...@gmail.com> wrote:

>
> Venkatesh:
> Increasing the memory: I've tried even bigger setttings, that made the
> error appear after twice much more time.
>
> Dev:
> So I know which table is giving the issue, following your previous
> suggestion I did a SELECT DISTINCT * FROM DELTA, which cause the same issue
> so I think the DISTINCT is the likely cause. This would make sense since my
> original query worked for a UNION ALL.
>
> I'll try the simplest query I can reduce it to  with loads of memory and
> see if that gets anywhere. Other pointers are much appreciated.
>
> Thanks for the help!
>
>
>
>
>
> Op ma 11 mrt. 2019 om 19:35 schreef Devopam Mittra <devo...@gmail.com>:
>
>> hi Patrick,
>> If it sounds worth trying please do the same:
>>
>> 1. Create physical table from table 1. (with filter clause)
>> 2. Create physical table from table 2. (with filter clause)
>> 3. Create interim table 2_1 with the DISTINCT clause.
>> 4. Create interim table 2_2 with the UNION clause.
>> 5. Do an INSERT OVERWRITE into target table from 2_2 above.
>>
>> This should help with isolating the error step.
>> Yes, increasing # mappers or memory helps - but I am usually averse to
>> this till I have exhausted all other options to be sure.
>>
>> regards
>> Dev
>>
>>
>> On Mon, Mar 11, 2019 at 9:21 PM Patrick Duin <patd...@gmail.com> wrote:
>>
>>> Very good question, Yes that does give the same problem.
>>>
>>> Op ma 11 mrt. 2019 om 16:28 schreef Devopam Mittra <devo...@gmail.com>:
>>>
>>>> Can you please try doing SELECT DISTINCT * FROM DELTA into a physical
>>>> table first ?
>>>> regards
>>>> Dev
>>>>
>>>>
>>>> On Mon, Mar 11, 2019 at 7:59 PM Patrick Duin <patd...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I'm running into oom issue trying to do a Union all on a bunch of AVRO
>>>>> files.
>>>>>
>>>>> The query is something like this:
>>>>>
>>>>> with gold  as ( select * from table1 where local_date=2019-01-01),
>>>>>      delta ss ( select * from table2 where local_date=2019-01-01)
>>>>> insert overwrite table3 PARTITION ('local_date')
>>>>> select * from gold
>>>>> union distinct
>>>>> select * from delta;
>>>>>
>>>>> UNION ALL works. The data size is in the low gigabytes and I'm running
>>>>> on 6 16 GB Nodes (I've tried larger and set memory settings higher but 
>>>>> that
>>>>> just postpones the error).
>>>>>
>>>>> Mappers fail with erros (stacktraces not all the same)
>>>>>
>>>>> 2019-03-11 13:37:22,381 ERROR [main] org.apache.hadoop.mapred.YarnChild: 
>>>>> Error running child : java.lang.OutOfMemoryError: GC overhead limit 
>>>>> exceeded
>>>>>   at org.apache.hadoop.io.Text.setCapacity(Text.java:268)
>>>>>   at org.apache.hadoop.io.Text.set(Text.java:224)
>>>>>   at org.apache.hadoop.io.Text.set(Text.java:214)
>>>>>   at org.apache.hadoop.io.Text.<init>(Text.java:93)
>>>>>   at 
>>>>> org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObjectInspector.copyObject(WritableStringObjectInspector.java:36)
>>>>>   at 
>>>>> org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.copyToStandardObject(ObjectInspectorUtils.java:418)
>>>>>   at 
>>>>> org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.copyToStandardObject(ObjectInspectorUtils.java:442)
>>>>>   at 
>>>>> org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.copyToStandardObject(ObjectInspectorUtils.java:428)
>>>>>   at 
>>>>> org.apache.hadoop.hive.ql.exec.KeyWrapperFactory$ListKeyWrapper.deepCopyElements(KeyWrapperFactory.java:152)
>>>>>   at 
>>>>> org.apache.hadoop.hive.ql.exec.KeyWrapperFactory$ListKeyWrapper.deepCopyElements(KeyWrapperFactory.java:144)
>>>>>   at 
>>>>> org.apache.hadoop.hive.ql.exec.KeyWrapperFactory$ListKeyWrapper.copyKey(KeyWrapperFactory.java:121)
>>>>>   at 
>>>>> org.apache.hadoop.hive.ql.exec.GroupByOperator.processHashAggr(GroupByOperator.java:805)
>>>>>   at 
>>>>> org.apache.hadoop.hive.ql.exec.GroupByOperator.processKey(GroupByOperator.java:719)
>>>>>   at 
>>>>> org.apache.hadoop.hive.ql.exec.GroupByOperator.process(GroupByOperator.java:787)
>>>>>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897)
>>>>>   at 
>>>>> org.apache.hadoop.hive.ql.exec.UnionOperator.process(UnionOperator.java:148)
>>>>>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897)
>>>>>   at 
>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95)
>>>>>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897)
>>>>>   at 
>>>>> org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:130)
>>>>>   at 
>>>>> org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:148)
>>>>>   at 
>>>>> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:547)
>>>>>   at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:160)
>>>>>   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
>>>>>   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:455)
>>>>>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:344)
>>>>>   at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
>>>>>   at java.security.AccessController.doPrivileged(Native Method)
>>>>>   at javax.security.auth.Subject.doAs(Subject.java:422)
>>>>>   at 
>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
>>>>>   at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:169)
>>>>>
>>>>> I've tried Hive 2.3.2 and Hive 2.3.4, both tez and mr engines.
>>>>>
>>>>> I've tried running with more and less mappers, always hitting oom.
>>>>>
>>>>> I'm running similar query on different (much larger) data without issues 
>>>>> so suspect something with the actual data.
>>>>>
>>>>> The table schema is this:
>>>>> c1    string                      
>>>>> c2    bigint                      
>>>>> c3    array<map<string,string>>   
>>>>> local_date  string
>>>>>
>>>>>
>>>>> I've narrowed it down and (not surprisingly) the 3rd column seems to be 
>>>>> the cause of the issue, If I remove that the union works again just fine.
>>>>>
>>>>> Anyone has similar experiences? Perhaps any pointers on how to tackle 
>>>>> this?
>>>>>
>>>>> Kind regards,
>>>>>
>>>>>  Patrick
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Devopam Mittra
>>>> Life and Relations are not binary
>>>>
>>>
>>
>> --
>> Devopam Mittra
>> Life and Relations are not binary
>>
>

-- 
Devopam Mittra
Life and Relations are not binary

Reply via email to