Hi Gopal,

Thanks for your comment.

Yes, Kylin generated the query. I'm using Kylin 1.5.3.

But I still not sure how can I fix the problem. I'm a beginner of Hive and 
Kylin, Can the problem be fixed by just change the hive or kylin settings?

The total data is about 1 billion lines, I'm trying to build a cube as the base 
and then dealing with the increment everyday. Show I separate the 1 billion 
lines to hundreds of pieces and then build the cube?


Thanks,

Minghao Feng

________________________________
From: Gopal Vijayaraghavan <go...@hortonworks.com> on behalf of Gopal 
Vijayaraghavan <gop...@apache.org>
Sent: Wednesday, August 17, 2016 11:10:45 AM
To: user@hive.apache.org
Subject: Re: hive throws ConcurrentModificationException when executing insert 
overwrite table


> This problem has blocked me a whole week, anybodies have any ideas?

This might be a race condition here.

<https://github.com/apache/hive/blob/master/shims/common/src/main/java/org/
apache/hadoop/hive/io/HdfsUtils.java#L68>


aclStatus.getEntries(); is being modified without being copied (oddly with
Kerberos, it might be okay).


>> >= '1970-01-01 01:00:00' AND TBL_HIS_UWIP_SCAN_PROM.START_TIME <
>>'2010-01-01 01:00:00') DISTRIBUTE BY RAND();

Did Kylin generate this query? This pattern is known to cause data loss
during runtime.

Distribute BY RAND() loses data when map tasks fail.

>        at org.apache.hadoop.hdfs.DFSClient.setAcl(DFSClient.java:3242)
...
>        at
>org.apache.hadoop.hive.io.HdfsUtils.setFullFileStatus(HdfsUtils.java:126)

> An interesting thing is that if I narrow down the 'where' to make the
>select query only return about 300,000 line, the insert SQL can be
>completed successfully.

Producing exactly 1 file will fix the issue.

Cheers,
Gopal










Reply via email to