[ https://issues.apache.org/jira/browse/SPARK-18077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
J.P Feng closed SPARK-18077. ---------------------------- Resolution: Won't Fix i would try to open another one, for there are some mistakes in this issue. > Run insert overwrite statements in spark to overwrite a partitioned table is > very slow > --------------------------------------------------------------------------------------- > > Key: SPARK-18077 > URL: https://issues.apache.org/jira/browse/SPARK-18077 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.0.0 > Environment: spark 2.0 > hive 2.0.1 > driver memory: 4g > total executors: 4 > executor memory: 10g > total cores: 13 > Reporter: J.P Feng > Labels: hive, insert, sparkSQL > Original Estimate: 120h > Remaining Estimate: 120h > > Hello,all. I face a strange thing in my project. > there is a table: > CREATE TABLE `login4game`(`account_name` string, `role_id` string, > `server_id` string, `recdate` string) > PARTITIONED BY (`pt` string, `dt` string) stored as orc; > another table: > CREATE TABLE `tbllog_login`(`server` string,`role_id` bigint, `account_name` > string, `happened_time` int) > PARTITIONED BY (`pt` string, `dt` string) > ---------------------------------------------------------------------------------------------- > Test-1: > executed sql in spark-shell or spark-sql( before i run this sql, there is > much data in partition(pt='mix_en', dt='2016-10-21') of table login4game ): > insert overwrite table login4game partition(pt='mix_en',dt='2016-10-21') > select distinct account_name,role_id,server,'1476979200' as recdate from > tbllog_login where pt='mix_en' and dt='2016-10-21' > it will cost a lot of time, below is a part of the logs: > ///////////////////////////////////////////////////////////////////////////////////////////////////////////////// > [Stage 5:=======================================> (144 + 8) / > 200]15127.974: [GC [PSYoungGen: 587153K->103638K(572416K)] > 893021K->412112K(1259008K), 0.0740800 secs] [Times: user=0.18 sys=0.00, > real=0.08 secs] > [Stage 5:=========================================> (152 + 8) / > 200]15128.441: [GC [PSYoungGen: 564438K->82692K(580096K)] > 872912K->393836K(1266688K), 0.0808380 secs] [Times: user=0.16 sys=0.00, > real=0.08 secs] > [Stage 5:============================================> (160 + 8) / > 200]15128.854: [GC [PSYoungGen: 543297K->28369K(573952K)] > 854441K->341282K(1260544K), 0.0674920 secs] [Times: user=0.12 sys=0.00, > real=0.07 secs] > [Stage 5:================================================> (176 + 8) / > 200]15129.152: [GC [PSYoungGen: 485073K->40441K(497152K)] > 797986K->353651K(1183744K), 0.0588420 secs] [Times: user=0.15 sys=0.00, > real=0.06 secs] > [Stage 5:================================================> (177 + 8) / > 200]15129.460: [GC [PSYoungGen: 496966K->50692K(579584K)] > 810176K->364126K(1266176K), 0.0555160 secs] [Times: user=0.15 sys=0.00, > real=0.06 secs] > [Stage 5:====================================================> (192 + 8) / > 200]15129.777: [GC [PSYoungGen: 508420K->57213K(515072K)] > 821854K->371717K(1201664K), 0.0641580 secs] [Times: user=0.16 sys=0.00, > real=0.06 secs] > Moved: > 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00000' > to trash at: hdfs://master.com/user/hadoop/.Trash/Current > Moved: > 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00001' > to trash at: hdfs://master.com/user/hadoop/.Trash/Current > Moved: > 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00002' > to trash at: hdfs://master.com/user/hadoop/.Trash/Current > Moved: > 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00003' > to trash at: hdfs://master.com/user/hadoop/.Trash/Current > Moved: > 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00004' > to trash at: hdfs://master.com/user/hadoop/.Trash/Current > ....... > Moved: > 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00199' > to trash at: hdfs://master.com/user/hadoop/.Trash/Current > ///////////////////////////////////////////////////////////////////////////////////////////////////////////////// > i can see, the origin data is moved to .trash > and then, there is no log printing, and after about 10 min, the log print > again: > ///////////////////////////////////////////////////////////////////////////////////////////////////////////////// > 16/10/24 17:24:15 INFO Hive: Replacing > src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-10000/part-00000, > dest: > hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00000, > Status:true > 16/10/24 17:24:15 INFO Hive: Replacing > src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-10000/part-00001, > dest: > hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00001, > Status:true > 16/10/24 17:24:15 INFO Hive: Replacing > src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-10000/part-00002, > dest: > hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00002, > Status:true > 16/10/24 17:24:15 INFO Hive: Replacing > src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-10000/part-00003, > dest: > hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00003, > Status:true > 16/10/24 17:24:15 INFO Hive: Replacing > src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-10000/part-00004, > dest: > hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00004, > Status:true > .... > 16/10/24 17:24:15 INFO Hive: Replacing > src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-10000/part-00199, > dest: > hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00199, > Status:true > ///////////////////////////////////////////////////////////////////////////////////////////////////////////////// > after which, the sql executing is end. but it costs over 10 min to handle > 257128 lines of data > --------------------------------------------------------------------------------------------- > Test-2: > if i remove all data of partition(pt='mix_en',dt='2016-10-21') in table > login4game, and run such sql: > ALTER TABLE login4game DROP IF EXISTS PARTITION (pt='mix_en', > dt='2016-10-21'); > insert overwrite table login4game partition(pt='mix_en',dt='2016-10-21') > select distinct account_name,role_id,server,'1476979200' as recdate from > tbllog_login where pt='mix_en' and dt='2016-10-21' > the logs is the same: > //////////////////////////////////////////////////////////////////////////////////////////////////////////////// > [Stage 5:=======================================> (144 + 8) / > 200]15127.974: [GC [PSYoungGen: 587153K->103638K(572416K)] > 893021K->412112K(1259008K), 0.0740800 secs] [Times: user=0.18 sys=0.00, > real=0.08 secs] > [Stage 5:=========================================> (152 + 8) / > 200]15128.441: [GC [PSYoungGen: 564438K->82692K(580096K)] > 872912K->393836K(1266688K), 0.0808380 secs] [Times: user=0.16 sys=0.00, > real=0.08 secs] > [Stage 5:============================================> (160 + 8) / > 200]15128.854: [GC [PSYoungGen: 543297K->28369K(573952K)] > 854441K->341282K(1260544K), 0.0674920 secs] [Times: user=0.12 sys=0.00, > real=0.07 secs] > [Stage 5:================================================> (176 + 8) / > 200]15129.152: [GC [PSYoungGen: 485073K->40441K(497152K)] > 797986K->353651K(1183744K), 0.0588420 secs] [Times: user=0.15 sys=0.00, > real=0.06 secs] > [Stage 5:================================================> (177 + 8) / > 200]15129.460: [GC [PSYoungGen: 496966K->50692K(579584K)] > 810176K->364126K(1266176K), 0.0555160 secs] [Times: user=0.15 sys=0.00, > real=0.06 secs] > [Stage 5:====================================================> (192 + 8) / > 200]15129.777: [GC [PSYoungGen: 508420K->57213K(515072K)] > 821854K->371717K(1201664K), 0.0641580 secs] [Times: user=0.16 sys=0.00, > real=0.06 secs] > Moved: > 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00000' > to trash at: hdfs://master.com/user/hadoop/.Trash/Current > Moved: > 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00001' > to trash at: hdfs://master.com/user/hadoop/.Trash/Current > Moved: > 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00002' > to trash at: hdfs://master.com/user/hadoop/.Trash/Current > Moved: > 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00003' > to trash at: hdfs://master.com/user/hadoop/.Trash/Current > Moved: > 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00004' > to trash at: hdfs://master.com/user/hadoop/.Trash/Current > ....... > Moved: > 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00199' > to trash at: hdfs://master.com/user/hadoop/.Trash/Current > ///////////////////////////////////////////////////////////////////////////////////////////////////////////////// > but after moving to trash, i don't need to wait for 10min, and the next log > appears immediately. > for about 20 seconds, this sql is finished, and i get the same result as the > above executing(the drop partitions statement costs about 5 second). > ---------------------------------------------------------------------------------------------- > Test-3: > now, if i run the same sql in hive server(not using spark-sql or spark-shell > to run, just run it in beeline or hive client): > before running, i also load the same data in to login4game in > partition(pt='mix_en',dt='2016-10-21') > then i run such sql: > insert overwrite table login4game partition(pt='mix_en',dt='2016-10-21') > select distinct account_name,role_id,server,'1476979200' as recdate from > tbllog_login where pt='mix_en' and dt='2016-10-21' > after running, i get the same result for about 20 seconds. > ========================================================== > now, it's the question, why its so slow when i use spark-sql to execute > insert overwrite statement to handle a partitioned table with data(if there > is no data in selected partitions, it runs fast) ? > important point: > 1. it costs a lot of time in running move to trash when there is much data > 2. running the same sql in beeline is fast > 3. drop data from selected partitions before running executing insert > overwrite statement will speeding the execution > Is there anyone could help me? Thanks a lot! -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org