[ https://issues.apache.org/jira/browse/SPARK-2274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael Armbrust resolved SPARK-2274. ------------------------------------- Resolution: Fixed Fix Version/s: 1.1.0 The attached query is using a left outer join, which until 1.1 was executed very slowly. Please reopen this issue if problems persist on a newer version of Spark. > spark SQL query hang up sometimes > --------------------------------- > > Key: SPARK-2274 > URL: https://issues.apache.org/jira/browse/SPARK-2274 > Project: Spark > Issue Type: Question > Components: SQL > Environment: spark 1.0.0 > Reporter: jackielihf > Assignee: Michael Armbrust > Fix For: 1.1.0 > > > when I run spark SQL query, it hang up sometimes: > 1) simple SQL query works, such as "select * from a left out join b on > a.id=b.id" > 2) BUT if it has more joins, such as "select * from a left out join b on > a.id=b.id left out join c on a.id=c.id...", spark shell seems to hang up. > spark shell prints: > scala> hc.hql("select > A.id,A.tit,A.sub_tit,B.abst,B.cont,A.aut,A.com_name,A.med_name,A.pub_dt,A.upd_time,A.ent_time,A.info_lvl,A.is_pic,A.lnk_addr,A.is_ann,A.info_open_lvl,A.keyw_name,C.typ_code > as > type,D.evt_dir,D.evt_st,E.secu_id,E.typ_codeii,E.exch_code,E.trd_code,E.secu_sht > from txt_nws_bas_update A left outer join txt_nws_bas_txt B on > A.id=B.orig_id left outer join txt_nws_typ C on A.id=C.orig_id left outer > join txt_nws_secu D on A.id=D.orig_id left outer join bas_secu_info E on > D.secu_id=E.secu_id where D.secu_id is not null limit 5").foreach(println) > 14/06/25 13:32:25 INFO ParseDriver: Parsing command: select > A.id,A.tit,A.sub_tit,B.abst,B.cont,A.aut,A.com_name,A.med_name,A.pub_dt,A.upd_time,A.ent_time,A.info_lvl,A.is_pic,A.lnk_addr,A.is_ann,A.info_open_lvl,A.keyw_name,C.typ_code > as > type,D.evt_dir,D.evt_st,E.secu_id,E.typ_codeii,E.exch_code,E.trd_code,E.secu_sht > from txt_nws_bas_update A left outer join txt_nws_bas_txt B on > A.id=B.orig_id left outer join txt_nws_typ C on A.id=C.orig_id left outer > join txt_nws_secu D on A.id=D.orig_id left outer join bas_secu_info E on > D.secu_id=E.secu_id where D.secu_id is not null limit 5 > 14/06/25 13:32:25 INFO ParseDriver: Parse Completed > 14/06/25 13:32:25 INFO Analyzer: Max iterations (2) reached for batch > MultiInstanceRelations > 14/06/25 13:32:25 INFO Analyzer: Max iterations (2) reached for batch > CaseInsensitiveAttributeReferences > 14/06/25 13:32:27 INFO MemoryStore: ensureFreeSpace(220923) called with > curMem=0, maxMem=311387750 > 14/06/25 13:32:27 INFO MemoryStore: Block broadcast_0 stored as values to > memory (estimated size 215.7 KB, free 296.8 MB) > 14/06/25 13:32:27 INFO MemoryStore: ensureFreeSpace(220971) called with > curMem=220923, maxMem=311387750 > 14/06/25 13:32:27 INFO MemoryStore: Block broadcast_1 stored as values to > memory (estimated size 215.8 KB, free 296.5 MB) > 14/06/25 13:32:28 INFO MemoryStore: ensureFreeSpace(220971) called with > curMem=441894, maxMem=311387750 > 14/06/25 13:32:28 INFO MemoryStore: Block broadcast_2 stored as values to > memory (estimated size 215.8 KB, free 296.3 MB) > 14/06/25 13:32:28 INFO MemoryStore: ensureFreeSpace(220971) called with > curMem=662865, maxMem=311387750 > 14/06/25 13:32:28 INFO MemoryStore: Block broadcast_3 stored as values to > memory (estimated size 215.8 KB, free 296.1 MB) > 14/06/25 13:32:28 INFO MemoryStore: ensureFreeSpace(220971) called with > curMem=883836, maxMem=311387750 > 14/06/25 13:32:28 INFO MemoryStore: Block broadcast_4 stored as values to > memory (estimated size 215.8 KB, free 295.9 MB) > 14/06/25 13:32:29 INFO SQLContext$$anon$1: Max iterations (2) reached for > batch Add exchange > 14/06/25 13:32:29 INFO SQLContext$$anon$1: Max iterations (2) reached for > batch Prepare Expressions > 14/06/25 13:32:30 INFO FileInputFormat: Total input paths to process : 1 > 14/06/25 13:32:30 INFO SparkContext: Starting job: collect at joins.scala:184 > 14/06/25 13:32:30 INFO DAGScheduler: Got job 0 (collect at joins.scala:184) > with 2 output partitions (allowLocal=false) > 14/06/25 13:32:30 INFO DAGScheduler: Final stage: Stage 0(collect at > joins.scala:184) > 14/06/25 13:32:30 INFO DAGScheduler: Parents of final stage: List() > 14/06/25 13:32:30 INFO DAGScheduler: Missing parents: List() > 14/06/25 13:32:30 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[7] at map > at joins.scala:184), which has no missing parents > 14/06/25 13:32:30 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 > (MappedRDD[7] at map at joins.scala:184) > 14/06/25 13:32:30 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks > 14/06/25 13:32:30 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on > executor 0: 192.168.56.100 (PROCESS_LOCAL) > 14/06/25 13:32:30 INFO TaskSetManager: Serialized task 0.0:0 as 4088 bytes in > 3 ms > 14/06/25 13:32:30 INFO TaskSetManager: Starting task 0.0:1 as TID 1 on > executor 1: 192.168.56.100 (PROCESS_LOCAL) > 14/06/25 13:32:30 INFO TaskSetManager: Serialized task 0.0:1 as 4088 bytes in > 2 ms > 14/06/25 13:32:44 INFO BlockManagerInfo: Added taskresult_1 in memory on > 192.168.56.100:47102 (size: 15.0 MB, free: 281.9 MB) > 14/06/25 13:32:44 INFO SendingConnection: Initiating connection to > [/192.168.56.100:47102] > 14/06/25 13:32:44 INFO SendingConnection: Connected to > [/192.168.56.100:47102], 1 messages pending > 14/06/25 13:32:44 INFO BlockManagerInfo: Added taskresult_0 in memory on > 192.168.56.100:41322 (size: 15.1 MB, free: 281.9 MB) > 14/06/25 13:32:44 INFO SendingConnection: Initiating connection to > [/192.168.56.100:41322] > 14/06/25 13:32:44 INFO SendingConnection: Connected to > [/192.168.56.100:41322], 1 messages pending > 14/06/25 13:32:49 INFO ConnectionManager: Accepted connection from > [192.168.56.100/192.168.56.100] > 14/06/25 13:32:55 INFO BlockManagerInfo: Removed taskresult_1 on > 192.168.56.100:47102 in memory (size: 15.0 MB, free: 297.0 MB) > 14/06/25 13:32:55 INFO DAGScheduler: Completed ResultTask(0, 1) > 14/06/25 13:32:55 INFO TaskSetManager: Finished TID 1 in 24749 ms on > 192.168.56.100 (progress: 1/2) > 14/06/25 13:33:10 INFO ConnectionManager: Accepted connection from > [192.168.56.100/192.168.56.100] > 14/06/25 13:33:15 INFO BlockManagerInfo: Removed taskresult_0 on > 192.168.56.100:41322 in memory (size: 15.1 MB, free: 297.0 MB) > 14/06/25 13:33:15 INFO DAGScheduler: Completed ResultTask(0, 0) > 14/06/25 13:33:15 INFO TaskSetManager: Finished TID 0 in 45025 ms on > 192.168.56.100 (progress: 2/2) > 14/06/25 13:33:15 INFO DAGScheduler: Stage 0 (collect at joins.scala:184) > finished in 45.665 s > 14/06/25 13:33:16 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks > have all completed, from pool > 14/06/25 13:33:16 INFO SparkContext: Job finished: collect at > joins.scala:184, took 45.910877269 s > 14/06/25 13:33:20 INFO MemoryStore: ensureFreeSpace(133210496) called with > curMem=1104807, maxMem=311387750 > 14/06/25 13:33:20 INFO MemoryStore: Block broadcast_5 stored as values to > memory (estimated size 127.0 MB, free 168.9 MB) > 14/06/25 13:33:22 INFO FileInputFormat: Total input paths to process : 1 > 14/06/25 13:33:22 INFO SparkContext: Starting job: collect at joins.scala:184 > 14/06/25 13:33:22 INFO DAGScheduler: Got job 1 (collect at joins.scala:184) > with 2 output partitions (allowLocal=false) > 14/06/25 13:33:22 INFO DAGScheduler: Final stage: Stage 1(collect at > joins.scala:184) > 14/06/25 13:33:22 INFO DAGScheduler: Parents of final stage: List() > 14/06/25 13:33:22 INFO DAGScheduler: Missing parents: List() > 14/06/25 13:33:22 INFO DAGScheduler: Submitting Stage 1 (MappedRDD[12] at map > at joins.scala:184), which has no missing parents > 14/06/25 13:33:22 INFO DAGScheduler: Submitting 2 missing tasks from Stage 1 > (MappedRDD[12] at map at joins.scala:184) > 14/06/25 13:33:22 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks > 14/06/25 13:33:22 INFO TaskSetManager: Starting task 1.0:0 as TID 2 on > executor 0: 192.168.56.100 (PROCESS_LOCAL) > 14/06/25 13:33:22 INFO TaskSetManager: Serialized task 1.0:0 as 3705 bytes in > 0 ms > 14/06/25 13:33:22 INFO TaskSetManager: Starting task 1.0:1 as TID 3 on > executor 1: 192.168.56.100 (PROCESS_LOCAL) > 14/06/25 13:33:22 INFO TaskSetManager: Serialized task 1.0:1 as 3705 bytes in > 0 ms > 14/06/25 13:33:26 INFO DAGScheduler: Completed ResultTask(1, 1) > 14/06/25 13:33:26 INFO TaskSetManager: Finished TID 3 in 4178 ms on > 192.168.56.100 (progress: 1/2) > 14/06/25 13:33:26 INFO DAGScheduler: Completed ResultTask(1, 0) > 14/06/25 13:33:26 INFO DAGScheduler: Stage 1 (collect at joins.scala:184) > finished in 4.365 s > 14/06/25 13:33:26 INFO SparkContext: Job finished: collect at > joins.scala:184, took 4.437216848 s > 14/06/25 13:33:26 INFO TaskSetManager: Finished TID 2 in 4278 ms on > 192.168.56.100 (progress: 2/2) > 14/06/25 13:33:26 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks > have all completed, from pool > 14/06/25 13:33:27 INFO MemoryStore: ensureFreeSpace(31747896) called with > curMem=134315303, maxMem=311387750 > 14/06/25 13:33:27 INFO MemoryStore: Block broadcast_6 stored as values to > memory (estimated size 30.3 MB, free 138.6 MB) > 14/06/25 13:33:28 INFO FileInputFormat: Total input paths to process : 1 > 14/06/25 13:33:28 INFO SparkContext: Starting job: collect at joins.scala:184 > 14/06/25 13:33:28 INFO DAGScheduler: Got job 2 (collect at joins.scala:184) > with 2 output partitions (allowLocal=false) > 14/06/25 13:33:28 INFO DAGScheduler: Final stage: Stage 2(collect at > joins.scala:184) > 14/06/25 13:33:28 INFO DAGScheduler: Parents of final stage: List() > 14/06/25 13:33:28 INFO DAGScheduler: Missing parents: List() > 14/06/25 13:33:28 INFO DAGScheduler: Submitting Stage 2 (MappedRDD[17] at map > at joins.scala:184), which has no missing parents > 14/06/25 13:33:28 INFO DAGScheduler: Submitting 2 missing tasks from Stage 2 > (MappedRDD[17] at map at joins.scala:184) > 14/06/25 13:33:28 INFO TaskSchedulerImpl: Adding task set 2.0 with 2 tasks > 14/06/25 13:33:28 INFO TaskSetManager: Starting task 2.0:0 as TID 4 on > executor 1: 192.168.56.100 (PROCESS_LOCAL) > 14/06/25 13:33:28 INFO TaskSetManager: Serialized task 2.0:0 as 3627 bytes in > 0 ms > 14/06/25 13:33:28 INFO TaskSetManager: Starting task 2.0:1 as TID 5 on > executor 0: 192.168.56.100 (PROCESS_LOCAL) > 14/06/25 13:33:28 INFO TaskSetManager: Serialized task 2.0:1 as 3627 bytes in > 0 ms > 14/06/25 13:33:30 INFO DAGScheduler: Completed ResultTask(2, 0) > 14/06/25 13:33:30 INFO TaskSetManager: Finished TID 4 in 2285 ms on > 192.168.56.100 (progress: 1/2) > 14/06/25 13:33:30 INFO DAGScheduler: Completed ResultTask(2, 1) > 14/06/25 13:33:30 INFO DAGScheduler: Stage 2 (collect at joins.scala:184) > finished in 2.681 s > 14/06/25 13:33:30 INFO SparkContext: Job finished: collect at > joins.scala:184, took 2.743363946 s > 14/06/25 13:33:30 INFO TaskSetManager: Finished TID 5 in 2610 ms on > 192.168.56.100 (progress: 2/2) > 14/06/25 13:33:30 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks > have all completed, from pool > 14/06/25 13:33:33 INFO MemoryStore: ensureFreeSpace(49064048) called with > curMem=166063199, maxMem=311387750 > 14/06/25 13:33:33 INFO MemoryStore: Block broadcast_7 stored as values to > memory (estimated size 46.8 MB, free 91.8 MB) > 14/06/25 13:33:34 INFO FileInputFormat: Total input paths to process : 1 > 14/06/25 13:33:34 INFO SparkContext: Starting job: collect at joins.scala:184 > 14/06/25 13:33:34 INFO DAGScheduler: Got job 3 (collect at joins.scala:184) > with 2 output partitions (allowLocal=false) > 14/06/25 13:33:34 INFO DAGScheduler: Final stage: Stage 3(collect at > joins.scala:184) > 14/06/25 13:33:34 INFO DAGScheduler: Parents of final stage: List() > 14/06/25 13:33:34 INFO DAGScheduler: Missing parents: List() > 14/06/25 13:33:34 INFO DAGScheduler: Submitting Stage 3 (MappedRDD[22] at map > at joins.scala:184), which has no missing parents > 14/06/25 13:33:34 INFO DAGScheduler: Submitting 2 missing tasks from Stage 3 > (MappedRDD[22] at map at joins.scala:184) > 14/06/25 13:33:34 INFO TaskSchedulerImpl: Adding task set 3.0 with 2 tasks > 14/06/25 13:33:34 INFO TaskSetManager: Starting task 3.0:0 as TID 6 on > executor 0: 192.168.56.100 (PROCESS_LOCAL) > 14/06/25 13:33:34 INFO TaskSetManager: Serialized task 3.0:0 as 3708 bytes in > 1 ms > 14/06/25 13:33:34 INFO TaskSetManager: Starting task 3.0:1 as TID 7 on > executor 1: 192.168.56.100 (PROCESS_LOCAL) > 14/06/25 13:33:34 INFO TaskSetManager: Serialized task 3.0:1 as 3708 bytes in > 1 ms > 14/06/25 13:33:35 INFO DAGScheduler: Completed ResultTask(3, 1) > 14/06/25 13:33:35 INFO TaskSetManager: Finished TID 7 in 608 ms on > 192.168.56.100 (progress: 1/2) > 14/06/25 13:33:35 INFO DAGScheduler: Completed ResultTask(3, 0) > 14/06/25 13:33:35 INFO DAGScheduler: Stage 3 (collect at joins.scala:184) > finished in 0.636 s > 14/06/25 13:33:35 INFO SparkContext: Job finished: collect at > joins.scala:184, took 0.647360428 s > 14/06/25 13:33:35 INFO TaskSetManager: Finished TID 6 in 627 ms on > 192.168.56.100 (progress: 2/2) > 14/06/25 13:33:35 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks > have all completed, from pool > 14/06/25 13:33:35 INFO MemoryStore: ensureFreeSpace(4471280) called with > curMem=215127247, maxMem=311387750 > 14/06/25 13:33:35 INFO MemoryStore: Block broadcast_8 stored as values to > memory (estimated size 4.3 MB, free 87.5 MB) > 14/06/25 13:33:35 INFO FileInputFormat: Total input paths to process : 1 > 14/06/25 13:33:35 INFO SparkContext: Starting job: count at joins.scala:216 > 14/06/25 13:33:35 INFO DAGScheduler: Got job 4 (count at joins.scala:216) > with 2 output partitions (allowLocal=false) > 14/06/25 13:33:35 INFO DAGScheduler: Final stage: Stage 4(count at > joins.scala:216) > 14/06/25 13:33:35 INFO DAGScheduler: Parents of final stage: List() > 14/06/25 13:33:35 INFO DAGScheduler: Missing parents: List() > 14/06/25 13:33:35 INFO DAGScheduler: Submitting Stage 4 (MappedRDD[28] at map > at joins.scala:214), which has no missing parents > 14/06/25 13:33:36 INFO DAGScheduler: Submitting 2 missing tasks from Stage 4 > (MappedRDD[28] at map at joins.scala:214) > 14/06/25 13:33:36 INFO TaskSchedulerImpl: Adding task set 4.0 with 2 tasks > 14/06/25 13:33:36 INFO TaskSetManager: Starting task 4.0:0 as TID 8 on > executor 0: 192.168.56.100 (PROCESS_LOCAL) > 14/06/25 13:33:36 INFO TaskSetManager: Serialized task 4.0:0 as 4570 bytes in > 0 ms > 14/06/25 13:33:36 INFO TaskSetManager: Starting task 4.0:1 as TID 9 on > executor 1: 192.168.56.100 (PROCESS_LOCAL) > 14/06/25 13:33:36 INFO TaskSetManager: Serialized task 4.0:1 as 4570 bytes in > 0 ms > 14/06/25 13:33:47 INFO TaskSetManager: Finished TID 9 in 10869 ms on > 192.168.56.100 (progress: 1/2) > 14/06/25 13:33:47 INFO DAGScheduler: Completed ResultTask(4, 1) > 14/06/25 13:33:47 INFO TaskSetManager: Finished TID 8 in 11240 ms on > 192.168.56.100 (progress: 2/2) > 14/06/25 13:33:47 INFO TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks > have all completed, from pool > 14/06/25 13:33:47 INFO DAGScheduler: Completed ResultTask(4, 0) > 14/06/25 13:33:47 INFO DAGScheduler: Stage 4 (count at joins.scala:216) > finished in 11.191 s > 14/06/25 13:33:47 INFO SparkContext: Job finished: count at joins.scala:216, > took 12.351801444 s > 14/06/25 13:33:47 INFO SparkContext: Starting job: reduce at joins.scala:219 > 14/06/25 13:33:47 INFO DAGScheduler: Got job 5 (reduce at joins.scala:219) > with 2 output partitions (allowLocal=false) > 14/06/25 13:33:47 INFO DAGScheduler: Final stage: Stage 5(reduce at > joins.scala:219) > 14/06/25 13:33:47 INFO DAGScheduler: Parents of final stage: List() > 14/06/25 13:33:47 INFO DAGScheduler: Missing parents: List() > 14/06/25 13:33:47 INFO DAGScheduler: Submitting Stage 5 (MappedRDD[29] at map > at joins.scala:219), which has no missing parents > 14/06/25 13:33:47 INFO DAGScheduler: Submitting 2 missing tasks from Stage 5 > (MappedRDD[29] at map at joins.scala:219) > 14/06/25 13:33:47 INFO TaskSchedulerImpl: Adding task set 5.0 with 2 tasks > 14/06/25 13:33:47 INFO TaskSetManager: Starting task 5.0:0 as TID 10 on > executor 1: 192.168.56.100 (PROCESS_LOCAL) > 14/06/25 13:33:47 INFO TaskSetManager: Serialized task 5.0:0 as 4586 bytes in > 0 ms > 14/06/25 13:33:47 INFO TaskSetManager: Starting task 5.0:1 as TID 11 on > executor 0: 192.168.56.100 (PROCESS_LOCAL) > 14/06/25 13:33:47 INFO TaskSetManager: Serialized task 5.0:1 as 4586 bytes in > 0 ms > 14/06/25 13:33:52 INFO TaskSetManager: Finished TID 10 in 4929 ms on > 192.168.56.100 (progress: 1/2) > 14/06/25 13:33:52 INFO DAGScheduler: Completed ResultTask(5, 0) > 14/06/25 13:33:52 INFO TaskSetManager: Finished TID 11 in 5285 ms on > 192.168.56.100 (progress: 2/2) > 14/06/25 13:33:52 INFO TaskSchedulerImpl: Removed TaskSet 5.0, whose tasks > have all completed, from pool > 14/06/25 13:33:52 INFO DAGScheduler: Completed ResultTask(5, 1) > 14/06/25 13:33:52 INFO DAGScheduler: Stage 5 (reduce at joins.scala:219) > finished in 5.236 s > 14/06/25 13:33:52 INFO SparkContext: Job finished: reduce at joins.scala:219, > took 5.306035846 s > 14/06/25 13:33:53 INFO SparkContext: Starting job: count at joins.scala:216 > 14/06/25 13:33:53 INFO DAGScheduler: Got job 6 (count at joins.scala:216) > with 4 output partitions (allowLocal=false) > 14/06/25 13:33:53 INFO DAGScheduler: Final stage: Stage 6(count at > joins.scala:216) > 14/06/25 13:33:53 INFO DAGScheduler: Parents of final stage: List() > 14/06/25 13:33:53 INFO DAGScheduler: Missing parents: List() > 14/06/25 13:33:53 INFO DAGScheduler: Submitting Stage 6 (MappedRDD[34] at map > at joins.scala:214), which has no missing parents > 14/06/25 13:33:53 INFO DAGScheduler: Submitting 4 missing tasks from Stage 6 > (MappedRDD[34] at map at joins.scala:214) > 14/06/25 13:33:53 INFO TaskSchedulerImpl: Adding task set 6.0 with 4 tasks > 14/06/25 13:33:53 INFO TaskSetManager: Starting task 6.0:0 as TID 12 on > executor 1: 192.168.56.100 (PROCESS_LOCAL) > 14/06/25 13:33:53 INFO TaskSetManager: Serialized task 6.0:0 as 17113 bytes > in 2 ms > 14/06/25 13:33:53 INFO TaskSetManager: Starting task 6.0:1 as TID 13 on > executor 0: 192.168.56.100 (PROCESS_LOCAL) > 14/06/25 13:33:53 INFO TaskSetManager: Serialized task 6.0:1 as 17113 bytes > in 4 ms > 14/06/25 13:47:25 INFO TaskSetManager: Starting task 6.0:2 as TID 14 on > executor 1: 192.168.56.100 (PROCESS_LOCAL) > 14/06/25 13:47:25 INFO TaskSetManager: Serialized task 6.0:2 as 5974 bytes in > 101 ms > 14/06/25 13:47:25 INFO TaskSetManager: Finished TID 12 in 812264 ms on > 192.168.56.100 (progress: 1/4) > 14/06/25 13:47:25 INFO DAGScheduler: Completed ResultTask(6, 0) > 14/06/25 13:47:25 INFO TaskSetManager: Starting task 6.0:3 as TID 15 on > executor 1: 192.168.56.100 (PROCESS_LOCAL) > 14/06/25 13:47:25 INFO TaskSetManager: Serialized task 6.0:3 as 5974 bytes in > 6 ms > 14/06/25 13:47:25 INFO TaskSetManager: Finished TID 14 in 462 ms on > 192.168.56.100 (progress: 2/4) > 14/06/25 13:47:25 INFO DAGScheduler: Completed ResultTask(6, 2) > 14/06/25 13:47:25 INFO DAGScheduler: Completed ResultTask(6, 3) > 14/06/25 13:47:25 INFO TaskSetManager: Finished TID 15 in 81 ms on > 192.168.56.100 (progress: 3/4) > 14/06/25 14:11:07 INFO TaskSetManager: Finished TID 13 in 2234716 ms on > 192.168.56.100 (progress: 4/4) > 14/06/25 14:11:07 INFO TaskSchedulerImpl: Removed TaskSet 6.0, whose tasks > have all completed, from pool > 14/06/25 14:11:07 INFO DAGScheduler: Completed ResultTask(6, 1) > 14/06/25 14:11:07 INFO DAGScheduler: Stage 6 (count at joins.scala:216) > finished in 2234.742 s > 14/06/25 14:11:07 INFO SparkContext: Job finished: count at joins.scala:216, > took 2234.877469214 s > 14/06/25 14:11:08 INFO SparkContext: Starting job: reduce at joins.scala:219 > 14/06/25 14:11:08 INFO DAGScheduler: Got job 7 (reduce at joins.scala:219) > with 4 output partitions (allowLocal=false) > 14/06/25 14:11:08 INFO DAGScheduler: Final stage: Stage 7(reduce at > joins.scala:219) > 14/06/25 14:11:08 INFO DAGScheduler: Parents of final stage: List() > 14/06/25 14:11:08 INFO DAGScheduler: Missing parents: List() > 14/06/25 14:11:08 INFO DAGScheduler: Submitting Stage 7 (MappedRDD[35] at map > at joins.scala:219), which has no missing parents > 14/06/25 14:11:08 INFO DAGScheduler: Submitting 4 missing tasks from Stage 7 > (MappedRDD[35] at map at joins.scala:219) > 14/06/25 14:11:08 INFO TaskSchedulerImpl: Adding task set 7.0 with 4 tasks > 14/06/25 14:11:08 INFO TaskSetManager: Starting task 7.0:0 as TID 16 on > executor 0: 192.168.56.100 (PROCESS_LOCAL) > 14/06/25 14:11:08 INFO TaskSetManager: Serialized task 7.0:0 as 17130 bytes > in 10 ms > 14/06/25 14:11:08 INFO TaskSetManager: Starting task 7.0:1 as TID 17 on > executor 1: 192.168.56.100 (PROCESS_LOCAL) > 14/06/25 14:11:08 INFO TaskSetManager: Serialized task 7.0:1 as 17130 bytes > in 4 ms > ==================== > logs stderr: > Spark Executor Command: "/usr/local/java/jdk1.7.0_25/bin/java" "-cp" > "::/usr/local/spark/spark-1.0.0-bin-cdh4/conf:/usr/local/spark/spark-1.0.0-bin-cdh4/lib/spark-assembly-1.0.0-hadoop2.0.0-mr1-cdh4.2.0.jar:/usr/local/spark/spark-1.0.0-bin-cdh4/lib/datanucleus-api-jdo-3.2.1.jar:/usr/local/spark/spark-1.0.0-bin-cdh4/lib/datanucleus-rdbms-3.2.1.jar:/usr/local/spark/spark-1.0.0-bin-cdh4/lib/datanucleus-core-3.2.2.jar" > "-XX:MaxPermSize=128m" "-Xms512M" "-Xmx512M" > "org.apache.spark.executor.CoarseGrainedExecutorBackend" > "akka.tcp://spark@192.168.56.100:57832/user/CoarseGrainedScheduler" "1" > "192.168.56.100" "1" > "akka.tcp://sparkWorker@192.168.56.100:41139/user/Worker" > "app-20140625133200-0013" > ======================================== > log4j:WARN No appenders could be found for logger > (org.apache.hadoop.metrics2.lib.MutableMetricsFactory). > log4j:WARN Please initialize the log4j system properly. > log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more > info. > 14/06/25 13:32:04 INFO SparkHadoopUtil: Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > 14/06/25 13:32:04 INFO SecurityManager: Changing view acls to: root > 14/06/25 13:32:04 INFO SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(root) > 14/06/25 13:32:06 INFO Slf4jLogger: Slf4jLogger started > 14/06/25 13:32:06 INFO Remoting: Starting remoting > 14/06/25 13:32:07 INFO Remoting: Remoting started; listening on addresses > :[akka.tcp://sparkExecutor@192.168.56.100:57820] > 14/06/25 13:32:07 INFO Remoting: Remoting now listens on addresses: > [akka.tcp://sparkExecutor@192.168.56.100:57820] > 14/06/25 13:32:07 INFO CoarseGrainedExecutorBackend: Connecting to driver: > akka.tcp://spark@192.168.56.100:57832/user/CoarseGrainedScheduler > 14/06/25 13:32:07 INFO WorkerWatcher: Connecting to worker > akka.tcp://sparkWorker@192.168.56.100:41139/user/Worker > 14/06/25 13:32:07 INFO WorkerWatcher: Successfully connected to > akka.tcp://sparkWorker@192.168.56.100:41139/user/Worker > 14/06/25 13:32:08 INFO CoarseGrainedExecutorBackend: Successfully registered > with driver > 14/06/25 13:32:08 INFO SecurityManager: Changing view acls to: root > 14/06/25 13:32:08 INFO SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(root) > 14/06/25 13:32:08 INFO Slf4jLogger: Slf4jLogger started > 14/06/25 13:32:08 INFO Remoting: Starting remoting > 14/06/25 13:32:08 INFO Remoting: Remoting started; listening on addresses > :[akka.tcp://spark@192.168.56.100:57296] > 14/06/25 13:32:08 INFO Remoting: Remoting now listens on addresses: > [akka.tcp://spark@192.168.56.100:57296] > 14/06/25 13:32:08 INFO SparkEnv: Connecting to MapOutputTracker: > akka.tcp://spark@192.168.56.100:57832/user/MapOutputTracker > 14/06/25 13:32:08 INFO SparkEnv: Connecting to BlockManagerMaster: > akka.tcp://spark@192.168.56.100:57832/user/BlockManagerMaster > 14/06/25 13:32:08 INFO DiskBlockManager: Created local directory at > /tmp/spark-local-20140625133208-88c8 > 14/06/25 13:32:08 INFO MemoryStore: MemoryStore started with capacity 297.0 > MB. > 14/06/25 13:32:09 INFO ConnectionManager: Bound socket to port 47102 with id > = ConnectionManagerId(192.168.56.100,47102) > 14/06/25 13:32:09 INFO BlockManagerMaster: Trying to register BlockManager > 14/06/25 13:32:09 INFO BlockManagerMaster: Registered BlockManager > 14/06/25 13:32:09 INFO HttpFileServer: HTTP File server directory is > /tmp/spark-1ab56863-9a5a-43d6-8e8a-8ef50db0a8a4 > 14/06/25 13:32:09 INFO HttpServer: Starting HTTP Server > 14/06/25 13:32:09 WARN Utils: Your hostname, localhost.localdomain resolves > to a loopback address: 127.0.0.1; using 192.168.56.100 instead (on interface > eth1) > 14/06/25 13:32:09 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to > another address > 14/06/25 13:32:10 INFO Executor: Using REPL class URI: > http://192.168.56.100:53773 > 14/06/25 13:32:30 INFO CoarseGrainedExecutorBackend: Got assigned task 1 > 14/06/25 13:32:30 INFO Executor: Running task ID 1 > 14/06/25 13:32:31 INFO HttpBroadcast: Started reading broadcast variable 4 > 14/06/25 13:32:32 WARN HiveConf: DEPRECATED: Configuration property > hive.metastore.local no longer has any effect. Make sure to provide a valid > value for hive.metastore.uris if you are connecting to a remote metastore. > 14/06/25 13:32:33 INFO MemoryStore: ensureFreeSpace(379695) called with > curMem=0, maxMem=311387750 > 14/06/25 13:32:33 INFO MemoryStore: Block broadcast_4 stored as values to > memory (estimated size 370.8 KB, free 296.6 MB) > 14/06/25 13:32:33 INFO HttpBroadcast: Reading broadcast variable 4 took > 2.882124397 s > 14/06/25 13:32:35 INFO HadoopRDD: Input split: > hdfs://localhost.localdomain:8020/user/hive/warehouse/gfinfodb.db/bas_secu_info/part-m-00000:14110777+14110778 > 14/06/25 13:32:37 WARN LazyStruct: Extra bytes detected at the end of the > row! Ignoring similar problems. > 14/06/25 13:32:44 INFO Executor: Serialized size of result for 1 is 15754830 > 14/06/25 13:32:44 INFO Executor: Storing result for 1 in local BlockManager > 14/06/25 13:32:44 INFO MemoryStore: ensureFreeSpace(15754830) called with > curMem=379695, maxMem=311387750 > 14/06/25 13:32:44 INFO MemoryStore: Block taskresult_1 stored as bytes to > memory (size 15.0 MB, free 281.6 MB) > 14/06/25 13:32:44 INFO BlockManagerMaster: Updated info of block taskresult_1 > 14/06/25 13:32:44 INFO Executor: Finished task ID 1 > 14/06/25 13:32:44 INFO ConnectionManager: Accepted connection from > [192.168.56.100/192.168.56.100] > 14/06/25 13:32:44 INFO SendingConnection: Initiating connection to > [/192.168.56.100:54915] > 14/06/25 13:32:44 INFO SendingConnection: Connected to > [/192.168.56.100:54915], 1 messages pending > 14/06/25 13:32:55 INFO BlockManager: Removing block taskresult_1 > 14/06/25 13:32:55 INFO MemoryStore: Block taskresult_1 of size 15754830 > dropped from memory (free 311008055) > 14/06/25 13:32:55 INFO BlockManagerMaster: Updated info of block taskresult_1 > 14/06/25 13:33:22 INFO CoarseGrainedExecutorBackend: Got assigned task 3 > 14/06/25 13:33:22 INFO Executor: Running task ID 3 > 14/06/25 13:33:22 INFO HttpBroadcast: Started reading broadcast variable 3 > 14/06/25 13:33:22 WARN HiveConf: DEPRECATED: Configuration property > hive.metastore.local no longer has any effect. Make sure to provide a valid > value for hive.metastore.uris if you are connecting to a remote metastore. > 14/06/25 13:33:22 WARN Configuration: io.bytes.per.checksum is deprecated. > Instead, use dfs.bytes-per-checksum > 14/06/25 13:33:23 INFO MemoryStore: ensureFreeSpace(511815) called with > curMem=379695, maxMem=311387750 > 14/06/25 13:33:23 INFO MemoryStore: Block broadcast_3 stored as values to > memory (estimated size 499.8 KB, free 296.1 MB) > 14/06/25 13:33:23 INFO HttpBroadcast: Reading broadcast variable 3 took > 0.479206184 s > 14/06/25 13:33:23 INFO HadoopRDD: Input split: > hdfs://localhost.localdomain:8020/user/hive/warehouse/gfinfodb.db/txt_nws_secu/part-m-00000:2542140+2542140 > 14/06/25 13:33:25 INFO Executor: Serialized size of result for 3 is 3155695 > 14/06/25 13:33:25 INFO Executor: Sending result for 3 directly to driver > 14/06/25 13:33:25 INFO Executor: Finished task ID 3 > 14/06/25 13:33:28 INFO CoarseGrainedExecutorBackend: Got assigned task 4 > 14/06/25 13:33:28 INFO Executor: Running task ID 4 > 14/06/25 13:33:28 INFO HttpBroadcast: Started reading broadcast variable 2 > 14/06/25 13:33:28 WARN HiveConf: DEPRECATED: Configuration property > hive.metastore.local no longer has any effect. Make sure to provide a valid > value for hive.metastore.uris if you are connecting to a remote metastore. > 14/06/25 13:33:28 WARN Configuration: io.bytes.per.checksum is deprecated. > Instead, use dfs.bytes-per-checksum > 14/06/25 13:33:28 INFO MemoryStore: ensureFreeSpace(511815) called with > curMem=891510, maxMem=311387750 > 14/06/25 13:33:28 INFO MemoryStore: Block broadcast_2 stored as values to > memory (estimated size 499.8 KB, free 295.6 MB) > 14/06/25 13:33:28 INFO HttpBroadcast: Reading broadcast variable 2 took > 0.390923128 s > 14/06/25 13:33:28 INFO HadoopRDD: Input split: > hdfs://localhost.localdomain:8020/user/hive/warehouse/gfinfodb.db/txt_nws_typ/part-m-00000:0+4216842 > 14/06/25 13:33:29 INFO Executor: Serialized size of result for 4 is 5510834 > 14/06/25 13:33:29 INFO Executor: Sending result for 4 directly to driver > 14/06/25 13:33:29 INFO Executor: Finished task ID 4 > 14/06/25 13:33:34 INFO CoarseGrainedExecutorBackend: Got assigned task 7 > 14/06/25 13:33:34 INFO Executor: Running task ID 7 > 14/06/25 13:33:34 INFO HttpBroadcast: Started reading broadcast variable 1 > 14/06/25 13:33:34 WARN HiveConf: DEPRECATED: Configuration property > hive.metastore.local no longer has any effect. Make sure to provide a valid > value for hive.metastore.uris if you are connecting to a remote metastore. > 14/06/25 13:33:34 WARN Configuration: io.bytes.per.checksum is deprecated. > Instead, use dfs.bytes-per-checksum > 14/06/25 13:33:34 INFO MemoryStore: ensureFreeSpace(511815) called with > curMem=1403325, maxMem=311387750 > 14/06/25 13:33:34 INFO MemoryStore: Block broadcast_1 stored as values to > memory (estimated size 499.8 KB, free 295.1 MB) > 14/06/25 13:33:34 INFO HttpBroadcast: Reading broadcast variable 1 took > 0.305664776 s > 14/06/25 13:33:34 INFO HadoopRDD: Input split: > hdfs://localhost.localdomain:8020/user/hive/warehouse/gfinfodb.db/txt_nws_bas_txt/part-m-00000-1:2098795+2098796 > 14/06/25 13:33:35 INFO Executor: Serialized size of result for 7 is 2120896 > 14/06/25 13:33:35 INFO Executor: Sending result for 7 directly to driver > 14/06/25 13:33:35 INFO Executor: Finished task ID 7 > 14/06/25 13:33:36 INFO CoarseGrainedExecutorBackend: Got assigned task 9 > 14/06/25 13:33:36 INFO Executor: Running task ID 9 > 14/06/25 13:33:36 INFO HttpBroadcast: Started reading broadcast variable 0 > 14/06/25 13:33:36 WARN HiveConf: DEPRECATED: Configuration property > hive.metastore.local no longer has any effect. Make sure to provide a valid > value for hive.metastore.uris if you are connecting to a remote metastore. > 14/06/25 13:33:36 WARN Configuration: io.bytes.per.checksum is deprecated. > Instead, use dfs.bytes-per-checksum > 14/06/25 13:33:36 INFO MemoryStore: ensureFreeSpace(511815) called with > curMem=1915140, maxMem=311387750 > 14/06/25 13:33:36 INFO MemoryStore: Block broadcast_0 stored as values to > memory (estimated size 499.8 KB, free 294.6 MB) > 14/06/25 13:33:36 INFO HttpBroadcast: Reading broadcast variable 0 took > 0.397017543 s > 14/06/25 13:33:36 INFO HttpBroadcast: Started reading broadcast variable 8 > 14/06/25 13:33:37 INFO MemoryStore: ensureFreeSpace(4471280) called with > curMem=2426955, maxMem=311387750 > 14/06/25 13:33:37 INFO MemoryStore: Block broadcast_8 stored as values to > memory (estimated size 4.3 MB, free 290.4 MB) > 14/06/25 13:33:37 INFO HttpBroadcast: Reading broadcast variable 8 took > 1.0188118 s > 14/06/25 13:33:38 INFO HadoopRDD: Input split: > hdfs://localhost.localdomain:8020/user/hive/warehouse/gfinfodb.db/txt_nws_bas_update/part-m-00000:8778082+8778083 > 14/06/25 13:33:46 INFO LazyStruct: Missing fields! Expected 20 fields but > only got 8! Ignoring similar problems. > 14/06/25 13:33:47 INFO Executor: Serialized size of result for 9 is 602 > 14/06/25 13:33:47 INFO Executor: Sending result for 9 directly to driver > 14/06/25 13:33:47 INFO Executor: Finished task ID 9 > 14/06/25 13:33:47 INFO CoarseGrainedExecutorBackend: Got assigned task 10 > 14/06/25 13:33:47 INFO Executor: Running task ID 10 > 14/06/25 13:33:47 INFO BlockManager: Found block broadcast_0 locally > 14/06/25 13:33:47 INFO BlockManager: Found block broadcast_8 locally > 14/06/25 13:33:47 INFO HadoopRDD: Input split: > hdfs://localhost.localdomain:8020/user/hive/warehouse/gfinfodb.db/txt_nws_bas_update/part-m-00000:0+8778082 > 14/06/25 13:33:52 INFO Executor: Serialized size of result for 10 is 818 > 14/06/25 13:33:52 INFO Executor: Sending result for 10 directly to driver > 14/06/25 13:33:52 INFO Executor: Finished task ID 10 > 14/06/25 13:33:53 INFO CoarseGrainedExecutorBackend: Got assigned task 12 > 14/06/25 13:33:53 INFO Executor: Running task ID 12 > 14/06/25 13:33:53 INFO BlockManager: Found block broadcast_0 locally > 14/06/25 13:33:53 INFO BlockManager: Found block broadcast_8 locally > 14/06/25 13:33:53 INFO HttpBroadcast: Started reading broadcast variable 7 > 14/06/25 13:33:58 INFO MemoryStore: ensureFreeSpace(49064048) called with > curMem=6898235, maxMem=311387750 > 14/06/25 13:33:58 INFO MemoryStore: Block broadcast_7 stored as values to > memory (estimated size 46.8 MB, free 243.6 MB) > 14/06/25 13:33:58 INFO HttpBroadcast: Reading broadcast variable 7 took > 5.00987562 s > 14/06/25 13:33:58 INFO BlockManager: Found block broadcast_0 locally > 14/06/25 13:33:58 INFO BlockManager: Found block broadcast_8 locally > 14/06/25 13:33:58 INFO HadoopRDD: Input split: > hdfs://localhost.localdomain:8020/user/hive/warehouse/gfinfodb.db/txt_nws_bas_update/part-m-00000:0+8778082 > 14/06/25 13:47:25 INFO Executor: Serialized size of result for 12 is 602 > 14/06/25 13:47:25 INFO Executor: Sending result for 12 directly to driver > 14/06/25 13:47:25 INFO Executor: Finished task ID 12 > 14/06/25 13:47:25 INFO CoarseGrainedExecutorBackend: Got assigned task 14 > 14/06/25 13:47:25 INFO Executor: Running task ID 14 > 14/06/25 13:47:25 INFO BlockManager: Found block broadcast_0 locally > 14/06/25 13:47:25 INFO BlockManager: Found block broadcast_8 locally > 14/06/25 13:47:25 INFO BlockManager: Found block broadcast_7 locally > 14/06/25 13:47:25 INFO Executor: Serialized size of result for 14 is 602 > 14/06/25 13:47:25 INFO Executor: Sending result for 14 directly to driver > 14/06/25 13:47:25 INFO Executor: Finished task ID 14 > 14/06/25 13:47:25 INFO CoarseGrainedExecutorBackend: Got assigned task 15 > 14/06/25 13:47:25 INFO Executor: Running task ID 15 > 14/06/25 13:47:25 INFO BlockManager: Found block broadcast_0 locally > 14/06/25 13:47:25 INFO BlockManager: Found block broadcast_8 locally > 14/06/25 13:47:25 INFO BlockManager: Found block broadcast_7 locally > 14/06/25 13:47:25 INFO Executor: Serialized size of result for 15 is 602 > 14/06/25 13:47:25 INFO Executor: Sending result for 15 directly to driver > 14/06/25 13:47:25 INFO Executor: Finished task ID 15 > 14/06/25 14:11:08 INFO CoarseGrainedExecutorBackend: Got assigned task 17 > 14/06/25 14:11:08 INFO Executor: Running task ID 17 > 14/06/25 14:11:08 INFO BlockManager: Found block broadcast_0 locally > 14/06/25 14:11:08 INFO BlockManager: Found block broadcast_8 locally > 14/06/25 14:11:08 INFO BlockManager: Found block broadcast_7 locally > 14/06/25 14:11:08 INFO BlockManager: Found block broadcast_0 locally > 14/06/25 14:11:08 INFO BlockManager: Found block broadcast_8 locally > 14/06/25 14:11:08 INFO HadoopRDD: Input split: > hdfs://localhost.localdomain:8020/user/hive/warehouse/gfinfodb.db/txt_nws_bas_update/part-m-00000:8778082+8778083 > 14/06/25 14:11:21 INFO LazyStruct: Missing fields! Expected 20 fields but > only got 8! Ignoring similar problems. > 3) and all the queries work fine on shark 0.9.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org