I am having a strange problem writing to s3 that I have distilled to this
minimal example:

def jsonRaw = s"${outprefix}-json-raw"
def jsonClean = s"${outprefix}-json-clean"

val txt = sc.textFile(inpath)//.coalesce(shards, false)
txt.count

val res = txt.saveAsTextFile(jsonRaw)

val txt2 = sc.textFile(jsonRaw +"/part-*")
txt2.count

txt2.saveAsTextFile(jsonClean)

This code should simply copy files from inpath to jsonRaw and then from
jsonRaw to jsonClean.
This code executes all the way down to the last line where it hangs after
creating the output directory contatining a _temporary_$folder but no
actual files not even temporary ones.

`outputprefix` is and  bucket url, both jsonRaw and jsonClean are in the
same bucket.
Both calls .count succeed and return the same number. This means Spark can
read from inpath and can both read from and write to jsonRaw. Since
jsonClean is in the same bucket as jsonRaw and the final line does create
the directory, I cannot think of any reason why the files should  not be
written. If there were any access or url problems they should already
manifest when writing jsonRaw.

This problem is completely reproduceable with Spark 1.2.1 and 1.3.1
The console output from the last line is

scala> txt0.saveAsTextFile(jsonClean)
15/04/21 22:55:48 INFO storage.BlockManager: Removing broadcast 3
15/04/21 22:55:48 INFO storage.BlockManager: Removing block
broadcast_3_piece0
15/04/21 22:55:48 INFO storage.MemoryStore: Block broadcast_3_piece0 of
size 2024 dropped from memory (free 278251716)
15/04/21 22:55:48 INFO storage.BlockManagerInfo: Removed broadcast_3_piece0
on ip-10-51-181-81.ec2.internal:45199 in memory (size: 2024.0 B, free:
265.4 MB)
15/04/21 22:55:48 INFO storage.BlockManagerMaster: Updated info of block
broadcast_3_piece0
15/04/21 22:55:48 INFO storage.BlockManager: Removing block broadcast_3
15/04/21 22:55:48 INFO storage.MemoryStore: Block broadcast_3 of size 2728
dropped from memory (free 278254444)
15/04/21 22:55:48 INFO storage.BlockManagerInfo: Removed broadcast_3_piece0
on ip-10-166-129-153.ec2.internal:46671 in memory (size: 2024.0 B, free:
13.8 GB)
15/04/21 22:55:48 INFO storage.BlockManagerInfo: Removed broadcast_3_piece0
on ip-10-51-153-34.ec2.internal:51691 in memory (size: 2024.0 B, free: 13.8
GB)
15/04/21 22:55:48 INFO storage.BlockManagerInfo: Removed broadcast_3_piece0
on ip-10-158-142-155.ec2.internal:54690 in memory (size: 2024.0 B, free:
13.8 GB)
15/04/21 22:55:48 INFO storage.BlockManagerInfo: Removed broadcast_3_piece0
on ip-10-61-144-7.ec2.internal:44849 in memory (size: 2024.0 B, free: 13.8
GB)
15/04/21 22:55:48 INFO storage.BlockManagerInfo: Removed broadcast_3_piece0
on ip-10-69-77-180.ec2.internal:42417 in memory (size: 2024.0 B, free: 13.8
GB)
15/04/21 22:55:48 INFO spark.ContextCleaner: Cleaned broadcast 3
15/04/21 22:55:49 INFO spark.SparkContext: Starting job: saveAsTextFile at
<console>:38
15/04/21 22:55:49 INFO scheduler.DAGScheduler: Got job 2 (saveAsTextFile at
<console>:38) with 96 output partitions (allowLocal=false)
15/04/21 22:55:49 INFO scheduler.DAGScheduler: Final stage: Stage
2(saveAsTextFile at <console>:38)
15/04/21 22:55:49 INFO scheduler.DAGScheduler: Parents of final stage:
List()
15/04/21 22:55:49 INFO scheduler.DAGScheduler: Missing parents: List()
15/04/21 22:55:49 INFO scheduler.DAGScheduler: Submitting Stage 2
(MapPartitionsRDD[5] at saveAsTextFile at <console>:38), which has no
missing parents
15/04/21 22:55:49 INFO storage.MemoryStore: ensureFreeSpace(22248) called
with curMem=48112, maxMem=278302556
15/04/21 22:55:49 INFO storage.MemoryStore: Block broadcast_4 stored as
values in memory (estimated size 21.7 KB, free 265.3 MB)
15/04/21 22:55:49 INFO storage.MemoryStore: ensureFreeSpace(17352) called
with curMem=70360, maxMem=278302556
15/04/21 22:55:49 INFO storage.MemoryStore: Block broadcast_4_piece0 stored
as bytes in memory (estimated size 16.9 KB, free 265.3 MB)
15/04/21 22:55:49 INFO storage.BlockManagerInfo: Added broadcast_4_piece0
in memory on ip-10-51-181-81.ec2.internal:45199 (size: 16.9 KB, free: 265.4
MB)
15/04/21 22:55:49 INFO storage.BlockManagerMaster: Updated info of block
broadcast_4_piece0
15/04/21 22:55:49 INFO spark.SparkContext: Created broadcast 4 from
broadcast at DAGScheduler.scala:839
15/04/21 22:55:49 INFO scheduler.DAGScheduler: Submitting 96 missing tasks
from Stage 2 (MapPartitionsRDD[5] at saveAsTextFile at <console>:38)
15/04/21 22:55:49 INFO scheduler.TaskSchedulerImpl: Adding task set 2.0
with 96 tasks
15/04/21 22:55:49 INFO scheduler.TaskSetManager: Starting task 0.0 in stage
2.0 (TID 192, ip-10-166-129-153.ec2.internal, PROCESS_LOCAL, 1377 bytes)
15/04/21 22:55:49 INFO scheduler.TaskSetManager: Starting task 1.0 in stage
2.0 (TID 193, ip-10-61-144-7.ec2.internal, PROCESS_LOCAL, 1377 bytes)
15/04/21 22:55:49 INFO scheduler.TaskSetManager: Starting task 2.0 in stage
2.0 (TID 194, ip-10-158-142-155.ec2.internal, PROCESS_LOCAL, 1377 bytes)
15/04/21 22:55:49 INFO scheduler.TaskSetManager: Starting task 3.0 in stage
2.0 (TID 195, ip-10-69-77-180.ec2.internal, PROCESS_LOCAL, 1377 bytes)
15/04/21 22:55:49 INFO scheduler.TaskSetManager: Starting task 4.0 in stage
2.0 (TID 196, ip-10-51-153-34.ec2.internal, PROCESS_LOCAL, 1377 bytes)
15/04/21 22:55:49 INFO scheduler.TaskSetManager: Starting task 5.0 in stage
2.0 (TID 197, ip-10-166-129-153.ec2.internal, PROCESS_LOCAL, 1377 bytes)
15/04/21 22:55:49 INFO scheduler.TaskSetManager: Starting task 6.0 in stage
2.0 (TID 198, ip-10-61-144-7.ec2.internal, PROCESS_LOCAL, 1377 bytes)
15/04/21 22:55:49 INFO scheduler.TaskSetManager: Starting task 7.0 in stage
2.0 (TID 199, ip-10-158-142-155.ec2.internal, PROCESS_LOCAL, 1377 bytes)
15/04/21 22:55:49 INFO scheduler.TaskSetManager: Starting task 8.0 in stage
2.0 (TID 200, ip-10-69-77-180.ec2.internal, PROCESS_LOCAL, 1377 bytes)
15/04/21 22:55:49 INFO scheduler.TaskSetManager: Starting task 9.0 in stage
2.0 (TID 201, ip-10-51-153-34.ec2.internal, PROCESS_LOCAL, 1377 bytes)
15/04/21 22:55:49 INFO scheduler.TaskSetManager: Starting task 10.0 in
stage 2.0 (TID 202, ip-10-166-129-153.ec2.internal, PROCESS_LOCAL, 1377
bytes)
15/04/21 22:55:49 INFO scheduler.TaskSetManager: Starting task 11.0 in
stage 2.0 (TID 203, ip-10-61-144-7.ec2.internal, PROCESS_LOCAL, 1377 bytes)
15/04/21 22:55:49 INFO scheduler.TaskSetManager: Starting task 12.0 in
stage 2.0 (TID 204, ip-10-158-142-155.ec2.internal, PROCESS_LOCAL, 1377
bytes)
15/04/21 22:55:49 INFO scheduler.TaskSetManager: Starting task 13.0 in
stage 2.0 (TID 205, ip-10-69-77-180.ec2.internal, PROCESS_LOCAL, 1377 bytes)
15/04/21 22:55:49 INFO scheduler.TaskSetManager: Starting task 14.0 in
stage 2.0 (TID 206, ip-10-51-153-34.ec2.internal, PROCESS_LOCAL, 1377 bytes)
15/04/21 22:55:49 INFO scheduler.TaskSetManager: Starting task 15.0 in
stage 2.0 (TID 207, ip-10-166-129-153.ec2.internal, PROCESS_LOCAL, 1377
bytes)
15/04/21 22:55:49 INFO scheduler.TaskSetManager: Starting task 16.0 in
stage 2.0 (TID 208, ip-10-61-144-7.ec2.internal, PROCESS_LOCAL, 1377 bytes)
15/04/21 22:55:49 INFO scheduler.TaskSetManager: Starting task 17.0 in
stage 2.0 (TID 209, ip-10-158-142-155.ec2.internal, PROCESS_LOCAL, 1377
bytes)
15/04/21 22:55:49 INFO scheduler.TaskSetManager: Starting task 18.0 in
stage 2.0 (TID 210, ip-10-69-77-180.ec2.internal, PROCESS_LOCAL, 1377 bytes)
15/04/21 22:55:49 INFO scheduler.TaskSetManager: Starting task 19.0 in
stage 2.0 (TID 211, ip-10-51-153-34.ec2.internal, PROCESS_LOCAL, 1377 bytes)
15/04/21 22:55:49 INFO storage.BlockManagerInfo: Added broadcast_4_piece0
in memory on ip-10-61-144-7.ec2.internal:44849 (size: 16.9 KB, free: 13.8
GB)
15/04/21 22:55:49 INFO storage.BlockManagerInfo: Added broadcast_4_piece0
in memory on ip-10-69-77-180.ec2.internal:42417 (size: 16.9 KB, free: 13.8
GB)
15/04/21 22:55:49 INFO storage.BlockManagerInfo: Added broadcast_4_piece0
in memory on ip-10-158-142-155.ec2.internal:54690 (size: 16.9 KB, free:
13.8 GB)
15/04/21 22:55:49 INFO storage.BlockManagerInfo: Added broadcast_4_piece0
in memory on ip-10-166-129-153.ec2.internal:46671 (size: 16.9 KB, free:
13.8 GB)
15/04/21 22:55:49 INFO storage.BlockManagerInfo: Added broadcast_4_piece0
in memory on ip-10-51-153-34.ec2.internal:51691 (size: 16.9 KB, free: 13.8
GB)

This feels like I am missing something really basic.

thanks
Daniel

Reply via email to