Upon completion of the 2 hour part of the run, the files did not exist in
the output directory? One thing that is done serially is deleting any
remaining files from _temporary, so perhaps there was a lot of data
remaining in _temporary but the committed data had already been moved.
I am,
TLDR Extend FileOutPutCommitter to eliminate the temporary_storage. There
are some implementations to be found online, typically called
DirectOutputCommitter, f.i. this spark pull request
https://github.com/themodernlife/spark/commit/4359664b1d557d55b0579023df809542386d5b8c.
Tell Spark to use your
We are running spark in Google Compute Engine using their One-Click Deploy.
By doing so, we get their Google Cloud Storage connector for hadoop for free
meaning we can specify gs:// paths for input and output.
We have jobs that take a couple of hours, end up with ~9k partitions which
means 9k
This renaming from _temporary to the final location is actually done by
executors, in parallel, for saveAsTextFile. It should be performed by each
task individually before it returns.
I have seen an issue similar to what you mention dealing with Hive code
which did the renaming serially on the
I'm not sure how to confirm how the moving is happening, however, one of
the jobs just completed that I was talking about with 9k files of 4mb each.
Spark UI showed the job being complete after ~2 hours. The last four hours
of the job was just moving the files from _temporary to their final