Re: Best way to merge files from streaming jobs‏ on S3

2016-03-04 Thread Chris Miller
Why does the order matter? Coalesce runs in parallel and if it's just
writing to the file, then I imagine it would do it in whatever order it
happens to be executed in each thread. If you want to sort the resulting
data, I imagine you'd need to save it to some sort of data structure
instead of writing to the file from coalesce, sort that data structure,
then write your file.


--
Chris Miller

On Sat, Mar 5, 2016 at 5:24 AM, jelez  wrote:

> My streaming job is creating files on S3.
> The problem is that those files end up very small if I just write them to
> S3
> directly.
> This is why I use coalesce() to reduce the number of files and make them
> larger.
>
> However, coalesce shuffles data and my job processing time ends up higher
> than sparkBatchIntervalMilliseconds.
>
> I have observed that if I coalesce the number of partitions to be equal to
> the cores in the cluster I get less shuffling - but that is
> unsubstantiated.
> Is there any dependency/rule between number of executors, number of cores
> etc. that I can use to minimize shuffling and at the same time achieve
> minimum number of output files per batch?
> What is the best practice?
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Best-way-to-merge-files-from-streaming-jobs-on-S3-tp26400.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Best way to merge files from streaming jobs‏ on S3

2016-03-04 Thread jelez
My streaming job is creating files on S3.
The problem is that those files end up very small if I just write them to S3
directly.
This is why I use coalesce() to reduce the number of files and make them
larger.

However, coalesce shuffles data and my job processing time ends up higher
than sparkBatchIntervalMilliseconds.

I have observed that if I coalesce the number of partitions to be equal to
the cores in the cluster I get less shuffling - but that is unsubstantiated.
Is there any dependency/rule between number of executors, number of cores
etc. that I can use to minimize shuffling and at the same time achieve
minimum number of output files per batch?
What is the best practice?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Best-way-to-merge-files-from-streaming-jobs-on-S3-tp26400.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org