We are tasked with loading a big file (possibly 2TB) into a data warehouse.
In order to do this efficiently, we need to split the file into smaller
files.

I don't believe there is a way to do this with Spark, because in order for
Spark to distribute the file to the worker nodes, it first has to be split
up, right?

We ended up using a single machine with a single thread to do the
splitting. I just want to make sure I am not missing something obvious.

Thanks!

-- 
Paul Henry Tremblay
Attunix

Reply via email to