Re: splitting a huge file

2017-04-24 Thread Steve Loughran

> On 21 Apr 2017, at 19:36, Paul Tremblay  wrote:
> 
> We are tasked with loading a big file (possibly 2TB) into a data warehouse. 
> In order to do this efficiently, we need to split the file into smaller files.
> 
> I don't believe there is a way to do this with Spark, because in order for 
> Spark to distribute the file to the worker nodes, it first has to be split 
> up, right? 

if it is in HDFS, it's already been broken up by block size and scattered 
around the filesystem, so probably split up by 128/256MB blocks, 3x replicated 
each, offering lots of places for local data.

If its in another FS, different strategies may apply, including no lo

> 
> We ended up using a single machine with a single thread to do the splitting. 
> I just want to make sure I am not missing something obvious.
> 

you don't explicitly need to split up the file if you can run different workers 
against different parts of the same file, which means you need to split it up,

This is what org.apache.hadoop.mapreduce.InputFormat.getSplits() does: you will 
need to define an input format for your data source, and provide the split 
calculation

> Thanks!
> 
> -- 
> Paul Henry Tremblay
> Attunix


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: splitting a huge file

2017-04-21 Thread Roger Marin
If the file is in HDFS already you can use spark to read the file using a
specific input format (depending on file type) to split it.

http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapred/InputFormat.html

On Sat, Apr 22, 2017 at 4:36 AM, Paul Tremblay 
wrote:

> We are tasked with loading a big file (possibly 2TB) into a data
> warehouse. In order to do this efficiently, we need to split the file into
> smaller files.
>
> I don't believe there is a way to do this with Spark, because in order for
> Spark to distribute the file to the worker nodes, it first has to be split
> up, right?
>
> We ended up using a single machine with a single thread to do the
> splitting. I just want to make sure I am not missing something obvious.
>
> Thanks!
>
> --
> Paul Henry Tremblay
> Attunix
>


Re: splitting a huge file

2017-04-21 Thread Jörn Franke
What is your DWH technology?
If the file is on HDFS and depending on the format than Spark can read parts of 
it in parallel.

> On 21. Apr 2017, at 20:36, Paul Tremblay  wrote:
> 
> We are tasked with loading a big file (possibly 2TB) into a data warehouse. 
> In order to do this efficiently, we need to split the file into smaller files.
> 
> I don't believe there is a way to do this with Spark, because in order for 
> Spark to distribute the file to the worker nodes, it first has to be split 
> up, right? 
> 
> We ended up using a single machine with a single thread to do the splitting. 
> I just want to make sure I am not missing something obvious.
> 
> Thanks!
> 
> -- 
> Paul Henry Tremblay
> Attunix

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



splitting a huge file

2017-04-21 Thread Paul Tremblay
We are tasked with loading a big file (possibly 2TB) into a data warehouse.
In order to do this efficiently, we need to split the file into smaller
files.

I don't believe there is a way to do this with Spark, because in order for
Spark to distribute the file to the worker nodes, it first has to be split
up, right?

We ended up using a single machine with a single thread to do the
splitting. I just want to make sure I am not missing something obvious.

Thanks!

-- 
Paul Henry Tremblay
Attunix