Hi, 

I think you’re asking the right question, however you’re making an assumption 
that he’s on the cloud and he never talked about the size of the file. 

It could be that he’s got a lot of small-ish data sets.  1GB is kinda small in 
relative terms.  

Again YMMV. 

Personally if you’re going to use Spark for data engineering,  Scala first, 
Java second, then Python unless you’re a Python developer which means go w 
Python. 

I agree that wanting to have a single file needs to be explained. 


> On Aug 31, 2020, at 10:52 AM, Jörn Franke <jornfra...@gmail.com> wrote:
> 
> Why only one file?
> I would go more for files of specific size, eg data is split in 1gb files. 
> The reason is also that if you need to transfer it (eg to other clouds etc) - 
> having a large file of several terabytes is bad.
> 
> It depends on your use case but you might look also at partitions etc.
> 
>> Am 31.08.2020 um 16:17 schrieb Tzahi File <tzahi.f...@ironsrc.com>:
>> 
>> 
>> Hi, 
>> 
>> I would like to develop a process that merges parquet files. 
>> My first intention was to develop it with PySpark using coalesce(1) -  to 
>> create only 1 file. 
>> This process is going to run on a huge amount of files.
>> I wanted your advice on what is the best way to implement it (PySpark isn't 
>> a must).  
>> 
>> 
>> Thanks,
>> Tzahi

Reply via email to