Re: Merging Parquet Files

2020-09-03 Thread Michael Segel
Hi, I think you’re asking the right question, however you’re making an assumption that he’s on the cloud and he never talked about the size of the file. It could be that he’s got a lot of small-ish data sets. 1GB is kinda small in relative terms. Again YMMV. Personally if you’re going

Re: Merging Parquet Files

2020-08-31 Thread Tzahi File
You are right. In general this job should deal with very small files and create an output file of less than 100MB. In other cases I would need to create multiple files of around 100 MB.. The issues with partitions that decrease the number of partitions will reduce ETLs performance, while this job

Re: Merging Parquet Files

2020-08-31 Thread Jörn Franke
Why only one file? I would go more for files of specific size, eg data is split in 1gb files. The reason is also that if you need to transfer it (eg to other clouds etc) - having a large file of several terabytes is bad. It depends on your use case but you might look also at partitions etc. >

Merging Parquet Files

2020-08-31 Thread Tzahi File
Hi, I would like to develop a process that merges parquet files. My first intention was to develop it with PySpark using coalesce(1) - to create only 1 file. This process is going to run on a huge amount of files. I wanted your advice on what is the best way to implement it (PySpark isn't a

Re: Merging Parquet Files

2016-12-22 Thread Benjamin Kim
Thanks, Hyukjin. I’ll try using the Parquet tools for 1.9 On Dec 23, 2016, at 12:43 PM, Hyukjin Kwon wrote: Hi Benjamin, As you might already know, I believe the Hadoop command automatically does not merge the column-based format such as ORC or Parquet but just simply

Re: Merging Parquet Files

2016-12-22 Thread Benjamin Kim
Thanks, Hyukjin. I’ll try using the Parquet tools for 1.9 based on the jira. If that doesn’t work, I’ll try Kite. Cheers, Ben > On Dec 23, 2016, at 12:43 PM, Hyukjin Kwon wrote: > > Hi Benjamin, > > > As you might already know, I believe the Hadoop command

Re: Merging Parquet Files

2016-12-22 Thread Hyukjin Kwon
Hi Benjamin, As you might already know, I believe the Hadoop command automatically does not merge the column-based format such as ORC or Parquet but just simply concatenates them. I haven't tried this by myself but I remember I saw a JIRA in Parquet -

Merging Parquet Files

2016-12-22 Thread Benjamin Kim
Has anyone tried to merge *.gz.parquet files before? I'm trying to merge them into 1 file after they are output from Spark. Doing a coalesce(1) on the Spark cluster will not work. It just does not have the resources to do it. I'm trying to do it using the commandline and not use Spark. I will

Re: Merging Parquet Files

2014-11-25 Thread Michael Armbrust
You'll need to be running a very recent version of Spark SQL as this feature was just added. On Tue, Nov 25, 2014 at 1:01 AM, Daniel Haviv danielru...@gmail.com wrote: Hi, Thanks for your reply.. I'm trying to do what you suggested but I get: scala sqlContext.sql(CREATE TEMPORARY TABLE data

Re: Merging Parquet Files

2014-11-24 Thread Michael Armbrust
Parquet does a lot of serial metadata operations on the driver which makes it really slow when you have a very large number of files (especially if you are reading from something like S3). This is something we are aware of and that I'd really like to improve in 1.3. You might try the (brand new

Merging Parquet Files

2014-11-22 Thread Daniel Haviv
Hi, I'm ingesting a lot of small JSON files and convert them to unified parquet files, but even the unified files are fairly small (~10MB). I want to run a merge operation every hour on the existing files, but it takes a lot of time for such a small amount of data: about 3 GB spread of 3000

Merging Parquet Files

2014-11-19 Thread Daniel Haviv
Hello, I'm writing a process that ingests json files and saves them a parquet files. The process is as such: val sqlContext = new org.apache.spark.sql.SQLContext(sc) val jsonRequests=sqlContext.jsonFile(/requests) val parquetRequests=sqlContext.parquetFile(/requests_parquet)

Re: Merging Parquet Files

2014-11-19 Thread Marius Soutier
You can also insert into existing tables via .insertInto(tableName, overwrite). You just have to import sqlContext._ On 19.11.2014, at 09:41, Daniel Haviv danielru...@gmail.com wrote: Hello, I'm writing a process that ingests json files and saves them a parquet files. The process is as such:

Re: Merging Parquet Files

2014-11-19 Thread Daniel Haviv
Very cool thank you! On Wed, Nov 19, 2014 at 11:15 AM, Marius Soutier mps@gmail.com wrote: You can also insert into existing tables via .insertInto(tableName, overwrite). You just have to import sqlContext._ On 19.11.2014, at 09:41, Daniel Haviv danielru...@gmail.com wrote: Hello,

Re: Merging Parquet Files

2014-11-19 Thread Michael Armbrust
On Wed, Nov 19, 2014 at 12:41 AM, Daniel Haviv danielru...@gmail.com wrote: Another problem I have is that I get a lot of small json files and as a result a lot of small parquet files, I'd like to merge the json files into a few parquet files.. how I do that? You can use `coalesce` on any