Try to coalesce with a value of 2 or so. You could dynamically calculate how
many partitions to have to obtain an optimal file size.
Sent from my iPhone
> On Dec 8, 2016, at 1:03 PM, Kevin Tran wrote:
>
> How many partition should it be when streaming? - As in streaming
How many partition should it be when streaming? - As in streaming process
the data will growing in size and is there any configuration for limit file
size and write to new file if it is more than x (let says 128MB per file)
Another question about performance when query to these parquet files.
Try limit the partitions. spark.sql.shuffle.partitions
This control the number of files generated.
On 28 Nov 2016 8:29 p.m., "Kevin Tran" wrote:
> Hi Denny,
> Thank you for your inputs. I also use 128 MB but still too many files
> generated by Spark app which is only ~14 KB
Hi Denny,
Thank you for your inputs. I also use 128 MB but still too many files
generated by Spark app which is only ~14 KB each ! That's why I'm asking if
there is a solution for this if some one has same issue.
Cheers,
Kevin.
On Mon, Nov 28, 2016 at 7:08 PM, Denny Lee
Generally, yes - you should try to have larger data sizes due to the
overhead of opening up files. Typical guidance is between 64MB-1GB;
personally I usually stick with 128MB-512MB with the default of snappy
codec compression with parquet. A good reference is Vida Ha's
presentation Data
Storage
Hi Everyone,
Does anyone know what is the best practise of writing parquet file from
Spark ?
As Spark app write data to parquet and it shows that under that directory
there are heaps of very small parquet file (such as
e73f47ef-4421-4bcc-a4db-a56b110c3089.parquet). Each parquet file is only
15KB