Yes, I think so.

Stefan Panayotov, PhD
spanayo...@outlook.com
spanayo...@comcast.net
spanayo...@gmail.com

-----Original Message-----
From: ilaimalka <ilai.ma...@nielsen.com> 
Sent: Monday, June 8, 2020 9:17 AM
To: user@spark.apache.org
Subject: we control spark file names before we write them - should we 
opensource it?

Hi, as part of our work we needed more control over the name of the files 
written out by Spark, e.g instead of "part-...csv.gz" we want to get something 
like this "15988891_1748330679_20200507124153.tsv.gz" where the first number is 
hardcoded, the second one is the value from partitionBy and third is a 
timestamp in provided SimpleDateFormat.

After a long research for possibilities, the most common way is to find those 
files and rename them *after* the spark job has finished. We tried to find a 
more efficient way.

We decided to implement a new DataSource which is actually a wrapper to most 
standard Spark file formats (csv, json, text, parquet, avro), which allows us 
to rename the file before it's written.

In short, this is how it works :
Datasource extends FileFormat and implements prepareWrite - which redirects to 
local FileNameOutputWriterFactory TypeFactory which redirects to original Spark 
Formats FileNameOutputWriterFactory which actually do the work and by 
reflection can call any implementation to control the file name  

The question is - is this interesting/useful enough for the community?
Should we open-source it?
Thanks!

p.s we wrote the same question on spark channel on ASF if you want to discuss 
it there:
https://the-asf.slack.com/archives/CD5UQDNBA/p1589117451069600



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to