Hi, as part of our work we needed more control over the name of the files
written out by Spark, e.g instead of "part-...csv.gz" we want to get
something like this "15988891_1748330679_20200507124153.tsv.gz" where the
first number is hardcoded, the second one is the value from partitionBy and
third is a timestamp in provided SimpleDateFormat.
After a long research for possibilities, the most common way is to find
those files and rename them *after* the spark job has finished. We tried to
find a more efficient way.
We decided to implement a new DataSource which is actually a wrapper to most
standard Spark file formats (csv, json, text, parquet, avro), which allows
us to rename the file before it's written.
In short, this is how it works :
Datasource extends FileFormat and implements prepareWrite - which redirects
to local FileNameOutputWriterFactory
TypeFactory which redirects to original Spark Formats
FileNameOutputWriterFactory which actually do the work and by reflection can
call any implementation to control the file name
The question is - is this interesting/useful enough for the community?
Should we open-source it?
Thanks!
p.s we wrote the same question on spark channel on ASF if you want to
discuss it there:
https://the-asf.slack.com/archives/CD5UQDNBA/p1589117451069600
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org