How costly is it for you, to move files after generating them with Spark?
File systems tend to just update some links under the hood.
*Yohann Jardin*
Le 2/22/2020 à 11:47 AM, Kshitij a écrit :
That's the alternative ofcourse. But that is costly when we are
dealing with bunch of files.
Thanks.
On Sat, Feb 22, 2020, 4:15 PM Sebastian Piu <sebastian....@gmail.com
<mailto:sebastian....@gmail.com>> wrote:
I'm not aware of a way to specify the file name on the writer.
Since you'd need to bring all the data into a single node and
write from there to get a single file out you could simple
move/rename the file that spark creates or write the csv yourself
with your library of preference?
On Sat, 22 Feb 2020 at 10:39, Kshitij <kshtjkm...@gmail.com
<mailto:kshtjkm...@gmail.com>> wrote:
Is there any way to save it as raw_csv file as we do in
pandas? I have a script that uses the CSV file for further
processing.
On Sat, 22 Feb 2020 at 14:31, rahul c <rchannal1...@gmail.com
<mailto:rchannal1...@gmail.com>> wrote:
Hi Kshitij,
There are option to suppress the metadata files from get
created.
Set the below properties and try.
1) To disable the transaction logs of spark
"spark.sql.sources.commitProtocolClass =
org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol".
This will help to disable the "committed<TID>" and
"started<TID>" files but still _SUCCESS, _common_metadata
and _metadata files will generate.
2) We can disable the _common_metadata and _metadata files
using "parquet.enable.summary-metadata=false".
3) We can also disable the _SUCCESS file using
"mapreduce.fileoutputcommitter.marksuccessfuljobs=false".
On Sat, 22 Feb, 2020, 10:51 AM Kshitij,
<kshtjkm...@gmail.com <mailto:kshtjkm...@gmail.com>> wrote:
Hi,
There is no dataframe spark API which writes/creates a
single file instead of directory as a result of write
operation.
Below both options will create directory with a random
file name.
|df.coalesce(1).write.csv(<path>)|
df.write.csv(<path>)
Instead of creating directory with standard files
(_SUCCESS , _committed , _started). I want a single
file with file_name specified.
Thanks