How costly is it for you, to move files after generating them with Spark?
File systems tend to just update some links under the hood.

*Yohann Jardin*

Le 2/22/2020 à 11:47 AM, Kshitij a écrit :
That's the alternative ofcourse. But that is costly when we are dealing with bunch of files.

Thanks.

On Sat, Feb 22, 2020, 4:15 PM Sebastian Piu <sebastian....@gmail.com <mailto:sebastian....@gmail.com>> wrote:

    I'm not aware of a way to specify the file name on the writer.
    Since you'd need to bring all the data into a single node and
    write from there to get a single file out you could simple
    move/rename the file that spark creates or write the csv yourself
    with your library of preference?

    On Sat, 22 Feb 2020 at 10:39, Kshitij <kshtjkm...@gmail.com
    <mailto:kshtjkm...@gmail.com>> wrote:

        Is there any way to save it as raw_csv file as we do in
        pandas? I have a script that uses the CSV file for further
        processing.

        On Sat, 22 Feb 2020 at 14:31, rahul c <rchannal1...@gmail.com
        <mailto:rchannal1...@gmail.com>> wrote:

            Hi Kshitij,

            There are option to suppress the metadata files from get
            created.
            Set the below properties and try.

            1) To disable the transaction logs of spark
            "spark.sql.sources.commitProtocolClass =
            
org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol".
            This will help to disable the "committed<TID>" and
            "started<TID>" files but still _SUCCESS, _common_metadata
            and _metadata files will generate.

            2) We can disable the _common_metadata and _metadata files
            using "parquet.enable.summary-metadata=false".

            3) We can also disable the _SUCCESS file using
            "mapreduce.fileoutputcommitter.marksuccessfuljobs=false".

            On Sat, 22 Feb, 2020, 10:51 AM Kshitij,
            <kshtjkm...@gmail.com <mailto:kshtjkm...@gmail.com>> wrote:

                Hi,

                There is no dataframe spark API which writes/creates a
                single file instead of directory as a result of write
                operation.

                Below both options will create directory with a random
                file name.

                    |df.coalesce(1).write.csv(<path>)|

                    df.write.csv(<path>)


                Instead of creating directory with standard files
                (_SUCCESS , _committed , _started). I want a single
                file with file_name specified.


                Thanks

Reply via email to