> Is there any way to save it as raw_csv file as we do in pandas? I have a

I did write such a function for scala. Please take a look at
https://github.com/EDS-APHP/spark-etl/blob/master/spark-csv/src/main/scala/CSVTool.scala
see writeCsvToLocal

it first writes csv to hdfs, and then fetches every csv part into one
local csv with headers.


Kshitij <kshtjkm...@gmail.com> writes:

> Is there any way to save it as raw_csv file as we do in pandas? I have a
> script that uses the CSV file for further processing.
>
> On Sat, 22 Feb 2020 at 14:31, rahul c <rchannal1...@gmail.com> wrote:
>
>> Hi Kshitij,
>>
>> There are option to suppress the metadata files from get created.
>> Set the below properties and try.
>>
>> 1) To disable the transaction logs of spark
>> "spark.sql.sources.commitProtocolClass =
>> org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol".
>> This will help to disable the "committed<TID>" and "started<TID>" files but
>> still _SUCCESS, _common_metadata and _metadata files will generate.
>>
>> 2) We can disable the _common_metadata and _metadata files using
>> "parquet.enable.summary-metadata=false".
>>
>> 3) We can also disable the _SUCCESS file using
>> "mapreduce.fileoutputcommitter.marksuccessfuljobs=false".
>>
>> On Sat, 22 Feb, 2020, 10:51 AM Kshitij, <kshtjkm...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> There is no dataframe spark API which writes/creates a single file
>>> instead of directory as a result of write operation.
>>>
>>> Below both options will create directory with a random file name.
>>>
>>> df.coalesce(1).write.csv(<path>)
>>>
>>>
>>>
>>> df.write.csv(<path>)
>>>
>>>
>>> Instead of creating directory with standard files (_SUCCESS , _committed
>>> , _started). I want a single file with file_name specified.
>>>
>>>
>>> Thanks
>>>
>>


--
nicolas paris

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to