Re: writing a small csv to HDFS is super slow

Lian Jiang Tue, 26 Mar 2019 17:56:03 -0700

Hi Gezim,

My execution plan of the data frame to write into HDFS is a union of 140
children dataframes. All these children data frames are not materialized
when writing to HDFS. It is not saving file taking time. Instead, it is
materializing the dataframes taking time. My solution is to materialize all
the children dataframe and save into HDFS. Then union the pre-existing
children dataframes and saving to HDFS is very fast.


Hope this helps!

On Tue, Mar 26, 2019 at 1:50 PM Gezim Sejdiu <g.sej...@gmail.com> wrote:

> Hi Lian,
>
> I was following the thread since one of my students had the same issue.
> The problem was when trying to save a larger XML dataset into HDFS and due
> to the connectivity timeout between Spark and HDFS, the output wasn't able
> to be displayed.
> I also suggested him to do the same as @Apostolos said in the previous
> mail, using saveAsTextFile instead (haven't got any result/reply after my
> suggestion).
>
> Seeing the last commit date "*Jan 10, 2017*" made
> on databricks/spark-csv [1] project, not sure how much inline with Spark
> 2.x is. Even though there is a *note* about it on the README file.
>
> Would it be possible that you share your solution (in case the project is
> open-sourced already) with us and then we can have a look at it?
>
> Many thanks in advance.
>
> Best regards,
> [1]. https://github.com/databricks/spark-csv
>
> On Tue, Mar 26, 2019 at 1:09 AM Lian Jiang <jiangok2...@gmail.com> wrote:
>
>> Thanks guys for reply.
>>
>> The execution plan shows a giant query. After divide and conquer, saving
>> is quick.
>>
>> On Fri, Mar 22, 2019 at 4:01 PM kathy Harayama <kathleenli...@gmail.com>
>> wrote:
>>
>>> Hi Lian,
>>> Since you using repartition(1), do you want to decrease the number of
>>> partitions? If so, have you tried to use coalesce instead?
>>>
>>> Kathleen
>>>
>>> On Fri, Mar 22, 2019 at 2:43 PM Lian Jiang <jiangok2...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Writing a csv to HDFS takes about 1 hour:
>>>>
>>>>
>>>> df.repartition(1).write.format('com.databricks.spark.csv').mode('overwrite').options(header='true').save(csv)
>>>>
>>>> The generated csv file is only about 150kb. The job uses 3 containers
>>>> (13 cores, 23g mem).
>>>>
>>>> Other people have similar issues but I don't see a good explanation and
>>>> solution.
>>>>
>>>> Any clue is highly appreciated! Thanks.
>>>>
>>>>
>>>>
>
> --
>
> _____________
>
> *Gëzim Sejdiu*
>
>
>
> *PhD Student & Research Associate*
>
> *SDA, University of Bonn*
>
> *Endenicher Allee 19a, 53115 Bonn, Germany*
>
> *https://gezimsejdiu.github.io/ <https://gezimsejdiu.github.io/>*
>
> GitHub <https://github.com/GezimSejdiu> | Twitter
> <https://twitter.com/Gezim_Sejdiu> | LinkedIn
> <https://www.linkedin.com/in/g%C3%ABzim-sejdiu-08b1761b> | Google Scholar
> <https://scholar.google.de/citations?user=Lpbwr9oAAAAJ>
>
>

Re: writing a small csv to HDFS is super slow

Reply via email to