Re: Spark Small file issue

Koert Kuipers Wed, 24 Jun 2020 06:26:02 -0700

i second that. we have gotten bitten too many times by coalesce impacting
upstream in an unintended way that i avoid coalesce on write altogether.


i prefer to use repartition (and take the shuffle hit) before writing
(especially if you are writing out partitioned), or if possible use
adaptive query execution to avoid too many files to begin with

On Wed, Jun 24, 2020 at 9:09 AM Bobby Evans <reva...@gmail.com> wrote:

> First, you need to be careful with coalesce. It will impact upstream
> processing, so if you are doing a lot of computation in the last stage
> before the repartition then coalesce will make the problem worse because
> all of that computation will happen in a single thread instead of being
> spread out.
>
> My guess is that it has something to do with writing your output files.
> Writing orc and/or parquet is not cheap. It does a lot of compression and
> statistics calculations. I also am not sure why, but from what I have seen
> they do not scale very linearly with more data being put into a single
> file. You might also be doing the repartition too early.  There should be
> some statistics on the SQL page of the UI where you can look to see which
> stages took a long time it should point you in the right direction.
>
> On Tue, Jun 23, 2020 at 5:06 PM German SM <germanschia...@gmail.com>
> wrote:
>
>> Hi,
>>
>> When reducing partitions is better to use coalesce because it doesn't
>> need to shuffle the data.
>>
>> dataframe.coalesce(1)
>>
>> El mar., 23 jun. 2020 23:54, Hichki <harish.vs...@gmail.com> escribió:
>>
>>> Hello Team,
>>>
>>>
>>>
>>> I am new to Spark environment. I have converted Hive query to Spark
>>> Scala.
>>> Now I am loading data and doing performance testing. Below are details on
>>> loading 3 weeks data. Cluster level small file avg size is set to 128
>>> MB.
>>>
>>>
>>>
>>> 1. New temp table where I am loading data is ORC formatted as current
>>> Hive
>>> table is ORC stored.
>>>
>>> 2. Hive table each partition folder size is 200 MB.
>>>
>>> 3. I am using repartition(1) in spark code so that it will create one
>>> 200MB
>>> part file in each partition folder(to avoid small file issue). With this
>>> job
>>> is completing in 23 to 26 mins.
>>>
>>> 4. If I don't use repartition(), job is completing in 12 to 13 mins. But
>>> problem with this approach is, it is creating 800 part files (size
>>> <128MB)
>>> in each partition folder.
>>>
>>>
>>>
>>> I am quite not sure on how to reduce processing time and not create small
>>> files at the same time. Could anyone please help me in this situation.
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>

Re: Spark Small file issue

Reply via email to