This is no longer accurate, since now we do have a "fan-out" writer for
spark. But originally the idea here is that it is way more efficient to
open a single file handle at a time and write to it, than to open a new
file handle for every file as we find a new partition to write to in the
same spark task. Fanout performs the write as just opening each handle as
the writer sees a new partition.

Now that said, this is a local required sort for the default writer. For
best performance though in making as few files as possible using write
distribution mode "Hash" will force a real shuffle but eliminate this issue
by making sure each spark task is writing to a single or single set of
Partitions in order. We need to update this document to talk about
distribution modes, especially since hash will be the new default soon and
this information is basically for manual tuning only.

If your data is already organized the way you want, setting distribution
mode to none will avoid this shuffle. If you don't care about multiple file
handles being open at the same time, you can set the fanout writer option.
With "none" and "fan-out" writers you will basically write in the fastest
way possible at the expense of memory at write time and possibly generating
many files if your data isn't organized.

On Tue, Mar 7, 2023 at 9:46 PM Manu Zhang <[email protected]> wrote:

> Hi all,
>
> As per
> https://iceberg.apache.org/docs/latest/spark-writes/#writing-to-partitioned-tables,
> sort is required for Spark writing to a partitioned table. Does anyone know
> the reason behind it? If this is to avoid creating too many small files,
> isn't shuffle/repartition sufficient?
>
> Thanks,
> Manu
>
>

Reply via email to