Re: Why is sort required for Spark writing to partitioned table

Russell Spitzer Tue, 25 Apr 2023 12:07:30 -0700

https://github.com/apache/iceberg/issues/7037


On Tue, Apr 25, 2023 at 1:52 PM Pucheng Yang <py...@pinterest.com.invalid>
wrote:

> Great thanks, it will be great if we can update the doc to avoid confusion.
>
> On Tue, Apr 25, 2023 at 11:47 AM Anton Okolnychyi
> <aokolnyc...@apple.com.invalid> wrote:
>
>> We have implemented this natively in Spark and explicit sorts are no
>> longer required. Iceberg takes into account both the partition and sort key
>> in the table to request a distribution and ordering from Spark. Should be
>> supported both for batch and micro-batch writes.
>>
>> - Anton
>>
>> On Apr 25, 2023, at 11:05 AM, Pucheng Yang <py...@pinterest.com.INVALID>
>> wrote:
>>
>> Hi to confirm,
>>
>> In the doc,
>> https://iceberg.apache.org/docs/1.0.0/spark-writes/#writing-to-partitioned-tables,
>> it says "Explicit sort is necessary because Spark doesn’t allow Iceberg to
>> request a sort before writing as of Spark 3.0. SPARK-23889
>> <https://issues.apache.org/jira/browse/SPARK-23889> is filed to enable
>> Iceberg to require specific distribution & sort order to Spark."
>>
>> I found that all relevant JIRAs in SPARK-23889
>> <https://issues.apache.org/jira/browse/SPARK-23889> are resolved in
>> spark-3.2.0. Does that mean we don't need explicit sort  anymore from
>> spark-3.2.0 and after?
>>
>> Thanks
>>
>> On Tue, Mar 7, 2023 at 8:10 PM Russell Spitzer <russell.spit...@gmail.com>
>> wrote:
>>
>>> This is no longer accurate, since now we do have a "fan-out" writer for
>>> spark. But originally the idea here is that it is way more efficient to
>>> open a single file handle at a time and write to it, than to open a new
>>> file handle for every file as we find a new partition to write to in the
>>> same spark task. Fanout performs the write as just opening each handle as
>>> the writer sees a new partition.
>>>
>>> Now that said, this is a local required sort for the default writer. For
>>> best performance though in making as few files as possible using write
>>> distribution mode "Hash" will force a real shuffle but eliminate this issue
>>> by making sure each spark task is writing to a single or single set of
>>> Partitions in order. We need to update this document to talk about
>>> distribution modes, especially since hash will be the new default soon and
>>> this information is basically for manual tuning only.
>>>
>>> If your data is already organized the way you want, setting distribution
>>> mode to none will avoid this shuffle. If you don't care about multiple file
>>> handles being open at the same time, you can set the fanout writer option.
>>> With "none" and "fan-out" writers you will basically write in the fastest
>>> way possible at the expense of memory at write time and possibly generating
>>> many files if your data isn't organized.
>>>
>>> On Tue, Mar 7, 2023 at 9:46 PM Manu Zhang <owenzhang1...@gmail.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> As per
>>>> https://iceberg.apache.org/docs/latest/spark-writes/#writing-to-partitioned-tables,
>>>> sort is required for Spark writing to a partitioned table. Does anyone know
>>>> the reason behind it? If this is to avoid creating too many small files,
>>>> isn't shuffle/repartition sufficient?
>>>>
>>>> Thanks,
>>>> Manu
>>>>
>>>>
>>

Re: Why is sort required for Spark writing to partitioned table

Reply via email to