rdblue commented on a change in pull request #1523:
URL: https://github.com/apache/iceberg/pull/1523#discussion_r496903557
##########
File path: site/docs/spark.md
##########
@@ -519,6 +519,59 @@ data.writeTo("prod.db.table")
.createOrReplace()
```
+## Writing against partitioned table
+
+Iceberg requires the data to be sorted according to the partition spec in
prior to write against partitioned table.
Review comment:
You're right that the query fails. Although the data is partitioned to
group by category, Spark will use hash partitioning so there is no guarantee
that each partition contains only one category. If it there were, it would
work. Some partitions are going to contain more than one category, and a
partition is a task. Sorting by id mixes the categories together, which causes
the failure. I think it would also fail without the sort because the data
coming from the map side of the shuffle would also be mixed together.
For repartitioning, it appears like Spark will cluster the data, but you
actually need a sort to do it. However, there are other cases where you don't
need to add an explicit sort, like when you can guarantee that only one
partition is written by a job (writing an agg table for one day) or when the
input data already aligns with the partitioning of the output.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]