[GitHub] [iceberg] rdblue commented on a change in pull request #1523: ISSUE-1520 Document writing against partitioned table in Spark

GitBox Tue, 29 Sep 2020 10:10:58 -0700


rdblue commented on a change in pull request #1523:
URL: https://github.com/apache/iceberg/pull/1523#discussion_r496903557




##########
File path: site/docs/spark.md
##########
@@ -519,6 +519,59 @@ data.writeTo("prod.db.table")
     .createOrReplace()
 ```
 
+## Writing against partitioned table
+
+Iceberg requires the data to be sorted according to the partition spec in 
prior to write against partitioned table.

Review comment:
       You're right that the query fails. Although the data is partitioned to 
group by category, Spark will use hash partitioning so there is no guarantee 
that each partition contains only one category. If it there were, it would 
work. Some partitions are going to contain more than one category, and a 
partition is a task. Sorting by id mixes the categories together, which causes 
the failure. I think it would also fail without the sort because the data 
coming from the map side of the shuffle would also be mixed together.
   
   For repartitioning, it appears like Spark will cluster the data, but you 
actually need a sort to do it. However, there are other cases where you don't 
need to add an explicit sort, like when you can guarantee that only one 
partition is written by a job (writing an agg table for one day) or when the 
input data already aligns with the partitioning of the output.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on a change in pull request #1523: ISSUE-1520 Document writing against partitioned table in Spark

Reply via email to