Re: tradeoffs between serializable vs snapshot isolation for single writer

Ryan Blue Thu, 04 May 2023 14:53:21 -0700

Hi Nirav,

If you only have one writer, then there is no performance cost to using
serializable. The isolation level only matters if you're retrying commits
and need to validate them. For serializable, Iceberg will validate that the
result of the operation is the same. That is, there have been no changes to
the table since the operation started that affect the result. Since that's
clearly the case with only one writer, there is no performance cost.

Serializability is not necessarily achieved with pessimistic concurrency.
The definition is that when transactions run in parallel, there is some
ordering where you would get the same result if they were to run
sequentially. For Iceberg, that ordering is the lineage of snapshots in the
table and we achieve the guarantee by allowing a commit if you would get
the same result if the operation ran exactly at the commit time.

Here's an example. Say I have two concurrent writers appending files A and
B to a table at snapshot s1. Both write the data in parallel to A and B,
then attempt to commit at the same time. One writer wins (say, B) to
produce s2 and the other retries. After the retry, A is committed in s3
because there is no conflict with s2 where B was committed. On the other
hand, if both A and B were trying to modify existing data that overlapped,
say rewriting file X, then only one transaction would succeed.

Ryan

On Thu, May 4, 2023 at 12:19 PM Nirav Patel <nira...@gmail.com> wrote:

> I am trying to ingest data into iceberg table using spark streaming. There
> are no multiple writers to same data at the moment. According to iceberg
> api
> <https://iceberg.apache.org/javadoc/0.11.0/org/apache/iceberg/IsolationLevel.html#:%7E:text=Both%20of%20them%20provide%20a,environments%20with%20many%20concurrent%20writers.>
>  default
> isolation level for table is serializable . I want to understand if there
> is only a single application (single spark streaming job in my case)
> writing to iceberg table is there any advantage or disadvantage over using
> serializable or a snapshot isolation ? Is there any performance impact of
> using serializable when only one application is writing to table? Also it
> seems iceberg allows all writers to write into snapshot and use OCC to
> decide if one needs to retry because it was late. In this case how it is
> serializable at all? isn't serilizability achieved via
> pessimistic concurrency control? Would like to understand how iceberg
> implement serializable isolation level and how it is different than
> snapshot isolation ?
>
> Thanks
>

-- 
Ryan Blue
Tabular

Re: tradeoffs between serializable vs snapshot isolation for single writer

Reply via email to