Re: Partition Loss Policies issues

Denis Magda Fri, 02 Nov 2018 14:34:28 -0700

Hi Stan,

Thanks for the extensive analysis. Alex G. could you please step in and
share your thoughts. Seems it's time to revisit IEP-4 and prioritize the
gaps.


--
Denis

On Wed, Oct 31, 2018 at 7:56 AM Stanislav Lukyanov <[email protected]>
wrote:

> Hi Igniters,
>
> I've been looking into various scenarios of Partition Loss Policies usage
> recently,
> and found a number of issues in the current implementation.
>
> I'll start with an overview, but if you'd like to dive to a proposal I
> have right now then please
> feel free to scroll down to TLDR.
>
> The list of issues is below:
> https://issues.apache.org/jira/browse/IGNITE-10041: Partition loss
> policies work incorrectly with BLT
> https://issues.apache.org/jira/browse/IGNITE-10043: Lost partitions list
> is reset when only one server is alive in the cluster
> https://issues.apache.org/jira/browse/IGNITE-9841: SQL doesn't take lost
> partitions into account when persistence is enabled
> https://issues.apache.org/jira/browse/IGNITE-10057: SQL queries hang
> during rebalance if there are LOST partitions
> https://issues.apache.org/jira/browse/IGNITE-9902: ScanQuery doesn't take
> lost partitions into account
> https://issues.apache.org/jira/browse/IGNITE-10059: Local scan query
> against LOST partition fails
> https://issues.apache.org/jira/browse/IGNITE-10044: LOST partition is
> marked as OWNING after the owner rejoins with existing persistent data
> https://issues.apache.org/jira/browse/IGNITE-10058: resetLostPartitions()
> leaves an additional copy of a partition in the cluster
>
> I'm sure this is not a complete list, but this is what I could find by
> tackling how queries and
> persistence interact with current handling of partition loss.
>
> It seems that the issues - from this list and some other fixed recently -
> can be split into three categories
> - corner case bugs - there are always some, and we can fix them as they
> show up
> - handling of lost partitions by different APIs - while JCache API handles
> lost partitions fine,
> SQL and Scan queries have known issues; other APIs, such different types
> of queries, DataStreamer,
> etc probably need to have more testing
> - Partition Loss Policices + BLT (
> https://issues.apache.org/jira/browse/IGNITE-10041) - BLT seems
> to be fundametally conflicting with the pre-existing semantics of
> partition loss
>
> While the former two categories can be solved case-by-case, the last one
> needs a wider design effort.
> We need to reimagine our partition loss semantics for BLT, and change
> behavior accordingly.
> For now, most of the features don't really work for a cache with BLT, with
> only READ_WRITE_SAFE and
> READ_ONLY_SAFE working correctly (good thing - these two are the most
> useful policies anyway).
>
> TLDR: we have issues with partition loss policices, and the largest one is
> that BLT semantics
> conflict with most partition lost policices, and we need to address this
> somehow.
>
> What I suggest to do right now:
> 1. Deny the configurations that don't work - e.g. just throw an exception
> if a cache starts
> with BLT and PartitionLossPolicy.IGNORE or others.
> 2. Change default PartitionLossPolicy to READ_WRITE_SAFE *for persistent
> caches only*.
> This is what effectively in place for the persistent caches already (since
> IGNORE semantics are
> not supported), so this shouldn't bring a lot of compatibility issues.
>
> I believe doing this will at least help us to protect the users from
> unexpected/inconsistent behavior.
> Actual design changes can be done later, e.g. as a part of IEP 4 Phase 2/3.
>
> WDYT?
>
> Thanks,
> Stan
>
>

Re: Partition Loss Policies issues

Reply via email to