merlimat opened a new pull request #2959: Fixed race condition in schema 
initialization in partitioned topics
URL: https://github.com/apache/pulsar/pull/2959
 
 
   ### Motivation
   
   There is a race condition when producers and consumers are connecting to a 
new partitioned topic concurrently and try to initialize the schema. 
   
   That results in consumers getting subscribe error (upon application retry, 
they will succeed).
   
   The exception is like: 
   
   ```
   
1541537636157/test-pythonpartitiontopictest-output-edot-partition-1][test-subs-edot]
 Failed to create consumer: No such ledger exists
   java.util.concurrent.CompletionException: 
org.apache.bookkeeper.client.BKException$BKNoSuchLedgerExistsException: No such 
ledger exists
        at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
 ~[?:1.8.0_181]
        at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
 ~[?:1.8.0_181]
        at 
java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:943) 
~[?:1.8.0_181]
        at 
java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)
 ~[?:1.8.0_181]
        at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) 
~[?:1.8.0_181]
        at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
 ~[?:1.8.0_181]
        at 
org.apache.pulsar.broker.service.schema.BookkeeperSchemaStorage.lambda$15(BookkeeperSchemaStorage.java:441)
 ~[org.apache.pulsar-pulsar-broker-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
   ```
   
   The main issue is that `getOrCreateSchemaLocator()` is creating the z-node 
with a dummy marker (ledgerId=-1) and then creates a new ledger and finally 
updates the z-node with the real ledger id. 
   Because of that, consumers might see the z-node pointing to ledger -1 and 
hence the error.
   
   ### Modifications
   
    * Added more information in the BK exception reporting (eg: which operation 
we are trying to do and ledger id).
    * Removed `getOrCreateSchemaLocator()`. Instead, we do get(), then create 
ledger and then try to create z-node with real ledger id. There would not be 
incomplete state visible.
    * Handle concurrent create conflicts (eg: across multiple brokers) by 
retrying from the get operation again.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to