Hi, all:
I drafted a PIP about configurable data source priority for offloaded
messages, newest version at
https://gist.github.com/Renkai/e5be927404fbfd8289e7703c55812b1c
<https://gist.github.com/Renkai/e5be927404fbfd8289e7703c55812b1c> , current
version post below this mail, hope anyone can help review it and make it an
official PIP
Motivation
Currently, if the data in pulsar was offloaded to the second storage layer,
data can still exists in bookkeeper for a period of time, but the client will
directly read data from the second layer.
This may lead to several problems:
Read from second layer have different performance characteristics, which may
lead wrong estimate from users if they didn't know which layer they are reading.
The second layer may be managed by another team rather than Pulsar management
team(for example, a independent HDFS management team), they may have
independent quota or authority policy to users.
The second layer storage can be infinite in theory, if user set cursor to an
error time in accident, it will cause a lot of resource waste.
So it's better to make data source configurable if data exists in both layer.
Maybe the below options are enough:
BOOKKEEPER_ONLY
BOOKKEEPER_FIRST
OFFLOADED_ONLY
OFFLOADED_FIRST
Background
Now which layer was broker read from is decide by
org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl#getLedgerHandle(long
ledgerId)
<https://github.com/apache/pulsar/blob/a3584309017f1894a05b05c695c42e7aa8b7c3a7/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java#L1521>
which only have one parameter ledgerId , and will choose the offloaded ledger
handle as soon as the ledger was offloaded. If the choosed handle fails all the
getLedgerHandle fails.
Implementation
The tiered read priority should be set by namespace or topic, the method in
command line tool should be looks like
pulsar-admin namespaces --set-tiered-read-priority tenant/namespace
priority-policie
pulsar-admin topics --set-tiered-read-priority tenant/namespace/topic
priority-policie
If not configured, OFFLOADED_FIRST should be used by default, which will result
to the same behavior with current version.
Then the corresponding ManagedLedger should be aware what priority option
client is using, and the signature the getLedgerHandle method should be change
to
CompletableFuture<ReadHandle> getLedgerHandle(
long ledgerId, TieredReadPriority priority) {
For BOOKKEEPER_ONLY and OFFLOADED_ONLY, the ManagedLedger will use the
corresponding ReadHandle directly. For BOOKKEEPER_FIRST and OFFLOADED_FIRST ,
ManagedLedger will fall back to the secondary storage, no matter the ledger in
the first layer storage does not exist, or there is something wrong in network
or disk or authorization with first layer storage.