asafm commented on issue #16680:
URL: https://github.com/apache/pulsar/issues/16680#issuecomment-1195377352
Let me see if I understand correctly.
Today, when an entry filter receives an entry, it gets an Entry that has:
```java
public interface Entry {
/**
* @return the data
*/
byte[] getData();
byte[] getDataAndRelease();
/**
* @return the entry length in bytes
*/
int getLength();
/**
* @return the data buffer for the entry
*/
ByteBuf getDataBuffer();
/**
* @return the position at which the entry was stored
*/
Position getPosition();
/**
* @return ledgerId of the position
*/
long getLedgerId();
/**
* @return entryId of the position
*/
long getEntryId();
/**
* Release the resources (data) allocated for this entry and recycle if
all the resources are deallocated (ref-count
* of data reached to 0).
*/
boolean release();
}
```
The Entry interface doesn't let you know if this is Batched Entry.
You also get FilterContext:
```java
@Data
public class FilterContext {
private Subscription subscription;
private MessageMetadata msgMetadata;
private Consumer consumer;
```
and in `MessageMetadata`, you have
```protobuf
// differentiate single and batch message metadata
optional int32 num_messages_in_batch = 11 [default = 1];
```
Which enables you to know this entry is batched.
The developer can determine what class would deserialize the entry byte
array into a list of separate messages.
So currently, given the entry is batched, the filter author can act on it
only by paying the cost of deserializing it, right?
You're saying we can alter the clients (all clients) to extract specific
properties from each message and place those properties values in the message
metadata of the Batched Entry. The filter can then use the values to decide if
to reject/accept.
The problem is that if you have different values for a given property for
each message in the batch, then the filter author can't provide a reject or
accept for this entry since some messages are rejected, and some are accepted.
So the only solution offered in this suggestion is to change the way
messages are batched and collect the records into a batch only if they have the
same values for the properties configured to be extracted.
If this is ok and correct, I have notes on it:
1. All I wrote above is not clearly stated in the PIP. IMO the PIP needs to
be modified to reflect that explanation.
2. As I wrote before, you are changing the core batching behavior of the
client - the user needs to be fully aware of this and its implications. The
user configures `batchedFilterProperties` without understanding the
consequences of the altering of the batching behavior. One option may be to
rename `batchedFilterProperties` to `batchGroupByProperties` so they will know
the batching behavior is changing. I wouldn't specify the term `filter` here
since the filter at this stage has no direct link.
3. Don't you need to introduce new knobs to control the memory? Up until
now, you collected records into a batch and sent them. Now you collect into
multiple batches until a certain threshold - won't this consume more memory?
How can I control this as a user?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]