[GitHub] [pulsar] asafm commented on issue #16680: PIP-191: Support batched message using entry filter

GitBox Tue, 26 Jul 2022 04:46:14 -0700


asafm commented on issue #16680:
URL: https://github.com/apache/pulsar/issues/16680#issuecomment-1195377352


   Let me see if I understand correctly.
   
   Today, when an entry filter receives an entry, it gets an Entry that has:
   ```java
   public interface Entry {
   
       /**
        * @return the data
        */
       byte[] getData();
   
       byte[] getDataAndRelease();
   
       /**
        * @return the entry length in bytes
        */
       int getLength();
   
       /**
        * @return the data buffer for the entry
        */
       ByteBuf getDataBuffer();
   
       /**
        * @return the position at which the entry was stored
        */
       Position getPosition();
   
       /**
        * @return ledgerId of the position
        */
       long getLedgerId();
   
       /**
        * @return entryId of the position
        */
       long getEntryId();
   
       /**
        * Release the resources (data) allocated for this entry and recycle if 
all the resources are deallocated (ref-count
        * of data reached to 0).
        */
       boolean release();
   }
   ```
   The Entry interface doesn't let you know if this is Batched Entry. 
   
   You also get FilterContext:
   ```java
   @Data
   public class FilterContext {
       private Subscription subscription;
       private MessageMetadata msgMetadata;
       private Consumer consumer;
   
   ```
   
   and in `MessageMetadata`, you have 
   ```protobuf
       // differentiate single and batch message metadata
       optional int32 num_messages_in_batch = 11 [default = 1];
   ```
   
   Which enables you to know this entry is batched.
   
   The developer can determine what class would deserialize the entry byte 
array into a list of separate messages.
   
   
   So currently, given the entry is batched, the filter author can act on it 
only by paying the cost of deserializing it, right?
   
   You're saying we can alter the clients (all clients) to extract specific 
properties from each message and place those properties values in the message 
metadata of the Batched Entry. The filter can then use the values to decide if 
to reject/accept. 
   The problem is that if you have different values for a given property for 
each message in the batch, then the filter author can't provide a reject or 
accept for this entry since some messages are rejected, and some are accepted. 
   
   So the only solution offered in this suggestion is to change the way 
messages are batched and collect the records into a batch only if they have the 
same values for the properties configured to be extracted.
   
   If this is ok and correct, I have notes on it:
   
   1. All I wrote above is not clearly stated in the PIP. IMO the PIP needs to 
be modified to reflect that explanation.
   2. As I wrote before, you are changing the core batching behavior of the 
client - the user needs to be fully aware of this and its implications. The 
user configures `batchedFilterProperties` without understanding the 
consequences of the altering of the batching behavior. One option may be to 
rename `batchedFilterProperties` to `batchGroupByProperties` so they will know 
the batching behavior is changing. I wouldn't specify the term `filter` here 
since the filter at this stage has no direct link.
   3. Don't you need to introduce new knobs to control the memory? Up until 
now, you collected records into a batch and sent them. Now you collect into 
multiple batches until a certain threshold - won't this consume more memory? 
How can I control this as a user?
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [pulsar] asafm commented on issue #16680: PIP-191: Support batched message using entry filter

Reply via email to