Hi Sohaib,

I have been sort of busy this these days. Sorry to reply you so late!

So sure what “deadline” you are referring to. If this is part of a campaign, I 
have to admit I am not aware of the regulations and what kind of help I should 
offer to maintain fairness considering other arising similar issues.

Regards!

Zhanhui Li


> 在 2018年3月1日,上午3:43,Sohaib Iftikhar <sohaib1...@gmail.com> 写道:
> 
> Hi guys,
> 
> Would be nice to have some feedback on this as the deadline is not too far :)
> 
> Thanks,
> Sohaib
> 
> Regards,
> Sohaib Iftikhar
> 
> -- Man is still the most extraordinary computer of all.--
> 
> 
> On Mon, Feb 26, 2018 at 10:36 AM, Sohaib Iftikhar <sohaib1...@gmail.com 
> <mailto:sohaib1...@gmail.com>> wrote:
> Thank you for the pointers to the code. This was super helpful. The multiple 
> keys can probably be serialized better than separating them with a space but 
> that is already legacy I suppose.
> 
> Firstly filters like bloom or cuckoo are heuristic. They can help make things 
> faster but definitely cannot be used as the only solution. Hence, in the end, 
> we will still need a persistent keystore/distributed set. My plan was to have 
> this keystore as distributed (raft guarantee etc.). The keystore can also 
> hold a persistent filter on its end. If a broker collapses it can 
> renew/refresh its filter from the keystore. Hence eliminating the problems 
> about crashes that you mention. The problem here could be in maintaining 
> performance for filters in case of removals from the keystore (for eg: 
> sliding windows as mentioned in my previous mail). Periodic refreshal of 
> filters can help solve this but I am open to suggestions on how to make this 
> better.
> 
> I think implementing a distributed set on the client cluster has its caveats. 
> The way I understand RocketMQ is that we do not have control over the 
> diskspace/memory on the client end. So we probably only have a constant 
> amount. A distributed set on the client would also need to be persistent. For 
> eg: if a client restarts/recovers etc. This basically means we need a 
> keystore on the client instead of the broker cluster. This probably puts too 
> much responsibility on the client cluster. A different approach would be to 
> ensure that the offsets are always in sync with the broker. Since the broker 
> only serves unique messages (based on the proposed solution on the 
> producer/broker end) all we need to ensure is that a client does not consume 
> messages with the same offset twice.
> 
> Please suggest improvements if this does not look like the correct approach. 
> Also would be great if someone can come up with a completely different 
> approach so that we can weigh up pros and cons.
> 
> Thanks for reading this through and looking forward to your opinions.
> 
> Regards,
> Sohaib
> 
> Regards,
> Sohaib Iftikhar
> 
> -- Man is still the most extraordinary computer of all.--
> 
> 
> On Mon, Feb 26, 2018 at 3:58 AM, Zhanhui Li <lizhan...@gmail.com 
> <mailto:lizhan...@gmail.com>> wrote:
> Hi Sohaib,
> 
> About multiple key support, the following code snippet should clarify your 
> doubt:
> org.apache.rocketmq.common.message.Message class has overloaded setKeys 
> methods, allowing your to set multiple keys via string(separated by 
> space…sorry, we have not yet unified all separators, hoping this does not 
> confuse you) or collection.
> 
> 
> When broker tries to build index for the message with multiple keys, multiple 
> index entries are inserted into the indexing file. 
> See org.apache.rocketmq.store.index.IndexService#buildIndex
> 
> 
> In terms of eliminating message duplication, personally, I wish we have a 
> feature of exactly-once semantic covering the whole cluster and the complete 
> send-store-consume processes. A rough idea is route the message according to 
> its unique key to a broker according to a rule; The serving broker ensures 
> uniqueness of the message according to the key( as you said, 
> bloom-filter/cuckoo-filter, etc);  Things might looks simple, but issues 
> resides in scenarios where cluster is experiencing membership changes: for 
> example, what if a broker crashed down? We might need propagate bloom-filter 
> bitset synchronously to other brokers having the same topics; What if a new 
> broker joins in the cluster and starts to serve? I do not mean this is too 
> complex to implement. Instead, this is a pretty interesting topic and fancy 
> feature to have. Alternatively, we might defer eliminating duplicates to the 
> consumption phase using kind of distributed set. For sure, my proposing idea 
> suffers the same challenges including membership changes. 
> 
> Guys of dev board, any insights on this issue?
> 
> Zhanhui Li
> 
> 
>> 在 2018年2月26日,上午2:47,Sohaib Iftikhar <sohaib1...@gmail.com 
>> <mailto:sohaib1...@gmail.com>> 写道:
>> 
>> Hi Zhanhui,
>> 
>> I have a doubt about these multiple keys. If I am wrong in any of the
>> assumptions I make please point it out.
>> 
>> If there is support for multiple keys I cannot see this in the code. The
>> class Message only stores a single key in the property map against the
>> property name "KEYS". Is this also done in the same ways as tags? That is
>> different keys are separated with ' || '? So basically as a user of the
>> producer API it is the user's responsibility to ensure that he separates
>> the different keys with the correct separator. I can see an obvious problem
>> here. What if the key contains this special character ' || '? But maybe
>> this event is rare and hence this is not important. Could you point me to
>> some source/doc that explains this part? I was looking at the index section
>> rocketmq-store but I have not been able to understand the indexing process
>> completely for now. I will keep reading the source to get a better idea.
>> 
>> Moving on to the implementational details. Here is a broad idea of one
>> possible way to approach it.
>> 
>> The attempt is to remove duplicate messages. In this issue, I would like to
>> aim at eliminating duplicate messages at the producer/broker end. For now,
>> we do not concern ourselves with the duplicate messages happening due to
>> unwritten consumer offsets as these two issues have different solutions.
>> One way to solve this problem at the producer/broker end could be to have a
>> distributed key store that stores the messages. We can make it configurable
>> such that this distributed store stores all messages or works as a sliding
>> window keeping only the messages from the last X seconds specified by the
>> user. We can have a layer on top to check set membership such as a bloom
>> filter or a cuckoo filter (
>> https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf 
>> <https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf>) to help
>> performance. Every message being pushed in by a producer are checked in
>> first with the filter and in case of a positive result with this key store.
>> If the message is found then it is discarded. This helps remove duplicates
>> completely from a producer perspective. The core of this idea is the
>> distributed key store which would be completely separate from the current
>> message storage. Since the concept of a distributed key store or a
>> key/value store is not novel there are two ways to this.
>> 1. Implement it ourselves. This would be high effort but no external
>> dependencies.
>> 2. Use a key-value store such as Redis (which already has timeouts and
>> persistence but a large memory footprint) or some other disk-based storage
>> for set membership. This would include an external dependency but
>> development time will reduce significantly for such a solution.
>> I am inclined towards implementing it by myself as this would avoid
>> dependencies on other products especially since RocketMQ is currently a
>> self-reliant system. In addition, my past experience with building such a
>> store should also come in handy.
>> 
>> I would like to know the opinions of the development community on this
>> approach and to suggest improvements on it. Looking forward to your
>> responses to this.
>> 
>> ====<question unrelated to issue>=====
>> To increase my familiarity with the code base and to help prove that I am
>> familiar with the tools and technologies in place it would be great if I
>> could be pointed to some low effort issues that I could help out with. In
>> case there are no 'newbie' issues available I could help improve the
>> comments inside the codebase. I noticed some source files with no
>> explanations which can be documented via comments to help onboard a new
>> contributor faster.
>> ====</question unrelated to issue>=====
>> 
>> Thanks a lot for reading this through and looking forward to your opinions.
>> 
>> Regards,
>> Sohaib
>> 
>> 
>> On Sat, Feb 24, 2018 at 11:50 AM, Zhanhui Li <lizhan...@gmail.com 
>> <mailto:lizhan...@gmail.com>> wrote:
>> 
>>> Hi Sohaib,
>>> 
>>> Happy to know you are interested in RocketMQ.
>>> 
>>> First, let me answer questions you raised.
>>> 
>>> — can there be multiple tags?
>>> No. At present, the storage engine allows single tag only. Subscriptions
>>> are allowed to use combination of tags. The current model should meet your
>>> business development. If not, please let us know.
>>> 
>>> 
>>> — key (Similar question to above.)
>>> RocketMQ builds index using message keys. A single message may have
>>> multiple keys.
>>> 
>>> — About redundant message
>>> From my understanding, you are trying to eliminate duplicate messages.
>>> True there are various reasons which may cause message duplication, ranging
>>> from message delivery and consumption. Discussion on this topic is warmly
>>> welcome.  Had you had any idea to contribute on this issue, the developer
>>> board is happy to discuss.
>>> 
>>> Zhanhui Li
>>> 
>>> 
>>> 
>>> 
>>>> 在 2018年2月24日,上午11:17,Sohaib Iftikhar <sohaib1...@gmail.com 
>>>> <mailto:sohaib1...@gmail.com>> 写道:
>>>> 
>>>> My earlier email message seems to have gotten lost. So I will try again.
>>>> Please see the original message for the discussion.
>>>> 
>>>> Regards,
>>>> Sohaib Iftikhar
>>>> 
>>>> -- Man is still the most extraordinary computer of all.--
>>>> 
>>>> On Tue, Feb 20, 2018 at 1:54 AM, Sohaib Iftikhar <sohaib1...@gmail.com 
>>>> <mailto:sohaib1...@gmail.com>>
>>>> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> I am interested in working on this issue (https://issues.apache.org/ 
>>>>> <https://issues.apache.org/>
>>>>> jira/browse/ROCKETMQ-124) as part of GSOC-18. I have a few questions for
>>>>> the same. I am not sure if this discussion needs to be on the JIRA
>>> issue or
>>>>> here. Feel free to correct me if this is the wrong platform. Also while
>>> I
>>>>> have worked with distributed pub-sub systems I am still fairly new to
>>>>> Rocket-MQ so maybe my understanding of it is incorrect. I apologise if
>>> that
>>>>> is the case and would be happy to stand corrected.
>>>>> 
>>>>> Following are my questions:
>>>>> 1. What defines a redundant message?
>>>>>   The constructor that I see for a message is as follows:
>>>>>   Message(String topic, String tags, String keys, int flag, byte[]
>>> body,
>>>>> boolean waitStoreMsgOK)
>>>>>   Possible candidates to me are topic, tags (can there be multiple
>>> tags?
>>>>> I could not find an example for this. If yes how are they separated?),
>>> keys
>>>>> (Similar question to above.) and of course the body. Is there something
>>>>> that I have missed in this? Is there something that we do not need to
>>>>> consider?
>>>>> 2. Is their a timeline on the redundant messages? What I mean by this is
>>>>> that is there a time limit after which a message with similar content is
>>>>> allowed. From what I gather there was no such thing mentioned. This
>>> would
>>>>> mean storing all the messages. Depending on the requirements this may or
>>>>> may not be the best solution. It might be desirable that no duplicates
>>> are
>>>>> needed within a certain time window (sliding). This allows ignoring of
>>>>> duplicate messages that were generated very close to each other (or in
>>> the
>>>>> window indicated). Depending on this requirement implementation may
>>> become
>>>>> a little bit more involved.
>>>>> 
>>>>> For now, these are the only questions. I have ideas that need review
>>> about
>>>>> possible implementations but I will mention them once the specifications
>>>>> are clear to me. As an end question, I would at some point like to post
>>>>> design ideas to this problem privately to get it reviewed by the
>>>>> development community but not make it publicly available so that it
>>> cannot
>>>>> be plagiarised. What platform/method can I use to do that? Or is
>>> submitting
>>>>> a draft to the Google platform the only possible way to accomplish this?
>>>>> 
>>>>> Thanks a lot for reading this through and looking forward to your
>>> inputs.
>>>>> 
>>>>> Regards,
>>>>> Sohaib Iftikhar
>>>>> 
>>> 
>>> 
> 
> 
> 

Reply via email to