Hi Yukon, Thanks for your reply. Yes, it would be nice to concretely define the scope of this project as the doc is a bit ambitious for just a summer. Should you (or anyone else) have questions/suggestions/clarifications I'd be glad to discuss more details.
Thanks, Sohaib On Wed, Mar 7, 2018 at 8:58 AM, yukon <yu...@apache.org> wrote: > Hi, > > Google doc is better for discussion, your design is great, now we could > discuss more details base on it. > > Any advice is welcome from RocketMQ community. > > Appreciate your efforts. > > Regards, > yukon > > On Wed, Mar 7, 2018 at 5:15 AM, Sohaib Iftikhar <sohaib1...@gmail.com> > wrote: > > > Hi, > > > > @Yukon Thank you for your reply. This clears some doubts. > > > > Sorry for the delay as I was somewhat occupied with another project. I > have > > created an initial design doc. Email is a bit cumbersome for feedback I > > wrote this document in two formats: > > > > 1. In the form of a Google document: > > https://docs.google.com/document/d/1KSpXGNDH0HF5E27lfKJxJnjIjPtlP > > 1Q-M6rj3yZde24. > > The document is open for comments to all users without signing in. I > would > > appreciate it if you put your name before the comment so I can identify > who > > to follow up the discussion with. > > > > 2. As a markdown on github: > > https://github.com/sohaibiftikhar/rocketmq/blob/ > gsoc_design/gsoc_design.md > > . > > The comments for this can be made on the commit: > > https://github.com/sohaibiftikhar/rocketmq/commit/ > > dfd55fc69f430fc024217a3b20dde31717334e62 > > > > After I have received a certain amount of feedback I will try to > > incorporate it and put in a subsequent version for review. Please tell me > > which methods suits you better (gdoc or github) for review and we can > > continue with that for the subsequent versions. > > > > Lastly, the document is a couple of pages so I appreciate your patience > and > > your help. > > Looking forward to your opinions. > > > > Thanks, > > Sohaib > > > > On Mon, Mar 5, 2018 at 1:01 PM, yukon <yu...@apache.org> wrote: > > > > > Hi Sohaib, > > > > > > Sorry for the late reply, we could move this project forward now ~ > > > > > > ``` > > > I would at some point like to post > > > design ideas to this problem privately to get it reviewed by the > > > development community but not make it publicly available so that it > > cannot > > > be plagiarised. > > > ``` > > > > > > You can send your design ideas to me directly or to our PMC list( > > > priv...@rocketmq.apache.org) if you want to make your ideas privately. > > But > > > please don't break away from the community. > > > > > > I hope you have already understood the goal of this project. Now, > > RocketMQ > > > support At-least-once delivery, it's an obvious solution > > > that achieves Exactly-Once by removing duplicated messages. > > > > > > Return to your original questions: > > > > > > 1. What defines a redundant message? > > > > > > A message id will be generated when new a message, so this id can be > used > > > to identify a message. Also, the user could specify a unique > > > business-related property to identify a message. > > > > > > The redundant messages will occur when the network is broken or > > > reconnected, rebalance[1] is triggered, etc. > > > > > > > > > 2. Is their a timeline on the redundant messages? > > > > > > Yes, keep all messages nonredundant is expensive, let's consider this > > > question within a certain time window ~ > > > > > > Looking forward to your design. > > > > > > [1]. > > > https://github.com/apache/rocketmq/blob/master/client/ > > > src/main/java/org/apache/rocketmq/client/impl/consumer/ > > > RebalanceService.java > > > > > > > > > Regards, > > > yukon > > > > > > > > > On Fri, Mar 2, 2018 at 9:31 PM, Sohaib Iftikhar <sohaib1...@gmail.com> > > > wrote: > > > > > > > @Zhanhui Thanks for the response. This is not a campaign its just > part > > of > > > > GSoC (https://summerofcode.withgoogle.com/). And community help is > > > gladly > > > > welcomed. In fact, it is recommended :) > > > > > > > > @KaiYuan Thanks for your suggestions. I will come up with a flow > chart > > > for > > > > the proposed solution this weekend. > > > > > > > > Thanks, > > > > Sohaib > > > > > > > > > > > > On Fri, Mar 2, 2018 at 3:41 AM, Zhanhui Li <lizhan...@gmail.com> > > wrote: > > > > > > > > > Hi Sohaib, > > > > > > > > > > I have been sort of busy this these days. Sorry to reply you so > late! > > > > > > > > > > So sure what “deadline” you are referring to. If this is part of a > > > > > campaign, I have to admit I am not aware of the regulations and > what > > > kind > > > > > of help I should offer to maintain fairness considering other > arising > > > > > similar issues. > > > > > > > > > > Regards! > > > > > > > > > > Zhanhui Li > > > > > > > > > > > > > > > > 在 2018年3月1日,上午3:43,Sohaib Iftikhar <sohaib1...@gmail.com> 写道: > > > > > > > > > > > > Hi guys, > > > > > > > > > > > > Would be nice to have some feedback on this as the deadline is > not > > > too > > > > > far :) > > > > > > > > > > > > Thanks, > > > > > > Sohaib > > > > > > > > > > > > Regards, > > > > > > Sohaib Iftikhar > > > > > > > > > > > > -- Man is still the most extraordinary computer of all.-- > > > > > > > > > > > > > > > > > > On Mon, Feb 26, 2018 at 10:36 AM, Sohaib Iftikhar < > > > > sohaib1...@gmail.com > > > > > <mailto:sohaib1...@gmail.com>> wrote: > > > > > > Thank you for the pointers to the code. This was super helpful. > The > > > > > multiple keys can probably be serialized better than separating > them > > > > with a > > > > > space but that is already legacy I suppose. > > > > > > > > > > > > Firstly filters like bloom or cuckoo are heuristic. They can help > > > make > > > > > things faster but definitely cannot be used as the only solution. > > > Hence, > > > > in > > > > > the end, we will still need a persistent keystore/distributed set. > My > > > > plan > > > > > was to have this keystore as distributed (raft guarantee etc.). The > > > > > keystore can also hold a persistent filter on its end. If a broker > > > > > collapses it can renew/refresh its filter from the keystore. Hence > > > > > eliminating the problems about crashes that you mention. The > problem > > > here > > > > > could be in maintaining performance for filters in case of removals > > > from > > > > > the keystore (for eg: sliding windows as mentioned in my previous > > > mail). > > > > > Periodic refreshal of filters can help solve this but I am open to > > > > > suggestions on how to make this better. > > > > > > > > > > > > I think implementing a distributed set on the client cluster has > > its > > > > > caveats. The way I understand RocketMQ is that we do not have > control > > > > over > > > > > the diskspace/memory on the client end. So we probably only have a > > > > constant > > > > > amount. A distributed set on the client would also need to be > > > persistent. > > > > > For eg: if a client restarts/recovers etc. This basically means we > > > need a > > > > > keystore on the client instead of the broker cluster. This probably > > > puts > > > > > too much responsibility on the client cluster. A different approach > > > would > > > > > be to ensure that the offsets are always in sync with the broker. > > Since > > > > the > > > > > broker only serves unique messages (based on the proposed solution > on > > > the > > > > > producer/broker end) all we need to ensure is that a client does > not > > > > > consume messages with the same offset twice. > > > > > > > > > > > > Please suggest improvements if this does not look like the > correct > > > > > approach. Also would be great if someone can come up with a > > completely > > > > > different approach so that we can weigh up pros and cons. > > > > > > > > > > > > Thanks for reading this through and looking forward to your > > opinions. > > > > > > > > > > > > Regards, > > > > > > Sohaib > > > > > > > > > > > > Regards, > > > > > > Sohaib Iftikhar > > > > > > > > > > > > -- Man is still the most extraordinary computer of all.-- > > > > > > > > > > > > > > > > > > On Mon, Feb 26, 2018 at 3:58 AM, Zhanhui Li <lizhan...@gmail.com > > > > > <mailto:lizhan...@gmail.com>> wrote: > > > > > > Hi Sohaib, > > > > > > > > > > > > About multiple key support, the following code snippet should > > clarify > > > > > your doubt: > > > > > > org.apache.rocketmq.common.message.Message class has overloaded > > > > setKeys > > > > > methods, allowing your to set multiple keys via string(separated by > > > > > space…sorry, we have not yet unified all separators, hoping this > does > > > not > > > > > confuse you) or collection. > > > > > > > > > > > > > > > > > > When broker tries to build index for the message with multiple > > keys, > > > > > multiple index entries are inserted into the indexing file. > > > > > > See org.apache.rocketmq.store.index.IndexService#buildIndex > > > > > > > > > > > > > > > > > > In terms of eliminating message duplication, personally, I wish > we > > > have > > > > > a feature of exactly-once semantic covering the whole cluster and > the > > > > > complete send-store-consume processes. A rough idea is route the > > > message > > > > > according to its unique key to a broker according to a rule; The > > > serving > > > > > broker ensures uniqueness of the message according to the key( as > you > > > > said, > > > > > bloom-filter/cuckoo-filter, etc); Things might looks simple, but > > > issues > > > > > resides in scenarios where cluster is experiencing membership > > changes: > > > > for > > > > > example, what if a broker crashed down? We might need propagate > > > > > bloom-filter bitset synchronously to other brokers having the same > > > > topics; > > > > > What if a new broker joins in the cluster and starts to serve? I do > > not > > > > > mean this is too complex to implement. Instead, this is a pretty > > > > > interesting topic and fancy feature to have. Alternatively, we > might > > > > defer > > > > > eliminating duplicates to the consumption phase using kind of > > > distributed > > > > > set. For sure, my proposing idea suffers the same challenges > > including > > > > > membership changes. > > > > > > > > > > > > Guys of dev board, any insights on this issue? > > > > > > > > > > > > Zhanhui Li > > > > > > > > > > > > > > > > > >> 在 2018年2月26日,上午2:47,Sohaib Iftikhar <sohaib1...@gmail.com > > <mailto: > > > > > sohaib1...@gmail.com>> 写道: > > > > > >> > > > > > >> Hi Zhanhui, > > > > > >> > > > > > >> I have a doubt about these multiple keys. If I am wrong in any > of > > > the > > > > > >> assumptions I make please point it out. > > > > > >> > > > > > >> If there is support for multiple keys I cannot see this in the > > code. > > > > The > > > > > >> class Message only stores a single key in the property map > against > > > the > > > > > >> property name "KEYS". Is this also done in the same ways as > tags? > > > That > > > > > is > > > > > >> different keys are separated with ' || '? So basically as a user > > of > > > > the > > > > > >> producer API it is the user's responsibility to ensure that he > > > > separates > > > > > >> the different keys with the correct separator. I can see an > > obvious > > > > > problem > > > > > >> here. What if the key contains this special character ' || '? > But > > > > maybe > > > > > >> this event is rare and hence this is not important. Could you > > point > > > me > > > > > to > > > > > >> some source/doc that explains this part? I was looking at the > > index > > > > > section > > > > > >> rocketmq-store but I have not been able to understand the > indexing > > > > > process > > > > > >> completely for now. I will keep reading the source to get a > better > > > > idea. > > > > > >> > > > > > >> Moving on to the implementational details. Here is a broad idea > of > > > one > > > > > >> possible way to approach it. > > > > > >> > > > > > >> The attempt is to remove duplicate messages. In this issue, I > > would > > > > > like to > > > > > >> aim at eliminating duplicate messages at the producer/broker > end. > > > For > > > > > now, > > > > > >> we do not concern ourselves with the duplicate messages > happening > > > due > > > > to > > > > > >> unwritten consumer offsets as these two issues have different > > > > solutions. > > > > > >> One way to solve this problem at the producer/broker end could > be > > to > > > > > have a > > > > > >> distributed key store that stores the messages. We can make it > > > > > configurable > > > > > >> such that this distributed store stores all messages or works > as a > > > > > sliding > > > > > >> window keeping only the messages from the last X seconds > specified > > > by > > > > > the > > > > > >> user. We can have a layer on top to check set membership such > as a > > > > bloom > > > > > >> filter or a cuckoo filter ( > > > > > >> https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf < > > > > > https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf>) to help > > > > > >> performance. Every message being pushed in by a producer are > > checked > > > > in > > > > > >> first with the filter and in case of a positive result with this > > key > > > > > store. > > > > > >> If the message is found then it is discarded. This helps remove > > > > > duplicates > > > > > >> completely from a producer perspective. The core of this idea is > > the > > > > > >> distributed key store which would be completely separate from > the > > > > > current > > > > > >> message storage. Since the concept of a distributed key store > or a > > > > > >> key/value store is not novel there are two ways to this. > > > > > >> 1. Implement it ourselves. This would be high effort but no > > external > > > > > >> dependencies. > > > > > >> 2. Use a key-value store such as Redis (which already has > timeouts > > > and > > > > > >> persistence but a large memory footprint) or some other > disk-based > > > > > storage > > > > > >> for set membership. This would include an external dependency > but > > > > > >> development time will reduce significantly for such a solution. > > > > > >> I am inclined towards implementing it by myself as this would > > avoid > > > > > >> dependencies on other products especially since RocketMQ is > > > currently > > > > a > > > > > >> self-reliant system. In addition, my past experience with > building > > > > such > > > > > a > > > > > >> store should also come in handy. > > > > > >> > > > > > >> I would like to know the opinions of the development community > on > > > this > > > > > >> approach and to suggest improvements on it. Looking forward to > > your > > > > > >> responses to this. > > > > > >> > > > > > >> ====<question unrelated to issue>===== > > > > > >> To increase my familiarity with the code base and to help prove > > > that I > > > > > am > > > > > >> familiar with the tools and technologies in place it would be > > great > > > > if I > > > > > >> could be pointed to some low effort issues that I could help out > > > with. > > > > > In > > > > > >> case there are no 'newbie' issues available I could help improve > > the > > > > > >> comments inside the codebase. I noticed some source files with > no > > > > > >> explanations which can be documented via comments to help > onboard > > a > > > > new > > > > > >> contributor faster. > > > > > >> ====</question unrelated to issue>===== > > > > > >> > > > > > >> Thanks a lot for reading this through and looking forward to > your > > > > > opinions. > > > > > >> > > > > > >> Regards, > > > > > >> Sohaib > > > > > >> > > > > > >> > > > > > >> On Sat, Feb 24, 2018 at 11:50 AM, Zhanhui Li < > lizhan...@gmail.com > > > > > <mailto:lizhan...@gmail.com>> wrote: > > > > > >> > > > > > >>> Hi Sohaib, > > > > > >>> > > > > > >>> Happy to know you are interested in RocketMQ. > > > > > >>> > > > > > >>> First, let me answer questions you raised. > > > > > >>> > > > > > >>> — can there be multiple tags? > > > > > >>> No. At present, the storage engine allows single tag only. > > > > > Subscriptions > > > > > >>> are allowed to use combination of tags. The current model > should > > > meet > > > > > your > > > > > >>> business development. If not, please let us know. > > > > > >>> > > > > > >>> > > > > > >>> — key (Similar question to above.) > > > > > >>> RocketMQ builds index using message keys. A single message may > > have > > > > > >>> multiple keys. > > > > > >>> > > > > > >>> — About redundant message > > > > > >>> From my understanding, you are trying to eliminate duplicate > > > > messages. > > > > > >>> True there are various reasons which may cause message > > duplication, > > > > > ranging > > > > > >>> from message delivery and consumption. Discussion on this topic > > is > > > > > warmly > > > > > >>> welcome. Had you had any idea to contribute on this issue, the > > > > > developer > > > > > >>> board is happy to discuss. > > > > > >>> > > > > > >>> Zhanhui Li > > > > > >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > > > >>>> 在 2018年2月24日,上午11:17,Sohaib Iftikhar <sohaib1...@gmail.com > > > <mailto: > > > > > sohaib1...@gmail.com>> 写道: > > > > > >>>> > > > > > >>>> My earlier email message seems to have gotten lost. So I will > > try > > > > > again. > > > > > >>>> Please see the original message for the discussion. > > > > > >>>> > > > > > >>>> Regards, > > > > > >>>> Sohaib Iftikhar > > > > > >>>> > > > > > >>>> -- Man is still the most extraordinary computer of all.-- > > > > > >>>> > > > > > >>>> On Tue, Feb 20, 2018 at 1:54 AM, Sohaib Iftikhar < > > > > > sohaib1...@gmail.com <mailto:sohaib1...@gmail.com>> > > > > > >>>> wrote: > > > > > >>>> > > > > > >>>>> Hi, > > > > > >>>>> > > > > > >>>>> I am interested in working on this issue ( > > > > https://issues.apache.org/ > > > > > <https://issues.apache.org/> > > > > > >>>>> jira/browse/ROCKETMQ-124) as part of GSOC-18. I have a few > > > > questions > > > > > for > > > > > >>>>> the same. I am not sure if this discussion needs to be on the > > > JIRA > > > > > >>> issue or > > > > > >>>>> here. Feel free to correct me if this is the wrong platform. > > Also > > > > > while > > > > > >>> I > > > > > >>>>> have worked with distributed pub-sub systems I am still > fairly > > > new > > > > to > > > > > >>>>> Rocket-MQ so maybe my understanding of it is incorrect. I > > > apologise > > > > > if > > > > > >>> that > > > > > >>>>> is the case and would be happy to stand corrected. > > > > > >>>>> > > > > > >>>>> Following are my questions: > > > > > >>>>> 1. What defines a redundant message? > > > > > >>>>> The constructor that I see for a message is as follows: > > > > > >>>>> Message(String topic, String tags, String keys, int flag, > > > byte[] > > > > > >>> body, > > > > > >>>>> boolean waitStoreMsgOK) > > > > > >>>>> Possible candidates to me are topic, tags (can there be > > > multiple > > > > > >>> tags? > > > > > >>>>> I could not find an example for this. If yes how are they > > > > > separated?), > > > > > >>> keys > > > > > >>>>> (Similar question to above.) and of course the body. Is there > > > > > something > > > > > >>>>> that I have missed in this? Is there something that we do not > > > need > > > > to > > > > > >>>>> consider? > > > > > >>>>> 2. Is their a timeline on the redundant messages? What I mean > > by > > > > > this is > > > > > >>>>> that is there a time limit after which a message with similar > > > > > content is > > > > > >>>>> allowed. From what I gather there was no such thing > mentioned. > > > This > > > > > >>> would > > > > > >>>>> mean storing all the messages. Depending on the requirements > > this > > > > > may or > > > > > >>>>> may not be the best solution. It might be desirable that no > > > > > duplicates > > > > > >>> are > > > > > >>>>> needed within a certain time window (sliding). This allows > > > ignoring > > > > > of > > > > > >>>>> duplicate messages that were generated very close to each > other > > > (or > > > > > in > > > > > >>> the > > > > > >>>>> window indicated). Depending on this requirement > implementation > > > may > > > > > >>> become > > > > > >>>>> a little bit more involved. > > > > > >>>>> > > > > > >>>>> For now, these are the only questions. I have ideas that need > > > > review > > > > > >>> about > > > > > >>>>> possible implementations but I will mention them once the > > > > > specifications > > > > > >>>>> are clear to me. As an end question, I would at some point > like > > > to > > > > > post > > > > > >>>>> design ideas to this problem privately to get it reviewed by > > the > > > > > >>>>> development community but not make it publicly available so > > that > > > it > > > > > >>> cannot > > > > > >>>>> be plagiarised. What platform/method can I use to do that? Or > > is > > > > > >>> submitting > > > > > >>>>> a draft to the Google platform the only possible way to > > > accomplish > > > > > this? > > > > > >>>>> > > > > > >>>>> Thanks a lot for reading this through and looking forward to > > your > > > > > >>> inputs. > > > > > >>>>> > > > > > >>>>> Regards, > > > > > >>>>> Sohaib Iftikhar > > > > > >>>>> > > > > > >>> > > > > > >>> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >