@Zhanhui Thanks for the response. This is not a campaign its just part of GSoC (https://summerofcode.withgoogle.com/). And community help is gladly welcomed. In fact, it is recommended :)
@KaiYuan Thanks for your suggestions. I will come up with a flow chart for the proposed solution this weekend. Thanks, Sohaib On Fri, Mar 2, 2018 at 3:41 AM, Zhanhui Li <lizhan...@gmail.com> wrote: > Hi Sohaib, > > I have been sort of busy this these days. Sorry to reply you so late! > > So sure what “deadline” you are referring to. If this is part of a > campaign, I have to admit I am not aware of the regulations and what kind > of help I should offer to maintain fairness considering other arising > similar issues. > > Regards! > > Zhanhui Li > > > > 在 2018年3月1日,上午3:43,Sohaib Iftikhar <sohaib1...@gmail.com> 写道: > > > > Hi guys, > > > > Would be nice to have some feedback on this as the deadline is not too > far :) > > > > Thanks, > > Sohaib > > > > Regards, > > Sohaib Iftikhar > > > > -- Man is still the most extraordinary computer of all.-- > > > > > > On Mon, Feb 26, 2018 at 10:36 AM, Sohaib Iftikhar <sohaib1...@gmail.com > <mailto:sohaib1...@gmail.com>> wrote: > > Thank you for the pointers to the code. This was super helpful. The > multiple keys can probably be serialized better than separating them with a > space but that is already legacy I suppose. > > > > Firstly filters like bloom or cuckoo are heuristic. They can help make > things faster but definitely cannot be used as the only solution. Hence, in > the end, we will still need a persistent keystore/distributed set. My plan > was to have this keystore as distributed (raft guarantee etc.). The > keystore can also hold a persistent filter on its end. If a broker > collapses it can renew/refresh its filter from the keystore. Hence > eliminating the problems about crashes that you mention. The problem here > could be in maintaining performance for filters in case of removals from > the keystore (for eg: sliding windows as mentioned in my previous mail). > Periodic refreshal of filters can help solve this but I am open to > suggestions on how to make this better. > > > > I think implementing a distributed set on the client cluster has its > caveats. The way I understand RocketMQ is that we do not have control over > the diskspace/memory on the client end. So we probably only have a constant > amount. A distributed set on the client would also need to be persistent. > For eg: if a client restarts/recovers etc. This basically means we need a > keystore on the client instead of the broker cluster. This probably puts > too much responsibility on the client cluster. A different approach would > be to ensure that the offsets are always in sync with the broker. Since the > broker only serves unique messages (based on the proposed solution on the > producer/broker end) all we need to ensure is that a client does not > consume messages with the same offset twice. > > > > Please suggest improvements if this does not look like the correct > approach. Also would be great if someone can come up with a completely > different approach so that we can weigh up pros and cons. > > > > Thanks for reading this through and looking forward to your opinions. > > > > Regards, > > Sohaib > > > > Regards, > > Sohaib Iftikhar > > > > -- Man is still the most extraordinary computer of all.-- > > > > > > On Mon, Feb 26, 2018 at 3:58 AM, Zhanhui Li <lizhan...@gmail.com > <mailto:lizhan...@gmail.com>> wrote: > > Hi Sohaib, > > > > About multiple key support, the following code snippet should clarify > your doubt: > > org.apache.rocketmq.common.message.Message class has overloaded setKeys > methods, allowing your to set multiple keys via string(separated by > space…sorry, we have not yet unified all separators, hoping this does not > confuse you) or collection. > > > > > > When broker tries to build index for the message with multiple keys, > multiple index entries are inserted into the indexing file. > > See org.apache.rocketmq.store.index.IndexService#buildIndex > > > > > > In terms of eliminating message duplication, personally, I wish we have > a feature of exactly-once semantic covering the whole cluster and the > complete send-store-consume processes. A rough idea is route the message > according to its unique key to a broker according to a rule; The serving > broker ensures uniqueness of the message according to the key( as you said, > bloom-filter/cuckoo-filter, etc); Things might looks simple, but issues > resides in scenarios where cluster is experiencing membership changes: for > example, what if a broker crashed down? We might need propagate > bloom-filter bitset synchronously to other brokers having the same topics; > What if a new broker joins in the cluster and starts to serve? I do not > mean this is too complex to implement. Instead, this is a pretty > interesting topic and fancy feature to have. Alternatively, we might defer > eliminating duplicates to the consumption phase using kind of distributed > set. For sure, my proposing idea suffers the same challenges including > membership changes. > > > > Guys of dev board, any insights on this issue? > > > > Zhanhui Li > > > > > >> 在 2018年2月26日,上午2:47,Sohaib Iftikhar <sohaib1...@gmail.com <mailto: > sohaib1...@gmail.com>> 写道: > >> > >> Hi Zhanhui, > >> > >> I have a doubt about these multiple keys. If I am wrong in any of the > >> assumptions I make please point it out. > >> > >> If there is support for multiple keys I cannot see this in the code. The > >> class Message only stores a single key in the property map against the > >> property name "KEYS". Is this also done in the same ways as tags? That > is > >> different keys are separated with ' || '? So basically as a user of the > >> producer API it is the user's responsibility to ensure that he separates > >> the different keys with the correct separator. I can see an obvious > problem > >> here. What if the key contains this special character ' || '? But maybe > >> this event is rare and hence this is not important. Could you point me > to > >> some source/doc that explains this part? I was looking at the index > section > >> rocketmq-store but I have not been able to understand the indexing > process > >> completely for now. I will keep reading the source to get a better idea. > >> > >> Moving on to the implementational details. Here is a broad idea of one > >> possible way to approach it. > >> > >> The attempt is to remove duplicate messages. In this issue, I would > like to > >> aim at eliminating duplicate messages at the producer/broker end. For > now, > >> we do not concern ourselves with the duplicate messages happening due to > >> unwritten consumer offsets as these two issues have different solutions. > >> One way to solve this problem at the producer/broker end could be to > have a > >> distributed key store that stores the messages. We can make it > configurable > >> such that this distributed store stores all messages or works as a > sliding > >> window keeping only the messages from the last X seconds specified by > the > >> user. We can have a layer on top to check set membership such as a bloom > >> filter or a cuckoo filter ( > >> https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf < > https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf>) to help > >> performance. Every message being pushed in by a producer are checked in > >> first with the filter and in case of a positive result with this key > store. > >> If the message is found then it is discarded. This helps remove > duplicates > >> completely from a producer perspective. The core of this idea is the > >> distributed key store which would be completely separate from the > current > >> message storage. Since the concept of a distributed key store or a > >> key/value store is not novel there are two ways to this. > >> 1. Implement it ourselves. This would be high effort but no external > >> dependencies. > >> 2. Use a key-value store such as Redis (which already has timeouts and > >> persistence but a large memory footprint) or some other disk-based > storage > >> for set membership. This would include an external dependency but > >> development time will reduce significantly for such a solution. > >> I am inclined towards implementing it by myself as this would avoid > >> dependencies on other products especially since RocketMQ is currently a > >> self-reliant system. In addition, my past experience with building such > a > >> store should also come in handy. > >> > >> I would like to know the opinions of the development community on this > >> approach and to suggest improvements on it. Looking forward to your > >> responses to this. > >> > >> ====<question unrelated to issue>===== > >> To increase my familiarity with the code base and to help prove that I > am > >> familiar with the tools and technologies in place it would be great if I > >> could be pointed to some low effort issues that I could help out with. > In > >> case there are no 'newbie' issues available I could help improve the > >> comments inside the codebase. I noticed some source files with no > >> explanations which can be documented via comments to help onboard a new > >> contributor faster. > >> ====</question unrelated to issue>===== > >> > >> Thanks a lot for reading this through and looking forward to your > opinions. > >> > >> Regards, > >> Sohaib > >> > >> > >> On Sat, Feb 24, 2018 at 11:50 AM, Zhanhui Li <lizhan...@gmail.com > <mailto:lizhan...@gmail.com>> wrote: > >> > >>> Hi Sohaib, > >>> > >>> Happy to know you are interested in RocketMQ. > >>> > >>> First, let me answer questions you raised. > >>> > >>> — can there be multiple tags? > >>> No. At present, the storage engine allows single tag only. > Subscriptions > >>> are allowed to use combination of tags. The current model should meet > your > >>> business development. If not, please let us know. > >>> > >>> > >>> — key (Similar question to above.) > >>> RocketMQ builds index using message keys. A single message may have > >>> multiple keys. > >>> > >>> — About redundant message > >>> From my understanding, you are trying to eliminate duplicate messages. > >>> True there are various reasons which may cause message duplication, > ranging > >>> from message delivery and consumption. Discussion on this topic is > warmly > >>> welcome. Had you had any idea to contribute on this issue, the > developer > >>> board is happy to discuss. > >>> > >>> Zhanhui Li > >>> > >>> > >>> > >>> > >>>> 在 2018年2月24日,上午11:17,Sohaib Iftikhar <sohaib1...@gmail.com <mailto: > sohaib1...@gmail.com>> 写道: > >>>> > >>>> My earlier email message seems to have gotten lost. So I will try > again. > >>>> Please see the original message for the discussion. > >>>> > >>>> Regards, > >>>> Sohaib Iftikhar > >>>> > >>>> -- Man is still the most extraordinary computer of all.-- > >>>> > >>>> On Tue, Feb 20, 2018 at 1:54 AM, Sohaib Iftikhar < > sohaib1...@gmail.com <mailto:sohaib1...@gmail.com>> > >>>> wrote: > >>>> > >>>>> Hi, > >>>>> > >>>>> I am interested in working on this issue (https://issues.apache.org/ > <https://issues.apache.org/> > >>>>> jira/browse/ROCKETMQ-124) as part of GSOC-18. I have a few questions > for > >>>>> the same. I am not sure if this discussion needs to be on the JIRA > >>> issue or > >>>>> here. Feel free to correct me if this is the wrong platform. Also > while > >>> I > >>>>> have worked with distributed pub-sub systems I am still fairly new to > >>>>> Rocket-MQ so maybe my understanding of it is incorrect. I apologise > if > >>> that > >>>>> is the case and would be happy to stand corrected. > >>>>> > >>>>> Following are my questions: > >>>>> 1. What defines a redundant message? > >>>>> The constructor that I see for a message is as follows: > >>>>> Message(String topic, String tags, String keys, int flag, byte[] > >>> body, > >>>>> boolean waitStoreMsgOK) > >>>>> Possible candidates to me are topic, tags (can there be multiple > >>> tags? > >>>>> I could not find an example for this. If yes how are they > separated?), > >>> keys > >>>>> (Similar question to above.) and of course the body. Is there > something > >>>>> that I have missed in this? Is there something that we do not need to > >>>>> consider? > >>>>> 2. Is their a timeline on the redundant messages? What I mean by > this is > >>>>> that is there a time limit after which a message with similar > content is > >>>>> allowed. From what I gather there was no such thing mentioned. This > >>> would > >>>>> mean storing all the messages. Depending on the requirements this > may or > >>>>> may not be the best solution. It might be desirable that no > duplicates > >>> are > >>>>> needed within a certain time window (sliding). This allows ignoring > of > >>>>> duplicate messages that were generated very close to each other (or > in > >>> the > >>>>> window indicated). Depending on this requirement implementation may > >>> become > >>>>> a little bit more involved. > >>>>> > >>>>> For now, these are the only questions. I have ideas that need review > >>> about > >>>>> possible implementations but I will mention them once the > specifications > >>>>> are clear to me. As an end question, I would at some point like to > post > >>>>> design ideas to this problem privately to get it reviewed by the > >>>>> development community but not make it publicly available so that it > >>> cannot > >>>>> be plagiarised. What platform/method can I use to do that? Or is > >>> submitting > >>>>> a draft to the Google platform the only possible way to accomplish > this? > >>>>> > >>>>> Thanks a lot for reading this through and looking forward to your > >>> inputs. > >>>>> > >>>>> Regards, > >>>>> Sohaib Iftikhar > >>>>> > >>> > >>> > > > > > > > >