Hi Yukon,

Thanks for your reply. Yes, it would be nice to concretely define the scope
of this project as the doc is a bit ambitious for just a summer. Should you
(or anyone else) have questions/suggestions/clarifications I'd be glad to
discuss more details.

Thanks,
Sohaib

On Wed, Mar 7, 2018 at 8:58 AM, yukon <yu...@apache.org> wrote:

> Hi,
>
> Google doc is better for discussion, your design is great, now we could
> discuss more details base on it.
>
> Any advice is welcome from RocketMQ community.
>
> Appreciate your efforts.
>
> Regards,
> yukon
>
> On Wed, Mar 7, 2018 at 5:15 AM, Sohaib Iftikhar <sohaib1...@gmail.com>
> wrote:
>
> > Hi,
> >
> > @Yukon Thank you for your reply. This clears some doubts.
> >
> > Sorry for the delay as I was somewhat occupied with another project. I
> have
> > created an initial design doc. Email is a bit cumbersome for feedback I
> > wrote this document in two formats:
> >
> > 1. In the form of a Google document:
> > https://docs.google.com/document/d/1KSpXGNDH0HF5E27lfKJxJnjIjPtlP
> > 1Q-M6rj3yZde24.
> > The document is open for comments to all users without signing in. I
> would
> > appreciate it if you put your name before the comment so I can identify
> who
> > to follow up the discussion with.
> >
> > 2. As a markdown on github:
> > https://github.com/sohaibiftikhar/rocketmq/blob/
> gsoc_design/gsoc_design.md
> > .
> > The comments for this can be made on the commit:
> > https://github.com/sohaibiftikhar/rocketmq/commit/
> > dfd55fc69f430fc024217a3b20dde31717334e62
> >
> > After I have received a certain amount of feedback I will try to
> > incorporate it and put in a subsequent version for review. Please tell me
> > which methods suits you better (gdoc or github) for review and we can
> > continue with that for the subsequent versions.
> >
> > Lastly, the document is a couple of pages so I appreciate your patience
> and
> > your help.
> > Looking forward to your opinions.
> >
> > Thanks,
> > Sohaib
> >
> > On Mon, Mar 5, 2018 at 1:01 PM, yukon <yu...@apache.org> wrote:
> >
> > > Hi Sohaib,
> > >
> > > Sorry for the late reply, we could move this project forward now ~
> > >
> > > ```
> > > I would at some point like to post
> > > design ideas to this problem privately to get it reviewed by the
> > > development community but not make it publicly available so that it
> > cannot
> > > be plagiarised.
> > > ```
> > >
> > > You can send your design ideas to me directly or to our PMC list(
> > > priv...@rocketmq.apache.org) if you want to make your ideas privately.
> > But
> > > please don't break away from the community.
> > >
> > > I hope you have already understood the goal of this project. Now,
> > RocketMQ
> > > support At-least-once delivery, it's an obvious solution
> > > that achieves Exactly-Once by removing duplicated messages.
> > >
> > > Return to your original questions:
> > >
> > > 1. What defines a redundant message?
> > >
> > > A message id will be generated when new a message, so this id can be
> used
> > > to identify a message. Also, the user could specify a unique
> > > business-related property to identify a message.
> > >
> > > The redundant messages will occur when the network is broken or
> > > reconnected, rebalance[1] is triggered, etc.
> > >
> > >
> > > 2. Is their a timeline on the redundant messages?
> > >
> > > Yes, keep all messages nonredundant is expensive, let's consider this
> > > question within a certain time window ~
> > >
> > > Looking forward to your design.
> > >
> > > [1].
> > > https://github.com/apache/rocketmq/blob/master/client/
> > > src/main/java/org/apache/rocketmq/client/impl/consumer/
> > > RebalanceService.java
> > >
> > >
> > > Regards,
> > > yukon
> > >
> > >
> > > On Fri, Mar 2, 2018 at 9:31 PM, Sohaib Iftikhar <sohaib1...@gmail.com>
> > > wrote:
> > >
> > > > @Zhanhui Thanks for the response. This is not a campaign its just
> part
> > of
> > > > GSoC (https://summerofcode.withgoogle.com/). And community help is
> > > gladly
> > > > welcomed. In fact, it is recommended :)
> > > >
> > > > @KaiYuan Thanks for your suggestions. I will come up with a flow
> chart
> > > for
> > > > the proposed solution this weekend.
> > > >
> > > > Thanks,
> > > > Sohaib
> > > >
> > > >
> > > > On Fri, Mar 2, 2018 at 3:41 AM, Zhanhui Li <lizhan...@gmail.com>
> > wrote:
> > > >
> > > > > Hi Sohaib,
> > > > >
> > > > > I have been sort of busy this these days. Sorry to reply you so
> late!
> > > > >
> > > > > So sure what “deadline” you are referring to. If this is part of a
> > > > > campaign, I have to admit I am not aware of the regulations and
> what
> > > kind
> > > > > of help I should offer to maintain fairness considering other
> arising
> > > > > similar issues.
> > > > >
> > > > > Regards!
> > > > >
> > > > > Zhanhui Li
> > > > >
> > > > >
> > > > > > 在 2018年3月1日,上午3:43,Sohaib Iftikhar <sohaib1...@gmail.com> 写道:
> > > > > >
> > > > > > Hi guys,
> > > > > >
> > > > > > Would be nice to have some feedback on this as the deadline is
> not
> > > too
> > > > > far :)
> > > > > >
> > > > > > Thanks,
> > > > > > Sohaib
> > > > > >
> > > > > > Regards,
> > > > > > Sohaib Iftikhar
> > > > > >
> > > > > > -- Man is still the most extraordinary computer of all.--
> > > > > >
> > > > > >
> > > > > > On Mon, Feb 26, 2018 at 10:36 AM, Sohaib Iftikhar <
> > > > sohaib1...@gmail.com
> > > > > <mailto:sohaib1...@gmail.com>> wrote:
> > > > > > Thank you for the pointers to the code. This was super helpful.
> The
> > > > > multiple keys can probably be serialized better than separating
> them
> > > > with a
> > > > > space but that is already legacy I suppose.
> > > > > >
> > > > > > Firstly filters like bloom or cuckoo are heuristic. They can help
> > > make
> > > > > things faster but definitely cannot be used as the only solution.
> > > Hence,
> > > > in
> > > > > the end, we will still need a persistent keystore/distributed set.
> My
> > > > plan
> > > > > was to have this keystore as distributed (raft guarantee etc.). The
> > > > > keystore can also hold a persistent filter on its end. If a broker
> > > > > collapses it can renew/refresh its filter from the keystore. Hence
> > > > > eliminating the problems about crashes that you mention. The
> problem
> > > here
> > > > > could be in maintaining performance for filters in case of removals
> > > from
> > > > > the keystore (for eg: sliding windows as mentioned in my previous
> > > mail).
> > > > > Periodic refreshal of filters can help solve this but I am open to
> > > > > suggestions on how to make this better.
> > > > > >
> > > > > > I think implementing a distributed set on the client cluster has
> > its
> > > > > caveats. The way I understand RocketMQ is that we do not have
> control
> > > > over
> > > > > the diskspace/memory on the client end. So we probably only have a
> > > > constant
> > > > > amount. A distributed set on the client would also need to be
> > > persistent.
> > > > > For eg: if a client restarts/recovers etc. This basically means we
> > > need a
> > > > > keystore on the client instead of the broker cluster. This probably
> > > puts
> > > > > too much responsibility on the client cluster. A different approach
> > > would
> > > > > be to ensure that the offsets are always in sync with the broker.
> > Since
> > > > the
> > > > > broker only serves unique messages (based on the proposed solution
> on
> > > the
> > > > > producer/broker end) all we need to ensure is that a client does
> not
> > > > > consume messages with the same offset twice.
> > > > > >
> > > > > > Please suggest improvements if this does not look like the
> correct
> > > > > approach. Also would be great if someone can come up with a
> > completely
> > > > > different approach so that we can weigh up pros and cons.
> > > > > >
> > > > > > Thanks for reading this through and looking forward to your
> > opinions.
> > > > > >
> > > > > > Regards,
> > > > > > Sohaib
> > > > > >
> > > > > > Regards,
> > > > > > Sohaib Iftikhar
> > > > > >
> > > > > > -- Man is still the most extraordinary computer of all.--
> > > > > >
> > > > > >
> > > > > > On Mon, Feb 26, 2018 at 3:58 AM, Zhanhui Li <lizhan...@gmail.com
> > > > > <mailto:lizhan...@gmail.com>> wrote:
> > > > > > Hi Sohaib,
> > > > > >
> > > > > > About multiple key support, the following code snippet should
> > clarify
> > > > > your doubt:
> > > > > > org.apache.rocketmq.common.message.Message class has overloaded
> > > > setKeys
> > > > > methods, allowing your to set multiple keys via string(separated by
> > > > > space…sorry, we have not yet unified all separators, hoping this
> does
> > > not
> > > > > confuse you) or collection.
> > > > > >
> > > > > >
> > > > > > When broker tries to build index for the message with multiple
> > keys,
> > > > > multiple index entries are inserted into the indexing file.
> > > > > > See org.apache.rocketmq.store.index.IndexService#buildIndex
> > > > > >
> > > > > >
> > > > > > In terms of eliminating message duplication, personally, I wish
> we
> > > have
> > > > > a feature of exactly-once semantic covering the whole cluster and
> the
> > > > > complete send-store-consume processes. A rough idea is route the
> > > message
> > > > > according to its unique key to a broker according to a rule; The
> > > serving
> > > > > broker ensures uniqueness of the message according to the key( as
> you
> > > > said,
> > > > > bloom-filter/cuckoo-filter, etc);  Things might looks simple, but
> > > issues
> > > > > resides in scenarios where cluster is experiencing membership
> > changes:
> > > > for
> > > > > example, what if a broker crashed down? We might need propagate
> > > > > bloom-filter bitset synchronously to other brokers having the same
> > > > topics;
> > > > > What if a new broker joins in the cluster and starts to serve? I do
> > not
> > > > > mean this is too complex to implement. Instead, this is a pretty
> > > > > interesting topic and fancy feature to have. Alternatively, we
> might
> > > > defer
> > > > > eliminating duplicates to the consumption phase using kind of
> > > distributed
> > > > > set. For sure, my proposing idea suffers the same challenges
> > including
> > > > > membership changes.
> > > > > >
> > > > > > Guys of dev board, any insights on this issue?
> > > > > >
> > > > > > Zhanhui Li
> > > > > >
> > > > > >
> > > > > >> 在 2018年2月26日,上午2:47,Sohaib Iftikhar <sohaib1...@gmail.com
> > <mailto:
> > > > > sohaib1...@gmail.com>> 写道:
> > > > > >>
> > > > > >> Hi Zhanhui,
> > > > > >>
> > > > > >> I have a doubt about these multiple keys. If I am wrong in any
> of
> > > the
> > > > > >> assumptions I make please point it out.
> > > > > >>
> > > > > >> If there is support for multiple keys I cannot see this in the
> > code.
> > > > The
> > > > > >> class Message only stores a single key in the property map
> against
> > > the
> > > > > >> property name "KEYS". Is this also done in the same ways as
> tags?
> > > That
> > > > > is
> > > > > >> different keys are separated with ' || '? So basically as a user
> > of
> > > > the
> > > > > >> producer API it is the user's responsibility to ensure that he
> > > > separates
> > > > > >> the different keys with the correct separator. I can see an
> > obvious
> > > > > problem
> > > > > >> here. What if the key contains this special character ' || '?
> But
> > > > maybe
> > > > > >> this event is rare and hence this is not important. Could you
> > point
> > > me
> > > > > to
> > > > > >> some source/doc that explains this part? I was looking at the
> > index
> > > > > section
> > > > > >> rocketmq-store but I have not been able to understand the
> indexing
> > > > > process
> > > > > >> completely for now. I will keep reading the source to get a
> better
> > > > idea.
> > > > > >>
> > > > > >> Moving on to the implementational details. Here is a broad idea
> of
> > > one
> > > > > >> possible way to approach it.
> > > > > >>
> > > > > >> The attempt is to remove duplicate messages. In this issue, I
> > would
> > > > > like to
> > > > > >> aim at eliminating duplicate messages at the producer/broker
> end.
> > > For
> > > > > now,
> > > > > >> we do not concern ourselves with the duplicate messages
> happening
> > > due
> > > > to
> > > > > >> unwritten consumer offsets as these two issues have different
> > > > solutions.
> > > > > >> One way to solve this problem at the producer/broker end could
> be
> > to
> > > > > have a
> > > > > >> distributed key store that stores the messages. We can make it
> > > > > configurable
> > > > > >> such that this distributed store stores all messages or works
> as a
> > > > > sliding
> > > > > >> window keeping only the messages from the last X seconds
> specified
> > > by
> > > > > the
> > > > > >> user. We can have a layer on top to check set membership such
> as a
> > > > bloom
> > > > > >> filter or a cuckoo filter (
> > > > > >> https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf <
> > > > > https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf>) to help
> > > > > >> performance. Every message being pushed in by a producer are
> > checked
> > > > in
> > > > > >> first with the filter and in case of a positive result with this
> > key
> > > > > store.
> > > > > >> If the message is found then it is discarded. This helps remove
> > > > > duplicates
> > > > > >> completely from a producer perspective. The core of this idea is
> > the
> > > > > >> distributed key store which would be completely separate from
> the
> > > > > current
> > > > > >> message storage. Since the concept of a distributed key store
> or a
> > > > > >> key/value store is not novel there are two ways to this.
> > > > > >> 1. Implement it ourselves. This would be high effort but no
> > external
> > > > > >> dependencies.
> > > > > >> 2. Use a key-value store such as Redis (which already has
> timeouts
> > > and
> > > > > >> persistence but a large memory footprint) or some other
> disk-based
> > > > > storage
> > > > > >> for set membership. This would include an external dependency
> but
> > > > > >> development time will reduce significantly for such a solution.
> > > > > >> I am inclined towards implementing it by myself as this would
> > avoid
> > > > > >> dependencies on other products especially since RocketMQ is
> > > currently
> > > > a
> > > > > >> self-reliant system. In addition, my past experience with
> building
> > > > such
> > > > > a
> > > > > >> store should also come in handy.
> > > > > >>
> > > > > >> I would like to know the opinions of the development community
> on
> > > this
> > > > > >> approach and to suggest improvements on it. Looking forward to
> > your
> > > > > >> responses to this.
> > > > > >>
> > > > > >> ====<question unrelated to issue>=====
> > > > > >> To increase my familiarity with the code base and to help prove
> > > that I
> > > > > am
> > > > > >> familiar with the tools and technologies in place it would be
> > great
> > > > if I
> > > > > >> could be pointed to some low effort issues that I could help out
> > > with.
> > > > > In
> > > > > >> case there are no 'newbie' issues available I could help improve
> > the
> > > > > >> comments inside the codebase. I noticed some source files with
> no
> > > > > >> explanations which can be documented via comments to help
> onboard
> > a
> > > > new
> > > > > >> contributor faster.
> > > > > >> ====</question unrelated to issue>=====
> > > > > >>
> > > > > >> Thanks a lot for reading this through and looking forward to
> your
> > > > > opinions.
> > > > > >>
> > > > > >> Regards,
> > > > > >> Sohaib
> > > > > >>
> > > > > >>
> > > > > >> On Sat, Feb 24, 2018 at 11:50 AM, Zhanhui Li <
> lizhan...@gmail.com
> > > > > <mailto:lizhan...@gmail.com>> wrote:
> > > > > >>
> > > > > >>> Hi Sohaib,
> > > > > >>>
> > > > > >>> Happy to know you are interested in RocketMQ.
> > > > > >>>
> > > > > >>> First, let me answer questions you raised.
> > > > > >>>
> > > > > >>> — can there be multiple tags?
> > > > > >>> No. At present, the storage engine allows single tag only.
> > > > > Subscriptions
> > > > > >>> are allowed to use combination of tags. The current model
> should
> > > meet
> > > > > your
> > > > > >>> business development. If not, please let us know.
> > > > > >>>
> > > > > >>>
> > > > > >>> — key (Similar question to above.)
> > > > > >>> RocketMQ builds index using message keys. A single message may
> > have
> > > > > >>> multiple keys.
> > > > > >>>
> > > > > >>> — About redundant message
> > > > > >>> From my understanding, you are trying to eliminate duplicate
> > > > messages.
> > > > > >>> True there are various reasons which may cause message
> > duplication,
> > > > > ranging
> > > > > >>> from message delivery and consumption. Discussion on this topic
> > is
> > > > > warmly
> > > > > >>> welcome.  Had you had any idea to contribute on this issue, the
> > > > > developer
> > > > > >>> board is happy to discuss.
> > > > > >>>
> > > > > >>> Zhanhui Li
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>>> 在 2018年2月24日,上午11:17,Sohaib Iftikhar <sohaib1...@gmail.com
> > > <mailto:
> > > > > sohaib1...@gmail.com>> 写道:
> > > > > >>>>
> > > > > >>>> My earlier email message seems to have gotten lost. So I will
> > try
> > > > > again.
> > > > > >>>> Please see the original message for the discussion.
> > > > > >>>>
> > > > > >>>> Regards,
> > > > > >>>> Sohaib Iftikhar
> > > > > >>>>
> > > > > >>>> -- Man is still the most extraordinary computer of all.--
> > > > > >>>>
> > > > > >>>> On Tue, Feb 20, 2018 at 1:54 AM, Sohaib Iftikhar <
> > > > > sohaib1...@gmail.com <mailto:sohaib1...@gmail.com>>
> > > > > >>>> wrote:
> > > > > >>>>
> > > > > >>>>> Hi,
> > > > > >>>>>
> > > > > >>>>> I am interested in working on this issue (
> > > > https://issues.apache.org/
> > > > > <https://issues.apache.org/>
> > > > > >>>>> jira/browse/ROCKETMQ-124) as part of GSOC-18. I have a few
> > > > questions
> > > > > for
> > > > > >>>>> the same. I am not sure if this discussion needs to be on the
> > > JIRA
> > > > > >>> issue or
> > > > > >>>>> here. Feel free to correct me if this is the wrong platform.
> > Also
> > > > > while
> > > > > >>> I
> > > > > >>>>> have worked with distributed pub-sub systems I am still
> fairly
> > > new
> > > > to
> > > > > >>>>> Rocket-MQ so maybe my understanding of it is incorrect. I
> > > apologise
> > > > > if
> > > > > >>> that
> > > > > >>>>> is the case and would be happy to stand corrected.
> > > > > >>>>>
> > > > > >>>>> Following are my questions:
> > > > > >>>>> 1. What defines a redundant message?
> > > > > >>>>>   The constructor that I see for a message is as follows:
> > > > > >>>>>   Message(String topic, String tags, String keys, int flag,
> > > byte[]
> > > > > >>> body,
> > > > > >>>>> boolean waitStoreMsgOK)
> > > > > >>>>>   Possible candidates to me are topic, tags (can there be
> > > multiple
> > > > > >>> tags?
> > > > > >>>>> I could not find an example for this. If yes how are they
> > > > > separated?),
> > > > > >>> keys
> > > > > >>>>> (Similar question to above.) and of course the body. Is there
> > > > > something
> > > > > >>>>> that I have missed in this? Is there something that we do not
> > > need
> > > > to
> > > > > >>>>> consider?
> > > > > >>>>> 2. Is their a timeline on the redundant messages? What I mean
> > by
> > > > > this is
> > > > > >>>>> that is there a time limit after which a message with similar
> > > > > content is
> > > > > >>>>> allowed. From what I gather there was no such thing
> mentioned.
> > > This
> > > > > >>> would
> > > > > >>>>> mean storing all the messages. Depending on the requirements
> > this
> > > > > may or
> > > > > >>>>> may not be the best solution. It might be desirable that no
> > > > > duplicates
> > > > > >>> are
> > > > > >>>>> needed within a certain time window (sliding). This allows
> > > ignoring
> > > > > of
> > > > > >>>>> duplicate messages that were generated very close to each
> other
> > > (or
> > > > > in
> > > > > >>> the
> > > > > >>>>> window indicated). Depending on this requirement
> implementation
> > > may
> > > > > >>> become
> > > > > >>>>> a little bit more involved.
> > > > > >>>>>
> > > > > >>>>> For now, these are the only questions. I have ideas that need
> > > > review
> > > > > >>> about
> > > > > >>>>> possible implementations but I will mention them once the
> > > > > specifications
> > > > > >>>>> are clear to me. As an end question, I would at some point
> like
> > > to
> > > > > post
> > > > > >>>>> design ideas to this problem privately to get it reviewed by
> > the
> > > > > >>>>> development community but not make it publicly available so
> > that
> > > it
> > > > > >>> cannot
> > > > > >>>>> be plagiarised. What platform/method can I use to do that? Or
> > is
> > > > > >>> submitting
> > > > > >>>>> a draft to the Google platform the only possible way to
> > > accomplish
> > > > > this?
> > > > > >>>>>
> > > > > >>>>> Thanks a lot for reading this through and looking forward to
> > your
> > > > > >>> inputs.
> > > > > >>>>>
> > > > > >>>>> Regards,
> > > > > >>>>> Sohaib Iftikhar
> > > > > >>>>>
> > > > > >>>
> > > > > >>>
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to