date:20190107

Re: PR Milestone policy

2019-01-07 Thread Gian Merlino

My feeling is that setting a milestone on PRs before they're merged is a
way of making their authors feel more included. I don't necessarily see a
problem with setting milestones optimistically and then, when a release
branch is about to be cut (based on the timed release date), we bulk-update
anything that hasn't been merged yet to the next milestone.

However, there are other ways to make authors feel more included. If we end
up doing a more formalized proposal process then this helps too. (It should
be easier for people to comment on proposals than on PRs, since there isn't
a need to read code.)

I guess I'm not really fussy either way on this one.

On Wed, Dec 12, 2018 at 10:27 PM 邱明明  wrote:

> I agree with Jonathan.
> Jay Nash  于2018年12月13日周四 下午1:05写道：
> >
> > Dear all,
> > I am just bystander on Druid List however I like to contribute code to
> Druids some day because it is very great, we use it at my company. It
> sounds consensus was reached that Github milestones should be used not so
> frequently and is proposed vote about to change this.. is this correct?
> >
> > Regards,
> > Jay
> >
> > On 2018/12/12 00:39:29, Jonathan Wei  wrote:
> > > After a PR has been reviewed and merged, I think we should tag it with
> the>
> > > upcoming milestone to make life easier for release managers, for all
> PRs.>
> > >
> > > Regarding unresolved PRs:>
> > >
> > > > I advocate for not assigning milestones to any non-bug (or otherwise>
> > > "critical") PRs, including "feature", non-refactoring PRs.>
> > >
> > > That seems like a reasonable policy to me, based on the points Roman
> made>
> > > in the thread.>
> > >
> > > On Tue, Dec 11, 2018 at 1:13 AM Julian Hyde  wrote:>
> > >
> > > > Well, see if you can get consensus around such a policy. Other Druid>
> > > > folks, please speak up if you agree or disagree.>
> > > >>
> > > > > On Dec 8, 2018, at 8:02 AM, Roman Leventov >
> > > > wrote:>
> > > > >>
> > > > > It's not exactly and not only that. I advocate for not assigning>
> > > > milestones>
> > > > > to any non-bug (or otherwise "critical") PRs, including "feature",>
> > > > > non-refactoring PRs.>
> > > > >>
> > > > > On Fri, 7 Dec 2018 at 19:29, Julian Hyde 
> wrote:>
> > > > >>
> > > > >> Consensus.>
> > > > >>>
> > > > >> We resolve debates by going into them knowing that we need to
> find>
> > > > >> consensus. A vote is a last step to prove that consensus exists,
> and>
> > > > >> in most cases is not necessary.>
> > > > >>>
> > > > >> Reading between the lines, it sounds as if you and FJ have a>
> > > > >> difference of opinion about refactoring changes. Two extreme
> positions>
> > > > >> would be (1) we don't accept changes that only refactor code, (2)
> and>
> > > > >> I assert my right to contribute a refactoring change at any point
> in>
> > > > >> the project lifecycle. A debate that starts with those positions
> is>
> > > > >> never going to reach consensus. A better starting point might be
> "I>
> > > > >> would like to make the following change because I believe it
> would be>
> > > > >> beneficial. How could I best structure it / time it to minimize>
> > > > >> impact?">
> > > > >> On Fri, Dec 7, 2018 at 9:19 AM Roman Leventov >
> > > > >> wrote:>
> > > > 
> > > > >>> I would like like learn what is the Apache way to resolve
> debates. But>
> > > > >> you>
> > > > >>> are right, this question probably doesn't deserve that. Thanks
> for>
> > > > >> guidance>
> > > > >>> Julian.>
> > > > 
> > > > >>> On Fri, 7 Dec 2018 at 16:43, Julian Hyde >
> > > > wrote:>
> > > > 
> > > >  May I suggest that a vote is not the solution. In this
> discussion I>
> > > > see>
> > > >  two people beating each other over the head with policy.>
> > > > >
> > > >  Let’s strive to operate according to the Apache way. Accept>
> > > > >> contributions>
> > > >  on merit in a timely manner. Avoid the urge to “project
> manage”.>
> > > > >
> > > >  Julian>
> > > > >
> > > > > On Dec 7, 2018, at 07:03, Roman Leventov >
> > > > >> wrote:>
> > > > >>
> > > > > The previous consensus community decision seems to be to not
> use PR>
> > > > > milestones for any PRs except bugs. To change this policy,
> probably>
> > > > >> there>
> > > > > should be a committer (or PPMC?) vote.>
> > > > >>
> > > > >> On Thu, 6 Dec 2018 at 20:49, Julian Hyde 
> wrote:>
> > > > >>>
> > > > >> FJ,>
> > > > >>>
> > > > >> What you are proposing sounds suspiciously like project
> management.>
> > > > >> If a>
> > > > >> contributor makes a contribution, that contribution should be
> given>
> > > > >> a>
> > > >  fair>
> > > > >> review in a timely fashion and be committed based on its
> merits. You>
> > > > >> overstate the time-sensitivity of contributions. I would
> imagine>
> > > > >> that>
> > > >  there>
> > > > >> are only a few days preceding each release where stability is
> a>
> > > > >> major>
> > > > >> concern. At any othe

Re: Off list major development

2019-01-07 Thread Gian Merlino

It sounds like splitting design from code review is a common theme in a few
of the posts here. How does everyone feel about making a point of
encouraging design reviews to be done as issues, separate from the pull
request, with the expectations that (1) the design review issue
("proposal") should generally appear somewhat _before_ the pull request;
(2) pull requests should _not_ have design review happen on them, meaning
there should no longer be PRs with design review tags, and we should move
the design review approval process to the issue rather than the PR.

For (1), even if we encourage design review discussions to start before a
pull request appears, I don't see an issue with them running concurrently
for a while at some point.

On Thu, Jan 3, 2019 at 5:35 PM Jonathan Wei  wrote:

> Thanks for raising these concerns!
>
> My initial thoughts:
> - I agree that separation of design review and code-level review for major
> changes would be more efficient
>
> - I agree that a clear, more formalized process for handling major changes
> would be helpful for contributors:
>   - Define what is considered a major change
>   - Define a standard proposal structure, KIP-style proposal format sounds
> good to me
>
> - I think it's too rigid to have a policy of "no code at all with the
> initial proposal"
>   - Code samples can be useful references for understanding aspects of a
> design
>   - In some cases it's necessary to run experiments to fully understand a
> problem and determine an appropriate design, or to determine whether
> something is even worth doing before committing to the work of fleshing out
> a proposal, prototype code is a natural outcome of that and I'm not against
> someone providing such code for reference
>   - I tend to view design/code as things that are often developed
> simultaneously in an intertwined way
>
> > Let's not be naive this is very rare that a contributor will accept that
> his work is to be thrown, usually devs takes coding as personal creation
> and they get attached to it.
>
> If we have a clear review process that emphasizes the need for early
> consensus building, with separate design and code review, then I feel we've
> done enough and don't need a hard rule against having some code linked with
> the initial proposal. If a potential contributor then still wants to go
> ahead and write a lot of code that may be rejected or change significantly,
> the risks were made clear.
>
> > Once code is written hard to think abstract.
>
> I can see the validity of the concern, but I personally don't see it as a
> huge risk. My impression from the Druid PR reviews I've seen is that our
> reviewers are able to keep abstract design vs. implementation details
> separate and consider alternate designs when reviewing.
>
> To summarize I think it's probably enough to have a policy along the lines
> of:
>  - Create more formalized guidelines for proposals and what changes require
> proposals
>  - Separate design and code review for major changes, with design review
> first, code-level review after reaching consensus on the design.
>  - Code before the design review is completed is just for reference, not
> regarded as a candidate for review/merging.
>
> - Jon
>
>
> On Thu, Jan 3, 2019 at 12:48 PM Slim Bouguerra 
> wrote:
>
> > On Thu, Jan 3, 2019 at 12:16 PM Clint Wylie 
> wrote:
> >
> > > I am definitely biased in this matter as an owner of another large PR
> > that
> > > wasn't preceded by a direct proposal or dev list discussion, and in
> > general
> > > I agree that proposal first is usually better, but I think in some
> rarer
> > > cases approaching a problem code first *is* the most appropriate way to
> > > have a discussion.
> >
> >
> > I am wondering here what is the case where code first is better?
> > In general when you are writing code you have an idea about what you want
> > to change, why you want to change and why you want to change it.
> > I do not see what is wrong with sharing this primitive ideas and thoughts
> > as an abstract proposal (at least to avoid overlapping)
> >
> > I see nothing wrong with it so long as the author
> > > accepts that the PR is treated as a combined proposal and proof of
> > concept,
> > > and fair game to be radically changed via discussion or even rejected,
> > > which sounds like Gian's attitude on the matter and is mine as well
> with
> > my
> > > compression stuff.
> >
> >
> > Let's not be naive this is very rare that a contributor will accept that
> > his work is to be thrown, usually devs takes coding as personal creation
> > and they get attached to it.
> > To my point you can take a look on some old issue in the Druid forum
> >
> https://github.com/apache/incubator-druid/pull/3755#issuecomment-265667690
> >  and am sure other communities have similar problems.
> >  So leaving the door open to some side cases is not a good idea in my
> > opinion and will lead to similar issue in the future.
> >
> > This seems to me especially likely to happ

Re: Off list major development

2019-01-07 Thread Julian Hyde

Small contributions don’t need any design review, whereas large contributions 
need significant review. I don’t think we should require an additional step for 
those (many) small contributions. But who decides whether a contribution fits 
into the small or large category?

I think the solution is for authors to log a case (or send an email to dev) 
before they start work on any contribution. Then committers can request a more 
heavy-weight process if they think it is needed.

Julian


> On Jan 7, 2019, at 11:24 AM, Gian Merlino  wrote:
> 
> It sounds like splitting design from code review is a common theme in a few
> of the posts here. How does everyone feel about making a point of
> encouraging design reviews to be done as issues, separate from the pull
> request, with the expectations that (1) the design review issue
> ("proposal") should generally appear somewhat _before_ the pull request;
> (2) pull requests should _not_ have design review happen on them, meaning
> there should no longer be PRs with design review tags, and we should move
> the design review approval process to the issue rather than the PR.
> 
> For (1), even if we encourage design review discussions to start before a
> pull request appears, I don't see an issue with them running concurrently
> for a while at some point.
> 
> On Thu, Jan 3, 2019 at 5:35 PM Jonathan Wei  wrote:
> 
>> Thanks for raising these concerns!
>> 
>> My initial thoughts:
>> - I agree that separation of design review and code-level review for major
>> changes would be more efficient
>> 
>> - I agree that a clear, more formalized process for handling major changes
>> would be helpful for contributors:
>>  - Define what is considered a major change
>>  - Define a standard proposal structure, KIP-style proposal format sounds
>> good to me
>> 
>> - I think it's too rigid to have a policy of "no code at all with the
>> initial proposal"
>>  - Code samples can be useful references for understanding aspects of a
>> design
>>  - In some cases it's necessary to run experiments to fully understand a
>> problem and determine an appropriate design, or to determine whether
>> something is even worth doing before committing to the work of fleshing out
>> a proposal, prototype code is a natural outcome of that and I'm not against
>> someone providing such code for reference
>>  - I tend to view design/code as things that are often developed
>> simultaneously in an intertwined way
>> 
>>> Let's not be naive this is very rare that a contributor will accept that
>> his work is to be thrown, usually devs takes coding as personal creation
>> and they get attached to it.
>> 
>> If we have a clear review process that emphasizes the need for early
>> consensus building, with separate design and code review, then I feel we've
>> done enough and don't need a hard rule against having some code linked with
>> the initial proposal. If a potential contributor then still wants to go
>> ahead and write a lot of code that may be rejected or change significantly,
>> the risks were made clear.
>> 
>>> Once code is written hard to think abstract.
>> 
>> I can see the validity of the concern, but I personally don't see it as a
>> huge risk. My impression from the Druid PR reviews I've seen is that our
>> reviewers are able to keep abstract design vs. implementation details
>> separate and consider alternate designs when reviewing.
>> 
>> To summarize I think it's probably enough to have a policy along the lines
>> of:
>> - Create more formalized guidelines for proposals and what changes require
>> proposals
>> - Separate design and code review for major changes, with design review
>> first, code-level review after reaching consensus on the design.
>> - Code before the design review is completed is just for reference, not
>> regarded as a candidate for review/merging.
>> 
>> - Jon
>> 
>> 
>> On Thu, Jan 3, 2019 at 12:48 PM Slim Bouguerra 
>> wrote:
>> 
>>> On Thu, Jan 3, 2019 at 12:16 PM Clint Wylie 
>> wrote:
>>> 
 I am definitely biased in this matter as an owner of another large PR
>>> that
 wasn't preceded by a direct proposal or dev list discussion, and in
>>> general
 I agree that proposal first is usually better, but I think in some
>> rarer
 cases approaching a problem code first *is* the most appropriate way to
 have a discussion.
>>> 
>>> 
>>> I am wondering here what is the case where code first is better?
>>> In general when you are writing code you have an idea about what you want
>>> to change, why you want to change and why you want to change it.
>>> I do not see what is wrong with sharing this primitive ideas and thoughts
>>> as an abstract proposal (at least to avoid overlapping)
>>> 
>>> I see nothing wrong with it so long as the author
 accepts that the PR is treated as a combined proposal and proof of
>>> concept,
 and fair game to be radically changed via discussion or even rejected,
 which sounds like Gian's attitude on the matter and is mine as w

Re: Off list major development

2019-01-07 Thread Gian Merlino

I don't think there's a need to raise issues for every change: a small bug
fix or doc fix should just go straight to PR. (GitHub PRs show up as issues
in the issue-search UI/API, so it's not like this means the patch has no
corresponding issue -- in a sense the PR _is_ the issue.)

I do think it makes sense to encourage potential contributors to write to
the dev list or raise an issue if they aren't sure if something would need
to go through a more heavy weight process.

Fwiw we do have a set of 'design review' criteria already (we had a
discussion about this a couple years ago) at:
http://druid.io/community/#getting-your-changes-accepted. So we wouldn't be
starting from zero on defining that. We set it up back when we were trying
to _streamline_ our process -- we used to require two non-author +1s for
_every_ change, even minor ones. The introduction of design review criteria
was meant to classify which PRs need that level of review and which ones
are minor and can be merged with less review. I do think it helped with
getting minor PRs merged more quickly. The list of criteria is,

- Major architectural changes or API changes
- HTTP requests and responses (e. g. a new HTTP endpoint)
- Interfaces for extensions
- Server configuration (e. g. altering the behavior of a config property)
- Emitted metrics
- Other major changes, judged by the discretion of Druid committers

Some of it is subjective, but it has been in place for a while, so it's at
least something we are relatively familiar with.

On Mon, Jan 7, 2019 at 11:32 AM Julian Hyde  wrote:

> Small contributions don’t need any design review, whereas large
> contributions need significant review. I don’t think we should require an
> additional step for those (many) small contributions. But who decides
> whether a contribution fits into the small or large category?
>
> I think the solution is for authors to log a case (or send an email to
> dev) before they start work on any contribution. Then committers can
> request a more heavy-weight process if they think it is needed.
>
> Julian
>
>
> > On Jan 7, 2019, at 11:24 AM, Gian Merlino  wrote:
> >
> > It sounds like splitting design from code review is a common theme in a
> few
> > of the posts here. How does everyone feel about making a point of
> > encouraging design reviews to be done as issues, separate from the pull
> > request, with the expectations that (1) the design review issue
> > ("proposal") should generally appear somewhat _before_ the pull request;
> > (2) pull requests should _not_ have design review happen on them, meaning
> > there should no longer be PRs with design review tags, and we should move
> > the design review approval process to the issue rather than the PR.
> >
> > For (1), even if we encourage design review discussions to start before a
> > pull request appears, I don't see an issue with them running concurrently
> > for a while at some point.
> >
> > On Thu, Jan 3, 2019 at 5:35 PM Jonathan Wei  wrote:
> >
> >> Thanks for raising these concerns!
> >>
> >> My initial thoughts:
> >> - I agree that separation of design review and code-level review for
> major
> >> changes would be more efficient
> >>
> >> - I agree that a clear, more formalized process for handling major
> changes
> >> would be helpful for contributors:
> >>  - Define what is considered a major change
> >>  - Define a standard proposal structure, KIP-style proposal format
> sounds
> >> good to me
> >>
> >> - I think it's too rigid to have a policy of "no code at all with the
> >> initial proposal"
> >>  - Code samples can be useful references for understanding aspects of a
> >> design
> >>  - In some cases it's necessary to run experiments to fully understand a
> >> problem and determine an appropriate design, or to determine whether
> >> something is even worth doing before committing to the work of fleshing
> out
> >> a proposal, prototype code is a natural outcome of that and I'm not
> against
> >> someone providing such code for reference
> >>  - I tend to view design/code as things that are often developed
> >> simultaneously in an intertwined way
> >>
> >>> Let's not be naive this is very rare that a contributor will accept
> that
> >> his work is to be thrown, usually devs takes coding as personal creation
> >> and they get attached to it.
> >>
> >> If we have a clear review process that emphasizes the need for early
> >> consensus building, with separate design and code review, then I feel
> we've
> >> done enough and don't need a hard rule against having some code linked
> with
> >> the initial proposal. If a potential contributor then still wants to go
> >> ahead and write a lot of code that may be rejected or change
> significantly,
> >> the risks were made clear.
> >>
> >>> Once code is written hard to think abstract.
> >>
> >> I can see the validity of the concern, but I personally don't see it as
> a
> >> huge risk. My impression from the Druid PR reviews I've seen is that our
> >> reviewers are able

Re: Off list major development

2019-01-07 Thread Julian Hyde

Statically, yes, GitHub PRs are the same as GitHub cases. But dynamically, they 
are different, because you can only log a PR when you have finished work.

A lot of other Apache projects use JIRA, so there is a clear distinction 
between cases and contributions. JIRA cases, especially when logged early in 
the lifecycle of a contribution, become long-running conversation threads with 
a lot of community participation. If the Druid chose to do so, GitHub cases 
could be the same.

Be careful that you do not treat “potential contributors” (by which I presume 
you mean non-committers) differently from committers and PMC members. Anyone 
starting a major piece of work should follow the same process. (Experienced 
committers probably have a somewhat better idea what work will turn out to be 
“major”, so they get a little more leeway.)

Julian


> On Jan 7, 2019, at 12:10 PM, Gian Merlino  wrote:
> 
> I don't think there's a need to raise issues for every change: a small bug
> fix or doc fix should just go straight to PR. (GitHub PRs show up as issues
> in the issue-search UI/API, so it's not like this means the patch has no
> corresponding issue -- in a sense the PR _is_ the issue.)
> 
> I do think it makes sense to encourage potential contributors to write to
> the dev list or raise an issue if they aren't sure if something would need
> to go through a more heavy weight process.
> 
> Fwiw we do have a set of 'design review' criteria already (we had a
> discussion about this a couple years ago) at:
> http://druid.io/community/#getting-your-changes-accepted. So we wouldn't be
> starting from zero on defining that. We set it up back when we were trying
> to _streamline_ our process -- we used to require two non-author +1s for
> _every_ change, even minor ones. The introduction of design review criteria
> was meant to classify which PRs need that level of review and which ones
> are minor and can be merged with less review. I do think it helped with
> getting minor PRs merged more quickly. The list of criteria is,
> 
> - Major architectural changes or API changes
> - HTTP requests and responses (e. g. a new HTTP endpoint)
> - Interfaces for extensions
> - Server configuration (e. g. altering the behavior of a config property)
> - Emitted metrics
> - Other major changes, judged by the discretion of Druid committers
> 
> Some of it is subjective, but it has been in place for a while, so it's at
> least something we are relatively familiar with.
> 
> On Mon, Jan 7, 2019 at 11:32 AM Julian Hyde  wrote:
> 
>> Small contributions don’t need any design review, whereas large
>> contributions need significant review. I don’t think we should require an
>> additional step for those (many) small contributions. But who decides
>> whether a contribution fits into the small or large category?
>> 
>> I think the solution is for authors to log a case (or send an email to
>> dev) before they start work on any contribution. Then committers can
>> request a more heavy-weight process if they think it is needed.
>> 
>> Julian
>> 
>> 
>>> On Jan 7, 2019, at 11:24 AM, Gian Merlino  wrote:
>>> 
>>> It sounds like splitting design from code review is a common theme in a
>> few
>>> of the posts here. How does everyone feel about making a point of
>>> encouraging design reviews to be done as issues, separate from the pull
>>> request, with the expectations that (1) the design review issue
>>> ("proposal") should generally appear somewhat _before_ the pull request;
>>> (2) pull requests should _not_ have design review happen on them, meaning
>>> there should no longer be PRs with design review tags, and we should move
>>> the design review approval process to the issue rather than the PR.
>>> 
>>> For (1), even if we encourage design review discussions to start before a
>>> pull request appears, I don't see an issue with them running concurrently
>>> for a while at some point.
>>> 
>>> On Thu, Jan 3, 2019 at 5:35 PM Jonathan Wei  wrote:
>>> 
 Thanks for raising these concerns!
 
 My initial thoughts:
 - I agree that separation of design review and code-level review for
>> major
 changes would be more efficient
 
 - I agree that a clear, more formalized process for handling major
>> changes
 would be helpful for contributors:
 - Define what is considered a major change
 - Define a standard proposal structure, KIP-style proposal format
>> sounds
 good to me
 
 - I think it's too rigid to have a policy of "no code at all with the
 initial proposal"
 - Code samples can be useful references for understanding aspects of a
 design
 - In some cases it's necessary to run experiments to fully understand a
 problem and determine an appropriate design, or to determine whether
 something is even worth doing before committing to the work of fleshing
>> out
 a proposal, prototype code is a natural outcome of that and I'm not
>> against
 someone providing such code for refere

Re: Watermarks!

2019-01-07 Thread Gian Merlino

For Kafka, maybe something that tells you if all committed data is actually
loaded, & what offset has been committed up to? Would there by any problems
caused by the fact that only the most recent commit is saved in the DB?

Is this feature connected at all to an ask I have heard from a few people:
that there be an option to fail a query (or at least include a special
response header) if some segments in the interval are unavailable? (Which,
currently, the broker can't know since it doesn't know details about all
available segments.)

Btw, at your site do you have any plans to migrate to Kafka indexing?

On Wed, Jan 2, 2019 at 5:37 PM Charles Allen 
wrote:

> Hi all!
>
> https://github.com/apache/incubator-druid/pull/6799
>
> A contribution is up that includes a neat feature we have been using
> internally called Watermarks. Basically when operating a large scale and
> multi-tenant system, it is handy to be able to monitor how 'well behaved'
> the data is with regard to history. This is commonly used to spot holes in
> data, and to help give hints to data consumers in a lambda environment on
> when data has been run through a thorough check (batch job) vs a best
> effort sketch of the results which may or may not handle late data well
> (streaming intake).
>
> Unfortunately i'm not really sure what meta-data would be handy to have for
> the kafka indexing service, so I'd love input there as well if anyone knows
> of any "watermarks" that would make sense for it.
>
> Since the extension was written to be a stand alone service, it can remain
> as an extension forever if desired. An alternative I would like to propose
> is that the primitives for the watermark feature be added to core druid,
> and the extension points be added to their respective places (mysql
> extension and google extension to name two explicitly).
>
> Let me know what you think!
> Charles Allen
>

Re: Watermarks!

2019-01-07 Thread Charles Allen

I'll answer the last question first:

Many data groups are processed via Airflow, so having a batch component
compatible with Airflow is more impactful than being able to live stream
data as it stands right now. I'm constantly on the lookout for a use case
where druid streaming is a good fit for a solution (as opposed to
Graphite/grafana, or even potentially prometheus) but haven't found one yet
where the overhead for maintaining the extra realtime and streaming system
is worth the payout. From a technology investment point of view, a Beam
compatible sink (which we have an internal one based on tranquility for
streaming sinks) might end up working. I am interested to see if the KIS
features can be leveraged to work with systems outside of kafka. Also of
great interest is to see if the "resources per task" can be made more
tunable instead of being a single cookie cutter footprint. The need for
huge resources during the final merge-and-push phase compared to the
incremental intake phase is also a major pain point and cause of
inefficiency for Druid streaming stuff.

Watermarking *could* tell if segments are unavailable (i.e. a whole hour of
data is missing) and fail the query accordingly if the watermark cursor was
not advanced beyond the interval end. I have not attempted to put such an
interrupt into the query layer though. It is a very intriguing idea. In
general the cursors work by monitoring the segment availability
announcements and watches for certain criteria to be met before advancing.
A very simple example here would be to halt a watermark's progression until
at least *some* data for a time range is available in some segment
somewhere. A more advanced cursor would have a concept of "completeness"
and only advance the watermark once some time range has reached some
"complete" criteria (number of events, or signal from external system could
make sense).

The nice thing here is also with automated checks, which can wait until the
watermark has progressed before querying the druid cluster for some data.

Hopefully that answers some questions,
Charles Allen

On Mon, Jan 7, 2019 at 12:50 PM Gian Merlino  wrote:

> For Kafka, maybe something that tells you if all committed data is actually
> loaded, & what offset has been committed up to? Would there by any problems
> caused by the fact that only the most recent commit is saved in the DB?
>
> Is this feature connected at all to an ask I have heard from a few people:
> that there be an option to fail a query (or at least include a special
> response header) if some segments in the interval are unavailable? (Which,
> currently, the broker can't know since it doesn't know details about all
> available segments.)
>
> Btw, at your site do you have any plans to migrate to Kafka indexing?
>
> On Wed, Jan 2, 2019 at 5:37 PM Charles Allen  .invalid>
> wrote:
>
> > Hi all!
> >
> > https://github.com/apache/incubator-druid/pull/6799
> >
> > A contribution is up that includes a neat feature we have been using
> > internally called Watermarks. Basically when operating a large scale and
> > multi-tenant system, it is handy to be able to monitor how 'well behaved'
> > the data is with regard to history. This is commonly used to spot holes
> in
> > data, and to help give hints to data consumers in a lambda environment on
> > when data has been run through a thorough check (batch job) vs a best
> > effort sketch of the results which may or may not handle late data well
> > (streaming intake).
> >
> > Unfortunately i'm not really sure what meta-data would be handy to have
> for
> > the kafka indexing service, so I'd love input there as well if anyone
> knows
> > of any "watermarks" that would make sense for it.
> >
> > Since the extension was written to be a stand alone service, it can
> remain
> > as an extension forever if desired. An alternative I would like to
> propose
> > is that the primitives for the watermark feature be added to core druid,
> > and the extension points be added to their respective places (mysql
> > extension and google extension to name two explicitly).
> >
> > Let me know what you think!
> > Charles Allen
> >
>

Guava compat

2019-01-07 Thread Charles Allen

Hi all!

Just FYI https://github.com/apache/incubator-druid/pull/6815 is up which is
hopefully the last change needed to get old AND recent versions of Guava
working with druid

Pointers on implementing a new ShardSpec

2019-01-07 Thread Julian Jaffe

Hey all,

Are there any major caveats or gotchas I should be aware of when
implementing a new ShardSpec? The context here is that we have a datasource
that is the combined result of multiple input jobs. We're trying to do
write-side joining by having all of the jobs write segments for the same
intervals (e.g. partitioning on both partition number and source pipeline).
For now, I've modified the Spark-Druid batch ingestor (
https://github.com/metamx/druid-spark-batch) to run in our various
pipelines and to write out segments with identifier form
`dataSource_startInterval_endInterval_version_sourceName_partitionNum. This
is working without issue for loading, querying, and deleting data, but the
metadata API reports the incorrect segment identifier, since it
reconstructs the identifier instead of reading from metadata (e.g. it
reports segment identifiers of the form
`dataSource_startInterval_endInterval_version_partitionNum`). Both because
we'd like this to be fully supported, and because we imagine that this
feature may be useful to others, I'd like to implement this via a ShardSpec.

Julian

Re: Druid 0.14 timing

2019-01-07 Thread Benedict Jin




On 2019/01/04 21:06:40, Gian Merlino  wrote: 
> It feels like 0.13.0 was just recently released, but it was branched off
> back in October, and it has almost been 3 months since then. How do we feel
> about doing an 0.14 branch cut at the end of January (Thu Jan 31) - going
> back to the every 3 months cycle?
> 
> For this release, based on the feedback we got from the Incubator vote last
> time, we'll need to fix up the LICENSE and NOTICE issues that were flagged
> but waved through for our first release. (Justin said he would have -1'd
> based on that if it was anything beyond a first release.)
> +1

-
To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
For additional commands, e-mail: dev-h...@druid.apache.org

Re: PR Milestone policy

Re: Off list major development

Re: Off list major development

Re: Off list major development

Re: Off list major development

Re: Watermarks!

Re: Watermarks!

Guava compat

Pointers on implementing a new ShardSpec

Re: Druid 0.14 timing

10 matches

Site Navigation

Mail list logo

Footer information