Re: Druid 0.12.2 release vote

2018-08-08 Thread Gian Merlino
That being said, Charles I am definitely looking forward to your report of
what the upgrade from 0.11 -> 0.12.2-rc1 is like in your cluster!

On Wed, Aug 8, 2018 at 12:30 PM Gian Merlino  wrote:

> My thought is that recently we have started doing small bug-fix releases
> more often (0.12.1 and 0.12.2 were both small releases) and I think it
> makes sense to continue this practice. It makes sense to get them out
> quickly, since shipping bug fixes is good. IMO trying to validate bug fix
> releases within the customary Apache style 72 hour voting period is a good
> goal.
>
> On the other hand we do strive to put out high quality releases, and we
> don't want bug fix releases to introduce regressions. Testing every single
> patch in real clusters is an important part of that. All I can do is
> encourage people running real clusters to deploy RCs as fast as they can!
> Fwiw, we have already incorporated all the 0.12.2 patches into our Imply
> distro of Druid and already have a good number users running them. So my +1
> earlier incorporated knowledge that the patches have been validated in that
> way.
>
> I agree with Jihoon that we will probably end up doing an 0.12.3 soon, to
> fix the issues he mentioned and a couple of others as well.
>
> On Wed, Aug 8, 2018 at 12:07 PM Jihoon Son  wrote:
>
>> Charles, thank you for doing performance evaluation! Performance numbers
>> are always good and helpful.
>>
>> However, IMO, any kind of performance degradation shouldn't be a blocker
>> for this release. 0.12.2 is a minor release and contains only bug fixes.
>> https://github.com/apache/incubator-druid/pull/5878/files is the only one
>> tagged with 'Performance', but it can be regarded as a more like a code
>> bug
>> rather than architectural performance issue.
>>
>> Instead, those kinds of performance tests should be performed per major
>> release to catch any kinds of unexpected performance change. They can be a
>> blocker if we find any performance regression.
>>
>> Also, if you find any performance regression for this release, we probably
>> make another minor release. I think some bug fixes (e.g.,
>> https://github.com/apache/incubator-druid/issues/6124,
>> https://github.com/apache/incubator-druid/issues/6123) are also worth to
>> be
>> included in the minor release.
>>
>> What do you think?
>>
>> Best,
>> Jihoon
>>
>> On Wed, Aug 8, 2018 at 10:18 AM Charles Allen
>>  wrote:
>>
>> > I'm hoping to have some numbers for any performance changes or other
>> > impacts in the next few days (rollouts on big clusters take a long
>> time). I
>> > am neutral until the numbers come in. Preliminary indicators show no
>> > significant regression since the 0.11.x series. More data is expected
>> to be
>> > available in a few days as rollout completes.
>> >
>> >
>> >
>> > On Wed, Aug 8, 2018 at 9:10 AM Himanshu  wrote:
>> >
>> > > +1 , thanks for coordinating it.
>> > >
>> > > On Tue, Aug 7, 2018 at 8:05 PM, Gian Merlino  wrote:
>> > >
>> > > > +1. Thank you Jihoon for running this release.
>> > > >
>> > > > On Tue, Aug 7, 2018 at 10:04 AM Jihoon Son 
>> > wrote:
>> > > >
>> > > > > Sure,
>> > > > >
>> > > > > the release note is available here:
>> > > > > https://github.com/apache/incubator-druid/issues/6116.
>> > > > >
>> > > > > Best,
>> > > > > Jihoon
>> > > > >
>> > > > > On Tue, Aug 7, 2018 at 10:02 AM Charles Allen > >
>> > > > wrote:
>> > > > >
>> > > > > > ((don't let this ask block the release))
>> > > > > >
>> > > > > > Is there a way to get a preview of what the release notice will
>> > look
>> > > > > like?
>> > > > > >
>> > > > > > On Mon, Aug 6, 2018 at 3:38 PM Fangjin Yang 
>> > > wrote:
>> > > > > >
>> > > > > > > +1
>> > > > > > >
>> > > > > > > On Mon, Aug 6, 2018 at 3:03 PM, Jihoon Son <
>> jihoon...@apache.org
>> > >
>> > > > > wrote:
>> > > > > > >
>> > > > > > > > Hi all,
>> > > > > > > >
>> > > > > > > > Druid 0.12.2-rc1 (
>> > >
>> >
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__druid.io_downloads.html&d=DwIBaQ&c=ncDTmphkJTvjIDPh0hpF_w&r=HrLGT1qWNhseJBMYABL0GFSZESht5gBoLejor3SqMSo&m=jpC6qqjxYOq5XtNdLQ8U99CX0aj4Qn7oNQCLnjJ_v-Y&s=wdtYVmGEMAXuXYDvVCvJ_u6CKSIShdGKA2DpgjX9lfE&e=
>> > )
>> > > is available
>> > > > now,
>> > > > > > and
>> > > > > > > I
>> > > > > > > > think it's time to vote on the 0.12.2 release. Please note
>> that
>> > > > > 0.12.2
>> > > > > > is
>> > > > > > > > not an ASF release.
>> > > > > > > >
>> > > > > > > > Here is my +1.
>> > > > > > > >
>> > > > > > > > Best,
>> > > > > > > > Jihoon
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>


Re: Druid 0.12.2 release vote

2018-08-08 Thread Gian Merlino
Thanks, it will be nice to see!

On Wed, Aug 8, 2018 at 1:15 PM Charles Allen 
wrote:

> I don't think it should be a blocker to release, but I have to run perf
> tests for rollouts anyways so I figured I'd publish what I find :-P
>
> Cheers,
> Charles Allen
>
>
> On Wed, Aug 8, 2018 at 12:33 PM Gian Merlino  wrote:
>
> > That being said, Charles I am definitely looking forward to your report
> of
> > what the upgrade from 0.11 -> 0.12.2-rc1 is like in your cluster!
> >
> > On Wed, Aug 8, 2018 at 12:30 PM Gian Merlino  wrote:
> >
> > > My thought is that recently we have started doing small bug-fix
> releases
> > > more often (0.12.1 and 0.12.2 were both small releases) and I think it
> > > makes sense to continue this practice. It makes sense to get them out
> > > quickly, since shipping bug fixes is good. IMO trying to validate bug
> fix
> > > releases within the customary Apache style 72 hour voting period is a
> > good
> > > goal.
> > >
> > > On the other hand we do strive to put out high quality releases, and we
> > > don't want bug fix releases to introduce regressions. Testing every
> > single
> > > patch in real clusters is an important part of that. All I can do is
> > > encourage people running real clusters to deploy RCs as fast as they
> can!
> > > Fwiw, we have already incorporated all the 0.12.2 patches into our
> Imply
> > > distro of Druid and already have a good number users running them. So
> my
> > +1
> > > earlier incorporated knowledge that the patches have been validated in
> > that
> > > way.
> > >
> > > I agree with Jihoon that we will probably end up doing an 0.12.3 soon,
> to
> > > fix the issues he mentioned and a couple of others as well.
> > >
> > > On Wed, Aug 8, 2018 at 12:07 PM Jihoon Son 
> wrote:
> > >
> > >> Charles, thank you for doing performance evaluation! Performance
> numbers
> > >> are always good and helpful.
> > >>
> > >> However, IMO, any kind of performance degradation shouldn't be a
> blocker
> > >> for this release. 0.12.2 is a minor release and contains only bug
> fixes.
> > >> https://github.com/apache/incubator-druid/pull/5878/files is the only
> > one
> > >> tagged with 'Performance', but it can be regarded as a more like a
> code
> > >> bug
> > >> rather than architectural performance issue.
> > >>
> > >> Instead, those kinds of performance tests should be performed per
> major
> > >> release to catch any kinds of unexpected performance change. They can
> > be a
> > >> blocker if we find any performance regression.
> > >>
> > >> Also, if you find any performance regression for this release, we
> > probably
> > >> make another minor release. I think some bug fixes (e.g.,
> > >> https://github.com/apache/incubator-druid/issues/6124,
> > >> https://github.com/apache/incubator-druid/issues/6123) are also worth
> > to
> > >> be
> > >> included in the minor release.
> > >>
> > >> What do you think?
> > >>
> > >> Best,
> > >> Jihoon
> > >>
> > >> On Wed, Aug 8, 2018 at 10:18 AM Charles Allen
> > >>  wrote:
> > >>
> > >> > I'm hoping to have some numbers for any performance changes or other
> > >> > impacts in the next few days (rollouts on big clusters take a long
> > >> time). I
> > >> > am neutral until the numbers come in. Preliminary indicators show no
> > >> > significant regression since the 0.11.x series. More data is
> expected
> > >> to be
> > >> > available in a few days as rollout completes.
> > >> >
> > >> >
> > >> >
> > >> > On Wed, Aug 8, 2018 at 9:10 AM Himanshu 
> wrote:
> > >> >
> > >> > > +1 , thanks for coordinating it.
> > >> > >
> > >> > > On Tue, Aug 7, 2018 at 8:05 PM, Gian Merlino 
> > wrote:
> > >> > >
> > >> > > > +1. Thank you Jihoon for running this release.
> > >> > > >
> > >> > > > On Tue, Aug 7, 2018 at 10:04 AM Jihoon Son <
> jihoon...@apache.org>
> > >> > wrote:
> > >> > > >
> > >> > > > > Sure,
> > >> > > &

Re: Druid 0.12.2 release vote

2018-08-09 Thread Gian Merlino
Nice!!

Although I don't see the graphic attached, maybe the mailing list ate it?

On Wed, Aug 8, 2018 at 4:15 PM Charles Allen 
wrote:

> Blue is 0.12.2 with some minor backports not perf related. Red is from the
> 0.11.x series. This is effectively a bucketed PDF of the query times for a
> live cluster with Timeseries queries as self-reported by historical nodes.
> I mentioned elsewhere I'm not convinced query/time is a good proxy for user
> experience, but it does provide a good baseline for comparisons between
> versions. Low query times are suspected due to some aggressive caching or
> complete node misses (node very little data for that time range for that
> datasource). And high query time outliers are often the result of bad GC.
>
> On our side there is a new java version going out with the 0.12.2
> deployment so it is unclear how much is attributed to the new java version
> and how much is attributed to the druid jars or other config changes.
> Overall things seem to consistently display a small % improvement in the
> mean with our internal 0.12.2 release. This is good!
>
> Cheers,
> Charles Allen
>
> [image: Screen Shot 2018-08-08 at 4.01.24 PM.png]
>
>
> On Wed, Aug 8, 2018 at 3:11 PM David Lim  wrote:
>
>> +1, thank you!
>>
>> On Wed, Aug 8, 2018 at 3:16 PM Jonathan Wei  wrote:
>>
>> > +1, thanks Jihoon!
>> >
>> > On Wed, Aug 8, 2018 at 1:18 PM, Jihoon Son 
>> wrote:
>> >
>> > > Awesome! Thanks Charles!
>> > >
>> > > Jihoon
>> > >
>> > > On Wed, Aug 8, 2018 at 1:16 PM Gian Merlino  wrote:
>> > >
>> > > > Thanks, it will be nice to see!
>> > > >
>> > > > On Wed, Aug 8, 2018 at 1:15 PM Charles Allen <
>> charles.al...@snap.com
>> > > > .invalid>
>> > > > wrote:
>> > > >
>> > > > > I don't think it should be a blocker to release, but I have to run
>> > perf
>> > > > > tests for rollouts anyways so I figured I'd publish what I find
>> :-P
>> > > > >
>> > > > > Cheers,
>> > > > > Charles Allen
>> > > > >
>> > > > >
>> > > > > On Wed, Aug 8, 2018 at 12:33 PM Gian Merlino 
>> > wrote:
>> > > > >
>> > > > > > That being said, Charles I am definitely looking forward to your
>> > > report
>> > > > > of
>> > > > > > what the upgrade from 0.11 -> 0.12.2-rc1 is like in your
>> cluster!
>> > > > > >
>> > > > > > On Wed, Aug 8, 2018 at 12:30 PM Gian Merlino 
>> > > wrote:
>> > > > > >
>> > > > > > > My thought is that recently we have started doing small
>> bug-fix
>> > > > > releases
>> > > > > > > more often (0.12.1 and 0.12.2 were both small releases) and I
>> > think
>> > > > it
>> > > > > > > makes sense to continue this practice. It makes sense to get
>> them
>> > > out
>> > > > > > > quickly, since shipping bug fixes is good. IMO trying to
>> validate
>> > > bug
>> > > > > fix
>> > > > > > > releases within the customary Apache style 72 hour voting
>> period
>> > > is a
>> > > > > > good
>> > > > > > > goal.
>> > > > > > >
>> > > > > > > On the other hand we do strive to put out high quality
>> releases,
>> > > and
>> > > > we
>> > > > > > > don't want bug fix releases to introduce regressions. Testing
>> > every
>> > > > > > single
>> > > > > > > patch in real clusters is an important part of that. All I
>> can do
>> > > is
>> > > > > > > encourage people running real clusters to deploy RCs as fast
>> as
>> > > they
>> > > > > can!
>> > > > > > > Fwiw, we have already incorporated all the 0.12.2 patches into
>> > our
>> > > > > Imply
>> > > > > > > distro of Druid and already have a good number users running
>> > them.
>> > > So
>> > > > > my
>> > > > > > +1
>> > > > > > > earlier incorporated knowledge that the patches have been
>> > validate

Re: Nightly build!

2018-08-09 Thread Gian Merlino
I found http://www.apache.org/legal/release-policy.html#host-rc which looks
like the policy we should follow if we start doing nightly builds. I guess
we shouldn't archive nightly builds, since there would be too many.

On Thu, Aug 9, 2018 at 6:02 PM Jihoon Son  wrote:

> Hi all,
>
> Nightly build would be useful for folks who want to stay on the bleeding
> edge of Druid. I'm thinking to add a Jenkins job to
> https://builds.apache.org/ which checks every hour that there are changes
> in the master branch and builds a new build. Once the build succeeds, the
> binary is archived, so that we can add a link to the binary to our web
> page. If the build fails, the notification would be sent to
> comm...@druid.apache.org.
>
> Welcome any thoughts.
>
> Best,
> Jihoon
>


Re: Docs for 'master'

2018-08-10 Thread Gian Merlino
That sounds nice, I am all for it. It should ideally be automated, maybe
with Jenkins.

On Fri, Aug 10, 2018 at 5:13 PM Jihoon Son  wrote:

> Hi all,
>
> We currently have the following system for docs.
>
> - http://druid.io/docs/{version}: docs for a specific version
> - http://druid.io/docs/latest: latest docs. These docs are basically based
> on the latest release, but it can contain more recent docs which are not
> released yet.
>
> This system sometimes makes people confused because 'latest' can be
> interpreted as the 'latest release'. So, I'm proposing a new docs for the
> 'master' branch which always shows the most recent docs. We might call this
> 'dev' or something better. Then, we would have the following docs:
>
> - http://druid.io/docs/{version}: docs for a specific version
> - http://druid.io/docs/latest: docs for the latest release
> - http://druid.io/docs/dev: docs for the master branch
>
> IMO, this system would have some benefits of less confusion as well as
> quickly publishing recent documents. Quick document publishing is important
> for developers as well as users because we also need to refer documents to
> test/improve/develop/review some features.
>
> Welcome any thoughts.
>
> Best,
> Jihoon
>


Re: CLA still required?

2018-08-17 Thread Gian Merlino
I think since the source is migrated now, what sounds right to me is to
accept Apache CLAs/SGAs for new committers, corporate contributors, and
major code transfers (like any other Apache project). And I think we
probably don't need to keep collecting our old CLAs, especially not for
minor contributions. Happy to get input from other people on this as it is
not my area of expertise.

On Tue, Aug 14, 2018 at 5:09 PM Jonathan Wei  wrote:

> Now that we've migrated the sources (but are still incubating), should we
> still ask new contributors to sign http://druid.io/community/cla.html?
>
>
> On Fri, Jun 1, 2018 at 12:52 PM, Gian Merlino  wrote:
>
> > Yes we are still collecting them, although once we are fully migrated to
> > ASF, then we won't anymore (as per ASF policy - as I understand it - CLAs
> > are only required for committers).
> >
> > On Fri, Jun 1, 2018 at 6:27 AM, Pierre Lacave 
> wrote:
> >
> > > Hi,
> > >
> > > With the incubation ongoing, do you still require CLA signed for
> > > contributions?
> > >
> > > Thanks
> > >
> >
>


Re: Druid as a OLTP solution

2018-08-19 Thread Gian Merlino
Hi Shushant,

Druid is definitely not intended for OLTP processing. It's generally meant
for storing, querying, and analyzing event streams. Check out
http://druid.io/ for some more info about what Druid is good at.

On Sat, Aug 18, 2018 at 10:19 PM Shushant Arora 
wrote:

>  Hi
>
> I have a requirement where I have a large scale data and each record has a
> mandatory visitorId field and one or more optional dimension
> and records of same visitorId can come after few days time gap.So I have
> records in system like
> visitorId1,Timestamp1,Dim1=val1
> visitorId1,Timestamp1+20days,Dim2=val2
>
> Now I have a requirement to count distinct visitorIds where Dim1=val1 and
> Dim2=val2(i.,e any row of same visitor has Dim1=val1 and any row of same
> vsisitor has Dim2=val2).
>
>
> Can I do grpupBy visitorId and filter on 2 dimensions Dim=val1 and
> Dim2=val2 both being in separate row and count on visitorId .
> Since druid needs Timestamp in each event so I used currentTs for that.
> Will this return result as 1.
>
> Or druid is not for OLTP processimg?
>


Re: CLA still required?

2018-08-19 Thread Gian Merlino
I think it couldn't hurt. I am not sure if it's necessary. I believe that
in general, Apache doesn't require CLAs for patches submitted by
non-committers, as long as they are submitted by the author and with clear
intent to contribute.

On Fri, Aug 17, 2018 at 5:09 PM Pierre-Emile Ferron 
wrote:

> For corporate contributors, if we already signed the old corporate CLA,
> will we need to sign again but with the Apache CLA at
> https://www.apache.org/licenses/cla-corporate.txt ?
>
> On Fri, Aug 17, 2018 at 2:41 PM, Gian Merlino  wrote:
>
> > I think since the source is migrated now, what sounds right to me is to
> > accept Apache CLAs/SGAs for new committers, corporate contributors, and
> > major code transfers (like any other Apache project). And I think we
> > probably don't need to keep collecting our old CLAs, especially not for
> > minor contributions. Happy to get input from other people on this as it
> is
> > not my area of expertise.
> >
> > On Tue, Aug 14, 2018 at 5:09 PM Jonathan Wei  wrote:
> >
> > > Now that we've migrated the sources (but are still incubating), should
> we
> > > still ask new contributors to sign http://druid.io/community/cla.html?
> > >
> > >
> > > On Fri, Jun 1, 2018 at 12:52 PM, Gian Merlino  wrote:
> > >
> > > > Yes we are still collecting them, although once we are fully migrated
> > to
> > > > ASF, then we won't anymore (as per ASF policy - as I understand it -
> > CLAs
> > > > are only required for committers).
> > > >
> > > > On Fri, Jun 1, 2018 at 6:27 AM, Pierre Lacave 
> > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > With the incubation ongoing, do you still require CLA signed for
> > > > > contributions?
> > > > >
> > > > > Thanks
> > > > >
> > > >
> > >
> >
>


Towards 0.13 (Apache release)

2018-08-29 Thread Gian Merlino
Hi everyone,

As we continue towards 0.13 I started looking into the "great renaming" (of
all packages from io.druid -> org.apache.druid) and am getting a PR ready.
I know Slim is working on
https://github.com/apache/incubator-druid/pull/6215 too (automated license
checking and some header fixups).

Other than these Apache related items, we have 26 open issues/PRs in the
0.13.0 milestone: https://github.com/apache/incubator-druid/milestone/25.
Is this everything we want to include? Is anything there we should bump to
the next release? Is anything _not_ there that needs to be added?

Let's figure out when we can target a code freeze -- the start of the RC
train for our first Apache release!!


Re: Towards 0.13 (Apache release)

2018-08-29 Thread Gian Merlino
I just raised https://github.com/apache/incubator-druid/pull/6266. I think
for sanity's sake, I would really appreciate it if we got this one merged
before merging any other PRs. (It will conflict with 100% of other PRs)

On Wed, Aug 29, 2018 at 9:34 AM Gian Merlino  wrote:

> Hi everyone,
>
> As we continue towards 0.13 I started looking into the "great renaming"
> (of all packages from io.druid -> org.apache.druid) and am getting a PR
> ready. I know Slim is working on
> https://github.com/apache/incubator-druid/pull/6215 too (automated
> license checking and some header fixups).
>
> Other than these Apache related items, we have 26 open issues/PRs in the
> 0.13.0 milestone: https://github.com/apache/incubator-druid/milestone/25.
> Is this everything we want to include? Is anything there we should bump to
> the next release? Is anything _not_ there that needs to be added?
>
> Let's figure out when we can target a code freeze -- the start of the RC
> train for our first Apache release!!
>


Re: Towards 0.13 (Apache release)

2018-08-30 Thread Gian Merlino
That PR is merged now! If anyone here still has outstanding PRs that are
now in conflict with master, try running this before merging master, it
really helps git out.

  git config --local merge.renameLimit 5000

My experience was that even a patch with a few dozen changed files merged
pretty cleanly, after setting this config. I just had a few conflicts to
resolve in imports.

On Wed, Aug 29, 2018 at 4:09 PM Gian Merlino  wrote:

> I just raised https://github.com/apache/incubator-druid/pull/6266. I
> think for sanity's sake, I would really appreciate it if we got this one
> merged before merging any other PRs. (It will conflict with 100% of other
> PRs)
>
> On Wed, Aug 29, 2018 at 9:34 AM Gian Merlino  wrote:
>
>> Hi everyone,
>>
>> As we continue towards 0.13 I started looking into the "great renaming"
>> (of all packages from io.druid -> org.apache.druid) and am getting a PR
>> ready. I know Slim is working on
>> https://github.com/apache/incubator-druid/pull/6215 too (automated
>> license checking and some header fixups).
>>
>> Other than these Apache related items, we have 26 open issues/PRs in the
>> 0.13.0 milestone: https://github.com/apache/incubator-druid/milestone/25.
>> Is this everything we want to include? Is anything there we should bump to
>> the next release? Is anything _not_ there that needs to be added?
>>
>> Let's figure out when we can target a code freeze -- the start of the RC
>> train for our first Apache release!!
>>
>


Re: Towards 0.13 (Apache release)

2018-09-04 Thread Gian Merlino
Hi Qiu,

It's in master, so that means it will be included in 0.13.0 (which hasn't
forked from master yet).

On Mon, Sep 3, 2018 at 10:22 AM qiumingming.2...@bytedance.com <
qiumingming.2...@bytedance.com> wrote:

>
>
> On 2018/08/29 23:09:41, Gian Merlino  wrote:
> > I just raised https://github.com/apache/incubator-druid/pull/6266. I
> think
> > for sanity's sake, I would really appreciate it if we got this one merged
> > before merging any other PRs. (It will conflict with 100% of other PRs)
> >
> > On Wed, Aug 29, 2018 at 9:34 AM Gian Merlino  wrote:
> >
> > > Hi everyone,
> > >
> > > As we continue towards 0.13 I started looking into the "great renaming"
> > > (of all packages from io.druid -> org.apache.druid) and am getting a PR
> > > ready. I know Slim is working on
> > > https://github.com/apache/incubator-druid/pull/6215 too (automated
> > > license checking and some header fixups).
> > >
> > > Other than these Apache related items, we have 26 open issues/PRs in
> the
> > > 0.13.0 milestone:
> https://github.com/apache/incubator-druid/milestone/25.
> > > Is this everything we want to include? Is anything there we should
> bump to
> > > the next release? Is anything _not_ there that needs to be added?
> > >
> > > Let's figure out when we can target a code freeze -- the start of the
> RC
> > > train for our first Apache release!!
> > >
> > Hi, I think https://github.com/apache/incubator-druid/pull/6202 should
> add to 0.13.0 milestone.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> For additional commands, e-mail: dev-h...@druid.apache.org
>
>


Re: Druid 0.12.3-rc1 available

2018-09-05 Thread Gian Merlino
Thanks Jon!!

How does everyone feel about starting the vote on this next Monday?

On Tue, Sep 4, 2018 at 12:41 PM Jonathan Wei  wrote:

> We're happy to announce our next release candidate, Druid 0.12.3-rc1!
>
> Druid 0.12.3 is a non-ASF release. It contains stability improvements and
> bug fixes from 6 contributors. Major improvements include a more stable
> Kafka indexing service and several query bug fixes
>
> Everyone in the community is invited to help out with the upcoming release
> by downloading this candidate and evaluating it.
>
> Draft release notes are at:
> https://github.com/druid-io/druid/issues/6288
>
> Documentation for this release candidate is at:
> http://druid.io/docs/0.12.3-rc1/
>
> You can download the release candidate here:
> http://druid.io/downloads.html
>
> Please file GitHub issues if you find any problems:
> https://github.com/druid-io/druid/issues/new
>
> Thanks everyone who contributed!
>


Flaky tests

2018-09-06 Thread Gian Merlino
Our CI has been super flaky lately: it seems rare that a PR is able to pass
without a few retries. In an effort to try to help I added a new label
"Flaky tests" and tagged all the open issues that look related to flaky
tests. I also closed a few that have been open for a long time and I don't
recall seeing in a while. They are all here:

https://github.com/apache/incubator-druid/labels/Flaky%20test

In a few cases I edited the titles so they all have the specific test case
that failed (the class and method). I think it helps to have one issue per
method, that way we can track them separately.

Non-scientifically I seem to be noticing these four often these days:

1) https://github.com/apache/incubator-druid/issues/6296
2) https://github.com/apache/incubator-druid/issues/2373
3) https://github.com/apache/incubator-druid/issues/6311
4) https://github.com/apache/incubator-druid/issues/6312

Please, if we can, let's spend some time looking into what is going on with
these tests. We will thank ourselves when it makes PR flow smoother!


Re: Druid 0.12.3 release vote

2018-09-17 Thread Gian Merlino
+1, thanks Jon!

On Tue, Sep 11, 2018 at 11:11 AM Jonathan Wei  wrote:

> Hi all,
>
> I'm going ahead and opening the vote for the 0.12.3 release.
>
> Please chime in with your vote once you've had a chance to test the release
> candidate.
>
> Thanks,
> Jon
>


Re: First Apache release of Druid

2018-09-17 Thread Gian Merlino
Hi Julian,

I am surprised to read that you feel the project hasn't come up with a plan
for an Apache release yet. I feel like we do have a plan. I wonder if your
message means that our plan is no good, or just that it isn't clear.

>From my perspective, as a community, we have decided that our next release
from master (0.13) is going to be an Apache release. And we're treating it
the same way we've treated all our other from-master releases in the past
(0.10, 0.11, 0.12, etc). That is to say, we have tagged a set of issues
with the release number (
https://github.com/apache/incubator-druid/milestone/25) and we are working
to get that list down to zero so we can start doing RCs: either by
finishing the tasks or by punting them to future releases. We have some
extra Apache stuff in this release, and have an "Apache" label in github
that we've been tagging those issues and PRs with. Some relevant changes
include the following,

1) https://github.com/apache/incubator-druid/pull/5976 (Update license
headers.)
2) https://github.com/apache/incubator-druid/pull/6266 (Rename io.druid to
org.apache.druid.)
3) https://github.com/apache/incubator-druid/pull/6215 (Adding licenses and
enable apache-rat-plugin.)

Based on the tempo so far, I am hoping that we will get this release
branched off and start doing RCs later in September.

We haven't modified our NOTICE file yet, although I think we'll need to,
based on what I've seen on the Incubator site. If you have any advice about
what's the minimal set of tasks we should get done before starting to
generate and vote on RCs, that would be helpful towards getting it done
faster.


On Mon, Sep 17, 2018 at 10:45 AM Julian Hyde  wrote:

> Druid has been in incubation for several months and has not yet produced
> an Apache release. There were initially some issues with IP transfer that
> prevented that release, but they are now solved. The release is becoming
> urgent, because the code is still not been released under the Apache
> license. Can the project please come up with a plan for that release?
>
> I have seen the following in other incubating projects. They want their
> first release to be a “major” release, and then they start asking product
> managers to dictate the content and timing of the release, and they ask
> their marketing people what they could do to make it a “big splash". Don’t
> do that. A release is nothing more than a snapshot of whatever is on the
> master branch. Releases must be driven by the community.
>
> The first Apache release is always more effort than it seems. My advice is
> to start as soon as possible, and make its goals as limited as possible.
>
> Julian (wearing my “mentor” hat)
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> For additional commands, e-mail: dev-h...@druid.apache.org
>
>


Re: First Apache release of Druid

2018-09-19 Thread Gian Merlino
I think we want this release to be seen by the user community as a 'real'
release, and in that case it seems best to satisfy both sets of
requirements at once. The legal requirements: confirming we have all our
ducks in a row Apache/IP-wise. And the community requirements: producing a
release that is at least as high quality as our users have become
accustomed to in the past, and follows past precedent for deciding what
goes in a release and what doesn't. If that means more work for us, then in
my view, so be it. We will strive to prepare something for the IPMC to take
a look at as soon as possible. I think it's doable on a relatively short
timeframe from now.

On Mon, Sep 17, 2018 at 2:07 PM Julian Hyde  wrote:

> An ASF release is fundamentally a legal transaction. Issuing a bunch of
> source code under a license, having made sure that the contributions to
> that source code are in order. There’s not a strict requirement that it
> works, or even compiles.
>
> Now obviously we, as diligent software engineers who are hoping to build a
> community, want to deliver something that is functional, well-documented
> and a delight to use. But those are all secondary to the purpose of
> launching a blob of intellectual property into the world.
>
> Be advised that the “issues” that the IPMC will find with your release
> will likely have nothing to do with code bugs, testing or documentation.
> So, you need to find a balance between the technical tasks and the other
> aspects of the release process.
>
> To your question. It makes a lot of sense to rename java packages before
> the release, for the benefit of Druid’s community. But it’s not an absolute
> requirement.
>
> Julian
>
>
> > On Sep 17, 2018, at 1:16 PM, Xavier Léauté  wrote:
> >
> > Julian, maybe the requirements for an ASF release aren't clear to
> everyone.
> > It seems we are trying to move all our artifacts to be under org.apache
> in
> > order to meet ASF requirements for a release. Doing so would imply a
> major
> > release for us since those changes wouldn't be backwards compatible. Are
> > you saying that we would be able to do a release without renaming
> artifacts?
> >
> > On Mon, Sep 17, 2018 at 12:42 PM Julian Hyde  jh...@apache.org>> wrote:
> >
> >> I’m probably guilty of not spending enough time reading through dev@
> >> archives to find the plans. I hadn’t figured out that the first ASF
> release
> >> was going to be a major release (i.e. numbered 0.x) or that the release
> >> cadence for such releases is about every six months. Sorry about that.
> >>
> >> I saw this thread [1] but the end-of-September timescale isn’t explicit.
> >>
> >> It may be challenging if your first Apache release is also a major
> release
> >> (e.g. the two rounds of voting take a while, especially if each vote
> fails
> >> a couple of times). So, if you are planning say a beta release before a
> >> 0.13 then that might be a better first apache release.
> >>
> >> Julian
> >>
> >> [1]
> >>
> https://lists.apache.org/thread.html/e6a378201f7e7ab6da2493fe6ee4ae276768c461ea5c676a953d8139@%3Cdev.druid.apache.org%3E
> >> <
> >>
> https://lists.apache.org/thread.html/e6a378201f7e7ab6da2493fe6ee4ae276768c461ea5c676a953d8139@%3Cdev.druid.apache.org%3E
> <
> https://lists.apache.org/thread.html/e6a378201f7e7ab6da2493fe6ee4ae276768c461ea5c676a953d8139@%3Cdev.druid.apache.org%3E
> >
> >>>
> >>
> >>
> >>> On Sep 17, 2018, at 12:19 PM, Gian Merlino  wrote:
> >>>
> >>> Hi Julian,
> >>>
> >>> I am surprised to read that you feel the project hasn't come up with a
> >> plan
> >>> for an Apache release yet. I feel like we do have a plan. I wonder if
> >> your
> >>> message means that our plan is no good, or just that it isn't clear.
> >>>
> >>> From my perspective, as a community, we have decided that our next
> >> release
> >>> from master (0.13) is going to be an Apache release. And we're treating
> >> it
> >>> the same way we've treated all our other from-master releases in the
> past
> >>> (0.10, 0.11, 0.12, etc). That is to say, we have tagged a set of issues
> >>> with the release number (
> >>> https://github.com/apache/incubator-druid/milestone/25) and we are
> >> working
> >>> to get that list down to zero so we can start doing RCs: either by
> >>> finishing the tasks or by punting them to future releases. We have some
>

Re: Towards 0.13 (Apache release)

2018-09-19 Thread Gian Merlino
Hi all,

The current list of issues for 0.13.0 is:
https://github.com/apache/incubator-druid/milestone/25. In the interests of
moving towards 0.13.0 soon, I suggest we should start moving things out to
0.13.1 unless they can be implemented and reviewed within the next week
(or, unless they are regressions from 0.12, in which case we must fix them
for 0.13.0). See also the thread "First Apache release of Druid" for
motivation on why we want to get this done soon.

On Tue, Sep 4, 2018 at 11:43 AM Gian Merlino  wrote:

> Hi Qiu,
>
> It's in master, so that means it will be included in 0.13.0 (which hasn't
> forked from master yet).
>
> On Mon, Sep 3, 2018 at 10:22 AM qiumingming.2...@bytedance.com <
> qiumingming.2...@bytedance.com> wrote:
>
>>
>>
>> On 2018/08/29 23:09:41, Gian Merlino  wrote:
>> > I just raised https://github.com/apache/incubator-druid/pull/6266. I
>> think
>> > for sanity's sake, I would really appreciate it if we got this one
>> merged
>> > before merging any other PRs. (It will conflict with 100% of other PRs)
>> >
>> > On Wed, Aug 29, 2018 at 9:34 AM Gian Merlino  wrote:
>> >
>> > > Hi everyone,
>> > >
>> > > As we continue towards 0.13 I started looking into the "great
>> renaming"
>> > > (of all packages from io.druid -> org.apache.druid) and am getting a
>> PR
>> > > ready. I know Slim is working on
>> > > https://github.com/apache/incubator-druid/pull/6215 too (automated
>> > > license checking and some header fixups).
>> > >
>> > > Other than these Apache related items, we have 26 open issues/PRs in
>> the
>> > > 0.13.0 milestone:
>> https://github.com/apache/incubator-druid/milestone/25.
>> > > Is this everything we want to include? Is anything there we should
>> bump to
>> > > the next release? Is anything _not_ there that needs to be added?
>> > >
>> > > Let's figure out when we can target a code freeze -- the start of the
>> RC
>> > > train for our first Apache release!!
>> > >
>> > Hi, I think https://github.com/apache/incubator-druid/pull/6202 should
>> add to 0.13.0 milestone.
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
>> For additional commands, e-mail: dev-h...@druid.apache.org
>>
>>


Re: Unique Sketch aggregations and bias correction

2018-09-24 Thread Gian Merlino
I have not. The original HLL paper does have some points in it about bias
corrections for small cardinalities, and I am not sure if those are
implemented in Druid's HLL implementation.

On Mon, Sep 24, 2018 at 8:49 AM Charles Allen
 wrote:

> https://github.com/apache/incubator-druid/pull/5712 adds some great
> functionality to the Datasketches hooks in Druid.
>
> One thing noted in
>
> https://datasketches.github.io/docs/HLL/HllSketchVsDruidHyperLogLogCollector.html
> is the severe bias the druid HLL implementation shows at ~5k uniques being
> fed in. This is something we've seen in a severe way internally, where a
> bias of a few percent makes a big difference in results. As such, I'm
> curious if anyone has done any research into simple bias correction to
> attempt to minimize the error seen on the outputs around the error state?
>
> Cheers,
> Charles Allen
>


Re: subscribe to druid

2018-09-28 Thread Gian Merlino
Hey Dayue,

You can subscribe to the list by emailing "dev-subscr...@druid.apache.org".

On Thu, Sep 27, 2018 at 7:14 PM Dayue Gao  wrote:

>
>


Re: Please add me to dev subscriber list

2018-09-29 Thread Gian Merlino
Hey Panner,

You can subscribe to the list by emailing "dev-subscr...@druid.apache.org".

On Sat, Sep 29, 2018 at 1:15 AM Panner selvam Velmyl <
pannerselvam.vel...@gmail.com> wrote:

>
>


Re: TODOs for documentation

2018-10-01 Thread Gian Merlino
My feeling is that if a TODO is so small that it's not worth raising a
Github issue about it, then it's best to take some time to implement it in
the original patch. If that is too much of a burden, then it's a good
indication that it is not in fact a small task (and in that case, go ahead
and create the issue).

On Mon, Oct 1, 2018 at 11:02 AM Roman Leventov  wrote:

> Druid project has a policy of creating issues on Github instead of TODOs in
> code, but I feel that this is an overkill for small documentation-only
> tasks. I suggest to allow TODOs that ask just for writing or rewriting some
> text (including internal Javadoc documentation), but not development of any
> logic, code experiments, benchmarking, etc.
>
> The question is arisen here:
> https://github.com/apache/incubator-druid/pull/6370#discussion_r219694721
>


Re: Druid Developer Contribution

2018-10-03 Thread Gian Merlino
Hi Ravi,

All of us use either MacOS or Linux for development, so that issue might be
Windows related (it does use different end-of-line markers from what
MacOS/Linux use).

On Wed, Oct 3, 2018 at 4:19 AM Ravi Kumar Gadagotti <
ravikumargadago...@gmail.com> wrote:

> Hi,
>
> I am getting the following error message when I am compiling using the
> maven or eclipse, I am not sure where to modify the POM to get it to work.
>
> And one more question is, can I edit the code in windows or should I do it
> Linux/Unix or mac os?
>
> [ERROR]
>
> G:\incubator-druid-master\incubator-druid-master\java-util\src\main\java\org\apache\druid\common\config\NullHandling.java:0:
> File does not end with a newline. [NewlineAtEndOfFile]
> [ERROR]
>
> G:\incubator-druid-master\incubator-druid-master\java-util\src\main\java\org\apache\druid\common\config\NullValueHandlingConfig.java:0:
> File does not end with a newline. [NewlineAtEndOfFile]
> [ERROR]
>
> G:\incubator-druid-master\incubator-druid-master\java-util\src\main\java\org\apache\druid\concurrent\ConcurrentAwaitableCounter.java:0:
> File does not end with a newline. [NewlineAtEndOfFile]
> [ERROR]
>
> G:\incubator-druid-master\incubator-druid-master\java-util\src\main\java\org\apache\druid\guice\annotations\ExtensionPoint.java:0:
> File does not end with a newline. [NewlineAtEndOfFile]
>
> On Wed, Oct 3, 2018 at 7:08 AM Ravi Kumar Gadagotti <
> ravikumargadago...@gmail.com> wrote:
>
> > Hi,
> >
> > I am getting the following error message when I am compiling using the
> > maven or eclipse, I am not sure where to modify the POM to get it to
> work.
> >
> > [ERROR]
> >
> G:\incubator-druid-master\incubator-druid-master\java-util\src\main\java\org\apache\druid\common\config\NullHandling.java:0:
> > File does not end with a newline. [NewlineAtEndOfFile]
> > [ERROR]
> >
> G:\incubator-druid-master\incubator-druid-master\java-util\src\main\java\org\apache\druid\common\config\NullValueHandlingConfig.java:0:
> > File does not end with a newline. [NewlineAtEndOfFile]
> > [ERROR]
> >
> G:\incubator-druid-master\incubator-druid-master\java-util\src\main\java\org\apache\druid\concurrent\ConcurrentAwaitableCounter.java:0:
> > File does not end with a newline. [NewlineAtEndOfFile]
> > [ERROR]
> >
> G:\incubator-druid-master\incubator-druid-master\java-util\src\main\java\org\apache\druid\guice\annotations\ExtensionPoint.java:0:
> > File does not end with a newline. [NewlineAtEndOfFile]
> >
> > On Wed, Oct 3, 2018 at 1:48 AM Surekha Saharan  >
> > wrote:
> >
> >> Hi Ravi,
> >>
> >> After you git cloned the project, are you able to build it using *mvn
> >> clean
> >> install **-DskipTests, *before importing into eclipse ? Also make sure
> >> your
> >> eclipse java compiler and runtime are set to version 1.8
> >>
> >> In case you are open to use intelliJ,  there are some guidelines on
> >> intelliJ here
> >> https://github.com/apache/incubator-druid/blob/master/INTELLIJ_SETUP.md
> .
> >>
> >> Good luck,
> >> Surekha
> >>
> >>
> >>
> >>
> >>
> >> On Tue, Oct 2, 2018 at 4:49 PM Ravi Kumar Gadagotti <
> >> ravikumargadago...@gmail.com> wrote:
> >>
> >> > Hi Surekha,
> >> >
> >> > Hope you are free to answer my silly questions...
> >> >
> >> > I just downloaded the source code and imported it to eclipse and when
> I
> >> am
> >> > updating the project, I am getting so many errors that I cannot even
> >> > compile or update the project in eclipse using maven. If possible can
> I
> >> get
> >> > any help from any of the existing developers for initial setup?
> >> >
> >> > Thanks,
> >> > Ravi Kumar Gadagotti.
> >> >
> >> > On Tue, Oct 2, 2018 at 3:13 PM Surekha Saharan <
> >> surekha.saha...@imply.io>
> >> > wrote:
> >> >
> >> > > Hi Ravi,
> >> > >
> >> > > It's great that you are interested in contributing to Druid!
> >> > >
> >> > > You can do the following to get started:
> >> > > - Checkout the community page here  http://druid.io/community/
> >> > > - Subscribe to the dev list
> >> > > - Checkout the open issues here
> >> > > https://github.com/apache/incubator-druid/issues/ (may be the ones
> >> that
> >> > > are
> >> > > marked easy)
> >> > > - Check this on contributing guidelines :
> >> > >
> https://github.com/apache/incubator-druid/blob/master/CONTRIBUTING.md
> >> > >
> >> > > Good luck,
> >> > > Surekha
> >> > >
> >> > > On Tue, Oct 2, 2018 at 12:04 PM Ravi Kumar Gadagotti <
> >> > > ravikumargadago...@gmail.com> wrote:
> >> > >
> >> > > > Hi,
> >> > > >
> >> > > > My name is Ravi Kumar Gadagotti, and I want to contribute code to
> >> the
> >> > > druid
> >> > > > community and I have no idea where to start and how to start, I
> am a
> >> > java
> >> > > > developer with good amount of experience in Java as well as hadoop
> >> so
> >> > > > please let me know how can I help or contribute to this project.
> >> > > >
> >> > > > Thanks,
> >> > > > Ravi Kumar Gadagotti.
> >> > > >
> >> > >
> >> >
> >>
> >
>


Re: Netty 4.1.x

2018-10-05 Thread Gian Merlino
It sounds good to me.

BTW, we still use netty 3.x for http-client and so it's pretty pervasive.
It coexists with netty 4.x (the packages are different) so there isn't a
conflict at that level. But if we wanted to, like, _fully_ upgrade to netty
4.1.x then it'd involve porting over the http-client.

On Fri, Oct 5, 2018 at 11:20 AM Charles Allen
 wrote:

> https://github.com/apache/incubator-druid/pull/6417 proposes upgrading to
> netty 4.1.x
>
> A lot of the prior issues are likely resolved. Things like java-util are
> part of the druid repository now, the dependent libraries which were still
> using 4.0.x are upgraded (in the PR) to ones using 4.1.x, and Spark's
> laster major version (2.3.x) has netty 4.1
>
> I propose giving netty 4.1.x another shot.
>
> Sound good?
>
> Charles Allen
>


Re: [VOTE] Tranquility 0.8.3 release

2018-10-16 Thread Gian Merlino
Tranquility isn't an Apache project (yet?). It is one of Druid's companion
projects, like pydruid and RDruid, that live in separate git repos with an
independent release process. What is being voted on is the latest commit in
github. Unlike Druid we have typically not done release candidates or very
formal release processes in general for the companion projects. They have a
smaller feel to them and some of them have just a single maintainer, or
even no active maintainer.

It might make sense to migrate some or all them to Apache at some point.
There hasn't been much discussion about it so I am not sure if there is
really consensus on what to do about them. For now I guess we are
continuing with the 'classic' process for them.

On Mon, Oct 15, 2018 at 8:21 PM Julian Hyde  wrote:

> Can someone please clarify what is going on here. Am I correct that
> Tranquility is not an Apache project? Who is allowed to vote for this
> release - Druid PPMC members?
>
> What is being voted upon? A particular set of artifacts to be released,
> the latest commit in github, or something else? (If it’s not an Apache
> release, I guess I shouldn’t complain that the vote doesn’t follow Apache
> protocol.)
>
> Julian
>
>
> > On Oct 15, 2018, at 7:51 PM, David Lim  wrote:
> >
> > +1
> >
> > On Mon, Oct 15, 2018 at 6:08 PM Fangjin Yang  wrote:
> >
> >> +1
> >>
> >> On Mon, Oct 15, 2018 at 4:40 PM Jihoon Son 
> wrote:
> >>
> >>> +1
> >>>
> >>> Thanks Jon!
> >>>
> >>> Jihoon
> >>>
> >>> On Tue, Oct 16, 2018 at 5:22 AM Jonathan Wei 
> wrote:
> >>>
>  Hi all,
> 
>  I'd like to open a vote for a new Tranquility release, 0.8.3. The new
>  release would have the following improvements and bug fixes:
> 
>  Improvements:
>  * Update Curator and Scala. (#213)
>  * support rollup function in druid 0.9.2 (#210)
>  * Allow customization of zookeeper path through properties. (#215)
>  * Update MMX libraries and replace scala_tools.time (#220)
>  * Exclude deps with *GPL licenses. (#223)
>  * expose sslContext and prefer tlsPort if present (#257)
>  * Support Basic HTTP auth with druid, TLS support for server (#277)
> 
>  Bug fixes:
>  * remove data type and input row parser type binding (#193)
>  * Change default host/port for DruidNode and FlinkBeam (#266)
>  * Thread-safe samza BeamProducer (#228)
> 
>  Notably, this release would allow Tranquility to work with TLS-secured
>  Druid clusters and support Basic HTTP user/pass authentication.
> 
>  Thanks,
>  Jon
> 
> >>>
> >>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> For additional commands, e-mail: dev-h...@druid.apache.org
>
>


Re: ICLA

2018-10-16 Thread Gian Merlino
That is the understanding we've been applying to Druid itself. My
understanding of ASF policy is that committers need ICLAs, and other
contributors only need "clear intent to contribute", which is established
if the PR author == the code author.

On Tue, Oct 16, 2018 at 12:18 PM Maxime Beauchemin <
maximebeauche...@gmail.com> wrote:

> Hey,
>
> While I'm maintaining PyDruid, I'm wondering whether I should still be
> asking all contributors for an ICLA.  From my understanding, the ASF
> requires an ICLA only for committers, not all contributors (is that
> right?).
>
> Max
>


Re: [VOTE] Release Apache Druid (incubating) 0.13.0 [RC1]

2018-10-30 Thread Gian Merlino
-1 because there are issues tagged for 0.13.0 that are not part of this
release:

- https://github.com/apache/incubator-druid/pull/6512
- https://github.com/apache/incubator-druid/pull/6514
- https://github.com/apache/incubator-druid/pull/6516
- https://github.com/apache/incubator-druid/pull/6508
- https://github.com/apache/incubator-druid/pull/6520
- https://github.com/apache/incubator-druid/issues/6546

I also noticed a doc problem that I haven't filed an issue for yet, which
is that the tutorials download section isn't accurate any longer. It says
to use http://static.druid.io/, but we probably won't put the artifacts
there. And it says to do "cd druid-#{DRUIDVERSION}" but now the directory
name is "apache-druid" not "druid".

I tried to go through an Apache style verification process anyway. Here's
what I looked at.

Source release:
- GPG signature and SHA512 are ok
- Tarball name and structure looks ok
- LICENSE, NOTICE, and DISCLAIMER are present
- Code builds and tests pass (by running "mvn package")
- Cloned a fresh Druid repo, checked out druid-0.13.0-incubating-rc1
(acf15b42778d3a84638193a3b07c6814cf2f35a2), and compared it to the source
release. The source release has two extra files: git.version (expected)
and extensions-core/protobuf-extensions/dependency-reduced-pom.xml
(unexpected). Perhaps the latter should be removed. The source release is
_missing_ a few files that I am ok with them missing, since they don't seem
to be necessary in a source release (.gitignore, .idea, .travis.yml,
eclipse.importorder, eclipse_formatting.xml, publications, upload.sh).

Binary "release" (not really a release, from what I understand, but still
important):
- GPG signature and SHA512 are ok
- Tarball name and structure looks ok. It expands to
"apache-druid-0.13.0-incubating" not "apache-druid-0.13.0-incubating-bin"
but IMO that is fine.
- Did the "tutorial-batch" quickstart and verified data could be loaded and
queried.

On Tue, Oct 23, 2018 at 11:56 AM Julian Hyde  wrote:

> Your’e right. After I imported the keys, using
>
> $ gpg --recv-keys 58B5D669D2FFD83B37D88DF8BB64B3727183DE56
>
> it worked. I could have done ‘gpg —import KEYS’ also.
>
> Changing my vote to +0 (binding). I would give +1 but the artifacts I am
> asked to review contains a bin.tar.gz and I have no idea how to review
> binary artifacts. Maybe some other reviewers can chime in.
>
> I do know how to find out online how to build Druid. I think it is
> important that src.tar.gz is self-contained, and that includes the
> necessary build instructions. (Just as I really appreciate when a box of
> pasta has “boil for 12 minutes” written on the side.)
>
> I don’t know whether Apache has guidelines for PaxHeaders. I’m just saying
> it looked weird to me. And by the way, emacs tar-mode choked on the tar
> file. Not a blocker, just friction.
>
> There is a way to add headers to .md files. And I claim there is just as
> much creativity in these as in source code. Please fix in the next release.
>
> Julian
>
>
> > On Oct 22, 2018, at 10:04 PM, David Lim  wrote:
> >
> > Hi Julian,
> >
> > I believe the PaxHeader files are the result of extracting a tarball
> built
> > with POSIX tar with a GNU or other variant. I believe it is a result of
> > this configuration:
> >
> https://github.com/apache/incubator-druid/blob/master/distribution/pom.xml#L195
> >
> > Does Apache have guidelines on what variant of tar should be used in
> > generating release artifacts?
> >
> > Also, just thought I would note that we do have published documentation
> on
> > building from source here:
> >
> https://github.com/apache/incubator-druid/blob/master/docs/content/development/build.md
> > - but there are a few statements that should be updated, hence my comment
> > that I'll make sure the docs get updated.
> >
> > Regards,
> > David
> >
> >
> > On Mon, Oct 22, 2018 at 10:20 PM David Lim  wrote:
> >
> >> Hi Julian,
> >>
> >> Thank you for the thorough review!
> >>
> >> For the GPG key, my understanding is that it was expected that users
> would
> >> fetch the key by either running 'gpg --import KEYS' on the KEYS file (as
> >> per https://www.apache.org/dev/release-signing.html#keys-policy) or by
> >> importing it from the Apache phonebook (
> >> https://people.apache.org/keys/committer/davidlim.asc) or grabbing it
> >> from a well-known key server (e.g.
> >> http://pgp.mit.edu/pks/lookup?search=davidlim%40apache.org&op=index).
> Did
> >> this not work for you?
> >>
> >>> src.tar.gz file contains files such as
> >>
> ./PaxHeaders.X/apache-druid-0.13.0-incubating-src_indexing-service_src_test_java_org_apache_druid_s
> >> I will check to see whether these files are expected or not.
> >>
> >> The files you identified from the diff against the git tag are all
> either
> >> expected to be omitted or were generated as part of the source
> packaging.
> >>
> >> For instructions on building the release from source, I mentioned the
> >> command in the original post but it may have been som

Re: Metrics updates in Release Notes

2018-10-31 Thread Gian Merlino
Why not also tag those with "Release Notes"? It makes it a lot easier for
release managers to do their jobs if they just have to look at one label.
(Or two, I guess: "release notes" and "incompatible". But I would be down
to merge them.)

On Wed, Oct 31, 2018 at 9:26 AM David Lim  wrote:

> Thanks Roman. I'm helping with the release this time so I will check the
> PRs with that label and include them in the release notes as appropriate.
>
> As far as I know, there isn't any document like that, but I agree it would
> be quite useful.
>
> On Wed, Oct 31, 2018 at 9:04 AM Roman Leventov 
> wrote:
>
> > It's suggested that the person that prepares Druid Release Notes (I think
> > it's Jon usually) goes through all PRs labelled "Area - Metrics/Event
> > Emitting" (
> >
> >
> https://github.com/apache/incubator-druid/pulls?q=is%3Apr+sort%3Aupdated-desc+label%3A%22Area+-+Metrics%2FEvent+Emitting%22+is%3Aclosed+milestone%3A0.13.0
> > )
> > along with "Release Notes", to present this information in the release
> > notes.
> >
> > BTW I wonder is the a document in the repository or elsewhere that
> > describes the release process?
> >
>


Re: Metrics updates in Release Notes

2018-10-31 Thread Gian Merlino
I don't think we have a doc about how to do a release, but yeah it would be
great to have it. Dave, would you be able to put it together while you
manage this release? I am sure it will differ substantially from what we've
done in the past, because of the new Apache-ified stuff.

On Wed, Oct 31, 2018 at 10:07 AM Gian Merlino  wrote:

> Why not also tag those with "Release Notes"? It makes it a lot easier for
> release managers to do their jobs if they just have to look at one label.
> (Or two, I guess: "release notes" and "incompatible". But I would be down
> to merge them.)
>
> On Wed, Oct 31, 2018 at 9:26 AM David Lim  wrote:
>
>> Thanks Roman. I'm helping with the release this time so I will check the
>> PRs with that label and include them in the release notes as appropriate.
>>
>> As far as I know, there isn't any document like that, but I agree it would
>> be quite useful.
>>
>> On Wed, Oct 31, 2018 at 9:04 AM Roman Leventov 
>> wrote:
>>
>> > It's suggested that the person that prepares Druid Release Notes (I
>> think
>> > it's Jon usually) goes through all PRs labelled "Area - Metrics/Event
>> > Emitting" (
>> >
>> >
>> https://github.com/apache/incubator-druid/pulls?q=is%3Apr+sort%3Aupdated-desc+label%3A%22Area+-+Metrics%2FEvent+Emitting%22+is%3Aclosed+milestone%3A0.13.0
>> > )
>> > along with "Release Notes", to present this information in the release
>> > notes.
>> >
>> > BTW I wonder is the a document in the repository or elsewhere that
>> > describes the release process?
>> >
>>
>


Re: Metrics updates in Release Notes

2018-10-31 Thread Gian Merlino
Why is "metrics / event emitting" special, though? Why shouldn't we ask the
release manager to look at _all_ tags just in case? (Which is clearly too
much burden for a release manager -- I'm trying to make an argument, I
guess, that it's fair to push some of the burden to the committer that
originally merged the PR.)

On Wed, Oct 31, 2018 at 10:44 AM Roman Leventov 
wrote:

> I think people will often forget to put both tags, so the person who does
> the release should check the tag Metrics/Event Emitting anyway, just in
> case.
>
> On Wed, 31 Oct 2018 at 18:09, Gian Merlino  wrote:
>
> > I don't think we have a doc about how to do a release, but yeah it would
> be
> > great to have it. Dave, would you be able to put it together while you
> > manage this release? I am sure it will differ substantially from what
> we've
> > done in the past, because of the new Apache-ified stuff.
> >
> > On Wed, Oct 31, 2018 at 10:07 AM Gian Merlino  wrote:
> >
> > > Why not also tag those with "Release Notes"? It makes it a lot easier
> for
> > > release managers to do their jobs if they just have to look at one
> label.
> > > (Or two, I guess: "release notes" and "incompatible". But I would be
> down
> > > to merge them.)
> > >
> > > On Wed, Oct 31, 2018 at 9:26 AM David Lim  wrote:
> > >
> > >> Thanks Roman. I'm helping with the release this time so I will check
> the
> > >> PRs with that label and include them in the release notes as
> > appropriate.
> > >>
> > >> As far as I know, there isn't any document like that, but I agree it
> > would
> > >> be quite useful.
> > >>
> > >> On Wed, Oct 31, 2018 at 9:04 AM Roman Leventov 
> > >> wrote:
> > >>
> > >> > It's suggested that the person that prepares Druid Release Notes (I
> > >> think
> > >> > it's Jon usually) goes through all PRs labelled "Area -
> Metrics/Event
> > >> > Emitting" (
> > >> >
> > >> >
> > >>
> >
> https://github.com/apache/incubator-druid/pulls?q=is%3Apr+sort%3Aupdated-desc+label%3A%22Area+-+Metrics%2FEvent+Emitting%22+is%3Aclosed+milestone%3A0.13.0
> > >> > )
> > >> > along with "Release Notes", to present this information in the
> release
> > >> > notes.
> > >> >
> > >> > BTW I wonder is the a document in the repository or elsewhere that
> > >> > describes the release process?
> > >> >
> > >>
> > >
> >
>


Re: SegmentId PR

2018-11-12 Thread Gian Merlino
I could take a look after 0.13.0 is released. Right now things related to
that are the main things I am spending my Druid-related time on.

I haven't read most of the diff yet, but I was wondering, is there a reason
you make a new class instead of using SegmentIdentifier? They are slightly
different (one has a ShardSpec and one just has the partition num) but I am
wondering if these need to be two different classes or not.

On Mon, Nov 12, 2018 at 4:51 AM Roman Leventov  wrote:

> Could somebody please provide design review of "Introduce SegmentId class"
> PR : https://github.com/apache/incubator-druid/pull/6370? This is an
> important improvement, and many other improvements and bugs fixes are
> blocked on it. Despite "Development Blocker" tag (that was thought to give
> PRs a priority), nobody reviewed this PR for almost two months, except Egor
> with whom we work for the same company.
>


Re: [VOTE] Release Apache Druid (incubating) 0.13.0 [RC3]

2018-11-20 Thread Gian Merlino
When voting please mention what you did to verify the release (see
http://www.apache.org/legal/release-policy.html#release-approval, search on
page for "Before casting +1 binding votes, individuals are required to").

On Tue, Nov 20, 2018 at 1:03 AM Fangjin Yang  wrote:

> +1
>
> On Fri, Nov 16, 2018 at 11:07 PM David Lim  wrote:
>
> > Hi all,
> >
> > I have created a build for Apache Druid (incubating) 0.13.0, release
> > candidate 3.
> >
> > A list of the patches applied can be found here:
> >   Since rc1:
> >
> >
> https://github.com/apache/incubator-druid/pulls?utf8=%E2%9C%93&q=is%3Apr+base%3A0.13.0-incubating+merged%3A%3C2018-11-17T06+
> >   Since rc2:
> >
> >
> https://github.com/apache/incubator-druid/pulls?utf8=%E2%9C%93&q=is%3Apr+base%3A0.13.0-incubating+merged%3A2018-11-16T05..2018-11-17T06+
> >
> > Thanks to everyone who has contributed to this release! You can read the
> > proposed release notes here:
> > https://github.com/apache/incubator-druid/issues/6442
> >
> > The release candidate has been tagged in GitHub as
> > druid-0.13.0-incubating-rc3 (56c97b2), available here:
> >
> >
> https://github.com/apache/incubator-druid/releases/tag/druid-0.13.0-incubating-rc3
> >
> > The artifacts to be voted on are located here:
> >
> >
> https://dist.apache.org/repos/dist/dev/incubator/druid/0.13.0-incubating-rc3/
> >
> > A staged Maven repository is available for review at:
> > https://repository.apache.org/content/repositories/orgapachedruid-1001/
> >
> > Release artifacts are signed with the key [7183DE56]:
> > https://people.apache.org/keys/committer/davidlim.asc (also available
> > here:
> > http://pgp.mit.edu/pks/lookup?search=davidlim%40apache.org&op=index)
> >
> > This key and the key of other committers can also be found in the
> project's
> > KEYS file here:
> > https://dist.apache.org/repos/dist/dev/incubator/druid/KEYS
> >
> > (If you are a committer, please feel free to add your own key to that
> file
> > by following the instructions in the file's header.)
> >
> > Please review the proposed artifacts and vote. As this is our first
> release
> > under the Apache Incubator program, note that Apache has specific
> > requirements that must be met before +1 binding votes can be cast by PMC
> > members. Please refer to the policy at
> > http://www.apache.org/legal/release-policy.html#policy for more details.
> >
> > As part of the validation process, the release artifacts can be generated
> > from source by running: mvn clean install -Papache-release,dist,rat
> >
> > This vote will be open for at least 72 hours. The vote will pass if a
> > majority of at least three +1 PMC votes are cast.
> >
> > Once the vote has passed, the second stage vote will be called on the
> > Apache Incubator mailing list to get approval from the Incubator PMC.
> >
> > [ ] +1 Release this package as Apache Druid (incubating) 0.13.0
> > [ ]  0 I don't feel strongly about it, but I'm okay with the release
> > [ ] -1 Do not release this package because...
> >
> > Thanks!
> >
>


Re: Sync up this week

2018-11-20 Thread Gian Merlino
IMO, minutes would be good. We did recordings in the past and I thought
they were a bit of a struggle (it was sometimes tough to get them working
right, since different people hosted from week to week, we didn't always
use a consistent hosting platform, and we would end up having conversations
that weren't recorded as a result…)

I suppose that means the first order of business for every hangout should
be to nominate a note taker for that day.

On Tue, Nov 13, 2018 at 7:52 PM Julian Hyde  wrote:

> Apache doesn’t have a policy on video per se, but we do have a policy that
> conversations should be open to all. Due to time-zones etc. some people may
> not be able to attend the dev syncs, and those people must not feel
> excluded.
>
> If consent is preventing you from recording the dev syncs, perhaps you can
> state in the invite that attendance implies consent?
>
> Or do what the Drill community does: send out minutes after each meeting.
> See “Hangout” on https://drill.apache.org/community-resources/ <
> https://drill.apache.org/community-resources/>.
>
> Also remember that in Apache, decisions must be made on-list. The dev sync
> is a great place to have discussions and share information, but you must
> bring the discussion onto the dev list if there are decisions to be made.
>
> Julian
>
>
> > On Nov 13, 2018, at 4:43 PM, Charles Allen 
> wrote:
> >
> > We had been recording it into youtube, but something broke and the
> youtube
> > channel didn't work anymore. No one had the time or willpower to fix it.
> I
> > know I won't be able to release video that I record without getting a few
> > forms filled out by everyone involved. I don't know if ASF has any policy
> > on releasing videos.
> >
> >
> >
> > On Tue, Nov 13, 2018 at 4:33 PM Jun Rao  wrote:
> >
> >> Hi, Charles,
> >>
> >> Are those dev sync meetings being recorded? It would be useful to make
> the
> >> recordings or at least the meeting notes available for public access
> (e.g.
> >> Apache wiki).
> >>
> >> Thanks,
> >>
> >> Jun
> >>
> >> On Tue, Nov 13, 2018 at 9:15 AM Charles Allen
> >>  wrote:
> >>
> >>> Hi all!
> >>>
> >>> I have an off-site today so will not be able to host the sync up. Is
> >> anyone
> >>> else able to host?
> >>>
> >>> Thank you,
> >>> Charles Allen
> >>>
> >>
>
>


Re: [VOTE] Release Apache Druid (incubating) 0.13.0 [RC3]

2018-11-27 Thread Gian Merlino
+1

Source release:

- GPG signature and SHA512 are ok
- Tarball name is ok
- git.version file looks ok (references tag druid-0.13.0-incubating-rc3)
- LICENSE, NOTICE, and DISCLAIMER are present
- Tarball contents match git tag druid-0.13.0-incubating-rc3 (no unexpected
extra files, no critical files missing, all others match according to diff
-ru)
- Code builds and tests pass (by running "mvn package")

Binary:

- GPG signature and SHA512 are ok
- Tarball name and structure are ok
- Was able to start the services and run through the "tutorial-batch" and
"tutorial-query" tutorials

On Fri, Nov 16, 2018 at 11:07 PM David Lim  wrote:

> Hi all,
>
> I have created a build for Apache Druid (incubating) 0.13.0, release
> candidate 3.
>
> A list of the patches applied can be found here:
>   Since rc1:
>
> https://github.com/apache/incubator-druid/pulls?utf8=%E2%9C%93&q=is%3Apr+base%3A0.13.0-incubating+merged%3A%3C2018-11-17T06+
>   Since rc2:
>
> https://github.com/apache/incubator-druid/pulls?utf8=%E2%9C%93&q=is%3Apr+base%3A0.13.0-incubating+merged%3A2018-11-16T05..2018-11-17T06+
>
> Thanks to everyone who has contributed to this release! You can read the
> proposed release notes here:
> https://github.com/apache/incubator-druid/issues/6442
>
> The release candidate has been tagged in GitHub as
> druid-0.13.0-incubating-rc3 (56c97b2), available here:
>
> https://github.com/apache/incubator-druid/releases/tag/druid-0.13.0-incubating-rc3
>
> The artifacts to be voted on are located here:
>
> https://dist.apache.org/repos/dist/dev/incubator/druid/0.13.0-incubating-rc3/
>
> A staged Maven repository is available for review at:
> https://repository.apache.org/content/repositories/orgapachedruid-1001/
>
> Release artifacts are signed with the key [7183DE56]:
> https://people.apache.org/keys/committer/davidlim.asc (also available
> here:
> http://pgp.mit.edu/pks/lookup?search=davidlim%40apache.org&op=index)
>
> This key and the key of other committers can also be found in the project's
> KEYS file here:
> https://dist.apache.org/repos/dist/dev/incubator/druid/KEYS
>
> (If you are a committer, please feel free to add your own key to that file
> by following the instructions in the file's header.)
>
> Please review the proposed artifacts and vote. As this is our first release
> under the Apache Incubator program, note that Apache has specific
> requirements that must be met before +1 binding votes can be cast by PMC
> members. Please refer to the policy at
> http://www.apache.org/legal/release-policy.html#policy for more details.
>
> As part of the validation process, the release artifacts can be generated
> from source by running: mvn clean install -Papache-release,dist,rat
>
> This vote will be open for at least 72 hours. The vote will pass if a
> majority of at least three +1 PMC votes are cast.
>
> Once the vote has passed, the second stage vote will be called on the
> Apache Incubator mailing list to get approval from the Incubator PMC.
>
> [ ] +1 Release this package as Apache Druid (incubating) 0.13.0
> [ ]  0 I don't feel strongly about it, but I'm okay with the release
> [ ] -1 Do not release this package because...
>
> Thanks!
>


Re: Weekly dev sync minutes (2018-11-27)

2018-11-27 Thread Gian Merlino
Thanks, Dave!

On Tue, Nov 27, 2018 at 10:51 AM David Lim  wrote:

> Attendees: David Lim, Jihoon Son, Atul, Clint Wylie, Eyal Yurman, Roman
> Leventov
>
> Following the discussion after last week's sync, we will be taking meeting
> minutes of the weekly dev sync going forward so that everyone in the
> community can be informed. At the start of each meeting, the moderator
> should issue a call for a volunteer who will be responsible for taking the
> minutes and posting them to the dev mailing list afterwards.
>
> On the call today, we decided that posting the minutes to the dev mailing
> list (as opposed to adding them to a wiki, Google docs, etc.) would be
> beneficial through increased visibility and community engagement. We may
> also want to create an index page with links to the mail archive so that
> previous minutes can be located easily.
>
> Other matters that were brought up:
>
> 0.13.0-rc3 has been out for about 10 days now with no major issues
> reported. Imply has been running that build for the past week; Atul
> mentioned that he has deployed 0.13.0-rc1 to their cluster and it has been
> running without issues (they plan to upgrade to rc3 this week).
>
> If no issues come up in the next few days, we will present 0.13.0-rc3 to
> the incubator PMC for their vote on releasing it as Apache Druid 0.13.0.
>
> Roman wanted to highlight issue
> https://github.com/apache/incubator-druid/issues/ which identifies the
> PasswordProvider abstraction as a potential source of race condition
> related issues. We decided on the call that this is not a blocker for
> 0.13.0.
>


Re: Podling Report Reminder - December 2018

2018-11-29 Thread Gian Merlino
Is anyone willing to volunteer to pick up December's report?

On Thu, Nov 29, 2018 at 1:12 PM  wrote:

> Dear podling,
>
> This email was sent by an automated system on behalf of the Apache
> Incubator PMC. It is an initial reminder to give you plenty of time to
> prepare your quarterly board report.
>
> The board meeting is scheduled for Wed, 19 December 2018, 10:30 am PDT.
> The report for your podling will form a part of the Incubator PMC
> report. The Incubator PMC requires your report to be submitted 2 weeks
> before the board meeting, to allow sufficient time for review and
> submission (Wed, December 05).
>
> Please submit your report with sufficient time to allow the Incubator
> PMC, and subsequently board members to review and digest. Again, the
> very latest you should submit your report is 2 weeks prior to the board
> meeting.
>
> Candidate names should not be made public before people are actually
> elected, so please do not include the names of potential committers or
> PPMC members in your report.
>
> Thanks,
>
> The Apache Incubator PMC
>
> Submitting your Report
>
> --
>
> Your report should contain the following:
>
> *   Your project name
> *   A brief description of your project, which assumes no knowledge of
> the project or necessarily of its field
> *   A list of the three most important issues to address in the move
> towards graduation.
> *   Any issues that the Incubator PMC or ASF Board might wish/need to be
> aware of
> *   How has the community developed since the last report
> *   How has the project developed since the last report.
> *   How does the podling rate their own maturity.
>
> This should be appended to the Incubator Wiki page at:
>
> https://wiki.apache.org/incubator/December2018
>
> Note: This is manually populated. You may need to wait a little before
> this page is created from a template.
>
> Mentors
> ---
>
> Mentors should review reports for their project(s) and sign them off on
> the Incubator wiki page. Signing off reports shows that you are
> following the project - projects that are not signed may raise alarms
> for the Incubator PMC.
>
> Incubator PMC
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> For additional commands, e-mail: dev-h...@druid.apache.org
>
>


Re: Resume making timely releases

2018-12-04 Thread Gian Merlino
Agree. I think the only reason we stopped was the ASF migration.

On Tue, Dec 4, 2018 at 5:15 AM Roman Leventov  wrote:

> I suggest resuming making quarterly releases. 0.13 branch was created
> between October 12 and 19, as far as I can see. Then 0.14 branch should be
> created between 12 and 19 of January.
>


Re: [VOTE] Release Apache Druid (incubating) 0.13.0 [RC4]

2018-12-04 Thread Gian Merlino
+1

- Verified signatures and checksums of both src and bin packages.
- Source tarball matches tag.
- Source tarball builds and tests pass.
- Ran through quickstart on binary tarball.

On Mon, Dec 3, 2018 at 5:57 PM benedictjin2...@gmail.com <
benedictjin2...@gmail.com> wrote:

>
>
> On 2018/12/02 09:50:55, David Lim  wrote:
> > Hi all,
> >
> > I have created a build for Apache Druid (incubating) 0.13.0, release
> > candidate 4.
> >
> > A list of the patches applied can be found here:
> >   Since rc1:
> >
> https://github.com/apache/incubator-druid/pulls?utf8=%E2%9C%93&q=is%3Apr+base%3A0.13.0-incubating+merged%3A%3C2018-11-17T06+
> >   Since rc2:
> >
> https://github.com/apache/incubator-druid/pulls?utf8=%E2%9C%93&q=is%3Apr+base%3A0.13.0-incubating+merged%3A2018-11-16T05..2018-11-17T06+
> >   Since rc3:
> >
> https://github.com/apache/incubator-druid/pulls?utf8=%E2%9C%93&q=is%3Apr+base%3A0.13.0-incubating+merged%3A2018-11-30..2018-12-02+
> >
> > Thanks to everyone who has contributed to this release! You can read the
> > proposed release notes here:
> > https://github.com/apache/incubator-druid/issues/6442
> >
> > The release candidate has been tagged in GitHub as
> > druid-0.13.0-incubating-rc4 (cf15aac), available here:
> >
> https://github.com/apache/incubator-druid/releases/tag/druid-0.13.0-incubating-rc4
> >
> > The artifacts to be voted on are located here:
> >
> https://dist.apache.org/repos/dist/dev/incubator/druid/0.13.0-incubating-rc4
> >
> > A staged Maven repository is available for review at:
> > https://repository.apache.org/content/repositories/orgapachedruid-1002/
> >
> > Release artifacts are signed with the key [7183DE56]:
> > https://people.apache.org/keys/committer/davidlim.asc (also available
> here:
> > http://pgp.mit.edu/pks/lookup?search=davidlim%40apache.org&op=index)
> >
> > This key and the key of other committers can also be found in the
> project's
> > KEYS file here:
> https://dist.apache.org/repos/dist/dev/incubator/druid/KEYS
> >
> > Please review the proposed artifacts and vote. As this is our first
> release
> > under the Apache Incubator program, note that Apache has specific
> > requirements that must be met before +1 binding votes can be cast by PMC
> > members. Please refer to the policy at
> > http://www.apache.org/legal/release-policy.html#policy for more details.
> >
> > As part of the validation process, the release artifacts can be generated
> > from source by running: mvn clean install -Papache-release,dist,rat
> >
> > This vote will be open for at least 72 hours. The vote will pass if a
> > majority of at least three +1 PMC votes are cast.
> >
> > Once the vote has passed, the second stage vote will be called on the
> > Apache Incubator mailing list to get approval from the Incubator PMC.
> >
> > [ ] +1 Release this package as Apache Druid (incubating) 0.13.0
> > [ ]  0 I don't feel strongly about it, but I'm okay with the release
> > [ ] -1 Do not release this package because...
> >
> > Starting with my +1 (binding)
> >
> > Thanks!
> > +1
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> For additional commands, e-mail: dev-h...@druid.apache.org
>
>


Re: [ANNOUNCE] Apache Druid (incubating) 0.13.0 released

2018-12-18 Thread Gian Merlino
A big milestone!!

Thanks Dave for capably wearing the release manager hat, and thanks to
everyone else that contributed!!

On Tue, Dec 18, 2018 at 12:37 PM David Lim  wrote:

> We're happy to announce the release of Apache Druid (incubating) 0.13.0!
>
> Druid 0.13.0-incubating contains over 400 new features,
> performance/stability/documentation improvements, and bug fixes from 81
> contributors. It is the first release of Druid in the Apache Incubator
> program. Major new features and improvements include:
>
>   - native parallel batch indexing
>   - automatic segment compaction
>   - system schema tables
>   - improved indexing task status, statistics, and error reporting
>   - SQL-compatible null handling
>   - result-level broker caching
>   - ingestion from RDBMS
>   - Bloom filter support
>   - additional SQL result formats
>   - additional aggregators (stringFirst/stringLast, ArrayOfDoublesSketch,
> HllSketch)
>   - support for multiple grouping specs in groupBy query
>   - mutual TLS support
>   - HTTP-based worker management
>   - broker backpressure
>   - maxBytesInMemory ingestion tuning configuration
>   - materialized views (community extension)
>   - parser for Influx Line Protocol (community extension)
>   - OpenTSDB emitter (community extension)
>
> Source and binary distributions can be downloaded from:
> https://druid.apache.org/downloads.html
>
> Release notes are at:
>
> https://github.com/apache/incubator-druid/releases/tag/druid-0.13.0-incubating
>
> A big thank you to all the contributors in this milestone release!
>


Re: Drop 0. from the version

2018-12-20 Thread Gian Merlino
I think it's a good point. Culturally we have been willing to break
extension APIs for relatively small benefits. But we have generally been
unwilling to make breaking changes on the operations side quite so
liberally. Also, most cluster operators don't have their own custom
extensions, in my experience. So it does make sense to differentiate them.
I'm not sure how it makes sense to differentiate them, though. It could be
done through the version number (only increment the major version for
operations breaking changes) or it could be done through an "upgrading"
guide in the documentation (increment the major version for operations or
extension breaking changes, but, have a guide that tells people which
versions have operations breaking changes to aid in upgrades).

Coming back to the question in the subject of your mail: IMO, for
"graduation" out of 0.x, we should talk as a community about what that
means to us. It is a milestone that on the one hand, doesn't mean much, but
on the other hand, can be deeply symbolic. Some things that it has meant to
other projects:

1) Production readiness. Obviously Druid is well past this. If this is what
dropping the 0. means, then we should do it immediately.

2) Belief that the APIs have become relatively stable. Like you said, the
extension APIs don't seem particularly close to stable, but maybe that's
okay. However, the pace of breaking changes on the operations and query
side for non-experimental features has been relatively calm for the past
couple of years, so if we focus on that then we can make a case here.

3) Completeness of vision. This one is the most interesting to me. I
suspect that different people in the community have different visions for
Druid. It is also the kind of project that may never truly be complete in
vision (in principle, the platform could become a competitive data
warehouse, search engine, etc, …). For what it's worth, my vision of Druid
for the next year at least involves robust stream ingestion being a first
class ingestion method (Kafka / Kinesis indexing service style) and SQL
being a first class query language. These are both, today, still
experimental features. So are lookups. All of these 3 features, from what I
can see, are quite popular amongst Druid users despite being experimental.
For a 'completeness of vision' based 1.0 I would want to lift all of those
out of experimental status and, for SQL in particular, to have its
functionality rounded out a bit more (to support the native query features
it doesn't currently support, like multi-value dimensions, datasketches,
etc).

4) Marketing / timing. Like, doing a 1.0 around the time we graduate from
the Incubator. Not sure how much this really matters, but projects do it
sometimes.

Another question is, how often do we intend to rev the version? At the rate
we're going, we rev 2-3 major versions a year. Would we intend to keep that
up, or slow it down by making more of an effort to avoid breaking changes?

On Thu, Dec 20, 2018 at 2:17 PM Roman Leventov 
wrote:

> It may also make sense to distinguish "operations" breaking changes from
> API breaking changes. Operations breaking changes establish the minimum
> cadence of Druid cluster upgrades, that allow rolling Druid versions back
> and forward. I. e. it's related to segment format, the format of the data
> kept in ZooKeeper and the SQL database, or events such as stopping support
> of ZooKeeper for certain things (e. g. forcing using of HTTP
> announcements). So Druid cluster operators cannot update Druid from version
> X to version Z skipping the version Y, if both Y and Z have some operations
> breaking changes. (Any such changes should support rollback options at
> least until the next version with operations breaking changes.)
>
> API breaking changes are just changes in Druid extensions APIs. Druid
> cluster operators could skip any number of releases with such breaking
> changes, as long as their extension's code is updated for the latest
> version of API.
>
> On Thu, 20 Dec 2018 at 20:03, Roman Leventov  wrote:
>
> > It doesn't seem to me that Druid API is going to stabilize in the near
> > future (if ever), because there are so many extension points and
> something
> > is broken in every release. On the other hand, Druid is not Hadoop or
> > Spark, which have applications API. Druid API for extensions, not
> > applications. It is used by people who are closer to Druid development
> and
> > fixing their extensions is routine.
> >
> > With that, I think it make sense to drop "0." from the Druid version and
> > call it Druid 14, Druid 15, etc.
> >
>


Re: Bug report!

2018-12-20 Thread Gian Merlino
Hey Mike,

I would look to Hive to fix this - it should be able to handle either a 0
or 0.0 in the response equally well. I suppose I wouldn't consider it to be
a bug in Druid.

On Mon, Dec 17, 2018 at 10:15 AM mike  wrote:

> Hello,  Could anybody give me a hand?
>
> Recently, I upgraded my druid server version from 0.9.2 to lastest
> 0.12.3, I encountered into a trouble when I ran my previous application
> over new druid. It worked fine on 0.9.2. after check up the logs, I found
> that for double fields in my schema, it returned 0 instead of 0.0 in the
> query result, which caused my json parser unhappy and throwed an exception
> something like that:
>
> 18/12/11 19:36:37 ERROR thriftserver.SparkExecuteStatementOperation: Error
> running hive query:
> org.apache.hive.service.cli.HiveSQLException:
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
> in stage 11.0 failed 1 times, most recent failure: Lost task 0.0 in stage
> 11.0 (TID 11, localhost): java.lang.ClassCastException: scala.math.BigInt
> cannot be cast to java.lang.Double
> at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:114)
> at
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getDouble(rows.scala:44)
> at
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getDouble(rows.scala:221)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
> Source)
> at
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
> at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
> at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
>
>
>
> Could anybody respond to me or fix this bug or help me get out of this
> disgusting trouble, thanks million!
>
> Michael


Re: Drop 0. from the version

2018-12-21 Thread Gian Merlino
I'm not too fussy around whether we do a 1.0 or simply drop the 0. and have
it be a 14.0 or 15.0 or 16.0 or wherever we are at the time we do it. I
also like the quarterly cadence of release-from-master we had before we got
blocked on the ASF transition, and would like to pick that back up again
(with the next branch cut from master at the end of January, since we did
the 0.13.0 branch cut in late October).

Seems to me that good points of discussion are, what should we use as the
rule for incrementing the major version? Do we do what we've been doing
(incrementing whenever there's either an incompatible change in extension
APIs, or in query APIs, or when necessary to preserve the ability to always
be able to roll forward/back one major release). Or do we do something else
(Roman seems to be suggesting dropping extension APIs from consideration).

And also, what does 1.0 or 14.0 or 15.0 or what-have-you mean to us? Is it
something that should be tied to ASF graduation? Completeness of vision?
Stability of APIs or operational characteristics? Something else? You are
right that it is sort of a marketing/mentality thing, so it's an
opportunity for us to declare that we feel Druid has reached some
milestone. My feeling at this time is probably ASF graduation or
completeness of vision (see my earlier mail for thoughts there) are the
ones that make most sense to me.

On Fri, Dec 21, 2018 at 10:41 AM Charles Allen  wrote:

> Is there any feeling in the community that the logic behind the releases
> needs to change?
>
> If so then I think we should discuss what that release cadence needs to
> look like.
>
> If not then dropping the 0. prefix is a marketing / mental item. Kind of
> like the 3.x->4.x Linux kernel upgrade. If this is the case then would we
> even want to go with 1.x? I think Roman's proposal would work fine in this
> case. Where we just call it Apache Druid 14 (or 15 or whatever it is when
> we get there) and just keep the same logic for when we release stuff, which
> has been something like:
>
> For a X.Y release, going to a X.? release should be very straight forward
> for anyone running stock Druid.
> For a X.Y release, going to a (X+1).? or from a (X+1).? back to an X.Y
> release should be feasible. It might require running a tool supported by
> the community.
> For a X.Y release, going to an (X+2).? or an (X-2).? is not supported. Some
> things that will not have tools might have warning logs printed that the
> functionality will change (should we change these to alerts?)
>
> If this sounds reasonable then jumping straight to Apache Druid 14 on the
> first official apache release would make a lot of sense.
>
> Cheers,
> Charles Allen
>
>
> On Thu, Dec 20, 2018 at 11:07 PM Gian Merlino  wrote:
>
> > I think it's a good point. Culturally we have been willing to break
> > extension APIs for relatively small benefits. But we have generally been
> > unwilling to make breaking changes on the operations side quite so
> > liberally. Also, most cluster operators don't have their own custom
> > extensions, in my experience. So it does make sense to differentiate
> them.
> > I'm not sure how it makes sense to differentiate them, though. It could
> be
> > done through the version number (only increment the major version for
> > operations breaking changes) or it could be done through an "upgrading"
> > guide in the documentation (increment the major version for operations or
> > extension breaking changes, but, have a guide that tells people which
> > versions have operations breaking changes to aid in upgrades).
> >
> > Coming back to the question in the subject of your mail: IMO, for
> > "graduation" out of 0.x, we should talk as a community about what that
> > means to us. It is a milestone that on the one hand, doesn't mean much,
> but
> > on the other hand, can be deeply symbolic. Some things that it has meant
> to
> > other projects:
> >
> > 1) Production readiness. Obviously Druid is well past this. If this is
> what
> > dropping the 0. means, then we should do it immediately.
> >
> > 2) Belief that the APIs have become relatively stable. Like you said, the
> > extension APIs don't seem particularly close to stable, but maybe that's
> > okay. However, the pace of breaking changes on the operations and query
> > side for non-experimental features has been relatively calm for the past
> > couple of years, so if we focus on that then we can make a case here.
> >
> > 3) Completeness of vision. This one is the most interesting to me. I
> > suspect that different people in the community have different visions for
> > Druid. It is also the kin

Re: Off list major development

2019-01-02 Thread Gian Merlino
In this particular case: please consider the PR as a proposal. Don't feel
like just because there is code there that takes a certain approach, that
the approach is somehow sacred. I had to implement something to crystallize
my own thinking about how the problem could be approached. I won't be
disappointed if, as a community, we decide a different direction is better
and the code all gets thrown away. That's one of the reasons that I removed
the 0.14.0 milestone that was added to the patch. (I don't want to rush it,
nor do I think that's a good idea.)

In general: Sounds like we could do with some more formalization around
what a proposal looks like, which sorts of changes need one, and when in
the dev cycle it is appropriate. FWIW I think Kafka's process is more or
less fine, and would be okay with adopting it for Druid if people like it.
Right now our standards for what requires a "design review" are very
similar to the Kafka community standards for what requires a KIP, so we
have some familiarity with those concepts. However we don't separate PR
review and proposal discussion as strictly as they do, which seems to be
the foundation for the feeling of exclusion that is being felt here.

Separately: I just redid the description on
https://github.com/apache/incubator-druid/pull/6794 to be more proposal-y.
I followed the KIP style:
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals.
Please refresh the page and see if it looks more useful.

Gian

On Wed, Jan 2, 2019 at 10:52 AM Julian Hyde  wrote:

> Slim,
>
> I agree with your points that offline development is bad for community.
> But I don’t think you need much mentor help. You have raised valid issues
> and the Druid community needs to decide what its development practices
> should be.
>
> Julian
>
>
> > On Jan 2, 2019, at 10:29 AM, Slim Bouguerra  wrote:
> >
> > Hello everyone and hope you all have very good holidays.
> >
> > First, this email is not directed on the author or the PR
> > https://github.com/apache/incubator-druid/pull/6794  it self, but i see
> > this PR as a perfect example.
> >
> > One of the foundation of Apache Way or what i would simply call open
> source
> > community driven development is that "Technical decisions are discussed,
> > decided, and archived publicly.
> > developpement"
> > Which means that big technical  changes such as the one brought by #/6794
> > should have started as a proposal and round of discussions about the
> major
> > changes designs not as 11K line of code.
> > I believe such openness will promote a lot of good benefits such as:
> >
> > - ensures community health and growth.
> > - ensures everyone can participate not only the authors and his
> co-workers.
> > - ensures that the project is driven by the community and not a given
> > company or an individual.
> > - ensures that there is consensus (not saying 100% agreement;) however it
> > means that all individuals will accept the current progress on the
> project
> > until some better proposal is put forth.
> >
> > Personally such BIG offline PR makes me feel excluded and doesn't give
> me a
> > sense that i belong to  a community at all.
> >
> > To prevent such off list development i think as a Druid Community we need
> > to stick to the apache way “If it didn’t happen on the mailing list, it
> > didn’t happen.”
> >
> > I would appreciate if some of the Apache mentor help with this.
> > Thanks
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> For additional commands, e-mail: dev-h...@druid.apache.org
>
>


Re: Writing a Druid extension

2019-01-02 Thread Gian Merlino
Some other comments,

For 3) it is not safe to assume that QueryLifecycleFactory::runSimple
always returns org.apache.druid.data.input.Row. It does for groupBy queries
but not for other query types. The SQL layer has a bunch of code to adapt
the various query type's return types into the uniform format required by
SQL; check out QueryMaker.

For 4) consider if you only need to support Avatica, or if you want to
support the basic HTTP endpoint too (SqlResource).

On Wed, Jan 2, 2019 at 10:34 AM Charles Allen
 wrote:

> We have a functional gRPC extension for brokers internally. Let me see if I
> can get approval for releasing it.
>
> For the explicit answers:
>
> 1) Guava 16
>
> Yep, druid is stuck on it due to hadoop.
> https://github.com/apache/incubator-druid/pull/5413 is the only
> outstanding
> issue I know of that would a very wide swath of guava implementations to be
> used. Once a solution for the same thread executor service gets into place,
> then you should be able to modify your local deployment to whatever guava
> version fits with your indexing config.
>
> 2) Group By thread processing
>
> You picked the hardest one here :) there is all kinds of multi-threaded fun
> that can show up when dealing with group by queries. If you want a good
> dive into this I suggest checking out
> https://github.com/apache/incubator-druid/pull/6629 which will put you
> straight into the weeds of it all.
>
> 3) Yielder / Sequence type safety
>
> Yeah... I don't have any good info there other than "things aren't
> currently broken". There are some really nasty and hacky type casts related
> to by segment sequences if you start digging around the code.
>
> 4) Calcite Proto
>
> This is a great question. I imagine getting a Calcite Proto SQL endpoint
> setup in an extension wouldn't be too hard, but have not tried such a
> thing. This one would probably be worth having its own discussion thread
> (maybe an issue?) on how to handle.
>
> You are on the right track!
> Charles Allen
>
> On Sat, Dec 29, 2018 at 11:59 PM Nikita Dolgov  >
> wrote:
>
> > I was experimenting with a Druid extension prototype and encountered some
> > difficulties. The experiment is to build something like
> > https://github.com/apache/incubator-druid/issues/3891 with gRPC.
> >
> > (1) Guava version
> >
> > Druid relies on 16.0.1 which is a very old version (~4 years). My only
> > guess is another transitive dependency (Hadoop?) requires it. The
> earliest
> > version used by gRPC from three years ago was 19.0. So my first question
> is
> > if there are any plans for upgrading Guava any time soon.
> >
> > (2) Druid thread model for query execution
> >
> > I played a little with calling
> > org.apache.druid.server.QueryLifecycleFactory::runSimple under debugger.
> > The stack trace was rather deep to reverse engineer easily so I'd like to
> > ask directly instead. Would it be possible to briefly explain how many
> > threads (and from which thread pool) it takes on a broker node to
> process,
> > say, a GroupBy query.
> >
> > At the very least I'd like to know if calling
> > QueryLifecycleFactory::runSimple on a thread from some "query processing
> > pool" is better than doing it on the IO thread that received the query.
> >
> > (3) Yielder
> >
> > Is it safe to assume that QueryLifecycleFactory::runSimple always returns
> > a Yielder ? QueryLifecycle omits generic
> > types rather liberally when dealing with Sequence instances.
> >
> > (4) Calcite integration
> >
> > Presumably Avatica has an option of using protobuf encoding for the
> > returned results. Is it true that Druid cannot use it?
> > On a related note, any chance there was something written down about
> > org.apache.druid.sql.calcite ?
> >
> > Thank you
> >
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> > For additional commands, e-mail: dev-h...@druid.apache.org
> >
> >
>


Re: Batch Ingestion

2019-01-02 Thread Gian Merlino
Hey Satya,

The easiest way to ingest data is to ask Druid to pull it from Kafka (
http://druid.io/docs/latest/tutorials/tutorial-kafka.html) or Hadoop (
http://druid.io/docs/latest/tutorials/tutorial-batch-hadoop.html).
Tranquility Server (
https://github.com/druid-io/tranquility/blob/master/docs/server.md) is the
closest thing to what you are looking for, but I'd prefer going with the
Kafka method at this point, since it has a number of advantages. See
https://imply.io/post/exactly-once-streaming-ingestion for more details on
that.

On Wed, Jan 2, 2019 at 3:44 PM sat...@gmail.com  wrote:

> I am looking for details on how can I ingest the data in Druid cluster. I
> need to collect the data and push the same via REST, from the document it
> appears the HTTPFirehose would be the best option however I am not able to
> get more details on it.
>
> Can anyone point me to sample config and detailed steps on how to
> configure the same?
>
> I would very much appreciate your help.
>
> Thanks
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> For additional commands, e-mail: dev-h...@druid.apache.org
>
>


Re: Off list major development

2019-01-02 Thread Gian Merlino
One of the advantages I see with a more formal process is (like Kafka KIPs)
is that it levels the playing field a bit and sets some ground rules for
working together. In a way it can help encourage contributions by making it
clear what is expected of potential contributors.

We have a design review process today that is not as formal as KIPs, but is
somewhat heavier than the one you describe. Maybe we could tweak our
current one by starting to do design reviews separately from PRs. i.e., for
anything that meets our 'design review' criteria, do that on the dev list
or in a separate issue, and keep the PR focused on code-level stuff. That
way we don't end up trying to do both at once. And it makes it easier to
start talking about design before the code is ready, which would be better.

On Wed, Jan 2, 2019 at 6:10 PM Julian Hyde  wrote:

> It’s really hard to say no to a contribution when someone has put in a
> significant amount of work.
>
> The following approach is simple and works really well: Before you start
> work, log a case, describing the problem. When you have some ideas about
> design, add those to the case. When you have a code branch, add its URL to
> the case. And so forth. At any point in the proceedings, people can chime
> in with their opinions.
>
> In my opinion, a formal “design review” process is not necessary. Just
> build consensus iteratively, by starting the conversation early in the
> process.
>
> Julian
>
>
> > On Jan 2, 2019, at 12:37 PM, Gian Merlino  wrote:
> >
> > In this particular case: please consider the PR as a proposal. Don't feel
> > like just because there is code there that takes a certain approach, that
> > the approach is somehow sacred. I had to implement something to
> crystallize
> > my own thinking about how the problem could be approached. I won't be
> > disappointed if, as a community, we decide a different direction is
> better
> > and the code all gets thrown away. That's one of the reasons that I
> removed
> > the 0.14.0 milestone that was added to the patch. (I don't want to rush
> it,
> > nor do I think that's a good idea.)
> >
> > In general: Sounds like we could do with some more formalization around
> > what a proposal looks like, which sorts of changes need one, and when in
> > the dev cycle it is appropriate. FWIW I think Kafka's process is more or
> > less fine, and would be okay with adopting it for Druid if people like
> it.
> > Right now our standards for what requires a "design review" are very
> > similar to the Kafka community standards for what requires a KIP, so we
> > have some familiarity with those concepts. However we don't separate PR
> > review and proposal discussion as strictly as they do, which seems to be
> > the foundation for the feeling of exclusion that is being felt here.
> >
> > Separately: I just redid the description on
> > https://github.com/apache/incubator-druid/pull/6794 to be more
> proposal-y.
> > I followed the KIP style:
> >
> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals
> .
> > Please refresh the page and see if it looks more useful.
> >
> > Gian
> >
> > On Wed, Jan 2, 2019 at 10:52 AM Julian Hyde  wrote:
> >
> >> Slim,
> >>
> >> I agree with your points that offline development is bad for community.
> >> But I don’t think you need much mentor help. You have raised valid
> issues
> >> and the Druid community needs to decide what its development practices
> >> should be.
> >>
> >> Julian
> >>
> >>
> >>> On Jan 2, 2019, at 10:29 AM, Slim Bouguerra  wrote:
> >>>
> >>> Hello everyone and hope you all have very good holidays.
> >>>
> >>> First, this email is not directed on the author or the PR
> >>> https://github.com/apache/incubator-druid/pull/6794  it self, but i
> see
> >>> this PR as a perfect example.
> >>>
> >>> One of the foundation of Apache Way or what i would simply call open
> >> source
> >>> community driven development is that "Technical decisions are
> discussed,
> >>> decided, and archived publicly.
> >>> developpement"
> >>> Which means that big technical  changes such as the one brought by
> #/6794
> >>> should have started as a proposal and round of discussions about the
> >> major
> >>> changes designs not as 11K line of code.
> >>> I believe such openness will promote a lot of good benefits such as:
> >>>
> >>

Druid 0.14 timing

2019-01-04 Thread Gian Merlino
It feels like 0.13.0 was just recently released, but it was branched off
back in October, and it has almost been 3 months since then. How do we feel
about doing an 0.14 branch cut at the end of January (Thu Jan 31) - going
back to the every 3 months cycle?

For this release, based on the feedback we got from the Incubator vote last
time, we'll need to fix up the LICENSE and NOTICE issues that were flagged
but waved through for our first release. (Justin said he would have -1'd
based on that if it was anything beyond a first release.)


Re: HTTPFirehose

2019-01-04 Thread Gian Merlino
I have but mostly for toy stuff (like demos). It reads a data file from an
HTTP server and puts it through the normal batch ingestion flow.


On Fri, Jan 4, 2019 at 4:06 PM sat...@gmail.com  wrote:

> Hi,
>
> Has anyone used HTTPFirehose to ingest the data in Druid? If so can you
> please share the details.
>
> Thanks
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> For additional commands, e-mail: dev-h...@druid.apache.org
>
>


Re: PR Milestone policy

2019-01-07 Thread Gian Merlino
My feeling is that setting a milestone on PRs before they're merged is a
way of making their authors feel more included. I don't necessarily see a
problem with setting milestones optimistically and then, when a release
branch is about to be cut (based on the timed release date), we bulk-update
anything that hasn't been merged yet to the next milestone.

However, there are other ways to make authors feel more included. If we end
up doing a more formalized proposal process then this helps too. (It should
be easier for people to comment on proposals than on PRs, since there isn't
a need to read code.)

I guess I'm not really fussy either way on this one.

On Wed, Dec 12, 2018 at 10:27 PM 邱明明  wrote:

> I agree with Jonathan.
> Jay Nash  于2018年12月13日周四 下午1:05写道:
> >
> > Dear all,
> > I am just bystander on Druid List however I like to contribute code to
> Druids some day because it is very great, we use it at my company. It
> sounds consensus was reached that Github milestones should be used not so
> frequently and is proposed vote about to change this.. is this correct?
> >
> > Regards,
> > Jay
> >
> > On 2018/12/12 00:39:29, Jonathan Wei  wrote:
> > > After a PR has been reviewed and merged, I think we should tag it with
> the>
> > > upcoming milestone to make life easier for release managers, for all
> PRs.>
> > >
> > > Regarding unresolved PRs:>
> > >
> > > > I advocate for not assigning milestones to any non-bug (or otherwise>
> > > "critical") PRs, including "feature", non-refactoring PRs.>
> > >
> > > That seems like a reasonable policy to me, based on the points Roman
> made>
> > > in the thread.>
> > >
> > > On Tue, Dec 11, 2018 at 1:13 AM Julian Hyde  wrote:>
> > >
> > > > Well, see if you can get consensus around such a policy. Other Druid>
> > > > folks, please speak up if you agree or disagree.>
> > > >>
> > > > > On Dec 8, 2018, at 8:02 AM, Roman Leventov >
> > > > wrote:>
> > > > >>
> > > > > It's not exactly and not only that. I advocate for not assigning>
> > > > milestones>
> > > > > to any non-bug (or otherwise "critical") PRs, including "feature",>
> > > > > non-refactoring PRs.>
> > > > >>
> > > > > On Fri, 7 Dec 2018 at 19:29, Julian Hyde 
> wrote:>
> > > > >>
> > > > >> Consensus.>
> > > > >>>
> > > > >> We resolve debates by going into them knowing that we need to
> find>
> > > > >> consensus. A vote is a last step to prove that consensus exists,
> and>
> > > > >> in most cases is not necessary.>
> > > > >>>
> > > > >> Reading between the lines, it sounds as if you and FJ have a>
> > > > >> difference of opinion about refactoring changes. Two extreme
> positions>
> > > > >> would be (1) we don't accept changes that only refactor code, (2)
> and>
> > > > >> I assert my right to contribute a refactoring change at any point
> in>
> > > > >> the project lifecycle. A debate that starts with those positions
> is>
> > > > >> never going to reach consensus. A better starting point might be
> "I>
> > > > >> would like to make the following change because I believe it
> would be>
> > > > >> beneficial. How could I best structure it / time it to minimize>
> > > > >> impact?">
> > > > >> On Fri, Dec 7, 2018 at 9:19 AM Roman Leventov >
> > > > >> wrote:>
> > > > 
> > > > >>> I would like like learn what is the Apache way to resolve
> debates. But>
> > > > >> you>
> > > > >>> are right, this question probably doesn't deserve that. Thanks
> for>
> > > > >> guidance>
> > > > >>> Julian.>
> > > > 
> > > > >>> On Fri, 7 Dec 2018 at 16:43, Julian Hyde >
> > > > wrote:>
> > > > 
> > > >  May I suggest that a vote is not the solution. In this
> discussion I>
> > > > see>
> > > >  two people beating each other over the head with policy.>
> > > > >
> > > >  Let’s strive to operate according to the Apache way. Accept>
> > > > >> contributions>
> > > >  on merit in a timely manner. Avoid the urge to “project
> manage”.>
> > > > >
> > > >  Julian>
> > > > >
> > > > > On Dec 7, 2018, at 07:03, Roman Leventov >
> > > > >> wrote:>
> > > > >>
> > > > > The previous consensus community decision seems to be to not
> use PR>
> > > > > milestones for any PRs except bugs. To change this policy,
> probably>
> > > > >> there>
> > > > > should be a committer (or PPMC?) vote.>
> > > > >>
> > > > >> On Thu, 6 Dec 2018 at 20:49, Julian Hyde 
> wrote:>
> > > > >>>
> > > > >> FJ,>
> > > > >>>
> > > > >> What you are proposing sounds suspiciously like project
> management.>
> > > > >> If a>
> > > > >> contributor makes a contribution, that contribution should be
> given>
> > > > >> a>
> > > >  fair>
> > > > >> review in a timely fashion and be committed based on its
> merits. You>
> > > > >> overstate the time-sensitivity of contributions. I would
> imagine>
> > > > >> that>
> > > >  there>
> > > > >> are only a few days preceding each release where stability is
> a>
> > > > >> major>
> > > > >> concern. At any othe

Re: Off list major development

2019-01-07 Thread Gian Merlino
ally devs takes coding as personal creation
> > and they get attached to it.
> > To my point you can take a look on some old issue in the Druid forum
> >
> https://github.com/apache/incubator-druid/pull/3755#issuecomment-265667690
> >  and am sure other communities have similar problems.
> >  So leaving the door open to some side cases is not a good idea in my
> > opinion and will lead to similar issue in the future.
> >
> > This seems to me especially likely to happen in cases
> > > where an approach still needs proven to be a viable idea *to the
> author*,
> > > so that a much more productive discussion can be had in the first
> place.
> > >
> > > I think there is a trade off, I don't think we want to discourage
> > > experimentation by walling it off behind mandatory discussions before
> it
> > > can even start, but I do think formalizing the process for large
> changes
> > is
> > > a good thing, especially since we probably can't expect the wider
> > community
> > > to have the same attitude towards a large PR getting discarded as a
> > > committer might. I think the Kafka approach is reasonable, a bit more
> > > formal than our design review process but not overbearing.
> >
> >
> > Can you please explain what is overbearing ? what can be changed to make
> it
> > easy ?
> > Most of the points are kind of the actual questions that you want to
> > address before hand anyway isn't ?
> >
> >
> > > Going code first
> > > should be in general discouraged, but when it does happen, it seems
> like
> > > opening DIP/an issue/starting a mailing list thread or whatever we go
> > with
> > > to have a more high level design discussion alongside the reference PR
> > > could alleviate some of these complaints?
> >
> >
> > What are the complaints ?
> >
> >
> > > +1 for "DIP" heh, I think making
> > > them in the form of github issues is probably appropriate, with a dev
> > list
> > > thread to announce them perhaps?
> > >
> >
> > I think  github issue with [Proposal] header like
> > https://github.com/apache/incubator-druid/issues/4349 is good to me,
> >
> > Thanks!
> >
> >
> > > On Thu, Jan 3, 2019 at 10:28 AM Slim Bouguerra 
> wrote:
> > >
> > > > Thanks everyone for interacting with this thread.
> > > >
> > > > The fact that i, Roman, Jihoon  and others in the past (FJ
> > > >
> > >
> >
> https://groups.google.com/forum/#!msg/druid-user/gkUEsAYIfBA/6B2GJdLkAgAJ)
> > > > raised this point indicates that PRs without a proposal are indeed an
> > > issue
> > > > and we need to solve it.
> > > >
> > > > Something Similar to KIP maybe called DIPs is fine with me.
> > > > What i strive to see is the following:
> > > >
> > > > [Step 1] formalize what is the kind of work that needs a formal
> > > Proposal, I
> > > > think Roman and Jihoon has already covered that pretty well. am +1 on
> > > that.
> > > >
> > > >
> > >
> >
> https://lists.apache.org/thread.html/9b30d893bdb6bb633cf6a9a700183ffb5b98f115330531a55328ac77@%3Cdev.druid.apache.org%3E
> > > >
> > > > I am strongly in favor of the separation of Proposal Review and
> (later)
> > > > Code review PRs. My  main reasons:
> > > > Most importantly code reviewing will introduce lot of noise and will
> > > > ultimately make  the GitHub page unreadable.
> > > > Avoid overlapping of work.
> > > > Once code is written hard to think abstract.
> > > > Separate page for Design review later can always be used it as a
> Design
> > > > document that is readable and code free-ish.
> > > > As i said the goal of this first round is to see if the community
> agree
> > > > about such change, then make the process of design more inclusive
> thus
> > > > other contributors can submit a counter proposals.
> > > >
> > > > [Step 2] IF everybody agree about that point Step 2 is to define
> which
> > > > medium is used to Publish a primitive form of a CODE FREE Abstract
> > > Proposal
> > > > containing at least the following bullet points.
> > > > - The problem description and motivation
> > > > - Overview of the proposed change
> > > > - Operational impact (compatibility/ plans to upgrades) publi

Re: Off list major development

2019-01-07 Thread Gian Merlino
I don't think there's a need to raise issues for every change: a small bug
fix or doc fix should just go straight to PR. (GitHub PRs show up as issues
in the issue-search UI/API, so it's not like this means the patch has no
corresponding issue -- in a sense the PR _is_ the issue.)

I do think it makes sense to encourage potential contributors to write to
the dev list or raise an issue if they aren't sure if something would need
to go through a more heavy weight process.

Fwiw we do have a set of 'design review' criteria already (we had a
discussion about this a couple years ago) at:
http://druid.io/community/#getting-your-changes-accepted. So we wouldn't be
starting from zero on defining that. We set it up back when we were trying
to _streamline_ our process -- we used to require two non-author +1s for
_every_ change, even minor ones. The introduction of design review criteria
was meant to classify which PRs need that level of review and which ones
are minor and can be merged with less review. I do think it helped with
getting minor PRs merged more quickly. The list of criteria is,

- Major architectural changes or API changes
- HTTP requests and responses (e. g. a new HTTP endpoint)
- Interfaces for extensions
- Server configuration (e. g. altering the behavior of a config property)
- Emitted metrics
- Other major changes, judged by the discretion of Druid committers

Some of it is subjective, but it has been in place for a while, so it's at
least something we are relatively familiar with.

On Mon, Jan 7, 2019 at 11:32 AM Julian Hyde  wrote:

> Small contributions don’t need any design review, whereas large
> contributions need significant review. I don’t think we should require an
> additional step for those (many) small contributions. But who decides
> whether a contribution fits into the small or large category?
>
> I think the solution is for authors to log a case (or send an email to
> dev) before they start work on any contribution. Then committers can
> request a more heavy-weight process if they think it is needed.
>
> Julian
>
>
> > On Jan 7, 2019, at 11:24 AM, Gian Merlino  wrote:
> >
> > It sounds like splitting design from code review is a common theme in a
> few
> > of the posts here. How does everyone feel about making a point of
> > encouraging design reviews to be done as issues, separate from the pull
> > request, with the expectations that (1) the design review issue
> > ("proposal") should generally appear somewhat _before_ the pull request;
> > (2) pull requests should _not_ have design review happen on them, meaning
> > there should no longer be PRs with design review tags, and we should move
> > the design review approval process to the issue rather than the PR.
> >
> > For (1), even if we encourage design review discussions to start before a
> > pull request appears, I don't see an issue with them running concurrently
> > for a while at some point.
> >
> > On Thu, Jan 3, 2019 at 5:35 PM Jonathan Wei  wrote:
> >
> >> Thanks for raising these concerns!
> >>
> >> My initial thoughts:
> >> - I agree that separation of design review and code-level review for
> major
> >> changes would be more efficient
> >>
> >> - I agree that a clear, more formalized process for handling major
> changes
> >> would be helpful for contributors:
> >>  - Define what is considered a major change
> >>  - Define a standard proposal structure, KIP-style proposal format
> sounds
> >> good to me
> >>
> >> - I think it's too rigid to have a policy of "no code at all with the
> >> initial proposal"
> >>  - Code samples can be useful references for understanding aspects of a
> >> design
> >>  - In some cases it's necessary to run experiments to fully understand a
> >> problem and determine an appropriate design, or to determine whether
> >> something is even worth doing before committing to the work of fleshing
> out
> >> a proposal, prototype code is a natural outcome of that and I'm not
> against
> >> someone providing such code for reference
> >>  - I tend to view design/code as things that are often developed
> >> simultaneously in an intertwined way
> >>
> >>> Let's not be naive this is very rare that a contributor will accept
> that
> >> his work is to be thrown, usually devs takes coding as personal creation
> >> and they get attached to it.
> >>
> >> If we have a clear review process that emphasizes the need for early
> >> consensus building, with separate design and code review, then I feel
>

Re: Watermarks!

2019-01-07 Thread Gian Merlino
For Kafka, maybe something that tells you if all committed data is actually
loaded, & what offset has been committed up to? Would there by any problems
caused by the fact that only the most recent commit is saved in the DB?

Is this feature connected at all to an ask I have heard from a few people:
that there be an option to fail a query (or at least include a special
response header) if some segments in the interval are unavailable? (Which,
currently, the broker can't know since it doesn't know details about all
available segments.)

Btw, at your site do you have any plans to migrate to Kafka indexing?

On Wed, Jan 2, 2019 at 5:37 PM Charles Allen 
wrote:

> Hi all!
>
> https://github.com/apache/incubator-druid/pull/6799
>
> A contribution is up that includes a neat feature we have been using
> internally called Watermarks. Basically when operating a large scale and
> multi-tenant system, it is handy to be able to monitor how 'well behaved'
> the data is with regard to history. This is commonly used to spot holes in
> data, and to help give hints to data consumers in a lambda environment on
> when data has been run through a thorough check (batch job) vs a best
> effort sketch of the results which may or may not handle late data well
> (streaming intake).
>
> Unfortunately i'm not really sure what meta-data would be handy to have for
> the kafka indexing service, so I'd love input there as well if anyone knows
> of any "watermarks" that would make sense for it.
>
> Since the extension was written to be a stand alone service, it can remain
> as an extension forever if desired. An alternative I would like to propose
> is that the primitives for the watermark feature be added to core druid,
> and the extension points be added to their respective places (mysql
> extension and google extension to name two explicitly).
>
> Let me know what you think!
> Charles Allen
>


Re: Pointers on implementing a new ShardSpec

2019-01-08 Thread Gian Merlino
Hey Julian,

There aren't any gotchas that I can think of other than the fact that they
are not super well documented, and you might miss some features if you're
just skimming the code. A couple points that might matter,

1) PartitionChunk is what allows a shard spec to contribute to the
definition of whether a set of segments for a time chunk is "complete".
It's an important concept since the broker will not query segment sets
unless the chunk is complete. The way the completeness check works is
basically that the broker will get all the ShardSpecs for all the segments
in a time chunk, order them by partitionNum, generate the partition chunks,
and check if (a) the first one is a starter based on "isStart", (b) any
subsequent ones until the end [based on "isEnd"] abut the previous one
[based on "abuts"]. Some ShardSpecs use nonsensical-at-first-glance logic
from these methods to short circuit the completeness checks: time chunks
with LinearShardSpecs are _always_ considered complete. Time chunks with
NumberedShardSpecs can have "partitionNum" go beyond "partitions", and are
considered complete if the first "partitions" number of segments are
present.

2) "getDomainDimensions" and "possibleInDomain" are optional, but useful
for powering broker-side segment pruning.

3) All segments in the same time chunk must have the same kind of
ShardSpec. However, it can vary from time chunk to time chunk within a
datasource.

On Mon, Jan 7, 2019 at 2:56 PM Julian Jaffe 
wrote:

> Hey all,
>
> Are there any major caveats or gotchas I should be aware of when
> implementing a new ShardSpec? The context here is that we have a datasource
> that is the combined result of multiple input jobs. We're trying to do
> write-side joining by having all of the jobs write segments for the same
> intervals (e.g. partitioning on both partition number and source pipeline).
> For now, I've modified the Spark-Druid batch ingestor (
> https://github.com/metamx/druid-spark-batch) to run in our various
> pipelines and to write out segments with identifier form
> `dataSource_startInterval_endInterval_version_sourceName_partitionNum. This
> is working without issue for loading, querying, and deleting data, but the
> metadata API reports the incorrect segment identifier, since it
> reconstructs the identifier instead of reading from metadata (e.g. it
> reports segment identifiers of the form
> `dataSource_startInterval_endInterval_version_partitionNum`). Both because
> we'd like this to be fully supported, and because we imagine that this
> feature may be useful to others, I'd like to implement this via a
> ShardSpec.
>
> Julian
>


Re: Off list major development

2019-01-08 Thread Gian Merlino
I think for us, choosing to use GitHub issues as discussion threads for
potential 'major' contributions would be a good idea, especially if we
encourage people to start them before PRs show up. Definitely agree that
all contributors should go through the same process -- I couldn't see it
working well any other way.

On Mon, Jan 7, 2019 at 12:23 PM Julian Hyde  wrote:

> Statically, yes, GitHub PRs are the same as GitHub cases. But dynamically,
> they are different, because you can only log a PR when you have finished
> work.
>
> A lot of other Apache projects use JIRA, so there is a clear distinction
> between cases and contributions. JIRA cases, especially when logged early
> in the lifecycle of a contribution, become long-running conversation
> threads with a lot of community participation. If the Druid chose to do so,
> GitHub cases could be the same.
>
> Be careful that you do not treat “potential contributors” (by which I
> presume you mean non-committers) differently from committers and PMC
> members. Anyone starting a major piece of work should follow the same
> process. (Experienced committers probably have a somewhat better idea what
> work will turn out to be “major”, so they get a little more leeway.)
>
> Julian
>
>
> > On Jan 7, 2019, at 12:10 PM, Gian Merlino  wrote:
> >
> > I don't think there's a need to raise issues for every change: a small
> bug
> > fix or doc fix should just go straight to PR. (GitHub PRs show up as
> issues
> > in the issue-search UI/API, so it's not like this means the patch has no
> > corresponding issue -- in a sense the PR _is_ the issue.)
> >
> > I do think it makes sense to encourage potential contributors to write to
> > the dev list or raise an issue if they aren't sure if something would
> need
> > to go through a more heavy weight process.
> >
> > Fwiw we do have a set of 'design review' criteria already (we had a
> > discussion about this a couple years ago) at:
> > http://druid.io/community/#getting-your-changes-accepted. So we
> wouldn't be
> > starting from zero on defining that. We set it up back when we were
> trying
> > to _streamline_ our process -- we used to require two non-author +1s for
> > _every_ change, even minor ones. The introduction of design review
> criteria
> > was meant to classify which PRs need that level of review and which ones
> > are minor and can be merged with less review. I do think it helped with
> > getting minor PRs merged more quickly. The list of criteria is,
> >
> > - Major architectural changes or API changes
> > - HTTP requests and responses (e. g. a new HTTP endpoint)
> > - Interfaces for extensions
> > - Server configuration (e. g. altering the behavior of a config property)
> > - Emitted metrics
> > - Other major changes, judged by the discretion of Druid committers
> >
> > Some of it is subjective, but it has been in place for a while, so it's
> at
> > least something we are relatively familiar with.
> >
> > On Mon, Jan 7, 2019 at 11:32 AM Julian Hyde  wrote:
> >
> >> Small contributions don’t need any design review, whereas large
> >> contributions need significant review. I don’t think we should require
> an
> >> additional step for those (many) small contributions. But who decides
> >> whether a contribution fits into the small or large category?
> >>
> >> I think the solution is for authors to log a case (or send an email to
> >> dev) before they start work on any contribution. Then committers can
> >> request a more heavy-weight process if they think it is needed.
> >>
> >> Julian
> >>
> >>
> >>> On Jan 7, 2019, at 11:24 AM, Gian Merlino  wrote:
> >>>
> >>> It sounds like splitting design from code review is a common theme in a
> >> few
> >>> of the posts here. How does everyone feel about making a point of
> >>> encouraging design reviews to be done as issues, separate from the pull
> >>> request, with the expectations that (1) the design review issue
> >>> ("proposal") should generally appear somewhat _before_ the pull
> request;
> >>> (2) pull requests should _not_ have design review happen on them,
> meaning
> >>> there should no longer be PRs with design review tags, and we should
> move
> >>> the design review approval process to the issue rather than the PR.
> >>>
> >>> For (1), even if we encourage design review discussions to start
> before a
> >>> pull request appears, I don't 

Re: Off list major development

2019-01-10 Thread Gian Merlino
t; > "Proposal" and something like "API Review". I think a single "Design
> > > Review" tag handles it well.
> > >
> > > Gian mentioned an idea that PRs that follow a "Design Review" proposal
> > > issue shouldn't be "Design Review" themselves. I don't agree with
> this, I
> > > think that actual code and performance data are important inputs that
> > > should be re-evaluated at least by two people. I even think that it's
> > very
> > > desirable that at least two people read _each line of production code_
> in
> > > large PRs, although it's not what was done historically in Druid,
> because
> > > large bodies of newly added code, with whole new classes and subsystems
> > > added, are also coincidentally tested worse than already existing
> classes
> > > and subsystems, including in production. It seems to me that those huge
> > > code influxes is a major source of bugs, that could later take years to
> > > squeeze out from the codebase.
> > >
> > > On Wed, 9 Jan 2019 at 08:24, Clint Wylie  wrote:
> > >
> > >> Apologies for the delayed response.
> > >>
> > >> On Thu, Jan 3, 2019 at 12:48 PM Slim Bouguerra <
> > slim.bougue...@gmail.com>
> > >> wrote:
> > >>
> > >> > I am wondering here what is the case where code first is better?
> > >> >
> > >>
> > >> I don't think it's wrong to share ideas as early as possible, and
> after
> > >> this discussion I think I am in favor of it too. I just meant that I
> > don't
> > >> think it's always necessarily the most productive discussion until
> code
> > >> exists sometimes, with the types of thing I am thinking of are almost
> > >> entirely limited to cases where things might sound good to anyone on
> > paper
> > >> but in reality need a large amount of experiments conducted and
> > >> observations collected to determine that something is actually worth
> > >> doing,
> > >> which I imagine is mainly things like reworking internals for
> > performance
> > >> improvements.
> > >>
> > >> In the case of my combined proposal PR, I needed to prove that the
> > thing I
> > >> was working on was a good idea... and it wasn't directly. But I came
> up
> > >> with another idea during the course of experiment turned into
> something
> > >> compelling, so an initial proposal would have looked quite a lot
> > different
> > >> than what I ended up with. Once I had proven to myself that it was a
> > good
> > >> idea, then I was comfortable sharing with the wider community. I'm not
> > >> certain how this would play out in an always proposal first model,
> maybe
> > >> the first proposal exists, I personally reject it after updating with
> > >> experiment results show it's a bad idea, continue experimenting and
> > raise
> > >> a
> > >> new one after the experiments start looking promising?
> > >>
> > >>
> > >> > Let's not be naive this is very rare that a contributor will accept
> > that
> > >> > his work is to be thrown, usually devs takes coding as personal
> > creation
> > >> > and they get attached to it.
> > >> >
> > >>
> > >> I agree, just because a handful of the committers have this attitude,
> it
> > >> isn't fair to expect the wider community to also, that's why I am in
> > favor
> > >> of formalizing the process.
> > >>
> > >> Can you please explain what is overbearing ? what can be changed to
> make
> > >> it
> > >> > easy ?
> > >> > Most of the points are kind of the actual questions that you want to
> > >> > address before hand anyway isn't ?
> > >> >
> > >>
> > >> Sorry for the confusion, I said it's "not overbearing", I think it's
> > fine.
> > >>
> > >> What are the complaints ?
> > >>
> > >>
> > >> Is this and other previous threads not a complaint about opening a
> large
> > >> PR
> > >> without a proposal? :) I just mean that formalizing the process, even
> > if a
> > >> proposal has a reference PR opened with it near concurrently, could
&

Re: Pointers on implementing a new ShardSpec

2019-01-11 Thread Gian Merlino
Do you mean the modifications to the metamx/druid-spark-batch project or
the new ShardSpec you're working on?

On Thu, Jan 10, 2019 at 3:09 PM Julian Jaffe 
wrote:

> Thanks for the detailed pointers, Gian! In light of the ongoing discussion
> around on-list development, does this seem like something that's worthwhile
> to anyone else in the community?
>
> On Tue, Jan 8, 2019 at 10:32 AM Gian Merlino  wrote:
>
> > Hey Julian,
> >
> > There aren't any gotchas that I can think of other than the fact that
> they
> > are not super well documented, and you might miss some features if you're
> > just skimming the code. A couple points that might matter,
> >
> > 1) PartitionChunk is what allows a shard spec to contribute to the
> > definition of whether a set of segments for a time chunk is "complete".
> > It's an important concept since the broker will not query segment sets
> > unless the chunk is complete. The way the completeness check works is
> > basically that the broker will get all the ShardSpecs for all the
> segments
> > in a time chunk, order them by partitionNum, generate the partition
> chunks,
> > and check if (a) the first one is a starter based on "isStart", (b) any
> > subsequent ones until the end [based on "isEnd"] abut the previous one
> > [based on "abuts"]. Some ShardSpecs use nonsensical-at-first-glance logic
> > from these methods to short circuit the completeness checks: time chunks
> > with LinearShardSpecs are _always_ considered complete. Time chunks with
> > NumberedShardSpecs can have "partitionNum" go beyond "partitions", and
> are
> > considered complete if the first "partitions" number of segments are
> > present.
> >
> > 2) "getDomainDimensions" and "possibleInDomain" are optional, but useful
> > for powering broker-side segment pruning.
> >
> > 3) All segments in the same time chunk must have the same kind of
> > ShardSpec. However, it can vary from time chunk to time chunk within a
> > datasource.
> >
> > On Mon, Jan 7, 2019 at 2:56 PM Julian Jaffe  >
> > wrote:
> >
> > > Hey all,
> > >
> > > Are there any major caveats or gotchas I should be aware of when
> > > implementing a new ShardSpec? The context here is that we have a
> > datasource
> > > that is the combined result of multiple input jobs. We're trying to do
> > > write-side joining by having all of the jobs write segments for the
> same
> > > intervals (e.g. partitioning on both partition number and source
> > pipeline).
> > > For now, I've modified the Spark-Druid batch ingestor (
> > > https://github.com/metamx/druid-spark-batch) to run in our various
> > > pipelines and to write out segments with identifier form
> > > `dataSource_startInterval_endInterval_version_sourceName_partitionNum.
> > This
> > > is working without issue for loading, querying, and deleting data, but
> > the
> > > metadata API reports the incorrect segment identifier, since it
> > > reconstructs the identifier instead of reading from metadata (e.g. it
> > > reports segment identifiers of the form
> > > `dataSource_startInterval_endInterval_version_partitionNum`). Both
> > because
> > > we'd like this to be fully supported, and because we imagine that this
> > > feature may be useful to others, I'd like to implement this via a
> > > ShardSpec.
> > >
> > > Julian
> > >
> >
>


Re: Pointers on implementing a new ShardSpec

2019-01-14 Thread Gian Merlino
IMO, if you think it makes sense to contribute an interface to it too (some
option that triggers it) then I'd say write up a short proposal (motivation
/ proposed change as a GitHub issue) as the first step towards
contributing. I don't think I understand it well enough from your
description so an issue helps with that. I also don't feel that we've
totally decided on what the process should be there, but, that's the
direction the conversation seems to be going in the other thread as to how
new changes should start.

On Fri, Jan 11, 2019 at 1:26 PM Julian Jaffe 
wrote:

> The new shard spec
>
> On Fri, Jan 11, 2019 at 8:37 AM Gian Merlino  wrote:
>
> > Do you mean the modifications to the metamx/druid-spark-batch project or
> > the new ShardSpec you're working on?
> >
> > On Thu, Jan 10, 2019 at 3:09 PM Julian Jaffe
>  > >
> > wrote:
> >
> > > Thanks for the detailed pointers, Gian! In light of the ongoing
> > discussion
> > > around on-list development, does this seem like something that's
> > worthwhile
> > > to anyone else in the community?
> > >
> > > On Tue, Jan 8, 2019 at 10:32 AM Gian Merlino  wrote:
> > >
> > > > Hey Julian,
> > > >
> > > > There aren't any gotchas that I can think of other than the fact that
> > > they
> > > > are not super well documented, and you might miss some features if
> > you're
> > > > just skimming the code. A couple points that might matter,
> > > >
> > > > 1) PartitionChunk is what allows a shard spec to contribute to the
> > > > definition of whether a set of segments for a time chunk is
> "complete".
> > > > It's an important concept since the broker will not query segment
> sets
> > > > unless the chunk is complete. The way the completeness check works is
> > > > basically that the broker will get all the ShardSpecs for all the
> > > segments
> > > > in a time chunk, order them by partitionNum, generate the partition
> > > chunks,
> > > > and check if (a) the first one is a starter based on "isStart", (b)
> any
> > > > subsequent ones until the end [based on "isEnd"] abut the previous
> one
> > > > [based on "abuts"]. Some ShardSpecs use nonsensical-at-first-glance
> > logic
> > > > from these methods to short circuit the completeness checks: time
> > chunks
> > > > with LinearShardSpecs are _always_ considered complete. Time chunks
> > with
> > > > NumberedShardSpecs can have "partitionNum" go beyond "partitions",
> and
> > > are
> > > > considered complete if the first "partitions" number of segments are
> > > > present.
> > > >
> > > > 2) "getDomainDimensions" and "possibleInDomain" are optional, but
> > useful
> > > > for powering broker-side segment pruning.
> > > >
> > > > 3) All segments in the same time chunk must have the same kind of
> > > > ShardSpec. However, it can vary from time chunk to time chunk within
> a
> > > > datasource.
> > > >
> > > > On Mon, Jan 7, 2019 at 2:56 PM Julian Jaffe
> >  > > >
> > > > wrote:
> > > >
> > > > > Hey all,
> > > > >
> > > > > Are there any major caveats or gotchas I should be aware of when
> > > > > implementing a new ShardSpec? The context here is that we have a
> > > > datasource
> > > > > that is the combined result of multiple input jobs. We're trying to
> > do
> > > > > write-side joining by having all of the jobs write segments for the
> > > same
> > > > > intervals (e.g. partitioning on both partition number and source
> > > > pipeline).
> > > > > For now, I've modified the Spark-Druid batch ingestor (
> > > > > https://github.com/metamx/druid-spark-batch) to run in our various
> > > > > pipelines and to write out segments with identifier form
> > > > >
> > `dataSource_startInterval_endInterval_version_sourceName_partitionNum.
> > > > This
> > > > > is working without issue for loading, querying, and deleting data,
> > but
> > > > the
> > > > > metadata API reports the incorrect segment identifier, since it
> > > > > reconstructs the identifier instead of reading from metadata (e.g.
> it
> > > > > reports segment identifiers of the form
> > > > > `dataSource_startInterval_endInterval_version_partitionNum`). Both
> > > > because
> > > > > we'd like this to be fully supported, and because we imagine that
> > this
> > > > > feature may be useful to others, I'd like to implement this via a
> > > > > ShardSpec.
> > > > >
> > > > > Julian
> > > > >
> > > >
> > >
> >
>


Experimental feature 'graduation' in 0.14

2019-01-14 Thread Gian Merlino
I'd like to propose graduating a couple of features out of 'experimental'
status in 0.14. Both are popular features (judging by mailing list & github
issue/PR activity). Both have been around for a while and have attained a
good level of quality and stability of API & behavior. I believe removing
the 'experimental' banner from these features would more accurately reflect
reality, and be a good signal to the user community.

1) Kafka indexing service. First introduced in Druid 0.9.1, it went through
a major protocol change in Druid 0.12.0 that added incremental publishing,
& 'mixing' of data from different partitions. Subjectively, quality appears
to be getting more solid, based on frequency of bug reports and also based
on our own experiences running this in production. Finally- I believe it is
already much more robust than Tranquility, the only 'stable' alternative.

2) Druid SQL. First introduced in Druid 0.10.0. It isn't feature complete
yet (multi-value dimensions, datasketches, etc, remain unsupported) but the
API and behavior have been generally stable. No major issues around memory
/ performance / etc regressions relative to native Druid queries are
outstanding. IMO, it is well on its way to becoming a first class way to
query Druid, and it is a good time to remove the 'experimental' banner.


Re: Experimental feature 'graduation' in 0.14

2019-01-15 Thread Gian Merlino
I am ok with officially deprecating or sunsetting Tranquility. Sunsetting
may make more sense since a few of us are still committed to fixing
critical bugs with it, but I don't know of anyone that is actively working
on new features.

It was never part of the main Druid repo, so IMO all that really needs to
be done is updating the Druid webpage (which references Tranquility in a
few places), and the Tranquility repo itself, to say that Tranquility is
deprecated/sunsetted and the recommended alternative is either the Kafka or
Kinesis indexing service. We should include some rationale as to why that
decision was made, too. To me the rationale boils down to: KIS style
ingestion doesn't have a windowPeriod restriction, doesn't lose data when
tasks fail, is lower footprint when reading from an external stream hub
like Kafka/Kinesis, and has generally proven to be easier to set up and
debug.

On Tue, Jan 15, 2019 at 4:03 AM Dylan Wylie  wrote:

> Can attest on our clusters KIS has performed well enough to be considered
> non-experimental.
>
> As part of its promotion, can we consider "officially" deprecating
> Tranquility? Its status is a little uncertain post-apache and if there's
> consensus on deprecating it it'd be good opportunity to collate what work
> needs done to get there.
>
>
>
> On Tue, 15 Jan 2019 at 11:31, 邱明明  wrote:
>
> > +1.
> >
> > Gian Merlino  于2019年1月15日周二 上午6:18写道:
> > >
> > > I'd like to propose graduating a couple of features out of
> 'experimental'
> > > status in 0.14. Both are popular features (judging by mailing list &
> > github
> > > issue/PR activity). Both have been around for a while and have
> attained a
> > > good level of quality and stability of API & behavior. I believe
> removing
> > > the 'experimental' banner from these features would more accurately
> > reflect
> > > reality, and be a good signal to the user community.
> > >
> > > 1) Kafka indexing service. First introduced in Druid 0.9.1, it went
> > through
> > > a major protocol change in Druid 0.12.0 that added incremental
> > publishing,
> > > & 'mixing' of data from different partitions. Subjectively, quality
> > appears
> > > to be getting more solid, based on frequency of bug reports and also
> > based
> > > on our own experiences running this in production. Finally- I believe
> it
> > is
> > > already much more robust than Tranquility, the only 'stable'
> alternative.
> > >
> > > 2) Druid SQL. First introduced in Druid 0.10.0. It isn't feature
> complete
> > > yet (multi-value dimensions, datasketches, etc, remain unsupported) but
> > the
> > > API and behavior have been generally stable. No major issues around
> > memory
> > > / performance / etc regressions relative to native Druid queries are
> > > outstanding. IMO, it is well on its way to becoming a first class way
> to
> > > query Druid, and it is a good time to remove the 'experimental' banner.
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> > For additional commands, e-mail: dev-h...@druid.apache.org
> >
> >
>


Re: Experimental feature 'graduation' in 0.14

2019-01-15 Thread Gian Merlino
Hi Mat,

Ah, right. IMO https://github.com/apache/incubator-druid/pull/6742 is a
decent workaround towards making #6176 less of a problem. It would prevent
incorrect results from happening (the broker will not start up its http
server & announce itself, and so it won't get picked up by clients, if it
never got the initialization event). If paired with monitoring that
restarts unhealthy brokers, the issue should be fully worked-around in
practice.

Even though there's an (imo) viable workaround, it would still be good to
fix the root cause of #6176. I just raised
https://github.com/apache/incubator-druid/pull/6862 to update Curator and
see if that helps -- there is a bug fixed in the latest release that looks
like it could cause the behavior we're seeing (
https://issues.apache.org/jira/browse/CURATOR-476).

My feeling is that it's still reasonable to remove the experimental label
from Druid SQL in 0.14, especially since #6742 will make SQL and native
queries behave at parity (initialization getting missed will delay broker
startup for _both_ cases). So in that sense they are at least on the same
footing. And hopefully #6862 will fix them both, together.

On Tue, Jan 15, 2019 at 7:56 AM Pierre-Emile Ferron 
wrote:

> A remaining issue with SQL is
> https://github.com/apache/incubator-druid/issues/6176
>
> We've seen it happen several times in production on 0.12, where thankfully
> SQL doesn't power anything critical. The current workarounds are:
> 1. Restart the broker. Obviously not a good solution.
> 2. Migrate to HTTP segment discovery. I'm fine with that, and we are
> actually planning to do it soon in our clusters, but I'm still concerned
> about other Druid users—the default setting is still ZK, which means that
> SQL would still have this issue by default.
>
> Before marking SQL as non-experimental, I'd suggest either fixing the root
> cause, or making HTTP segment discovery the default and then explicitly
> deprecating ZK segment discovery.
>
>
> On Mon, Jan 14, 2019 at 2:18 PM Gian Merlino  wrote:
>
> > I'd like to propose graduating a couple of features out of 'experimental'
> > status in 0.14. Both are popular features (judging by mailing list &
> github
> > issue/PR activity). Both have been around for a while and have attained a
> > good level of quality and stability of API & behavior. I believe removing
> > the 'experimental' banner from these features would more accurately
> reflect
> > reality, and be a good signal to the user community.
> >
> > 1) Kafka indexing service. First introduced in Druid 0.9.1, it went
> through
> > a major protocol change in Druid 0.12.0 that added incremental
> publishing,
> > & 'mixing' of data from different partitions. Subjectively, quality
> appears
> > to be getting more solid, based on frequency of bug reports and also
> based
> > on our own experiences running this in production. Finally- I believe it
> is
> > already much more robust than Tranquility, the only 'stable' alternative.
> >
> > 2) Druid SQL. First introduced in Druid 0.10.0. It isn't feature complete
> > yet (multi-value dimensions, datasketches, etc, remain unsupported) but
> the
> > API and behavior have been generally stable. No major issues around
> memory
> > / performance / etc regressions relative to native Druid queries are
> > outstanding. IMO, it is well on its way to becoming a first class way to
> > query Druid, and it is a good time to remove the 'experimental' banner.
> >
>


Re: Sync up this week

2019-01-16 Thread Gian Merlino
Thanks, Jihoon!

On Wed, Jan 16, 2019 at 3:37 PM Jihoon Son  wrote:

> Sorry, a bit late note for the last dev sync.
>
> Attendees: Charles Allen, Jon Wei, Jihoon Son, Atul Mohan, Dylan Wylie,
> Clint Wylie
>
> - Charles pre-reported a bug in HLLCollector (
> https://github.com/apache/incubator-druid/pull/6865).
> - Jon asked if there is any question about his new packaging proposal (
> https://github.com/apache/incubator-druid/issues/6838).
> - Jihoon asked to take a look his segment locking proposal (
> https://github.com/apache/incubator-druid/issues/6319) and had a minor
> discussion to give Charles a rough overview.
>
> Best,
> Jihoon
>
> On Tue, Jan 15, 2019 at 9:59 AM Charles Allen
>  wrote:
>
> > To join the video meeting, click this link:
> > https://meet.google.com/ozi-rtfg-ags
> > Otherwise, to join by phone, dial +1 442-666-1256 and enter this PIN:
> 6867#
> > To view more phone numbers, click this link:
> > https://tel.meet/ozi-rtfg-ags?hs=5
> >
> > Cheers!
> >
>


Re: Include Empty Buckets at Granularity Defined Queries

2019-01-17 Thread Gian Merlino
Hey Furkan,

For timeseries there's a "skipEmptyBuckets" parameter that you can make
true or false.

For other query types, empty buckets are always skipped.

On Wed, Jan 16, 2019 at 10:58 PM Furkan KAMACI 
wrote:

> Hi,
>
> As I know that the granularity field determines how data gets bucketed
> across the time dimension, or how it gets aggregated by hour, day, minute,
> etc.
>
> Here is the related tutorial about it:
> http://druid.io/docs/latest/querying/granularities.html
>
> However, as far as I see, all the empty buckets are discarded. Is there any
> option to include empty buckets into result?
>
> Kind Regards,
> Furkan KAMACI
>


Re: The etiquette of pocking people on Github and the policy when people stop responding

2019-01-24 Thread Gian Merlino
The timelines you outlined seem quite slow. Especially "if there are enough
approvals, a PR could be merged not sooner than in two weeks since they
left the last review comment". IMO, rather than delaying patches by so
long, a better way to be courteous of a reviewer being too busy to review
in a timely manner is to seek review from some other committer that has
more time.

In an extreme case, where the patch has issues that a reviewer feels must
be addressed before the patch is merged, but the reviewer does not have
time to hold up their end of that conversation, they can veto (
https://www.apache.org/foundation/voting.html#Veto), accompanied by a
justification. This is a pretty blunt tool and should not be used much. But
it is there.

I'm still optimistic that the discussion we've been having around proposals
is a good way to address a desire for reviewers to have their say, in a way
that doesn't slow down the development process so much. By starting
conversations about changes earlier, it is a way for contributors to come
together and agree on the general shape of changes before there is a bunch
of code to discuss. Hopefully that also makes code review more efficient in
terms of contributors' time, reviewers' time, and amount of calendar days
that PRs are open for.

On Thu, Jan 24, 2019 at 3:41 AM Roman Leventov  wrote:

> To foster calmness, respect, and consideration of people's busy schedules I
> suggest the following etiquette:
>
> - When someone showed up in a PR and left some review comments, but didn't
> explicitly approved the PR, poke them with comments like "@username do you
> have more comments?" not sooner than in one week since they left the last
> review comment.
> - Poke people no more frequently than once per week.
> - If there are enough approvals, a PR could be merged not sooner than in
> two weeks since they left the last review comment.
> - If someone created a valuable PR but then stopped responding to review
> comments and there are conflicts or tests/CI don't pass, a PR could be
> recreated by another person not sooner than in three weeks since the
> author's last activity in the PR. Their authorship should be preserved by
> cherry-picking their commits, squashing them locally, rebasing on top of
> current upstream master, creating a new PR and choosing "Rebase and merge"
> option when merging the new PR instead of the default "Squash and merge".
>


Re: The etiquette of pocking people on Github and the policy when people stop responding

2019-01-25 Thread Gian Merlino
If enough other committers have already reviewed and accepted a patch, I
don't think it's fair to the author or to those other reviewers for
committing to be delayed by two weeks because another committer doesn't
have time to finish reviewing, but wants others to wait for them anyway. A
couple of days, sure, but two weeks is quite a long time. It would be
potentially longer in practice, since with your proposal the two-week clock
would start fresh if the reviewer had some more follow-up comments.

Presumably the not-yet-finished reviewer's motivation for wanting others to
wait for them is that they have some set of concerns that they feel other
reviewers may not be examining. IMO, it'd be better to ask the reviewers
that do have more time to take those concerns into account rather than
blocking the commit.

And of course - even after patches are committed, we still have release
votes and the concept of release blocker issues as a final gate to allow
people to express an opinion that particular code should not be released
without changes. So the commit of a patch itself is not the only gate that
exists before code is released.

On Fri, Jan 25, 2019 at 3:43 AM Roman Leventov 
wrote:

> The times that I suggested are IMO minimally reasonable for not provoking a
> sense of rush and anxiety in people being poked.
>
> If we assume that people adhere to the guideline "do not comment on a pull
> request unless you are willing to follow up on the edits" from here:
>
> https://github.com/druid-io/druid-io.github.io/blob/src/community/index.md#general-committer-guidelines
> and don't forget about the PRs they started reviewing, poking appears to be
> pointless. Poking and expecting the reviewer to respond "I didn't forget
> about it, I'll continue my review soon" is not a good use of the time of
> both people and IMO should be the common practice.
>
> Proposal reviews are good but the code should be reviewed thoroughly too.
> Despite the proposal got enough approvals, the PR with the actual code
> shouldn't be rushed into the codebase.
>
> On Fri, 25 Jan 2019 at 04:05, Gian Merlino  wrote:
>
> > The timelines you outlined seem quite slow. Especially "if there are
> enough
> > approvals, a PR could be merged not sooner than in two weeks since they
> > left the last review comment". IMO, rather than delaying patches by so
> > long, a better way to be courteous of a reviewer being too busy to review
> > in a timely manner is to seek review from some other committer that has
> > more time.
> >
> > In an extreme case, where the patch has issues that a reviewer feels must
> > be addressed before the patch is merged, but the reviewer does not have
> > time to hold up their end of that conversation, they can veto (
> > https://www.apache.org/foundation/voting.html#Veto), accompanied by a
> > justification. This is a pretty blunt tool and should not be used much.
> But
> > it is there.
> >
> > I'm still optimistic that the discussion we've been having around
> proposals
> > is a good way to address a desire for reviewers to have their say, in a
> way
> > that doesn't slow down the development process so much. By starting
> > conversations about changes earlier, it is a way for contributors to come
> > together and agree on the general shape of changes before there is a
> bunch
> > of code to discuss. Hopefully that also makes code review more efficient
> in
> > terms of contributors' time, reviewers' time, and amount of calendar days
> > that PRs are open for.
> >
> > On Thu, Jan 24, 2019 at 3:41 AM Roman Leventov 
> > wrote:
> >
> > > To foster calmness, respect, and consideration of people's busy
> > schedules I
> > > suggest the following etiquette:
> > >
> > > - When someone showed up in a PR and left some review comments, but
> > didn't
> > > explicitly approved the PR, poke them with comments like "@username do
> > you
> > > have more comments?" not sooner than in one week since they left the
> last
> > > review comment.
> > > - Poke people no more frequently than once per week.
> > > - If there are enough approvals, a PR could be merged not sooner than
> in
> > > two weeks since they left the last review comment.
> > > - If someone created a valuable PR but then stopped responding to
> review
> > > comments and there are conflicts or tests/CI don't pass, a PR could be
> > > recreated by another person not sooner than in three weeks since the
> > > author's last activity in the PR. Their authorship should be preserved
> by
> > > cherry-picking their commits, squashing them locally, rebasing on top
> of
> > > current upstream master, creating a new PR and choosing "Rebase and
> > merge"
> > > option when merging the new PR instead of the default "Squash and
> merge".
> > >
> >
>


Re: HAS ISSUE

2019-01-25 Thread Gian Merlino
Hey Mingwen,

This looks like it's related to the Hive/Druid integration, so it might be
a better question for the Hive mailing list. (The code for that integration
lives in the Hive project.)

On Tue, Jan 22, 2019 at 11:29 PM mingwen@analyticservice.net <
mingwen@analyticservice.net> wrote:

> HI ,
>
> CREATE EXTERNAL TABLE druid_table_1
> STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'
> TBLPROPERTIES ("druid.datasource" = "wikipedia");
>
> BELOW ISSUE,PLEASE HELP ME.THANKS.
> Exception in thread "main" java.lang.NoClassDefFoundError:
> org/apache/hadoop/hive/metastore/DefaultHiveMetaHook
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
> at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
> at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:411)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at
> org.apache.hadoop.hive.ql.parse.ParseUtils.ensureClassExists(ParseUtils.java:225)
> at
> org.apache.hadoop.hive.ql.parse.StorageFormat.fillStorageFormat(StorageFormat.java:64)
> at
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeCreateTable(SemanticAnalyzer.java:10906)
> at
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genResolvedParseTree(SemanticAnalyzer.java:10144)
> at
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:10225)
> at
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:10110)
> at
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:223)
> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:560)
> at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1358)
> at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1475)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1287)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1277)
> at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:226)
> at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:175)
> at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:389)
> at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:781)
> at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:699)
> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:634)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:226)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:141)
> Caused by: java.lang.ClassNotFoundException:
> org.apache.hadoop.hive.metastore.DefaultHiveMetaHook
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>
>
>
>
>
> mingwen@analyticservice.net
>


Re: Indexing Arbitrary Key/Value Data

2019-01-25 Thread Gian Merlino
Hey Furkan,

There isn't currently an out of the box parser in Druid that can do what
you are describing. But it is an interesting feature to think about. Today
you could implement this using a custom parser (instead of using the
builtin json/avro/etc parsers, write an extension that implements an
InputRowParser, and you can do anything you want, including automatic
flattening of nested data).

In terms of how this might be done out of the box in the future I could
think of a few ideas.

1) Have some way to define an "automatic flatten spec". Maybe something
that systematically flattens in a particular way: in your example, perhaps
it'd automatically create fields like "world.0.hey" and "world.1.tree".

2) A repetition and definition level scheme similar to Parquet:
https://blog.twitter.com/engineering/en_us/a/2013/dremel-made-simple-with-parquet.html.
It sounds like this could be more natural and lend itself to better
compression than (1).

3) Create a new column type designed to store json-like data, although
presumably in some more optimized form. Add some query-time functionality
for extracting values from it. Use this for storing the original input
data. This would only really make sense if you had rollup disabled. In this
case, the idea would be that you would store an entire ingested object in
this new kind of column, and extract some subset fields for faster access
into traditional dimension and metric columns.

On Wed, Jan 9, 2019 at 8:08 AM Furkan KAMACI  wrote:

> Hi Dylan,
>
> Indexing such data as flatten works for my case. I've checked that
> documentation before and this is similar to my need at documentation:
>
> "world": [{"hey": "there"}, {"tree": "apple"}]
>
> However, I don't know what will be the keys at indexing time. Such
> configuration is handled via this at documentation:
>
> ...
> {
>   "type": "path",
>   "name": "world-hey",
>   "expr": "$.world[0].hey"
> },
> {
>   "type": "path",
>   "name": "worldtree",
>   "expr": "$.world[1].tree"
> }
> ...
>
> However, I can't define parseSpec for it as far as I know. I need something
> like working as schemaless or defining a RegEx i.e?
>
> Kind Regards,
> Furkan KAMACI
>
> On Wed, Jan 9, 2019 at 5:45 PM Dylan Wylie  wrote:
>
> > Hey Furkan,
> >
> > Druid can index flat arrays (multi-value dimensions) but not arrays of
> > objects. There is the ability to flatten objects on ingestion using
> > JSONPath. See http://druid.io/docs/latest/ingestion/flatten-json
> >
> > Best regards,
> > Dylan
> >
> > On Wed, 9 Jan 2019 at 14:23, Furkan KAMACI 
> wrote:
> >
> > > Hi All,
> > >
> > > I can index such data with Druid:
> > >
> > > {"ts":"2018-01-01T02:35:45Z","appToken":"guid", "eventName":"app-open",
> > > "key1":"value1"}
> > >
> > > via this configuration:
> > >
> > > "parser" : {
> > > "type" : "string",
> > > "parseSpec" : {
> > >   "format" : "json",
> > >   "timestampSpec" : {
> > > "format" : "iso",
> > > "column" : "ts"
> > >   },
> > >   "dimensionsSpec" : {
> > > "dimensions": [
> > >   "appToken",
> > >   "eventName",
> > >   "key1"
> > > ]
> > >   }
> > > }
> > >   }
> > >
> > > However, I would also want to index such data:
> > >
> > > {
> > >   "ts":"2018-01-01T03:35:45Z",
> > >   "appToken":"guid",
> > >   "eventName":"app-open",
> > >   "properties":[{"randomKey1":"randomValue1"},
> > > {"randomKey2":"randomValue2"}]
> > > }
> > >
> > > at which properties is an array and members of that array has some
> > > arbitrary keys and values.
> > >
> > > How can I do that?
> > >
> > > Kind Regards,
> > > Furkan KAMACI
> > >
> >
>


Re: Druid Auto Field Type Detection

2019-01-25 Thread Gian Merlino
Hey Furkan,

Right now when Druid detects dimensions (so called "schemaless" mode, what
you get when you have an empty dimensions list at ingestion time), it
assumes they are all strings. It would definitely be better if it did some
analysis on the incoming data and chose the most appropriate type. I think
the main consideration here is that Druid has to pick a type as soon as it
sees a new column, but it might not get it right just by looking at the
first record. Imagine some JSON data where you have a field that is the
number 3 for the first row Druid sees, but the string "foo" in the second.
The right type would be string, but Druid wouldn't know that when it gets
the first row.

Maybe it would work to do some mechanism where auto-detected fields are
ingested as strings initially into IncrementalIndex, and then potentially
converted to a different type when written to disk.

On Thu, Jan 10, 2019 at 12:43 AM Furkan KAMACI 
wrote:

> Hi All,
>
> I can define auto type detection for timestamp as follows:
>
> "timestampSpec" : {
>  "format" : "auto",
>  "column" : "ts"
> }
>
> In similar manner, I cannot detect field type via parseSpec. I mean:
>
>
> {"ts":"2018-01-01T03:35:45Z","app_token":"guid1","eventName":"app-x","properties-key1":"123"}
>
>
> {"ts":"2018-01-01T03:35:45Z","app_token":"guid2","eventName":"app-x","properties-key2":123}
>
> Both properties-key1 and properties-key2 are indexed as String. I expect to
> index properties-key2 as Integer at Druid.
>
> So, is there any mechanism at Druid about letting Druid auto filed type
> detection for a newly created field? If not, I would like to implement such
> a feature.
>
> Kind Regards,
> Furkan KAMACI
>


Re: script, GPL, container question

2019-01-25 Thread Gian Merlino
For Q1 the legal guidance as I understand it is that we can provide users
with instructions for how to get optional (L)GPL dependencies, but we can't
distribute them ourselves. Putting the mysql-connector in an Docker image
does feel like distribution…

Q2 is an interesting question. I wonder if Apache has a policy on official
or semiofficial Docker containers that touches on the possibly thorny
licensing questions. It seems that they do exist for other projects,
though: https://hub.docker.com/u/apache. The Zeppelin one, for example, is
based on ubuntu so it must have plenty of GPL stuff in it:
https://hub.docker.com/r/apache/zeppelin/dockerfile. And it is presented on
the Zeppelin page as an official thing:
https://zeppelin.apache.org/docs/0.7.0/install/docker.html.

I dunno, it feels weird to me, and I am searching for evidence of these
issues having been explicitly discussed by other projects but have not
found it yet.

On Wed, Jan 23, 2019 at 3:13 PM Don Bowman  wrote:

> Re: PR https://github.com/apache/incubator-druid/pull/6896
>
> I am trying to make a container that works for folks so that we can get the
> Kubernetes helm chart off the ground.
>
> A question has arisen in the PR around the 'mysql-connector-java' which is
> GPL.
>
> Q1. A script checked into the druid container build repo that is Apache
> license, does anyone have a concern if it has a line 'wget ...
> mysql-connector.java' [link
> <
> https://github.com/apache/incubator-druid/pull/6896#pullrequestreview-195673343
> >
> to question]
> Q2. Given that the container ultimately has GPL things in it (e.g. bash) is
> there a problem releasing it? [link
>  >
> to Q]
>
> I'll work on the other comments in the PR.
>
> --don
>


Re: script, GPL, container question

2019-01-25 Thread Gian Merlino
I found this discussion of Docker images for Apache projects:
https://issues.apache.org/jira/browse/LEGAL-270. It looks like Apache Infra
has been maintaining https://hub.docker.com/u/apache and Apache Legal is
aware of it and didn't see any obvious problems. But I don't see anyone
discussing the question of GPLed components that come from the base image.

On Fri, Jan 25, 2019 at 10:17 AM Gian Merlino  wrote:

> For Q1 the legal guidance as I understand it is that we can provide users
> with instructions for how to get optional (L)GPL dependencies, but we can't
> distribute them ourselves. Putting the mysql-connector in an Docker image
> does feel like distribution…
>
> Q2 is an interesting question. I wonder if Apache has a policy on official
> or semiofficial Docker containers that touches on the possibly thorny
> licensing questions. It seems that they do exist for other projects,
> though: https://hub.docker.com/u/apache. The Zeppelin one, for example,
> is based on ubuntu so it must have plenty of GPL stuff in it:
> https://hub.docker.com/r/apache/zeppelin/dockerfile. And it is presented
> on the Zeppelin page as an official thing:
> https://zeppelin.apache.org/docs/0.7.0/install/docker.html.
>
> I dunno, it feels weird to me, and I am searching for evidence of these
> issues having been explicitly discussed by other projects but have not
> found it yet.
>
> On Wed, Jan 23, 2019 at 3:13 PM Don Bowman  wrote:
>
>> Re: PR https://github.com/apache/incubator-druid/pull/6896
>>
>> I am trying to make a container that works for folks so that we can get
>> the
>> Kubernetes helm chart off the ground.
>>
>> A question has arisen in the PR around the 'mysql-connector-java' which is
>> GPL.
>>
>> Q1. A script checked into the druid container build repo that is Apache
>> license, does anyone have a concern if it has a line 'wget ...
>> mysql-connector.java' [link
>> <
>> https://github.com/apache/incubator-druid/pull/6896#pullrequestreview-195673343
>> >
>> to question]
>> Q2. Given that the container ultimately has GPL things in it (e.g. bash)
>> is
>> there a problem releasing it? [link
>> <
>> https://github.com/apache/incubator-druid/pull/6896#discussion_r250316937
>> >
>> to Q]
>>
>> I'll work on the other comments in the PR.
>>
>> --don
>>
>


Re: script, GPL, container question

2019-01-25 Thread Gian Merlino
For Q1 - I have the feeling that if we asked the powers that be at Apache
for an opinion on bundling the MySQL connector that they would not be fans
of the idea -- mostly because it is not allowed for binary tarball
releases, and I don't see why it would be different for Docker releases. So
because of that, I wouldn't be comfortable bundling the mysql-connector jar
unless we actually _did_ ask the powers that be, and they said it's ok. The
powers that be are probably either ASF Legal or the Incubator PMC.

For Q2 - IMO precedent established by other projects means this is not
likely an issue. Probably because of the mere-aggregation reason you
brought up.

If we want to release official Docker images then this will also need to be
incorporated into our build process. Don are you interested in researching
/ proposing how that might be done? Anyone else - should we have a
discussion about whether we want official Docker images as part of our
release process? Personally, if it doesn't impose much additional burden on
release managers, and the image is something that is easy to run in
production in a way that we are comfortable supporting as a community, I am
for it. (I haven't reviewed the PR enough to have an opinion on that.)

On Fri, Jan 25, 2019 at 12:53 PM Don Bowman  wrote:

> On Fri, 25 Jan 2019 at 13:17, Gian Merlino  wrote:
>
> > For Q1 the legal guidance as I understand it is that we can provide users
> > with instructions for how to get optional (L)GPL dependencies, but we
> can't
> > distribute them ourselves. Putting the mysql-connector in an Docker image
> > does feel like distribution…
> >
> > Q2 is an interesting question. I wonder if Apache has a policy on
> official
> > or semiofficial Docker containers that touches on the possibly thorny
> > licensing questions. It seems that they do exist for other projects,
> > though: https://hub.docker.com/u/apache. The Zeppelin one, for example,
> is
> > based on ubuntu so it must have plenty of GPL stuff in it:
> > https://hub.docker.com/r/apache/zeppelin/dockerfile. And it is presented
> > on
> > the Zeppelin page as an official thing:
> > https://zeppelin.apache.org/docs/0.7.0/install/docker.html.
> >
> > I dunno, it feels weird to me, and I am searching for evidence of these
> > issues having been explicitly discussed by other projects but have not
> > found it yet.
> >
> >
> >
> GPL does not attach by mere aggregation. [see GPL FAQ
> <https://www.gnu.org/licenses/gpl-faq.en.html#MereAggregation>]
> All linux is gpl, and all the containers are linux for all the other apache
> foundation things (maven, httpd, ...). even debian has bash in it, which is
> gpl.
>
> so I can either:
>
> a) continue as is. I want to get this on dockerhub auto-built, that's what
> he script does now. BTW, it downloads the gpl code from maven repository,
> which is also run by apache.
> b) remove it, support postgres only.
>
> both are ok w/ me I suppose.
>


Re: script, GPL, container question

2019-01-25 Thread Gian Merlino
I did some looking around and found this:
https://issues.apache.org/jira/browse/INFRA-12781. It seems that the Apache
Infra folks will set up DockerHub configs for projects like us, in such a
way that they build automatically off release tags. So it does seem pretty
easy; it looks like once we have a working Dockerfile we just need to raise
an INFRA ticket to get them to set that up.

> how do i drive to consensus on the mysql thing?

One way is try to encourage more people to chime in here (you are already
doing this by writing emails). Another is to reach out and ask the
Incubator PMC (email gene...@incubator.apache.org) or Apache Legal (raise a
LEGAL ticket in JIRA) for advice. I would probably go for the Incubator PMC
first, since the audience is a bit larger and this may have come up before.

On Fri, Jan 25, 2019 at 1:24 PM Don Bowman  wrote:

> On Fri, 25 Jan 2019 at 16:07, Gian Merlino  wrote:
>
> > For Q1 - I have the feeling that if we asked the powers that be at Apache
> > for an opinion on bundling the MySQL connector that they would not be
> fans
> > of the idea -- mostly because it is not allowed for binary tarball
> > releases, and I don't see why it would be different for Docker releases.
> So
> > because of that, I wouldn't be comfortable bundling the mysql-connector
> jar
> > unless we actually _did_ ask the powers that be, and they said it's ok.
> The
> > powers that be are probably either ASF Legal or the Incubator PMC.
> >
> > For Q2 - IMO precedent established by other projects means this is not
> > likely an issue. Probably because of the mere-aggregation reason you
> > brought up.
> >
> > If we want to release official Docker images then this will also need to
> be
> > incorporated into our build process. Don are you interested in
> researching
> > / proposing how that might be done? Anyone else - should we have a
> > discussion about whether we want official Docker images as part of our
> > release process? Personally, if it doesn't impose much additional burden
> on
> > release managers, and the image is something that is easy to run in
> > production in a way that we are comfortable supporting as a community, I
> am
> > for it. (I haven't reviewed the PR enough to have an opinion on that.)
> >
> >
> >
> Its pretty trivial to let the dockerhub run its own pipeline on any {tag |
> merge} in git.
> it does this automatically.
> Or, its not too hard to have travis have some keys to do a push and it in
> turn is gated
> by a {tag | merge}.
>
> some projects build the release on each 'tag' created.
> some build on tag matching pattern
> some build on any merge commit to master.
>
> IMO w/o a release of container to a public repo this is pointless, its what
> people
> expect.
>
> I don't have any particular domain knowledge in the area but am willing to
> do
> some work if it is identified as needing done.
>
> how do i drive to consensus on the mysql thing?
>


Re: The etiquette of pocking people on Github and the policy when people stop responding

2019-01-28 Thread Gian Merlino
I don't think it's irresponsible to start a review and not be able to
finish it promptly. But drawing the process out can feel frustrating to
other committers that have already finished reviewing, like they are being
told that their review is not good enough. I think it's a matter of degree.
Two weeks sounds very long to me but two or three days sounds reasonable.
Part of the rationale for this is that PR review is an iterative process.
If each iteration could take two weeks then a patch might not be committed
for months. (This happens sometimes, and it is sad, and sometimes I've been
on the guilty end of it, and it's something I think we should try to avoid
by endeavoring to speed up review cycles.)

It's a totally different situation if nobody else has reviewed a patch yet.
In that case a reviewer reviewing things with longer cycles isn't blocking
anything. They are actually helping a lot by being the only person willing
to review the patch at all. In this case I think the etiquette and timings
you suggested are more reasonable.

I guess that reviewers prioritize the newest PRs first because of how the
GitHub UI works. By default it sorts PRs by created date, newest first. And
it doesn't have an option to sort by "most time elapsed since review".
Maybe we should make our own review console that sorts the PRs differently?
Or shoot for PR-zero (like inbox-zero): close all PRs that haven't had
comments in 6 months and try to review all the others as fast as possible.

On Mon, Jan 28, 2019 at 8:43 AM Roman Leventov 
wrote:

> On Fri, 25 Jan 2019 at 23:12, Gian Merlino  wrote:
>
> > If enough other committers have already reviewed and accepted a patch, I
> > don't think it's fair to the author or to those other reviewers for
> > committing to be delayed by two weeks because another committer doesn't
> > have time to finish reviewing, but wants others to wait for them anyway.
>
>
> It sounds like it's implied that the reviewer is irresponsible because he
> started reviewing a PR but "doesn't have time to finish reviewing". In
> fact, a reviewer could *have* time to finish reviewing and is willing to
> finish the review, but they don't have time *tomorrow*. A reviewer could
> have different duties and could slice only Y hours for reviews of Druid PRs
> every X days. X/(Y * number of PRs the reviewer is interested in at the
> moment) is how often (in days) a reviewer could pay attention to specific
> PR. I think we should respect that for some people, at least at some times,
> this value could grow to about 7. Saying that we shouldn't wait for those
> people creates a bias favoring most involved developers, and it's not
> necessarily a good bias, because sometimes outsider (or half-outsider)
> perspective is valuable.
>
> If we do releases every 3 months and the time between creating a release
> branch and the final release candidate is at least 3 weeks (historically),
> why there is an urge to commit anything faster.
>
> IMO the real source of unfairness in reviews is that reviewers generally
> prioritize the newest PRs rather than PRs that await for reviews the
> longest. The probability that somebody starts to review a PR decreases with
> time, while it should increase.
>


Re: Proposal to shade Guava manually in Druid

2019-01-29 Thread Gian Merlino
Interesting proposal - I commented on the issue. It sounds like a good idea.

On Tue, Jan 29, 2019 at 7:22 AM Roman Leventov  wrote:

> https://github.com/apache/incubator-druid/issues/6942
>


Re: The etiquette of pocking people on Github and the policy when people stop responding

2019-01-29 Thread Gian Merlino
> I disagree with Roman's suggestions. If a PR has enough votes, we should
> trust the committers approving the PR and move forward.

FWIW, I do think it's good to be courteous and give other reviewers a day
or two to either follow up on a review or decide to leave the decision to
the reviewers that have already chimed in. I just think that allowing two
weeks for that is excessive and would lead to PRs languishing in the queue
even more than they do now.

On Mon, Jan 28, 2019 at 1:30 PM Fangjin Yang  wrote:

> I disagree with Roman's suggestions. If a PR has enough votes, we should
> trust the committers approving the PR and move forward.
>
> On Mon, Jan 28, 2019 at 9:28 AM Gian Merlino  wrote:
>
> > I don't think it's irresponsible to start a review and not be able to
> > finish it promptly. But drawing the process out can feel frustrating to
> > other committers that have already finished reviewing, like they are
> being
> > told that their review is not good enough. I think it's a matter of
> degree.
> > Two weeks sounds very long to me but two or three days sounds reasonable.
> > Part of the rationale for this is that PR review is an iterative process.
> > If each iteration could take two weeks then a patch might not be
> committed
> > for months. (This happens sometimes, and it is sad, and sometimes I've
> been
> > on the guilty end of it, and it's something I think we should try to
> avoid
> > by endeavoring to speed up review cycles.)
> >
> > It's a totally different situation if nobody else has reviewed a patch
> yet.
> > In that case a reviewer reviewing things with longer cycles isn't
> blocking
> > anything. They are actually helping a lot by being the only person
> willing
> > to review the patch at all. In this case I think the etiquette and
> timings
> > you suggested are more reasonable.
> >
> > I guess that reviewers prioritize the newest PRs first because of how the
> > GitHub UI works. By default it sorts PRs by created date, newest first.
> And
> > it doesn't have an option to sort by "most time elapsed since review".
> > Maybe we should make our own review console that sorts the PRs
> differently?
> > Or shoot for PR-zero (like inbox-zero): close all PRs that haven't had
> > comments in 6 months and try to review all the others as fast as
> possible.
> >
> > On Mon, Jan 28, 2019 at 8:43 AM Roman Leventov 
> > wrote:
> >
> > > On Fri, 25 Jan 2019 at 23:12, Gian Merlino  wrote:
> > >
> > > > If enough other committers have already reviewed and accepted a
> patch,
> > I
> > > > don't think it's fair to the author or to those other reviewers for
> > > > committing to be delayed by two weeks because another committer
> doesn't
> > > > have time to finish reviewing, but wants others to wait for them
> > anyway.
> > >
> > >
> > > It sounds like it's implied that the reviewer is irresponsible because
> he
> > > started reviewing a PR but "doesn't have time to finish reviewing". In
> > > fact, a reviewer could *have* time to finish reviewing and is willing
> to
> > > finish the review, but they don't have time *tomorrow*. A reviewer
> could
> > > have different duties and could slice only Y hours for reviews of Druid
> > PRs
> > > every X days. X/(Y * number of PRs the reviewer is interested in at the
> > > moment) is how often (in days) a reviewer could pay attention to
> specific
> > > PR. I think we should respect that for some people, at least at some
> > times,
> > > this value could grow to about 7. Saying that we shouldn't wait for
> those
> > > people creates a bias favoring most involved developers, and it's not
> > > necessarily a good bias, because sometimes outsider (or half-outsider)
> > > perspective is valuable.
> > >
> > > If we do releases every 3 months and the time between creating a
> release
> > > branch and the final release candidate is at least 3 weeks
> > (historically),
> > > why there is an urge to commit anything faster.
> > >
> > > IMO the real source of unfairness in reviews is that reviewers
> generally
> > > prioritize the newest PRs rather than PRs that await for reviews the
> > > longest. The probability that somebody starts to review a PR decreases
> > with
> > > time, while it should increase.
> > >
> >
>


Re: Contributing an extension

2019-01-29 Thread Gian Merlino
Hi Eyal,

I'll take a look too. For some reason I missed this when you first posted
it, but it is very interesting work, and looks like it could be part of a
path to supporting generic windowed aggregations in Druid SQL. (Moving
average, cumulative sum, and so on)

On Tue, Jan 29, 2019 at 7:07 PM Jihoon Son  wrote:

> Sorry for delayed review. I'll take a look once the 0.14.0 release is
> finished.
>
> Jihoon
>
> On Tue, Jan 29, 2019 at 3:58 PM Eyal Yurman
>  wrote:
>
> > Hi,
> >
> > A few months ago I worked on open sourcing an extension we've been using
> > internally for a couple of years.
> >
> > https://github.com/apache/incubator-druid/pull/6430
> > https://github.com/apache/incubator-druid/issues/6320
> >
> > This has been upvoted by a few community users, so I was happy to put in
> > the effort.
> >
> > Unfortunately, I haven't found any committer who could spend the time
> > reviewing this work...
> >
> > Is there anyone who could do that sometime next month?
> >
> > I could always release it under a separate repository, but since I am
> > committed to continuing maintaining and expanding the module, I'd prefer
> to
> > keep it as part of the Druid codebase.
> >
> > Thanks!
> >
> > Eyal.
> >
>


Re: Off list major development

2019-01-30 Thread Gian Merlino
I think it'd also be nice to tweak a couple parts of the KIP template
(Motivation; Public Interfaces; Proposed Changes; Compatibility,
Deprecation, and Migration Plan; Test Plan; Rejected Alternatives). A
couple people have suggested adding a "Rationale" section, how about adding
that and removing "Rejected alternatives" -- rolling them in together? And
dropping "test plan", since IMO that discussion can be deferred to the PR
itself, when there is code ready. Finally, adding "future work", detailing
where this change might lead us.

So in particular the template I am suggesting would be something like this.

1) Motivation: A description of the problem.
2) Proposed changes: Should usually be the longest section. Should include
any changes that are proposed to user-facing interfaces (configuration
parameters, JSON query/ingest specs, SQL language, emitted metrics, and so
on).
3) Rationale: A discussion of why this particular solution is the best one.
One good way to approach this is to discuss other alternative solutions
that you considered and decided against. This should also include a
discussion of any specific benefits or drawbacks you are aware of.
4) Operational impact: Is anything going to be deprecated or removed by
this change? Is there a migration path that cluster operators need to be
aware of? Will there be any effect on the ability to do a rolling upgrade,
or to do a rolling _downgrade_ if an operator wants to switch back to a
previous version?
5) Future work: A discussion of things that you believe are out of scope
for the particular proposal but would be nice follow-ups. It helps show
where a particular change could be leading us. There isn't any commitment
that the proposal author will actually work on this stuff. It is okay if
this section is empty.

On Wed, Jan 30, 2019 at 3:14 PM Jihoon Son  wrote:

> Thanks Eyal and Jon for starting the discussion about making a template!
>
> The KIP template looks good, but I would like to add one more.
> The current template is:
>
> - Motivation
> - Public Interfaces
> - Proposed Changes
> - Compatibility, Deprecation, and Migration Plan
> - Test Plan
> - Rejected Alternatives
>
> It includes almost everything required for proposals, but I think it's
> missing why the author chose the proposed changes.
> So, I think it would be great if we can add 'Rationale' or 'Expected
> benefits and drawbacks'.
> People might include it by themselves in 'Motivation' or 'Proposed
> Changes', but it would be good if there's an explicit section to describe
> it.
>
> Best,
> Jihoon
>


Re: Slow download of segments from deep storage

2019-01-30 Thread Gian Merlino
I believe today, if you use the (experimental) HTTP-based load queues, they
will parallelize segment downloads. Adding similar functionality for the
ZK-based load queues would definitely be useful though, since at this time
nobody seems to be actively driving a migration to HTTP-based load queues
being enabled by default.

On Wed, Jan 30, 2019 at 7:20 PM Samarth Jain  wrote:

> We noticed that it takes a long time for the historicals to download
> segments from deep storage (in our case S3). Looking closer at the code in
> ZKCoordinator, I noticed that the segment download is happening in a single
> threaded fashion. This download happens in the SingleThreadedExecutor
> service used by the PathChildrenCache. Looking at the commentary on
> https://github.com/apache/incubator-druid/issues/4421 and
> https://github.com/apache/incubator-druid/issues/3202, the executor
> service
> used in PathChildrenCache can only be single threaded.
>
> My proposal is to use a multi threaded ExecutorService that will be used to
> take action on the  events to perform the download. The role of single
> threaded ExecutorService in PathChildrenCache will be simply to delegate
> the download task to this new executor service.
>
> Does that sound feasible? IMO, if this happens to be functionally correct,
> it should help significantly boost up the time it is taking historicals to
> download all the assigned segments.
>
> I would be more than happy to contribute this enhancement to the community.
>
> Thanks,
> Samarth
>


Re: Forbiddenapis Plugin

2019-01-31 Thread Gian Merlino
I get those sometimes with generated sources -- typically doing a "mvn
clean" beforehand clears it up. We might be able to add exclusions for the
generated source directories in order to avoid the need to do this.

On Thu, Jan 31, 2019 at 5:15 AM Furkan KAMACI 
wrote:

> I try to run forbiddenapis plugin at Druid. However I get that errors but
> does not know where they actually points:
>
> [INFO] Scanning classes for violations...
> [ERROR] Forbidden method invocation:
> java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses default
> locale]
> [ERROR]   in org.apache.druid.math.expr.BinaryEvalOpExprBase (Expr.java,
> method body of '$$$reportNull$$$0(int)')
> [ERROR] Forbidden method invocation:
> java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses default
> locale]
> [ERROR]   in org.apache.druid.math.expr.LongExpr (Expr.java, method body of
> '$$$reportNull$$$0(int)')
> [ERROR] Forbidden method invocation:
> java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses default
> locale]
> [ERROR]   in org.apache.druid.math.expr.FunctionExpr (Expr.java, method
> body of '$$$reportNull$$$0(int)')
> [ERROR] Forbidden method invocation:
> java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses default
> locale]
> [ERROR]   in org.apache.druid.data.input.impl.InputRowParser
> (InputRowParser.java, method body of '$$$reportNull$$$0(int)')
> [ERROR] Forbidden method invocation:
> java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses default
> locale]
> [ERROR]   in org.apache.druid.math.expr.BinAndExpr (Expr.java, method body
> of '$$$reportNull$$$0(int)')
> [ERROR] Forbidden method invocation:
> java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses default
> locale]
> [ERROR]   in org.apache.druid.java.util.common.concurrent.Execs
> (Execs.java, method body of '$$$reportNull$$$0(int)')
> [ERROR] Forbidden method invocation:
> java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses default
> locale]
> [ERROR]   in org.apache.druid.math.expr.BinOrExpr (Expr.java, method body
> of '$$$reportNull$$$0(int)')
> [ERROR] Forbidden method invocation:
> java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses default
> locale]
> [ERROR]   in org.apache.druid.math.expr.StringExpr (Expr.java, method body
> of '$$$reportNull$$$0(int)')
> [ERROR] Forbidden method invocation:
> java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses default
> locale]
> [ERROR]   in org.apache.druid.math.expr.DoubleExpr (Expr.java, method body
> of '$$$reportNull$$$0(int)')
> [ERROR] Forbidden method invocation:
> java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses default
> locale]
> [ERROR]   in org.apache.druid.math.expr.UnaryMinusExpr (Expr.java, method
> body of '$$$reportNull$$$0(int)')
> [ERROR] Forbidden method invocation:
> java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses default
> locale]
> [ERROR]   in org.apache.druid.math.expr.UnaryNotExpr (Expr.java, method
> body of '$$$reportNull$$$0(int)')
> [ERROR] Forbidden method invocation:
> java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses default
> locale]
> [ERROR]   in org.apache.druid.math.expr.IdentifierExpr (Expr.java, method
> body of '$$$reportNull$$$0(int)')
> [ERROR] Scanned 714 class file(s) for forbidden API invocations (in 0.65s),
> 12 error(s).
>
> Do you have any idea?
>
> Kind Regards,
> Furkan KAMACI
>


Re: Experimental feature 'graduation' in 0.14

2019-01-31 Thread Gian Merlino
It sounds like the general feeling is +1 on Kafka and maybe wait another
release for SQL. I will do a PR to mark Kafka ingest as non-experimental,
then, and on SQL we'll see whether #6742 and #6862 look solid in 0.14.

On Tue, Jan 15, 2019 at 8:39 AM Gian Merlino  wrote:

> Hi Mat,
>
> Ah, right. IMO https://github.com/apache/incubator-druid/pull/6742 is a
> decent workaround towards making #6176 less of a problem. It would prevent
> incorrect results from happening (the broker will not start up its http
> server & announce itself, and so it won't get picked up by clients, if it
> never got the initialization event). If paired with monitoring that
> restarts unhealthy brokers, the issue should be fully worked-around in
> practice.
>
> Even though there's an (imo) viable workaround, it would still be good to
> fix the root cause of #6176. I just raised
> https://github.com/apache/incubator-druid/pull/6862 to update Curator and
> see if that helps -- there is a bug fixed in the latest release that looks
> like it could cause the behavior we're seeing (
> https://issues.apache.org/jira/browse/CURATOR-476).
>
> My feeling is that it's still reasonable to remove the experimental label
> from Druid SQL in 0.14, especially since #6742 will make SQL and native
> queries behave at parity (initialization getting missed will delay broker
> startup for _both_ cases). So in that sense they are at least on the same
> footing. And hopefully #6862 will fix them both, together.
>
> On Tue, Jan 15, 2019 at 7:56 AM Pierre-Emile Ferron 
> wrote:
>
>> A remaining issue with SQL is
>> https://github.com/apache/incubator-druid/issues/6176
>>
>> We've seen it happen several times in production on 0.12, where thankfully
>> SQL doesn't power anything critical. The current workarounds are:
>> 1. Restart the broker. Obviously not a good solution.
>> 2. Migrate to HTTP segment discovery. I'm fine with that, and we are
>> actually planning to do it soon in our clusters, but I'm still concerned
>> about other Druid users—the default setting is still ZK, which means that
>> SQL would still have this issue by default.
>>
>> Before marking SQL as non-experimental, I'd suggest either fixing the root
>> cause, or making HTTP segment discovery the default and then explicitly
>> deprecating ZK segment discovery.
>>
>>
>> On Mon, Jan 14, 2019 at 2:18 PM Gian Merlino  wrote:
>>
>> > I'd like to propose graduating a couple of features out of
>> 'experimental'
>> > status in 0.14. Both are popular features (judging by mailing list &
>> github
>> > issue/PR activity). Both have been around for a while and have attained
>> a
>> > good level of quality and stability of API & behavior. I believe
>> removing
>> > the 'experimental' banner from these features would more accurately
>> reflect
>> > reality, and be a good signal to the user community.
>> >
>> > 1) Kafka indexing service. First introduced in Druid 0.9.1, it went
>> through
>> > a major protocol change in Druid 0.12.0 that added incremental
>> publishing,
>> > & 'mixing' of data from different partitions. Subjectively, quality
>> appears
>> > to be getting more solid, based on frequency of bug reports and also
>> based
>> > on our own experiences running this in production. Finally- I believe
>> it is
>> > already much more robust than Tranquility, the only 'stable'
>> alternative.
>> >
>> > 2) Druid SQL. First introduced in Druid 0.10.0. It isn't feature
>> complete
>> > yet (multi-value dimensions, datasketches, etc, remain unsupported) but
>> the
>> > API and behavior have been generally stable. No major issues around
>> memory
>> > / performance / etc regressions relative to native Druid queries are
>> > outstanding. IMO, it is well on its way to becoming a first class way to
>> > query Druid, and it is a good time to remove the 'experimental' banner.
>> >
>>
>


Re: Off list major development

2019-01-31 Thread Gian Merlino
If it's not clear - I am agreeing with Jihoon and Slim that a separate
"Rationale" section makes sense in addition to a couple other suggested
tweaks.

On Wed, Jan 30, 2019 at 3:46 PM Gian Merlino  wrote:

> I think it'd also be nice to tweak a couple parts of the KIP template
> (Motivation; Public Interfaces; Proposed Changes; Compatibility,
> Deprecation, and Migration Plan; Test Plan; Rejected Alternatives). A
> couple people have suggested adding a "Rationale" section, how about adding
> that and removing "Rejected alternatives" -- rolling them in together? And
> dropping "test plan", since IMO that discussion can be deferred to the PR
> itself, when there is code ready. Finally, adding "future work", detailing
> where this change might lead us.
>
> So in particular the template I am suggesting would be something like this.
>
> 1) Motivation: A description of the problem.
> 2) Proposed changes: Should usually be the longest section. Should include
> any changes that are proposed to user-facing interfaces (configuration
> parameters, JSON query/ingest specs, SQL language, emitted metrics, and so
> on).
> 3) Rationale: A discussion of why this particular solution is the best
> one. One good way to approach this is to discuss other alternative
> solutions that you considered and decided against. This should also include
> a discussion of any specific benefits or drawbacks you are aware of.
> 4) Operational impact: Is anything going to be deprecated or removed by
> this change? Is there a migration path that cluster operators need to be
> aware of? Will there be any effect on the ability to do a rolling upgrade,
> or to do a rolling _downgrade_ if an operator wants to switch back to a
> previous version?
> 5) Future work: A discussion of things that you believe are out of scope
> for the particular proposal but would be nice follow-ups. It helps show
> where a particular change could be leading us. There isn't any commitment
> that the proposal author will actually work on this stuff. It is okay if
> this section is empty.
>
> On Wed, Jan 30, 2019 at 3:14 PM Jihoon Son  wrote:
>
>> Thanks Eyal and Jon for starting the discussion about making a template!
>>
>> The KIP template looks good, but I would like to add one more.
>> The current template is:
>>
>> - Motivation
>> - Public Interfaces
>> - Proposed Changes
>> - Compatibility, Deprecation, and Migration Plan
>> - Test Plan
>> - Rejected Alternatives
>>
>> It includes almost everything required for proposals, but I think it's
>> missing why the author chose the proposed changes.
>> So, I think it would be great if we can add 'Rationale' or 'Expected
>> benefits and drawbacks'.
>> People might include it by themselves in 'Motivation' or 'Proposed
>> Changes', but it would be good if there's an explicit section to describe
>> it.
>>
>> Best,
>> Jihoon
>>
>


Re: Forbiddenapis Plugin

2019-01-31 Thread Gian Merlino
Good question. I'm not sure. They are at least doing String.format on
_something_ with no default locale.

On Thu, Jan 31, 2019 at 9:36 AM Charles Allen
 wrote:

> Is this indicative of latent bugs the generated sources have?
>
> On Thu, Jan 31, 2019 at 8:55 AM Gian Merlino  wrote:
>
> > I get those sometimes with generated sources -- typically doing a "mvn
> > clean" beforehand clears it up. We might be able to add exclusions for
> the
> > generated source directories in order to avoid the need to do this.
> >
> > On Thu, Jan 31, 2019 at 5:15 AM Furkan KAMACI 
> > wrote:
> >
> > > I try to run forbiddenapis plugin at Druid. However I get that errors
> but
> > > does not know where they actually points:
> > >
> > > [INFO] Scanning classes for violations...
> > > [ERROR] Forbidden method invocation:
> > > java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses
> > default
> > > locale]
> > > [ERROR]   in org.apache.druid.math.expr.BinaryEvalOpExprBase
> (Expr.java,
> > > method body of '$$$reportNull$$$0(int)')
> > > [ERROR] Forbidden method invocation:
> > > java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses
> > default
> > > locale]
> > > [ERROR]   in org.apache.druid.math.expr.LongExpr (Expr.java, method
> body
> > of
> > > '$$$reportNull$$$0(int)')
> > > [ERROR] Forbidden method invocation:
> > > java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses
> > default
> > > locale]
> > > [ERROR]   in org.apache.druid.math.expr.FunctionExpr (Expr.java, method
> > > body of '$$$reportNull$$$0(int)')
> > > [ERROR] Forbidden method invocation:
> > > java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses
> > default
> > > locale]
> > > [ERROR]   in org.apache.druid.data.input.impl.InputRowParser
> > > (InputRowParser.java, method body of '$$$reportNull$$$0(int)')
> > > [ERROR] Forbidden method invocation:
> > > java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses
> > default
> > > locale]
> > > [ERROR]   in org.apache.druid.math.expr.BinAndExpr (Expr.java, method
> > body
> > > of '$$$reportNull$$$0(int)')
> > > [ERROR] Forbidden method invocation:
> > > java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses
> > default
> > > locale]
> > > [ERROR]   in org.apache.druid.java.util.common.concurrent.Execs
> > > (Execs.java, method body of '$$$reportNull$$$0(int)')
> > > [ERROR] Forbidden method invocation:
> > > java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses
> > default
> > > locale]
> > > [ERROR]   in org.apache.druid.math.expr.BinOrExpr (Expr.java, method
> body
> > > of '$$$reportNull$$$0(int)')
> > > [ERROR] Forbidden method invocation:
> > > java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses
> > default
> > > locale]
> > > [ERROR]   in org.apache.druid.math.expr.StringExpr (Expr.java, method
> > body
> > > of '$$$reportNull$$$0(int)')
> > > [ERROR] Forbidden method invocation:
> > > java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses
> > default
> > > locale]
> > > [ERROR]   in org.apache.druid.math.expr.DoubleExpr (Expr.java, method
> > body
> > > of '$$$reportNull$$$0(int)')
> > > [ERROR] Forbidden method invocation:
> > > java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses
> > default
> > > locale]
> > > [ERROR]   in org.apache.druid.math.expr.UnaryMinusExpr (Expr.java,
> method
> > > body of '$$$reportNull$$$0(int)')
> > > [ERROR] Forbidden method invocation:
> > > java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses
> > default
> > > locale]
> > > [ERROR]   in org.apache.druid.math.expr.UnaryNotExpr (Expr.java, method
> > > body of '$$$reportNull$$$0(int)')
> > > [ERROR] Forbidden method invocation:
> > > java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses
> > default
> > > locale]
> > > [ERROR]   in org.apache.druid.math.expr.IdentifierExpr (Expr.java,
> method
> > > body of '$$$reportNull$$$0(int)')
> > > [ERROR] Scanned 714 class file(s) for forbidden API invocations (in
> > 0.65s),
> > > 12 error(s).
> > >
> > > Do you have any idea?
> > >
> > > Kind Regards,
> > > Furkan KAMACI
> > >
> >
>


Re: Off list major development

2019-02-01 Thread Gian Merlino
Sure, I can see the value of putting it in the template to encourage
thinking about it. Sounds good to me.

On Thu, Jan 31, 2019 at 2:47 PM Jonathan Wei  wrote:

> That structure sounds good:
> - expanding rejected alternatives to a broader rationale section sounds
> good
> - I like "operational impact" as suggested by Slim and Gian more than the
> corresponding KIP terminology
> - Future work is a good addition
>
> For test plan, I don't have a very strong opinion on this, but I'm thinking
> it could make sense as an optional section (if someone has one, that's
> cool, if not, that's cool too, but perhaps having it present in the
> template would encourage ppl to think about testing strategies early on if
> they aren't already)
>
>
> On Thu, Jan 31, 2019 at 2:17 PM Jihoon Son  wrote:
>
> > Thanks Gian.
> > The suggested template looks good to me.
> >
> > Jihoon
> >
> > On Thu, Jan 31, 2019 at 9:27 AM Gian Merlino  wrote:
> >
> > > If it's not clear - I am agreeing with Jihoon and Slim that a separate
> > > "Rationale" section makes sense in addition to a couple other suggested
> > > tweaks.
> > >
> > > On Wed, Jan 30, 2019 at 3:46 PM Gian Merlino  wrote:
> > >
> > > > I think it'd also be nice to tweak a couple parts of the KIP template
> > > > (Motivation; Public Interfaces; Proposed Changes; Compatibility,
> > > > Deprecation, and Migration Plan; Test Plan; Rejected Alternatives). A
> > > > couple people have suggested adding a "Rationale" section, how about
> > > adding
> > > > that and removing "Rejected alternatives" -- rolling them in
> together?
> > > And
> > > > dropping "test plan", since IMO that discussion can be deferred to
> the
> > PR
> > > > itself, when there is code ready. Finally, adding "future work",
> > > detailing
> > > > where this change might lead us.
> > > >
> > > > So in particular the template I am suggesting would be something like
> > > this.
> > > >
> > > > 1) Motivation: A description of the problem.
> > > > 2) Proposed changes: Should usually be the longest section. Should
> > > include
> > > > any changes that are proposed to user-facing interfaces
> (configuration
> > > > parameters, JSON query/ingest specs, SQL language, emitted metrics,
> and
> > > so
> > > > on).
> > > > 3) Rationale: A discussion of why this particular solution is the
> best
> > > > one. One good way to approach this is to discuss other alternative
> > > > solutions that you considered and decided against. This should also
> > > include
> > > > a discussion of any specific benefits or drawbacks you are aware of.
> > > > 4) Operational impact: Is anything going to be deprecated or removed
> by
> > > > this change? Is there a migration path that cluster operators need to
> > be
> > > > aware of? Will there be any effect on the ability to do a rolling
> > > upgrade,
> > > > or to do a rolling _downgrade_ if an operator wants to switch back
> to a
> > > > previous version?
> > > > 5) Future work: A discussion of things that you believe are out of
> > scope
> > > > for the particular proposal but would be nice follow-ups. It helps
> show
> > > > where a particular change could be leading us. There isn't any
> > commitment
> > > > that the proposal author will actually work on this stuff. It is okay
> > if
> > > > this section is empty.
> > > >
> > > > On Wed, Jan 30, 2019 at 3:14 PM Jihoon Son 
> > wrote:
> > > >
> > > >> Thanks Eyal and Jon for starting the discussion about making a
> > template!
> > > >>
> > > >> The KIP template looks good, but I would like to add one more.
> > > >> The current template is:
> > > >>
> > > >> - Motivation
> > > >> - Public Interfaces
> > > >> - Proposed Changes
> > > >> - Compatibility, Deprecation, and Migration Plan
> > > >> - Test Plan
> > > >> - Rejected Alternatives
> > > >>
> > > >> It includes almost everything required for proposals, but I think
> it's
> > > >> missing why the author chose the proposed changes.
> > > >> So, I think it would be great if we can add 'Rationale' or 'Expected
> > > >> benefits and drawbacks'.
> > > >> People might include it by themselves in 'Motivation' or 'Proposed
> > > >> Changes', but it would be good if there's an explicit section to
> > > describe
> > > >> it.
> > > >>
> > > >> Best,
> > > >> Jihoon
> > > >>
> > > >
> > >
> >
>


Re: Off list major development

2019-02-01 Thread Gian Merlino
I think we should clarify the process too. Might I suggest,

1) Add a GitHub issue template with proposal headers and some description
of what each section should be, so people can fill them in easily.
2) Suggest that for any change that would need a design review per
http://druid.io/community/, the author also creates a proposal issue
following that template. It can be very short if the change is simple. The
design discussion should take place on the proposal issue, and the code
review should take place on the PR. A +1 on either the issue or the PR
would be considered a +1 for the design, while only a +1 on the PR would be
considered a +1 for the code itself.
3) Update http://druid.io/community/ and our CONTRIBUTING.md with guidance
about (2) and encouraging that the proposal issues are created early in the
dev cycle.

I am thinking of "suggest" rather than "require" in (2) so we can start
slow and see how we like this process before making it mandatory.

On Fri, Feb 1, 2019 at 2:22 AM Clint Wylie  wrote:

> +1 for proposal template.
>
> Do we also need to clarify the process that goes along with the proposals?
> (It seems clear to me we've reached consensus in wanting a proposal
> process, but less clear if we have a clear picture of or have reached
> consensus on the process itself). Things like when voting happens,
> appropriate PR timing, voting period, announcements to dev list,
> significance of silence (implicit +1 or -1?), etc. Even if just adapting
> Kafka's I think it might be a good idea to lay it out in this thread.
>
> Beyond putting reference to this stuff in top level github readme and on
> the website, is there anything more we should do to guide people that are
> thinking about contributing to use the proposal process?
>
> On Thu, Jan 31, 2019 at 2:47 PM Jonathan Wei  wrote:
>
> > That structure sounds good:
> > - expanding rejected alternatives to a broader rationale section sounds
> > good
> > - I like "operational impact" as suggested by Slim and Gian more than the
> > corresponding KIP terminology
> > - Future work is a good addition
> >
> > For test plan, I don't have a very strong opinion on this, but I'm
> thinking
> > it could make sense as an optional section (if someone has one, that's
> > cool, if not, that's cool too, but perhaps having it present in the
> > template would encourage ppl to think about testing strategies early on
> if
> > they aren't already)
> >
> >
> > On Thu, Jan 31, 2019 at 2:17 PM Jihoon Son  wrote:
> >
> > > Thanks Gian.
> > > The suggested template looks good to me.
> > >
> > > Jihoon
> > >
> > > On Thu, Jan 31, 2019 at 9:27 AM Gian Merlino  wrote:
> > >
> > > > If it's not clear - I am agreeing with Jihoon and Slim that a
> separate
> > > > "Rationale" section makes sense in addition to a couple other
> suggested
> > > > tweaks.
> > > >
> > > > On Wed, Jan 30, 2019 at 3:46 PM Gian Merlino 
> wrote:
> > > >
> > > > > I think it'd also be nice to tweak a couple parts of the KIP
> template
> > > > > (Motivation; Public Interfaces; Proposed Changes; Compatibility,
> > > > > Deprecation, and Migration Plan; Test Plan; Rejected
> Alternatives). A
> > > > > couple people have suggested adding a "Rationale" section, how
> about
> > > > adding
> > > > > that and removing "Rejected alternatives" -- rolling them in
> > together?
> > > > And
> > > > > dropping "test plan", since IMO that discussion can be deferred to
> > the
> > > PR
> > > > > itself, when there is code ready. Finally, adding "future work",
> > > > detailing
> > > > > where this change might lead us.
> > > > >
> > > > > So in particular the template I am suggesting would be something
> like
> > > > this.
> > > > >
> > > > > 1) Motivation: A description of the problem.
> > > > > 2) Proposed changes: Should usually be the longest section. Should
> > > > include
> > > > > any changes that are proposed to user-facing interfaces
> > (configuration
> > > > > parameters, JSON query/ingest specs, SQL language, emitted metrics,
> > and
> > > > so
> > > > > on).
> > > > > 3) Rationale: A discussion of why this particular solution is the
> > best
> > > > > one. One good way to approach this is to discuss othe

Re: Forbiddenapis Plugin

2019-02-02 Thread Gian Merlino
Would it change any behaviors? I'm not familiar with the difference between
the two locales in practice.

On Sat, Feb 2, 2019 at 5:09 AM Furkan KAMACI  wrote:

> Hi,
>
> By the way, Locale.ENGLISH is used to fix such cases. However, we may
> prefer Locale.ROOT?
>
> Kind Regards,
> Furkan KAMACI
>
> On Thu, Jan 31, 2019 at 9:50 PM Julian Hyde  wrote:
>
> > I’ve seen a similar problem in Calcite, which also uses forbiddenApis. It
> > only seems to occurs in a “bad build”; when you do “mvn clean” the
> problems
> > disappear.
> >
> > My hypothesis is that the code is generated by javac, for example for the
> > messages from “assert”, or when concatenating string literals separated
> by
> > “+", and it really is not something to worry about.
> >
> > Julian
> >
> >
> > > On Jan 31, 2019, at 9:40 AM, Gian Merlino  wrote:
> > >
> > > Good question. I'm not sure. They are at least doing String.format on
> > > _something_ with no default locale.
> > >
> > > On Thu, Jan 31, 2019 at 9:36 AM Charles Allen
> > >  wrote:
> > >
> > >> Is this indicative of latent bugs the generated sources have?
> > >>
> > >> On Thu, Jan 31, 2019 at 8:55 AM Gian Merlino  wrote:
> > >>
> > >>> I get those sometimes with generated sources -- typically doing a
> "mvn
> > >>> clean" beforehand clears it up. We might be able to add exclusions
> for
> > >> the
> > >>> generated source directories in order to avoid the need to do this.
> > >>>
> > >>> On Thu, Jan 31, 2019 at 5:15 AM Furkan KAMACI <
> furkankam...@gmail.com>
> > >>> wrote:
> > >>>
> > >>>> I try to run forbiddenapis plugin at Druid. However I get that
> errors
> > >> but
> > >>>> does not know where they actually points:
> > >>>>
> > >>>> [INFO] Scanning classes for violations...
> > >>>> [ERROR] Forbidden method invocation:
> > >>>> java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses
> > >>> default
> > >>>> locale]
> > >>>> [ERROR]   in org.apache.druid.math.expr.BinaryEvalOpExprBase
> > >> (Expr.java,
> > >>>> method body of '$$$reportNull$$$0(int)')
> > >>>> [ERROR] Forbidden method invocation:
> > >>>> java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses
> > >>> default
> > >>>> locale]
> > >>>> [ERROR]   in org.apache.druid.math.expr.LongExpr (Expr.java, method
> > >> body
> > >>> of
> > >>>> '$$$reportNull$$$0(int)')
> > >>>> [ERROR] Forbidden method invocation:
> > >>>> java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses
> > >>> default
> > >>>> locale]
> > >>>> [ERROR]   in org.apache.druid.math.expr.FunctionExpr (Expr.java,
> > method
> > >>>> body of '$$$reportNull$$$0(int)')
> > >>>> [ERROR] Forbidden method invocation:
> > >>>> java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses
> > >>> default
> > >>>> locale]
> > >>>> [ERROR]   in org.apache.druid.data.input.impl.InputRowParser
> > >>>> (InputRowParser.java, method body of '$$$reportNull$$$0(int)')
> > >>>> [ERROR] Forbidden method invocation:
> > >>>> java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses
> > >>> default
> > >>>> locale]
> > >>>> [ERROR]   in org.apache.druid.math.expr.BinAndExpr (Expr.java,
> method
> > >>> body
> > >>>> of '$$$reportNull$$$0(int)')
> > >>>> [ERROR] Forbidden method invocation:
> > >>>> java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses
> > >>> default
> > >>>> locale]
> > >>>> [ERROR]   in org.apache.druid.java.util.common.concurrent.Execs
> > >>>> (Execs.java, method body of '$$$reportNull$$$0(int)')
> > >>>> [ERROR] Forbidden method invocation:
> > >>>> java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses
> > >>> default
> > >>>> locale]
> > >>>> [ERROR]   in org.apache.druid.math.e

Re: Druid 0.14 timing

2019-02-04 Thread Gian Merlino
Sounds good to me - thanks Jon.

On Sun, Feb 3, 2019 at 4:49 PM Jonathan Wei  wrote:

> Hi all,
>
> It's around the discussed time for the release; I'm planning on cutting the
> 0.14.0 branch tomorrow, and I'll volunteer to serve as the release manager
> for 0.14.0.
>
> Thanks,
> Jon
>
> On Mon, Jan 7, 2019 at 5:50 PM Benedict Jin 
> wrote:
>
> >
> >
> > On 2019/01/04 21:06:40, Gian Merlino  wrote:
> > > It feels like 0.13.0 was just recently released, but it was branched
> off
> > > back in October, and it has almost been 3 months since then. How do we
> > feel
> > > about doing an 0.14 branch cut at the end of January (Thu Jan 31) -
> going
> > > back to the every 3 months cycle?
> > >
> > > For this release, based on the feedback we got from the Incubator vote
> > last
> > > time, we'll need to fix up the LICENSE and NOTICE issues that were
> > flagged
> > > but waved through for our first release. (Justin said he would have
> -1'd
> > > based on that if it was anything beyond a first release.)
> > > +1
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> > For additional commands, e-mail: dev-h...@druid.apache.org
> >
> >
>


Re: Spark batch with Druid

2019-02-06 Thread Gian Merlino
Hey Rajiv,

There's an unofficial Druid/Spark adapter at:
https://github.com/metamx/druid-spark-batch. If you want to stick to
official things, then the best approach would be to use Spark to write data
to HDFS or S3 and then ingest it into Druid using Druid's Hadoop-based or
native batch ingestion. (Or even write it to Kafka using Spark Streaming
and ingest from Kafka into Druid using Druid's Kafka indexing service.)

On Wed, Feb 6, 2019 at 12:04 PM Rajiv Mordani 
wrote:

> Is there a best practice for how to load data from druid to use in a spark
> batch job? I asked this question on the user alias but got no response
> hence reposting here.
>
>
>   *   Rajiv
>


Re: Spark batch with Druid

2019-02-06 Thread Gian Merlino
Ah, you're right. I misread the original question.

In that case, also try checking out:
https://github.com/implydata/druid-hadoop-inputformat, an unofficial Druid
InputFormat. Spark can use that to read Druid data into an RDD - check the
example in the README. It's also unofficial and, currently, unmaintained,
so you'd be taking on some maintenance effort if you want to use it.

On Wed, Feb 6, 2019 at 3:01 PM Julian Jaffe 
wrote:

> I think this question is going the other way (e.g. how to read data into
> Spark, as opposed to into Druid). For that, the quickest and dirtiest
> approach is probably to use Spark's json support to parse a Druid response.
> You may also be able to repurpose some code from
> https://github.com/SparklineData/spark-druid-olap, but I don't think
> there's any official guidance on this.
>
> On Wed, Feb 6, 2019 at 2:21 PM Gian Merlino  wrote:
>
> > Hey Rajiv,
> >
> > There's an unofficial Druid/Spark adapter at:
> > https://github.com/metamx/druid-spark-batch. If you want to stick to
> > official things, then the best approach would be to use Spark to write
> data
> > to HDFS or S3 and then ingest it into Druid using Druid's Hadoop-based or
> > native batch ingestion. (Or even write it to Kafka using Spark Streaming
> > and ingest from Kafka into Druid using Druid's Kafka indexing service.)
> >
> > On Wed, Feb 6, 2019 at 12:04 PM Rajiv Mordani
>  > >
> > wrote:
> >
> > > Is there a best practice for how to load data from druid to use in a
> > spark
> > > batch job? I asked this question on the user alias but got no response
> > > hence reposting here.
> > >
> > >
> > >   *   Rajiv
> > >
> >
>


Re: docker build

2019-02-08 Thread Gian Merlino
First off thanks a lot for your work here Don!!

I really do think, though, that we need to be careful about the inclusion
of the MySQL connector jar. ASF legal has been clear in the past that ASF
projects should not distribute it as part of binary convenience releases:
https://issues.apache.org/jira/browse/LEGAL-200. I think having the
Dockerfile in the repo is probably fine: in that case we are not
distributing the jar itself, just, essentially, a pointer to how to
download it. But if we start offering a prebuilt Docker image, it is less
clear to me if that is fine or not. In the interests of resolving this
question one way or the other, I opened a question asking about this
specific situation: https://issues.apache.org/jira/browse/LEGAL-437.

About Dylan's questions: my feeling is that we should go ahead and enable
automated pushes to Docker Hub, and provide some appropriate language
around what people should expect out of it. I don't think 'experimental' is
the right word, but we should be clear around exactly what contract we are
adhering to. Is it something people can expect to be published with each
release? Is it something that we are going to build and test as part of the
release process, or are we going to publish it via automation without any
testing? Is it something we expect people to use in production, or
something we only expect people to use for evaluation? If it is something
we expect people to use in production, do we expect them to use it in any
particular way? Will we be guaranteeing certain things (file layout, etc)
that provide a stable interface for people to build derived images from?

The path of least resistance to answering these questions is to say that
the Docker image is provided in the hopes that it is useful, but that it is
done via an automated build, without any pre-release testing, and without
any particular guarantees about the 'interface' it provides. If this is the
case then I would suggest putting it up on Docker Hub with an appropriate
disclaimer and not promoting it too much. (We might very well end up
pushing images every once in a while that don't work right, and it would
reflect poorly on the project to have those be prominently linked-to.) It
becomes easier to strengthen these guarantees if we add an automated test
suite that we can run before releases which verifies functionality and
interface adherence.

On Fri, Feb 8, 2019 at 7:14 AM Rajiv Mordani 
wrote:

> This is purely a packaging exercise. I don't see a reason to mark this as
> experimental.
>
> Rajiv.
> 
> From: Dylan Wylie 
> Sent: Friday, February 8, 2019 6:08:47 AM
> To: dev@druid.apache.org
> Subject: Re: docker build
>
> I believe all we have to do is submit a ticket to Apache's Infrastructure
> team and then we'll have some automatic process that'll automatically
> update docker-hub with images relating to each release.
>
> I guess there's two open questions I think we should reach a consensus on
> (others feel free to add more!).
>
> - Are we as a community happy to "support" an additional release artefact?
> I'm happy to try to incorporate this into my employer's testing
> infrastructure to help catch any regressions on future releases but that's
> just one data point on each release.
>
> - Along the same vein, do we follow the same process as we do with new
> features and mark this as experimental for some time?
>
> On Fri, 8 Feb 2019 at 13:25, Don Bowman  wrote:
>
> > Now that
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-druid%2Fpull%2F6896&data=02%7C01%7Crmordani%40vmware.com%7C942b2af1dfb740fcbed308d68dcef937%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C636852317419449405&sdata=EXigZIBkKiatM0rEgyQRoxA9ER8u8amiAfPN0MghzjE%3D&reserved=0
> is merged
> > (thank you!)
> >
> > who can get this set to build into Dockerhub? Presumably automatically
> on a
> > 'tag' of the repo.
> >
> > Once that is done it is much more convenient for folks to use this tool.
> >
> > --don
> >
>


Re: docker build

2019-02-08 Thread Gian Merlino
I don't think anything is strictly needed from you at this point, but
things happen when people drive them, and participation in that effort
would help make sure it gets done. I think at this point the tasks on our
end are watching LEGAL-437 for advice (or making it moot by removing the
MySQL jar), asking Infra to set up automated builds once that is sorted
out, and building some kind of consensus around how we'll label and promote
the Docker images.

On Fri, Feb 8, 2019 at 12:13 PM Don Bowman  wrote:

> i'd be fine w/ removing the mysql, i'm using postgresql for the metadata.
> if this is the case we should consider relfecting postgres as the default
> metadata in the docs.
> however, i think this is mere aggregation under the gpl license, and the
> docker image tends to have other (e.g. bash) gpl code. druid's start
> scripts are all bash-specific as an example.
>
> I'm not clear if anything further is needed of me, i'm hoping to get an
> automated build going into dockerhub, and tagged w/ each release. i think
> this will help adoption.
>
>
>
> On Fri, 8 Feb 2019 at 14:22, Gian Merlino  wrote:
>
> > First off thanks a lot for your work here Don!!
> >
> > I really do think, though, that we need to be careful about the inclusion
> > of the MySQL connector jar. ASF legal has been clear in the past that ASF
> > projects should not distribute it as part of binary convenience releases:
> > https://issues.apache.org/jira/browse/LEGAL-200. I think having the
> > Dockerfile in the repo is probably fine: in that case we are not
> > distributing the jar itself, just, essentially, a pointer to how to
> > download it. But if we start offering a prebuilt Docker image, it is less
> > clear to me if that is fine or not. In the interests of resolving this
> > question one way or the other, I opened a question asking about this
> > specific situation: https://issues.apache.org/jira/browse/LEGAL-437.
> >
> > About Dylan's questions: my feeling is that we should go ahead and enable
> > automated pushes to Docker Hub, and provide some appropriate language
> > around what people should expect out of it. I don't think 'experimental'
> is
> > the right word, but we should be clear around exactly what contract we
> are
> > adhering to. Is it something people can expect to be published with each
> > release? Is it something that we are going to build and test as part of
> the
> > release process, or are we going to publish it via automation without any
> > testing? Is it something we expect people to use in production, or
> > something we only expect people to use for evaluation? If it is something
> > we expect people to use in production, do we expect them to use it in any
> > particular way? Will we be guaranteeing certain things (file layout, etc)
> > that provide a stable interface for people to build derived images from?
> >
> > The path of least resistance to answering these questions is to say that
> > the Docker image is provided in the hopes that it is useful, but that it
> is
> > done via an automated build, without any pre-release testing, and without
> > any particular guarantees about the 'interface' it provides. If this is
> the
> > case then I would suggest putting it up on Docker Hub with an appropriate
> > disclaimer and not promoting it too much. (We might very well end up
> > pushing images every once in a while that don't work right, and it would
> > reflect poorly on the project to have those be prominently linked-to.) It
> > becomes easier to strengthen these guarantees if we add an automated test
> > suite that we can run before releases which verifies functionality and
> > interface adherence.
> >
> > On Fri, Feb 8, 2019 at 7:14 AM Rajiv Mordani  >
> > wrote:
> >
> > > This is purely a packaging exercise. I don't see a reason to mark this
> as
> > > experimental.
> > >
> > > Rajiv.
> > > 
> > > From: Dylan Wylie 
> > > Sent: Friday, February 8, 2019 6:08:47 AM
> > > To: dev@druid.apache.org
> > > Subject: Re: docker build
> > >
> > > I believe all we have to do is submit a ticket to Apache's
> Infrastructure
> > > team and then we'll have some automatic process that'll automatically
> > > update docker-hub with images relating to each release.
> > >
> > > I guess there's two open questions I think we should reach a consensus
> on
> > > (others feel free to add more!).
> > >
> > > - Are we as a community 

Re: Auto-closing old PRs

2019-02-11 Thread Gian Merlino
IMO it makes sense to keep PRs open if they have a milestone or have a
Security or Bug label. 60 days with no activity as a threshold sounds good
to me - it's a pretty long time.

On Mon, Feb 11, 2019 at 11:22 AM Jihoon Son  wrote:

> Hi Dylan, thank you for starting a discussion.
>
> I think this is a good idea. We currently have 159 open PRs, but many PRs
> have gone too stale. For example, the earliest PR was opened on Jan 26,
> 2016.
> I do believe that this would help us to focus on more active PRs and
> encourage more people to get involved in the review process.
>
> The policy for the timeline looks good to me. But, for milestone, we can
> assign it on any PRs and remove it later if it shouldn't block the release.
> (See
>
> https://lists.apache.org/thread.html/371ffb06447debb93eec01863802aab13a08a9c37356466e6750c007@%3Cdev.druid.apache.org%3E
> and
>
> https://lists.apache.org/thread.html/b9cd3aaf2d01801751f16ee0b2beb2cebc39e2a42160ffb268dc6918@%3Cdev.druid.apache.org%3E
> for the discussion of the milestone policy).
>
> I think we should make bug PRs to be not auto-closed rather than the ones
> assigned a milestone.
>
> Best,
> Jihoon
>
> On Mon, Feb 11, 2019 at 8:27 AM Dylan Wylie  wrote:
>
> > Hey folks,
> >
> > What are opinions on automatically closing old pull requests?
> >
> > There's a lot that our outdated and abandoned. I think some sort of
> > automated process will tidy away those that are truly abandoned while
> > highlighting those that aren't by encouraging their authors to poke
> > committers for review.
> >
> > I've taken Apache Beam's stalebot configuration and adjusted it slightly
> > here - https://github.com/apache/incubator-druid/pull/7031
> >
> > This will:
> > - Leave a comment and mark PRs as stale when they haven't had any
> activity
> > for 60 days.
> > - After a further 7 days of no activity the PR will be closed.
> > - Ignore any PR that has the label "Security" or a milestone assigned.
> >
> > I've left issues out for now but open to suggestions on the timelines for
> > those if we were to enact a similar process.
> >
> > Best regards,
> > Dylan
> >
>


Re: Segment files for ITTwitterQueryTest and ITWikipediaQueryTest

2019-02-11 Thread Gian Merlino
The keys should be in the repo. I think the ones in
"integration-tests/docker/historical.conf" will work.

On Mon, Feb 11, 2019 at 2:19 PM Jihoon Son  wrote:

> Good question. I has been always curious about this too.
> Does anyone know about it?
>
> Jihoon
>
> On Mon, Feb 11, 2019 at 2:15 PM Atul Mohan 
> wrote:
>
> > Hi,
> > Presently the druid integration tests: ITTwitterQueryTest and
> > ITWikipediaQueryTest looks like it is using pre-ingested segments for
> > datasources *twitterstream* and *wikipedia_editstream* respectively. Is
> > there a way to access these segment files? I tried accessing it from S3
> > using the demo account credentials, but it gave me an AllAccessDisabled
> > exception.
> >
> > Thanks
> > --
> > Atul Mohan
> > 
> >
>


Re: Segment files for ITTwitterQueryTest and ITWikipediaQueryTest

2019-02-11 Thread Gian Merlino
Ah. I think they have Get but not List permissions. So you can retrieve
objects but you need to know what their paths are first. Check out the file
"integration-tests/docker/sample-data.sql" for the paths that the ITs use.

On Mon, Feb 11, 2019 at 2:33 PM Atul Mohan  wrote:

> The keys are valid but the account does not seem to have the required
> access. Attempting to access /shared/storage with this account gives the
> error:
> *An error occurred (AllAccessDisabled) when calling the ListObjectsV2
> operation: All access to this object has been disabled*
>
> Atul
>
> On Mon, Feb 11, 2019 at 4:25 PM Gian Merlino  wrote:
>
> > The keys should be in the repo. I think the ones in
> > "integration-tests/docker/historical.conf" will work.
> >
> > On Mon, Feb 11, 2019 at 2:19 PM Jihoon Son  wrote:
> >
> > > Good question. I has been always curious about this too.
> > > Does anyone know about it?
> > >
> > > Jihoon
> > >
> > > On Mon, Feb 11, 2019 at 2:15 PM Atul Mohan 
> > > wrote:
> > >
> > > > Hi,
> > > > Presently the druid integration tests: ITTwitterQueryTest and
> > > > ITWikipediaQueryTest looks like it is using pre-ingested segments for
> > > > datasources *twitterstream* and *wikipedia_editstream* respectively.
> Is
> > > > there a way to access these segment files? I tried accessing it from
> S3
> > > > using the demo account credentials, but it gave me an
> AllAccessDisabled
> > > > exception.
> > > >
> > > > Thanks
> > > > --
> > > > Atul Mohan
> > > > 
> > > >
> > >
> >
>
>
> --
> Atul Mohan
> 
>


Re: Druid Auto Field Type Detection

2019-02-11 Thread Gian Merlino
Yeah that's a good point. Maybe we should store some extra information
about what the type was in the original input.

On Sat, Jan 26, 2019 at 4:04 AM Furkan KAMACI 
wrote:

> Hi Gian,
>
> Same problem applies to null fields too. When first record is null, it will
> not possible to detect such a field's type.
>
> However, problem is different at my case. You may have an ad-hoc field
> which is not defined at beginning. Such a field should have strict type but
> not known at the beginning. At your example case, we may define such field
> as Integer and throw error or skip an entry which has a value if "foo" due
> to field is initialized as Integer. On the other hand, sending a datum as:
>
> field: 3
>
> and
>
> field: "3"
>
> maybe threatened different. Second one could be String but first one should
> be Integer.
>
> I think that Solr could be an example for us such a schemaless mode.
> What do you think?
>
> Kind Regards,
> Furkan KAMACI
>
> On Fri, Jan 25, 2019 at 8:56 PM Gian Merlino  wrote:
>
> > Hey Furkan,
> >
> > Right now when Druid detects dimensions (so called "schemaless" mode,
> what
> > you get when you have an empty dimensions list at ingestion time), it
> > assumes they are all strings. It would definitely be better if it did
> some
> > analysis on the incoming data and chose the most appropriate type. I
> think
> > the main consideration here is that Druid has to pick a type as soon as
> it
> > sees a new column, but it might not get it right just by looking at the
> > first record. Imagine some JSON data where you have a field that is the
> > number 3 for the first row Druid sees, but the string "foo" in the
> second.
> > The right type would be string, but Druid wouldn't know that when it gets
> > the first row.
> >
> > Maybe it would work to do some mechanism where auto-detected fields are
> > ingested as strings initially into IncrementalIndex, and then potentially
> > converted to a different type when written to disk.
> >
> > On Thu, Jan 10, 2019 at 12:43 AM Furkan KAMACI 
> > wrote:
> >
> > > Hi All,
> > >
> > > I can define auto type detection for timestamp as follows:
> > >
> > > "timestampSpec" : {
> > >  "format" : "auto",
> > >  "column" : "ts"
> > > }
> > >
> > > In similar manner, I cannot detect field type via parseSpec. I mean:
> > >
> > >
> > >
> >
> {"ts":"2018-01-01T03:35:45Z","app_token":"guid1","eventName":"app-x","properties-key1":"123"}
> > >
> > >
> > >
> >
> {"ts":"2018-01-01T03:35:45Z","app_token":"guid2","eventName":"app-x","properties-key2":123}
> > >
> > > Both properties-key1 and properties-key2 are indexed as String. I
> expect
> > to
> > > index properties-key2 as Integer at Druid.
> > >
> > > So, is there any mechanism at Druid about letting Druid auto filed type
> > > detection for a newly created field? If not, I would like to implement
> > such
> > > a feature.
> > >
> > > Kind Regards,
> > > Furkan KAMACI
> > >
> >
>


Re: Off list major development

2019-02-12 Thread Gian Merlino
Does anyone have thoughts on the above suggestions?

On Fri, Feb 1, 2019 at 2:16 PM Gian Merlino  wrote:

> I think we should clarify the process too. Might I suggest,
>
> 1) Add a GitHub issue template with proposal headers and some description
> of what each section should be, so people can fill them in easily.
> 2) Suggest that for any change that would need a design review per
> http://druid.io/community/, the author also creates a proposal issue
> following that template. It can be very short if the change is simple. The
> design discussion should take place on the proposal issue, and the code
> review should take place on the PR. A +1 on either the issue or the PR
> would be considered a +1 for the design, while only a +1 on the PR would be
> considered a +1 for the code itself.
> 3) Update http://druid.io/community/ and our CONTRIBUTING.md with
> guidance about (2) and encouraging that the proposal issues are created
> early in the dev cycle.
>
> I am thinking of "suggest" rather than "require" in (2) so we can start
> slow and see how we like this process before making it mandatory.
>
> On Fri, Feb 1, 2019 at 2:22 AM Clint Wylie  wrote:
>
>> +1 for proposal template.
>>
>> Do we also need to clarify the process that goes along with the proposals?
>> (It seems clear to me we've reached consensus in wanting a proposal
>> process, but less clear if we have a clear picture of or have reached
>> consensus on the process itself). Things like when voting happens,
>> appropriate PR timing, voting period, announcements to dev list,
>> significance of silence (implicit +1 or -1?), etc. Even if just adapting
>> Kafka's I think it might be a good idea to lay it out in this thread.
>>
>> Beyond putting reference to this stuff in top level github readme and on
>> the website, is there anything more we should do to guide people that are
>> thinking about contributing to use the proposal process?
>>
>> On Thu, Jan 31, 2019 at 2:47 PM Jonathan Wei  wrote:
>>
>> > That structure sounds good:
>> > - expanding rejected alternatives to a broader rationale section sounds
>> > good
>> > - I like "operational impact" as suggested by Slim and Gian more than
>> the
>> > corresponding KIP terminology
>> > - Future work is a good addition
>> >
>> > For test plan, I don't have a very strong opinion on this, but I'm
>> thinking
>> > it could make sense as an optional section (if someone has one, that's
>> > cool, if not, that's cool too, but perhaps having it present in the
>> > template would encourage ppl to think about testing strategies early on
>> if
>> > they aren't already)
>> >
>> >
>> > On Thu, Jan 31, 2019 at 2:17 PM Jihoon Son 
>> wrote:
>> >
>> > > Thanks Gian.
>> > > The suggested template looks good to me.
>> > >
>> > > Jihoon
>> > >
>> > > On Thu, Jan 31, 2019 at 9:27 AM Gian Merlino  wrote:
>> > >
>> > > > If it's not clear - I am agreeing with Jihoon and Slim that a
>> separate
>> > > > "Rationale" section makes sense in addition to a couple other
>> suggested
>> > > > tweaks.
>> > > >
>> > > > On Wed, Jan 30, 2019 at 3:46 PM Gian Merlino 
>> wrote:
>> > > >
>> > > > > I think it'd also be nice to tweak a couple parts of the KIP
>> template
>> > > > > (Motivation; Public Interfaces; Proposed Changes; Compatibility,
>> > > > > Deprecation, and Migration Plan; Test Plan; Rejected
>> Alternatives). A
>> > > > > couple people have suggested adding a "Rationale" section, how
>> about
>> > > > adding
>> > > > > that and removing "Rejected alternatives" -- rolling them in
>> > together?
>> > > > And
>> > > > > dropping "test plan", since IMO that discussion can be deferred to
>> > the
>> > > PR
>> > > > > itself, when there is code ready. Finally, adding "future work",
>> > > > detailing
>> > > > > where this change might lead us.
>> > > > >
>> > > > > So in particular the template I am suggesting would be something
>> like
>> > > > this.
>> > > > >
>> > > > > 1) Motivation: A description of the problem.
>> > > > > 2) Proposed cha

Re: Dev sync

2019-02-13 Thread Gian Merlino
I personally join about 1 in 10 of them so, from that perspective, I feel
that I am getting what I need in terms of communication out of the lists
and github and don't need extra utility from the dev syncs. Even if we stop
doing them, meeting face to face is still nice, and I always like to see
people at meetups :)

On Tue, Feb 12, 2019 at 9:41 AM Charles Allen
 wrote:

> I am unable to host the dev sync this week.
>
> Is anyone finding utility out of these? the dev list seems pretty active
> these days, so the legacy utility of the dev sync is very muted (this is a
> good thing). Unless people are finding specific utility out of a weekly
> video sync up, I propose it be postponed indefinitely until a need can be
> identified.
>
> Thoughts?
>


Re: Spark batch with Druid

2019-02-13 Thread Gian Merlino
I'd guess the majority of users are just using Druid itself to process
Druid data, although there are a few people out there that export it into
other systems using techniques like the above.

On Wed, Feb 13, 2019 at 2:00 PM Rajiv Mordani 
wrote:

> Am curious to know how people are generally processing data from druid? We
> want to be able to spark processing in a distributed fashion using
> Dataframes.
>
> - Rajiv
>
> On 2/11/19, 1:04 PM, "Julian Jaffe"  wrote:
>
> Spark can convert an RDD of JSON strings into an RDD/DataFrame/DataSet
> of
> objects parsed from the JSON (something like
> `sparkSession.read.json(jsonStringRDD)`). You could hook this up to a
> Druid
> response, but I would definitely recommend looking through the code
> that
> Gian posted instead - it reads data from deep storage instead of
> sending an
> HTTP request to the Druid cluster and waiting for the response.
>
> On Sat, Feb 9, 2019 at 5:02 PM Rajiv Mordani
> 
> wrote:
>
> > Thanks Julian,
> > See some questions in-line:
> >
> > On 2/6/19, 3:01 PM, "Julian Jaffe" 
> wrote:
> >
> > I think this question is going the other way (e.g. how to read
> data
> > into
> > Spark, as opposed to into Druid). For that, the quickest and
> dirtiest
> > approach is probably to use Spark's json support to parse a Druid
> > response.
> >
> > [Rajiv] Can you please expand more here?
> >
> > You may also be able to repurpose some code from
> >
> >
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSparklineData%2Fspark-druid-olap&data=02%7C01%7Crmordani%40vmware.com%7C4b7f159a82db4dc4fdc008d690647969%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636855158547887488&sdata=9Uq3ox5hhes60fxfqMOxmjfQPZdwFrfSs7glVLTafs0%3D&reserved=0
> ,
> > but I don't think
> > there's any official guidance on this.
> >
> >
> >
> > On Wed, Feb 6, 2019 at 2:21 PM Gian Merlino 
> wrote:
> >
> > > Hey Rajiv,
> > >
> > > There's an unofficial Druid/Spark adapter at:
> > >
> >
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmetamx%2Fdruid-spark-batch&data=02%7C01%7Crmordani%40vmware.com%7C4b7f159a82db4dc4fdc008d690647969%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636855158547887488&sdata=OFHEl0qFx5g8csFcjz5qnfU67bw37reST%2BYY%2BqzDLk8%3D&reserved=0
> .
> > If you want to stick to
> > > official things, then the best approach would be to use Spark
> to
> > write data
> > > to HDFS or S3 and then ingest it into Druid using Druid's
> > Hadoop-based or
> > > native batch ingestion. (Or even write it to Kafka using Spark
> > Streaming
> > > and ingest from Kafka into Druid using Druid's Kafka indexing
> > service.)
> > >
> > > On Wed, Feb 6, 2019 at 12:04 PM Rajiv Mordani
> >  > > >
> > > wrote:
> > >
> > > > Is there a best practice for how to load data from druid to
> use in
> > a
> > > spark
> > > > batch job? I asked this question on the user alias but got no
> > response
> > > > hence reposting here.
> > > >
> > > >
> > > >   *   Rajiv
> > > >
> > >
> >
> >
> >
>
>
>


Re: Make a regular issue template to de-emphasize "Proposals"

2019-02-18 Thread Gian Merlino
Sounds good to me. IMO it also would make sense to remove the license
header from the templates, and add them as a rat exclusion if needed, and
also exclude them from release tarballs. The header is ugly and I don't
think there is a strong need to label these files if we don't plan to
include them in releases.

On Sat, Feb 16, 2019 at 2:43 PM Roman Leventov  wrote:

> The current interface of creating an issue in Druid draws a lot of
> attention to "Proposals":
> [image: image.png]
> I think we should at least create a "regular issue" template and make it
> go higher in this list of templates to restore the balance.
>


Re: Knowledge sharing between Druid developers via technical talks

2019-02-18 Thread Gian Merlino
I am interested especially if the format is something live. An in-person
meetup with a recording distributed afterwards would be my preference, if
people are into that. Maybe something at one of the Druid meetups?

On Wed, Feb 13, 2019 at 8:38 PM Eyal Yurman
 wrote:

> Hi,
>
> This is something usually being done in companies, but I think it is useful
> for any community, especially our community which is so distributed.
>
> I think it would be absolutely wonderful if we can find people willing to
> share their knowledge with other contributors via the form of a tech-talk.
> I.e. it would be very useful if someone could take a subject (Just for
> example, groupBy query) and present the high-level
> architecture/implementation.
>
> I know this requires significant effort, but I hope to convince you of the
> benefits it would provide to the Druid project:
> - Helping any newcomer being more effective, thus providing better
> contribution ROI against work effort.
> - Serving as a high-quality medium of communication within the group of
> committers, which would lead to more trust and understanding.
>
> Recording and uploaded such sessions will make them Apache-Way compatible
> (Along with serving future viewers).
>
> So, anyone up to the challenge? :)
>
> Eyal.
>


Re: Use add support for using dropwizard metrics

2019-02-18 Thread Gian Merlino
The only caveat I could think of is whether/how it would integrate with the
existing metrics emitter system (Emitter, ServiceEmitter, LoggingEmitter, &
friends). I am not too familiar with Dropwizard so I don't have much to say
about how the integration could work.

On Thu, Feb 14, 2019 at 1:15 PM Lucilla Chalmer
 wrote:

> Hello all,
>
> I'd like to add support for using dropwizard metrics
>  and was wondering if there are any
> caveats or if anybody else is already working on this.
>
> --
> *Lucilla Chalmer*
> *Pinterest | Ads Onsite Reporting*
>


Re: docker build

2019-02-18 Thread Gian Merlino
A discussion is progressing on
https://issues.apache.org/jira/browse/LEGAL-437. It doesn't seem to have
got anywhere firm yet.

On Fri, Feb 8, 2019 at 12:23 PM Gian Merlino  wrote:

> I don't think anything is strictly needed from you at this point, but
> things happen when people drive them, and participation in that effort
> would help make sure it gets done. I think at this point the tasks on our
> end are watching LEGAL-437 for advice (or making it moot by removing the
> MySQL jar), asking Infra to set up automated builds once that is sorted
> out, and building some kind of consensus around how we'll label and promote
> the Docker images.
>
> On Fri, Feb 8, 2019 at 12:13 PM Don Bowman  wrote:
>
>> i'd be fine w/ removing the mysql, i'm using postgresql for the metadata.
>> if this is the case we should consider relfecting postgres as the default
>> metadata in the docs.
>> however, i think this is mere aggregation under the gpl license, and the
>> docker image tends to have other (e.g. bash) gpl code. druid's start
>> scripts are all bash-specific as an example.
>>
>> I'm not clear if anything further is needed of me, i'm hoping to get an
>> automated build going into dockerhub, and tagged w/ each release. i think
>> this will help adoption.
>>
>>
>>
>> On Fri, 8 Feb 2019 at 14:22, Gian Merlino  wrote:
>>
>> > First off thanks a lot for your work here Don!!
>> >
>> > I really do think, though, that we need to be careful about the
>> inclusion
>> > of the MySQL connector jar. ASF legal has been clear in the past that
>> ASF
>> > projects should not distribute it as part of binary convenience
>> releases:
>> > https://issues.apache.org/jira/browse/LEGAL-200. I think having the
>> > Dockerfile in the repo is probably fine: in that case we are not
>> > distributing the jar itself, just, essentially, a pointer to how to
>> > download it. But if we start offering a prebuilt Docker image, it is
>> less
>> > clear to me if that is fine or not. In the interests of resolving this
>> > question one way or the other, I opened a question asking about this
>> > specific situation: https://issues.apache.org/jira/browse/LEGAL-437.
>> >
>> > About Dylan's questions: my feeling is that we should go ahead and
>> enable
>> > automated pushes to Docker Hub, and provide some appropriate language
>> > around what people should expect out of it. I don't think
>> 'experimental' is
>> > the right word, but we should be clear around exactly what contract we
>> are
>> > adhering to. Is it something people can expect to be published with each
>> > release? Is it something that we are going to build and test as part of
>> the
>> > release process, or are we going to publish it via automation without
>> any
>> > testing? Is it something we expect people to use in production, or
>> > something we only expect people to use for evaluation? If it is
>> something
>> > we expect people to use in production, do we expect them to use it in
>> any
>> > particular way? Will we be guaranteeing certain things (file layout,
>> etc)
>> > that provide a stable interface for people to build derived images from?
>> >
>> > The path of least resistance to answering these questions is to say that
>> > the Docker image is provided in the hopes that it is useful, but that
>> it is
>> > done via an automated build, without any pre-release testing, and
>> without
>> > any particular guarantees about the 'interface' it provides. If this is
>> the
>> > case then I would suggest putting it up on Docker Hub with an
>> appropriate
>> > disclaimer and not promoting it too much. (We might very well end up
>> > pushing images every once in a while that don't work right, and it would
>> > reflect poorly on the project to have those be prominently linked-to.)
>> It
>> > becomes easier to strengthen these guarantees if we add an automated
>> test
>> > suite that we can run before releases which verifies functionality and
>> > interface adherence.
>> >
>> > On Fri, Feb 8, 2019 at 7:14 AM Rajiv Mordani
>> 
>> > wrote:
>> >
>> > > This is purely a packaging exercise. I don't see a reason to mark
>> this as
>> > > experimental.
>> > >
>> > > Rajiv.
>> > > 
>> > > From: Dylan Wylie 
>> 

<    1   2   3   4   5   6   >