[ANNOUNCE] - New Apache Drill Committer - Chris Westin

2016-12-01 Thread Jacques Nadeau
On behalf of the Apache Drill PMC, I am very pleased to announce that Chris
Westin has accepted the invitation to become a committer in the project.

Welcome Chris and thanks for your great contributions!


--
Jacques Nadeau
CTO and Co-Founder, Dremio


Re: [DISCUSS] Apache Drill Version after 1.9.0, etc.

2016-11-29 Thread Jacques Nadeau
Hey Sudheesh,

Thanks for asking my opinion given my statements back in April. I
appreciate the thought but I prefer to defer to others who are more
actively contributing than myself.

With regards to (C): I ran numerous releases previously where we simply
forward ported any fixes that were wanted from a release branch back into
master. There is no formal requirement for a release commit to be on the
master branch. In general, once the release branch is started, I typically
suggest simply rolling forward the master branch to the next snapshot
version immediately. Note that different projects think about this
differently. As an example: on the Calcite project we typically lock the
master branch so that the release is in the master branch.

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Mon, Nov 28, 2016 at 7:49 PM, Aman Sinha  wrote:

> (A) I am leaning to 1.10 for the reasons already mentioned in your email.
> (B) sounds good.
> (C) Does it matter if there are a few commits in master branch already ?
> What's the implication of just updating the pom files (not force-push).
>
> On Mon, Nov 28, 2016 at 3:25 PM, Sudheesh Katkam 
> wrote:
>
> > Hi all,
> >
> > -
> >
> > (A) I had asked the question about what the release version should be
> > after 1.9.0. Since this is part of the next release plan, a vote is
> > required based on the discussion. For approval, the vote requires a lazy
> > majority of active committers over 3 days.
> >
> > Here are some comments from that thread:
> >
> > Quoting Paul:
> >
> > > For release numbers, 1.10 (then 1.11, 1.12, …) seems like a good idea.
> > >
> > > At first it may seem odd to go to 1.10 from 1.9. Might people get
> > confused between 1.10 and 1.1.0? But, there is precedence. Tomcat’s
> latest
> > 7-series release is 7.0.72. Java is on 8u112. And so on.
> > >
> > > I like the idea of moving to 2.0 later when the team introduces a major
> > change, rather than by default just because the numbers roll around. For
> > example, Hadoop when to 2.x when YARN was introduced. Impala appears to
> > have moved to 2.0 when they added Spill to disk for some (all?)
> operators.
> >
> >
> > Quoting Parth:
> >
> > > Specifically what did you want to discuss about the release number
> after
> > 1.9?  Ordinarily you would just go to 2.0. The only reason for holding
> off
> > on 2.0, that I can think of, is if you want to make breaking changes in
> the
> > 2.0 release and those are not going to be ready for the next release
> cycle.
> > Are any dev's planning on such breaking changes? If so we should discuss
> > that (or any other reason we might have for deferring 2.0) in a separate
> > thread?
> > > I'm +0 on any version number we chose.
> >
> >
> > I am +1 on Paul’s suggestion for 1.10.0, unless, as Parth noted, we plan
> > to make breaking changes in the next release cycle.
> >
> > @Jacques, any comments? You had mentioned about this a while back [1].
> >
> > -
> >
> > (B) Until discussion on (A) is complete, which may take a while, I
> propose
> > we move the master to 1.10.0-SNAPSHOT to unblock committing to master
> > branch. If there are no objections, I will do this tomorrow, once 1.9.0
> > release artifacts are propagated.
> >
> > -
> >
> > (C) I noticed there are some changes committed to master branch before
> the
> > commit that moves to the next snapshot version. Did we face this issue in
> > the past? If so, how did we resolve the issue? Is 'force push' an option?
> >
> > -
> >
> > Thank you,
> > Sudheesh
> >
> > [1] http://mail-archives.apache.org/mod_mbox/drill-dev/201604.mbox/%
> > 3CCAJrw0OTiXLnmW25K0aQtsVmh3A4vxfwZzvHntxeYJjPdd-PnYQ%40mail.gmail.com
> %3E
> > <http://mail-archives.apache.org/mod_mbox/drill-dev/201604.mbox/%
> > 3ccajrw0otixlnmw25k0aqtsvmh3a4vxfwzzvhntxeyjjpdd-p...@mail.gmail.com%3E>
>


Re: [RESULT] [VOTE] Release Apache Drill 1.9.0 RC1

2016-11-18 Thread Jacques Nadeau
It sounds like the issue is constrained only to JDBC then, despite my
previous concerns. It also isn't a regression. As such, I guess it it
shouldn't really be a blocker to the release. When I first saw the trace, I
thought it was related to the new parallelization changes and was a
regression.

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Fri, Nov 18, 2016 at 9:28 AM, Sudheesh Katkam 
wrote:

> Venki, could you please take a look, since you are most familiar with that
> piece of code? Or anyone else wants to take a look?
>
> The issue can be reproduced with a simple unit test. In
> TestJdbcPluginWithDerbyIT, add this test. and then run “mvn install” in the
> storage-jdbc sub-project.
>
> @Test // DRILL-4984
> public void limit0() throws Exception {
> testNoResult("SELECT * FROM derby.DRILL_DERBY_TEST.PERSON LIMIT
> 0");
> }
>
> In the ticket, Hogler suggested “adding a check for null in
> FindHardDistributionScans.java @line 55 before calling getDrillTable()”.
> But that check may not be sufficient (I could be wrong) because the check
> does not imply if “contains” should be set to true/false. The call to
> unwrap() returns a different type of table (not DrillTable or
> DrillTranslatableTable), and that may need to be investigated.
>
> Thank you,
> Sudheesh
>
> > On Nov 17, 2016, at 10:09 PM, Jacques Nadeau  wrote:
> >
> > It might make sense for someone to look at this jira before rolling
> another
> > release: DRILL-4984
> >
> > The stacktrace looks like it might be an issue with the new hard
> > parallelization algorithm which could potentially influence all sources.
> It
> > might not have shown up in traditional regression tests if those always
> > have source/drillbit affinity (just a random guess).
> >
> > --
> > Jacques Nadeau
> > CTO and Co-Founder, Dremio
> >
> > On Thu, Nov 17, 2016 at 10:50 AM, Sudheesh Katkam  <mailto:skat...@maprtech.com>>
> > wrote:
> >
> >> Hi all,
> >>
> >> I had not noticed that Gautam mentioned about a potential bug. That is a
> >> -1 from me on the proposed candidate; the bug is a regression in
> behavior.
> >> I did not push the release artifacts until now, and the announcement is
> not
> >> out.
> >>
> >> The issue is that the query profile is not displayed past the point of
> >> failure (trying to show a changed string option). So I will propose
> another
> >> candidate once this issue is fixed [1, 2].
> >>
> >> In the mean time, please test the candidate for other regressions.
> >>
> >> Thank you,
> >> Sudheesh
> >>
> >> [1] https://issues.apache.org/jira/browse/DRILL-5047 <
> >> https://issues.apache.org/jira/browse/DRILL-5047 <
> https://issues.apache.org/jira/browse/DRILL-5047>>
> >> [2] https://github.com/apache/drill/pull/655 <
> https://github.com/apache/drill/pull/655> <https://github.com/apache/ <
> https://github.com/apache/>
> >> drill/pull/655>
> >>
> >>> On Nov 16, 2016, at 7:15 PM, Sudheesh Katkam 
> >> wrote:
> >>>
> >>> The proposal passes!
> >>>
> >>> Final tally:
> >>>
> >>> 3 binding +1s
> >>> + Sudheesh
> >>> + Aman
> >>> + Parth
> >>>
> >>> 12 non-binding +1s
> >>> + Khurram
> >>> + Dechang
> >>> + Rahul
> >>> + Chunhui
> >>> + Karthikeyan
> >>> + Robert
> >>> + Paul
> >>> + Krystal
> >>> + Sorabh
> >>> + Abhishek
> >>> + Kunal
> >>> + Gautam
> >>>
> >>> No 0s or -1s
> >>>
> >>> I'll push the release artifacts, and send an announcement once
> >> propagated. Thanks to everyone involved!
> >>>
> >>> Thank you,
> >>> Sudheesh
> >>>
> >>>> On Nov 16, 2016, at 6:23 PM, Gautam Parai 
> wrote:
> >>>>
> >>>> +1 (non-binding)
> >>>>
> >>>> Built from source on Linux VM and Mac.
> >>>> Ran unit tests.
> >>>> Ran new tests derived from bugs (Drill-4986/Drill-4771/Drill-
> >>>> 4792/Drill-4927)
> >>>> Ran some random queries
> >>>>
> >>>> Found a potential bug (NON-blocker) in Drill-4792.
> >>>>
> >>>> LGTM
> >>>>
> >>>> On Wed, Nov 16, 2016 at 5:52 PM, Kunal Khatua 
> >> wr

Re: [RESULT] [VOTE] Release Apache Drill 1.9.0 RC1

2016-11-17 Thread Jacques Nadeau
It might make sense for someone to look at this jira before rolling another
release: DRILL-4984

The stacktrace looks like it might be an issue with the new hard
parallelization algorithm which could potentially influence all sources. It
might not have shown up in traditional regression tests if those always
have source/drillbit affinity (just a random guess).

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Thu, Nov 17, 2016 at 10:50 AM, Sudheesh Katkam 
wrote:

> Hi all,
>
> I had not noticed that Gautam mentioned about a potential bug. That is a
> -1 from me on the proposed candidate; the bug is a regression in behavior.
> I did not push the release artifacts until now, and the announcement is not
> out.
>
> The issue is that the query profile is not displayed past the point of
> failure (trying to show a changed string option). So I will propose another
> candidate once this issue is fixed [1, 2].
>
> In the mean time, please test the candidate for other regressions.
>
> Thank you,
> Sudheesh
>
> [1] https://issues.apache.org/jira/browse/DRILL-5047 <
> https://issues.apache.org/jira/browse/DRILL-5047>
> [2] https://github.com/apache/drill/pull/655 <https://github.com/apache/
> drill/pull/655>
>
> > On Nov 16, 2016, at 7:15 PM, Sudheesh Katkam 
> wrote:
> >
> > The proposal passes!
> >
> > Final tally:
> >
> > 3 binding +1s
> > + Sudheesh
> > + Aman
> > + Parth
> >
> > 12 non-binding +1s
> > + Khurram
> > + Dechang
> > + Rahul
> > + Chunhui
> > + Karthikeyan
> > + Robert
> > + Paul
> > + Krystal
> > + Sorabh
> > + Abhishek
> > + Kunal
> > + Gautam
> >
> > No 0s or -1s
> >
> > I'll push the release artifacts, and send an announcement once
> propagated. Thanks to everyone involved!
> >
> > Thank you,
> > Sudheesh
> >
> >> On Nov 16, 2016, at 6:23 PM, Gautam Parai  wrote:
> >>
> >> +1 (non-binding)
> >>
> >> Built from source on Linux VM and Mac.
> >> Ran unit tests.
> >> Ran new tests derived from bugs (Drill-4986/Drill-4771/Drill-
> >> 4792/Drill-4927)
> >> Ran some random queries
> >>
> >> Found a potential bug (NON-blocker) in Drill-4792.
> >>
> >> LGTM
> >>
> >> On Wed, Nov 16, 2016 at 5:52 PM, Kunal Khatua 
> wrote:
> >>
> >>> +1 (non-binding)
> >>>
> >>> Built from the GitHub repo and deployed on a 10-node setup.
> >>> Ran a bunch of queries and verified the profiles as well.
> >>>
> >>> LGTM.
> >>>
> >>>
> >>> On Wed 16-Nov-2016 3:41:03 PM, Abhishek Girish 
> wrote:
> >>> +1 (non-binding)
> >>>
> >>> Built from source. Ran Functional and Advanced tests from [1]. Sanity
> >>> tested Sqlline and Web UI. Looks good.
> >>>
> >>>
> >>> [1] https://github.com/mapr/drill-test-framework.git
> >>>
> >>>
> >>> On Wed, Nov 16, 2016 at 3:37 PM, Sorabh Hamirwasia
> >>>> wrote:
> >>>
> >>>> +1 (non-binding)
> >>>> Built from source and successfully ran unit tests.
> >>>> Ran both in embedded and distributed mode.
> >>>> Verified DRILL-4972 / DRILL-4964
> >>>> Ran some basic query on sys tables and sample data.
> >>>>
> >>>> Looks good.
> >>>>
> >>>>
> >>>> On Wed, Nov 16, 2016 at 2:49 PM, Krystal Nguyen
> >>>> wrote:
> >>>>
> >>>>> +1 (non-binding)
> >>>>> Built from source. Tested the WebUI including authentication. Tested
> >>>>> sqlline.
> >>>>>
> >>>>> On Wed, Nov 16, 2016 at 1:59 PM, Paul Rogers
> >>>> wrote:
> >>>>>
> >>>>>> +1 (non-binding)
> >>>>>> Built from source
> >>>>>> Ran script unit tests to verify config settings, etc.
> >>>>>>
> >>>>>> Looks good.
> >>>>>>
> >>>>>> - Paul
> >>>>>>
> >>>>>>> On Nov 16, 2016, at 1:46 PM, Robert Hou wrote:
> >>>>>>>
> >>>>>>> +1 (non-binding)
> >>>>>>>
> >>>>>>> Built from source.
> >>>>>>> Tested parquet filter pushdown.
> >>>>>>>
> >>>>>>&

Re: [HANGOUT] Topics for 10/04/16

2016-10-05 Thread Jacques Nadeau
-user (since this is really a dev discussion)

Looking at DRILL-4280, I see the comment about C++ changes but don't see
them there or in the linked squash branch. Maybe I'm missing something
obvious?

The key performance benefits from the provided patch are about all the
other interrogations that Tableau (and other BI tools) makes against the
ODBC metadata interfaces (not prepare). With the new driver, those items
can be answered directly (such as list of schemas) rather than through ODBC
driver generated queries. I've commonly seen situations where even a simple
tableau workflow can take 7-10s even in the situation that the actual
"work" query (and associated limit 0 query) is quite short (<1s). This has
been especially problematic in situations where you're interacting with a
large number of sources since those information schema queries are not
always optimal to what is actually needed.

With regards to benchmarks, we haven't done anything formal but you should
be able to easily do a comparison and see the improvements, especially when
dealing with larger catalogs.



--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Wed, Oct 5, 2016 at 9:49 AM, Parth Chandra  wrote:

> Yup. I agree that we need to make sure that both clients are in sync. I
> believe DRILL-4280's PR refers to making changes in both APIs as well.
>
> Do you have a sense of how these changes give us a performance boost? As
> far as I can see, the APIs result in nearly the same code path being
> executed, with the difference being that the limit 0 query is now submitted
> by the server instead of the client.
>
> I don't know much about the tweaking of performance for various BI tools;
> is there something that Tableau et al do different? I don't see how, since
> the the ODBC/JDBC interface remains the same. Just trying to understand
> this.
>
> Anyway, any performance gain is wonderful. Do you have any numbers to
> share?
>
>
> On Tue, Oct 4, 2016 at 10:29 AM, Jacques Nadeau 
> wrote:
>
> > Both the C++ and the JDBC changes are updates that leverage a number of
> > pre-existing APIs already on the server. Our initial evaluations, we have
> > already seen substantially improved BI tool performance with the proposed
> > changes (with no additional server side changes). Are you seeing
> something
> > different? If you haven't yet looked at the changes in that light, I
> > suggest you do.
> >
> > If anything, I'm more concerned about client feature proposals that don't
> > cover both the C++ and Java client. For example, I think we should be
> > cautious about merging something like DRILL-4280. We should be cautious
> > about introducing new server APIs unless there is a concrete plan around
> > support in all clients.
> >
> > So I agree with the spirit of your ask: change proposals should be
> > "complete". However, I don't think it reasonably applies to the changes
> > proposed by Laurent. His changes "complete" the already introduced
> metadata
> > and prepare apis the server exposes. It provides an improved BI user
> > experience. It also introduces unit tests in the C++ client, something
> that
> > was previously sorely missing.
> >
> >
> >
> > --
> > Jacques Nadeau
> > CTO and Co-Founder, Dremio
> >
> > On Tue, Oct 4, 2016 at 9:47 AM, Parth Chandra 
> > wrote:
> >
> > > Hi guys,
> > >
> > >   I won't be able to join the hangout but it would be good to discuss
> the
> > > plan for the related backend changes.
> > >
> > >   As I mentioned before I would like to see a concrete proposal for the
> > > backend that will accompany these changes. Without that, I feel there
> is
> > no
> > > point to adding so much new code.
> > >
> > > Thanks
> > >
> > > Parth
> > >
> > >
> > > On Mon, Oct 3, 2016 at 7:52 PM, Laurent Goujon 
> > wrote:
> > >
> > > > Hi,
> > > >
> > > > I'm currently working on improving metadata support for both the JDBC
> > > > driver and the C++ connector, more specifically the following JIRAs:
> > > >
> > > > DRILL-4853: Update C++ protobuf source files
> > > > DRILL-4420: Server-side metadata and prepared-statement support for
> C++
> > > > connector
> > > > DRILL-4880: Support JDBC driver registration using ServiceLoader
> > > > DRILL-4925: Add tableType filter to GetTables metadata query
> > > > DRILL-4730: Update JDBC DatabaseMetaData implementation to use new
> &g

Re: [HANGOUT] Topics for 10/04/16

2016-10-04 Thread Jacques Nadeau
Both the C++ and the JDBC changes are updates that leverage a number of
pre-existing APIs already on the server. Our initial evaluations, we have
already seen substantially improved BI tool performance with the proposed
changes (with no additional server side changes). Are you seeing something
different? If you haven't yet looked at the changes in that light, I
suggest you do.

If anything, I'm more concerned about client feature proposals that don't
cover both the C++ and Java client. For example, I think we should be
cautious about merging something like DRILL-4280. We should be cautious
about introducing new server APIs unless there is a concrete plan around
support in all clients.

So I agree with the spirit of your ask: change proposals should be
"complete". However, I don't think it reasonably applies to the changes
proposed by Laurent. His changes "complete" the already introduced metadata
and prepare apis the server exposes. It provides an improved BI user
experience. It also introduces unit tests in the C++ client, something that
was previously sorely missing.



--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Tue, Oct 4, 2016 at 9:47 AM, Parth Chandra  wrote:

> Hi guys,
>
>   I won't be able to join the hangout but it would be good to discuss the
> plan for the related backend changes.
>
>   As I mentioned before I would like to see a concrete proposal for the
> backend that will accompany these changes. Without that, I feel there is no
> point to adding so much new code.
>
> Thanks
>
> Parth
>
>
> On Mon, Oct 3, 2016 at 7:52 PM, Laurent Goujon  wrote:
>
> > Hi,
> >
> > I'm currently working on improving metadata support for both the JDBC
> > driver and the C++ connector, more specifically the following JIRAs:
> >
> > DRILL-4853: Update C++ protobuf source files
> > DRILL-4420: Server-side metadata and prepared-statement support for C++
> > connector
> > DRILL-4880: Support JDBC driver registration using ServiceLoader
> > DRILL-4925: Add tableType filter to GetTables metadata query
> > DRILL-4730: Update JDBC DatabaseMetaData implementation to use new
> Metadata
> > APIs
> >
> > I  already opened multiple pull requests for those (the list is available
> > at https://github.com/apache/drill/pulls/laurentgo)
> >
> > I'm planning to join tomorrow hangout in case people have questions about
> > those.
> >
> > Cheers,
> >
> > Laurent
> >
> > On Mon, Oct 3, 2016 at 10:28 AM, Subbu Srinivasan <
> ssriniva...@zscaler.com
> > >
> > wrote:
> >
> > > Can we close on https://github.com/apache/drill/pull/518 ?
> > >
> > > On Mon, Oct 3, 2016 at 10:27 AM, Sudheesh Katkam 
> > > wrote:
> > >
> > > > Hi drillers,
> > > >
> > > > Our bi-weekly hangout is tomorrow (10/04/16, 10 AM PT). If you have
> any
> > > > suggestions for hangout topics, you can add them to this thread. We
> > will
> > > > also ask around at the beginning of the hangout for topics.
> > > >
> > > > Thank you,
> > > > Sudheesh
> > > >
> > >
> >
>


Re: [DISCUSS] Release cadence

2016-09-07 Thread Jacques Nadeau
+1 on the versioning scheme and the rest.



--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Wed, Sep 7, 2016 at 11:00 AM, Parth Chandra  wrote:

> Completely agree with you on allowing a release if the need is felt. The
> general release cadence would provide predictability, as you said, but we
> absolutely should be able to do releases with fixes if we need to.
> I would suggest we use a numbering of *major.minor*  for the regular
> releases and a *major.minor.revision *for any release outside of that.
>
>
> On Wed, Sep 7, 2016 at 10:04 AM, Jacques Nadeau 
> wrote:
>
> > I'm +1 for communicating to the user community a particular expected
> > release cadence. It helps set expectations. I'm +0 on 3 months being what
> > is communicated.
> >
> > I'm -1 on this being a reason to vote down a release proposed by someone.
> > If a member of the PMC wants to start a release because they perceive a
> > need, they should be able to. A general release cadence is not a reason
> to
> > vote down a release.
> >
> > --
> > Jacques Nadeau
> > CTO and Co-Founder, Dremio
> >
> > On Tue, Sep 6, 2016 at 5:48 PM, Parth Chandra  wrote:
> >
> > > As we discussed in the hangout today, based on the last few releases,
> it
> > > looks like a slightly longer time period between releases is probably
> > > called for. The 1.7 release was almost four months and folks had
> started
> > > asking questions about the release while the 1.8 release was done in
> much
> > > less time and we found quite a few show stopper issues at the last
> > minute.
> > > It seems that a three month cycle is probably appropriate at this time
> > > since that does not keep folks waiting for a new release and also
> > provides
> > > enough time for the team to test things thoroughly before a release.
> > >
> > > What does everyone think?
> > >
> > > Parth
> > >
> >
>


Re: [DISCUSS] Release cadence

2016-09-07 Thread Jacques Nadeau
I'm +1 for communicating to the user community a particular expected
release cadence. It helps set expectations. I'm +0 on 3 months being what
is communicated.

I'm -1 on this being a reason to vote down a release proposed by someone.
If a member of the PMC wants to start a release because they perceive a
need, they should be able to. A general release cadence is not a reason to
vote down a release.

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Tue, Sep 6, 2016 at 5:48 PM, Parth Chandra  wrote:

> As we discussed in the hangout today, based on the last few releases, it
> looks like a slightly longer time period between releases is probably
> called for. The 1.7 release was almost four months and folks had started
> asking questions about the release while the 1.8 release was done in much
> less time and we found quite a few show stopper issues at the last minute.
> It seems that a three month cycle is probably appropriate at this time
> since that does not keep folks waiting for a new release and also provides
> enough time for the team to test things thoroughly before a release.
>
> What does everyone think?
>
> Parth
>


Re: [DISCUSS] - Design Docs

2016-09-07 Thread Jacques Nadeau
+1 for better design docs
-1 for using something disconnected from the mailing list

I think we need to have a way that the design doc is more connected to
mailing lists. Design docs in Google docs are problematic because a bunch
of discussion happens independent of the list which is contrary to the
apache way. (Remember the old mantra, if it didn't happen on the list, it
didn't happen). If a developer happens to miss the post which is "here is a
new design doc", developers will never realize a conversation is happening
and important design decisions are occurring. I think this constantly been
a problem when we've tried using google docs in the past. I'm up for other
ideas. Some: google docs where all comments and edits come back to dev list
(possible/how?), use jira for design docs, markdown design docs in separate
Drill branch using github comments to shape, or a better idea?

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Tue, Sep 6, 2016 at 11:13 PM, Aman Sinha  wrote:

> +1 on using a design doc template for features that are of moderate or
> higher complexity.   Many of the sections are optional, so it should
> hopefully be considered 'lightweight' enough to encourage more people to
> adopt it.
>
> On Tue, Sep 6, 2016 at 10:22 PM, Khurram Faraaz 
> wrote:
>
> > Should we have the Document history table at the beginning of the
> document,
> > that way reviewers and readers of the design document will know if the
> > document has already gone through a few review cycles ?
> >
> > On Wed, Sep 7, 2016 at 7:28 AM, Gautam Parai 
> wrote:
> >
> > > Thanks so much for writing design documents for complex projects! They
> > are
> > > very helpful in learning about Drill Internals especially for new
> > > contributors like me - most recently Drill 4280.
> > >
> > > The design document template [2] looks good to me.
> > >
> > > For the reviews, I like Google Docs since it makes the document easy to
> > > share and review :)
> > >
> > > Gautam
> > >
> > >
> > > On Tue, Sep 6, 2016 at 5:49 PM, Parth Chandra 
> wrote:
> > >
> > > > We had a discussion on the dev list nearly a year ago about getting
> > > better
> > > > at documenting designs in Drill [1].  We were all mostly in agreement
> > > that
> > > > we should write better design documents and I just wanted to revisit
> > the
> > > > topic.
> > > >
> > > > Some of the more complex features being worked on recently,
> DRILL-4800
> > > and
> > > > DRILL-4820 to name a couple, have used a common format for the
> design,
> > > and
> > > > it has proven to be quite useful.
> > > >
> > > > I've put a basic template at [2].  Do folks have any comments about
> the
> > > > template? I would like to encourage folks working on complex features
> > to
> > > > use this as a guideline to writing design proposals and for reviewers
> > to
> > > > use while reviewing. I don't think every JIRA needs a design document
> > > > (sometimes the JIRA is enough), and I would leave it open for the
> > > > contributor to use whatever technology they feel comfortable with
> > > (provided
> > > > reviewers can comment  easily).
> > > >
> > > > What do people think? If everyone agrees I would like to provide a
> link
> > > to
> > > > this document from the Contribute to Drill page.
> > > >
> > > >
> > > > Parth
> > > >
> > > >
> > > > [1]
> > > > http://mail-archives.apache.org/mod_mbox/drill-dev/201510.
> > > > mbox/%3CCAAOiHjFDOZE%2Br2zmn%2BYWF%3DbKc4JAocVKGcvaCpfTj0gXdfxLUw
> > > > %40mail.gmail.com%3E
> > > > [2]
> > > > https://docs.google.com/document/d/1PnBiOMV5mYBi5N6fLci-
> > > > bRTva1gieCuxwlSYH9crMhU/edit?usp=sharing
> > > >
> > >
> >
>


Re: WIP for prepared statement/metadata querying in C++ native client

2016-08-24 Thread Jacques Nadeau
I'm not sure what unused code you are worried about. The backend already
implements these APIs and these changes would simply expose that interface
to the C++ consumers. I'd expect the ODBC driver to be updated to take
advantage of these APIs and thus all the code would be exercised all the
time. The main difference is: once we have the new client in user's hands,
we can iterate on performance improvements without having to upgrade the
client again in the future. Since this is an additive change, the existing
clients will continue to work without issue.

I think we already have some jiras about improving metadata performance and
BI tool experience. I'll try to dig them up. The thought was to start as
simple as possible, which is what the backend does now: do what the client
is doing but on the server. The next clear step for metadata preparation is
to do single parsing & planning (without the limit 0) so that in known
schema cases, we can avoid double parsing/planning (especially expensive in
partition pruning and metadata cache cases). Note, I'd expect that this
next version will still maintain the same behavior if the RowType is not
known on the initial DrillTable (where we don't have the limit 0 short
circuit today)

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Tue, Aug 23, 2016 at 6:30 PM, Parth Chandra  wrote:

> I would think that the Java client would be sufficient for experimentation.
> But what I'm looking for is an actual proposal for the backend changes. If
> we don't have one (not sure if there is a JIRA open for it), then we should
> start that now.
> Otherwise I'm afraid we will end up with a bunch of large amount of code
> that is not used. In particular, for the C++ client, I would like to avoid
> that.
>
>
>
>
> On Tue, Aug 23, 2016 at 6:09 PM, Jacques Nadeau 
> wrote:
>
> > The clear quick win would be caching parsing/planning/pruning on the
> server
> > and reusing (if executed within time t or until statement/connection are
> > closed).
> >
> > My thinking is to get an implementation of the client that is opaque to
> the
> > server implementation so that we can iterate on preparation without
> having
> > to constantly update the client. From there, developers can easily
> > experiment with different mechanisms to find what works best for BI
> tools.
> >
> > --
> > Jacques Nadeau
> > CTO and Co-Founder, Dremio
> >
> > On Tue, Aug 23, 2016 at 1:54 PM, Laurent Goujon 
> > wrote:
> >
> > > I'm  currently focusing on the client work, and making sure the C++
> > client
> > > is not lagging behind the Java one. I personally haven't worked on
> > backend
> > > changes for prepared statements.
> > >
> > > Laurent
> > >
> > > On Mon, Aug 22, 2016 at 7:32 PM, Parth Chandra 
> > wrote:
> > >
> > > > Hi Laurent,
> > > >
> > > >   I'll take a look at this in the next few days.
> > > >
> > > >   On a related note, do you or Venki have a proposal for the backend
> > > > changes (i.e actual implementation of prepare)? It would be a good
> idea
> > > to
> > > > start a discussion on that.
> > > >
> > > > Parth
> > > >
> > > > On Mon, Aug 22, 2016 at 3:24 PM, Laurent Goujon 
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I just started working on adding support for prepared statements
> and
> > > > > metadata querying in the C++ Drill client. Hopefully, nobody else
> has
> > > > > started working on this (The Drill jiras don't mention any activity
> > on
> > > > > this), but if it is not the case, let me know.
> > > > >
> > > > > My working branch is
> > > > > https://github.com/laurentgo/drill/tree/laurent/improve-
> > native-client.
> > > > >
> > > > > For now, I just have a basic interface API (
> > > > > https://github.com/laurentgo/drill/commit/
> > > 1f55a3e631cd97016b113b9d4bca07
> > > > > b5e016a25e),
> > > > > and it would be nice if people knowledgeable about the C++ client
> > could
> > > > > review it and give me some feedback. I'm also adding an actual
> > initial
> > > > > implementation in the coming days.
> > > > >
> > > > > Cheers,
> > > > >
> > > > > Laurent
> > > > >
> > > >
> > >
> >
>


Re: WIP for prepared statement/metadata querying in C++ native client

2016-08-23 Thread Jacques Nadeau
The clear quick win would be caching parsing/planning/pruning on the server
and reusing (if executed within time t or until statement/connection are
closed).

My thinking is to get an implementation of the client that is opaque to the
server implementation so that we can iterate on preparation without having
to constantly update the client. From there, developers can easily
experiment with different mechanisms to find what works best for BI tools.

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Tue, Aug 23, 2016 at 1:54 PM, Laurent Goujon  wrote:

> I'm  currently focusing on the client work, and making sure the C++ client
> is not lagging behind the Java one. I personally haven't worked on backend
> changes for prepared statements.
>
> Laurent
>
> On Mon, Aug 22, 2016 at 7:32 PM, Parth Chandra  wrote:
>
> > Hi Laurent,
> >
> >   I'll take a look at this in the next few days.
> >
> >   On a related note, do you or Venki have a proposal for the backend
> > changes (i.e actual implementation of prepare)? It would be a good idea
> to
> > start a discussion on that.
> >
> > Parth
> >
> > On Mon, Aug 22, 2016 at 3:24 PM, Laurent Goujon 
> > wrote:
> >
> > > Hi,
> > >
> > > I just started working on adding support for prepared statements and
> > > metadata querying in the C++ Drill client. Hopefully, nobody else has
> > > started working on this (The Drill jiras don't mention any activity on
> > > this), but if it is not the case, let me know.
> > >
> > > My working branch is
> > > https://github.com/laurentgo/drill/tree/laurent/improve-native-client.
> > >
> > > For now, I just have a basic interface API (
> > > https://github.com/laurentgo/drill/commit/
> 1f55a3e631cd97016b113b9d4bca07
> > > b5e016a25e),
> > > and it would be nice if people knowledgeable about the C++ client could
> > > review it and give me some feedback. I'm also adding an actual initial
> > > implementation in the coming days.
> > >
> > > Cheers,
> > >
> > > Laurent
> > >
> >
>


Re: [jira] [Commented] (DRILL-4800) Improve parquet reader performance

2016-07-26 Thread Jacques Nadeau
Sorry my first email wasn't clearer (and had missing words).

My question was, what is the maximum direct byte throughput of the
underlying filesystem your reading against (when not cached). Let's call
that the Optimal case. One way to do this might be to do a parallel hdfs fs
-cat "file" > /dev/null.

The second question kernel, user and io wait time per workload. So we could
get a snapshot something like this.

| Reader | Transfer Rate | Kernel | User | IO |
  Drill 1.7
  Other
  Solo
  Optimal

If the specific kernel and user times are too difficult (mostly in the 1.7
and other cases probably), maybe just io wait and cpu load and total test
duration for a fixed workload for each would suffice?

Even if this isn't possible, that's lots of great stuff in what you put
together. Was just trying to understand the bounding box.

thanks,
Jacques



--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Mon, Jul 25, 2016 at 3:17 PM, Parth Chandra 
wrote:

> Didn't quite catch your question there. But I do have the following numbers
> from the file system -
>
>| AvgIOR OpSize (KB) | Estimated
> Ops/Disk
> Drill 1.7.0 - uncached |239  |
> 103
> Solo Uncached   |   240  |
> 281
>
> The numbers are approximate as these are captured by scripts on all the
> nodes and then averaged by another script.
>
> Solo is close to as fast as is possible from disk.
>
> Is that what you were looking for?
>


Re: [jira] [Commented] (DRILL-4800) Improve parquet reader performance

2016-07-25 Thread Jacques Nadeau
It would also be helpful to see raw fs performance for the same 11 nodes.
I'm worried that the read pattern is worse than it should be when not
cached causing additional issues. If we know that solo readers 764 or
whatever it was is 90% of physical performance that is very than if it is
60% of physical performance.

On Jul 25, 2016 10:59 AM, "Parth Chandra (JIRA)"  wrote:

>
> [
> https://issues.apache.org/jira/browse/DRILL-4800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15392388#comment-15392388
> ]
>
> Parth Chandra commented on DRILL-4800:
> --
>
> Good point. I'll include that in the benchmarking phase after making the
> first set of changes.
>
> > Improve parquet reader performance
> > --
> >
> > Key: DRILL-4800
> > URL: https://issues.apache.org/jira/browse/DRILL-4800
> > Project: Apache Drill
> >  Issue Type: Improvement
> >Reporter: Parth Chandra
> >
> > Reported by a user in the field -
> > We're generally getting read speeds of about 100-150 MB/s/node on
> PARQUET scan operator. This seems a little low given the number of drives
> on the node - 24. We're looking for options we can improve the performance
> of this operator as most of our queries are I/O bound.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>


Re: Question about Fragment Stats start time

2016-06-16 Thread Jacques Nadeau
It has to do with fast schema. It doesn't matter when the start time is
called because all the fragments will execute right at the beginning of the
query and propagate a first "schema batch". You could possibly modify it to
be the start of the second batch time to get what you want. (This would
thus also make the gantt chart in the UI useful again, as it was before we
added fast schema.)

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Thu, Jun 16, 2016 at 11:29 AM, Abdel Hakim Deneche  wrote:

> Hey all,
>
> In the query profile, fragment's start time is taken when the fragment is
> first initialized. For leaf fragments that's fine as they'll start running
> right away, but for intermediate/root fragment, a long time may pass before
> they effectively start running (submitted to the execution pool).
>
> Is there a specific reason we did it this way ? would it make sense to
> measure the start time when the fragment executor run method is called for
> the very first time ?
>
> Thanks
>
> --
>
> Abdelhakim Deneche
>
> Software Engineer
>
>   <http://www.mapr.com/>
>
>
> Now Available - Free Hadoop On-Demand Training
> <
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> >
>


Re: [GitHub] drill issue #507: DRILL-4690: CORS in REST API

2016-06-08 Thread Jacques Nadeau
FYI, since the JDBC driver doesn't include the webservice, the extra jar
should be able to be excluded with no ill effects.

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Wed, Jun 8, 2016 at 11:25 AM, Chunhui Shi  wrote:

> I think to avoid that size increase is to revert back to the previous
> change which implemented the CORS manually and did not introduce this extra
> jar file. If our goal is to eat only one egg we don't need to buy a hen.
>
> On Mon, Jun 6, 2016 at 2:36 PM, sudheeshkatkam  wrote:
>
> > Github user sudheeshkatkam commented on the issue:
> >
> > https://github.com/apache/drill/pull/507
> >
> > I am not familiar with CORS. One question: why is this enabled by
> > default?
> >
> > Also, there is a discussion about not increasing the size of the
> > jdbc-all jar (subject: _drill-jdbc-all-1.7.0-SNAPSHOT.jar max size_). Any
> > way to avoid that change?
> >
> >
> > ---
> > If your project is set up for it, you can reply to this email and have
> your
> > reply appear on GitHub as well. If your project does not have this
> feature
> > enabled and wishes so, or if the feature is enabled but not working,
> please
> > contact infrastructure at infrastruct...@apache.org or file a JIRA
> ticket
> > with INFRA.
> > ---
> >
>


Re: drill-jdbc-all-1.7.0-SNAPSHOT.jar max size

2016-06-06 Thread Jacques Nadeau
I bet some throuough class cleansing can mean keeping this limit as opposed
to increasing it.

I suggest the JIRA instead be someone reducing the current size by 2mb. In
the past I've done this by expanding the archive and determining all the
large chunks of classes that shouldn't be included. Note the current filter
list at [1] and [2] needs to be continuously updated. It looks like neither
have been updated since January.

[1] https://github.com/apache/drill/blob/master/exec/jdbc-all/pom.xml#L280
[2] https://github.com/apache/drill/blob/master/exec/jdbc-all/pom.xml#L386




--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Mon, Jun 6, 2016 at 6:52 AM, Arina Yelchiyeva  wrote:

> Hi all!
>
> Drill has enforcer for drill-jdbc-all-1.7.0-SNAPSHOT.jar max size. Max size
> is 2000.
> Currently on master jar size is 19956787.
> 43213 bytes is left till the limit. I have exceeded this limit just with
> adding a of couple of new  classes.
>
> I am going to create Jira to update this limit.
> Just wanted to know your opinion on new max size. 3000 will be ok?
>
>
> Kind regards
> Arina
>


Re: Precedence of List and Map

2016-06-02 Thread Jacques Nadeau
Why do we need any precedence information for implementing new specific
type functions?

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Thu, Jun 2, 2016 at 9:34 AM, Vitalii Diravka 
wrote:

> Thank's for reply.
>
> It is necessary for implementing count function on complex datatypes.
> That's why I'm interested only in "precedenceMap" now.
> I'm going to add simple cast rules for Map and List:
> rules.put(MinorType.MAP, Sets.newHashSet(MinorType.MAP));
> rules.put(MinorType.LIST, Sets.newHashSet(MinorType.LIST));
> since cast from any other type isn't supported now.
>
> I am agree with placing of Map and List in the end of "precedenceMap" and
> before Union type.
> Does it matter the first will be Map or List on that place?
>
>
>
> Kind regards
> Vitalii
>
> 2016-06-01 17:57 GMT+00:00 Aman Sinha :
>
> > What are the implicit casting rules for promoting a data type to a List
> or
> > Map ?  It seems to me the reverse mapping is more useful:  casting a List
> > or Map to a VARCHAR is possible, so for instance I can do a join between
> a
> > Map containing {x: 1, y: 2}  and a Varchar containing the exact same
> > string.  To handle this you would add the mapping to the
> > ResolverTypePrecedence.secondaryImplicitCastRules.
> >
> > If there is a valid promotion to List or Map in the precedenceMap, since
> > these are complex types I would think it belongs to the end just before
> the
> > UNION type (since Union is the superset).
> >
> > On Wed, Jun 1, 2016 at 9:24 AM, Vitalii Diravka <
> vitalii.dira...@gmail.com
> > >
> > wrote:
> >
> > > Hi all!
> > >
> > > I need to add List and Map data types into "precedenceMap" in the
> > > "ResolverTypePrecedence" class.
> > > And I am interested in precedence value of these data types.
> > > What are your thoughts about it?
> > >
> > >
> > > You can see all current precedence map below.
> > >
> > >
> > > > precedenceMap = new HashMap();
> > > > precedenceMap.put(MinorType.NULL, i += 2);   // NULL is legal to
> > > > implicitly be promoted to any other type
> > > > precedenceMap.put(MinorType.FIXEDBINARY, i += 2); // Fixed-length is
> > > > promoted to var length
> > > > precedenceMap.put(MinorType.VARBINARY, i += 2);
> > > > precedenceMap.put(MinorType.FIXEDCHAR, i += 2);
> > > > precedenceMap.put(MinorType.VARCHAR, i += 2);
> > > > precedenceMap.put(MinorType.FIXED16CHAR, i += 2);
> > > > precedenceMap.put(MinorType.VAR16CHAR, i += 2);
> > > > precedenceMap.put(MinorType.BIT, i += 2);
> > > > precedenceMap.put(MinorType.TINYINT, i += 2);   //type with few bytes
> > is
> > > > promoted to type with more bytes ==> no data loss.
> > > > precedenceMap.put(MinorType.UINT1, i += 2); //signed is legal to
> > > > implicitly be promoted to unsigned.
> > > > precedenceMap.put(MinorType.SMALLINT, i += 2);
> > > > precedenceMap.put(MinorType.UINT2, i += 2);
> > > > precedenceMap.put(MinorType.INT, i += 2);
> > > > precedenceMap.put(MinorType.UINT4, i += 2);
> > > > precedenceMap.put(MinorType.BIGINT, i += 2);
> > > > precedenceMap.put(MinorType.UINT8, i += 2);
> > > > precedenceMap.put(MinorType.MONEY, i += 2);
> > > > precedenceMap.put(MinorType.FLOAT4, i += 2);
> > > > precedenceMap.put(MinorType.DECIMAL9, i += 2);
> > > > precedenceMap.put(MinorType.DECIMAL18, i += 2);
> > > > precedenceMap.put(MinorType.DECIMAL28DENSE, i += 2);
> > > > precedenceMap.put(MinorType.DECIMAL28SPARSE, i += 2);
> > > > precedenceMap.put(MinorType.DECIMAL38DENSE, i += 2);
> > > > precedenceMap.put(MinorType.DECIMAL38SPARSE, i += 2);
> > > > precedenceMap.put(MinorType.FLOAT8, i += 2);
> > > > precedenceMap.put(MinorType.DATE, i += 2);
> > > > precedenceMap.put(MinorType.TIMESTAMP, i += 2);
> > > > precedenceMap.put(MinorType.TIMETZ, i += 2);
> > > > precedenceMap.put(MinorType.TIMESTAMPTZ, i += 2);
> > > > precedenceMap.put(MinorType.TIME, i += 2);
> > > > precedenceMap.put(MinorType.INTERVALDAY, i+= 2);
> > > > precedenceMap.put(MinorType.INTERVALYEAR, i+= 2);
> > > > precedenceMap.put(MinorType.INTERVAL, i+= 2);
> > > > precedenceMap.put(MinorType.UNION, i += 2);
> > >
> > >
> > >
> > > Kind regards
> > > Vitalii
> > >
> >
>


Re: Hash Aggregate Memory usage

2016-05-27 Thread Jacques Nadeau
There was a presentation a year or so ago I presented at the MapR sales
kickoff that covers the memory characteristics of operators. Unfortunately,
I don't have access to the content but hopefully someone internal to MapR
should have it. (Maybe Ellen or Neeraja)

Approximately (from memory):

total hash aggregate size = entries * (links (4 bytes) + hash code (4
bytes) + aggregate key size + aggregate workspace variable size)
aggregate key size = (fixed value size of all keys + variable value size
for all keys)
fixed value size = fixed width size (e.g. 4 bytes for a four byte int) +
nullability (1 or 0 bytes)
variable value size = offset (4 bytes)  + length of data  + nullability (1
or 0 bytes)
aggregate workspace variable size = each function field *  value size

Note that the entries is actually based on the nearest power of two.
Additionally, every vector is also rounded up to the nearest power of two.
(this includes both the key vectors, workspace vectors links and hash code
vectors

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Fri, May 27, 2016 at 11:21 AM, Aman Sinha  wrote:

> Rahul,  can you send me the query profile separately ?  Also, can you try
> group-by on fixed-width columns instead of Varchar ?
> With single group, the hash table itself should be consuming relatively
> small amount of memory.
>
> On Fri, May 27, 2016 at 11:14 AM, Zelaine Fong  wrote:
>
> > My guess would be that for hashing, a hash table is pre-allocated based
> on
> > the number of keys in the hash.  That would explain why with more keys,
> the
> > memory usage grows.  But that's just my guess.  Someone who really
> > understands how this works should chime in :).
> >
> > -- Zelaine
> >
> > On Fri, May 27, 2016 at 10:36 AM, rahul challapalli <
> > challapallira...@gmail.com> wrote:
> >
> > > Any inputs on this one?
> > >
> > > On Wed, May 25, 2016 at 7:51 PM, rahul challapalli <
> > > challapallira...@gmail.com> wrote:
> > >
> > > > Its using hash aggregation.
> > > > On May 25, 2016 7:48 PM, "Zelaine Fong"  wrote:
> > > >
> > > >> What does the explain plan show?  I.e., is the group by being done
> > via a
> > > >> hash agg or a streaming agg?  If it's a streaming agg, then you
> still
> > > have
> > > >> to sort the entire data set before you reduce it down to a single
> > group.
> > > >> That would explain the increase in memory as you add group by keys.
> > > >>
> > > >> -- Zelaine
> > > >>
> > > >> On Wed, May 25, 2016 at 5:50 PM, rahul challapalli <
> > > >> challapallira...@gmail.com> wrote:
> > > >>
> > > >> > I am trying to understand the memory usage patterns for hash
> > > aggregate.
> > > >> The
> > > >> > below query completes in 9.163 seconds and uses 24 MB of memory
> for
> > > >> > hash-aggregate (according to profile)
> > > >> >
> > > >> > select max(d.l_linenumber) from (select l_linenumber, 'asdf' c1,
> > > 'kfjhl'
> > > >> > c2, 'reyui' c3, 'khdfs' c4, 'vkhj' c5  from mem_heavy1) d group by
> > > d.c1,
> > > >> > d.c2, d.c3, d.c4, d.c5;
> > > >> >
> > > >> > Adding one more constant column to the group by, the below query
> > takes
> > > >> > 11.638 seconds and uses 29 MB of ram
> > > >> >
> > > >> > select max(d.l_linenumber) from (select l_linenumber, 'asdf' c1,
> > > 'kfjhl'
> > > >> > c2, 'reyui' c3, 'khdfs' c4, 'vkhj' c5, 'bmkr' c6  from
> mem_heavy1) d
> > > >> group
> > > >> > by d.c1, d.c2, d.c3, d.c4, d.c5, d.c6;
> > > >> >
> > > >> > The below query with one more constant column added to group by
> > 14.622
> > > >> > seconds and uses 33 MB memory
> > > >> >
> > > >> > select max(d.l_linenumber) from (select l_linenumber, 'asdf' c1,
> > > 'kfjhl'
> > > >> > c2, 'reyui' c3, 'khdfs' c4, 'vkhj' c5, 'bmkr' c6, 'ciuh' c7  from
> > > >> > mem_heavy1) d group by d.c1, d.c2, d.c3, d.c4, d.c5, d.c6, d.c7;
> > > >> >
> > > >> >
> > > >> > As you can see, there is only one disctinct group in all the above
> > > >> cases.
> > > >> > It looks like the memory usage is proportional to no of elements
> in
> > > the
> > > >> > group by clause. Is this expected?
> > > >> >
> > > >> > Is the increase in time expected between the above queries? (As we
> > did
> > > >> not
> > > >> > introduce any new groups)
> > > >> >
> > > >> > - Rahul
> > > >> >
> > > >>
> > > >
> > >
> >
>


Re: Manta Object Store Support

2016-05-26 Thread Jacques Nadeau
It should "just work". Nothing in the logs?

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Wed, May 25, 2016 at 7:01 PM, Elijah Zupancic  wrote:

> Hi Tomer,
>
> Thanks for your advice about creating a Hadoop FileSystem implementation.
> I just finished a prototypical implementation of a Hadoop file system for
> Manta: https://github.com/dekobon/hadoop-manta <
> https://github.com/dekobon/hadoop-manta>
>
> I see the example for enabling S3 with Apache Drill and I’ve verified that
> it works. However, when I attempt to replicate the configuration of S3 for
> Manta, I’m unable to get the Hadoop FileSystem driver to load. I’ve
> verified that the FileSystem driver works in Hadoop by checking all of the
> hdfs dfs -* commands and I’ve got a fair bit of automated testing around it.
>
> What’s the magic to get it turned on with Drill? Do I need to do something
> to make the jar load other than copy it into jars/3rdparty? Right now, I’m
> just testing in drill-embedded for what it is worth.
>
> Thanks,
> Elijah Zupancic
>
> > On May 6, 2016, at 4:11 PM, Tomer Shiran  wrote:
> >
> > Does Manta have a Hadoop FileSystem API implementation? That's what Drill
> > uses for S3, HDFS, MapR-FS, Azure Blob Storage, etc. You could
> potentially
> > write a Drill storage plugin, but you get a lot for free if you already
> > have the file system implementation.
> >
> > On Fri, May 6, 2016 at 9:43 AM, Elijah Zupancic 
> wrote:
> >
> >> I'm trying to get started contributing to Apache Drill. I've got the
> >> project checked out and it is building to my satisfaction. Right now,
> I'm
> >> trying to add support for the open source object store Manta (
> >> https://github.com/joyent/manta). I thought that this would be a good
> >> learning project.
> >>
> >> Initially, I want to add support in the same way that S3 has support.
> >> However, I can't seem to find a reference to the S3 storage driver in
> the
> >> code base. Is the s3 storage driver part of a different project? How
> would
> >> you suggest that I get started?
> >>
> >> Thank you,
> >> Elijah Zupancic
> >>
>
>


[ANNOUNCE] New PMC Chair of Apache Drill

2016-05-25 Thread Jacques Nadeau
I'm pleased to announce that the Drill PMC has voted to elect Parth Chandra
as the new PMC chair of Apache Drill. Please join me in congratulating
Parth!

thanks,
Jacques

--
Jacques Nadeau
CTO and Co-Founder, Dremio


Re: UDF is not recognized by drill / Validation Error

2016-05-20 Thread Jacques Nadeau
Can you run  jar tf myudf.jar against your jar files? Since Drill is not
detecting the jar file, we need to resolve that first. The
drill-module.conf must be in the root of each jar file that should be
included. Lets start by verifying that.

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Fri, May 20, 2016 at 6:08 AM, Julian Feinauer 
wrote:

> Dear all,
>
> thank you very much for all your replies.
> I tried everything but it is still not working.
>
> 1. I copy both files (classes and sources) in the /jars/3rdparty directory
> 2. I restarted the drillbit after this (I use only one drillbit and
> drill-conf both running on my local machine)
> 3. I changed the class to a static subclass
> 4. I have the drill-module.conf in my ressources
> 4. The error appears in the drillbit.log as soon as i call the udf because
> it is not recognized by drill.
> On startup drillbit.log states all the packages and jars that are scanned
> and my custom jar is not listed there.
> Therefore I think it is a problem with the class loader or something
> related?
>
> Could this be possible?
>
> Greetings
> Julian
>
> > Am 20.05.2016 um 14:36 schrieb Tugdual Grall :
> >
> > Hi
> >
> > Be sure you deploy on each nodes, 2 jars:
> > - the jar containing the classes
> > - the jar contains the sources
> >
> > The POM.xml in the simple function examples contains the maven
> > configuration to generate these 2 files, be sure you have the same in
> your
> > project:
> >
> https://github.com/tgrall/drill-simple-mask-function/blob/master/pom.xml#L24-L37
> >
> > and you have restarted the drillbit
> >
> > Regards
> > Tug
> > @tgrall
> >
> >
> >
> > On Fri, May 20, 2016 at 12:34 AM, Abdel Hakim Deneche <
> adene...@maprtech.com
> >> wrote:
> >
> >> the example I gave you was incomplete, here is what I meant to send:
> >>
> >> public class MyUDF {
> >>
> >>   @FunctionTemplate(name = „myaddints", scope = FunctionTemplate.
> >>   FunctionScope.SIMPLE, nulls =
> >> FunctionTemplate.NullHandling.NULL_IF_NULL)
> >>   public *static *class IntIntAdd implements DrillSimpleFunc {
> >>  ...
> >>   }
> >>
> >> }
> >>
> >>
> >> On Thu, May 19, 2016 at 3:33 PM, Abdel Hakim Deneche <
> >> adene...@maprtech.com>
> >> wrote:
> >>
> >>> Hey Julian,
> >>>
> >>> one more thing you could try out: declare the UDF as a static class
> >> inside
> >>> another class:
> >>>
> >>> public class MyUDF {
> >>>
> >>>   @FunctionTemplate(name = „myaddints", scope = FunctionTemplate.
> >>>   FunctionScope.SIMPLE, nulls = FunctionTemplate.NullHandling.
> >>> NULL_IF_NULL)
> >>>   public class IntIntAdd implements DrillSimpleFunc {
> >>>  ...
> >>>   }
> >>>
> >>> }
> >>>
> >>> Take a look at the following page to see an examples of UDFs:
> >>> http://drill.apache.org/docs/custom-function-interfaces/
> >>>
> >>> If this doesn't work check the drillbit log, it should print an error
> >>> message when it's starting up if something's wrong with your UDF.
> >>>
> >>> Thanks
> >>>
> >>>
> >>> On Thu, May 19, 2016 at 3:31 AM, Julian Feinauer <
> julian.feina...@web.de
> >>>
> >>> wrote:
> >>>
> >>>> Dear folks,
> >>>>
> >>>> I’m currently experimenting with user defined functions in drill but
> I’m
> >>>> not able to get them to work on my drillbits.
> >>>> I always get the error: Error: VALIDATION ERROR: From line 1, column 8
> >> to
> >>>> line 1, column 41: No match found for function signature
> >> myaddints(,
> >>>> ).
> >>>>
> >>>> I already went through all the tips I found in the mailing list.
> >>>> The jar contains a drill-module.conf with the content:
> >>>> drill.classpath.scanning.packages += "org.julian"
> >>>> And the UDF is defined as:
> >>>> package org.julian;
> >>>>
> >>>> import ...
> >>>>
> >>>> @FunctionTemplate(name = „myaddints", scope =
> >>>> FunctionTemplate.FunctionScope.SIMPLE, nulls =
> >>>> FunctionTemplate.NullHandling.NULL_IF_NULL)
> >>&

Re: Operator unit test framework merged

2016-04-20 Thread Jacques Nadeau
Great Jason, thanks for pulling this together!

Jacques
On Apr 20, 2016 9:24 AM, "Jason Altekruse"  wrote:

> Hello all,
>
> I finally got a chance to do some final minor fixes and merge the operator
> unit test framework I posted a while back, thanks again to Path for doing a
> review on it. There are still some enhancements I would like to add to make
> the tests more flexible, but for examples of what can be done with the
> current version please check out the tests that were included with the
> patch [1]. Please don't hesitate to ask questions or suggest improvements.
> I think that writing tests in smaller units like this could go a long way
> in improving our coverage and ensure that we can write tests that
> consistently cover a particular execution path, independent of the query
> planner.
>
> For anyone looking to get more familiar with how Drill executes operations,
> these tests might be a little easier way to start getting antiquated with
> the internals of Drill. The tests mock a number of the more complex parts
> of the system and try to produce a minimal environment where a single
> operation can run.
>
> [1] -
>
> https://github.com/apache/drill/blob/d93a3633815ed1c7efd6660eae62b7351a2c9739/exec/java-exec/src/test/java/org/apache/drill/exec/physical/unit/BasicPhysicalOpUnitTest.java
>
> Jason Altekruse
> Software Engineer at Dremio
> Apache Drill Committer
>


[jira] [Resolved] (DRILL-4113) memory leak reported while handling query or shutting down

2016-04-19 Thread Jacques Nadeau (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacques Nadeau resolved DRILL-4113.
---
Resolution: Cannot Reproduce

> memory leak reported while handling query or shutting down
> --
>
> Key: DRILL-4113
> URL: https://issues.apache.org/jira/browse/DRILL-4113
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.3.0, 1.4.0
>Reporter: Chun Chang
>Priority: Critical
>
> With impersonation enabled, I've seen two memory leaks. One reported at query 
> time, one at shutdown.
> At query time:
> {noformat}
> 2015-11-17 19:11:03,595 [29b413b7-958e-c1f3-9d37-c34f96e7bf6a:foreman] INFO  
> o.a.drill.exec.work.foreman.Foreman - Query text for query id 
> 29b413b7-958e-c1f3-9d37-c34f96e7bf6a: use `dfs.window_functions`
> 2015-11-17 19:11:03,666 [29b413b7-edbc-9722-120d-66ab3611f250:frag:0:0] INFO  
> o.a.d.e.w.fragment.FragmentExecutor - 
> 29b413b7-edbc-9722-120d-66ab3611f250:0:0: State change requested 
> AWAITING_ALLOCATION --> RUNNING
> 2015-11-17 19:11:03,666 [29b413b7-edbc-9722-120d-66ab3611f250:frag:0:0] INFO  
> o.a.d.e.w.f.FragmentStatusReporter - 
> 29b413b7-edbc-9722-120d-66ab3611f250:0:0: State to report: RUNNING
> 2015-11-17 19:11:03,669 [29b413b7-edbc-9722-120d-66ab3611f250:frag:0:0] INFO  
> o.a.d.e.w.fragment.FragmentExecutor - 
> 29b413b7-edbc-9722-120d-66ab3611f250:0:0: State change requested RUNNING --> 
> FAILED
> 2015-11-17 19:11:03,669 [29b413b7-edbc-9722-120d-66ab3611f250:frag:0:0] INFO  
> o.a.d.e.w.fragment.FragmentExecutor - 
> 29b413b7-edbc-9722-120d-66ab3611f250:0:0: State change requested FAILED --> 
> FAILED
> 2015-11-17 19:11:03,669 [29b413b7-edbc-9722-120d-66ab3611f250:frag:0:0] INFO  
> o.a.d.e.w.fragment.FragmentExecutor - 
> 29b413b7-edbc-9722-120d-66ab3611f250:0:0: State change requested FAILED --> 
> FINISHED
> 2015-11-17 19:11:03,674 [29b413b7-edbc-9722-120d-66ab3611f250:frag:0:0] ERROR 
> o.a.d.e.w.fragment.FragmentExecutor - SYSTEM ERROR: IllegalStateException: 
> Failure while closing accountor.  Expected private and shared pools to be set 
> to initial values.  However, one or more were not.  Stats are
> zoneinitallocated   delta
> private 100 738112  261888
> shared  00  261888  -261888.
> Fragment 0:0
> [Error Id: 6df67be9-69d4-4a3b-9eae-43ab2404c6d3 on drillats1.qa.lab:31010]
> org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: 
> IllegalStateException: Failure while closing accountor.  Expected private and 
> shared pools to be set to initial values.  However, one or more were not.  
> Stats are
> zoneinitallocated   delta
> private 100 738112  261888
> shared  00  261888  -261888.
> Fragment 0:0
> [Error Id: 6df67be9-69d4-4a3b-9eae-43ab2404c6d3 on drillats1.qa.lab:31010]
> at 
> org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:534)
>  ~[drill-common-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
> at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.sendFinalState(FragmentExecutor.java:321)
>  [drill-java-exec-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
> at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup(FragmentExecutor.java:184)
>  [drill-java-exec-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
> at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:290)
>  [drill-java-exec-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
> at 
> org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
>  [drill-common-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  [na:1.7.0_45]
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>  [na:1.7.0_45]
> at java.lang.Thread.run(Thread.java:744) [na:1.7.0_45]
> Caused by: java.lang.IllegalStateException: Failure while closing accountor.  
> Expected private and shared pools to be set to initial values.  However, one 
> or more were not.  Stats are
> zoneinitallocated   delta
> private 100 738112  261888
> shared  00  261888  -261888.
> at 
> org.apache.drill.exec.memory.AtomicRemainder.close(AtomicRemainder.java:199) 
> ~[drill-memory-impl-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
> at 
> org.apache.drill.exec.memory.AccountorImpl.close(AccountorImpl.java:365) 
> ~[drill-

Re: Hangout Anyone?

2016-04-19 Thread Jacques Nadeau
Quick one today:

Attendees:

Kumiko, Arrina, Jacques, Jason, Vitalli

Main topic of discussion was missing column behavior. Jason explained that
you can enable union type and use case statements to get exactly the
behavior you need.

thanks,
Jacques

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Tue, Apr 19, 2016 at 10:01 AM, Jacques Nadeau  wrote:

> https://plus.google.com/hangouts/_/dremio.com/drillhangout?authuser=0
>
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>


Hangout Anyone?

2016-04-19 Thread Jacques Nadeau
https://plus.google.com/hangouts/_/dremio.com/drillhangout?authuser=0


--
Jacques Nadeau
CTO and Co-Founder, Dremio


Re: Getting back on Calcite master: only a few steps left

2016-04-18 Thread Jacques Nadeau
Hey All,

Following up to get a status update. We made some good initial progress but
it seems like people may have hit some challenges (or distractions). Can
everyone report on how they are doing?

Jinfeng, how are tests for CALCITE-1150 going? Can Minji help get together
test cases for CALCITE-1150? Maybe you could provide guidance on the set of
queries to test?

thanks,
Jacques


--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Thu, Mar 31, 2016 at 4:19 PM, Julian Hyde  wrote:

> I’ve closed 1149, if we don’t need the feature.
>
> Yes, we need a unit test for 1151. I offered a suggestion how.
>
> > On Mar 31, 2016, at 11:59 AM, Sudheesh Katkam 
> wrote:
> >
> > I submitted a patch for CALCITE-1151 <
> https://issues.apache.org/jira/browse/CALCITE-1151> (with changes to
> resolve a checkstyle error). I am waiting for comments regarding the unit
> test.
> >
> > I added a comment to CALCITE-1149 <
> https://issues.apache.org/jira/browse/CALCITE-1149> with the workaround
> being used.
> >
> > Thank you,
> > Sudheesh
> >
> >> On Mar 16, 2016, at 5:19 PM, Jacques Nadeau  wrote:
> >>
> >> Yes, I'm trying to work through the failing unit tests.
> >>
> >> I merged your change.
> >>
> >> In the future you can pick compare & create pull request on your branch
> and
> >> then change the target repo from apache to mine.
> >>
> >> thanks,
> >> Jacques
> >>
> >>
> >> --
> >> Jacques Nadeau
> >> CTO and Co-Founder, Dremio
> >>
> >> On Wed, Mar 16, 2016 at 4:39 PM, Aman Sinha 
> wrote:
> >>
> >>> Jacques, I wasn't sure how to create a pull request against your
> branch;
> >>> for  CALCITE-1108 you can cherry-pick from here:
> >>>
> >>>
> https://github.com/amansinha100/incubator-calcite/commits/calcite-drill-2
> >>>
> >>> BTW,  there are unit test failures on your branch which I assume is
> >>> expected for now ?
> >>>
> >>> On Tue, Mar 15, 2016 at 6:56 PM, Jacques Nadeau 
> >>> wrote:
> >>>
> >>>> Why don't you guys propose patches for my branch and I'll incorporate
> >>> until
> >>>> we get to a good state. Once we feel good about it, I'll clean up the
> >>>> revision history.
> >>>>
> >>>> --
> >>>> Jacques Nadeau
> >>>> CTO and Co-Founder, Dremio
> >>>>
> >>>> On Tue, Mar 15, 2016 at 11:01 AM, Jinfeng Ni 
> >>>> wrote:
> >>>>
> >>>>> I'll add test for CALCITE-1150.
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Tue, Mar 15, 2016 at 9:45 AM, Sudheesh Katkam <
> skat...@maprtech.com
> >>>>
> >>>>> wrote:
> >>>>>> CALCITE-1149 [Extend CALCITE-845] <
> >>>>>
> >>>>
> >>>
> https://github.com/mapr/incubator-calcite/commit/bd73728a8297e15331ae956096eab0e15b3f
> >>>>>
> >>>>> does not need to be committed into Calcite. DRILL-4372 <
> >>>>> https://issues.apache.org/jira/browse/DRILL-4372> supersedes that
> >>> patch.
> >>>>>>
> >>>>>> I will add a test case for CALCITE-1151.
> >>>>>>
> >>>>>> Thank you,
> >>>>>> Sudheesh
> >>>>>>
> >>>>>>> On Mar 15, 2016, at 9:04 AM, Aman Sinha 
> >>> wrote:
> >>>>>>>
> >>>>>>> I'll add a test for CALCITE-1108.   For 1105 I am not yet sure but
> >>>> will
> >>>>>>> look through the old drill commits to see what test was added
> there.
> >>>>>>>
> >>>>>>> On Sun, Mar 13, 2016 at 11:15 PM, Minji Kim 
> >>> wrote:
> >>>>>>>
> >>>>>>>> I will add more test cases to CALCITE-1148 in addition to the ones
> >>>>> already
> >>>>>>>> there.  I noticed a few more problems while testing the patch
> >>> against
> >>>>> drill
> >>>>>>>> master.  I am still working through these issues, so I will add
> >>> more
> >>>>> test
> >>>>>>>> cases as I find/fix them.  -Minji
> >>>>>>>>
> 

[jira] [Created] (DRILL-4612) Drill profiles no longer include verbose error

2016-04-17 Thread Jacques Nadeau (JIRA)
Jacques Nadeau created DRILL-4612:
-

 Summary: Drill profiles no longer include verbose error
 Key: DRILL-4612
 URL: https://issues.apache.org/jira/browse/DRILL-4612
 Project: Apache Drill
  Issue Type: Bug
Reporter: Jacques Nadeau
Assignee: Steven Phillips
Priority: Critical


It looks like something has broken in Drill and now the profiles no longer 
include the verbose error. This makes troubleshooting a remote user issue with 
only the profile more difficult.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (DRILL-4608) Csv with Headers reader is not case insensitive

2016-04-14 Thread Jacques Nadeau (JIRA)
Jacques Nadeau created DRILL-4608:
-

 Summary: Csv with Headers reader is not case insensitive
 Key: DRILL-4608
 URL: https://issues.apache.org/jira/browse/DRILL-4608
 Project: Apache Drill
  Issue Type: Bug
  Components: Storage - Text & CSV
Reporter: Jacques Nadeau






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Proposal: Create v2 branch to work on breaking changes

2016-04-13 Thread Jacques Nadeau
A number of the vector problems are an issue for the Java client but not
the c client since the c client doesn't support complex or union types yet.

Agree on your other points:

- We need to get to rolling upgrades. (Note that I suggest that we try to
get to "minor" compatibility for Drillbit <--> Drillbit by the end of the
2.x series in my notes)
- Also agree that we should always work to avoid changing any of the
interfaces described in the doc, no matter what the external commitment is.
The performance analog is a good one.



--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Wed, Apr 13, 2016 at 5:54 PM, Parth Chandra  wrote:

> Thanks for putting this doc together Jacques. This gives us a clear
> framework for discussion.
> Just as a clarification (I haven't yet been able to do more than glance at
> the doc), for 2.0, I was suggesting client-server compatibility not
> drillbit-drillbit compatibility. It seems some of the items you noted
> earlier (null lists, union vectors, etc.) may break drillbit-drillbit
> compatibility but not necessarily affect client-server compatibility. So we
> may be in agreement on some things here.
> In general, though, as the size of user's clusters grows, it will be
> required that we permit rolling upgrades. As I said in the hangout, it's
> like performance, we have to consider it at every instance; and take a
> decision to not support backward compatibility only after due
> consideration. At the moment, some of the functionality we are talking
> about might justify breaking drillbit-drillbit compatibility. Our design
> decisions for these implementations, though, must keep the requirement for
> future backward compatibility in mind.
> I'll add further comments in the JIRA.
>
> On Tue, Apr 12, 2016 at 6:47 PM, Jacques Nadeau 
> wrote:
>
> > A general policy shouldn't hold up a specific decision. Even after we
> > establish a guiding policy, there will be exceptions that we will
> consider.
> > I'm looking for concrete counterpoint to the cost of maintaining
> backwards
> > compatibility.
> >
> > That being said, I have put together an initial proposal of the
> > compatibility commitments we should make to the users. It is important to
> > note that my outline is about our public commitment. As a development
> > community, we should always work to avoid disruptive or backwards
> > incompatible changes on public apis even if the our public commitment
> > policy doesn't dictate it.
> >
> > The proposal is attached here:
> > https://issues.apache.org/jira/browse/DRILL-4600
> >
> >
> > --
> > Jacques Nadeau
> > CTO and Co-Founder, Dremio
> >
> > On Tue, Apr 12, 2016 at 5:54 PM, Neeraja Rentachintala <
> > nrentachint...@maprtech.com> wrote:
> >
> > > Makes sense to postpone the debate : )
> > > Will Look forward for the proposal.
> > >
> > > On Tuesday, April 12, 2016, Zelaine Fong  wrote:
> > >
> > > > As we discussed at this morning's hangout, Jacques took the action to
> > put
> > > > together a strawman compatibility points document.  Would it be
> better
> > to
> > > > wait for that document before we debate this further?
> > > >
> > > > -- Zelaine
> > > >
> > > > On Tue, Apr 12, 2016 at 4:39 PM, Jacques Nadeau  > > > > wrote:
> > > >
> > > > > I agree with Paul, too. Perfect compatibility would be great. I
> > > recognize
> > > > > the issues that a version break could cause.  These are some of the
> > > > issues
> > > > > that I believe require a version break to address:
> > > > > - Support nulls in lists.
> > > > > - Distinguish null maps from empty maps.
> > > > > - Distinguish null arrays from empty arrays.
> > > > > - Support sparse maps (analogous to Parquet maps instead of our
> > current
> > > > > approach analogous to structs in Parquet lingo).
> > > > > - Clean up decimal and enable it by default.
> > > > > - Support full Avro <> Parquet roundtrip (and Parquet files
> generated
> > > by
> > > > > other tools).
> > > > > - Enable union type by default.
> > > > > - Improve performance execution performance of nullable values.
> > > > >
> > > > > I think these things need to be addressed in the 2.x line (let's
> say
> > > that
> > > > > is ~12 months). This is all about tradeoffs which is why I keep
> > asking
> &

Re: Drill on YARN

2016-04-13 Thread Jacques Nadeau
It sounds like Paul and John would both benefit from reviewing [1] & [2].

Drill's has memory management, respects limits and has a hierarchy of
allocators to do this. The framework for constraining certain operations,
fragments or queries all exists. (Note that this is entirely focused on
off-heap memory, in general Drill tries to avoid ever moving data on heap.)

Workload management is another topic and there is an initial proposal out
on that for comment here: [2]

The parallelization algorithms don't currently support heterogeneous nodes.
I'd suggest that initial work be done on adding or removing same sized
nodes. A separate substantial effort would be involved in better lopsided
parallelization and workload decisions. (Let's get the basics right first.)

With regards to Paul's comments on 'inside Drill' threading, I think you're
jumping to some incorrect conclusions. There hasn't been any formal
proposals to change the threading model. There was a very short discussion
a month or two back where Hanifi said he'd throw out some prototype code
but nothing has been shared since. I suggest you assume the current
threading model until there is a consensus around something new.

[1]
https://github.com/apache/drill/blob/master/exec/memory/base/src/main/java/org/apache/drill/exec/memory/README.md
[2]
https://docs.google.com/document/d/1xK6CyxwzpEbOrjOdmkd9GXf37dVaf7z0BsvBNLgsZWs/edit





--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Mon, Mar 28, 2016 at 8:43 AM, John Omernik  wrote:

> Great summary.  I'll fill in some "non-technical" explanations of some
> challenges with Memory as I see. Drill Devs, please keep Paul and I
> accurate in our understanding.
>
> First,  Memory is already set at the drillbit level... sorta.  It's set via
> ENV in drill-env, and is not a cluster specific thing. However, I believe
> there are some challenges that come into play when you have bits of
> different sizes. Drill "may" assume that bits are all the same size, and
> thus, if you run a query, depending on which bit is the foreman, and which
> fragments land where, the query may succeed or fail. That's not an ideal
> situation. I think for a holistic discussion on memory, we need to get some
> definitives around how Drill handles memory, especially different sized
> nodes, and what changes would need to be made for bits of different size to
> work well together on a production cluster.
>
> This discussion forms the basis of almost all work around memory
> management. If we can realistically only have bits of one size in it's
> current form, then static allocations are where we are going to be for the
> initial Yarn work. I love the idea of scaling up and down, but it will be
> difficult to scale an entire cluster worth of bits up and down, so
> heterogeneous resource allocations must be a prerequisite to dynamic
> allocation discussions (other then just adding and removing whole bits).
>
> Second, this also plays into the multiple drillbits per node discussion.
> If static sized bits are our only approach, then the initial reaction is to
> make them smaller so you have some granularity in scaling up and down.
> This may actually hurt a cluster.  Large queries may be challenged by
> trying to fit it's fragments on 3 nodes of say 8GB of direct RAM, but that
> query would run fine on bits of 24GB of direct RAM.  Drill Devs: Keep me
> honest here. I am going off of lots of participation in this memory/cpu
> discussions when I first started Drill/Marathon integration, and that is
> the feeling I got in talking to folks on and off list about memory
> management.
>
> This is a hard topic, but one that I am glad you are spearheading Paul,
>  because as we see more and more clusters get folded together, having a
> citizen that plays nice with others, and provides flexibility with regards
> to performance vs resource tradeoffs will be a huge selling/implementation
> point of any analytics tool.  If it's hard to implement and test at scale
> without dedicated hardware, it won't get a fair shake.
>
> John
>
>
> On Sun, Mar 27, 2016 at 3:25 PM, Paul Rogers  wrote:
>
> > Hi John,
> >
> > The other main topic of your discussion is memory management. Here we
> seem
> > to have 6 topics:
> >
> > 1. Setting the limits for Drill.
> > 2. Drill respects the limits.
> > 3. Drill lives within its memory “budget.”
> > 4. Drill throttles work based on available memory.
> > 5. Drill adapts memory usage to available memory.
> > 6. Some means to inform Drill of increases (or decreased) in memory
> > allocation.
> >
> > YARN, via container requests, solves the first problem. Someone (the
> > network admin) has to

[jira] [Created] (DRILL-4605) Flatten doesn't return nested arrays correctly when Union is enabled

2016-04-13 Thread Jacques Nadeau (JIRA)
Jacques Nadeau created DRILL-4605:
-

 Summary: Flatten doesn't return nested arrays correctly when Union 
is enabled
 Key: DRILL-4605
 URL: https://issues.apache.org/jira/browse/DRILL-4605
 Project: Apache Drill
  Issue Type: Bug
Reporter: Jacques Nadeau


File: 
{code}
{a:[[1,2,3], [4]]}
{code}

{code}
set `exec.enable_union_type` = false;
{code}

{code}
select flatten(a) as a from dfs.tmp.`blue.json`;
+--+
|a |
+--+
| [1,2,3]  |
| [4]  |
+--+
{code}

{code}
set `exec.enable_union_type` = true;
{code}

{code}
select flatten(a) as a from dfs.tmp.`blue.json`;
+---+
|   a   |
+---+
| null  |
| null  |
+---+
{code}





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Proposal: Create v2 branch to work on breaking changes

2016-04-13 Thread Jacques Nadeau
For anyone following this thread, some of the people here reached out to me
privately to better detail some concerns that they don't feel comfortable
sharing publicly.

I'm work with them to come up with a sanitized way to share the specific
requirements that they are seeing so that we can come to a consensus of
what is the best thing to do for v2 on the list.

thanks,
Jacques


--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Tue, Apr 12, 2016 at 6:47 PM, Jacques Nadeau  wrote:

> A general policy shouldn't hold up a specific decision. Even after we
> establish a guiding policy, there will be exceptions that we will consider.
> I'm looking for concrete counterpoint to the cost of maintaining backwards
> compatibility.
>
> That being said, I have put together an initial proposal of the
> compatibility commitments we should make to the users. It is important to
> note that my outline is about our public commitment. As a development
> community, we should always work to avoid disruptive or backwards
> incompatible changes on public apis even if the our public commitment
> policy doesn't dictate it.
>
> The proposal is attached here:
> https://issues.apache.org/jira/browse/DRILL-4600
>
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Tue, Apr 12, 2016 at 5:54 PM, Neeraja Rentachintala <
> nrentachint...@maprtech.com> wrote:
>
>> Makes sense to postpone the debate : )
>> Will Look forward for the proposal.
>>
>> On Tuesday, April 12, 2016, Zelaine Fong  wrote:
>>
>> > As we discussed at this morning's hangout, Jacques took the action to
>> put
>> > together a strawman compatibility points document.  Would it be better
>> to
>> > wait for that document before we debate this further?
>> >
>> > -- Zelaine
>> >
>> > On Tue, Apr 12, 2016 at 4:39 PM, Jacques Nadeau > > > wrote:
>> >
>> > > I agree with Paul, too. Perfect compatibility would be great. I
>> recognize
>> > > the issues that a version break could cause.  These are some of the
>> > issues
>> > > that I believe require a version break to address:
>> > > - Support nulls in lists.
>> > > - Distinguish null maps from empty maps.
>> > > - Distinguish null arrays from empty arrays.
>> > > - Support sparse maps (analogous to Parquet maps instead of our
>> current
>> > > approach analogous to structs in Parquet lingo).
>> > > - Clean up decimal and enable it by default.
>> > > - Support full Avro <> Parquet roundtrip (and Parquet files generated
>> by
>> > > other tools).
>> > > - Enable union type by default.
>> > > - Improve performance execution performance of nullable values.
>> > >
>> > > I think these things need to be addressed in the 2.x line (let's say
>> that
>> > > is ~12 months). This is all about tradeoffs which is why I keep asking
>> > > people to provide concrete impact. If you think at least one of these
>> > > should be resolved, you're arguing for breaking wire compatibility
>> > between
>> > > 1.x and 2.x.
>> > >
>> > > So let's get concrete:
>> > >
>> > > - How many users are running multiple clusters and using a single
>> client
>> > to
>> > > connect them?
>> > > - What BI tools are most users using? What is the primary driver they
>> are
>> > > using?
>> > > - What BI tools are packaging a Drill driver? If any, what is the
>> update
>> > > process and lead time?
>> > > - How many users are skipping multiple Drill versions (e.g. going from
>> > 1.2
>> > > to 1.6)? (Beyond the MapR tick-tock pattern)
>> > > - How many users are delaying driver upgrade substantially? Are there
>> > > customers using the 1.0 driver?
>> > > - What is the average number of deployed clients per Drillbit cluster?
>> > >
>> > > These are some of the things that need to be evaluated to determine
>> > whether
>> > > we choose to implement a compatibility layer or simply make a full
>> break.
>> > > (And in reality, I'm not sure we have the resources to build and
>> carry a
>> > > complex compatibility layer for these changes.)
>> > >
>> > > Whatever the policy we agree upon for future commitments to the user
>> > base,
>> > > we're in a situation where there are very important reasons to move
>> the
>&g

[jira] [Resolved] (DRILL-3940) Make RecordBatch AutoCloseable

2016-04-13 Thread Jacques Nadeau (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacques Nadeau resolved DRILL-3940.
---
Resolution: Fixed
  Assignee: Jacques Nadeau  (was: Chris Westin)

RecordBatch wasn't made autocloseable. Instead, CloseableRecordBatch was 
created that is managed by the framework so the interface doesn't leak into the 
operators.

> Make RecordBatch AutoCloseable
> --
>
> Key: DRILL-3940
> URL: https://issues.apache.org/jira/browse/DRILL-3940
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Flow
>Affects Versions: 1.2.0
>Reporter: Chris Westin
>Assignee: Jacques Nadeau
>
> This made it easier to find RecordBatches that were not cleaned up (because 
> the compiler complains about AutoCloseable resources that haven't been 
> closed).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Hangout starting at 10am

2016-04-12 Thread Jacques Nadeau
Note, per the first item above, I've put together an initial proposal of
the compatibility commitments we should make to the users. It is important
to note that my outline is about our public commitment. As a development
community, we should always work to avoid disruptive or backwards
incompatible changes on public apis even if the our public commitment
policy doesn't dictate it.

The proposal is attached here:
https://issues.apache.org/jira/browse/DRILL-4600



--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Tue, Apr 12, 2016 at 11:10 AM, Jacques Nadeau  wrote:

> Notes:
>
> Attendees: Paul, Parth, Zelaine, Jacques, Arrina, Vittali, Aman
>
> Main topics of discussion:
>
> Backwards compatibility. Everybody thinks that striving for backwards
> compatibility is good. However, we need to be formal about our goals as
> well as real costs to maintain. Jacques to put together and propose on the
> list a strawman of the various compatibility points in the product as well
> as what types of compatibility are critical, nice-to-have, low priority,
> etc.
>
> Arrow code patch merge. Some concerns were raised about the impact of
> relying on Arrow as an external project. The size and complexity of the
> patch was also challenging to consume. Jacques to work with Steven to make
> patch more consumable and then have a follow-on discussion around the
> purpose of various changes.
>
>
>
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Tue, Apr 12, 2016 at 9:50 AM, Jacques Nadeau 
> wrote:
>
>> https://plus.google.com/hangouts/_/dremio.com/drillhangout?authuser=0
>>
>>
>> --
>> Jacques Nadeau
>> CTO and Co-Founder, Dremio
>>
>
>


Re: Proposal: Create v2 branch to work on breaking changes

2016-04-12 Thread Jacques Nadeau
A general policy shouldn't hold up a specific decision. Even after we
establish a guiding policy, there will be exceptions that we will consider.
I'm looking for concrete counterpoint to the cost of maintaining backwards
compatibility.

That being said, I have put together an initial proposal of the
compatibility commitments we should make to the users. It is important to
note that my outline is about our public commitment. As a development
community, we should always work to avoid disruptive or backwards
incompatible changes on public apis even if the our public commitment
policy doesn't dictate it.

The proposal is attached here:
https://issues.apache.org/jira/browse/DRILL-4600


--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Tue, Apr 12, 2016 at 5:54 PM, Neeraja Rentachintala <
nrentachint...@maprtech.com> wrote:

> Makes sense to postpone the debate : )
> Will Look forward for the proposal.
>
> On Tuesday, April 12, 2016, Zelaine Fong  wrote:
>
> > As we discussed at this morning's hangout, Jacques took the action to put
> > together a strawman compatibility points document.  Would it be better to
> > wait for that document before we debate this further?
> >
> > -- Zelaine
> >
> > On Tue, Apr 12, 2016 at 4:39 PM, Jacques Nadeau  > > wrote:
> >
> > > I agree with Paul, too. Perfect compatibility would be great. I
> recognize
> > > the issues that a version break could cause.  These are some of the
> > issues
> > > that I believe require a version break to address:
> > > - Support nulls in lists.
> > > - Distinguish null maps from empty maps.
> > > - Distinguish null arrays from empty arrays.
> > > - Support sparse maps (analogous to Parquet maps instead of our current
> > > approach analogous to structs in Parquet lingo).
> > > - Clean up decimal and enable it by default.
> > > - Support full Avro <> Parquet roundtrip (and Parquet files generated
> by
> > > other tools).
> > > - Enable union type by default.
> > > - Improve performance execution performance of nullable values.
> > >
> > > I think these things need to be addressed in the 2.x line (let's say
> that
> > > is ~12 months). This is all about tradeoffs which is why I keep asking
> > > people to provide concrete impact. If you think at least one of these
> > > should be resolved, you're arguing for breaking wire compatibility
> > between
> > > 1.x and 2.x.
> > >
> > > So let's get concrete:
> > >
> > > - How many users are running multiple clusters and using a single
> client
> > to
> > > connect them?
> > > - What BI tools are most users using? What is the primary driver they
> are
> > > using?
> > > - What BI tools are packaging a Drill driver? If any, what is the
> update
> > > process and lead time?
> > > - How many users are skipping multiple Drill versions (e.g. going from
> > 1.2
> > > to 1.6)? (Beyond the MapR tick-tock pattern)
> > > - How many users are delaying driver upgrade substantially? Are there
> > > customers using the 1.0 driver?
> > > - What is the average number of deployed clients per Drillbit cluster?
> > >
> > > These are some of the things that need to be evaluated to determine
> > whether
> > > we choose to implement a compatibility layer or simply make a full
> break.
> > > (And in reality, I'm not sure we have the resources to build and carry
> a
> > > complex compatibility layer for these changes.)
> > >
> > > Whatever the policy we agree upon for future commitments to the user
> > base,
> > > we're in a situation where there are very important reasons to move the
> > > codebase forward and change the wire protocol for 2.x.
> > >
> > > I think it is noble to strive towards backwards compatibility. We
> should
> > > always do this. However, I also think that--especially early in a
> > product's
> > > life--it is better to resolve technical debt issues and break a few
> eggs
> > > than defer and carry a bunch of extra code around.
> > >
> > > Yes, it can suck for users. Luckily, we should also be giving users a
> > bunch
> > > of positive reasons that it is worth upgrading and dealing with this
> > > version break. These include better perf, better compatibility with
> other
> > > tools, union type support, faster bi tool behaviors and a number of
> other
> > > things.
> > >
> > > I for one vote for moving forward and 

[jira] [Created] (DRILL-4600) Document Public Compatibility Comittments

2016-04-12 Thread Jacques Nadeau (JIRA)
Jacques Nadeau created DRILL-4600:
-

 Summary: Document Public Compatibility Comittments
 Key: DRILL-4600
 URL: https://issues.apache.org/jira/browse/DRILL-4600
 Project: Apache Drill
  Issue Type: New Feature
Reporter: Jacques Nadeau
Assignee: Jacques Nadeau






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Proposal: Create v2 branch to work on breaking changes

2016-04-12 Thread Jacques Nadeau
I agree with Paul, too. Perfect compatibility would be great. I recognize
the issues that a version break could cause.  These are some of the issues
that I believe require a version break to address:
- Support nulls in lists.
- Distinguish null maps from empty maps.
- Distinguish null arrays from empty arrays.
- Support sparse maps (analogous to Parquet maps instead of our current
approach analogous to structs in Parquet lingo).
- Clean up decimal and enable it by default.
- Support full Avro <> Parquet roundtrip (and Parquet files generated by
other tools).
- Enable union type by default.
- Improve performance execution performance of nullable values.

I think these things need to be addressed in the 2.x line (let's say that
is ~12 months). This is all about tradeoffs which is why I keep asking
people to provide concrete impact. If you think at least one of these
should be resolved, you're arguing for breaking wire compatibility between
1.x and 2.x.

So let's get concrete:

- How many users are running multiple clusters and using a single client to
connect them?
- What BI tools are most users using? What is the primary driver they are
using?
- What BI tools are packaging a Drill driver? If any, what is the update
process and lead time?
- How many users are skipping multiple Drill versions (e.g. going from 1.2
to 1.6)? (Beyond the MapR tick-tock pattern)
- How many users are delaying driver upgrade substantially? Are there
customers using the 1.0 driver?
- What is the average number of deployed clients per Drillbit cluster?

These are some of the things that need to be evaluated to determine whether
we choose to implement a compatibility layer or simply make a full break.
(And in reality, I'm not sure we have the resources to build and carry a
complex compatibility layer for these changes.)

Whatever the policy we agree upon for future commitments to the user base,
we're in a situation where there are very important reasons to move the
codebase forward and change the wire protocol for 2.x.

I think it is noble to strive towards backwards compatibility. We should
always do this. However, I also think that--especially early in a product's
life--it is better to resolve technical debt issues and break a few eggs
than defer and carry a bunch of extra code around.

Yes, it can suck for users. Luckily, we should also be giving users a bunch
of positive reasons that it is worth upgrading and dealing with this
version break. These include better perf, better compatibility with other
tools, union type support, faster bi tool behaviors and a number of other
things.

I for one vote for moving forward and making sure that the 2.x branch is
the highest quality and best version of Drill yet rather than focusing on
minimizing the upgrade cost. All upgrades are a cost/benefit analysis.
Drill is too young to focus on only minimizing the cost. We should be
working to make sure the other part of the equation (benefit) is where
we're spending the vast majority of our time.



--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Tue, Apr 12, 2016 at 3:38 PM, Neeraja Rentachintala <
nrentachint...@maprtech.com> wrote:

> I agree with Paul. Great points.
> I would also add the partners aspect to it. Majority of Drill users use it
> in conjunction with a BI tool.
>
>
> -Neeraja
>
> On Tue, Apr 12, 2016 at 3:34 PM, Paul Rogers  wrote:
>
> > Hi Jacques,
> >
> > My two cents…
> >
> > The unfortunate reality is that enterprise customers move slowly. There
> is
> > a delay in the time it takes for end users to upgrade to a new release.
> > When a third-party tool must also upgrade, the delay becomes even longer.
> >
> > At a high level, we need to provide a window of time in which old/new
> > clients work with old/new servers. I may have a 1.6 client. The cluster
> > upgrades to 1.8. I need time to upgrade my client to 1.8 — especially if
> I
> > have to wait for the vendor to provide a new package.
> >
> > If I connect to two clusters, I may upgrade my client to 1.8 for one, but
> > I still need to connect to 1.6 for the other if they upgrade on different
> > schedules.
> >
> > This is exactly why we need to figure out a policy: how do we give users
> a
> > sufficient window of time to complete upgrades, even across the 1.x/2.x
> > boundary?
> >
> > The cost of not providing such a window? Broken production systems,
> > unpleasant escalations and unhappy customers.
> >
> > Thanks,
> >
> > - Paul
> >
> > > On Apr 12, 2016, at 3:14 PM, Jacques Nadeau 
> wrote:
> > >
> > >>> What I am suggesting is that we need to maintain backward
> > compatibility with
> > > a defined set of 1.x version clients when Drill 2.0 version is out.
> &g

Re: Contributing to Drill?

2016-04-12 Thread Jacques Nadeau
Can a committer respond to this request?

thanks,
Jacques

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Thu, Apr 7, 2016 at 9:07 AM, Zelaine Fong  wrote:

> A Drill user is eager to contribute a new XML plugin, and was planning to
> follow the instructions noted at
> https://drill.apache.org/docs/apache-drill-contribution-guidelines/.
> However, my understanding is we're no longer following that process.
>
> Can one of the dev commiters update that Web page with the latest
> guidelines?
>
> Thanks.
>
> -- Zelaine
>


Re: Proposal: Create v2 branch to work on breaking changes

2016-04-12 Thread Jacques Nadeau
>>What I am suggesting is that we need to maintain backward compatibility with
a defined set of 1.x version clients when Drill 2.0 version is out.

I'm asking you to be concrete on why. There is definitely a cost to
maintaining this compatibility. What are the real costs if we don't?

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Wed, Apr 6, 2016 at 9:21 AM, Neeraja Rentachintala <
nrentachint...@maprtech.com> wrote:

> Jacques
> can you elaborate on what you mean by 'internal' implementation changes but
> maintain external API.
> I thought that changes that are being discussed here are the Drill client
> library changes.
> What I am suggesting is that we need to maintain backward compatibility
> with a defined set of 1.x version clients when Drill 2.0 version is out.
>
> Neeraja
>
> On Tue, Apr 5, 2016 at 12:12 PM, Jacques Nadeau 
> wrote:
>
> > Thanks for bringing this up. BI compatibility is super important.
> >
> > The discussions here are primarily about internal implementation changes
> as
> > opposed to external API changes. From a BI perspective, I think (hope)
> > everyone shares the goal of having zero (to minimal) changes in terms of
> > ODBC and JDBC behaviors in v2. The items outlined in DRILL-4417 are also
> > critical to strong BI adoption as numerous patterns right now are
> > suboptimal and we need to get them improved.
> >
> > In terms of your request of the community, it makes sense to have a
> > strategy around this. It sounds like you have a bunch of considerations
> > that should be weighed but your presentation doesn't actually share what
> > the concrete details. To date, there has been no formal consensus or
> > commitment to any particular compatibility behavior. We've had an
> informal
> > "don't change wire compatibility within a major version". If we are going
> > to have a rich dialog about pros and cons of different approaches, we
> need
> > to make sure that everybody has the same understanding of the dynamics.
> For
> > example:
> >
> > Are you saying that someone has packaged the Apache Drill drivers in
> their
> > BI solution? If so, what version? Is this the Apache release artifact or
> a
> > custom version? Has someone certified them? Did anyone commit a
> particular
> > compatibility pattern to a BI vendor on behalf of the community?
> >
> > To date, I'm not aware of any of these types of decisions being discussed
> > in the community so it is hard to evaluate how important they are versus
> > other things. Knowing that DRILL-4417 is outstanding and critical to the
> > best BI experience, I think we should be very cautious of requiring
> > long-term support of the existing (internal) implementation. Guaranteeing
> > ODBC and JDBC behaviors should be satisfactory for the vast majority of
> > situations. Anything beyond this needs to have a very public cost/benefit
> > tradeoff. In other words, please expose your thinking 100x more so that
> we
> > can all understand the ramifications of different strategies.
> >
> > thanks!
> > Jacques
> >
> >
> >
> >
> >
> > --
> > Jacques Nadeau
> > CTO and Co-Founder, Dremio
> >
> > On Tue, Apr 5, 2016 at 10:01 AM, Neeraja Rentachintala <
> > nrentachint...@maprtech.com> wrote:
> >
> > > Sorry for coming back to this thread late.
> > > I have some feedback on the compatibility aspects of 2.0.
> > >
> > > We are working with a variety of BI vendors to certify Drill and
> provide
> > > native connectors for Drill. Having native access from BI tools helps
> > with
> > > seamless experience for the users with performance and functionality.
> > This
> > > work is in progress and they are (and will be) working with 1.x
> versions
> > of
> > > Drill as part of the development because thats what we have now. Some
> of
> > > these connectors will be available before 2.0 and some of them can come
> > in
> > > post 2.0 as certification is a long process. We don't want to be in a
> > > situation where the native connectors are just released by certain BI
> > > vendor and the connector is immediately obsolete or doesn't work
> because
> > we
> > > have 2.0 release out now.
> > > So the general requirement should be that we maintain backward
> > > compatibility with certain number of prior releases. This is very
> > important
> > > for the success of the project and adoption by eco system. I am happy
> to
> > > discus

Re: Hangout starting at 10am

2016-04-12 Thread Jacques Nadeau
Notes:

Attendees: Paul, Parth, Zelaine, Jacques, Arrina, Vittali, Aman

Main topics of discussion:

Backwards compatibility. Everybody thinks that striving for backwards
compatibility is good. However, we need to be formal about our goals as
well as real costs to maintain. Jacques to put together and propose on the
list a strawman of the various compatibility points in the product as well
as what types of compatibility are critical, nice-to-have, low priority,
etc.

Arrow code patch merge. Some concerns were raised about the impact of
relying on Arrow as an external project. The size and complexity of the
patch was also challenging to consume. Jacques to work with Steven to make
patch more consumable and then have a follow-on discussion around the
purpose of various changes.




--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Tue, Apr 12, 2016 at 9:50 AM, Jacques Nadeau  wrote:

> https://plus.google.com/hangouts/_/dremio.com/drillhangout?authuser=0
>
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>


Hangout starting at 10am

2016-04-12 Thread Jacques Nadeau
https://plus.google.com/hangouts/_/dremio.com/drillhangout?authuser=0


--
Jacques Nadeau
CTO and Co-Founder, Dremio


Re: are random numbers broken?

2016-04-11 Thread Jacques Nadeau
There is already a function annotation for whether or not a function is
deterministic called "isRandom" [1] where true means that the function is
NOT deterministic. It sounds like either the random function is not
correctly annotated or the constant expression elimination isn't respecting
this flag.

This sounds very familiar so I'm wondering if this came up in the last
couple months on the list.




--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Mon, Apr 11, 2016 at 2:36 PM, Chunhui Shi  wrote:

> The rand([seed optional]) is loaded from hive's UDF, the 'random()'
> function is drill's implementation.
>
> Since drill has the logic to use the previous function container thus the
> previous result is reused. I would say this is a bug for random generator.
> The fix should be allowing some functions not to hold previous result so
> each call of the function could return a new random value. Also we need to
> decide that whether we want to keep both 'rand' and 'random'.
>
> Could you open a bug for this?
>
> Thanks,
>
> Chunhui
>
>
>
>
> On Mon, Apr 11, 2016 at 9:13 AM, Ted Dunning 
> wrote:
>
> > I am trying to generate some random numbers. I have a large base file
> (foo)
> > this is what I get:
> >
> > 0: jdbc:drill:>  select floor(1000*random()) as x, floor(1000*random())
> as
> > y, floor(1000*rand()) as z from (select * from maprfs.tdunning.foo) a
> limit
> > 20;
> > ++++
> > |   x|   y|   z|
> > ++++
> > | 556.0  | 556.0  | 618.0  |
> > | 564.0  | 564.0  | 618.0  |
> > | 129.0  | 129.0  | 618.0  |
> > | 48.0   | 48.0   | 618.0  |
> > | 696.0  | 696.0  | 618.0  |
> > | 642.0  | 642.0  | 618.0  |
> > | 535.0  | 535.0  | 618.0  |
> > | 440.0  | 440.0  | 618.0  |
> > | 894.0  | 894.0  | 618.0  |
> > | 24.0   | 24.0   | 618.0  |
> > | 508.0  | 508.0  | 618.0  |
> > | 28.0   | 28.0   | 618.0  |
> > | 816.0  | 816.0  | 618.0  |
> > | 717.0  | 717.0  | 618.0  |
> > | 334.0  | 334.0  | 618.0  |
> > | 978.0  | 978.0  | 618.0  |
> > | 646.0  | 646.0  | 618.0  |
> > | 787.0  | 787.0  | 618.0  |
> > | 260.0  | 260.0  | 618.0  |
> > | 711.0  | 711.0  | 618.0  |
> > ++++
> >
> > On this page, https://drill.apache.org/docs/math-and-trig/, the rand
> > function is described and random() is not. But it appears that rand()
> > delivers a constant instead (although a different constant each time the
> > query is run) and it appears that random() delivers the same value when
> > used multiple times in each returned value.
> >
> > This seems very, very wrong.
> >
> > The fault does not seem to be related to my querying a table:
> >
> > 0: jdbc:drill:> select rand(), random(), random() from (values
> (1),(2),(3))
> > x;
> > +-+---+---+
> > |   EXPR$0|EXPR$1 |EXPR$2 |
> > +-+---+---+
> > | 0.1347749257216052  | 0.36724556209765014   | 0.36724556209765014   |
> > | 0.1347749257216052  | 0.006087161689924625  | 0.006087161689924625  |
> > | 0.1347749257216052  | 0.09417099142512142   | 0.09417099142512142   |
> > +-+---+---+
> >
> > For reference, postgres doesn't have rand() and does the right thing with
> > random().
> >
>


Re: Proposal: Create v2 branch to work on breaking changes

2016-04-05 Thread Jacques Nadeau
Thanks for bringing this up. BI compatibility is super important.

The discussions here are primarily about internal implementation changes as
opposed to external API changes. From a BI perspective, I think (hope)
everyone shares the goal of having zero (to minimal) changes in terms of
ODBC and JDBC behaviors in v2. The items outlined in DRILL-4417 are also
critical to strong BI adoption as numerous patterns right now are
suboptimal and we need to get them improved.

In terms of your request of the community, it makes sense to have a
strategy around this. It sounds like you have a bunch of considerations
that should be weighed but your presentation doesn't actually share what
the concrete details. To date, there has been no formal consensus or
commitment to any particular compatibility behavior. We've had an informal
"don't change wire compatibility within a major version". If we are going
to have a rich dialog about pros and cons of different approaches, we need
to make sure that everybody has the same understanding of the dynamics. For
example:

Are you saying that someone has packaged the Apache Drill drivers in their
BI solution? If so, what version? Is this the Apache release artifact or a
custom version? Has someone certified them? Did anyone commit a particular
compatibility pattern to a BI vendor on behalf of the community?

To date, I'm not aware of any of these types of decisions being discussed
in the community so it is hard to evaluate how important they are versus
other things. Knowing that DRILL-4417 is outstanding and critical to the
best BI experience, I think we should be very cautious of requiring
long-term support of the existing (internal) implementation. Guaranteeing
ODBC and JDBC behaviors should be satisfactory for the vast majority of
situations. Anything beyond this needs to have a very public cost/benefit
tradeoff. In other words, please expose your thinking 100x more so that we
can all understand the ramifications of different strategies.

thanks!
Jacques





--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Tue, Apr 5, 2016 at 10:01 AM, Neeraja Rentachintala <
nrentachint...@maprtech.com> wrote:

> Sorry for coming back to this thread late.
> I have some feedback on the compatibility aspects of 2.0.
>
> We are working with a variety of BI vendors to certify Drill and provide
> native connectors for Drill. Having native access from BI tools helps with
> seamless experience for the users with performance and functionality. This
> work is in progress and they are (and will be) working with 1.x versions of
> Drill as part of the development because thats what we have now. Some of
> these connectors will be available before 2.0 and some of them can come in
> post 2.0 as certification is a long process. We don't want to be in a
> situation where the native connectors are just released by certain BI
> vendor and the connector is immediately obsolete or doesn't work because we
> have 2.0 release out now.
> So the general requirement should be that we maintain backward
> compatibility with certain number of prior releases. This is very important
> for the success of the project and adoption by eco system. I am happy to
> discuss further.
>
> -Neeraja
>
> On Tue, Apr 5, 2016 at 8:44 AM, Jacques Nadeau  wrote:
>
> > I'm going to take this as lazy consensus. I'll create the branch.
> >
> > Once created, all merges to the master (1.x branch) should also go to the
> > v2 branch unless we have a discussion here that they aren't applicable.
> > When committing, please make sure to commit to both locations.
> >
> > thanks,
> > Jacques
> >
> >
> > --
> > Jacques Nadeau
> > CTO and Co-Founder, Dremio
> >
> > On Sat, Mar 26, 2016 at 7:26 PM, Jacques Nadeau 
> > wrote:
> >
> > > Re Compatibility:
> > >
> > > I actually don't even think 1.0 clients work with 1.6 server, do they?
> > >
> > > I would probably decrease the cross-compatibility requirement burden. A
> > > nice goal would be cross compatibility across an extended series of
> > > releases. However, given all the things we've learned in the last year,
> > we
> > > shouldn't try to maintain more legacy than is necessary. As such, I
> > propose
> > > that we consider the requirement of 2.0 to be:
> > >
> > > 1.lastX works with 2.firstX. (For example, if 1.8 is the last minor
> > > release of the 1.x series, 1.8 would work with 2.0.)
> > >
> > > This simplifies testing (we don't have to worry about things like does
> > 1.1
> > > work with 2.3, etc) and gives people an upgrade path as they desire.
> This
> > > also allows us to deci

Re: Next Release

2016-04-05 Thread Jacques Nadeau
I'm going to rescind my offer to be the 1.7 release manager and move it to
the 2.0 release.

The features I was going to shepherd into the release have now moved to the
v2 release. I'd like to focus all my time on helping get the v2 features
merged and at parity with the 1.x release line (per other threads).

thanks,
Jacques


--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Tue, Mar 22, 2016 at 3:13 PM, Parth Chandra 
wrote:

> Wonderful !
> From my experience with 1.6, I was going to suggest we start the process of
> 'finalizing' the items that are a must have for the release a week before
> the end of the month. Otherwise the release takes an extra week.
>
> Parth
>
> On Tue, Mar 22, 2016 at 8:55 AM, Jacques Nadeau 
> wrote:
>
> > Hey All,
> >
> > I'd like to volunteer to be the 1.7 release manager. I'd also like to
> plan
> > putting together a target feature list for the release now so we can all
> > plan ahead. I'll share an initial stab at this later today if people
> think
> > that sounds good.
> >
> > Thanks
> > Jacques
> >
>


Re: Proposal: Create v2 branch to work on breaking changes

2016-04-05 Thread Jacques Nadeau
I'm going to take this as lazy consensus. I'll create the branch.

Once created, all merges to the master (1.x branch) should also go to the
v2 branch unless we have a discussion here that they aren't applicable.
When committing, please make sure to commit to both locations.

thanks,
Jacques


--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Sat, Mar 26, 2016 at 7:26 PM, Jacques Nadeau  wrote:

> Re Compatibility:
>
> I actually don't even think 1.0 clients work with 1.6 server, do they?
>
> I would probably decrease the cross-compatibility requirement burden. A
> nice goal would be cross compatibility across an extended series of
> releases. However, given all the things we've learned in the last year, we
> shouldn't try to maintain more legacy than is necessary. As such, I propose
> that we consider the requirement of 2.0 to be:
>
> 1.lastX works with 2.firstX. (For example, if 1.8 is the last minor
> release of the 1.x series, 1.8 would work with 2.0.)
>
> This simplifies testing (we don't have to worry about things like does 1.1
> work with 2.3, etc) and gives people an upgrade path as they desire. This
> also allows us to decide what pieces of the compatibility shim go in the
> 2.0 server versus the 1.lastX client. (I actually lean towards allowing a
> full break between v1 and v2 server/client but understand that that level
> or coordination is hard in many organizations since analysts are separate
> from IT). Hopefully, what I'm proposing can be a good compromise between
> progress and deployment ease.
>
> Thoughts?
>
> Re: Branches/Dangers
>
> Good points on this Julian.
>
> How about this:
>
> - small fixes and enhancements PRs should be made against v1
> - new feature PRs should be made against v2
> - v2 should continue to always pass all precommit tests during its life
> - v2 becomes master in two months
>
> I definitely don't want to create instability in the v2 branch.
>
> The other option I see is we can only do bug fix releases and branch the
> current master into a maintenance branch and treat master as v2.
>
> Other ideas?
>
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Sat, Mar 26, 2016 at 6:07 PM, Julian Hyde  wrote:
>
>> Do you plan to be doing significant development on both the v1 and v2
>> branches, and if so, for how long? I have been bitten badly by that pattern
>> in the past. Developers put lots of unrelated, destabilizing changes into
>> v2, it look longer than expected to stabilize v2, product management lost
>> confidence in v2 and shifted resources back to v1, and v2 never caught up
>> with v1.
>>
>> One important question: Which branch will you ask people to target for
>> pull requests? v1, v2 or both? If they submit to v2, and v2 is broken, how
>> will you know whether the patches are good?
>>
>> My recommendation is to choose one of the following: (1) put a strict
>> time limit of say 2 months after which v2 would become the master branch
>> (and v1 master would become a maintenance branch), or (2) make v2 focused
>> on a particular architectural feature; create multiple independent feature
>> branches with breaking API changes if you need to.
>>
>> Julian
>>
>>
>> > On Mar 26, 2016, at 1:41 PM, Paul Rogers  wrote:
>> >
>> > Hi All,
>> >
>> > 2.0 is a good opportunity to enhance our ZK information. See
>> DRILL-4543: Advertise Drill-bit ports, status, capabilities in ZooKeeper.
>> This change will simplify YARN integration.
>> >
>> > This enhancement will change the “public API” in ZK. To Parth’s point,
>> we can do so in a way that old clients work - as long as a Drill-bit uses
>> default ports.
>> >
>> > I’ve marked this JIRA as a candidate for 2.0.
>> >
>> > Thanks,
>> >
>> > - Paul
>> >
>> >> On Mar 24, 2016, at 4:11 PM, Parth Chandra  wrote:
>> >>
>> >> What's our proposal for backward compatibility between 1.x and 2.x?
>> >> My thoughts:
>> >> Optional  -  Allow a mixture of 1.x and 2.x drillbits in a cluster.
>> >> Required - 1.x clients should be able to talk to 2.x drillbits.
>> >>
>> >>
>> >>
>> >> On Thu, Mar 24, 2016 at 8:55 AM, Jacques Nadeau 
>> wrote:
>> >>
>> >>> There are some changes that either have reviews pending or are in
>> progress
>> >>> that would require breaking changes to Drill core.
>> >>>
>> >>> Examples Include:
>> >>> DRILL-4455 (arrow integration)
>> >>> DRILL-4417 (jdbc/odbc/rpc changes)
>> >>> DRILL-4534 (improve null performance)
>> >>>
>> >>> I've created a new 2.0.0 release version in JIRA and moved these
>> tasks to
>> >>> that umbrella.
>> >>>
>> >>> I'd like to propose a new v2 release branch where we can start
>> >>> incorporating these changes without disrupting v1 stability and
>> >>> compatibility.
>> >>>
>> >>>
>> >>> --
>> >>> Jacques Nadeau
>> >>> CTO and Co-Founder, Dremio
>> >>>
>> >
>>
>>
>


Re: [DISCUSS] Nonblocking RPC

2016-04-01 Thread Jacques Nadeau
I think you're going to really have to break the encapsulation model to
accomplish this in the RPC layer.  What about updating the serialized
executor for those situations to do a resubmission rather than blocking
operation? Basically, it seems like we want a two phase termination:
request termination and then confirm termination. It seems like both should
be non-blocking

The other option is to rethink the model around termination. It might be
worth a hangout brainstorm to see if we can come up with ideas that are
more outside of the box.



--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Fri, Apr 1, 2016 at 2:28 PM, Sudheesh Katkam  wrote:

> Hey y’all,
>
> There are some blocking requests that could make an event loop *await
> uninterruptibly*. At this point, the Drillbit might seem unresponsive. This
> is worsened if the the event loop is not unblocked (due to a bug), which
> requires a Drillbit restart. Although Drill supports *offloading from the
> event loop* (experimental), this is not sufficient as the thread handling
> the queue of requests would still block.
>
> AFAIK there are two such requests:
> + when the user cancels
> <
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/work/foreman/Foreman.java#L1184
> >
> the query during planning
> + a fragment is canceled
> <
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/work/fragment/FragmentExecutor.java#L150
> >
> or terminated
> <
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/work/fragment/FragmentExecutor.java#L501
> >
> early during setup
>
> I think a simple solution would be to *re-queue *such requests (possible in
> above cases). That way other requests get their chance, and all requests
> would be eventually handled. Thoughts?
>
> Thank you,
> Sudheesh
>


Re: Continued Avro Frustration

2016-04-01 Thread Jacques Nadeau
Stefan,

It makes sense to me to mark the Avro plugin experimental. Clearly, there
are bugs. I also want to note your requirements and expectations haven't
always been in alignment with what the Avro plugin developers
built/envisioned (especially around schemas). As part of trying to address
these gaps, I'd like to ask again for you to provide actual data and tests
cases so we make sure that the Avro plugin includes those as future test
cases. (This is absolutely the best way to ensure that the project
continues to work for your use case.)

The bigger issue I see here is that you expect the community to spend time
doing what you want. You have already received a lot of that via free
support and numerous bug fixes by myself, Jason and others. You need to
remember: this community is run by a bunch of volunteers. Everybody here
has a day job. A lot of time I spend in the community is at the cost of my
personal life. For others, it is the same.

This is a good place to ask for help but you should never demand it. If you
want paid support, I know Ted offered this from MapR and I'm sure if you
went that route, your issues would get addressed very quickly. If you don't
want to go that route, then I suggest that you help by creating more
example data and test cases and focusing on what are the most important
issues that you need to solve. From there, you can continue to expect that
people will help you--as they can. There are no guarantees in open source.
Everything comes through the kindness and shared goals of those in the
community.

thanks,
Jacques


--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Fri, Apr 1, 2016 at 5:43 AM, Stefán Baxter 
wrote:

> Hi,
>
> Is it at all possible that we are the only company trying to use Avro with
> Drill to some serious extent?
>
> We continue to coma across all sorts of embarrassing shortcomings like the
> one we are dealing with now where a schema change exception is thrown even
> when working with a single Avro file (that has the same schema).
>
> Can a non project member call for a discussion on this topic and the level
> of support that is offered for Avro in Drill?
>
> My discussion topics would be:
>
>- Strange schema validation that ... :
>... currently fails on single file
>... prevents dirX variables to work
>... would require Drill to scan all Avro files to establish schema (even
>when pruning would be used)
>... would ALWAY fail for old queries if the an old Avro file, containing
>the original fields, was removed and could not be scanned
>... does not rhyme with the "eliminate ETL" and "Evolving Schema" goals
>of Drill
>
>- Simple union types do not work to declare nullable fields
>
>- Drill can not read Parquet that is created by parquet-mr-avro
>
>- What is the intention for Avro in Drill
>- Should we select to use some other format to buffer/badge data before
>creating a Parquet file for it?
>
>- The culture here regarding talking about boring/hard topics like this
>- Where serious complaints/issues are met with silence
>- I know full well that my frustration shines through here and that it
>not helping but this Drill+Avro mess is really getting too much for us
> to
>handle
>
> Look forward do discuss this here or during the next hangout.
>
> Regards,
>  -Stefán (or ... mr. old & frustrated)
>


Re: adding int96 datatype in Drill

2016-04-01 Thread Jacques Nadeau
It sounds like Hive doesn't work with the full Parquet spec. Why don't we
just fix that?

Let's add Hive handling so that an annotated INT64 date works.

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Fri, Apr 1, 2016 at 4:33 AM, Arina Yelchiyeva  wrote:

> Hi all!
>
> There is business requirement to read parquet files generated by Drill
> using Hive.
> Hive reads timestamps only if they are int96, while Drill writes timestamps
> as binary.
>
> I just wanted to know community opinion on adding int96 datatype.
> For example, if there will be session property that will tell Drill to
> write timestamps as int96.
>
> Any suggestions?
>
> Kind regards
> Arina
>


Re: REPEATED_CONTAINS

2016-03-29 Thread Jacques Nadeau
I think the best answer is to test it and share your findings.
Hypothesizing about performance in complicated systems is also suspect :)

That said, I'll make a guess...

In general, I would expect the flatten to be faster in your example since a
flatten without a cartesian is trivial operation and can be done in
vectorized fashion because of the shape of how data is held in memory. This
is different than how complex UDFs are written today (using the FieldReader
model). These UDFs are object-based execution, record by record. So,
vectorized and full runtime code generation

That being said, if you changed your code to be something more like [select
a,b,c,d,e,f,g, flatten(t.fillings) as fill], you might see the two be
closer together. This is because this would then require a cartesian copy
of all the fields abcdefg, which then have to be filtered out. In this
case, the extra cost of the copies might be more expensive than the object
overhead required for traversing the complex object structure.

In general, start with the methodology that works. If we don't see the
performance to satisfy your usecase, we can see if we can suggest some
things. (For example, supporting operation pushdowns that push through
FLATTEN would probably be very helpful.)



--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Tue, Mar 29, 2016 at 6:37 PM, Jean-Claude Cote  wrote:

> I've noticed drill offers a REPEATED_CONTAINS which can be applied to
> fields which are arrays.
>
> https://drill.apache.org/docs/repeated-contains/
>
> I have a schema stored in parquet files which contain a repeated field
> containing a key and a value. However such structures can't be queried
> using the REPEATED_CONTAINS. I was thinking of writing a user defined
> function to look through it.
>
> My question is: is it worth it? Will it be faster than doing this?
>
> {"name":"classic","fillings":[ {"name":"sugar","cal":500} ,
> {"name":"flour","cal":300} ] }
>
> SELECT flat.fill FROM (SELECT FLATTEN(t.fillings) AS fill FROM
> dfs.flatten.`test.json` t) flat WHERE flat.fill.name like 'sug%';
>
> Specifically what's the cost of using FLATTEN compared to iterating over
> the array right in a UDF?
>
> Thanks
> Jean-Claude
>


Re: Drill Hangout Starting

2016-03-29 Thread Jacques Nadeau
Just a quick call today.

Attendees:

Vitalli
Arrina
Laurent
Jacques


Discussions around DESCRIBE SCHEMA & DESCRIBE TABLE where? Calcite or
Drill?

- Propose initially committing to Drill. Also open two bugs to move to
Calcite once DRILL-3993 is done.

Question around progress on DRILL-3993?

- Jacques and Arrina to both bump thread to figure out next steps.

Question: why Parquet only read, not write INT96 type? Should we add INT96
type?

- INT96 read so we can convert to Drill internal date type when data is
produced by Impala. Write isn't implemented because there is no current way
to tell Drill to output data in that format (as there is no concept of a
96bit integer inside Drill).
- Propose open discussion on Drill mailing list if desire adding type. Also
question of whether the issue is actually we need to enhance Hive to
support Parquet defined timestamp types for consumption. Jacques noted that
extra types can be expensive and Drill probably needs to deprecate types
instead of adding types.

Question: Why does drill have var16char?

- Vitalli making changes to remove var16char from Hive translation
- We should probably remove var16char from Drill v2

Short discussion around removing spill files, Vitalli to update PR to clean
up earlier than end of JVM.


thanks everyone who attended!

Jacques


--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Tue, Mar 29, 2016 at 10:01 AM, Jacques Nadeau  wrote:

> https://plus.google.com/hangouts/_/dremio.com/drillhangout?authuser=0
>
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>


Drill Hangout Starting

2016-03-29 Thread Jacques Nadeau
https://plus.google.com/hangouts/_/dremio.com/drillhangout?authuser=0


--
Jacques Nadeau
CTO and Co-Founder, Dremio


Re: Proposal: Create v2 branch to work on breaking changes

2016-03-26 Thread Jacques Nadeau
Re Compatibility:

I actually don't even think 1.0 clients work with 1.6 server, do they?

I would probably decrease the cross-compatibility requirement burden. A
nice goal would be cross compatibility across an extended series of
releases. However, given all the things we've learned in the last year, we
shouldn't try to maintain more legacy than is necessary. As such, I propose
that we consider the requirement of 2.0 to be:

1.lastX works with 2.firstX. (For example, if 1.8 is the last minor release
of the 1.x series, 1.8 would work with 2.0.)

This simplifies testing (we don't have to worry about things like does 1.1
work with 2.3, etc) and gives people an upgrade path as they desire. This
also allows us to decide what pieces of the compatibility shim go in the
2.0 server versus the 1.lastX client. (I actually lean towards allowing a
full break between v1 and v2 server/client but understand that that level
or coordination is hard in many organizations since analysts are separate
from IT). Hopefully, what I'm proposing can be a good compromise between
progress and deployment ease.

Thoughts?

Re: Branches/Dangers

Good points on this Julian.

How about this:

- small fixes and enhancements PRs should be made against v1
- new feature PRs should be made against v2
- v2 should continue to always pass all precommit tests during its life
- v2 becomes master in two months

I definitely don't want to create instability in the v2 branch.

The other option I see is we can only do bug fix releases and branch the
current master into a maintenance branch and treat master as v2.

Other ideas?


--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Sat, Mar 26, 2016 at 6:07 PM, Julian Hyde  wrote:

> Do you plan to be doing significant development on both the v1 and v2
> branches, and if so, for how long? I have been bitten badly by that pattern
> in the past. Developers put lots of unrelated, destabilizing changes into
> v2, it look longer than expected to stabilize v2, product management lost
> confidence in v2 and shifted resources back to v1, and v2 never caught up
> with v1.
>
> One important question: Which branch will you ask people to target for
> pull requests? v1, v2 or both? If they submit to v2, and v2 is broken, how
> will you know whether the patches are good?
>
> My recommendation is to choose one of the following: (1) put a strict time
> limit of say 2 months after which v2 would become the master branch (and v1
> master would become a maintenance branch), or (2) make v2 focused on a
> particular architectural feature; create multiple independent feature
> branches with breaking API changes if you need to.
>
> Julian
>
>
> > On Mar 26, 2016, at 1:41 PM, Paul Rogers  wrote:
> >
> > Hi All,
> >
> > 2.0 is a good opportunity to enhance our ZK information. See DRILL-4543:
> Advertise Drill-bit ports, status, capabilities in ZooKeeper. This change
> will simplify YARN integration.
> >
> > This enhancement will change the “public API” in ZK. To Parth’s point,
> we can do so in a way that old clients work - as long as a Drill-bit uses
> default ports.
> >
> > I’ve marked this JIRA as a candidate for 2.0.
> >
> > Thanks,
> >
> > - Paul
> >
> >> On Mar 24, 2016, at 4:11 PM, Parth Chandra  wrote:
> >>
> >> What's our proposal for backward compatibility between 1.x and 2.x?
> >> My thoughts:
> >> Optional  -  Allow a mixture of 1.x and 2.x drillbits in a cluster.
> >> Required - 1.x clients should be able to talk to 2.x drillbits.
> >>
> >>
> >>
> >> On Thu, Mar 24, 2016 at 8:55 AM, Jacques Nadeau 
> wrote:
> >>
> >>> There are some changes that either have reviews pending or are in
> progress
> >>> that would require breaking changes to Drill core.
> >>>
> >>> Examples Include:
> >>> DRILL-4455 (arrow integration)
> >>> DRILL-4417 (jdbc/odbc/rpc changes)
> >>> DRILL-4534 (improve null performance)
> >>>
> >>> I've created a new 2.0.0 release version in JIRA and moved these tasks
> to
> >>> that umbrella.
> >>>
> >>> I'd like to propose a new v2 release branch where we can start
> >>> incorporating these changes without disrupting v1 stability and
> >>> compatibility.
> >>>
> >>>
> >>> --
> >>> Jacques Nadeau
> >>> CTO and Co-Founder, Dremio
> >>>
> >
>
>


epoll disconnections?

2016-03-25 Thread Jacques Nadeau
Hey All,

If I recall correctly, many months ago Sudheesh discovered that we were
having instability in RPC connections in some situations due to bugs in the
epoll implementation that are fixed in a later version of Netty (~4.0.31?).
At the time, we shelved switching Netty because it also changed the memory
caching behavior (same thread to all thread) which seemed like a high risk
change. I thought that as part of this we decided the safest change was to
disable epoll RPC in our distribution. However, reviewing drill-env, it
doesn't look like we do this. See here [1].

Thoughts?

[1]
https://github.com/apache/drill/blob/master/distribution/src/resources/drill-env.sh#L19
--
Jacques Nadeau
CTO and Co-Founder, Dremio


[jira] [Created] (DRILL-4539) Add support for Null Equality Joins

2016-03-24 Thread Jacques Nadeau (JIRA)
Jacques Nadeau created DRILL-4539:
-

 Summary: Add support for Null Equality Joins
 Key: DRILL-4539
 URL: https://issues.apache.org/jira/browse/DRILL-4539
 Project: Apache Drill
  Issue Type: Improvement
Reporter: Jacques Nadeau
Assignee: Venki Korukanti


Tableau frequently generates queries similar to this:

{code}
SELECT `t0`.`city` AS `city`,
  `t2`.`X_measure__B` AS `max_Calculation_DFIDBHHAIIECCJFDAG_ok`,
  `t0`.`state` AS `state`,
  `t0`.`sum_stars_ok` AS `sum_stars_ok`
FROM (
  SELECT `business`.`city` AS `city`,
`business`.`state` AS `state`,
SUM(`business`.`stars`) AS `sum_stars_ok`
  FROM `mongo.academic`.`business` `business`
  GROUP BY `business`.`city`,
`business`.`state`
) `t0`
  INNER JOIN (
  SELECT MAX(`t1`.`X_measure__A`) AS `X_measure__B`,
`t1`.`city` AS `city`,
`t1`.`state` AS `state`
  FROM (
SELECT `business`.`city` AS `city`,
  `business`.`state` AS `state`,
  `business`.`business_id` AS `business_id`,
  SUM(`business`.`stars`) AS `X_measure__A`
FROM `mongo.academic`.`business` `business`
GROUP BY `business`.`city`,
  `business`.`state`,
  `business`.`business_id`
  ) `t1`
  GROUP BY `t1`.`city`,
`t1`.`state`
) `t2` ON (((`t0`.`city` = `t2`.`city`) OR ((`t0`.`city` IS NULL) AND 
(`t2`.`city` IS NULL))) AND ((`t0`.`state` = `t2`.`state`) OR ((`t0`.`state` IS 
NULL) AND (`t2`.`state` IS NULL
{code}

If you look at the join condition, you'll note that the join condition is an 
equality condition which also allows null=null. We should add a planning 
rewrite rule and execution join option to allow null equality so that we don't 
treat this as a cartesian join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Proposal: Create v2 branch to work on breaking changes

2016-03-24 Thread Jacques Nadeau
I also propose that we should turn on the union type as part of this as
well. I've opened DRILL-4538 to track that.

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Thu, Mar 24, 2016 at 8:55 AM, Jacques Nadeau  wrote:

> There are some changes that either have reviews pending or are in progress
> that would require breaking changes to Drill core.
>
> Examples Include:
> DRILL-4455 (arrow integration)
> DRILL-4417 (jdbc/odbc/rpc changes)
> DRILL-4534 (improve null performance)
>
> I've created a new 2.0.0 release version in JIRA and moved these tasks to
> that umbrella.
>
> I'd like to propose a new v2 release branch where we can start
> incorporating these changes without disrupting v1 stability and
> compatibility.
>
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>


[jira] [Created] (DRILL-4538) Turn on Union type by default

2016-03-24 Thread Jacques Nadeau (JIRA)
Jacques Nadeau created DRILL-4538:
-

 Summary: Turn on Union type by default
 Key: DRILL-4538
 URL: https://issues.apache.org/jira/browse/DRILL-4538
 Project: Apache Drill
  Issue Type: Improvement
Reporter: Jacques Nadeau
 Fix For: 2.0.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Drill on YARN

2016-03-24 Thread Jacques Nadeau
Your proposed allocation approach makes a lot of sense. I think it will
solve a large number of use cases. Thanks for giving an overview of the
different frameworks. I wonder if they got too focused on the simple use
case

Have you looked at LLama to see whether we could extend it for our needs?
Its Apache licensed and probably has at least a start at a bunch of things
we're trying to do.

https://github.com/cloudera/llama

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Tue, Mar 22, 2016 at 7:42 PM, Paul Rogers  wrote:

> Hi Jacques,
>
> I’m thinking of “semi-static” allocation at first. Spin up a cluster of
> Drill-bits, after which the user can add or remove nodes while the cluster
> runs. (The add part is easy, the remove part is a bit tricky since we don’t
> yet have a way to gracefully shut down a Drill-bit.) Once we get the basics
> to work, we can incrementally try out dynamics. For example, someone could
> whip up a script to look at load and use the proposed YARN client app to
> adjust resources. Later, we can fold dynamic load management into the
> solution once we’re sure what folks want.
>
> I did look at Slider, Twill, Kitten and REEF. Kitten is too basic. I had
> great hope for Slider. But, it turns out that Slider and Weave have each
> built an elaborate framework to isolate us from YARN. The Slider framework
> (written in Python) seems harder to understand than YARN itself. At least,
> one has to be an expert in YARN to understand what all that Python code
> does. And, just looking at the class count in the Twill Javadoc was
> overwhelming. Slider and Twill have to solve the general case. If we build
> our own Java solution, we only have to solve the Drill case, which is
> likely much simpler.
>
> A bespoke solution would seem to offer some other advantages. It lets us
> do things like integrate ZK monitoring so we can learn of zombie drill bits
> (haven’t exited, but not sending heartbeat messages.) We can also gather
> metrics and historical data about the cluster as a whole. We can try out
> different cluster topologies. (Run Drill-bits on x of y nodes on a rack,
> say.) And, we can eventually do the dynamic load management we discussed
> earlier.
>
> But first, I look forward to hearing what others have tried and what we’ve
> learned about how people want to use Drill in a production YARN cluster.
>
> Thanks,
>
> - Paul
>
>
> > On Mar 22, 2016, at 5:45 PM, Jacques Nadeau  wrote:
> >
> > This is great news, welcome!
> >
> > What are you thinking in regards to static versus dynamic resource
> > allocation? We have some conversations going regarding workload
> management
> > but they are still early so it seems like starting with user-controlled
> > allocation makes sense initially.
> >
> > Also, have you spent much time evaluating whether one of the existing
> YARN
> > frameworks such as Slider would be useful? Does anyone on the list have
> any
> > feedback on the relative merits of these technologies?
> >
> > Again, glad to see someone picking this up.
> >
> > Jacques
> >
> >
> > --
> > Jacques Nadeau
> > CTO and Co-Founder, Dremio
> >
> > On Tue, Mar 22, 2016 at 4:58 PM, Paul Rogers 
> wrote:
> >
> >> Hi All,
> >>
> >> I’m a new member of the Drill Team here at MapR. We’d like to take a
> look
> >> at running Drill on YARN for production customers. JIRA suggests some
> early
> >> work may have been done (DRILL-142 <
> >> https://issues.apache.org/jira/browse/DRILL-142>, DRILL-1170 <
> >> https://issues.apache.org/jira/browse/DRILL-1170>, DRILL-3675 <
> >> https://issues.apache.org/jira/browse/DRILL-3675>).
> >>
> >> YARN is a complex beast and the Drill community is large and growing.
> So,
> >> a good place to start is to ask if anyone has already done work on
> >> integrating Drill with YARN (see DRILL-142)?  Or has thought about what
> >> might be needed?
> >>
> >> DRILL-1170 (YARN support for Drill) seems a good place to gather
> >> requirements, designs and so on. I’ve posted a “starter set” of
> >> requirements to spur discussion.
> >>
> >> Thanks,
> >>
> >> - Paul
> >>
> >>
>
>


[jira] [Created] (DRILL-4537) FileSystemStoragePlugin optimizer rules leaking outside StoragePlugin

2016-03-24 Thread Jacques Nadeau (JIRA)
Jacques Nadeau created DRILL-4537:
-

 Summary: FileSystemStoragePlugin optimizer rules leaking outside 
StoragePlugin
 Key: DRILL-4537
 URL: https://issues.apache.org/jira/browse/DRILL-4537
 Project: Apache Drill
  Issue Type: Bug
Reporter: Jacques Nadeau
Assignee: Jason Altekruse


I was recently trying to use Drill without having a FileSystemPlugin 
configured. Even in this case, it pulls in optimizer rules that are specific to 
the FileSystemPlugin (specifically Parquet rules at least). The 
FileSystemPlugin should use the same plugin interfaces as all the other 
StoragePlugin's and expose its rules through the StoragePlugin interface.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [DISCUSS] Remove required type

2016-03-24 Thread Jacques Nadeau
Sorry if that is what you thought I was referring to.

My main question at the top of this thread was about the customer impact.
Since I'm now proposing a coupling so there is no regression I think your
customer concern should be addressed. My statement about theoretical
regressions was specifically in reference to future features.


[jira] [Created] (DRILL-4534) Replace declarative null type with observed null type in execution layer

2016-03-24 Thread Jacques Nadeau (JIRA)
Jacques Nadeau created DRILL-4534:
-

 Summary: Replace declarative null type with observed null type in 
execution layer
 Key: DRILL-4534
 URL: https://issues.apache.org/jira/browse/DRILL-4534
 Project: Apache Drill
  Issue Type: Improvement
Reporter: Jacques Nadeau
Assignee: Jacques Nadeau
 Fix For: 2.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [DISCUSS] Remove required type

2016-03-24 Thread Jacques Nadeau
I've created DRILL-4534 to track this issue.

The reality is that the fear around theoretical performance regressions is
holding this back. By making this both the removal of the required type and
the switch to columnar null evaluation, those fears should be allayed. I've
created both as subtasks under DRILL-4534. This is a breaking change (it
changes the UDF interface) so I've associated this with the v2 release.

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Thu, Mar 24, 2016 at 8:48 AM, Jacques Nadeau  wrote:

> My numbers show a declarative approach is unnecessary in execution.
>
> >> Having the right tools would help...
>
> Declarative is great in planning and should continue to exist. The right
> tools will continue to exist.
>
> It seems like a number of people here are worried performance of future
> features. I'm also focused on performance. Cleaning up mistakes is the way
> we're going to get to the next level of performance.
>
> It is clear from my numbers that a columnar observation approach would be
> a huge win across virtually all current workloads.
>
> I think there is a second dynamic here: this type of change, much like the
> a few others proposed right now are not trivial changes: there are huge
> benefits to what we're proposing but it is possible that some workloads
> won't be as good. That seems like a step-function change (and some would
> call it a breaking change). I'm going to start a new thread on creation of
> a v2 branch.
>
>
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Thu, Mar 24, 2016 at 8:38 AM, Aman Sinha  wrote:
>
>> With regard to the following:
>>
>>
>>
>> *>> The only time we use the "required" path is if the underlying data >>
>> guarantees that all the data will be non-null. I believe that path is >>
>> rarely used, poorly tested and provides only a small gain in performance
>> >>
>> when used.*
>>
>> The main reason this code path is less used is because currently there is
>> no declarative way of specifying the required type.  Going forward, at
>> least 2 features (probably several more) that would require a declarative
>> approach:
>>
>>1. INSERT INTO:   I recall discussions from last year where we wanted
>> to
>>keep the merged schema in some metadata file.  This would allow an
>> insert
>>row to be quickly rejected if its schema did not match the merged
>> schema.
>>2. Sort physical property of a column in files in order to do
>> merge-join
>>or streaming aggregate without re-sorting the data.  This physical
>> property
>>would also be declared in the metadata file.
>>
>> Once these functionality are added (I am not sure of the timeline but
>> likely in a few months) we could leverage the same declarative way for NOT
>> NULL attributes for the underlying data.
>>
>> For data warehouse offloads (a major use-case of Drill), we need to make
>> the ForeignKey-PrimaryKey joins (assume both are guaranteed to be non-null
>> for this scenario) as fast as possible to compete with the RDBMSs.
>>  Having
>> the right tools would help...
>>
>>
>> On Wed, Mar 23, 2016 at 2:55 PM, Jacques Nadeau 
>> wrote:
>>
>> > There seems to be a lot of confusion on this thread.
>> >
>> > We have large amount of code that separates physical representations of
>> > data that can be possibly null versus data that can't be null. We have a
>> > rigid concept in MajorType of whether data is nullable or required. If
>> we
>> > change from one to the other, that is a schema change inside of Drill
>> (and
>> > treated much the same as changing from Integer to Map). As we compile
>> > expression trees, we have to constantly manage whether or not items are
>> > null or not null. We also don't cast between the two. So UDF, Vector
>> > classes, code generation, schema management, schema change are all much
>> > more complicated because of this fact. I proposed this complexity
>> initially
>> > but looking at the continued cost and nominal benefit, think it was a
>> > mistake.
>> >
>> > The only time we use the "required" path is if the underlying data
>> > guarantees that all the data will be non-null. I believe that path is
>> > rarely used, poorly tested and provides only a small gain in performance
>> > when used. In essence, it creates a permutation nightmare (just like us
>> > having too many minor types) with marginal benefit.
>&g

Re: [DISCUSS] Remove required type

2016-03-24 Thread Jacques Nadeau
My numbers show a declarative approach is unnecessary in execution.

>> Having the right tools would help...

Declarative is great in planning and should continue to exist. The right
tools will continue to exist.

It seems like a number of people here are worried performance of future
features. I'm also focused on performance. Cleaning up mistakes is the way
we're going to get to the next level of performance.

It is clear from my numbers that a columnar observation approach would be a
huge win across virtually all current workloads.

I think there is a second dynamic here: this type of change, much like the
a few others proposed right now are not trivial changes: there are huge
benefits to what we're proposing but it is possible that some workloads
won't be as good. That seems like a step-function change (and some would
call it a breaking change). I'm going to start a new thread on creation of
a v2 branch.



--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Thu, Mar 24, 2016 at 8:38 AM, Aman Sinha  wrote:

> With regard to the following:
>
>
>
> *>> The only time we use the "required" path is if the underlying data >>
> guarantees that all the data will be non-null. I believe that path is >>
> rarely used, poorly tested and provides only a small gain in performance >>
> when used.*
>
> The main reason this code path is less used is because currently there is
> no declarative way of specifying the required type.  Going forward, at
> least 2 features (probably several more) that would require a declarative
> approach:
>
>1. INSERT INTO:   I recall discussions from last year where we wanted to
>keep the merged schema in some metadata file.  This would allow an
> insert
>row to be quickly rejected if its schema did not match the merged
> schema.
>2. Sort physical property of a column in files in order to do merge-join
>or streaming aggregate without re-sorting the data.  This physical
> property
>would also be declared in the metadata file.
>
> Once these functionality are added (I am not sure of the timeline but
> likely in a few months) we could leverage the same declarative way for NOT
> NULL attributes for the underlying data.
>
> For data warehouse offloads (a major use-case of Drill), we need to make
> the ForeignKey-PrimaryKey joins (assume both are guaranteed to be non-null
> for this scenario) as fast as possible to compete with the RDBMSs.   Having
> the right tools would help...
>
>
> On Wed, Mar 23, 2016 at 2:55 PM, Jacques Nadeau 
> wrote:
>
> > There seems to be a lot of confusion on this thread.
> >
> > We have large amount of code that separates physical representations of
> > data that can be possibly null versus data that can't be null. We have a
> > rigid concept in MajorType of whether data is nullable or required. If we
> > change from one to the other, that is a schema change inside of Drill
> (and
> > treated much the same as changing from Integer to Map). As we compile
> > expression trees, we have to constantly manage whether or not items are
> > null or not null. We also don't cast between the two. So UDF, Vector
> > classes, code generation, schema management, schema change are all much
> > more complicated because of this fact. I proposed this complexity
> initially
> > but looking at the continued cost and nominal benefit, think it was a
> > mistake.
> >
> > The only time we use the "required" path is if the underlying data
> > guarantees that all the data will be non-null. I believe that path is
> > rarely used, poorly tested and provides only a small gain in performance
> > when used. In essence, it creates a permutation nightmare (just like us
> > having too many minor types) with marginal benefit.
> >
> > The proposal here is to correct that mistake.
> >
> > **Separately**, Drill should take better advantage of observed not-null
> > data.
> >
> > >> You may not generate not-null data, but a lot of data is not-null.
> >
> > Yes! You are 100% correct. Drill often chews through large amounts of
> data
> > that is annotated as nullable but has no nulls. For example, we run
> > benchmarks on TPCH data. The TPCH dataset doesn't have nulls. However, we
> > store the data as nullable (to be consistent with how virtually all
> systems
> > generate the data). As such, *Drill uses the nullable path* for the
> > entirety of execution. This is a great opportunity for performance
> > improvements. However, it is orthogonal to whether we remove the code
> path
> > above ** since it doesn't use it**. Ultimately we should allow the
> &

[jira] [Created] (DRILL-4536) Modify Project such that NULL_IF_NULL handling operates columnar

2016-03-24 Thread Jacques Nadeau (JIRA)
Jacques Nadeau created DRILL-4536:
-

 Summary: Modify Project such that NULL_IF_NULL handling operates 
columnar
 Key: DRILL-4536
 URL: https://issues.apache.org/jira/browse/DRILL-4536
 Project: Apache Drill
  Issue Type: Sub-task
Reporter: Jacques Nadeau






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (DRILL-4535) Remove declarative null type

2016-03-24 Thread Jacques Nadeau (JIRA)
Jacques Nadeau created DRILL-4535:
-

 Summary: Remove declarative null type
 Key: DRILL-4535
 URL: https://issues.apache.org/jira/browse/DRILL-4535
 Project: Apache Drill
  Issue Type: Sub-task
Reporter: Jacques Nadeau






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Proposal: Create v2 branch to work on breaking changes

2016-03-24 Thread Jacques Nadeau
There are some changes that either have reviews pending or are in progress
that would require breaking changes to Drill core.

Examples Include:
DRILL-4455 (arrow integration)
DRILL-4417 (jdbc/odbc/rpc changes)
DRILL-4534 (improve null performance)

I've created a new 2.0.0 release version in JIRA and moved these tasks to
that umbrella.

I'd like to propose a new v2 release branch where we can start
incorporating these changes without disrupting v1 stability and
compatibility.


--
Jacques Nadeau
CTO and Co-Founder, Dremio


Re: Drill automation framework update

2016-03-23 Thread Jacques Nadeau
Can you confirm that you've successfully executed the tests on Apache HDFS
2.7.1? I note that you have modified the plans to remove the maprfs prefix
however you have kept the individual file names. I believe the ordering of
these files is not the same in HDFS versus MapRFS and thus tests will fail.
Can you confirm or dispute that issue?

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Wed, Mar 23, 2016 at 4:19 PM, Chun Chang  wrote:

> Hi drillers,
>
> MapR recently made changes to the automation framework* to make it easier
> running against HDFS cluster. Please refer to the updated README file for
> detail. Let us know if you encounter any issues.
>
> Thanks,
> -Chun
>
> *https://github.com/mapr/drill-test-framework
>


Re: [DISCUSS] Remove required type

2016-03-23 Thread Jacques Nadeau
There seems to be a lot of confusion on this thread.

We have large amount of code that separates physical representations of
data that can be possibly null versus data that can't be null. We have a
rigid concept in MajorType of whether data is nullable or required. If we
change from one to the other, that is a schema change inside of Drill (and
treated much the same as changing from Integer to Map). As we compile
expression trees, we have to constantly manage whether or not items are
null or not null. We also don't cast between the two. So UDF, Vector
classes, code generation, schema management, schema change are all much
more complicated because of this fact. I proposed this complexity initially
but looking at the continued cost and nominal benefit, think it was a
mistake.

The only time we use the "required" path is if the underlying data
guarantees that all the data will be non-null. I believe that path is
rarely used, poorly tested and provides only a small gain in performance
when used. In essence, it creates a permutation nightmare (just like us
having too many minor types) with marginal benefit.

The proposal here is to correct that mistake.

**Separately**, Drill should take better advantage of observed not-null
data.

>> You may not generate not-null data, but a lot of data is not-null.

Yes! You are 100% correct. Drill often chews through large amounts of data
that is annotated as nullable but has no nulls. For example, we run
benchmarks on TPCH data. The TPCH dataset doesn't have nulls. However, we
store the data as nullable (to be consistent with how virtually all systems
generate the data). As such, *Drill uses the nullable path* for the
entirety of execution. This is a great opportunity for performance
improvements. However, it is orthogonal to whether we remove the code path
above ** since it doesn't use it**. Ultimately we should allow the
execution engine to decide the operation path **rather than having a schema
level concept** that creates more code combinations and schema change.

My additional perspective is that having the mistake cruft above means that
doing the right thing of using observed nulls instead of annotated nulls is
substantially harder to implement and reduces the likelihood that it will
be implemented.

With regards to columnar benefits for calculations (which I again argue is
actually orthogonal to the initial proposal), I put together an ideal
condition test. In reality, we have more indirection and I'd actually
expect a larger benefit moving to columnar null evaluation than is this
test. (For example: (1) everybody still runs with bounds checking which
introduces an additional check for each null bit and (2) we always read
memory values in addition to null bits before inspecting the null bits). As
you can see below, having a columnar approach means that performance varies
little depending on nullability. Optimizing for the columnar no-nulls case
provides 5-6% additional performance which seems like a late optimization
compared to where we should be focused: moving to columnar execution.

Benchmark  Mode  Cnt Score
Error  Units
ColumnarComparisons.a_plus_b_columnar  avgt  200  2059.743 ±
9.625  ns/op
ColumnarComparisons.a_plus_b_non_null  avgt  200  1934.380 ±
 10.279  ns/op
ColumnarComparisons.a_plus_b_current_drill avgt  200  6737.569 ±
396.452  ns/op
ColumnarComparisons.a_plus_b_plus_c_columnar   avgt  200  2565.702 ±
 12.139  ns/op
ColumnarComparisons.a_plus_b_plus_c_non_null   avgt  200  2437.322 ±
 12.875  ns/op
ColumnarComparisons.a_plus_b_plus_c_current_drill  avgt  200  9010.913 ±
475.392  ns/op

This comes out as:

columnar a+b 0.5ns/record
current a+b 1.7ns/record
no-null a+b 0.5ns/record
columnar a+b+c 0.6ns/record
current a+b+c 2.25ns/record
no-null a+b+c 0.6ns/record

relative differences:
columnar versus current (a+b) : 3.2x
columnar versus current (a+b+c) : 3.5x
columnar no-nulls eval null: 1.06x
columnar no-nulls eval null: 1.05x

Code here: https://gist.github.com/jacques-n/70fa5afdeadba28ea398





--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Wed, Mar 23, 2016 at 11:58 AM, Parth Chandra  wrote:

> Hmm. I may not have expressed my thoughts clearly.
> What I was suggesting was that 'non-null' data exists in all data sets. (I
> have at least two data sets from users with Drill in production (sorry,
> cannot share the data), that have required fields in parquet files). The
> fields may not be marked as such in the metadata, or the data source may
> not have any such metadata, but if we can identify the type as non-null,
> then we can (and should) take advantage of it.
> If we are already taking advantage of it, then we should not make any
> changes without understanding the tradeoffs.
> So in the spirit of understanding that, I'd like to ask two questions -
> 1) Where specifica

Re: [DISCUSS] Remove required type

2016-03-23 Thread Jacques Nadeau
I agree that we should focus on real benefits versus theories.

Reduction in code complexity is a real benefit. Performance benefits from
having required types is theoretical. Dot drill files don't exist so they
should have little bearing on this conversation.

We rarely generate required data. Most tools never generate it. The reason
the question is about actual deployments is that would be a real factor to
counterbalance the drive for code simplification rather than something
theoretical. A theoretical future performance regression shouldn't stop
code improvement. If it did, we wouldn't make any progress.

What about your own internal benchmark tests. If removing required types
doesn't impact them, doesn't that mean this hasn't been a point of focus?
On Mar 22, 2016 8:36 PM, "Parth Chandra"  wrote:

> I don't know if the main question is whether people have parquet (or other
> ) files which have required fields or not. With something like a dot drill
> file, a user can supply schema or format for data that does not carry
> schema, and we can certainly use the same to indicate knowledge of
> nullability. The question is whether we can take advantage of knowing
> whether data is null or not to get better performance.
>
> Any argument that applies to taking advantage of non-nullability at the
> batch level applies to taking advantage of non-nullability at the schema
> level.
>
> I'm not entirely convinced that the reduction of code complexity is
> ultimately leading to performance gain. Sure, it improves maintainability,
> but what specific improvements are you thinking of that will increase
> performance?
>
> If you recommend some areas of improvement that become possible as a result
> of this change, then I would suggest we run some experiments before we make
> any change.
>
> It is a capital mistake to theorize before one has data, etc...
>
> A 15% performance drop is not something to be ignored, I would think.
>
> Parth
>
> On Tue, Mar 22, 2016 at 5:40 PM, Jacques Nadeau 
> wrote:
> >
> > Re Performance:
> >
> > I think the main question is what portion of people's data is actually
> > marked as non-nullable in Parquet files? (We already treat json, avro,
> > kudu, and hbase (except row key) as nullable. We do treat csv as
> > non-nullable (array) but I think these workloads start with conversion to
> > Parquet.)  Early on, we typically benchmarked Drill using required fields
> > in Parquet. At the time, we actually hacked the Pig code to get something
> > to even generate this format. (I believe, to this day, Pig only generates
> > nullable fields in Parquet.) After some time, we recognized that
> basically
> > every tool was producing Parquet files that were nullable and ultimately
> > moved the benchmark infrastructure to using nullable types to do a better
> > job of representing real-world workloads.
> >
> > Based on my (fuzzy) recollection, working with nullable types had a
> 10-15%
> > performance impact versus working on required types so I think there is a
> > performance impact but I think the population of users who have
> > non-nullable Parquet files are small. If I recall, I believe Impala also
> > creates nullable Parquet files. Not sure what Spark does. I believe Hive
> > has also made this change recently or is doing it (deprecating non-nulls
> in
> > their internals).
> >
> > If we move forward with this, I would expect there initially would be a
> > decrease in performance when data is held as non-nullable given we
> > previously observed this. However, I believe the reduction in code
> > complexity would leads us to improve other things more quickly. Which
> leads
> > me to...
> >
> > Re: Why
> >
> > Drill suffers from code complexity. This hurts forward progress. One
> > example is the fact that we have to generate all nullable permutations of
> > functions. (For example, if we have three arguments, we have to generate
> 8
> > separate functions to work with the combination of argument
> nullabilities).
> > This leads to vastly more reliance on compile-time templating which is a
> > maintenance headache. Additionally, it makes the runtime code generation
> > more complicated and error prone. Testing is also more expensive because
> we
> > now have twice as many paths to both validate and maintain.
> Realistically,
> > we should try to move to more columnar algorithms, which would provide a
> > bigger lift than this declared schema nullability optimization. This is
> > because many workloads have rare nulls so we can actually optimize better
> > at the batch leve

Re: Drill on YARN

2016-03-22 Thread Jacques Nadeau
This is great news, welcome!

What are you thinking in regards to static versus dynamic resource
allocation? We have some conversations going regarding workload management
but they are still early so it seems like starting with user-controlled
allocation makes sense initially.

Also, have you spent much time evaluating whether one of the existing YARN
frameworks such as Slider would be useful? Does anyone on the list have any
feedback on the relative merits of these technologies?

Again, glad to see someone picking this up.

Jacques


--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Tue, Mar 22, 2016 at 4:58 PM, Paul Rogers  wrote:

> Hi All,
>
> I’m a new member of the Drill Team here at MapR. We’d like to take a look
> at running Drill on YARN for production customers. JIRA suggests some early
> work may have been done (DRILL-142 <
> https://issues.apache.org/jira/browse/DRILL-142>, DRILL-1170 <
> https://issues.apache.org/jira/browse/DRILL-1170>, DRILL-3675 <
> https://issues.apache.org/jira/browse/DRILL-3675>).
>
> YARN is a complex beast and the Drill community is large and growing. So,
> a good place to start is to ask if anyone has already done work on
> integrating Drill with YARN (see DRILL-142)?  Or has thought about what
> might be needed?
>
> DRILL-1170 (YARN support for Drill) seems a good place to gather
> requirements, designs and so on. I’ve posted a “starter set” of
> requirements to spur discussion.
>
> Thanks,
>
> - Paul
>
>


Re: [DISCUSS] Remove required type

2016-03-22 Thread Jacques Nadeau
Re Performance:

I think the main question is what portion of people's data is actually
marked as non-nullable in Parquet files? (We already treat json, avro,
kudu, and hbase (except row key) as nullable. We do treat csv as
non-nullable (array) but I think these workloads start with conversion to
Parquet.)  Early on, we typically benchmarked Drill using required fields
in Parquet. At the time, we actually hacked the Pig code to get something
to even generate this format. (I believe, to this day, Pig only generates
nullable fields in Parquet.) After some time, we recognized that basically
every tool was producing Parquet files that were nullable and ultimately
moved the benchmark infrastructure to using nullable types to do a better
job of representing real-world workloads.

Based on my (fuzzy) recollection, working with nullable types had a 10-15%
performance impact versus working on required types so I think there is a
performance impact but I think the population of users who have
non-nullable Parquet files are small. If I recall, I believe Impala also
creates nullable Parquet files. Not sure what Spark does. I believe Hive
has also made this change recently or is doing it (deprecating non-nulls in
their internals).

If we move forward with this, I would expect there initially would be a
decrease in performance when data is held as non-nullable given we
previously observed this. However, I believe the reduction in code
complexity would leads us to improve other things more quickly. Which leads
me to...

Re: Why

Drill suffers from code complexity. This hurts forward progress. One
example is the fact that we have to generate all nullable permutations of
functions. (For example, if we have three arguments, we have to generate 8
separate functions to work with the combination of argument nullabilities).
This leads to vastly more reliance on compile-time templating which is a
maintenance headache. Additionally, it makes the runtime code generation
more complicated and error prone. Testing is also more expensive because we
now have twice as many paths to both validate and maintain.  Realistically,
we should try to move to more columnar algorithms, which would provide a
bigger lift than this declared schema nullability optimization. This is
because many workloads have rare nulls so we can actually optimize better
at the batch level. Creating three code paths (nullable observed non-null,
nullable observed null and non-null) make this substantially more
complicated. We want to invest in this area but the code complexity of
nullable versus required makes this tasks less likely to happen/harder. So,
in essence, I'm arguing that it is a small short-term loss that leads to
better code quality and ultimately faster performance.

Do others have real-world observations on the frequency of required fields
in Parquet files?

thanks,
Jacques



--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Tue, Mar 22, 2016 at 3:08 PM, Parth Chandra  wrote:

> I'm not entirely convinced that this would have no performance impact. Do
> we have any experiments?
>
>
> On Tue, Mar 22, 2016 at 1:36 PM, Jacques Nadeau 
> wrote:
>
> > My suggestion is we use explicit observation at the batch level. If there
> > are no nulls we can optimize this batch. This would ultimately improve
> over
> > our current situation where most parquet and all json data is nullable so
> > we don't optimize. I'd estimate that the vast majority of Drills
> workloads
> > are marked nullable whether they are or not. So what we're really
> > suggesting is deleting a bunch of code which is rarely in the execution
> > path.
> > On Mar 22, 2016 1:22 PM, "Aman Sinha"  wrote:
> >
> > > I was thinking about it more after sending the previous concerns.
> Agree,
> > > this is an execution side change...but some details need to be worked
> > out.
> > > If the planner indicates to the executor that a column is non-nullable
> > (e.g
> > > a primary key),  the run-time generated code is more efficient since it
> > > does not have to check the null bit.  Are you thinking we would use the
> > > existing nullable vector and add some additional metadata (at a record
> > > batch level rather than record level) to indicate non-nullability ?
> > >
> > >
> > > On Tue, Mar 22, 2016 at 12:27 PM, Jacques Nadeau 
> > > wrote:
> > >
> > > > Hey Aman, I believe both Steven and I were only suggesting removal
> only
> > > > from execution, not planning. It seems like your concerns are all
> > related
> > > > to planning. Iit seems like the real tradeoffs in execution are
> > nominal.
> > > > On Mar 22, 2016 9:03 AM, "Aman Sinha"  wrote:
> > > >
> > > > &g

Fwd: drill git commit: DRILL-3623: For limit 0 queries, optionally use a shorter execution path when result column types are known

2016-03-22 Thread Jacques Nadeau
Awesome job on this Sudheesh.  Thanks for all the hard work. Thanks also to
Sean for all his work on the previous patch.
-- Forwarded message --
From: 
Date: Mar 22, 2016 4:33 PM
Subject: drill git commit: DRILL-3623: For limit 0 queries, optionally use
a shorter execution path when result column types are known
To: 
Cc:

Repository: drill
Updated Branches:
  refs/heads/master 600ba9ee1 -> 5dbaafbe6


DRILL-3623: For limit 0 queries, optionally use a shorter execution path
when result column types are known

+ "planner.enable_limit0_optimization" option is disabled by default

+ Print plan in PlanTestBase if TEST_QUERY_PRINTING_SILENT is set
+ Fix DrillTestWrapper to verify expected and actual schema
+ Correct the schema of results in TestInbuiltHiveUDFs#testXpath_Double

This closes #405


Project: http://git-wip-us.apache.org/repos/asf/drill/repo
Commit: http://git-wip-us.apache.org/repos/asf/drill/commit/5dbaafbe
Tree: http://git-wip-us.apache.org/repos/asf/drill/tree/5dbaafbe
Diff: http://git-wip-us.apache.org/repos/asf/drill/diff/5dbaafbe

Branch: refs/heads/master
Commit: 5dbaafbe6651b0a284fef69d5c952d82ce506e20
Parents: 600ba9e
Author: Sudheesh Katkam 
Authored: Tue Mar 22 15:21:51 2016 -0700
Committer: Sudheesh Katkam 
Committed: Tue Mar 22 16:19:01 2016 -0700

--
 .../drill/exec/fn/hive/TestInbuiltHiveUDFs.java |   2 +-
 .../org/apache/drill/exec/ExecConstants.java|   3 +
 .../drill/exec/physical/base/ScanStats.java |   6 +-
 .../apache/drill/exec/planner/PlannerPhase.java |   2 +
 .../planner/logical/DrillDirectScanRel.java |  70 ++
 .../exec/planner/physical/DirectScanPrule.java  |  49 ++
 .../planner/sql/handlers/DefaultSqlHandler.java |  12 +
 .../planner/sql/handlers/FindLimit0Visitor.java | 124 +++-
 .../server/options/SystemOptionManager.java |   1 +
 .../exec/store/direct/DirectGroupScan.java  |  27 +-
 .../java/org/apache/drill/DrillTestWrapper.java |  25 +-
 .../java/org/apache/drill/PlanTestBase.java |   9 +-
 .../impl/limit/TestEarlyLimit0Optimization.java | 663 +++
 13 files changed, 963 insertions(+), 30 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/drill/blob/5dbaafbe/contrib/storage-hive/core/src/test/java/org/apache/drill/exec/fn/hive/TestInbuiltHiveUDFs.java
--
diff --git
a/contrib/storage-hive/core/src/test/java/org/apache/drill/exec/fn/hive/TestInbuiltHiveUDFs.java
b/contrib/storage-hive/core/src/test/java/org/apache/drill/exec/fn/hive/TestInbuiltHiveUDFs.java
index a287c89..a126aaa 100644
---
a/contrib/storage-hive/core/src/test/java/org/apache/drill/exec/fn/hive/TestInbuiltHiveUDFs.java
+++
b/contrib/storage-hive/core/src/test/java/org/apache/drill/exec/fn/hive/TestInbuiltHiveUDFs.java
@@ -58,7 +58,7 @@ public class TestInbuiltHiveUDFs extends HiveTestBase {

 final TypeProtos.MajorType majorType =
TypeProtos.MajorType.newBuilder()
 .setMinorType(TypeProtos.MinorType.FLOAT8)
-.setMode(TypeProtos.DataMode.REQUIRED)
+.setMode(TypeProtos.DataMode.OPTIONAL)
 .build();

 final List> expectedSchema =
Lists.newArrayList();

http://git-wip-us.apache.org/repos/asf/drill/blob/5dbaafbe/exec/java-exec/src/main/java/org/apache/drill/exec/ExecConstants.java
--
diff --git
a/exec/java-exec/src/main/java/org/apache/drill/exec/ExecConstants.java
b/exec/java-exec/src/main/java/org/apache/drill/exec/ExecConstants.java
index b8f25ad..963934d 100644
--- a/exec/java-exec/src/main/java/org/apache/drill/exec/ExecConstants.java
+++ b/exec/java-exec/src/main/java/org/apache/drill/exec/ExecConstants.java
@@ -202,6 +202,9 @@ public interface ExecConstants {
   String AFFINITY_FACTOR_KEY = "planner.affinity_factor";
   OptionValidator AFFINITY_FACTOR = new
DoubleValidator(AFFINITY_FACTOR_KEY, 1.2d);

+  String EARLY_LIMIT0_OPT_KEY = "planner.enable_limit0_optimization";
+  BooleanValidator EARLY_LIMIT0_OPT = new
BooleanValidator(EARLY_LIMIT0_OPT_KEY, false);
+
   String ENABLE_MEMORY_ESTIMATION_KEY =
"planner.memory.enable_memory_estimation";
   OptionValidator ENABLE_MEMORY_ESTIMATION = new
BooleanValidator(ENABLE_MEMORY_ESTIMATION_KEY, false);


http://git-wip-us.apache.org/repos/asf/drill/blob/5dbaafbe/exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/ScanStats.java
--
diff --git
a/exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/ScanStats.java
b/exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/ScanStats.java
index ba36931..1886c14 100644
---
a/exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/ScanStats.java
+++
b/exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/ScanStats.java
@@ -17,13 

Re: [DISCUSS] Remove required type

2016-03-22 Thread Jacques Nadeau
My suggestion is we use explicit observation at the batch level. If there
are no nulls we can optimize this batch. This would ultimately improve over
our current situation where most parquet and all json data is nullable so
we don't optimize. I'd estimate that the vast majority of Drills workloads
are marked nullable whether they are or not. So what we're really
suggesting is deleting a bunch of code which is rarely in the execution
path.
On Mar 22, 2016 1:22 PM, "Aman Sinha"  wrote:

> I was thinking about it more after sending the previous concerns.  Agree,
> this is an execution side change...but some details need to be worked out.
> If the planner indicates to the executor that a column is non-nullable (e.g
> a primary key),  the run-time generated code is more efficient since it
> does not have to check the null bit.  Are you thinking we would use the
> existing nullable vector and add some additional metadata (at a record
> batch level rather than record level) to indicate non-nullability ?
>
>
> On Tue, Mar 22, 2016 at 12:27 PM, Jacques Nadeau 
> wrote:
>
> > Hey Aman, I believe both Steven and I were only suggesting removal only
> > from execution, not planning. It seems like your concerns are all related
> > to planning. Iit seems like the real tradeoffs in execution are nominal.
> > On Mar 22, 2016 9:03 AM, "Aman Sinha"  wrote:
> >
> > > While it is true that there is code complexity due to the required
> type,
> > > what would we be trading off ?  some important considerations:
> > >   - We don't currently have null count statistics which would need to
> be
> > > implemented for various data sources
> > >   - Primary keys in the RDBMS sources (or rowkeys in hbase) are always
> > > non-null, and although today we may not be doing optimizations to
> > leverage
> > > that,  one could easily add a rule that converts  WHERE primary_key IS
> > NULL
> > > to a FALSE filter.
> > >
> > >
> > > On Tue, Mar 22, 2016 at 7:31 AM, Dave Oshinsky <
> doshin...@commvault.com>
> > > wrote:
> > >
> > > > Hi Jacques,
> > > > Marginally related to this, I made a small change in PR-372
> > (DRILL-4184)
> > > > to support variable widths for decimal quantities in Parquet.  I
> found
> > > the
> > > > (decimal) vectoring code to be very difficult to understand (probably
> > > > because it's overly complex, but also because I'm new to Drill code
> in
> > > > general), so I made a small, surgical change in my pull request to
> > > support
> > > > keeping track of variable widths (lengths) and null booleans within
> the
> > > > existing fixed width decimal vectoring scheme.  Can my changes be
> > > > reviewed/accepted, and then we discuss how to fix properly long-term?
> > > >
> > > > Thanks,
> > > > Dave Oshinsky
> > > >
> > > > -Original Message-
> > > > From: Jacques Nadeau [mailto:jacq...@dremio.com]
> > > > Sent: Monday, March 21, 2016 11:43 PM
> > > > To: dev
> > > > Subject: Re: [DISCUSS] Remove required type
> > > >
> > > > Definitely in support of this. The required type is a huge
> maintenance
> > > and
> > > > code complexity nightmare that provides little to no benefit. As you
> > > point
> > > > out, we can do better performance optimizations though null count
> > > > observation since most sources are nullable anyway.
> > > > On Mar 21, 2016 7:41 PM, "Steven Phillips" 
> wrote:
> > > >
> > > > > I have been thinking about this for a while now, and I feel it
> would
> > > > > be a good idea to remove the Required vector types from Drill, and
> > > > > only use the Nullable version of vectors. I think this will greatly
> > > > simplify the code.
> > > > > It will also simplify the creation of UDFs. As is, if a function
> has
> > > > > custom null handling (i.e. INTERNAL), the function has to be
> > > > > separately implemented for each permutation of nullability of the
> > > > > inputs. But if drill data types are always nullable, this wouldn't
> > be a
> > > > problem.
> > > > >
> > > > > I don't think there would be much impact on performance. In
> practice,
> > > > > I think the required type is used very rarely. And there are other
> > > > > ways we can optimize for when a column is known to have no nulls.
> > > > >
> > > > > Thoughts?
> > > > >
> > > >
> > > >
> > > >
> > > > ***Legal
> Disclaimer***
> > > > "This communication may contain confidential and privileged material
> > for
> > > > the
> > > > sole use of the intended recipient. Any unauthorized review, use or
> > > > distribution
> > > > by others is strictly prohibited. If you have received the message by
> > > > mistake,
> > > > please advise the sender by reply email and delete the message. Thank
> > > you."
> > > >
> **
> > >
> >
>


Re: [DISCUSS] Remove required type

2016-03-22 Thread Jacques Nadeau
Hey Aman, I believe both Steven and I were only suggesting removal only
from execution, not planning. It seems like your concerns are all related
to planning. Iit seems like the real tradeoffs in execution are nominal.
On Mar 22, 2016 9:03 AM, "Aman Sinha"  wrote:

> While it is true that there is code complexity due to the required type,
> what would we be trading off ?  some important considerations:
>   - We don't currently have null count statistics which would need to be
> implemented for various data sources
>   - Primary keys in the RDBMS sources (or rowkeys in hbase) are always
> non-null, and although today we may not be doing optimizations to leverage
> that,  one could easily add a rule that converts  WHERE primary_key IS NULL
> to a FALSE filter.
>
>
> On Tue, Mar 22, 2016 at 7:31 AM, Dave Oshinsky 
> wrote:
>
> > Hi Jacques,
> > Marginally related to this, I made a small change in PR-372 (DRILL-4184)
> > to support variable widths for decimal quantities in Parquet.  I found
> the
> > (decimal) vectoring code to be very difficult to understand (probably
> > because it's overly complex, but also because I'm new to Drill code in
> > general), so I made a small, surgical change in my pull request to
> support
> > keeping track of variable widths (lengths) and null booleans within the
> > existing fixed width decimal vectoring scheme.  Can my changes be
> > reviewed/accepted, and then we discuss how to fix properly long-term?
> >
> > Thanks,
> > Dave Oshinsky
> >
> > -Original Message-
> > From: Jacques Nadeau [mailto:jacq...@dremio.com]
> > Sent: Monday, March 21, 2016 11:43 PM
> > To: dev
> > Subject: Re: [DISCUSS] Remove required type
> >
> > Definitely in support of this. The required type is a huge maintenance
> and
> > code complexity nightmare that provides little to no benefit. As you
> point
> > out, we can do better performance optimizations though null count
> > observation since most sources are nullable anyway.
> > On Mar 21, 2016 7:41 PM, "Steven Phillips"  wrote:
> >
> > > I have been thinking about this for a while now, and I feel it would
> > > be a good idea to remove the Required vector types from Drill, and
> > > only use the Nullable version of vectors. I think this will greatly
> > simplify the code.
> > > It will also simplify the creation of UDFs. As is, if a function has
> > > custom null handling (i.e. INTERNAL), the function has to be
> > > separately implemented for each permutation of nullability of the
> > > inputs. But if drill data types are always nullable, this wouldn't be a
> > problem.
> > >
> > > I don't think there would be much impact on performance. In practice,
> > > I think the required type is used very rarely. And there are other
> > > ways we can optimize for when a column is known to have no nulls.
> > >
> > > Thoughts?
> > >
> >
> >
> >
> > ***Legal Disclaimer***
> > "This communication may contain confidential and privileged material for
> > the
> > sole use of the intended recipient. Any unauthorized review, use or
> > distribution
> > by others is strictly prohibited. If you have received the message by
> > mistake,
> > please advise the sender by reply email and delete the message. Thank
> you."
> > **
>


Next Release

2016-03-22 Thread Jacques Nadeau
Hey All,

I'd like to volunteer to be the 1.7 release manager. I'd also like to plan
putting together a target feature list for the release now so we can all
plan ahead. I'll share an initial stab at this later today if people think
that sounds good.

Thanks
Jacques


Re: [DISCUSS] Remove required type

2016-03-21 Thread Jacques Nadeau
Definitely in support of this. The required type is a huge maintenance and
code complexity nightmare that provides little to no benefit. As you point
out, we can do better performance optimizations though null count
observation since most sources are nullable anyway.
On Mar 21, 2016 7:41 PM, "Steven Phillips"  wrote:

> I have been thinking about this for a while now, and I feel it would be a
> good idea to remove the Required vector types from Drill, and only use the
> Nullable version of vectors. I think this will greatly simplify the code.
> It will also simplify the creation of UDFs. As is, if a function has custom
> null handling (i.e. INTERNAL), the function has to be separately
> implemented for each permutation of nullability of the inputs. But if drill
> data types are always nullable, this wouldn't be a problem.
>
> I don't think there would be much impact on performance. In practice, I
> think the required type is used very rarely. And there are other ways we
> can optimize for when a column is known to have no nulls.
>
> Thoughts?
>


Re: Moving to HBase 1.1 [DRILL-4199]

2016-03-21 Thread Jacques Nadeau
+1

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Mon, Mar 21, 2016 at 1:18 PM, Aditya  wrote:

> Hi,
>
> HBase has moved to 1.1 branch as their latest stable release[1] and since
> it is wire compatible with 0.98 releases, I'd like to propose that Drill
> updates its supported HBase release to 1.1.
>
> Essentially, it means that we update the HBase clients bundled with Drill
> distribution to latest stable version of 1.1 branch. I do not expect any
> code change.
>
> I have assigned DRILL-4199 to myself and unless someone has a reason to not
> to, I'd like to move to HBase 1.1 in Drill 1.7 release.
>
> aditya...
>
> [1] https://dist.apache.org/repos/dist/release/hbase/stable
> [2] https://issues.apache.org/jira/browse/DRILL-4199
>


[jira] [Created] (DRILL-4521) Drill doesn't correctly treat VARIANCE and STDDEV as two phase aggregates

2016-03-19 Thread Jacques Nadeau (JIRA)
Jacques Nadeau created DRILL-4521:
-

 Summary: Drill doesn't correctly treat VARIANCE and STDDEV as two 
phase aggregates
 Key: DRILL-4521
 URL: https://issues.apache.org/jira/browse/DRILL-4521
 Project: Apache Drill
  Issue Type: Bug
Reporter: Jacques Nadeau
Assignee: MinJi Kim


These are supposed to be synonyms with STDDEV_POP and VARIANCE_POP but they are 
handled differently. This causes the reduce aggregates rule to not reduce these 
and thus they are handled as single phase aggregates.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [RESULT] [VOTE] Release Apache Drill 1.6.0 - rc0

2016-03-19 Thread Jacques Nadeau
Can you confirm that you also run the unit tests in JDK8?

thanks!

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Wed, Mar 16, 2016 at 4:46 PM, Abhishek Girish 
wrote:

> I built Drill from source (Github - 1.6.0 branch
> <https://github.com/apache/drill/commits/1.6.0>). Deployed on a 4 node
> cluster.
>
> Drill gitCommitID: d51f7fc14bd71d3e711ece0d02cdaa4d4c385eeb | openjdk
> version "1.8.0_71" | CentOS 6.6 | MapR 5.0.0
>
> On Wed, Mar 16, 2016 at 4:21 PM, Jacques Nadeau 
> wrote:
>
> > That's good to hear. Was that based on the source tarball and a full
> build
> > and test or the binary tarball?
> >
> > --
> > Jacques Nadeau
> > CTO and Co-Founder, Dremio
> >
> > On Wed, Mar 16, 2016 at 4:15 PM, Abhishek Girish <
> > abhishek.gir...@gmail.com>
> > wrote:
> >
> > > While my vote is non-binding, I thought I'll share that I ran a subset
> > (all
> > > but hive/hbase) of the Functional Regression tests in a Java 8
> > environment.
> > > Also verified some customer issues we previously had in this area.
> > >
> > > On Wed, Mar 16, 2016 at 2:25 PM, Jacques Nadeau 
> > > wrote:
> > >
> > > > Hey All,
> > > >
> > > > I just saw the 1.6 release notes state that Drill now supports JDK8.
> I
> > > > didn't do JDK8 validation as part of my vote as I didn't know this
> was
> > > > expected to work [1]. Which of the binding votes were based on
> > building &
> > > > testing the release on JDK8?
> > > >
> > > > thanks,
> > > > Jacques
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/browse/DRILL-3488?focusedCommentId=15176563&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15176563
> > > >
> > > >
> > > > --
> > > > Jacques Nadeau
> > > > CTO and Co-Founder, Dremio
> > > >
> > > > On Tue, Mar 15, 2016 at 8:17 AM, Parth Chandra 
> > > wrote:
> > > >
> > > > > *The vote* passes. Thanks everyone for your time. Final tally:
> > > > >
> > > > > 6x +1 (binding):  Parth, Aman, Aditya, Venki, Jacques, Jinfeng
> > > > >
> > > > > 6x +1 (non-binding) : Abdel Hakim, Sean, Norris, Abhishek,
> Sudheesh,
> > > Chun
> > > > >
> > > > > No -1s.
> > > > >
> > > > > I'll push the *release* artifacts and send an announcement once
> > > > propagated.
> > > > >
> > > > > Thanks,
> > > > > Parth
> > > > >
> > > > > On Mon, Mar 14, 2016 at 3:56 PM, Jinfeng Ni  >
> > > > wrote:
> > > > >
> > > > > > +1 (binding)
> > > > > >
> > > > > > - Download src tgz and do a full maven build on CentOS
> > > > > > - Run yelp tutorial queries.
> > > > > > - Verify query profiles on Web-UI
> > > > > > - Run couple of partition pruning related queries.
> > > > > >
> > > > > > All look good.
> > > > > >
> > > > > > Jinfeng
> > > > > >
> > > > > >
> > > > > > On Mon, Mar 14, 2016 at 2:48 PM, Jacques Nadeau <
> > jacq...@dremio.com>
> > > > > > wrote:
> > > > > > > +1 (binding)
> > > > > > >
> > > > > > > - Download src tgz and build and test
> > > > > > > - Download binary tgz, test execution of a number of queries
> and
> > > > verify
> > > > > > > profiles
> > > > > > > - Enable socket level logging and confirm new planning phase +
> > time
> > > > > > logging
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Jacques Nadeau
> > > > > > > CTO and Co-Founder, Dremio
> > > > > > >
> > > > > > > On Mon, Mar 14, 2016 at 1:45 PM, Chun Chang <
> cch...@maprtech.com
> > >
> > > > > wrote:
> > > > > > >
> > > > > > >> +1 (non-binding)
> > > > > > >>
> >

Re: [RESULT] [VOTE] Release Apache Drill 1.6.0 - rc0

2016-03-19 Thread Jacques Nadeau
Hey All,

I just saw the 1.6 release notes state that Drill now supports JDK8. I
didn't do JDK8 validation as part of my vote as I didn't know this was
expected to work [1]. Which of the binding votes were based on building &
testing the release on JDK8?

thanks,
Jacques

[1]
https://issues.apache.org/jira/browse/DRILL-3488?focusedCommentId=15176563&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15176563


--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Tue, Mar 15, 2016 at 8:17 AM, Parth Chandra  wrote:

> *The vote* passes. Thanks everyone for your time. Final tally:
>
> 6x +1 (binding):  Parth, Aman, Aditya, Venki, Jacques, Jinfeng
>
> 6x +1 (non-binding) : Abdel Hakim, Sean, Norris, Abhishek, Sudheesh, Chun
>
> No -1s.
>
> I'll push the *release* artifacts and send an announcement once propagated.
>
> Thanks,
> Parth
>
> On Mon, Mar 14, 2016 at 3:56 PM, Jinfeng Ni  wrote:
>
> > +1 (binding)
> >
> > - Download src tgz and do a full maven build on CentOS
> > - Run yelp tutorial queries.
> > - Verify query profiles on Web-UI
> > - Run couple of partition pruning related queries.
> >
> > All look good.
> >
> > Jinfeng
> >
> >
> > On Mon, Mar 14, 2016 at 2:48 PM, Jacques Nadeau 
> > wrote:
> > > +1 (binding)
> > >
> > > - Download src tgz and build and test
> > > - Download binary tgz, test execution of a number of queries and verify
> > > profiles
> > > - Enable socket level logging and confirm new planning phase + time
> > logging
> > >
> > >
> > >
> > >
> > > --
> > > Jacques Nadeau
> > > CTO and Co-Founder, Dremio
> > >
> > > On Mon, Mar 14, 2016 at 1:45 PM, Chun Chang 
> wrote:
> > >
> > >> +1 (non-binding)
> > >>
> > >> -ran functional and advanced automation
> > >>
> > >> On Mon, Mar 14, 2016 at 1:09 PM, Sudheesh Katkam <
> skat...@maprtech.com>
> > >> wrote:
> > >>
> > >> > +1 (non-binding)
> > >> >
> > >> > * downloaded and built from source tar-ball; ran unit tests
> > successfully
> > >> > on Ubuntu
> > >> > * ran simple queries (including cancellations) in embedded mode on
> > Mac;
> > >> > verified states in web UI
> > >> > * ran simple queries (including cancellations) on a 3 node cluster;
> > >> > verified states in web UI
> > >> >
> > >> > * tested maven artifacts (drill-jdbc) using a sample application <
> > >> > https://github.com/sudheeshkatkam/drill-example>.
> > >> > This application is based on DrillClient, and not JDBC API. I had to
> > make
> > >> > two changes for this application to work (i.e. not backward
> > compatible).
> > >> > However, these changes are not related to this release (commits
> > >> > responsible: 1fde9bb <
> > >> >
> > >>
> >
> https://github.com/apache/drill/commit/1fde9bb1505f04e0b0a1afb542a1aa5dfd20ed1b
> > >> >
> > >> > and de00881 <
> > >> >
> > >>
> >
> https://github.com/apache/drill/commit/de008810c815e46e6f6e5d13ad0b9a23e705b13a
> > >> >).
> > >> > We should have a conversation about what constitutes public API and
> > >> changes
> > >> > to this API on a separate thread.
> > >> >
> > >> > Thank you,
> > >> > Sudheesh
> > >> >
> > >> > > On Mar 14, 2016, at 12:04 PM, Abhishek Girish <
> > >> abhishek.gir...@gmail.com>
> > >> > wrote:
> > >> > >
> > >> > > +1 (non-binding)
> > >> > >
> > >> > > - Tested Drill in distributed mode (built with MapR profile).
> > >> > > - Ran functional tests from Drill-Test-Framework [1]
> > >> > > - Tested Web UI (basic sanity)
> > >> > > - Tested Sqlline
> > >> > >
> > >> > > Looks good.
> > >> > >
> > >> > >
> > >> > > [1] https://github.com/mapr/drill-test-framework
> > >> > >
> > >> > > On Mon, Mar 14, 2016 at 11:23 AM, Venki Korukanti <
> > >> > venki.koruka...@gmail.com
> > >> > >> wrote:
> > >> > >
> > >> > >> +1
> > >> > >&g

Re: Optimizing SUM(1) query

2016-03-18 Thread Jacques Nadeau
I don't think Julian is saying it does this, I think he is saying it
should. I agree. (This actually is very common Tableau query pattern among
other things.)

Sudip, do you want to open an enhancement JIRA where we rewrite SUM(1) to
COUNT(1). Then our existing count optimizations can take over.

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Wed, Mar 16, 2016 at 8:38 AM, Sudip Mukherjee 
wrote:

> I don't see DRILL is transforming the query. Tried with a CSV file.
> Please let me know if I am missing something.
>
> 00-00Screen : rowType = RecordType(INTEGER EXPR$0): rowcount = 1.0,
> cumulative cost = {3.1 rows, 17.1 cpu, 0.0 io, 0.0 network, 0.0 memory}, id
> = 260
> 00-01  Project(EXPR$0=[$0]) : rowType = RecordType(INTEGER EXPR$0):
> rowcount = 1.0, cumulative cost = {3.0 rows, 17.0 cpu, 0.0 io, 0.0 network,
> 0.0 memory}, id = 259
> 00-02StreamAgg(group=[{}], EXPR$0=[SUM($0)]) : rowType =
> RecordType(INTEGER EXPR$0): rowcount = 1.0, cumulative cost = {3.0 rows,
> 17.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 258
> 00-03  Project($f0=[1]) : rowType = RecordType(INTEGER $f0):
> rowcount = 1.0, cumulative cost = {2.0 rows, 5.0 cpu, 0.0 io, 0.0 network,
> 0.0 memory}, id = 257
> 00-04Scan(groupscan=[EasyGroupScan
> [selectionRoot=file:/C:/data/company.csv, numFiles=1, columns=[`*`],
> files=[file:/C:/data/company.csv]]]) : rowType = RecordType(): rowcount =
> 1.0, cumulative cost = {1.0 rows, 1.0 cpu, 0.0 io, 0.0 network, 0.0
> memory}, id = 256
>
> Thanks,
> Sudip
>
> -Original Message-
> From: Julian Hyde [mailto:jh...@apache.org]
> Sent: 16 March 2016 AM 12:50
> To: dev@drill.apache.org
> Subject: Re: Optimizing SUM(1) query
>
> Is there any reason why Drill cannot transform SUM(1) to COUNT(*) at an
> early stage (i.e. using a logical optimization rule) so that this
> optimization does not need to be done for each engine?
>
> > On Mar 15, 2016, at 5:29 AM, Sudip Mukherjee 
> wrote:
> >
> > I was trying to have an Optimizer rule for the solr storage plugin that
> I'm working on for this query. Trying to use SOLR field stats for this , so
> that the query is faster..
> > Getting the below exception while transforming project to scan. Could
> you please advise?
> >
> >
> > 2016-03-15 08:20:35,149 [291801ee-33fc-064d-7aff-18391f15ae0e:foreman]
> DEBUG o.a.d.e.p.s.h.DefaultSqlHandler - Drill Logical :
> > DrillScreenRel: rowcount = 1.0, cumulative cost = {60.1 rows, 320.1
> > cpu, 0.0 io, 0.0 network, 176.0 memory}, id = 49
> >  DrillProjectRel(EXPR$0=[$0]): rowcount = 1.0, cumulative cost = {60.0
> rows, 320.0 cpu, 0.0 io, 0.0 network, 176.0 memory}, id = 48
> >DrillAggregateRel(group=[{}], EXPR$0=[SUM($0)]): rowcount = 1.0,
> cumulative cost = {60.0 rows, 320.0 cpu, 0.0 io, 0.0 network, 176.0
> memory}, id = 46
> >  DrillProjectRel($f0=[1]): rowcount = 20.0, cumulative cost = {40.0
> rows, 80.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 44
> >DrillScanRel(table=[[solr, ANalert_494]],
> > groupscan=[SolrGroupScan [SolrScanSpec=SolrScanSpec
> > [solrCoreName=ANalert_494, solrUrl=http://localhost:2/solr/
> > filter=[], solrDocFetchCount=-1, aggreegation=[]], columns=[`*`]]]):
> > rowcount = 20.0, cumulative cost = {20.0 rows, 0.0 cpu, 0.0 io, 0.0
> > network, 0.0 memory}, id = 26
> >
> > 2016-03-15 08:20:35,201 [291801ee-33fc-064d-7aff-18391f15ae0e:foreman]
> > DEBUG o.a.drill.exec.work.foreman.Foreman -
> > 291801ee-33fc-064d-7aff-18391f15ae0e: State change requested PENDING
> > --> FAILED
> > org.apache.drill.exec.work.foreman.ForemanException: Unexpected
> exception during fragment initialization: index (0) must be less than size
> (0)
> >   at
> org.apache.drill.exec.work.foreman.Foreman.run(Foreman.java:255)
> [drill-java-exec.jar:1.4.0]
> >   at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
> Source) [na:1.8.0_65]
> >   at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
> Source) [na:1.8.0_65]
> >   at java.lang.Thread.run(Unknown Source) [na:1.8.0_65] Caused by:
> > java.lang.IndexOutOfBoundsException: index (0) must be less than size (0)
> >   at
> com.google.common.base.Preconditions.checkElementIndex(Preconditions.java:305)
> ~[com.google.guava-guava.jar:na]
> >   at
> com.google.common.base.Preconditions.checkElementIndex(Preconditions.java:284)
> ~[com.google.guava-guava.jar:na]
> >   at
> com.google.common.collect.EmptyImmutableList.get(EmptyImmutableList.java:80)
> ~[com.google.guava-guava.jar:na]
> >   at org.apache.calcite.util.Pair$6.get(Pair.j

Re: [RESULT] [VOTE] Release Apache Drill 1.6.0 - rc0

2016-03-18 Thread Jacques Nadeau
That's good to hear. Was that based on the source tarball and a full build
and test or the binary tarball?

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Wed, Mar 16, 2016 at 4:15 PM, Abhishek Girish 
wrote:

> While my vote is non-binding, I thought I'll share that I ran a subset (all
> but hive/hbase) of the Functional Regression tests in a Java 8 environment.
> Also verified some customer issues we previously had in this area.
>
> On Wed, Mar 16, 2016 at 2:25 PM, Jacques Nadeau 
> wrote:
>
> > Hey All,
> >
> > I just saw the 1.6 release notes state that Drill now supports JDK8. I
> > didn't do JDK8 validation as part of my vote as I didn't know this was
> > expected to work [1]. Which of the binding votes were based on building &
> > testing the release on JDK8?
> >
> > thanks,
> > Jacques
> >
> > [1]
> >
> >
> https://issues.apache.org/jira/browse/DRILL-3488?focusedCommentId=15176563&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15176563
> >
> >
> > --
> > Jacques Nadeau
> > CTO and Co-Founder, Dremio
> >
> > On Tue, Mar 15, 2016 at 8:17 AM, Parth Chandra 
> wrote:
> >
> > > *The vote* passes. Thanks everyone for your time. Final tally:
> > >
> > > 6x +1 (binding):  Parth, Aman, Aditya, Venki, Jacques, Jinfeng
> > >
> > > 6x +1 (non-binding) : Abdel Hakim, Sean, Norris, Abhishek, Sudheesh,
> Chun
> > >
> > > No -1s.
> > >
> > > I'll push the *release* artifacts and send an announcement once
> > propagated.
> > >
> > > Thanks,
> > > Parth
> > >
> > > On Mon, Mar 14, 2016 at 3:56 PM, Jinfeng Ni 
> > wrote:
> > >
> > > > +1 (binding)
> > > >
> > > > - Download src tgz and do a full maven build on CentOS
> > > > - Run yelp tutorial queries.
> > > > - Verify query profiles on Web-UI
> > > > - Run couple of partition pruning related queries.
> > > >
> > > > All look good.
> > > >
> > > > Jinfeng
> > > >
> > > >
> > > > On Mon, Mar 14, 2016 at 2:48 PM, Jacques Nadeau 
> > > > wrote:
> > > > > +1 (binding)
> > > > >
> > > > > - Download src tgz and build and test
> > > > > - Download binary tgz, test execution of a number of queries and
> > verify
> > > > > profiles
> > > > > - Enable socket level logging and confirm new planning phase + time
> > > > logging
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Jacques Nadeau
> > > > > CTO and Co-Founder, Dremio
> > > > >
> > > > > On Mon, Mar 14, 2016 at 1:45 PM, Chun Chang 
> > > wrote:
> > > > >
> > > > >> +1 (non-binding)
> > > > >>
> > > > >> -ran functional and advanced automation
> > > > >>
> > > > >> On Mon, Mar 14, 2016 at 1:09 PM, Sudheesh Katkam <
> > > skat...@maprtech.com>
> > > > >> wrote:
> > > > >>
> > > > >> > +1 (non-binding)
> > > > >> >
> > > > >> > * downloaded and built from source tar-ball; ran unit tests
> > > > successfully
> > > > >> > on Ubuntu
> > > > >> > * ran simple queries (including cancellations) in embedded mode
> on
> > > > Mac;
> > > > >> > verified states in web UI
> > > > >> > * ran simple queries (including cancellations) on a 3 node
> > cluster;
> > > > >> > verified states in web UI
> > > > >> >
> > > > >> > * tested maven artifacts (drill-jdbc) using a sample
> application <
> > > > >> > https://github.com/sudheeshkatkam/drill-example>.
> > > > >> > This application is based on DrillClient, and not JDBC API. I
> had
> > to
> > > > make
> > > > >> > two changes for this application to work (i.e. not backward
> > > > compatible).
> > > > >> > However, these changes are not related to this release (commits
> > > > >> > responsible: 1fde9bb <
> > > > >> >
> > > > >>
> > > >
> > >
> >
> https://github.com/apache/drill/commit/1fde9

Re: Getting back on Calcite master: only a few steps left

2016-03-18 Thread Jacques Nadeau
Yes, I'm trying to work through the failing unit tests.

I merged your change.

In the future you can pick compare & create pull request on your branch and
then change the target repo from apache to mine.

thanks,
Jacques


--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Wed, Mar 16, 2016 at 4:39 PM, Aman Sinha  wrote:

> Jacques, I wasn't sure how to create a pull request against your branch;
>  for  CALCITE-1108 you can cherry-pick from here:
>
> https://github.com/amansinha100/incubator-calcite/commits/calcite-drill-2
>
> BTW,  there are unit test failures on your branch which I assume is
> expected for now ?
>
> On Tue, Mar 15, 2016 at 6:56 PM, Jacques Nadeau 
> wrote:
>
> > Why don't you guys propose patches for my branch and I'll incorporate
> until
> > we get to a good state. Once we feel good about it, I'll clean up the
> > revision history.
> >
> > --
> > Jacques Nadeau
> > CTO and Co-Founder, Dremio
> >
> > On Tue, Mar 15, 2016 at 11:01 AM, Jinfeng Ni 
> > wrote:
> >
> > > I'll add test for CALCITE-1150.
> > >
> > >
> > >
> > > On Tue, Mar 15, 2016 at 9:45 AM, Sudheesh Katkam  >
> > > wrote:
> > > > CALCITE-1149 [Extend CALCITE-845] <
> > >
> >
> https://github.com/mapr/incubator-calcite/commit/bd73728a8297e15331ae956096eab0e15b3f
> > >
> > > does not need to be committed into Calcite. DRILL-4372 <
> > > https://issues.apache.org/jira/browse/DRILL-4372> supersedes that
> patch.
> > > >
> > > > I will add a test case for CALCITE-1151.
> > > >
> > > > Thank you,
> > > > Sudheesh
> > > >
> > > >> On Mar 15, 2016, at 9:04 AM, Aman Sinha 
> wrote:
> > > >>
> > > >> I'll add a test for CALCITE-1108.   For 1105 I am not yet sure but
> > will
> > > >> look through the old drill commits to see what test was added there.
> > > >>
> > > >> On Sun, Mar 13, 2016 at 11:15 PM, Minji Kim 
> wrote:
> > > >>
> > > >>> I will add more test cases to CALCITE-1148 in addition to the ones
> > > already
> > > >>> there.  I noticed a few more problems while testing the patch
> against
> > > drill
> > > >>> master.  I am still working through these issues, so I will add
> more
> > > test
> > > >>> cases as I find/fix them.  -Minji
> > > >>>
> > > >>>
> > > >>> On 3/13/16 10:54 PM, Jacques Nadeau wrote:
> > > >>>
> > > >>>> Hey All,
> > > >>>>
> > > >>>> I've been working on rebasing and tracking all the necessary
> commits
> > > that
> > > >>>> are on the Drill Calcite fork so that we can get back onto master.
> > The
> > > >>>> current working branch is here: [1]. It includes the following
> > commits
> > > >>>>
> > > >>>> [CALCITE-1148] Fix RelTrait conversion (e.g. distribution,
> > collation),
> > > >>>> added test cases. (Minji Kim) #77def4a
> > > >>>> [CALCITE-991] Create separate FunctionCategories for table
> functions
> > > and
> > > >>>> macros (Julien Le Dem) #b1c203d
> > > >>>> [CALCITE-1149] Derive AVG’s return type by a customizable policy
> > > (Sudheesh
> > > >>>> Katkam) #18882cd
> > > >>>> [CALCITE-1151] Overriding the SqlSpecialOperator#createCall method
> > > given
> > > >>>> the usage by CompoundIdentifierConverter (Sudheesh Katkam)
> #2320c7f
> > > >>>> [CALCITE-1108] Don't use 'SumEmptyIsZero' (SUM0) window aggregate
> > > until
> > > >>>> CALCITE-777 is fixed. (Aman Sinha) #13466fa
> > > >>>> [CALCITE-1107] Make SqlSumEmptyIsZeroAggFunction constructor
> public.
> > > >>>> (Jinfeng Ni) #b6c3178
> > > >>>> [CALCITE-1106] Expose Constructor for ProjectJoinTransposeRule.
> > (Aman
> > > >>>> Sinha) #d169c37
> > > >>>> [CALCITE-1105] Add return type-inference strategy for arithmetic
> > > operators
> > > >>>> when one of the arguments is ANY type. (Aman Sinha) #df818c9
> > > >>>> [CALCITE-1150] Add DynamicRecordType and the concept of unresolved
> > > star
> > > >>&

Re: [RESULT] [VOTE] Release Apache Drill 1.6.0 - rc0

2016-03-18 Thread Jacques Nadeau
It seems like no one ran the unit tests against JDK8 for this release.

If that is the case, I think we should remove the statement that Drill
supports JDK8.



--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Wed, Mar 16, 2016 at 11:46 PM, Abhishek Girish  wrote:

> @Jacques, No I hadn't run unit tests.
>
> @Khurram, Functional tests were run and no issues were found. But like I
> mentioned, I skipped hive & hbase tests.
>
> On Wed, Mar 16, 2016 at 10:37 PM, Khurram Faraaz 
> wrote:
>
> > Abhishek, you ran Functional Regression tests in a Java 8 environment.
> > Did all Functional tests pass ? Did you see any failures ?
> >
> > On Thu, Mar 17, 2016 at 11:02 AM, Abdel Hakim Deneche <
> > adene...@maprtech.com
> > > wrote:
> >
> > > We didn't fix DRILL-4333, did we ? so I would expect some unit tests to
> > > fail in JDK8
> > >
> > > On Thu, Mar 17, 2016 at 1:10 AM, Jacques Nadeau 
> > > wrote:
> > >
> > > > Can you confirm that you also run the unit tests in JDK8?
> > > >
> > > > thanks!
> > > >
> > > > --
> > > > Jacques Nadeau
> > > > CTO and Co-Founder, Dremio
> > > >
> > > > On Wed, Mar 16, 2016 at 4:46 PM, Abhishek Girish <
> > > > abhishek.gir...@gmail.com>
> > > > wrote:
> > > >
> > > > > I built Drill from source (Github - 1.6.0 branch
> > > > > <https://github.com/apache/drill/commits/1.6.0>). Deployed on a 4
> > node
> > > > > cluster.
> > > > >
> > > > > Drill gitCommitID: d51f7fc14bd71d3e711ece0d02cdaa4d4c385eeb |
> openjdk
> > > > > version "1.8.0_71" | CentOS 6.6 | MapR 5.0.0
> > > > >
> > > > > On Wed, Mar 16, 2016 at 4:21 PM, Jacques Nadeau <
> jacq...@dremio.com>
> > > > > wrote:
> > > > >
> > > > > > That's good to hear. Was that based on the source tarball and a
> > full
> > > > > build
> > > > > > and test or the binary tarball?
> > > > > >
> > > > > > --
> > > > > > Jacques Nadeau
> > > > > > CTO and Co-Founder, Dremio
> > > > > >
> > > > > > On Wed, Mar 16, 2016 at 4:15 PM, Abhishek Girish <
> > > > > > abhishek.gir...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > While my vote is non-binding, I thought I'll share that I ran a
> > > > subset
> > > > > > (all
> > > > > > > but hive/hbase) of the Functional Regression tests in a Java 8
> > > > > > environment.
> > > > > > > Also verified some customer issues we previously had in this
> > area.
> > > > > > >
> > > > > > > On Wed, Mar 16, 2016 at 2:25 PM, Jacques Nadeau <
> > > jacq...@dremio.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hey All,
> > > > > > > >
> > > > > > > > I just saw the 1.6 release notes state that Drill now
> supports
> > > > JDK8.
> > > > > I
> > > > > > > > didn't do JDK8 validation as part of my vote as I didn't know
> > > this
> > > > > was
> > > > > > > > expected to work [1]. Which of the binding votes were based
> on
> > > > > > building &
> > > > > > > > testing the release on JDK8?
> > > > > > > >
> > > > > > > > thanks,
> > > > > > > > Jacques
> > > > > > > >
> > > > > > > > [1]
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/browse/DRILL-3488?focusedCommentId=15176563&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15176563
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Jacques Nadeau
> > > > > > > > CTO and Co-Founder, Dremio
> > > > > > > >
> > > > > > > > On Tue, Mar 15, 2016 at 8:17 AM, Parth Chandra <
&

Re: Getting back on Calcite master: only a few steps left

2016-03-15 Thread Jacques Nadeau
Why don't you guys propose patches for my branch and I'll incorporate until
we get to a good state. Once we feel good about it, I'll clean up the
revision history.

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Tue, Mar 15, 2016 at 11:01 AM, Jinfeng Ni  wrote:

> I'll add test for CALCITE-1150.
>
>
>
> On Tue, Mar 15, 2016 at 9:45 AM, Sudheesh Katkam 
> wrote:
> > CALCITE-1149 [Extend CALCITE-845] <
> https://github.com/mapr/incubator-calcite/commit/bd73728a8297e15331ae956096eab0e15b3f>
> does not need to be committed into Calcite. DRILL-4372 <
> https://issues.apache.org/jira/browse/DRILL-4372> supersedes that patch.
> >
> > I will add a test case for CALCITE-1151.
> >
> > Thank you,
> > Sudheesh
> >
> >> On Mar 15, 2016, at 9:04 AM, Aman Sinha  wrote:
> >>
> >> I'll add a test for CALCITE-1108.   For 1105 I am not yet sure but will
> >> look through the old drill commits to see what test was added there.
> >>
> >> On Sun, Mar 13, 2016 at 11:15 PM, Minji Kim  wrote:
> >>
> >>> I will add more test cases to CALCITE-1148 in addition to the ones
> already
> >>> there.  I noticed a few more problems while testing the patch against
> drill
> >>> master.  I am still working through these issues, so I will add more
> test
> >>> cases as I find/fix them.  -Minji
> >>>
> >>>
> >>> On 3/13/16 10:54 PM, Jacques Nadeau wrote:
> >>>
> >>>> Hey All,
> >>>>
> >>>> I've been working on rebasing and tracking all the necessary commits
> that
> >>>> are on the Drill Calcite fork so that we can get back onto master. The
> >>>> current working branch is here: [1]. It includes the following commits
> >>>>
> >>>> [CALCITE-1148] Fix RelTrait conversion (e.g. distribution, collation),
> >>>> added test cases. (Minji Kim) #77def4a
> >>>> [CALCITE-991] Create separate FunctionCategories for table functions
> and
> >>>> macros (Julien Le Dem) #b1c203d
> >>>> [CALCITE-1149] Derive AVG’s return type by a customizable policy
> (Sudheesh
> >>>> Katkam) #18882cd
> >>>> [CALCITE-1151] Overriding the SqlSpecialOperator#createCall method
> given
> >>>> the usage by CompoundIdentifierConverter (Sudheesh Katkam) #2320c7f
> >>>> [CALCITE-1108] Don't use 'SumEmptyIsZero' (SUM0) window aggregate
> until
> >>>> CALCITE-777 is fixed. (Aman Sinha) #13466fa
> >>>> [CALCITE-1107] Make SqlSumEmptyIsZeroAggFunction constructor public.
> >>>> (Jinfeng Ni) #b6c3178
> >>>> [CALCITE-1106] Expose Constructor for ProjectJoinTransposeRule. (Aman
> >>>> Sinha) #d169c37
> >>>> [CALCITE-1105] Add return type-inference strategy for arithmetic
> operators
> >>>> when one of the arguments is ANY type. (Aman Sinha) #df818c9
> >>>> [CALCITE-1150] Add DynamicRecordType and the concept of unresolved
> star
> >>>> (Jinfeng Ni) #29c7771
> >>>> [CALCITE-1152] Small ANY type fixes (Mehant Baid) #31efdda
> >>>> [CALCITE-528] Ensure uniquification is done in a case aware way
> according
> >>>> to type system and catalog policies. (Jacques Nadeau) #5a3d854
> >>>>
> >>>> Many commits, listed below, don't have tests right now so I'd like to
> get
> >>>> people to raise their hand and work on tests for each of the commits.
> >>>>
> >>>> [CALCITE-991] Create separate FunctionCategories for table functions
> and
> >>>> macros (Julien Le Dem) #b1c203d
> >>>> [CALCITE-1149] Derive AVG’s return type by a customizable policy
> (Sudheesh
> >>>> Katkam) #18882cd
> >>>> [CALCITE-1151] Overriding the SqlSpecialOperator#createCall method
> given
> >>>> the usage by CompoundIdentifierConverter (Sudheesh Katkam) #2320c7f
> >>>> [CALCITE-1108] Don't use 'SumEmptyIsZero' (SUM0) window aggregate
> until
> >>>> CALCITE-777 is fixed. (Aman Sinha) #13466fa
> >>>> [CALCITE-1105] Add return type-inference strategy for arithmetic
> operators
> >>>> when one of the arguments is ANY type. (Aman Sinha) #df818c9
> >>>> [CALCITE-1150] Add DynamicRecordType and the concept of unresolved
> star
> >>>> (Jinfeng Ni) #29c7771
> >>>> [CALCITE-1152] Small ANY type fixes (Mehant Baid) #31efdda
> &g

Re: Working with Case-Sensitive Data-sources

2016-03-14 Thread Jacques Nadeau
I believe it also suffers from the same issues.

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Mon, Mar 14, 2016 at 4:29 PM, Neeraja Rentachintala <
nrentachint...@maprtech.com> wrote:

> How is this handled for MongoDB storage plugin, which I believe a case
> sensitive DB as well?
>
> On Mon, Mar 14, 2016 at 4:27 PM, Jacques Nadeau 
> wrote:
>
> > I don't think it is that simple since there are some types of things that
> > we can't pushdown that will cause inconsistent results.
> >
> > For example, assuming that all values of x are positive, the following
> two
> > queries should return the same result
> >
> > select * from hbase where x = 5
> > select * from hbase where abs(x) = 5
> >
> > However, if the field x is sometimes 'x' and sometimes 'X', we're going
> to
> > different results between the first query and the second. That is why I
> > think we need to guarantee that even when optimization rules fails, we
> have
> > the same plan meaning. In essence, all plans should be valid. If you get
> to
> > a place where a rule changes the data, then the original plan was
> > effectively invalid.
> >
> > --
> > Jacques Nadeau
> > CTO and Co-Founder, Dremio
> >
> > On Mon, Mar 14, 2016 at 3:46 PM, Jinfeng Ni 
> wrote:
> >
> > > Project pushdown should always happen. If you see project pushdown
> > > does not happen for your HBase query, then it's a bug.
> > >
> > > However, if you submit two physical plans, one with project pushdown,
> > > another one without project pushdown, but they return different
> > > results for HBase query. I'll not call this a bug.
> > >
> > >
> > >
> > > On Mon, Mar 14, 2016 at 2:54 PM, Jacques Nadeau 
> > > wrote:
> > > > Agree with Zelaine, plan changes/optimizations shouldn't change
> > results.
> > > > This is a bug.
> > > >
> > > > Drill is focused on being case-insensitive, case-preserving. Each
> > storage
> > > > plugin implements its own case sensitivity policy when working with
> > > > columns/fields and should be documented. It isn't practical to make
> > HBase
> > > > case-insensitive so it should behave case sensitivity. DFS formats
> (as
> > > > opposed to HBase) are entirely under Drill's control and thus target
> > > > case-insensitive, case-preserving operation.
> > > >
> > > > --
> > > > Jacques Nadeau
> > > > CTO and Co-Founder, Dremio
> > > >
> > > > On Mon, Mar 14, 2016 at 2:43 PM, Jinfeng Ni 
> > > wrote:
> > > >
> > > >> Abhishek
> > > >>
> > > >> Great question. Here is what I understand regarding the case
> sensitive
> > > >> policy.
> > > >>
> > > >> Drill's case sensitivity policy (case insensitive and case
> preserving)
> > > >> applies to the execution engine in Drill; it does not enforce the
> case
> > > >> sensitivity policy to all the storage plugin. A storage plugin could
> > > >> decide and implement it's own policy.
> > > >>
> > > >> Why would the pushdown impact the case sensitivity when query HBase?
> > > >> Without project pushdown, HBase storage plugin will return all the
> > > >> data, and it's up to Drill's execution Project operator to apply the
> > > >> case insensitive policy.  With the project pushdown, Drill will pass
> > > >> the list of column names to HBase storage plugin, and HBase decides
> to
> > > >> apply it's case sensitivity policy when scan the data.
> > > >>
> > > >> Adding an option to make case sensitive storage plugin honor case
> > > >> insensitive policy seems to be a good idea. The question is whether
> > > >> the underneath storage (like HBase) will support such mode.
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> On Mon, Mar 14, 2016 at 2:09 PM, Zelaine Fong 
> > > wrote:
> > > >> > Abhishek,
> > > >> >
> > > >> > I guess you're arguing that Drill's current behavior of honoring
> the
> > > case
> > > >> > sensitive nature of the underlying data source (in this case,
> HBase
> > > and
> > > >

Re: Working with Case-Sensitive Data-sources

2016-03-14 Thread Jacques Nadeau
I don't think it is that simple since there are some types of things that
we can't pushdown that will cause inconsistent results.

For example, assuming that all values of x are positive, the following two
queries should return the same result

select * from hbase where x = 5
select * from hbase where abs(x) = 5

However, if the field x is sometimes 'x' and sometimes 'X', we're going to
different results between the first query and the second. That is why I
think we need to guarantee that even when optimization rules fails, we have
the same plan meaning. In essence, all plans should be valid. If you get to
a place where a rule changes the data, then the original plan was
effectively invalid.

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Mon, Mar 14, 2016 at 3:46 PM, Jinfeng Ni  wrote:

> Project pushdown should always happen. If you see project pushdown
> does not happen for your HBase query, then it's a bug.
>
> However, if you submit two physical plans, one with project pushdown,
> another one without project pushdown, but they return different
> results for HBase query. I'll not call this a bug.
>
>
>
> On Mon, Mar 14, 2016 at 2:54 PM, Jacques Nadeau 
> wrote:
> > Agree with Zelaine, plan changes/optimizations shouldn't change results.
> > This is a bug.
> >
> > Drill is focused on being case-insensitive, case-preserving. Each storage
> > plugin implements its own case sensitivity policy when working with
> > columns/fields and should be documented. It isn't practical to make HBase
> > case-insensitive so it should behave case sensitivity. DFS formats (as
> > opposed to HBase) are entirely under Drill's control and thus target
> > case-insensitive, case-preserving operation.
> >
> > --
> > Jacques Nadeau
> > CTO and Co-Founder, Dremio
> >
> > On Mon, Mar 14, 2016 at 2:43 PM, Jinfeng Ni 
> wrote:
> >
> >> Abhishek
> >>
> >> Great question. Here is what I understand regarding the case sensitive
> >> policy.
> >>
> >> Drill's case sensitivity policy (case insensitive and case preserving)
> >> applies to the execution engine in Drill; it does not enforce the case
> >> sensitivity policy to all the storage plugin. A storage plugin could
> >> decide and implement it's own policy.
> >>
> >> Why would the pushdown impact the case sensitivity when query HBase?
> >> Without project pushdown, HBase storage plugin will return all the
> >> data, and it's up to Drill's execution Project operator to apply the
> >> case insensitive policy.  With the project pushdown, Drill will pass
> >> the list of column names to HBase storage plugin, and HBase decides to
> >> apply it's case sensitivity policy when scan the data.
> >>
> >> Adding an option to make case sensitive storage plugin honor case
> >> insensitive policy seems to be a good idea. The question is whether
> >> the underneath storage (like HBase) will support such mode.
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Mon, Mar 14, 2016 at 2:09 PM, Zelaine Fong 
> wrote:
> >> > Abhishek,
> >> >
> >> > I guess you're arguing that Drill's current behavior of honoring the
> case
> >> > sensitive nature of the underlying data source (in this case, HBase
> and
> >> > MapR-DB) will be confusing for Drill users who are accustomed to
> Drill's
> >> > case insensitive behavior.
> >> >
> >> > I can see arguments both ways.
> >> >
> >> > But the part I think is confusing is that the behavior differs
> depending
> >> on
> >> > whether or not projections and filters are pushed down to the data
> >> source.
> >> > If the push down is done, then the behavior is case sensitive
> >> > (corresponding to the data source).  But if pushdown doesn't happen,
> then
> >> > the behavior is case insensitive.  That difference seems inconsistent
> and
> >> > undesirable -- unless you argue that there are instances where you
> would
> >> > want one behavior vs the other.  But it seems like that should be
> >> > orthogonal and separate from whether pushdowns are applied.
> >> >
> >> > -- Zelaine
> >> >
> >> > On Mon, Mar 14, 2016 at 1:40 AM, Abhishek Girish 
> >> wrote:
> >> >
> >> >> Hello all,
> >> >>
> >> >> As I understand, Drill by design is case-insensitive, w.r.t colum

Calcite: Trait propagation using relset iteration versus remove extraneous trait creation

2016-03-14 Thread Jacques Nadeau
Hey All,

I've been thinking about the SubsetTransformer pattern [1] that we use in
Drill to ensure trait propagation. It was discussed here in Calcite [2]

Julian's felt that the correct solution (and the patch he ultimately
applied) was to use a create and then remove behavior. Take a look at his
revision to my test here [3] where he adds the SortRemoveRule in order to
remove an extraneous Sort operation.

It seems like we need to either introduce a new mechanism in Calcite to
accomplish this or we need to adopt the removal behavior. (I also believe
there are a small set of situations where we insert distribution for
parallelization purposes as opposed to a requirement for a particular
operation... we'll need to determine how those work and figure out how to
express correctly in this removal pattern.)

Thoughts?

[1]
https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/SubsetTransformer.java
[2] https://issues.apache.org/jira/browse/CALCITE-606
[3]
https://github.com/julianhyde/calcite/commit/fb203dc4b9aea89bfed839c22ae3e285044df400#diff-9494b27dde1061ef95e3853cb6222b5bR103
--
Jacques Nadeau
CTO and Co-Founder, Dremio


Re: Working with Case-Sensitive Data-sources

2016-03-14 Thread Jacques Nadeau
Agree with Zelaine, plan changes/optimizations shouldn't change results.
This is a bug.

Drill is focused on being case-insensitive, case-preserving. Each storage
plugin implements its own case sensitivity policy when working with
columns/fields and should be documented. It isn't practical to make HBase
case-insensitive so it should behave case sensitivity. DFS formats (as
opposed to HBase) are entirely under Drill's control and thus target
case-insensitive, case-preserving operation.

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Mon, Mar 14, 2016 at 2:43 PM, Jinfeng Ni  wrote:

> Abhishek
>
> Great question. Here is what I understand regarding the case sensitive
> policy.
>
> Drill's case sensitivity policy (case insensitive and case preserving)
> applies to the execution engine in Drill; it does not enforce the case
> sensitivity policy to all the storage plugin. A storage plugin could
> decide and implement it's own policy.
>
> Why would the pushdown impact the case sensitivity when query HBase?
> Without project pushdown, HBase storage plugin will return all the
> data, and it's up to Drill's execution Project operator to apply the
> case insensitive policy.  With the project pushdown, Drill will pass
> the list of column names to HBase storage plugin, and HBase decides to
> apply it's case sensitivity policy when scan the data.
>
> Adding an option to make case sensitive storage plugin honor case
> insensitive policy seems to be a good idea. The question is whether
> the underneath storage (like HBase) will support such mode.
>
>
>
>
>
>
> On Mon, Mar 14, 2016 at 2:09 PM, Zelaine Fong  wrote:
> > Abhishek,
> >
> > I guess you're arguing that Drill's current behavior of honoring the case
> > sensitive nature of the underlying data source (in this case, HBase and
> > MapR-DB) will be confusing for Drill users who are accustomed to  Drill's
> > case insensitive behavior.
> >
> > I can see arguments both ways.
> >
> > But the part I think is confusing is that the behavior differs depending
> on
> > whether or not projections and filters are pushed down to the data
> source.
> > If the push down is done, then the behavior is case sensitive
> > (corresponding to the data source).  But if pushdown doesn't happen, then
> > the behavior is case insensitive.  That difference seems inconsistent and
> > undesirable -- unless you argue that there are instances where you would
> > want one behavior vs the other.  But it seems like that should be
> > orthogonal and separate from whether pushdowns are applied.
> >
> > -- Zelaine
> >
> > On Mon, Mar 14, 2016 at 1:40 AM, Abhishek Girish 
> wrote:
> >
> >> Hello all,
> >>
> >> As I understand, Drill by design is case-insensitive, w.r.t column names
> >> within a table or file [1]. While this provides great flexibility and
> works
> >> well with many data-sources, there are issues when working with
> >> case-sensitive data-sources such as HBase / MapR-DB.
> >>
> >> Consider the following JSON file:
> >>
> >> {"_id": "ID1",
> >>  *"Name"* : "ABC",
> >>  "Age" : "25",
> >>  "Phone" : null
> >> }
> >> {"_id": "ID2",
> >>  *"name"* : "PQR",
> >>  "Age" : "30",
> >>  "Phone" : "408-123-456"
> >> }
> >> {"_id": "ID3",
> >>  *"NAME"* : "XYZ",
> >>  "Phone" : ""
> >> }
> >>
> >> Note that the case of the name field within the JSON file is of
> mixed-case.
> >>
> >> From Drill, while querying the JSON file directly (or corresponding
> content
> >> in Parquet or Text formats), we get results which we as Drill users have
> >> come to expect:
> >>
> >> > select NAME from mfs.`/tmp/json/a.json`;
> >> +---+
> >> | NAME  |
> >> +---+
> >> | ABC   |
> >> | PQR   |
> >> | XYZ   |
> >> +---+
> >>
> >>
> >> However, while querying a case-sensitive datasource (*with pushdown
> >> enabled*)
> >> the following results are returned. The case provided in the query text
> is
> >> honored and would determine the results. This could come as a *slight
> >> surprise to certain Drill users* exploring/migrating to new Databases
> >> (using new Storage / Format

Re: [VOTE] Release Apache Drill 1.6.0 - rc0

2016-03-14 Thread Jacques Nadeau
+1 (binding)

- Download src tgz and build and test
- Download binary tgz, test execution of a number of queries and verify
profiles
- Enable socket level logging and confirm new planning phase + time logging




--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Mon, Mar 14, 2016 at 1:45 PM, Chun Chang  wrote:

> +1 (non-binding)
>
> -ran functional and advanced automation
>
> On Mon, Mar 14, 2016 at 1:09 PM, Sudheesh Katkam 
> wrote:
>
> > +1 (non-binding)
> >
> > * downloaded and built from source tar-ball; ran unit tests successfully
> > on Ubuntu
> > * ran simple queries (including cancellations) in embedded mode on Mac;
> > verified states in web UI
> > * ran simple queries (including cancellations) on a 3 node cluster;
> > verified states in web UI
> >
> > * tested maven artifacts (drill-jdbc) using a sample application <
> > https://github.com/sudheeshkatkam/drill-example>.
> > This application is based on DrillClient, and not JDBC API. I had to make
> > two changes for this application to work (i.e. not backward compatible).
> > However, these changes are not related to this release (commits
> > responsible: 1fde9bb <
> >
> https://github.com/apache/drill/commit/1fde9bb1505f04e0b0a1afb542a1aa5dfd20ed1b
> >
> > and de00881 <
> >
> https://github.com/apache/drill/commit/de008810c815e46e6f6e5d13ad0b9a23e705b13a
> >).
> > We should have a conversation about what constitutes public API and
> changes
> > to this API on a separate thread.
> >
> > Thank you,
> > Sudheesh
> >
> > > On Mar 14, 2016, at 12:04 PM, Abhishek Girish <
> abhishek.gir...@gmail.com>
> > wrote:
> > >
> > > +1 (non-binding)
> > >
> > > - Tested Drill in distributed mode (built with MapR profile).
> > > - Ran functional tests from Drill-Test-Framework [1]
> > > - Tested Web UI (basic sanity)
> > > - Tested Sqlline
> > >
> > > Looks good.
> > >
> > >
> > > [1] https://github.com/mapr/drill-test-framework
> > >
> > > On Mon, Mar 14, 2016 at 11:23 AM, Venki Korukanti <
> > venki.koruka...@gmail.com
> > >> wrote:
> > >
> > >> +1
> > >>
> > >> Installed tar.gz on a 3 node cluster.
> > >> Ran queries on data located in HDFS
> > >> Enabled auth in WebUI, ran few queries and, verified auth and querying
> > >> works fine
> > >> Logged bugs for 2 minor issues/improvements (DRILL-4508
> > >> <https://issues.apache.org/jira/browse/DRILL-4508> & DRILL-4509
> > >> <https://issues.apache.org/jira/browse/DRILL-4509>)
> > >>
> > >> Thanks
> > >> Venki
> > >>
> > >> On Mon, Mar 14, 2016 at 10:56 AM, Norris Lee 
> wrote:
> > >>
> > >>> +1 (Non-binding)
> > >>>
> > >>> Build from source on CentOS. Tested the ODBC driver with queries
> > against
> > >>> hive and DFS (json, parquet, tsv, csv, directories).
> > >>>
> > >>> Norris
> > >>>
> > >>> -Original Message-
> > >>> From: Hsuan Yi Chu [mailto:hyi...@maprtech.com]
> > >>> Sent: Monday, March 14, 2016 10:42 AM
> > >>> To: dev@drill.apache.org; adityakish...@gmail.com
> > >>> Subject: Re: [VOTE] Release Apache Drill 1.6.0 - rc0
> > >>>
> > >>> +1
> > >>> mvn clean install on linux vm; Tried some queries; Looks good.
> > >>>
> > >>> On Mon, Mar 14, 2016 at 9:58 AM, Aditya 
> > wrote:
> > >>>
> > >>>> While I did verify the signature and structure of the maven
> artifacts,
> > >>>> I think Jacques was referring to verify the functionality, which I
> > have
> > >>> not.
> > >>>>
> > >>>> On Mon, Mar 14, 2016 at 8:12 AM, Parth Chandra 
> > >>> wrote:
> > >>>>
> > >>>>> Aditya has verified the maven artifacts. Would it make sense to
> > >>>>> extend
> > >>>> the
> > >>>>> vote by another day to let more people verify the release?
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> On Mon, Mar 14, 2016 at 7:08 AM, Jacques Nadeau <
> jacq...@dremio.com>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> I haven'

Re: [VOTE] Release Apache Drill 1.6.0 - rc0

2016-03-14 Thread Jacques Nadeau
I haven't had a chance to validate yet.  Has anyone checked the maven
artifacts yet?
On Mar 14, 2016 6:37 AM, "Aditya"  wrote:

> +1 (binding).
>
> * Verified checksum and signature of all release artifacts in[1] and maven
> artifacts in [2] and the artifacts are signed using Parth's public key (ID
> 9BAA73B0).
> * Verified that build and tests pass using the source artifact.
> * Verified that Drill can be launched in embedded mode using the
> convenience binary release.
> * Ran sample queries using classpath storage plugin.
>
> p.s. Have enhanced the release verification script [3] to allow automatic
> download and verification of release artifacts through the pull request
> 249[4]. Will merge if someone can review it.
>
> [1] http://home.apache.org/~parthc/drill/releases/1.6.0/rc0/
> [2] https://repository.apache.org/content/repositories/orgapachedrill-1030
> [3] https://github.com/apache/drill/blob/master/tools/verify_release.sh
> [4] https://github.com/apache/drill/pull/249
>
> On Mon, Mar 14, 2016 at 12:51 AM, Abdel Hakim Deneche <
> adene...@maprtech.com
> > wrote:
>
> > +1
> >
> > built from source with mapr profile and deployed on 2 nodes, then run
> > window functions from Drill's test framework. Also took a quick look at
> the
> > WebUI. Everything looks fine
> >
> > On Sun, Mar 13, 2016 at 5:53 PM, Parth Chandra 
> wrote:
> >
> >> Added GPG key
> >>
> >> On Sat, Mar 12, 2016 at 6:48 PM, Aditya 
> wrote:
> >>
> >> > I couldn't find your signing keys[1].
> >> >
> >> > [1] https://github.com/apache/drill/blob/master/KEYS
> >> >
> >> > On Fri, Mar 11, 2016 at 7:09 AM, Parth Chandra 
> >> wrote:
> >> >
> >> > > Hello all,
> >> > >
> >> > > I'd like to propose the zeroth release candidate (rc0) of Apache
> >> Drill,
> >> > > version 1.6.0.
> >> > > It covers a total of 44 resolved JIRAs [1].
> >> > > Thanks to everyone who contributed to this release.
> >> > >
> >> > > The tarball artifacts are hosted at [2] and the maven artifacts are
> >> > hosted
> >> > > at [3].
> >> > >
> >> > > This release candidate is based on commit
> >> > > d51f7fc14bd71d3e711ece0d02cdaa4d4c385eeb located at [4].
> >> > >
> >> > > The vote will be open for the next ~72 hours ending at 7:10 AM
> >> Pacific,
> >> > > March
> >> > > 14, 2016.
> >> > >
> >> > > [ ] +1
> >> > > [ ] +0
> >> > > [ ] -1
> >> > >
> >> > > Here's my vote: +1
> >> > >
> >> > > Thanks,
> >> > >
> >> > > Parth
> >> > >
> >> > > [1]
> >> > >
> >> > >
> >> >
> >>
> https://issues.apache.org/jira/issues/?jql=project%3D%22Apache%20Drill%22%20and%20status%20in%20(resolved%2C%20closed)%20and%20fixVersion%3D1.6.0
> >> > > [2] http://home.apache.org/~parthc/drill/releases/1.6.0/rc0/
> >> > > [3]
> >> >
> https://repository.apache.org/content/repositories/orgapachedrill-1030
> >> > > [4]
> https://github.com/parthchandra/incubator-drill/tree/drill-1.6.0
> >> > >
> >> >
> >>
> >
> >
> >
> > --
> >
> > Abdelhakim Deneche
> >
> > Software Engineer
> >
> >   
> >
> >
> > Now Available - Free Hadoop On-Demand Training
> > <
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> >
> >
>


Getting back on Calcite master: only a few steps left

2016-03-13 Thread Jacques Nadeau
Hey All,

I've been working on rebasing and tracking all the necessary commits that
are on the Drill Calcite fork so that we can get back onto master. The
current working branch is here: [1]. It includes the following commits

[CALCITE-1148] Fix RelTrait conversion (e.g. distribution, collation),
added test cases. (Minji Kim) #77def4a
[CALCITE-991] Create separate FunctionCategories for table functions and
macros (Julien Le Dem) #b1c203d
[CALCITE-1149] Derive AVG’s return type by a customizable policy (Sudheesh
Katkam) #18882cd
[CALCITE-1151] Overriding the SqlSpecialOperator#createCall method given
the usage by CompoundIdentifierConverter (Sudheesh Katkam) #2320c7f
[CALCITE-1108] Don't use 'SumEmptyIsZero' (SUM0) window aggregate until
CALCITE-777 is fixed. (Aman Sinha) #13466fa
[CALCITE-1107] Make SqlSumEmptyIsZeroAggFunction constructor public.
(Jinfeng Ni) #b6c3178
[CALCITE-1106] Expose Constructor for ProjectJoinTransposeRule. (Aman
Sinha) #d169c37
[CALCITE-1105] Add return type-inference strategy for arithmetic operators
when one of the arguments is ANY type. (Aman Sinha) #df818c9
[CALCITE-1150] Add DynamicRecordType and the concept of unresolved star
(Jinfeng Ni) #29c7771
[CALCITE-1152] Small ANY type fixes (Mehant Baid) #31efdda
[CALCITE-528] Ensure uniquification is done in a case aware way according
to type system and catalog policies. (Jacques Nadeau) #5a3d854

Many commits, listed below, don't have tests right now so I'd like to get
people to raise their hand and work on tests for each of the commits.

[CALCITE-991] Create separate FunctionCategories for table functions and
macros (Julien Le Dem) #b1c203d
[CALCITE-1149] Derive AVG’s return type by a customizable policy (Sudheesh
Katkam) #18882cd
[CALCITE-1151] Overriding the SqlSpecialOperator#createCall method given
the usage by CompoundIdentifierConverter (Sudheesh Katkam) #2320c7f
[CALCITE-1108] Don't use 'SumEmptyIsZero' (SUM0) window aggregate until
CALCITE-777 is fixed. (Aman Sinha) #13466fa
[CALCITE-1105] Add return type-inference strategy for arithmetic operators
when one of the arguments is ANY type. (Aman Sinha) #df818c9
[CALCITE-1150] Add DynamicRecordType and the concept of unresolved star
(Jinfeng Ni) #29c7771
[CALCITE-1152] Small ANY type fixes (Mehant Baid) #31efdda
[CALCITE-528] Ensure uniquification is done in a case aware way according
to type system and catalog policies. (Jacques Nadeau) #5a3d854

Also note that there are currently 15 tests failing in this Calcite branch
that I haven't yet tracked down.

org.apache.calcite.test.SqlToRelConverterTest (10 tests)
org.apache.calcite.test.JdbcTest (2 tests)
org.apache.calcite.test.RelOptRulesTest.txt (1 test)
org.apache.calcite.test.SqlValidatorTest.txt (1 test)
org.apache.calcite.rel.rel2sql.RelToSqlConverterTest (1 test)

Note that I also reworked the Schema changes items so that they don't have
any impact on code paths unless the system returns a DynamicRecordType.
Once we get these changes looking good, we can move to making small
modifications in the Drill codebase to use this new record type.

Can people raise their hands to confirm they will be able to write tests
cases for issues they own?

thanks,
Jacques

[1] https://github.com/jacques-n/incubator-calcite/tree/calcite-drill-2

--
Jacques Nadeau
CTO and Co-Founder, Dremio


Re: Time for the 1.6 Release

2016-03-10 Thread Jacques Nadeau
You want to roll forward the current branch to 1.7.0-SNAPSHOT so we can
continue developing/merging stuff?

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Thu, Mar 10, 2016 at 4:09 PM, Parth Chandra  wrote:

> Hi guys,
>
>   I've created the 1.6.0 branch and will be rolling out the first release
> candidate as soon as I get the go ahead from the QA team, in the next
> couple of hours.
>
> Thanks
>
> Parth
>
> On Thu, Mar 10, 2016 at 3:20 PM, Parth Chandra 
> wrote:
>
> > It is usually not a good idea to try to rush in a patch at the last
> > minute. One of the reasons for having monthly releases is so people don't
> > have to wait too long for fixes and developers don't rush fixes in.
> > QA is almost done with their validation, so I'm afraid this might have to
> > go into the next release.
> >
> > On Thu, Mar 10, 2016 at 1:19 PM, Jason Altekruse 
> wrote:
> >
> >> I hadn't actually tested out the patch, what I had said was that I could
> >> add a flag to make avro files behave like parquet and JSON, without
> schema
> >> validation. The patch made it so the behavior of directories would be
> >> different from that of individual files, removing the schema
> validation. I
> >> tried applying it just now and it still doesn't appear to make the dirN
> >> columns work, but I don't understand why. I will try to take a look
> >> tonight
> >> and post a patch. It will be up to Parth if he wants to put it in the
> >> release once the full fix is merged.
> >>
> >> Jason Altekruse
> >> Software Engineer at Dremio
> >> Apache Drill Committer
> >>
> >> On Thu, Mar 10, 2016 at 1:09 AM, Stefán Baxter <
> ste...@activitystream.com
> >> >
> >> wrote:
> >>
> >> > Hi,
> >> >
> >> > This issue is still unresolved:
> >> > https://issues.apache.org/jira/browse/DRILL-4120
> >> >
> >> > It would mean a great deal to us if it was.
> >> > The solution is, as I understood Jason and Jacques, ready and only
> >> needs to
> >> > be merged.
> >> >
> >> > Regards,
> >> >  -Stefán
> >> >
> >> > On Thu, Mar 10, 2016 at 3:50 AM, Parth Chandra 
> >> wrote:
> >> >
> >> > > Hi everyone,
> >> > >
> >> > >   Just a note to  update everyone that the QA team is testing out
> the
> >> > build
> >> > > from master.
> >> > >   There are no further commits expected for the 1.6.0 release.
> >> > >   The repo is open for commits but try not to break anything :)
> >> > >
> >> > >
> >> > > Parth
> >> > >
> >> > > On Tue, Mar 8, 2016 at 5:16 PM, Parth Chandra 
> >> wrote:
> >> > >
> >> > > > Okay we are down to the final one -
> >> > > >
> >> > > > DRILL-4482 - Avro no longer selects data correctly from a
> >> > > > sub-structure.(Jason)
> >> > > >
> >> > > > Note that MapR QA team is going to start testing 1.6 snapshot now
> >> > before
> >> > > I
> >> > > > roll out the release candidate. DRILL-4482 can be merged in later
> >> as it
> >> > > is
> >> > > > not likely to affect the  Hopefully there will be no show stoppers
> >> (.
> >> > > >
> >> > > > The plan is to roll out the release candidate by Thursday.
> >> > > >
> >> > > > Thanks
> >> > > >
> >> > > > Parth
> >> > > >
> >> > > >
> >> > > > On Tue, Mar 8, 2016 at 9:31 AM, Parth Chandra 
> >> > wrote:
> >> > > >
> >> > > >> OK, let's leave it out then.
> >> > > >>
> >> > > >> On Tue, Mar 8, 2016 at 9:25 AM, Jason Altekruse <
> >> > > altekruseja...@gmail.com
> >> > > >> > wrote:
> >> > > >>
> >> > > >>> To be honest I was expecting a longer review cycle so I hadn't
> run
> >> > the
> >> > > >>> unit
> >> > > >>> tests before posting it for review. There were only very minor
> >> > > functional
> >> > > >>> changes, so I wasn't

Re: Questions about Event Loop Groups

2016-03-09 Thread Jacques Nadeau
We should probably have three main categories with a thread group for each.
Generally, the second and third can probably be very small:

Bit <> Bit (Data)
Data Client, DataServer

Bit <> Bit (Control)
ControlServer, ControlClient

Bit <> User
User Client, User Server

For the question with regards to control client/server, I believe we still
use a peer to peer approach for the control channel (unless we deprecated
this). As such, whomever creates the connection first is a ControlClient
but messages can be sent either direction. So no, I don't think you can
remove the handle method. This is different than the data channel where I
believe our behavior is to establish different sockets for each direction
of communication.

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Wed, Mar 9, 2016 at 11:55 AM, Sudheesh Katkam 
wrote:

> Slightly unrelated question: ControlClient does not handle requests; it is
> the requestor. So shouldn't the handle method <
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/rpc/control/ControlClient.java#L92>
> throw an UnsupportedOperationException (just like DataClient)?
>
> > On Mar 9, 2016, at 11:44 AM, Sudheesh Katkam 
> wrote:
> >
> > There are two event loop groups to handle bit-to-bit communication (“bit
> server” and “bit client”).
> >
> > (1) The “bit server” loop is used by DataServer, ControlServer and
> ControlClient, and “bit client” loop is used by DataClient. Is there a
> reason why ControlClient does not use the “bit client” loop?
> >
> > (2) The event loop groups are shutdown only when *Server are shutdown.
> So the “bit server” loop is shutdown twice, and the “bit client” loop is
> not shutdown.
> >
> > To avoid confusion, I propose these loops to be shutdown in (close
> methods of) classes that create them. Thoughts?
> >
> > Thank you,
> > Sudheesh
>
>


Re: Time for the 1.6 Release

2016-03-07 Thread Jacques Nadeau
The new bug (currently filed under DRILL-4384) is a completely different
bug than the original (original one has to do with profile metrics, this
has to do with plan text). I try to look at it tonight if noone can get to
it sooner.


--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Mon, Mar 7, 2016 at 12:37 PM, Parth Chandra 
wrote:

> DRILL-4384 is a blocker for the release though
>
> On Mon, Mar 7, 2016 at 12:01 PM, Sudheesh Katkam 
> wrote:
>
> > I reopened DRILL-4384 <https://issues.apache.org/jira/browse/DRILL-4384>
> > (blocker); it is assigned to Jacques.
> >
> > On the latest master, the visualized and physical plan tabs on web UI are
> > empty.
> >
> > Thank you,
> > Sudheesh
> >
> > > On Mar 7, 2016, at 11:39 AM, Jason Altekruse  >
> > wrote:
> > >
> > > I don't know if there are any specific time constraints for getting out
> > the
> > > release, but I'm inclined to go with Vicky on DRILL-4477, at least some
> > > investigation into the scope of a fix would be good. I think it's
> > > reasonably big problem whether it's a regression or not.
> > >
> > > On Mon, Mar 7, 2016 at 11:35 AM, Zelaine Fong 
> > wrote:
> > >
> > >> Hakim,
> > >>
> > >> Yes, we'll include this in the release.
> > >>
> > >> -- Zelaine
> > >>
> > >> On Mon, Mar 7, 2016 at 9:31 AM, Abdel Hakim Deneche <
> > adene...@maprtech.com
> > >>>
> > >> wrote:
> > >>
> > >>> If we still have time, I would like to include DRILL-4457 [1], it's a
> > >> wrong
> > >>> results issue, I already have a fix and it's passing all tests, I am
> > just
> > >>> waiting for a review [2]
> > >>>
> > >>>
> > >>> [1] https://issues.apache.org/jira/browse/DRILL-4457
> > >>> [2] https://github.com/apache/drill/pull/410
> > >>>
> > >>> On Mon, Mar 7, 2016 at 4:50 PM, Parth Chandra 
> > wrote:
> > >>>
> > >>>> Hi guys,
> > >>>>
> > >>>> I'm still waiting for the following to be reviewed/merged by today.
> > >>>>
> > >>>> DRILL-4437 (and others)/pr 394 (Operator unit test framework).
> Waiting
> > >> to
> > >>>> be merged (Jason)
> > >>>>
> > >>>> DRILL-4372/pr 377(?) (Drill Operators and Functions should correctly
> > >>> expose
> > >>>> their types within Calcite.) - (Jinfeng to review)
> > >>>>
> > >>>> DRILL-4313/pr 396  (Improved client randomization. Update JIRA with
> > >>>> warnings about using the feature ) (Hanifi/Sudheesh/Paul - patch
> > >>> reviewed.
> > >>>> No +1)
> > >>>>
> > >>>> DRILL-4375/pr 402 (Fix the maven release profile) - (Jason - patch
> > >>>> reviewed. Ready to merge?)
> > >>>>
> > >>>> Thanks
> > >>>>
> > >>>> Parth
> > >>>>
> > >>>> On Sun, Mar 6, 2016 at 12:01 PM, Aditya 
> > >> wrote:
> > >>>>
> > >>>>> DRILL-4375/pr 402 - reviewed.
> > >>>>>
> > >>>>> On Sun, Mar 6, 2016 at 12:48 AM, Stefán Baxter <
> > >>>> ste...@activitystream.com>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> Please review this and then consider as a potential blocker:
> > >>>>>>
> > >>>>>> https://issues.apache.org/jira/browse/DRILL-4482
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> On Sat, Mar 5, 2016 at 3:15 AM, Parth Chandra 
> > >>>> wrote:
> > >>>>>>
> > >>>>>>> Okay here's the list  of  JIRA's pending. It looks like we need
> > >> to
> > >>>> get
> > >>>>>> some
> > >>>>>>> more time to get the PRs still under review merged, so I'll wait
> > >>> over
> > >>>>> the
> > >>>>>>> weekend.
> > >>>>>>> It looks like the PRs that no reviewers assigned in the list
> > >> below
> > >>>> may
> > >>>>&

Re: Time for the 1.6 Release

2016-03-04 Thread Jacques Nadeau
Awesome. Thanks Chun!
On Mar 4, 2016 5:51 PM, "Chun Chang"  wrote:

> Jacques submitted a PR for fixing the failed baselines. I've merged them
> into automation master and confirmed the failed tests are all passing now.
> Thanks.
>
> -Chun
>
>
> On Thu, Mar 3, 2016 at 10:48 PM, Jacques Nadeau 
> wrote:
>
> > I think we need to include DRILL-4467
> > <https://issues.apache.org/jira/browse/DRILL-4467>. I think it is a one
> > line patch and it provides unpredictable plans at a minimum but may also
> > present invalid result. Still need to think through the second half. I've
> > seen this plan instability in some of my recent test runs (even without
> > Java 8) when running extended HBase tests.
> >
> > --
> > Jacques Nadeau
> > CTO and Co-Founder, Dremio
> >
> > On Thu, Mar 3, 2016 at 10:02 PM, Parth Chandra 
> wrote:
> >
> > > Updated list  (I'll follow up with the folks named here separately) -
> > >
> > > Committed for 1.6 -
> > >
> > > DRILL-4384 - Query profile is missing important information on WebUi -
> > > Merged
> > > DRILL-3488/pr 388 (Java 1.8 support) - Merged.
> > > DRILL-4410/pr 380 (listvector should initiatlize bits...) - Merged
> > > DRILL-4383/pr 375 (Allow custom configs for S3, Kerberos, etc) - Merged
> > > DRILL-4465/pr 401 (Simplify Calcite parsing & planning integration) -
> > > Waiting to be merged
> > > DRILL-4437 (and others)/pr 394 (Operator unit test framework). Waiting
> to
> > > be merged.
> > >
> > > DRILL-4281/pr 400 (Drill should support inbound impersonation) (Jacques
> > to
> > > review)
> > > DRILL-4372/pr 377(?) (Drill Operators and Functions should correctly
> > expose
> > > their types within Calcite.) - Waiting for Aman to review. (Owners:
> > Hsuan,
> > > Jinfeng, Aman, Sudheesh)
> > > DRILL-4313/pr 396  (Improved client randomization. Update JIRA with
> > > warnings about using the feature ) (Sudheesh to review.)
> > > DRILL-4449/pr 389 (Wrong results when metadata cache is used..) (Aman
> to
> > > review)
> > > DRILL-4069/pr 352 Enable RPC thread offload by default (Owner:
> Sudheesh)
> > >
> > > Need review -
> > > DRILL-4375/pr 402 (Fix the maven release profile)
> > > DRILL-4452/pr 395 (Update Avatica Driver to latest Calcite)
> > > DRILL-4332/pr 389 (Make vector comparison order stable in test
> framework)
> > > DRILL-4411/pr 381 (hash join over-memory condition)
> > > DRILL-4387/pr 379 (GroupScan should not use star column)
> > > DRILL-4184/pr 372 (support variable length decimal fields in parquet)
> > > DRILL-4120 - dir0 does not work when the directory structure contains
> > Avro
> > > files - Partial patch available.
> > > DRILL-4203/pr 341 (fix dates written into parquet files to conform to
> > > parquet format spec)
> > >
> > > Not included (yet) -
> > > DRILL-3149 - No patch available
> > > DRILL-4441 - IN operator does not work with Avro reader - No patch
> > > available
> > > DRILL-3745/pr 399 - Hive char support - New feature - Needs QA - Not
> > > included in 1.6
> > > DRILL-3623 - Limit 0 should avoid execution when querying a known
> schema.
> > > (Need to add limitations of current impl). Intrusive change; should be
> > > included at beginning of release cycle.
> > > DRILL-4416/pr 385 (quote path separator) (Owner: Hanifi) - Causes leak.
> > >
> > > Others -
> > > DRILL-2517   - Already resolved.
> > > DRILL-3688/pr 382 (skip.header.line.count in hive). - Already merged.
> PR
> > > needs to be closed.
> > >
> > >
> > > On Thu, Mar 3, 2016 at 9:44 PM, Parth Chandra 
> wrote:
> > >
> > > > Right. My mistake. Thanks, Jacques, for reviewing.
> > > >
> > > > On Thu, Mar 3, 2016 at 9:08 PM, Zelaine Fong 
> > wrote:
> > > >
> > > >> DRILL-4281/pr 400 (Drill should support inbound impersonation)
> > (Sudheesh
> > > >> to
> > > >> review)
> > > >>
> > > >> Sudheesh is the fixer of DRILL-4281, so I don't think he can be the
> > > >> reviewer :).
> > > >>
> > > >> -- Zelaine
> > > >>
> > > >> On Thu, Mar 3, 2016 at 6:30 PM, Parth Chandra 
> > > wrote:
> > > >>
> > > >> > Here's an updated list with names of reviewers

Re: Heads up on trivial project fix: plan baselines in regression suite need updating

2016-03-04 Thread Jacques Nadeau
I think it is ~40 in our suite. Not sure on yours. If you have a failed run
on your side, share the output and I may be able to propose a patch to your
suite:

It is a one line change.

https://github.com/apache/drill/commit/edea8b1cf4e5476d803e8b87c79e08e8c3263e04#diff-ca259849558f34142f1e17066df42a9fR259

Are you guys convinced that this isn't ever a correctness issue? That would
be my main hesitation to removing this. It is clearly a bug.

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Fri, Mar 4, 2016 at 10:35 AM, Parth Chandra  wrote:

> I'd be more comfortable if we merged this in after the release. Updating
> the test baselines will delay the release considerably - I would want the
> new baselines to be verified manually which is always time consuming.
> How many tests are affected?
>
>
>
> On Fri, Mar 4, 2016 at 10:03 AM, Jacques Nadeau 
> wrote:
>
> > Do you think we should back out? It seemed like this could likely cause
> > correctness issues although we may be safe with our name based
> resolution.
> > On Mar 4, 2016 9:56 AM, "Aman Sinha"  wrote:
> >
> > > @jacques, thanks for the heads-up, although it comes too close to the
> > > release date :).  I agree that the plan tests should be targeted to a
> > > narrow scope by specifying the sub-pattern it is supposed to test.
>  That
> > > said, it is a lot easier for the tester to capture the entire plan
> since
> > > he/she may miss an important detail if a sub-plan is captured, so this
> > > requires close interaction with the developer (which depending on
> various
> > > factors may take longer while the test needs to be checked-in).
> > > BTW, Calcite unit tests capture entire plan.  I am not sure if similar
> > > issue has been discussed on Calcite dev list in the past.
> > >
> > > -Aman
> > >
> > > On Fri, Mar 4, 2016 at 4:19 AM, Jacques Nadeau 
> > wrote:
> > >
> > > > I just merged a simple fix that Laurent found for DRILL-4467.
> > > >
> > > > This fix ensures consistent column ordering when pushing projection
> > into
> > > a
> > > > scan and invalid plans. This is good and was causing excessive
> > operators
> > > > and pushdown failure in some cases.
> > > >
> > > > However, this fix removes a number of trivial projects (that were
> > > > previously not detected as such) in a large set of queries. This
> means
> > > that
> > > > a number of plan baselines will need to be updated in the extended
> > > > regression suite to avoid consideration of the trivial project. This
> > > > underscores an issue I see in these tests. In virtually all cases
> I've
> > > > seen, the purpose of the test shouldn't care whether the trivial
> > project
> > > is
> > > > part of the plan. However, the baseline is over-reaching in its
> > > definition,
> > > > including a bunch of nodes irrelevant to the purpose of the test. One
> > > > example might be here:
> > > >
> > > >
> > > >
> > >
> >
> https://github.com/mapr/drill-test-framework/blob/master/framework/resources/Functional/filter/pushdown/plan/q23.res
> > > >
> > > > In this baseline, we're testing that the filter is pushed past the
> > > > aggregation. That means what we really need to be testing is a
> > multiline
> > > > plan pattern of
> > > >
> > > > HashAgg.*Filter.*Scan.*
> > > >
> > > > or better
> > > >
> > > > HashAgg.*Filter\(condition=\[=\(\$0, 10\)\]\).*Scan.*
> > > >
> > > > However, you can see that the actual expected result includes the
> > > > entire structure of the plan (but not the pushed down filter
> > > > condition). This causes the plan to fail now that DRILL-4467 is
> > > > merged. As part of the fixes to these plans, we should really make
> > > > sure that the scope of the baseline is only focused on the relevant
> > > > issue to avoid nominal changes from causing testing false positives.
> > > >
> > > >
> > > >
> > > > --
> > > > Jacques Nadeau
> > > > CTO and Co-Founder, Dremio
> > > >
> > >
> >
>


Re: Heads up on trivial project fix: plan baselines in regression suite need updating

2016-03-04 Thread Jacques Nadeau
Do you think we should back out? It seemed like this could likely cause
correctness issues although we may be safe with our name based resolution.
On Mar 4, 2016 9:56 AM, "Aman Sinha"  wrote:

> @jacques, thanks for the heads-up, although it comes too close to the
> release date :).  I agree that the plan tests should be targeted to a
> narrow scope by specifying the sub-pattern it is supposed to test.   That
> said, it is a lot easier for the tester to capture the entire plan since
> he/she may miss an important detail if a sub-plan is captured, so this
> requires close interaction with the developer (which depending on various
> factors may take longer while the test needs to be checked-in).
> BTW, Calcite unit tests capture entire plan.  I am not sure if similar
> issue has been discussed on Calcite dev list in the past.
>
> -Aman
>
> On Fri, Mar 4, 2016 at 4:19 AM, Jacques Nadeau  wrote:
>
> > I just merged a simple fix that Laurent found for DRILL-4467.
> >
> > This fix ensures consistent column ordering when pushing projection into
> a
> > scan and invalid plans. This is good and was causing excessive operators
> > and pushdown failure in some cases.
> >
> > However, this fix removes a number of trivial projects (that were
> > previously not detected as such) in a large set of queries. This means
> that
> > a number of plan baselines will need to be updated in the extended
> > regression suite to avoid consideration of the trivial project. This
> > underscores an issue I see in these tests. In virtually all cases I've
> > seen, the purpose of the test shouldn't care whether the trivial project
> is
> > part of the plan. However, the baseline is over-reaching in its
> definition,
> > including a bunch of nodes irrelevant to the purpose of the test. One
> > example might be here:
> >
> >
> >
> https://github.com/mapr/drill-test-framework/blob/master/framework/resources/Functional/filter/pushdown/plan/q23.res
> >
> > In this baseline, we're testing that the filter is pushed past the
> > aggregation. That means what we really need to be testing is a multiline
> > plan pattern of
> >
> > HashAgg.*Filter.*Scan.*
> >
> > or better
> >
> > HashAgg.*Filter\(condition=\[=\(\$0, 10\)\]\).*Scan.*
> >
> > However, you can see that the actual expected result includes the
> > entire structure of the plan (but not the pushed down filter
> > condition). This causes the plan to fail now that DRILL-4467 is
> > merged. As part of the fixes to these plans, we should really make
> > sure that the scope of the baseline is only focused on the relevant
> > issue to avoid nominal changes from causing testing false positives.
> >
> >
> >
> > --
> > Jacques Nadeau
> > CTO and Co-Founder, Dremio
> >
>


Heads up on trivial project fix: plan baselines in regression suite need updating

2016-03-04 Thread Jacques Nadeau
I just merged a simple fix that Laurent found for DRILL-4467.

This fix ensures consistent column ordering when pushing projection into a
scan and invalid plans. This is good and was causing excessive operators
and pushdown failure in some cases.

However, this fix removes a number of trivial projects (that were
previously not detected as such) in a large set of queries. This means that
a number of plan baselines will need to be updated in the extended
regression suite to avoid consideration of the trivial project. This
underscores an issue I see in these tests. In virtually all cases I've
seen, the purpose of the test shouldn't care whether the trivial project is
part of the plan. However, the baseline is over-reaching in its definition,
including a bunch of nodes irrelevant to the purpose of the test. One
example might be here:

https://github.com/mapr/drill-test-framework/blob/master/framework/resources/Functional/filter/pushdown/plan/q23.res

In this baseline, we're testing that the filter is pushed past the
aggregation. That means what we really need to be testing is a multiline
plan pattern of

HashAgg.*Filter.*Scan.*

or better

HashAgg.*Filter\(condition=\[=\(\$0, 10\)\]\).*Scan.*

However, you can see that the actual expected result includes the
entire structure of the plan (but not the pushed down filter
condition). This causes the plan to fail now that DRILL-4467 is
merged. As part of the fixes to these plans, we should really make
sure that the scope of the baseline is only focused on the relevant
issue to avoid nominal changes from causing testing false positives.



--
Jacques Nadeau
CTO and Co-Founder, Dremio


[jira] [Resolved] (DRILL-4467) Invalid projection created using PrelUtil.getColumns

2016-03-04 Thread Jacques Nadeau (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-4467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacques Nadeau resolved DRILL-4467.
---
Resolution: Fixed

Fixed with edea8b1cf4e5476d803e8b87c79e08e8c3263e04

> Invalid projection created using PrelUtil.getColumns
> 
>
> Key: DRILL-4467
> URL: https://issues.apache.org/jira/browse/DRILL-4467
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Laurent Goujon
>        Assignee: Jacques Nadeau
>Priority: Critical
> Fix For: 1.6.0
>
>
> In {{DrillPushProjIntoScan}}, a new scan and a new projection are created 
> using {{PrelUtil#getColumn(RelDataType, List)}}.
> The returned {{ProjectPushInfo}} instance has several fields, one of them is 
> {{desiredFields}} which is the list of projected fields. There's one instance 
> per {{RexNode}} but because instances were initially added to a set, they 
> might not be in the same order as the order they were created.
> The issue happens in the following code:
> {code:java}
>   List newProjects = Lists.newArrayList();
>   for (RexNode n : proj.getChildExps()) {
> newProjects.add(n.accept(columnInfo.getInputRewriter()));
>   }
> {code}
> This code creates a new list of projects out of the initial ones, by mapping 
> the indices from the old projects to the new projects, but the indices of the 
> new RexNode instances might be out of order (because of the ordering of 
> desiredFields). And if indices are out of order, the check 
> {{ProjectRemoveRule.isTrivial(newProj)}} will fail.
> My guess is that desiredFields ordering should be preserved when instances 
> are added, to satisfy the condition above.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (DRILL-4473) Removing trivial projects reveals bugs in handling of nonexistent columns in StreamingAggregate

2016-03-04 Thread Jacques Nadeau (JIRA)
Jacques Nadeau created DRILL-4473:
-

 Summary: Removing trivial projects reveals bugs in handling of 
nonexistent columns in StreamingAggregate
 Key: DRILL-4473
 URL: https://issues.apache.org/jira/browse/DRILL-4473
 Project: Apache Drill
  Issue Type: Bug
Reporter: Jacques Nadeau


We see a couple unit test failures in working with nonexistent columns once 
DRILL-4467 is fixed. This is because trivial projects no longer protect 
StreamingAggregate from non-existent columns. This is likely due to an 
incorrect check before throwing a Unsupported error. An unknown/ANY type should 
probably be allowed in the case of using sum/max/stddev

{code:title=Plan before DRILL-4467}
VOLCANO:Physical Planning (71ms):
ScreenPrel: rowcount = 1.0, cumulative cost = {464.1 rows, 2375.1 cpu, 0.0 io, 
0.0 network, 0.0 memory}, id = 185
  ProjectPrel(col1=[$0], col2=[$1], col3=[$2], col4=[$3], col5=[$4]): rowcount 
= 1.0, cumulative cost = {464.0 rows, 2375.0 cpu, 0.0 io, 0.0 network, 0.0 
memory}, id = 184
StreamAggPrel(group=[{}], col1=[SUM($0)], col2=[SUM($1)], col3=[SUM($2)], 
col4=[SUM($3)], col5=[SUM($4)]): rowcount = 1.0, cumulative cost = {464.0 rows, 
2375.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 183
  LimitPrel(offset=[0], fetch=[0]): rowcount = 1.0, cumulative cost = 
{463.0 rows, 2315.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 182
ProjectPrel(int_col=[$0], bigint_col=[$3], float4_col=[$4], 
float8_col=[$1], interval_year_col=[$2]): rowcount = 463.0, cumulative cost = 
{463.0 rows, 2315.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 181
  ScanPrel(groupscan=[EasyGroupScan 
[selectionRoot=classpath:/employee.json, numFiles=1, columns=[`int_col`, 
`bigint_col`, `float4_col`, `float8_col`, `interval_year_col`], 
files=[classpath:/employee.json]]]): rowcount = 463.0, cumulative cost = {463.0 
rows, 2315.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 160
{code}

{code:title=Plan after DRILL-4467}
VOLCANO:Physical Planning (63ms):
ScreenPrel: rowcount = 1.0, cumulative cost = {464.1 rows, 2375.1 cpu, 0.0 io, 
0.0 network, 0.0 memory}, id = 151
  ProjectPrel(col1=[$0], col2=[$1], col3=[$2], col4=[$3], col5=[$4]): rowcount 
= 1.0, cumulative cost = {464.0 rows, 2375.0 cpu, 0.0 io, 0.0 network, 0.0 
memory}, id = 150
StreamAggPrel(group=[{}], col1=[SUM($0)], col2=[SUM($1)], col3=[SUM($2)], 
col4=[SUM($3)], col5=[SUM($4)]): rowcount = 1.0, cumulative cost = {464.0 rows, 
2375.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 149
  LimitPrel(offset=[0], fetch=[0]): rowcount = 1.0, cumulative cost = 
{463.0 rows, 2315.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 148
ScanPrel(groupscan=[EasyGroupScan 
[selectionRoot=classpath:/employee.json, numFiles=1, columns=[`int_col`, 
`bigint_col`, `float4_col`, `float8_col`, `interval_year_col`], 
files=[classpath:/employee.json]]]): rowcount = 463.0, cumulative cost = {463.0 
rows, 2315.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 141


Tests disabled referring to this bug in TestAggregateFunctions show multiple 
examples of this behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   3   4   5   6   7   8   9   10   >