Re: Impala Column Masking Behavior Design

2019-11-12 Thread Todd Lipcon
I'd agree that applying it at the innermost column ref makes the most sense
from a security perspective. Otherwise it's trivial to "binary search" your
way to the value of a masked column, even if the masking is
completely "xed" out.

I'm surprised to hear that DB2 implements it otherwise, though quick
googling agrees with that. Perhaps the assumption there is that anyone who
is binary-searching to exposes data will be caught by audit or other
security features.

-Todd

On Tue, Nov 12, 2019 at 10:15 AM Tim Armstrong 
wrote:

> I think compatibility with Hive is pretty important - the default
> expectation will be that Ranger policies behave consistently across SQL
> engines. I think it would be hard to argue for differing default behaviour
> if it's in some sense less secure.
>
> On Tue, Nov 12, 2019 at 12:03 AM Gabor Kaszab 
> wrote:
>
> > Hey Quanlong,
> >
> > For me it seems more important not to leak confidential information so
> I'd
> > vote for (a). I wonder what others think.
> >
> > Gabor
> >
> > On Mon, Nov 11, 2019 at 1:04 PM Quanlong Huang 
> > wrote:
> >
> > > Hi all,
> > >
> > > We are adding the support for Ranger column masking and need to reach a
> > > consensus on the behavior design.
> > >
> > > A column masking policy is something like "only show last 4 chars of
> > phone
> > > column to user X". When user X reads the phone column, the value woule
> be
> > > something like "x6789" instead of the real value "123456789".
> > >
> > > The behavior is clear when the query is simple. However, there're two
> > > different behaviors when the query contains subqueries. The key part is
> > > where we should perform the masking, whether in the outer most select
> > list,
> > > or in the select list of the inner most subquery.
> > >
> > > To be specifit, consider these two queries:
> > > (1) subquery contains predicates on unmasked value
> > >   SELECT concat(name, phone) FROM (
> > > SELECT name, phone FROM customer WHERE phone = '123456789'
> > >   ) t;
> > > (2) subquery contains predicates on masked value
> > >   SELECT concat(name, phone) FROM (
> > > SELECT name, phone FROM customer WHERE phone = 'x6789'
> > >   ) t;
> > >
> > > Let's say there's actually one row in table 'customer' satisfying
> phone =
> > > '123456789'. When user X runs the queries, the two different behaviors
> > are:
> > > (a) Query1 returns nothing. Query2 returns one result: "Bobx6789".
> > > (b) Query1 returns one result: "Bobx6789". Query2 returns nothing.
> > >
> > > Hive is in behavior (a) since it does a table masking that replaces the
> > > TableRef with a subquery containing masked columns. See more in codes:
> > >
> > >
> >
> https://github.com/apache/hive/blob/rel/release-3.1.2/ql/src/java/org/apache/hadoop/hive/ql/parse/TableMask.java#L86-L155
> > > and some experiments I did:
> > >
> > >
> >
> https://docs.google.com/document/d/1LYk2wxT3GMw4ur5y9JBBykolfAs31P3gWRStk21PomM/edit?usp=sharing
> > >
> > > Kurt mentions that traditional dbs like DB2 are in behavior (b). I
> think
> > we
> > > need to decide which behavior we'd like to support. The pros for
> behavior
> > > (a) is no security leak. Because user X can't guess whether there are
> > some
> > > customers with phone number '123456789'. The pros for behavior (b) is
> > users
> > > don't need to rewrite their existing queries after admin applies column
> > > masking policies.
> > >
> > > What do you think?
> > >
> > > Thanks,
> > > Quanlong
> > >
> >
>


-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Ubuntu 18.04 in pre-merge tests?

2019-06-24 Thread Todd Lipcon
On Sun, Jun 23, 2019 at 8:23 PM Jim Apple  wrote:

> >
> > Generally I think precommit running on something closer to the oldest
> > supported OS is better than running on the newest, since it's more likely
> > that new OSes are backward-compatible. Otherwise it's very easy to
> > introduce code that uses features not available on el7, for example.
> >
>
> I find that argument compelling. Do you think we should switch the
> pre-commit job to CentOS 7?
>

Seems reasonable to me, if someone wants to volunteer the time to do it :)
Perhaps others have some reason why it wouldn't work well, though.

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Ubuntu 18.04 in pre-merge tests?

2019-06-20 Thread Todd Lipcon
Alexey and I looked at this today and realized the issue is with OpenSSL
1.1.1, which adds support for TLS 1.3. This breaks the TLS negotiation in
the krpc library. Likely Impala's usage of krpc would also break in this
environment when wire encryption is enabled. I put up a temporary fix
(disable TLS 1.3) here: http://gerrit.cloudera.org:8080/13683

Likely we need to cross-port this to Impala's krpc copy as well.

-Todd

On Wed, Jun 19, 2019 at 3:30 PM Alexey Serbin  wrote:

> Yep, some time ago over a weekend I started with an attempt to get
> Fedora29 machine, but I stuck there while trying to provision such a
> thing.  I.e., the machine has been eventually provisioned, but I could not
> access it.  That was where I left it.
>
> Having Ubuntu18 as a target machine is better since at least it's easier
> to create one for me.  I've provisioned one already and I'm starting Kudu
> build there in at attempt take a look at the issue later tonight.
>
> I'll keep you posted on my findings.
>
>
> Kind regards,
>
> Alexey
>
> On Wed, Jun 19, 2019 at 2:53 PM Todd Lipcon  wrote:
>
>> This same issue was reported a month or two ago for Kudu on Fedora 29. I
>> think Alexey Serbin had started to look into it. Alexey, did we figure out
>> what was going on here?
>>
>> -Todd
>>
>> On Wed, Jun 19, 2019 at 6:00 AM Laszlo Gaal 
>> wrote:
>>
>>> Having looked at the failing build Jim quoted above, the failure seems to
>>> come from the security area.
>>> This is from the Kudu master's log, from the startup sequence (see
>>>
>>> https://jenkins.impala.io/job/ubuntu-18.04-from-scratch/16/artifact/Impala/logs_static/logs/cluster/cdh6-node-1/kudu/master/kudu-master.INFO/*view*/
>>> ),
>>> all this in the context of an Impala minicluster:
>>>
>>> I0612 04:12:56.129866  8515 sys_catalog.cc:424] T
>>>  P 58a05ce6efa74b30907ac4d679bd0515
>>> [sys.catalog]: configured and running, proceeding with master startup.
>>> W0612 04:12:56.130080  8522 catalog_manager.cc:1113] T
>>>  P 58a05ce6efa74b30907ac4d679bd0515:
>>> acquiring CA information for follower catalog manager: Not found: root CA
>>> entry not found
>>> W0612 04:12:56.130123  8522 catalog_manager.cc:596] Not found: root CA
>>> entry not found: failed to prepare follower catalog manager, will retry
>>> I0612 04:12:56.130151  8521 catalog_manager.cc:1055] Loading table and
>>> tablet metadata into memory...
>>> I0612 04:12:56.130228  8521 catalog_manager.cc:1066] Initializing Kudu
>>> internal certificate authority...
>>> W0612 04:12:56.167639  8636 negotiation.cc:320] Unauthorized connection
>>> attempt: Server connection negotiation failed: server connection from
>>> 127.0.0.1:50174: expected TLS_HANDSHAKE step: SASL_INITIATE
>>> W0612 04:12:56.170145  8636 negotiation.cc:320] Unauthorized connection
>>> attempt: Server connection negotiation failed: server connection from
>>> 127.0.0.1:50176: expected TLS_HANDSHAKE step: SASL_INITIATE
>>> W0612 04:12:56.172571  8636 negotiation.cc:320] Unauthorized connection
>>> attempt: Server connection negotiation failed: server connection from
>>> 127.0.0.1:50178: expected TLS_HANDSHAKE step: SASL_INITIATE
>>> W0612 04:12:56.182530  8636 negotiation.cc:320] Unauthorized connection
>>> attempt: Server connection negotiation failed: server connection from
>>> 127.0.0.1:50180: expected TLS_HANDSHAKE step: SASL_INITIATE
>>> W0612 04:12:56.185034  8636 negotiation.cc:320] Unauthorized connection
>>> attempt: Server connection negotiation failed: server connection from
>>> 127.0.0.1:50182: expected TLS_HANDSHAKE step: SASL_INITIATE
>>> W0612 04:12:56.187453  8636 negotiation.cc:320] Unauthorized connection
>>> attempt: Server connection negotiation failed: server connection from
>>> 127.0.0.1:50184: expected TLS_HANDSHAKE step: SASL_INITIATE
>>> I0612 04:12:56.197146  8521 catalog_manager.cc:950] Generated new
>>> certificate authority record
>>> I0612 04:12:56.198005  8521 catalog_manager.cc:1075] Loading token
>>> signing
>>> keys...
>>> W0612 04:12:56.293697  8636 negotiation.cc:320] Unauthorized connection
>>> attempt: Server connection negotiation failed: server connection from
>>> 127.0.0.1:50186: expected TLS_HANDSHAKE step: SASL_INITIATE
>>> W0612 04:12:56.295320  8636 negotiation.cc:320] Unauthorized connection
>>> attempt: Server connection negotiation failed: server connection from
>>> 

Re: Ubuntu 18.04 in pre-merge tests?

2019-06-19 Thread Todd Lipcon
c:587] Failed to heartbeat to
> 127.0.0.1:7051 (62 consecutive failures): Not authorized: Failed to ping
> master at 127.0.0.1:7051: Client connection negotiation failed: client
> connection to 127.0.0.1:7051: FATAL_UNAUTHORIZED: Not authorized: expected
> TLS_HANDSHAKE step: SASL_INITIATE
> W0612 04:14:56.556850  8897 heartbeater.cc:587] Failed to heartbeat to
> 127.0.0.1:7051 (122 consecutive failures): Not authorized: Failed to ping
> master at 127.0.0.1:7051: Client connection negotiation failed: client
> connection to 127.0.0.1:7051: FATAL_UNAUTHORIZED: Not authorized: expected
> TLS_HANDSHAKE step: SASL_INITIATE
> W0612 04:15:56.694403  8897 heartbeater.cc:587] Failed to heartbeat to
> 127.0.0.1:7051 (182 consecutive failures): Not authorized: Failed to ping
> master at 127.0.0.1:7051: Client connection negotiation failed: client
> connection to 127.0.0.1:7051: FATAL_UNAUTHORIZED: Not authorized: expected
> TLS_HANDSHAKE step: SASL_INITIATE
> W0612 04:16:56.826400  8897 heartbeater.cc:587] Failed to heartbeat to
> 127.0.0.1:7051 (242 consecutive failures): Not authorized: Failed to ping
> master at 127.0.0.1:7051: Client connection negotiation failed: client
> connection to 127.0.0.1:7051: FATAL_UNAUTHORIZED: Not authorized: expected
> TLS_HANDSHAKE step: SASL_INITIATE
> W0612 04:17:56.955927  8897 heartbeater.cc:587] Failed to heartbeat to
> 127.0.0.1:7051 (302 consecutive failures): Not authorized: Failed to ping
> master at 127.0.0.1:7051: Client connection negotiation failed: client
> connection to 127.0.0.1:7051: FATAL_UNAUTHORIZED: Not authorized: expected
> TLS_HANDSHAKE step: SASL_INITIATE
> W0612 04:18:57.103503  8897 heartbeater.cc:587] Failed to heartbeat to
> 127.0.0.1:7051 (362 consecutive failures): Not authorized: Failed to ping
> master at 127.0.0.1:7051: Client connection negotiation failed: client
> connection to 127.0.0.1:7051: FATAL_UNAUTHORIZED: Not authorized: expected
> TLS_HANDSHAKE step: SASL_INITIATE
> W0612 04:19:57.237712  8897 heartbeater.cc:587] Failed to heartbeat to
> 127.0.0.1:7051 (422 consecutive failures): Not authorized: Failed to ping
> master at 127.0.0.1:7051: Client connection negotiation failed: client
> connection to 127.0.0.1:7051: FATAL_UNAUTHORIZED: Not authorized: expected
> TLS_HANDSHAKE step: SASL_INITIATE
> W0612 04:20:57.393489  8897 heartbeater.cc:587] Failed to heartbeat to
> 127.0.0.1:7051 (482 consecutive failures): Not authorized: Failed to ping
> master at 127.0.0.1:7051: Client connection negotiation failed: client
> connection to 127.0.0.1:7051: FATAL_UNAUTHORIZED: Not authorized: expected
> TLS_HANDSHAKE step: SASL_INITIATE
> W0612 04:21:57.522513  8897 heartbeater.cc:587] Failed to heartbeat to
> 127.0.0.1:7051 (542 consecutive failures): Not authorized: Failed to ping
> master at 127.0.0.1:7051: Client connection negotiation failed: client
> connection to 127.0.0.1:7051: FATAL_UNAUTHORIZED: Not authorized: expected
> TLS_HANDSHAKE step: SASL_INITIATE
> W0612 04:22:57.652271  8897 heartbeater.cc:587] Failed to heartbeat to
> 127.0.0.1:7051 (602 consecutive failures): Not authorized: Failed to ping
> master at 127.0.0.1:7051: Client connection negotiation failed: client
> connection to 127.0.0.1:7051: FATAL_UNAUTHORIZED: Not authorized: expected
> TLS_HANDSHAKE step: SASL_INITIATE
> W0612 04:23:57.782537  8897 heartbeater.cc:587] Failed to heartbeat to
> 127.0.0.1:7051 (662 consecutive failures): Not authorized: Failed to ping
> master at 127.0.0.1:7051: Client connection negotiation failed: client
> connection to 127.0.0.1:7051: FATAL_UNAUTHORIZED: Not authorized: expected
> TLS_HANDSHAKE step: SASL_INITIATE
> W0612 04:24:57.910481  8897 heartbeater.cc:587] Failed to heartbeat to
> 127.0.0.1:7051 (722 consecutive failures): Not authorized: Failed to ping
> master at 127.0.0.1:7051: Client connection negotiation failed: client
> connection to 127.0.0.1:7051: FATAL_UNAUTHORIZED: Not authorized: expected
> TLS_HANDSHAKE step: SASL_INITIATE
>
>
> On Mon, Jun 17, 2019 at 9:08 PM Todd Lipcon  wrote:
>
> > On Sat, Jun 15, 2019 at 2:20 PM Jim Apple  wrote:
> >
> > > My goal is to have Impala keep up with (what I perceive to be) the most
> > > popular version of the most popular Linux distribution, for the purpose
> > of
> > > easing the workflow of developers, especially new developers.
> > >
> >
> > Sure, that makes sense. I use Ubuntu 18 myself, but tend to develop
> Impala
> > on a remote box running el7 because the dev environment is too
> heavy-weight
> > to realistically run on my laptop.
> >
> >
> > >
> > > 18.04 stopped being able to load data some time between June 9th and
> > > https://jenkins.impala.io/job/ubuntu-18.04-from-scratch/14/ an

Re: Ubuntu 18.04 in pre-merge tests?

2019-06-17 Thread Todd Lipcon
On Sat, Jun 15, 2019 at 2:20 PM Jim Apple  wrote:

> My goal is to have Impala keep up with (what I perceive to be) the most
> popular version of the most popular Linux distribution, for the purpose of
> easing the workflow of developers, especially new developers.
>

Sure, that makes sense. I use Ubuntu 18 myself, but tend to develop Impala
on a remote box running el7 because the dev environment is too heavy-weight
to realistically run on my laptop.


>
> 18.04 stopped being able to load data some time between June 9th and
> https://jenkins.impala.io/job/ubuntu-18.04-from-scratch/14/ and June 12
> and
>
> https://jenkins.impala.io/job/ubuntu-18.04-from-scratch/16/artifact/Impala/logs_static/logs/data_loading/catalogd.ERROR/*view*/
> .
> I tried reproducing the June 9 run with the same git checkouts (Impala and
> Impala-LZO) as #14 today, and data loading still failed.
>
> What RHEL 7 components did you have in mind that are closer to Ubuntu 16.04
> than 18.04?
>

Stuff like libc, openssl, krb5, sasl, etc are pretty different
version-wise. At least, I know when we made Kudu pass tests on Ubuntu 18,
we dealt with issues mostly in those libraries, which aren't part of the
toolchain (for security reasons we rely on OS-provided libs).

Generally I think precommit running on something closer to the oldest
supported OS is better than running on the newest, since it's more likely
that new OSes are backward-compatible. Otherwise it's very easy to
introduce code that uses features not available on el7, for example.


>
> On Wed, May 22, 2019 at 10:41 AM Todd Lipcon  wrote:
>
> > On Mon, May 20, 2019 at 8:36 PM Jim Apple  wrote:
> >
> > > Maybe now would be a good time to implement Everblue jobs that ping
> dev@
> > > when they fail. Thoughts?
> > >
> >
> > Mixed feelings on that. We already get many test runs per day of the
> > "default" config because people are running precommit builds. Adding an
> > additional cron-based job to the mix that runs the same builds doesn't
> seem
> > like it adds much unless it tests some other config (eg Ubuntu 18 or a
> > longer suite of tests). One thing I could get on board with would be
> > switching the precommit builds to run just "core" tests or some other
> > faster subset, and defer the exhaustive/long runs to scheduled builds or
> as
> > an optional precommit for particularly invasive patches. I think that
> would
> > increase dev quality of life substantially (I find my productivity is
> often
> > hampered by only getting two shots at a precommit run per work day).
> >
> > I'm not against adding a cron-triggered full test/build on Ubuntu 18, but
> > would like to know if someone plans to sign up to triage it when it
> fails.
> > My experience with other Apache communities is that collective ownership
> > over test triage duty (ie "email the dev list on failure" doesn't work. I
> > seem to recall we had such builds back in 2010 or so on Hadoop and they
> > just always got ignored. In various "day job" teams I've seen this work
> via
> > a prescriptive rotation ("all team members take a triage/build-cop
> shift")
> > but that's not really compatbile with the nature of Apache projects being
> > volunteer communities.
> >
> > So, I think I'll put the question back to you: as a committer you can
> spend
> > your time as you like. If you think an Ubuntu 18 job running on a
> schedule
> > would be useful and willing to sign up to triage failures, sounds great
> to
> > me :) Personally I don't develop on Ubuntu 18 and in my day job it's not
> a
> > particularly important deployment platform, so I personally don't think
> > I'll spend much time triaging that build.
> >
> > Todd
> >
> >
> > >
> > > On Mon, May 20, 2019 at 9:09 AM Todd Lipcon  wrote:
> > >
> > > > Adding a build-only job for 18.04 makes sense to me. A full test run
> on
> > > > every precommit seems a bit expensive but doing one once a week or
> > > > something like that might be a good idea to prevent runtime
> > regressions.
> > > >
> > > > As for switching the precommit from 16.04 to 18.04, I'd lean towards
> > > > keeping to 16.04 due to it being closer in terms of component
> versions
> > to
> > > > common enterprise distros like RHEL 7.
> > > >
> > > > -Todd
> > > >
> > > > On Sun, May 19, 2019 at 5:03 PM Jim Apple 
> wrote:
> > > >
> > > > > HEAD now passes on Ub

Re: Ubuntu 18.04 in pre-merge tests?

2019-05-22 Thread Todd Lipcon
On Mon, May 20, 2019 at 8:36 PM Jim Apple  wrote:

> Maybe now would be a good time to implement Everblue jobs that ping dev@
> when they fail. Thoughts?
>

Mixed feelings on that. We already get many test runs per day of the
"default" config because people are running precommit builds. Adding an
additional cron-based job to the mix that runs the same builds doesn't seem
like it adds much unless it tests some other config (eg Ubuntu 18 or a
longer suite of tests). One thing I could get on board with would be
switching the precommit builds to run just "core" tests or some other
faster subset, and defer the exhaustive/long runs to scheduled builds or as
an optional precommit for particularly invasive patches. I think that would
increase dev quality of life substantially (I find my productivity is often
hampered by only getting two shots at a precommit run per work day).

I'm not against adding a cron-triggered full test/build on Ubuntu 18, but
would like to know if someone plans to sign up to triage it when it fails.
My experience with other Apache communities is that collective ownership
over test triage duty (ie "email the dev list on failure" doesn't work. I
seem to recall we had such builds back in 2010 or so on Hadoop and they
just always got ignored. In various "day job" teams I've seen this work via
a prescriptive rotation ("all team members take a triage/build-cop shift")
but that's not really compatbile with the nature of Apache projects being
volunteer communities.

So, I think I'll put the question back to you: as a committer you can spend
your time as you like. If you think an Ubuntu 18 job running on a schedule
would be useful and willing to sign up to triage failures, sounds great to
me :) Personally I don't develop on Ubuntu 18 and in my day job it's not a
particularly important deployment platform, so I personally don't think
I'll spend much time triaging that build.

Todd


>
> On Mon, May 20, 2019 at 9:09 AM Todd Lipcon  wrote:
>
> > Adding a build-only job for 18.04 makes sense to me. A full test run on
> > every precommit seems a bit expensive but doing one once a week or
> > something like that might be a good idea to prevent runtime regressions.
> >
> > As for switching the precommit from 16.04 to 18.04, I'd lean towards
> > keeping to 16.04 due to it being closer in terms of component versions to
> > common enterprise distros like RHEL 7.
> >
> > -Todd
> >
> > On Sun, May 19, 2019 at 5:03 PM Jim Apple  wrote:
> >
> > > HEAD now passes on Ubuntu 18.04:
> > >
> > > https://jenkins.impala.io/job/ubuntu-18.04-from-scratch/
> > >
> > > Thanks to the community members who have made this happen!
> > >
> > > Should we add Ubuntu 18.04 to our pre-merge Jenkins job, replace 16.04
> > with
> > > 18.04 in our pre-merge Jenkins job, or neither?
> > >
> > > I propose adding 18.04 for now (ans so running both 16.04 and 18.04 on
> > > merge) and removing 16.04 when it starts to become inconvenient.
> > >
> >
> >
> > --
> > Todd Lipcon
> > Software Engineer, Cloudera
> >
>


-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Ubuntu 18.04 in pre-merge tests?

2019-05-20 Thread Todd Lipcon
Adding a build-only job for 18.04 makes sense to me. A full test run on
every precommit seems a bit expensive but doing one once a week or
something like that might be a good idea to prevent runtime regressions.

As for switching the precommit from 16.04 to 18.04, I'd lean towards
keeping to 16.04 due to it being closer in terms of component versions to
common enterprise distros like RHEL 7.

-Todd

On Sun, May 19, 2019 at 5:03 PM Jim Apple  wrote:

> HEAD now passes on Ubuntu 18.04:
>
> https://jenkins.impala.io/job/ubuntu-18.04-from-scratch/
>
> Thanks to the community members who have made this happen!
>
> Should we add Ubuntu 18.04 to our pre-merge Jenkins job, replace 16.04 with
> 18.04 in our pre-merge Jenkins job, or neither?
>
> I propose adding 18.04 for now (ans so running both 16.04 and 18.04 on
> merge) and removing 16.04 when it starts to become inconvenient.
>


-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Enabled backend tests for UBSAN

2019-05-06 Thread Todd Lipcon
That's great to hear. Has anyone tried running any BE tests with TSAN in
recent years? It's been invaluable for Kudu, and while I'm sure it would
take a bit of effort to get passing on Impala, the ongoing returns would be
worth it IMO.

One point, though, is that for TSAN to work well, you need to instrument
thirdparty libs as well, so we'd need to start from native-toolchain and
work our way up.

-Todd

On Mon, May 6, 2019 at 10:49 AM Tim Armstrong 
wrote:

> Hi All,
>   I turned on backend tests under UBSAN in our precommit tests - Jim Apple
> let me know that they were passing after he fixed the final set of issues
> there.
>
> - Tim
>


-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Remote read testing in precommit

2019-04-25 Thread Todd Lipcon
Perhaps simplest is to just check for the existence of that script, and if
it doesn't exist, 'exit 0' from the job, so it doesn't get marked failed?

On Thu, Apr 25, 2019 at 8:40 PM Tim Armstrong 
wrote:

> I'm sorry about that - I should have thought about the 2.x branch! I rolled
> back the config change for now and will come up with a plan to skip the
> tests on 2.x.
>
> On Thu, Apr 25, 2019 at 6:24 PM Quanlong Huang 
> wrote:
>
> > Sorry to be late. Can we skip the ubuntu-16.04-dockerised-tests job for
> > branch 2.x or add an option to disable it? Just hit a failure due to
> this:
> > https://jenkins.impala.io/job/gerrit-verify-dryrun/4064/
> > https://jenkins.impala.io/job/ubuntu-16.04-dockerised-tests/72/console
> >
> > File ./bin/jenkins/dockerized-impala-bootstrap-and-test.sh is not found
> in
> > branch 2.x so it will finally fail.
> >
> > Thanks,
> > Quanlong
> >
> > On Thu, Apr 25, 2019 at 2:37 AM Tim Armstrong 
> > wrote:
> >
> > > I tested it here:
> > > https://jenkins.impala.io/job/parallel-all-tests-tarmstrong/ and it
> > works
> > > fine, so I made the corresponding change in precommit at
> > >
> > >
> >
> https://jenkins.impala.io/job/parallel-all-tests/jobConfigHistory/showDiffFiles?timestamp1=2019-01-18_01-08-25×tamp2=2019-04-24_18-35-23
> > >
> > > Let me know if you see any issues.
> > >
> > > On Mon, Apr 22, 2019 at 2:19 PM Lars Volker  wrote:
> > >
> > > > +1 for turning it on
> > > >
> > > > On Mon, Apr 22, 2019 at 2:14 PM Tim Armstrong <
> tarmstr...@cloudera.com
> > >
> > > > wrote:
> > > >
> > > > > It's been stable for a while now, with the exception of hitting a
> > flaky
> > > > > test that is also flaky on the non-dockerised minicluster
> > > (IMPALA-8124) -
> > > > > https://jenkins.impala.io/job/ubuntu-16.04-dockerised-tests/
> > > > >
> > > > > Are there any objections to me modifying parallel-all-tests and
> > > therefore
> > > > > precommit to run this job? I'll wait a couple of days for lazy
> > > consensus
> > > > > then go ahead.
> > > > >
> > > > > Thanks,
> > > > > Tim
> > > > >
> > > > > On Fri, Apr 5, 2019 at 3:03 PM Lars Volker 
> wrote:
> > > > >
> > > > > > +1, thanks for working on this!
> > > > > >
> > > > > > On Fri, Apr 5, 2019 at 11:18 AM Jim Apple 
> > > wrote:
> > > > > >
> > > > > > > I'm in favor. Given the importance of remote reads, I would
> even
> > be
> > > > in
> > > > > > > favor of these if it DID extend the critical path.
> > > > > > >
> > > > > > > On Fri, Apr 5, 2019 at 10:41 AM Tim Armstrong <
> > > > tarmstr...@cloudera.com
> > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > This is really about testing the dockerised minicluster, but
> > > gives
> > > > us
> > > > > > > > coverage of remote read code paths for free, and more people
> > care
> > > > > about
> > > > > > > > that right now.
> > > > > > > >
> > > > > > > > I got the core end-to-end tests passing locally as part of
> > > > > > > > https://issues.apache.org/jira/browse/IMPALA-7995. That
> change
> > > is
> > > > up
> > > > > > for
> > > > > > > > review here https://gerrit.cloudera.org/c/12639/. The next
> > step
> > > is
> > > > > to
> > > > > > > get
> > > > > > > > a
> > > > > > > > Jenkins job running, which I've been working on.
> > > > > > > >
> > > > > > > > I'd like to run it regularly so we can catch any regressions.
> > > > > Initially
> > > > > > > > I'll just have it email me when it fails, but after it's
> stable
> > > > for a
> > > > > > > week
> > > > > > > > or two I'd like to make it part of the regular set of jobs.
> > > > > > > >
> > > > > > > > My preference is to run it as part of the precommit jobs, in
> > > > parallel
> > > > > > to
> > > > > > > > the Ubuntu 16.04 tests. It should not extend the critical
> path
> > of
> > > > > > > precommit
> > > > > > > > because it only runs the end-to-end tests. We could
> > alternatively
> > > > run
> > > > > > it
> > > > > > > as
> > > > > > > > a scheduled post-commit job, but that tends to create
> > additional
> > > > work
> > > > > > > when
> > > > > > > > it breaks.
> > > > > > > >
> > > > > > > > What do people think?
> > > > > > > >
> > > > > > > > - Tim
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>


-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Remove 'hive-benchmark' workload?

2019-04-25 Thread Todd Lipcon
Posted a review: http://gerrit.cloudera.org:8080/13117

On Thu, Apr 25, 2019 at 8:37 AM Bharath Vissapragada 
wrote:

> +1, never used it!
>
> On Thu, Apr 25, 2019 at 8:20 AM Tim Armstrong 
> wrote:
>
> > +1 - do it!
> >
> > On Thu, Apr 25, 2019 at 1:43 AM Todd Lipcon  wrote:
> >
> > > It seems this is some ancient workload that hasn't been touched since
> > 2013
> > > and has rotted:
> > > - fails to load because the name has a '-' in it, and we use unquoted
> > > database names in the load script
> > > - the SQL to load it seems to refer to testdata files which don't exist
> > >
> > > Even if it did load, it looks like it's only a set of ~10 trivial
> queries
> > > which I doubt have any real benchmark interest in modern days.
> > >
> > > Anyone mind if I excise this cruft?
> > >
> > > -Todd
> > > --
> > > Todd Lipcon
> > > Software Engineer, Cloudera
> > >
> >
>


-- 
Todd Lipcon
Software Engineer, Cloudera


Remove 'hive-benchmark' workload?

2019-04-25 Thread Todd Lipcon
It seems this is some ancient workload that hasn't been touched since 2013
and has rotted:
- fails to load because the name has a '-' in it, and we use unquoted
database names in the load script
- the SQL to load it seems to refer to testdata files which don't exist

Even if it did load, it looks like it's only a set of ~10 trivial queries
which I doubt have any real benchmark interest in modern days.

Anyone mind if I excise this cruft?

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera


Re: A problem of left outer join when statement contains two aggregation for the same column

2019-04-04 Thread Todd Lipcon
Sounds like we should file a high priority JIRA (any "wrong results" bugs
should probably be considered critical or blocker). Quanlong, any interest
in working on this issue?

-Todd

On Thu, Apr 4, 2019 at 8:17 AM Quanlong Huang 
wrote:

> +1 for Csaba's analysis. Looks like similiar to this
> https://issues.apache.org/jira/browse/IMPALA-3126
>
> On Thu, Apr 4, 2019 at 11:08 PM Csaba Ringhofer 
> wrote:
>
> > Hi!
> >
> > I have checked the queries, and I can verify that Impala incorrectly
> > returns 1 row while the same query with Hive (or common sense..) returns
> 2
> > rows.
> >
> > > "but if remove the "t2.amount2"  like this:"
> > Indeed, the issue seems to be related to returning the same aggregate
> twice
> > + the fact that one of these values is NULL. The planner introduces a
> > predicate that checks if amount1=amount2, which is false, if both values
> > are NULL:
> >
> > explain select * from (select t2.a_id,t2.amount1,t2.amount2
> > from( select a_id from a) t1
> > left outer join (
> > select c.a_id,sum(amount) as amount1,sum(amount) as amount2
> > from b join c  on b.b_id = c.b_id group by c.a_id) t2
> > on t1.a_id = t2.a_id) t;
> >
> > results in:
> > PLAN-ROOT SINK
> > |
> > 05:HASH JOIN [RIGHT OUTER JOIN]
> > |  hash predicates: c.a_id = a_id
> > |  other predicates: sum(amount) = sum(amount) <- I don't know why
> this
> > predicate is added.
> > |  runtime filters: RF000 <- a_id
> > |  row-size=16B cardinality=2
> > 
> >
> > If I EXPLAIN the query without the outer select, the sum(amount) =
> > sum(amount) is not added, which explains the difference.
> >
> > I do not know why the planner adds this predicate, my guess is that this
> is
> > some kind of bug in Impala.
> >
> >
> > On Thu, Apr 4, 2019 at 2:27 PM skyyws  wrote:
> >
> > > Hi all, I met a problem of left outer join recently, and I reproduce
> this
> > > problem by some simple test data with three tables a, b, c:
> > > table A
> > > +--+
> > > | a_id |
> > > +--+
> > > | 1|
> > > | 2|
> > > +--+
> > > table B
> > > +--++
> > > | b_id | amount |
> > > +--++
> > > | 1| 10 |
> > > | 1| 20 |
> > > | 2| NULL   |
> > > +--++
> > > table C
> > > +--+--+
> > > | a_id | b_id |
> > > +--+--+
> > > | 1| 1|
> > > | 2| 2|
> > > +--+--+
> > > The sql below:
> > > select count(1) from (
> > > select t2.a_id,t2.amount1,t2.amount2
> > > from( select a_id from a) t1
> > > left outer join (
> > > select c.a_id,sum(amount) as amount1,sum(amount) as amount2
> > > from b join c  on b.b_id = c.b_id group by c.a_id) t2
> > > on t1.a_id = t2.a_id
> > > ) t;
> > > +--+
> > > | count(1) |
> > > +--+
> > > | 1|
> > > +--+
> > > but if remove the "t2.amount2"  like this:
> > > select count(1) from (
> > > select t2.a_id,t2.amount1
> > > from( select a_id from a) t1
> > > left outer join (
> > > select c.a_id,sum(amount) as amount1,sum(amount) as amount2
> > > from b join c  on b.b_id = c.b_id group by c.a_id) t2
> > > on t1.a_id = t2.a_id
> > > ) t;
> > > +--+
> > > | count(1) |
> > > +--+
> > > | 2|
> > > +--+
> > > Here is the result of two subquery without count(1):
> > > +--+-+-+
> > > | a_id | amount1 | amount2 |
> > > +--+-+-+
> > > | 1| 30  | 30  |
> > > | 2| NULL| NULL|
> > > +--+-+-+ why the count(1) of this
> > > resultset is 1?
> > > +--+-+
> > > | a_id | amount1 |
> > > +--+-+
> > > | 1| 30  |
> > > | 2| NULL|
> > > +--+-+   why the count(1) of this
> > > resultset is 2?
> > > I want to ask why the first sql return just 1, but second return 2,is
> > this
> > > correct or impala bug?How impala deal with count aggr.?
> > > If I change the sum to other aggr. function like count/max/min, result
> is
> > > same. I test this on 2.12.0 and 3.1.0 version.
> > >
> > >
> >
>


-- 
Todd Lipcon
Software Engineer, Cloudera


Re: +2ability on gerrit

2019-03-29 Thread Todd Lipcon
Ah, try now. Seems I added you to the wrong group

On Fri, Mar 29, 2019 at 10:33 AM Jim Apple  wrote:

> Hm, didn't seem to work.
>
> On Thu, Mar 28, 2019 at 9:28 AM Todd Lipcon  wrote:
>
> > I think you're all set. Give it a shot?
> >
> > -Todd
> >
> > On Tue, Mar 26, 2019 at 8:05 PM Jim Apple  wrote:
> >
> > > Hello! I am now using "jbapple", not "jbapple-cloudera", as my gerrit
> > > handle. Can someone with an admin login give me the auths to +2
> changes?
> > >
> > > Thanks in advance!
> > >
> >
> >
> > --
> > Todd Lipcon
> > Software Engineer, Cloudera
> >
>


-- 
Todd Lipcon
Software Engineer, Cloudera


Re: +2ability on gerrit

2019-03-28 Thread Todd Lipcon
I think you're all set. Give it a shot?

-Todd

On Tue, Mar 26, 2019 at 8:05 PM Jim Apple  wrote:

> Hello! I am now using "jbapple", not "jbapple-cloudera", as my gerrit
> handle. Can someone with an admin login give me the auths to +2 changes?
>
> Thanks in advance!
>


-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Runtime filter publishing from a join node to its cousin

2019-01-30 Thread Todd Lipcon
 always start before or
> concurrently
>with the join build with the subtree containing the destination scan
> node.
>
>
yea, definitely merits some further thought so we don't end up with such a
wait-cycle either due to explicit data dependencies or due to our
scheduling policies.

BTW, on the plus side, my example query went from taking 387sec without my
optimization to <1sec with the optimization. TPCDS Q64 SF1000 on a 9-node
cluster goes from 44sec to 8sec. A good number of the other plans seem to
have gotten new RFs as well, so re-running them to see if there are any
other big speedups:

Files query11.sql.explain-old and query11.sql.explain-new differ
Files query17.sql.explain-old and query17.sql.explain-new differ
Files query1.sql.explain-old and query1.sql.explain-new differ
Files query25.sql.explain-old and query25.sql.explain-new differ
Files query29.sql.explain-old and query29.sql.explain-new differ
Files query32.sql.explain-old and query32.sql.explain-new differ
Files query39.sql.explain-old and query39.sql.explain-new differ
Files query58.sql.explain-old and query58.sql.explain-new differ
Files query64.sql.explain-old and query64.sql.explain-new differ
Files query65.sql.explain-old and query65.sql.explain-new differ
Files query72.sql.explain-old and query72.sql.explain-new differ
Files query74.sql.explain-old and query74.sql.explain-new differ
Files query83.sql.explain-old and query83.sql.explain-new differ
Files query84.sql.explain-old and query84.sql.explain-new differ
Files query92.sql.explain-old and query92.sql.explain-new differ
Files query95.sql.explain-old and query95.sql.explain-new differ

-Todd


>
> On Thu, Jan 31, 2019 at 7:24 AM Philip Zeyliger 
> wrote:
>
> > (If folks are interested in a refresher,
> >
> >
> https://impala.apache.org/docs/build/html/topics/impala_runtime_filtering.html
> > is useful.)
> >
> > I stared at that code for a bit, and I agree with you that it's
> plausible.
> > I'm also confused by the "bottom-up" comment of generateFilters(): it
> seems
> > like we walk the plan depth first, and the assignment happens in that
> > depth-first walk, before all the runtime filters are generated.
> >
> > A concern is making sure that there are no outer joins impacting
> > correctness.
> >
> > -- Philip
> >
> > On Wed, Jan 30, 2019 at 9:36 PM Todd Lipcon  wrote:
> >
> > > Hey folks,
> > >
> > > I've been digging into a couple of TPCDS queries that are unexpectedly
> > slow
> > > and uncovered what seems to be some surprising behavior in the planner
> > > concerning runtime filter assignment. Consider the following query in
> the
> > > TPCDS schema:
> > >
> > > select straight_join *
> > >   from promotion a
> > >   join item b on a.p_item_sk = b.i_item_sk
> > >   join [shuffle] store_sales c on b.i_item_sk = c.ss_item_sk
> > >   where b.i_color = 'xxx'
> > >
> > > This generates a plan that looks like this:
> > > http://people.apache.org/~todd/plan.png
> > >
> > > Or in text form:
> > >
> > >
> >
> +-+
> > > | Explain String
> > >   |
> > >
> > >
> >
> +-+
> > > | Max Per-Host Resource Reservation: Memory=85.88MB
> > >|
> > > | Per-Host Resource Estimates: Memory=298.14GB
> > >   |
> > > |
> > >|
> > > | F04:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
> > >|
> > > | |  Per-Host Resources: mem-estimate=0B mem-reservation=0B
> > >|
> > > | PLAN-ROOT SINK
> > >   |
> > > | |  mem-estimate=0B mem-reservation=0B
> > >|
> > > | |
> > >|
> > > | 08:EXCHANGE [UNPARTITIONED]
> > >|
> > > | |  mem-estimate=0B mem-reservation=0B
> > >|
> > > | |  tuple-ids=0,1,2 row-size=911B cardinality=14201171
> > >|
> > > | |
> > >|
> > > | F03:PLAN FRAGMENT [HASH(b.i_item_sk)] hosts=1 instances=1
> > >|
> > > | Per-Host Resources: mem-estimate=295.06GB mem-reservation=50.0

Runtime filter publishing from a join node to its cousin

2019-01-30 Thread Todd Lipcon
rs: RF000[bloom] -> b.i_item_sk
  |
| | stored statistics:
  |
| |   table: rows=30 size=28.28MB
   |
| |   columns: all
  |
| | extrapolated-rows=disabled
  |
| | parquet statistics predicates: b.i_color = 'xxx'
  |
| | parquet dictionary predicates: b.i_color = 'xxx'
  |
| | mem-estimate=880.00MB mem-reservation=0B
  |
| | tuple-ids=1 row-size=496B cardinality=3226
  |
| |
   |
| 00:SCAN HDFS [tpcds_1000_parquet.promotion a, RANDOM]
   |
|partitions=1/1 files=1 size=85.50KB
  |
|runtime filters: RF000[bloom] -> a.p_item_sk, RF002[bloom] ->
a.p_item_sk|
|stored statistics:
   |
|  table: rows=1500 size=85.50KB
  |
|  columns: all
   |
|extrapolated-rows=disabled
   |
|mem-estimate=304.00MB mem-reservation=0B
   |
|tuple-ids=0 row-size=315B cardinality=1500
   |
+-+


Here because of the equi-joins, we have a slot equivalence for all of the
'item_id' columns. So, it seems it would be valid to take the runtime
filter RF002 (generated from b.sk_item_id at HASH JOIN node 03) and apply
it at the scan node 02 for store_sales. Notably, doing so is extremely
beneficial in this manufactured query, since the scan of 'item' takes only
a second or so to determine that in fact there are no items with color
'xxx'. This could completely short circuit the scan 02 of the very large
store_sales table.

Obviously this query is canned, but I'm seeing some cases in TPCDS (eg Q64)
where a much more complex variant of this plan ends up being generated and
failing to utilize an available runtime filter.

I spent some time digging in the code, and I think the faulty logic might
be here in RuntimeFilterGenerator.java:

  generateFilters(ctx, root.getChild(0));
  // Finalize every runtime filter of that join. This is to ensure that
we don't
  // assign a filter to a scan node from the right subtree of joinNode
or ancestor
  // join nodes in case we don't find a destination node in the left
subtree.
  for (RuntimeFilter runtimeFilter: filters)
finalizeRuntimeFilter(runtimeFilter);

I'm not quite sure how to understand the comment, but it seems this is
preventing the runtime filter at the hash join from being sent to any node
which is either on its right descendent or higher up the tree. In other
words, it's limiting the runtime filter to _only_ be applied at nodes in
its left subtree.

Per my above example query, I think this is unnecessarily restrictive.
Based on my understanding of Impala scheduling/parallelism, all of the
build sides of joins start running in parallel from bottom up, and it's
always possible for a build to complete somewhere deep in the tree before a
build completes higher up the tree, in which case sending the RF to the
hash join's "cousin" is beneficial. I think the only necessary restriction
is that a RF should not be sent from a hash join node to any descendent of
its right child.

Keep in mind I'm very new to the Impala planner code and particularly to
the runtime filter portion thereof, so I may have missed something. But
does the above sound like a plausible bug/missed optimization?

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Adding profile info from Frontend

2018-09-04 Thread Todd Lipcon
I have a WIP patch up here in case anyone's interested in taking an early
look: https://gerrit.cloudera.org/c/11388/

-Todd


On Tue, Sep 4, 2018 at 5:02 PM, Todd Lipcon  wrote:

> On Tue, Sep 4, 2018 at 4:46 PM, Andrew Sherman 
> wrote:
>
>> Hi Todd,
>>
>> I'm going to put the patch up for review any minute. After I finish first
>> time setup on https://gerrit.cloudera.org
>
>
> OK, I guess that settles whose patch goes in first, then, because I
> haven't written a line of code yet :-D
>
> -Todd
>
>
>>
>>
>> -Andrew
>>
>> On Tue, Sep 4, 2018 at 4:43 PM Todd Lipcon  wrote:
>>
>> > On Tue, Sep 4, 2018 at 4:28 PM, Andrew Sherman 
>> > wrote:
>> >
>> > > Hi Todd,
>> > >
>> > > I am making a simple fix for
>> > > IMPALA-6568 <https://issues.apache.org/jira/browse/IMPALA-6568> add
>> > > missing
>> > > Query Compilation section to profiles.
>> > > which would add the Timeline to all responses to createExecRequest.
>> > >
>> > > It sounds like your change is more deep. If you go ahead with your
>> change
>> > > it  sounds like my change might be redundant.
>> > >
>> >
>> > I'm not sure if it's totally redundant. Do you have a WIP patch already?
>> > I'm not sure why the existing timeline doesn't show up in all statement
>> > types, so maybe some changes are needed related to that, and then those
>> > changes will still be necessary when exposing a full profile node?
>> >
>> > Agreed we're likely to conflict, at the least, though. Do you have an
>> > estimate of when your patch will be up for review so we can coordinate
>> > which one goes in first?
>> >
>> > -Todd
>> >
>> > >
>> > >
>> > > On Tue, Sep 4, 2018 at 4:02 PM Todd Lipcon  wrote:
>> > >
>> > > > Hey folks,
>> > > >
>> > > > I'm working on a patch to add some more diagnostics from the
>> planning
>> > > > process into query profiles.
>> > > >
>> > > > Currently, only the planning "Timeline" is reported back as part of
>> the
>> > > > response to createExecRequest. As part of the fetch-on-demand
>> catalog
>> > > work
>> > > > I'd like to start exposing various counters such as cache hit/miss
>> > > counts,
>> > > > time spent on remote calls to the catalog, etc. Even in the existing
>> > code
>> > > > paths, I can see some similar metrics being useful.
>> > > >
>> > > > My current thinking is to remove the 'timeline'  (TEventSequence)
>> field
>> > > > from TExecRequest and replace it with a full TRuntimeProfileNode.
>> I'd
>> > > then
>> > > > add some capability in the Java side to fill in counters, etc, in
>> this
>> > > > structure.
>> > > >
>> > > > Any concerns with this approach before I go down this path? Are
>> there
>> > any
>> > > > compatibility guarantees I need to uphold with the profile output of
>> > > > queries?
>> > > >
>> > > > -Todd
>> > > > --
>> > > > Todd Lipcon
>> > > > Software Engineer, Cloudera
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > Todd Lipcon
>> > Software Engineer, Cloudera
>> >
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>



-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Adding profile info from Frontend

2018-09-04 Thread Todd Lipcon
On Tue, Sep 4, 2018 at 4:46 PM, Andrew Sherman 
wrote:

> Hi Todd,
>
> I'm going to put the patch up for review any minute. After I finish first
> time setup on https://gerrit.cloudera.org


OK, I guess that settles whose patch goes in first, then, because I haven't
written a line of code yet :-D

-Todd


>
>
> -Andrew
>
> On Tue, Sep 4, 2018 at 4:43 PM Todd Lipcon  wrote:
>
> > On Tue, Sep 4, 2018 at 4:28 PM, Andrew Sherman 
> > wrote:
> >
> > > Hi Todd,
> > >
> > > I am making a simple fix for
> > > IMPALA-6568 <https://issues.apache.org/jira/browse/IMPALA-6568> add
> > > missing
> > > Query Compilation section to profiles.
> > > which would add the Timeline to all responses to createExecRequest.
> > >
> > > It sounds like your change is more deep. If you go ahead with your
> change
> > > it  sounds like my change might be redundant.
> > >
> >
> > I'm not sure if it's totally redundant. Do you have a WIP patch already?
> > I'm not sure why the existing timeline doesn't show up in all statement
> > types, so maybe some changes are needed related to that, and then those
> > changes will still be necessary when exposing a full profile node?
> >
> > Agreed we're likely to conflict, at the least, though. Do you have an
> > estimate of when your patch will be up for review so we can coordinate
> > which one goes in first?
> >
> > -Todd
> >
> > >
> > >
> > > On Tue, Sep 4, 2018 at 4:02 PM Todd Lipcon  wrote:
> > >
> > > > Hey folks,
> > > >
> > > > I'm working on a patch to add some more diagnostics from the planning
> > > > process into query profiles.
> > > >
> > > > Currently, only the planning "Timeline" is reported back as part of
> the
> > > > response to createExecRequest. As part of the fetch-on-demand catalog
> > > work
> > > > I'd like to start exposing various counters such as cache hit/miss
> > > counts,
> > > > time spent on remote calls to the catalog, etc. Even in the existing
> > code
> > > > paths, I can see some similar metrics being useful.
> > > >
> > > > My current thinking is to remove the 'timeline'  (TEventSequence)
> field
> > > > from TExecRequest and replace it with a full TRuntimeProfileNode. I'd
> > > then
> > > > add some capability in the Java side to fill in counters, etc, in
> this
> > > > structure.
> > > >
> > > > Any concerns with this approach before I go down this path? Are there
> > any
> > > > compatibility guarantees I need to uphold with the profile output of
> > > > queries?
> > > >
> > > > -Todd
> > > > --
> > > > Todd Lipcon
> > > > Software Engineer, Cloudera
> > > >
> > >
> >
> >
> >
> > --
> > Todd Lipcon
> > Software Engineer, Cloudera
> >
>



-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Adding profile info from Frontend

2018-09-04 Thread Todd Lipcon
On Tue, Sep 4, 2018 at 4:28 PM, Andrew Sherman 
wrote:

> Hi Todd,
>
> I am making a simple fix for
> IMPALA-6568 <https://issues.apache.org/jira/browse/IMPALA-6568> add
> missing
> Query Compilation section to profiles.
> which would add the Timeline to all responses to createExecRequest.
>
> It sounds like your change is more deep. If you go ahead with your change
> it  sounds like my change might be redundant.
>

I'm not sure if it's totally redundant. Do you have a WIP patch already?
I'm not sure why the existing timeline doesn't show up in all statement
types, so maybe some changes are needed related to that, and then those
changes will still be necessary when exposing a full profile node?

Agreed we're likely to conflict, at the least, though. Do you have an
estimate of when your patch will be up for review so we can coordinate
which one goes in first?

-Todd

>
>
> On Tue, Sep 4, 2018 at 4:02 PM Todd Lipcon  wrote:
>
> > Hey folks,
> >
> > I'm working on a patch to add some more diagnostics from the planning
> > process into query profiles.
> >
> > Currently, only the planning "Timeline" is reported back as part of the
> > response to createExecRequest. As part of the fetch-on-demand catalog
> work
> > I'd like to start exposing various counters such as cache hit/miss
> counts,
> > time spent on remote calls to the catalog, etc. Even in the existing code
> > paths, I can see some similar metrics being useful.
> >
> > My current thinking is to remove the 'timeline'  (TEventSequence) field
> > from TExecRequest and replace it with a full TRuntimeProfileNode. I'd
> then
> > add some capability in the Java side to fill in counters, etc, in this
> > structure.
> >
> > Any concerns with this approach before I go down this path? Are there any
> > compatibility guarantees I need to uphold with the profile output of
> > queries?
> >
> > -Todd
> > --
> > Todd Lipcon
> > Software Engineer, Cloudera
> >
>



-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Adding profile info from Frontend

2018-09-04 Thread Todd Lipcon
On Tue, Sep 4, 2018 at 4:19 PM, Philip Zeyliger  wrote:

> I'm certainly aware of tools that read TEventSequence and expect it to have
> certain structures. If we change the profiles significantly, we should
> insert a version sigil in there so that a client could at least figure out
> that they're looking at something new.
>

OK, I'll see if I can keep the existing timeline in the same spot in the
resulting profile (potentially with additional fields in that same profile
node)


>
> A bigger modeling conversation is the use of string keys everywhere. For
> the timeline, for example, it's a list of (description, time) pairs, with
> the descriptions being strings. An alternative would have been to put the
> strings in Thrift itself. The latter, I think, makes tooling easier to
> write (since you're calling "get_query_started_time()" instead of "for
> description, time in timelines: if description == 'query started': return
> time". I recognize working with Thrift is kind of a pain, but I want to
> throw the thought out there if we're adding things to profiles.
>

Agreed -- or a set of well-known fields defined as a thrift enum instead of
relying on strings.. but I dont want to bite off too much in one go here :)


>
> -- Philip
>
>
> On Tue, Sep 4, 2018 at 4:02 PM Todd Lipcon  wrote:
>
> > Hey folks,
> >
> > I'm working on a patch to add some more diagnostics from the planning
> > process into query profiles.
> >
> > Currently, only the planning "Timeline" is reported back as part of the
> > response to createExecRequest. As part of the fetch-on-demand catalog
> work
> > I'd like to start exposing various counters such as cache hit/miss
> counts,
> > time spent on remote calls to the catalog, etc. Even in the existing code
> > paths, I can see some similar metrics being useful.
> >
> > My current thinking is to remove the 'timeline'  (TEventSequence) field
> > from TExecRequest and replace it with a full TRuntimeProfileNode. I'd
> then
> > add some capability in the Java side to fill in counters, etc, in this
> > structure.
> >
> > Any concerns with this approach before I go down this path? Are there any
> > compatibility guarantees I need to uphold with the profile output of
> > queries?
> >
> > -Todd
> > --
> > Todd Lipcon
> > Software Engineer, Cloudera
> >
>



-- 
Todd Lipcon
Software Engineer, Cloudera


Adding profile info from Frontend

2018-09-04 Thread Todd Lipcon
Hey folks,

I'm working on a patch to add some more diagnostics from the planning
process into query profiles.

Currently, only the planning "Timeline" is reported back as part of the
response to createExecRequest. As part of the fetch-on-demand catalog work
I'd like to start exposing various counters such as cache hit/miss counts,
time spent on remote calls to the catalog, etc. Even in the existing code
paths, I can see some similar metrics being useful.

My current thinking is to remove the 'timeline'  (TEventSequence) field
from TExecRequest and replace it with a full TRuntimeProfileNode. I'd then
add some capability in the Java side to fill in counters, etc, in this
structure.

Any concerns with this approach before I go down this path? Are there any
compatibility guarantees I need to uphold with the profile output of
queries?

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Issues with query/fragment lifecycle on trunk?

2018-08-30 Thread Todd Lipcon
On Thu, Aug 30, 2018 at 2:48 PM, Todd Lipcon  wrote:

> On Thu, Aug 30, 2018 at 2:44 PM, Pooja Nilangekar <
> pooja.nilange...@cloudera.com> wrote:
>
>> Hi Todd,
>>
>> I believe you are right. There are a couple of other race conditions in
>> the
>> HdfsScanNode which have resulted in broken-builds in the last couple of
>> weeks. (IMPALA-7418 and IMPALA-7335). While reviewing the change for
>> IMPALA-7335, Tim and Dan suggested that we need to cleanup the state
>> invariants of the HdfsScanNode to find and fix other such races. I am
>> going
>> to create a JIRA to track it, I could add this issue to the list of races
>> to be fixed.  If this issue occurs more frequently, we could add a
>> temporary fix for now. Let me know what you'd suggest.
>>
>
> Makes sense. I'm currently trying to put together a simple test and fix
> that targets this race in particular, but definitely agree that the
> interactions between the scanner threads and scan node are hard to reason
> about and could be improved more generally.
>
> For now I'm trying to use a DebugAction to inject probabilistic failures
> into the soft memory limit checks to see if I can reproduce it more easily.
>

Put this up here with a fix: https://gerrit.cloudera.org/c/11369/

-Todd


>
> -Todd
>
>
>> On Thu, Aug 30, 2018 at 1:27 PM Todd Lipcon  wrote:
>>
>> > I spent another 45 minutes looking at the core and think I figured this
>> > out. I found that for the stuck scanner, both of its scanner threads and
>> > run and exited, but for some reason it was never marked as "done".
>> >
>> > I think this might be a regression due to IMPALA-7096
>> > (7ccf7369085aa49a8fc0daf6f91d97b8a3135682). The scanner thread has the
>> > following code:
>> >
>> > // Stop extra threads if we're over a soft limit in order to free up
>> > memory.
>> > if (!first_thread && mem_tracker_->AnyLimitExceeded(MemLimit::SOFT))
>> {
>> >   break;
>> > }
>> >
>> > // Done with range and it completed successfully
>> > if (progress_.done()) {
>> >   // All ranges are finished.  Indicate we are done.
>> >   SetDone();
>> >   break;
>> > }
>> >
>> > if (scan_range == nullptr && num_unqueued_files == 0) {
>> >   unique_lock l(lock_);
>> >   // All ranges have been queued and DiskIoMgr has no more new
>> ranges
>> > for this scan
>> >   // node to process. This means that every range is either done or
>> > being processed by
>> >   // another thread.
>> >   all_ranges_started_ = true;
>> >   break;
>> > }
>> >   }
>> >
>> > What if we have the following scenario:
>> >
>> > T1) grab scan range 1 and start processing
>> >
>> > T2) grab scan range 2 and start processing
>> >
>> > T1) finish scan range 1 and see that 'progress_' is not done()
>> > T1) loop around, get no scan range (there are no more), so set
>> > all_ranges_satrted_ and break
>> > T1) thread exits
>> >
>> > T2) finish scan range 2
>> > T2) happen to hit a soft memory limit error due to pressure from other
>> exec
>> > nodes, etc. Since we aren't the first thread, we break. (even though the
>> > first thread is no longer running)
>> > T2) thread exits
>> >
>> > Note that no one got to the point of calling SetDone() because we break
>> due
>> > to the memory limit error _before_ checking progress_.Done().
>> >
>> > Thus, the query will hang forever.
>> >
>> > It seems to me that the order of these conditions needs to be
>> rejiggered:
>> > the breaking due to memory limit should be the last thing in the loop so
>> > that, if we were the thread to finish the last scan range, we properly
>> > SetDone before exiting.
>> >
>> > Does that makes sense? If so, I'll file a JIRA. Suggestions also
>> welcome on
>> > how to write a test for this (do we have an easy facility to inject
>> > probabilistic mem-limit failures in that code path?)
>> >
>> > -Todd
>> >
>> >
>> > On Thu, Aug 30, 2018 at 12:35 PM, Michael Ho  wrote:
>> >
>> > > FWIW, please see the discussion at
>> > > https://issues.apache.org/jira/browse/IMPALA-6194 about how "stuck
>> IO"
>> > may
&

Re: Issues with query/fragment lifecycle on trunk?

2018-08-30 Thread Todd Lipcon
On Thu, Aug 30, 2018 at 2:44 PM, Pooja Nilangekar <
pooja.nilange...@cloudera.com> wrote:

> Hi Todd,
>
> I believe you are right. There are a couple of other race conditions in the
> HdfsScanNode which have resulted in broken-builds in the last couple of
> weeks. (IMPALA-7418 and IMPALA-7335). While reviewing the change for
> IMPALA-7335, Tim and Dan suggested that we need to cleanup the state
> invariants of the HdfsScanNode to find and fix other such races. I am going
> to create a JIRA to track it, I could add this issue to the list of races
> to be fixed.  If this issue occurs more frequently, we could add a
> temporary fix for now. Let me know what you'd suggest.
>

Makes sense. I'm currently trying to put together a simple test and fix
that targets this race in particular, but definitely agree that the
interactions between the scanner threads and scan node are hard to reason
about and could be improved more generally.

For now I'm trying to use a DebugAction to inject probabilistic failures
into the soft memory limit checks to see if I can reproduce it more easily.

-Todd


> On Thu, Aug 30, 2018 at 1:27 PM Todd Lipcon  wrote:
>
> > I spent another 45 minutes looking at the core and think I figured this
> > out. I found that for the stuck scanner, both of its scanner threads and
> > run and exited, but for some reason it was never marked as "done".
> >
> > I think this might be a regression due to IMPALA-7096
> > (7ccf7369085aa49a8fc0daf6f91d97b8a3135682). The scanner thread has the
> > following code:
> >
> > // Stop extra threads if we're over a soft limit in order to free up
> > memory.
> > if (!first_thread && mem_tracker_->AnyLimitExceeded(MemLimit::SOFT))
> {
> >   break;
> > }
> >
> > // Done with range and it completed successfully
> > if (progress_.done()) {
> >   // All ranges are finished.  Indicate we are done.
> >   SetDone();
> >   break;
> > }
> >
> > if (scan_range == nullptr && num_unqueued_files == 0) {
> >   unique_lock l(lock_);
> >   // All ranges have been queued and DiskIoMgr has no more new ranges
> > for this scan
> >   // node to process. This means that every range is either done or
> > being processed by
> >   // another thread.
> >   all_ranges_started_ = true;
> >   break;
> > }
> >   }
> >
> > What if we have the following scenario:
> >
> > T1) grab scan range 1 and start processing
> >
> > T2) grab scan range 2 and start processing
> >
> > T1) finish scan range 1 and see that 'progress_' is not done()
> > T1) loop around, get no scan range (there are no more), so set
> > all_ranges_satrted_ and break
> > T1) thread exits
> >
> > T2) finish scan range 2
> > T2) happen to hit a soft memory limit error due to pressure from other
> exec
> > nodes, etc. Since we aren't the first thread, we break. (even though the
> > first thread is no longer running)
> > T2) thread exits
> >
> > Note that no one got to the point of calling SetDone() because we break
> due
> > to the memory limit error _before_ checking progress_.Done().
> >
> > Thus, the query will hang forever.
> >
> > It seems to me that the order of these conditions needs to be rejiggered:
> > the breaking due to memory limit should be the last thing in the loop so
> > that, if we were the thread to finish the last scan range, we properly
> > SetDone before exiting.
> >
> > Does that makes sense? If so, I'll file a JIRA. Suggestions also welcome
> on
> > how to write a test for this (do we have an easy facility to inject
> > probabilistic mem-limit failures in that code path?)
> >
> > -Todd
> >
> >
> > On Thu, Aug 30, 2018 at 12:35 PM, Michael Ho  wrote:
> >
> > > FWIW, please see the discussion at
> > > https://issues.apache.org/jira/browse/IMPALA-6194 about how "stuck IO"
> > may
> > > render cancellation in HDFS scan node ineffective. We have seen users
> > > reported similar backtraces in the past due to hung RPCs to HDFS name
> > node.
> > > Not saying this is the necessarily the case here.
> > >
> > > A couple of things which may be interesting to look at:
> > > - is there any scan node in the profile which doesn't finish any
> assigned
> > > scan ranges ?
> > > - if you happen to have a core, it may help to inspect the stack traces
> > of
> > > the scanner threads and the disk

Re: Issues with query/fragment lifecycle on trunk?

2018-08-30 Thread Todd Lipcon
I spent another 45 minutes looking at the core and think I figured this
out. I found that for the stuck scanner, both of its scanner threads and
run and exited, but for some reason it was never marked as "done".

I think this might be a regression due to IMPALA-7096
(7ccf7369085aa49a8fc0daf6f91d97b8a3135682). The scanner thread has the
following code:

// Stop extra threads if we're over a soft limit in order to free up
memory.
if (!first_thread && mem_tracker_->AnyLimitExceeded(MemLimit::SOFT)) {
  break;
}

// Done with range and it completed successfully
if (progress_.done()) {
  // All ranges are finished.  Indicate we are done.
  SetDone();
  break;
}

if (scan_range == nullptr && num_unqueued_files == 0) {
  unique_lock l(lock_);
  // All ranges have been queued and DiskIoMgr has no more new ranges
for this scan
  // node to process. This means that every range is either done or
being processed by
  // another thread.
  all_ranges_started_ = true;
  break;
}
  }

What if we have the following scenario:

T1) grab scan range 1 and start processing

T2) grab scan range 2 and start processing

T1) finish scan range 1 and see that 'progress_' is not done()
T1) loop around, get no scan range (there are no more), so set
all_ranges_satrted_ and break
T1) thread exits

T2) finish scan range 2
T2) happen to hit a soft memory limit error due to pressure from other exec
nodes, etc. Since we aren't the first thread, we break. (even though the
first thread is no longer running)
T2) thread exits

Note that no one got to the point of calling SetDone() because we break due
to the memory limit error _before_ checking progress_.Done().

Thus, the query will hang forever.

It seems to me that the order of these conditions needs to be rejiggered:
the breaking due to memory limit should be the last thing in the loop so
that, if we were the thread to finish the last scan range, we properly
SetDone before exiting.

Does that makes sense? If so, I'll file a JIRA. Suggestions also welcome on
how to write a test for this (do we have an easy facility to inject
probabilistic mem-limit failures in that code path?)

-Todd


On Thu, Aug 30, 2018 at 12:35 PM, Michael Ho  wrote:

> FWIW, please see the discussion at
> https://issues.apache.org/jira/browse/IMPALA-6194 about how "stuck IO" may
> render cancellation in HDFS scan node ineffective. We have seen users
> reported similar backtraces in the past due to hung RPCs to HDFS name node.
> Not saying this is the necessarily the case here.
>
> A couple of things which may be interesting to look at:
> - is there any scan node in the profile which doesn't finish any assigned
> scan ranges ?
> - if you happen to have a core, it may help to inspect the stack traces of
> the scanner threads and the disk io mgr threads to understand their states.
>
> On Thu, Aug 30, 2018 at 12:25 PM Todd Lipcon  wrote:
>
> > Hey folks,
> >
> > I ran into some issues with a local core test run last night. This is on
> my
> > own branch with some uncommitted work, but I haven't touched the backend
> > and these issues seem to be backend-focused.
> >
> > Initially, I noticed the problem because a test run that I started last
> > night was still "stuck" this morning. I had three queries which had been
> > running for 10 hours and not making progress. The 'fragments' page for
> the
> > query showed that one of the backends had not reported to the coordinator
> > for many hours. In attempting to debug this, I managed to get the
> > StateStore to declare that node dead, and those queries eventually were
> > cancelled due to that.
> >
> > I resumed the node that I had been debugging with gdb, and it was
> declared
> > live again. I didn't restart the process, though, which might have led to
> > further issues:
> >
> > Next, I also noticed that /queries on my coordinator node showed "Number
> of
> > running queries with fragments on this host" at 100+ on all three nodes,
> > even though no queries were running anymore. These numbers are stable. I
> > can continue to issue new queries and they complete normally. While a
> > queries are running, the count of fragments goes up appropriate, and when
> > the query finishes, it goes back down. But, the "base" numbers are stuck
> at
> > {108, 391, 402} with nothing running.
> >
> > I also found that the node that had had the original problems has three
> > stuck fragments, all waiting on HdfsScanNode::GetNext:
> > https://gist.github.com/57a99bf4c00f1575810af924013c259d
> > Looking in the logs, I see continuous spew that it's trying to cancel th

Issues with query/fragment lifecycle on trunk?

2018-08-30 Thread Todd Lipcon
Hey folks,

I ran into some issues with a local core test run last night. This is on my
own branch with some uncommitted work, but I haven't touched the backend
and these issues seem to be backend-focused.

Initially, I noticed the problem because a test run that I started last
night was still "stuck" this morning. I had three queries which had been
running for 10 hours and not making progress. The 'fragments' page for the
query showed that one of the backends had not reported to the coordinator
for many hours. In attempting to debug this, I managed to get the
StateStore to declare that node dead, and those queries eventually were
cancelled due to that.

I resumed the node that I had been debugging with gdb, and it was declared
live again. I didn't restart the process, though, which might have led to
further issues:

Next, I also noticed that /queries on my coordinator node showed "Number of
running queries with fragments on this host" at 100+ on all three nodes,
even though no queries were running anymore. These numbers are stable. I
can continue to issue new queries and they complete normally. While a
queries are running, the count of fragments goes up appropriate, and when
the query finishes, it goes back down. But, the "base" numbers are stuck at
{108, 391, 402} with nothing running.

I also found that the node that had had the original problems has three
stuck fragments, all waiting on HdfsScanNode::GetNext:
https://gist.github.com/57a99bf4c00f1575810af924013c259d
Looking in the logs, I see continuous spew that it's trying to cancel the
fragment instances for this query, but apparently the cancellation is not
working. I talked to Michael Ho about this and it sounds like this is a
known issue with cancellation.

So, there are still two mysteries:
- why did it get stuck in the first place?
- why are my "number of running queries" counters stuck at non-zero values?

Does anything above ring a bell for anyone?

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Breaking change for query hints?

2018-08-23 Thread Todd Lipcon
I'm curious: can you describe the view using Hive to see what the stored
query consists of?

On Thu, Aug 23, 2018, 7:44 PM Quanlong Huang 
wrote:

> Hi all,
>
> After we upgrade Impala from 2.5 to 2.12, we found some queries on views
> failed. The views contain a query hint which I think is the cause. I can't
> find anything about such kind of incompatibility was mentioned in
>
> https://www.cloudera.com/documentation/enterprise/release-notes/topics/impala_new_features.html
> or
>
> https://www.cloudera.com/documentation/enterprise/release-notes/topics/impala_incompatible_changes.html
>
> The error can be reproduced by
>
> [impala-shell:21000] > use test;
> [impala-shell:21000] > create view view_2_12 as select /*
> +`queryId=abc` */ * from video_parq;
> Query: create view view_2_12 as select /* +`queryId=abc` */ * from
> video_parq
> ++
> | summary|
> ++
> | View has been created. |
> ++
> WARNINGS: PLAN hint not recognized: queryId=abc
>
> Fetched 1 row(s) in 24.00s
> [impala-shell:21000] > desc formatted view_2_12;
> Query: describe formatted view_2_12
> ERROR: AnalysisException: Failed to parse view-definition statement of
> view: test.view_2_12
> CAUSED BY: TableLoadingException: Failed to parse view-definition
> statement of view: test.view_2_12
>
>
> Since this's an emergency for us, I post an email here for help. I'm also
> looking into the history of the cup file and looking for a hotfix for this.
> Thanks if anyone can give some pointers!
>
> Thanks,
> Quanlong
>


Re: Improving latency of catalog update propagation?

2018-08-21 Thread Todd Lipcon
One more parting thought: why don't we just call 'GetCatalogDelta()'
directly from the catalog callback in order to do a direct handoff, instead
of storing them in this 'pending' struct? Given the statestore uses a
dedicated thread per subscriber (right?) it seems like it would be fine for
the update callback to take a long time, no?

-Todd

On Tue, Aug 21, 2018 at 11:09 AM, Todd Lipcon  wrote:

> Thanks, Tim. I'm guessing once we switch over these RPCs to KRPC instead
> of Thrift we'll alleviate some of the scalability issues and maybe we can
> look into increasing frequency or doing a "push" to the statestore, etc. I
> probably won't work on this in the near term to avoid complicating the
> ongoing changes with catalog.
>
> -Todd
>
> On Tue, Aug 21, 2018 at 10:22 AM, Tim Armstrong 
> wrote:
>
>> This is somewhat relevant for admission control too - I had thought about
>> some of these issues in that context, because reducing the latency of
>> admission controls state propagation helps avoid overadmission but having
>> a
>> very low statestore frequency is very inefficient and doesn't scale well
>> to
>> larger clusters.
>>
>> For the catalog updates I agree we could do something with long polls
>> since
>> it's a single producer so that the "idle" state of the system has a thread
>> sitting in the update callback on catalogd waiting for an update.
>>
>> I'd also thought at one point about allowing subscribers to notify the
>> statestore that they had something to add to the topic. That could be
>> treated as a hint to the statestore to schedule the subscriber update
>> sooner. This would also work for admission control since coordinators
>> could
>> notify the statestore when the first query was admitted after the previous
>> statestore update.
>>
>> On Tue, Aug 21, 2018 at 9:41 AM, Todd Lipcon  wrote:
>>
>> > Hey folks,
>> >
>> > In my recent forays into the catalog->statestore->impalad metadata
>> > propagation code base, I noticed that the latency of any update is
>> > typically between 2-4 seconds with the standard 2-second statestore
>> polling
>> > interval. That's because the code currently works as follows:
>> >
>> > 1. in the steady state with no recent metadata changes, the catalogd's
>> > state is:
>> > -- topic_updates_ready_ = true
>> > -- pending_topic_updates_ = empty
>> >
>> > 2. some metadata change happens, which modifies the version numbers in
>> the
>> > Java catalog but doesn't modify any of the C++ side state
>> >
>> > 3. the next statestore poll happens due to the normal interval
>> expiring. On
>> > average, this will take *1/2 the polling interval*
>> > -- this sees that pending_topic_updates_ is empty, so returns no
>> results.
>> > -- it sets topic_updates_ready_ = false and triggers the "gather" thread
>> >
>> > 4. the "gather" thread wakes up and gathers updates, filling in
>> > 'pending_topic_updates_' and setting 'topic_updates_ready_' back to true
>> > (typically subsecond in smallish catalogs, so this happens before the
>> next
>> > poll)
>> >
>> > 5. wait *another full statestore polling interval* (2 seconds) after
>> step
>> > #3 above, at which point we deliver the metadata update to the
>> statestore
>> >
>> > 6. wait on average* 1/2 the polling interval* until any particular
>> impalad
>> > gets the update from #4
>> >
>> > So. in the absolute best case, we wait one full polling interval (2
>> > seconds), and in the worst case we wait two polling intervals (4
>> seconds).
>> >
>> > Has anyone looked into optimizing this at all? It seems like we could
>> have
>> > metadata changes trigger an immediate "collection" into the C++ side,
>> and
>> > have the statestore update callback wait ("long poll" style) for an
>> update
>> > rather than skip if there is nothing available.
>> >
>> > -Todd
>> > --
>> > Todd Lipcon
>> > Software Engineer, Cloudera
>> >
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>



-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Improving latency of catalog update propagation?

2018-08-21 Thread Todd Lipcon
Thanks, Tim. I'm guessing once we switch over these RPCs to KRPC instead of
Thrift we'll alleviate some of the scalability issues and maybe we can look
into increasing frequency or doing a "push" to the statestore, etc. I
probably won't work on this in the near term to avoid complicating the
ongoing changes with catalog.

-Todd

On Tue, Aug 21, 2018 at 10:22 AM, Tim Armstrong 
wrote:

> This is somewhat relevant for admission control too - I had thought about
> some of these issues in that context, because reducing the latency of
> admission controls state propagation helps avoid overadmission but having a
> very low statestore frequency is very inefficient and doesn't scale well to
> larger clusters.
>
> For the catalog updates I agree we could do something with long polls since
> it's a single producer so that the "idle" state of the system has a thread
> sitting in the update callback on catalogd waiting for an update.
>
> I'd also thought at one point about allowing subscribers to notify the
> statestore that they had something to add to the topic. That could be
> treated as a hint to the statestore to schedule the subscriber update
> sooner. This would also work for admission control since coordinators could
> notify the statestore when the first query was admitted after the previous
> statestore update.
>
> On Tue, Aug 21, 2018 at 9:41 AM, Todd Lipcon  wrote:
>
> > Hey folks,
> >
> > In my recent forays into the catalog->statestore->impalad metadata
> > propagation code base, I noticed that the latency of any update is
> > typically between 2-4 seconds with the standard 2-second statestore
> polling
> > interval. That's because the code currently works as follows:
> >
> > 1. in the steady state with no recent metadata changes, the catalogd's
> > state is:
> > -- topic_updates_ready_ = true
> > -- pending_topic_updates_ = empty
> >
> > 2. some metadata change happens, which modifies the version numbers in
> the
> > Java catalog but doesn't modify any of the C++ side state
> >
> > 3. the next statestore poll happens due to the normal interval expiring.
> On
> > average, this will take *1/2 the polling interval*
> > -- this sees that pending_topic_updates_ is empty, so returns no results.
> > -- it sets topic_updates_ready_ = false and triggers the "gather" thread
> >
> > 4. the "gather" thread wakes up and gathers updates, filling in
> > 'pending_topic_updates_' and setting 'topic_updates_ready_' back to true
> > (typically subsecond in smallish catalogs, so this happens before the
> next
> > poll)
> >
> > 5. wait *another full statestore polling interval* (2 seconds) after step
> > #3 above, at which point we deliver the metadata update to the statestore
> >
> > 6. wait on average* 1/2 the polling interval* until any particular
> impalad
> > gets the update from #4
> >
> > So. in the absolute best case, we wait one full polling interval (2
> > seconds), and in the worst case we wait two polling intervals (4
> seconds).
> >
> > Has anyone looked into optimizing this at all? It seems like we could
> have
> > metadata changes trigger an immediate "collection" into the C++ side, and
> > have the statestore update callback wait ("long poll" style) for an
> update
> > rather than skip if there is nothing available.
> >
> > -Todd
> > --
> > Todd Lipcon
> > Software Engineer, Cloudera
> >
>



-- 
Todd Lipcon
Software Engineer, Cloudera


Improving latency of catalog update propagation?

2018-08-21 Thread Todd Lipcon
Hey folks,

In my recent forays into the catalog->statestore->impalad metadata
propagation code base, I noticed that the latency of any update is
typically between 2-4 seconds with the standard 2-second statestore polling
interval. That's because the code currently works as follows:

1. in the steady state with no recent metadata changes, the catalogd's
state is:
-- topic_updates_ready_ = true
-- pending_topic_updates_ = empty

2. some metadata change happens, which modifies the version numbers in the
Java catalog but doesn't modify any of the C++ side state

3. the next statestore poll happens due to the normal interval expiring. On
average, this will take *1/2 the polling interval*
-- this sees that pending_topic_updates_ is empty, so returns no results.
-- it sets topic_updates_ready_ = false and triggers the "gather" thread

4. the "gather" thread wakes up and gathers updates, filling in
'pending_topic_updates_' and setting 'topic_updates_ready_' back to true
(typically subsecond in smallish catalogs, so this happens before the next
poll)

5. wait *another full statestore polling interval* (2 seconds) after step
#3 above, at which point we deliver the metadata update to the statestore

6. wait on average* 1/2 the polling interval* until any particular impalad
gets the update from #4

So. in the absolute best case, we wait one full polling interval (2
seconds), and in the worst case we wait two polling intervals (4 seconds).

Has anyone looked into optimizing this at all? It seems like we could have
metadata changes trigger an immediate "collection" into the C++ side, and
have the statestore update callback wait ("long poll" style) for an update
rather than skip if there is nothing available.

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera


java target version in FE pom?

2018-08-20 Thread Todd Lipcon
Hey folks,

I noticed that whenever I import the FE into Eclipse it ends up defaulting
to Java 1.7, and then gets errors trying to run the tests:

F0820 15:21:17.184092 59731 frontend.cc:109]
java.lang.UnsupportedClassVersionError:
org/apache/hadoop/conf/Configuration : Unsupported major.minor version 52.0

It seems this is because the pom has the following:

  
org.apache.maven.plugins
maven-compiler-plugin
3.3

  1.7
  1.7

  


Given that Hadoop is now being built for 1.8, should we change our own
target and source versions accordingly?

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera


Update on catalog changes

2018-08-07 Thread Todd Lipcon
t ID will never be reused with
different data. On a REFRESH or other partition metadata change, a new ID
can be assigned to the partition. This allows us to safely cache partitions
(the largest bit of metadata) with long TTLs, and to do very cheap updates
after any update affects only a subset of partitions.

In terms of the original "fetch from source systems" design, I do continue
to think it has promise longer-term as it simplifies the overall
architecture and can help with cross-system metadata consistency. But we
can treat fetch-from-catalogd as a nice interim that should bring most of
the performance and scalability benefits to users sooner and with less risk.

I'll plan to update the original design document to reflect this in coming
days.

Thanks
-Todd

-- 
Todd Lipcon
Software Engineer, Cloudera


Re: #pragma once?

2018-08-01 Thread Todd Lipcon
Yea, when we looked into it I recall that it worked fine on all compilers
from the last 10 years or something (in fact I remember using #pragma once
on Metrowerks Codewarrior more than fifteen years ago). Given we require
C++14 I don't think #pragma once is going to be the limiting factor in
compiler version portability.

-Todd

On Wed, Aug 1, 2018 at 12:01 PM, Sailesh Mukil  wrote:

> An advantage of using #pragma once is potential improved compilation
> speeds. However, a con is that it's non-standard and therefore, its
> behavior can change at any point and can also vary across compilers,
> potentially making the code even less portable.
>
> That being said, since Kudu has been using it for a while and has had no
> issues, we can do the same since the potential benefits outweigh the cons.
>
> On Wed, Aug 1, 2018 at 11:48 AM, Tim Armstrong <
> tarmstr...@cloudera.com.invalid> wrote:
>
> > Todd brought up our include guards on a code review, asking why we don't
> > use #pragma once instead: https://gerrit.cloudera.org/#/c/10988/5 . It
> > sounds like Kudu has switched to it
> >
> > #pragma once does seem cleaner and our GCC and Clang versions are modern
> > enough to support it.
> >
> > What do people think about switching to that as the preferred way of
> > including headers only once?
> >
> > - Tim
> >
>



-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Re: Impala Incident Sharing

2018-08-01 Thread Todd Lipcon
On Wed, Aug 1, 2018 at 6:02 AM, Quanlong Huang 
wrote:

> Hi Todd and Philip,
>
>
> The idea of separate ClassLoader is cool. But I would vote for executing
> java UDFs in a separate JVM. We don't need to create a JVM for each UDF.
> They can share the same JVM with a configurable hard limit. If the JVM for
> UDFs runs out of heap size, we can destroy the JVM and fail the queries.
> (Not sure if this is doable) The drawback is that some innocent queries
> using java UDFs will fail too, but the cluster will still be healthy.
>

I think this would be great, but we need to make sure not to regress
performance significantly. For example, maybe we'd need to expose row
batches as a shared memory region and then map them as a DirectByteBuffer
on the Java side. I think this would probably be a lot of work, though, and
maybe the Java UDF usage is uncommon enough that it may not be worth all of
the coding?

-Todd


>
> The patch of adding JVM pause logs and expose JMX metrics is really
> helpful. I'd like to join the code review.
>
>
> Thanks,
> Quanlong
>
> At 2018-08-01 05:47:25, "Philip Zeyliger" 
> wrote:
> >Hi Quanlong,
> >
> >Thank you for the incident note! You might be interested in
> >https://gerrit.cloudera.org/#/c/10998/ which is adding some
> instrumentation
> >to make it easier to notice with monitoring tools that we're running out
> of
> >memory or having large GC pauses.
> >
> >-- Philip
> >
> >On Tue, Jul 31, 2018 at 8:21 AM Todd Lipcon 
> >wrote:
> >
> >> Hi Quanlong,
> >>
> >> Thanks for the writeup. I wonder if we could two of the issues with a
> >> single approach of using a separate ClassLoader instance when we load
> and
> >> instantiate custom UDFs. For one, using a custom ClassLoader means that
> we
> >> could probably ensure that the user's jar is checked first even if
> there is
> >> a conflicting class on the Impala classpath. For two, if we close the
> >> custom classloader and ensure that we don't retain any references to it
> or
> >> to classes created by it (i.e the UDF classes), then we should be able
> to
> >> collect leaked memory on the next GC. In the particular case you pointed
> >> out, the leak is in static fields of the UDF, and those static fields
> would
> >> get destructed when destructing the classloader.
> >>
> >> On the downside, making this change could negatively harm performance if
> >> any UDFs are relying on their static initializer blocks to do expensive
> >> setup. Maybe a compromise would be to make this behavior optional or to
> >> cache the ClassLoader via a WeakReference across instantiations of a
> UDF so
> >> that, if there isn't memory pressure, the UDF doesn't need to be
> >> re-class-loaded repeatedly.
> >>
> >> As for tracking memory explicitly, the way that HBase does it is via
> >> somewhat careful accounting inline with all of the code that allocates
> >> memory. I don't think that approach is feasible in UDFs because the UDF
> >> code is provided by end users, and they aren't likely to want to
> carefully
> >> track their heap usage. I think the only way we could effectively
> "sandbox"
> >> UDFs would be to actually execute them in a separate JVM with a limited
> >> heap size, but that would be somewhat complex to implement without a big
> >> performance drop.
> >>
> >> -Todd
> >>
> >> On Tue, Jul 31, 2018 at 7:53 AM, Quanlong Huang  >
> >> wrote:
> >>
> >> > Hi all,
> >> >
> >> >
> >> > Just to draw your attention of a patch, I'd like to share an incident
> >> with
> >> > you.
> >> >
> >> >
> >> > Two months ago we encountered an incident in one of our Impala
> clusters
> >> in
> >> > version 2.5-cdh5.7.3. Queries were pilled up and most of them were in
> >> > CREATED state. It looks like an infamous incident when the catalogd is
> >> > loading metadata of huge tables and finally OOM so metadata loading
> >> > requests from impalad are stuck. However, the catalogd looks good this
> >> > time. The logs showed it's not loading any metadata. Then I looked
> into
> >> the
> >> > statestored and found many timeout logs like "Unable to send topic
> update
> >> > message to subscriber impalad@xxx:22000, received error: RPC timed
> out".
> >> > Chose one of the problemati

Re: Enabling automatic code review precommit job

2018-07-31 Thread Todd Lipcon
On Tue, Jul 31, 2018 at 9:02 AM, Tim Armstrong <
tarmstr...@cloudera.com.invalid> wrote:

> I agree that it might be a little strict at the moment and disallow some
> reasonable formatting. The tool is controlled by the setup.cfg file in the
> repo so it's easy enough to change the behaviour.
>
> I think we have been a little sloppy with Python style in general so I
> think some of these would be good to change over time. I think the main
> thing I'd like is to align the tool's behaviour with code reviews - if
> we're going to be strict about PEP 8 compliance in code reviews we should
> keep the tool strict.
>
> >109 : E125 continuation line with same indent as next logical line
> I think this formatting is hard to read and confusing, I'm in favour of
> leaving this enabled.
>
> >110 : E701 multiple statements on one line (colon)
> This if because of the one-line if statements we have in the code. I don't
> feel strongly either way as long as we're consistent.
>
> >129 : E231 missing whitespace after ','
> >185 : E251 unexpected spaces around keyword / parameter equals
> >288 : E502 the backslash is redundant between brackets
> These seems like sloppy/inconsistent formatting to me, I'm in favour of
> keeping these enabled and fixing existing code as we go.
>
> > 368 : E302 expected 2 blank lines, found 1
> Our Python code is very inconsistent here, would be nice to make it more
> consistent. I'm in favour of keeping it enabled and fixing as we go.
>
> >187 : E121 continuation line under-indented for hanging indent
> >   1295 : E128 continuation line under-indented for visual indent
> On one hand it would be nice to follow the PEP 8 style here (
> https://www.python.org/dev/peps/pep-0008/#indentation) but the other
> idioms
> seem fine to me. I've been asked on Python code reviews to switch to the
> PEP 8 indentation style before so I think we should align the tools
> behaviour with what we're going to ask for code reviews.
>


Alright, I agree with all of the above. One suggestion, though: is it
possible to get something like 'autopep8' to run in a 'patch formatting'
mode where it only reformats changed lines? Then we could more easily just
run a single command to ensure that our patches are properly formatted
before submitting to review. Or, at the very least, some instructions for
running the same flake8-against-only-my-changed-lines that gerrit is
running?

-Todd


>
>
> On Mon, Jul 30, 2018 at 7:48 PM, Todd Lipcon 
> wrote:
>
> > It seems like flake8 might need some adjustment of its policies. Here are
> > the most common issues in the current test code:
> >
> > 109 : E125 continuation line with same indent as next logical line
> > 110 : E701 multiple statements on one line (colon)
> > 129 : E231 missing whitespace after ','
> > 185 : E251 unexpected spaces around keyword / parameter equals
> > 187 : E121 continuation line under-indented for hanging indent
> > 288 : E502 the backslash is redundant between brackets
> > 368 : E302 expected 2 blank lines, found 1
> >1295 : E128 continuation line under-indented for visual indent
> >
> > Maybe worth just disabling some of the indentation-related ones to start?
> >
> >
> > On Mon, Jul 30, 2018 at 4:09 PM, Tim Armstrong <
> > tarmstr...@cloudera.com.invalid> wrote:
> >
> > > I have a few updates.
> > >
> > > I added an automatic build job for docs changes:
> > > https://jenkins.impala.io/job/gerrit-docs-auto-test/
> > >
> > > I'm going to disable the "Build started" message for
> > > gerrit-code-review-checks. It seems a bit too chatty. Let me know if
> you
> > > disagree.
> > >
> > > I'm adding a job that automatically does some checks on the diff and
> > posts
> > > code review comments. I started off with Python flake8 comments.
> > >
> > > Let me know if you see any problems or if it turns out to be too noisy.
> > >
> > >
> > >
> > > On Mon, Jul 23, 2018 at 11:55 AM, Tim Armstrong <
> tarmstr...@cloudera.com
> > >
> > > wrote:
> > >
> > > > Hi All,
> > > >   I'm enabling an automatic precommit job for code reviews uploaded
> to
> > > > gerrit that will run RAT, clang-tidy and a GCC debug compilation.
> This
> > is
> > > > to provide faster feedback on code reviews:
> https://issues.apache.org/
> > > > jira/browse/IMPALA-7317 . I'll add some more checks but I'm wanting
> to
> > > > test the basic mechanism for a bit now.
> > > >
> > > > It excludes docs/ commits.
> > > >
> > > > Let me know if you see any problems with it.
> > > >
> > > > Thanks,
> > > > Tim
> > > >
> > >
> >
> >
> >
> > --
> > Todd Lipcon
> > Software Engineer, Cloudera
> >
>



-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Impala Incident Sharing

2018-07-31 Thread Todd Lipcon
E-16196). That's why the impalads got OOM. Memory in the JVM is not
> tracked. The bug in UdfJson is fixed in later hive versions (
> https://github.com/cloudera/hive/commit/8397f933ea3ee5e2087b4def6b58ed
> 5d77026108#diff-e0b4cff52301169fede5c9550e968b67  ). However, using new
> hive jars (e.g. hive-exec-1.1.0-cdh5.16.0-SNAPSHOT-core.jar) in the
> CREATE FUNCTION statement does not work, since the class name is still the
> same and the hive jars in version cdh5.7.3 are already in the CLASSPATH.
> Impala will still load the old implementation of UDFJson. To work around
> this, I copied the implementation of UDFJson and use another class name,
> then use my custom jar file instead.
>
>
> This is not the final fix since memory used in java UDFs are not tracked.
> It's still unsafe. Looking into the C++ implementation of get_json_object (
> https://github.com/nazgul33/impala-get-json-object-udf) provided by an
> early comment in IMPALA-376, it still not tracks the memory. So I give
> another implementation and it's ready for code review:
> https://gerrit.cloudera.org/c/10950/ (It has passed the pre-view-test:
> https://jenkins.impala.io/job/pre-review-test/189/ ) Yes, this's the
> purpose of this email. Thank you for reading!
>
>
> Further thoughts:
> * We might need to mention in the docs that adding java UDFs may not use
> the given jar file if the impala CLASSPATH already contains a jar file
> containing the same class.
> * We should avoid using Hive builtin UDFs and any other Java UDFs since
> their memory is not tracked.
> * How to track memory used in JVM? HBase, a pure java project, is able to
> track its MemStore and BlockCache size. Can we learn from it?
>
>
> Thanks,
> Quanlong
> --
> Quanlong Huang
> Software Developer, Hulu




-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Enabling automatic code review precommit job

2018-07-30 Thread Todd Lipcon
It seems like flake8 might need some adjustment of its policies. Here are
the most common issues in the current test code:

109 : E125 continuation line with same indent as next logical line
110 : E701 multiple statements on one line (colon)
129 : E231 missing whitespace after ','
185 : E251 unexpected spaces around keyword / parameter equals
187 : E121 continuation line under-indented for hanging indent
288 : E502 the backslash is redundant between brackets
368 : E302 expected 2 blank lines, found 1
   1295 : E128 continuation line under-indented for visual indent

Maybe worth just disabling some of the indentation-related ones to start?


On Mon, Jul 30, 2018 at 4:09 PM, Tim Armstrong <
tarmstr...@cloudera.com.invalid> wrote:

> I have a few updates.
>
> I added an automatic build job for docs changes:
> https://jenkins.impala.io/job/gerrit-docs-auto-test/
>
> I'm going to disable the "Build started" message for
> gerrit-code-review-checks. It seems a bit too chatty. Let me know if you
> disagree.
>
> I'm adding a job that automatically does some checks on the diff and posts
> code review comments. I started off with Python flake8 comments.
>
> Let me know if you see any problems or if it turns out to be too noisy.
>
>
>
> On Mon, Jul 23, 2018 at 11:55 AM, Tim Armstrong 
> wrote:
>
> > Hi All,
> >   I'm enabling an automatic precommit job for code reviews uploaded to
> > gerrit that will run RAT, clang-tidy and a GCC debug compilation. This is
> > to provide faster feedback on code reviews: https://issues.apache.org/
> > jira/browse/IMPALA-7317 . I'll add some more checks but I'm wanting to
> > test the basic mechanism for a bit now.
> >
> > It excludes docs/ commits.
> >
> > Let me know if you see any problems with it.
> >
> > Thanks,
> > Tim
> >
>



-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Order of error messages printed

2018-07-24 Thread Todd Lipcon
I suppose the advantage of it being ordered is that at least the output
order is deterministic. That is to say, if a new warning shows up, it won't
re-shuffle the ordering of all prior warnings. And we'll get the same
results regardless of whether we're using libstdcxx or libc++, which might
make test assertions easier.

-Todd

On Tue, Jul 24, 2018 at 12:26 PM, Sailesh Mukil <
sail...@cloudera.com.invalid> wrote:

> The ErrorLogMap's key is "TErrorCode::type" which is an Enum. I don't think
> we can derive any inherent meaning from the order of the errors listed in
> the enum (other than OK=0, which wouldn't be in this map anyway). I'm not
> aware of any clients relying on this order, and even if there were, it
> would probably be incorrect to do so. So, unless anyone else chimes in with
> a good reason to keep this as a map instead of an unordered_map, I'm all
> for changing it.
>
> On Tue, Jul 24, 2018 at 12:12 PM, Michael Ho 
> wrote:
>
> > Hi,
> >
> > While making some changes for be/src/util/error-util.cc, I stumbled upon
> > the question of whether we need a map instead of an unordered map for
> > ErrorLogMap (used for tracking error messages for each error code). This
> > affects the order in which error messages are printed in PrintErrorMap().
> > Looking through the code, it doesn't appear to me that there is any
> > ordering requirement. However, it's unclear to me if any client of Impala
> > makes assumption about the ordering of the output in PrintErrorMap(). So,
> > sending this email to the list in case anyone knows anything.
> >
> > --
> > Thanks,
> > Michael
> >
>



-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Broken Impala Build?

2018-07-18 Thread Todd Lipcon
Figured it out: I had to cd fe && mvn -U clean install

At least now my impalad's starting. We'll see if the data load works.

It seems maven needs to be forced to go pick up the new artifacts after
being pointed at a new repository.

-Todd

On Wed, Jul 18, 2018 at 5:32 PM, Pooja Nilangekar <
pooja.nilange...@cloudera.com.invalid> wrote:

> I am still having the same issue.  (I haven't tried git clean either).
>
> Thanks,
> Pooja
>
>
> On Wed, Jul 18, 2018 at 5:30 PM Todd Lipcon 
> wrote:
>
> > Anyone else still having problems? I followed Fredy's directions but the
> > data load is failing with the same NoClassDefFound issue while attempting
> > to start the impalad. I haven't tried a full "git clean" yet but I guess
> > that iwll be my last resort.
> >
> > On Wed, Jul 18, 2018 at 4:22 PM, Fredy Wijaya
>  > >
> > wrote:
> >
> > > The CR is merged and the build looks good now. Please remove
> > > $IMPALA_HOME/toolchain and run a full build (./buildall.sh -format
> > > -testdata -notests). It's also recommended to exit any shell sessions
> to
> > > avoid issue related to stale environment variable.
> > >
> > > On Wed, Jul 18, 2018 at 2:16 PM Fredy Wijaya 
> > wrote:
> > >
> > > > ​A CR to fix the issue: https://gerrit.cloudera.org/c/10981/
> > > > I'm running a dry-run to make sure everything is good with the new
> > build
> > > > number.​
> > > >
> > > > On Wed, Jul 18, 2018 at 12:24 PM Todd Lipcon
>  > >
> > > > wrote:
> > > >
> > > >> Yea, I just got the same on a gerrit build. I would guess this
> somehow
> > > >> means we are ending up with mismatched versions of the "HDFS" and
> > > "common"
> > > >> Hadoop jars on our classpath. Possibly one is getting picked up from
> > > Maven
> > > >> whereas another is getting picked up from toolchain?
> > > >>
> > > >> -Todd
> > > >>
> > > >> On Wed, Jul 18, 2018 at 10:21 AM, Fredy Wijaya
> > > >>  > > >> > wrote:
> > > >>
> > > >> > I got this error trying to do a clean build in HEAD. Did anyone
> > > >> encounter
> > > >> > the same issue?
> > > >> >
> > > >> > I0718 10:03:13.658380 23734 jni-util.cc:230]
> > > >> > java.lang.NoClassDefFoundError:
> > > >> > org/apache/hadoop/fs/Options$ChecksumCombineMode
> > > >> > at
> > > >> > org.apache.hadoop.hdfs.client.impl.DfsClientConf.
> > > >> > getChecksumCombineModeFromConf(DfsClientConf.java:314)
> > > >> > at
> > > >> > org.apache.hadoop.hdfs.client.impl.DfsClientConf.(
> > > >> > DfsClientConf.java:184)
> > > >> > at org.apache.hadoop.hdfs.DFSClient.(DFSClient.
> java:302)
> > > >> > at org.apache.hadoop.hdfs.DFSClient.(DFSClient.
> java:286)
> > > >> > at
> > > >> > org.apache.hadoop.hdfs.DistributedFileSystem.initialize(
> > > >> > DistributedFileSystem.java:167)
> > > >> > at
> > > >> > org.apache.hadoop.fs.FileSystem.createFileSystem(
> > > FileSystem.java:3288)
> > > >> > at org.apache.hadoop.fs.FileSystem.access$200(
> > > FileSystem.java:123)
> > > >> > at
> > > >> > org.apache.hadoop.fs.FileSystem$Cache.getInternal(
> > > FileSystem.java:3337)
> > > >> > at org.apache.hadoop.fs.FileSystem$Cache.get(
> > > FileSystem.java:3305)
> > > >> > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:476)
> > > >> > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:225)
> > > >> > at
> > > >> > org.apache.impala.service.JniFrontend.checkFileSystem(
> > > >> > JniFrontend.java:778)
> > > >> > at
> > > >> > org.apache.impala.service.JniFrontend.checkConfiguration(
> > > >> > JniFrontend.java:690)
> > > >> > I0718 10:03:13.659090 23734 status.cc:125] NoClassDefFoundError:
> > > >> > org/apache/hadoop/fs/Options$ChecksumCombineMode
> > > >> > @  0x1951b3a
> > > >> > @  0x1ff307f
> > > >> > @  0x1e9ec51
> > > >> > @  0x1eba8bb
> > > >> > @  0x1eb6df2
> > > >> > @  0x190d4bf
> > > >> > @ 0x7fe6e95bf82f
> > > >> > @  0x190d338
> > > >> > E0718 10:03:13.659101 23734 impala-server.cc:289]
> > > NoClassDefFoundError:
> > > >> > org/apache/hadoop/fs/Options$ChecksumCombineMode
> > > >> > E0718 10:03:13.659122 23734 impala-server.cc:292] Aborting Impala
> > > Server
> > > >> > startup due to improper configuration. Impalad exiting.
> > > >> >
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> Todd Lipcon
> > > >> Software Engineer, Cloudera
> > > >>
> > > >
> > >
> >
> >
> >
> > --
> > Todd Lipcon
> > Software Engineer, Cloudera
> >
>



-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Broken Impala Build?

2018-07-18 Thread Todd Lipcon
Anyone else still having problems? I followed Fredy's directions but the
data load is failing with the same NoClassDefFound issue while attempting
to start the impalad. I haven't tried a full "git clean" yet but I guess
that iwll be my last resort.

On Wed, Jul 18, 2018 at 4:22 PM, Fredy Wijaya 
wrote:

> The CR is merged and the build looks good now. Please remove
> $IMPALA_HOME/toolchain and run a full build (./buildall.sh -format
> -testdata -notests). It's also recommended to exit any shell sessions to
> avoid issue related to stale environment variable.
>
> On Wed, Jul 18, 2018 at 2:16 PM Fredy Wijaya  wrote:
>
> > ​A CR to fix the issue: https://gerrit.cloudera.org/c/10981/
> > I'm running a dry-run to make sure everything is good with the new build
> > number.​
> >
> > On Wed, Jul 18, 2018 at 12:24 PM Todd Lipcon 
> > wrote:
> >
> >> Yea, I just got the same on a gerrit build. I would guess this somehow
> >> means we are ending up with mismatched versions of the "HDFS" and
> "common"
> >> Hadoop jars on our classpath. Possibly one is getting picked up from
> Maven
> >> whereas another is getting picked up from toolchain?
> >>
> >> -Todd
> >>
> >> On Wed, Jul 18, 2018 at 10:21 AM, Fredy Wijaya
> >>  >> > wrote:
> >>
> >> > I got this error trying to do a clean build in HEAD. Did anyone
> >> encounter
> >> > the same issue?
> >> >
> >> > I0718 10:03:13.658380 23734 jni-util.cc:230]
> >> > java.lang.NoClassDefFoundError:
> >> > org/apache/hadoop/fs/Options$ChecksumCombineMode
> >> > at
> >> > org.apache.hadoop.hdfs.client.impl.DfsClientConf.
> >> > getChecksumCombineModeFromConf(DfsClientConf.java:314)
> >> > at
> >> > org.apache.hadoop.hdfs.client.impl.DfsClientConf.(
> >> > DfsClientConf.java:184)
> >> > at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:302)
> >> > at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:286)
> >> > at
> >> > org.apache.hadoop.hdfs.DistributedFileSystem.initialize(
> >> > DistributedFileSystem.java:167)
> >> > at
> >> > org.apache.hadoop.fs.FileSystem.createFileSystem(
> FileSystem.java:3288)
> >> > at org.apache.hadoop.fs.FileSystem.access$200(
> FileSystem.java:123)
> >> > at
> >> > org.apache.hadoop.fs.FileSystem$Cache.getInternal(
> FileSystem.java:3337)
> >> > at org.apache.hadoop.fs.FileSystem$Cache.get(
> FileSystem.java:3305)
> >> > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:476)
> >> > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:225)
> >> > at
> >> > org.apache.impala.service.JniFrontend.checkFileSystem(
> >> > JniFrontend.java:778)
> >> > at
> >> > org.apache.impala.service.JniFrontend.checkConfiguration(
> >> > JniFrontend.java:690)
> >> > I0718 10:03:13.659090 23734 status.cc:125] NoClassDefFoundError:
> >> > org/apache/hadoop/fs/Options$ChecksumCombineMode
> >> > @  0x1951b3a
> >> > @  0x1ff307f
> >> > @  0x1e9ec51
> >> > @  0x1eba8bb
> >> > @  0x1eb6df2
> >> > @  0x190d4bf
> >> > @ 0x7fe6e95bf82f
> >> > @  0x190d338
> >> > E0718 10:03:13.659101 23734 impala-server.cc:289]
> NoClassDefFoundError:
> >> > org/apache/hadoop/fs/Options$ChecksumCombineMode
> >> > E0718 10:03:13.659122 23734 impala-server.cc:292] Aborting Impala
> Server
> >> > startup due to improper configuration. Impalad exiting.
> >> >
> >>
> >>
> >>
> >> --
> >> Todd Lipcon
> >> Software Engineer, Cloudera
> >>
> >
>



-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Broken Impala Build?

2018-07-18 Thread Todd Lipcon
Yea, I just got the same on a gerrit build. I would guess this somehow
means we are ending up with mismatched versions of the "HDFS" and "common"
Hadoop jars on our classpath. Possibly one is getting picked up from Maven
whereas another is getting picked up from toolchain?

-Todd

On Wed, Jul 18, 2018 at 10:21 AM, Fredy Wijaya  wrote:

> I got this error trying to do a clean build in HEAD. Did anyone encounter
> the same issue?
>
> I0718 10:03:13.658380 23734 jni-util.cc:230]
> java.lang.NoClassDefFoundError:
> org/apache/hadoop/fs/Options$ChecksumCombineMode
> at
> org.apache.hadoop.hdfs.client.impl.DfsClientConf.
> getChecksumCombineModeFromConf(DfsClientConf.java:314)
> at
> org.apache.hadoop.hdfs.client.impl.DfsClientConf.(
> DfsClientConf.java:184)
> at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:302)
> at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:286)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(
> DistributedFileSystem.java:167)
> at
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3288)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:123)
> at
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3337)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3305)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:476)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:225)
> at
> org.apache.impala.service.JniFrontend.checkFileSystem(
> JniFrontend.java:778)
> at
> org.apache.impala.service.JniFrontend.checkConfiguration(
> JniFrontend.java:690)
> I0718 10:03:13.659090 23734 status.cc:125] NoClassDefFoundError:
> org/apache/hadoop/fs/Options$ChecksumCombineMode
> @  0x1951b3a
> @  0x1ff307f
> @  0x1e9ec51
> @  0x1eba8bb
> @  0x1eb6df2
> @  0x190d4bf
> @ 0x7fe6e95bf82f
> @  0x190d338
> E0718 10:03:13.659101 23734 impala-server.cc:289] NoClassDefFoundError:
> org/apache/hadoop/fs/Options$ChecksumCombineMode
> E0718 10:03:13.659122 23734 impala-server.cc:292] Aborting Impala Server
> startup due to improper configuration. Impalad exiting.
>



-- 
Todd Lipcon
Software Engineer, Cloudera


Re: impalad running as superuser in tests

2018-07-18 Thread Todd Lipcon
On Tue, Jul 17, 2018 at 5:27 PM, Sailesh Mukil  wrote:

> On Tue, Jul 17, 2018 at 2:47 PM, Todd Lipcon 
> wrote:
>
> > Hey folks,
> >
> > I'm working on a regression test for IMPALA-7311 and found something
> > interesting. It appears that in our normal minicluster setup, impalad
> runs
> > as the same username as the namenode (namely, the username of the
> > developer, in my case 'todd').
> >
> > This means that the NN treats impala as a superuser, and therefore
> doesn't
> > actually enforce permissions. So, tests about the behavior of Impala on
> > files that it doesn't have access to are somewhat tricky to write.
> >
> >
> What kind of files do you specifically mean? Something that the daemon
> tries to access directly (Eg: keytab file, log files, etc.) ? I'm guessing
> it's not this since you mentioned the NN.
>
> Or files that belong to a table/partition in HDFS? If it's this case, we
> would go through Sentry before accessing files that belong to a table, and
> access would be determined by Sentry on the "session user" (not the impalad
> user) before Impala even tries to access HDFS. (Eg:
> tests/authorization/test_authorization.py)
>

Right, files on HDFS. I mean that, in cases where Sentry is not enabled or
set up, and even in some cases where it is set up but not synchronized with
HDFS, it's possible that the user can point table metadata at files or
directories that aren't writable to the 'impala' user on HDFS. For example,
I can do:

CREATE EXTERNAL TABLE foo (...) LOCATION '/user/todd/my-dir';

and it's likely that 'my-dir' is not writable by 'impala' on a real
cluster. Thus, if I try to insert into it, I get an error because "impala"
does not have HDFS permissions to access this directory.

Currently, the frontend does some checks here to try to produce a nice
error. But, those checks are based on cached metadata which could be in
accurate. In the case that it's inaccurate, the error will be thrown from
the backend when it tries to create a file in a non-writable location.

In the minicluster environment, it's impossible to test this case (actual
permissions enforced by the NN causing an error) because the backend is
running as an HDFS superuser. That is to say, it has full permissions
everywhere. That's due to the special case behavior that HDFS has: it
determines the name of the superuser to be the username that is running the
NN. Since in the minicluster, both impala and the NN run as 'todd' in my
case, impala acts as superuser. In a real cluster (even with security
disabled) impala typically runs as 'impala' whereas the NN runs as 'hdfs'
and thus impala does not have superuser privileges.

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera


impalad running as superuser in tests

2018-07-17 Thread Todd Lipcon
Hey folks,

I'm working on a regression test for IMPALA-7311 and found something
interesting. It appears that in our normal minicluster setup, impalad runs
as the same username as the namenode (namely, the username of the
developer, in my case 'todd').

This means that the NN treats impala as a superuser, and therefore doesn't
actually enforce permissions. So, tests about the behavior of Impala on
files that it doesn't have access to are somewhat tricky to write.

Has anyone run into this before? Should we consider running either the
impalad or the namenode as a different spoofed username so that the
minicluster environment is more authentic to true cluster environments? We
can do this easily by setting the HADOOP_USER_NAME environment variable or
system property.

-Todd

-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Inconsistent handling of schema in Avro tables

2018-07-16 Thread Todd Lipcon
On Thu, Jul 12, 2018 at 5:07 PM, Bharath Vissapragada <
bhara...@cloudera.com.invalid> wrote:

> On Thu, Jul 12, 2018 at 12:03 PM Todd Lipcon 
> wrote:
>
>
> > So, I think my proposal here is:
> >
> > 1. Query behavior on existing tables
> > - If the table-level format is non-Avro,
> > - AND the table contains column types incompatible with Avro (eg
> tinyint),
> > - AND the table has an existing avro partition,
> > - THEN the query will yield an error about incompatible types
> >
> > 2. Try to prevent shooting in the foot
> > - If the table-level format is non-Avro,
> > - AND the table contains column types incompatible with Avro (eg
> tinyint),
> > - THEN disallow changing the file format of an existing partition to Avro
> >
>
> This approach sounds good to me. Is the existing behavior preserved for the
> non-mixed-partition Avro based tables?
>

Yes, if the table-level properties indicate it's an Avro table we'll let
the Avro schema (explicit or implicit) override the columns in the
metastore.

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Inconsistent handling of schema in Avro tables

2018-07-12 Thread Todd Lipcon
Again there's inconsistency with Hive: the presence of a single Avro
partition doesn't change the table-level schema.

The interesting thing is that, when I modified Impala to have a similar
behavior, I got the following error from the backend when trying to query
the data:

WARNINGS: Unresolvable types for column 'tinyint_col': declared column
type: TINYINT, table's Avro schema type: int (1 of 2 similar)

Hive, however, when presented with such a situation, does an incorrect type
coercion of the avro file's "int" into the table's "tinyint" schema. What's
incorrect about the type coercion is that it doesn't handle out-of-range
values in a sensible way. Rather, it just lets them overflow into the
smaller target type -- when I dumped an Avro file containing the int
"1" into an Avro partition of a Parquet table with schema "tinyint", I
got the value "16" output from Hive (1%256=16). This downcasting
behavior is consistent with Impala and Hive's behavior with a CAST(1 as
tinyint) expression (see IMPALA-1821)


So, I think my proposal here is:

1. Query behavior on existing tables
- If the table-level format is non-Avro,
- AND the table contains column types incompatible with Avro (eg tinyint),
- AND the table has an existing avro partition,
- THEN the query will yield an error about incompatible types

2. Try to prevent shooting in the foot
- If the table-level format is non-Avro,
- AND the table contains column types incompatible with Avro (eg tinyint),
- THEN disallow changing the file format of an existing partition to Avro


-Todd






On Wed, Jul 11, 2018 at 9:32 PM, Todd Lipcon  wrote:

> Turns out it's even a bit more messy. The presence of one or more avro
> partitions can change the types of existing columns, even if there is no
> explicit avro schema specified for the table:
> https://gist.github.com/5018d6ff50f846c72762319eb7cf5ca8
>
> Not quite sure how to handle this one in a world where we don't load all
> of the partitions up front. Perhaps the best approach is to just throw an
> error and then provide a command for the user to "re-sync" the schema to
> the appropriate avro-supported types? Hive provides ALTER TABLE 
> UPDATE COLUMNS for something like this, though still I don't think that
> would iterate over all partitions in the case of a mixed table.
>
> -Todd
>
> On Wed, Jul 11, 2018 at 9:03 PM, Bharath Vissapragada <
> bhara...@cloudera.com.invalid> wrote:
>
>> Agreed.
>>
>> On Wed, Jul 11, 2018 at 8:55 PM Todd Lipcon 
>> wrote:
>>
>> > Your commit message there makes sense, Bharath -- we should set
>> > 'avroSchema' in the descriptor in case any referenced partition is avro,
>> > because the scanner needs that info. However, we don't need to also
>> > override the table-level schema. So, I think we can preserve the fix
>> that
>> > you made while also making the behavior less surprising.
>> >
>> > -Todd
>> >
>> > On Wed, Jul 11, 2018 at 8:21 PM, Bharath Vissapragada <
>> > bhara...@cloudera.com.invalid> wrote:
>> >
>> > > I added this functionality
>> > > <https://github.com/apache/impala/commit/49610e2cfa40aa10b62
>> 6c5ae41d7f0
>> > > d99d7cabc5>
>> > >  where adding an Avro partition in a mixed partition table resets the
>> > table
>> > > level schema. While I don't exactly remember why we chose this path,
>> I do
>> > > recall that we debated quite a bit about Avro schema evolution causing
>> > > schema inconsistencies across partitions. AFAICT there is no specific
>> > > reason Impala chose to different from Hive. Now that I see your email,
>> > > Hive's behavior makes more sense to me, especially in the context of
>> lazy
>> > > loading of metadata.
>> > >
>> > > Also, agree with Edward that the whole mixed partitions + Avro schema
>> > > evolution is a mess and I doubt if any serious user relies on a
>> specific
>> > > behavior.
>> > >
>> > > On Wed, Jul 11, 2018 at 7:48 PM Edward Capriolo <
>> edlinuxg...@gmail.com>
>> > > wrote:
>> > >
>> > > > I know that Hive can deal with schema being different per partition,
>> > but
>> > > I
>> > > > really hesitate to understand why someone would want to do this. If
>> > > someone
>> > > > asked me to support a mixed avro/parquet table I would suggest they
>> > > create
>> > > > a view. If they kept insist

Re: Inconsistent handling of schema in Avro tables

2018-07-11 Thread Todd Lipcon
Turns out it's even a bit more messy. The presence of one or more avro
partitions can change the types of existing columns, even if there is no
explicit avro schema specified for the table:
https://gist.github.com/5018d6ff50f846c72762319eb7cf5ca8

Not quite sure how to handle this one in a world where we don't load all of
the partitions up front. Perhaps the best approach is to just throw an
error and then provide a command for the user to "re-sync" the schema to
the appropriate avro-supported types? Hive provides ALTER TABLE 
UPDATE COLUMNS for something like this, though still I don't think that
would iterate over all partitions in the case of a mixed table.

-Todd

On Wed, Jul 11, 2018 at 9:03 PM, Bharath Vissapragada <
bhara...@cloudera.com.invalid> wrote:

> Agreed.
>
> On Wed, Jul 11, 2018 at 8:55 PM Todd Lipcon 
> wrote:
>
> > Your commit message there makes sense, Bharath -- we should set
> > 'avroSchema' in the descriptor in case any referenced partition is avro,
> > because the scanner needs that info. However, we don't need to also
> > override the table-level schema. So, I think we can preserve the fix that
> > you made while also making the behavior less surprising.
> >
> > -Todd
> >
> > On Wed, Jul 11, 2018 at 8:21 PM, Bharath Vissapragada <
> > bhara...@cloudera.com.invalid> wrote:
> >
> > > I added this functionality
> > > <https://github.com/apache/impala/commit/
> 49610e2cfa40aa10b626c5ae41d7f0
> > > d99d7cabc5>
> > >  where adding an Avro partition in a mixed partition table resets the
> > table
> > > level schema. While I don't exactly remember why we chose this path, I
> do
> > > recall that we debated quite a bit about Avro schema evolution causing
> > > schema inconsistencies across partitions. AFAICT there is no specific
> > > reason Impala chose to different from Hive. Now that I see your email,
> > > Hive's behavior makes more sense to me, especially in the context of
> lazy
> > > loading of metadata.
> > >
> > > Also, agree with Edward that the whole mixed partitions + Avro schema
> > > evolution is a mess and I doubt if any serious user relies on a
> specific
> > > behavior.
> > >
> > > On Wed, Jul 11, 2018 at 7:48 PM Edward Capriolo  >
> > > wrote:
> > >
> > > > I know that Hive can deal with schema being different per partition,
> > but
> > > I
> > > > really hesitate to understand why someone would want to do this. If
> > > someone
> > > > asked me to support a mixed avro/parquet table I would suggest they
> > > create
> > > > a view. If they kept insisting I would reply "Well it is your
> funeral."
> > > >
> > > > On Wed, Jul 11, 2018 at 7:51 PM, Todd Lipcon
>  > >
> > > > wrote:
> > > >
> > > > > Hey folks,
> > > > >
> > > > > I'm trying to understand the current behavior of tables that
> contain
> > > > > partitions of mixed format, specifically when one or more
> partitions
> > is
> > > > > stored as Avro. Impala seems to be doing a number of things which I
> > > find
> > > > > surprising, and I'm not sure if they are intentional or should be
> > > > > considered bugs.
> > > > >
> > > > > *Surprise 1*: the _presence_ of an Avro-formatted partition can
> > change
> > > > the
> > > > > table schema
> > > > > https://gist.github.com/74bdef8a69b558763e4453ac21313649
> > > > >
> > > > > - create a table that is Parquet-formatted, but with an
> > > 'avro.schema.url'
> > > > > property
> > > > > - the Avro schema is ignored, and we see whatever schema we
> specified
> > > > > (*makes
> > > > > sense, because the table is Parquet)*
> > > > > - add an partition
> > > > > - set the new partition's format to Avro
> > > > > - refresh the table
> > > > > - the schema for the table now reflects the Avro schema, because it
> > has
> > > > at
> > > > > least one Avro partition
> > > > >
> > > > > *Surprise 2*: the above is inconsistent with Hive and Spark
> > > > >
> > > > > Hive seems to still reflect the table-level defined schema, and
> > ignore
> > > > the
> > > > > avro.schema.url property in this m

Re: Inconsistent handling of schema in Avro tables

2018-07-11 Thread Todd Lipcon
Your commit message there makes sense, Bharath -- we should set
'avroSchema' in the descriptor in case any referenced partition is avro,
because the scanner needs that info. However, we don't need to also
override the table-level schema. So, I think we can preserve the fix that
you made while also making the behavior less surprising.

-Todd

On Wed, Jul 11, 2018 at 8:21 PM, Bharath Vissapragada <
bhara...@cloudera.com.invalid> wrote:

> I added this functionality
> <https://github.com/apache/impala/commit/49610e2cfa40aa10b626c5ae41d7f0
> d99d7cabc5>
>  where adding an Avro partition in a mixed partition table resets the table
> level schema. While I don't exactly remember why we chose this path, I do
> recall that we debated quite a bit about Avro schema evolution causing
> schema inconsistencies across partitions. AFAICT there is no specific
> reason Impala chose to different from Hive. Now that I see your email,
> Hive's behavior makes more sense to me, especially in the context of lazy
> loading of metadata.
>
> Also, agree with Edward that the whole mixed partitions + Avro schema
> evolution is a mess and I doubt if any serious user relies on a specific
> behavior.
>
> On Wed, Jul 11, 2018 at 7:48 PM Edward Capriolo 
> wrote:
>
> > I know that Hive can deal with schema being different per partition, but
> I
> > really hesitate to understand why someone would want to do this. If
> someone
> > asked me to support a mixed avro/parquet table I would suggest they
> create
> > a view. If they kept insisting I would reply "Well it is your funeral."
> >
> > On Wed, Jul 11, 2018 at 7:51 PM, Todd Lipcon 
> > wrote:
> >
> > > Hey folks,
> > >
> > > I'm trying to understand the current behavior of tables that contain
> > > partitions of mixed format, specifically when one or more partitions is
> > > stored as Avro. Impala seems to be doing a number of things which I
> find
> > > surprising, and I'm not sure if they are intentional or should be
> > > considered bugs.
> > >
> > > *Surprise 1*: the _presence_ of an Avro-formatted partition can change
> > the
> > > table schema
> > > https://gist.github.com/74bdef8a69b558763e4453ac21313649
> > >
> > > - create a table that is Parquet-formatted, but with an
> 'avro.schema.url'
> > > property
> > > - the Avro schema is ignored, and we see whatever schema we specified
> > > (*makes
> > > sense, because the table is Parquet)*
> > > - add an partition
> > > - set the new partition's format to Avro
> > > - refresh the table
> > > - the schema for the table now reflects the Avro schema, because it has
> > at
> > > least one Avro partition
> > >
> > > *Surprise 2*: the above is inconsistent with Hive and Spark
> > >
> > > Hive seems to still reflect the table-level defined schema, and ignore
> > the
> > > avro.schema.url property in this mixed scenario. That is to say, with
> the
> > > state set up by the above, we have the following behavior:
> > >
> > > Impala:
> > > - uses the external avro schema for all table-level info, SELECT *,
> etc.
> > > - "compute stats" detects the inconsistency and tells the user to
> > recreate
> > > the table.
> > > - if some existing partitions (eg in Parquet) aren't compatible with
> that
> > > avro schema, errors result from the backend that there are missing
> > columns
> > > in the Parquet data files
> > >
> > > Hive:
> > > - uses the table-level schema defined in the HMS for describe, etc
> > > - queries like 'select *' again use the table-level HMS schema. The
> > > underlying reader that reads the Avro partition seems to use the
> defined
> > > external Avro schema, resulting in nulls for missing columns.
> > > - computing stats (analyze table mixedtable partition (y=1) compute
> stats
> > > for columns) seems to end up only recording stats against the column
> > > defined in the table-level Schema.
> > >
> > > Spark:
> > > - DESCRIBE TABLE shows the table-level info
> > > - select * fails, because apparently Spark doesn't support multi-format
> > > tables at all (it tries to read the avro files as a parquet file)
> > >
> > >
> > > It seems to me that Hive's behavior is a bit better.* I'd like to
> propose
> > > we treat this as a bug and move to the following behavior:*
> &g

Inconsistent handling of schema in Avro tables

2018-07-11 Thread Todd Lipcon
Hey folks,

I'm trying to understand the current behavior of tables that contain
partitions of mixed format, specifically when one or more partitions is
stored as Avro. Impala seems to be doing a number of things which I find
surprising, and I'm not sure if they are intentional or should be
considered bugs.

*Surprise 1*: the _presence_ of an Avro-formatted partition can change the
table schema
https://gist.github.com/74bdef8a69b558763e4453ac21313649

- create a table that is Parquet-formatted, but with an 'avro.schema.url'
property
- the Avro schema is ignored, and we see whatever schema we specified (*makes
sense, because the table is Parquet)*
- add an partition
- set the new partition's format to Avro
- refresh the table
- the schema for the table now reflects the Avro schema, because it has at
least one Avro partition

*Surprise 2*: the above is inconsistent with Hive and Spark

Hive seems to still reflect the table-level defined schema, and ignore the
avro.schema.url property in this mixed scenario. That is to say, with the
state set up by the above, we have the following behavior:

Impala:
- uses the external avro schema for all table-level info, SELECT *, etc.
- "compute stats" detects the inconsistency and tells the user to recreate
the table.
- if some existing partitions (eg in Parquet) aren't compatible with that
avro schema, errors result from the backend that there are missing columns
in the Parquet data files

Hive:
- uses the table-level schema defined in the HMS for describe, etc
- queries like 'select *' again use the table-level HMS schema. The
underlying reader that reads the Avro partition seems to use the defined
external Avro schema, resulting in nulls for missing columns.
- computing stats (analyze table mixedtable partition (y=1) compute stats
for columns) seems to end up only recording stats against the column
defined in the table-level Schema.

Spark:
- DESCRIBE TABLE shows the table-level info
- select * fails, because apparently Spark doesn't support multi-format
tables at all (it tries to read the avro files as a parquet file)


It seems to me that Hive's behavior is a bit better.* I'd like to propose
we treat this as a bug and move to the following behavior:*

- if a table's properties indicate it's an avro table, parse and adopt the
external avro schema as the table schema
- if a table's properties indicate it's _not_ an avro table, but there is
an external avro schema defined in the table properties, then parse the
avro schema and include it in the TableDescriptor (for use by avro
partitions) but do not adopt it as the table schema.

The added benefit of the above proposal (and the reason why I started
looking into this in the first place) is that, in order to service a simple
query like DESCRIBE, our current behavior requires all partition metadata
to be loaded to know whether there is any avro-formatted partition. With
the proposed new behavior, we can avoid looking at all partitions. This is
important for any metadata design which supports fine-grained loading of
metadata to the coordinator.

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera


Re: boost::scoped_ptr vs std::unique_ptr

2018-07-05 Thread Todd Lipcon
Definitely in favor.

Personally I never found the "this pointer isn't movable" to be a
worthwhile distinction. With unique_ptr you need to pretty explicitly move
it using std::move, so you don't get "accidental" moves like you used to
with std::auto_ptr.

Looking briefly at Kudu we have 129 unique_ptr members and only 7 of them
are marked const, so at least we haven't found that a particularly useful
pattern.

-Todd

On Thu, Jul 5, 2018 at 5:23 PM, Jim Apple 
wrote:

> I suspect we could also make own own immobile_ptr with minimal effort,
> thereby both making the difference more visible and reducing boost
> dependence.
>
> On Thu, Jul 5, 2018 at 5:17 PM, Sailesh Mukil  >
> wrote:
>
> > I'm in favor.
> >
> > Since the main distinction is that a unique_ptr is moveable, whereas a
> > scoped_ptr is not, we should just make sure that we do our due diligence
> > during code reviews so that we catch those cases.
> >
> > Also, making a unique_ptr const disallows moving it, since the move
> > constructor takes a non-const unique_ptr container. It probably won't
> work
> > in all places, but we could enforce that in certain parts of the code.
> >
> > On Thu, Jul 5, 2018 at 4:49 PM, Thomas Tauber-Marshall <
> > tmarsh...@cloudera.com.invalid> wrote:
> >
> > > I'm definitely in favor of using more standard c++ to reduce both
> > confusion
> > > and our reliance on boost, especially as I suspect a lot of people (eg.
> > me)
> > > don't know the subtle difference between scoped_ptr and unique_ptr off
> > the
> > > top of their head anyways.
> > >
> > > Fwiw, I was under the impression from talking with people in the past
> > that
> > > we were already trying to make this move, and the
> > > PartitionedAggregationNode refactor that just went in made the switch
> to
> > > unique_ptr, though no one commented on it in the review.
> > >
> > > On Thu, Jul 5, 2018 at 4:39 PM Tim Armstrong
> > >  wrote:
> > >
> > > > I was just talking with Michael Ho on a review about this
> > > > https://gerrit.cloudera.org/#/c/10810/7/be/src/exec/scan-node.h@271
> > > >
> > > > For a while we've continued using scoped_ptr in some places because
> it
> > > > supports a smaller set of operators and implies that the pointer
> isn't
> > > > movable. See
> > > >
> > > > https://cwiki.apache.org/confluence/display/IMPALA/
> > > Resource+Management+Best+Practices+in+Impala
> > > > .
> > > >
> > > > I don't think we're consistently following this pattern and it seems
> to
> > > > cause some confusion about what the best practice is, particularly
> for
> > > > people coming from other code bases. I personally like the
> distinction,
> > > but
> > > > I don't feel that strongly about it.
> > > >
> > > > What do people think? Should we continue using scoped_ptr or move
> away
> > > from
> > > > it. There is already a JIRA to make the change but we haven't done it
> > > > because of the above reasons:
> > > > https://issues.apache.org/jira/browse/IMPALA-3444
> > > >
> > > > - Tim
> > > >
> > >
> >
>



-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Proposal for a new approach to Impala metadata

2018-07-05 Thread Todd Lipcon
Hi Marcel,

Sorry for the slow response, I was out of the office for a short vacation.
Comments inline:

On Sat, Jun 30, 2018 at 5:07 PM, Marcel Kornacker  wrote:

> Responses/comments inline.
>
> Before those, a note about process:
> It looks like the work on this proposal is already underway. However, the
> doc you sent out still has very many unanswered questions, one of which is
> "to what extent will these changes deteriorate existing use cases that are
> currently well supported".
>

I thought I'd answered that one, but just to be clear: we do not want to
deteroriate existing use cases that are well supported. The goal is to not
break any existing use cases, and we expect to substantially improve the
performance and reliability for many, as well as enable new larger-scale
use cases than possible with the current design.


> I think it is very important to spell out the goals and non-goals of this
> proposal explicitly, preferably in a separate section, so that there can be
> a meaningful discussion of those in the Impala community.
>
>
Right, sorry that I didn't get a chance to do this after we'd discussed it
previously. I will prioritize that this week.


> The proposal lists 6 contributors, all of whom are/were Cloudera employees
> at the time of writing. It appears that there was a fair amount of
> discussion and experimentation that went into this proposal, but this
> wasn't done publicly.
>

Indeed I did discuss the proposal with some colleagues prior to publishing
the doc on the mailing list. However, the doc itself was only started the
day before it was published, and I solicited a few comments from coworkers
as more of a "proof read". I personally wrote the document as well as
personally did the prototype work discussed therein. I wanted to be
inclusive of the other folks who had helped with the doc proof-reading so I
included them in the heading. I'll remove all of the names including my own
now, since the doc should be seen as a product of the open source community
rather than any particular contributor(s).

This can create the impression that community involvement is seen as an
> afterthought rather than an integral part of the process (and the warning
> in red at the top that 'this document is public' doesn't really do anything
> to counter that impression :)).
>
>
Sure, I can see that. I figured that such a warning would be appropriate
because the document discusses a lot of workload-specific data, and I
didn't want anyone to comment "In my work at FooCorp we have a table
storing info about our 12,300 customers." Such a comment might be by a
fellow Cloudera employee, or by someone at some other contributor. Happy to
remove that heading if it seems like it's not inclusive.


-Todd



> On Fri, Jun 8, 2018 at 9:54 AM, Todd Lipcon  wrote:
>
>> On Thu, Jun 7, 2018 at 9:27 AM, Marcel Kornacker 
>> wrote:
>>
>>> Hey Todd,
>>>
>>> thanks for putting that together, that area certainly needs some work.
>>> I think there are a number of good ideas in the proposal, and I also
>>> think there are a number of ways to augment it to make it even better:
>>> 1. The size of the cache is obviously a result of the caching
>>> granularity (or lack thereof), and switching to a bounded size with
>>> more granularity seems like a good choice.
>>> 2. However, one of the biggest usability problems stems from the need
>>> for manual metadata discovery (via Invalidate/Refresh). Introducing a
>>> TTL for cached objects seems like a bad approach, because it creates a
>>> conflict between metadata freshness and caching
>>> effectiveness/performance. This isn't really something that users want
>>> to have to trade off.
>>>
>>
>> Agreed. In the "future directions" part of the document I included one
>> item about removing the need for invalidate/refresh, perhaps by subscribing
>> to the push notification mechanisms provided by HMS/NN/S3/etc. With those
>> notification mechanisms, the cache TTL could be dialed way up (perhaps to
>> infinity) and only rely on cache capacity for eviction.
>>
>> In some cases the notification mechanisms provide enough information to
>> update the cache in place (i.e the notification has the new information)
>> whereas in others it would just contain enough information to perform
>> invalidation. But from the user's perspective it achieves near-real-time
>> freshness either way.
>>
>> Note, though, that, absent a transactional coordination between Impala
>> and source systems, we'll always be in the realm of "soft guarantees". That
>> is to say, the notif

Re: Proposal for a new approach to Impala metadata

2018-06-08 Thread Todd Lipcon
- keep a central daemon responsible for interacting with source systems
- impalads dont directly access source systems, but instead send metadata
fetch requests to the central coordinator
- impalads maintain a small cache of their own, and the central coordinator
maintains a larger cache. If the impalad requests some info that the
central daemon doesn't have, it fetches on demand from the source.
- central coordinator propagates invalidations to impalads

In essence, it's like the current catalog design, but based on a
fine-grained "pull" design instead of a course-grained "push/broadcast"
design.

This allows impalads to scale out without linearly increasing the amount of
load on source systems. The downsides, though, are:
- extra hop between impalad and source system on a cache miss
- extra complexity in failure cases (source system becomes a SPOF, and if
we replicate it we have more complex consistency to worry about)
- scalability benefits may only be really important on large clusters.
Small clusters can often be served sufficiently by just one or two
coordinator impalads.

-Todd


> On Tue, May 22, 2018 at 9:27 PM, Todd Lipcon  wrote:
> > Hey Impala devs,
> >
> > Over the past 3 weeks I have been investigating various issues with
> > Impala's treatment of metadata. Based on data from a number of user
> > deployments, and after discussing the issues with a number of Impala
> > contributors and committers, I've come up with a proposal for a new
> design.
> > I've also developed a prototype to show that the approach is workable and
> > is likely to achieve its goals.
> >
> > Rather than describe the design in duplicate, I've written up a proposal
> > document here:
> > https://docs.google.com/document/d/1WcUQ7nC3fzLFtZLofzO6kvWdGHFaa
> qh97fC_PvqVGCk/edit?ts=5b04a6b8#
> >
> > Please take a look and provide any input, questions, or concerns.
> >
> > Additionally, if any users on this list have experienced metadata-related
> > problems in the past and would be willing to assist in testing or
> > contribute workloads, please feel free to respond to me either on or off
> > list.
> >
> > Thanks
> > Todd
>



-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Palo, a fork of Impala (?), proposed for the incubator.

2018-06-08 Thread Todd Lipcon
Yea, definitely seems like it would be beneficial to join efforts given the
code similarity (at least as of last time I looked at the Palo source).
I'll join in on the incubator thread. Thanks for noticing this and bringing
to our attention

-Todd

On Thu, Jun 7, 2018 at 10:58 PM, Jim Apple  wrote:

> https://lists.apache.org/thread.html/74a3f3f945403b50515c658047d328
> 4955288a637207e4f97ecc15d1@%3Cgeneral.incubator.apache.org%3E
>
> I think it’s worth considering how the two communities could work together
> for the benefit of all.
>



-- 
Todd Lipcon
Software Engineer, Cloudera


Re: jenkins.impala.io account

2018-06-01 Thread Todd Lipcon
On Fri, Jun 1, 2018 at 11:37 AM, Michael Brown  wrote:

> Thanks for taking an interest!
>
> Is https://jenkins.impala.io/job/gerrit-verify-dryrun-external/ suitable
> for you? It's open to all contributors. It's just a wrapper around
> gerrit-verify-dryrun, sans commit-after-passing-build bit.
>

Will this give the proper "+1 verified" mark on a gerrit such that it can
be committed?

-Todd


> On Fri, Jun 1, 2018 at 11:35 AM, Todd Lipcon  wrote:
>
> > Hey folks,
> >
> > Would someone mind generating an account for me on the Impala jenkins
> > server so I can run gerrit verify jobs?
> >
> > Though I'm officially a committer (due to participation as a mentor
> during
> > incubation) I'm not yet an experienced patch contributor, so I'll be
> > prudent with my review/commit privileges.
> >
> > Thanks
> > -Todd
> > --
> > Todd Lipcon
> > Software Engineer, Cloudera
> >
>



-- 
Todd Lipcon
Software Engineer, Cloudera


jenkins.impala.io account

2018-06-01 Thread Todd Lipcon
Hey folks,

Would someone mind generating an account for me on the Impala jenkins
server so I can run gerrit verify jobs?

Though I'm officially a committer (due to participation as a mentor during
incubation) I'm not yet an experienced patch contributor, so I'll be
prudent with my review/commit privileges.

Thanks
-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Statically link Kudu client

2018-05-22 Thread Todd Lipcon
Hi Quanlong,

I think the comment you found must be out of date. I think Thomas may know
the latest state of how the Kudu client is built as part of
"native-toolchain".

One thing to note is that, since Impala builds with its own toolchain, you
need to make sure that it dynamic-links against the libkudu_client that
comes from the toolchain. If you end up linking against one built using
your system toolchain there is a chance that you'll get ABI mismatches and
strange crashes. Perhaps that's what you saw when you first tried.

-Todd

On Tue, May 22, 2018 at 3:21 PM, Quanlong Huang 
wrote:

> Hi all,
>
>
> Recently when I tried to upgrade our Impala cluster in CDH5.7.1 to
> Impala-2.12, impalad crashed when inserting data to kudu tables. However,
> it works when reading from kudu tables.
> Finally, I found out that the kudu client 
> (/usr/lib/x86_64-linux-gnu/libkudu_client.so.0)
> is still linked dynamically. Issues are resolved after I update
> libkudu_client.so.0.
>
>
> In kudu-table-sink.cc, the comments said the kudu-client can be linked
> statically: https://github.com/apache/impala/blob/2.12.0/be/src/
> exec/kudu-table-sink.cc#L151
> However, when building with -static in ./buildall.sh, the kudu-client is
> still linked dynamically (see `ldd be/build/latest/service/impalad`). Is
> there a build option to link it statically?
>
>
> Thanks,
> Quanlong




-- 
Todd Lipcon
Software Engineer, Cloudera


Proposal for a new approach to Impala metadata

2018-05-22 Thread Todd Lipcon
Hey Impala devs,

Over the past 3 weeks I have been investigating various issues with
Impala's treatment of metadata. Based on data from a number of user
deployments, and after discussing the issues with a number of Impala
contributors and committers, I've come up with a proposal for a new design.
I've also developed a prototype to show that the approach is workable and
is likely to achieve its goals.

Rather than describe the design in duplicate, I've written up a proposal
document here:
https://docs.google.com/document/d/1WcUQ7nC3fzLFtZLofzO6kvWdGHFaaqh97fC_PvqVGCk/edit?ts=5b04a6b8#

Please take a look and provide any input, questions, or concerns.

Additionally, if any users on this list have experienced metadata-related
problems in the past and would be willing to assist in testing or
contribute workloads, please feel free to respond to me either on or off
list.

Thanks
Todd


Re: Re: Best practice to compile and deploy Impala

2018-04-30 Thread Todd Lipcon
On Fri, Apr 27, 2018 at 7:09 AM, Quanlong Huang 
wrote:

> Thank you, Todd! But my IMPALA_HOME occupied 14GB after compilation...
>
>
> The be directory contributes 4.4GB but I think only impalad and
> libfesupport.so is needed. They're 981MB in total in a debug build. It's ok.
> However, the toolchain directory contributes 8.5GB. It's hard to pick up
> things we actually need (i.e. libkudu_client.so.0, libstdc++.so.6,
> libgcc_s.so.1, and cdh_components, etc.).
> I think I may not handle them in an elegant way. Could you give some more
> advise?
>
>
Maybe run 'ldd' on your impalad binary to figure out which shared objects
are actually necessary?

I also find that during development I have a good sense of which pieces
actually changed. For example if I updated only the backend I only bother
to re-copy impalad and not any of the jars. Vice versa, if I only updated
something in the front end I'd only re-copy the FE jar to the cluster. That
way you only pay the expensive deployment step the first time you set up
your cluster.

-Todd


> At 2018-04-26 00:49:05, "Todd Lipcon"  wrote:
> >Hi Quanlong,
> >
> >If you dont need full debuginfo on your cluster, you might consider
> running
> >'strip --strip-debug' on the impalad binary that you output. Between that
> >and using 'rsync' instead of copying a new full directory of jars every
> >time, it's usually not that many MB (<100?) I usually do builds on a
> >machine in the same network as the machines I plan on deploying on so that
> >the copy runs at several hundred MB/second, rather than building on my
> >laptop far away.
> >
> >Tools like pscp and pssh are also handy of course.
> >
> >Hope that helps,
> >-Todd
> >
> >On Wed, Apr 25, 2018 at 1:48 AM, Quanlong Huang 
> >wrote:
> >
> >> Hi all,
> >>
> >>
> >> Recently when I have a try on Impala-2.12.0-rc1, I find it really hard
> to
> >> deploy Impala manually (I used to do this by Cloudera Manager). The
> >> directory size is huge after compiled so I only distributed something I
> >> thought really needed. This work is tedious and prone to errors.
> >>
> >>
> >> Is there a best practice for packaging and distributing the binaries
> after
> >> compiling?
> >>
> >>
> >> Thanks,
> >> Quanlong
> >
> >
> >
> >
> >--
> >Todd Lipcon
> >Software Engineer, Cloudera
>



-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Best practice to compile and deploy Impala

2018-04-25 Thread Todd Lipcon
Hi Quanlong,

If you dont need full debuginfo on your cluster, you might consider running
'strip --strip-debug' on the impalad binary that you output. Between that
and using 'rsync' instead of copying a new full directory of jars every
time, it's usually not that many MB (<100?) I usually do builds on a
machine in the same network as the machines I plan on deploying on so that
the copy runs at several hundred MB/second, rather than building on my
laptop far away.

Tools like pscp and pssh are also handy of course.

Hope that helps,
-Todd

On Wed, Apr 25, 2018 at 1:48 AM, Quanlong Huang 
wrote:

> Hi all,
>
>
> Recently when I have a try on Impala-2.12.0-rc1, I find it really hard to
> deploy Impala manually (I used to do this by Cloudera Manager). The
> directory size is huge after compiled so I only distributed something I
> thought really needed. This work is tedious and prone to errors.
>
>
> Is there a best practice for packaging and distributing the binaries after
> compiling?
>
>
> Thanks,
> Quanlong




-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Help with fixing clang-diagnostic-over-aligned

2018-02-14 Thread Todd Lipcon
This probably means that QueryExecMgr (or one of its descendent members) is
annotated with __attribute__((aligned(64)), perhaps through something like
a CACHELINE_ALIGNED macro.

The issue is that, as the warning says, normal libc malloc doesn't make any
hard guarantees about the alignment of allocations, and the way that
'operator new' is implemented by default only gets the size and not the
desired allocation. I believe C++17 is adding support for an 'operator new'
overload that also passes the desired alignment, so that it can call down
into posix_memalign and ensure the appropriate allocation.

The "correct" fix prior to C++17 is probably to do something like overload
'operator new' for the type, or to use posix_memalign manually to allocate
aligned memory, and then placement-new it into that memory.

Another reasonable fix would probably be to add a CHECK in the constructor
that 'this' is aligned as desired so that, if you ever run on an allocator
that isn't guaranteeing what you think it is, you'll know immediately
rather than getting a surprising perf or atomicity regression due to
cross-cacheline loads/stores.

-Todd


On Wed, Feb 14, 2018 at 12:33 PM, Sailesh Mukil 
wrote:

> Does anyone have experience with fixing the following type of
> clang-tidy warning?
>
>
> /home/ubuntu/Impala/be/src/runtime/exec-env.cc:159:21: warning: type
> 'impala::QueryExecMgr' requires 64 bytes of alignment and the default
> allocator only guarantees 16 bytes [clang-diagnostic-over-aligned]
> query_exec_mgr_(new QueryExecMgr()),
>
>
> /home/ubuntu/Impala/be/src/service/impalad-main.cc:80:49: warning:
> type 'impala::ImpalaServer' requires 64 bytes of alignment and the
> default allocator only guarantees 16 bytes
> [clang-diagnostic-over-aligned]
>   boost::shared_ptr impala_server(new
> ImpalaServer(&exec_env));
>



-- 
Todd Lipcon
Software Engineer, Cloudera


Re: ORC scanner - points for discussion

2018-02-09 Thread Todd Lipcon
On Fri, Feb 9, 2018 at 5:05 PM, Tim Armstrong 
wrote:

> Quanlong has done a bunch of work implementing an ORC scanner. I've been
> playing around with it and it works pretty nicely - I can load and run
> TPC-H with no problem!
>
> It's a big addition to Impala and the integration with the external library
> has caused some implementation challenges, so I wanted to summarise the
> issues so that other community members can weigh in.
>
> *1. in-tree versus third-party library*
> The current patchset imports the ORC source code into the Impala source
> tree, rather than treating it as an external library.
>
> My feeling is that, initially, we don't want to take on the burden of
> maintaining a fork of the library, so we're best off keeping it external.
> This way we can upgrade the library to pick up new ORC features instead of
> having to bring the changes into the Impala codebase. I do think we should
> generally try to collaborate with other Apache projects and contribute back
> improvements where possible, instead of forking their code.
>
> We could re-evaluate this at some point, but reducing the maintenance
> burden for now seems most important.
>
> *2. C++ Exceptions*
> The ORC library relies on exceptions for error handling. Our coding style
> guide disallows exceptions, but we do have to deal with them in a few
> places interfacing with external libraries like boost (which sucks). If
> we're interfacing with the library, I don't see how we can avoid throwing
> and catching exceptions at the boundaries with the library, which is what
> Quanlong's patch does. I looked at the ORC source code and the exceptions
> seem fairly pervasive.
>
> My feeling is that we can live with this, provided that we're very careful
> to catch exceptions at the library boundary and it is only
> hdfs-orc-scanner.cc that has to deal with the exceptions.
>
> *3. What is the quality bar?*
> We definitely want all the functional tests to pass, same as other file
> formats like Avro. I also asked Quanlong to add it to the fuzz test to get
> that extra coverage. It would be helpful if others could confirm that we
> have enough test coverage.
>
> *4. What is the bar for perf and resource consumption?*
> The initial version of the scanner won't match Parquet in these categories,
> mainly because it isn't tightly integrated with the Impala runtime and is
> missing a few things like codegen. There may be some limits to how far we
> can improve this without modifying the ORC library.
>

For quality, perf, etc, I think it would be reasonable to ship it for a
release or two as an explicitly labeled experimental feature. I think the
experience in the past is that most file format integrations appear to work
with synthetic datasets but have new and exciting failures and
compatibility issues in the wild once run against the variety of actual
implementations that produce the format.

One thing we have found helpful in the Kudu community is to actually
enforce the "experimental" nature of new features for a couple of releases.
We do this by having a gflag called --unlock-experimental-flags. Users who
decide to flip that on are explicitly acknowledging that they're treading
on some unproven ground and they might get crashes, correctness issues, etc.

Of course the goal should be that, if after a release or two of use the
feedback from users is that it works well and few issues have been found,
I'd expect it to be marked non-experimental.

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera