Re: Impala Column Masking Behavior Design
I'd agree that applying it at the innermost column ref makes the most sense from a security perspective. Otherwise it's trivial to "binary search" your way to the value of a masked column, even if the masking is completely "xed" out. I'm surprised to hear that DB2 implements it otherwise, though quick googling agrees with that. Perhaps the assumption there is that anyone who is binary-searching to exposes data will be caught by audit or other security features. -Todd On Tue, Nov 12, 2019 at 10:15 AM Tim Armstrong wrote: > I think compatibility with Hive is pretty important - the default > expectation will be that Ranger policies behave consistently across SQL > engines. I think it would be hard to argue for differing default behaviour > if it's in some sense less secure. > > On Tue, Nov 12, 2019 at 12:03 AM Gabor Kaszab > wrote: > > > Hey Quanlong, > > > > For me it seems more important not to leak confidential information so > I'd > > vote for (a). I wonder what others think. > > > > Gabor > > > > On Mon, Nov 11, 2019 at 1:04 PM Quanlong Huang > > wrote: > > > > > Hi all, > > > > > > We are adding the support for Ranger column masking and need to reach a > > > consensus on the behavior design. > > > > > > A column masking policy is something like "only show last 4 chars of > > phone > > > column to user X". When user X reads the phone column, the value woule > be > > > something like "x6789" instead of the real value "123456789". > > > > > > The behavior is clear when the query is simple. However, there're two > > > different behaviors when the query contains subqueries. The key part is > > > where we should perform the masking, whether in the outer most select > > list, > > > or in the select list of the inner most subquery. > > > > > > To be specifit, consider these two queries: > > > (1) subquery contains predicates on unmasked value > > > SELECT concat(name, phone) FROM ( > > > SELECT name, phone FROM customer WHERE phone = '123456789' > > > ) t; > > > (2) subquery contains predicates on masked value > > > SELECT concat(name, phone) FROM ( > > > SELECT name, phone FROM customer WHERE phone = 'x6789' > > > ) t; > > > > > > Let's say there's actually one row in table 'customer' satisfying > phone = > > > '123456789'. When user X runs the queries, the two different behaviors > > are: > > > (a) Query1 returns nothing. Query2 returns one result: "Bobx6789". > > > (b) Query1 returns one result: "Bobx6789". Query2 returns nothing. > > > > > > Hive is in behavior (a) since it does a table masking that replaces the > > > TableRef with a subquery containing masked columns. See more in codes: > > > > > > > > > https://github.com/apache/hive/blob/rel/release-3.1.2/ql/src/java/org/apache/hadoop/hive/ql/parse/TableMask.java#L86-L155 > > > and some experiments I did: > > > > > > > > > https://docs.google.com/document/d/1LYk2wxT3GMw4ur5y9JBBykolfAs31P3gWRStk21PomM/edit?usp=sharing > > > > > > Kurt mentions that traditional dbs like DB2 are in behavior (b). I > think > > we > > > need to decide which behavior we'd like to support. The pros for > behavior > > > (a) is no security leak. Because user X can't guess whether there are > > some > > > customers with phone number '123456789'. The pros for behavior (b) is > > users > > > don't need to rewrite their existing queries after admin applies column > > > masking policies. > > > > > > What do you think? > > > > > > Thanks, > > > Quanlong > > > > > > -- Todd Lipcon Software Engineer, Cloudera
Re: Ubuntu 18.04 in pre-merge tests?
On Sun, Jun 23, 2019 at 8:23 PM Jim Apple wrote: > > > > Generally I think precommit running on something closer to the oldest > > supported OS is better than running on the newest, since it's more likely > > that new OSes are backward-compatible. Otherwise it's very easy to > > introduce code that uses features not available on el7, for example. > > > > I find that argument compelling. Do you think we should switch the > pre-commit job to CentOS 7? > Seems reasonable to me, if someone wants to volunteer the time to do it :) Perhaps others have some reason why it wouldn't work well, though. -Todd -- Todd Lipcon Software Engineer, Cloudera
Re: Ubuntu 18.04 in pre-merge tests?
Alexey and I looked at this today and realized the issue is with OpenSSL 1.1.1, which adds support for TLS 1.3. This breaks the TLS negotiation in the krpc library. Likely Impala's usage of krpc would also break in this environment when wire encryption is enabled. I put up a temporary fix (disable TLS 1.3) here: http://gerrit.cloudera.org:8080/13683 Likely we need to cross-port this to Impala's krpc copy as well. -Todd On Wed, Jun 19, 2019 at 3:30 PM Alexey Serbin wrote: > Yep, some time ago over a weekend I started with an attempt to get > Fedora29 machine, but I stuck there while trying to provision such a > thing. I.e., the machine has been eventually provisioned, but I could not > access it. That was where I left it. > > Having Ubuntu18 as a target machine is better since at least it's easier > to create one for me. I've provisioned one already and I'm starting Kudu > build there in at attempt take a look at the issue later tonight. > > I'll keep you posted on my findings. > > > Kind regards, > > Alexey > > On Wed, Jun 19, 2019 at 2:53 PM Todd Lipcon wrote: > >> This same issue was reported a month or two ago for Kudu on Fedora 29. I >> think Alexey Serbin had started to look into it. Alexey, did we figure out >> what was going on here? >> >> -Todd >> >> On Wed, Jun 19, 2019 at 6:00 AM Laszlo Gaal >> wrote: >> >>> Having looked at the failing build Jim quoted above, the failure seems to >>> come from the security area. >>> This is from the Kudu master's log, from the startup sequence (see >>> >>> https://jenkins.impala.io/job/ubuntu-18.04-from-scratch/16/artifact/Impala/logs_static/logs/cluster/cdh6-node-1/kudu/master/kudu-master.INFO/*view*/ >>> ), >>> all this in the context of an Impala minicluster: >>> >>> I0612 04:12:56.129866 8515 sys_catalog.cc:424] T >>> P 58a05ce6efa74b30907ac4d679bd0515 >>> [sys.catalog]: configured and running, proceeding with master startup. >>> W0612 04:12:56.130080 8522 catalog_manager.cc:1113] T >>> P 58a05ce6efa74b30907ac4d679bd0515: >>> acquiring CA information for follower catalog manager: Not found: root CA >>> entry not found >>> W0612 04:12:56.130123 8522 catalog_manager.cc:596] Not found: root CA >>> entry not found: failed to prepare follower catalog manager, will retry >>> I0612 04:12:56.130151 8521 catalog_manager.cc:1055] Loading table and >>> tablet metadata into memory... >>> I0612 04:12:56.130228 8521 catalog_manager.cc:1066] Initializing Kudu >>> internal certificate authority... >>> W0612 04:12:56.167639 8636 negotiation.cc:320] Unauthorized connection >>> attempt: Server connection negotiation failed: server connection from >>> 127.0.0.1:50174: expected TLS_HANDSHAKE step: SASL_INITIATE >>> W0612 04:12:56.170145 8636 negotiation.cc:320] Unauthorized connection >>> attempt: Server connection negotiation failed: server connection from >>> 127.0.0.1:50176: expected TLS_HANDSHAKE step: SASL_INITIATE >>> W0612 04:12:56.172571 8636 negotiation.cc:320] Unauthorized connection >>> attempt: Server connection negotiation failed: server connection from >>> 127.0.0.1:50178: expected TLS_HANDSHAKE step: SASL_INITIATE >>> W0612 04:12:56.182530 8636 negotiation.cc:320] Unauthorized connection >>> attempt: Server connection negotiation failed: server connection from >>> 127.0.0.1:50180: expected TLS_HANDSHAKE step: SASL_INITIATE >>> W0612 04:12:56.185034 8636 negotiation.cc:320] Unauthorized connection >>> attempt: Server connection negotiation failed: server connection from >>> 127.0.0.1:50182: expected TLS_HANDSHAKE step: SASL_INITIATE >>> W0612 04:12:56.187453 8636 negotiation.cc:320] Unauthorized connection >>> attempt: Server connection negotiation failed: server connection from >>> 127.0.0.1:50184: expected TLS_HANDSHAKE step: SASL_INITIATE >>> I0612 04:12:56.197146 8521 catalog_manager.cc:950] Generated new >>> certificate authority record >>> I0612 04:12:56.198005 8521 catalog_manager.cc:1075] Loading token >>> signing >>> keys... >>> W0612 04:12:56.293697 8636 negotiation.cc:320] Unauthorized connection >>> attempt: Server connection negotiation failed: server connection from >>> 127.0.0.1:50186: expected TLS_HANDSHAKE step: SASL_INITIATE >>> W0612 04:12:56.295320 8636 negotiation.cc:320] Unauthorized connection >>> attempt: Server connection negotiation failed: server connection from >>>
Re: Ubuntu 18.04 in pre-merge tests?
c:587] Failed to heartbeat to > 127.0.0.1:7051 (62 consecutive failures): Not authorized: Failed to ping > master at 127.0.0.1:7051: Client connection negotiation failed: client > connection to 127.0.0.1:7051: FATAL_UNAUTHORIZED: Not authorized: expected > TLS_HANDSHAKE step: SASL_INITIATE > W0612 04:14:56.556850 8897 heartbeater.cc:587] Failed to heartbeat to > 127.0.0.1:7051 (122 consecutive failures): Not authorized: Failed to ping > master at 127.0.0.1:7051: Client connection negotiation failed: client > connection to 127.0.0.1:7051: FATAL_UNAUTHORIZED: Not authorized: expected > TLS_HANDSHAKE step: SASL_INITIATE > W0612 04:15:56.694403 8897 heartbeater.cc:587] Failed to heartbeat to > 127.0.0.1:7051 (182 consecutive failures): Not authorized: Failed to ping > master at 127.0.0.1:7051: Client connection negotiation failed: client > connection to 127.0.0.1:7051: FATAL_UNAUTHORIZED: Not authorized: expected > TLS_HANDSHAKE step: SASL_INITIATE > W0612 04:16:56.826400 8897 heartbeater.cc:587] Failed to heartbeat to > 127.0.0.1:7051 (242 consecutive failures): Not authorized: Failed to ping > master at 127.0.0.1:7051: Client connection negotiation failed: client > connection to 127.0.0.1:7051: FATAL_UNAUTHORIZED: Not authorized: expected > TLS_HANDSHAKE step: SASL_INITIATE > W0612 04:17:56.955927 8897 heartbeater.cc:587] Failed to heartbeat to > 127.0.0.1:7051 (302 consecutive failures): Not authorized: Failed to ping > master at 127.0.0.1:7051: Client connection negotiation failed: client > connection to 127.0.0.1:7051: FATAL_UNAUTHORIZED: Not authorized: expected > TLS_HANDSHAKE step: SASL_INITIATE > W0612 04:18:57.103503 8897 heartbeater.cc:587] Failed to heartbeat to > 127.0.0.1:7051 (362 consecutive failures): Not authorized: Failed to ping > master at 127.0.0.1:7051: Client connection negotiation failed: client > connection to 127.0.0.1:7051: FATAL_UNAUTHORIZED: Not authorized: expected > TLS_HANDSHAKE step: SASL_INITIATE > W0612 04:19:57.237712 8897 heartbeater.cc:587] Failed to heartbeat to > 127.0.0.1:7051 (422 consecutive failures): Not authorized: Failed to ping > master at 127.0.0.1:7051: Client connection negotiation failed: client > connection to 127.0.0.1:7051: FATAL_UNAUTHORIZED: Not authorized: expected > TLS_HANDSHAKE step: SASL_INITIATE > W0612 04:20:57.393489 8897 heartbeater.cc:587] Failed to heartbeat to > 127.0.0.1:7051 (482 consecutive failures): Not authorized: Failed to ping > master at 127.0.0.1:7051: Client connection negotiation failed: client > connection to 127.0.0.1:7051: FATAL_UNAUTHORIZED: Not authorized: expected > TLS_HANDSHAKE step: SASL_INITIATE > W0612 04:21:57.522513 8897 heartbeater.cc:587] Failed to heartbeat to > 127.0.0.1:7051 (542 consecutive failures): Not authorized: Failed to ping > master at 127.0.0.1:7051: Client connection negotiation failed: client > connection to 127.0.0.1:7051: FATAL_UNAUTHORIZED: Not authorized: expected > TLS_HANDSHAKE step: SASL_INITIATE > W0612 04:22:57.652271 8897 heartbeater.cc:587] Failed to heartbeat to > 127.0.0.1:7051 (602 consecutive failures): Not authorized: Failed to ping > master at 127.0.0.1:7051: Client connection negotiation failed: client > connection to 127.0.0.1:7051: FATAL_UNAUTHORIZED: Not authorized: expected > TLS_HANDSHAKE step: SASL_INITIATE > W0612 04:23:57.782537 8897 heartbeater.cc:587] Failed to heartbeat to > 127.0.0.1:7051 (662 consecutive failures): Not authorized: Failed to ping > master at 127.0.0.1:7051: Client connection negotiation failed: client > connection to 127.0.0.1:7051: FATAL_UNAUTHORIZED: Not authorized: expected > TLS_HANDSHAKE step: SASL_INITIATE > W0612 04:24:57.910481 8897 heartbeater.cc:587] Failed to heartbeat to > 127.0.0.1:7051 (722 consecutive failures): Not authorized: Failed to ping > master at 127.0.0.1:7051: Client connection negotiation failed: client > connection to 127.0.0.1:7051: FATAL_UNAUTHORIZED: Not authorized: expected > TLS_HANDSHAKE step: SASL_INITIATE > > > On Mon, Jun 17, 2019 at 9:08 PM Todd Lipcon wrote: > > > On Sat, Jun 15, 2019 at 2:20 PM Jim Apple wrote: > > > > > My goal is to have Impala keep up with (what I perceive to be) the most > > > popular version of the most popular Linux distribution, for the purpose > > of > > > easing the workflow of developers, especially new developers. > > > > > > > Sure, that makes sense. I use Ubuntu 18 myself, but tend to develop > Impala > > on a remote box running el7 because the dev environment is too > heavy-weight > > to realistically run on my laptop. > > > > > > > > > > 18.04 stopped being able to load data some time between June 9th and > > > https://jenkins.impala.io/job/ubuntu-18.04-from-scratch/14/ an
Re: Ubuntu 18.04 in pre-merge tests?
On Sat, Jun 15, 2019 at 2:20 PM Jim Apple wrote: > My goal is to have Impala keep up with (what I perceive to be) the most > popular version of the most popular Linux distribution, for the purpose of > easing the workflow of developers, especially new developers. > Sure, that makes sense. I use Ubuntu 18 myself, but tend to develop Impala on a remote box running el7 because the dev environment is too heavy-weight to realistically run on my laptop. > > 18.04 stopped being able to load data some time between June 9th and > https://jenkins.impala.io/job/ubuntu-18.04-from-scratch/14/ and June 12 > and > > https://jenkins.impala.io/job/ubuntu-18.04-from-scratch/16/artifact/Impala/logs_static/logs/data_loading/catalogd.ERROR/*view*/ > . > I tried reproducing the June 9 run with the same git checkouts (Impala and > Impala-LZO) as #14 today, and data loading still failed. > > What RHEL 7 components did you have in mind that are closer to Ubuntu 16.04 > than 18.04? > Stuff like libc, openssl, krb5, sasl, etc are pretty different version-wise. At least, I know when we made Kudu pass tests on Ubuntu 18, we dealt with issues mostly in those libraries, which aren't part of the toolchain (for security reasons we rely on OS-provided libs). Generally I think precommit running on something closer to the oldest supported OS is better than running on the newest, since it's more likely that new OSes are backward-compatible. Otherwise it's very easy to introduce code that uses features not available on el7, for example. > > On Wed, May 22, 2019 at 10:41 AM Todd Lipcon wrote: > > > On Mon, May 20, 2019 at 8:36 PM Jim Apple wrote: > > > > > Maybe now would be a good time to implement Everblue jobs that ping > dev@ > > > when they fail. Thoughts? > > > > > > > Mixed feelings on that. We already get many test runs per day of the > > "default" config because people are running precommit builds. Adding an > > additional cron-based job to the mix that runs the same builds doesn't > seem > > like it adds much unless it tests some other config (eg Ubuntu 18 or a > > longer suite of tests). One thing I could get on board with would be > > switching the precommit builds to run just "core" tests or some other > > faster subset, and defer the exhaustive/long runs to scheduled builds or > as > > an optional precommit for particularly invasive patches. I think that > would > > increase dev quality of life substantially (I find my productivity is > often > > hampered by only getting two shots at a precommit run per work day). > > > > I'm not against adding a cron-triggered full test/build on Ubuntu 18, but > > would like to know if someone plans to sign up to triage it when it > fails. > > My experience with other Apache communities is that collective ownership > > over test triage duty (ie "email the dev list on failure" doesn't work. I > > seem to recall we had such builds back in 2010 or so on Hadoop and they > > just always got ignored. In various "day job" teams I've seen this work > via > > a prescriptive rotation ("all team members take a triage/build-cop > shift") > > but that's not really compatbile with the nature of Apache projects being > > volunteer communities. > > > > So, I think I'll put the question back to you: as a committer you can > spend > > your time as you like. If you think an Ubuntu 18 job running on a > schedule > > would be useful and willing to sign up to triage failures, sounds great > to > > me :) Personally I don't develop on Ubuntu 18 and in my day job it's not > a > > particularly important deployment platform, so I personally don't think > > I'll spend much time triaging that build. > > > > Todd > > > > > > > > > > On Mon, May 20, 2019 at 9:09 AM Todd Lipcon wrote: > > > > > > > Adding a build-only job for 18.04 makes sense to me. A full test run > on > > > > every precommit seems a bit expensive but doing one once a week or > > > > something like that might be a good idea to prevent runtime > > regressions. > > > > > > > > As for switching the precommit from 16.04 to 18.04, I'd lean towards > > > > keeping to 16.04 due to it being closer in terms of component > versions > > to > > > > common enterprise distros like RHEL 7. > > > > > > > > -Todd > > > > > > > > On Sun, May 19, 2019 at 5:03 PM Jim Apple > wrote: > > > > > > > > > HEAD now passes on Ub
Re: Ubuntu 18.04 in pre-merge tests?
On Mon, May 20, 2019 at 8:36 PM Jim Apple wrote: > Maybe now would be a good time to implement Everblue jobs that ping dev@ > when they fail. Thoughts? > Mixed feelings on that. We already get many test runs per day of the "default" config because people are running precommit builds. Adding an additional cron-based job to the mix that runs the same builds doesn't seem like it adds much unless it tests some other config (eg Ubuntu 18 or a longer suite of tests). One thing I could get on board with would be switching the precommit builds to run just "core" tests or some other faster subset, and defer the exhaustive/long runs to scheduled builds or as an optional precommit for particularly invasive patches. I think that would increase dev quality of life substantially (I find my productivity is often hampered by only getting two shots at a precommit run per work day). I'm not against adding a cron-triggered full test/build on Ubuntu 18, but would like to know if someone plans to sign up to triage it when it fails. My experience with other Apache communities is that collective ownership over test triage duty (ie "email the dev list on failure" doesn't work. I seem to recall we had such builds back in 2010 or so on Hadoop and they just always got ignored. In various "day job" teams I've seen this work via a prescriptive rotation ("all team members take a triage/build-cop shift") but that's not really compatbile with the nature of Apache projects being volunteer communities. So, I think I'll put the question back to you: as a committer you can spend your time as you like. If you think an Ubuntu 18 job running on a schedule would be useful and willing to sign up to triage failures, sounds great to me :) Personally I don't develop on Ubuntu 18 and in my day job it's not a particularly important deployment platform, so I personally don't think I'll spend much time triaging that build. Todd > > On Mon, May 20, 2019 at 9:09 AM Todd Lipcon wrote: > > > Adding a build-only job for 18.04 makes sense to me. A full test run on > > every precommit seems a bit expensive but doing one once a week or > > something like that might be a good idea to prevent runtime regressions. > > > > As for switching the precommit from 16.04 to 18.04, I'd lean towards > > keeping to 16.04 due to it being closer in terms of component versions to > > common enterprise distros like RHEL 7. > > > > -Todd > > > > On Sun, May 19, 2019 at 5:03 PM Jim Apple wrote: > > > > > HEAD now passes on Ubuntu 18.04: > > > > > > https://jenkins.impala.io/job/ubuntu-18.04-from-scratch/ > > > > > > Thanks to the community members who have made this happen! > > > > > > Should we add Ubuntu 18.04 to our pre-merge Jenkins job, replace 16.04 > > with > > > 18.04 in our pre-merge Jenkins job, or neither? > > > > > > I propose adding 18.04 for now (ans so running both 16.04 and 18.04 on > > > merge) and removing 16.04 when it starts to become inconvenient. > > > > > > > > > -- > > Todd Lipcon > > Software Engineer, Cloudera > > > -- Todd Lipcon Software Engineer, Cloudera
Re: Ubuntu 18.04 in pre-merge tests?
Adding a build-only job for 18.04 makes sense to me. A full test run on every precommit seems a bit expensive but doing one once a week or something like that might be a good idea to prevent runtime regressions. As for switching the precommit from 16.04 to 18.04, I'd lean towards keeping to 16.04 due to it being closer in terms of component versions to common enterprise distros like RHEL 7. -Todd On Sun, May 19, 2019 at 5:03 PM Jim Apple wrote: > HEAD now passes on Ubuntu 18.04: > > https://jenkins.impala.io/job/ubuntu-18.04-from-scratch/ > > Thanks to the community members who have made this happen! > > Should we add Ubuntu 18.04 to our pre-merge Jenkins job, replace 16.04 with > 18.04 in our pre-merge Jenkins job, or neither? > > I propose adding 18.04 for now (ans so running both 16.04 and 18.04 on > merge) and removing 16.04 when it starts to become inconvenient. > -- Todd Lipcon Software Engineer, Cloudera
Re: Enabled backend tests for UBSAN
That's great to hear. Has anyone tried running any BE tests with TSAN in recent years? It's been invaluable for Kudu, and while I'm sure it would take a bit of effort to get passing on Impala, the ongoing returns would be worth it IMO. One point, though, is that for TSAN to work well, you need to instrument thirdparty libs as well, so we'd need to start from native-toolchain and work our way up. -Todd On Mon, May 6, 2019 at 10:49 AM Tim Armstrong wrote: > Hi All, > I turned on backend tests under UBSAN in our precommit tests - Jim Apple > let me know that they were passing after he fixed the final set of issues > there. > > - Tim > -- Todd Lipcon Software Engineer, Cloudera
Re: Remote read testing in precommit
Perhaps simplest is to just check for the existence of that script, and if it doesn't exist, 'exit 0' from the job, so it doesn't get marked failed? On Thu, Apr 25, 2019 at 8:40 PM Tim Armstrong wrote: > I'm sorry about that - I should have thought about the 2.x branch! I rolled > back the config change for now and will come up with a plan to skip the > tests on 2.x. > > On Thu, Apr 25, 2019 at 6:24 PM Quanlong Huang > wrote: > > > Sorry to be late. Can we skip the ubuntu-16.04-dockerised-tests job for > > branch 2.x or add an option to disable it? Just hit a failure due to > this: > > https://jenkins.impala.io/job/gerrit-verify-dryrun/4064/ > > https://jenkins.impala.io/job/ubuntu-16.04-dockerised-tests/72/console > > > > File ./bin/jenkins/dockerized-impala-bootstrap-and-test.sh is not found > in > > branch 2.x so it will finally fail. > > > > Thanks, > > Quanlong > > > > On Thu, Apr 25, 2019 at 2:37 AM Tim Armstrong > > wrote: > > > > > I tested it here: > > > https://jenkins.impala.io/job/parallel-all-tests-tarmstrong/ and it > > works > > > fine, so I made the corresponding change in precommit at > > > > > > > > > https://jenkins.impala.io/job/parallel-all-tests/jobConfigHistory/showDiffFiles?timestamp1=2019-01-18_01-08-25×tamp2=2019-04-24_18-35-23 > > > > > > Let me know if you see any issues. > > > > > > On Mon, Apr 22, 2019 at 2:19 PM Lars Volker wrote: > > > > > > > +1 for turning it on > > > > > > > > On Mon, Apr 22, 2019 at 2:14 PM Tim Armstrong < > tarmstr...@cloudera.com > > > > > > > wrote: > > > > > > > > > It's been stable for a while now, with the exception of hitting a > > flaky > > > > > test that is also flaky on the non-dockerised minicluster > > > (IMPALA-8124) - > > > > > https://jenkins.impala.io/job/ubuntu-16.04-dockerised-tests/ > > > > > > > > > > Are there any objections to me modifying parallel-all-tests and > > > therefore > > > > > precommit to run this job? I'll wait a couple of days for lazy > > > consensus > > > > > then go ahead. > > > > > > > > > > Thanks, > > > > > Tim > > > > > > > > > > On Fri, Apr 5, 2019 at 3:03 PM Lars Volker > wrote: > > > > > > > > > > > +1, thanks for working on this! > > > > > > > > > > > > On Fri, Apr 5, 2019 at 11:18 AM Jim Apple > > > wrote: > > > > > > > > > > > > > I'm in favor. Given the importance of remote reads, I would > even > > be > > > > in > > > > > > > favor of these if it DID extend the critical path. > > > > > > > > > > > > > > On Fri, Apr 5, 2019 at 10:41 AM Tim Armstrong < > > > > tarmstr...@cloudera.com > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > This is really about testing the dockerised minicluster, but > > > gives > > > > us > > > > > > > > coverage of remote read code paths for free, and more people > > care > > > > > about > > > > > > > > that right now. > > > > > > > > > > > > > > > > I got the core end-to-end tests passing locally as part of > > > > > > > > https://issues.apache.org/jira/browse/IMPALA-7995. That > change > > > is > > > > up > > > > > > for > > > > > > > > review here https://gerrit.cloudera.org/c/12639/. The next > > step > > > is > > > > > to > > > > > > > get > > > > > > > > a > > > > > > > > Jenkins job running, which I've been working on. > > > > > > > > > > > > > > > > I'd like to run it regularly so we can catch any regressions. > > > > > Initially > > > > > > > > I'll just have it email me when it fails, but after it's > stable > > > > for a > > > > > > > week > > > > > > > > or two I'd like to make it part of the regular set of jobs. > > > > > > > > > > > > > > > > My preference is to run it as part of the precommit jobs, in > > > > parallel > > > > > > to > > > > > > > > the Ubuntu 16.04 tests. It should not extend the critical > path > > of > > > > > > > precommit > > > > > > > > because it only runs the end-to-end tests. We could > > alternatively > > > > run > > > > > > it > > > > > > > as > > > > > > > > a scheduled post-commit job, but that tends to create > > additional > > > > work > > > > > > > when > > > > > > > > it breaks. > > > > > > > > > > > > > > > > What do people think? > > > > > > > > > > > > > > > > - Tim > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- Todd Lipcon Software Engineer, Cloudera
Re: Remove 'hive-benchmark' workload?
Posted a review: http://gerrit.cloudera.org:8080/13117 On Thu, Apr 25, 2019 at 8:37 AM Bharath Vissapragada wrote: > +1, never used it! > > On Thu, Apr 25, 2019 at 8:20 AM Tim Armstrong > wrote: > > > +1 - do it! > > > > On Thu, Apr 25, 2019 at 1:43 AM Todd Lipcon wrote: > > > > > It seems this is some ancient workload that hasn't been touched since > > 2013 > > > and has rotted: > > > - fails to load because the name has a '-' in it, and we use unquoted > > > database names in the load script > > > - the SQL to load it seems to refer to testdata files which don't exist > > > > > > Even if it did load, it looks like it's only a set of ~10 trivial > queries > > > which I doubt have any real benchmark interest in modern days. > > > > > > Anyone mind if I excise this cruft? > > > > > > -Todd > > > -- > > > Todd Lipcon > > > Software Engineer, Cloudera > > > > > > -- Todd Lipcon Software Engineer, Cloudera
Remove 'hive-benchmark' workload?
It seems this is some ancient workload that hasn't been touched since 2013 and has rotted: - fails to load because the name has a '-' in it, and we use unquoted database names in the load script - the SQL to load it seems to refer to testdata files which don't exist Even if it did load, it looks like it's only a set of ~10 trivial queries which I doubt have any real benchmark interest in modern days. Anyone mind if I excise this cruft? -Todd -- Todd Lipcon Software Engineer, Cloudera
Re: A problem of left outer join when statement contains two aggregation for the same column
Sounds like we should file a high priority JIRA (any "wrong results" bugs should probably be considered critical or blocker). Quanlong, any interest in working on this issue? -Todd On Thu, Apr 4, 2019 at 8:17 AM Quanlong Huang wrote: > +1 for Csaba's analysis. Looks like similiar to this > https://issues.apache.org/jira/browse/IMPALA-3126 > > On Thu, Apr 4, 2019 at 11:08 PM Csaba Ringhofer > wrote: > > > Hi! > > > > I have checked the queries, and I can verify that Impala incorrectly > > returns 1 row while the same query with Hive (or common sense..) returns > 2 > > rows. > > > > > "but if remove the "t2.amount2" like this:" > > Indeed, the issue seems to be related to returning the same aggregate > twice > > + the fact that one of these values is NULL. The planner introduces a > > predicate that checks if amount1=amount2, which is false, if both values > > are NULL: > > > > explain select * from (select t2.a_id,t2.amount1,t2.amount2 > > from( select a_id from a) t1 > > left outer join ( > > select c.a_id,sum(amount) as amount1,sum(amount) as amount2 > > from b join c on b.b_id = c.b_id group by c.a_id) t2 > > on t1.a_id = t2.a_id) t; > > > > results in: > > PLAN-ROOT SINK > > | > > 05:HASH JOIN [RIGHT OUTER JOIN] > > | hash predicates: c.a_id = a_id > > | other predicates: sum(amount) = sum(amount) <- I don't know why > this > > predicate is added. > > | runtime filters: RF000 <- a_id > > | row-size=16B cardinality=2 > > > > > > If I EXPLAIN the query without the outer select, the sum(amount) = > > sum(amount) is not added, which explains the difference. > > > > I do not know why the planner adds this predicate, my guess is that this > is > > some kind of bug in Impala. > > > > > > On Thu, Apr 4, 2019 at 2:27 PM skyyws wrote: > > > > > Hi all, I met a problem of left outer join recently, and I reproduce > this > > > problem by some simple test data with three tables a, b, c: > > > table A > > > +--+ > > > | a_id | > > > +--+ > > > | 1| > > > | 2| > > > +--+ > > > table B > > > +--++ > > > | b_id | amount | > > > +--++ > > > | 1| 10 | > > > | 1| 20 | > > > | 2| NULL | > > > +--++ > > > table C > > > +--+--+ > > > | a_id | b_id | > > > +--+--+ > > > | 1| 1| > > > | 2| 2| > > > +--+--+ > > > The sql below: > > > select count(1) from ( > > > select t2.a_id,t2.amount1,t2.amount2 > > > from( select a_id from a) t1 > > > left outer join ( > > > select c.a_id,sum(amount) as amount1,sum(amount) as amount2 > > > from b join c on b.b_id = c.b_id group by c.a_id) t2 > > > on t1.a_id = t2.a_id > > > ) t; > > > +--+ > > > | count(1) | > > > +--+ > > > | 1| > > > +--+ > > > but if remove the "t2.amount2" like this: > > > select count(1) from ( > > > select t2.a_id,t2.amount1 > > > from( select a_id from a) t1 > > > left outer join ( > > > select c.a_id,sum(amount) as amount1,sum(amount) as amount2 > > > from b join c on b.b_id = c.b_id group by c.a_id) t2 > > > on t1.a_id = t2.a_id > > > ) t; > > > +--+ > > > | count(1) | > > > +--+ > > > | 2| > > > +--+ > > > Here is the result of two subquery without count(1): > > > +--+-+-+ > > > | a_id | amount1 | amount2 | > > > +--+-+-+ > > > | 1| 30 | 30 | > > > | 2| NULL| NULL| > > > +--+-+-+ why the count(1) of this > > > resultset is 1? > > > +--+-+ > > > | a_id | amount1 | > > > +--+-+ > > > | 1| 30 | > > > | 2| NULL| > > > +--+-+ why the count(1) of this > > > resultset is 2? > > > I want to ask why the first sql return just 1, but second return 2,is > > this > > > correct or impala bug?How impala deal with count aggr.? > > > If I change the sum to other aggr. function like count/max/min, result > is > > > same. I test this on 2.12.0 and 3.1.0 version. > > > > > > > > > -- Todd Lipcon Software Engineer, Cloudera
Re: +2ability on gerrit
Ah, try now. Seems I added you to the wrong group On Fri, Mar 29, 2019 at 10:33 AM Jim Apple wrote: > Hm, didn't seem to work. > > On Thu, Mar 28, 2019 at 9:28 AM Todd Lipcon wrote: > > > I think you're all set. Give it a shot? > > > > -Todd > > > > On Tue, Mar 26, 2019 at 8:05 PM Jim Apple wrote: > > > > > Hello! I am now using "jbapple", not "jbapple-cloudera", as my gerrit > > > handle. Can someone with an admin login give me the auths to +2 > changes? > > > > > > Thanks in advance! > > > > > > > > > -- > > Todd Lipcon > > Software Engineer, Cloudera > > > -- Todd Lipcon Software Engineer, Cloudera
Re: +2ability on gerrit
I think you're all set. Give it a shot? -Todd On Tue, Mar 26, 2019 at 8:05 PM Jim Apple wrote: > Hello! I am now using "jbapple", not "jbapple-cloudera", as my gerrit > handle. Can someone with an admin login give me the auths to +2 changes? > > Thanks in advance! > -- Todd Lipcon Software Engineer, Cloudera
Re: Runtime filter publishing from a join node to its cousin
always start before or > concurrently >with the join build with the subtree containing the destination scan > node. > > yea, definitely merits some further thought so we don't end up with such a wait-cycle either due to explicit data dependencies or due to our scheduling policies. BTW, on the plus side, my example query went from taking 387sec without my optimization to <1sec with the optimization. TPCDS Q64 SF1000 on a 9-node cluster goes from 44sec to 8sec. A good number of the other plans seem to have gotten new RFs as well, so re-running them to see if there are any other big speedups: Files query11.sql.explain-old and query11.sql.explain-new differ Files query17.sql.explain-old and query17.sql.explain-new differ Files query1.sql.explain-old and query1.sql.explain-new differ Files query25.sql.explain-old and query25.sql.explain-new differ Files query29.sql.explain-old and query29.sql.explain-new differ Files query32.sql.explain-old and query32.sql.explain-new differ Files query39.sql.explain-old and query39.sql.explain-new differ Files query58.sql.explain-old and query58.sql.explain-new differ Files query64.sql.explain-old and query64.sql.explain-new differ Files query65.sql.explain-old and query65.sql.explain-new differ Files query72.sql.explain-old and query72.sql.explain-new differ Files query74.sql.explain-old and query74.sql.explain-new differ Files query83.sql.explain-old and query83.sql.explain-new differ Files query84.sql.explain-old and query84.sql.explain-new differ Files query92.sql.explain-old and query92.sql.explain-new differ Files query95.sql.explain-old and query95.sql.explain-new differ -Todd > > On Thu, Jan 31, 2019 at 7:24 AM Philip Zeyliger > wrote: > > > (If folks are interested in a refresher, > > > > > https://impala.apache.org/docs/build/html/topics/impala_runtime_filtering.html > > is useful.) > > > > I stared at that code for a bit, and I agree with you that it's > plausible. > > I'm also confused by the "bottom-up" comment of generateFilters(): it > seems > > like we walk the plan depth first, and the assignment happens in that > > depth-first walk, before all the runtime filters are generated. > > > > A concern is making sure that there are no outer joins impacting > > correctness. > > > > -- Philip > > > > On Wed, Jan 30, 2019 at 9:36 PM Todd Lipcon wrote: > > > > > Hey folks, > > > > > > I've been digging into a couple of TPCDS queries that are unexpectedly > > slow > > > and uncovered what seems to be some surprising behavior in the planner > > > concerning runtime filter assignment. Consider the following query in > the > > > TPCDS schema: > > > > > > select straight_join * > > > from promotion a > > > join item b on a.p_item_sk = b.i_item_sk > > > join [shuffle] store_sales c on b.i_item_sk = c.ss_item_sk > > > where b.i_color = 'xxx' > > > > > > This generates a plan that looks like this: > > > http://people.apache.org/~todd/plan.png > > > > > > Or in text form: > > > > > > > > > +-+ > > > | Explain String > > > | > > > > > > > > > +-+ > > > | Max Per-Host Resource Reservation: Memory=85.88MB > > >| > > > | Per-Host Resource Estimates: Memory=298.14GB > > > | > > > | > > >| > > > | F04:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1 > > >| > > > | | Per-Host Resources: mem-estimate=0B mem-reservation=0B > > >| > > > | PLAN-ROOT SINK > > > | > > > | | mem-estimate=0B mem-reservation=0B > > >| > > > | | > > >| > > > | 08:EXCHANGE [UNPARTITIONED] > > >| > > > | | mem-estimate=0B mem-reservation=0B > > >| > > > | | tuple-ids=0,1,2 row-size=911B cardinality=14201171 > > >| > > > | | > > >| > > > | F03:PLAN FRAGMENT [HASH(b.i_item_sk)] hosts=1 instances=1 > > >| > > > | Per-Host Resources: mem-estimate=295.06GB mem-reservation=50.0
Runtime filter publishing from a join node to its cousin
rs: RF000[bloom] -> b.i_item_sk | | | stored statistics: | | | table: rows=30 size=28.28MB | | | columns: all | | | extrapolated-rows=disabled | | | parquet statistics predicates: b.i_color = 'xxx' | | | parquet dictionary predicates: b.i_color = 'xxx' | | | mem-estimate=880.00MB mem-reservation=0B | | | tuple-ids=1 row-size=496B cardinality=3226 | | | | | 00:SCAN HDFS [tpcds_1000_parquet.promotion a, RANDOM] | |partitions=1/1 files=1 size=85.50KB | |runtime filters: RF000[bloom] -> a.p_item_sk, RF002[bloom] -> a.p_item_sk| |stored statistics: | | table: rows=1500 size=85.50KB | | columns: all | |extrapolated-rows=disabled | |mem-estimate=304.00MB mem-reservation=0B | |tuple-ids=0 row-size=315B cardinality=1500 | +-+ Here because of the equi-joins, we have a slot equivalence for all of the 'item_id' columns. So, it seems it would be valid to take the runtime filter RF002 (generated from b.sk_item_id at HASH JOIN node 03) and apply it at the scan node 02 for store_sales. Notably, doing so is extremely beneficial in this manufactured query, since the scan of 'item' takes only a second or so to determine that in fact there are no items with color 'xxx'. This could completely short circuit the scan 02 of the very large store_sales table. Obviously this query is canned, but I'm seeing some cases in TPCDS (eg Q64) where a much more complex variant of this plan ends up being generated and failing to utilize an available runtime filter. I spent some time digging in the code, and I think the faulty logic might be here in RuntimeFilterGenerator.java: generateFilters(ctx, root.getChild(0)); // Finalize every runtime filter of that join. This is to ensure that we don't // assign a filter to a scan node from the right subtree of joinNode or ancestor // join nodes in case we don't find a destination node in the left subtree. for (RuntimeFilter runtimeFilter: filters) finalizeRuntimeFilter(runtimeFilter); I'm not quite sure how to understand the comment, but it seems this is preventing the runtime filter at the hash join from being sent to any node which is either on its right descendent or higher up the tree. In other words, it's limiting the runtime filter to _only_ be applied at nodes in its left subtree. Per my above example query, I think this is unnecessarily restrictive. Based on my understanding of Impala scheduling/parallelism, all of the build sides of joins start running in parallel from bottom up, and it's always possible for a build to complete somewhere deep in the tree before a build completes higher up the tree, in which case sending the RF to the hash join's "cousin" is beneficial. I think the only necessary restriction is that a RF should not be sent from a hash join node to any descendent of its right child. Keep in mind I'm very new to the Impala planner code and particularly to the runtime filter portion thereof, so I may have missed something. But does the above sound like a plausible bug/missed optimization? -Todd -- Todd Lipcon Software Engineer, Cloudera
Re: Adding profile info from Frontend
I have a WIP patch up here in case anyone's interested in taking an early look: https://gerrit.cloudera.org/c/11388/ -Todd On Tue, Sep 4, 2018 at 5:02 PM, Todd Lipcon wrote: > On Tue, Sep 4, 2018 at 4:46 PM, Andrew Sherman > wrote: > >> Hi Todd, >> >> I'm going to put the patch up for review any minute. After I finish first >> time setup on https://gerrit.cloudera.org > > > OK, I guess that settles whose patch goes in first, then, because I > haven't written a line of code yet :-D > > -Todd > > >> >> >> -Andrew >> >> On Tue, Sep 4, 2018 at 4:43 PM Todd Lipcon wrote: >> >> > On Tue, Sep 4, 2018 at 4:28 PM, Andrew Sherman >> > wrote: >> > >> > > Hi Todd, >> > > >> > > I am making a simple fix for >> > > IMPALA-6568 <https://issues.apache.org/jira/browse/IMPALA-6568> add >> > > missing >> > > Query Compilation section to profiles. >> > > which would add the Timeline to all responses to createExecRequest. >> > > >> > > It sounds like your change is more deep. If you go ahead with your >> change >> > > it sounds like my change might be redundant. >> > > >> > >> > I'm not sure if it's totally redundant. Do you have a WIP patch already? >> > I'm not sure why the existing timeline doesn't show up in all statement >> > types, so maybe some changes are needed related to that, and then those >> > changes will still be necessary when exposing a full profile node? >> > >> > Agreed we're likely to conflict, at the least, though. Do you have an >> > estimate of when your patch will be up for review so we can coordinate >> > which one goes in first? >> > >> > -Todd >> > >> > > >> > > >> > > On Tue, Sep 4, 2018 at 4:02 PM Todd Lipcon wrote: >> > > >> > > > Hey folks, >> > > > >> > > > I'm working on a patch to add some more diagnostics from the >> planning >> > > > process into query profiles. >> > > > >> > > > Currently, only the planning "Timeline" is reported back as part of >> the >> > > > response to createExecRequest. As part of the fetch-on-demand >> catalog >> > > work >> > > > I'd like to start exposing various counters such as cache hit/miss >> > > counts, >> > > > time spent on remote calls to the catalog, etc. Even in the existing >> > code >> > > > paths, I can see some similar metrics being useful. >> > > > >> > > > My current thinking is to remove the 'timeline' (TEventSequence) >> field >> > > > from TExecRequest and replace it with a full TRuntimeProfileNode. >> I'd >> > > then >> > > > add some capability in the Java side to fill in counters, etc, in >> this >> > > > structure. >> > > > >> > > > Any concerns with this approach before I go down this path? Are >> there >> > any >> > > > compatibility guarantees I need to uphold with the profile output of >> > > > queries? >> > > > >> > > > -Todd >> > > > -- >> > > > Todd Lipcon >> > > > Software Engineer, Cloudera >> > > > >> > > >> > >> > >> > >> > -- >> > Todd Lipcon >> > Software Engineer, Cloudera >> > >> > > > > -- > Todd Lipcon > Software Engineer, Cloudera > -- Todd Lipcon Software Engineer, Cloudera
Re: Adding profile info from Frontend
On Tue, Sep 4, 2018 at 4:46 PM, Andrew Sherman wrote: > Hi Todd, > > I'm going to put the patch up for review any minute. After I finish first > time setup on https://gerrit.cloudera.org OK, I guess that settles whose patch goes in first, then, because I haven't written a line of code yet :-D -Todd > > > -Andrew > > On Tue, Sep 4, 2018 at 4:43 PM Todd Lipcon wrote: > > > On Tue, Sep 4, 2018 at 4:28 PM, Andrew Sherman > > wrote: > > > > > Hi Todd, > > > > > > I am making a simple fix for > > > IMPALA-6568 <https://issues.apache.org/jira/browse/IMPALA-6568> add > > > missing > > > Query Compilation section to profiles. > > > which would add the Timeline to all responses to createExecRequest. > > > > > > It sounds like your change is more deep. If you go ahead with your > change > > > it sounds like my change might be redundant. > > > > > > > I'm not sure if it's totally redundant. Do you have a WIP patch already? > > I'm not sure why the existing timeline doesn't show up in all statement > > types, so maybe some changes are needed related to that, and then those > > changes will still be necessary when exposing a full profile node? > > > > Agreed we're likely to conflict, at the least, though. Do you have an > > estimate of when your patch will be up for review so we can coordinate > > which one goes in first? > > > > -Todd > > > > > > > > > > > On Tue, Sep 4, 2018 at 4:02 PM Todd Lipcon wrote: > > > > > > > Hey folks, > > > > > > > > I'm working on a patch to add some more diagnostics from the planning > > > > process into query profiles. > > > > > > > > Currently, only the planning "Timeline" is reported back as part of > the > > > > response to createExecRequest. As part of the fetch-on-demand catalog > > > work > > > > I'd like to start exposing various counters such as cache hit/miss > > > counts, > > > > time spent on remote calls to the catalog, etc. Even in the existing > > code > > > > paths, I can see some similar metrics being useful. > > > > > > > > My current thinking is to remove the 'timeline' (TEventSequence) > field > > > > from TExecRequest and replace it with a full TRuntimeProfileNode. I'd > > > then > > > > add some capability in the Java side to fill in counters, etc, in > this > > > > structure. > > > > > > > > Any concerns with this approach before I go down this path? Are there > > any > > > > compatibility guarantees I need to uphold with the profile output of > > > > queries? > > > > > > > > -Todd > > > > -- > > > > Todd Lipcon > > > > Software Engineer, Cloudera > > > > > > > > > > > > > > > -- > > Todd Lipcon > > Software Engineer, Cloudera > > > -- Todd Lipcon Software Engineer, Cloudera
Re: Adding profile info from Frontend
On Tue, Sep 4, 2018 at 4:28 PM, Andrew Sherman wrote: > Hi Todd, > > I am making a simple fix for > IMPALA-6568 <https://issues.apache.org/jira/browse/IMPALA-6568> add > missing > Query Compilation section to profiles. > which would add the Timeline to all responses to createExecRequest. > > It sounds like your change is more deep. If you go ahead with your change > it sounds like my change might be redundant. > I'm not sure if it's totally redundant. Do you have a WIP patch already? I'm not sure why the existing timeline doesn't show up in all statement types, so maybe some changes are needed related to that, and then those changes will still be necessary when exposing a full profile node? Agreed we're likely to conflict, at the least, though. Do you have an estimate of when your patch will be up for review so we can coordinate which one goes in first? -Todd > > > On Tue, Sep 4, 2018 at 4:02 PM Todd Lipcon wrote: > > > Hey folks, > > > > I'm working on a patch to add some more diagnostics from the planning > > process into query profiles. > > > > Currently, only the planning "Timeline" is reported back as part of the > > response to createExecRequest. As part of the fetch-on-demand catalog > work > > I'd like to start exposing various counters such as cache hit/miss > counts, > > time spent on remote calls to the catalog, etc. Even in the existing code > > paths, I can see some similar metrics being useful. > > > > My current thinking is to remove the 'timeline' (TEventSequence) field > > from TExecRequest and replace it with a full TRuntimeProfileNode. I'd > then > > add some capability in the Java side to fill in counters, etc, in this > > structure. > > > > Any concerns with this approach before I go down this path? Are there any > > compatibility guarantees I need to uphold with the profile output of > > queries? > > > > -Todd > > -- > > Todd Lipcon > > Software Engineer, Cloudera > > > -- Todd Lipcon Software Engineer, Cloudera
Re: Adding profile info from Frontend
On Tue, Sep 4, 2018 at 4:19 PM, Philip Zeyliger wrote: > I'm certainly aware of tools that read TEventSequence and expect it to have > certain structures. If we change the profiles significantly, we should > insert a version sigil in there so that a client could at least figure out > that they're looking at something new. > OK, I'll see if I can keep the existing timeline in the same spot in the resulting profile (potentially with additional fields in that same profile node) > > A bigger modeling conversation is the use of string keys everywhere. For > the timeline, for example, it's a list of (description, time) pairs, with > the descriptions being strings. An alternative would have been to put the > strings in Thrift itself. The latter, I think, makes tooling easier to > write (since you're calling "get_query_started_time()" instead of "for > description, time in timelines: if description == 'query started': return > time". I recognize working with Thrift is kind of a pain, but I want to > throw the thought out there if we're adding things to profiles. > Agreed -- or a set of well-known fields defined as a thrift enum instead of relying on strings.. but I dont want to bite off too much in one go here :) > > -- Philip > > > On Tue, Sep 4, 2018 at 4:02 PM Todd Lipcon wrote: > > > Hey folks, > > > > I'm working on a patch to add some more diagnostics from the planning > > process into query profiles. > > > > Currently, only the planning "Timeline" is reported back as part of the > > response to createExecRequest. As part of the fetch-on-demand catalog > work > > I'd like to start exposing various counters such as cache hit/miss > counts, > > time spent on remote calls to the catalog, etc. Even in the existing code > > paths, I can see some similar metrics being useful. > > > > My current thinking is to remove the 'timeline' (TEventSequence) field > > from TExecRequest and replace it with a full TRuntimeProfileNode. I'd > then > > add some capability in the Java side to fill in counters, etc, in this > > structure. > > > > Any concerns with this approach before I go down this path? Are there any > > compatibility guarantees I need to uphold with the profile output of > > queries? > > > > -Todd > > -- > > Todd Lipcon > > Software Engineer, Cloudera > > > -- Todd Lipcon Software Engineer, Cloudera
Adding profile info from Frontend
Hey folks, I'm working on a patch to add some more diagnostics from the planning process into query profiles. Currently, only the planning "Timeline" is reported back as part of the response to createExecRequest. As part of the fetch-on-demand catalog work I'd like to start exposing various counters such as cache hit/miss counts, time spent on remote calls to the catalog, etc. Even in the existing code paths, I can see some similar metrics being useful. My current thinking is to remove the 'timeline' (TEventSequence) field from TExecRequest and replace it with a full TRuntimeProfileNode. I'd then add some capability in the Java side to fill in counters, etc, in this structure. Any concerns with this approach before I go down this path? Are there any compatibility guarantees I need to uphold with the profile output of queries? -Todd -- Todd Lipcon Software Engineer, Cloudera
Re: Issues with query/fragment lifecycle on trunk?
On Thu, Aug 30, 2018 at 2:48 PM, Todd Lipcon wrote: > On Thu, Aug 30, 2018 at 2:44 PM, Pooja Nilangekar < > pooja.nilange...@cloudera.com> wrote: > >> Hi Todd, >> >> I believe you are right. There are a couple of other race conditions in >> the >> HdfsScanNode which have resulted in broken-builds in the last couple of >> weeks. (IMPALA-7418 and IMPALA-7335). While reviewing the change for >> IMPALA-7335, Tim and Dan suggested that we need to cleanup the state >> invariants of the HdfsScanNode to find and fix other such races. I am >> going >> to create a JIRA to track it, I could add this issue to the list of races >> to be fixed. If this issue occurs more frequently, we could add a >> temporary fix for now. Let me know what you'd suggest. >> > > Makes sense. I'm currently trying to put together a simple test and fix > that targets this race in particular, but definitely agree that the > interactions between the scanner threads and scan node are hard to reason > about and could be improved more generally. > > For now I'm trying to use a DebugAction to inject probabilistic failures > into the soft memory limit checks to see if I can reproduce it more easily. > Put this up here with a fix: https://gerrit.cloudera.org/c/11369/ -Todd > > -Todd > > >> On Thu, Aug 30, 2018 at 1:27 PM Todd Lipcon wrote: >> >> > I spent another 45 minutes looking at the core and think I figured this >> > out. I found that for the stuck scanner, both of its scanner threads and >> > run and exited, but for some reason it was never marked as "done". >> > >> > I think this might be a regression due to IMPALA-7096 >> > (7ccf7369085aa49a8fc0daf6f91d97b8a3135682). The scanner thread has the >> > following code: >> > >> > // Stop extra threads if we're over a soft limit in order to free up >> > memory. >> > if (!first_thread && mem_tracker_->AnyLimitExceeded(MemLimit::SOFT)) >> { >> > break; >> > } >> > >> > // Done with range and it completed successfully >> > if (progress_.done()) { >> > // All ranges are finished. Indicate we are done. >> > SetDone(); >> > break; >> > } >> > >> > if (scan_range == nullptr && num_unqueued_files == 0) { >> > unique_lock l(lock_); >> > // All ranges have been queued and DiskIoMgr has no more new >> ranges >> > for this scan >> > // node to process. This means that every range is either done or >> > being processed by >> > // another thread. >> > all_ranges_started_ = true; >> > break; >> > } >> > } >> > >> > What if we have the following scenario: >> > >> > T1) grab scan range 1 and start processing >> > >> > T2) grab scan range 2 and start processing >> > >> > T1) finish scan range 1 and see that 'progress_' is not done() >> > T1) loop around, get no scan range (there are no more), so set >> > all_ranges_satrted_ and break >> > T1) thread exits >> > >> > T2) finish scan range 2 >> > T2) happen to hit a soft memory limit error due to pressure from other >> exec >> > nodes, etc. Since we aren't the first thread, we break. (even though the >> > first thread is no longer running) >> > T2) thread exits >> > >> > Note that no one got to the point of calling SetDone() because we break >> due >> > to the memory limit error _before_ checking progress_.Done(). >> > >> > Thus, the query will hang forever. >> > >> > It seems to me that the order of these conditions needs to be >> rejiggered: >> > the breaking due to memory limit should be the last thing in the loop so >> > that, if we were the thread to finish the last scan range, we properly >> > SetDone before exiting. >> > >> > Does that makes sense? If so, I'll file a JIRA. Suggestions also >> welcome on >> > how to write a test for this (do we have an easy facility to inject >> > probabilistic mem-limit failures in that code path?) >> > >> > -Todd >> > >> > >> > On Thu, Aug 30, 2018 at 12:35 PM, Michael Ho wrote: >> > >> > > FWIW, please see the discussion at >> > > https://issues.apache.org/jira/browse/IMPALA-6194 about how "stuck >> IO" >> > may &
Re: Issues with query/fragment lifecycle on trunk?
On Thu, Aug 30, 2018 at 2:44 PM, Pooja Nilangekar < pooja.nilange...@cloudera.com> wrote: > Hi Todd, > > I believe you are right. There are a couple of other race conditions in the > HdfsScanNode which have resulted in broken-builds in the last couple of > weeks. (IMPALA-7418 and IMPALA-7335). While reviewing the change for > IMPALA-7335, Tim and Dan suggested that we need to cleanup the state > invariants of the HdfsScanNode to find and fix other such races. I am going > to create a JIRA to track it, I could add this issue to the list of races > to be fixed. If this issue occurs more frequently, we could add a > temporary fix for now. Let me know what you'd suggest. > Makes sense. I'm currently trying to put together a simple test and fix that targets this race in particular, but definitely agree that the interactions between the scanner threads and scan node are hard to reason about and could be improved more generally. For now I'm trying to use a DebugAction to inject probabilistic failures into the soft memory limit checks to see if I can reproduce it more easily. -Todd > On Thu, Aug 30, 2018 at 1:27 PM Todd Lipcon wrote: > > > I spent another 45 minutes looking at the core and think I figured this > > out. I found that for the stuck scanner, both of its scanner threads and > > run and exited, but for some reason it was never marked as "done". > > > > I think this might be a regression due to IMPALA-7096 > > (7ccf7369085aa49a8fc0daf6f91d97b8a3135682). The scanner thread has the > > following code: > > > > // Stop extra threads if we're over a soft limit in order to free up > > memory. > > if (!first_thread && mem_tracker_->AnyLimitExceeded(MemLimit::SOFT)) > { > > break; > > } > > > > // Done with range and it completed successfully > > if (progress_.done()) { > > // All ranges are finished. Indicate we are done. > > SetDone(); > > break; > > } > > > > if (scan_range == nullptr && num_unqueued_files == 0) { > > unique_lock l(lock_); > > // All ranges have been queued and DiskIoMgr has no more new ranges > > for this scan > > // node to process. This means that every range is either done or > > being processed by > > // another thread. > > all_ranges_started_ = true; > > break; > > } > > } > > > > What if we have the following scenario: > > > > T1) grab scan range 1 and start processing > > > > T2) grab scan range 2 and start processing > > > > T1) finish scan range 1 and see that 'progress_' is not done() > > T1) loop around, get no scan range (there are no more), so set > > all_ranges_satrted_ and break > > T1) thread exits > > > > T2) finish scan range 2 > > T2) happen to hit a soft memory limit error due to pressure from other > exec > > nodes, etc. Since we aren't the first thread, we break. (even though the > > first thread is no longer running) > > T2) thread exits > > > > Note that no one got to the point of calling SetDone() because we break > due > > to the memory limit error _before_ checking progress_.Done(). > > > > Thus, the query will hang forever. > > > > It seems to me that the order of these conditions needs to be rejiggered: > > the breaking due to memory limit should be the last thing in the loop so > > that, if we were the thread to finish the last scan range, we properly > > SetDone before exiting. > > > > Does that makes sense? If so, I'll file a JIRA. Suggestions also welcome > on > > how to write a test for this (do we have an easy facility to inject > > probabilistic mem-limit failures in that code path?) > > > > -Todd > > > > > > On Thu, Aug 30, 2018 at 12:35 PM, Michael Ho wrote: > > > > > FWIW, please see the discussion at > > > https://issues.apache.org/jira/browse/IMPALA-6194 about how "stuck IO" > > may > > > render cancellation in HDFS scan node ineffective. We have seen users > > > reported similar backtraces in the past due to hung RPCs to HDFS name > > node. > > > Not saying this is the necessarily the case here. > > > > > > A couple of things which may be interesting to look at: > > > - is there any scan node in the profile which doesn't finish any > assigned > > > scan ranges ? > > > - if you happen to have a core, it may help to inspect the stack traces > > of > > > the scanner threads and the disk
Re: Issues with query/fragment lifecycle on trunk?
I spent another 45 minutes looking at the core and think I figured this out. I found that for the stuck scanner, both of its scanner threads and run and exited, but for some reason it was never marked as "done". I think this might be a regression due to IMPALA-7096 (7ccf7369085aa49a8fc0daf6f91d97b8a3135682). The scanner thread has the following code: // Stop extra threads if we're over a soft limit in order to free up memory. if (!first_thread && mem_tracker_->AnyLimitExceeded(MemLimit::SOFT)) { break; } // Done with range and it completed successfully if (progress_.done()) { // All ranges are finished. Indicate we are done. SetDone(); break; } if (scan_range == nullptr && num_unqueued_files == 0) { unique_lock l(lock_); // All ranges have been queued and DiskIoMgr has no more new ranges for this scan // node to process. This means that every range is either done or being processed by // another thread. all_ranges_started_ = true; break; } } What if we have the following scenario: T1) grab scan range 1 and start processing T2) grab scan range 2 and start processing T1) finish scan range 1 and see that 'progress_' is not done() T1) loop around, get no scan range (there are no more), so set all_ranges_satrted_ and break T1) thread exits T2) finish scan range 2 T2) happen to hit a soft memory limit error due to pressure from other exec nodes, etc. Since we aren't the first thread, we break. (even though the first thread is no longer running) T2) thread exits Note that no one got to the point of calling SetDone() because we break due to the memory limit error _before_ checking progress_.Done(). Thus, the query will hang forever. It seems to me that the order of these conditions needs to be rejiggered: the breaking due to memory limit should be the last thing in the loop so that, if we were the thread to finish the last scan range, we properly SetDone before exiting. Does that makes sense? If so, I'll file a JIRA. Suggestions also welcome on how to write a test for this (do we have an easy facility to inject probabilistic mem-limit failures in that code path?) -Todd On Thu, Aug 30, 2018 at 12:35 PM, Michael Ho wrote: > FWIW, please see the discussion at > https://issues.apache.org/jira/browse/IMPALA-6194 about how "stuck IO" may > render cancellation in HDFS scan node ineffective. We have seen users > reported similar backtraces in the past due to hung RPCs to HDFS name node. > Not saying this is the necessarily the case here. > > A couple of things which may be interesting to look at: > - is there any scan node in the profile which doesn't finish any assigned > scan ranges ? > - if you happen to have a core, it may help to inspect the stack traces of > the scanner threads and the disk io mgr threads to understand their states. > > On Thu, Aug 30, 2018 at 12:25 PM Todd Lipcon wrote: > > > Hey folks, > > > > I ran into some issues with a local core test run last night. This is on > my > > own branch with some uncommitted work, but I haven't touched the backend > > and these issues seem to be backend-focused. > > > > Initially, I noticed the problem because a test run that I started last > > night was still "stuck" this morning. I had three queries which had been > > running for 10 hours and not making progress. The 'fragments' page for > the > > query showed that one of the backends had not reported to the coordinator > > for many hours. In attempting to debug this, I managed to get the > > StateStore to declare that node dead, and those queries eventually were > > cancelled due to that. > > > > I resumed the node that I had been debugging with gdb, and it was > declared > > live again. I didn't restart the process, though, which might have led to > > further issues: > > > > Next, I also noticed that /queries on my coordinator node showed "Number > of > > running queries with fragments on this host" at 100+ on all three nodes, > > even though no queries were running anymore. These numbers are stable. I > > can continue to issue new queries and they complete normally. While a > > queries are running, the count of fragments goes up appropriate, and when > > the query finishes, it goes back down. But, the "base" numbers are stuck > at > > {108, 391, 402} with nothing running. > > > > I also found that the node that had had the original problems has three > > stuck fragments, all waiting on HdfsScanNode::GetNext: > > https://gist.github.com/57a99bf4c00f1575810af924013c259d > > Looking in the logs, I see continuous spew that it's trying to cancel th
Issues with query/fragment lifecycle on trunk?
Hey folks, I ran into some issues with a local core test run last night. This is on my own branch with some uncommitted work, but I haven't touched the backend and these issues seem to be backend-focused. Initially, I noticed the problem because a test run that I started last night was still "stuck" this morning. I had three queries which had been running for 10 hours and not making progress. The 'fragments' page for the query showed that one of the backends had not reported to the coordinator for many hours. In attempting to debug this, I managed to get the StateStore to declare that node dead, and those queries eventually were cancelled due to that. I resumed the node that I had been debugging with gdb, and it was declared live again. I didn't restart the process, though, which might have led to further issues: Next, I also noticed that /queries on my coordinator node showed "Number of running queries with fragments on this host" at 100+ on all three nodes, even though no queries were running anymore. These numbers are stable. I can continue to issue new queries and they complete normally. While a queries are running, the count of fragments goes up appropriate, and when the query finishes, it goes back down. But, the "base" numbers are stuck at {108, 391, 402} with nothing running. I also found that the node that had had the original problems has three stuck fragments, all waiting on HdfsScanNode::GetNext: https://gist.github.com/57a99bf4c00f1575810af924013c259d Looking in the logs, I see continuous spew that it's trying to cancel the fragment instances for this query, but apparently the cancellation is not working. I talked to Michael Ho about this and it sounds like this is a known issue with cancellation. So, there are still two mysteries: - why did it get stuck in the first place? - why are my "number of running queries" counters stuck at non-zero values? Does anything above ring a bell for anyone? -Todd -- Todd Lipcon Software Engineer, Cloudera
Re: Breaking change for query hints?
I'm curious: can you describe the view using Hive to see what the stored query consists of? On Thu, Aug 23, 2018, 7:44 PM Quanlong Huang wrote: > Hi all, > > After we upgrade Impala from 2.5 to 2.12, we found some queries on views > failed. The views contain a query hint which I think is the cause. I can't > find anything about such kind of incompatibility was mentioned in > > https://www.cloudera.com/documentation/enterprise/release-notes/topics/impala_new_features.html > or > > https://www.cloudera.com/documentation/enterprise/release-notes/topics/impala_incompatible_changes.html > > The error can be reproduced by > > [impala-shell:21000] > use test; > [impala-shell:21000] > create view view_2_12 as select /* > +`queryId=abc` */ * from video_parq; > Query: create view view_2_12 as select /* +`queryId=abc` */ * from > video_parq > ++ > | summary| > ++ > | View has been created. | > ++ > WARNINGS: PLAN hint not recognized: queryId=abc > > Fetched 1 row(s) in 24.00s > [impala-shell:21000] > desc formatted view_2_12; > Query: describe formatted view_2_12 > ERROR: AnalysisException: Failed to parse view-definition statement of > view: test.view_2_12 > CAUSED BY: TableLoadingException: Failed to parse view-definition > statement of view: test.view_2_12 > > > Since this's an emergency for us, I post an email here for help. I'm also > looking into the history of the cup file and looking for a hotfix for this. > Thanks if anyone can give some pointers! > > Thanks, > Quanlong >
Re: Improving latency of catalog update propagation?
One more parting thought: why don't we just call 'GetCatalogDelta()' directly from the catalog callback in order to do a direct handoff, instead of storing them in this 'pending' struct? Given the statestore uses a dedicated thread per subscriber (right?) it seems like it would be fine for the update callback to take a long time, no? -Todd On Tue, Aug 21, 2018 at 11:09 AM, Todd Lipcon wrote: > Thanks, Tim. I'm guessing once we switch over these RPCs to KRPC instead > of Thrift we'll alleviate some of the scalability issues and maybe we can > look into increasing frequency or doing a "push" to the statestore, etc. I > probably won't work on this in the near term to avoid complicating the > ongoing changes with catalog. > > -Todd > > On Tue, Aug 21, 2018 at 10:22 AM, Tim Armstrong > wrote: > >> This is somewhat relevant for admission control too - I had thought about >> some of these issues in that context, because reducing the latency of >> admission controls state propagation helps avoid overadmission but having >> a >> very low statestore frequency is very inefficient and doesn't scale well >> to >> larger clusters. >> >> For the catalog updates I agree we could do something with long polls >> since >> it's a single producer so that the "idle" state of the system has a thread >> sitting in the update callback on catalogd waiting for an update. >> >> I'd also thought at one point about allowing subscribers to notify the >> statestore that they had something to add to the topic. That could be >> treated as a hint to the statestore to schedule the subscriber update >> sooner. This would also work for admission control since coordinators >> could >> notify the statestore when the first query was admitted after the previous >> statestore update. >> >> On Tue, Aug 21, 2018 at 9:41 AM, Todd Lipcon wrote: >> >> > Hey folks, >> > >> > In my recent forays into the catalog->statestore->impalad metadata >> > propagation code base, I noticed that the latency of any update is >> > typically between 2-4 seconds with the standard 2-second statestore >> polling >> > interval. That's because the code currently works as follows: >> > >> > 1. in the steady state with no recent metadata changes, the catalogd's >> > state is: >> > -- topic_updates_ready_ = true >> > -- pending_topic_updates_ = empty >> > >> > 2. some metadata change happens, which modifies the version numbers in >> the >> > Java catalog but doesn't modify any of the C++ side state >> > >> > 3. the next statestore poll happens due to the normal interval >> expiring. On >> > average, this will take *1/2 the polling interval* >> > -- this sees that pending_topic_updates_ is empty, so returns no >> results. >> > -- it sets topic_updates_ready_ = false and triggers the "gather" thread >> > >> > 4. the "gather" thread wakes up and gathers updates, filling in >> > 'pending_topic_updates_' and setting 'topic_updates_ready_' back to true >> > (typically subsecond in smallish catalogs, so this happens before the >> next >> > poll) >> > >> > 5. wait *another full statestore polling interval* (2 seconds) after >> step >> > #3 above, at which point we deliver the metadata update to the >> statestore >> > >> > 6. wait on average* 1/2 the polling interval* until any particular >> impalad >> > gets the update from #4 >> > >> > So. in the absolute best case, we wait one full polling interval (2 >> > seconds), and in the worst case we wait two polling intervals (4 >> seconds). >> > >> > Has anyone looked into optimizing this at all? It seems like we could >> have >> > metadata changes trigger an immediate "collection" into the C++ side, >> and >> > have the statestore update callback wait ("long poll" style) for an >> update >> > rather than skip if there is nothing available. >> > >> > -Todd >> > -- >> > Todd Lipcon >> > Software Engineer, Cloudera >> > >> > > > > -- > Todd Lipcon > Software Engineer, Cloudera > -- Todd Lipcon Software Engineer, Cloudera
Re: Improving latency of catalog update propagation?
Thanks, Tim. I'm guessing once we switch over these RPCs to KRPC instead of Thrift we'll alleviate some of the scalability issues and maybe we can look into increasing frequency or doing a "push" to the statestore, etc. I probably won't work on this in the near term to avoid complicating the ongoing changes with catalog. -Todd On Tue, Aug 21, 2018 at 10:22 AM, Tim Armstrong wrote: > This is somewhat relevant for admission control too - I had thought about > some of these issues in that context, because reducing the latency of > admission controls state propagation helps avoid overadmission but having a > very low statestore frequency is very inefficient and doesn't scale well to > larger clusters. > > For the catalog updates I agree we could do something with long polls since > it's a single producer so that the "idle" state of the system has a thread > sitting in the update callback on catalogd waiting for an update. > > I'd also thought at one point about allowing subscribers to notify the > statestore that they had something to add to the topic. That could be > treated as a hint to the statestore to schedule the subscriber update > sooner. This would also work for admission control since coordinators could > notify the statestore when the first query was admitted after the previous > statestore update. > > On Tue, Aug 21, 2018 at 9:41 AM, Todd Lipcon wrote: > > > Hey folks, > > > > In my recent forays into the catalog->statestore->impalad metadata > > propagation code base, I noticed that the latency of any update is > > typically between 2-4 seconds with the standard 2-second statestore > polling > > interval. That's because the code currently works as follows: > > > > 1. in the steady state with no recent metadata changes, the catalogd's > > state is: > > -- topic_updates_ready_ = true > > -- pending_topic_updates_ = empty > > > > 2. some metadata change happens, which modifies the version numbers in > the > > Java catalog but doesn't modify any of the C++ side state > > > > 3. the next statestore poll happens due to the normal interval expiring. > On > > average, this will take *1/2 the polling interval* > > -- this sees that pending_topic_updates_ is empty, so returns no results. > > -- it sets topic_updates_ready_ = false and triggers the "gather" thread > > > > 4. the "gather" thread wakes up and gathers updates, filling in > > 'pending_topic_updates_' and setting 'topic_updates_ready_' back to true > > (typically subsecond in smallish catalogs, so this happens before the > next > > poll) > > > > 5. wait *another full statestore polling interval* (2 seconds) after step > > #3 above, at which point we deliver the metadata update to the statestore > > > > 6. wait on average* 1/2 the polling interval* until any particular > impalad > > gets the update from #4 > > > > So. in the absolute best case, we wait one full polling interval (2 > > seconds), and in the worst case we wait two polling intervals (4 > seconds). > > > > Has anyone looked into optimizing this at all? It seems like we could > have > > metadata changes trigger an immediate "collection" into the C++ side, and > > have the statestore update callback wait ("long poll" style) for an > update > > rather than skip if there is nothing available. > > > > -Todd > > -- > > Todd Lipcon > > Software Engineer, Cloudera > > > -- Todd Lipcon Software Engineer, Cloudera
Improving latency of catalog update propagation?
Hey folks, In my recent forays into the catalog->statestore->impalad metadata propagation code base, I noticed that the latency of any update is typically between 2-4 seconds with the standard 2-second statestore polling interval. That's because the code currently works as follows: 1. in the steady state with no recent metadata changes, the catalogd's state is: -- topic_updates_ready_ = true -- pending_topic_updates_ = empty 2. some metadata change happens, which modifies the version numbers in the Java catalog but doesn't modify any of the C++ side state 3. the next statestore poll happens due to the normal interval expiring. On average, this will take *1/2 the polling interval* -- this sees that pending_topic_updates_ is empty, so returns no results. -- it sets topic_updates_ready_ = false and triggers the "gather" thread 4. the "gather" thread wakes up and gathers updates, filling in 'pending_topic_updates_' and setting 'topic_updates_ready_' back to true (typically subsecond in smallish catalogs, so this happens before the next poll) 5. wait *another full statestore polling interval* (2 seconds) after step #3 above, at which point we deliver the metadata update to the statestore 6. wait on average* 1/2 the polling interval* until any particular impalad gets the update from #4 So. in the absolute best case, we wait one full polling interval (2 seconds), and in the worst case we wait two polling intervals (4 seconds). Has anyone looked into optimizing this at all? It seems like we could have metadata changes trigger an immediate "collection" into the C++ side, and have the statestore update callback wait ("long poll" style) for an update rather than skip if there is nothing available. -Todd -- Todd Lipcon Software Engineer, Cloudera
java target version in FE pom?
Hey folks, I noticed that whenever I import the FE into Eclipse it ends up defaulting to Java 1.7, and then gets errors trying to run the tests: F0820 15:21:17.184092 59731 frontend.cc:109] java.lang.UnsupportedClassVersionError: org/apache/hadoop/conf/Configuration : Unsupported major.minor version 52.0 It seems this is because the pom has the following: org.apache.maven.plugins maven-compiler-plugin 3.3 1.7 1.7 Given that Hadoop is now being built for 1.8, should we change our own target and source versions accordingly? -Todd -- Todd Lipcon Software Engineer, Cloudera
Update on catalog changes
t ID will never be reused with different data. On a REFRESH or other partition metadata change, a new ID can be assigned to the partition. This allows us to safely cache partitions (the largest bit of metadata) with long TTLs, and to do very cheap updates after any update affects only a subset of partitions. In terms of the original "fetch from source systems" design, I do continue to think it has promise longer-term as it simplifies the overall architecture and can help with cross-system metadata consistency. But we can treat fetch-from-catalogd as a nice interim that should bring most of the performance and scalability benefits to users sooner and with less risk. I'll plan to update the original design document to reflect this in coming days. Thanks -Todd -- Todd Lipcon Software Engineer, Cloudera
Re: #pragma once?
Yea, when we looked into it I recall that it worked fine on all compilers from the last 10 years or something (in fact I remember using #pragma once on Metrowerks Codewarrior more than fifteen years ago). Given we require C++14 I don't think #pragma once is going to be the limiting factor in compiler version portability. -Todd On Wed, Aug 1, 2018 at 12:01 PM, Sailesh Mukil wrote: > An advantage of using #pragma once is potential improved compilation > speeds. However, a con is that it's non-standard and therefore, its > behavior can change at any point and can also vary across compilers, > potentially making the code even less portable. > > That being said, since Kudu has been using it for a while and has had no > issues, we can do the same since the potential benefits outweigh the cons. > > On Wed, Aug 1, 2018 at 11:48 AM, Tim Armstrong < > tarmstr...@cloudera.com.invalid> wrote: > > > Todd brought up our include guards on a code review, asking why we don't > > use #pragma once instead: https://gerrit.cloudera.org/#/c/10988/5 . It > > sounds like Kudu has switched to it > > > > #pragma once does seem cleaner and our GCC and Clang versions are modern > > enough to support it. > > > > What do people think about switching to that as the preferred way of > > including headers only once? > > > > - Tim > > > -- Todd Lipcon Software Engineer, Cloudera
Re: Re: Impala Incident Sharing
On Wed, Aug 1, 2018 at 6:02 AM, Quanlong Huang wrote: > Hi Todd and Philip, > > > The idea of separate ClassLoader is cool. But I would vote for executing > java UDFs in a separate JVM. We don't need to create a JVM for each UDF. > They can share the same JVM with a configurable hard limit. If the JVM for > UDFs runs out of heap size, we can destroy the JVM and fail the queries. > (Not sure if this is doable) The drawback is that some innocent queries > using java UDFs will fail too, but the cluster will still be healthy. > I think this would be great, but we need to make sure not to regress performance significantly. For example, maybe we'd need to expose row batches as a shared memory region and then map them as a DirectByteBuffer on the Java side. I think this would probably be a lot of work, though, and maybe the Java UDF usage is uncommon enough that it may not be worth all of the coding? -Todd > > The patch of adding JVM pause logs and expose JMX metrics is really > helpful. I'd like to join the code review. > > > Thanks, > Quanlong > > At 2018-08-01 05:47:25, "Philip Zeyliger" > wrote: > >Hi Quanlong, > > > >Thank you for the incident note! You might be interested in > >https://gerrit.cloudera.org/#/c/10998/ which is adding some > instrumentation > >to make it easier to notice with monitoring tools that we're running out > of > >memory or having large GC pauses. > > > >-- Philip > > > >On Tue, Jul 31, 2018 at 8:21 AM Todd Lipcon > >wrote: > > > >> Hi Quanlong, > >> > >> Thanks for the writeup. I wonder if we could two of the issues with a > >> single approach of using a separate ClassLoader instance when we load > and > >> instantiate custom UDFs. For one, using a custom ClassLoader means that > we > >> could probably ensure that the user's jar is checked first even if > there is > >> a conflicting class on the Impala classpath. For two, if we close the > >> custom classloader and ensure that we don't retain any references to it > or > >> to classes created by it (i.e the UDF classes), then we should be able > to > >> collect leaked memory on the next GC. In the particular case you pointed > >> out, the leak is in static fields of the UDF, and those static fields > would > >> get destructed when destructing the classloader. > >> > >> On the downside, making this change could negatively harm performance if > >> any UDFs are relying on their static initializer blocks to do expensive > >> setup. Maybe a compromise would be to make this behavior optional or to > >> cache the ClassLoader via a WeakReference across instantiations of a > UDF so > >> that, if there isn't memory pressure, the UDF doesn't need to be > >> re-class-loaded repeatedly. > >> > >> As for tracking memory explicitly, the way that HBase does it is via > >> somewhat careful accounting inline with all of the code that allocates > >> memory. I don't think that approach is feasible in UDFs because the UDF > >> code is provided by end users, and they aren't likely to want to > carefully > >> track their heap usage. I think the only way we could effectively > "sandbox" > >> UDFs would be to actually execute them in a separate JVM with a limited > >> heap size, but that would be somewhat complex to implement without a big > >> performance drop. > >> > >> -Todd > >> > >> On Tue, Jul 31, 2018 at 7:53 AM, Quanlong Huang > > >> wrote: > >> > >> > Hi all, > >> > > >> > > >> > Just to draw your attention of a patch, I'd like to share an incident > >> with > >> > you. > >> > > >> > > >> > Two months ago we encountered an incident in one of our Impala > clusters > >> in > >> > version 2.5-cdh5.7.3. Queries were pilled up and most of them were in > >> > CREATED state. It looks like an infamous incident when the catalogd is > >> > loading metadata of huge tables and finally OOM so metadata loading > >> > requests from impalad are stuck. However, the catalogd looks good this > >> > time. The logs showed it's not loading any metadata. Then I looked > into > >> the > >> > statestored and found many timeout logs like "Unable to send topic > update > >> > message to subscriber impalad@xxx:22000, received error: RPC timed > out". > >> > Chose one of the problemati
Re: Enabling automatic code review precommit job
On Tue, Jul 31, 2018 at 9:02 AM, Tim Armstrong < tarmstr...@cloudera.com.invalid> wrote: > I agree that it might be a little strict at the moment and disallow some > reasonable formatting. The tool is controlled by the setup.cfg file in the > repo so it's easy enough to change the behaviour. > > I think we have been a little sloppy with Python style in general so I > think some of these would be good to change over time. I think the main > thing I'd like is to align the tool's behaviour with code reviews - if > we're going to be strict about PEP 8 compliance in code reviews we should > keep the tool strict. > > >109 : E125 continuation line with same indent as next logical line > I think this formatting is hard to read and confusing, I'm in favour of > leaving this enabled. > > >110 : E701 multiple statements on one line (colon) > This if because of the one-line if statements we have in the code. I don't > feel strongly either way as long as we're consistent. > > >129 : E231 missing whitespace after ',' > >185 : E251 unexpected spaces around keyword / parameter equals > >288 : E502 the backslash is redundant between brackets > These seems like sloppy/inconsistent formatting to me, I'm in favour of > keeping these enabled and fixing existing code as we go. > > > 368 : E302 expected 2 blank lines, found 1 > Our Python code is very inconsistent here, would be nice to make it more > consistent. I'm in favour of keeping it enabled and fixing as we go. > > >187 : E121 continuation line under-indented for hanging indent > > 1295 : E128 continuation line under-indented for visual indent > On one hand it would be nice to follow the PEP 8 style here ( > https://www.python.org/dev/peps/pep-0008/#indentation) but the other > idioms > seem fine to me. I've been asked on Python code reviews to switch to the > PEP 8 indentation style before so I think we should align the tools > behaviour with what we're going to ask for code reviews. > Alright, I agree with all of the above. One suggestion, though: is it possible to get something like 'autopep8' to run in a 'patch formatting' mode where it only reformats changed lines? Then we could more easily just run a single command to ensure that our patches are properly formatted before submitting to review. Or, at the very least, some instructions for running the same flake8-against-only-my-changed-lines that gerrit is running? -Todd > > > On Mon, Jul 30, 2018 at 7:48 PM, Todd Lipcon > wrote: > > > It seems like flake8 might need some adjustment of its policies. Here are > > the most common issues in the current test code: > > > > 109 : E125 continuation line with same indent as next logical line > > 110 : E701 multiple statements on one line (colon) > > 129 : E231 missing whitespace after ',' > > 185 : E251 unexpected spaces around keyword / parameter equals > > 187 : E121 continuation line under-indented for hanging indent > > 288 : E502 the backslash is redundant between brackets > > 368 : E302 expected 2 blank lines, found 1 > >1295 : E128 continuation line under-indented for visual indent > > > > Maybe worth just disabling some of the indentation-related ones to start? > > > > > > On Mon, Jul 30, 2018 at 4:09 PM, Tim Armstrong < > > tarmstr...@cloudera.com.invalid> wrote: > > > > > I have a few updates. > > > > > > I added an automatic build job for docs changes: > > > https://jenkins.impala.io/job/gerrit-docs-auto-test/ > > > > > > I'm going to disable the "Build started" message for > > > gerrit-code-review-checks. It seems a bit too chatty. Let me know if > you > > > disagree. > > > > > > I'm adding a job that automatically does some checks on the diff and > > posts > > > code review comments. I started off with Python flake8 comments. > > > > > > Let me know if you see any problems or if it turns out to be too noisy. > > > > > > > > > > > > On Mon, Jul 23, 2018 at 11:55 AM, Tim Armstrong < > tarmstr...@cloudera.com > > > > > > wrote: > > > > > > > Hi All, > > > > I'm enabling an automatic precommit job for code reviews uploaded > to > > > > gerrit that will run RAT, clang-tidy and a GCC debug compilation. > This > > is > > > > to provide faster feedback on code reviews: > https://issues.apache.org/ > > > > jira/browse/IMPALA-7317 . I'll add some more checks but I'm wanting > to > > > > test the basic mechanism for a bit now. > > > > > > > > It excludes docs/ commits. > > > > > > > > Let me know if you see any problems with it. > > > > > > > > Thanks, > > > > Tim > > > > > > > > > > > > > > > -- > > Todd Lipcon > > Software Engineer, Cloudera > > > -- Todd Lipcon Software Engineer, Cloudera
Re: Impala Incident Sharing
E-16196). That's why the impalads got OOM. Memory in the JVM is not > tracked. The bug in UdfJson is fixed in later hive versions ( > https://github.com/cloudera/hive/commit/8397f933ea3ee5e2087b4def6b58ed > 5d77026108#diff-e0b4cff52301169fede5c9550e968b67 ). However, using new > hive jars (e.g. hive-exec-1.1.0-cdh5.16.0-SNAPSHOT-core.jar) in the > CREATE FUNCTION statement does not work, since the class name is still the > same and the hive jars in version cdh5.7.3 are already in the CLASSPATH. > Impala will still load the old implementation of UDFJson. To work around > this, I copied the implementation of UDFJson and use another class name, > then use my custom jar file instead. > > > This is not the final fix since memory used in java UDFs are not tracked. > It's still unsafe. Looking into the C++ implementation of get_json_object ( > https://github.com/nazgul33/impala-get-json-object-udf) provided by an > early comment in IMPALA-376, it still not tracks the memory. So I give > another implementation and it's ready for code review: > https://gerrit.cloudera.org/c/10950/ (It has passed the pre-view-test: > https://jenkins.impala.io/job/pre-review-test/189/ ) Yes, this's the > purpose of this email. Thank you for reading! > > > Further thoughts: > * We might need to mention in the docs that adding java UDFs may not use > the given jar file if the impala CLASSPATH already contains a jar file > containing the same class. > * We should avoid using Hive builtin UDFs and any other Java UDFs since > their memory is not tracked. > * How to track memory used in JVM? HBase, a pure java project, is able to > track its MemStore and BlockCache size. Can we learn from it? > > > Thanks, > Quanlong > -- > Quanlong Huang > Software Developer, Hulu -- Todd Lipcon Software Engineer, Cloudera
Re: Enabling automatic code review precommit job
It seems like flake8 might need some adjustment of its policies. Here are the most common issues in the current test code: 109 : E125 continuation line with same indent as next logical line 110 : E701 multiple statements on one line (colon) 129 : E231 missing whitespace after ',' 185 : E251 unexpected spaces around keyword / parameter equals 187 : E121 continuation line under-indented for hanging indent 288 : E502 the backslash is redundant between brackets 368 : E302 expected 2 blank lines, found 1 1295 : E128 continuation line under-indented for visual indent Maybe worth just disabling some of the indentation-related ones to start? On Mon, Jul 30, 2018 at 4:09 PM, Tim Armstrong < tarmstr...@cloudera.com.invalid> wrote: > I have a few updates. > > I added an automatic build job for docs changes: > https://jenkins.impala.io/job/gerrit-docs-auto-test/ > > I'm going to disable the "Build started" message for > gerrit-code-review-checks. It seems a bit too chatty. Let me know if you > disagree. > > I'm adding a job that automatically does some checks on the diff and posts > code review comments. I started off with Python flake8 comments. > > Let me know if you see any problems or if it turns out to be too noisy. > > > > On Mon, Jul 23, 2018 at 11:55 AM, Tim Armstrong > wrote: > > > Hi All, > > I'm enabling an automatic precommit job for code reviews uploaded to > > gerrit that will run RAT, clang-tidy and a GCC debug compilation. This is > > to provide faster feedback on code reviews: https://issues.apache.org/ > > jira/browse/IMPALA-7317 . I'll add some more checks but I'm wanting to > > test the basic mechanism for a bit now. > > > > It excludes docs/ commits. > > > > Let me know if you see any problems with it. > > > > Thanks, > > Tim > > > -- Todd Lipcon Software Engineer, Cloudera
Re: Order of error messages printed
I suppose the advantage of it being ordered is that at least the output order is deterministic. That is to say, if a new warning shows up, it won't re-shuffle the ordering of all prior warnings. And we'll get the same results regardless of whether we're using libstdcxx or libc++, which might make test assertions easier. -Todd On Tue, Jul 24, 2018 at 12:26 PM, Sailesh Mukil < sail...@cloudera.com.invalid> wrote: > The ErrorLogMap's key is "TErrorCode::type" which is an Enum. I don't think > we can derive any inherent meaning from the order of the errors listed in > the enum (other than OK=0, which wouldn't be in this map anyway). I'm not > aware of any clients relying on this order, and even if there were, it > would probably be incorrect to do so. So, unless anyone else chimes in with > a good reason to keep this as a map instead of an unordered_map, I'm all > for changing it. > > On Tue, Jul 24, 2018 at 12:12 PM, Michael Ho > wrote: > > > Hi, > > > > While making some changes for be/src/util/error-util.cc, I stumbled upon > > the question of whether we need a map instead of an unordered map for > > ErrorLogMap (used for tracking error messages for each error code). This > > affects the order in which error messages are printed in PrintErrorMap(). > > Looking through the code, it doesn't appear to me that there is any > > ordering requirement. However, it's unclear to me if any client of Impala > > makes assumption about the ordering of the output in PrintErrorMap(). So, > > sending this email to the list in case anyone knows anything. > > > > -- > > Thanks, > > Michael > > > -- Todd Lipcon Software Engineer, Cloudera
Re: Broken Impala Build?
Figured it out: I had to cd fe && mvn -U clean install At least now my impalad's starting. We'll see if the data load works. It seems maven needs to be forced to go pick up the new artifacts after being pointed at a new repository. -Todd On Wed, Jul 18, 2018 at 5:32 PM, Pooja Nilangekar < pooja.nilange...@cloudera.com.invalid> wrote: > I am still having the same issue. (I haven't tried git clean either). > > Thanks, > Pooja > > > On Wed, Jul 18, 2018 at 5:30 PM Todd Lipcon > wrote: > > > Anyone else still having problems? I followed Fredy's directions but the > > data load is failing with the same NoClassDefFound issue while attempting > > to start the impalad. I haven't tried a full "git clean" yet but I guess > > that iwll be my last resort. > > > > On Wed, Jul 18, 2018 at 4:22 PM, Fredy Wijaya > > > > > wrote: > > > > > The CR is merged and the build looks good now. Please remove > > > $IMPALA_HOME/toolchain and run a full build (./buildall.sh -format > > > -testdata -notests). It's also recommended to exit any shell sessions > to > > > avoid issue related to stale environment variable. > > > > > > On Wed, Jul 18, 2018 at 2:16 PM Fredy Wijaya > > wrote: > > > > > > > A CR to fix the issue: https://gerrit.cloudera.org/c/10981/ > > > > I'm running a dry-run to make sure everything is good with the new > > build > > > > number. > > > > > > > > On Wed, Jul 18, 2018 at 12:24 PM Todd Lipcon > > > > > > > wrote: > > > > > > > >> Yea, I just got the same on a gerrit build. I would guess this > somehow > > > >> means we are ending up with mismatched versions of the "HDFS" and > > > "common" > > > >> Hadoop jars on our classpath. Possibly one is getting picked up from > > > Maven > > > >> whereas another is getting picked up from toolchain? > > > >> > > > >> -Todd > > > >> > > > >> On Wed, Jul 18, 2018 at 10:21 AM, Fredy Wijaya > > > >> > > >> > wrote: > > > >> > > > >> > I got this error trying to do a clean build in HEAD. Did anyone > > > >> encounter > > > >> > the same issue? > > > >> > > > > >> > I0718 10:03:13.658380 23734 jni-util.cc:230] > > > >> > java.lang.NoClassDefFoundError: > > > >> > org/apache/hadoop/fs/Options$ChecksumCombineMode > > > >> > at > > > >> > org.apache.hadoop.hdfs.client.impl.DfsClientConf. > > > >> > getChecksumCombineModeFromConf(DfsClientConf.java:314) > > > >> > at > > > >> > org.apache.hadoop.hdfs.client.impl.DfsClientConf.( > > > >> > DfsClientConf.java:184) > > > >> > at org.apache.hadoop.hdfs.DFSClient.(DFSClient. > java:302) > > > >> > at org.apache.hadoop.hdfs.DFSClient.(DFSClient. > java:286) > > > >> > at > > > >> > org.apache.hadoop.hdfs.DistributedFileSystem.initialize( > > > >> > DistributedFileSystem.java:167) > > > >> > at > > > >> > org.apache.hadoop.fs.FileSystem.createFileSystem( > > > FileSystem.java:3288) > > > >> > at org.apache.hadoop.fs.FileSystem.access$200( > > > FileSystem.java:123) > > > >> > at > > > >> > org.apache.hadoop.fs.FileSystem$Cache.getInternal( > > > FileSystem.java:3337) > > > >> > at org.apache.hadoop.fs.FileSystem$Cache.get( > > > FileSystem.java:3305) > > > >> > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:476) > > > >> > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:225) > > > >> > at > > > >> > org.apache.impala.service.JniFrontend.checkFileSystem( > > > >> > JniFrontend.java:778) > > > >> > at > > > >> > org.apache.impala.service.JniFrontend.checkConfiguration( > > > >> > JniFrontend.java:690) > > > >> > I0718 10:03:13.659090 23734 status.cc:125] NoClassDefFoundError: > > > >> > org/apache/hadoop/fs/Options$ChecksumCombineMode > > > >> > @ 0x1951b3a > > > >> > @ 0x1ff307f > > > >> > @ 0x1e9ec51 > > > >> > @ 0x1eba8bb > > > >> > @ 0x1eb6df2 > > > >> > @ 0x190d4bf > > > >> > @ 0x7fe6e95bf82f > > > >> > @ 0x190d338 > > > >> > E0718 10:03:13.659101 23734 impala-server.cc:289] > > > NoClassDefFoundError: > > > >> > org/apache/hadoop/fs/Options$ChecksumCombineMode > > > >> > E0718 10:03:13.659122 23734 impala-server.cc:292] Aborting Impala > > > Server > > > >> > startup due to improper configuration. Impalad exiting. > > > >> > > > > >> > > > >> > > > >> > > > >> -- > > > >> Todd Lipcon > > > >> Software Engineer, Cloudera > > > >> > > > > > > > > > > > > > > > -- > > Todd Lipcon > > Software Engineer, Cloudera > > > -- Todd Lipcon Software Engineer, Cloudera
Re: Broken Impala Build?
Anyone else still having problems? I followed Fredy's directions but the data load is failing with the same NoClassDefFound issue while attempting to start the impalad. I haven't tried a full "git clean" yet but I guess that iwll be my last resort. On Wed, Jul 18, 2018 at 4:22 PM, Fredy Wijaya wrote: > The CR is merged and the build looks good now. Please remove > $IMPALA_HOME/toolchain and run a full build (./buildall.sh -format > -testdata -notests). It's also recommended to exit any shell sessions to > avoid issue related to stale environment variable. > > On Wed, Jul 18, 2018 at 2:16 PM Fredy Wijaya wrote: > > > A CR to fix the issue: https://gerrit.cloudera.org/c/10981/ > > I'm running a dry-run to make sure everything is good with the new build > > number. > > > > On Wed, Jul 18, 2018 at 12:24 PM Todd Lipcon > > wrote: > > > >> Yea, I just got the same on a gerrit build. I would guess this somehow > >> means we are ending up with mismatched versions of the "HDFS" and > "common" > >> Hadoop jars on our classpath. Possibly one is getting picked up from > Maven > >> whereas another is getting picked up from toolchain? > >> > >> -Todd > >> > >> On Wed, Jul 18, 2018 at 10:21 AM, Fredy Wijaya > >> >> > wrote: > >> > >> > I got this error trying to do a clean build in HEAD. Did anyone > >> encounter > >> > the same issue? > >> > > >> > I0718 10:03:13.658380 23734 jni-util.cc:230] > >> > java.lang.NoClassDefFoundError: > >> > org/apache/hadoop/fs/Options$ChecksumCombineMode > >> > at > >> > org.apache.hadoop.hdfs.client.impl.DfsClientConf. > >> > getChecksumCombineModeFromConf(DfsClientConf.java:314) > >> > at > >> > org.apache.hadoop.hdfs.client.impl.DfsClientConf.( > >> > DfsClientConf.java:184) > >> > at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:302) > >> > at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:286) > >> > at > >> > org.apache.hadoop.hdfs.DistributedFileSystem.initialize( > >> > DistributedFileSystem.java:167) > >> > at > >> > org.apache.hadoop.fs.FileSystem.createFileSystem( > FileSystem.java:3288) > >> > at org.apache.hadoop.fs.FileSystem.access$200( > FileSystem.java:123) > >> > at > >> > org.apache.hadoop.fs.FileSystem$Cache.getInternal( > FileSystem.java:3337) > >> > at org.apache.hadoop.fs.FileSystem$Cache.get( > FileSystem.java:3305) > >> > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:476) > >> > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:225) > >> > at > >> > org.apache.impala.service.JniFrontend.checkFileSystem( > >> > JniFrontend.java:778) > >> > at > >> > org.apache.impala.service.JniFrontend.checkConfiguration( > >> > JniFrontend.java:690) > >> > I0718 10:03:13.659090 23734 status.cc:125] NoClassDefFoundError: > >> > org/apache/hadoop/fs/Options$ChecksumCombineMode > >> > @ 0x1951b3a > >> > @ 0x1ff307f > >> > @ 0x1e9ec51 > >> > @ 0x1eba8bb > >> > @ 0x1eb6df2 > >> > @ 0x190d4bf > >> > @ 0x7fe6e95bf82f > >> > @ 0x190d338 > >> > E0718 10:03:13.659101 23734 impala-server.cc:289] > NoClassDefFoundError: > >> > org/apache/hadoop/fs/Options$ChecksumCombineMode > >> > E0718 10:03:13.659122 23734 impala-server.cc:292] Aborting Impala > Server > >> > startup due to improper configuration. Impalad exiting. > >> > > >> > >> > >> > >> -- > >> Todd Lipcon > >> Software Engineer, Cloudera > >> > > > -- Todd Lipcon Software Engineer, Cloudera
Re: Broken Impala Build?
Yea, I just got the same on a gerrit build. I would guess this somehow means we are ending up with mismatched versions of the "HDFS" and "common" Hadoop jars on our classpath. Possibly one is getting picked up from Maven whereas another is getting picked up from toolchain? -Todd On Wed, Jul 18, 2018 at 10:21 AM, Fredy Wijaya wrote: > I got this error trying to do a clean build in HEAD. Did anyone encounter > the same issue? > > I0718 10:03:13.658380 23734 jni-util.cc:230] > java.lang.NoClassDefFoundError: > org/apache/hadoop/fs/Options$ChecksumCombineMode > at > org.apache.hadoop.hdfs.client.impl.DfsClientConf. > getChecksumCombineModeFromConf(DfsClientConf.java:314) > at > org.apache.hadoop.hdfs.client.impl.DfsClientConf.( > DfsClientConf.java:184) > at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:302) > at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:286) > at > org.apache.hadoop.hdfs.DistributedFileSystem.initialize( > DistributedFileSystem.java:167) > at > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3288) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:123) > at > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3337) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3305) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:476) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:225) > at > org.apache.impala.service.JniFrontend.checkFileSystem( > JniFrontend.java:778) > at > org.apache.impala.service.JniFrontend.checkConfiguration( > JniFrontend.java:690) > I0718 10:03:13.659090 23734 status.cc:125] NoClassDefFoundError: > org/apache/hadoop/fs/Options$ChecksumCombineMode > @ 0x1951b3a > @ 0x1ff307f > @ 0x1e9ec51 > @ 0x1eba8bb > @ 0x1eb6df2 > @ 0x190d4bf > @ 0x7fe6e95bf82f > @ 0x190d338 > E0718 10:03:13.659101 23734 impala-server.cc:289] NoClassDefFoundError: > org/apache/hadoop/fs/Options$ChecksumCombineMode > E0718 10:03:13.659122 23734 impala-server.cc:292] Aborting Impala Server > startup due to improper configuration. Impalad exiting. > -- Todd Lipcon Software Engineer, Cloudera
Re: impalad running as superuser in tests
On Tue, Jul 17, 2018 at 5:27 PM, Sailesh Mukil wrote: > On Tue, Jul 17, 2018 at 2:47 PM, Todd Lipcon > wrote: > > > Hey folks, > > > > I'm working on a regression test for IMPALA-7311 and found something > > interesting. It appears that in our normal minicluster setup, impalad > runs > > as the same username as the namenode (namely, the username of the > > developer, in my case 'todd'). > > > > This means that the NN treats impala as a superuser, and therefore > doesn't > > actually enforce permissions. So, tests about the behavior of Impala on > > files that it doesn't have access to are somewhat tricky to write. > > > > > What kind of files do you specifically mean? Something that the daemon > tries to access directly (Eg: keytab file, log files, etc.) ? I'm guessing > it's not this since you mentioned the NN. > > Or files that belong to a table/partition in HDFS? If it's this case, we > would go through Sentry before accessing files that belong to a table, and > access would be determined by Sentry on the "session user" (not the impalad > user) before Impala even tries to access HDFS. (Eg: > tests/authorization/test_authorization.py) > Right, files on HDFS. I mean that, in cases where Sentry is not enabled or set up, and even in some cases where it is set up but not synchronized with HDFS, it's possible that the user can point table metadata at files or directories that aren't writable to the 'impala' user on HDFS. For example, I can do: CREATE EXTERNAL TABLE foo (...) LOCATION '/user/todd/my-dir'; and it's likely that 'my-dir' is not writable by 'impala' on a real cluster. Thus, if I try to insert into it, I get an error because "impala" does not have HDFS permissions to access this directory. Currently, the frontend does some checks here to try to produce a nice error. But, those checks are based on cached metadata which could be in accurate. In the case that it's inaccurate, the error will be thrown from the backend when it tries to create a file in a non-writable location. In the minicluster environment, it's impossible to test this case (actual permissions enforced by the NN causing an error) because the backend is running as an HDFS superuser. That is to say, it has full permissions everywhere. That's due to the special case behavior that HDFS has: it determines the name of the superuser to be the username that is running the NN. Since in the minicluster, both impala and the NN run as 'todd' in my case, impala acts as superuser. In a real cluster (even with security disabled) impala typically runs as 'impala' whereas the NN runs as 'hdfs' and thus impala does not have superuser privileges. -Todd -- Todd Lipcon Software Engineer, Cloudera
impalad running as superuser in tests
Hey folks, I'm working on a regression test for IMPALA-7311 and found something interesting. It appears that in our normal minicluster setup, impalad runs as the same username as the namenode (namely, the username of the developer, in my case 'todd'). This means that the NN treats impala as a superuser, and therefore doesn't actually enforce permissions. So, tests about the behavior of Impala on files that it doesn't have access to are somewhat tricky to write. Has anyone run into this before? Should we consider running either the impalad or the namenode as a different spoofed username so that the minicluster environment is more authentic to true cluster environments? We can do this easily by setting the HADOOP_USER_NAME environment variable or system property. -Todd -- Todd Lipcon Software Engineer, Cloudera
Re: Inconsistent handling of schema in Avro tables
On Thu, Jul 12, 2018 at 5:07 PM, Bharath Vissapragada < bhara...@cloudera.com.invalid> wrote: > On Thu, Jul 12, 2018 at 12:03 PM Todd Lipcon > wrote: > > > > So, I think my proposal here is: > > > > 1. Query behavior on existing tables > > - If the table-level format is non-Avro, > > - AND the table contains column types incompatible with Avro (eg > tinyint), > > - AND the table has an existing avro partition, > > - THEN the query will yield an error about incompatible types > > > > 2. Try to prevent shooting in the foot > > - If the table-level format is non-Avro, > > - AND the table contains column types incompatible with Avro (eg > tinyint), > > - THEN disallow changing the file format of an existing partition to Avro > > > > This approach sounds good to me. Is the existing behavior preserved for the > non-mixed-partition Avro based tables? > Yes, if the table-level properties indicate it's an Avro table we'll let the Avro schema (explicit or implicit) override the columns in the metastore. -Todd -- Todd Lipcon Software Engineer, Cloudera
Re: Inconsistent handling of schema in Avro tables
Again there's inconsistency with Hive: the presence of a single Avro partition doesn't change the table-level schema. The interesting thing is that, when I modified Impala to have a similar behavior, I got the following error from the backend when trying to query the data: WARNINGS: Unresolvable types for column 'tinyint_col': declared column type: TINYINT, table's Avro schema type: int (1 of 2 similar) Hive, however, when presented with such a situation, does an incorrect type coercion of the avro file's "int" into the table's "tinyint" schema. What's incorrect about the type coercion is that it doesn't handle out-of-range values in a sensible way. Rather, it just lets them overflow into the smaller target type -- when I dumped an Avro file containing the int "1" into an Avro partition of a Parquet table with schema "tinyint", I got the value "16" output from Hive (1%256=16). This downcasting behavior is consistent with Impala and Hive's behavior with a CAST(1 as tinyint) expression (see IMPALA-1821) So, I think my proposal here is: 1. Query behavior on existing tables - If the table-level format is non-Avro, - AND the table contains column types incompatible with Avro (eg tinyint), - AND the table has an existing avro partition, - THEN the query will yield an error about incompatible types 2. Try to prevent shooting in the foot - If the table-level format is non-Avro, - AND the table contains column types incompatible with Avro (eg tinyint), - THEN disallow changing the file format of an existing partition to Avro -Todd On Wed, Jul 11, 2018 at 9:32 PM, Todd Lipcon wrote: > Turns out it's even a bit more messy. The presence of one or more avro > partitions can change the types of existing columns, even if there is no > explicit avro schema specified for the table: > https://gist.github.com/5018d6ff50f846c72762319eb7cf5ca8 > > Not quite sure how to handle this one in a world where we don't load all > of the partitions up front. Perhaps the best approach is to just throw an > error and then provide a command for the user to "re-sync" the schema to > the appropriate avro-supported types? Hive provides ALTER TABLE > UPDATE COLUMNS for something like this, though still I don't think that > would iterate over all partitions in the case of a mixed table. > > -Todd > > On Wed, Jul 11, 2018 at 9:03 PM, Bharath Vissapragada < > bhara...@cloudera.com.invalid> wrote: > >> Agreed. >> >> On Wed, Jul 11, 2018 at 8:55 PM Todd Lipcon >> wrote: >> >> > Your commit message there makes sense, Bharath -- we should set >> > 'avroSchema' in the descriptor in case any referenced partition is avro, >> > because the scanner needs that info. However, we don't need to also >> > override the table-level schema. So, I think we can preserve the fix >> that >> > you made while also making the behavior less surprising. >> > >> > -Todd >> > >> > On Wed, Jul 11, 2018 at 8:21 PM, Bharath Vissapragada < >> > bhara...@cloudera.com.invalid> wrote: >> > >> > > I added this functionality >> > > <https://github.com/apache/impala/commit/49610e2cfa40aa10b62 >> 6c5ae41d7f0 >> > > d99d7cabc5> >> > > where adding an Avro partition in a mixed partition table resets the >> > table >> > > level schema. While I don't exactly remember why we chose this path, >> I do >> > > recall that we debated quite a bit about Avro schema evolution causing >> > > schema inconsistencies across partitions. AFAICT there is no specific >> > > reason Impala chose to different from Hive. Now that I see your email, >> > > Hive's behavior makes more sense to me, especially in the context of >> lazy >> > > loading of metadata. >> > > >> > > Also, agree with Edward that the whole mixed partitions + Avro schema >> > > evolution is a mess and I doubt if any serious user relies on a >> specific >> > > behavior. >> > > >> > > On Wed, Jul 11, 2018 at 7:48 PM Edward Capriolo < >> edlinuxg...@gmail.com> >> > > wrote: >> > > >> > > > I know that Hive can deal with schema being different per partition, >> > but >> > > I >> > > > really hesitate to understand why someone would want to do this. If >> > > someone >> > > > asked me to support a mixed avro/parquet table I would suggest they >> > > create >> > > > a view. If they kept insist
Re: Inconsistent handling of schema in Avro tables
Turns out it's even a bit more messy. The presence of one or more avro partitions can change the types of existing columns, even if there is no explicit avro schema specified for the table: https://gist.github.com/5018d6ff50f846c72762319eb7cf5ca8 Not quite sure how to handle this one in a world where we don't load all of the partitions up front. Perhaps the best approach is to just throw an error and then provide a command for the user to "re-sync" the schema to the appropriate avro-supported types? Hive provides ALTER TABLE UPDATE COLUMNS for something like this, though still I don't think that would iterate over all partitions in the case of a mixed table. -Todd On Wed, Jul 11, 2018 at 9:03 PM, Bharath Vissapragada < bhara...@cloudera.com.invalid> wrote: > Agreed. > > On Wed, Jul 11, 2018 at 8:55 PM Todd Lipcon > wrote: > > > Your commit message there makes sense, Bharath -- we should set > > 'avroSchema' in the descriptor in case any referenced partition is avro, > > because the scanner needs that info. However, we don't need to also > > override the table-level schema. So, I think we can preserve the fix that > > you made while also making the behavior less surprising. > > > > -Todd > > > > On Wed, Jul 11, 2018 at 8:21 PM, Bharath Vissapragada < > > bhara...@cloudera.com.invalid> wrote: > > > > > I added this functionality > > > <https://github.com/apache/impala/commit/ > 49610e2cfa40aa10b626c5ae41d7f0 > > > d99d7cabc5> > > > where adding an Avro partition in a mixed partition table resets the > > table > > > level schema. While I don't exactly remember why we chose this path, I > do > > > recall that we debated quite a bit about Avro schema evolution causing > > > schema inconsistencies across partitions. AFAICT there is no specific > > > reason Impala chose to different from Hive. Now that I see your email, > > > Hive's behavior makes more sense to me, especially in the context of > lazy > > > loading of metadata. > > > > > > Also, agree with Edward that the whole mixed partitions + Avro schema > > > evolution is a mess and I doubt if any serious user relies on a > specific > > > behavior. > > > > > > On Wed, Jul 11, 2018 at 7:48 PM Edward Capriolo > > > > wrote: > > > > > > > I know that Hive can deal with schema being different per partition, > > but > > > I > > > > really hesitate to understand why someone would want to do this. If > > > someone > > > > asked me to support a mixed avro/parquet table I would suggest they > > > create > > > > a view. If they kept insisting I would reply "Well it is your > funeral." > > > > > > > > On Wed, Jul 11, 2018 at 7:51 PM, Todd Lipcon > > > > > > > wrote: > > > > > > > > > Hey folks, > > > > > > > > > > I'm trying to understand the current behavior of tables that > contain > > > > > partitions of mixed format, specifically when one or more > partitions > > is > > > > > stored as Avro. Impala seems to be doing a number of things which I > > > find > > > > > surprising, and I'm not sure if they are intentional or should be > > > > > considered bugs. > > > > > > > > > > *Surprise 1*: the _presence_ of an Avro-formatted partition can > > change > > > > the > > > > > table schema > > > > > https://gist.github.com/74bdef8a69b558763e4453ac21313649 > > > > > > > > > > - create a table that is Parquet-formatted, but with an > > > 'avro.schema.url' > > > > > property > > > > > - the Avro schema is ignored, and we see whatever schema we > specified > > > > > (*makes > > > > > sense, because the table is Parquet)* > > > > > - add an partition > > > > > - set the new partition's format to Avro > > > > > - refresh the table > > > > > - the schema for the table now reflects the Avro schema, because it > > has > > > > at > > > > > least one Avro partition > > > > > > > > > > *Surprise 2*: the above is inconsistent with Hive and Spark > > > > > > > > > > Hive seems to still reflect the table-level defined schema, and > > ignore > > > > the > > > > > avro.schema.url property in this m
Re: Inconsistent handling of schema in Avro tables
Your commit message there makes sense, Bharath -- we should set 'avroSchema' in the descriptor in case any referenced partition is avro, because the scanner needs that info. However, we don't need to also override the table-level schema. So, I think we can preserve the fix that you made while also making the behavior less surprising. -Todd On Wed, Jul 11, 2018 at 8:21 PM, Bharath Vissapragada < bhara...@cloudera.com.invalid> wrote: > I added this functionality > <https://github.com/apache/impala/commit/49610e2cfa40aa10b626c5ae41d7f0 > d99d7cabc5> > where adding an Avro partition in a mixed partition table resets the table > level schema. While I don't exactly remember why we chose this path, I do > recall that we debated quite a bit about Avro schema evolution causing > schema inconsistencies across partitions. AFAICT there is no specific > reason Impala chose to different from Hive. Now that I see your email, > Hive's behavior makes more sense to me, especially in the context of lazy > loading of metadata. > > Also, agree with Edward that the whole mixed partitions + Avro schema > evolution is a mess and I doubt if any serious user relies on a specific > behavior. > > On Wed, Jul 11, 2018 at 7:48 PM Edward Capriolo > wrote: > > > I know that Hive can deal with schema being different per partition, but > I > > really hesitate to understand why someone would want to do this. If > someone > > asked me to support a mixed avro/parquet table I would suggest they > create > > a view. If they kept insisting I would reply "Well it is your funeral." > > > > On Wed, Jul 11, 2018 at 7:51 PM, Todd Lipcon > > wrote: > > > > > Hey folks, > > > > > > I'm trying to understand the current behavior of tables that contain > > > partitions of mixed format, specifically when one or more partitions is > > > stored as Avro. Impala seems to be doing a number of things which I > find > > > surprising, and I'm not sure if they are intentional or should be > > > considered bugs. > > > > > > *Surprise 1*: the _presence_ of an Avro-formatted partition can change > > the > > > table schema > > > https://gist.github.com/74bdef8a69b558763e4453ac21313649 > > > > > > - create a table that is Parquet-formatted, but with an > 'avro.schema.url' > > > property > > > - the Avro schema is ignored, and we see whatever schema we specified > > > (*makes > > > sense, because the table is Parquet)* > > > - add an partition > > > - set the new partition's format to Avro > > > - refresh the table > > > - the schema for the table now reflects the Avro schema, because it has > > at > > > least one Avro partition > > > > > > *Surprise 2*: the above is inconsistent with Hive and Spark > > > > > > Hive seems to still reflect the table-level defined schema, and ignore > > the > > > avro.schema.url property in this mixed scenario. That is to say, with > the > > > state set up by the above, we have the following behavior: > > > > > > Impala: > > > - uses the external avro schema for all table-level info, SELECT *, > etc. > > > - "compute stats" detects the inconsistency and tells the user to > > recreate > > > the table. > > > - if some existing partitions (eg in Parquet) aren't compatible with > that > > > avro schema, errors result from the backend that there are missing > > columns > > > in the Parquet data files > > > > > > Hive: > > > - uses the table-level schema defined in the HMS for describe, etc > > > - queries like 'select *' again use the table-level HMS schema. The > > > underlying reader that reads the Avro partition seems to use the > defined > > > external Avro schema, resulting in nulls for missing columns. > > > - computing stats (analyze table mixedtable partition (y=1) compute > stats > > > for columns) seems to end up only recording stats against the column > > > defined in the table-level Schema. > > > > > > Spark: > > > - DESCRIBE TABLE shows the table-level info > > > - select * fails, because apparently Spark doesn't support multi-format > > > tables at all (it tries to read the avro files as a parquet file) > > > > > > > > > It seems to me that Hive's behavior is a bit better.* I'd like to > propose > > > we treat this as a bug and move to the following behavior:* > &g
Inconsistent handling of schema in Avro tables
Hey folks, I'm trying to understand the current behavior of tables that contain partitions of mixed format, specifically when one or more partitions is stored as Avro. Impala seems to be doing a number of things which I find surprising, and I'm not sure if they are intentional or should be considered bugs. *Surprise 1*: the _presence_ of an Avro-formatted partition can change the table schema https://gist.github.com/74bdef8a69b558763e4453ac21313649 - create a table that is Parquet-formatted, but with an 'avro.schema.url' property - the Avro schema is ignored, and we see whatever schema we specified (*makes sense, because the table is Parquet)* - add an partition - set the new partition's format to Avro - refresh the table - the schema for the table now reflects the Avro schema, because it has at least one Avro partition *Surprise 2*: the above is inconsistent with Hive and Spark Hive seems to still reflect the table-level defined schema, and ignore the avro.schema.url property in this mixed scenario. That is to say, with the state set up by the above, we have the following behavior: Impala: - uses the external avro schema for all table-level info, SELECT *, etc. - "compute stats" detects the inconsistency and tells the user to recreate the table. - if some existing partitions (eg in Parquet) aren't compatible with that avro schema, errors result from the backend that there are missing columns in the Parquet data files Hive: - uses the table-level schema defined in the HMS for describe, etc - queries like 'select *' again use the table-level HMS schema. The underlying reader that reads the Avro partition seems to use the defined external Avro schema, resulting in nulls for missing columns. - computing stats (analyze table mixedtable partition (y=1) compute stats for columns) seems to end up only recording stats against the column defined in the table-level Schema. Spark: - DESCRIBE TABLE shows the table-level info - select * fails, because apparently Spark doesn't support multi-format tables at all (it tries to read the avro files as a parquet file) It seems to me that Hive's behavior is a bit better.* I'd like to propose we treat this as a bug and move to the following behavior:* - if a table's properties indicate it's an avro table, parse and adopt the external avro schema as the table schema - if a table's properties indicate it's _not_ an avro table, but there is an external avro schema defined in the table properties, then parse the avro schema and include it in the TableDescriptor (for use by avro partitions) but do not adopt it as the table schema. The added benefit of the above proposal (and the reason why I started looking into this in the first place) is that, in order to service a simple query like DESCRIBE, our current behavior requires all partition metadata to be loaded to know whether there is any avro-formatted partition. With the proposed new behavior, we can avoid looking at all partitions. This is important for any metadata design which supports fine-grained loading of metadata to the coordinator. -Todd -- Todd Lipcon Software Engineer, Cloudera
Re: boost::scoped_ptr vs std::unique_ptr
Definitely in favor. Personally I never found the "this pointer isn't movable" to be a worthwhile distinction. With unique_ptr you need to pretty explicitly move it using std::move, so you don't get "accidental" moves like you used to with std::auto_ptr. Looking briefly at Kudu we have 129 unique_ptr members and only 7 of them are marked const, so at least we haven't found that a particularly useful pattern. -Todd On Thu, Jul 5, 2018 at 5:23 PM, Jim Apple wrote: > I suspect we could also make own own immobile_ptr with minimal effort, > thereby both making the difference more visible and reducing boost > dependence. > > On Thu, Jul 5, 2018 at 5:17 PM, Sailesh Mukil > > wrote: > > > I'm in favor. > > > > Since the main distinction is that a unique_ptr is moveable, whereas a > > scoped_ptr is not, we should just make sure that we do our due diligence > > during code reviews so that we catch those cases. > > > > Also, making a unique_ptr const disallows moving it, since the move > > constructor takes a non-const unique_ptr container. It probably won't > work > > in all places, but we could enforce that in certain parts of the code. > > > > On Thu, Jul 5, 2018 at 4:49 PM, Thomas Tauber-Marshall < > > tmarsh...@cloudera.com.invalid> wrote: > > > > > I'm definitely in favor of using more standard c++ to reduce both > > confusion > > > and our reliance on boost, especially as I suspect a lot of people (eg. > > me) > > > don't know the subtle difference between scoped_ptr and unique_ptr off > > the > > > top of their head anyways. > > > > > > Fwiw, I was under the impression from talking with people in the past > > that > > > we were already trying to make this move, and the > > > PartitionedAggregationNode refactor that just went in made the switch > to > > > unique_ptr, though no one commented on it in the review. > > > > > > On Thu, Jul 5, 2018 at 4:39 PM Tim Armstrong > > > wrote: > > > > > > > I was just talking with Michael Ho on a review about this > > > > https://gerrit.cloudera.org/#/c/10810/7/be/src/exec/scan-node.h@271 > > > > > > > > For a while we've continued using scoped_ptr in some places because > it > > > > supports a smaller set of operators and implies that the pointer > isn't > > > > movable. See > > > > > > > > https://cwiki.apache.org/confluence/display/IMPALA/ > > > Resource+Management+Best+Practices+in+Impala > > > > . > > > > > > > > I don't think we're consistently following this pattern and it seems > to > > > > cause some confusion about what the best practice is, particularly > for > > > > people coming from other code bases. I personally like the > distinction, > > > but > > > > I don't feel that strongly about it. > > > > > > > > What do people think? Should we continue using scoped_ptr or move > away > > > from > > > > it. There is already a JIRA to make the change but we haven't done it > > > > because of the above reasons: > > > > https://issues.apache.org/jira/browse/IMPALA-3444 > > > > > > > > - Tim > > > > > > > > > > -- Todd Lipcon Software Engineer, Cloudera
Re: Proposal for a new approach to Impala metadata
Hi Marcel, Sorry for the slow response, I was out of the office for a short vacation. Comments inline: On Sat, Jun 30, 2018 at 5:07 PM, Marcel Kornacker wrote: > Responses/comments inline. > > Before those, a note about process: > It looks like the work on this proposal is already underway. However, the > doc you sent out still has very many unanswered questions, one of which is > "to what extent will these changes deteriorate existing use cases that are > currently well supported". > I thought I'd answered that one, but just to be clear: we do not want to deteroriate existing use cases that are well supported. The goal is to not break any existing use cases, and we expect to substantially improve the performance and reliability for many, as well as enable new larger-scale use cases than possible with the current design. > I think it is very important to spell out the goals and non-goals of this > proposal explicitly, preferably in a separate section, so that there can be > a meaningful discussion of those in the Impala community. > > Right, sorry that I didn't get a chance to do this after we'd discussed it previously. I will prioritize that this week. > The proposal lists 6 contributors, all of whom are/were Cloudera employees > at the time of writing. It appears that there was a fair amount of > discussion and experimentation that went into this proposal, but this > wasn't done publicly. > Indeed I did discuss the proposal with some colleagues prior to publishing the doc on the mailing list. However, the doc itself was only started the day before it was published, and I solicited a few comments from coworkers as more of a "proof read". I personally wrote the document as well as personally did the prototype work discussed therein. I wanted to be inclusive of the other folks who had helped with the doc proof-reading so I included them in the heading. I'll remove all of the names including my own now, since the doc should be seen as a product of the open source community rather than any particular contributor(s). This can create the impression that community involvement is seen as an > afterthought rather than an integral part of the process (and the warning > in red at the top that 'this document is public' doesn't really do anything > to counter that impression :)). > > Sure, I can see that. I figured that such a warning would be appropriate because the document discusses a lot of workload-specific data, and I didn't want anyone to comment "In my work at FooCorp we have a table storing info about our 12,300 customers." Such a comment might be by a fellow Cloudera employee, or by someone at some other contributor. Happy to remove that heading if it seems like it's not inclusive. -Todd > On Fri, Jun 8, 2018 at 9:54 AM, Todd Lipcon wrote: > >> On Thu, Jun 7, 2018 at 9:27 AM, Marcel Kornacker >> wrote: >> >>> Hey Todd, >>> >>> thanks for putting that together, that area certainly needs some work. >>> I think there are a number of good ideas in the proposal, and I also >>> think there are a number of ways to augment it to make it even better: >>> 1. The size of the cache is obviously a result of the caching >>> granularity (or lack thereof), and switching to a bounded size with >>> more granularity seems like a good choice. >>> 2. However, one of the biggest usability problems stems from the need >>> for manual metadata discovery (via Invalidate/Refresh). Introducing a >>> TTL for cached objects seems like a bad approach, because it creates a >>> conflict between metadata freshness and caching >>> effectiveness/performance. This isn't really something that users want >>> to have to trade off. >>> >> >> Agreed. In the "future directions" part of the document I included one >> item about removing the need for invalidate/refresh, perhaps by subscribing >> to the push notification mechanisms provided by HMS/NN/S3/etc. With those >> notification mechanisms, the cache TTL could be dialed way up (perhaps to >> infinity) and only rely on cache capacity for eviction. >> >> In some cases the notification mechanisms provide enough information to >> update the cache in place (i.e the notification has the new information) >> whereas in others it would just contain enough information to perform >> invalidation. But from the user's perspective it achieves near-real-time >> freshness either way. >> >> Note, though, that, absent a transactional coordination between Impala >> and source systems, we'll always be in the realm of "soft guarantees". That >> is to say, the notif
Re: Proposal for a new approach to Impala metadata
- keep a central daemon responsible for interacting with source systems - impalads dont directly access source systems, but instead send metadata fetch requests to the central coordinator - impalads maintain a small cache of their own, and the central coordinator maintains a larger cache. If the impalad requests some info that the central daemon doesn't have, it fetches on demand from the source. - central coordinator propagates invalidations to impalads In essence, it's like the current catalog design, but based on a fine-grained "pull" design instead of a course-grained "push/broadcast" design. This allows impalads to scale out without linearly increasing the amount of load on source systems. The downsides, though, are: - extra hop between impalad and source system on a cache miss - extra complexity in failure cases (source system becomes a SPOF, and if we replicate it we have more complex consistency to worry about) - scalability benefits may only be really important on large clusters. Small clusters can often be served sufficiently by just one or two coordinator impalads. -Todd > On Tue, May 22, 2018 at 9:27 PM, Todd Lipcon wrote: > > Hey Impala devs, > > > > Over the past 3 weeks I have been investigating various issues with > > Impala's treatment of metadata. Based on data from a number of user > > deployments, and after discussing the issues with a number of Impala > > contributors and committers, I've come up with a proposal for a new > design. > > I've also developed a prototype to show that the approach is workable and > > is likely to achieve its goals. > > > > Rather than describe the design in duplicate, I've written up a proposal > > document here: > > https://docs.google.com/document/d/1WcUQ7nC3fzLFtZLofzO6kvWdGHFaa > qh97fC_PvqVGCk/edit?ts=5b04a6b8# > > > > Please take a look and provide any input, questions, or concerns. > > > > Additionally, if any users on this list have experienced metadata-related > > problems in the past and would be willing to assist in testing or > > contribute workloads, please feel free to respond to me either on or off > > list. > > > > Thanks > > Todd > -- Todd Lipcon Software Engineer, Cloudera
Re: Palo, a fork of Impala (?), proposed for the incubator.
Yea, definitely seems like it would be beneficial to join efforts given the code similarity (at least as of last time I looked at the Palo source). I'll join in on the incubator thread. Thanks for noticing this and bringing to our attention -Todd On Thu, Jun 7, 2018 at 10:58 PM, Jim Apple wrote: > https://lists.apache.org/thread.html/74a3f3f945403b50515c658047d328 > 4955288a637207e4f97ecc15d1@%3Cgeneral.incubator.apache.org%3E > > I think it’s worth considering how the two communities could work together > for the benefit of all. > -- Todd Lipcon Software Engineer, Cloudera
Re: jenkins.impala.io account
On Fri, Jun 1, 2018 at 11:37 AM, Michael Brown wrote: > Thanks for taking an interest! > > Is https://jenkins.impala.io/job/gerrit-verify-dryrun-external/ suitable > for you? It's open to all contributors. It's just a wrapper around > gerrit-verify-dryrun, sans commit-after-passing-build bit. > Will this give the proper "+1 verified" mark on a gerrit such that it can be committed? -Todd > On Fri, Jun 1, 2018 at 11:35 AM, Todd Lipcon wrote: > > > Hey folks, > > > > Would someone mind generating an account for me on the Impala jenkins > > server so I can run gerrit verify jobs? > > > > Though I'm officially a committer (due to participation as a mentor > during > > incubation) I'm not yet an experienced patch contributor, so I'll be > > prudent with my review/commit privileges. > > > > Thanks > > -Todd > > -- > > Todd Lipcon > > Software Engineer, Cloudera > > > -- Todd Lipcon Software Engineer, Cloudera
jenkins.impala.io account
Hey folks, Would someone mind generating an account for me on the Impala jenkins server so I can run gerrit verify jobs? Though I'm officially a committer (due to participation as a mentor during incubation) I'm not yet an experienced patch contributor, so I'll be prudent with my review/commit privileges. Thanks -Todd -- Todd Lipcon Software Engineer, Cloudera
Re: Statically link Kudu client
Hi Quanlong, I think the comment you found must be out of date. I think Thomas may know the latest state of how the Kudu client is built as part of "native-toolchain". One thing to note is that, since Impala builds with its own toolchain, you need to make sure that it dynamic-links against the libkudu_client that comes from the toolchain. If you end up linking against one built using your system toolchain there is a chance that you'll get ABI mismatches and strange crashes. Perhaps that's what you saw when you first tried. -Todd On Tue, May 22, 2018 at 3:21 PM, Quanlong Huang wrote: > Hi all, > > > Recently when I tried to upgrade our Impala cluster in CDH5.7.1 to > Impala-2.12, impalad crashed when inserting data to kudu tables. However, > it works when reading from kudu tables. > Finally, I found out that the kudu client > (/usr/lib/x86_64-linux-gnu/libkudu_client.so.0) > is still linked dynamically. Issues are resolved after I update > libkudu_client.so.0. > > > In kudu-table-sink.cc, the comments said the kudu-client can be linked > statically: https://github.com/apache/impala/blob/2.12.0/be/src/ > exec/kudu-table-sink.cc#L151 > However, when building with -static in ./buildall.sh, the kudu-client is > still linked dynamically (see `ldd be/build/latest/service/impalad`). Is > there a build option to link it statically? > > > Thanks, > Quanlong -- Todd Lipcon Software Engineer, Cloudera
Proposal for a new approach to Impala metadata
Hey Impala devs, Over the past 3 weeks I have been investigating various issues with Impala's treatment of metadata. Based on data from a number of user deployments, and after discussing the issues with a number of Impala contributors and committers, I've come up with a proposal for a new design. I've also developed a prototype to show that the approach is workable and is likely to achieve its goals. Rather than describe the design in duplicate, I've written up a proposal document here: https://docs.google.com/document/d/1WcUQ7nC3fzLFtZLofzO6kvWdGHFaaqh97fC_PvqVGCk/edit?ts=5b04a6b8# Please take a look and provide any input, questions, or concerns. Additionally, if any users on this list have experienced metadata-related problems in the past and would be willing to assist in testing or contribute workloads, please feel free to respond to me either on or off list. Thanks Todd
Re: Re: Best practice to compile and deploy Impala
On Fri, Apr 27, 2018 at 7:09 AM, Quanlong Huang wrote: > Thank you, Todd! But my IMPALA_HOME occupied 14GB after compilation... > > > The be directory contributes 4.4GB but I think only impalad and > libfesupport.so is needed. They're 981MB in total in a debug build. It's ok. > However, the toolchain directory contributes 8.5GB. It's hard to pick up > things we actually need (i.e. libkudu_client.so.0, libstdc++.so.6, > libgcc_s.so.1, and cdh_components, etc.). > I think I may not handle them in an elegant way. Could you give some more > advise? > > Maybe run 'ldd' on your impalad binary to figure out which shared objects are actually necessary? I also find that during development I have a good sense of which pieces actually changed. For example if I updated only the backend I only bother to re-copy impalad and not any of the jars. Vice versa, if I only updated something in the front end I'd only re-copy the FE jar to the cluster. That way you only pay the expensive deployment step the first time you set up your cluster. -Todd > At 2018-04-26 00:49:05, "Todd Lipcon" wrote: > >Hi Quanlong, > > > >If you dont need full debuginfo on your cluster, you might consider > running > >'strip --strip-debug' on the impalad binary that you output. Between that > >and using 'rsync' instead of copying a new full directory of jars every > >time, it's usually not that many MB (<100?) I usually do builds on a > >machine in the same network as the machines I plan on deploying on so that > >the copy runs at several hundred MB/second, rather than building on my > >laptop far away. > > > >Tools like pscp and pssh are also handy of course. > > > >Hope that helps, > >-Todd > > > >On Wed, Apr 25, 2018 at 1:48 AM, Quanlong Huang > >wrote: > > > >> Hi all, > >> > >> > >> Recently when I have a try on Impala-2.12.0-rc1, I find it really hard > to > >> deploy Impala manually (I used to do this by Cloudera Manager). The > >> directory size is huge after compiled so I only distributed something I > >> thought really needed. This work is tedious and prone to errors. > >> > >> > >> Is there a best practice for packaging and distributing the binaries > after > >> compiling? > >> > >> > >> Thanks, > >> Quanlong > > > > > > > > > >-- > >Todd Lipcon > >Software Engineer, Cloudera > -- Todd Lipcon Software Engineer, Cloudera
Re: Best practice to compile and deploy Impala
Hi Quanlong, If you dont need full debuginfo on your cluster, you might consider running 'strip --strip-debug' on the impalad binary that you output. Between that and using 'rsync' instead of copying a new full directory of jars every time, it's usually not that many MB (<100?) I usually do builds on a machine in the same network as the machines I plan on deploying on so that the copy runs at several hundred MB/second, rather than building on my laptop far away. Tools like pscp and pssh are also handy of course. Hope that helps, -Todd On Wed, Apr 25, 2018 at 1:48 AM, Quanlong Huang wrote: > Hi all, > > > Recently when I have a try on Impala-2.12.0-rc1, I find it really hard to > deploy Impala manually (I used to do this by Cloudera Manager). The > directory size is huge after compiled so I only distributed something I > thought really needed. This work is tedious and prone to errors. > > > Is there a best practice for packaging and distributing the binaries after > compiling? > > > Thanks, > Quanlong -- Todd Lipcon Software Engineer, Cloudera
Re: Help with fixing clang-diagnostic-over-aligned
This probably means that QueryExecMgr (or one of its descendent members) is annotated with __attribute__((aligned(64)), perhaps through something like a CACHELINE_ALIGNED macro. The issue is that, as the warning says, normal libc malloc doesn't make any hard guarantees about the alignment of allocations, and the way that 'operator new' is implemented by default only gets the size and not the desired allocation. I believe C++17 is adding support for an 'operator new' overload that also passes the desired alignment, so that it can call down into posix_memalign and ensure the appropriate allocation. The "correct" fix prior to C++17 is probably to do something like overload 'operator new' for the type, or to use posix_memalign manually to allocate aligned memory, and then placement-new it into that memory. Another reasonable fix would probably be to add a CHECK in the constructor that 'this' is aligned as desired so that, if you ever run on an allocator that isn't guaranteeing what you think it is, you'll know immediately rather than getting a surprising perf or atomicity regression due to cross-cacheline loads/stores. -Todd On Wed, Feb 14, 2018 at 12:33 PM, Sailesh Mukil wrote: > Does anyone have experience with fixing the following type of > clang-tidy warning? > > > /home/ubuntu/Impala/be/src/runtime/exec-env.cc:159:21: warning: type > 'impala::QueryExecMgr' requires 64 bytes of alignment and the default > allocator only guarantees 16 bytes [clang-diagnostic-over-aligned] > query_exec_mgr_(new QueryExecMgr()), > > > /home/ubuntu/Impala/be/src/service/impalad-main.cc:80:49: warning: > type 'impala::ImpalaServer' requires 64 bytes of alignment and the > default allocator only guarantees 16 bytes > [clang-diagnostic-over-aligned] > boost::shared_ptr impala_server(new > ImpalaServer(&exec_env)); > -- Todd Lipcon Software Engineer, Cloudera
Re: ORC scanner - points for discussion
On Fri, Feb 9, 2018 at 5:05 PM, Tim Armstrong wrote: > Quanlong has done a bunch of work implementing an ORC scanner. I've been > playing around with it and it works pretty nicely - I can load and run > TPC-H with no problem! > > It's a big addition to Impala and the integration with the external library > has caused some implementation challenges, so I wanted to summarise the > issues so that other community members can weigh in. > > *1. in-tree versus third-party library* > The current patchset imports the ORC source code into the Impala source > tree, rather than treating it as an external library. > > My feeling is that, initially, we don't want to take on the burden of > maintaining a fork of the library, so we're best off keeping it external. > This way we can upgrade the library to pick up new ORC features instead of > having to bring the changes into the Impala codebase. I do think we should > generally try to collaborate with other Apache projects and contribute back > improvements where possible, instead of forking their code. > > We could re-evaluate this at some point, but reducing the maintenance > burden for now seems most important. > > *2. C++ Exceptions* > The ORC library relies on exceptions for error handling. Our coding style > guide disallows exceptions, but we do have to deal with them in a few > places interfacing with external libraries like boost (which sucks). If > we're interfacing with the library, I don't see how we can avoid throwing > and catching exceptions at the boundaries with the library, which is what > Quanlong's patch does. I looked at the ORC source code and the exceptions > seem fairly pervasive. > > My feeling is that we can live with this, provided that we're very careful > to catch exceptions at the library boundary and it is only > hdfs-orc-scanner.cc that has to deal with the exceptions. > > *3. What is the quality bar?* > We definitely want all the functional tests to pass, same as other file > formats like Avro. I also asked Quanlong to add it to the fuzz test to get > that extra coverage. It would be helpful if others could confirm that we > have enough test coverage. > > *4. What is the bar for perf and resource consumption?* > The initial version of the scanner won't match Parquet in these categories, > mainly because it isn't tightly integrated with the Impala runtime and is > missing a few things like codegen. There may be some limits to how far we > can improve this without modifying the ORC library. > For quality, perf, etc, I think it would be reasonable to ship it for a release or two as an explicitly labeled experimental feature. I think the experience in the past is that most file format integrations appear to work with synthetic datasets but have new and exciting failures and compatibility issues in the wild once run against the variety of actual implementations that produce the format. One thing we have found helpful in the Kudu community is to actually enforce the "experimental" nature of new features for a couple of releases. We do this by having a gflag called --unlock-experimental-flags. Users who decide to flip that on are explicitly acknowledging that they're treading on some unproven ground and they might get crashes, correctness issues, etc. Of course the goal should be that, if after a release or two of use the feedback from users is that it works well and few issues have been found, I'd expect it to be marked non-experimental. -Todd -- Todd Lipcon Software Engineer, Cloudera