I think we should consider changing a couple more defaults, after having an offline conversion with Shant.
We could change COMPRESSION_CODEC to LZ4 or ZSTD as the default. I think LZ4 is the safest option perf-wise, because it will be faster across the board and the decompression is now one of the main CPU bottlenecks for Parquet scanning. We might need to double-check that enough of the ecosystem supports LZ4, but this seems like it would be a good improvement. It *might* we worth enabled compute stats table sampling by default, but I think that could be open for discussion. We could also consider bumping RUNTIME_FILTER_WAIT_TIME_MS to a higher value, since I think generally higher values have proven to be more robust for complex queries (TPC-DS, etc). On Tue, Mar 17, 2020 at 11:56 AM Tim Armstrong <tarmstr...@cloudera.com> wrote: > > - Do we still need the DECIMAL_V2 query option? Seems like this has > been true for a while. Maybe we can add it to the list of deprecated flags? > Maybe we could officially deprecate it and phase it out soonish? It really > only exists as a workaround for people upgrading from the old behaviour in > 2.x. It hasn't been terribly bad maintaining the two code paths, but it > would be nice to simplify it. > > > - Deprecate support for ADLS, since it has effectively been replaced > by ABFS > Makes sense. It probably isn't too much overhead to keep the old code > around for a while, is it? Just in case users have a bunch of data still > sitting in the old ADLS. > > > - Deprecate (or even remove) support for HDFS cacheing? Not sure how > extensively this is used, removing the code would be nice as it simplifies > part of the HDFS read path > Anecdotally I do see it used, but a lot of times it's to affect scheduling > rather than because saving memcpy() makes a real difference (with > compressed parquet, that's rarely the bottleneck) . A compromise or > in-between step would be to remove the special-casing of the zero-copy code > path in the backend, but keep the scheduling behaviour. > > On Tue, Mar 17, 2020 at 11:50 AM Tim Armstrong <tarmstr...@cloudera.com> > wrote: > >> I think I generally support this. A few specific comments. >> >> > Proposal 3: Impala-lzo >> > Drop support for Impala-lzo/hadoop-lzo >> >> Does this mean dropping the plugin text scanner interface entirely? LZO >> is the only implementation of that that I'm aware of (and we rely on it to >> test the interface) so seems reasonable to me to remove something that has >> minimal adoption and not cleanly separated from the scanner implementation >> of core Impala. >> >> > Proposal 5: Sentry >> > Drop support for Sentry in favor of Ranger. >> >> I think moving this direction makes a lot of sense given that activity in >> the Sentry project has declined a lot (just look at the activity level on >> the two projects, it's dramatically different), unless someone in the >> community wants to step up and maintain the integration. >> >> > Proposal 6: Metadata >> > Metadata V2 will become the default. Metadata V1 will be deprecated. >> Maybe we should set a goal of removing the support in Impala 4.1 or 4.2? >> That would allow us to remove a lot of complex code >> >> On Mon, Mar 16, 2020 at 10:07 AM Joe McDonnell <joemcdonn...@cloudera.com> >> wrote: >> >>> Now that Impala 3.4 is branched and master is Impala 4.0, we need to >>> decide >>> what breaking changes will happen in Impala 4.0. I have provided a series >>> of proposals below. I welcome feedback on them. Other proposals are also >>> welcome. >>> >>> Thanks, >>> Joe >>> >>> Proposal 0: Hadoop component versions >>> >>> Switch to CDP versions of components by default. This means that Impala >>> will use Hive 3+ (which is already essentially Hive 4 and may change >>> names >>> to being Hive 4). >>> Remove support for CDH versions of components. >>> This was already discussed in the original thread for Impala 4, so this >>> is >>> not new. >>> >>> Proposal 1: OS support >>> >>> Drop support for Centos 6, Ubuntu 14, and Debian (all versions) >>> Retain support for Ubuntu 16, Ubuntu 18, Centos 7, and SLES 12 >>> Centos 7 development will be focused on newer Centos 7 versions such as >>> 7.6 >>> and 7.7. >>> Add support for Centos 8 >>> Move main development from Ubuntu 16 to Ubuntu 18 over time. >>> >>> Proposal 2: Python support >>> >>> Drop support for Python 2.6 >>> Add support for Python 3 over time. >>> >>> Proposal 3: Impala-lzo >>> >>> Drop support for Impala-lzo/hadoop-lzo >>> >>> Proposal 4: Clients >>> >>> Deprecate beeswax protocol. This means that it can be removed in the next >>> major version number, but it would not be removed in Impala 4. Current >>> users of beeswax would need to start migrating to HS2. >>> >>> Proposal 5: Sentry >>> >>> Drop support for Sentry in favor of Ranger. >>> >>> Proposal 6: Metadata >>> >>> Metadata V2 will become the default. Metadata V1 will be deprecated. >>> >>> Thanks, >>> Joe >>> >>