I think we should consider changing a couple more defaults, after having an
offline conversion with Shant.

We could change COMPRESSION_CODEC to LZ4 or ZSTD as the default. I think
LZ4 is the safest option perf-wise, because it will be faster across the
board and the decompression is now one of the main CPU bottlenecks for
Parquet scanning. We might need to double-check that enough of the
ecosystem supports LZ4, but this seems like it would be a good improvement.

It *might* we worth enabled compute stats table sampling by default, but I
think that could be open for discussion.

We could also consider bumping RUNTIME_FILTER_WAIT_TIME_MS to a higher
value, since I think generally higher values have proven to be more robust
for complex queries (TPC-DS, etc).

On Tue, Mar 17, 2020 at 11:56 AM Tim Armstrong <tarmstr...@cloudera.com>
wrote:

> >   - Do we still need the DECIMAL_V2 query option? Seems like this has
> been  true for a while. Maybe we can add it to the list of deprecated flags?
> Maybe we could officially deprecate it and phase it out soonish? It really
> only exists as a workaround for people upgrading from the old behaviour in
> 2.x. It hasn't been terribly bad maintaining the two code paths, but it
> would be nice to simplify it.
>
> >   - Deprecate support for ADLS, since it has effectively been replaced
> by ABFS
> Makes sense. It probably isn't too much overhead to keep the old code
> around for a while, is it? Just in case users have a bunch of data still
> sitting in the old ADLS.
>
> >   - Deprecate (or even remove) support for HDFS cacheing? Not sure how
> extensively this is used, removing the code would be nice as it simplifies
> part of the HDFS read path
> Anecdotally I do see it used, but a lot of times it's to affect scheduling
> rather than because saving memcpy() makes a real difference (with
> compressed parquet, that's rarely the bottleneck) . A compromise or
> in-between step would be to remove the special-casing of the zero-copy code
> path in the backend, but keep the scheduling behaviour.
>
> On Tue, Mar 17, 2020 at 11:50 AM Tim Armstrong <tarmstr...@cloudera.com>
> wrote:
>
>> I think I generally support this. A few specific comments.
>>
>> > Proposal 3: Impala-lzo
>> > Drop support for Impala-lzo/hadoop-lzo
>>
>> Does this mean dropping the plugin text scanner interface entirely? LZO
>> is the only implementation of that that I'm aware of (and we rely on it to
>> test the interface) so seems reasonable to me to remove something that has
>> minimal adoption and not cleanly separated from the scanner implementation
>> of core Impala.
>>
>> > Proposal 5: Sentry
>> > Drop support for Sentry in favor of Ranger.
>>
>> I think moving this direction makes a lot of sense given that activity in
>> the Sentry project has declined a lot (just look at the activity level on
>> the two projects, it's dramatically different), unless someone in the
>> community wants to step up and maintain the integration.
>>
>> > Proposal 6: Metadata
>> > Metadata V2 will become the default. Metadata V1 will be deprecated.
>> Maybe we should set a goal of removing the support in Impala 4.1 or 4.2?
>> That would allow us to remove a lot of complex code
>>
>> On Mon, Mar 16, 2020 at 10:07 AM Joe McDonnell <joemcdonn...@cloudera.com>
>> wrote:
>>
>>> Now that Impala 3.4 is branched and master is Impala 4.0, we need to
>>> decide
>>> what breaking changes will happen in Impala 4.0. I have provided a series
>>> of proposals below. I welcome feedback on them. Other proposals are also
>>> welcome.
>>>
>>> Thanks,
>>> Joe
>>>
>>> Proposal 0: Hadoop component versions
>>>
>>> Switch to CDP versions of components by default. This means that Impala
>>> will use Hive 3+ (which is already essentially Hive 4 and may change
>>> names
>>> to being Hive 4).
>>> Remove support for CDH versions of components.
>>> This was already discussed in the original thread for Impala 4, so this
>>> is
>>> not new.
>>>
>>> Proposal 1: OS support
>>>
>>> Drop support for Centos 6, Ubuntu 14, and Debian (all versions)
>>> Retain support for Ubuntu 16, Ubuntu 18, Centos 7, and SLES 12
>>> Centos 7 development will be focused on newer Centos 7 versions such as
>>> 7.6
>>> and 7.7.
>>> Add support for Centos 8
>>> Move main development from Ubuntu 16 to Ubuntu 18 over time.
>>>
>>> Proposal 2: Python support
>>>
>>> Drop support for Python 2.6
>>> Add support for Python 3 over time.
>>>
>>> Proposal 3: Impala-lzo
>>>
>>> Drop support for Impala-lzo/hadoop-lzo
>>>
>>> Proposal 4: Clients
>>>
>>> Deprecate beeswax protocol. This means that it can be removed in the next
>>> major version number, but it would not be removed in Impala 4. Current
>>> users of beeswax would need to start migrating to HS2.
>>>
>>> Proposal 5: Sentry
>>>
>>> Drop support for Sentry in favor of Ranger.
>>>
>>> Proposal 6: Metadata
>>>
>>> Metadata V2 will become the default. Metadata V1 will be deprecated.
>>>
>>> Thanks,
>>> Joe
>>>
>>

Reply via email to