Re: Automatic Update statistics on ORC tables in Hive

Mich Talebzadeh Mon, 28 Mar 2016 11:28:12 -0700

Hi Alan,

Thanks for the clarification. I gather you are referring to the following
notes in Jira

"Given the work that's going on in HIVE-11160
<https://issues.apache.org/jira/browse/HIVE-11160> and HIVE-12763
<https://issues.apache.org/jira/browse/HIVE-12763> I don't think it makes
sense to continue down this path. These JIRAs will lay the groundwork for
auto-gathering stats on data as it is inserted rather than having a
background process do the work."

I concur that I am not a fan of automatic update statistics although many
RDBMS vendor were touting about it in earlier days. The whole thing turned
up to be a hindrance as UPDATE STATISTICS was being fired in the middle of
the business day thus adding issues to the workload by taking resources
away.

Most vendors base the need for update/gathering stats on the number of
 rows being changed by relying on some Function say datachange(). When
datachange()  function indicates changes by 10% so it is time for update
stats to run. Again in my opinion rather arbitrary and void of any
scientific base. For Hive the important one is Inserts. For transactional
tables one will have Updates and Deletes as well. My understanding is that
the classical approach is to report on how many "row change operations" say
Inserts have been performed since the last time any kind of analyze
statistics was run.

This came to my mind as I was using Spark to load CSV files and create and
insert in Hive ORC tables. The problem I have is that Analyse statistics
through Spark fails. This is not a show stopper as the load shell
script invokes beeline to log in to Hive and Analyze statistics on the
newly created table. Although some proponents might argue about saving data
in Spark as Parquet file, when one has millions and millions of rows then
stats matter and then ORC adds its value.

Cheers

Dr Mich Talebzadeh

LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

http://talebzadehmich.wordpress.com

On 28 March 2016 at 18:43, Alan Gates <alanfga...@gmail.com> wrote:

> I resolved that as Won’t Fix.  See the last comment on the JIRA for my
> rationale.
>
> Alan.
>
> > On Mar 28, 2016, at 03:53, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
> >
> > Thanks. This does not seem to be implemented although the Jira says
> resolved. It also mentions the timestamp of the last update stats. I do not
> see it yet.
> >
> > Regards,
> >
> > Mich
> >
> > Dr Mich Talebzadeh
> >
> > LinkedIn
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >
> > http://talebzadehmich.wordpress.com
> >
> >
> > On 28 March 2016 at 06:19, Gopal Vijayaraghavan <gop...@apache.org>
> wrote:
> >
> > > This might be a bit far fetched but is there any plan for background
> > >ANALYZE STATISTICS to be performed  on ORC tables
> >
> >
> > https://issues.apache.org/jira/browse/HIVE-12669
> >
> > Cheers,
> > Gopal
> >
> >
> >
>
>

Re: Automatic Update statistics on ORC tables in Hive

Reply via email to