Re: Speed up refresh table partitions in batch

2017-08-22 Thread Edward Capriolo
Previously I found that if you run any command that touches the partition,
like adding properties it caused a refresh of that partition.

On Tue, Aug 22, 2017 at 10:40 PM, yu feng  wrote:

> Hi, community :
>
>I am a improvement modify to impala in our env, and I want to contribute
> it to impala community , This is our scenario:
>
> we have a table with three or four partition keys, and the table have
> almost 1K partition to be added, and a spark streaming job write new data
> to existing partitions every 15 min(add to recent 7 days), so we have to
> refresh the recent 7 days partition, about 7K partitions.
>
>However, the whole table have 10W partitions and growing, we have two
> chioce: refresh the whole table or refresh the 7K partitions, we obvious
> should select to refresh table, but It will take 5min to be finish, I check
> the code(before 2.8.0) and find refreshing table will finally call the
> function :
>
> HdfsTable.load(true,  client, msTbl, true, true, null);
>
> which will try to reload metadata and check every partition existing in the
> table, and load eveny file to check whether file is updated or newly
> created by checking last ModificationTime and file length.
>
> In our table, there are about 100W files, so the refresh table operation is
> slowly.
>
> Hence, we create a new usage: REFRESH TABLE xxx PARTITION (day = ('xx1',
> 'xx2', 'xx3'}); and the operation will just refresh partitions match the
> day in (xx1/xx2/xx3), in this way, we can only load files and partitions in
> the last 7 days.
>
> After our test, we find in this way, we speed the operation 2x times.
>
> Do you have any suggestion about it ?  Thanks a lot.
>


Re: IMPALA-4326 - split() function

2017-07-10 Thread Edward Capriolo
That standard 2016 spec did not predate hive's implementation of lateral
view

On Sunday, July 9, 2017, Greg Rahn  wrote:

> (also commented on IMPALA-4326)
>
> For this functionality, I'd prefer to follow what Postgres does and use its
> well-named functions like string_to_array().
> This becomes powerful when using the unnest() table function, which is
> defined and is part of the ANSI/ISO SQL:2016 spec (vs the non-standard
> lateral view explode Hive syntax).
>
> with t as (
>   select
> 42 as id,
> '1,2,3,4,5,6'::text as string_array
> )
> select
>   t.id,
>   u.l
> from t, unnest(string_to_array(t.string_array,',')) as u(l);
>
> id | l
> +---
> 42 | 1
> 42 | 2
> 42 | 3
> 42 | 4
> 42 | 5
> 42 | 6
>
>
> On Mon, Jun 19, 2017 at 7:40 AM, Alexander Behm  >
> wrote:
>
> > Yes and no. Extending the UDF framework might be hard, but I think
> > implementing a built-in split() is feasible. We already have a built-in
> > Expr that returns an array type to implement unnest.
> >
> > On Mon, Jun 19, 2017 at 6:22 AM, Vincent Tran  > wrote:
> >
> > > This request appears to be blocked by the current UDF framework's
> > > limitation.
> > > As far as I can tell, functions can still only return simple scalar
> > types,
> > > right?
> > >
> >
>


-- 
Sorry this was sent from mobile. Will do less grammar and spell check than
usual.


Re: Scanning directories recursively for table data

2017-05-05 Thread Edward Capriolo
In hadoop proper there is a property of text input format that controls
recursion

On May 5, 2017 3:53 AM, "yu feng"  wrote:

As a user, I think impala should always turn this feature on.

2017-05-05 14:57 GMT+08:00 Alexander Behm :

> 1. For my understanding, what's the use case for turning this feature on
> and off? Why not have it on all the time?
>
> 2. A query/session option seems awkward because Impala loads the block
> metadata in the catalogd and caches it. How would an impalad know if there
> is already sufficient metadata in the cache? Should we reload the table
> metadata whenever such a SET option is used? I'm thinking of a table that
> does not have data in subdirectories. You could add an additional "loading
> state" to a table to indicate whether it was loaded with/without
> subdirectories. Overall this solution does not seem to fit very well into
> the existing architecture, and sounds overly complicated.
>
> 3. A table property is more consistent with the existing architecture.
>
> On Thu, May 4, 2017 at 11:03 PM, Shant Hovsepian 
> wrote:
>
> > Hi All, what are people's thoughts on IMPALA-4726
> >  and IMPALA-4596
> > ? These are
> concerning
> > support for recursing through subdirectories in a table location to
> search
> > for all data files.
> >
> > Restricting the behavior to external tables only seems like a good idea,
> > but as for turning on the behavior what are thoughts around making it a
> > runtime session setting with "SET" like hive does, or potentially making
> it
> > something permanent like a table property.
> >
> > Thanks!
> >
> > -Shant
> >
>


Apache Hive metastore and Impala

2017-04-05 Thread Edward Capriolo
Hello impala devs!

Let me say that I have used impala a lot and am very impressed with it.

I know impala is moving into the Apache incubator (I have an incubator
prodling gossip so I know this is challenging). There are few things I want
to bring to your attention/discuss, so that they do not become an issue or
blocker in the future.

1) code
Your proposal https://wiki.apache.org/incubator/ImpalaProposal lists hive
as a dependency.

External Dependencies

Apache Hive (Apache Software License v2.0)

I notice that the cloudera impala has CDH "hive" (which are rather old)
jars in its source tree:

https://github.com/cloudera/Impala/tree/8b621a301329d91fbe10a8aac5e39a2b14d6d25f/thirdparty/hive-1.1.0-cdh5.12.0-SNAPSHOT

A quick search did not find any evidence of that in incubator-impala (which
is good):
https://github.com/apache/incubator-impala/

We (Hive) want people using only official Apache Hive releases for
dependencies. We want to avoid:
1) Full or partial code forks of Apache Hive which still carry the Hive name
2) Artifacts published to central repositories named "*Hive*" which could
be confusing

I am not asserting that impala if affected by case #1 or #2 currently, but
something to be aware of. If you need guidance  feel free to discuss
further with the Hive PMC.

2) Next topic, the Hive name and statements that imply compatibility:

http://impala.apache.org/

For Apache Hive users, Impala utilizes the same metadata, ODBC driver, SQL
syntax, and user interface as Hive—so you don't have to worry about
re-inventing the implementation wheel.

Apache Hive proposes and adds syntax all the time. For example, this
feature is in the works now (
https://issues.apache.org/jira/browse/HIVE-15986). Even if every effort was
made to keep the languages and features in sync no one would be able to
make this claim. This because Apache Hive does not have compatibility tests
for any of these things (We do not have anything like ANSI SQL 92).

This text needs be replaced. It is probably fine to make statements such as
"Impala can run many of queries as Apache Hive", or "users of Apache Hive
will find many familiar features in Impala".

Again welcome to the incubator, I am sure getting impala through is fun
with the c++ ness of it all!

Thanks,
Edward